textflint.generation_layer.subpopulation.UT.prejudice¶
Extract samples with gender bias¶
-
class
textflint.generation_layer.subpopulation.UT.prejudice.PrejudiceSubPopulation(mode='man')[source]¶ Bases:
textflint.generation_layer.subpopulation.subpopulation.SubPopulationFilter samples based on gender bias
for example in mode ‘man’:
sample 1: "There is a boy.", score: 1 sample 2: "There is a girl.", score: 1 sample 3: "There are boys and girls.", score: 0
-
class
textflint.generation_layer.subpopulation.UT.prejudice.KeywordProcessor(case_sensitive=False)[source]¶ Bases:
object- Attributes:
- _keyword (str): Used as key to store keywords in trie dictionary.
Defaults to ‘_keyword_’
- non_word_boundaries (set(str)): Characters that will determine if the word is continuing.
Defaults to set([A-Za-z0-9_])
- keyword_trie_dict (dict): Trie dict built character by character, that is used for lookup
Defaults to empty dictionary
- case_sensitive (boolean): if the search algorithm should be case sensitive or not.
Defaults to False
- Examples:
>>> # import module >>> from flashtext import KeywordProcessor >>> # Create an object of KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> # add keywords >>> keyword_names = ['NY', 'new-york', 'SF'] >>> clean_names = ['new york', 'new york', 'san francisco'] >>> for keyword_name, clean_name in zip(keyword_names, clean_names): >>> keyword_processor.add_keyword(keyword_name, clean_name) >>> keywords_found = keyword_processor.extract_keywords('I love SF and NY. new-york is the best.') >>> keywords_found >>> ['san francisco', 'new york', 'new york']
- Note:
loosely based on Aho-Corasick algorithm.
Idea came from this Stack Overflow Question.
-
__init__(case_sensitive=False)[source]¶ - Args:
- case_sensitive (boolean): Keyword search should be case sensitive set or not.
Defaults to False
-
set_non_word_boundaries(non_word_boundaries)[source]¶ set of characters that will be considered as part of word.
- Args:
- non_word_boundaries (set(str)):
Set of characters that will be considered as part of word.
-
add_non_word_boundary(character)[source]¶ add a character that will be considered as part of word.
- Args:
- character (char):
Character that will be considered as part of word.
-
add_keyword(keyword, clean_name=None)[source]¶ To add one or more keywords to the dictionary pass the keyword and the clean name it maps to.
- Args:
- keywordstring
keyword that you want to identify
- clean_namestring
clean term for that keyword that you would want to get back in return or replace if not provided, keyword will be used as the clean name also.
- Returns:
- statusbool
The return value. True for success, False otherwise.
- Examples:
>>> keyword_processor.add_keyword('Big Apple', 'New York') >>> # This case 'Big Apple' will return 'New York' >>> # OR >>> keyword_processor.add_keyword('Big Apple') >>> # This case 'Big Apple' will return 'Big Apple'
-
remove_keyword(keyword)[source]¶ To remove one or more keywords from the dictionary pass the keyword and the clean name it maps to.
- Args:
- keywordstring
keyword that you want to remove if it’s present
- Returns:
- statusbool
The return value. True for success, False otherwise.
- Examples:
>>> keyword_processor.add_keyword('Big Apple') >>> keyword_processor.remove_keyword('Big Apple') >>> # Returns True >>> # This case 'Big Apple' will no longer be a recognized keyword >>> keyword_processor.remove_keyword('Big Apple') >>> # Returns False
-
get_keyword(word)[source]¶ if word is present in keyword_trie_dict return the clean name for it.
- Args:
- wordstring
word that you want to check
- Returns:
- keywordstring
If word is present as it is in keyword_trie_dict then we return keyword mapped to it.
- Examples:
>>> keyword_processor.add_keyword('Big Apple', 'New York') >>> keyword_processor.get('Big Apple') >>> # New York
-
add_keyword_from_file(keyword_file, encoding='utf-8')[source]¶ To add keywords from a file
- Args:
keyword_file : path to keywords file encoding : specify the encoding of the file
- Examples:
keywords file format can be like:
>>> # Option 1: keywords.txt content >>> # java_2e=>java >>> # java programing=>java >>> # product management=>product management >>> # product management techniques=>product management
>>> # Option 2: keywords.txt content >>> # java >>> # python >>> # c++
>>> keyword_processor.add_keyword_from_file('keywords.txt')
- Raises:
IOError: If keyword_file path is not valid
-
add_keywords_from_dict(keyword_dict)[source]¶ To add keywords from a dictionary
- Args:
keyword_dict (dict): A dictionary with str key and (list str) as value
- Examples:
>>> keyword_dict = { "java": ["java_2e", "java programing"], "product management": ["PM", "product manager"] } >>> keyword_processor.add_keywords_from_dict(keyword_dict)
- Raises:
AttributeError: If value for a key in keyword_dict is not a list.
-
remove_keywords_from_dict(keyword_dict)[source]¶ To remove keywords from a dictionary
- Args:
keyword_dict (dict): A dictionary with str key and (list str) as value
- Examples:
>>> keyword_dict = { "java": ["java_2e", "java programing"], "product management": ["PM", "product manager"] } >>> keyword_processor.remove_keywords_from_dict(keyword_dict)
- Raises:
AttributeError: If value for a key in keyword_dict is not a list.
-
add_keywords_from_list(keyword_list)[source]¶ To add keywords from a list
- Args:
keyword_list (list(str)): List of keywords to add
- Examples:
>>> keyword_processor.add_keywords_from_list(["java", "python"]})
- Raises:
AttributeError: If keyword_list is not a list.
-
remove_keywords_from_list(keyword_list)[source]¶ To remove keywords present in list
- Args:
keyword_list (list(str)): List of keywords to remove
- Examples:
>>> keyword_processor.remove_keywords_from_list(["java", "python"]})
- Raises:
AttributeError: If keyword_list is not a list.
-
get_all_keywords(term_so_far='', current_dict=None)[source]¶ Recursively builds a dictionary of keywords present in the dictionary And the clean name mapped to those keywords.
- Args:
- term_so_farstring
term built so far by adding all previous characters
- current_dictdict
current recursive position in dictionary
- Returns:
- terms_presentdict
A map of key and value where each key is a term in the keyword_trie_dict. And value mapped to it is the clean name mapped to it.
- Examples:
>>> keyword_processor = KeywordProcessor() >>> keyword_processor.add_keyword('j2ee', 'Java') >>> keyword_processor.add_keyword('Python', 'Python') >>> keyword_processor.get_all_keywords() >>> {'j2ee': 'Java', 'python': 'Python'} >>> # NOTE: for case_insensitive all keys will be lowercased.
-
extract_keywords(sentence, span_info=False)[source]¶ Searches in the string for all keywords present in corpus. Keywords present are added to a list keywords_extracted and returned.
- Args:
sentence (str): Line of text where we will search for keywords
- Returns:
keywords_extracted (list(str)): List of terms/keywords found in sentence that match our corpus
- Examples:
>>> from flashtext import KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> keyword_processor.add_keyword('Big Apple', 'New York') >>> keyword_processor.add_keyword('Bay Area') >>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.') >>> keywords_found >>> ['New York', 'Bay Area']
-
replace_keywords(sentence)[source]¶ Searches in the string for all keywords present in corpus. Keywords present are replaced by the clean name and a new string is returned.
- Args:
sentence (str): Line of text where we will replace keywords
- Returns:
new_sentence (str): Line of text with replaced keywords
- Examples:
>>> from flashtext import KeywordProcessor >>> keyword_processor = KeywordProcessor() >>> keyword_processor.add_keyword('Big Apple', 'New York') >>> keyword_processor.add_keyword('Bay Area') >>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and bay area.') >>> new_sentence >>> 'I love New York and Bay Area.'
-
class
textflint.generation_layer.subpopulation.UT.prejudice.SubPopulation(intervals=None, **kwargs)[source]¶ Bases:
abc.ABCAn abstract class for extracting subset of examples.
-
text_processor= <textflint.common.preprocess.en_processor.EnProcessor object>¶
-
score(sample, field, **kwargs)[source]¶ Score the sample
- Parameters
sample – data sample
field (str|list) – field str
kwargs –
- Return int
score for sample
-
get_slice(scores, dataset)[source]¶ Pick up samples based on scores
- Parameters
scores (list) – list of int
dataset – Dataset
- Returns
subset samples
-
-
textflint.generation_layer.subpopulation.UT.prejudice.download_if_needed(folder_name)[source]¶ Folder name will be saved as .cache/textflint/[folder_name]. If it doesn’t exist on disk, the zip file will be downloaded and extracted.
- Parameters
folder_name (str) – path to folder or file in cache
- Returns
path to the downloaded folder or file on disk
-
textflint.generation_layer.subpopulation.UT.prejudice.read_json(path, encoding='utf-8', fields=None, dropna=True)[source]¶ Construct a generator to read json items.
- Parameters
path – file path
encoding – file’s encoding, default: utf-8
fields – json object’s fields that needed, if None, all fields are needed. default: None
dropna – whether to ignore and drop invalid data, if False, raise ValueError when reading invalid data. default: True
- Returns
generator, every time yield (line number, json item)