textflint.generation_layer.subpopulation.UT.prejudice

Extract samples with gender bias

class textflint.generation_layer.subpopulation.UT.prejudice.PrejudiceSubPopulation(mode='man')[source]

Bases: textflint.generation_layer.subpopulation.subpopulation.SubPopulation

Filter samples based on gender bias

for example in mode ‘man’:

sample 1: "There is a boy.", score: 1
sample 2: "There is a girl.", score: 1
sample 3: "There are boys and girls.", score: 0
get_slice(scores, dataset)[source]

Save the samples that mach the phrase groups and mode

class textflint.generation_layer.subpopulation.UT.prejudice.KeywordProcessor(case_sensitive=False)[source]

Bases: object

Attributes:
_keyword (str): Used as key to store keywords in trie dictionary.

Defaults to ‘_keyword_’

non_word_boundaries (set(str)): Characters that will determine if the word is continuing.

Defaults to set([A-Za-z0-9_])

keyword_trie_dict (dict): Trie dict built character by character, that is used for lookup

Defaults to empty dictionary

case_sensitive (boolean): if the search algorithm should be case sensitive or not.

Defaults to False

Examples:
>>> # import module
>>> from flashtext import KeywordProcessor
>>> # Create an object of KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> # add keywords
>>> keyword_names = ['NY', 'new-york', 'SF']
>>> clean_names = ['new york', 'new york', 'san francisco']
>>> for keyword_name, clean_name in zip(keyword_names, clean_names):
>>>     keyword_processor.add_keyword(keyword_name, clean_name)
>>> keywords_found = keyword_processor.extract_keywords('I love SF and NY. new-york is the best.')
>>> keywords_found
>>> ['san francisco', 'new york', 'new york']
Note:
__init__(case_sensitive=False)[source]
Args:
case_sensitive (boolean): Keyword search should be case sensitive set or not.

Defaults to False

set_non_word_boundaries(non_word_boundaries)[source]

set of characters that will be considered as part of word.

Args:
non_word_boundaries (set(str)):

Set of characters that will be considered as part of word.

add_non_word_boundary(character)[source]

add a character that will be considered as part of word.

Args:
character (char):

Character that will be considered as part of word.

add_keyword(keyword, clean_name=None)[source]

To add one or more keywords to the dictionary pass the keyword and the clean name it maps to.

Args:
keywordstring

keyword that you want to identify

clean_namestring

clean term for that keyword that you would want to get back in return or replace if not provided, keyword will be used as the clean name also.

Returns:
statusbool

The return value. True for success, False otherwise.

Examples:
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> # This case 'Big Apple' will return 'New York'
>>> # OR
>>> keyword_processor.add_keyword('Big Apple')
>>> # This case 'Big Apple' will return 'Big Apple'
remove_keyword(keyword)[source]

To remove one or more keywords from the dictionary pass the keyword and the clean name it maps to.

Args:
keywordstring

keyword that you want to remove if it’s present

Returns:
statusbool

The return value. True for success, False otherwise.

Examples:
>>> keyword_processor.add_keyword('Big Apple')
>>> keyword_processor.remove_keyword('Big Apple')
>>> # Returns True
>>> # This case 'Big Apple' will no longer be a recognized keyword
>>> keyword_processor.remove_keyword('Big Apple')
>>> # Returns False
get_keyword(word)[source]

if word is present in keyword_trie_dict return the clean name for it.

Args:
wordstring

word that you want to check

Returns:
keywordstring

If word is present as it is in keyword_trie_dict then we return keyword mapped to it.

Examples:
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.get('Big Apple')
>>> # New York
add_keyword_from_file(keyword_file, encoding='utf-8')[source]

To add keywords from a file

Args:

keyword_file : path to keywords file encoding : specify the encoding of the file

Examples:

keywords file format can be like:

>>> # Option 1: keywords.txt content
>>> # java_2e=>java
>>> # java programing=>java
>>> # product management=>product management
>>> # product management techniques=>product management
>>> # Option 2: keywords.txt content
>>> # java
>>> # python
>>> # c++
>>> keyword_processor.add_keyword_from_file('keywords.txt')
Raises:

IOError: If keyword_file path is not valid

add_keywords_from_dict(keyword_dict)[source]

To add keywords from a dictionary

Args:

keyword_dict (dict): A dictionary with str key and (list str) as value

Examples:
>>> keyword_dict = {
        "java": ["java_2e", "java programing"],
        "product management": ["PM", "product manager"]
    }
>>> keyword_processor.add_keywords_from_dict(keyword_dict)
Raises:

AttributeError: If value for a key in keyword_dict is not a list.

remove_keywords_from_dict(keyword_dict)[source]

To remove keywords from a dictionary

Args:

keyword_dict (dict): A dictionary with str key and (list str) as value

Examples:
>>> keyword_dict = {
        "java": ["java_2e", "java programing"],
        "product management": ["PM", "product manager"]
    }
>>> keyword_processor.remove_keywords_from_dict(keyword_dict)
Raises:

AttributeError: If value for a key in keyword_dict is not a list.

add_keywords_from_list(keyword_list)[source]

To add keywords from a list

Args:

keyword_list (list(str)): List of keywords to add

Examples:
>>> keyword_processor.add_keywords_from_list(["java", "python"]})
Raises:

AttributeError: If keyword_list is not a list.

remove_keywords_from_list(keyword_list)[source]

To remove keywords present in list

Args:

keyword_list (list(str)): List of keywords to remove

Examples:
>>> keyword_processor.remove_keywords_from_list(["java", "python"]})
Raises:

AttributeError: If keyword_list is not a list.

get_all_keywords(term_so_far='', current_dict=None)[source]

Recursively builds a dictionary of keywords present in the dictionary And the clean name mapped to those keywords.

Args:
term_so_farstring

term built so far by adding all previous characters

current_dictdict

current recursive position in dictionary

Returns:
terms_presentdict

A map of key and value where each key is a term in the keyword_trie_dict. And value mapped to it is the clean name mapped to it.

Examples:
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('j2ee', 'Java')
>>> keyword_processor.add_keyword('Python', 'Python')
>>> keyword_processor.get_all_keywords()
>>> {'j2ee': 'Java', 'python': 'Python'}
>>> # NOTE: for case_insensitive all keys will be lowercased.
extract_keywords(sentence, span_info=False)[source]

Searches in the string for all keywords present in corpus. Keywords present are added to a list keywords_extracted and returned.

Args:

sentence (str): Line of text where we will search for keywords

Returns:

keywords_extracted (list(str)): List of terms/keywords found in sentence that match our corpus

Examples:
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')
>>> keywords_found
>>> ['New York', 'Bay Area']
replace_keywords(sentence)[source]

Searches in the string for all keywords present in corpus. Keywords present are replaced by the clean name and a new string is returned.

Args:

sentence (str): Line of text where we will replace keywords

Returns:

new_sentence (str): Line of text with replaced keywords

Examples:
>>> from flashtext import KeywordProcessor
>>> keyword_processor = KeywordProcessor()
>>> keyword_processor.add_keyword('Big Apple', 'New York')
>>> keyword_processor.add_keyword('Bay Area')
>>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and bay area.')
>>> new_sentence
>>> 'I love New York and Bay Area.'
class textflint.generation_layer.subpopulation.UT.prejudice.SubPopulation(intervals=None, **kwargs)[source]

Bases: abc.ABC

An abstract class for extracting subset of examples.

text_processor = <textflint.common.preprocess.en_processor.EnProcessor object>
score(sample, field, **kwargs)[source]

Score the sample

Parameters
  • sample – data sample

  • field (str|list) – field str

  • kwargs

Return int

score for sample

get_slice(scores, dataset)[source]

Pick up samples based on scores

Parameters
  • scores (list) – list of int

  • dataset – Dataset

Returns

subset samples

slice_population(dataset, fields, **kwargs)[source]

Extract a subset of samples.

Parameters
  • dataset – Dataset

  • fields (list) – field str list

  • kwargs

Returns

Subset Dataset

static normalize_bound(limit, size)[source]

Normalize the bound of slice

Parameters
  • limit (str|float|int) – left_bound or right_bound for intervals can be percentile like 10%, 20% can be float between 0 and 1 like 0.3 can be int index like 50

  • size – the size of samples

:return int : bound

textflint.generation_layer.subpopulation.UT.prejudice.download_if_needed(folder_name)[source]

Folder name will be saved as .cache/textflint/[folder_name]. If it doesn’t exist on disk, the zip file will be downloaded and extracted.

Parameters

folder_name (str) – path to folder or file in cache

Returns

path to the downloaded folder or file on disk

textflint.generation_layer.subpopulation.UT.prejudice.read_json(path, encoding='utf-8', fields=None, dropna=True)[source]

Construct a generator to read json items.

Parameters
  • path – file path

  • encoding – file’s encoding, default: utf-8

  • fields – json object’s fields that needed, if None, all fields are needed. default: None

  • dropna – whether to ignore and drop invalid data, if False, raise ValueError when reading invalid data. default: True

Returns

generator, every time yield (line number, json item)