textflint.generation_layer.transformation.UT.keyboard¶
KeyboardTransformation Class¶
-
class
textflint.generation_layer.transformation.UT.keyboard.Keyboard(min_char=4, trans_min=1, trans_max=10, trans_p=0.2, stop_words=None, rules_path=None, include_special_char=True, include_numeric=True, include_upper_case=True, lang='en', **kwargs)[source]¶ Bases:
textflint.generation_layer.transformation.word_substitute.WordSubstituteTransformation that simulate typo error by random values.
https://arxiv.org/pdf/1711.02173.pdf
For example, people may type i as o incorrectly.One keyboard distance is leveraged to replace character by possible keyboard error.
-
__init__(min_char=4, trans_min=1, trans_max=10, trans_p=0.2, stop_words=None, rules_path=None, include_special_char=True, include_numeric=True, include_upper_case=True, lang='en', **kwargs)[source]¶ - Parameters
min_char (int) – If word less than this value, do not draw word for
augmentation :param int trans_min: Minimum number of character will be augmented. :param int trans_max: Maximum number of character will be augmented.
If None is passed, number of augmentation is calculated via aup_char_p.If calculated result from aug_p is smaller than aug_max, will use calculated result from aup_char_p. Otherwise, using aug_max.
- Parameters
trans_p (float) – Percentage of character (per token) will be augmented.
stop_words (list) – List of words which will be skipped from augment operation.
rules_path (str) – Loading customize model from file system
include_special_char (bool) – Include special character
include_numeric (bool) – If True, numeric character may be included in augmented data.
include_upper_case (bool) – If True, upper case character may be included in augmented data.
lang (str) – Indicate built-in language model. Default value is ‘en’. Possible values are ‘en’ and ‘th’. If custom model is used (passing model_path), this value will be ignored.
-
-
class
textflint.generation_layer.transformation.UT.keyboard.WordSubstitute(trans_min=1, trans_max=10, trans_p=0.1, stop_words=None, **kwargs)[source]¶ Bases:
textflint.generation_layer.transformation.transformation.TransformationWord replace transformation to implement normal word replace functions.
-
__init__(trans_min=1, trans_max=10, trans_p=0.1, stop_words=None, **kwargs)[source]¶ - Parameters
trans_min (int) – Minimum number of word will be augmented.
trans_max (int) – Maximum number of word will be augmented. If None is passed, number of augmentation is calculated via aup_char_p. If calculated result from aug_p is smaller than aug_max, will use calculated result from aup_char_p. Otherwise, using aug_max.
trans_p (float) – Percentage of word will be augmented.
stop_words (list) – List of words which will be skipped from augment operation.
processor (EnProcessor) –
get_pos (bool) – whether pass pos tag to _get_substitute_words API.
-
abstract
skip_aug(tokens, mask, pos=None)[source]¶ Returns the index of the replaced tokens.
- Parameters
tokens (list) – tokenized words or word with pos tag pairs
- Return list
the index of the replaced tokens
-
is_stop_words(token)[source]¶ Judge whether the input word belongs to the stop words vocab.
- Parameters
token (str) – the input word to be judged
- Return bool
is a stop word or not
-
pre_skip_aug(tokens, mask)[source]¶ Skip the tokens in stop words list or punctuation list.
- Parameters
tokens (list) – the list of tokens
mask (list) – the list of mask Indicates whether each word is allowed to be substituted. ORIGIN is allowed, while TASK_MASK and MODIFIED_MASK is not.
- Return list
List of possible substituted token index.
-
-
textflint.generation_layer.transformation.UT.keyboard.download_if_needed(folder_name)[source]¶ Folder name will be saved as .cache/textflint/[folder_name]. If it doesn’t exist on disk, the zip file will be downloaded and extracted.
- Parameters
folder_name (str) – path to folder or file in cache
- Returns
path to the downloaded folder or file on disk