textflint.generation_layer.transformation.transformation¶

Transformation Abstract Class¶

class textflint.generation_layer.transformation.transformation.Transformation(**kwargs)[source]¶

Bases: abc.ABC

An abstract class for transforming a sequence of text to produce a list of potential adversarial example.

processor = <textflint.common.preprocess.en_processor.EnProcessor object>¶

transform(sample, n=1, field='x', **kwargs)[source]¶

Transform data sample to a list of Sample.

Parameters

sample (Sample) – Data sample for augmentation.
n (int) – Max number of unique augmented output, default is 5.
field (str|list) – Indicate which fields to apply transformations.
**kwargs (dict) –
other auxiliary params.

Returns

list of Sample

classmethod sample_num(x, num)[source]¶

Get ‘num’ samples from x.

Parameters

x (list) – list to sample
num (int) – sample number

Returns

max ‘num’ unique samples.

class textflint.generation_layer.transformation.transformation.ABC[source]¶

Bases: object

Helper class that provides a standard way to create an ABC using inheritance.

class textflint.generation_layer.transformation.transformation.EnProcessor(*args, **kwargs)[source]¶

Bases: object

Text Processor class implement NER, POS tag, lexical tree parsing. EnProcessor is designed by single instance mode.

sentence_tokenize(text)[source]¶

Split text to sentences.

Parameters: text (str) – text string
Returns: list[str]

tokenize_one_sent(text, split_by_space=False)[source]¶

Tokenize one sentence.

Parameters

text (str) –
split_by_space (bool) – whether tokenize sentence by split space

Returns

tokens

tokenize(text, is_one_sent=False, split_by_space=False)[source]¶

Split a text into tokens (words, morphemes we can separate such as “n’t”, and punctuation).

Parameters

text (str) –
is_one_sent (bool) –
split_by_space (bool) –

Returns

list of tokens

static inverse_tokenize(tokens)[source]¶

Convert tokens to sentence.

Untokenizing a text undoes the tokenizing operation, restoring punctuation and spaces to the places that people expect them to be. Ideally, untokenize(tokenize(text)) should be identical to text, except for line breaks.

Watch out! Default punctuation add to the word before its index, it may raise inconsistency bug.

Parameters: tokens (list[str]r) – target token list
Returns: str

get_pos(sentence)[source]¶

POS tagging function.

Example:

EnProcessor().get_pos(
    'All things in their being are good for something.'
)

>> [('All', 'DT'),
    ('things', 'NNS'),
    ('in', 'IN'),
    ('their', 'PRP$'),
    ('being', 'VBG'),
    ('are', 'VBP'),
    ('good', 'JJ'),
    ('for', 'IN'),
    ('something', 'NN'),
    ('.', '.')]

Parameters: sentence (str|list) – A sentence which needs to be tokenized.
Returns: Tokenized tokens with their POS tags.

get_ner(sentence, return_char_idx=True)[source]¶

NER function. This method uses implemented based on spacy model.

Example:

EnProcessor().get_ner(
    'Lionel Messi is a football player from Argentina.'
)

if return_word_index is False
>>[('Lionel Messi', 0, 12, 'PERSON'),
   ('Argentina', 39, 48, 'LOCATION')]

if return_word_index is True
>>[('Lionel Messi', 0, 2, 'PERSON'),
   ('Argentina', 7, 8, 'LOCATION')]

Parameters

sentence (str|list) – text string or token list
return_char_idx (bool) – if set True, return character start to end index, else return char start to end index.

Returns

A list of tuples, (entity, start, end, label)

get_parser(sentence)[source]¶

Lexical tree parsing function based on NLTK toolkit.

Example:

EnProcessor().get_parser('Messi is a football player.')

>>'(ROOT\n  (S\n    (NP (NNP Messi))\n    (VP (VBZ is) (NP (DT a)
(NN football) (NN player)))\n    (. .)))'

Parameters: sentence (str|list) – A sentence needs to be parsed.

:return:The result tree of lexicalized parser in string format.

get_dep_parser(sentence, is_one_sent=True, split_by_space=False)[source]¶

Dependency parsing based on spacy model.

Example:

EnProcessor().get_dep_parser(
'The quick brown fox jumps over the lazy dog.'
)

>>
    The     DT      4       det
    quick   JJ      4       amod
    brown   JJ      4       amod
    fox     NN      5       nsubj
    jumps   VBZ     0       root
    over    IN      9       case
    the     DT      9       det
    lazy    JJ      9       amod
    dog     NN      5       obl

Parameters

sentence (str|list) – input text string
is_one_sent (bool) – whether do sentence tokenzie
split_by_space (bool) – whether tokenize sentence by split with ” “

Returns

dp tags.

get_lemmas(token_and_pos)[source]¶

Lemmatize function. This method uses nltk.WordNetLemmatier to lemmatize tokens.

Parameters: token_and_pos (list) – (token, POS).
Returns: A lemma or a list of lemmas depends on your input.

get_all_lemmas(pos)[source]¶

Lemmatize function for all words in WordNet.

Parameters: pos – POS tag pr a list of POS tag.
Returns: A list of lemmas that have the given pos tag.

get_delemmas(lemma_and_pos)[source]¶

Delemmatize function.

This method uses a pre-processed dict which maps (lemma, pos) to original token for delemmatizing.

Parameters: lemma_and_pos (tuple|list) – A tuple or a list of (lemma, POS).
Returns: A word or a list of words, each word represents the specific form of input lemma.

get_synsets(tokens_and_pos, lang='eng')[source]¶

Get synsets from WordNet.

Parameters

tokens_and_pos (list) – A list of tuples, (token, POS).
lang (str) – language name

Returns

A list of str, represents the sense of each input token.

get_antonyms(tokens_and_pos, lang='eng')[source]¶

Get antonyms from WordNet.

This method uses NTLK WordNet to generate antonyms, and uses “lesk” algorithm which is proposed by Michael E. Lesk in 1986, to screen the sense out.

Parameters

tokens_and_pos (list) – A list of tuples, (token, POS).
lang (str) – language name.

Returns

A list of str, represents the sense of each input token.

filter_candidates_by_pos(token_and_pos, candidates)[source]¶

Filter synonyms not contain the same pos tag with given token.

Parameters

token_and_pos (list|tuple) – (token, pos)
candidates (list) – strings to verify

Returns

filtered candidates list.

feature_extract(sent)[source]¶

Generate linguistic tags for tokens.

Parameters: sent (str) – input sentence
Returns: list of dict

exception textflint.generation_layer.transformation.transformation.FlintError[source]¶

Bases: RuntimeError

Default error thrown by textflint functions. FlintError will be raised if you do not give any error type specification,

textflint.generation_layer.transformation.transformation.abstractmethod(funcobj)[source]¶

A decorator indicating abstract methods.

Requires that the metaclass is ABCMeta or derived from it. A class that has a metaclass derived from ABCMeta cannot be instantiated unless all of its abstract methods are overridden. The abstract methods can be called using any of the normal ‘super’ call mechanisms.

Usage:

class C(metaclass=ABCMeta):
@abstractmethod def my_abstract_method(self, …):

…

textflint.generation_layer.transformation.transformation.trade_off_sub_words(sub_words, sub_indices, trans_num=None, n=1)[source]¶

Select proper candidate words to maximum number of transform result. Select words of top n substitutes words number.

Parameters

sub_words (list) – list of substitutes word of each legal word
sub_indices (list) – list of indices of each legal word
trans_num (int) – max number of words to apply substitution
n (int) –

Returns

sub_words after alignment + indices of sub_words