textflint¶

Welcome to the API references for TextFlint!

class textflint.Sample(data, origin=None, sample_id=None)[source]¶

Bases: abc.ABC

Base Sample class to hold the necessary info and provide atomic operations

text_processor = <textflint.common.preprocess.en_processor.EnProcessor object>¶

__init__(data, origin=None, sample_id=None)[source]¶

Parameters

data (dict) – The dict obj that contains data info.
origin (sample) – original sample obj.
sample_id (int) – sampleindex

get_value(field)[source]¶

Get field value by field_str.

Parameters: field (str) – field name
Returns: field value

get_words(field)[source]¶

Get tokenized words of given textfield

Parameters: field (str) – field name
Returns: tokenized words

get_text(field)[source]¶

Get text string of given textfield

Parameters: field (str) – field name
Return string: text

get_mask(field)[source]¶

Get word masks of given textfield

Parameters: field (str) – field name
Returns: list of mask values

get_sentences(field)[source]¶

Get split sentences of given textfield

Parameters: field (str) – field name
Returns: list of sentences

get_pos(field)[source]¶: Get text field pos tags. :param str field: field name :return: pos tag list

get_ner(field)[source]¶

Get text field ner tags

Parameters: field (str) – field name
Returns: ner tag list

replace_fields(fields, field_values, field_masks=None)[source]¶

Fully replace multi fields at the same time and return new sample. Notice: Not suggest use this API as it will set mask values of TextField to MODIFIED_MASK.

Parameters

fields (list) – field str list
field_values (list) – field value list
field_masks (list) – indicate mask values, useful for printable text

Returns

Modified Sample

replace_field(field, field_value, field_mask=None)[source]¶

Fully replace single field and return new sample. Notice: Not suggest use this API as it will set mask values of TextField to MODIFIED_MASK.

Parameters

field (str) – field str
field_value – field_type
field_mask (list) – indicate mask value of field

Returns

Modified Sample

replace_field_at_indices(field, indices, items)[source]¶

Replace items of multi given scopes of field value at the same time. Stay away from the complex function !!!

Be careful of your input list shape.

Parameters

field (str) – field name
of int|list|slice indices (list) –

each index can be int indicate replace single item or their list
like [1, 2, 3],

can be list like (0,3) indicate replace items from
0 to 3(not included),

can be slice which would be convert to list.
items –

Returns

Modified Sample

replace_field_at_index(field, index, items)[source]¶

Replace items of given scope of field value.

Be careful of your input list shape.

Parameters

field (str) – field name
index (int|list|slice) –
can be int indicate replace single item or list like [1, 2, 3], can be list like (0,3) indicate replace items

from 0 to 3(not included),

can be slice which would be convert to list.
items (str|list) – shape: indices_num, correspond to field_sub_items

Returns

Modified Sample

unequal_replace_field_at_indices(field, indices, rep_items)[source]¶

Replace scope items of field value with rep_items which may not equal with scope.

Parameters

field – field str
indices – list of int/tupe/list
rep_items – list

Returns

Modified Sample

delete_field_at_indices(field, indices)[source]¶

Delete items of given scopes of field value.

Parameters

field (str) – field name
of int|list|slice indices (list) –
shape：indices_num each index can be int indicate delete single item or their list

like [1, 2, 3],

can be list like (0,3) indicate replace items
from 0 to 3(not included),

can be slice which would be convert to list.

Returns

Modified Sample

delete_field_at_index(field, index)[source]¶

Delete items of given scopes of field value.

Parameters

field (str) – field value
index (int|list|slice) –
can be int indicate delete single item or their list like [1, 2, 3], can be list like (0,3) indicate replace items

from 0 to 3(not included),

can be slice which would be convert to list.

Returns

Modified Sample

insert_field_before_indices(field, indices, items)[source]¶

Insert items of multi given scopes before indices of field value at the same time.

Stay away from the complex function !!! Be careful of your input list shape.

Parameters

field (str) – field name
indices – list of int, shape：indices_num, list like [1, 2, 3]
items – list of str/list, shape: indices_num, correspond to indices

Returns

Modified Sample

insert_field_before_index(field, index, items)[source]¶

Insert items of multi given scope before index of field value.

Parameters

field (str) – field name
index (int) – indicate which index to insert items
items (str|list) – items to insert

Returns

Modified Sample

insert_field_after_indices(field, indices, items)[source]¶

Insert items of multi given scopes after indices of field value at the same time.

Stay away from the complex function !!! Be careful of your input list shape.

Parameters

field (str) – field name
indices – list of int, shape：indices_num, like [1, 2, 3]
items – list of str/list shape: indices_num, correspond to indices

Returns

Modified Sample

insert_field_after_index(field, index, items)[source]¶

Insert items of multi given scope after index of field value

Parameters

field (str) – field name
index (int) – indicate where to apply insert
items (str|list) – shape: indices_num, correspond to field_sub_items

Returns

Modified Sample

swap_field_at_index(field, first_index, second_index)[source]¶

Swap items between first_index and second_index of field value.

Parameters

field (str) – field name
first_index (int) –
second_index (int) –

Returns

Modified Sample

abstract check_data(data)[source]¶

Check rare data format

Parameters: data – rare data input
Returns

abstract load(data)[source]¶

Parse data into sample field value.

Parameters: data – rare data input

abstract dump()[source]¶

Convert sample info to input data json format.

Returns: dict object.

classmethod clone(original_sample)[source]¶

Deep copy self to a new sample

Parameters: original_sample – sample to be copied
Returns: Sample instance

property is_origin¶: Return whether the sample is original Sample.

class textflint.Field(field_value, field_type=<class 'str'>, **kwargs)[source]¶

Bases: object

A helper class that represents input string that to be modified.

__init__(field_value, field_type=<class 'str'>, **kwargs)[source]¶

Parameters

field_value (string|int|list) – The string that Field represents.
field_type (str) – field value type

class textflint.Dataset(task='UT')[source]¶

Bases: object

Any iterable of (label, text_input) pairs qualifies as a Dataset.

__init__(task='UT')[source]¶

Parameters: task (str) – indicate data sample format.

free()[source]¶: Fully clear dataset.

dump()[source]¶: Return dataset in json object format.

load(dataset)[source]¶

Loads json object and prepares it as a Dataset.

Support two formats input, Example:

{‘x’: [
‘The robustness of deep neural networks has received much attention recently’, ‘We focus on certified robustness of smoothed classifiers in this work’, …, ‘our approach exceeds the state-of-the-art.’ ],

‘y’: [
‘neural’, ‘positive’, …, ‘positive’ ]}
1. [
  {‘x’: ‘The robustness of deep neural networks has received much attention recently’, ‘y’: ‘neural’}, {‘x’: ‘We focus on certified robustness of smoothed classifiers in this work’, ‘y’: ‘positive’}, …, {‘x’: ‘our approach exceeds the state-of-the-art.’, ‘y’: ‘positive’} ]

Parameters: dataset (list|dict) –
Returns

load_json(json_path, encoding='utf-8', fields=None, dropna=True)[source]¶

Loads json file, each line of the file is a json string.

Parameters

json_path – file path
encoding – file’s encoding, default: utf-8
fields – json object’s fields that needed, if None, all fields are needed. default: None
dropna – weather to ignore and drop invalid data, :if False, raise ValueError when reading invalid data. default: True

Returns

load_csv(csv_path, encoding='utf-8', headers=None, sep=',', dropna=True)[source]¶

Loads csv file, one line correspond one sample.

Parameters

csv_path – file path
encoding – file’s encoding, default: utf-8
headers – file’s headers, if None, make file’s first line as headers. default: None
sep – separator for each column. default: ‘,’
dropna – weather to ignore and drop invalid data, :if False, raise ValueError when reading invalid data. default: True

Returns

load_hugging_face(name, subset='train')[source]¶

Loads a dataset from HuggingFace datasets and prepares it as a Dataset.

Parameters

name – the dataset name
subset – the subset of the main dataset.

Returns

append(data_sample, sample_id=- 1)[source]¶

Load single data sample and append to dataset.

Parameters

data_sample (dict|sample) –
sample_id (int) – useful to identify sample, default -1

Returns

True / False indicate whether append action successful.

extend(data_samples)[source]¶

Load multi data samples and extend to dataset.

Parameters: data_samples (list|dict|Sample) –
Returns

static norm_input(data_samples)[source]¶

Convert various data input to list of dict. Example:

 {'x': [
          'The robustness of deep neural networks has received
          much attention recently',
          'We focus on certified robustness of smoothed classifiers
          in this work',
          ...,
          'our approach exceeds the state-of-the-art.'
      ],
 'y': [
          'neural',
          'positive',
          ...,
          'positive'
      ]
}
convert to
[
    {'x': 'The robustness of deep neural networks has received
    much attention recently', 'y': 'neural'},
    {'x': 'We focus on certified robustness of smoothed classifiers
    in this work', 'y': 'positive'},
    ...,
    {'x': 'our approach exceeds the state-of-the-art.',
    'y': 'positive'}
]

Parameters: data_samples (list|dict|Sample) –
Returns: Normalized data.

save_csv(out_path, encoding='utf-8', headers=None, sep=',')[source]¶

Save dataset to csv file.

Parameters

out_path – file path
encoding – file’s encoding, default: utf-8
headers – file’s headers, if None, make file’s first line as headers. default: None
sep – separator for each column. default: ‘,’

Returns

save_json(out_path, encoding='utf-8', fields=None)[source]¶

Save dataset to json file which contains json object in each line.

Parameters

out_path – file path
encoding – file’s encoding, default: utf-8
fields – json object’s fields that needed, if None, all fields are needed. default: None

Returns

class textflint.Config(task='UT', out_dir=None, max_trans=1, random_seed=1, fields=None, flint_model=None, trans_methods=None, trans_config=None, return_unk=True, sub_methods=None, sub_config=None, attack_methods=None, validate_methods=None, **kwargs)[source]¶

Bases: object

Hold some config params to control generation and report procedure.

__init__(task='UT', out_dir=None, max_trans=1, random_seed=1, fields=None, flint_model=None, trans_methods=None, trans_config=None, return_unk=True, sub_methods=None, sub_config=None, attack_methods=None, validate_methods=None, **kwargs)[source]¶

Parameters

task (str) – task name
out_dir (string) – out dir for saving generated samples, default current path.
max_trans (int) – maximum transformed samples generate by one original sample pre Transformation.
random_seed (int) – random number seed to reproduce generation.
fields (str|list[str]) – fields on which new samples are generated.

::param str model_file: path to the python file containing: the FlintModel instance which named ‘model’.

Parameters

trans_methods (list) – indicate what transformations to apply to dataset.
trans_config (dict) – parameters for the initialization of the transformation instances.
return_unk (bool) – whether apply transformations which may influence label of sample.
sub_methods (list) – indicate what subpopulations to apply to dataset.
sub_config (dict) – parameters for the initialization of the subpopulation instances.
attack_methods (str) – path to the python file containing the Attack instances which named “attacks”.
validate_methods (str|list[str]) – indicate use which validate methods to calculate confidence of generated samples.

check_config()[source]¶: Check common config params.

get_generate_methods(methods, task_to_methods, allow_pipeline=False)[source]¶

Validate transformation or subpopulation methods.

Watch out! Some UT transformations/subpopulations may not compatible with your task, please choose your method carefully.

Parameters

methods (list) – transformation or subpopulation need to apply to dataset. If not provide, return default generated methods.
task_to_methods (dict) – map allowed methods by task name.
allow_pipeline (bool) – whether allow pipeline input

Returns

list of transformation/subpopulation.

classmethod from_dict(json_object)[source]¶: Constructs a Config from a Python dictionary of parameters.

classmethod from_json_file(json_file)[source]¶: Constructs a Config from a json file of parameters.

to_dict()[source]¶: Serializes this instance to a Python dictionary.

to_json_string()[source]¶: Serializes this instance to a JSON string.

to_json_file(json_file)[source]¶: Serializes this instance to a JSON file.

class textflint.FlintModel(model, tokenizer, task='SA', batch_size=1)[source]¶

Bases: abc.ABC

A model wrapper queries a model with a list of text inputs.

Classification-based models return a list of lists, where each sublist represents the model’s scores for a given input.

Text-to-text models return a list of strings, where each string is the output – like a translation or summarization – for a given input.

__init__(model, tokenizer, task='SA', batch_size=1)[source]¶

Parameters

model – any model object
tokenizer – support tokenize sentence and convert tokens to model input ids
task (str) – task name
batch_size (int) – batch size to apply evaluation

evaluate(data_samples, prefix='')[source]¶

Parameters

data_samples (list[Sample]) – list of Samples
prefix (str) – name prefix to add to metrics

Returns

dict obj to save metrics result

get_grad(*inputs)[source]¶

Get gradient of loss with respect to input tokens.

Parameters: inputs (tuple) – tuple of original texts

get_model_grad(*inputs)[source]¶

Get gradient of loss with respect to input tokens.

Parameters: inputs (tuple) – list of original text

unzip_samples(data_samples)[source]¶

Unzip sample to input texts and labels.

Parameters: data_samples (list) – sample list
Returns: (inputs_text), labels.

class textflint.Engine[source]¶

Bases: object

Engine class of Text Robustness.

Support run entrance which automatically finish data loading, transformation/subpopulation/attack generation and robustness report generation.

Also provide interfaces of each layer to practitioners.

run(data_input, config=None, model=None)[source]¶

Engine start entrance, load data and apply transformations, finally generate robustness report if needed.

Parameters

data_input (dict|list|string) – json object or json/csv file
config (string|textflint.Config) – json file or Config object
model (textflint.FlintModel) – model wrapper which implements FlintModel abstract methods, not a necessary input.

Returns

save generated data to out dir and provide report in html format.

load(data_input, config=None, model=None)[source]¶

Load data input, config file and FlintModel.

Parameters

data_input (dict|list|string) – json object or json/csv file
config (string|textflint.Config) – json file or Config object
model (textflint.FlintModel) – model wrapper which implements FlintModel abstract methods, not a necessary input.

Returns

textflint.Dataset, textflint.Config, textflint.FlintModel

generate(dataset, config, model=None)[source]¶

Generate new samples according to given config, save result as json file to out path, and evaluate model performance automatically if provide model.

Parameters

dataset (textflint.Dataset) – container of original samples.
config (textflint.Config) – config instance to control procedure.
model (textflint.FlintModel) – model wrapper which implements FlintModel abstract methods, not a necessary input.

Returns

save generated samples to json file.

report(evaluate_result)[source]¶

Automatically analyze the model robustness verification results and plot the robustness evaluation report.

Parameters: evaluate_result (dict) – json object contains robustness evaluation result and other additional information.
Returns: open a html of robustness report.

class textflint.Generator(task='UT', max_trans=1, random_seed=1, fields='x', trans_methods=None, trans_config=None, return_unk=True, sub_methods=None, sub_config=None, attack_methods=None, validate_methods=None, **kwargs)[source]¶

Bases: abc.ABC

Transformation controller which applies multi transformations to each data sample.

__init__(task='UT', max_trans=1, random_seed=1, fields='x', trans_methods=None, trans_config=None, return_unk=True, sub_methods=None, sub_config=None, attack_methods=None, validate_methods=None, **kwargs)[source]¶

Parameters

task (str) – Indicate which task of your transformation data.
max_trans (int) – Maximum transformed samples generate by one original sample pre Transformation.
random_seed (int) – random number seed to reproduce generation.
fields (str|list) – Indicate which fields to apply transformations. Multi fields transform just for some special task, like: SM、NLI.
trans_methods (list) – list of transformations’ name.
trans_config (dict) – transformation class configs, useful to control the behavior of transformations.
return_unk (bool) – Some transformation may generate unk labels, s.t. insert a word to a sequence in NER task. If set False, would skip these transformations.
sub_methods (list) – list of subpopulations’ name.
sub_config (dict) – subpopulation class configs, useful to control the behavior of subpopulation.
attack_methods (str) – path to the python file containing the Attack instances.
validate_methods (list) – confidence calculate functions.

prepare(dataset)[source]¶

Check dataset

Parameters: dataset (textflint.Dataset) – the input dataset

generate(dataset, model=None)[source]¶

Returns a list of possible generated samples for dataset.

Parameters

dataset (textflint.Dataset) – the input dataset
model (textflint.FlintModel) – the model to attack if given.

Returns

yield (original samples, new samples, generated function string).

generate_by_transformations(dataset, **kwargs)[source]¶

Generate samples by a list of transformation methods.

Parameters: dataset – the input dataset
Returns: (original samples, new samples, generated function string)

generate_by_subpopulations(dataset, **kwargs)[source]¶

Generate samples by a list of subpopulation methods.

Parameters: dataset – the input dataset
Returns: the transformed dataset

generate_by_attacks(dataset, model=None, **kwargs)[source]¶

Generate samples by a list of attack methods.

Parameters

dataset – the input dataset
model – the model to attack if given.

Returns

the transformed dataset

class textflint.Validator(origin_dataset, trans_dataset, fields, need_tokens=False)[source]¶

Bases: abc.ABC

An abstract class that computes the semantic similarity score between: original text and adversarial texts

Parameters

origin_dataset (dataset) – the dataset of origin sample
trans_dataset (dataset) – the dataset of translate sample
fields (str|list) – the name of the origin field need compare.
need_tokens (bool) – if we need tokenize the sentence

abstract validate(transformed_text, reference_text)[source]¶

Calculate the score

Parameters

transformed_text (str) – transformed sentence
reference_text (str) – origin sentence

Return float

the score of two sentence

check_data()[source]¶: Check whether the input data is legal

property score¶

Calculate the score of the deformed sentence

Return list: a list of translate sentence score

class textflint.ABSASample(data, trans_id=None, origin=None, sample_id=None)[source]¶

Bases: textflint.input_layer.component.sample.sample.Sample

ABSASample Class

check_data(data)[source]¶

Check the format of input data.

Parameters: data (dict) – data name

load(data)[source]¶

Load the legal data and convert it into SASample.

Parameters: data (dict) – data name

dump()[source]¶

Dump the legal data.

Return dict: output of transformed data

is_legal()[source]¶

Check whether aspect words and opinion words are: in the correct position.

Return bool: whether format of data is legal.

tokenize_term_list()[source]¶

Tokenize the term list of ABSASample.

Return list: terms in ABSASample

update_sentence(trans_sentence)[source]¶

Update the sentence of ABSASample.

Parameters: trans_sentence (str|list) – updated sentence

update_terms(trans_terms)[source]¶

Update the terms of ABSASample.

Parameters: trans_terms (dict) – updated terms

update_term_list(sample)[source]¶

Update the term_list of ABSASample.

Parameters: sample (ABSAsample) – updated sample

insert_field_before_indices(field, indices, items)[source]¶

Insert items of multi given scopes before indices of field value at the same time.

Parameters

field (str) – transformed field
indices (list) – indices of insert positions
items (list) – insert items

Return ~textflint.ABSAsample

modified sample

insert_field_before_index(field, ins_index, new_item)[source]¶

Insert items of multi given scope before index of field value.

Parameters

field (str) – transformed field
ins_index (int|list) – index of insert position
new_item (str|list) – insert item

Return ~textflint.ABSAsample

modified sample

insert_field_after_indices(field, indices, items)[source]¶

Insert items of multi given scopes after indices of field value at the same time.

Parameters

field (str) – transformed field
indices (list) – indices of insert positions
items (list) – insert items

Return ABSAsample

modified sample

insert_field_after_index(field, ins_index, new_item)[source]¶

Insert items of multi given scope after index of field value.

Parameters

field (str) – transformed field
ins_index (int|list) – index of insert position
new_item (str|list) – insert item

Return ~textflint.ABSAsample

modified sample

delete_field_at_indices(field, indices)[source]¶

Delete items of given scopes of field value.

Parameters

field (str) – transformed field
indices (list) – indices of delete positions

Return ABSAsample

modified sample

delete_field_at_index(field, del_index)[source]¶

Delete items of given scopes of field value.

Parameters

field (str) – transformed field
del_index (list) – index of delete position

Return ~textflint.ABSAsample

modified sample

class textflint.CWSSample(data, origin=None, sample_id=None)[source]¶

Bases: textflint.input_layer.component.sample.sample.Sample

Our segmentation rules are based on ctb6.

the input x can be a list or a sentence the input y is segmentation label include:B,M,E,S the y also can automatic generation,if you want automatic generation

you must input an empty list and x must each word in x is separated by a space or split into each element of the list

Note that punctuation should be separated into a single word

Example:

1. input {'x':'小明好想送Jo圣诞礼物', 'y' = ['B', 'E', 'B', 'E', 'S', 'B',
    'E', 'B', 'E', 'B', 'E']}
2. input {'x':['小明','好想送Jo圣诞礼物'], 'y' = ['B', 'E', 'B', 'E', 'S',
    'B', 'E', 'B', 'E', 'B', 'E']}
3. input {'x':'小明 好想 送 Jo 圣诞 礼物', 'y' = []}
4. input {'x':['小明', '好想', '送', 'Jo', '圣诞', '礼物'], 'y' = []}

__init__(data, origin=None, sample_id=None)[source]¶

Parameters

data (dict) – The dict obj that contains data info
sample_id (int) – the id of sample
origin (bool) – if the sample is origin

check_data(data)[source]¶

Check the whether the data legitimate but we don’t check that the label is correct if the data is not legal but acceptable format, change the format of data

Parameters: data (dict) – The dict obj that contains data info

load(data)[source]¶

Convert data dict which contains essential information to CWSSample.

Parameters: data (dict) – The dict obj that contains data info

get_words()[source]¶

Get the words from the sentence.

Return list: the words in sentence

replace_at_ranges(indices, new_items, y_new_items=None)[source]¶

Replace words at indices and set their mask to MODIFIED_MASK.

Parameters

indices (list) – The list of the pos need to be changed.
new_items (list) – The list of the item need to be changed.
y_new_items (list) – The list of the mask info need to be changed.

Returns

replaced CWSSample object.

update(x, y)[source]¶

Replace words at indices and set their mask to MODIFIED_MASK.

Parameters

x (str) – the new sentence.
y (list) – the new labels.

Returns

new CWSSample object.

check(indices, new_items, y_new_items=None)[source]¶

Check whether the position of change is legal.

Parameters

indices (list) – The list of the pos need to be changed.
new_items (list) – The list of the item need to be changed.
y_new_items (list) – The list of the mask info need to be changed.

Return three list

legal position, change items, change labels.

static get_labels(words)[source]¶

Get the label of the word.

Parameters: words (str) – The word you want to get labels.
Return list: the label of the words.

class textflint.CorefSample(data, origin=None, sample_id=None)[source]¶

Bases: textflint.input_layer.component.sample.sample.Sample

Coref Sample

check_data(data)[source]¶

Check if data is a conll-dict and is ready to be predicted.

Parameters: data (None|dict) – Must have key: sentences, clusters May have key: doc_key, speakers, constituents, ner
Returns

is_legal()[source]¶: Validate whether the sample is legal.

load(data)[source]¶

Convert a conll-dict to CorefSample.

Parameters: data (None|dict) – None, or a conll-style dict Must have key: sentences, clusters May have key: doc_key, speakers, constituents, ner
Returns

dump(with_check=True)[source]¶

Dump a CorefSample to a conll-dict.

Parameters: with_check (bool) – whether the dumped conll-dict should be checked
Return dict ret_dict: a conll-style dict

pretty_print(show='Sample:')[source]¶

A pretty-printer for CorefSample. Print useful sample information by calling this function.

Parameters: show (str) – optional, the welcome information of printing this sample

num_sentences()[source]¶

the number of sentences in this sample

Param
Return int: the number of sentences in this sample

get_kth_sen(k)[source]¶

get the kth sen as a word list

Parameters: k (int) – sen id
Return list: kth sen, word list

eqlen_sen_map()[source]¶

Generate [0, 0, 1, 1, 1, 2, 2] from self.sen_map = [2, 3, 2]

Param
Return list: sentence mapping with equal length to x, like [0, 0, 1, 1, 1, 2, 2]

index_in_sen(idx)[source]¶

For the given word idx, determine which sen it is in.

Parameters: idx (int) – word idx
Return int: sen_idx, which sentence is word idx in

static sens2doc(sens)[source]¶

Given an 2nd list of str (word list list), concat it and records the length of each sentence

Parameters: sens (list) – 2nd list of str (word list list)
Returns (list, list): x as list of str (word list), sen_map as list of int (sen len list)

static doc2sens(x, sen_map)[source]¶

Given x and sen_map, return sens. Inverse to sens2doc.

Parameters

x (list) – list of str (word list)
sen_map (list) – list of int (sen len list)

Return list

sens as 2nd list of str (word list list)

insert_field_before_indices(field, indices, items)[source]¶

Insert items of given scopes before indices of field value simutaneously

Parameters

field (str) – transformed field
indices (list) – indices of insert positions
items (list) – insert items

Return ~textflint.CorefSample

modified sample

insert_field_after_indices(field, indices, items)[source]¶

Insert items of given scopes after indices of field value simutaneously.

Parameters

field (str) – transformed field
indices (list) – indices of insert positions
items (list) – insert items

Return ~textflint.CorefSample

modified sample

delete_field_at_indices(field, indices)[source]¶

Delete items of given scopes of field value.

Parameters

field (str) – transformed field
indices (list) – indices of delete positions

Return ~textflint.CorefSample

modified sample

replace_field_at_indices(field, indices, items)[source]¶: Replace scope items of field value with items. :param str field: transformed field :param list indices: indices of delete positions :param list items: insert items :return ~textflint.CorefSample: modified sample

static concat_conlls(*args)[source]¶

Given several CorefSamples, concat the values key by key.

Param: Some CorefSamples
Return ~textflint.input_layer.component.sample.CorefSample: A CorefSample, as the docs are concanated to form one x

shuffle_conll(sen_idxs)[source]¶

Given a CorefSample and shuffled sentence indexes, reproduce a CorefSample with respect to the indexes.

Parameters: sen_idxs (list) – a list of ints. the indexes in a shuffled order we expect sen_idxs is like [1, 3, 0, 4, 2, 5] when sen_num = 6
Return ~textflint.input_layer.component.sample.CorefSample: a CorefSample with respect to the shuffled index

part_conll(pres_idxs)[source]¶

Only sentences with indexs will be kept, and all the structures of clusters are kept for convenience of concat.

Parameters: pres_idxs (list) – a list of ints. the indexes to be preserved we expect pres_idxs is from [0..num_sen], and is in ascending order, like [0, 1, 3, 5] when num_sen = 6
Return ~textflint.input_layer.component.sample.CorefSample: a CorefPartSample of a conll-part

part_before_conll(sen_idx)[source]¶

Only sentences [0, sen_idx) will be kept, and all the structures of clusters are kept for convenience of concat.

Parameters: sen_idx (int) – sentences with idx < sen_idx will be preserved
Return ~textflint.input_layer.component.sample.CorefSample: a CorefPartSample of a conll-part

part_after_conll(sen_idx)[source]¶

Only sentences [sen_idx:] will be kept, and all the structures of clusters are kept for convenience of concat.

Parameters: sen_idx (int) – sentences with idx < sen_idx will be preserved
Return ~textflint.input_layer.component.sample.CorefSample: a CorefPartSample of a conll-part

class textflint.DPSample(data, origin=None, sample_id=None)[source]¶

Bases: textflint.input_layer.component.sample.sample.Sample

DP Sample class to hold the data info and provide atomic operations.

is_legal()[source]¶: Validate whether the sample is legal

load(data)[source]¶

Convert data dict to DPSample and get matched brackets.

Parameters: data (dict) – contains ‘word’, ‘postag’, ‘head’, ‘deprel’ keys.

insert_field_after_indices(field, indices, items)[source]¶

Insert items of multi given scopes before indices of field value at the same time.

Parameters

field (str) – Only value ‘x’ supported.
indices (list) – shape：indices_num
items (list) – shape: indices_num, correspond to indices

Return ~DPSample

The sentence with words added.

insert_field_after_index(field, ins_index, new_item)[source]¶

Insert given data after the given index.

Parameters

field (str) – Only value ‘x’ supported.
ins_index (int) – The index where the word will be inserted after.
new_item (str) – The word to be inserted.

Return ~DPSample

The sentence with one word added.

insert_field_before_indices(field, indices, items)[source]¶

Insert items of multi given scopes before indices of field value at the same time.

Parameters

field (str) – Only value ‘x’ supported.
indices (list) – shape：indices_num
items (list) – shape: indices_num, correspond to indices

Return ~DPSample

The sentence with words added.

insert_field_before_index(field, ins_index, new_item)[source]¶

Insert given data before the given position.

Parameters

field (str) – Only value ‘x’ supported.
ins_index (int) – The index where the word will be inserted after.
new_item (str) – The word to be inserted.

Return ~DPSample

The sentence with one word added.

delete_field_at_indices(field, indices)[source]¶

Delete items of given scopes of field value.

Parameters

field (str) – Only value ‘x’ supported.
indices (list) –
shape：indices_num each index can be int indicate replace single item or their list

like [1, 2, 3],

can be list like (0,3) indicate replace items
from 0 to 3(not included),

can be slice which would be convert to list.

Return ~DPSample

The sentence with words deleted.

delete_field_at_index(field, del_index)[source]¶

Delete data at the given position.

Parameters

field (str) – Only value ‘x’ supported.
del_index (int|list|slice) –

can be int indicate replace single item or their list
like [1, 2, 3],

can be list like (0,3) indicate replace items
from 0 to 3(not included),

can be slice which would be convert to list.

Return ~DPSample

The sentence with one word deleted.

class textflint.MRCSample(data, origin=None, sample_id=None)[source]¶

Bases: textflint.input_layer.component.sample.sample.Sample

MRC Sample class to hold the mrc data info and provide atomic operations.

STEMMER = <LancasterStemmer>¶

wn = <WordNetCorpusReader in '/home/docs/.cache/textflint/NLTK_DATA/wordnet'>¶

POS_TO_WORDNET = {'JJ': 'a', 'JJR': 'a', 'JJS': 'a', 'NN': 'n'}¶

__init__(data, origin=None, sample_id=None)[source]¶: The sample object for machine reading comprehension task :param dict data: The dict obj that contains data info. :param bool origin: :param int sample_id: sample index

check_data(data)[source]¶: Check whether the input data is legal :param dict data: dict obj that contains data info

is_legal()[source]¶: Validate whether the sample is legal :return: bool

static convert_idx(text, tokens)[source]¶

Get the start and end character idx of tokens in the context

Parameters

text (str) – context text
tokens (list) – context words

Returns

list of spans

load_answers(ans, spans)[source]¶

Get word-level positions of answers

Parameters

ans (dict) – answers dict with character position and text
spans (list) – the start idx and end idx of tokens

get_answers()[source]¶

Get copy of answers

Returns: dict, answers

set_answers_mask()[source]¶: Set the answers with TASK_MASK

load(data)[source]¶

Convert data dict which contains essential information to MRCSample.

Parameters: data (dict) – the dict obj that contains dict info

dump()[source]¶

Convert data dict which contains essential information to MRCSample.

Returns: dict object

delete_field_at_index(field, index)[source]¶

Delete the word seat in del_index.

:param str field:field name :param int|list|slice index: modified scope :return: modified sample

delete_field_at_indices(field, indices)[source]¶

Delete items of given scopes of field value.

Parameters

field (str) – field name
indices (list) – list of int/list/slice, modified scopes

Returns

modified Sample

insert_field_before_indices(field, indices, items)[source]¶

Insert items of multi given scopes before indices of field value at the same time.

Parameters

field (str) – field name
indices (list) – list of int/list/slice, modified scopes
items (list) – inserted items

Returns

modified Sample

insert_field_before_index(field, index, items)[source]¶

Insert item before index of field value.

Parameters

field (str) – field name
index (int) – modified scope
items – inserted item

Returns

modified Sample

insert_field_after_index(field, index, new_item)[source]¶

Insert item after index of field value.

Parameters

field (str) – field name
index (int) – modified scope
new_item – inserted item

Returns

modified Sample

insert_field_after_indices(field, indices, items)[source]¶

Insert items of multi given scopes after indices of field value at the same time.

Parameters

field (str) – field name
indices (list) – list of int/list/slice, modified scopes
items (list) – inserted items

Returns

modified Sample

unequal_replace_field_at_indices(field, indices, rep_items)[source]¶

Replace scope items of field value with rep_items which may not equal with scope.

Parameters

field (str) – field name
indices (list) – list of int/list/slice, modified scopes
rep_items (list) – replace items

Returns

modified sample

static get_answer_position(spans, answer_start, answer_end)[source]¶: Get answer tokens start position and end position

static run_conversion(question, answer, tokens, const_parse)[source]¶

Convert the question and answer to a declarative sentence

Parameters

question (str) – question
answer (str) – answer
tokens (list) – the semantic tag dicts of question
const_parse – the constituency parse of question

Returns

a declarative sentence

convert_answer(answer, sent_tokens, question)[source]¶

Replace the ground truth with fake answer based on specific rules

Parameters

answer (str) – ground truth, str
sent_tokens (list) – sentence dicts, like [{‘word’: ‘Saint’, ‘pos’: ‘NNP’, ‘lemma’: ‘Saint’, ‘ner’: ‘PERSON’}…]
question (str) – question sentence

Return str

fake answer

static alter_sentence(sample, nearby_word_dict=None, pos_tag_dict=None, rules=None)[source]¶

Parameters

sample – sentence dicts, like [{‘word’: ‘Saint’, ‘pos’: ‘NNP’, ‘lemma’: ‘Saint’, ‘ner’: ‘PERSON’}…]
nearby_word_dict – the dictionary to search for nearby words
pos_tag_dict – the dictionary to search for the most frequent pos tags
rules – the rules to alter the sentence

Returns

alter_sentence, alter_sentence dicts

static alter_special(token, **kwargs)[source]¶

Alter special tokens

Parameters

token – the token to alter
kwargs –

Returns

like ‘US’ -> ‘UK’

static alter_wordnet_antonyms(token, **kwargs)[source]¶

Replace words with wordnet antonyms

Parameters

token – the token to replace
kwargs –

Returns

like good -> bad

static alter_wordnet_synonyms(token, **kwargs)[source]¶

Replace words with synonyms

Parameters

token – the token to replace
kwargs –

Returns

like good -> great

static alter_nearby(pos_list, ignore_pos=False, is_ner=False)[source]¶

Alter words based on glove embedding space

Parameters

pos_list – pos tags list
ignore_pos (bool) – whether to match pos tag
is_ner (bool) – indicate ner

Returns

like ‘Mary’ -> ‘Rose’

static alter_entity_type(token, **kwargs)[source]¶

Alter entity

Parameters

token – the word to replace
kwargs –

Returns

like ‘London’ -> ‘Berlin’

static get_answer_tokens(sent_tokens, answer)[source]¶

Extract the pos, ner, lemma tags of answer tokens

Parameters

sent_tokens (list) – a list of dicts
answer (str) – answer

Returns

a list of dicts like [ {‘word’: ‘Saint’, ‘pos’: ‘NNP’, ‘lemma’: ‘Saint’, ‘ner’: ‘PERSON’}, {‘word’: ‘Bernadette’, ‘pos’: ‘NNP’, ‘lemma’: ‘Bernadette’, …}, {‘word’: ‘Soubirous’, ‘pos’: ‘NNP’, ‘lemma’: ‘Soubirous’, …] ]

static ans_entity_full(ner_tag, new_ans)[source]¶

Returns a function that yields new_ans iff every token has |ner_tag|

Parameters

ner_tag (str) – ner tag
new_ans (list) – like [{‘word’: ‘Saint’, ‘pos’: ‘NNP’, ‘lemma’: ‘Saint’, ‘ner’: ‘PERSON’}…]

Returns

fake answer, str

static ans_abbrev(new_ans)[source]¶

Parameters: strnew_ans – answer words
Return str: fake answer

static ans_match_wh(wh_word, new_ans)[source]¶

Returns a function that yields new_ans: if the question starts with |wh_word|

Parameters

wh_word (str) – question word
new_ans (list) – like [{‘word’: ‘Saint’, ‘pos’: ‘NNP’, ‘lemma’: ‘Saint’, ‘ner’: ‘PERSON’}…]

Return str

fake answers,

static ans_pos(pos, new_ans, end=False, add_dt=False)[source]¶

Returns a function that yields new_ans if the first/last token has |pos|

Parameters

pos (str) – pos tag
new_ans (list) – like [{‘word’: ‘Saint’, ‘pos’: ‘NNP’, ‘lemma’: ‘Saint’, ‘ner’: ‘PERSON’}…]
end (bool) – whether to use the last word to match the pos tag
add_dt (bool) – whether to add a determiner

Return str

fake answer

static read_const_parse(parse_str)[source]¶: Construct a constituency tree based on constituency parser

static fix_style(s)[source]¶: Minor, general style fixes for questions.

class textflint.NERSample(data, origin=None, sample_id=None, mode='BIO')[source]¶

Bases: textflint.input_layer.component.sample.sample.Sample

NER Sample class to hold the necessary info and provide atomic operations.

__init__(data, origin=None, sample_id=None, mode='BIO')[source]¶

Parameters

data (dict) – The dict obj that contains data info
origin (~BaseSample) – Original sample obj
sample_id (int) – the id of sample
mode (str) – The sequence labeling mode for NER samples.

check_data(data)[source]¶

Check rare data format.

Parameters: data (dict) – rare data input.

load(data)[source]¶

Parse data into sample field value.

Parameters: data (dict) – rare data input.

dump()[source]¶

Convert sample info to input data json format.

Return json: the dict of sentences and labels

delete_field_at_indices(field, indices)[source]¶

Delete tokens and their NER tag.

Parameters

field (str) – field str
indices (list) –
list of int/list/slice shape：indices_num each index can be int indicate delete single item or

their list like [1, 2, 3],

can be list like (0,3) indicate replace items
from 0 to 3(not included),

can be slice which would be convert to list

Returns

Modified NERSample.

delete_field_at_index(field, index)[source]¶

Delete tokens and their NER tag.

Parameters

field (str) – field string, normally ‘x’
index (int|list|slice) –
int/list/slice can be int indicate delete single item

or their list like [1, 2, 3],

can be list like (0,3) indicate replace items
from 0 to 3(not included),

can be slice which would be convert to list

Returns

Modified NERSample

insert_field_before_indices(field, indices, items)[source]¶

Insert tokens and ner tags.Assuming the tag of new_item is O.

:param str field:field string :param list indices: list of int

shape：indices_num, list like [1, 2, 3]

Parameters: items (list) – list of str/list shape: indices_num, correspond to indices
Returns: Modified NERSample

insert_field_before_index(field, ins_index, new_item)[source]¶

Insert tokens and ner tags.Assuming the tag of new_item is O

Parameters

field (str) – field str
ins_index (int) – indicate which index to insert items
new_item (str/list) – items to insert

Returns

Modified NERSample

insert_field_after_indices(field, indices, items)[source]¶

Insert tokens and ner tags.Assuming the tag of new_item is O.

Parameters

field (str) – field string
indices (list) – list of int shape：indices_num, like [1, 2, 3]
items (list) – list of str/list shape: indices_num, correspond to indices

Returns

Modified NERSample

insert_field_after_index(field, ins_index, new_item)[source]¶

Insert tokens and ner tags.Assuming the tag of new_item is O.

Parameters

field (str) – field string
ins_index (int) – indicate where to apply insert
new_item (str|list) – shape: indices_num, correspond to field_sub_items

Returns

Modified NERSample

find_entities_BIO(word_seq, tag_seq)[source]¶

find entities in a sentence with BIO labels.

Parameters

word_seq (list) – a list of tokens representing a sentence
tag_seq (list) – a list of tags representing a tag sequence labeling the sentence

Return list entity_in_seq

a list of entities found in the sequence, including the information of the start position & end position in the sentence, the category, and the entity itself.

find_entities_BIOES(word_seq, tag_seq)[source]¶

find entities in a sentence with BIOES labels.

Parameters

word_seq (list) – a list of tokens representing a sentence
tag_seq (list) – a list of tags representing a tag sequence labeling the sentence

Return list entity_in_seq

a list of entities found in the sequence, including the information of the start position & end position in the sentence, the category, and the entity itself.

entities_replace(entities_info, candidates)[source]¶

Replace multi entity in once time.Assume input entities with reversed sequential.

Parameters

entities_info (list) – list of entity_info
candidates (list) – candidate entities

Returns

Modified NERSample

entity_replace(start, end, entity, label)[source]¶

Replace one entity and update entities info.

Parameters

start (int) – the start position of the entity to be replaced
end (int) – the end position of the entity to be replaced
entity (str) – the entity to be replaced with
label (str) – the category of the entity

Returns

Modified NERSample

class textflint.POSSample(data, origin=None, sample_id=None)[source]¶

Bases: textflint.input_layer.component.sample.sample.Sample

POS Sample class to hold the necessary info and provide atomic operations.

get_pos(field)[source]¶

Get text field pos tag.

Parameters: field – str
Returns: list, a pos tag list.

check_data(data)[source]¶: Check rare data format.

is_legal()[source]¶: Validate whether the sample is legal

delete_field_at_indices(field, indices)[source]¶: See sample.py for details.

insert_field_before_indices(field, indices, items)[source]¶: See sample.py for details.

insert_field_after_indices(field, indices, items)[source]¶: See sample.py for details.

unequal_replace_field_at_indices(field, indices, rep_items)[source]¶: See sample.py for details.

load(data)[source]¶: Parse data into sample field value.

dump()[source]¶: Convert sample info to input data json format.

class textflint.RESample(data, origin=None, sample_id=None)[source]¶

Bases: textflint.input_layer.component.sample.sample.Sample

transform and retrieve features of RESample

check_data(data)[source]¶

check whether type of data is correct

Parameters: data (dict) – data dict containing ‘x’, ‘subj’, ‘obj’ and ‘y’

is_legal()[source]¶: Validate whether the sample is legal

get_sent_ids()[source]¶

Generate sentence ID

Returns: string: sentence ID

load(data)[source]¶

Convert data dict which contains essential information to SASample.

Params: dict data: contains ‘token’, ‘subj’ ,’obj’, ‘relation’ keys.

get_dp()[source]¶

get dependency parsing

Return Tuple(list, list): dependency tag of sentence and head of sentence

get_en()[source]¶

get entity index

Return Tuple(int, int, int, int): start index of subject entity, end index of subject entity, start index of object entity and end index of object entity

get_type()[source]¶

get entity type

Return Tuple(string, string): entity type of subject and entity type of object

get_sent()[source]¶

get tokenized sentence

Return Tuple(list, string): tokenized sentence and relation

delete_field_at_indices(field, indices)[source]¶

delete word of given indices in sentence

Parameters

field (string) – field to be operated on
indices (list) – a list of index to be deleted

Return dict

contains ‘token’, ‘subj’ ,’obj’ keys

insert_field_after_indices(field, indices, new_item)[source]¶

insert word before given indices in sentence

Parameters

field (string) – field to be operated on
indices (list) – a list of index to be inserted
new_item (list) – list of items to be inserted

Return dict

contains ‘token’, ‘subj’ ,’obj’ keys

insert_field_before_indices(field, indices, new_item)[source]¶

insert word after given indices in sentence

Parameters

field (string) – field to be operated on
indices (list) – a list of index to be inserted
new_item (list) – list of items to be inserted

Return dict

contains ‘token’, ‘subj’ ,’obj’ keys

replace_sample_fields(data)[source]¶

replace sample fields for RE transformation

Parameters: data (dict) – contains transformed x, subj, obj keys
Return RESample: transformed sample

stan_ner_transform()[source]¶

Generate ner list

Return list: ner tags

get_pos()[source]¶

get pos tagging of sentence

Return list: pos tags

dump()[source]¶

output data sample

Return dict: containing x, subj, obj, y and sample_id

class textflint.UTSample(data, origin=None, sample_id=None)[source]¶

Bases: textflint.input_layer.component.sample.sample.Sample

Universal Transformation sample.

Universe Transformation is not a subtask of NLP, implemented for providing universal text transformation function.

load(data)[source]¶

Convert data dict which contains essential information to SASample.

Parameters: data (dict) – contains ‘x’ key at least.
Returns

textflint.auto_config(task='UT', config=None)[source]¶

Check config input or create config automatically.

Parameters

task (str) – task name
config (str|dict|textflint.config.Config) – config to control generation procedure.

Returns

textflint.config.Config instance.

textflint.auto_dataset(data_input=None, task='UT')[source]¶

Create Dataset instance and load data input automatically.

Parameters

data_input (dict|list|string) – json object or json/csv file.
task (str) – task name.

Returns

textflint.Dataset instance.

textflint.auto_flintmodel(model, task)[source]¶

Check flint model type and whether compatible to task.

Parameters

model (textflint.FlintModel|str) – FlintModel instance or python file path which contains FlintModel instance
task (str) – task name

Returns

textflint.FlintModel

textflint.auto_generator(config_obj)[source]¶

Automatic create task generator to apply transformations, subpopulations and adversarial attacks.

Parameters: config_obj (textflint.Config) – Config instance.
Returns: textflint.Generator

textflint.auto_report_generator()[source]¶

Return a ReportGenerator instance.

Returns: ReportGenerator

textflint¶

Subpackages¶

Submodules¶