textflint¶
Welcome to the API references for TextFlint!
-
class
textflint.
Sample
(data, origin=None, sample_id=None)[source]¶ Bases:
abc.ABC
Base Sample class to hold the necessary info and provide atomic operations
-
text_processor
= <textflint.common.preprocess.en_processor.EnProcessor object>¶
-
__init__
(data, origin=None, sample_id=None)[source]¶ - Parameters
data (dict) – The dict obj that contains data info.
origin (sample) – original sample obj.
sample_id (int) – sampleindex
-
get_value
(field)[source]¶ Get field value by field_str.
- Parameters
field (str) – field name
- Returns
field value
-
get_words
(field)[source]¶ Get tokenized words of given textfield
- Parameters
field (str) – field name
- Returns
tokenized words
-
get_text
(field)[source]¶ Get text string of given textfield
- Parameters
field (str) – field name
- Return string
text
-
get_mask
(field)[source]¶ Get word masks of given textfield
- Parameters
field (str) – field name
- Returns
list of mask values
-
get_sentences
(field)[source]¶ Get split sentences of given textfield
- Parameters
field (str) – field name
- Returns
list of sentences
-
get_ner
(field)[source]¶ Get text field ner tags
- Parameters
field (str) – field name
- Returns
ner tag list
-
replace_fields
(fields, field_values, field_masks=None)[source]¶ Fully replace multi fields at the same time and return new sample. Notice: Not suggest use this API as it will set mask values of TextField to MODIFIED_MASK.
- Parameters
fields (list) – field str list
field_values (list) – field value list
field_masks (list) – indicate mask values, useful for printable text
- Returns
Modified Sample
-
replace_field
(field, field_value, field_mask=None)[source]¶ Fully replace single field and return new sample. Notice: Not suggest use this API as it will set mask values of TextField to MODIFIED_MASK.
- Parameters
field (str) – field str
field_value – field_type
field_mask (list) – indicate mask value of field
- Returns
Modified Sample
-
replace_field_at_indices
(field, indices, items)[source]¶ Replace items of multi given scopes of field value at the same time. Stay away from the complex function !!!
Be careful of your input list shape.
- Parameters
field (str) – field name
of int|list|slice indices (list) –
- each index can be int indicate replace single item or their list
like [1, 2, 3],
- can be list like (0,3) indicate replace items from
0 to 3(not included),
can be slice which would be convert to list.
items –
- Returns
Modified Sample
-
replace_field_at_index
(field, index, items)[source]¶ Replace items of given scope of field value.
Be careful of your input list shape.
- Parameters
field (str) – field name
index (int|list|slice) –
can be int indicate replace single item or list like [1, 2, 3], can be list like (0,3) indicate replace items
from 0 to 3(not included),
can be slice which would be convert to list.
items (str|list) – shape: indices_num, correspond to field_sub_items
- Returns
Modified Sample
-
unequal_replace_field_at_indices
(field, indices, rep_items)[source]¶ Replace scope items of field value with rep_items which may not equal with scope.
- Parameters
field – field str
indices – list of int/tupe/list
rep_items – list
- Returns
Modified Sample
-
delete_field_at_indices
(field, indices)[source]¶ Delete items of given scopes of field value.
- Parameters
field (str) – field name
of int|list|slice indices (list) –
shape:indices_num each index can be int indicate delete single item or their list
like [1, 2, 3],
- can be list like (0,3) indicate replace items
from 0 to 3(not included),
can be slice which would be convert to list.
- Returns
Modified Sample
-
delete_field_at_index
(field, index)[source]¶ Delete items of given scopes of field value.
- Parameters
field (str) – field value
index (int|list|slice) –
can be int indicate delete single item or their list like [1, 2, 3], can be list like (0,3) indicate replace items
from 0 to 3(not included),
can be slice which would be convert to list.
- Returns
Modified Sample
-
insert_field_before_indices
(field, indices, items)[source]¶ Insert items of multi given scopes before indices of field value at the same time.
Stay away from the complex function !!! Be careful of your input list shape.
- Parameters
field (str) – field name
indices – list of int, shape:indices_num, list like [1, 2, 3]
items – list of str/list, shape: indices_num, correspond to indices
- Returns
Modified Sample
-
insert_field_before_index
(field, index, items)[source]¶ Insert items of multi given scope before index of field value.
- Parameters
field (str) – field name
index (int) – indicate which index to insert items
items (str|list) – items to insert
- Returns
Modified Sample
-
insert_field_after_indices
(field, indices, items)[source]¶ Insert items of multi given scopes after indices of field value at the same time.
Stay away from the complex function !!! Be careful of your input list shape.
- Parameters
field (str) – field name
indices – list of int, shape:indices_num, like [1, 2, 3]
items – list of str/list shape: indices_num, correspond to indices
- Returns
Modified Sample
-
insert_field_after_index
(field, index, items)[source]¶ Insert items of multi given scope after index of field value
- Parameters
field (str) – field name
index (int) – indicate where to apply insert
items (str|list) – shape: indices_num, correspond to field_sub_items
- Returns
Modified Sample
-
swap_field_at_index
(field, first_index, second_index)[source]¶ Swap items between first_index and second_index of field value.
- Parameters
field (str) – field name
first_index (int) –
second_index (int) –
- Returns
Modified Sample
-
classmethod
clone
(original_sample)[source]¶ Deep copy self to a new sample
- Parameters
original_sample – sample to be copied
- Returns
Sample instance
-
property
is_origin
¶ Return whether the sample is original Sample.
-
-
class
textflint.
Field
(field_value, field_type=<class 'str'>, **kwargs)[source]¶ Bases:
object
A helper class that represents input string that to be modified.
-
class
textflint.
Dataset
(task='UT')[source]¶ Bases:
object
Any iterable of (label, text_input) pairs qualifies as a
Dataset
.-
load
(dataset)[source]¶ Loads json object and prepares it as a Dataset.
Support two formats input, Example:
- {‘x’: [
‘The robustness of deep neural networks has received much attention recently’, ‘We focus on certified robustness of smoothed classifiers in this work’, …, ‘our approach exceeds the state-of-the-art.’ ],
- ‘y’: [
‘neural’, ‘positive’, …, ‘positive’ ]}
- [
{‘x’: ‘The robustness of deep neural networks has received much attention recently’, ‘y’: ‘neural’}, {‘x’: ‘We focus on certified robustness of smoothed classifiers in this work’, ‘y’: ‘positive’}, …, {‘x’: ‘our approach exceeds the state-of-the-art.’, ‘y’: ‘positive’} ]
- Parameters
dataset (list|dict) –
- Returns
-
load_json
(json_path, encoding='utf-8', fields=None, dropna=True)[source]¶ Loads json file, each line of the file is a json string.
- Parameters
json_path – file path
encoding – file’s encoding, default: utf-8
fields – json object’s fields that needed, if None, all fields are needed. default: None
dropna – weather to ignore and drop invalid data, :if False, raise ValueError when reading invalid data. default: True
- Returns
-
load_csv
(csv_path, encoding='utf-8', headers=None, sep=',', dropna=True)[source]¶ Loads csv file, one line correspond one sample.
- Parameters
csv_path – file path
encoding – file’s encoding, default: utf-8
headers – file’s headers, if None, make file’s first line as headers. default: None
sep – separator for each column. default: ‘,’
dropna – weather to ignore and drop invalid data, :if False, raise ValueError when reading invalid data. default: True
- Returns
-
load_hugging_face
(name, subset='train')[source]¶ Loads a dataset from HuggingFace
datasets
and prepares it as a Dataset.- Parameters
name – the dataset name
subset – the subset of the main dataset.
- Returns
-
append
(data_sample, sample_id=- 1)[source]¶ Load single data sample and append to dataset.
- Parameters
data_sample (dict|sample) –
sample_id (int) – useful to identify sample, default -1
- Returns
True / False indicate whether append action successful.
-
extend
(data_samples)[source]¶ Load multi data samples and extend to dataset.
- Parameters
data_samples (list|dict|Sample) –
- Returns
-
static
norm_input
(data_samples)[source]¶ Convert various data input to list of dict. Example:
{'x': [ 'The robustness of deep neural networks has received much attention recently', 'We focus on certified robustness of smoothed classifiers in this work', ..., 'our approach exceeds the state-of-the-art.' ], 'y': [ 'neural', 'positive', ..., 'positive' ] } convert to [ {'x': 'The robustness of deep neural networks has received much attention recently', 'y': 'neural'}, {'x': 'We focus on certified robustness of smoothed classifiers in this work', 'y': 'positive'}, ..., {'x': 'our approach exceeds the state-of-the-art.', 'y': 'positive'} ]
- Parameters
data_samples (list|dict|Sample) –
- Returns
Normalized data.
-
save_csv
(out_path, encoding='utf-8', headers=None, sep=',')[source]¶ Save dataset to csv file.
- Parameters
out_path – file path
encoding – file’s encoding, default: utf-8
headers – file’s headers, if None, make file’s first line as headers. default: None
sep – separator for each column. default: ‘,’
- Returns
-
-
class
textflint.
Config
(task='UT', out_dir=None, max_trans=1, random_seed=1, fields=None, flint_model=None, trans_methods=None, trans_config=None, return_unk=True, sub_methods=None, sub_config=None, attack_methods=None, validate_methods=None, **kwargs)[source]¶ Bases:
object
Hold some config params to control generation and report procedure.
-
__init__
(task='UT', out_dir=None, max_trans=1, random_seed=1, fields=None, flint_model=None, trans_methods=None, trans_config=None, return_unk=True, sub_methods=None, sub_config=None, attack_methods=None, validate_methods=None, **kwargs)[source]¶ - Parameters
task (str) – task name
out_dir (string) – out dir for saving generated samples, default current path.
max_trans (int) – maximum transformed samples generate by one original sample pre Transformation.
random_seed (int) – random number seed to reproduce generation.
fields (str|list[str]) – fields on which new samples are generated.
- ::param str model_file: path to the python file containing
the FlintModel instance which named ‘model’.
- Parameters
trans_methods (list) – indicate what transformations to apply to dataset.
trans_config (dict) – parameters for the initialization of the transformation instances.
return_unk (bool) – whether apply transformations which may influence label of sample.
sub_methods (list) – indicate what subpopulations to apply to dataset.
sub_config (dict) – parameters for the initialization of the subpopulation instances.
attack_methods (str) – path to the python file containing the Attack instances which named “attacks”.
validate_methods (str|list[str]) – indicate use which validate methods to calculate confidence of generated samples.
-
get_generate_methods
(methods, task_to_methods, allow_pipeline=False)[source]¶ Validate transformation or subpopulation methods.
Watch out! Some UT transformations/subpopulations may not compatible with your task, please choose your method carefully.
- Parameters
methods (list) – transformation or subpopulation need to apply to dataset. If not provide, return default generated methods.
task_to_methods (dict) – map allowed methods by task name.
allow_pipeline (bool) – whether allow pipeline input
- Returns
list of transformation/subpopulation.
-
-
class
textflint.
FlintModel
(model, tokenizer, task='SA', batch_size=1)[source]¶ Bases:
abc.ABC
A model wrapper queries a model with a list of text inputs.
Classification-based models return a list of lists, where each sublist represents the model’s scores for a given input.
Text-to-text models return a list of strings, where each string is the output – like a translation or summarization – for a given input.
-
__init__
(model, tokenizer, task='SA', batch_size=1)[source]¶ - Parameters
model – any model object
tokenizer – support tokenize sentence and convert tokens to model input ids
task (str) – task name
batch_size (int) – batch size to apply evaluation
-
evaluate
(data_samples, prefix='')[source]¶ - Parameters
data_samples (list[Sample]) – list of Samples
prefix (str) – name prefix to add to metrics
- Returns
dict obj to save metrics result
-
get_grad
(*inputs)[source]¶ Get gradient of loss with respect to input tokens.
- Parameters
inputs (tuple) – tuple of original texts
-
-
class
textflint.
Engine
[source]¶ Bases:
object
Engine class of Text Robustness.
Support run entrance which automatically finish data loading, transformation/subpopulation/attack generation and robustness report generation.
Also provide interfaces of each layer to practitioners.
-
run
(data_input, config=None, model=None)[source]¶ Engine start entrance, load data and apply transformations, finally generate robustness report if needed.
- Parameters
data_input (dict|list|string) – json object or json/csv file
config (string|textflint.Config) – json file or Config object
model (textflint.FlintModel) – model wrapper which implements FlintModel abstract methods, not a necessary input.
- Returns
save generated data to out dir and provide report in html format.
-
load
(data_input, config=None, model=None)[source]¶ Load data input, config file and FlintModel.
- Parameters
data_input (dict|list|string) – json object or json/csv file
config (string|textflint.Config) – json file or Config object
model (textflint.FlintModel) – model wrapper which implements FlintModel abstract methods, not a necessary input.
- Returns
textflint.Dataset, textflint.Config, textflint.FlintModel
-
generate
(dataset, config, model=None)[source]¶ Generate new samples according to given config, save result as json file to out path, and evaluate model performance automatically if provide model.
- Parameters
dataset (textflint.Dataset) – container of original samples.
config (textflint.Config) – config instance to control procedure.
model (textflint.FlintModel) – model wrapper which implements FlintModel abstract methods, not a necessary input.
- Returns
save generated samples to json file.
-
report
(evaluate_result)[source]¶ Automatically analyze the model robustness verification results and plot the robustness evaluation report.
- Parameters
evaluate_result (dict) – json object contains robustness evaluation result and other additional information.
- Returns
open a html of robustness report.
-
-
class
textflint.
Generator
(task='UT', max_trans=1, random_seed=1, fields='x', trans_methods=None, trans_config=None, return_unk=True, sub_methods=None, sub_config=None, attack_methods=None, validate_methods=None, **kwargs)[source]¶ Bases:
abc.ABC
Transformation controller which applies multi transformations to each data sample.
-
__init__
(task='UT', max_trans=1, random_seed=1, fields='x', trans_methods=None, trans_config=None, return_unk=True, sub_methods=None, sub_config=None, attack_methods=None, validate_methods=None, **kwargs)[source]¶ - Parameters
task (str) – Indicate which task of your transformation data.
max_trans (int) – Maximum transformed samples generate by one original sample pre Transformation.
random_seed (int) – random number seed to reproduce generation.
fields (str|list) – Indicate which fields to apply transformations. Multi fields transform just for some special task, like: SM、NLI.
trans_methods (list) – list of transformations’ name.
trans_config (dict) – transformation class configs, useful to control the behavior of transformations.
return_unk (bool) – Some transformation may generate unk labels, s.t. insert a word to a sequence in NER task. If set False, would skip these transformations.
sub_methods (list) – list of subpopulations’ name.
sub_config (dict) – subpopulation class configs, useful to control the behavior of subpopulation.
attack_methods (str) – path to the python file containing the Attack instances.
validate_methods (list) – confidence calculate functions.
-
prepare
(dataset)[source]¶ Check dataset
- Parameters
dataset (textflint.Dataset) – the input dataset
-
generate
(dataset, model=None)[source]¶ Returns a list of possible generated samples for
dataset
.- Parameters
dataset (textflint.Dataset) – the input dataset
model (textflint.FlintModel) – the model to attack if given.
- Returns
yield (original samples, new samples, generated function string).
-
generate_by_transformations
(dataset, **kwargs)[source]¶ Generate samples by a list of transformation methods.
- Parameters
dataset – the input dataset
- Returns
(original samples, new samples, generated function string)
-
-
class
textflint.
Validator
(origin_dataset, trans_dataset, fields, need_tokens=False)[source]¶ Bases:
abc.ABC
- An abstract class that computes the semantic similarity score between
original text and adversarial texts
- Parameters
origin_dataset (dataset) – the dataset of origin sample
trans_dataset (dataset) – the dataset of translate sample
fields (str|list) – the name of the origin field need compare.
need_tokens (bool) – if we need tokenize the sentence
-
abstract
validate
(transformed_text, reference_text)[source]¶ Calculate the score
- Parameters
transformed_text (str) – transformed sentence
reference_text (str) – origin sentence
- Return float
the score of two sentence
-
property
score
¶ Calculate the score of the deformed sentence
- Return list
a list of translate sentence score
-
class
textflint.
ABSASample
(data, trans_id=None, origin=None, sample_id=None)[source]¶ Bases:
textflint.input_layer.component.sample.sample.Sample
ABSASample Class
-
load
(data)[source]¶ Load the legal data and convert it into SASample.
- Parameters
data (dict) – data name
-
is_legal
()[source]¶ - Check whether aspect words and opinion words are
in the correct position.
- Return bool
whether format of data is legal.
-
update_sentence
(trans_sentence)[source]¶ Update the sentence of ABSASample.
- Parameters
trans_sentence (str|list) – updated sentence
-
update_terms
(trans_terms)[source]¶ Update the terms of ABSASample.
- Parameters
trans_terms (dict) – updated terms
-
update_term_list
(sample)[source]¶ Update the term_list of ABSASample.
- Parameters
sample (ABSAsample) – updated sample
-
insert_field_before_indices
(field, indices, items)[source]¶ Insert items of multi given scopes before indices of field value at the same time.
- Parameters
field (str) – transformed field
indices (list) – indices of insert positions
items (list) – insert items
- Return ~textflint.ABSAsample
modified sample
-
insert_field_before_index
(field, ins_index, new_item)[source]¶ Insert items of multi given scope before index of field value.
- Parameters
field (str) – transformed field
ins_index (int|list) – index of insert position
new_item (str|list) – insert item
- Return ~textflint.ABSAsample
modified sample
-
insert_field_after_indices
(field, indices, items)[source]¶ Insert items of multi given scopes after indices of field value at the same time.
- Parameters
field (str) – transformed field
indices (list) – indices of insert positions
items (list) – insert items
- Return ABSAsample
modified sample
-
insert_field_after_index
(field, ins_index, new_item)[source]¶ Insert items of multi given scope after index of field value.
- Parameters
field (str) – transformed field
ins_index (int|list) – index of insert position
new_item (str|list) – insert item
- Return ~textflint.ABSAsample
modified sample
-
-
class
textflint.
CWSSample
(data, origin=None, sample_id=None)[source]¶ Bases:
textflint.input_layer.component.sample.sample.Sample
Our segmentation rules are based on ctb6.
the input x can be a list or a sentence the input y is segmentation label include:B,M,E,S the y also can automatic generation,if you want automatic generation
you must input an empty list and x must each word in x is separated by a space or split into each element of the list
Note that punctuation should be separated into a single word
Example:
1. input {'x':'小明好想送Jo圣诞礼物', 'y' = ['B', 'E', 'B', 'E', 'S', 'B', 'E', 'B', 'E', 'B', 'E']} 2. input {'x':['小明','好想送Jo圣诞礼物'], 'y' = ['B', 'E', 'B', 'E', 'S', 'B', 'E', 'B', 'E', 'B', 'E']} 3. input {'x':'小明 好想 送 Jo 圣诞 礼物', 'y' = []} 4. input {'x':['小明', '好想', '送', 'Jo', '圣诞', '礼物'], 'y' = []}
-
__init__
(data, origin=None, sample_id=None)[source]¶ - Parameters
data (dict) – The dict obj that contains data info
sample_id (int) – the id of sample
origin (bool) – if the sample is origin
-
check_data
(data)[source]¶ Check the whether the data legitimate but we don’t check that the label is correct if the data is not legal but acceptable format, change the format of data
- Parameters
data (dict) – The dict obj that contains data info
-
load
(data)[source]¶ Convert data dict which contains essential information to CWSSample.
- Parameters
data (dict) – The dict obj that contains data info
-
replace_at_ranges
(indices, new_items, y_new_items=None)[source]¶ Replace words at indices and set their mask to MODIFIED_MASK.
- Parameters
indices (list) – The list of the pos need to be changed.
new_items (list) – The list of the item need to be changed.
y_new_items (list) – The list of the mask info need to be changed.
- Returns
replaced CWSSample object.
-
update
(x, y)[source]¶ Replace words at indices and set their mask to MODIFIED_MASK.
- Parameters
x (str) – the new sentence.
y (list) – the new labels.
- Returns
new CWSSample object.
-
check
(indices, new_items, y_new_items=None)[source]¶ Check whether the position of change is legal.
- Parameters
indices (list) – The list of the pos need to be changed.
new_items (list) – The list of the item need to be changed.
y_new_items (list) – The list of the mask info need to be changed.
- Return three list
legal position, change items, change labels.
-
-
class
textflint.
CorefSample
(data, origin=None, sample_id=None)[source]¶ Bases:
textflint.input_layer.component.sample.sample.Sample
Coref Sample
-
check_data
(data)[source]¶ Check if data is a conll-dict and is ready to be predicted.
- Parameters
data (None|dict) – Must have key: sentences, clusters May have key: doc_key, speakers, constituents, ner
- Returns
-
load
(data)[source]¶ Convert a conll-dict to CorefSample.
- Parameters
data (None|dict) – None, or a conll-style dict Must have key: sentences, clusters May have key: doc_key, speakers, constituents, ner
- Returns
-
dump
(with_check=True)[source]¶ Dump a CorefSample to a conll-dict.
- Parameters
with_check (bool) – whether the dumped conll-dict should be checked
- Return dict ret_dict
a conll-style dict
-
pretty_print
(show='Sample:')[source]¶ A pretty-printer for CorefSample. Print useful sample information by calling this function.
- Parameters
show (str) – optional, the welcome information of printing this sample
-
num_sentences
()[source]¶ the number of sentences in this sample
- Param
- Return int
the number of sentences in this sample
-
get_kth_sen
(k)[source]¶ get the kth sen as a word list
- Parameters
k (int) – sen id
- Return list
kth sen, word list
-
eqlen_sen_map
()[source]¶ Generate [0, 0, 1, 1, 1, 2, 2] from self.sen_map = [2, 3, 2]
- Param
- Return list
sentence mapping with equal length to x, like [0, 0, 1, 1, 1, 2, 2]
-
index_in_sen
(idx)[source]¶ For the given word idx, determine which sen it is in.
- Parameters
idx (int) – word idx
- Return int
sen_idx, which sentence is word idx in
-
static
sens2doc
(sens)[source]¶ Given an 2nd list of str (word list list), concat it and records the length of each sentence
- Parameters
sens (list) – 2nd list of str (word list list)
- Returns (list, list)
x as list of str (word list), sen_map as list of int (sen len list)
-
static
doc2sens
(x, sen_map)[source]¶ Given x and sen_map, return sens. Inverse to sens2doc.
- Parameters
x (list) – list of str (word list)
sen_map (list) – list of int (sen len list)
- Return list
sens as 2nd list of str (word list list)
-
insert_field_before_indices
(field, indices, items)[source]¶ Insert items of given scopes before indices of field value simutaneously
- Parameters
field (str) – transformed field
indices (list) – indices of insert positions
items (list) – insert items
- Return ~textflint.CorefSample
modified sample
-
insert_field_after_indices
(field, indices, items)[source]¶ Insert items of given scopes after indices of field value simutaneously.
- Parameters
field (str) – transformed field
indices (list) – indices of insert positions
items (list) – insert items
- Return ~textflint.CorefSample
modified sample
-
delete_field_at_indices
(field, indices)[source]¶ Delete items of given scopes of field value.
- Parameters
field (str) – transformed field
indices (list) – indices of delete positions
- Return ~textflint.CorefSample
modified sample
-
replace_field_at_indices
(field, indices, items)[source]¶ Replace scope items of field value with items. :param str field: transformed field :param list indices: indices of delete positions :param list items: insert items :return ~textflint.CorefSample: modified sample
-
static
concat_conlls
(*args)[source]¶ Given several CorefSamples, concat the values key by key.
- Param
Some CorefSamples
- Return ~textflint.input_layer.component.sample.CorefSample
A CorefSample, as the docs are concanated to form one x
-
shuffle_conll
(sen_idxs)[source]¶ Given a CorefSample and shuffled sentence indexes, reproduce a CorefSample with respect to the indexes.
- Parameters
sen_idxs (list) – a list of ints. the indexes in a shuffled order we expect sen_idxs is like [1, 3, 0, 4, 2, 5] when sen_num = 6
- Return ~textflint.input_layer.component.sample.CorefSample
a CorefSample with respect to the shuffled index
-
part_conll
(pres_idxs)[source]¶ Only sentences with indexs will be kept, and all the structures of clusters are kept for convenience of concat.
- Parameters
pres_idxs (list) – a list of ints. the indexes to be preserved we expect pres_idxs is from [0..num_sen], and is in ascending order, like [0, 1, 3, 5] when num_sen = 6
- Return ~textflint.input_layer.component.sample.CorefSample
a CorefPartSample of a conll-part
-
part_before_conll
(sen_idx)[source]¶ Only sentences [0, sen_idx) will be kept, and all the structures of clusters are kept for convenience of concat.
- Parameters
sen_idx (int) – sentences with idx < sen_idx will be preserved
- Return ~textflint.input_layer.component.sample.CorefSample
a CorefPartSample of a conll-part
-
part_after_conll
(sen_idx)[source]¶ Only sentences [sen_idx:] will be kept, and all the structures of clusters are kept for convenience of concat.
- Parameters
sen_idx (int) – sentences with idx < sen_idx will be preserved
- Return ~textflint.input_layer.component.sample.CorefSample
a CorefPartSample of a conll-part
-
-
class
textflint.
DPSample
(data, origin=None, sample_id=None)[source]¶ Bases:
textflint.input_layer.component.sample.sample.Sample
DP Sample class to hold the data info and provide atomic operations.
-
load
(data)[source]¶ Convert data dict to DPSample and get matched brackets.
- Parameters
data (dict) – contains ‘word’, ‘postag’, ‘head’, ‘deprel’ keys.
-
insert_field_after_indices
(field, indices, items)[source]¶ Insert items of multi given scopes before indices of field value at the same time.
- Parameters
field (str) – Only value ‘x’ supported.
indices (list) – shape:indices_num
items (list) – shape: indices_num, correspond to indices
- Return ~DPSample
The sentence with words added.
-
insert_field_after_index
(field, ins_index, new_item)[source]¶ Insert given data after the given index.
- Parameters
field (str) – Only value ‘x’ supported.
ins_index (int) – The index where the word will be inserted after.
new_item (str) – The word to be inserted.
- Return ~DPSample
The sentence with one word added.
-
insert_field_before_indices
(field, indices, items)[source]¶ Insert items of multi given scopes before indices of field value at the same time.
- Parameters
field (str) – Only value ‘x’ supported.
indices (list) – shape:indices_num
items (list) – shape: indices_num, correspond to indices
- Return ~DPSample
The sentence with words added.
-
insert_field_before_index
(field, ins_index, new_item)[source]¶ Insert given data before the given position.
- Parameters
field (str) – Only value ‘x’ supported.
ins_index (int) – The index where the word will be inserted after.
new_item (str) – The word to be inserted.
- Return ~DPSample
The sentence with one word added.
-
delete_field_at_indices
(field, indices)[source]¶ Delete items of given scopes of field value.
- Parameters
field (str) – Only value ‘x’ supported.
indices (list) –
shape:indices_num each index can be int indicate replace single item or their list
like [1, 2, 3],
- can be list like (0,3) indicate replace items
from 0 to 3(not included),
can be slice which would be convert to list.
- Return ~DPSample
The sentence with words deleted.
-
delete_field_at_index
(field, del_index)[source]¶ Delete data at the given position.
- Parameters
field (str) – Only value ‘x’ supported.
del_index (int|list|slice) –
- can be int indicate replace single item or their list
like [1, 2, 3],
- can be list like (0,3) indicate replace items
from 0 to 3(not included),
can be slice which would be convert to list.
- Return ~DPSample
The sentence with one word deleted.
-
-
class
textflint.
MRCSample
(data, origin=None, sample_id=None)[source]¶ Bases:
textflint.input_layer.component.sample.sample.Sample
MRC Sample class to hold the mrc data info and provide atomic operations.
-
STEMMER
= <LancasterStemmer>¶
-
wn
= <WordNetCorpusReader in '/home/docs/.cache/textflint/NLTK_DATA/wordnet'>¶
-
POS_TO_WORDNET
= {'JJ': 'a', 'JJR': 'a', 'JJS': 'a', 'NN': 'n'}¶
-
__init__
(data, origin=None, sample_id=None)[source]¶ The sample object for machine reading comprehension task :param dict data: The dict obj that contains data info. :param bool origin: :param int sample_id: sample index
-
check_data
(data)[source]¶ Check whether the input data is legal :param dict data: dict obj that contains data info
-
static
convert_idx
(text, tokens)[source]¶ Get the start and end character idx of tokens in the context
- Parameters
text (str) – context text
tokens (list) – context words
- Returns
list of spans
-
load_answers
(ans, spans)[source]¶ Get word-level positions of answers
- Parameters
ans (dict) – answers dict with character position and text
spans (list) – the start idx and end idx of tokens
-
load
(data)[source]¶ Convert data dict which contains essential information to MRCSample.
- Parameters
data (dict) – the dict obj that contains dict info
-
dump
()[source]¶ Convert data dict which contains essential information to MRCSample.
- Returns
dict object
-
delete_field_at_index
(field, index)[source]¶ Delete the word seat in del_index.
:param str field:field name :param int|list|slice index: modified scope :return: modified sample
-
delete_field_at_indices
(field, indices)[source]¶ Delete items of given scopes of field value.
- Parameters
field (str) – field name
indices (list) – list of int/list/slice, modified scopes
- Returns
modified Sample
-
insert_field_before_indices
(field, indices, items)[source]¶ Insert items of multi given scopes before indices of field value at the same time.
- Parameters
field (str) – field name
indices (list) – list of int/list/slice, modified scopes
items (list) – inserted items
- Returns
modified Sample
-
insert_field_before_index
(field, index, items)[source]¶ Insert item before index of field value.
- Parameters
field (str) – field name
index (int) – modified scope
items – inserted item
- Returns
modified Sample
-
insert_field_after_index
(field, index, new_item)[source]¶ Insert item after index of field value.
- Parameters
field (str) – field name
index (int) – modified scope
new_item – inserted item
- Returns
modified Sample
-
insert_field_after_indices
(field, indices, items)[source]¶ Insert items of multi given scopes after indices of field value at the same time.
- Parameters
field (str) – field name
indices (list) – list of int/list/slice, modified scopes
items (list) – inserted items
- Returns
modified Sample
-
unequal_replace_field_at_indices
(field, indices, rep_items)[source]¶ Replace scope items of field value with rep_items which may not equal with scope.
- Parameters
field (str) – field name
indices (list) – list of int/list/slice, modified scopes
rep_items (list) – replace items
- Returns
modified sample
-
static
get_answer_position
(spans, answer_start, answer_end)[source]¶ Get answer tokens start position and end position
-
static
run_conversion
(question, answer, tokens, const_parse)[source]¶ Convert the question and answer to a declarative sentence
- Parameters
question (str) – question
answer (str) – answer
tokens (list) – the semantic tag dicts of question
const_parse – the constituency parse of question
- Returns
a declarative sentence
-
convert_answer
(answer, sent_tokens, question)[source]¶ Replace the ground truth with fake answer based on specific rules
- Parameters
answer (str) – ground truth, str
sent_tokens (list) – sentence dicts, like [{‘word’: ‘Saint’, ‘pos’: ‘NNP’, ‘lemma’: ‘Saint’, ‘ner’: ‘PERSON’}…]
question (str) – question sentence
- Return str
fake answer
-
static
alter_sentence
(sample, nearby_word_dict=None, pos_tag_dict=None, rules=None)[source]¶ - Parameters
sample – sentence dicts, like [{‘word’: ‘Saint’, ‘pos’: ‘NNP’, ‘lemma’: ‘Saint’, ‘ner’: ‘PERSON’}…]
nearby_word_dict – the dictionary to search for nearby words
pos_tag_dict – the dictionary to search for the most frequent pos tags
rules – the rules to alter the sentence
- Returns
alter_sentence, alter_sentence dicts
-
static
alter_special
(token, **kwargs)[source]¶ Alter special tokens
- Parameters
token – the token to alter
kwargs –
- Returns
like ‘US’ -> ‘UK’
-
static
alter_wordnet_antonyms
(token, **kwargs)[source]¶ Replace words with wordnet antonyms
- Parameters
token – the token to replace
kwargs –
- Returns
like good -> bad
-
static
alter_wordnet_synonyms
(token, **kwargs)[source]¶ Replace words with synonyms
- Parameters
token – the token to replace
kwargs –
- Returns
like good -> great
-
static
alter_nearby
(pos_list, ignore_pos=False, is_ner=False)[source]¶ Alter words based on glove embedding space
- Parameters
pos_list – pos tags list
ignore_pos (bool) – whether to match pos tag
is_ner (bool) – indicate ner
- Returns
like ‘Mary’ -> ‘Rose’
-
static
alter_entity_type
(token, **kwargs)[source]¶ Alter entity
- Parameters
token – the word to replace
kwargs –
- Returns
like ‘London’ -> ‘Berlin’
-
static
get_answer_tokens
(sent_tokens, answer)[source]¶ Extract the pos, ner, lemma tags of answer tokens
- Parameters
sent_tokens (list) – a list of dicts
answer (str) – answer
- Returns
a list of dicts like [ {‘word’: ‘Saint’, ‘pos’: ‘NNP’, ‘lemma’: ‘Saint’, ‘ner’: ‘PERSON’}, {‘word’: ‘Bernadette’, ‘pos’: ‘NNP’, ‘lemma’: ‘Bernadette’, …}, {‘word’: ‘Soubirous’, ‘pos’: ‘NNP’, ‘lemma’: ‘Soubirous’, …] ]
-
static
ans_entity_full
(ner_tag, new_ans)[source]¶ Returns a function that yields new_ans iff every token has |ner_tag|
- Parameters
ner_tag (str) – ner tag
new_ans (list) – like [{‘word’: ‘Saint’, ‘pos’: ‘NNP’, ‘lemma’: ‘Saint’, ‘ner’: ‘PERSON’}…]
- Returns
fake answer, str
-
static
ans_match_wh
(wh_word, new_ans)[source]¶ - Returns a function that yields new_ans
if the question starts with |wh_word|
- Parameters
wh_word (str) – question word
new_ans (list) – like [{‘word’: ‘Saint’, ‘pos’: ‘NNP’, ‘lemma’: ‘Saint’, ‘ner’: ‘PERSON’}…]
- Return str
fake answers,
-
static
ans_pos
(pos, new_ans, end=False, add_dt=False)[source]¶ Returns a function that yields new_ans if the first/last token has |pos|
- Parameters
pos (str) – pos tag
new_ans (list) – like [{‘word’: ‘Saint’, ‘pos’: ‘NNP’, ‘lemma’: ‘Saint’, ‘ner’: ‘PERSON’}…]
end (bool) – whether to use the last word to match the pos tag
add_dt (bool) – whether to add a determiner
- Return str
fake answer
-
-
class
textflint.
NERSample
(data, origin=None, sample_id=None, mode='BIO')[source]¶ Bases:
textflint.input_layer.component.sample.sample.Sample
NER Sample class to hold the necessary info and provide atomic operations.
-
__init__
(data, origin=None, sample_id=None, mode='BIO')[source]¶ - Parameters
data (dict) – The dict obj that contains data info
origin (~BaseSample) – Original sample obj
sample_id (int) – the id of sample
mode (str) – The sequence labeling mode for NER samples.
-
dump
()[source]¶ Convert sample info to input data json format.
- Return json
the dict of sentences and labels
-
delete_field_at_indices
(field, indices)[source]¶ Delete tokens and their NER tag.
- Parameters
field (str) – field str
indices (list) –
list of int/list/slice shape:indices_num each index can be int indicate delete single item or
their list like [1, 2, 3],
- can be list like (0,3) indicate replace items
from 0 to 3(not included),
can be slice which would be convert to list
- Returns
Modified NERSample.
-
delete_field_at_index
(field, index)[source]¶ Delete tokens and their NER tag.
- Parameters
field (str) – field string, normally ‘x’
index (int|list|slice) –
int/list/slice can be int indicate delete single item
or their list like [1, 2, 3],
- can be list like (0,3) indicate replace items
from 0 to 3(not included),
can be slice which would be convert to list
- Returns
Modified NERSample
-
insert_field_before_indices
(field, indices, items)[source]¶ Insert tokens and ner tags.Assuming the tag of new_item is O.
:param str field:field string :param list indices: list of int
shape:indices_num, list like [1, 2, 3]
- Parameters
items (list) – list of str/list shape: indices_num, correspond to indices
- Returns
Modified NERSample
-
insert_field_before_index
(field, ins_index, new_item)[source]¶ Insert tokens and ner tags.Assuming the tag of new_item is O
- Parameters
field (str) – field str
ins_index (int) – indicate which index to insert items
new_item (str/list) – items to insert
- Returns
Modified NERSample
-
insert_field_after_indices
(field, indices, items)[source]¶ Insert tokens and ner tags.Assuming the tag of new_item is O.
- Parameters
field (str) – field string
indices (list) – list of int shape:indices_num, like [1, 2, 3]
items (list) – list of str/list shape: indices_num, correspond to indices
- Returns
Modified NERSample
-
insert_field_after_index
(field, ins_index, new_item)[source]¶ Insert tokens and ner tags.Assuming the tag of new_item is O.
- Parameters
field (str) – field string
ins_index (int) – indicate where to apply insert
new_item (str|list) – shape: indices_num, correspond to field_sub_items
- Returns
Modified NERSample
-
find_entities_BIO
(word_seq, tag_seq)[source]¶ find entities in a sentence with BIO labels.
- Parameters
word_seq (list) – a list of tokens representing a sentence
tag_seq (list) – a list of tags representing a tag sequence labeling the sentence
- Return list entity_in_seq
a list of entities found in the sequence, including the information of the start position & end position in the sentence, the category, and the entity itself.
-
find_entities_BIOES
(word_seq, tag_seq)[source]¶ find entities in a sentence with BIOES labels.
- Parameters
word_seq (list) – a list of tokens representing a sentence
tag_seq (list) – a list of tags representing a tag sequence labeling the sentence
- Return list entity_in_seq
a list of entities found in the sequence, including the information of the start position & end position in the sentence, the category, and the entity itself.
-
entities_replace
(entities_info, candidates)[source]¶ Replace multi entity in once time.Assume input entities with reversed sequential.
- Parameters
entities_info (list) – list of entity_info
candidates (list) – candidate entities
- Returns
Modified NERSample
-
entity_replace
(start, end, entity, label)[source]¶ Replace one entity and update entities info.
- Parameters
start (int) – the start position of the entity to be replaced
end (int) – the end position of the entity to be replaced
entity (str) – the entity to be replaced with
label (str) – the category of the entity
- Returns
Modified NERSample
-
-
class
textflint.
POSSample
(data, origin=None, sample_id=None)[source]¶ Bases:
textflint.input_layer.component.sample.sample.Sample
POS Sample class to hold the necessary info and provide atomic operations.
-
class
textflint.
RESample
(data, origin=None, sample_id=None)[source]¶ Bases:
textflint.input_layer.component.sample.sample.Sample
transform and retrieve features of RESample
-
check_data
(data)[source]¶ check whether type of data is correct
- Parameters
data (dict) – data dict containing ‘x’, ‘subj’, ‘obj’ and ‘y’
-
load
(data)[source]¶ Convert data dict which contains essential information to SASample.
- Params
dict data: contains ‘token’, ‘subj’ ,’obj’, ‘relation’ keys.
-
get_dp
()[source]¶ get dependency parsing
- Return Tuple(list, list)
dependency tag of sentence and head of sentence
-
get_en
()[source]¶ get entity index
- Return Tuple(int, int, int, int)
start index of subject entity, end index of subject entity, start index of object entity and end index of object entity
-
get_type
()[source]¶ get entity type
- Return Tuple(string, string)
entity type of subject and entity type of object
-
get_sent
()[source]¶ get tokenized sentence
- Return Tuple(list, string)
tokenized sentence and relation
-
delete_field_at_indices
(field, indices)[source]¶ delete word of given indices in sentence
- Parameters
field (string) – field to be operated on
indices (list) – a list of index to be deleted
- Return dict
contains ‘token’, ‘subj’ ,’obj’ keys
-
insert_field_after_indices
(field, indices, new_item)[source]¶ insert word before given indices in sentence
- Parameters
field (string) – field to be operated on
indices (list) – a list of index to be inserted
new_item (list) – list of items to be inserted
- Return dict
contains ‘token’, ‘subj’ ,’obj’ keys
-
insert_field_before_indices
(field, indices, new_item)[source]¶ insert word after given indices in sentence
- Parameters
field (string) – field to be operated on
indices (list) – a list of index to be inserted
new_item (list) – list of items to be inserted
- Return dict
contains ‘token’, ‘subj’ ,’obj’ keys
-
-
class
textflint.
UTSample
(data, origin=None, sample_id=None)[source]¶ Bases:
textflint.input_layer.component.sample.sample.Sample
Universal Transformation sample.
Universe Transformation is not a subtask of NLP, implemented for providing universal text transformation function.
-
textflint.
auto_config
(task='UT', config=None)[source]¶ Check config input or create config automatically.
- Parameters
task (str) – task name
config (str|dict|textflint.config.Config) – config to control generation procedure.
- Returns
textflint.config.Config instance.
-
textflint.
auto_dataset
(data_input=None, task='UT')[source]¶ Create Dataset instance and load data input automatically.
- Parameters
data_input (dict|list|string) – json object or json/csv file.
task (str) – task name.
- Returns
textflint.Dataset instance.
-
textflint.
auto_flintmodel
(model, task)[source]¶ Check flint model type and whether compatible to task.
- Parameters
model (textflint.FlintModel|str) – FlintModel instance or python file path which contains FlintModel instance
task (str) – task name
- Returns
textflint.FlintModel
-
textflint.
auto_generator
(config_obj)[source]¶ Automatic create task generator to apply transformations, subpopulations and adversarial attacks.
- Parameters
config_obj (textflint.Config) – Config instance.
- Returns
textflint.Generator
-
textflint.
auto_report_generator
()[source]¶ Return a ReportGenerator instance.
- Returns
ReportGenerator