textflint.generation_layer.validator.validator¶

Constraint Class¶

class textflint.generation_layer.validator.validator.Validator(origin_dataset, trans_dataset, fields, need_tokens=False)[source]¶

Bases: abc.ABC

An abstract class that computes the semantic similarity score between: original text and adversarial texts

Parameters

origin_dataset (dataset) – the dataset of origin sample
trans_dataset (dataset) – the dataset of translate sample
fields (str|list) – the name of the origin field need compare.
need_tokens (bool) – if we need tokenize the sentence

abstract validate(transformed_text, reference_text)[source]¶

Calculate the score

Parameters

transformed_text (str) – transformed sentence
reference_text (str) – origin sentence

Return float

the score of two sentence

check_data()[source]¶: Check whether the input data is legal

property score¶

Calculate the score of the deformed sentence

Return list: a list of translate sentence score

class textflint.generation_layer.validator.validator.ABC[source]¶

Bases: object

Helper class that provides a standard way to create an ABC using inheritance.

class textflint.generation_layer.validator.validator.Dataset(task='UT')[source]¶

Bases: object

Any iterable of (label, text_input) pairs qualifies as a Dataset.

__init__(task='UT')[source]¶

Parameters: task (str) – indicate data sample format.

free()[source]¶: Fully clear dataset.

dump()[source]¶: Return dataset in json object format.

load(dataset)[source]¶

Loads json object and prepares it as a Dataset.

Support two formats input, Example:

{‘x’: [
‘The robustness of deep neural networks has received much attention recently’, ‘We focus on certified robustness of smoothed classifiers in this work’, …, ‘our approach exceeds the state-of-the-art.’ ],

‘y’: [
‘neural’, ‘positive’, …, ‘positive’ ]}
1. [
  {‘x’: ‘The robustness of deep neural networks has received much attention recently’, ‘y’: ‘neural’}, {‘x’: ‘We focus on certified robustness of smoothed classifiers in this work’, ‘y’: ‘positive’}, …, {‘x’: ‘our approach exceeds the state-of-the-art.’, ‘y’: ‘positive’} ]

Parameters: dataset (list|dict) –
Returns

load_json(json_path, encoding='utf-8', fields=None, dropna=True)[source]¶

Loads json file, each line of the file is a json string.

Parameters

json_path – file path
encoding – file’s encoding, default: utf-8
fields – json object’s fields that needed, if None, all fields are needed. default: None
dropna – weather to ignore and drop invalid data, :if False, raise ValueError when reading invalid data. default: True

Returns

load_csv(csv_path, encoding='utf-8', headers=None, sep=',', dropna=True)[source]¶

Loads csv file, one line correspond one sample.

Parameters

csv_path – file path
encoding – file’s encoding, default: utf-8
headers – file’s headers, if None, make file’s first line as headers. default: None
sep – separator for each column. default: ‘,’
dropna – weather to ignore and drop invalid data, :if False, raise ValueError when reading invalid data. default: True

Returns

load_hugging_face(name, subset='train')[source]¶

Loads a dataset from HuggingFace datasets and prepares it as a Dataset.

Parameters

name – the dataset name
subset – the subset of the main dataset.

Returns

append(data_sample, sample_id=- 1)[source]¶

Load single data sample and append to dataset.

Parameters

data_sample (dict|sample) –
sample_id (int) – useful to identify sample, default -1

Returns

True / False indicate whether append action successful.

extend(data_samples)[source]¶

Load multi data samples and extend to dataset.

Parameters: data_samples (list|dict|Sample) –
Returns

static norm_input(data_samples)[source]¶

Convert various data input to list of dict. Example:

 {'x': [
          'The robustness of deep neural networks has received
          much attention recently',
          'We focus on certified robustness of smoothed classifiers
          in this work',
          ...,
          'our approach exceeds the state-of-the-art.'
      ],
 'y': [
          'neural',
          'positive',
          ...,
          'positive'
      ]
}
convert to
[
    {'x': 'The robustness of deep neural networks has received
    much attention recently', 'y': 'neural'},
    {'x': 'We focus on certified robustness of smoothed classifiers
    in this work', 'y': 'positive'},
    ...,
    {'x': 'our approach exceeds the state-of-the-art.',
    'y': 'positive'}
]

Parameters: data_samples (list|dict|Sample) –
Returns: Normalized data.

save_csv(out_path, encoding='utf-8', headers=None, sep=',')[source]¶

Save dataset to csv file.

Parameters

out_path – file path
encoding – file’s encoding, default: utf-8
headers – file’s headers, if None, make file’s first line as headers. default: None
sep – separator for each column. default: ‘,’

Returns

save_json(out_path, encoding='utf-8', fields=None)[source]¶

Save dataset to json file which contains json object in each line.

Parameters

out_path – file path
encoding – file’s encoding, default: utf-8
fields – json object’s fields that needed, if None, all fields are needed. default: None

Returns

class textflint.generation_layer.validator.validator.Sample(data, origin=None, sample_id=None)[source]¶

Bases: abc.ABC

Base Sample class to hold the necessary info and provide atomic operations

text_processor = <textflint.common.preprocess.en_processor.EnProcessor object>¶

__init__(data, origin=None, sample_id=None)[source]¶

Parameters

data (dict) – The dict obj that contains data info.
origin (sample) – original sample obj.
sample_id (int) – sampleindex

get_value(field)[source]¶

Get field value by field_str.

Parameters: field (str) – field name
Returns: field value

get_words(field)[source]¶

Get tokenized words of given textfield

Parameters: field (str) – field name
Returns: tokenized words

get_text(field)[source]¶

Get text string of given textfield

Parameters: field (str) – field name
Return string: text

get_mask(field)[source]¶

Get word masks of given textfield

Parameters: field (str) – field name
Returns: list of mask values

get_sentences(field)[source]¶

Get split sentences of given textfield

Parameters: field (str) – field name
Returns: list of sentences

get_pos(field)[source]¶: Get text field pos tags. :param str field: field name :return: pos tag list

get_ner(field)[source]¶

Get text field ner tags

Parameters: field (str) – field name
Returns: ner tag list

replace_fields(fields, field_values, field_masks=None)[source]¶

Fully replace multi fields at the same time and return new sample. Notice: Not suggest use this API as it will set mask values of TextField to MODIFIED_MASK.

Parameters

fields (list) – field str list
field_values (list) – field value list
field_masks (list) – indicate mask values, useful for printable text

Returns

Modified Sample

replace_field(field, field_value, field_mask=None)[source]¶

Fully replace single field and return new sample. Notice: Not suggest use this API as it will set mask values of TextField to MODIFIED_MASK.

Parameters

field (str) – field str
field_value – field_type
field_mask (list) – indicate mask value of field

Returns

Modified Sample

replace_field_at_indices(field, indices, items)[source]¶

Replace items of multi given scopes of field value at the same time. Stay away from the complex function !!!

Be careful of your input list shape.

Parameters

field (str) – field name
of int|list|slice indices (list) –

each index can be int indicate replace single item or their list
like [1, 2, 3],

can be list like (0,3) indicate replace items from
0 to 3(not included),

can be slice which would be convert to list.
items –

Returns

Modified Sample

replace_field_at_index(field, index, items)[source]¶

Replace items of given scope of field value.

Be careful of your input list shape.

Parameters

field (str) – field name
index (int|list|slice) –
can be int indicate replace single item or list like [1, 2, 3], can be list like (0,3) indicate replace items

from 0 to 3(not included),

can be slice which would be convert to list.
items (str|list) – shape: indices_num, correspond to field_sub_items

Returns

Modified Sample

unequal_replace_field_at_indices(field, indices, rep_items)[source]¶

Replace scope items of field value with rep_items which may not equal with scope.

Parameters

field – field str
indices – list of int/tupe/list
rep_items – list

Returns

Modified Sample

delete_field_at_indices(field, indices)[source]¶

Delete items of given scopes of field value.

Parameters

field (str) – field name
of int|list|slice indices (list) –
shape：indices_num each index can be int indicate delete single item or their list

like [1, 2, 3],

can be list like (0,3) indicate replace items
from 0 to 3(not included),

can be slice which would be convert to list.

Returns

Modified Sample

delete_field_at_index(field, index)[source]¶

Delete items of given scopes of field value.

Parameters

field (str) – field value
index (int|list|slice) –
can be int indicate delete single item or their list like [1, 2, 3], can be list like (0,3) indicate replace items

from 0 to 3(not included),

can be slice which would be convert to list.

Returns

Modified Sample

insert_field_before_indices(field, indices, items)[source]¶

Insert items of multi given scopes before indices of field value at the same time.

Stay away from the complex function !!! Be careful of your input list shape.

Parameters

field (str) – field name
indices – list of int, shape：indices_num, list like [1, 2, 3]
items – list of str/list, shape: indices_num, correspond to indices

Returns

Modified Sample

insert_field_before_index(field, index, items)[source]¶

Insert items of multi given scope before index of field value.

Parameters

field (str) – field name
index (int) – indicate which index to insert items
items (str|list) – items to insert

Returns

Modified Sample

insert_field_after_indices(field, indices, items)[source]¶

Insert items of multi given scopes after indices of field value at the same time.

Stay away from the complex function !!! Be careful of your input list shape.

Parameters

field (str) – field name
indices – list of int, shape：indices_num, like [1, 2, 3]
items – list of str/list shape: indices_num, correspond to indices

Returns

Modified Sample

insert_field_after_index(field, index, items)[source]¶

Insert items of multi given scope after index of field value

Parameters

field (str) – field name
index (int) – indicate where to apply insert
items (str|list) – shape: indices_num, correspond to field_sub_items

Returns

Modified Sample

swap_field_at_index(field, first_index, second_index)[source]¶

Swap items between first_index and second_index of field value.

Parameters

field (str) – field name
first_index (int) –
second_index (int) –

Returns

Modified Sample

abstract check_data(data)[source]¶

Check rare data format

Parameters: data – rare data input
Returns

abstract load(data)[source]¶

Parse data into sample field value.

Parameters: data – rare data input

abstract dump()[source]¶

Convert sample info to input data json format.

Returns: dict object.

classmethod clone(original_sample)[source]¶

Deep copy self to a new sample

Parameters: original_sample – sample to be copied
Returns: Sample instance

property is_origin¶: Return whether the sample is original Sample.

textflint.generation_layer.validator.validator.abstractmethod(funcobj)[source]¶

A decorator indicating abstract methods.

Requires that the metaclass is ABCMeta or derived from it. A class that has a metaclass derived from ABCMeta cannot be instantiated unless all of its abstract methods are overridden. The abstract methods can be called using any of the normal ‘super’ call mechanisms.

Usage:

class C(metaclass=ABCMeta):
@abstractmethod def my_abstract_method(self, …):

…

textflint.generation_layer.validator.validator.read_csv(path, encoding='utf-8', headers=None, sep=',', dropna=True)[source]¶

Construct a generator to read csv items.

Parameters

path – file path
encoding – file’s encoding, default: utf-8
headers – file’s headers, if None, make file’s first line as headers. default: None
sep – separator for each column. default: ‘,’
dropna – whether to ignore and drop invalid data, if False, raise ValueError when reading invalid data. default: True

Returns

generator, every time yield (line number, csv item)

textflint.generation_layer.validator.validator.read_json(path, encoding='utf-8', fields=None, dropna=True)[source]¶

Construct a generator to read json items.

Parameters

path – file path
encoding – file’s encoding, default: utf-8
fields – json object’s fields that needed, if None, all fields are needed. default: None
dropna – whether to ignore and drop invalid data, if False, raise ValueError when reading invalid data. default: True

Returns

generator, every time yield (line number, json item)

textflint.generation_layer.validator.validator.save_csv(json_list, out_path, encoding='utf-8', headers=None, sep=',')[source]¶

Save json list to csv file.

Parameters

json_list – list of dict
out_path – file path
encoding – file’s encoding, default: utf-8
headers – file’s headers, if None, make file’s first line as headers. default: None
sep – separator for each column. default: ‘,’

Returns

textflint.generation_layer.validator.validator.save_json(json_list, out_path, encoding='utf-8', fields=None)[source]¶

Save json list to json file which contains json object in each line.

Parameters

json_list – list of dict
out_path – output path
encoding – file’s encoding, default: utf-8
fields – json object’s fields that needed, if None, all fields are needed. default: None

Returns

class textflint.generation_layer.validator.validator.tqdm(*args, **kwargs)[source]¶

Bases: tqdm.utils.Comparable

Decorate an iterable object, returning an iterator which acts exactly like the original iterable, but prints a dynamically updating progressbar every time a value is requested.

monitor_interval = 10¶

static format_sizeof(num, suffix='', divisor=1000)[source]¶

Formats a number (greater than unity) with SI Order of Magnitude prefixes.

numfloat: Number ( >= 1) to format.
suffixstr, optional: Post-postfix [default: ‘’].
divisorfloat, optional: Divisor between prefixes [default: 1000].

outstr: Number with Order of Magnitude SI unit postfix.

static format_interval(t)[source]¶

Formats a number of seconds as a clock time, [H:]MM:SS

tint: Number of seconds.

outstr: [H:]MM:SS

static format_num(n)[source]¶

Intelligent scientific notation (.3g).

nint or float or Numeric: A Number.

outstr: Formatted number.

static ema(x, mu=None, alpha=0.3)[source]¶

Exponential moving average: smoothing to give progressively lower weights to older values.

xfloat: New value to include in EMA.
mufloat, optional: Previous EMA value.
alphafloat, optional: Smoothing factor in range [0, 1], [default: 0.3]. Increase to give more weight to recent values. Ranges from 0 (yields mu) to 1 (yields x).

static status_printer(file)[source]¶: Manage the printing and in-place updating of a line of characters. Note that if the string is longer than a line, then in-place updating may not work (it will print a new line at each refresh).

static format_meter(n, total, elapsed, ncols=None, prefix='', ascii=False, unit='it', unit_scale=False, rate=None, bar_format=None, postfix=None, unit_divisor=1000, initial=0, **extra_kwargs)[source]¶

Return a string-based progress bar given some parameters

nint or float

Number of finished iterations.

totalint or float

The expected total number of iterations. If meaningless (None), only basic progress statistics are displayed (no ETA).

elapsedfloat

Number of seconds passed since start.

ncolsint, optional

The width of the entire output message. If specified, dynamically resizes {bar} to stay within this bound [default: None]. If 0, will not print any bar (only stats). The fallback is {bar:10}.

prefixstr, optional

Prefix message (included in total width) [default: ‘’]. Use as {desc} in bar_format string.

asciibool, optional or str, optional

If not set, use unicode (smooth blocks) to fill the meter [default: False]. The fallback is to use ASCII characters ” 123456789#”.

unitstr, optional

The iteration unit [default: ‘it’].

unit_scalebool or int or float, optional

If 1 or True, the number of iterations will be printed with an appropriate SI metric prefix (k = 10^3, M = 10^6, etc.) [default: False]. If any other non-zero number, will scale total and n.

ratefloat, optional

Manual override for iteration rate. If [default: None], uses n/elapsed.

bar_formatstr, optional

Specify a custom bar string formatting. May impact performance. [default: ‘{l_bar}{bar}{r_bar}’], where l_bar=’{desc}: {percentage:3.0f}%|’ and r_bar=’| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, ‘

‘{rate_fmt}{postfix}]’

Possible vars: l_bar, bar, r_bar, n, n_fmt, total, total_fmt,: percentage, elapsed, elapsed_s, ncols, nrows, desc, unit, rate, rate_fmt, rate_noinv, rate_noinv_fmt, rate_inv, rate_inv_fmt, postfix, unit_divisor, remaining, remaining_s.

Note that a trailing “: ” is automatically removed after {desc} if the latter is empty.

postfix*, optional

Similar to prefix, but placed at the end (e.g. for additional stats). Note: postfix is usually a string (not a dict) for this method, and will if possible be set to postfix = ‘, ‘ + postfix. However other types are supported (#382).

unit_divisorfloat, optional

[default: 1000], ignored unless unit_scale is True.

initialint or float, optional

The initial counter value [default: 0].

out : Formatted meter and stats, ready to display.

classmethod write(s, file=None, end='\n', nolock=False)[source]¶: Print a message via tqdm (without overlap with bars).

classmethod external_write_mode(file=None, nolock=False)[source]¶: Disable tqdm within context and refresh tqdm when exits. Useful when writing to standard output stream

classmethod set_lock(lock)[source]¶: Set the global lock.

classmethod get_lock()[source]¶: Get the global lock. Construct it if it does not exist.

classmethod pandas(**tqdm_kwargs)[source]¶

Registers the current tqdm class with: pandas.core. ( frame.DataFrame | series.Series | groupby.(generic.)DataFrameGroupBy | groupby.(generic.)SeriesGroupBy ).progress_apply

A new instance will be create every time progress_apply is called, and each instance will automatically close() upon completion.

tqdm_kwargs : arguments for the tqdm instance

>>> import pandas as pd
>>> import numpy as np
>>> from tqdm import tqdm
>>> from tqdm.gui import tqdm as tqdm_gui
>>>
>>> df = pd.DataFrame(np.random.randint(0, 100, (100000, 6)))
>>> tqdm.pandas(ncols=50)  # can use tqdm_gui, optional kwargs, etc
>>> # Now you can use `progress_apply` instead of `apply`
>>> df.groupby(0).progress_apply(lambda x: x**2)

<https://stackoverflow.com/questions/18603270/ progress-indicator-during-pandas-operations-python>

__init__(iterable=None, desc=None, total=None, leave=True, file=None, ncols=None, mininterval=0.1, maxinterval=10.0, miniters=None, ascii=None, disable=False, unit='it', unit_scale=False, dynamic_ncols=False, smoothing=0.3, bar_format=None, initial=0, position=None, postfix=None, unit_divisor=1000, write_bytes=None, lock_args=None, nrows=None, gui=False, **kwargs)[source]¶

iterableiterable, optional

Iterable to decorate with a progressbar. Leave blank to manually manage the updates.

descstr, optional

Prefix for the progressbar.

totalint or float, optional

The number of expected iterations. If unspecified, len(iterable) is used if possible. If float(“inf”) or as a last resort, only basic progress statistics are displayed (no ETA, no progressbar). If gui is True and this parameter needs subsequent updating, specify an initial arbitrary large positive number, e.g. 9e9.

leavebool, optional

If [default: True], keeps all traces of the progressbar upon termination of iteration. If None, will leave only if position is 0.

fileio.TextIOWrapper or io.StringIO, optional

Specifies where to output the progress messages (default: sys.stderr). Uses file.write(str) and file.flush() methods. For encoding, see write_bytes.

ncolsint, optional

The width of the entire output message. If specified, dynamically resizes the progressbar to stay within this bound. If unspecified, attempts to use environment width. The fallback is a meter width of 10 and no limit for the counter and statistics. If 0, will not print any meter (only stats).

minintervalfloat, optional

Minimum progress display update interval [default: 0.1] seconds.

maxintervalfloat, optional

Maximum progress display update interval [default: 10] seconds. Automatically adjusts miniters to correspond to mininterval after long display update lag. Only works if dynamic_miniters or monitor thread is enabled.

minitersint or float, optional

Minimum progress display update interval, in iterations. If 0 and dynamic_miniters, will automatically adjust to equal mininterval (more CPU efficient, good for tight loops). If > 0, will skip display of specified number of iterations. Tweak this and mininterval to get very efficient loops. If your progress is erratic with both fast and slow iterations (network, skipping items, etc) you should set miniters=1.

asciibool or str, optional

If unspecified or False, use unicode (smooth blocks) to fill the meter. The fallback is to use ASCII characters ” 123456789#”.

disablebool, optional

Whether to disable the entire progressbar wrapper [default: False]. If set to None, disable on non-TTY.

unitstr, optional

String that will be used to define the unit of each iteration [default: it].

unit_scalebool or int or float, optional

If 1 or True, the number of iterations will be reduced/scaled automatically and a metric prefix following the International System of Units standard will be added (kilo, mega, etc.) [default: False]. If any other non-zero number, will scale total and n.

dynamic_ncolsbool, optional

If set, constantly alters ncols and nrows to the environment (allowing for window resizes) [default: False].

smoothingfloat, optional

Exponential moving average smoothing factor for speed estimates (ignored in GUI mode). Ranges from 0 (average speed) to 1 (current/instantaneous speed) [default: 0.3].

bar_formatstr, optional

Specify a custom bar string formatting. May impact performance. [default: ‘{l_bar}{bar}{r_bar}’], where l_bar=’{desc}: {percentage:3.0f}%|’ and r_bar=’| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, ‘

‘{rate_fmt}{postfix}]’

Possible vars: l_bar, bar, r_bar, n, n_fmt, total, total_fmt,: percentage, elapsed, elapsed_s, ncols, nrows, desc, unit, rate, rate_fmt, rate_noinv, rate_noinv_fmt, rate_inv, rate_inv_fmt, postfix, unit_divisor, remaining, remaining_s.

Note that a trailing “: ” is automatically removed after {desc} if the latter is empty.

initialint or float, optional

The initial counter value. Useful when restarting a progress bar [default: 0]. If using float, consider specifying {n:.3f} or similar in bar_format, or specifying unit_scale.

positionint, optional

Specify the line offset to print this bar (starting from 0) Automatic if unspecified. Useful to manage multiple bars at once (eg, from threads).

postfixdict or *, optional

Specify additional stats to display at the end of the bar. Calls set_postfix(**postfix) if possible (dict).

unit_divisorfloat, optional

[default: 1000], ignored unless unit_scale is True.

write_bytesbool, optional

If (default: None) and file is unspecified, bytes will be written in Python 2. If True will also write bytes. In all other cases will default to unicode.

lock_argstuple, optional

Passed to refresh for intermediate output (initialisation, iterating, and updating).

nrowsint, optional

The screen height. If specified, hides nested bars outside this bound. If unspecified, attempts to use environment height. The fallback is 20.

guibool, optional

WARNING: internal parameter - do not use. Use tqdm.gui.tqdm(…) instead. If set, will attempt to use matplotlib animations for a graphical output [default: False].

out : decorated iterator.

update(n=1)[source]¶

Manually update the progress bar, useful for streams such as reading files. E.g.: >>> t = tqdm(total=filesize) # Initialise >>> for current_buffer in stream: … … … t.update(len(current_buffer)) >>> t.close() The last line is highly recommended, but possibly not necessary if t.update() will be called in such a way that filesize will be exactly reached and printed.

nint or float, optional: Increment to add to the internal counter of iterations [default: 1]. If using float, consider specifying {n:.3f} or similar in bar_format, or specifying unit_scale.

outbool or None: True if a display() was triggered.

close()[source]¶: Cleanup and (if leave=False) close the progressbar.

clear(nolock=False)[source]¶: Clear current bar display.

refresh(nolock=False, lock_args=None)[source]¶

Force refresh the display of this bar.

nolockbool, optional: If True, does not lock. If [default: False]: calls acquire() on internal lock.
lock_argstuple, optional: Passed to internal lock’s acquire(). If specified, will only display() if acquire() returns True.

unpause()[source]¶: Restart tqdm timer from last print time.

reset(total=None)[source]¶

Resets to 0 iterations for repeated use.

Consider combining with leave=True.

total : int or float, optional. Total to use for the new bar.

set_description(desc=None, refresh=True)[source]¶

Set/modify description of the progress bar.

desc : str, optional refresh : bool, optional

Forces refresh [default: True].

set_description_str(desc=None, refresh=True)[source]¶: Set/modify description without ‘: ‘ appended.

set_postfix(ordered_dict=None, refresh=True, **kwargs)[source]¶

Set/modify postfix (additional stats) with automatic formatting based on datatype.

ordered_dict : dict or OrderedDict, optional refresh : bool, optional

Forces refresh [default: True].

kwargs : dict, optional

set_postfix_str(s='', refresh=True)[source]¶: Postfix without dictionary expansion, similar to prefix handling.

property format_dict¶: Public API for read-only member access.

display(msg=None, pos=None)[source]¶

Use self.sp to display msg in the specified pos.

Consider overloading this function when inheriting to use e.g.: self.some_frontend(**self.format_dict) instead of self.sp.

msg : str, optional. What to display (default: repr(self)). pos : int, optional. Position to moveto

(default: abs(self.pos)).

classmethod wrapattr(stream, method, total=None, bytes=True, **tqdm_kwargs)[source]¶

stream : file-like object. method : str, “read” or “write”. The result of read() and

the first argument of write() should have a len().

>>> with tqdm.wrapattr(file_obj, "read", total=file_obj.size) as fobj:
...     while True:
...         chunk = fobj.read(chunk_size)
...         if not chunk:
...             break