textflint.input_layer.dataset.dataset¶

dataset: textflint dataset¶

class textflint.input_layer.dataset.dataset.Dataset(task='UT')[source]¶

Bases: object

Any iterable of (label, text_input) pairs qualifies as a Dataset.

__init__(task='UT')[source]¶

Parameters: task (str) – indicate data sample format.

free()[source]¶: Fully clear dataset.

dump()[source]¶: Return dataset in json object format.

load(dataset)[source]¶

Loads json object and prepares it as a Dataset.

Support two formats input, Example:

{‘x’: [
‘The robustness of deep neural networks has received much attention recently’, ‘We focus on certified robustness of smoothed classifiers in this work’, …, ‘our approach exceeds the state-of-the-art.’ ],

‘y’: [
‘neural’, ‘positive’, …, ‘positive’ ]}
1. [
  {‘x’: ‘The robustness of deep neural networks has received much attention recently’, ‘y’: ‘neural’}, {‘x’: ‘We focus on certified robustness of smoothed classifiers in this work’, ‘y’: ‘positive’}, …, {‘x’: ‘our approach exceeds the state-of-the-art.’, ‘y’: ‘positive’} ]

Parameters: dataset (list|dict) –
Returns

load_json(json_path, encoding='utf-8', fields=None, dropna=True)[source]¶

Loads json file, each line of the file is a json string.

Parameters

json_path – file path
encoding – file’s encoding, default: utf-8
fields – json object’s fields that needed, if None, all fields are needed. default: None
dropna – weather to ignore and drop invalid data, :if False, raise ValueError when reading invalid data. default: True

Returns

load_csv(csv_path, encoding='utf-8', headers=None, sep=',', dropna=True)[source]¶

Loads csv file, one line correspond one sample.

Parameters

csv_path – file path
encoding – file’s encoding, default: utf-8
headers – file’s headers, if None, make file’s first line as headers. default: None
sep – separator for each column. default: ‘,’
dropna – weather to ignore and drop invalid data, :if False, raise ValueError when reading invalid data. default: True

Returns

load_hugging_face(name, subset='train')[source]¶

Loads a dataset from HuggingFace datasets and prepares it as a Dataset.

Parameters

name – the dataset name
subset – the subset of the main dataset.

Returns

append(data_sample, sample_id=- 1)[source]¶

Load single data sample and append to dataset.

Parameters

data_sample (dict|sample) –
sample_id (int) – useful to identify sample, default -1

Returns

True / False indicate whether append action successful.

extend(data_samples)[source]¶

Load multi data samples and extend to dataset.

Parameters: data_samples (list|dict|Sample) –
Returns

static norm_input(data_samples)[source]¶

Convert various data input to list of dict. Example:

 {'x': [
          'The robustness of deep neural networks has received
          much attention recently',
          'We focus on certified robustness of smoothed classifiers
          in this work',
          ...,
          'our approach exceeds the state-of-the-art.'
      ],
 'y': [
          'neural',
          'positive',
          ...,
          'positive'
      ]
}
convert to
[
    {'x': 'The robustness of deep neural networks has received
    much attention recently', 'y': 'neural'},
    {'x': 'We focus on certified robustness of smoothed classifiers
    in this work', 'y': 'positive'},
    ...,
    {'x': 'our approach exceeds the state-of-the-art.',
    'y': 'positive'}
]

Parameters: data_samples (list|dict|Sample) –
Returns: Normalized data.

save_csv(out_path, encoding='utf-8', headers=None, sep=',')[source]¶

Save dataset to csv file.

Parameters

out_path – file path
encoding – file’s encoding, default: utf-8
headers – file’s headers, if None, make file’s first line as headers. default: None
sep – separator for each column. default: ‘,’

Returns

save_json(out_path, encoding='utf-8', fields=None)[source]¶

Save dataset to json file which contains json object in each line.

Parameters

out_path – file path
encoding – file’s encoding, default: utf-8
fields – json object’s fields that needed, if None, all fields are needed. default: None

Returns