Sample¶

A common problem is that the input format of different models is highly different, making it very difficult to load and utilize data. Sample solve this problem by decomposing various NLP task data into underlying Fields, which cover all basic input types.

[1]:

# Take SASample as an example
from textflint.input_layer.component.sample.sa_sample import SASample

data = {'x': 'Titanic is my favorite movie. The leading actor is good.','y': 'pos'}
sample = SASample(data)

Sample provides common linguistic functions, including tokenization, partof-speech tagging and dependency parsing,

[3]:

sample.get_ner('x')

[3]:

[('Titanic', 0, 1, 'PRODUCT')]

[4]:

sample.get_pos('x')

[4]:

['NNP', 'VBZ', 'PRP$', 'JJ', 'NN', '.', 'DT', 'VBG', 'NN', 'VBZ', 'JJ', '.']

And textflint break down the arbitrary text transformation method into some atomic operations inside Sample, backed with clean and consistent implementations.

[5]:

sample.get_text('x')

[5]:

'Titanic is my favorite movie. The leading actor is good.'

[6]:

sample.get_words('x')

[6]:

['Titanic',
 'is',
 'my',
 'favorite',
 'movie',
 '.',
 'The',
 'leading',
 'actor',
 'is',
 'good',
 '.']

[8]:

sample.get_sentences('x')

[8]:

['Titanic is my favorite movie.', 'The leading actor is good.']

Dataset¶

Dataset contains samples and provides efficient and handy operation interfaces for samples. Dataset supports loading, verification, and saving data in JSON or CSV format for various NLP tasks.

[9]:

from textflint.input_layer.dataset import Dataset
data2 = {'x': 'I don\'t like the actor Tim Hill', 'y': 'neg'}
dataset = Dataset('SA')
dataset.load([data, data2])

textflint: ******Start load!******
100%|██████████| 2/2 [00:00<00:00, 976.56it/s]
textflint: 2 in total, 2 were loaded successful.
textflint: ******Finish load!******