Sample¶
A common problem is that the input format of different models is highly different, making it very difficult to load and utilize data. Sample
solve this problem by decomposing various NLP task data into underlying Fields
, which cover all basic input types.
[1]:
# Take SASample as an example
from textflint.input_layer.component.sample.sa_sample import SASample
data = {'x': 'Titanic is my favorite movie. The leading actor is good.','y': 'pos'}
sample = SASample(data)
Sample
provides common linguistic functions, including tokenization, partof-speech tagging and dependency parsing,
[3]:
sample.get_ner('x')
[3]:
[('Titanic', 0, 1, 'PRODUCT')]
[4]:
sample.get_pos('x')
[4]:
['NNP', 'VBZ', 'PRP$', 'JJ', 'NN', '.', 'DT', 'VBG', 'NN', 'VBZ', 'JJ', '.']
And textflint
break down the arbitrary text transformation method into some atomic operations inside Sample
, backed with clean and consistent implementations.
[5]:
sample.get_text('x')
[5]:
'Titanic is my favorite movie. The leading actor is good.'
[6]:
sample.get_words('x')
[6]:
['Titanic',
'is',
'my',
'favorite',
'movie',
'.',
'The',
'leading',
'actor',
'is',
'good',
'.']
[8]:
sample.get_sentences('x')
[8]:
['Titanic is my favorite movie.', 'The leading actor is good.']
Dataset¶
Dataset
contains samples and provides efficient and handy operation interfaces for samples. Dataset
supports loading, verification, and saving data in JSON or CSV format for various NLP tasks.
[9]:
from textflint.input_layer.dataset import Dataset
data2 = {'x': 'I don\'t like the actor Tim Hill', 'y': 'neg'}
dataset = Dataset('SA')
dataset.load([data, data2])
textflint: ******Start load!******
100%|██████████| 2/2 [00:00<00:00, 976.56it/s]
textflint: 2 in total, 2 were loaded successful.
textflint: ******Finish load!******