SubPopulation¶
This page gives a quick overview on how to start using subpopulation,a subclass of textflint methods to verify the robustness comprehensively. The full list of SubPopulation
s can be found in our website or github
How to use a built-in SubPopulation¶
textflint offers multiple universal SubPopulation
methods for nlp tasks and we will provide task-specific SubPopulation
methods in the coming version. Here we use the LM
Subpopulation on Sentiment Analysis task to give a brief introduction.
[5]:
# 1. Import the SA Sample, textflint Dataset and LM SubPopulation method
from textflint.input_layer.component.sample.sa_sample import SASample
from textflint.input_layer.dataset import Dataset
from textflint.generation_layer.subpopulation.UT import LMSubPopulation
# 2. Initialize the SA Sample
sample1 = {'x': 'Titanic is my favorite movie.','y': 'pos'}
sample2 = {'x': 'I don\'t like the actor Tim Hill', 'y': 'neg'}
sample3 = {'x': 'The leading actor is good.','y': 'pos'}
samples = [sample1, sample2, sample3]
# 3. Construct the Dataset
dataset = Dataset('SA')
dataset.load(samples)
# 4. Define the SubPopulation
sub = LMSubPopulation(intervals=[0, 1])
# 5. Run SubPopulation on Dataset
sub_dataset = sub.slice_population(dataset, 'x')
textflint: ******Start load!******
100%|██████████| 3/3 [00:00<00:00, 1950.23it/s]
textflint: 3 in total, 3 were loaded successful.
textflint: ******Finish load!******
100%|██████████| 3/3 [00:46<00:00, 15.59s/it]
We can save the sub-dataset in a json flie in predefined path dir through Dataset.save_json
interface.
[7]:
# output path
path = './test_result/'
sub_dataset.save_json(path+ 'result.json')
textflint: Save samples to ./test_result/result.json!
[8]:
import os
os.listdir(path)
[8]:
['result.json']
Define your own SubPopulation¶
[15]:
from textflint.generation_layer.subpopulation import SubPopulation
class LengthStr(SubPopulation):
r"""
Filter samples based on string length
"""
def _score(self, sample, fields, **kwargs):
r"""
Score the sample
:param sample: data sample
:param list fields: list of field str
:param kwargs:
:return int: score for sample
"""
return len(sample.get_text(fields[0]))
The SubPopulation
requires you to reimplement the abstractive method _score
, used to assign a score to Sample
. The above code box define a new SubPopulation
method LengthStr
with a score representing the length of string.
The fields
here is a list of field names of the input Sample
, and we compute the score based on the values of these specific fields.
[23]:
test_sub = LengthStr(intervals=[0, 2])
test_dataset = test_sub.slice_population(dataset, 'x')
print(test_dataset[0].dump())
print(test_dataset[1].dump())
100%|██████████| 3/3 [00:00<00:00, 6654.10it/s]
{'x': 'The leading actor is good.', 'y': 'pos', 'sample_id': 2}
{'x': 'Titanic is my favorite movie.', 'y': 'pos', 'sample_id': 0}
[19]:
test_dataset[1].dump()
[19]:
{'x': 'Titanic is my favorite movie.', 'y': 'pos', 'sample_id': 0}
Conclusion¶
In this section, we show how to use a built-in SubPopulation
LMSubPopulation
and define our own SubPopulation
. Now textflint
only implements a few SubPopulations
, and we will supplement more task-specific SubPopulations
like Transformations
.