SubPopulation¶

This page gives a quick overview on how to start using subpopulation,a subclass of textflint methods to verify the robustness comprehensively. The full list of SubPopulations can be found in our website or github

How to use a built-in SubPopulation¶

textflint offers multiple universal SubPopulation methods for nlp tasks and we will provide task-specific SubPopulation methods in the coming version. Here we use the LM Subpopulation on Sentiment Analysis task to give a brief introduction.

[5]:

# 1. Import the SA Sample, textflint Dataset and LM SubPopulation method
from textflint.input_layer.component.sample.sa_sample import SASample
from textflint.input_layer.dataset import Dataset
from textflint.generation_layer.subpopulation.UT import LMSubPopulation

# 2. Initialize the SA Sample
sample1 = {'x': 'Titanic is my favorite movie.','y': 'pos'}
sample2 = {'x': 'I don\'t like the actor Tim Hill', 'y': 'neg'}
sample3 = {'x': 'The leading actor is good.','y': 'pos'}
samples = [sample1, sample2, sample3]

# 3. Construct the Dataset
dataset = Dataset('SA')
dataset.load(samples)

# 4. Define the SubPopulation
sub = LMSubPopulation(intervals=[0, 1])

# 5. Run SubPopulation on Dataset
sub_dataset = sub.slice_population(dataset, 'x')

textflint: ******Start load!******
100%|██████████| 3/3 [00:00<00:00, 1950.23it/s]
textflint: 3 in total, 3 were loaded successful.
textflint: ******Finish load!******
100%|██████████| 3/3 [00:46<00:00, 15.59s/it]

We can save the sub-dataset in a json flie in predefined path dir through Dataset.save_json interface.

[7]:

# output path
path = './test_result/'
sub_dataset.save_json(path+ 'result.json')

textflint: Save samples to ./test_result/result.json!

[8]:

import os
os.listdir(path)

[8]:

['result.json']

Define your own SubPopulation¶

[15]:

from textflint.generation_layer.subpopulation import SubPopulation

class LengthStr(SubPopulation):
    r"""
    Filter samples based on string length
    """

    def _score(self, sample, fields, **kwargs):
        r"""
        Score the sample

        :param sample: data sample
        :param list fields: list of field str
        :param kwargs:
        :return int: score for sample
        """
        return len(sample.get_text(fields[0]))

The SubPopulation requires you to reimplement the abstractive method _score, used to assign a score to Sample. The above code box define a new SubPopulation method LengthStr with a score representing the length of string.

The fields here is a list of field names of the input Sample, and we compute the score based on the values of these specific fields.

[23]:

test_sub = LengthStr(intervals=[0, 2])
test_dataset = test_sub.slice_population(dataset, 'x')
print(test_dataset[0].dump())
print(test_dataset[1].dump())

100%|██████████| 3/3 [00:00<00:00, 6654.10it/s]

{'x': 'The leading actor is good.', 'y': 'pos', 'sample_id': 2}
{'x': 'Titanic is my favorite movie.', 'y': 'pos', 'sample_id': 0}

[19]:

test_dataset[1].dump()

[19]:

{'x': 'Titanic is my favorite movie.', 'y': 'pos', 'sample_id': 0}

Conclusion¶

In this section, we show how to use a built-in SubPopulation LMSubPopulation and define our own SubPopulation. Now textflint only implements a few SubPopulations, and we will supplement more task-specific SubPopulations like Transformations.