textflint.common.preprocess.en_processor¶
EnProcessor Class¶
-
class
textflint.common.preprocess.en_processor.
EnProcessor
(*args, **kwargs)[source]¶ Bases:
object
Text Processor class implement NER, POS tag, lexical tree parsing. EnProcessor is designed by single instance mode.
-
sentence_tokenize
(text)[source]¶ Split text to sentences.
- Parameters
text (str) – text string
- Returns
list[str]
-
tokenize_one_sent
(text, split_by_space=False)[source]¶ Tokenize one sentence.
- Parameters
text (str) –
split_by_space (bool) – whether tokenize sentence by split space
- Returns
tokens
-
tokenize
(text, is_one_sent=False, split_by_space=False)[source]¶ Split a text into tokens (words, morphemes we can separate such as “n’t”, and punctuation).
- Parameters
text (str) –
is_one_sent (bool) –
split_by_space (bool) –
- Returns
list of tokens
-
static
inverse_tokenize
(tokens)[source]¶ Convert tokens to sentence.
Untokenizing a text undoes the tokenizing operation, restoring punctuation and spaces to the places that people expect them to be. Ideally, untokenize(tokenize(text)) should be identical to text, except for line breaks.
Watch out! Default punctuation add to the word before its index, it may raise inconsistency bug.
- Parameters
tokens (list[str]r) – target token list
- Returns
str
-
get_pos
(sentence)[source]¶ POS tagging function.
Example:
EnProcessor().get_pos( 'All things in their being are good for something.' ) >> [('All', 'DT'), ('things', 'NNS'), ('in', 'IN'), ('their', 'PRP$'), ('being', 'VBG'), ('are', 'VBP'), ('good', 'JJ'), ('for', 'IN'), ('something', 'NN'), ('.', '.')]
- Parameters
sentence (str|list) – A sentence which needs to be tokenized.
- Returns
Tokenized tokens with their POS tags.
-
get_ner
(sentence, return_char_idx=True)[source]¶ NER function. This method uses implemented based on spacy model.
Example:
EnProcessor().get_ner( 'Lionel Messi is a football player from Argentina.' ) if return_word_index is False >>[('Lionel Messi', 0, 12, 'PERSON'), ('Argentina', 39, 48, 'LOCATION')] if return_word_index is True >>[('Lionel Messi', 0, 2, 'PERSON'), ('Argentina', 7, 8, 'LOCATION')]
- Parameters
sentence (str|list) – text string or token list
return_char_idx (bool) – if set True, return character start to end index, else return char start to end index.
- Returns
A list of tuples, (entity, start, end, label)
-
get_parser
(sentence)[source]¶ Lexical tree parsing function based on NLTK toolkit.
Example:
EnProcessor().get_parser('Messi is a football player.') >>'(ROOT\n (S\n (NP (NNP Messi))\n (VP (VBZ is) (NP (DT a) (NN football) (NN player)))\n (. .)))'
- Parameters
sentence (str|list) – A sentence needs to be parsed.
:return:The result tree of lexicalized parser in string format.
-
get_dep_parser
(sentence, is_one_sent=True, split_by_space=False)[source]¶ Dependency parsing based on spacy model.
Example:
EnProcessor().get_dep_parser( 'The quick brown fox jumps over the lazy dog.' ) >> The DT 4 det quick JJ 4 amod brown JJ 4 amod fox NN 5 nsubj jumps VBZ 0 root over IN 9 case the DT 9 det lazy JJ 9 amod dog NN 5 obl
- Parameters
sentence (str|list) – input text string
is_one_sent (bool) – whether do sentence tokenzie
split_by_space (bool) – whether tokenize sentence by split with ” “
- Returns
dp tags.
-
get_lemmas
(token_and_pos)[source]¶ Lemmatize function. This method uses
nltk.WordNetLemmatier
to lemmatize tokens.- Parameters
token_and_pos (list) – (token, POS).
- Returns
A lemma or a list of lemmas depends on your input.
-
get_all_lemmas
(pos)[source]¶ Lemmatize function for all words in WordNet.
- Parameters
pos – POS tag pr a list of POS tag.
- Returns
A list of lemmas that have the given pos tag.
-
get_delemmas
(lemma_and_pos)[source]¶ Delemmatize function.
This method uses a pre-processed dict which maps (lemma, pos) to original token for delemmatizing.
- Parameters
lemma_and_pos (tuple|list) – A tuple or a list of (lemma, POS).
- Returns
A word or a list of words, each word represents the specific form of input lemma.
-
get_synsets
(tokens_and_pos, lang='eng')[source]¶ Get synsets from WordNet.
- Parameters
tokens_and_pos (list) – A list of tuples, (token, POS).
lang (str) – language name
- Returns
A list of str, represents the sense of each input token.
-
get_antonyms
(tokens_and_pos, lang='eng')[source]¶ Get antonyms from WordNet.
This method uses NTLK WordNet to generate antonyms, and uses “lesk” algorithm which is proposed by Michael E. Lesk in 1986, to screen the sense out.
- Parameters
tokens_and_pos (list) – A list of tuples, (token, POS).
lang (str) – language name.
- Returns
A list of str, represents the sense of each input token.
-