SentencesNLTKTokenizer#
- class SentencesNLTKTokenizer(config)#
Bases:
Tokenizer
Sentences
Tokenizer
splits the input fields by sentences with NLTK and additional custom rules.Input - all input fields need to be of type
str
.Output - all output fields are filled with data of type
list
[str
].- Parameters:
Example
{ "step": "tokenizer", "type": "sentences_nltk", "input_fields": ["body"], "output_fields": ["sentences"], "rules": ["**", "...", "…",": "], "cleaning": { "\t": " ", "\n": "", " ": " ", "approx.": "approx", "etc.": "etc", "i.e.": "ie" } }
Methods Summary
process_doc
(doc)Process a document
Methods Documentation