SentencesNLTKTokenizer#

class SentencesNLTKTokenizer(config)#

Bases: Tokenizer

Sentences Tokenizer splits the input fields by sentences with NLTK and additional custom rules.

Input - all input fields need to be of type str.

Output - all output fields are filled with data of type list [ str ].

Parameters
  • type (str) – sentences_nltk

  • rules (list,[]) – list of additional splitting rules. example: [‘-‘,’**’]

  • cleaning (dict,{}) – dict of additional cleaning rules. example: {‘U.N.’:’UN’}

Example

{
    "step": "tokenizer",
    "type": "sentences_nltk",
    "input_fields": ["body"],
    "output_fields": ["sentences"],
    "rules": ["**", "...", "…",": "],
    "cleaning": {
        "\t": " ",
        "\n": "",
        "  ": " ",
        "approx.": "approx",
        "etc.": "etc",
        "i.e.": "ie"
    }
}

Methods Summary

process_doc(doc)

Process a document

Methods Documentation

process_doc(doc)#

Process a document

Parameters

doc (Document) – Document

Returns

Processed document

Return type

Document