PdfSentencesTokenizer

class squirro.lib.nlp.steps.tokenizers.PdfSentencesTokenizer(config)

Bases: squirro.lib.nlp.steps.batched_step.BatchedStep

PDF Sentences Tokenizer splits the PDF files by sentences and keeps track of the positional information of the extracted sentences.

Note - this step only works inside the Squirro Platform due to internal dependencies.

Input - as input it requires a field with the name files of type dict containing the key-value pairs: {“mime_type”:”application/pdf”,”content_url”:”PATH”} where PATH is an existing path to a pdf-file.

Output - the output field is filled with data of type list [ dict ]. Each dict contains following structure: {“text”:”EXTRACTED_SENTENCE”,”page_to_rects”:{PAGE_NUM:[“x”:X-POS,”y”:Y-POS,”height”:HEIGHT,”width”:WIDTH]}}

Parameters
  • type (str) – pdf_sentences

  • output_field (str, "texts") – output field for PDF sentences data.

  • default_language (str, "en") – Default language if language_field is not present.

  • language_field (str, "language") – Document field that gives the language.

  • cleaning (dict, {}) – dict of additional cleaning rules. example: {‘U.N.’:’UN’}

Example

{
    "step": "tokenizer",
    "type": "pdf_sentences",
    "cleaning": {
        "\t": " ",
        "\n": "",
        "  ": " ",
        "approx.": "approx",
        "etc.": "etc",
        "i.e.": "ie"
    }
}

Methods Summary

get_pdf_files(fields)

rtype

Iterator[Tuple[dict, str]]

process_doc(doc)

Process a document

Methods Documentation

get_pdf_files(fields)
Return type

Iterator[Tuple[dict, str]]

process_doc(doc)

Process a document

Parameters

doc (Document) – Document

Returns

Processed document

Return type

Document