PdfSentencesTokenizer#

class PdfSentencesTokenizer(config)#

Bases: BatchedStep

PDF Sentences Tokenizer splits the PDF files by sentences and keeps track of the positional information of the extracted sentences.

Note For performance reasons, this step has been deprecated in favor of the “content-conversion” extraction step in the data ingestion pipeline.

If “content-conversion” is already in the pipeline, this step will do nothing. This step only works inside the Squirro Platform due to internal dependencies.

Input - as input it requires a field with the name files of type dict containing the key-value pairs: {“mime_type”:”application/pdf”,”content_url”:”PATH”} where PATH is an existing path to a pdf-file.

Output - the output field is filled with data of type list [ dict ]. Each dict contains following structure: {“text”:”EXTRACTED_SENTENCE”,”page_to_rects”:{PAGE_NUM:[“x”:X-POS,”y”:Y-POS,”height”:HEIGHT,”width”:WIDTH]}}

Parameters
  • type (str) – pdf_sentences

  • output_field (str, "texts") – output field for PDF sentences data.

  • default_language (str, "en") – Default language if language_field is not present.

  • language_field (str, "language") – Document field that gives the language.

  • cleaning (dict, {}) – dict of additional cleaning rules. example: {‘U.N.’:’UN’}

Example

{
    "step": "tokenizer",
    "type": "pdf_sentences",
    "cleaning": {
        "\t": " ",
        "\n": "",
        "  ": " ",
        "approx.": "approx",
        "etc.": "etc",
        "i.e.": "ie"
    }
}

Methods Summary

get_pdf_files(fields)

rtype

Iterator[Tuple[dict, str]]

process_doc(doc)

Process a document

Methods Documentation

get_pdf_files(fields)#
Return type

Iterator[Tuple[dict, str]]

process_doc(doc)#

Process a document

Parameters

doc (Document) – Document

Returns

Processed document

Return type

Document