PdfSentencesTokenizer

PdfSentencesTokenizer#

class PdfSentencesTokenizer(config)#

Bases: BatchedStep

PDF Sentences Tokenizer splits the PDF files by sentences and keeps track of the positional information of the extracted sentences.

Note For performance reasons, this step has been deprecated in favor of the “content-conversion” extraction step in the data ingestion pipeline.: If “content-conversion” is already in the pipeline, this step will do nothing. This step only works inside the Squirro Platform due to internal dependencies.

Input - as input it requires a field with the name files of type dict containing the key-value pairs: {“mime_type”:”application/pdf”,”content_url”:”PATH”} where PATH is an existing path to a pdf-file.

Output - the output field is filled with data of type list [ dict ]. Each dict contains following structure: {“text”:”EXTRACTED_SENTENCE”,”page_to_rects”:{PAGE_NUM:[“x”:X-POS,”y”:Y-POS,”height”:HEIGHT,”width”:WIDTH]}}

Parameters:

type (str) – pdf_sentences
output_field (str, "texts") – output field for PDF sentences data.
default_language (str, "en") – Default language if language_field is not present.
language_field (str, "language") – Document field that gives the language.
cleaning (dict, {}) – dict of additional cleaning rules. example: {‘U.N.’:’UN’}

Example

{
    "step": "tokenizer",
    "type": "pdf_sentences",
    "cleaning": {
        "\t": " ",
        "\n": "",
        "  ": " ",
        "approx.": "approx",
        "etc.": "etc",
        "i.e.": "ie"
    }
}

Methods Summary

`get_pdf_files`(fields)
`process_doc`(doc)	Process a document

Methods Documentation

get_pdf_files(fields)#

Return type:: Iterator[tuple[dict, str]]

process_doc(doc)#

Process a document

Parameters:: doc (Document) – Document
Returns:: Processed document
Return type:: Document

PdfSentencesTokenizer

Contents

PdfSentencesTokenizer#