PdfSentencesTokenizer#
- class PdfSentencesTokenizer(config)#
Bases:
BatchedStep
PDF Sentences
Tokenizer
splits the PDF files by sentences and keeps track of the positional information of the extracted sentences.- Note For performance reasons, this step has been deprecated in favor of the “content-conversion” extraction step in the data ingestion pipeline.
If “content-conversion” is already in the pipeline, this step will do nothing. This step only works inside the Squirro Platform due to internal dependencies.
Input - as input it requires a field with the name files of type
dict
containing the key-value pairs: {“mime_type”:”application/pdf”,”content_url”:”PATH”} where PATH is an existing path to a pdf-file.Output - the output field is filled with data of type
list
[dict
]. Eachdict
contains following structure: {“text”:”EXTRACTED_SENTENCE”,”page_to_rects”:{PAGE_NUM:[“x”:X-POS,”y”:Y-POS,”height”:HEIGHT,”width”:WIDTH]}}- Parameters
type (str) – pdf_sentences
output_field (str, "texts") – output field for PDF sentences data.
default_language (str, "en") – Default language if language_field is not present.
language_field (str, "language") – Document field that gives the language.
cleaning (dict, {}) – dict of additional cleaning rules. example: {‘U.N.’:’UN’}
Example
{ "step": "tokenizer", "type": "pdf_sentences", "cleaning": { "\t": " ", "\n": "", " ": " ", "approx.": "approx", "etc.": "etc", "i.e.": "ie" } }
Methods Summary
get_pdf_files
(fields)process_doc
(doc)Process a document
Methods Documentation