PyMuPDFEntityBBoxFilter

PyMuPDFEntityBBoxFilter#

class PyMuPDFEntityBBoxFilter(config)#

Bases: Filter

After SquirroEntityFilter has created entities with offsets/lengths, this step uses PyMuPDF to compute precise bounding boxes and fills in page_to_rects for each extract.

Parameters:

type (str) – bbox_filter
entities_field (str) – Name of the input field where entities are stored.
output_field (str) – Name of the output field where the updated entities will be stored.
pdf_input_field (str, None) – Field of pdf page content

Methods Summary

`get_ent_ranking`(content, query)	Get the ordinal of the entity based on its text and page.
`get_pdf_files`(fields)
`process_doc`(doc)	Process a document

Methods Documentation

get_ent_ranking(content, query)#

Get the ordinal of the entity based on its text and page. This assumes entities are grouped by text and page.

Return type:: dict

get_pdf_files(fields)#

Return type:: Iterator[tuple[dict, str]]

process_doc(doc)#

Process a document

Parameters:: doc (Document) – Document
Returns:: Processed document
Return type:: Document

PyMuPDFEntityBBoxFilter

Contents

PyMuPDFEntityBBoxFilter#