PyMuPDFEntityBBoxFilter#

class PyMuPDFEntityBBoxFilter(config)#

Bases: Filter

After SquirroEntityFilter has created entities with offsets/lengths, this step uses PyMuPDF to compute precise bounding boxes and fills in page_to_rects for each extract.

Parameters:
  • type (str) – bbox_filter

  • entities_field (str) – Name of the input field where entities are stored.

  • output_field (str) – Name of the output field where the updated entities will be stored.

  • pdf_input_field (str, None) – Field of pdf page content

Methods Summary

get_ent_ranking(content, query)

Get the ordinal of the entity based on its text and page.

get_pdf_files(fields)

process_doc(doc)

Process a document

Methods Documentation

get_ent_ranking(content, query)#

Get the ordinal of the entity based on its text and page. This assumes entities are grouped by text and page.

Return type:

dict

get_pdf_files(fields)#
Return type:

Iterator[tuple[dict, str]]

process_doc(doc)#

Process a document

Parameters:

doc (Document) – Document

Returns:

Processed document

Return type:

Document