PyMuPDFEntityBBoxFilter#
- class PyMuPDFEntityBBoxFilter(config)#
Bases:
Filter
After SquirroEntityFilter has created entities with offsets/lengths, this step uses PyMuPDF to compute precise bounding boxes and fills in page_to_rects for each extract.
- Parameters:
Methods Summary
get_ent_ranking
(content, query)Get the ordinal of the entity based on its text and page.
get_pdf_files
(fields)process_doc
(doc)Process a document
Methods Documentation
- get_ent_ranking(content, query)#
Get the ordinal of the entity based on its text and page. This assumes entities are grouped by text and page.
- Return type: