PdfPagesTokenizer#
- class PdfPagesTokenizer(config)#
Bases:
BatchedStep
PDF Pages Extractor: reads each PDF in your files field and extracts full-page text content using PyMuPDF (fitz).
- Input: expects doc.fields[‘files’] to be a list of dicts:
{ ‘mime_type’: ‘application/pdf’, ‘content_url’: ‘<path_or_url>’ }
- Output: writes to doc.fields[output_field] (default ‘pages’) a list of dicts:
[{ ‘page_number’: int, ‘text’: str }, …]
- Parameters:
Methods Summary
process_doc
(doc)Process a document
Methods Documentation