PdfPagesTokenizer#

class PdfPagesTokenizer(config)#

Bases: BatchedStep

PDF Pages Extractor: reads each PDF in your files field and extracts full-page text content using PyMuPDF (fitz).

Input: expects doc.fields[‘files’] to be a list of dicts:

{ ‘mime_type’: ‘application/pdf’, ‘content_url’: ‘<path_or_url>’ }

Output: writes to doc.fields[output_field] (default ‘pages’) a list of dicts:

[{ ‘page_number’: int, ‘text’: str }, …]

Parameters:
  • type (str) – ‘pdf_pages’

  • output_field (str) – field to write the pages list (default: ‘pages’)

  • page_count (int, None) – if set, only extract this many pages from each PDF

Methods Summary

process_doc(doc)

Process a document

Methods Documentation

process_doc(doc)#

Process a document

Parameters:

doc (Document) – Document

Returns:

Processed document

Return type:

Document