DocumentUploader Class#
As a best practice, provide the mime_type argument explicitly when calling upload(). If mime_type is omitted, the platform must infer it from the file content or filename further down the ingestion pipeline, which is less reliable. Setting it explicitly guarantees correct handling, in particular for Microsoft Office formats that require the PDF Conversion step before their text content can be extracted.
- class DocumentUploader(metadata_mapping=None, batch_size=10, batch_size_mb=150, default_mime_type_keyword=True, timeout_secs=300, **kwargs)#
Document uploader class which simplifies the indexing of office documents. Default parameters are loaded from your home directories .squirrorc. See the documentation of [[ItemUploader]] for a complete list of options regarding project selection, source selection, configuration, etc.
- Parameters:
batch_size – Number of items to send in one request.
batch_size_mb – Size of documents to send in one request. If this file size is reached, the client uploads the existing documents.
metadata_mapping – A dictionary which contains the meta-data mapping.
default_mime_type_keyword – If set to
Truea default keyword is added to the document which contains the mime-type.timeout_secs – How many seconds to wait for data before giving up (default 300).
kwargs – Any additional keyword arguments are passed on to the [[ItemUploader]]. See the documentation of that class for details.
Typical usage:
>>> from squirro_client import DocumentUploader >>> import os >>> uploader = DocumentUploader( ... project_title='My Project', token='<your token>', ... cluster='https://demo.squirro.net/') >>> uploader.upload(os.path.expanduser('~/Documents/test.pdf')) >>> uploader.flush()
Meta-data mapping usage:
By default (i.e. for all document mime-types) map the original document size to a keyword field named “Doc Size”:
>>> mapping = {'default': {'sq:size_orig': 'Doc Size', ... 'sq:content-mime-type': 'Mime Type'}} >>> uploader = DocumentUploader(metadata_mapping=mapping)
For a specific mime-type (i.e. ‘application/vnd.oasis.opendocument.text’) map the “meta:word-count” meta-data filed value to a keyword field named “Word Count”:
>>> mapping = {'application/vnd.oasis.opendocument.text': { ... 'meta:word-count': 'Word Count'}} >>> uploader = DocumentUploader(metadata_mapping=mapping)
Default meta-data fields available for mapping usage:
sq:doc_size: Converted document file size.sq:doc_size_orig: Original uploaded document file size.sq:content-mime-type: Document mime-type specified during upload operation.
- upload(filename, mime_type=None, title=None, doc_id=None, keywords=None, link=None, created_at=None, filename_encoding=None, content_url=None, priority=0, pipeline_workflow_id=None)#
Method which will use the provided
filenameto create a Squirro item for upload. Items are buffered internally and uploaded according to the specified batch size. If mime_type is not provided a simple filename extension based lookup is performed.- Parameters:
filename – Read content from the provided filename.
mime_type – Optional mime-type for the provided filename.
title – Optional title for the uploaded document.
doc_id – Optional external document identifier.
keywords – Optional dictionary of document meta data keywords. All values must be lists of string.
link – Optional URL which points to the origin document.
created_at – Optional document creation date and time.
filename_encoding – Encoding of the filename.
content_url – Storage URL of this file. If this is set, the Squirro cluster will not copy the file.
priority – int, describing the priority of ingestion for the dataset to be loaded. Currently only supports a value of 0 or 1. 0 means that the items are loaded in an asynchronous fashion and 1 would mean that the items are loaded in a synchronous fashion.
pipeline_workflow_id – str, id of an existing pipeline workflow which should be used to process the current batch of items. Can only be used with parameter priority set to 1.
Example:
>>> filename = 'test.pdf' >>> mime_type = 'application/pdf' >>> title = 'My Test Document' >>> doc_id = 'doc01' >>> keywords = {'Author': ['John Smith'], 'Tags': ['sales', ... 'marketing']} >>> link = 'http://example.com/test.pdf' >>> created_at = '2014-07-10T21:26:15' >>> uploader.upload(filename, mime_type, title, doc_id, keywords, ... link, created_at)
- flush()#
Flush the internal buffer by uploading all documents.