How to Access File Contents in Pipelets#
Python Engineer
When importing files, you sometimes need to access the contents of these files in a pipelet.
This process only works for server-side pipelets, and not with pipelets executed in the context of the data loader directly (using its pipelets.json
).
This work is typically performed by python engineers.
Concept#
Files are stored in the Squirro item in the files
list. Currently each Squirro item can only refer to one file. An example item with a file might look like this (extract of most important fields only):
{
"title": "example_document.pdf",
"id": "QDBwKommTM6rXm8Nh6Behw",
"item_id": "first_item",
"files": [
{
"mime_type": "application/pdf",
"id": "MwaAYRpSQY2vbTJcDtETqw",
"url": "/files/test_pdfs/example_document.pdf",
"content_url": "storage://test_pdfs/example_document.pdf"
}
]
}
To get access to the data, use the content_url
path. This URL points to a storage bucket (see Using storage buckets) which may even point to data that is not physically present on the current node, for example on Amazon S3.
You can use the StorageHandler
helper to access this content. Thus, it is not required to implement the heavy lifting in pipelets.
StorageHandler#
The StorageHandler
helper opens the content_url
resource and provides access to the data. The following snippet shows you how to use the helper:
from squirro.common.config import get_config
from squirro.lib.storage.handler import StorageHandler
from squirro.sdk import PipeletV1, require
@require('log')
class PdfTestPipelet(PipeletV1):
def consume(self, item):
# Only process items which have a fi
if not item.get('files'):
return item
# Limit to PDF files
mime = item['files'][0].get('mime_type')
if mime != 'application/pdf':
return item
content_url = item['files'][0].get('content_url')
if not content_url:
return item
keywords = item.setdefault('keywords', {})
storage_config = get_config('squirro.lib.storage')
storage = StorageHandler(storage_config)
with storage.open(content_url) as f:
# Work with file object `f`. For example:
content = f.read()
# Note: this requires `file_size` to be defined as an integer facet
keywords['file_size'] = [len(content)]
return item
Testing#
To test these pipelets, without having to re-ingest files continually, you can use a test setup.
Warning
Testing pipelets that make use of the StorageHandler
is currently not supported with the Squirro toolbox due to missing packages in the toolbox. Execute the test on a Squirro server. You can activate the virtual environment that has all necessary packages using squirro_activate3
on the command line.
Folder Setup#
- Next to the pipelet, create three additional folders:
conf
items
test_pdfs
Config File#
Create a configuration file conf/storage.ini
with the following contents:
[storage_test_pdfs]
container = file
directory = test_pdfs/
Test File#
Create JSON files for your test cases, that look like the following:
{
"title": "example_document.pdf",
"id": "QDBwKommTM6rXm8Nh6Behw",
"item_id": "first_item",
"files": [
{
"mime_type": "application/pdf",
"id": "MwaAYRpSQY2vbTJcDtETqw",
"url": "/files/test_pdfs/example_document.pdf",
"content_url": "storage://test_pdfs/example_document.pdf"
}
]
}
Change the content_url
for each file and store the corresponding binary file in the test_pdfs
folder.
Running Test#
Now run the test using pipelet consume:
pipelet consume extract.py -i items/example_document.json