How to Access File Contents in Pipelets#

Python Engineer

When importing files, you sometimes need to access the contents of these files in a pipelet.

This process only works for server-side pipelets, and not with pipelets executed in the context of the data loader directly (using its pipelets.json).

This work is typically performed by python engineers.

Concept#

Files are stored in the Squirro item in the files list. Currently each Squirro item can only refer to one file. An example item with a file might look like this (extract of most important fields only):

{
    "title": "example_document.pdf",
    "id": "QDBwKommTM6rXm8Nh6Behw",
    "item_id": "first_item",
    "files": [
        {
            "mime_type": "application/pdf",
            "id": "MwaAYRpSQY2vbTJcDtETqw",
            "url": "/files/test_pdfs/example_document.pdf",
            "content_url": "storage://test_pdfs/example_document.pdf"
        }
    ]
}

To get access to the data, use the content_url path. This URL points to a storage bucket (see Using storage buckets) which may even point to data that is not physically present on the current node, for example on Amazon S3.

You can use the StorageHandler helper to access this content. Thus, it is not required to implement the heavy lifting in pipelets.

StorageHandler#

The StorageHandler helper opens the content_url resource and provides access to the data. The following snippet shows you how to use the helper:

from squirro.common.config import get_config
from squirro.lib.storage.handler import StorageHandler
from squirro.sdk import PipeletV1, require

@require('log')
class PdfTestPipelet(PipeletV1):
    def consume(self, item):
        # Only process items which have a fi
        if not item.get('files'):
            return item

        # Limit to PDF files
        mime = item['files'][0].get('mime_type')
        if mime != 'application/pdf':
            return item

        content_url = item['files'][0].get('content_url')
        if not content_url:
            return item

        keywords = item.setdefault('keywords', {})

        storage_config = get_config('squirro.lib.storage')
        storage = StorageHandler(storage_config)
        with storage.open(content_url) as f:
            # Work with file object `f`. For example:
            content = f.read()
            # Note: this requires `file_size` to be defined as an integer facet
            keywords['file_size'] = [len(content)]

        return item

Testing#

To test these pipelets, without having to re-ingest files continually, you can use a test setup.

Warning

Testing pipelets that make use of the StorageHandler is currently not supported with the Squirro toolbox due to missing packages in the toolbox. Execute the test on a Squirro server. You can activate the virtual environment that has all necessary packages using squirro_activate3 on the command line.

Folder Setup#

Next to the pipelet, create three additional folders:
  • conf

  • items

  • test_pdfs

Config File#

Create a configuration file conf/storage.ini with the following contents:

[storage_test_pdfs]
container = file
directory = test_pdfs/

Test File#

Create JSON files for your test cases, that look like the following:

items/example_document.json (Example)#
{
    "title": "example_document.pdf",
    "id": "QDBwKommTM6rXm8Nh6Behw",
    "item_id": "first_item",
    "files": [
        {
            "mime_type": "application/pdf",
            "id": "MwaAYRpSQY2vbTJcDtETqw",
            "url": "/files/test_pdfs/example_document.pdf",
            "content_url": "storage://test_pdfs/example_document.pdf"
        }
    ]
}

Change the content_url for each file and store the corresponding binary file in the test_pdfs folder.

Running Test#

Now run the test using pipelet consume:

pipelet consume extract.py -i items/example_document.json