Content Augmentation#

The content augmentation enrichment fetches additional content from third-party systems.

The data processing pipeline can be instructed to fetch third-party content. Examples include external websites accessible via the HTTP(S) protocol.

Enrichment name

content-augmentation

Stage

content

Overview#

The content-augmentation step is used to fetch third party content.

It has two steps during this process: first the content of all uploaded files is fetched. That content is also used to guess the MIME type for these files. After that, and only if configured to do so, the content augmentation enrichment downloads the content from the link attribute. That fetched content is used to set the body attribute.

When link fetching is enabled, this step will often be combined with the Noise Removal enrichment.

image1

Configuration#

Field

Description

fetch_link_content

Boolean value indicating whether to fetch the content from the web site referenced with the link attribute. Default: false.

Examples#

The following examples all use the SquirroClient (Python SDK) to show how the content augmentation enrichment step can be used.

Item Uploader#

The following example details how to enable third-party content fetching.

from squirro_client import ItemUploader

# processing config to fetch 3rd party content
processing_config = {
    'content-augmentation': {
        'enabled': True,
        'fetch_link_content': True,
    },
}

uploader = ItemUploader(…, processing_config=config)
# item with a link attribute
items = [
    {
        'link': 'http://www.example.com',
        'title': 'Item 01',
    },
]
uploader.upload(items)

In the example above the processing pipeline is instructed to fetch the content from the site http://www.example.com and use it as the item body.

New Data Source#

The following example details how to enable third-party content fetching for a new feed data source.

from squirro_client import SquirroClient

client = SquirroClient(None, None, cluster='https://next.squirro.net/')
client.authenticate(refresh_token='293d…a13b')

# processing config to fetch 3rd party content and detect boilerplate with the
# news classifier
processing_config = {
    'content-augmentation': {
        'enabled': True,
        'fetch_link_content': True,
    },
}

# source configuration
config = {
    'url': 'http://newsfeed.zeit.de/index',
    'processing': processing_config
}

# create new source subscription
client.new_subscription(
    project_id='…', object_id='default', provider='feed', config=config)

Existing Data Source#

The following example details how to enable third-party content fetching for an existing source. Items that have already been processed are not updated.

from squirro_client import SquirroClient

client = SquirroClient(None, None, cluster='https://next.squirro.net/')
client.authenticate(refresh_token='293d…a13b')

# Get existing source configuration (including processing configuration)
source = client.get_subscription(project_id='…', object_id='…', subscription_id='…')
config = source.get('config', {})
processing_config = config.get('processing_config', {})

# Modify processing configuration
processing_config['content-augmentation'] = {
    'enabled': True,
    'fetch_link_content': True,
}
config['processing'] = processing_config
client.modify_subscription(project_id='…', object_id='…', subscription_id='…', config=config)

In the example above the processing pipeline is instructed to fetch the content for every new incoming item (from the link attribute) and use it as the item body.