ingester.ini

ingester.ini#

The ingester.ini config file, located at /etc/squirro/ingester.ini configures the workings of the Ingester Service, which executes the pipeline workflows within the Squirro Data Pipeline.

The number of pipeline processors for each priority level (low, normal, and high) can be specified in the server configuration via the ingester.priorities.pool-*-processors options. For more detailed information, please refer to the Configuring the Ingester for Prioritized Data Sources section of the documentation.

Example Configuration#

[ingester]
# if there are no items to process, how long to wait until checking again
sleep_in_secs_if_no_batch = 5

# how often the ingester file reaper should check for orphaned files in the content streamer
clean_up_interval_in_hours = 1

[processor]
# number of workers to use on non-batched steps
workers = 10

# maximum number of retries to retry an item that failed to process a step
max_retries = 10

Data Retention Configuration#

Data retention is configured in common.ini using the following default options:

[content_filesystem_stream]

# root directories for ingester content stream (space separated if more than one)
data_directories = /var/lib/squirro/inputstream

# number of days and hours we keep around item batches that failed to be ingested
# total time is days + hours
days_to_retain_failed_batches = 30
hours_to_retain_failed_batches = 0

Understanding The Squirro Pipeline#

The Squirro Pipeline allows you to import large amounts of data in such a resource-friendly way that Squirro remains responsive even during high-volume data inflow. This section describes the steps you can follow to use Pipeline to import large amounts of data into Squirro.

Setting Up Pipeline#

Pipeline relies on a file system to queue data in batches before Squirro inserts the data into ElasticSearch in bulk. Options for storing these temporary data files are:

Local file system: This is the default out-of-the-box configuration with queued data placed under /var/lib/squirro/inputstream/ . We recommend using Raid-1 or some other form of redundancy to guard against loss of this temporary data. The IO access generated by the Pipeline consists of sequential reads and writes consistent with log-structured storage in which data is added sequentially “to the end” of the queue and read and deleted “from the beginning” of the queue as data is inserted into ElasticSearch. Note in the case of multiple Squirro Cluster Node installations that the queue and file system are implicitly “sharded” with a different subset of the data going to different servers and disks. This helps scale the data capacity of the Pipeline much like you can scale ElasticSearch by adding more Indexing Servers.
Amazon Elastic Block Storage: For cloud installations hosted in AWS, Amazon EBS is a suitable choice.
Network attached storage: NAS is expensive, so although network attached storage will work, it is likely not the first choice.

Internally the Pipeline is known as “Ingester”, or specifically the sqingesterd service. To have all new subscriptions import data via the Pipeline, configure the Squirro API Provider service (the service that all Squirro data imports interact with), like so:

/etc/squirro/provider.ini

[provider]
# new pipeline configs:
# processing_mode controls the transition to the ingester ("new pipeline")
# modes: legacy (all sources go through only old pipeline and bulk pipeline),
#        tee (all sources go through legacy & new pipelines for "shadow-testing"),
#        default_legacy (by default sources go through legacy
#            pipelines, but can be overridden in source config to new pipeline),
#        default_ingester (by default sources go through new pipeline, but can be
#            overridden in source config to legacy pipelines),
#        ingester (all sources exclusively go through new pipeline)
#
processing_mode = default_ingester

Ensure that /var/lib/squirro/inputstream has enough space. At the minimum we require 10 GB is space. The following setting in /etc/squirro/common.ini ensures that the Pipeline does not fill up the disk space and leaves at least 10 GB of space for other users of the same disk volume.

[content_filesystem_stream]
# directories for pipeline queued data
source_metadata_directory = /var/lib/squirro/inputstream
data_directories = %(source_metadata_directory)s

# number of gigabytes to require at least to continue writing to the file system
back_off_when_data_disk_space_falls_below_in_gigabytes = 10
back_off_when_metadata_disk_space_falls_below_in_gigabytes = %(back_off_when_data_disk_space_falls_below_in_gigabytes)s

If you elect not to use the local file system, change source_metadata_directory and data_directories accordingly.

Configuring The Pipeline#

The same as in the previous section, ensure that there is enough space and sequential IO capacity for temporarily queued data files.

Then follow the step described under Processing Config to choose the Pipeline when creating a subscription.