Incremental Loading

Incremental Loading#

The Squirro data loader handles incremental loading for data loader plugins.

The built-in support works if the data source has support for a sorted attribute that will change every time the data is updated.

A common attribute for this is a last modified timestamp.

To handle incremental loading, the connect() method takes two arguments: inc_column and max_inc_value.

They contain the name of the column on which incremental loading is done (specified on the command line with --incremental-column) and the maximum value of that column that was received in the previous load. Well-behaved data loader plugins should implement this property, so that only results are returned where inc_column >= max_inc_value. The data loader then takes care of all the required book-keeping.

Data loader plugins that do not support incremental loading, should raise an error when this option is specified:

def connect(self, inc_column=None, max_inc_value=None):
    if inc_column:
        raise ValueError("Incremental loading not supported.")

With most sources, the incremental column is only sensible and supported on one specific column or property. In that case, it is recommended to implement getIncrementalColumns() and enforce that in connect() as well:

def connect(self, inc_column=None, max_inc_value=None):
    if inc_column and inc_column != "updated_at":
        raise ValueError("Incremental loading is only supported on the updated_at column.")

def getIncrementalColumns(self):
    return ["updated_at"]