Data Loader Plugin Preview#

The data loader frontend requests a preview from data loader plugins. This preview is used to show the record values to users, so they can decide how to apply the item field and facet mapping.

While preview handling is automatic and uses the getDataBatch method of the datasource class, there are some additional considerations that should be applied.

Preview Mode#

When requesting a preview, the data source class has self.preview_mode set to True. This can be used to change the behavior.

Considerations#

Response Size#

The preview data is sent in full to the browser. As a result, returning large responses, such as file content, should be avoided.

For example, a loader that returns file content, might want to change each row as follows:

if self.preview_mode:
    row['file_content'] = 'BINARY…'

Incremental Loading#

Even though the data loader handles incremental data out of the box, some plugins need to apply a separate logic to that. E.g. a loader might ensure that any result is only returned once. Where such logic is present, it must be disabled in preview mode.

Example:

  1. Imagine that purpose of your loader is to retrieve a list of articles first (metadata only), and then fetch the content of each PDF article (might be lots of megabytes). Consider this loaded to be a long-running job.

  2. In order not to double-download the articles, we keep track of which articles have already been downloaded onto the disk (see data-loader-plugin-state).

  3. If these articles are indeed downloaded during the preview, you will mark them as downloaded. Note that the preview will extremely slow if it needs to download all the data.

  4. Then during the main load (post-preview), these 10 articles will be skipped, since they’ve already been marked as downloaded.

  5. Since previewed items are never ingested into Squirro, so you effectively lose the content of these 10 items.

Expensive Queries#

In preview mode ensure that the query to the source system can not timeout.

One common way to improve performance is to request fewer records than you would for real fetching.