Item Format

Item Format#

Squirro items are represented in JSON format. This is true both for data loading and data consumption with the API. The following tables document the properties a Squirro item can have.

About Squirro Items#

When executing a search, Squirro will show a list of matching items as the query result.

When planning and integrating a custom data source the following points should be considered:

What is the smallest independent result entity that the user should be consuming? These should then be modeled as Squirro items.
The formatting of the body content and title.

Some examples of individual items are a:

News story, web article, tweet, etc.
Binary document (PDF, Office documents, etc.)
Service ticket
Email
Chat message

Items can also contain sub-items which are always shown in the context of the full item. By default Squirro uses these sub-items for indexing the individual pages of PDF documents as separate sub-items.

Common Fields#

The fields in this table are used in both the data loading and data consumption APIs.

Field	Data type	Description
id	Unique Identifier	While this field is called the same in data loading and consumption, it has different semantics.
link	URL	Link to the item at its original location.
title	String	Item title.
body	HTML String	Item body. This field is in HTML format and special characters need to be escaped.
language	Language Codes	Content language of the item. If this is not specified, it is auto-detected based on the content.
created_at	Date and Time	Item creation date. Ideally, this is the creation date of the item in its source system. If this is not specified for data loading, the import process goes through the following steps: If the `files` property is specified, Squirro tries to extract a creation date from the file metadata. As a fallback, the server’s current date and time are used.
webshot_url	URL	Main item picture. This image is displayed in the result list to represent the story. For item-format loading, the `webshot_picture_hint` field should be used because the picture will then automatically be archived. If this is not set, it is automatically extracted from the website specified with the `link` property, generally by using the first story picture.
webshot_height	Integer	Height of the webshot in pixels.
webshot_width	Integer	Width of the webshot in pixels.
keywords	Dictionary, values represented as lists	Keywords attached to the item (see Labels for further information). They are the structured information of an item. Item keywords are offered as filter options in the search screen. Keywords and their values are offered in the search field as typeahead options. Search Tagging, Known Entity Extraction (KEE), and unknown entity extraction are used to add additional keywords to items. Keyword values can have different data types. The default data type is string. To use other formats, configure it before loading any data into the system. Example item with keywords: { "title": "Our offices", "body": "We have offices in Munich, …", "keywords": { "country": ["Germany"], "city": ["Munich", "Berlin"] } }
entities	List of dictionaries	Entities attached to the item.

When importing data into Squirro at least one of the fields title, body or files must be set. All other fields are optional.

Data Loading Fields#

These fields can be specified in the data loading APIs. They will be transformed and output with different names in the data consumption APIs.

Field	Data type	Description
id	String	External item identifier. When a value is specified here at import, it is written into the `external_id`. Used by data providers to reference their source system. Squirro uses this identifier for deduplication.
summary	Text String	Item summary text. If not specified, this is generated from the `body` field. Any HTML tags are removed.
webshot_picture_hint	URL	Main item picture. If this URL exists and can be downloaded, the image is archived by Squirro. The resulting URL is written into the `webshot_url` field. The picture width and height are calculated and written into `webshot_width` and `webshot_height`.
mime_type	String	The MIME type of the body. Set to `text/html` for HTML bodies. For all other types, a conversion to HTML will be attempted. If not specified, the MIME type is auto-detected.
files	List of dictionaries	A list of files that are uploaded for the item. Note: this is modelled as a list, but only one file can currently be attached. The fields for individual files are: `content`: Base64-encoded content of the file to upload. This or the `url` field are mandatory. `url`: URL where the file can be downloaded from. This or the `content` field are mandatory. `name`: File name without path. Mandatory when the content field is provided. If the `url` field is provided, the name is derived from the URL by default. Note, that at consumption this field also exists, but has a different layout. See below.

Data Consumption Fields#

Some fields are only available during data consumption because they are calculated on the fly or represent a user state. This table documents these fields.

Field	Data type	Description
id	Unique Identifier	Internal item identifier, generated by Squirro only.
external_id	String	External item identifier. The external identifier is used for deduplication and can be used to link items to their source system. See the `id` field in the item-format-loading for details.
read	Boolean	True if the item has been read.
starred	Boolean	True if the item has been starred.
abstract	Text String	Item abstract. This is generated from the `summary` field, or if that field doesn’t exist, from the `body`. In case the item is returned as a matching result to a query search, the abstract is calculated around the most relevant matching keywords.
score	Float	Relevant score of the item. This is only set when the result list is ordered by relevance.
thumbler_url	Partial URL	Used internally by Squirro to display thumbnails of the `webshot_url` field.
explanation	Dictionary	Returned for items when the `explain_smartfilters` option is used in the `Get Item resource`. The dictionary contains a list of fields, for each of which the matching Smart Filter tokens are listed. Example: { "explanation": { "matches": { "summary.stemmed": [ {"term": "eliminated", "score": 0.010479515}, {"term": "equipped", "score": 0.00846127} ], "body.stemmed": [ {"term": "eliminated", "score": 0.010341313000000001}, {"term": "equipped", "score": 0.008302803000000001} ], "language_code": [ {"term": "en", "score": 0.0009886466000000001} ] } } }
related_items	List	Returned when the `filter_related_items` option is set in the `List Items resource`. A list of dictionaries which contains the field `id` of any related (duplicate) items. Example: { "related_items": [ {"id": "UfA8Ah08TeSLSUo-RBzm7Q"}, {"id": "tjiS4mjaTgupKaIiZisYww"} ] }
highlight_matches	Dictionary	A dictionary of matching query terms per field. { "highlight_matches": { "body": ["asia"], "summary": ["asia"] } }
matching_sub_items	List	List of sub-items (e.g. pages for a PDF) with matches for the current query.
has_matching_sub_items	Boolean	True if the item consists of sub-items (e.g. pages for a PDF) that match the current query.
files	List	A list of files that are uploaded for the item. Note: this is modeled as a list, but only one file can currently be attached. The fields for individual files are: `content_url`: `content:///` URL where the file is stored. This includes the storage bucket name and the path within that bucket. `mime_type`: MIME type of the file. `name`: File name without path. Note, that at ingestion this field also exists, but has a different layout. See above.

Data Processing Fields#

These fields are used by the Pipeline Overview. They are used to pass information and processing state through the various pipeline stages. They can not be passed in from the raw data and are not indexed nor returned to the client on search.

Field

Data type

Description

clean_body

HTML String

A cleaned version of the item body as it should be used for machine learning classification tasks.

For example, this is written by the Email Parser step, and read by AI Studio models, Known Entity Extraction, and NLP Keyphrase Tagger.