Squirro Scripts#

This section explains Squirro Scripts and how they are used in Squirro.

Overview#

Squirro Script is a scripting language used in Squirro enrichments. It allows each item to be processed as it comes through the pipeline. Its syntax strongly resembled Python, but with reduced complexity and functionality.

On the other hand, Squirro Script is optimized for dealing with Squirro items, especially around how functions are used on keywords and the functions that are provided. For cases where Squirro Script is not powerful enough, Pipelets can be used to use to unlock the full power of Python instead.

Using Squirro Script#

Squirro Script is available in the server-side pipeline as an enrichment type. As Squirro Script is itself implemented as a pipelet, it can also be used in the data loader using the pipelets.json.

Server-side Enrichment#

Squirro Script can be installed using the plugin repository as a pipelet. For more details, see the Plugin Repository wiki page

Once installed, Squirro Script can be added as a step to any pipeline workflow. See Pipeline Editor for detailed instructions.

After dragging the step into the pipeline, use the pencil icon to open the properties. The code editor can be clicked to open a bigger editor window.

image6

Squirro Script Cookbook#

The following Squirro Script reference examples show how to achieve various functionalities.

Site-Specific Noise-Removal#

When indexing specific sites, a site-specific noise-removal script can be useful. This can be achieved with the following script:

@body = xpath('//div[@class="story-body__inner"]')
if @body:
    $body = @body

This selects the div element with the CSS class story-body__inner using a XPath expression. If that element is found, then it’s used for the body.

To limit this to just a given site, the find() function can be used on the link field:

if find("https?://bbc\.", $link):
    @body = xpath('//div[@class="story-body__inner"]')
    if @body:
        $body = @body

Squirro Script Reference#

This is the language reference for Squirro Script.

The syntax of Squirro Script strongly resembled Python, but the complexity and functionality has been reduced. On the other hand Squirro Script is optimised for dealing with Squirro items, especially around how functions are used on keywords and the functions that are provided. For cases where Squirro Script is not powerful enough, Pipelets can be used to use the full power of Python instead.

Language Primer#

Squirro Script is modeled very closely after Python with only small syntax additions. Note, however, that compared to Python only very limited functionality is available.

Example#

An example script could look like this:

hours = minutes / 60

if len($title) > 40:
    $title = trim(substr($title, 0, 40)) + '…'

This simple script calculates the hours keyword from the minutes keyword. It then shortens the title to a most 40 characters with an ellipsis.

Variables#

There are three types of variables available. The Item Format documentation contains full details on the possible fields and keywords available.

Variable

Description

Examples

Keyword

This are facet values stored in the keywords field of Squirro items. Each value is a list of items, but Squirro Script simplifies the handling of that by working on each list item individually.

  • minute

  • hour = minute / 60

Item field

Variables prefixed with a $ reference a core item field of Squirro items. Title, body, creation date, etc. are stored at this level.

  • $title,

  • $body = $title + "<br>" + $body

Temporary

When prefixed with a @ a variable is temporary. Sometimes you need to store the result of a calculation somewhere. It is recommended to use temporary variables for this.

  • @tlen = len($title)

Variable can be assigned with the = operator

Data types#

There are four types of variables:

  • string

  • int

  • float

  • datetime

The types correspond to the type of Labels. The type of a variable is implicitly defined by the assigned value, or explicitly by using one of the to_* functions.

Literals (values)#

Squirro Script supports strings and numbers. Strings are always assumed to be Unicode and are assumed to be UTF-8 encoded. Numbers can be full numbers (int) or contain a decimal separator (float). Examples: 42 or 3.1415926).

Conditions#

Conditional logic can be implemented with the if block. Example:

if @tlen > 40:
    $title = $title + '…'
    short_title = 'false'
elif @tlen == 0:
    $title = "No title"
    short_title = 'true'
else:
    short_title = 'true'

Each if block has an indented block of code to be executed when the comparison returns true. Any number of optional elif blocks can be added which are executed if the respective condition matches. Optionally an else block can be used at the end for when none of the comparisons matches.

For comparison the operators == (equal), != (not equal), ‘>’ (greater than), ‘>=’ (greater or equal), ‘<’ (less than) and ‘<=’ (less or equal) are available.

If no comparison operator is used (such as if $title:), then the variable is treated as true if it is non-empty and not zero.

‘not’ is also supported for boolean comparisons. For example:

if not startswith($title, "test"):
    # ...

Operators#

Operators can be used anywhere a variable can be referenced. The two sides of the operator can be any function call or variable. For example:

@offset = len($title) - 10

The following operators are provided:

  • +: for addition or concatenation.

  • -: for subtraction of numeric values.
  • *: for multiplication of two numeric values.

  • /: for division of numeric values.

Lookup Tables#

Lookup tables are used to link data in processed Squirro items to some other database. For this purpose lookup tables are provided in Squirro Script. When you need more matching capabilities, Known Entity Extraction is another way to go.

A lookup table is instantiated with a load_table_ function, such as load_table_from_json. Values can then be looked up using the lookup function, which will return a record based on the passed in key. Finally the extend_keywords function is a helper to be able to work with more complex return values. It takes a table return value and assigns each value to the item keywords.

This whole process is best illustrated with an example.

Take the following input file (the lookup table) and Squirro item:

Lookup Table

Squirro Item

products.json

{
    "iphone": {
        "category": "phone",
        "vendor": "Apple Inc."
    },
    "firefox": {
        "category": ["browser", "software"],
        "vendor": "Mozilla"
    }
}
{
    "title": "Example",
    "keywords": {
        "products": ["firefox"]
    }
}

The following Squirro script will take the products keyword and look it up in the lookup table:

lookup.sqs

# Reads information from a product database (stored in JSON).
@product_table = load_table_from_json('products.json')
@products = lookup(@product_table, products)
extend_keywords(@products)

This will result in the following output item:

{
    "title": "Example",
    "keywords": {
        "products": ["firefox"],
        "category": ["browser", "software"],
        "vendor": ["Mozilla"]
    }
}

Functions#

Functions can be called anywhere a variable can be referenced. For example:

title_hash = sha1($title)
if len($title) > 40:
    $title += '…'

When called on a keyword, then most functions are called on each individual value. The notable exception are the aggregation functions, which work on the entire list.

The following example shows the difference between a normal function (len) and an aggregation function (count).

Script

Squirro Item

Result

attendee_count = count(attendees)
attendee_len = len(attendees)
{
  "title": "Example",
  "keywords": {
    "attendees": ["Jon", "Abigail", "Mark", "Rose ", "Isabella"]
  }
}
{
  "title": "Example",
  "keywords": {
    "attendees": ["Jon", "Abigail", "Mark", "Rose ", "Isabella"],
    "attendees_count": [5],
    "attendees_len": [3, 7, 4, 4, 8]
  }
}

String Functions#

contains(str, substring[, substrings…])

Returns true if the string contains any of the given substrings. Examples:

if contains(campaign, 'Contact Us'):
    contains = 1
elif contains($body, 'contact', 'form'):
    contains = 1

endswith(str, substring)

Returns true if the string ends with the given substring.

find(regexp[, field…])

Returns matches of the given regular expression in the input item and return any matching groups.

join(separator, list)

Combines all elements of the list, separating them with the given separator.

len(str)

Return the size of a string.

lower(str)

Returns a lower-cased version of the string.

replace(regexp, replace, str)

Replace all occurrences of the given regular expression.

Example:

# This example redacts email-like strings
$body = replace('\S+@\S+', 'REDACTED_EMAIL', $body)

sha1(str)

Return the SHA1 hash of the string.

sha256(str)

Return the SHA256 hash of the string.

sha512(str)

Return the SHA512 hash of the string.

startswith(str, substring)

Returns true if the string starts with the given substring.

strip(str)

Return a copy of the string with all whitespace removed at the beginning and end of the string.

substr(str, start, length)

Return a subset of the string, starting at start.

to_string(value)

Converts the given value to a string.

upper(str)

Returns a upper-cased version of the string.

xpath(expression)

Returns the text of the expressions matching the given XPath expression in the item body. An example which determines the title from the body:

@title = xpath("//h1")
if @title:
    $title = @title

xpath_clear(expression)

Removes all nodes that match the given XPath expression from the item body.

Numeric Functions#

abs(num)

Returns the absolute value of the number.

round(number[, precision=0])

Rounds the number to the given precision. If precision is 0 (the default) an int number is returned. Otherwise a float with at most precision digits after the decimal point is returned.

to_float(value)

Converts the given value to a float.

to_int(value)

Converts the given value to an int.

Date Functions#

datediff(date1, date2[, unit=”seconds”])

Return the difference between the two given datetime objects. The unit can be specified as "seconds", "minutes", "hours", "days" or "weeks". Short versions of these units can be used, such as "s" for seconds.

now([format_string])

Return the current date time in UTC time zone. If a format string is given, this is returned as a string with the formatting applied, otherwise it’s returned as a datetime object.

strftime(date[, format_string])

Return a string formatted version of the given datetime object. The default format is %Y-%m-%dT%H:%M:%S.

to_datetime(value)

Converts the given value to a datetime. This only works if the string uses the Squirro datetime format (%Y-%m-%dT%H:%M:%S).

Datetime format strings#

Some date functions take format strings. See Format Strings for information on this Python datetime format string.

Aggregation Functions#

Aggregation functions are used on keywords and work on the entire list, instead of individual values. Most of the functions assume numeric values in the list.

avg(list_of_nums)

Return the average of all the values.

count(list)

Return the number of elements in the list.

max(list_of_nums)

Return the largest value of the list.

min(list_of_nums)

Return the smallest value of the list.

sum(list_of_nums)

Return the sum of all the values in the list.

Examples:

minutes_len = len(minutes)
minutes_count = count(minutes)
minutes_max = max(minutes)
minutes_min = min(minutes)
minutes_sum = sum(minutes)
minutes_avg = avg(minutes)

Lookup Table Functions#

extend_keywords(key_value_dict)

Applies the return value from the lookup table to the item’s keywords. The key_value_dict is a dictionary of key/value pairs, which is transformed into keywords on the items. The value can also be a list of such dictionaries, in which case each one is processed in turn.

load_table_from_json(filename)

Instantiates a lookup table by loading the given filename in JSON format. The format of the file is expected to be the following:

{
    "key": [
        {
            "output_key": "value",
            …
        },
        …
    ],
    …
}

lookup(table, keys)

Looks up the given key or keys in the lookup table. Returns a list of all the matching entries.

Other Functions#

clear(variable)

Clears the given variable. This is usually used on facets or item fields to clear them.

discard()

Aborts execution of the script and discards the current item. This can be used to implement a stop criteria, where certain items shouldn’t be indexed.

get_value(key)

Returns the value of the given variable. The key notation is the same as for any variables, with support for @ or $ prefix to access temporary variables or item fields.

The main use of this function is that it supports spaces in the keys. So this can be used to access keywords that contain spaces.

set_value(key, value)

Sets the value of the given variable. Like get_value, this method is mostly used to work with keywords that have spaces in their name.