3.9.7 LTS Release Notes

3.9.7 LTS Release Notes#

Squirro 3.9.7 was released on February 27, 2024.

Reference: Learn more about the Squirro Release Process.

Caution

This release includes breaking changes. See the Breaking Changes section at the end of this page to learn more.

What’s New#

Squirro 3.9.7 LTS includes significant platform improvements, including the following:

A new Semantic Search default pipeline was created to support improved semantic search capabilities and performance, available as a preset pipeline workflow within the Pipeline Editor.

A redesigned header and navigation bar with improved looks and functionality. This includes query chips within the Search Bar widget. Learn more about Search Bar Chips.

You can now change boolean operator chip values (AND, OR, NOT) by clicking on them and selecting from the dropdown.

The Filter widget was redesigned and improved, including search and typeahead functionality for labels.

Monitoring projects are no longer shown in the project list accessed by clicking the current project title. They are accessible via Monitoring in the Squirro Spaces menu.

In the Tabs widget, the communityPopulated property was replaced with a new “Control By Community” option in the UI.

Squirro now includes beta support for Oracle Database for new installations. To learn more, see How to Set Up Oracle Database as Squirro’s Metadata Storage.
Squirro now supports license keys for easier provisioning and updating of Squirro features and metrics.
Dashboard facets and facets applied by URL are now converted to search chips and shown in the global search and dashboard search widget with the ability to remove them.
Added .eml and .msg as supported searchable file types. To learn more, see Searchable File Types.
Introduced the beta feature Multi Query API, which enables users to perform multiple searches in parallel and combine the results using a specified rank method.
The default value of useReactWidgets is now true, meaning that React widgets are used by default on new projects.
Added a lazy loading option to the Labels widget.
Added new API overrides to the item detail: getSidepanelItem and getSidepanelWidth.
Added a new Topic API endpoint for retrieving batches of GT labels to enable clients to retrieve large numbers of labels without timing out.
Studio Plugins can now access a database (the studio database) and define and manage their own tables via SQLAlchemy.
The PDF Conversion step will now pass the MIME type of the file to convert to the pdfconversion service. In cases where the MIME type has not been provided, the pdfconversion service will try to detect it by content. It is possible to enable filename-based detection first via the server config option pdfconversion.detect-mime-type-only-by-content.
The MIME Type Detection step now includes a configuration option to detect MIME types solely based on content, disregarding the filename. This can be useful in scenarios where the data ingested may have misleading extensions. For example, if a file with the filename test.pdf is in fact a TIFF file and not a PDF.
Added a mechanism to the ingester service to restart the local Apache Tika service in case of internal server errors. This mechanism is controlled by the server config option ingester.tika.restart-on-error-*.
Values in the configuration service can now be fetched from INI files. For example, specifying ${nlp_service.api_key} in the configuration service option fetches the value from the api_key key in the [nlp_service] located in the .ini file. This is especially useful for sensitive data that would preferably not be exposed to the configuration service.
Added a new server setting to hide internal information such as version and thumblerAPIRoot.
Introduced a new item field clean_body which can be used to pre-process the body for classification tasks. KEE, NLP keyphrase tagging, and machine learning workflows now default to using this field.
When data loading, exposed the ability to retry failed batches from the UI.
Added availability for moving all failed batches of a given source back to processing via the SquirroClient (Python SDK). This requires specifying project and source identifiers. If a batch identifier is specified, only that specific batch will be moved. If a batch priority level is specified, only batches with that priority level will be moved.
Item title can now be modified using the modify_item and modify_items Squirro Client methods.

What’s New in SquirroGPT#

SquirroGPT now automatically detects the language of the user input and will respond in the same language. However, if a specific language is specified in the configuration settings, this automatic detection feature will remain inactive.

Foundational backend work is completed that allows SquirroGPT to be incorporated into various parts of the Squirro platform via Squirro’s built-in widget functionality. Exciting new features are on the way soon!
SquirroGPT now supports connecting to your own deployments of OpenAI or Azure-supported models, including Mixtral, via the project or server-level configuration settings under genai.sqgpt.settings. To learn more, see How to Connect Squirro Chat to a Third-Party LLM.
Added a filter query to the GenAI studio plugin, which allows changing the scope of SquirroGPT. This allows you to narrow the set of documents available for retrieval.
Deleting a SquirroGPT project and its data sources will now delete the related embeddings.
Introduced the backend preparation for a new SquirroGPT Summarization feature, which takes a body of text and asks an LLM to summarize its contents. Optionally, we can provide an anchor question/statement or the maximum number of words. Note that this feature is not yet available on the frontend for use.

Search Additions and Improvements#

Tuned the precision/recall of keyword search by applying scoring plugins on individual term sequences as found in the search query. Depending on the plugin that is set up, it’s possible to enable fuzzy term matching on the body, or typeahead-like (prefix) term matching on the title, or any other custom matching logic that a plugin may provide. This can be enabled in the new project configuration topic.search.query-strategy-term-scoring-profiles. Default configuration comes with these settings: "scoring_profiles": ["prefix_match fields:title","fuzzy_match fuzziness:auto fields:title"]
Created a Text Chunking step that can be configured for different chunking strategies.
Added Scoring Plugins to perform fuzzy_match and prefix_match matching. (Phrase)-Prefix term matching can be used to achieve typeahead functionality on searchable text content fields, considering the order of tokens and proximity. This can be used to increase recall for lexical search (with higher query-computation cost). To learn more, see the Retrieve Scoring Plugin Documentation.
Added a new layer visibility condition option, Concept Search, with is empty and is not empty conditions.
Added the ability to highlight specific paragraphs in searches via a new paragraph_highlight profile. Provided a paragraph ID, it can be used in the following manner: profile:{ paragraph_highlight id:FUelODYIFNsmsUoDhEa2cA_0 }. This profile is primarily intended for SquirroClient users and is only available for projects with paragraph embeddings.
Semantic search highlighting in the Item Detail view was improved. The best paragraph is highlighted, together with query-terms (hybrid search highlighting). This introduces basic semantic highlighting functionality without requiring extracive-question-answering to be active.
Removed the semantic similarity step, instead sentence tokenize the snippet to produce most_relevant_sentences.
An error is no longer raised if an embedding chunk is too big. Instead, the text is truncated and a warning is logged.
Improved highlighting of extractive-question-answer span (sentence boundary aware).
Squirro now returns the amount of approximately matching documents when doing semantic search (instead of returning the count of matching paragraphs). This is applied when doing client.query(options={"search_scope":"paragraph", "response_format":"document", or client.query(query=" ..user question.. profile:{semantic}").
There is now a more robust baseline keyword search (without query processing). Squirro now better interprets ? in term-sequences as a question indicator and not as wildcard term matching.
For the aggregation API, exposed single value aggregation value_count to count the matching documents of a sub-aggregation.
Added the option to do keyword search on paragraphs using the query endpoint by specifying the search_scope: paragraph option.
Disabled auto-submit query within Global Search and the Search widget.
Improved Search widget/Global Search spacing.
Removed search overhead when doing paragraph / semantic search. This includes removal of the has_child clause (only necessary when doing document-centric PDF search) and search on paragraph index (instead of all project indices).
Exposed the response of elastic search-profiler to query API if profile is enabled in the search settings topic.search.search-settings. This allows for detailed insights into performance bottlenecks of the elasticsearch query.
A new typeahead implementation facet_value_lenient can be used to find facet-value matches (leveraged by Labels Widget’s searchbar). This typeahead strategy matches relevant (visible & analyzed) facets. It’s more lenient than the default facet_value as it ignores the order of the matched terms, and matches all search-strings as prefixes. For example, this means that state un amer will match united states of america.
Changed the default search settings to decrease semantic search latency and less strict keyword search (decreased minimum_should_match for term sequences).

Widget Improvements#

In the Labels widget, label value percentages are no longer shown when they are not known. For example, read and starred values.
Added the option to customize the Labels widget and add a start icon or text to the dropdown or accordion.
Added helper text to the Labels widget configuration options.
Added a loading indicator and empty message to the Labels widget.
Now, custom empty status for widgets can be HTML-based.
Updated the design of the Facets widget.
Removed the avatar option for the Cards widget and improved sources fetching.
Added onTabClick to the Tabs widget overrides.
Migrated the HorizontalTabs design.

UI Improvements#

Squirro now supports highlighting over multiple lines in the PDF viewer.
Added fullscreen and export options for the heatmap chart in the validation screen of AI Studio.
After successfully importing a project, you will now be taken directly to that project’s welcome page.
JPEG 2000 files are now supported in the filesystem (Documents) plugin.
Now, thumbnails are shrunk within Items when the abstract is hidden.
Improved dashboard loading performance. Sections will now lazy load based on visibility.
Improved search bar chips hover and focus styling.
Removed the reset zoom button from the timeline chart.
Add an option to autoexpand and hide accordion layer controls.
The text highlight color from project settings is now also used for entity highlighting.

Other Platform Improvements#

Small improvements to the support for image to PDF conversions: the pdfconversion service will attempt to convert to PDF any image/* file. Conversion to RGB mode will happen for any mode that is not tested for saving to PDF.
Added the new method get_groundtruth_labels_batched that returns a ground truth label generator, which uses the new batched endpoint from Topic API.
Implemented a unified welcome endpoint under /welcome/project_id.
Increased the timeout of the filtering service to 60 minutes, reducing the risk of filtering service timeouts if the ingestion pipeline uses the Search Tagging and Alerting pipelet.
Adjusted the squirro_groundtruth_loader step to use newly added client method for retrieving GT labels using batches under the hood. This should enable ML workflows to load large amounts of GT labels without the request to the backend timing out.
Created the project start endpoint app/project_start/<project_id>.
Moved the embedding pipelet to the native pipeline step.
Added the option to retrieve paragraphs from the query endpoint using the response_format: paragraph option.
Enhanced the Share Dashboard dialog with options for multiple dashboards.
Implemented a /dashboards/<project>/<dashboard?> endpoint for dashboard tabs and global search.
Improved the stability of query parsing for Phrase Queries. Phrase queries can now enclose nested, escaped quotes. This allows a query like language:en "new york "city park"", where the query phrase is parsed as "new york city park".
The Deduplication step now exposes its actions as configurable options. By default, the action to remove existing project items found in the incoming batch is now disabled. This was disabled for performance reasons given its limited usefulness in the majority of use cases.
Batches originating from a Change Pipeline step will now be processed using the pipeline workflow’s current state, rather than its state at the time of batch creation.
Exporting a matrix chart in the AI studio model validation will now have the title Confusion Matrix.
Deleting or modifying a parent item now reflects changes in the corresponding paragraphs.
Added paragraphs locator and index to existing projects, allowing the use of semantic search.
Added the project-level configuration option frontend.userapp.excel-export-filename which sets the filename of the generated Excel file when selecting to export project items to Excel.
In the webshot service, changed the log level from WARNING to DEBUG.
Activated the stripping of title prefix by default.
Made the highlighting of matching keywords more meaningful within the Item Detail view by rewriting highlight-query based on the output of query-processing, using only relevant terms for item-detail view highlighting.
Internal service users can now be created more easily when the tenant is unknown.
Default the store value in useStoreKeyChange to the current store value.
Introduced the new libNLP filter coalesce. This returns the first field from a provided list for which a value is present. This has been introduced to support the clean_body handling in machine learning workflows.
The email parsing pipelet now writes content into the newly standardized clean_body field.
Improved the quality of the email parsing pipelet and machine learning step. Only the latest reply of an email chain is now returned, leading to improved precision in email classification tasks.
Added options to specify certificates and custom headers (that may include the API token) to connect to the NLP services.
Squirro now counts the number of items for retried failed batches.
Added the option to rerun failed items on a source.
Upgraded redis-server to version 7.2.3.
Updated hiredis to the latest stable version (2.3.2).
Updated fastText to 0.9.2, which includes various fixes and improvements, including a memory leak fix.
The squirro_status command line interface (CLI) utility now provides information on the GenAI service’s health status as well.
Fixed the PDF loading indicator and improved the PDF toolbar.
Added log rotation to the machinelearning job logs and to the dataloader Crawler logs. Logs for the datasource job logs, the machinelearning job logs and the dataloader Crawler that have not been accessed for 30 days will be automatically removed.
Upgraded sqlalchemy to version 1.4.51.
Reworked the grid carousel click handlers.
AmazonS3Container.download_all is now able to download files from a bucket with explicitly created folders (e.g. folders created from the AWS Management Console). Also, this supports filtering which objects to download from the bucket based upon prefix.
Added support for multiple uvicorn workers for the machinelearning service.

Breaking Changes#

The Paragraph Embedder pipelet is no longer supported and all semantic search pipelines should replace it with the new Paragraph Embedding step.
By default, the Deduplication step will no longer remove existing project items found in the incoming batch. This change affects use cases where the same documents (i.e. updated versions of a document) with identical IDs are ingested, and it is necessary to remove the previous copy (indexed document) before ingesting the new one. This might be required, for example, to avoid retaining an outdated label that is not present in the new document. To revert to the previous behavior, you can enable the original functionality from the Pipeline Editor by adjusting the relevant option in the step.
If you have enabled the email parsing step in a workflow, the result of the steps that now use the clean_body field can change. Carefully review the output of the classification steps or remove the email parsing step if it is not required.
Chunking documents inside the Paragraph Embedding pipeline step is now deprecated. All semantic search projects should have the Text Chunking step before Paragraph Embedding.
Updated Pydantic, FastAPI, and Spacy dependencies. If you are using any of these in a pipelet, data loader plugin, or studio plugin, they may need to be updated.
The topic.nlp.remote-services-enabled key in the configuration service has been removed. Query Processing has now a smarter way to detect if the remote service is available.

Installation and Upgrade#

For new installations, find step-by-step instructions in install-ansible (recommended) or Installing Squirro on Linux.

To upgrade an existing installation, see Upgrading Squirro.