How Squirro Scales

How Squirro Scales#

This document aims to outline how Squirro can scale from a single, small development host to a massive multi-node cluster, ingesting millions of documents daily and serving thousands of users.

Basic Concepts#

Before going further it is important to familiarize yourself with the Squirro architecture, designed for both cloud and on-premise deployments.

For this reason, Squirro adopts a service-oriented approach, where the service is provided not by a few monolithic processes, but by a fleet of small, very focused services.

Microservices Explained#

In computing, microservices is a software architecture style in which complex applications are constructed from many independent processes communicating with each other using language-agnostic APIs.

These services are small, highly decoupled, and focus on doing a small task. This facilitates a modular approach to system building.

Read more about the topic on the Microservices Wikipedia Page.

Squirro Microservices#

Each service of Squirro offers a versioned, restful, language-agnostic (JSON and XML), stateless API. Each service is decoupled and shares nothing with the other services.

The only way for these services to communicate is via the API.

This makes Squirro very modular and also simple to scale for the following reasons:

Each service can be run on as many hosts as needed.
Load balancing can be done by HTTP load balancers.
Services and hosts can be added or removed from the cluster as needed without any downtime.
State and data persistence of Squirro is handled by standard software components that can scale horizontally.

The Squirro cluster service manages these systems, handles failover, and minimizes configuration and management complexity.

Design for Scaling#

Scaling Vertically#

Squirro can be run on a single host. While this isn’t recommended for production environments, it is ideal for development and testing environments.

The most crucial component in this setup is RAM. For simple testing purposes, 4 GB is acceptable. However, if any amount of data will be ingested, at least 8 GB is recommended. That initial setup can be scaled vertically to a certain point.

The following are the major components that can be increased:

CPU: The service-oriented nature of Squirro benefits from additional CPU cores. While higher CPU frequencies are always beneficial too, it is recommended that you focus on adding additional CPU cores or threads.
RAM: As more data is ingested and more queries are run in parallel, the RAM requirements increase, especially for Elasticsearch. For a production environment, it is strongly recommended that you run Elasticsearch on dedicated hosts for this very reason. See below.

Note: There are limits on how much RAM is recommended for Elasticsearch. See Liming Memory Usage in Elasticsearch’s official documentation for more details.
Storage: Fast, flash-based IO is highly recommended and makes a big impact.

Note: Once a single host is no longer sufficient, an intermediate step before deploying is to split Elasticsearch out to a dedicated host. Elasticsearch benefits from fast IO and depends heavily on filesystem caching. If multiple components compete for RAM and the filesystem cache shrinks, the Elasticsearch performance drops or becomes erratic.

Scaling Horizontally#

Squirro is designed for horizontal scaling. For additional storage needs, resilience, and performance, more hosts can be added. In this case, the Squirro cluster and storage nodes are typically deployed on separate servers.

Initial Setup#

To prepare for easy scaling without yet investing in the full server farm, a Squirro setup can be split into two roles:

An application server
An Elasticsearch server

Recommended Cluster Setup#

A full horizontal scaling recommended setup looks as follows:

The nodes are operated in an active-active setup and the storage, as well as the query load, is shared across all hosts. Nodes can be added or removed with minimal or no downtime. Squirro recommends a minimum of three nodes so that a robust leader election is possible.

The ingested documents are evenly spread across the entire cluster, increasing both the storage capacity as well as lowering the query pressure for each node.

For additional query performance, data can be replicated multiple times inside the Elasticsearch cluster. See the following Elasticsearch section for more details.

In this setup, Elasticsearch runs on dedicated hosts for maximum performance. No SAN or NAS is required here, which helps to keep costs low.

Setup Without Dedicated Elasticsearch Nodes#

In certain scenarios, the main load might not be on Elasticsearch, or high availability is key while the data volume is small. To keep costs lower, Elasticsearch can be deployed to the app servers too. While this limits the potential, it is a popular initial setup. Once traffic increases and the data volume grows, the dedicated Elasticsearch nodes can be brought online quickly.

Note: For a productive setup, factor in plenty of RAM per host. There must be enough free RAM for Elasticsearch to benefit from the filesystem cache. Once the host runs out of RAM for the filesystem cache or (even worse) starts swapping, performance drops dramatically.

Single Node Setup with Full Elasticsearch Cluster#

In scenarios with very high data volume, but simple or few queries and no high availability requirement, this setup can scale very well and keep costs low:

In that scenario, the Squirro node does not offer high availability, but the Elasticsearch cluster still does if the replica level is set to > 0.

A Setup That Meets Your Needs#

Both roles can be scaled independently. For example, a setup with 3 App Servers and 7 Elasticsearch hosts is possible too if that fits your needs.

For both roles, the same logic applies: it is better to have more medium-sized hosts, than just a few very big ones.

If you go for high availability for both roles, start with at least 3 hosts. Once you grow, avoid even numbers of hosts.

The ideal solution is always: start small, measure, then adjust.

Business Continuity Planning (BCP)#

See Business Continuity Planning for information on how to run Squirro across multiple data centers.

Squirro Components#

Elasticsearch#

Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable, full-text search and analytics engine with an HTTP web interface and schema-free JSON documents. It is designed from the ground up to support massively distributed deployments with high availability on commodity hardware.

For more information on its history, see the Elasticsearch Wikipedia entry.

Squirro uses Elasticsearch as its primary means to store and query its documents. When you ingest data into Squirro, Elasticsearch is the component that stores the bulk of the data.

Unless you index binary documents, it is the only persistence component that will grow based on the data volume you ingest.

Thanks to Elasticsearch, Squirro can scale to massive sizes with relative ease, but there are still a few things to consider:

How big is the dataset expected to grow?
How complex will the data be? (How many fields with how much variation?)
How many documents will be indexed per day/hour/minute/second?
How many concurrent users will be querying the system in parallel?
Will those parallel queries all contain aggregations?
Is the data time-based? Can data be segmented into time-based indexes that can be either discarded after some time has elapsed or can be marked as read-only?

These questions define how many nodes of Elasticsearch, how many shards and replicas, how much RAM and CPU and what type of storage is required to provide the desired user experience and resilience. Whenever the team at Elasticsearch is asked how a specific setup will perform, the answer is always: it depends! While this can be frustrating, it is the honest and correct answer.

While Squirro takes away some of the main concerns, for example, with its intelligent caching layer, the basic logic still applies with Elasticsearch: start small within the development environment, load real data, test and measure the performance under as real as possible conditions, and adjust & scale accordingly.

Time-Based Indexes#

Another important aspect is that in Elasticsearch queries can be run against one or multiple indexes. For the application executing the query, this makes no difference. It allows for some really powerful patterns if the data can expire after a given time or if it will be queried by the end users less frequently.

Here is an example:

Suppose you have to index 10 million documents every day and keep the data for 30 days. With a single index, you would end up with an index with 300 million documents. Using sharding and replicas, that is a feasible setup.

Using time-based indexes we could instead create a dedicated index every day. e.g. index_2022_11_10 and index_2023_11_11.

Assume each index is roughly 10 million documents.

If you want to run a query against all 30 days you would simply query against index_* and the result and performance would be about the same as above. However, if you know, based on the Squirro dashboard, that the user is only interested in the last 24 hours, you can choose to only run the query against one or two of these indexes.

This way, the data being queried is lower, the query performance is far superior, and the memory overhead for the concurrent queries is far lower.

Reference: See this excellent document from Elasticsearch about how to design for scaling: Elasticsearch: Designing for Scale.

MySQL#

MySQL is an open-source relational database management system (RDBMS). In July 2013, it was the world’s second most widely used RDBMS, and the most widely used open-source client-server model RDBMS.

MySQL is a popular choice of database for use in web applications and is a central component of the widely used LAMP open-source web application software stack (and other AMP stacks).

For more information on its history, see the MySQL Wikipedia page.

The MySQL RDBMS is used by Squirro to store configuration data.

No documents are stored in MySQL, hence it does not have to scale with the data volume ingested into Squirro. Squirro has previously seen deployments with millions of documents and a MySQL database that was less than 10 MB.

The main data stored in MySQL is:

Users
Groups
Roles
User Sessions / Auth Tokens
Dashboards
Smart Filters
Saved Searches
Scheduled Tasks / Alerts
Scaling

To achieve high availability and scalability the Squirro cluster service configures MySQL in a Source → Replica setup.

Each node in the Squirro cluster can contribute a MySQL service to the cluster, and the cluster services uses Zookeeper to automatically elect a leader. All remaining nodes will become followers and replicate the entire database.

Read operations are handled by all nodes, writes go to the leader and are replicated out to the followers.

There are two main concerns when scaling:

Number of write operations to the Source node
Concurrent Open Connections to any of the MySQL nodes. (each connection requires RAM)
In general, pressure on MySQL is low and the caching layer and connection pooling improve the scaling capability further.

Redis#

Redis is a data structure server. It is open-source, networked, in-memory, and stores keys with optional durability. According to the monthly ranking by DB-Engines.com, Redis is the most popular key-value database.

To learn more about Redis, see the Redis Wikipedia page.

Redis is used by Squirro for queuing, temporary data storage, and caching.

Temporary Data Storage

Documents are stored in Redis only “in flight”. While documents are processed by the pipeline service, a copy is kept in Redis for fast access / update. Once documents exit the pipeline and are persisted in Elasticsearch, they get removed from Redis.

Queuing

Redis is the primary queueing backend for the pipeline service. Only document references enter the queue, hence the size of that data is minimal. The queuing is abstracted using the Kombu library. Other backends such as in-memory, Amazon SQS, RabbitMQ can be supported too.

Caching

Redis is used by Squirro to provide an intelligent caching layer, both for the queries that are executed against Elasticsearch as well as for all HTTP requests between the various Squirro services. Cache entries can be stored long term, and get evicted intelligently when changes to the underlying data or the configuration occur.

Scaling

To achieve high availability and scalability, the Squirro cluster service configures Redis in a Master → Slave replication setup. Each node in the Squirro cluster can contribute a Redis service to the cluster, and the cluster services uses Zookeeper to automatically elect a leader. All remaining nodes will become followers and replicate the entire database.

Read operations will be handled by all nodes, writes will go to the leader and will be replicated out to the followers.

There are two main concerns when scaling:

Number of write operations to the Master node (but Redis is extremely fast).
The latency between the master node and the slaves.
The amount of RAM needed (Redis keeps all data in memory).

Distributed File System#

A distributed file system is a storage and management solution that facilitates data storage across multiple servers or locations, enhancing scalability, fault tolerance, and accessibility. Squirro uses it to store and replicate all file system-based documents across various nodes. The storage requirements for this data grow in proportion to the volume of binary documents ingested by the Squirro platform, as that data is managed and served by the system. The storage demand is minimal when the system is used for storing and replicating pipeline service plugins, known as pipelets. Similarly, when handling Excel or CSV files uploaded via the UI-based data loader, storage requirements remain low since the data is deleted after ingestion.

You can implement a shared file system through various methods, such as a NAS or an existing clustering file system. Note that if you do not ingest binary documents, do not need pipelets, and do not upload Excel or CSV files using the web interface, no shared filesystem is required.

For expert guidance on selecting the right technology or integrating with your current distributed file system, contact Squirro Support and open a technical support request.

Zookeeper#

Apache ZooKeeper is a software project from the Apache Software Foundation. It provides an open-source distributed configuration service, synchronization service, and naming registry for large distributed systems. ZooKeeper was a sub-project of Hadoop but is now a top-level project in its own right.

ZooKeeper’s architecture supports high availability through redundant services.

To learn more, see the Zookeeper Wikipedia page.

Zookeeper is used by Squirro for leader election. Its design is suited to this purpose and helps Squirro detect and handle network segmentation / split brain events.

Scaling Considerations

Only the cluster state and the data needed for the leader election is stored in Zookeeper.
Size is minimal and doesn’t increase with the data ingested into Squirro.
Run an uneven number of Zookeeper nodes, to ensure robust leader election.