How Squirro Scales#
This document aims to outline how Squirro can scale from a single, small development host to a massive multi-node cluster, ingesting millions of documents daily and serving thousand of users.
Before going further it’s important to familiarize yourself with the Squirro Architecture.
The following diagram gives an overview of the key components and technologies used to build the Squirro platform:
You can also download a
PDF of the Squirro architecture.
Squirro is built for both cloud and on-premise deployments.
For this reason, Squirro adopts a service-oriented approach, where the service is provided not by a few monolithic processes, but by a fleet of small, very focused services.
In computing, microservices is a software architecture style in which complex applications are constructed from many independent processes communicating with each other using language-agnostic APIs.
These services are small, highly decoupled, and focus on doing a small task. This facilitates a modular approach to system building.
Read more about the topic on the Microservices Wikipedia Page.
Each service of Squirro offers a versioned, restful, language-agnostic (JSON and XML), stateless API. Each service is decoupled and shares nothing with the other services.
The only way for these services to communicate is via the API.
This makes Squirro very modular and also simple to scale for the following reasons:
Each service can be run on as many hosts as needed.
Load balancing can be done by HTTP load balancers
Services and hosts can be added or removed from the cluster as needed without any downtime
State and data persistence of Squirro is handled by standard software components that can scale horizontally too
See the sections for Elasticsearch, MySQL, Redis, GlusterFS and Zookeeper below for more details.
The Squirro cluster service manages these systems, handles failover, and minimizes configuration and management complexity.
Design for Scaling#
Squirro can be run on a single host. While this isn’t recommended for production environments, it’s ideal for development and testing environments.
The simplest setup with all the components of Squirro deployed is shown in the image below:
The most crucial component in this setup is RAM. For simple testing purposes, 4 GB is acceptable. However, if any amount of data will be ingested, at least 8 GB is recommended. This setup can be scaled vertically to a certain limit.
The following are the major components that can be increased:
CPU: The service-oriented nature of Squirro benefits from additional CPU cores. While higher CPU frequencies are always beneficial too, it’s recommended that you focus on adding additional CPU cores or threads.
RAM: As more data is ingested and more queries are run in parallel, the RAM requirements will increase, especially for Elasticsearch. For a production environment, it’s strongly recommended that you run Elasticsearch on dedicated hosts for this very reason. See below.
Note: There are limits on how much RAM is recommended for Elasticsearch. See Liming Memory Usage in Elasticsearch’s official documentation for more details.
Storage: Fast, flash-based IO is highly recommended and will make a big impact.
Note: Once a single host is no longer sufficient, an intermediate step before deploying is to split Elasticsearch out to a dedicated host. Elasticsearch benefits from fast IO and depends heavily on filesystem caching. If multiple components compete for RAM and the filesystem cache shrinks, the Elasticsearch performance will drop or will become erratic.
Squirro is designed for horizontal scaling. For additional storage needs, resilience and performance additional hosts can be added. In this case, the Squirro cluster and storage nodes are usually split up on different servers.
To prepare for easy scaling without yet investing in the full server farm, a Squirro setup can be split into two roles from the beginning as shown below:
Recommended Cluster Setup#
A full horizontal scaling recommended setup looks as follows:
The nodes are operated in an active-active setup and the storage, as well as the query load, is shared across all hosts. Nodes can be added or removed with minimal or no downtime. Squirro recommends a minimum of three nodes so that a robust leader election is possible.
The ingested documents are evenly spread across the entire cluster, increasing both the storage capacity as well as lowering the query pressure for each node.
For additional query performance, data can be replicated multiple times inside the Elasticsearch cluster. See the following Elasticsearch section for more details.
In this setup, Elasticsearch runs on dedicated hosts for maximum performance. No SAN or NAS is required here, which helps to keep costs low.
Setup Without Dedicated Elasticsearch Nodes#
In certain scenarios, the main load might not be on Elasticsearch, or high availability is key while the data volume is small. To keep costs lower, Elasticsearch can be deployed to the app servers too:
While this limits the potential, it is a popular initial setup. Once traffic increases and the data volume grows, the dedicated Elasticsearch nodes can be brought online quickly.
Note: For a productive setup, factor in plenty of RAM per host. There must be enough free RAM for Elasticsearch to benefit from the filesystem cache. Once the host runs out of RAM for the filesystem cache or (even worse) starts swapping, performance will drop dramatically.
Single Node Setup with Full Elasticsearch Cluster#
In scenarios with very high data volume, but simple or few queries and no HA requirement, this setup can scale very well and keep costs low:
In this example, the Squirro node does not offer high availability, but the Elasticsearch cluster still does if the replica level is set to > 0.
A Setup That Meets Your Needs#
Both roles can be scaled independently. For example, a setup with 3 App Servers and 7 Elasticsearch Hosts is possible too if that fits your needs.
For both roles, the same logic applies: It’s better to have more medium-sized hosts, than just a few very big ones.
If you go for high availability for both roles, start with at least 3 hosts. Once you grow, avoid even numbers of hosts.
The ideal solution is always: Start small, measure, then adjust.
Business Continuity Planning (BCP)#
See Business Continuity Planning for information on how to run Squirro across multiple data centers.
Elasticsearch is a search server based on Lucene. It provides a distributed, multitenant-capable, full-text search and analytics engine with a HTTP web interface and schema-free JSON documents. It is designed from the ground up to support massively distributed deployments with high-availability on commodity hardware.
For more information on its history, see the Elasticsearch Wikipedia entry.
Squirro uses Elasticsearch as its primary means to store and query its documents. When you ingest data into Squirro, Elasticsearch is the component that will store the bulk of the data.
Unless you index binary documents, it is the only persistence component that will grow based on the data volume you ingest.
Thanks to Elasticsearch, Squirro can scale to massive sizes with relative ease, but there are still a few things to consider:
How big is the dataset expected to grow?
How complex will the data be? (How many fields with how much variation?)
How many documents will be indexed per day/hour/minute/second?
How many concurrent users will be querying the system in parallel?
Will those parallel queries all contain aggregations?
Is the data time-based? Can data be segmented into time-based indexes that can be either discarded after some time has elapsed or can be marked as read-only?
These questions will define how many nodes of Elasticsearch, how many shards and replicas, how much RAM and CPU and what type of storage is required to provide the desired user experience and resilience. Whenever the team at Elasticsearch is asked how a specific setup will perform, then answer is always: It depends! While this can be frustrating it is the honest and correct answer.
While Squirro will take away some of the main concerns, for example with its intelligent caching layer, the basic logic still applies with Elasticsearch: Start small within the development environment, load real data, test and measure the performance under as real as possible conditions, and adjust & scale accordingly.
Another important aspect is that in Elasticsearch queries can be run against one or multiple indexes. For the application executing the query, this makes no difference. This allows for some really powerful patterns if the data can be expired after a given time or if it will be queried by the end users less frequently.
Here is an example:
Suppose you have to index 10 million documents every day and keep the data for 30 days. With a single index, you’d end up with an index with 300 million documents. Using sharding and replicas this is a feasible setup.
Using time-based indexes we could instead create a dedicated index every day. e.g.
Assume each index is roughly 10 million documents.
If you want to run a query against all 30 days you would simply query against
index_* and the result and performance would be about the same as above. However, if you know based on the Squirro dashboard that the user is only is interested in the last 24 hours, you can choose to only run the query against one or two of these indexes.
This way the data being queried is lower and the query performance will be far superior and the memory overhead for the concurrent queries far lower.
Reference: See this excellent document from Elasticsearch about how to design for scaling: Elasticsearch: Designing for Scale.
MySQL is an open-source relational database management system (RDBMS). In July 2013, it was the world’s second most widely used RDBMS, and the most widely used open-source client-server model RDBMS.
MySQL is a popular choice of database for use in web applications and is a central component of the widely used LAMP open-source web application software stack (and other AMP stacks).
For more information on its history, see the MySQL Wikipedia page.
The MySQL RDBMS is used by Squirro to store configuration data.
No documents are stored in MySQL, hence it does not have to scale with the data volume ingested into Squirro. Squirro has previously seen deployments with millions of documents and a MySQL database that was less than 10 MB.
The main data stored in MySQL is:
User Sessions / Auth Tokens
Scheduled Tasks / Alerts
To achieve high availability and scalability the Squirro cluster service configures MySQL in a Source → Replica setup.
Each node in the Squirro cluster can contribute a MySQL service to the cluster, and the cluster services uses Zookeeper to automatically elect a leader. All remaining nodes will become followers and replicate the entire database.
Read operations are handled by all nodes, writes go to the leader and are replicated out to the followers.
There are two main concerns when scaling:
Number of write operations to the Source node
Concurrent Open Connections to any of the MySQL nodes. (each connection requires RAM)
In general, pressure on MySQL is low and the caching layer and connection pooling improve the scaling capability further.
Redis is a data structure server. It is open-source, networked, in-memory, and stores keys with optional durability. According to the monthly ranking by DB-Engines.com, Redis is the most popular key-value database.
To learn more about Redis, see the Redis Wikipedia page.
Redis is used by Squirro for queuing, temporary data storage, and caching.
Temporary Data Storage
Documents are stored in Redis only “in flight”. While documents are processed by the pipeline service a copy is kept in Redis for fast access / update. Once documents exit the pipeline and are persisted in Elasticsearch they get removed from Redis.
Redis is the primary queueing backend for the pipeline service. Only document references enter the queue, hence the size of that data is minimal. The queing is abstracted using the Kombu library. Other backends such as in-memory, Amazon SQS, RabbitMQ can be supported too.
Redis is used by Squirro to provide an intelligent caching layer, both for the queries that are executed against Elasticsearch as well as for all HTTP requests between the various Squirro services. Cache entries can be stored long term, and get evicted intelligently when changes to the underlying data or the configuration occur.
To achieve high availability and scalability the Squirro cluster service configures Redis in a Master → Slave replication setup. Each node in the Squirro cluster can contribute a Redis service to the cluster, and the cluster services uses Zookeeper to automatically elect a leader. All remaining nodes will become followers and replicate the entire database.
Read operations will be handled by all nodes, writes will go to the leader and will be replicated out to the followers.
There are two main concerns when scaling:
Number of write operations to the Master node (but Redis is extremely fast)
The latency between the master node and the slaves.
The amount of RAM needed (Redis keeps all data in memory)
The GlusterFS architecture aggregates compute, storage, and I/O resources into a global namespace. Capacity is scaled by adding additional nodes or adding additional storage to each node. Performance is increased by deploying storage among more nodes. High availability is achieved by replicating data n-way between nodes.
To learn more about Gluster, see the Gluster Wikipedia page.
Used to store and replicate all filesystems based documents between all nodes. This data only scales with the data ingested into Squirro if the data contains binary documents and that data is also served by Squirro. Otherwise the data requirement are minimal. See next two points. Used to store and replicate all pipeline service plugins (pipelets). Storage need is minimal. Used to store and replicate Excel / CSV files uploaded using the Web UI-based data loader. Storage need is minimal. Data gets deleted after ingestion is completed.
Alternatives GlusterFS is the default and built-in option for providing scalable and clustered file storage. It is however only one of multiple options:
Amazon S3 (and compatible storage solution)
Network Attached Storage, e.g. NFS
None: If you don’t ingest binary documents, don’t need pipelets, and don’t upload Excel/CSV files using the WebUI, no shared filesystem is required.
Apache ZooKeeper is a software project from the Apache Software Foundation. It provides an open-source distributed configuration service, synchronization service, and naming registry for large distributed systems. ZooKeeper was a sub-project of Hadoop but is now a top-level project in its own right.
ZooKeeper’s architecture supports high availability through redundant services.
To learn more, see the Zookeeper Wikipedia page.
Zookeeper is used by Squirro for leader election. Its design is suited to this purpose and helps Squirro detect and handle network segmentation / split brain events.
Only the cluster state and the data needed for the leader election is stored in Zookeeper.
Size is minimal and doesn’t increase with the data ingested into Squirro.
Run an uneven number of Zookeeper nodes, to ensure robust leader election.