Business Continuity Planning#

Introduction#

Squirro can be run as a clustered application across multiple hosts to provide high availability and horizontal scaling.

While it is technically possible to scale a single Squirro cluster across multiple data centers, it is not recommended. Both the Squirro cluster service, as well as the Elasticsearch cluster, need stable and low-latency network connections. Both can usually not be guaranteed across multiple locations.

To support BCP scenarios two fully independent Squirro clusters can be set up. The two clusters are operated in an Active-Standby setup. All incoming data and query traffic should be directed to the active cluster, where all data then gets replicated to the standby frequently (e.g. every 5 or 15 or 60 minutes).

The replication between the two clusters is done using a command line (CLI) utility provided by Squirro.

Overview#

Squirro Setup Without Dedicated Elasticsearch Nodes

Core Concepts#

  • Squirro Configuration is stored in .ini files under /etc/squirro

  • Project and User Metadata is stored in MySQL (will not grow based on data ingested into Squirro)

  • Text Documents are Stored in Elasticsearch (grows with data ingested into Squirro)

  • Binary Documents and additional assets such as custom CSS or Pipelets are stored in the filesystem, which is distributed within the cluster using GlusterFS. (will grow based on data ingested into Squirro, but only if binary documents such as Office/PDF are indexed)

  • Caching is done in Redis, but the cache is volatile and there is no need to consider it for BCP.

Technology Used#

  • The replication script is written with Fabric.

  • Files can be replicated using Rsync via SSH or by a storage vendor-specific method.

  • MySQL databases are exported using the mysqldump CLI command and restored using the mysql CLI command.

Requirements#

The replication is triggered and run from a single host. By default, this is the primary app server on the production environment. But this could be done also from a dedicated host, not part of any of the two clusters. For added resilience, the script and configuration is deployed to all Squirro nodes, but only actively run on the leader node.

The framework Fabric is used to communicate with the various host roles across both data centers. (Fabric relies on key-based SSH connections).

The host running the replication has the following requirements:

  • Ability to trigger timed bash execution (e.g. using Cron)

  • Python 2.7 (fulfilled if this is run from a Squirro App Server)

  • Can establish a non-interactive SSH connection (using a key-pair) to:
    • All Squirro App Nodes in Production (See note 1 below)

    • All Squirro App Nodes in BCP (see note 2 below)

  • Can reach TCP Port 9200 on:
    • All Elasticsearch production nodes

    • All Elasticsearch BCP nodes (See note 2 below)

  • The Linux user used to SSH into the nodes must:
    • be able to restart the Squirro Services (e.g. using sudo and can be encapsulated into a single Bash Script)

    • be a member of the Linux group squirro

    • have read & write access to the clustered filesystem, usually under /var/lib/squirro/storage

    • optionally, have read & write access to the /etc/squirro/

Note 1: SSH connections to all production nodes are not mandatory. Alternatively, access on TCP port 443 and TCP port 3306 (MySQL) is sufficient too.

Note 2: If SSH connections from the Production to the BCP Datacenter are not possible, then the solution is to run the replication in three independent stages:

  • Stage 1: Backup Production to NFS

  • Stage 2: Replicate the NFS folder to BCP

  • Stage 3: Restore BCP from NFS

The main disadvantage of this approach is that there is no longer a single script that is aware of success or failure of the replication process. On the BCP side, it can also be challenging to identify if the replication Prod -> BCP of the NFS folder has concluded and is stable.

Replication Workflow#

These are the stages which the Replication script uses:

Stage 1: Testing#

Before the replication commences, Fabric is used to connect to the production cluster nodes to validate that the cluster is fully operational and reachable.

  • Contacting the Cluster Service API, identify the leader

  • Contacting the ES API, identify the leader

  • Connect to Squirro Cluster Leader (via SSH or MySQL) to test connectivity

  • Test the NFS mount for permission and space, to ensure the operation will be smooth

If any of these steps reveal any issues, the replication job is aborted with verbose debug output.

Stage 2: Elasticsearch Snapshot Creation#

  • A new snapshot is requested using the official ES Snapshot Module.

  • Wait for the Snapshot to complete.

  • The snapshot target is the shared NFS mount.

Stage 3: MySQL Backup#

  • The host running the replication script connects to the leader of the production cluster (using SSH or MySQL).

  • A full backup of the MySQL Database is created, using mysqldump.

  • The MySQL backup is compressed and stored onto the shared NFS mount.

Stage 4: Config and Assets Backup#

  • The cluster filesystem used by the Squirro cluster is replicated to the shared NFS mount using Rsync. (incremental)

  • Optional: Also replicate all (or some) configuration files. This is ideal if both the Production and BCP Cluster are set up identically.

Stage 5: NFS Replication#

  • With all data stored on the NFS mount, the contents of the entire mount are replicated to the BCP data center.

  • This can be done using Rsync via SSH or using a storage vendor related replication technology (e.g. Netapp SnapMirror)

  • While the initial replication can be big, subsequent replication should be small since with the exception of the MySQL export all methods are incremental.

Stage 6: Elasticsearch Snapshot Restore#

  • From the BCP NFS mount, the latest Elasticsearch Snapshot is restored into the ES cluster using the official ES Snapshot module.

  • During the restore ES will not serve traffic.

Stage 7: MySQL Restore#

From the BCP NFS mount, the latest MySQL backup will be restored to the Squirro leader. The followers will replicate immediately to the same state.

Stage 8: Config and Assets Restore#

From the BCP NFS mount, the contents of the cluster file system are synced with the Squirro cluster master.

Optional: If both clusters are set up identically, the config files under /etc/squirro are also synced

Stage 9: Flushing Redis Cache#

To avoid stale caches, the Redis indexes are flushed on the Squirro cluster leader. The follower will replicate immediately to the same state.

Stage 10: Restart Squirro#

On all BCP Squirro cluster nodes, all Squirro processes are restarted.

Stage 11: Testing II#

The script ensures that the BCP Cluster is responsive again. If any error happens, it can raise alerts. (e.g. via email notification)

Changing the Replication Direction#

The same mechanism is used to replicate from BCP to Production.

The best practice approach is to set up and test this scenario, but not to execute the script using, for example, cron automatically.

Once BCP becomes active, the replication cron job on Production is stopped and the script on BCP is enabled.

For maximum safety, Squirro recommends separating both scenarios in the NFS mount. This way an accidental reversal of the direction cannot lead to unwanted data loss.

Reduced number of nodes in BCP#

The ideal setup is to run production and BCP with the same setup. This way the user experience will not degrade when a failover to BCP occurs.

However, it is possible to run a reduced setup in BCP. For example, instead of 3 nodes, only 1 node can be used.

Note: You should never run an even number of Squirro application and Elasticsearch nodes since both systems benefit from the ability to build quorums to detect and handle network segmentation events.

Backup the NFS Mount#

It is highly recommended that the NFS mount is regularly backed up or protected by a vendor-specific snapshotting technology.

The NFS mount can be easily used to restore previous cluster states and is ideal for disaster recovery.

Known Limitations#

Session reset during failover. If a user logs into Prod and then moves (via the LB or GLB) to the replicated BCP installation, the User will get logged out.

This is unavoidable, as the user session stored in the Production cluster MySQL server is most likely not (yet) replicated to BCP.

The issue is minor if the SSO integration is active since the user will be logged in automatically.