Leveraging NVMe SSDs for Elasticsearch speed

Leveraging NVMe SSDs for Elasticsearch speed#

By default the cloud providers use software defined storage (SDS) to provide a virtual block device to the virtual machine. e.g. AWS uses EBS, GCP uses PD, Azure uses Managed Disks.

While these SDS solutions are great for general purpose workloads, they are not optimized for Elasticsearch. These disks are slow with high latency and low throughput. Scaling up the performance of these disks is expensive and the performance is still not great.

These SDS disks also provide a lot of features that are not needed for Elasticsearch, like snapshots, encryption, replication, etc. These features are already provided by Elasticsearch. Hence spending money on these features is a waste.

The best solution is to use NVMe SSDs. These are local disks that are attached to the virtual machine. They are fast, low latency and high throughput. They are also cheaper than the SDS disks.

Select a suitable instance type#

The first step is to select an instance type that has NVMe SSDs.

Each vendor has different instance types, so you will need to find the equivalent instance type for your cloud provider:

Listing all the suitable instance types for each cloud provider is out of scope for this guide and it changes frequently. You will need to do your own research.

At the time of writing (2024), on AWS an obvious choice are the i3 and i4i instances. These are optimized for storage and have fast direct-attached NVMe SSDs.

The regular m5d, r5d and c5d instances also have NVMe SSDs. These are cheaper than the i instances, but they have less NVMe SSDs and the NVMe SSDs are smaller.

Understand the risk#

The NVMe SSDs are local to the virtual machine. If the virtual machine is terminated, the data on the NVMe SSDs is lost. This is not a problem for Elasticsearch, because Elasticsearch is a distributed system and the data is replicated across the cluster.

Warning

You have to ensure that the data is replicated across the cluster. If you have a single-node cluster, then you will lose all your data if the instance is terminated.

For a lower environment, like a development or staging environment, this is not a problem. You can always re-index the data or restore it from a snapshot. This can be automated.

For a production environment, loosing data, even if only temporarily, is not an option. You will need to have a multi-node cluster with at least three master eligible nodes. This will ensure that the data is replicated across the cluster and you will not lose any data if a single node is terminated.

When leveraging auto-scaling groups, you will need to ensure that new instances have enough time to join the cluster and replicate the data before terminating more old instances, otherwise you will lose data.

Having said all that, the NVMe SSDs are still a great choice for production environments as these are easily up to 40 times faster.

Backup the data before you try this#

If you’re migrating an existing cluster to use NVMe SSDs, you should backup the data before proceeding. The solution should not loose date during the playbook run, as the data is moved to the NVMe SSDs. But it is always better to be safe than sorry, especially if you roll out instances too fast and the cluster cannot keep up with the replication.

If you’re doing a new deployment, you don’t need to worry about this and you can safely skip this section.

Ideally leverage the snapshot support of squirro-ansible documented under Automate the Backup of Elasticsearch.

Adjust the ansible playbook / variables#

direct_attached_nvme_device: "/dev/sdb"
direct_attached_nvme_mount_point: "/mnt/nvme" (this is the default)

Adjust the example /dev/sdb to the device name of the NVMe SSD. This is different for each instance type. You will need to do your own research.

On Red Hat Enterprise Linux you can use the following command to list all available block devices:

lsblk -d -o name,tran

How it works#

Since the disks will come up empty / unformatted, we need to take care of that during each instance creation. While it is possible to run ansible on each instance creation, it’s not the best solution. It will slow down the instance creation and can make the process unreliable.

Because of this, the Ansible role creates a one-shot systemd service that will run on every boot of the instance. It will check if the NVMe SSD is already formatted and mounted. If not, it will format it and mount it.

It is safe to run this service on every boot, because it will only format and mount the NVMe SSD if it is not already formatted and mounted.

If there is still data on the regular data directory /var/lib/elasticsearch, it will be moved to the new data directory /mnt/nvme/var/lib/elasticsearch. A symlink is created from the original location to the new location for ease of use.

Troubleshooting#

If the NVMe SSD is not formatted and mounted, then run the following command to inspect the logs:

sudo journalctl -u sqnvmepreparation -f