Reindexing Elasticsearch#

Reindexing copies all data from one index into a newly created index and is useful for the following functions:

  • Changing index configuration (shards).

  • Getting rid of all deleted.marked documents, which are are only purged when two smaller segments are merged into one bigger segment.

  • Applying additional operations like removal of duplicated sub-items.

Note: Suboptimal index configuration can lead to segments larger than 5GB. Such large segments are typically not used for merging and cannot purge associated documents.

Multi Node ES Cluster#

Elasticsearch Architecture

Improve Shard Settings for Production#

By default, Squirro Deployments configure the ES-Cluster with 6 shards and 0 replicas.

Projects that consistently add data to the same Squirro project are likely to get into trouble sooner or later.

Tip 1: Maximum Shard Size#

  • A shard should not grow larger than 50GB, resulting in a maximum index size of 6x50 = 300GB.

The advantages of having many reasonable-sized shards (<50GB) include:

  • Faster (parallel) search across shards across the cluster.

  • Recovery from node failure: smaller shards can be more easily and evenly distributed.

Tip 2: Minimum Shard Size#

Shards need additional resources:

  • Many small shards add many small segments & increase overhead.

  • Ideally, a shard has between a few GBs to tens of GBs.

Tip 3: Available Heap vs Maximum Shards#

The maximum amount of shards is proportional to the amount of available heap on the Storage node.

At most, this should be 20 shards per GB of heap, meaning that for 32GB of heap available, use a maximum of 640 shards.

Tip 4: Shards vs Amount of Nodes#

The num_shards: shards / amount(storage_nodes) should be an even number.

Example: For 3 storage nodes, a total of 18 shards on a cluster would result in 6 shards per node.

For more information, see the following external resources:

Requirements#

Before you start any reindexing operation, make sure that the requirements listed below are met.

Disk Space#

You need to have at least a spare amount of disk space equivalent to the primary store size (in that case pri.store.size=272GB).

Store.size(541GB) contains all shards / replicas across the cluster.

The more documents marked for deletion that exist within your indices (docs.deleted), the less space is needed for the new index, because deletions are skipped.

Example: Stats of Source Index vs Re-indexed#

>>> Original IDX
$ $GET/_cat/indices/squirro_v9_$pid?v=true
    health status index             pri rep docs.count    docs.deleted  store.size     pri.store.size
    green  open   squirro_v9_$pid    6   1   13393103      2733053       541.4gb        272.3gb

>>> Re-Indexed IDX
$ $GET/_cat/indices/squirro_v9_$pid-reindexed?v=true
    health status index                       pri rep docs.count docs.deleted store.size pri.store.size
    green  open   squirro_v9_$pid-reindexed   18   0   13096028            0    158.4gb        158.4gb

Achieved reduction of index storage size: 45%  =  1.0 - ( 158GB / 272GB )
Note: docs.count of target-index is ~300k documents less compared to source-index, those are excluded duplicated pdfs.

Root Access#

The scripts are configured to run on a squirro-node as root (e.g. start/stop squirro services).

Permission to Modify Index Locator#

/etc/squirro/configuration.ini
[bool]
# Note: legacy .ini uses `_`
topic.custom_locator = true

Or update Configuration Service via UI (as Admin):

# Note: configuration service v2 uses `-`
Server > Configuration > topic.custom-locator: true

Optional: Check for Duplicated PDFs

detect_duplicate_sub_items.py

Problems may appear if:

  • The same binary documents are loaded into the same project/datasource multiple times without a configured de-duplication step.

To check any project if it contains duplications, use the following command:

$ python detect_duplicate_sub_items.py --index $IDX1

Example Output:

2021-08-24 12:19:18,989 - >> Parents with conflicting sub-items: 1
2021-08-24 12:19:18,989 - >> Conflicting IDs: {'mvhEe9481KwHcEafjCn_wQ'}
2021-08-24 12:19:18,989 - Total sub-items: 766
2021-08-24 12:19:18,989 - Total parent documents: 11
2021-08-24 12:19:18,989 - Duplicate detection finished in 0s
2021-08-24 12:19:18,989 - >> Retrieve External ID mapping
2021-08-24 12:19:18,999 - >> ID-Mapping:
{'mvhEe9481KwHcEafjCn_wQ': 'UAT!-23XF19'}

Logs a list of [{item_id: external_id}], with external_id coming from a third-party source.

Working with Elastic-API on a Squirro-Node#

Below is a list of terminal shortcuts to query ES API from a squirro-node and retrieve stats about indices, tasks, and so on.

Note

Commands below have to be executed on Squirro-Node (requires ssh access).

Set Up Terminal Shortcuts#

Store project_id in pid variable:

pid=wyGRWWirSleN_bMrqCXOOw

GET & POST shortcuts to work with ES-API:

ES=http://localhost:81/ext/elastic

GET="curl -XGET $ES"
POST="curl -XPOST $ES"                                      # to upload json in body with curl provide
PUT="curl -XPUT $ES"                                        # -d {"json":"body"} -H "Accept: application/json" -H "Content-Type: application/json"

IDX1=squirro_v9_`echo $pid | tr '[:upper:]' '[:lower:]'`    # Source Index, lowercased project_id
IDX2=$IDX1-reindexed                                        # Target Index created during reindexing

View / Cancel a Task#

t_id=o2rcMpCkSc6dESfYNmKmRA:82899  # ID is printed from reindex.py script
$GET/_tasks/$t_id
$POST/_tasks/$t_id/_cancel

Retrieve IDX Statistics#

Monitor index,shard,segments statistics on source vs target index.

$GET/_cat/indices/$IDX1?v=true
$GET/_cat/shards/$IDX1?v=true
$GET/_cat/segments/$IDX1?v=true

Open / Close Indices#

After reindexing is finished, the old index gets automatically closed:

$POST/$IDX1/_close
$POST/$IDX1/_open

Monitor Disk Space, Target IDX Stats#

Execute watch commands on their own screen session.

watch df -h                                 # make sure to not get ouf of disk space during reindexing
watch $GET/_cat/indices/$IDX2?v=true        # watch target index growing

Working with Screen Sessions#

Create named session:               screen -S reindex
List sessions:                      screen -ls
Attach to running session:          screen -R reindex

Run Automated Re-Indexing#

Note

Run scripts on one Squirro-Node, database configuration changes (like index locator) are distributed across the cluster.

Change to sudo user and activate squirro_environment:

sudo su
squirro_activate

1: ES Reindexing#

Caution

Stop the ingester service and any datasource loading before continuing with the next step.

systemctl stop sqingesterd

1.1: reindex.py#

pid=<YOUR_PROJECT>
token=<YOUR_TOKEN>
python reindex.py --project-id $pid --shards 6 --replicas 0 --distinct-sub-items False --token $token

Alternatively, you can pass the name of the Elasticsearch index directly to the script:

python reindex.py --original-index <YOUR_INDEX> --shards 6 --replicas 0 --distinct-sub-items False

Note: Script&reindexing can be canceled via CTRL-C

reindex.py:

  1. Sets up a new index with provided shard configuration. (Note that shards / amount(storage_nodes) should be an even number.)

  2. Starts reindexing operation in the background.

Optimizations for indexing speed:

  • Performs parallel (sliced) indexing: One sub-task per target-idx shard

  • Sets a very low refresh_interval frequency (expensive operation)

  • Tune-able: Reduce/Increase batch-size used for bulk indexing
    • Logs: 2021-08-20 11:47:19,773 - >>> reindexing running on task: o2rcMpCkSc6dESfYNmKmRA:82899

    • Monitors state via ES/_tasks/id

    • State logged to reindex-$pid-$timestamp.log periodically

  • Records idx statistics (source-idx vs target-idx)
    • reindex-$pid-$timestamp_idx_stats.txt

    • With indication on found version_conflicts, total reindexed documents etc.

  • (Optional) Skips duplicated sub-items (applies custom script processor)

  • Cleans up deleted.marked documents (default for reindexing)

1.2 Check Cluster State#

  • Check the index stats in the associated logfile in: reindex-$pid-$timestamp_idx_stats.txt

  • Check for a green index state.

$GET/_cluster/health/$IDX2?pretty

2: (Optional) Compare Number of Documents#

To be sure that reindexing went well, you can verify the number of documents in the reindexed index and compare it to the number of documents in the original one using the following command:

python compare_docs.py --original-index <ORIGINAL_INDEX> --target-index <TARGET_INDEX>

Note

The number of documents may slightly vary and it doesn’t mean that something went wrong.

During reindexing documents with invalid mapping are filtered out. Treat it more like a sanity check to capture obvious issues, for example an empty index or half of documents missing.

3: Update Project’s Index Locator#

3.1 update_index_locator.py#

python update_index_locator.py --project-id $pid --token $token

Alternatively, you can pass the name of the Elasticsearch index directly:

python update_index_locator.py --original-index <YOUR_INDEX> --token $token

update_index_locator.py:

  • Updates the index_locator on the project.

  • Closes the old ES index and opens the target index (in case it’s closed)

  • Logs updated project_config

  • Allows rollback to original index via –target-index, used as:

$ python update_index_locator.py --project-id $pid --token $token --target-index $IDX1
Start rewiring of project wyGRWWirSleN_bMrqCXOOw. Change index-locator from 'squirro_v9_wygrwwirslen_bmrqcxoow-reindexed' to 'squirro_v9_wygrwwirslen_bmrqcxoow'

3.2 Validation & Restart Services#

Test Project Setup

  • Perform search queries & retrieve results

  • Click on result-item & check that detail-view gets loaded

  • Setup communities work as expected

  • Verify that facet configuration page loads:
    • Setup > Data > Labels

    • Restart Squirro Services

    • squirro_restart

4: (Optional) Backup Original Index#

To save disk space, the original index (closed by update_index_locator.py) can be deleted after a successful reindexing operation.

It is recommended to take a snapshot and store the backup in a dedicated repository.

For more information, see the following external resources:

5: (Optional) Enable Replicas#

Configure THE amount of replicas depending on the use case.

Replicas increase cluster resilience through redundancy:

  • A replica shard R-S1 is never located on the same node as it’s primary shard S1.

  • Example: S1 on Node 1, R-S1 on Node 2.
    • In case of Node 1 failure, documents associated with S1 can still be served through R-S1 via Node 2.

  • Additionally, replicas serve search requests & decrease latency for setups with high search-traffic.

Pre-Requisite

Replicas can only be configured on a ES cluster setup with multiple nodes. A Replica shard cannot be located on the same node as it’s primary shard.

settings='{
  "index" : {
    "number_of_replicas" : 1
  }
}'

$PUT/$IDX1/_settings -d "$settings" -H "Accept: application/json" -H "Content-Type: application/json"