Reindexing Elasticsearch#
Reindexing copies all data from one index into a newly created index and is useful for the following functions:
Changing index configuration (shards).
Getting rid of all
deleted.marked
documents, which are are only purged when two smaller segments are merged into one bigger segment.Applying additional operations like removal of duplicated sub-items.
Note: Suboptimal index configuration can lead to segments larger than 5GB. Such large segments are typically not used for merging and cannot purge associated documents.
Multi Node ES Cluster#
Improve Shard Settings for Production#
By default, Squirro Deployments configure the ES-Cluster with 6 shards and 0 replicas.
Projects that consistently add data to the same Squirro project are likely to get into trouble sooner or later.
Tip 1: Maximum Shard Size#
A shard should not grow larger than 50GB, resulting in a maximum index size of 6x50 = 300GB.
The advantages of having many reasonable-sized shards (<50GB) include:
Faster (parallel) search across shards across the cluster.
Recovery from node failure: smaller shards can be more easily and evenly distributed.
Tip 2: Minimum Shard Size#
Shards need additional resources:
Many small shards add many small segments & increase overhead.
Ideally, a shard has between a few GBs to tens of GBs.
Tip 3: Available Heap vs Maximum Shards#
The maximum amount of shards is proportional to the amount of available heap on the Storage node.
At most, this should be 20 shards per GB of heap, meaning that for 32GB of heap available, use a maximum of 640 shards.
Tip 4: Shards vs Amount of Nodes#
The num_shards: shards / amount(storage_nodes)
should be an even number.
Example: For 3 storage nodes, a total of 18 shards on a cluster would result in 6 shards per node.
For more information, see the following external resources:
Requirements#
Before you start any reindexing operation, make sure that the requirements listed below are met.
Disk Space#
You need to have at least a spare amount of disk space equivalent to the primary store size (in that case pri.store.size=272GB
).
Store.size(541GB) contains all shards / replicas across the cluster.
The more documents marked for deletion that exist within your indices (docs.deleted
), the less space is needed for the new index, because deletions are skipped.
Example: Stats of Source Index vs Re-indexed#
>>> Original IDX
$ $GET/_cat/indices/squirro_v9_$pid?v=true
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open squirro_v9_$pid 6 1 13393103 2733053 541.4gb 272.3gb
>>> Re-Indexed IDX
$ $GET/_cat/indices/squirro_v9_$pid-reindexed?v=true
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open squirro_v9_$pid-reindexed 18 0 13096028 0 158.4gb 158.4gb
Achieved reduction of index storage size: 45% = 1.0 - ( 158GB / 272GB )
Note: docs.count of target-index is ~300k documents less compared to source-index, those are excluded duplicated pdfs.
Root Access#
The scripts are configured to run on a squirro-node as root (e.g. start/stop squirro services).
Permission to Modify Index Locator#
/etc/squirro/configuration.ini
[bool]
# Note: legacy .ini uses `_`
topic.custom_locator = true
Or update Configuration Service via UI (as Admin):
# Note: configuration service v2 uses `-`
Server > Configuration > topic.custom-locator: true
Optional: Check for Duplicated PDFs
detect_duplicate_sub_items.py
Problems may appear if:
The same binary documents are loaded into the same project/datasource multiple times without a configured de-duplication step.
To check any project if it contains duplications, use the following command:
$ python detect_duplicate_sub_items.py --index $IDX1
Example Output:
2021-08-24 12:19:18,989 - >> Parents with conflicting sub-items: 1
2021-08-24 12:19:18,989 - >> Conflicting IDs: {'mvhEe9481KwHcEafjCn_wQ'}
2021-08-24 12:19:18,989 - Total sub-items: 766
2021-08-24 12:19:18,989 - Total parent documents: 11
2021-08-24 12:19:18,989 - Duplicate detection finished in 0s
2021-08-24 12:19:18,989 - >> Retrieve External ID mapping
2021-08-24 12:19:18,999 - >> ID-Mapping:
{'mvhEe9481KwHcEafjCn_wQ': 'UAT!-23XF19'}
Logs a list of [{item_id: external_id}]
, with external_id
coming from a third-party source.
Working with Elastic-API on a Squirro-Node#
Below is a list of terminal shortcuts to query ES API from a squirro-node and retrieve stats about indices, tasks, and so on.
Note
Commands below have to be executed on Squirro-Node (requires ssh access).
Set Up Terminal Shortcuts#
Store project_id in pid variable:
pid=wyGRWWirSleN_bMrqCXOOw
GET
& POST
shortcuts to work with ES-API:
ES=http://localhost:81/ext/elastic
GET="curl -XGET $ES"
POST="curl -XPOST $ES" # to upload json in body with curl provide
PUT="curl -XPUT $ES" # -d {"json":"body"} -H "Accept: application/json" -H "Content-Type: application/json"
IDX1=squirro_v9_`echo $pid | tr '[:upper:]' '[:lower:]'` # Source Index, lowercased project_id
IDX2=$IDX1-reindexed # Target Index created during reindexing
View / Cancel a Task#
t_id=o2rcMpCkSc6dESfYNmKmRA:82899 # ID is printed from reindex.py script
$GET/_tasks/$t_id
$POST/_tasks/$t_id/_cancel
Retrieve IDX Statistics#
Monitor index,shard,segments statistics on source vs target index.
$GET/_cat/indices/$IDX1?v=true
$GET/_cat/shards/$IDX1?v=true
$GET/_cat/segments/$IDX1?v=true
Open / Close Indices#
After reindexing is finished, the old index gets automatically closed:
$POST/$IDX1/_close
$POST/$IDX1/_open
Monitor Disk Space, Target IDX Stats#
Execute watch
commands on their own screen
session.
watch df -h # make sure to not get ouf of disk space during reindexing
watch $GET/_cat/indices/$IDX2?v=true # watch target index growing
Working with Screen Sessions#
Create named session: screen -S reindex
List sessions: screen -ls
Attach to running session: screen -R reindex
Run Automated Re-Indexing#
Note
Run scripts on one Squirro-Node, database configuration changes (like index locator) are distributed across the cluster.
Change to sudo user and activate squirro_environment:
sudo su
squirro_activate
1: ES Reindexing#
Caution
Stop the ingester service and any datasource loading before continuing with the next step.
systemctl stop sqingesterd
1.1: reindex.py#
pid=<YOUR_PROJECT>
token=<YOUR_TOKEN>
python reindex.py --project-id $pid --shards 6 --replicas 0 --distinct-sub-items False --token $token
Alternatively, you can pass the name of the Elasticsearch index directly to the script:
python reindex.py --original-index <YOUR_INDEX> --shards 6 --replicas 0 --distinct-sub-items False
Note: Script&reindexing can be canceled via CTRL-C
reindex.py
:
Sets up a new index with provided
shard
configuration. (Note thatshards / amount(storage_nodes)
should be an even number.)Starts reindexing operation in the background.
Optimizations for indexing speed:
Performs parallel (sliced) indexing: One sub-task per target-idx shard
Sets a very low refresh_interval frequency (expensive operation)
- Tune-able: Reduce/Increase batch-size used for bulk indexing
Logs:
2021-08-20 11:47:19,773 - >>> reindexing running on task: o2rcMpCkSc6dESfYNmKmRA:82899
Monitors state via
ES/_tasks/id
State logged to
reindex-$pid-$timestamp.log
periodically
- Records idx statistics (source-idx vs target-idx)
reindex-$pid-$timestamp_idx_stats.txt
With indication on found version_conflicts, total reindexed documents etc.
(Optional) Skips duplicated sub-items (applies custom script processor)
Cleans up deleted.marked documents (default for reindexing)
1.2 Check Cluster State#
Check the index stats in the associated logfile in:
reindex-$pid-$timestamp_idx_stats.txt
Check for a green index state.
$GET/_cluster/health/$IDX2?pretty
2: (Optional) Compare Number of Documents#
To be sure that reindexing went well, you can verify the number of documents in the reindexed index and compare it to the number of documents in the original one using the following command:
python compare_docs.py --original-index <ORIGINAL_INDEX> --target-index <TARGET_INDEX>
Note
The number of documents may slightly vary and it doesn’t mean that something went wrong.
During reindexing documents with invalid mapping are filtered out. Treat it more like a sanity check to capture obvious issues, for example an empty index or half of documents missing.
3: Update Project’s Index Locator#
3.1 update_index_locator.py#
python update_index_locator.py --project-id $pid --token $token
Alternatively, you can pass the name of the Elasticsearch index directly:
python update_index_locator.py --original-index <YOUR_INDEX> --token $token
update_index_locator.py
:
Updates the index_locator on the project.
Closes the old ES index and opens the target index (in case it’s closed)
Logs updated project_config
Allows rollback to original index via –target-index, used as:
$ python update_index_locator.py --project-id $pid --token $token --target-index $IDX1
Start rewiring of project wyGRWWirSleN_bMrqCXOOw. Change index-locator from 'squirro_v9_wygrwwirslen_bmrqcxoow-reindexed' to 'squirro_v9_wygrwwirslen_bmrqcxoow'
3.2 Validation & Restart Services#
Test Project Setup
Perform search queries & retrieve results
Click on result-item & check that detail-view gets loaded
Setup communities work as expected
- Verify that facet configuration page loads:
Setup > Data > Labels
Restart Squirro Services
squirro_restart
4: (Optional) Backup Original Index#
To save disk space, the original index (closed by update_index_locator.py
) can be deleted after a successful reindexing operation.
It is recommended to take a snapshot and store the backup in a dedicated repository.
For more information, see the following external resources:
5: (Optional) Enable Replicas#
Configure THE amount of replicas depending on the use case.
Replicas increase cluster resilience through redundancy:
A replica shard
R-S1
is never located on the same node as it’s primary shardS1
.- Example:
S1
onNode 1
,R-S1
onNode 2
. In case of
Node 1
failure, documents associated withS1
can still be served throughR-S1
viaNode 2
.
- Example:
Additionally, replicas serve search requests & decrease latency for setups with high search-traffic.
Pre-Requisite
Replicas can only be configured on a ES cluster setup with multiple nodes. A Replica shard cannot be located on the same node as it’s primary shard.
settings='{
"index" : {
"number_of_replicas" : 1
}
}'
$PUT/$IDX1/_settings -d "$settings" -H "Accept: application/json" -H "Content-Type: application/json"