diff options
Diffstat (limited to 'doc/integration/elasticsearch.md')
-rw-r--r-- | doc/integration/elasticsearch.md | 190 |
1 files changed, 171 insertions, 19 deletions
diff --git a/doc/integration/elasticsearch.md b/doc/integration/elasticsearch.md index 8ca45547f11..67b256cc944 100644 --- a/doc/integration/elasticsearch.md +++ b/doc/integration/elasticsearch.md @@ -23,9 +23,8 @@ special searches: ## Installing Elasticsearch -Elasticsearch is _not_ included in the Omnibus packages. You will have to -[install it yourself](https://www.elastic.co/guide/en/elasticsearch/reference/6.8/install-elasticsearch.html "Elasticsearch 6.8 installation documentation") -whether you are using the Omnibus package or installed GitLab from source. +Elasticsearch is _not_ included in the Omnibus packages or when you install from source. You must +[install it separately](https://www.elastic.co/guide/en/elasticsearch/reference/7.x/install-elasticsearch.html "Elasticsearch 7.x installation documentation"). Be sure to select your version. Providing detailed information on installing Elasticsearch is out of the scope of this document. @@ -118,14 +117,21 @@ Once installed, enable it under your instance's Elasticsearch settings explained ## System Requirements Elasticsearch requires additional resources in excess of those documented in the -[GitLab system requirements](../install/requirements.md). These will vary by -installation size, but you should ensure **at least** an additional **8 GiB of RAM** -for each Elasticsearch node, per the [official guidelines](https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html). +[GitLab system requirements](../install/requirements.md). -Keep in mind, this is the **minimum requirements** as per Elasticsearch. For -production instances, they recommend considerably more resources. +The amount of resources (memory, CPU, storage) will vary greatly, based on the amount of data being indexed into the Elasticsearch cluster. According to [Elasticsearch official guidelines](https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html#_memory), each node should have: -The necessary storage space largely depends on the size of the repositories you want to store in GitLab but as a rule of thumb you should have at least 50% of free space as all your repositories combined take up. +- [RAM](https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html#_disks): **8 GiB as the bare minimum** +- [CPU](https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html#_cpus): Modern processor with multiple cores +- [Storage](https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html#_disks): Use SSD storage. As a guide you will need enough storage for 50% of the total size of your Git repositories. + +A few notes on CPU and storage: + +- CPU requirements for Elasticsearch tend to be light. There are specific scenarios where this isn't true, but GitLab.com isn't using Elasticsearch in an exceptionally CPU-heavy way. More cores will be more performant than faster CPUs. Extra concurrency from multiple cores will far outweigh a slightly faster clock speed in Elasticsearch. + +- Storage requirements for Elasticsearch are important, especially for indexing-heavy clusters. When possible, use SSDs, Their speed is far superior to any spinning media for Elasticsearch. In testing, nodes that use SSD storage see boosts in both query and indexing performance. + +Keep in mind, these are **minimum requirements** for Elasticsearch. Heavily-utilized Elasticsearch clusters will likely require considerably more resources. ## Enabling Elasticsearch @@ -142,13 +148,14 @@ The following Elasticsearch settings are available: | `Elasticsearch pause indexing` | Enables/disables temporary indexing pause. This is useful for cluster migration/reindexing. All changes are still tracked, but they are not committed to the Elasticsearch index until unpaused. | | `Search with Elasticsearch enabled` | Enables/disables using Elasticsearch in search. | | `URL` | The URL to use for connecting to Elasticsearch. Use a comma-separated list to support clustering (e.g., `http://host1, https://host2:9200`). If your Elasticsearch instance is password protected, pass the `username:password` in the URL (e.g., `http://<username>:<password>@<elastic_host>:9200/`). | -| `Number of Elasticsearch shards` | Elasticsearch indexes are split into multiple shards for performance reasons. In general, larger indexes need to have more shards. Changes to this value do not take effect until the index is recreated. You can read more about tradeoffs in the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html#create-index-settings) | +| `Number of Elasticsearch shards` | Elasticsearch indexes are split into multiple shards for performance reasons. In general, larger indexes need to have more shards. Changes to this value do not take effect until the index is recreated. You can read more about tradeoffs in the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/scalability.html). | | `Number of Elasticsearch replicas` | Each Elasticsearch shard can have a number of replicas. These are a complete copy of the shard, and can provide increased query performance or resilience against hardware failure. Increasing this value will greatly increase total disk space required by the index. | | `Limit namespaces and projects that can be indexed` | Enabling this will allow you to select namespaces and projects to index. All other namespaces and projects will use database search instead. Please note that if you enable this option but do not select any namespaces or projects, none will be indexed. [Read more below](#limiting-namespaces-and-projects). -| `Using AWS hosted Elasticsearch with IAM credentials` | Sign your Elasticsearch requests using [AWS IAM authorization](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html) or [AWS EC2 Instance Profile Credentials](https://docs.aws.amazon.com/codedeploy/latest/userguide/getting-started-create-iam-instance-profile.html#getting-started-create-iam-instance-profile-cli). The policies must be configured to allow `es:*` actions. | +| `Using AWS hosted Elasticsearch with IAM credentials` | Sign your Elasticsearch requests using [AWS IAM authorization](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html), [AWS EC2 Instance Profile Credentials](https://docs.aws.amazon.com/codedeploy/latest/userguide/getting-started-create-iam-instance-profile.html#getting-started-create-iam-instance-profile-cli), or [AWS ECS Tasks Credentials](https://docs.aws.amazon.com/AmazonECS/latest/userguide/task-iam-roles.html). The policies must be configured to allow `es:*` actions. | | `AWS Region` | The AWS region your Elasticsearch service is located in. | | `AWS Access Key` | The AWS access key. | | `AWS Secret Access Key` | The AWS secret access key. | +| `Maximum file size indexed` | See [the explanation in instance limits.](../administration/instance_limits.md#maximum-file-size-indexed). | | `Maximum field length` | See [the explanation in instance limits.](../administration/instance_limits.md#maximum-field-length). | | `Maximum bulk request size (MiB)` | The Maximum Bulk Request size is used by GitLab's Golang-based indexer processes and indicates how much data it ought to collect (and store in memory) in a given indexing process before submitting the payload to Elasticsearch’s Bulk API. This setting should be used with the Bulk request concurrency setting (see below) and needs to accommodate the resource constraints of both the Elasticsearch host(s) and the host(s) running GitLab's Golang-based indexer either from the `gitlab-rake` command or the Sidekiq tasks. | | `Bulk request concurrency` | The Bulk request concurrency indicates how many of GitLab's Golang-based indexer processes (or threads) can run in parallel to collect data to subsequently submit to Elasticsearch’s Bulk API. This increases indexing performance, but fills the Elasticsearch bulk requests queue faster. This setting should be used together with the Maximum bulk request size setting (see above) and needs to accommodate the resource constraints of both the Elasticsearch host(s) and the host(s) running GitLab's Golang-based indexer either from the `gitlab-rake` command or the Sidekiq tasks. | @@ -284,9 +291,16 @@ or creating [extra Sidekiq processes](../administration/operations/extra_sidekiq ``` 1. [Enable **Elasticsearch indexing**](#enabling-elasticsearch). -1. Indexing large Git repositories can take a while. To speed up the process, you - can temporarily disable auto-refreshing and replicating. In our experience, you can expect a 20% - decrease in indexing time. We'll enable them when indexing is done. This step is optional! +1. Indexing large Git repositories can take a while. To speed up the process, you can [tune for indexing speed](https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html#tune-for-indexing-speed): + + - You can temporarily disable [`refresh`](https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html), the operation responsible for making changes to an index available to search. + + - You can set the number of replicas to 0. This setting controls the number of copies each primary shard of an index will have. Thus, having 0 replicas effectively disables the replication of shards across nodes, which should increase the indexing performance. This is an important trade-off in terms of reliability and query performance. It is important to remember to set the replicas to a considered value after the initial indexing is complete. + + In our experience, you can expect a 20% decrease in indexing time. After completing indexing in a later step, you can return `refresh` and `number_of_replicas` to their desired settings. + + NOTE: **Note:** + This step is optional but may help significantly speed up large indexing operations. ```shell curl --request PUT localhost:9200/gitlab-production/_settings --header 'Content-Type: application/json' --data '{ @@ -409,6 +423,144 @@ or creating [extra Sidekiq processes](../administration/operations/extra_sidekiq For repository and snippet files, GitLab will only index up to 1 MiB of content, in order to avoid indexing timeouts. +## Zero downtime reindexing + +The idea behind this reindexing method is to leverage Elasticsearch index alias feature to atomically swap between two indices. +We will refer to each index as `primary` (online and used by GitLab for read/writes) and `secondary` (offline, for reindexing purpose). + +Instead of connecting directly to the `primary` index, we'll setup an index alias such as we can change the underlying index at will. + +NOTE: **Note:** +Any index attached to the production alias is deemed a `primary` and will end up being used by the GitLab Elasticsearch integration. + +### Pause the indexing + +Under **Admin Area > Integration > Elasticsearch**, check the **Pause Elasticsearch Indexing** setting and save. + +With this, all updates that should happen on your Elasticsearch index will be buffered and caught up once unpaused. + +### Setup + +TIP: **Tip:** +If your index has been created with GitLab v13.0+ you can skip directly to [trigger the reindex](#trigger-the-reindex-via-the-elasticsearch-administration). + +This process involves multiple shell commands and curl invocations, so a good initial setup will help down the road: + +```shell +# You can find this value under Admin Area > Integration > Elasticsearch > URL +export CLUSTER_URL="http://localhost:9200" +export PRIMARY_INDEX="gitlab-production" +export SECONDARY_INDEX="gitlab-production-$(date +%s)" +``` + +### Reclaiming the `gitlab-production` index name + +CAUTION: **Caution:** +It is highly recommended that you take a snapshot of your cluster to make sure there is a recovery path if anything goes wrong. + +NOTE: **Note:** +Due to a technical limitation, there will be a slight downtime because of the fact that we need to reclaim the current `primary` index to be used as the alias. + +To reclaim the `gitlab-production` index name, you need to first create a `secondary` index and then trigger the re-index from `primary`. + +#### Creating a secondary index + +To create a secondary index, run the following Rake task. The `SKIP_ALIAS` +environment variable will disable the automatic creation of the Elasticsearch +alias, which would conflict with the existing index under `$PRIMARY_INDEX`: + +```shell +# Omnibus installation +sudo SKIP_ALIAS=1 gitlab-rake "gitlab:elastic:create_empty_index[$SECONDARY_INDEX]" + +# Source installation +SKIP_ALIAS=1 bundle exec rake "gitlab:elastic:create_empty_index[$SECONDARY_INDEX]" +``` + +The index should be created successfully, with the latest index options and mappings. + +#### Trigger the re-index from `primary` + +To trigger the re-index from `primary` index: + +1. Use the Elasticsearch [Reindex API](https://www.elastic.co/guide/en/elasticsearch/reference/7.6/docs-reindex.html): + + ```shell + curl --request POST \ + --header 'Content-Type: application/json' \ + --data "{ \"source\": { \"index\": \"$PRIMARY_INDEX\" }, \"dest\": { \"index\": \"$SECONDARY_INDEX\" } }" \ + "$CLUSTER_URL/_reindex?slices=auto&wait_for_completion=false" + ``` + + There will be an output like: + + ```plaintext + {"task":"3qw_Tr0YQLq7PF16Xek8YA:1012"} + ``` + + Note the `task` value here as it will be useful to follow the reindex progress. + +1. Wait for the reindex process to complete, by checking the `completed` value. + Using the `task` value form the previous step: + + ```shell + export TASK_ID=3qw_Tr0YQLq7PF16Xek8YA:1012 + curl "$CLUSTER_URL/_tasks/$TASK_ID?pretty" + ``` + + The output will be like: + + ```plaintext + {"completed":false, …} + ``` + + Once the returned value is `true`, you may continue to the next step. + +1. Make sure that the secondary index has data in it. You can use the Elasticsearch + API to look for the index size and compare our two indices: + + ```shell + curl $CLUSTER_URL/$PRIMARY_INDEX/_count => 123123 + curl $CLUSTER_URL/$SECONDARY_INDEX/_count => 123123 + ``` + + TIP: **Tip:** + Comparing the document count is more accurate than using the index size, as improvements to the storage might cause the new index to be smaller than the original one. + +1. Once you are confident your `secondary` index is valid, you can process to the creation of the alias. + + ```shell + # Delete the original index + curl --request DELETE $CLUSTER_URL/$PRIMARY_INDEX + + # Create the alias and add the `secondary` index to it + curl --request POST \ + --header 'Content-Type: application/json' \ + --data "{\"actions\":[{\"add\":{\"index\":\"$SECONDARY_INDEX\",\"alias\":\"$PRIMARY_INDEX\"}}]}}" \ + $CLUSTER_URL/_aliases + ``` + + The reindexing is now completed. Your GitLab instance is now ready to use the [automated in-cluster reindexing](#trigger-the-reindex-via-the-elasticsearch-administration) feature for future reindexing. + +1. Unpause the indexing + + Under **Admin Area > Integration > Elasticsearch**, uncheck the **Pause Elasticsearch Indexing** setting and save. + +### Trigger the reindex via the Elasticsearch administration + +> - [Introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/34069) in [GitLab Starter](https://about.gitlab.com/pricing/) 13.2. +> - A scheduled index deletion and the ability to cancel it was [introduced](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/38914) in GitLab Starter 13.3. + +Under **Admin Area > Integration > Elasticsearch zero-downtime reindexing**, click on **Trigger cluster reindexing**. + +NOTE: **Note:** +Reindexing can be a lengthy process depending on the size of your Elasticsearch cluster. + +CAUTION: **Caution:** +After the reindexing is completed, the original index will be scheduled to be deleted after 14 days. You can cancel this action by pressing the cancel button. + +While the reindexing is running, you will be able to follow its progress under that same section. + ## GitLab Elasticsearch Rake tasks Rake tasks are available to: @@ -572,7 +724,7 @@ Here are some common pitfalls and how to overcome them: - **I indexed all the repositories but then switched Elasticsearch servers and now I can't find anything** - You will need to re-run all the Rake tasks to re-index the database, repositories, and wikis. + You will need to re-run all the Rake tasks to reindex the database, repositories, and wikis. - **The indexing process is taking a very long time** @@ -602,7 +754,7 @@ Here are some common pitfalls and how to overcome them: ``` This is because we changed the index mapping in GitLab 8.12 and the old indexes should be removed and built from scratch again, - see details in the [8-11-to-8-12 update guide](https://gitlab.com/gitlab-org/gitlab/blob/master/doc/update/8.11-to-8.12.md#11-elasticsearch-index-update-if-you-currently-use-elasticsearch). + see details in the [update guide](../update/upgrading_from_source.md). - Exception `Elasticsearch::Transport::Transport::Errors::BadRequest` @@ -657,11 +809,11 @@ There is a [more structured, lower-level troubleshooting document](../administra ### Known Issues -- **[Elasticsearch `code_analyzer` doesn't account for all code cases](https://gitlab.com/gitlab-org/gitlab/-/issues/10693)** +- **[Elasticsearch `code_analyzer` doesn't account for all code cases](https://gitlab.com/groups/gitlab-org/-/epics/3621)** - The `code_analyzer` pattern and filter configuration is being evaluated for improvement. We have noticed [several edge cases](https://gitlab.com/gitlab-org/gitlab/-/issues/10693#note_158382332) that are not returning expected search results due to our pattern and filter configuration. + The `code_analyzer` pattern and filter configuration is being evaluated for improvement. We have fixed [most edge cases](https://gitlab.com/groups/gitlab-org/-/epics/3621#note_363429094) that were not returning expected search results due to our pattern and filter configuration. - An improved strategy for the `code_analyzer` pattern and filters are being discussed in [issue 29443](https://gitlab.com/gitlab-org/gitlab/-/issues/29443). + Improvements to the `code_analyzer` pattern and filters is being discussed in [epic 3621](https://gitlab.com/groups/gitlab-org/-/epics/3621). ### Reverting to basic search |