summaryrefslogtreecommitdiff
path: root/doc/integration/elasticsearch.md
diff options
context:
space:
mode:
authorGitLab Bot <gitlab-bot@gitlab.com>2021-03-16 18:18:33 +0000
committerGitLab Bot <gitlab-bot@gitlab.com>2021-03-16 18:18:33 +0000
commitf64a639bcfa1fc2bc89ca7db268f594306edfd7c (patch)
treea2c3c2ebcc3b45e596949db485d6ed18ffaacfa1 /doc/integration/elasticsearch.md
parentbfbc3e0d6583ea1a91f627528bedc3d65ba4b10f (diff)
downloadgitlab-ce-f64a639bcfa1fc2bc89ca7db268f594306edfd7c.tar.gz
Add latest changes from gitlab-org/gitlab@13-10-stable-eev13.10.0-rc40
Diffstat (limited to 'doc/integration/elasticsearch.md')
-rw-r--r--doc/integration/elasticsearch.md85
1 files changed, 74 insertions, 11 deletions
diff --git a/doc/integration/elasticsearch.md b/doc/integration/elasticsearch.md
index 389c7aabdf5..5c9149ef49b 100644
--- a/doc/integration/elasticsearch.md
+++ b/doc/integration/elasticsearch.md
@@ -11,10 +11,7 @@ info: To determine the technical writer assigned to the Stage/Group associated w
This document describes how to enable Advanced Search. After
Advanced Search is enabled, you'll have the benefit of fast search response times
-and the advantage of the following special searches:
-
-- [Advanced Search](../user/search/advanced_global_search.md)
-- [Advanced Search Syntax](../user/search/advanced_search_syntax.md)
+and the advantage of the [special searches](../user/search/advanced_search.md).
## Version requirements
@@ -41,7 +38,7 @@ each node should have:
- [Memory](https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html#_memory): 8 GiB (minimum).
- [CPU](https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html#_cpus): Modern processor with multiple cores.
-- [Storage](https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html#_disks): Use SSD storage. You will need enough storage for 50% of the total size of your Git repositories.
+- [Storage](https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html#_disks): Use SSD storage. The total storage size of all Elasticsearch nodes is about 50% of the total size of your Git repositories. It includes one primary and one replica.
A few notes on CPU and storage:
@@ -56,6 +53,12 @@ A few notes on CPU and storage:
to any spinning media for Elasticsearch. In testing, nodes that use SSD storage
see boosts in both query and indexing performance.
+- We've introduced the [`estimate_cluster_size`](#gitlab-advanced-search-rake-tasks)
+ Rake task to estimate the Advanced Search storage requirements in advance, which
+- The [`estimate_cluster_size`](#gitlab-advanced-search-rake-tasks) Rake task estimates the
+ Advanced Search storage requirements in advance. The Rake task uses total repository size
+ for the calculation. [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/221177) in GitLab 13.10.
+
Keep in mind, these are **minimum requirements** for Elasticsearch.
Heavily-used Elasticsearch clusters will likely require considerably more
resources.
@@ -218,8 +221,8 @@ The following Elasticsearch settings are available:
| `Elasticsearch indexing` | Enables or disables Elasticsearch indexing and creates an empty index if one does not already exist. You may want to enable indexing but disable search in order to give the index time to be fully completed, for example. Also, keep in mind that this option doesn't have any impact on existing data, this only enables/disables the background indexer which tracks data changes and ensures new data is indexed. |
| `Pause Elasticsearch indexing` | Enables or disables temporary indexing pause. This is useful for cluster migration/reindexing. All changes are still tracked, but they are not committed to the Elasticsearch index until resumed. |
| `Search with Elasticsearch enabled` | Enables or disables using Elasticsearch in search. |
-| `URL` | The URL to use for connecting to Elasticsearch. Use a comma-separated list to support clustering (e.g., `http://host1, https://host2:9200`). If your Elasticsearch instance is password protected, pass the `username:password` in the URL (e.g., `http://<username>:<password>@<elastic_host>:9200/`). |
-| `Number of Elasticsearch shards` | Elasticsearch indexes are split into multiple shards for performance reasons. In general, larger indexes need to have more shards. Changes to this value do not take effect until the index is recreated. You can read more about tradeoffs in the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/scalability.html). |
+| `URL` | The URL to use for connecting to Elasticsearch. Use a comma-separated list to support clustering (e.g., `http://host1, https://host2:9200`). If your Elasticsearch instance is password protected, pass the `username:password` in the URL (e.g., `http://<username>:<password>@<elastic_host>:9200/`). Special characters in the username or password should use [percentage encoding](https://en.wikipedia.org/wiki/Percent-encoding). |
+| `Number of Elasticsearch shards` | Elasticsearch indexes are split into multiple shards for performance reasons. In general, you should use at least 5 shards, and indexes with tens of millions of documents need to have more shards ([see below](#guidance-on-choosing-optimal-cluster-configuration)). Changes to this value do not take effect until the index is recreated. You can read more about tradeoffs in the [Elasticsearch documentation](https://www.elastic.co/guide/en/elasticsearch/reference/current/scalability.html). |
| `Number of Elasticsearch replicas` | Each Elasticsearch shard can have a number of replicas. These are a complete copy of the shard, and can provide increased query performance or resilience against hardware failure. Increasing this value will greatly increase total disk space required by the index. |
| `Limit namespaces and projects that can be indexed` | Enabling this will allow you to select namespaces and projects to index. All other namespaces and projects will use database search instead. Please note that if you enable this option but do not select any namespaces or projects, none will be indexed. [Read more below](#limiting-namespaces-and-projects).
| `Using AWS hosted Elasticsearch with IAM credentials` | Sign your Elasticsearch requests using [AWS IAM authorization](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html), [AWS EC2 Instance Profile Credentials](https://docs.aws.amazon.com/codedeploy/latest/userguide/getting-started-create-iam-instance-profile.html#getting-started-create-iam-instance-profile-cli), or [AWS ECS Tasks Credentials](https://docs.aws.amazon.com/AmazonECS/latest/userguide/task-iam-roles.html). Please refer to [Identity and Access Management in Amazon Elasticsearch Service](https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-ac.html) for details of AWS hosted Elasticsearch domain access policy configuration. |
@@ -232,6 +235,12 @@ The following Elasticsearch settings are available:
| `Bulk request concurrency` | The Bulk request concurrency indicates how many of the GitLab Golang-based indexer processes (or threads) can run in parallel to collect data to subsequently submit to Elasticsearch’s Bulk API. This increases indexing performance, but fills the Elasticsearch bulk requests queue faster. This setting should be used together with the Maximum bulk request size setting (see above) and needs to accommodate the resource constraints of both the Elasticsearch host(s) and the host(s) running the GitLab Golang-based indexer either from the `gitlab-rake` command or the Sidekiq tasks. |
| `Client request timeout` | Elasticsearch HTTP client request timeout value in seconds. `0` means using the system default timeout value, which depends on the libraries that GitLab application is built upon. |
+WARNING:
+Increasing the values of `Maximum bulk request size (MiB)` and `Bulk request concurrency` can negatively impact
+Sidekiq performance. Return them to their default values if you see increased `scheduling_latency_s` durations
+in your Sidekiq logs. For more information, see
+[issue 322147](https://gitlab.com/gitlab-org/gitlab/-/issues/322147).
+
### Limiting namespaces and projects
If you select `Limit namespaces and projects that can be indexed`, more options will become available.
@@ -335,7 +344,7 @@ Sometimes, you might want to abandon the unfinished reindex job and resume the i
1. Uncheck the "Pause Elasticsearch indexing" checkbox in **Admin Area > Settings > Advanced Search**.
-## Background migrations
+## Advanced Search migrations
> [Introduced](https://gitlab.com/gitlab-org/gitlab/-/issues/234046) in GitLab 13.6.
@@ -343,7 +352,7 @@ With reindex migrations running in the background, there's no need for a manual
intervention. This usually happens in situations where new features are added to
Advanced Search, which means adding or changing the way content is indexed.
-To confirm that the background migrations ran, you can check with:
+To confirm that the Advanced Search migrations ran, you can check with:
```shell
curl "$CLUSTER_URL/gitlab-production-migrations/_search?q=*" | jq .
@@ -393,6 +402,21 @@ debug why the migration was halted and make any changes before retrying the migr
fixed the cause of the failure, click "Retry migration", and the migration will be scheduled to be retried
in the background.
+If you cannot get the migration to succeed, you may
+consider the [last resort to recreate the index from
+scratch](#last-resort-to-recreate-an-index). This may allow you to skip over
+the problem because a newly created index will skip all migrations as the index
+will be recreated with the correct up-to-date schema.
+
+### All migrations must be finished before doing a major upgrade
+
+Before doing a major version GitLab upgrade, you should have completed all
+migrations that exist up until the latest minor version before that major
+version. If you have halted migrations, these will need to be resolved and
+[retried](#retry-a-halted-migration) before proceeding with a major version
+upgrade. Read more about [upgrading to a new major
+version](../update/index.md#upgrading-to-a-new-major-version).
+
## GitLab Advanced Search Rake tasks
Rake tasks are available to:
@@ -415,7 +439,9 @@ The following are some available Rake tasks:
| [`sudo gitlab-rake gitlab:elastic:index_snippets`](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/lib/tasks/gitlab/elastic.rake) | Performs an Elasticsearch import that indexes the snippets data. |
| [`sudo gitlab-rake gitlab:elastic:projects_not_indexed`](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/lib/tasks/gitlab/elastic.rake) | Displays which projects are not indexed. |
| [`sudo gitlab-rake gitlab:elastic:reindex_cluster`](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/lib/tasks/gitlab/elastic.rake) | Schedules a zero-downtime cluster reindexing task. This feature should be used with an index that was created after GitLab 13.0. |
-| [`sudo gitlab-rake gitlab:elastic:mark_reindex_failed`](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/lib/tasks/gitlab/elastic.rake)`] | Mark the most recent re-index job as failed. |
+| [`sudo gitlab-rake gitlab:elastic:mark_reindex_failed`](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/lib/tasks/gitlab/elastic.rake) | Mark the most recent re-index job as failed. |
+| [`sudo gitlab-rake gitlab:elastic:list_pending_migrations`](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/lib/tasks/gitlab/elastic.rake) | List pending migrations. Pending migrations include those that have not yet started, have started but not finished, and those that are halted. |
+| [`sudo gitlab-rake gitlab:elastic:estimate_cluster_size`](https://gitlab.com/gitlab-org/gitlab/blob/master/ee/lib/tasks/gitlab/elastic.rake) | Get an estimate of cluster size based on the total repository size. |
### Environment variables
@@ -465,7 +491,12 @@ For basic guidance on choosing a cluster configuration you may refer to [Elastic
- `Heap size` should be set to no more than 50% of your physical RAM. Additionally, it shouldn't be set to more than the threshold for zero-based compressed oops. The exact threshold varies, but 26 GB is safe on most systems, but can also be as large as 30 GB on some systems. See [Heap size settings](https://www.elastic.co/guide/en/elasticsearch/reference/current/important-settings.html#heap-size-settings) and [Setting JVM options](https://www.elastic.co/guide/en/elasticsearch/reference/current/jvm-options.html) for more details.
- Number of CPUs (CPU cores) per node usually corresponds to the `Number of Elasticsearch shards` setting described below.
- A good guideline is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.
-- Small shards result in small segments, which increases overhead. Aim to keep the average shard size between at least a few GB and a few tens of GB. Another consideration is the number of documents, you should aim for this simple formula for the number of shards: `number of expected documents / 5M +1`.
+- Number of Elasticsearch shards:
+ - Small shards result in small segments, which increases overhead. Aim to keep the average shard size between at least a few GB and a few tens of GB.
+ - Another consideration is the number of documents. To determine the number of shards to use, sum the numbers in the **Admin Area > Dashboard > Statistics** pane (the number of documents to be indexed), divide by 5 million, and add 5. For example:
+ - If you have fewer than about 2,000,000 documents, use the default of 5 shards
+ - 10,000,000 documents: `10000000/5000000 + 5` = 7 shards
+ - 100,000,000 documents: `100000000/5000000 + 5` = 25 shards
- `refresh_interval` is a per index setting. You may want to adjust that from default `1s` to a bigger value if you don't need data in real-time. This will change how soon you will see fresh results. If that's important for you, you should leave it as close as possible to the default value.
- You might want to raise [`indices.memory.index_buffer_size`](https://www.elastic.co/guide/en/elasticsearch/reference/current/indexing-buffer.html) to 30% or 40% if you have a lot of heavy indexing operations.
@@ -873,3 +904,35 @@ Improvements to the `code_analyzer` pattern and filters are being discussed in [
### Some binary files may not be searchable by name
In GitLab 13.9, a change was made where [binary file names are being indexed](https://gitlab.com/gitlab-org/gitlab/-/issues/301083). However, without indexing all projects' data from scratch, only binary files that are added or updated after the GitLab 13.9 release are searchable.
+
+### Last resort to recreate an index
+
+There may be cases where somehow data never got indexed and it's not in the
+queue, or the index is somehow in a state where migrations just simply cannot
+proceed. It is always best to try to troubleshoot the root cause of the problem
+using the above [troubleshooting](#troubleshooting) steps.
+
+If there are no other options, then you always have the option of recreating the
+entire index from scratch. If you have a small GitLab installation, this can
+sometimes be a quick way to resolve a problem, but if you have a large GitLab
+installation, then this will likely take a very long time to complete. Until the
+index is fully recreated, your index will not be serving correct search results,
+so you may want to disable **Search with Elasticsearch** while it is running.
+
+If you are sure you've read the above caveats and want to proceed, then you
+should run the following Rake task to recreate the entire index from scratch:
+
+**For Omnibus installations**
+
+```shell
+# WARNING: DO NOT RUN THIS UNTIL YOU READ THE DESCRIPTION ABOVE
+sudo gitlab-rake gitlab:elastic:index
+```
+
+**For installations from source**
+
+```shell
+# WARNING: DO NOT RUN THIS UNTIL YOU READ THE DESCRIPTION ABOVE
+cd /home/git/gitlab
+sudo -u git -H bundle exec rake gitlab:elastic:index
+```