summaryrefslogtreecommitdiff
path: root/doc/integration/elasticsearch.md
diff options
context:
space:
mode:
authorGitLab Bot <gitlab-bot@gitlab.com>2021-04-20 23:50:22 +0000
committerGitLab Bot <gitlab-bot@gitlab.com>2021-04-20 23:50:22 +0000
commit9dc93a4519d9d5d7be48ff274127136236a3adb3 (patch)
tree70467ae3692a0e35e5ea56bcb803eb512a10bedb /doc/integration/elasticsearch.md
parent4b0f34b6d759d6299322b3a54453e930c6121ff0 (diff)
downloadgitlab-ce-9dc93a4519d9d5d7be48ff274127136236a3adb3.tar.gz
Add latest changes from gitlab-org/gitlab@13-11-stable-eev13.11.0-rc43
Diffstat (limited to 'doc/integration/elasticsearch.md')
-rw-r--r--doc/integration/elasticsearch.md26
1 files changed, 20 insertions, 6 deletions
diff --git a/doc/integration/elasticsearch.md b/doc/integration/elasticsearch.md
index 5c9149ef49b..a6c3afceeea 100644
--- a/doc/integration/elasticsearch.md
+++ b/doc/integration/elasticsearch.md
@@ -91,6 +91,14 @@ Since Elasticsearch can read and use indices created in the previous major versi
The only thing worth noting is that if you have created your current index before GitLab 13.0, you might want to reindex from scratch (which will implicitly create an alias) in order to use some features, for example [Zero downtime reindexing](#zero-downtime-reindexing). Once you do that, you'll be able to perform zero-downtime reindexing and will benefit from any future features that make use of the alias.
+If you are unsure when your current index was created,
+you can check whether it was created after GitLab 13.0 by using the
+[Elasticsearch cat aliases API](https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cat-alias.html).
+If the list of aliases returned contains an entry for `gitlab-production` that points to an index
+named `gitlab-production-<numerical timestamp>`, your index was created after GitLab 13.0.
+If the `gitlab-production` alias is missing, you'll need to reindex from scratch to use
+features such as Zero-downtime reindexing.
+
## Elasticsearch repository indexer
For indexing Git repository data, GitLab uses an [indexer written in Go](https://gitlab.com/gitlab-org/gitlab-elasticsearch-indexer).
@@ -231,8 +239,8 @@ The following Elasticsearch settings are available:
| `AWS Secret Access Key` | The AWS secret access key. |
| `Maximum file size indexed` | See [the explanation in instance limits.](../administration/instance_limits.md#maximum-file-size-indexed). |
| `Maximum field length` | See [the explanation in instance limits.](../administration/instance_limits.md#maximum-field-length). |
-| `Maximum bulk request size (MiB)` | The Maximum Bulk Request size is used by the GitLab Golang-based indexer processes and indicates how much data it ought to collect (and store in memory) in a given indexing process before submitting the payload to Elasticsearch’s Bulk API. This setting should be used with the Bulk request concurrency setting (see below) and needs to accommodate the resource constraints of both the Elasticsearch host(s) and the host(s) running the GitLab Golang-based indexer either from the `gitlab-rake` command or the Sidekiq tasks. |
-| `Bulk request concurrency` | The Bulk request concurrency indicates how many of the GitLab Golang-based indexer processes (or threads) can run in parallel to collect data to subsequently submit to Elasticsearch’s Bulk API. This increases indexing performance, but fills the Elasticsearch bulk requests queue faster. This setting should be used together with the Maximum bulk request size setting (see above) and needs to accommodate the resource constraints of both the Elasticsearch host(s) and the host(s) running the GitLab Golang-based indexer either from the `gitlab-rake` command or the Sidekiq tasks. |
+| `Maximum bulk request size (MiB)` | The Maximum Bulk Request size is used by the GitLab Golang-based indexer processes and indicates how much data it ought to collect (and store in memory) in a given indexing process before submitting the payload to Elasticsearch's Bulk API. This setting should be used with the Bulk request concurrency setting (see below) and needs to accommodate the resource constraints of both the Elasticsearch host(s) and the host(s) running the GitLab Golang-based indexer either from the `gitlab-rake` command or the Sidekiq tasks. |
+| `Bulk request concurrency` | The Bulk request concurrency indicates how many of the GitLab Golang-based indexer processes (or threads) can run in parallel to collect data to subsequently submit to Elasticsearch's Bulk API. This increases indexing performance, but fills the Elasticsearch bulk requests queue faster. This setting should be used together with the Maximum bulk request size setting (see above) and needs to accommodate the resource constraints of both the Elasticsearch host(s) and the host(s) running the GitLab Golang-based indexer either from the `gitlab-rake` command or the Sidekiq tasks. |
| `Client request timeout` | Elasticsearch HTTP client request timeout value in seconds. `0` means using the system default timeout value, which depends on the libraries that GitLab application is built upon. |
WARNING:
@@ -671,7 +679,7 @@ Sidekiq processes](../administration/operations/extra_sidekiq_processes.md).
Whenever a change or deletion is made to an indexed GitLab object (a merge request description is changed, a file is deleted from the master branch in a repository, a project is deleted, etc), a document in the index is deleted. However, since these are "soft" deletes, the overall number of "deleted documents", and therefore wasted space, increases. Elasticsearch does intelligent merging of segments in order to remove these deleted documents. However, depending on the amount and type of activity in your GitLab installation, it's possible to see as much as 50% wasted space in the index.
-In general, we recommend simply letting Elasticsearch merge and reclaim space automatically, with the default settings. From [Lucene's Handling of Deleted Documents](https://www.elastic.co/blog/lucenes-handling-of-deleted-documents "Lucene's Handling of Deleted Documents"), _"Overall, besides perhaps decreasing the maximum segment size, it is best to leave Lucene's defaults as-is and not fret too much about when deletes are reclaimed."_
+In general, we recommend letting Elasticsearch merge and reclaim space automatically, with the default settings. From [Lucene's Handling of Deleted Documents](https://www.elastic.co/blog/lucenes-handling-of-deleted-documents "Lucene's Handling of Deleted Documents"), _"Overall, besides perhaps decreasing the maximum segment size, it is best to leave Lucene's defaults as-is and not fret too much about when deletes are reclaimed."_
However, some larger installations may wish to tune the merge policy settings:
@@ -711,8 +719,7 @@ data).
The use of Elasticsearch in GitLab is only ever as a secondary data store.
This means that all of the data stored in Elasticsearch can always be derived
again from other data sources, specifically PostgreSQL and Gitaly. Therefore, if
-the Elasticsearch data store is ever corrupted for whatever reason, you can
-simply reindex everything from scratch.
+the Elasticsearch data store is ever corrupted for whatever reason, you can reindex everything from scratch.
## Troubleshooting
@@ -908,7 +915,7 @@ In GitLab 13.9, a change was made where [binary file names are being indexed](ht
### Last resort to recreate an index
There may be cases where somehow data never got indexed and it's not in the
-queue, or the index is somehow in a state where migrations just simply cannot
+queue, or the index is somehow in a state where migrations just cannot
proceed. It is always best to try to troubleshoot the root cause of the problem
using the above [troubleshooting](#troubleshooting) steps.
@@ -936,3 +943,10 @@ sudo gitlab-rake gitlab:elastic:index
cd /home/git/gitlab
sudo -u git -H bundle exec rake gitlab:elastic:index
```
+
+### How does Advanced Search handle private projects?
+
+Advanced Search will store all the projects in the same Elasticsearch indexes,
+however searches will only surface results that can be viewed by the user.
+Advanced Search will honor all permission checks in the application by
+filtering out projects that a user does not have access to at search time.