summaryrefslogtreecommitdiff
path: root/doc/development/elasticsearch.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/development/elasticsearch.md')
-rw-r--r--doc/development/elasticsearch.md35
1 files changed, 25 insertions, 10 deletions
diff --git a/doc/development/elasticsearch.md b/doc/development/elasticsearch.md
index c8c70fa7216..94b3796f6e9 100644
--- a/doc/development/elasticsearch.md
+++ b/doc/development/elasticsearch.md
@@ -64,20 +64,25 @@ All indexing after the initial one is done via `ElasticIndexerWorker` (sidekiq j
Search queries are generated by the concerns found in [ee/app/models/concerns/elastic](https://gitlab.com/gitlab-org/gitlab-ee/tree/master/ee/app/models/concerns/elastic). These concerns are also in charge of access control, and have been a historic source of security bugs so please pay close attention to them!
## Existing Analyzers/Tokenizers/Filters
+
These are all defined in https://gitlab.com/gitlab-org/gitlab-ee/blob/master/ee/lib/elasticsearch/git/model.rb
### Analyzers
+
#### `path_analyzer`
+
Used when indexing blobs' paths. Uses the `path_tokenizer` and the `lowercase` and `asciifolding` filters.
Please see the `path_tokenizer` explanation below for an example.
#### `sha_analyzer`
+
Used in blobs and commits. Uses the `sha_tokenizer` and the `lowercase` and `asciifolding` filters.
Please see the `sha_tokenizer` explanation later below for an example.
#### `code_analyzer`
+
Used when indexing a blob's filename and content. Uses the `whitespace` tokenizer and the filters: `code`, `edgeNGram_filter`, `lowercase`, and `asciifolding`
The `whitespace` tokenizer was selected in order to have more control over how tokens are split. For example the string `Foo::bar(4)` needs to generate tokens like `Foo` and `bar(4)` in order to be properly searched.
@@ -85,15 +90,19 @@ The `whitespace` tokenizer was selected in order to have more control over how t
Please see the `code` filter for an explanation on how tokens are split.
#### `code_search_analyzer`
+
Not directly used for indexing, but rather used to transform a search input. Uses the `whitespace` tokenizer and the `lowercase` and `asciifolding` filters.
### Tokenizers
+
#### `sha_tokenizer`
+
This is a custom tokenizer that uses the [`edgeNGram` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenizer.html) to allow SHAs to be searcheable by any sub-set of it (minimum of 5 chars).
-example:
+Example:
`240c29dc7e` becomes:
+
- `240c2`
- `240c29`
- `240c29d`
@@ -102,21 +111,26 @@ example:
- `240c29dc7e`
#### `path_tokenizer`
+
This is a custom tokenizer that uses the [`path_hierarchy` tokenizer](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pathhierarchy-tokenizer.html) with `reverse: true` in order to allow searches to find paths no matter how much or how little of the path is given as input.
-example:
+Example:
`'/some/path/application.js'` becomes:
+
- `'/some/path/application.js'`
- `'some/path/application.js'`
- `'path/application.js'`
- `'application.js'`
### Filters
+
#### `code`
-Uses a [Pattern Capture token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern-capture-tokenfilter.html) to split tokens into more easily searched versions of themselves.
+
+Uses a [Pattern Capture token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-pattern-capture-tokenfilter.html) to split tokens into more easily searched versions of themselves.
Patterns:
+
- `"(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)"`: captures CamelCased and lowedCameCased strings as separate tokens
- `"(\\d+)"`: extracts digits
- `"(?=([\\p{Lu}]+[\\p{L}]+))"`: captures CamelCased strings recursively. Ex: `ThisIsATest` => `[ThisIsATest, IsATest, ATest, Test]`
@@ -126,6 +140,7 @@ Patterns:
- `'\/?([^\/]+)(?=\/|\b)'`: separate path terms `like/this/one`
#### `edgeNGram_filter`
+
Uses an [Edge NGram token filter](https://www.elastic.co/guide/en/elasticsearch/reference/5.5/analysis-edgengram-tokenfilter.html) to allow inputs with only parts of a token to find the token. For example it would turn `glasses` into permutations starting with `gl` and ending with `glasses`, which would allow a search for "`glass`" to find the original token `glasses`
## Gotchas
@@ -140,13 +155,13 @@ Uses an [Edge NGram token filter](https://www.elastic.co/guide/en/elasticsearch/
You might get an error such as
```
-[2018-10-31T15:54:19,762][WARN ][o.e.c.r.a.DiskThresholdMonitor] [pval5Ct]
- flood stage disk watermark [95%] exceeded on
- [pval5Ct7SieH90t5MykM5w][pval5Ct][/usr/local/var/lib/elasticsearch/nodes/0] free: 56.2gb[3%],
+[2018-10-31T15:54:19,762][WARN ][o.e.c.r.a.DiskThresholdMonitor] [pval5Ct]
+ flood stage disk watermark [95%] exceeded on
+ [pval5Ct7SieH90t5MykM5w][pval5Ct][/usr/local/var/lib/elasticsearch/nodes/0] free: 56.2gb[3%],
all indices on this node will be marked read-only
```
-This is because you've exceeded the disk space threshold - it thinks you don't have enough disk space left, based on the default 95% threshold.
+This is because you've exceeded the disk space threshold - it thinks you don't have enough disk space left, based on the default 95% threshold.
In addition, the `read_only_allow_delete` setting will be set to `true`. It will block indexing, `forcemerge`, etc
@@ -158,16 +173,16 @@ Add this to your `elasticsearch.yml` file:
```
# turn off the disk allocator
-cluster.routing.allocation.disk.threshold_enabled: false
+cluster.routing.allocation.disk.threshold_enabled: false
```
_or_
```
# set your own limits
-cluster.routing.allocation.disk.threshold_enabled: true
+cluster.routing.allocation.disk.threshold_enabled: true
cluster.routing.allocation.disk.watermark.flood_stage: 5gb # ES 6.x only
-cluster.routing.allocation.disk.watermark.low: 15gb
+cluster.routing.allocation.disk.watermark.low: 15gb
cluster.routing.allocation.disk.watermark.high: 10gb
```