summaryrefslogtreecommitdiff
path: root/doc/architecture/blueprints/code_search_with_zoekt/index.md
blob: ca74460b98a22b02dc1d8d8d43f5041043f334ce (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
---
status: ongoing
creation-date: "2022-12-28"
authors: [ "@dgruzd", "@DylanGriffith" ]
coach: "@DylanGriffith"
approvers: [ "@joshlambert", "@changzhengliu" ]
owning-stage: "~devops::enablement"
participating-stages: []
---

# Use Zoekt For code search

## Summary

We will be implementing an additional code search functionality in GitLab that
is backed by [Zoekt](https://github.com/sourcegraph/zoekt), an open source
search engine that is specifically designed for code search. Zoekt will be used as
an API by GitLab and remain an implementation detail while the user interface
in GitLab will not change much except for some new features made available by
Zoekt.

This will be rolled out in phases to ensure that the system will actually meet
our scaling and cost expectations and will run alongside code search backed by
Elasticsearch until we can be sure it is a viable replacement. The first step
will be making it available for `gitlab-org` for internal and expanding
customer by customer based on customer interest.

## Motivation

GitLab code search functionality today is backed by Elasticsearch.
Elasticsearch has proven useful for other types of search (issues, merge
requests, comments and so-on) but is by design not a good choice for code
search where users expect matches to be precise (ie. no false positives) and
flexible (e.g. support
[substring matching](https://gitlab.com/gitlab-org/gitlab/-/issues/325234)
and
[regexes](https://gitlab.com/gitlab-org/gitlab/-/issues/4175)). We have
[investigated our options](https://gitlab.com/groups/gitlab-org/-/epics/7404)
and [Zoekt](https://github.com/sourcegraph/zoekt) is pretty much the only well
maintained open source technology that is suited to code search. Based on our
research we believe it will be better to adopt a well maintained open source
database than attempt to build our own. This is mostly due to the fact that our
research indicates that the fundamental architecture of Zoekt is what we would
implement again if we tried to implement something ourselves.

Our
[early benchmarking](https://gitlab.com/gitlab-org/gitlab/-/issues/370832#note_1183611955)
suggests that Zoekt will be viable at our scale, but we feel strongly
that investing in building a beta integration with Zoekt and rolling it out
group by group on GitLab.com will provide better insights into scalability and
cost than more accurate benchmarking efforts. It will also be relatively low
risk as it will be rolled out internally first and later rolled out to
customers that wish to participate in the trial.

### Goals

The main goals of this integration will be to implement the following highly
requested improvements to code search:

1. [Exact match (substring match) code searches in advanced search](https://gitlab.com/gitlab-org/gitlab/-/issues/325234)
1. [Support regular expressions with Advanced Global Search](https://gitlab.com/gitlab-org/gitlab/-/issues/4175)
1. [Support multiple line matches in the same file](https://gitlab.com/gitlab-org/gitlab/-/issues/668)

The initial phases of the rollout will be designed to catch and resolve scaling
or infrastructure cost issues as early as possible so that we can pivot early
before investing too much in this technology if it is not suitable.

### Non-Goals

The following are not goals initially but could theoretically be built upon
this solution:

1. Improving security scanning features by having access to quickly perform
   regex scans across many repositories
1. Saving money on our search infrastructure - this may be possible with
   further optimizations, but initial estimates suggest the cost is similar
1. AI/ML features of search used to predict what users might be interested in
   finding
1. Code Intelligence and Navigation - likely code intelligence and navigation
   features should be built on structured data rather than a trigram index but
   regex based searches (using Zoekt) may be a suitable fallback for code which
   does not have structured metadata enabled or dynamic languages where static
   analysis is not very accurate. Zoekt in particular may not be well suited
   initially, despite existing symbol extraction using ctags, because ctags
   symbols may not contain enough data for accurate navigation and Zoekt
   doesn't undersand dependencies which would be necessary for cross-project
   navigation.

## Proposal

An
[initial implementation of a Zoekt integration](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/105049)
was created to demonstrate the feasibility of using Zoekt as a drop-in
replacement for Elasticsearch code searches. This blueprint will extend on all
the details needed to provide a minimum viable change as well steps needed to
scale this to a larger customer rollout on GitLab.com.

## Design and implementation details

### User Experience

When a user performs an advanced search on a group or project that is part
of the Zoekt rollout we will present a toggle somewhere in the UI to change
to "precise search" (or some other UX TBD) which switches them from
Elasticsearch to Zoekt. Early user feedback will help us assess the best way
to present these choices to users and ultimately we will want to remove the
Elasticsearch option if we find Zoekt is a suitable long term option.

### Indexing

Similar to our Elasticsearch integration, GitLab will notify Zoekt every time
there are updates to a repository. Zoekt, unlike Elasticsearch, is designed to
clone and index Git repositories so we will simply notify Zoekt of the URL of
the repository that has changed and it will update its local copy of the Git
repo and then update its local index files. The Zoekt side of this logic will
be implemented in a new server-side indexing endpoint we add to Zoekt which is
currently in
[an open Pull request](https://github.com/sourcegraph/zoekt/pull/496).
While the details of
this pull request are still being debated, we may choose to deploy a fork with
the functionality we need, but our strongest intention is not to maintain a
fork of Zoekt and the maintainers have already expressed they are open to this
new functionality.

The rails side of the integration will be a Sidekiq worker that is scheduled
every time there is an update to a repository and it will simply call this
`/index` endpoint in Zoekt. This will also need to generate a one-time token
that can allow Zoekt to clone a private repository.

```mermaid
sequenceDiagram
   participant user as User
   participant gitlab_git as GitLab Git
   participant gitlab_sidekiq as GitLab Sidekiq
   participant zoekt as Zoekt
   user->>gitlab_git: git push git@gitlab.com:gitlab-org/gitlab.git
   gitlab_git->>gitlab_sidekiq: ZoektIndexerWorker.perform_async(278964)
   gitlab_sidekiq->>zoekt: POST /index {"RepoUrl":"https://zoekt:SECRET_TOKEN@gitlab.com/gitlab-org/gitlab.git","RepoId":278964}'
   zoekt->>gitlab_git: git clone https://zoekt:SECRET_TOKEN@gitlab.com/gitlab-org/gitlab.git
```

The Sidekiq worker can leverage de-duplication based on the `project_id`.

Zoekt supports indexing multiple projects we'll likely need to, eventually,
allow a way for users to configure additional branches (beyond the default
branch) and this will need to be sent to Zoekt. We will need to decide if these
branch lists are sent every time we index the project or only when they change
configuration.

There may be race conditions with multiple Zoekt processes indexing the same
repo at the same time. For this reason we should implement a locking mechanism
somewhere to ensure we are only indexing 1 project in 1 place at a time. We
could make use of the same Redis locking we use for indexing projects in
Elasticsearch.

### Searching

Searching will be implemented using the `/api/search` functionality in
Zoekt. There is also
[an open PR to fix this endpoint in Zoekt](https://github.com/sourcegraph/zoekt/pull/506),
and again we may consider working from a fork until this is fixed. GitLab will
prepend all searches with the appropriate filter for repositories based on the
user's search context (group or project) in the same way we do for
Elasticsearch. For Zoekt this will be implemented as a query string regex that
matches all the searched repositories.

### Zoekt infrastructure

Each Zoekt node will need to run a
[zoekt-dynamic-indexserver](https://github.com/sourcegraph/zoekt/pull/496) and
a
[zoekt-webserver](https://github.com/sourcegraph/zoekt/blob/main/cmd/zoekt-webserver/main.go).
These are both webservers with different responsibilities. Considering that the
Zoekt indexing process needs to keep a full clone of the bare repo
([unless we come up with a better option](https://gitlab.com/gitlab-org/gitlab/-/issues/384722))
these bare repos will be stored on spinning disks to save space. These are only
used as an intermediate step to generate the actual `.zoekt` index files which
will be stored on an SSD for fast searches. These web servers need to run on
the same node as they access the same files. The `zoekt-dynamic-indexserver` is
responsible for writing the `.zoekt` index files. The `zoekt-webserver` is
responsible for responding to searches that it performs by reading these
`.zoekt` index files.

### Rollout strategy

Initially Zoekt code search will only be available to `gitlab-org`. After that
we'll start rolling it out to specific customers that have requested better
code search experience. As we learn about scaling and make improvements we will
gradually roll it out to all licensed groups on GitLab.com. We will use a
similar approach to Elasticsearch for keeping track of which groups are indexed
and which are not. This will be based on a new table `zoekt_indexed_namespaces`
with a `namespace_id` reference. We will only allow rolling out to top level
namespaces to simplify the logic of checking for all layers of group
inheritance. Once we've rolled out to all licensed groups we'll enable logic to
automatically enroll newly licensed groups. This table also may be a place to
store per-namespace sharding and replication data as described below.

### Sharding and replication strategy

Zoekt does not have any inbuilt sharding, and we expect that we'll need
multiple Zoekt servers to reach the scale to provide search functionality to
all of GitLab licensed customers.

There are 2 clear ways to implement sharding:

1. Build it on top of, or in front of Zoekt, as an independent component. Building
   all the complexities of a distributed database into Zoekt is not likely to
   be a good direction for the project so most likely this would be an
   independent piece of infrastructure that proxied requests to the correct
   shard.
1. Manage the shards inside GitLab. This would be an application layer in
   GitLab which chooses the correct shard to send indexing and search requests
   to.

Likewise, there are a few ways to implement replication:

1. Server-side where Zoekt replicas are aware of other Zoekt replicas and they
   stream updates from some primary to remain in sync
1. Client-side replication where clients send indexing requests to all replicas
   and search requests to any replica

We plan to implement sharding inside GitLab application but replication may be
best served at the level of the filesystem of Zoekt servers rather than sending
duplicated updates from GitLab to all replicas. This could be some process on
Zoekt servers that monitors for changes to the `.zoekt` files in a specific
directory and syncs those updates to the replicas. This will need to be
slightly more sophisticated than `rsync` because the files are constantly
changing and files may be getting deleted while the sync is happening so we
would want to be syncing the updates in batches somehow without slowing down
indexing.

Implementing sharding in GitLab simplifies the additional infrastructure
components that need to be deployed and allows more flexibility to control our
rollout to many customers alongside our rollout of multiple shards.

Implementing syncing from primary -> replica on Zoekt nodes at the filesystem
level optimizes that overall resource usage. We only need to sync the index
files to replicas as the bare repo is just a cache. This saves on:

1. Disk space on replicas
1. CPU usage on replicas as it does not need to rebuild the index
1. Load on Gitaly to clone the repos

We plan to defer the implementation of these high availability aspects until
later, but a preliminary plan would be:

1. GitLab is configured with a pool of Zoekt servers
1. GitLab assigns groups randomly a Zoekt primary server
1. There will also be Zoekt replica servers
1. Periodically Zoekt primary servers will sync their `.zoekt` index files to
   their respective replicas
1. There will need to be some process by which to promote a replica to a
   primary if the primary is having issues. We will be using Consul for
   keeping track of which is the primary and which are the replicas.
1. When indexing a project GitLab will queue a Sidekiq job to update the index
   on the primary
1. When searching we will randomly select one of the Zoekt primaries or replica
   servers for the group being searched. We don't care which is "more up to
   date" as code search will be "eventually consistent" and all reads may read
   slightly out of date indexes. We will have a target of maximum latency of
   index updates and may consider removing nodes from rotation if they are too
   far out of date.
1. We will shard everything by top level group as this ensures group search can
   always search a single Zoekt server. Aggregation may be possible for global
   searches at some point in future if this turns out to be important. Smaller
   self-managed instances may use a single Zoekt server allowing global
   searches to work without any aggregation being implemented. Depending on our
   largest group sizes and scaling limitations of a single node Zoekt server we
   may consider implementing an approach where a group can be assigned multiple
   shards.

The downside of the chosen path will be added complexity of managing all these
Zoekt servers from GitLab when compared with a "proxy" layer outside of GitLab
that is managing all of these shards. We will consider this decision a work in
progress and reassess if it turns out to add too much complexity to GitLab.

#### Sharding proposal using GitLab `::Zoekt::Shard` model

This is already implemented as the `::Zoekt::IndexedNamespace`
implements a many-to-many relationship between namespaces and shards.

#### Replication and service discovery using Consul

If we plan to replicate at the Zoekt node level as described above we need to
change our data model to use a one-to-many relationship from `zoekt_shards ->
namespaces`. This means making the `namespace_id` column unique in
`zoekt_indexed_namespaces`. Then we need to implement a service discovery
approach where the `index_url` always points at a primary Zoekt node and the
`search_url` is a DNS record with N replicas and the primary. We then choose
randomly from `search_url` records when searching.

### Iterations

1. Make available for `gitlab-org`
1. Improve monitoring
1. Improve performance
1. Make available for select customers
1. Implement sharding
1. Implement replication
1. Make available to many more licensed groups
1. Implement automatic (re)balancing of shards
1. Estimate costs for rolling out to all licensed groups and decide if it's worth it or if we need to optimize further or adjust our plan
1. Rollout to all licensed groups
1. Improve performance
1. Assess costs and decide whether we should roll out to all free customers