1 files changed, 305 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/search/code_search_with_zoekt.md b/doc/architecture/blueprints/search/code_search_with_zoekt.md
new file mode 100644
index 00000000000..d0d347f1ff4
--- /dev/null
+++ b/doc/architecture/blueprints/search/code_search_with_zoekt.md
@@ -0,0 +1,305 @@
+---
+status: ongoing
+creation-date: "2022-12-28"
+authors: [ "@dgruzd", "@DylanGriffith" ]
+coach: "@DylanGriffith"
+approvers: [ "@joshlambert", "@changzhengliu" ]
+owning-stage: "~devops::enablement"
+participating-stages: []
+---
+
+# Use Zoekt For code search
+
+## Summary
+
+We will be implementing an additional code search functionality in GitLab that
+is backed by [Zoekt](https://github.com/sourcegraph/zoekt), an open source
+search engine that is specifically designed for code search. Zoekt will be used as
+an API by GitLab and remain an implementation detail while the user interface
+in GitLab will not change much except for some new features made available by
+Zoekt.
+
+This will be rolled out in phases to ensure that the system will actually meet
+our scaling and cost expectations and will run alongside code search backed by
+Elasticsearch until we can be sure it is a viable replacement. The first step
+will be making it available for `gitlab-org` for internal and expanding
+customer by customer based on customer interest.
+
+## Motivation
+
+GitLab code search functionality today is backed by Elasticsearch.
+Elasticsearch has proven useful for other types of search (issues, merge
+requests, comments and so-on) but is by design not a good choice for code
+search where users expect matches to be precise (ie. no false positives) and
+flexible (e.g. support
+[substring matching](https://gitlab.com/gitlab-org/gitlab/-/issues/325234)
+and
+[regexes](https://gitlab.com/gitlab-org/gitlab/-/issues/4175)). We have
+[investigated our options](https://gitlab.com/groups/gitlab-org/-/epics/7404)
+and [Zoekt](https://github.com/sourcegraph/zoekt) is pretty much the only well
+maintained open source technology that is suited to code search. Based on our
+research we believe it will be better to adopt a well maintained open source
+database than attempt to build our own. This is mostly due to the fact that our
+research indicates that the fundamental architecture of Zoekt is what we would
+implement again if we tried to implement something ourselves.
+
+Our
+[early benchmarking](https://gitlab.com/gitlab-org/gitlab/-/issues/370832#note_1183611955)
+suggests that Zoekt will be viable at our scale, but we feel strongly
+that investing in building a beta integration with Zoekt and rolling it out
+group by group on GitLab.com will provide better insights into scalability and
+cost than more accurate benchmarking efforts. It will also be relatively low
+risk as it will be rolled out internally first and later rolled out to
+customers that wish to participate in the trial.
+
+### Goals
+
+The main goals of this integration will be to implement the following highly
+requested improvements to code search:
+
+1. [Exact match (substring match) code searches in Advanced Search](https://gitlab.com/gitlab-org/gitlab/-/issues/325234)
+1. [Support regular expressions with Advanced Global Search](https://gitlab.com/gitlab-org/gitlab/-/issues/4175)
+1. [Support multiple line matches in the same file](https://gitlab.com/gitlab-org/gitlab/-/issues/668)
+
+The initial phases of the rollout will be designed to catch and resolve scaling
+or infrastructure cost issues as early as possible so that we can pivot early
+before investing too much in this technology if it is not suitable.
+
+### Non-Goals
+
+The following are not goals initially but could theoretically be built upon
+this solution:
+
+1. Improving security scanning features by having access to quickly perform
+   regex scans across many repositories
+1. Saving money on our search infrastructure - this may be possible with
+   further optimizations, but initial estimates suggest the cost is similar
+1. AI/ML features of search used to predict what users might be interested in
+   finding
+1. Code Intelligence and Navigation - likely code intelligence and navigation
+   features should be built on structured data rather than a trigram index but
+   regex based searches (using Zoekt) may be a suitable fallback for code which
+   does not have structured metadata enabled or dynamic languages where static
+   analysis is not very accurate. Zoekt in particular may not be well suited
+   initially, despite existing symbol extraction using ctags, because ctags
+   symbols may not contain enough data for accurate navigation and Zoekt
+   doesn't undersand dependencies which would be necessary for cross-project
+   navigation.
+
+## Proposal
+
+An
+[initial implementation of a Zoekt integration](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/105049)
+was created to demonstrate the feasibility of using Zoekt as a drop-in
+replacement for Elasticsearch code searches. This blueprint will extend on all
+the details needed to provide a minimum viable change as well steps needed to
+scale this to a larger customer rollout on GitLab.com.
+
+## Design and implementation details
+
+### User Experience
+
+When a user performs an advanced search on a group or project that is part
+of the Zoekt rollout we will present a toggle somewhere in the UI to change
+to "precise search" (or some other UX TBD) which switches them from
+Elasticsearch to Zoekt. Early user feedback will help us assess the best way
+to present these choices to users and ultimately we will want to remove the
+Elasticsearch option if we find Zoekt is a suitable long term option.
+
+### Indexing
+
+Similar to our Elasticsearch integration, GitLab will notify Zoekt every time
+there are updates to a repository. Zoekt, unlike Elasticsearch, is designed to
+clone and index Git repositories so we will simply notify Zoekt of the URL of
+the repository that has changed and it will update its local copy of the Git
+repo and then update its local index files. The Zoekt side of this logic will
+be implemented in a new server-side indexing endpoint we add to Zoekt which is
+currently in
+[an open Pull request](https://github.com/sourcegraph/zoekt/pull/496).
+While the details of
+this pull request are still being debated, we may choose to deploy a fork with
+the functionality we need, but our strongest intention is not to maintain a
+fork of Zoekt and the maintainers have already expressed they are open to this
+new functionality.
+
+The rails side of the integration will be a Sidekiq worker that is scheduled
+every time there is an update to a repository and it will simply call this
+`/index` endpoint in Zoekt. This will also need to generate a one-time token
+that can allow Zoekt to clone a private repository.
+
+```mermaid
+sequenceDiagram
+   participant user as User
+   participant gitlab_git as GitLab Git
+   participant gitlab_sidekiq as GitLab Sidekiq
+   participant zoekt as Zoekt
+   user->>gitlab_git: git push git@gitlab.com:gitlab-org/gitlab.git
+   gitlab_git->>gitlab_sidekiq: ZoektIndexerWorker.perform_async(278964)
+   gitlab_sidekiq->>zoekt: POST /index {"RepoUrl":"https://zoekt:SECRET_TOKEN@gitlab.com/gitlab-org/gitlab.git","RepoId":278964}'
+   zoekt->>gitlab_git: git clone https://zoekt:SECRET_TOKEN@gitlab.com/gitlab-org/gitlab.git
+```
+
+The Sidekiq worker can leverage de-duplication based on the `project_id`.
+
+Zoekt supports indexing multiple projects we'll likely need to, eventually,
+allow a way for users to configure additional branches (beyond the default
+branch) and this will need to be sent to Zoekt. We will need to decide if these
+branch lists are sent every time we index the project or only when they change
+configuration.
+
+There may be race conditions with multiple Zoekt processes indexing the same
+repo at the same time. For this reason we should implement a locking mechanism
+somewhere to ensure we are only indexing 1 project in 1 place at a time. We
+could make use of the same Redis locking we use for indexing projects in
+Elasticsearch.
+
+### Searching
+
+Searching will be implemented using the `/api/search` functionality in
+Zoekt. There is also
+[an open PR to fix this endpoint in Zoekt](https://github.com/sourcegraph/zoekt/pull/506),
+and again we may consider working from a fork until this is fixed. GitLab will
+prepend all searches with the appropriate filter for repositories based on the
+user's search context (group or project) in the same way we do for
+Elasticsearch. For Zoekt this will be implemented as a query string regex that
+matches all the searched repositories.
+
+### Zoekt infrastructure
+
+Each Zoekt node will need to run a
+[zoekt-dynamic-indexserver](https://github.com/sourcegraph/zoekt/pull/496) and
+a
+[zoekt-webserver](https://github.com/sourcegraph/zoekt/blob/main/cmd/zoekt-webserver/main.go).
+These are both webservers with different responsibilities. Considering that the
+Zoekt indexing process needs to keep a full clone of the bare repo
+([unless we come up with a better option](https://gitlab.com/gitlab-org/gitlab/-/issues/384722))
+these bare repos will be stored on spinning disks to save space. These are only
+used as an intermediate step to generate the actual `.zoekt` index files which
+will be stored on an SSD for fast searches. These web servers need to run on
+the same node as they access the same files. The `zoekt-dynamic-indexserver` is
+responsible for writing the `.zoekt` index files. The `zoekt-webserver` is
+responsible for responding to searches that it performs by reading these
+`.zoekt` index files.
+
+### Rollout strategy
+
+Initially Zoekt code search will only be available to `gitlab-org`. After that
+we'll start rolling it out to specific customers that have requested better
+code search experience. As we learn about scaling and make improvements we will
+gradually roll it out to all licensed groups on GitLab.com. We will use a
+similar approach to Elasticsearch for keeping track of which groups are indexed
+and which are not. This will be based on a new table `zoekt_indexed_namespaces`
+with a `namespace_id` reference. We will only allow rolling out to top level
+namespaces to simplify the logic of checking for all layers of group
+inheritance. Once we've rolled out to all licensed groups we'll enable logic to
+automatically enroll newly licensed groups. This table also may be a place to
+store per-namespace sharding and replication data as described below.
+
+### Sharding and replication strategy
+
+Zoekt does not have any inbuilt sharding, and we expect that we'll need
+multiple Zoekt servers to reach the scale to provide search functionality to
+all of GitLab licensed customers.
+
+There are 2 clear ways to implement sharding:
+
+1. Build it on top of, or in front of Zoekt, as an independent component. Building
+   all the complexities of a distributed database into Zoekt is not likely to
+   be a good direction for the project so most likely this would be an
+   independent piece of infrastructure that proxied requests to the correct
+   shard.
+1. Manage the shards inside GitLab. This would be an application layer in
+   GitLab which chooses the correct shard to send indexing and search requests
+   to.
+
+Likewise, there are a few ways to implement replication:
+
+1. Server-side where Zoekt replicas are aware of other Zoekt replicas and they
+   stream updates from some primary to remain in sync
+1. Client-side replication where clients send indexing requests to all replicas
+   and search requests to any replica
+
+We plan to implement sharding inside GitLab application but replication may be
+best served at the level of the filesystem of Zoekt servers rather than sending
+duplicated updates from GitLab to all replicas. This could be some process on
+Zoekt servers that monitors for changes to the `.zoekt` files in a specific
+directory and syncs those updates to the replicas. This will need to be
+slightly more sophisticated than `rsync` because the files are constantly
+changing and files may be getting deleted while the sync is happening so we
+would want to be syncing the updates in batches somehow without slowing down
+indexing.
+
+Implementing sharding in GitLab simplifies the additional infrastructure
+components that need to be deployed and allows more flexibility to control our
+rollout to many customers alongside our rollout of multiple shards.
+
+Implementing syncing from primary -> replica on Zoekt nodes at the filesystem
+level optimizes that overall resource usage. We only need to sync the index
+files to replicas as the bare repo is just a cache. This saves on:
+
+1. Disk space on replicas
+1. CPU usage on replicas as it does not need to rebuild the index
+1. Load on Gitaly to clone the repos
+
+We plan to defer the implementation of these high availability aspects until
+later, but a preliminary plan would be:
+
+1. GitLab is configured with a pool of Zoekt servers
+1. GitLab assigns groups randomly a Zoekt primary server
+1. There will also be Zoekt replica servers
+1. Periodically Zoekt primary servers will sync their `.zoekt` index files to
+   their respective replicas
+1. There will need to be some process by which to promote a replica to a
+   primary if the primary is having issues. We will be using Consul for
+   keeping track of which is the primary and which are the replicas.
+1. When indexing a project GitLab will queue a Sidekiq job to update the index
+   on the primary
+1. When searching we will randomly select one of the Zoekt primaries or replica
+   servers for the group being searched. We don't care which is "more up to
+   date" as code search will be "eventually consistent" and all reads may read
+   slightly out of date indexes. We will have a target of maximum latency of
+   index updates and may consider removing nodes from rotation if they are too
+   far out of date.
+1. We will shard everything by top level group as this ensures group search can
+   always search a single Zoekt server. Aggregation may be possible for global
+   searches at some point in future if this turns out to be important. Smaller
+   self-managed instances may use a single Zoekt server allowing global
+   searches to work without any aggregation being implemented. Depending on our
+   largest group sizes and scaling limitations of a single node Zoekt server we
+   may consider implementing an approach where a group can be assigned multiple
+   shards.
+
+The downside of the chosen path will be added complexity of managing all these
+Zoekt servers from GitLab when compared with a "proxy" layer outside of GitLab
+that is managing all of these shards. We will consider this decision a work in
+progress and reassess if it turns out to add too much complexity to GitLab.
+
+#### Sharding proposal using GitLab `::Zoekt::Shard` model
+
+This is already implemented as the `::Zoekt::IndexedNamespace`
+implements a many-to-many relationship between namespaces and shards.
+
+#### Replication and service discovery using Consul
+
+If we plan to replicate at the Zoekt node level as described above we need to
+change our data model to use a one-to-many relationship from `zoekt_shards ->
+namespaces`. This means making the `namespace_id` column unique in
+`zoekt_indexed_namespaces`. Then we need to implement a service discovery
+approach where the `index_url` always points at a primary Zoekt node and the
+`search_url` is a DNS record with N replicas and the primary. We then choose
+randomly from `search_url` records when searching.
+
+### Iterations
+
+1. Make available for `gitlab-org`
+1. Improve monitoring
+1. Improve performance
+1. Make available for select customers
+1. Implement sharding
+1. Implement replication
+1. Make available to many more licensed groups
+1. Implement automatic (re)balancing of shards
+1. Estimate costs for rolling out to all licensed groups and decide if it's worth it or if we need to optimize further or adjust our plan
+1. Rollout to all licensed groups
+1. Improve performance
+1. Assess costs and decide whether we should roll out to all free customers