diff options
Diffstat (limited to 'doc/architecture/blueprints/search/code_search_with_zoekt.md')
-rw-r--r-- | doc/architecture/blueprints/search/code_search_with_zoekt.md | 305 |
1 files changed, 305 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/search/code_search_with_zoekt.md b/doc/architecture/blueprints/search/code_search_with_zoekt.md new file mode 100644 index 00000000000..d0d347f1ff4 --- /dev/null +++ b/doc/architecture/blueprints/search/code_search_with_zoekt.md @@ -0,0 +1,305 @@ +--- +status: ongoing +creation-date: "2022-12-28" +authors: [ "@dgruzd", "@DylanGriffith" ] +coach: "@DylanGriffith" +approvers: [ "@joshlambert", "@changzhengliu" ] +owning-stage: "~devops::enablement" +participating-stages: [] +--- + +# Use Zoekt For code search + +## Summary + +We will be implementing an additional code search functionality in GitLab that +is backed by [Zoekt](https://github.com/sourcegraph/zoekt), an open source +search engine that is specifically designed for code search. Zoekt will be used as +an API by GitLab and remain an implementation detail while the user interface +in GitLab will not change much except for some new features made available by +Zoekt. + +This will be rolled out in phases to ensure that the system will actually meet +our scaling and cost expectations and will run alongside code search backed by +Elasticsearch until we can be sure it is a viable replacement. The first step +will be making it available for `gitlab-org` for internal and expanding +customer by customer based on customer interest. + +## Motivation + +GitLab code search functionality today is backed by Elasticsearch. +Elasticsearch has proven useful for other types of search (issues, merge +requests, comments and so-on) but is by design not a good choice for code +search where users expect matches to be precise (ie. no false positives) and +flexible (e.g. support +[substring matching](https://gitlab.com/gitlab-org/gitlab/-/issues/325234) +and +[regexes](https://gitlab.com/gitlab-org/gitlab/-/issues/4175)). We have +[investigated our options](https://gitlab.com/groups/gitlab-org/-/epics/7404) +and [Zoekt](https://github.com/sourcegraph/zoekt) is pretty much the only well +maintained open source technology that is suited to code search. Based on our +research we believe it will be better to adopt a well maintained open source +database than attempt to build our own. This is mostly due to the fact that our +research indicates that the fundamental architecture of Zoekt is what we would +implement again if we tried to implement something ourselves. + +Our +[early benchmarking](https://gitlab.com/gitlab-org/gitlab/-/issues/370832#note_1183611955) +suggests that Zoekt will be viable at our scale, but we feel strongly +that investing in building a beta integration with Zoekt and rolling it out +group by group on GitLab.com will provide better insights into scalability and +cost than more accurate benchmarking efforts. It will also be relatively low +risk as it will be rolled out internally first and later rolled out to +customers that wish to participate in the trial. + +### Goals + +The main goals of this integration will be to implement the following highly +requested improvements to code search: + +1. [Exact match (substring match) code searches in Advanced Search](https://gitlab.com/gitlab-org/gitlab/-/issues/325234) +1. [Support regular expressions with Advanced Global Search](https://gitlab.com/gitlab-org/gitlab/-/issues/4175) +1. [Support multiple line matches in the same file](https://gitlab.com/gitlab-org/gitlab/-/issues/668) + +The initial phases of the rollout will be designed to catch and resolve scaling +or infrastructure cost issues as early as possible so that we can pivot early +before investing too much in this technology if it is not suitable. + +### Non-Goals + +The following are not goals initially but could theoretically be built upon +this solution: + +1. Improving security scanning features by having access to quickly perform + regex scans across many repositories +1. Saving money on our search infrastructure - this may be possible with + further optimizations, but initial estimates suggest the cost is similar +1. AI/ML features of search used to predict what users might be interested in + finding +1. Code Intelligence and Navigation - likely code intelligence and navigation + features should be built on structured data rather than a trigram index but + regex based searches (using Zoekt) may be a suitable fallback for code which + does not have structured metadata enabled or dynamic languages where static + analysis is not very accurate. Zoekt in particular may not be well suited + initially, despite existing symbol extraction using ctags, because ctags + symbols may not contain enough data for accurate navigation and Zoekt + doesn't undersand dependencies which would be necessary for cross-project + navigation. + +## Proposal + +An +[initial implementation of a Zoekt integration](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/105049) +was created to demonstrate the feasibility of using Zoekt as a drop-in +replacement for Elasticsearch code searches. This blueprint will extend on all +the details needed to provide a minimum viable change as well steps needed to +scale this to a larger customer rollout on GitLab.com. + +## Design and implementation details + +### User Experience + +When a user performs an advanced search on a group or project that is part +of the Zoekt rollout we will present a toggle somewhere in the UI to change +to "precise search" (or some other UX TBD) which switches them from +Elasticsearch to Zoekt. Early user feedback will help us assess the best way +to present these choices to users and ultimately we will want to remove the +Elasticsearch option if we find Zoekt is a suitable long term option. + +### Indexing + +Similar to our Elasticsearch integration, GitLab will notify Zoekt every time +there are updates to a repository. Zoekt, unlike Elasticsearch, is designed to +clone and index Git repositories so we will simply notify Zoekt of the URL of +the repository that has changed and it will update its local copy of the Git +repo and then update its local index files. The Zoekt side of this logic will +be implemented in a new server-side indexing endpoint we add to Zoekt which is +currently in +[an open Pull request](https://github.com/sourcegraph/zoekt/pull/496). +While the details of +this pull request are still being debated, we may choose to deploy a fork with +the functionality we need, but our strongest intention is not to maintain a +fork of Zoekt and the maintainers have already expressed they are open to this +new functionality. + +The rails side of the integration will be a Sidekiq worker that is scheduled +every time there is an update to a repository and it will simply call this +`/index` endpoint in Zoekt. This will also need to generate a one-time token +that can allow Zoekt to clone a private repository. + +```mermaid +sequenceDiagram + participant user as User + participant gitlab_git as GitLab Git + participant gitlab_sidekiq as GitLab Sidekiq + participant zoekt as Zoekt + user->>gitlab_git: git push git@gitlab.com:gitlab-org/gitlab.git + gitlab_git->>gitlab_sidekiq: ZoektIndexerWorker.perform_async(278964) + gitlab_sidekiq->>zoekt: POST /index {"RepoUrl":"https://zoekt:SECRET_TOKEN@gitlab.com/gitlab-org/gitlab.git","RepoId":278964}' + zoekt->>gitlab_git: git clone https://zoekt:SECRET_TOKEN@gitlab.com/gitlab-org/gitlab.git +``` + +The Sidekiq worker can leverage de-duplication based on the `project_id`. + +Zoekt supports indexing multiple projects we'll likely need to, eventually, +allow a way for users to configure additional branches (beyond the default +branch) and this will need to be sent to Zoekt. We will need to decide if these +branch lists are sent every time we index the project or only when they change +configuration. + +There may be race conditions with multiple Zoekt processes indexing the same +repo at the same time. For this reason we should implement a locking mechanism +somewhere to ensure we are only indexing 1 project in 1 place at a time. We +could make use of the same Redis locking we use for indexing projects in +Elasticsearch. + +### Searching + +Searching will be implemented using the `/api/search` functionality in +Zoekt. There is also +[an open PR to fix this endpoint in Zoekt](https://github.com/sourcegraph/zoekt/pull/506), +and again we may consider working from a fork until this is fixed. GitLab will +prepend all searches with the appropriate filter for repositories based on the +user's search context (group or project) in the same way we do for +Elasticsearch. For Zoekt this will be implemented as a query string regex that +matches all the searched repositories. + +### Zoekt infrastructure + +Each Zoekt node will need to run a +[zoekt-dynamic-indexserver](https://github.com/sourcegraph/zoekt/pull/496) and +a +[zoekt-webserver](https://github.com/sourcegraph/zoekt/blob/main/cmd/zoekt-webserver/main.go). +These are both webservers with different responsibilities. Considering that the +Zoekt indexing process needs to keep a full clone of the bare repo +([unless we come up with a better option](https://gitlab.com/gitlab-org/gitlab/-/issues/384722)) +these bare repos will be stored on spinning disks to save space. These are only +used as an intermediate step to generate the actual `.zoekt` index files which +will be stored on an SSD for fast searches. These web servers need to run on +the same node as they access the same files. The `zoekt-dynamic-indexserver` is +responsible for writing the `.zoekt` index files. The `zoekt-webserver` is +responsible for responding to searches that it performs by reading these +`.zoekt` index files. + +### Rollout strategy + +Initially Zoekt code search will only be available to `gitlab-org`. After that +we'll start rolling it out to specific customers that have requested better +code search experience. As we learn about scaling and make improvements we will +gradually roll it out to all licensed groups on GitLab.com. We will use a +similar approach to Elasticsearch for keeping track of which groups are indexed +and which are not. This will be based on a new table `zoekt_indexed_namespaces` +with a `namespace_id` reference. We will only allow rolling out to top level +namespaces to simplify the logic of checking for all layers of group +inheritance. Once we've rolled out to all licensed groups we'll enable logic to +automatically enroll newly licensed groups. This table also may be a place to +store per-namespace sharding and replication data as described below. + +### Sharding and replication strategy + +Zoekt does not have any inbuilt sharding, and we expect that we'll need +multiple Zoekt servers to reach the scale to provide search functionality to +all of GitLab licensed customers. + +There are 2 clear ways to implement sharding: + +1. Build it on top of, or in front of Zoekt, as an independent component. Building + all the complexities of a distributed database into Zoekt is not likely to + be a good direction for the project so most likely this would be an + independent piece of infrastructure that proxied requests to the correct + shard. +1. Manage the shards inside GitLab. This would be an application layer in + GitLab which chooses the correct shard to send indexing and search requests + to. + +Likewise, there are a few ways to implement replication: + +1. Server-side where Zoekt replicas are aware of other Zoekt replicas and they + stream updates from some primary to remain in sync +1. Client-side replication where clients send indexing requests to all replicas + and search requests to any replica + +We plan to implement sharding inside GitLab application but replication may be +best served at the level of the filesystem of Zoekt servers rather than sending +duplicated updates from GitLab to all replicas. This could be some process on +Zoekt servers that monitors for changes to the `.zoekt` files in a specific +directory and syncs those updates to the replicas. This will need to be +slightly more sophisticated than `rsync` because the files are constantly +changing and files may be getting deleted while the sync is happening so we +would want to be syncing the updates in batches somehow without slowing down +indexing. + +Implementing sharding in GitLab simplifies the additional infrastructure +components that need to be deployed and allows more flexibility to control our +rollout to many customers alongside our rollout of multiple shards. + +Implementing syncing from primary -> replica on Zoekt nodes at the filesystem +level optimizes that overall resource usage. We only need to sync the index +files to replicas as the bare repo is just a cache. This saves on: + +1. Disk space on replicas +1. CPU usage on replicas as it does not need to rebuild the index +1. Load on Gitaly to clone the repos + +We plan to defer the implementation of these high availability aspects until +later, but a preliminary plan would be: + +1. GitLab is configured with a pool of Zoekt servers +1. GitLab assigns groups randomly a Zoekt primary server +1. There will also be Zoekt replica servers +1. Periodically Zoekt primary servers will sync their `.zoekt` index files to + their respective replicas +1. There will need to be some process by which to promote a replica to a + primary if the primary is having issues. We will be using Consul for + keeping track of which is the primary and which are the replicas. +1. When indexing a project GitLab will queue a Sidekiq job to update the index + on the primary +1. When searching we will randomly select one of the Zoekt primaries or replica + servers for the group being searched. We don't care which is "more up to + date" as code search will be "eventually consistent" and all reads may read + slightly out of date indexes. We will have a target of maximum latency of + index updates and may consider removing nodes from rotation if they are too + far out of date. +1. We will shard everything by top level group as this ensures group search can + always search a single Zoekt server. Aggregation may be possible for global + searches at some point in future if this turns out to be important. Smaller + self-managed instances may use a single Zoekt server allowing global + searches to work without any aggregation being implemented. Depending on our + largest group sizes and scaling limitations of a single node Zoekt server we + may consider implementing an approach where a group can be assigned multiple + shards. + +The downside of the chosen path will be added complexity of managing all these +Zoekt servers from GitLab when compared with a "proxy" layer outside of GitLab +that is managing all of these shards. We will consider this decision a work in +progress and reassess if it turns out to add too much complexity to GitLab. + +#### Sharding proposal using GitLab `::Zoekt::Shard` model + +This is already implemented as the `::Zoekt::IndexedNamespace` +implements a many-to-many relationship between namespaces and shards. + +#### Replication and service discovery using Consul + +If we plan to replicate at the Zoekt node level as described above we need to +change our data model to use a one-to-many relationship from `zoekt_shards -> +namespaces`. This means making the `namespace_id` column unique in +`zoekt_indexed_namespaces`. Then we need to implement a service discovery +approach where the `index_url` always points at a primary Zoekt node and the +`search_url` is a DNS record with N replicas and the primary. We then choose +randomly from `search_url` records when searching. + +### Iterations + +1. Make available for `gitlab-org` +1. Improve monitoring +1. Improve performance +1. Make available for select customers +1. Implement sharding +1. Implement replication +1. Make available to many more licensed groups +1. Implement automatic (re)balancing of shards +1. Estimate costs for rolling out to all licensed groups and decide if it's worth it or if we need to optimize further or adjust our plan +1. Rollout to all licensed groups +1. Improve performance +1. Assess costs and decide whether we should roll out to all free customers |