summaryrefslogtreecommitdiff
path: root/doc/architecture
diff options
context:
space:
mode:
authorGitLab Bot <gitlab-bot@gitlab.com>2021-02-18 10:34:06 +0000
committerGitLab Bot <gitlab-bot@gitlab.com>2021-02-18 10:34:06 +0000
commit859a6fb938bb9ee2a317c46dfa4fcc1af49608f0 (patch)
treed7f2700abe6b4ffcb2dcfc80631b2d87d0609239 /doc/architecture
parent446d496a6d000c73a304be52587cd9bbc7493136 (diff)
downloadgitlab-ce-859a6fb938bb9ee2a317c46dfa4fcc1af49608f0.tar.gz
Add latest changes from gitlab-org/gitlab@13-9-stable-eev13.9.0-rc42
Diffstat (limited to 'doc/architecture')
-rw-r--r--doc/architecture/blueprints/database_testing/index.md145
-rw-r--r--doc/architecture/blueprints/feature_flags_development/index.md4
-rw-r--r--doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md6
-rw-r--r--doc/architecture/blueprints/graphql_api/index.md183
-rw-r--r--doc/architecture/blueprints/image_resizing/index.md8
5 files changed, 341 insertions, 5 deletions
diff --git a/doc/architecture/blueprints/database_testing/index.md b/doc/architecture/blueprints/database_testing/index.md
new file mode 100644
index 00000000000..a333ac12ef3
--- /dev/null
+++ b/doc/architecture/blueprints/database_testing/index.md
@@ -0,0 +1,145 @@
+---
+comments: false
+description: 'Database Testing'
+---
+
+# Database Testing
+
+We have identified [common themes of reverted migrations](https://gitlab.com/gitlab-org/gitlab/-/issues/233391) and discovered failed migrations breaking in both production and staging even when successfully tested in a developer environment. We have also experienced production incidents even with successful testing in staging. These failures are quite expensive: they can have a significant effect on availability, block deployments, and generate incident escalations. These escalations must be triaged and either reverted or fixed forward. Often, this can take place without the original author's involvement due to time zones and/or the criticality of the escalation. With our increased deployment speeds and stricter uptime requirements, the need for improving database testing is critical, particularly earlier in the development process (shift left).
+
+From a developer's perspective, it is hard, if not unfeasible, to validate a migration on a large enough dataset before it goes into production.
+
+Our primary goal is to **provide developers with immediate feedback for new migrations and other database-related changes tested on a full copy of the production database**, and to do so with high levels of efficiency (particularly in terms of infrastructure costs) and security.
+
+## Current day
+
+Developers are expected to test database migrations prior to deploying to any environment, but we lack the ability to perform testing against large environments such as GitLab.com. The [developer database migration style guide](../../../development/migration_style_guide.md) provides guidelines on migrations, and we focus on validating migrations during code review and testing in CI and staging.
+
+The [code review phase](../../../development/database_review.md) involves Database Reviewers and Maintainers to manually check the migrations committed. This often involves knowing and spotting problematic patterns and their particular behavior on GitLab.com from experience. There is no large-scale environment available that allows us to test database migrations before they are being merged.
+
+Testing in CI is done on a very small database. We mainly check forward/backward migration consistency, evaluate Rubocop rules to detect well-known problematic behaviors (static code checking) and have a few other, rather technical checks in place (adding the right files etc). That is, we typically find code or other rather simple errors, but cannot surface any data related errors - which are also typically not covered by unit tests either.
+
+Once merged, migrations are being deployed to the staging environment. Its database size is less than 5% of the production database size as of January 2021 and its recent data distribution does not resemble the production site. Oftentimes, we see migrations succeed in staging but then fail in production due to query timeouts or other unexpected problems. Even if we caught problems in staging, this is still expensive to reconcile and ideally we want to catch those problems as early as possible in the development cycle.
+
+Today, we have gained experience with working on a thin-cloned production database (more on this below) and already use it to provide developers with access to production query plans, automated query feedback and suggestions with optimizations. This is built around [Database Lab](https://gitlab.com/postgres-ai/database-lab) and [Joe](https://gitlab.com/postgres-ai/joe), both available through Slack (using ChatOps) and [postgres.ai](https://postgres.ai/).
+
+## Vision
+
+As a developer:
+
+1. I am working on a GitLab code change that includes a data migration and changes a heavy database query.
+1. I push my code, create a merge request, and provide an example query in the description.
+1. The pipeline executes the data migration and examines the query in a large-scale environment (a copy of GitLab.com).
+1. Once the pipeline finishes, the merge request gets detailed feedback and information about the migration and the query I provided. This is based on a full clone of the production database with a state that is very close to production (minutes).
+
+For database migrations, the information gathered from execution on the clone includes:
+
+- Overall runtime.
+- Detailed statistics for queries being executed in the migration (normalizing queries and showing their frequencies and execution times as plots).
+- Dangerous locks held during the migration (which would cause blocking situations in production).
+
+For database queries, we can automatically gather:
+
+- Query plans along with visualization.
+- Execution times and predictions for production.
+- Suggestions on optimizations from Joe.
+- Memory and IO statistics.
+
+After having gotten that feedback:
+
+1. I can go back and investigate a performance problem with the data migration.
+1. Once I have a fix pushed, I can repeat the above cycle and eventually send my merge request for database review. During the database review, the database reviewer and maintainer have all the additional generated information available to them to make an informed decision on the performance of the introduced changes.
+
+This information gathering is done in a protected and safe environment, making sure that there is no unauthorized access to production data and we can safely execute code in this environment.
+
+The intended benefits include:
+
+- Shifting left: Allow developers to understand large-scale database performance and what to expect to happen on GitLab.com in a self-service manner
+- Identify errors that are only generated when working against a production scale dataset with real data (with inconsistencies or unexpected patterns)
+- Automate the information gathering phase to make it easier for everybody involved in code review (developer, reviewer, maintainer) by providing relevant details automatically and upfront.
+
+## Technology and next steps
+
+We already use Database Lab from [postgres.ai](https://postgres.ai/), which is a thin-cloning technology. We maintain a PostgreSQL replica which is up to date with production data but does not serve any production traffic. This runs Database Lab which allows us to quickly create a full clone of the production dataset (in the order of seconds).
+
+Internally, this is based on ZFS and implements a "thin-cloning technology". That is, ZFS snapshots are being used to clone the data and it exposes a full read/write PostgreSQL cluster based on the cloned data. This is called a *thin clone*. It is rather short lived and is going to be destroyed again shortly after we are finished using it.
+
+It is important to note that a thin clone is fully read/write. This allows us to execute migrations on top of it.
+
+Database Lab provides an API we can interact with to manage thin clones. In order to automate the migration and query testing, we add steps to the `gitlab/gitlab-org` CI pipeline. This triggers automation that performs the following steps for a given merge request:
+
+1. Create a thin-clone with production data for this testing session.
+1. Pull GitLab code from the merge request.
+1. Execute migrations and gather all necessary information from it.
+1. Execute query testing and gather all necessary information from it.
+1. Post back the results of the migration and query testing to the merge request.
+1. Destroy the thin-clone.
+
+### Short-term
+
+The short-term focus is on testing regular migrations (typically schema changes) and using the existing Database Lab instance from postgres.ai for it.
+
+In order to secure this process and meet compliance goals, the runner environment will be treated as a *production* environment and similarly locked down, monitored and audited. Only Database Maintainers will have access to the CI pipeline and its job output. Everyone else will only be able to see the results and statistics posted back on the merge request.
+
+We implement a secured CI pipeline on <https://ops.gitlab.net> that adds the execution steps outlined above. The goal is to secure this pipeline in order to solve the following problem:
+
+Make sure we strongly protect production data, even though we allow everyone (GitLab team/developers) to execute arbitrary code on the thin-clone which contains production data.
+
+This is in principle achieved by locking down the GitLab Runner instance executing the code and its containers on a network level, such that no data can escape over the network. We make sure no communication can happen to the outside world from within the container executing the GitLab Rails code (and its database migrations).
+
+Furthermore, we limit the ability to view the results of the jobs (including the output printed from code) to Maintainer and Owner level on the <https://ops.gitlab.net> pipeline and provide only a high level summary back to the original MR. If there are issues or errors in one of the jobs run, the database Maintainer assigned to review the MR can check the original job for more details.
+
+With this step implemented, we already have the ability to execute database migrations on the thin-cloned GitLab.com database automatically from GitLab CI and provide feedback back to the merge request and the developer. The content of that feedback is expected to evolve over time and we can continuously add to this.
+
+We already have an [MVC-style implementation for the pipeline](https://gitlab.com/gitlab-org/database-team/gitlab-com-migrations) for reference and an [example merge request with feedback](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/50793#note_477815261) from the pipeline.
+
+The short-term goal is detailed in [this epic](https://gitlab.com/groups/gitlab-org/database-team/-/epics/6).
+
+### Mid-term - Improved feedback, query testing and background migration testing
+
+Mid-term, we plan to expand the level of detail the testing pipeline reports back to the Merge Request and expand its scope to cover query testing, too. By doing so, we use our experience from database code reviews and using thin-clone technology and bring this back closer to the GitLab workflow. Instead of reaching out to different tools (`postgres.ai`, `joe`, Slack, plan visualizations etc.) we bring this back to GitLab and working directly on the Merge Request.
+
+Secondly, we plan to cover background migrations testing, too. These are typically data migrations that are scheduled to run over a long period of time. The success of both the scheduling phase and the job execution phase typically depends a lot on data distribution - which only surfaces when running these migrations on actual production data. In order to become confident about a background migration, we plan to provide the following feedback:
+
+1. Scheduling phase - query statistics (for example a histogram of query execution times), job statistics (how many jobs, overall duration etc.), batch sizes.
+1. Execution phase - using a few instances of a job as examples, we execute those to gather query and runtime statistics.
+
+### Long-term - incorporate into GitLab product
+
+There are opportunities to discuss for extracting features from this into GitLab itself. For example, annotating the Merge Request with query examples and attaching feedback gathered from the testing run can become a first-class citizen instead of using Merge Request description and comments for it. We plan to evaluate those ideas as we see those being used in earlier phases and bring our experience back into the product.
+
+## An alternative discussed: Anonymization
+
+At the core of this problem lies the concern about executing (potentially arbitrary) code on a production dataset and making sure the production data is well protected. The approach discussed above solves this by strongly limiting access to the output of said code.
+
+An alternative approach we have discussed and abandoned is to "scrub" and anonymize production data. The idea is to remove any sensitive data from the database and use the resulting dataset for database testing. This has a lot of downsides which led us to abandon the idea:
+
+- Anonymization is complex by nature - it is a hard problem to call a "scrubbed clone" actually safe to work with in public. Different data types may require different anonymization techniques (e.g. anonymizing sensitive information inside a JSON field) and only focusing on one attribute at a time does not guarantee that a dataset is fully anonymized (for example join attacks or using timestamps in conjunction to public profiles/projects to de-anonymize users by there activity).
+- Anonymization requires an additional process to keep track and update the set of attributes considered as sensitive, ongoing maintenance and security reviews every time the database schema changes.
+- Annotating data as "sensitive" is error prone, with the wrong anonymization approach used for a data type or one sensitive attribute accidentally not marked as such possibly leading to a data breach.
+- Scrubbing not only removes sensitive data, but also changes data distribution, which greatly affects performance of migrations and queries.
+- Scrubbing heavily changes the database contents, potentially updating a lot of data, which leads to different data storage details (think MVC bloat), affecting performance of migrations and queries.
+
+## Who
+
+<!-- vale gitlab.Spelling = NO -->
+
+This effort is owned and driven by the [GitLab Database Team](https://about.gitlab.com/handbook/engineering/development/enablement/database/) with support from the [GitLab.com Reliability Datastores](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/datastores/) team.
+
+| Role | Who
+|------------------------------|-------------------------|
+| Author | Andreas Brandl |
+| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
+| Engineering Leader | Craig Gomes |
+| Domain Expert | Yannis Roussos |
+| Domain Expert | Pat Bair |
+
+DRIs:
+
+| Role | Who
+|------------------------------|------------------------|
+| Product | Fabian Zimmer |
+| Leadership | Craig Gomes |
+| Engineering | Andreas Brandl |
+
+<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/feature_flags_development/index.md b/doc/architecture/blueprints/feature_flags_development/index.md
index 6be582bb8af..a5e46d25921 100644
--- a/doc/architecture/blueprints/feature_flags_development/index.md
+++ b/doc/architecture/blueprints/feature_flags_development/index.md
@@ -91,7 +91,7 @@ allow us to have:
name: ci_disallow_to_create_merge_request_pipelines_in_target_project
introduced_by_url: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/40724
rollout_issue_url: https://gitlab.com/gitlab-org/gitlab/-/issues/235119
-group: group::progressive delivery
+group: group::release
type: development
default_enabled: false
```
@@ -105,7 +105,7 @@ These are reason why these changes are needed:
- we have ambiguous usage of feature flag with different `default_enabled:` and
different `actors` used
- we lack a clear indication who owns what feature flag and where to find
- relevant informations
+ relevant information
- we do not emphasise the desire to create feature flag rollout issue to
indicate that feature flag is in fact a ~"technical debt"
- we don't know exactly what feature flags we have in our codebase
diff --git a/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md b/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md
index 6c27ecca284..fb71707c146 100644
--- a/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md
+++ b/doc/architecture/blueprints/gitlab_to_kubernetes_communication/index.md
@@ -1,12 +1,12 @@
---
-stage: configure
-group: configure
+stage: Configure
+group: Configure
info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#designated-technical-writers
comments: false
description: 'GitLab to Kubernetes communication'
---
-# GitLab to Kubernetes communication
+# GitLab to Kubernetes communication **(FREE)**
The goal of this document is to define how GitLab can communicate with Kubernetes
and in-cluster services through the GitLab Kubernetes Agent.
diff --git a/doc/architecture/blueprints/graphql_api/index.md b/doc/architecture/blueprints/graphql_api/index.md
new file mode 100644
index 00000000000..40d02168b3b
--- /dev/null
+++ b/doc/architecture/blueprints/graphql_api/index.md
@@ -0,0 +1,183 @@
+---
+stage: none
+group: unassigned
+comments: false
+description: 'GraphQL API architecture foundation'
+---
+
+# GraphQL API
+
+[GraphQL](https://graphql.org/) is a data query and manipulation language for
+APIs, and a runtime for fulfilling queries with existing data.
+
+At GitLab we want to adopt GraphQL to make it easier for the wider community to
+interact with GitLab in a reliable way, but also to advance our own product by
+modeling communication between backend and frontend components using GraphQL.
+
+We've recently increased the pace of the adoption by defining quarterly OKRs
+related to GraphQL migration. This resulted in us spending more time on the
+GraphQL development and helped to surface the need of improving tooling we use
+to extend the new API.
+
+This document describes the work that is needed to build a stable foundation that
+will support our development efforts and a large-scale usage of the [GraphQL
+API](https://docs.gitlab.com/ee/api/graphql/index.html).
+
+## Summary
+
+The GraphQL initiative at GitLab [started around three years ago](https://gitlab.com/gitlab-org/gitlab/-/commit/9c6c17cbcdb8bf8185fc1b873dcfd08f723e4df5).
+Most of the work around the GraphQL ecosystem has been done by volunteers that are
+[GraphQL experts](https://gitlab.com/groups/gitlab-org/graphql-experts/-/group_members?with_inherited_permissions=exclude).
+
+The [retrospective on our progress](https://gitlab.com/gitlab-org/gitlab/-/issues/235659)
+surfaced a few opportunities to streamline our GraphQL development efforts and
+to reduce the risk of performance degradations and possible outages that may
+be related to the gaps in the essential mechanisms needed to make the GraphQL
+API observable and operable at scale.
+
+Amongst small improvements to the GraphQL engine itself we want to build a
+comprehensive monitoring dashboard, that will enable team members to make sense
+of what is happening inside our GraphQL API. We want to make it possible to define
+SLOs, triage breached SLIs and to be able to zoom into relevant details using
+Grafana and Elastic. We want to see historical data and predict future usage.
+
+It is an opportunity to learn from our experience in evolving the REST API, for
+the scale, and to apply this knowledge onto the GraphQL development efforts. We
+can do that by building query-to-feature correlation mechanisms, adding
+scalable state synchronization support and aligning GraphQL with other
+architectural initiatives being executed in parallel, like [the support for
+direct uploads](https://gitlab.com/gitlab-org/gitlab/-/issues/280819).
+
+GraphQL should be secure by default. We can avoid common security mistakes by
+building mechanisms that will help us to enforce [OWASP GraphQL
+recommendations](https://cheatsheetseries.owasp.org/cheatsheets/GraphQL_Cheat_Sheet.html)
+that are relevant to us.
+
+Understanding what are the needs of the wider community will also allow us to
+plan deprecation policies better and to design parity between GraphQL and REST
+API that suits their needs.
+
+## Challenges
+
+### Make sense of what is happening in GraphQL
+
+Being able to see how GraphQL performs in a production environment is a
+prerequisite for improving performance and reliability of that service.
+
+We do not yet have tools that would make it possible for us to answer a
+question of how GraphQL performs and what the bottlenecks we should optimize
+are. This, combined with a pace of GraphQL adoption and the scale in which we
+expect it operate, imposes a risk of an increased rate of production incidents
+what will be difficult to resolve.
+
+We want to build a comprehensive Grafana dashboard that will focus on
+delivering insights of how GraphQL endpoint performs, while still empowering
+team members with capability of zooming in into details. We want to improve
+logging to make it possible to better correlate GraphQL queries with feature
+using Elastic and to index them in a way that performance problems can be
+detected early.
+
+- Build a comprehensive Grafana dashboard for GraphQL
+- Build a GraphQL query-to-feature correlation mechanisms
+- Improve logging GraphQL queries in Elastic
+- Redesign error handling on frontend to surface warnings
+
+### Manage volatile GraphQL data structures
+
+Our GraphQL API will evolve with time. GraphQL has been designed to make such
+evolution easier. GraphQL APIs are easier to extend because of how composable
+GraphQL is. On the other hand this is also a reason why versioning of GraphQL
+APIs is considered unnecessary. Instead of versioning the API we want to mark
+some fields as deprecated, but we need to have a way to understand what is the
+usage of deprecated fields, types and a way to visualize it in a way that is
+easy to understand. We might want to detect usage of deprecated fields and
+notify users that we plan to remove them.
+
+- Define a data-informed deprecation policy that will serve our users better
+- Build a dashboard showing usage frequency of deprecated GraphQL fields
+- Build mechanisms required to send deprecated fields usage in usage ping
+
+### Ensure consistency with the rest of the codebase
+
+GraphQL is not the only thing we work on, but it cuts across the entire
+application. It is being used to expose data collected and processed in almost
+every part of our product. It makes it tightly coupled with our monolithic
+codebase.
+
+We need to ensure that how we use GraphQL is consistent with other mechanisms
+we've designed to improve performance and reliability of GitLab.
+
+We have extensive experience with evolving our REST API. We want to apply
+this knowledge onto GraphQL and make it performant and secure by default.
+
+- Design direct uploads for GraphQL
+- Build GraphQL query depth and complexity histograms
+- Visualize the amount of GraphQL queries reaching limits
+- Add support for GraphQL ETags for existing features
+
+### Design GraphQL interoperability with REST API
+
+We do not plan to deprecate our REST API. It is a simple way to interact with
+GitLab, and GraphQL might never become a full replacement of a traditional REST
+API. The two APIs will need to coexist together. We will need to remove
+duplication between them to make their codebases maintainable. This symbiosis,
+however, is not only a technical challenge we need to resolve on the backend.
+Users might want to use the two APIs interchangeably or even at the same time.
+Making it interoperable by exposing a common scheme for resource identifiers is
+a prerequisite for interoperability.
+
+- Make GraphQL and REST API interoperable
+- Design common resource identifiers for both APIs
+
+### Design scalable state synchronization mechanisms
+
+One of the most important goals related to GraphQL adoption at GitLab is using
+it to model interactions between GitLab backend and frontend components. This
+is an ongoing process that has already surfaced the need of building better
+state synchronization mechanisms and hooking into existing ones.
+
+- Design a scalable state synchronization mechanism
+- Evaluate state synchronization through pub/sub and websockets
+- Build a generic support for GraphQL feature correlation and feature ETags
+- Redesign frontend code responsible for managing shared global state
+
+## Iterations
+
+1. [Build comprehensive Grafana dashboard for GraphQL](https://gitlab.com/groups/gitlab-com/-/epics/1343)
+1. [Improve logging of GraphQL requests in Elastic](https://gitlab.com/groups/gitlab-org/-/epics/4646)
+1. [Build a scalable state synchronization for GraphQL](https://gitlab.com/groups/gitlab-org/-/epics/5319)
+1. [Build GraphQL feature-to-query correlation mechanisms](https://gitlab.com/groups/gitlab-org/-/epics/5320)
+1. [Design a better data-informed deprecation policy](https://gitlab.com/groups/gitlab-org/-/epics/5321)
+1. [Add support for direct uploads for GraphQL](https://gitlab.com/gitlab-org/gitlab/-/issues/280819)
+1. [Review GraphQL design choices related to security](https://gitlab.com/gitlab-org/security/gitlab/-/issues/339)
+
+## Status
+
+Current status: in progress.
+
+## Who
+
+Proposal:
+
+<!-- vale gitlab.Spelling = NO -->
+
+| Role | Who
+|------------------------------|-------------------------|
+| Author | Grzegorz Bizon |
+| Architecture Evolution Coach | Kamil TrzciƄski |
+| Engineering Leader | Darva Satcher |
+| Product Manager | Patrick Deuley |
+| Domain Expert / GraphQL | Charlie Ablett |
+| Domain Expert / GraphQL | Alex Kalderimis |
+| Domain Expert / GraphQL | Natalia Tepluhina |
+| Domain Expert / Scalability | Bob Van Landuyt |
+
+DRIs:
+
+| Role | Who
+|------------------------------|------------------------|
+| Leadership | Darva Satcher |
+| Product | Patrick Deuley |
+| Engineering | |
+
+<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/image_resizing/index.md b/doc/architecture/blueprints/image_resizing/index.md
index 9e5c45a715d..686a2f9c8f5 100644
--- a/doc/architecture/blueprints/image_resizing/index.md
+++ b/doc/architecture/blueprints/image_resizing/index.md
@@ -59,6 +59,8 @@ The MVC Avatar resizing implementation is integrated into Workhorse. With the ex
Proposal:
+<!-- vale gitlab.Spelling = NO -->
+
| Role | Who
|------------------------------|-------------------------|
| Author | Craig Gomes |
@@ -67,10 +69,16 @@ Proposal:
| Domain Expert | Matthias Kaeppler |
| Domain Expert | Aleksei Lipniagov |
+<!-- vale gitlab.Spelling = YES -->
+
DRIs:
+<!-- vale gitlab.Spelling = NO -->
+
| Role | Who
|------------------------------|------------------------|
| Product | Josh Lambert |
| Leadership | Craig Gomes |
| Engineering | Matthias Kaeppler |
+
+<!-- vale gitlab.Spelling = YES -->