summaryrefslogtreecommitdiff
path: root/doc/architecture/blueprints
diff options
context:
space:
mode:
authorGitLab Bot <gitlab-bot@gitlab.com>2020-10-21 07:08:36 +0000
committerGitLab Bot <gitlab-bot@gitlab.com>2020-10-21 07:08:36 +0000
commit48aff82709769b098321c738f3444b9bdaa694c6 (patch)
treee00c7c43e2d9b603a5a6af576b1685e400410dee /doc/architecture/blueprints
parent879f5329ee916a948223f8f43d77fba4da6cd028 (diff)
downloadgitlab-ce-48aff82709769b098321c738f3444b9bdaa694c6.tar.gz
Add latest changes from gitlab-org/gitlab@13-5-stable-eev13.5.0-rc42
Diffstat (limited to 'doc/architecture/blueprints')
-rw-r--r--doc/architecture/blueprints/cloud_native_build_logs/index.md141
-rw-r--r--doc/architecture/blueprints/cloud_native_gitlab_pages/index.md135
-rw-r--r--doc/architecture/blueprints/feature_flags_development/index.md140
3 files changed, 416 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/cloud_native_build_logs/index.md b/doc/architecture/blueprints/cloud_native_build_logs/index.md
new file mode 100644
index 00000000000..25abfe36e88
--- /dev/null
+++ b/doc/architecture/blueprints/cloud_native_build_logs/index.md
@@ -0,0 +1,141 @@
+---
+comments: false
+description: 'Next iteration of build logs architecture at GitLab'
+---
+
+# Cloud Native Build Logs
+
+Cloud native and the adoption of Kubernetes has been recognised by GitLab to be
+one of the top two biggest tailwinds that are helping us grow faster as a
+company behind the project.
+
+This effort is described in a more details [in the infrastructure team
+handbook](https://about.gitlab.com/handbook/engineering/infrastructure/production/kubernetes/gitlab-com/).
+
+## Traditional build logs
+
+Traditional job logs depend a lot on availability of a local shared storage.
+
+Every time a GitLab Runner sends a new partial build output, we write this
+output to a file on a disk. This is simple, but this mechanism depends on
+shared local storage - the same file needs to be available on every GitLab web
+node machine, because GitLab Runner might connect to a different one every time
+it performs an API request. Sidekiq also needs access to the file because when
+a job is complete, a trace file contents will be sent to the object store.
+
+## New architecture
+
+New architecture writes data to Redis instead of writing build logs into a
+file.
+
+In order to make this performant and resilient enough, we implemented a chunked
+I/O mechanism - we store data in Redis in chunks, and migrate them to an object
+store once we reach a desired chunk size.
+
+Simplified sequence diagram is available below.
+
+```mermaid
+sequenceDiagram
+ autonumber
+ participant U as User
+ participant R as Runner
+ participant G as GitLab (rails)
+ participant I as Redis
+ participant D as Database
+ participant O as Object store
+
+ loop incremental trace update sent by a runner
+ Note right of R: Runner appends a build trace
+ R->>+G: PATCH trace [build.id, offset, data]
+ G->>+D: find or create chunk [chunk.index]
+ D-->>-G: chunk [id, index]
+ G->>I: append chunk data [chunk.index, data]
+ G-->>-R: 200 OK
+ end
+
+ Note right of R: User retrieves a trace
+ U->>+G: GET build trace
+ loop every trace chunk
+ G->>+D: find chunk [index]
+ D-->>-G: chunk [id]
+ G->>+I: read chunk data [chunk.index]
+ I-->>-G: chunk data [data, size]
+ end
+ G-->>-U: build trace
+
+ Note right of R: Trace chunk is full
+ R->>+G: PATCH trace [build.id, offset, data]
+ G->>+D: find or create chunk [chunk.index]
+ D-->>-G: chunk [id, index]
+ G->>I: append chunk data [chunk.index, data]
+ G->>G: chunk full [index]
+ G-->>-R: 200 OK
+ G->>+I: read chunk data [chunk.index]
+ I-->>-G: chunk data [data, size]
+ G->>O: send chunk data [data, size]
+ G->>+D: update data store type [chunk.id]
+ G->>+I: delete chunk data [chunk.index]
+```
+
+## NFS coupling
+
+In 2017, we experienced serious problems of scaling our NFS infrastructure. We
+even tried to replace NFS with
+[CephFS](https://docs.ceph.com/docs/master/cephfs/) - unsuccessfully.
+
+Since that time it has become apparent that the cost of operations and
+maintenance of a NFS cluster is significant and that if we ever decide to
+migrate to Kubernetes [we need to decouple GitLab from a shared local storage
+and
+NFS](https://gitlab.com/gitlab-org/gitlab-pages/-/issues/426#note_375646396).
+
+1. NFS might be a single point of failure
+1. NFS can only be reliably scaled vertically
+1. Moving to Kubernetes means increasing the number of mount points by an order
+ of magnitude
+1. NFS depends on extremely reliable network which can be difficult to provide
+ in Kubernetes environment
+1. Storing customer data on NFS involves additional security risks
+
+Moving GitLab to Kubernetes without NFS decoupling would result in an explosion
+of complexity, maintenance cost and enormous, negative impact on availability.
+
+## Iterations
+
+1. ✓ Implement the new architecture in way that it does not depend on shared local storage
+1. ✓ Evaluate performance and edge-cases, iterate to improve the new architecture
+1. ✓ Design cloud native build logs correctness verification mechanisms
+1. ✓ Build observability mechanisms around performance and correctness
+1. Rollout the feature into production environment incrementally
+
+The work needed to make the new architecture production ready and enabled on
+GitLab.com is being tracked in [Cloud Native Build Logs on
+GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/4275) epic.
+
+Enabling this feature on GitLab.com is a subtask of [making the new
+architecture generally
+available](https://gitlab.com/groups/gitlab-org/-/epics/3791) for everyone.
+
+## Who
+
+Proposal:
+
+<!-- vale gitlab.Spelling = NO -->
+
+| Role | Who
+|------------------------------|-------------------------|
+| Author | Grzegorz Bizon |
+| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
+| Engineering Leader | Darby Frey |
+| Domain Expert | Kamil Trzciński |
+| Domain Expert | Sean McGivern |
+
+DRIs:
+
+| Role | Who
+|------------------------------|------------------------|
+| Product | Jason Yavorska |
+| Leadership | Darby Frey |
+| Engineering | Grzegorz Bizon |
+
+<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md b/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md
new file mode 100644
index 00000000000..37e69d46ae1
--- /dev/null
+++ b/doc/architecture/blueprints/cloud_native_gitlab_pages/index.md
@@ -0,0 +1,135 @@
+---
+comments: false
+description: 'Making GitLab Pages a Cloud Native application - architecture blueprint.'
+---
+
+# GitLab Pages New Architecture
+
+GitLab Pages is an important component of the GitLab product. It is mostly
+being used to serve static content, and has a limited set of well defined
+responsibilities. That being said, unfortunately it has become a blocker for
+GitLab.com Kubernetes migration.
+
+Cloud Native and the adoption of Kubernetes has been recognised by GitLab to be
+one of the top two biggest tailwinds that are helping us grow faster as a
+company behind the project.
+
+This effort is described in more detail [in the infrastructure team handbook
+page](https://about.gitlab.com/handbook/engineering/infrastructure/production/kubernetes/gitlab-com/).
+
+GitLab Pages is tightly coupled with NFS and in order to unblock Kubernetes
+migration a significant change to GitLab Pages' architecture is required. This
+is an ongoing work that we have started more than a year ago. This blueprint
+might be useful to understand why it is important, and what is the roadmap.
+
+## How GitLab Pages Works
+
+GitLab Pages is a daemon designed to serve static content, written in
+[Go](https://golang.org/).
+
+Initially, GitLab Pages has been designed to store static content on a local
+shared block storage (NFS) in a hierarchical group > project directory
+structure. Each directory, representing a project, was supposed to contain a
+configuration file and static content that GitLab Pages daemon was supposed to
+read and serve.
+
+```mermaid
+graph LR
+ A(GitLab Rails) -- Writes new pages deployment --> B[(NFS)]
+ C(GitLab Pages) -. Reads static content .-> B
+```
+
+This initial design has become outdated because of a few reasons - NFS coupling
+being one of them - and we decided to replace it with more "decoupled
+service"-like architecture. The new architecture, that we are working on, is
+described in this blueprint.
+
+## NFS coupling
+
+In 2017, we experienced serious problems of scaling our NFS infrastructure. We
+even tried to replace NFS with
+[CephFS](https://docs.ceph.com/docs/master/cephfs/) - unsuccessfully.
+
+Since that time it has become apparent that the cost of operations and
+maintenance of a NFS cluster is significant and that if we ever decide to
+migrate to Kubernetes [we need to decouple GitLab from a shared local storage
+and
+NFS](https://gitlab.com/gitlab-org/gitlab-pages/-/issues/426#note_375646396).
+
+1. NFS might be a single point of failure
+1. NFS can only be reliably scaled vertically
+1. Moving to Kubernetes means increasing the number of mount points by an order
+ of magnitude
+1. NFS depends on extremely reliable network which can be difficult to provide
+ in Kubernetes environment
+1. Storing customer data on NFS involves additional security risks
+
+Moving GitLab to Kubernetes without NFS decoupling would result in an explosion
+of complexity, maintenance cost and enormous, negative impact on availability.
+
+## New GitLab Pages Architecture
+
+- GitLab Pages is going to source domains' configuration from GitLab's internal
+ API, instead of reading `config.json` files from a local shared storage.
+- GitLab Pages is going to serve static content from Object Storage.
+
+```mermaid
+graph TD
+ A(User) -- Pushes pages deployment --> B{GitLab}
+ C((GitLab Pages)) -. Reads configuration from API .-> B
+ C -. Reads static content .-> D[(Object Storage)]
+ C -- Serves static content --> E(Visitors)
+```
+
+This new architecture has been briefly described in [the blog
+post](https://about.gitlab.com/blog/2020/08/03/how-gitlab-pages-uses-the-gitlab-api-to-serve-content/)
+too.
+
+## Iterations
+
+1. ✓ Redesign GitLab Pages configuration source to use GitLab's API
+1. ✓ Evaluate performance and build reliable caching mechanisms
+1. ✓ Incrementally rollout the new source on GitLab.com
+1. ✓ Make GitLab Pages API domains config source enabled by default
+1. Enable experimentation with different servings through feature flags
+1. Triangulate object store serving design through meaningful experiments
+1. Design pages migration mechanisms that can work incrementally
+1. Gradually migrate towards object storage serving on GitLab.com
+
+[GitLab Pages Architecture](https://gitlab.com/groups/gitlab-org/-/epics/1316)
+epic with detailed roadmap is also available.
+
+## Who
+
+Proposal:
+
+<!-- vale gitlab.Spelling = NO -->
+
+| Role | Who
+|------------------------------|-------------------------|
+| Author | Grzegorz Bizon |
+| Architecture Evolution Coach | Kamil Trzciński |
+| Engineering Leader | Daniel Croft |
+| Domain Expert | Grzegorz Bizon |
+| Domain Expert | Vladimir Shushlin |
+| Domain Expert | Jaime Martinez |
+
+DRIs:
+
+| Role | Who
+|------------------------------|------------------------|
+| Product | Jackie Porter |
+| Leadership | Daniel Croft |
+| Engineering | Kamil Trzciński |
+
+Domain Experts:
+
+| Role | Who
+|------------------------------|------------------------|
+| Domain Expert | Kamil Trzciński |
+| Domain Expert | Grzegorz Bizon |
+| Domain Expert | Vladimir Shushlin |
+| Domain Expert | Jaime Martinez |
+| Domain Expert | Krasimir Angelov |
+
+<!-- vale gitlab.Spelling = YES -->
diff --git a/doc/architecture/blueprints/feature_flags_development/index.md b/doc/architecture/blueprints/feature_flags_development/index.md
new file mode 100644
index 00000000000..0aeb2b51b39
--- /dev/null
+++ b/doc/architecture/blueprints/feature_flags_development/index.md
@@ -0,0 +1,140 @@
+---
+comments: false
+description: 'Internal usage of Feature Flags for GitLab development'
+---
+
+# Usage of Feature Flags for GitLab development
+
+Usage of feature flags become crucial for the development of GitLab. The
+feature flags are a convenient way to ship changes early, and safely rollout
+them to wide audience ensuring that feature is stable and performant.
+
+Since the presence of feature is controlled with a dedicated condition, a
+developer can decide for a best time for testing the feature, ensuring that
+feature is not enable prematurely.
+
+## Challenges
+
+The extensive usage of feature flags poses a few challenges
+
+- Each feature flag that we add to codebase is a ~"technical debt" as it adds a
+ matrix of configurations.
+- Testing each combination of feature flags is close to impossible, so we
+ instead try to optimise our testing of feature flags to the most common
+ scenarios.
+- There's a growing challenge of maintaining a growing number of feature flags.
+ We sometimes forget how our feature flags are configured or why we haven't
+ yet removed the feature flag.
+- The usage of feature flags can also be confusing to people outside of
+ development that might not fully understand dependence of ~feature or ~bug
+ fix on feature flag and how this feature flag is configured. Or if the feature
+ should be announced as part of release post.
+- Maintaining feature flags poses additional challenge of having to manage
+ different configurations across different environments/target. We have
+ different configuration of feature flags for testing, for development, for
+ staging, for production and what is being shipped to our customers as part of
+ on-premise offering.
+
+## Goals
+
+The biggest challenge today with our feature flags usage is their implicit
+nature. Feature flags are part of the codebase, making them hard to understand
+outside of development function.
+
+We should aim to make our feature flag based development to be accessible to
+any interested party.
+
+- developer / engineer
+ - can easily add a new feature flag, and configure it's state
+ - can quickly find who to reach if touches another feature flag
+ - can quickly find stale feature flags
+- engineering manager
+ - can understand what feature flags her/his group manages
+- engineering manager and director
+ - can understand how much ~"technical debt" is inflicted due to amount of feature flags that we have to manage
+ - can understand how many feature flags are added and removed in each release
+- product manager and documentation writer
+ - can understand what features are gated by what feature flags
+ - can understand if feature and thus feature flag is generally available on GitLab.com
+ - can understand if feature and thus feature flag is enabled by default for on-premise installations
+- delivery engineer
+ - can understand what feature flags are introduced and changed between subsequent deployments
+- support and reliability engineer
+ - can understand how feature flags changed between releases: what feature flags become enabled, what removed
+ - can quickly find relevant information about feature flag to know individuals which might help with an ongoing support request or incident
+
+## Proposal
+
+To help with above goals we should aim to make our feature flags usage explicit
+and understood by all involved parties.
+
+Introduce a YAML-described `feature-flags/<name-of-feature.yml>` that would
+allow us to have:
+
+1. A central place where all feature flags are documented,
+1. A description of why the given feature flag was introduced,
+1. A what relevant issue and merge request it was introduced by,
+1. Build automated documentation with all feature flags in the codebase,
+1. Track how many feature flags are per given group
+1. Track how many feature flags are added and removed between releases
+1. Make this information easily accessible for all
+1. Allow our customers to easily discover how to enable features and quickly
+ find out information what did change between different releases
+
+### The `YAML`
+
+```yaml
+---
+name: ci_disallow_to_create_merge_request_pipelines_in_target_project
+introduced_by_url: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/40724
+rollout_issue_url: https://gitlab.com/gitlab-org/gitlab/-/issues/235119
+group: group::progressive delivery
+type: development
+default_enabled: false
+```
+
+## Reasons
+
+These are reason why these changes are needed:
+
+- we have around 500 different feature flags today
+- we have hard time tracking their usage
+- we have ambiguous usage of feature flag with different `default_enabled:` and
+ different `actors` used
+- we lack a clear indication who owns what feature flag and where to find
+ relevant informations
+- we do not emphasise the desire to create feature flag rollout issue to
+ indicate that feature flag is in fact a ~"technical debt"
+- we don't know exactly what feature flags we have in our codebase
+- we don't know exactly how our feature flags are configured for different
+ environments: what is being used for `test`, what we ship for `on-premise`,
+ what is our settings for `staging`, `qa` and `production`
+
+## Iterations
+
+This work is being done as part of dedicated epic: [Improve internal usage of
+Feature Flags](https://gitlab.com/groups/gitlab-org/-/epics/3551). This epic
+describes a meta reasons for making these changes.
+
+## Who
+
+Proposal:
+
+<!-- vale gitlab.Spelling = NO -->
+
+| Role | Who
+|------------------------------|-------------------------|
+| Author | Kamil Trzciński |
+| Architecture Evolution Coach | Gerardo Lopez-Fernandez |
+| Engineering Leader | Kamil Trzciński |
+| Domain Expert | Shinya Maeda |
+
+DRIs:
+
+| Role | Who
+|------------------------------|------------------------|
+| Product | Kenny Johnston |
+| Leadership | Craig Gomes |
+| Engineering | Kamil Trzciński |
+
+<!-- vale gitlab.Spelling = YES -->