diff options
Diffstat (limited to 'doc/architecture/blueprints/cloud_native_build_logs/index.md')
-rw-r--r-- | doc/architecture/blueprints/cloud_native_build_logs/index.md | 137 |
1 files changed, 137 insertions, 0 deletions
diff --git a/doc/architecture/blueprints/cloud_native_build_logs/index.md b/doc/architecture/blueprints/cloud_native_build_logs/index.md new file mode 100644 index 00000000000..0b02da21109 --- /dev/null +++ b/doc/architecture/blueprints/cloud_native_build_logs/index.md @@ -0,0 +1,137 @@ +--- +comments: false +description: 'Next iteration of build logs architecture at GitLab' +--- + +# Cloud Native Build Logs + +Cloud native and the adoption of Kubernetes has been recognised by GitLab to be +one of the top two biggest tailwinds that are helping us grow faster as a +company behind the project. + +This effort is described in a more details [in the infrastructure team +handbook](https://about.gitlab.com/handbook/engineering/infrastructure/production/kubernetes/gitlab-com/). + +## Traditional build logs + +Traditional job logs depend a lot on availability of a local shared storage. + +Every time a GitLab Runner sends a new partial build output, we write this +output to a file on a disk. This is simple, but this mechanism depends on +shared local storage - the same file needs to be available on every GitLab web +node machine, because GitLab Runner might connect to a different one every time +it performs an API request. Sidekiq also needs access to the file because when +a job is complete, a trace file contents will be sent to the object store. + +## New architecture + +New architecture writes data to Redis instead of writing build logs into a +file. + +In order to make this performant and resilient enough, we implemented a chunked +I/O mechanism - we store data in Redis in chunks, and migrate them to an object +store once we reach a desired chunk size. + +Simplified sequence diagram is available below. + +```mermaid +sequenceDiagram + autonumber + participant U as User + participant R as Runner + participant G as GitLab (rails) + participant I as Redis + participant D as Database + participant O as Object store + + loop incremental trace update sent by a runner + Note right of R: Runner appends a build trace + R->>+G: PATCH trace [build.id, offset, data] + G->>+D: find or create chunk [chunk.index] + D-->>-G: chunk [id, index] + G->>I: append chunk data [chunk.index, data] + G-->>-R: 200 OK + end + + Note right of R: User retrieves a trace + U->>+G: GET build trace + loop every trace chunk + G->>+D: find chunk [index] + D-->>-G: chunk [id] + G->>+I: read chunk data [chunk.index] + I-->>-G: chunk data [data, size] + end + G-->>-U: build trace + + Note right of R: Trace chunk is full + R->>+G: PATCH trace [build.id, offset, data] + G->>+D: find or create chunk [chunk.index] + D-->>-G: chunk [id, index] + G->>I: append chunk data [chunk.index, data] + G->>G: chunk full [index] + G-->>-R: 200 OK + G->>+I: read chunk data [chunk.index] + I-->>-G: chunk data [data, size] + G->>O: send chunk data [data, size] + G->>+D: update data store type [chunk.id] + G->>+I: delete chunk data [chunk.index] +``` + +## NFS coupling + +In 2017, we experienced serious problems of scaling our NFS infrastructure. We +even tried to replace NFS with +[CephFS](https://docs.ceph.com/docs/master/cephfs/) - unsuccessfully. + +Since that time it has become apparent that the cost of operations and +maintenance of a NFS cluster is significant and that if we ever decide to +migrate to Kubernetes [we need to decouple GitLab from a shared local storage +and +NFS](https://gitlab.com/gitlab-org/gitlab-pages/-/issues/426#note_375646396). + +1. NFS might be a single point of failure +1. NFS can only be reliably scaled vertically +1. Moving to Kubernetes means increasing the number of mount points by an order + of magnitude +1. NFS depends on extremely reliable network which can be difficult to provide + in Kubernetes environment +1. Storing customer data on NFS involves additional security risks + +Moving GitLab to Kubernetes without NFS decoupling would result in an explosion +of complexity, maintenance cost and enormous, negative impact on availability. + +## Iterations + +1. ✓ Implement the new architecture in way that it does not depend on shared local storage +1. ✓ Evaluate performance and edge-cases, iterate to improve the new architecture +1. ✓ Design cloud native build logs correctness verification mechanisms +1. ✓ Build observability mechanisms around performance and correctness +1. Rollout the feature into production environment incrementally + +The work needed to make the new architecture production ready and enabled on +GitLab.com is being tracked in [Cloud Native Build Logs on +GitLab.com](https://gitlab.com/groups/gitlab-org/-/epics/4275) epic. + +Enabling this feature on GitLab.com is a subtask of [making the new +architecture generally +available](https://gitlab.com/groups/gitlab-org/-/epics/3791) for everyone. + +## Who + +Proposal: + +| Role | Who +|------------------------------|-------------------------| +| Author | Grzegorz Bizon | +| Architecture Evolution Coach | Gerardo Lopez-Fernandez | +| Engineering Leader | Darby Frey | +| Domain Expert | Kamil Trzciński | +| Domain Expert | Sean McGivern | + +DRIs: + +| Role | Who +|------------------------------|------------------------| +| Product | Jason Yavorska | +| Leadership | Darby Frey | +| Engineering | Grzegorz Bizon | |