summaryrefslogtreecommitdiff
path: root/doc/development/uploads/background.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/development/uploads/background.md')
-rw-r--r--doc/development/uploads/background.md154
1 files changed, 154 insertions, 0 deletions
diff --git a/doc/development/uploads/background.md b/doc/development/uploads/background.md
new file mode 100644
index 00000000000..e68e4127b57
--- /dev/null
+++ b/doc/development/uploads/background.md
@@ -0,0 +1,154 @@
+---
+stage: none
+group: unassigned
+info: To determine the technical writer assigned to the Stage/Group associated with this page, see https://about.gitlab.com/handbook/engineering/ux/technical-writing/#assignments
+---
+
+# Uploads guide: Why GitLab uses custom upload logic
+
+This page is for developers trying to better understand the history behind GitLab uploads and the
+technical challenges associated with uploads.
+
+## Problem description
+
+GitLab and [GitLab Workhorse](https://gitlab.com/gitlab-org/gitlab-workhorse) use special rules for handling file uploads,
+because in an ordinary Rails application file uploads can become expensive as files grow in size.
+Rails often sacrifices performance to provide a better developer experience, including how it handles
+`multipart/form-post` uploads. In any Rack server, Rails applications included, when such a request arrives at the application server,
+several things happen:
+
+1. A [Rack middleware](https://github.com/rack/rack/blob/main/lib/rack/multipart.rb) intercepts the request and parses the request body.
+1. The middleware writes each file in the multipart request to a temporary directory on disk.
+1. A `params` hash is constructed with entries pointing to the respective files on disk.
+1. A Rails controller acts on the file contents.
+
+While this is convenient for developers, it is costly for the Ruby server process to buffer large files on disk.
+Because of Ruby's [global interpreter lock](https://en.wikipedia.org/wiki/Global_interpreter_lock),
+only a single thread of execution of a given Ruby process can be on CPU. This means the amount of CPU
+time spent doing this is not available to other worker threads serving user requests.
+Buffering files to disk also means spending more time in I/O routines and mode switches, which are expensive operations.
+
+The following diagram shows how GitLab handled such a request prior to putting optimizations in place.
+
+```mermaid
+graph TB
+ subgraph "load balancers"
+ LB(Proxy)
+ end
+
+ subgraph "Shared storage"
+ nfs(NFS)
+ end
+
+ subgraph "redis cluster"
+ r(persisted redis)
+ end
+ LB-- 1 -->Workhorse
+
+ subgraph "web or API fleet"
+ Workhorse-- 2 -->rails
+ end
+ rails-- "3 (write files)" -->nfs
+ rails-- "4 (schedule a job)" -->r
+
+ subgraph sidekiq
+ s(sidekiq)
+ end
+ s-- "5 (fetch a job)" -->r
+ s-- "6 (read files)" -->nfs
+```
+
+We went through two major iterations of our uploads architecture to improve on these problems:
+
+1. [Moving disk buffering to Workhorse.](#moving-disk-buffering-to-workhorse)
+1. [Uploading to Object Storage from Workhorse.](#moving-to-object-storage-and-direct-uploads)
+
+### Moving disk buffering to Workhorse
+
+To address the performance issues resulting from buffering files in Ruby, we moved this logic to Workhorse instead,
+our reverse proxy fronting the GitLab Rails application.
+Workhorse is written in Go, and is much better at dealing with stream processing and I/O than Rails.
+
+There are two parts to this implementation:
+
+1. In Workhorse, a request handler detects `multipart/form-data` content in an incoming user request.
+ If such a request is detected, Workhorse hijacks the request body before forwarding it to Rails.
+ Workhorse writes all files to disk, rewrites the multipart form fields to point to the new locations, signs the
+ request, then forwards it to Rails.
+1. In Rails, a [custom multipart Rack middleware](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/middleware/multipart.rb)
+ identifies any signed multipart requests coming from Workhorse and prepares the `params` hash Rails
+ would expect, now pointing to the files cached by Workhorse. This makes it a drop-in replacement for `Rack::Multipart`.
+
+The diagram below shows how GitLab handles such a request today:
+
+```mermaid
+graph TB
+ subgraph "load balancers"
+ LB(HA Proxy)
+ end
+
+ subgraph "Shared storage"
+ nfs(NFS)
+ end
+
+ subgraph "redis cluster"
+ r(persisted redis)
+ end
+ LB-- 1 -->Workhorse
+
+ subgraph "web or API fleet"
+ Workhorse-- "3 (without files)" -->rails
+ end
+ Workhorse -- "2 (write files)" -->nfs
+ rails-- "4 (schedule a job)" -->r
+
+ subgraph sidekiq
+ s(sidekiq)
+ end
+ s-- "5 (fetch a job)" -->r
+ s-- "6 (read files)" -->nfs
+```
+
+While this "one-size-fits-all" solution greatly improves performance for multipart uploads without compromising
+developer ergonomics, it severely limits GitLab [availability](#availability-challenges)
+and [scalability](#scalability-challenges).
+
+#### Availability challenges
+
+Moving file buffering to Workhorse addresses the immediate performance problems stemming from Ruby not being good at
+handling large file uploads. However, a remaining issue of this solution is its reliance on attached storage,
+whether via ordinary hard drives or network attached storage like NFS.
+NFS is a [single point of failure](https://en.wikipedia.org/wiki/Single_point_of_failure), and is unsuitable for
+deploying GitLab in highly available, cloud native environments.
+
+#### Scalability challenges
+
+NFS is not a part of cloud native installations, such as those running in Kubernetes.
+In Kubernetes, machine boundaries translate to pods, and without network-attached storage, disk-buffered uploads
+must be written directly to the pod's file system.
+
+Using disk buffering presents us with a scalability challenge here. If Workhorse can only
+write files to a pod's private file system, then these files are inaccessible outside of this particular pod.
+With disk buffering, a Rails controller will accept a file upload and enqueue it for upload in a Sidekiq
+background job. Therefore, Sidekiq requires access to these files.
+However, in a cloud native environment all Sidekiq instances run on separate pods, so they are
+not able to access files buffered to disk on a web server pod.
+
+Therefore, all features that involve Sidekiq uploading disk-buffered files severely limit the scalability of GitLab.
+
+## Moving to object storage and direct uploads
+
+To address these availability and scalability problems,
+instead of buffering files to disk, we have added support for uploading files directly
+from Workhorse to a given destination. While it remains possible to upload to local or network-attached storage
+this way, you should use a highly available
+[object store](https://en.wikipedia.org/wiki/Object_storage),
+such as AWS S3, Google GCS, or Azure, for scalability reasons.
+
+With direct uploads, Workhorse does not buffer files to disk. Instead, it first authorizes the request with
+the Rails application to find out where to upload it, then streams the file directly to its ultimate destination.
+
+To learn more about how disk buffering and direct uploads are implemented, see:
+
+- [How uploads work technically](implementation.md)
+- [Adding new uploads](working_with_uploads.md)