summaryrefslogtreecommitdiff
path: root/doc/administration/high_availability/README.md
diff options
context:
space:
mode:
Diffstat (limited to 'doc/administration/high_availability/README.md')
-rw-r--r--doc/administration/high_availability/README.md127
1 files changed, 93 insertions, 34 deletions
diff --git a/doc/administration/high_availability/README.md b/doc/administration/high_availability/README.md
index 7f0b4056acc..d411fb7f20f 100644
--- a/doc/administration/high_availability/README.md
+++ b/doc/administration/high_availability/README.md
@@ -4,18 +4,19 @@ type: reference, concepts
# Scaling and High Availability
-GitLab supports several different types of clustering and high-availability.
+GitLab supports a number of options for scaling your self-managed instance and configuring high availability (HA).
The solution you choose will be based on the level of scalability and
availability you require. The easiest solutions are scalable, but not necessarily
highly available.
-GitLab provides a service that is usually essential to most organizations: it
+GitLab provides a service that is essential to most organizations: it
enables people to collaborate on code in a timely fashion. Any downtime should
-therefore be short and planned. Luckily, GitLab provides a solid setup even on
-a single server without special measures. Due to the distributed nature
-of Git, developers can still commit code locally even when GitLab is not
+therefore be short and planned. Due to the distributed nature
+of Git, developers can continue to commit code locally even when GitLab is not
available. However, some GitLab features such as the issue tracker and
-Continuous Integration are not available when GitLab is down.
+continuous integration are not available when GitLab is down.
+If you require all GitLab functionality to be highly available,
+consider the options outlined below.
**Keep in mind that all highly-available solutions come with a trade-off between
cost/complexity and uptime**. The more uptime you want, the more complex the
@@ -25,8 +26,8 @@ solution should balance the costs against the benefits.
There are many options when choosing a highly-available GitLab architecture. We
recommend engaging with GitLab Support to choose the best architecture for your
-use case. This page contains some various options and guidelines based on
-experience with GitLab.com and Enterprise Edition on-premises customers.
+use case. This page contains recommendations based on
+experience with GitLab.com and internal scale testing.
For detailed insight into how GitLab scales and configures GitLab.com, you can
watch [this 1 hour Q&A](https://www.youtube.com/watch?v=uCU8jdYzpac)
@@ -82,12 +83,13 @@ Complete the following installation steps in order. A link at the end of each
section will bring you back to the Scalable Architecture Examples section so
you can continue with the next step.
-1. [PostgreSQL](database.md#postgresql-in-a-scaled-environment) with [PGBouncer](https://docs.gitlab.com/ee/administration/high_availability/pgbouncer.html)
+1. [Load Balancer(s)](load_balancer.md)[^2]
+1. [Consul](consul.md)
+1. [PostgreSQL](database.md#postgresql-in-a-scaled-environment) with [PgBouncer](https://docs.gitlab.com/ee/administration/high_availability/pgbouncer.html)
1. [Redis](redis.md#redis-in-a-scaled-environment)
1. [Gitaly](gitaly.md) (recommended) and / or [NFS](nfs.md)[^4]
1. [GitLab application nodes](gitlab.md)
- With [Object Storage service enabled](../gitaly/index.md#eliminating-nfs-altogether)[^3]
-1. [Load Balancer(s)](load_balancer.md)[^2]
1. [Monitoring node (Prometheus and Grafana)](monitoring_node.md)
### Full Scaling
@@ -201,13 +203,14 @@ with the added complexity of many more nodes to configure, manage, and monitor.
## Reference Architecture Examples
The Support and Quality teams build, performance test, and validate Reference
-Architectures that support set large numbers of users. The specifications below are a
-representation of this work so far and may be adjusted in the future based on
+Architectures that support large numbers of users. The specifications below are
+a representation of this work so far and may be adjusted in the future based on
additional testing and iteration.
-The architectures have been tested with specific coded workloads. The throughputs
-used for testing are calculated based on sample customer data. We test each endpoint
-type with the following number of requests per second (RPS) per 1000 users:
+The architectures have been tested with specific coded workloads, and the
+throughputs used for testing were calculated based on sample customer data. We
+test each endpoint type with the following number of requests per second (RPS)
+per 1000 users:
- API: 20 RPS
- Web: 2 RPS
@@ -217,13 +220,74 @@ Note that your exact needs may be more, depending on your workload. Your
workload is influenced by factors such as - but not limited to - how active your
users are, how much automation you use, mirroring, and repo/change size.
+### 2,000 User Configuration
+
+- **Supported Users (approximate):** 2,000
+- **Test RPS Rates:** API: 40 RPS, Web: 4 RPS, Git: 4 RPS
+- **Status:** Work-in-progress
+- **Known Issues:** For the latest list of known performance issues head
+[here](https://gitlab.com/gitlab-org/gitlab/issues?label_name%5B%5D=Quality%3Aperformance-issues).
+
+NOTE: **Note:** This architecture is a work-in-progress of the work so far. The
+Quality team will be certifying this environment in late 2019 or early 2020. The specifications
+may be adjusted prior to certification based on performance testing.
+
+| Service | Nodes | Configuration | GCP type |
+| ----------------------------|-------|-----------------------|---------------|
+| GitLab Rails <br> - Puma workers on each node set to 90% of available CPUs with 8 threads | 3 | 8 vCPU, 7.2GB Memory | n1-highcpu-8 |
+| PostgreSQL | 3 | 2 vCPU, 7.5GB Memory | n1-standard-2 |
+| PgBouncer | 3 | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
+| Gitaly <br> - Gitaly Ruby workers on each node set to 20% of available CPUs | X[^1] . | 4 vCPU, 15GB Memory | n1-standard-4 |
+| Redis Cache + Sentinel <br> - Cache maxmemory set to 90% of available memory | 3 | 2 vCPU, 7.5GB Memory | n1-standard-2 |
+| Redis Persistent + Sentinel | 3 | 2 vCPU, 7.5GB Memory | n1-standard-2 |
+| Sidekiq | 4 | 2 vCPU, 7.5GB Memory | n1-standard-2 |
+| Consul | 3 | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
+| NFS Server[^4] . | 1 | 4 vCPU, 3.6GB Memory | n1-highcpu-4 |
+| S3 Object Storage[^3] . | - | - | - |
+| Monitoring node | 1 | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
+| External load balancing node[^2] . | 1 | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
+| Internal load balancing node[^2] . | 1 | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
+
+NOTE: **Note:** Memory values are given directly by GCP machine sizes. On different cloud
+vendors a best effort like for like can be used.
+
+### 5,000 User Configuration
+
+- **Supported Users (approximate):** 5,000
+- **Test RPS Rates:** API: 100 RPS, Web: 10 RPS, Git: 10 RPS
+- **Status:** Work-in-progress
+- **Known Issues:** For the latest list of known performance issues head
+[here](https://gitlab.com/gitlab-org/gitlab/issues?label_name%5B%5D=Quality%3Aperformance-issues).
+
+NOTE: **Note:** This architecture is a work-in-progress of the work so far. The
+Quality team will be certifying this environment in late 2019 or early 2020. The specifications
+may be adjusted prior to certification based on performance testing.
+
+| Service | Nodes | Configuration | GCP type |
+| ----------------------------|-------|-----------------------|---------------|
+| GitLab Rails <br> - Puma workers on each node set to 90% of available CPUs with 16 threads | 3 | 16 vCPU, 14.4GB Memory | n1-highcpu-16 |
+| PostgreSQL | 3 | 2 vCPU, 7.5GB Memory | n1-standard-2 |
+| PgBouncer | 3 | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
+| Gitaly <br> - Gitaly Ruby workers on each node set to 20% of available CPUs | X[^1] . | 8 vCPU, 30GB Memory | n1-standard-8 |
+| Redis Cache + Sentinel <br> - Cache maxmemory set to 90% of available memory | 3 | 2 vCPU, 7.5GB Memory | n1-standard-2 |
+| Redis Persistent + Sentinel | 3 | 2 vCPU, 7.5GB Memory | n1-standard-2 |
+| Sidekiq | 4 | 2 vCPU, 7.5GB Memory | n1-standard-2 |
+| Consul | 3 | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
+| NFS Server[^4] . | 1 | 4 vCPU, 3.6GB Memory | n1-highcpu-4 |
+| S3 Object Storage[^3] . | - | - | - |
+| Monitoring node | 1 | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
+| External load balancing node[^2] . | 1 | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
+| Internal load balancing node[^2] . | 1 | 2 vCPU, 1.8GB Memory | n1-highcpu-2 |
+
+NOTE: **Note:** Memory values are given directly by GCP machine sizes. On different cloud
+vendors a best effort like for like can be used.
+
### 10,000 User Configuration
- **Supported Users (approximate):** 10,000
- **Test RPS Rates:** API: 200 RPS, Web: 20 RPS, Git: 20 RPS
-- **Known Issues:** While validating the reference architecture, slow API endpoints
- were discovered. For details, see the related issues list in
- [this issue](https://gitlab.com/gitlab-org/gitlab-foss/issues/64335).
+- **Known Issues:** For the latest list of known performance issues head
+[here](https://gitlab.com/gitlab-org/gitlab/issues?label_name%5B%5D=Quality%3Aperformance-issues).
| Service | Nodes | Configuration | GCP type |
| ----------------------------|-------|-----------------------|---------------|
@@ -248,10 +312,8 @@ vendors a best effort like for like can be used.
- **Supported Users (approximate):** 25,000
- **Test RPS Rates:** API: 500 RPS, Web: 50 RPS, Git: 50 RPS
-- **Known Issues:** The slow API endpoints that were discovered during testing
- the 10,000 user architecture also affect the 25,000 user architecture. For
- details, see the related issues list in
- [this issue](https://gitlab.com/gitlab-org/gitlab-foss/issues/64335).
+- **Known Issues:** For the latest list of known performance issues head
+[here](https://gitlab.com/gitlab-org/gitlab/issues?label_name%5B%5D=Quality%3Aperformance-issues).
| Service | Nodes | Configuration | GCP type |
| ----------------------------|-------|-----------------------|---------------|
@@ -276,12 +338,8 @@ vendors a best effort like for like can be used.
- **Supported Users (approximate):** 50,000
- **Test RPS Rates:** API: 1000 RPS, Web: 100 RPS, Git: 100 RPS
-- **Status:** Work-in-progress
-- **Related Issue:** See the [related issue](https://gitlab.com/gitlab-org/quality/performance/issues/66) for more information.
-
-NOTE: **Note:** This architecture is a work-in-progress of the work so far. The
-Quality team will be certifying this environment in late 2019. The specifications
-may be adjusted prior to certification based on performance testing.
+- **Known Issues:** For the latest list of known performance issues head
+[here](https://gitlab.com/gitlab-org/gitlab/issues?label_name%5B%5D=Quality%3Aperformance-issues).
| Service | Nodes | Configuration | GCP type |
| ----------------------------|-------|-----------------------|---------------|
@@ -302,15 +360,16 @@ may be adjusted prior to certification based on performance testing.
NOTE: **Note:** Memory values are given directly by GCP machine sizes. On different cloud
vendors a best effort like for like can be used.
-[^1]: Gitaly node requirements are dependent on customer data. We recommend 2
- nodes as an absolute minimum for performance at the 10,000 and 25,000 user
- scale and 4 nodes as an absolute minimum at the 50,000 user scale, but
- additional nodes should be considered in conjunction with a review of
- project counts and sizes.
+[^1]: Gitaly node requirements are dependent on customer data, specifically the number of
+ projects and their sizes. We recommend 2 nodes as an absolute minimum for HA environments
+ and at least 4 nodes should be used when supporting 50,000 or more users.
+ We recommend that each Gitaly node should store no more than 5TB of data.
+ Additional nodes should be considered in conjunction with a review of expected
+ data size and spread based on the recommendations above.
[^2]: Our architectures have been tested and validated with [HAProxy](https://www.haproxy.org/)
as the load balancer. However other reputable load balancers with similar feature sets
- should also work here but be aware these aren't validated.
+ should also work instead but be aware these aren't validated.
[^3]: For data objects such as LFS, Uploads, Artifacts, etc... We recommend a S3 Object Storage
where possible over NFS due to better performance and availability. Several types of objects