Move operations/ to new locationdocs/refactor-operations

[ci skip]
author: Achilleas Pipinellis <axilleas@axilleas.me> 2016-09-25 12:44:09 +0200
committer: Achilleas Pipinellis <axilleas@axilleas.me> 2016-10-11 16:13:04 +0200
commit: d8f33c0a51d9106ece6cd4bae469e40734e05f85 (patch)
tree: f8168d158bb85d303e6dcab953ebf6a26c9dfdd2 /doc/administration/operations
parent: 6602b917f7ffcb8c8e9134ee156ac49c24ed2a9b (diff)
download: gitlab-ce-d8f33c0a51d9106ece6cd4bae469e40734e05f85.tar.gz
4 files changed, 358 insertions, 0 deletions
diff --git a/doc/administration/operations/cleaning_up_redis_sessions.md b/doc/administration/operations/cleaning_up_redis_sessions.md
new file mode 100644
index 00000000000..93521e976d5
--- /dev/null
+++ b/doc/administration/operations/cleaning_up_redis_sessions.md
@@ -0,0 +1,52 @@
+# Cleaning up stale Redis sessions
+
+Since version 6.2, GitLab stores web user sessions as key-value pairs in Redis.
+Prior to GitLab 7.3, user sessions did not automatically expire from Redis. If
+you have been running a large GitLab server (thousands of users) since before
+GitLab 7.3 we recommend cleaning up stale sessions to compact the Redis
+database after you upgrade to GitLab 7.3. You can also perform a cleanup while
+still running GitLab 7.2 or older, but in that case new stale sessions will
+start building up again after you clean up.
+
+In GitLab versions prior to 7.3.0, the session keys in Redis are 16-byte
+hexadecimal values such as '976aa289e2189b17d7ef525a6702ace9'. Starting with
+GitLab 7.3.0, the keys are
+prefixed with 'session:gitlab:', so they would look like
+'session:gitlab:976aa289e2189b17d7ef525a6702ace9'. Below we describe how to
+remove the keys in the old format.
+
+First we define a shell function with the proper Redis connection details.
+
+```
+rcli() {
+  # This example works for Omnibus installations of GitLab 7.3 or newer. For an
+  # installation from source you will have to change the socket path and the
+  # path to redis-cli.
+  sudo /opt/gitlab/embedded/bin/redis-cli -s /var/opt/gitlab/redis/redis.socket "$@"
+}
+
+# test the new shell function; the response should be PONG
+rcli ping
+```
+
+Now we do a search to see if there are any session keys in the old format for
+us to clean up.
+
+```
+# returns the number of old-format session keys in Redis
+rcli keys '*' | grep '^[a-f0-9]\{32\}$' | wc -l
+```
+
+If the number is larger than zero, you can proceed to expire the keys from
+Redis. If the number is zero there is nothing to clean up.
+
+```
+# Tell Redis to expire each matched key after 600 seconds.
+rcli keys '*' | grep '^[a-f0-9]\{32\}$' | awk '{ print "expire", $0, 600 }' | rcli
+# This will print '(integer) 1' for each key that gets expired.
+```
+
+Over the next 15 minutes (10 minutes expiry time plus 5 minutes Redis
+background save interval) your Redis database will be compacted. If you are
+still using GitLab 7.2, users who are not clicking around in GitLab during the
+10 minute expiry window will be signed out of GitLab.
diff --git a/doc/administration/operations/moving_repositories.md b/doc/administration/operations/moving_repositories.md
new file mode 100644
index 00000000000..54adb99386a
--- /dev/null
+++ b/doc/administration/operations/moving_repositories.md
@@ -0,0 +1,180 @@
+# Moving repositories managed by GitLab
+
+Sometimes you need to move all repositories managed by GitLab to
+another filesystem or another server. In this document we will look
+at some of the ways you can copy all your repositories from
+`/var/opt/gitlab/git-data/repositories` to `/mnt/gitlab/repositories`.
+
+We will look at three scenarios: the target directory is empty, the
+target directory contains an outdated copy of the repositories, and
+how to deal with thousands of repositories.
+
+**Each of the approaches we list can/will overwrite data in the
+target directory `/mnt/gitlab/repositories`. Do not mix up the
+source and the target.**
+
+## Target directory is empty: use a tar pipe
+
+If the target directory `/mnt/gitlab/repositories` is empty the
+simplest thing to do is to use a tar pipe.  This method has low
+overhead and tar is almost always already installed on your system.
+However, it is not possible to resume an interrupted tar pipe:  if
+that happens then all data must be copied again.
+
+```
+# As the git user
+tar -C /var/opt/gitlab/git-data/repositories -cf - -- . |\
+  tar -C /mnt/gitlab/repositories -xf -
+```
+
+If you want to see progress, replace `-xf` with `-xvf`.
+
+### Tar pipe to another server
+
+You can also use a tar pipe to copy data to another server. If your
+'git' user has SSH access to the newserver as 'git@newserver', you
+can pipe the data through SSH.
+
+```
+# As the git user
+tar -C /var/opt/gitlab/git-data/repositories -cf - -- . |\
+  ssh git@newserver tar -C /mnt/gitlab/repositories -xf -
+```
+
+If you want to compress the data before it goes over the network
+(which will cost you CPU cycles) you can replace `ssh` with `ssh -C`.
+
+## The target directory contains an outdated copy of the repositories: use rsync
+
+If the target directory already contains a partial / outdated copy
+of the repositories it may be wasteful to copy all the data again
+with tar. In this scenario it is better to use rsync. This utility
+is either already installed on your system or easily installable
+via apt, yum etc.
+
+```
+# As the 'git' user
+rsync -a --delete /var/opt/gitlab/git-data/repositories/. \
+  /mnt/gitlab/repositories
+```
+
+The `/.` in the command above is very important, without it you can
+easily get the wrong directory structure in the target directory.
+If you want to see progress, replace `-a` with `-av`.
+
+### Single rsync to another server
+
+If the 'git' user on your source system has SSH access to the target
+server you can send the repositories over the network with rsync.
+
+```
+# As the 'git' user
+rsync -a --delete /var/opt/gitlab/git-data/repositories/. \
+  git@newserver:/mnt/gitlab/repositories
+```
+
+## Thousands of Git repositories: use one rsync per repository
+
+Every time you start an rsync job it has to inspect all files in
+the source directory, all files in the target directory, and then
+decide what files to copy or not. If the source or target directory
+has many contents this startup phase of rsync can become a burden
+for your GitLab server. In cases like this you can make rsync's
+life easier by dividing its work in smaller pieces, and sync one
+repository at a time.
+
+In addition to rsync we will use [GNU
+Parallel](http://www.gnu.org/software/parallel/). This utility is
+not included in GitLab so you need to install it yourself with apt
+or yum.  Also note that the GitLab scripts we used below were added
+in GitLab 8.1.
+
+** This process does not clean up repositories at the target location that no
+longer exist at the source. ** If you start using your GitLab instance with
+`/mnt/gitlab/repositories`, you need to run `gitlab-rake gitlab:cleanup:repos`
+after switching to the new repository storage directory.
+
+### Parallel rsync for all repositories known to GitLab
+
+This will sync repositories with 10 rsync processes at a time. We keep
+track of progress so that the transfer can be restarted if necessary.
+
+First we create a new directory, owned by 'git', to hold transfer
+logs. We assume the directory is empty before we start the transfer
+procedure, and that we are the only ones writing files in it.
+
+```
+# Omnibus
+sudo mkdir /var/opt/gitlab/transfer-logs
+sudo chown git:git /var/opt/gitlab/transfer-logs
+
+# Source
+sudo -u git -H mkdir /home/git/transfer-logs
+```
+
+We seed the process with a list of the directories we want to copy.
+
+```
+# Omnibus
+sudo -u git sh -c 'gitlab-rake gitlab:list_repos > /var/opt/gitlab/transfer-logs/all-repos-$(date +%s).txt'
+
+# Source
+cd /home/git/gitlab
+sudo -u git -H sh -c 'bundle exec rake gitlab:list_repos > /home/git/transfer-logs/all-repos-$(date +%s).txt'
+```
+
+Now we can start the transfer. The command below is idempotent, and
+the number of jobs done by GNU Parallel should converge to zero. If it
+does not some repositories listed in all-repos-1234.txt may have been
+deleted/renamed before they could be copied.
+
+```
+# Omnibus
+sudo -u git sh -c '
+cat /var/opt/gitlab/transfer-logs/* | sort | uniq -u |\
+  /usr/bin/env JOBS=10 \
+  /opt/gitlab/embedded/service/gitlab-rails/bin/parallel-rsync-repos \
+    /var/opt/gitlab/transfer-logs/success-$(date +%s).log \
+    /var/opt/gitlab/git-data/repositories \
+    /mnt/gitlab/repositories
+'
+
+# Source
+cd /home/git/gitlab
+sudo -u git -H sh -c '
+cat /home/git/transfer-logs/* | sort | uniq -u |\
+  /usr/bin/env JOBS=10 \
+  bin/parallel-rsync-repos \
+    /home/git/transfer-logs/success-$(date +%s).log \
+    /home/git/repositories \
+    /mnt/gitlab/repositories
+`
+```
+
+### Parallel rsync only for repositories with recent activity
+
+Suppose you have already done one sync that started after 2015-10-1 12:00 UTC.
+Then you might only want to sync repositories that were changed via GitLab
+_after_ that time. You can use the 'SINCE' variable to tell 'rake
+gitlab:list_repos' to only print repositories with recent activity.
+
+```
+# Omnibus
+sudo gitlab-rake gitlab:list_repos SINCE='2015-10-1 12:00 UTC' |\
+  sudo -u git \
+  /usr/bin/env JOBS=10 \
+  /opt/gitlab/embedded/service/gitlab-rails/bin/parallel-rsync-repos \
+    success-$(date +%s).log \
+    /var/opt/gitlab/git-data/repositories \
+    /mnt/gitlab/repositories
+
+# Source
+cd /home/git/gitlab
+sudo -u git -H bundle exec rake gitlab:list_repos SINCE='2015-10-1 12:00 UTC' |\
+  sudo -u git -H \
+  /usr/bin/env JOBS=10 \
+  bin/parallel-rsync-repos \
+    success-$(date +%s).log \
+    /home/git/repositories \
+    /mnt/gitlab/repositories
+```
diff --git a/doc/administration/operations/sidekiq_memory_killer.md b/doc/administration/operations/sidekiq_memory_killer.md
new file mode 100644
index 00000000000..b5e78348989
--- /dev/null
+++ b/doc/administration/operations/sidekiq_memory_killer.md
@@ -0,0 +1,40 @@
+# Sidekiq MemoryKiller
+
+The GitLab Rails application code suffers from memory leaks. For web requests
+this problem is made manageable using
+[unicorn-worker-killer](https://github.com/kzk/unicorn-worker-killer) which
+restarts Unicorn worker processes in between requests when needed. The Sidekiq
+MemoryKiller applies the same approach to the Sidekiq processes used by GitLab
+to process background jobs.
+
+Unlike unicorn-worker-killer, which is enabled by default for all GitLab
+installations since GitLab 6.4, the Sidekiq MemoryKiller is enabled by default
+_only_ for Omnibus packages. The reason for this is that the MemoryKiller
+relies on Runit to restart Sidekiq after a memory-induced shutdown and GitLab
+installations from source do not all use Runit or an equivalent.
+
+With the default settings, the MemoryKiller will cause a Sidekiq restart no
+more often than once every 15 minutes, with the restart causing about one
+minute of delay for incoming background jobs.
+
+## Configuring the MemoryKiller
+
+The MemoryKiller is controlled using environment variables.
+
+- `SIDEKIQ_MEMORY_KILLER_MAX_RSS`: if this variable is set, and its value is
+  greater than 0, then after each Sidekiq job, the MemoryKiller will check the
+  RSS of the Sidekiq process that executed the job. If the RSS of the Sidekiq
+  process (expressed in kilobytes) exceeds SIDEKIQ_MEMORY_KILLER_MAX_RSS, a
+  delayed shutdown is triggered. The default value for Omnibus packages is set
+  [in the omnibus-gitlab
+  repository](https://gitlab.com/gitlab-org/omnibus-gitlab/blob/master/files/gitlab-cookbooks/gitlab/attributes/default.rb).
+- `SIDEKIQ_MEMORY_KILLER_GRACE_TIME`: defaults 900 seconds (15 minutes). When
+  a shutdown is triggered, the Sidekiq process will keep working normally for
+  another 15 minutes.
+- `SIDEKIQ_MEMORY_KILLER_SHUTDOWN_WAIT`: defaults to 30 seconds. When the grace
+  time has expired, the MemoryKiller tells Sidekiq to stop accepting new jobs.
+  Existing jobs get 30 seconds to finish. After that, the MemoryKiller tells
+  Sidekiq to shut down, and an external supervision mechanism (e.g. Runit) must
+  restart Sidekiq.
+- `SIDEKIQ_MEMORY_KILLER_SHUTDOWN_SIGNAL`: defaults to `SIGKILL`. The name of
+  the final signal sent to the Sidekiq process when we want it to shut down.
diff --git a/doc/administration/operations/unicorn.md b/doc/administration/operations/unicorn.md
new file mode 100644
index 00000000000..bad61151bda
--- /dev/null
+++ b/doc/administration/operations/unicorn.md
@@ -0,0 +1,86 @@
+# Understanding Unicorn and unicorn-worker-killer
+
+## Unicorn
+
+GitLab uses [Unicorn](http://unicorn.bogomips.org/), a pre-forking Ruby web
+server, to handle web requests (web browsers and Git HTTP clients). Unicorn is
+a daemon written in Ruby and C that can load and run a Ruby on Rails
+application; in our case the Rails application is GitLab Community Edition or
+GitLab Enterprise Edition.
+
+Unicorn has a multi-process architecture to make better use of available CPU
+cores (processes can run on different cores) and to have stronger fault
+tolerance (most failures stay isolated in only one process and cannot take down
+GitLab entirely). On startup, the Unicorn 'master' process loads a clean Ruby
+environment with the GitLab application code, and then spawns 'workers' which
+inherit this clean initial environment. The 'master' never handles any
+requests, that is left to the workers. The operating system network stack
+queues incoming requests and distributes them among the workers.
+
+In a perfect world, the master would spawn its pool of workers once, and then
+the workers handle incoming web requests one after another until the end of
+time. In reality, worker processes can crash or time out: if the master notices
+that a worker takes too long to handle a request it will terminate the worker
+process with SIGKILL ('kill -9'). No matter how the worker process ended, the
+master process will replace it with a new 'clean' process again. Unicorn is
+designed to be able to replace 'crashed' workers without dropping user
+requests.
+
+This is what a Unicorn worker timeout looks like in `unicorn_stderr.log`. The
+master process has PID 56227 below.
+
+```
+[2015-06-05T10:58:08.660325 #56227] ERROR -- : worker=10 PID:53009 timeout (61s > 60s), killing
+[2015-06-05T10:58:08.699360 #56227] ERROR -- : reaped #<Process::Status: pid 53009 SIGKILL (signal 9)> worker=10
+[2015-06-05T10:58:08.708141 #62538]  INFO -- : worker=10 spawned pid=62538
+[2015-06-05T10:58:08.708824 #62538]  INFO -- : worker=10 ready
+```
+
+### Tunables
+
+The main tunables for Unicorn are the number of worker processes and the
+request timeout after which the Unicorn master terminates a worker process.
+See the [omnibus-gitlab Unicorn settings
+documentation](https://gitlab.com/gitlab-org/omnibus-gitlab/blob/master/doc/settings/unicorn.md)
+if you want to adjust these settings.
+
+## unicorn-worker-killer
+
+GitLab has memory leaks. These memory leaks manifest themselves in long-running
+processes, such as Unicorn workers. (The Unicorn master process is not known to
+leak memory, probably because it does not handle user requests.)
+
+To make these memory leaks manageable, GitLab comes with the
+[unicorn-worker-killer gem](https://github.com/kzk/unicorn-worker-killer). This
+gem [monkey-patches](https://en.wikipedia.org/wiki/Monkey_patch) the Unicorn
+workers to do a memory self-check after every 16 requests. If the memory of the
+Unicorn worker exceeds a pre-set limit then the worker process exits. The
+Unicorn master then automatically replaces the worker process.
+
+This is a robust way to handle memory leaks: Unicorn is designed to handle
+workers that 'crash' so no user requests will be dropped. The
+unicorn-worker-killer gem is designed to only terminate a worker process _in
+between requests_, so no user requests are affected.
+
+This is what a Unicorn worker memory restart looks like in unicorn_stderr.log.
+You see that worker 4 (PID 125918) is inspecting itself and decides to exit.
+The threshold memory value was 254802235 bytes, about 250MB. With GitLab this
+threshold is a random value between 200 and 250 MB.  The master process (PID
+117565) then reaps the worker process and spawns a new 'worker 4' with PID
+127549.
+
+```
+[2015-06-05T12:07:41.828374 #125918]  WARN -- : #<Unicorn::HttpServer:0x00000002734770>: worker (pid: 125918) exceeds memory limit (256413696 bytes > 254802235 bytes)
+[2015-06-05T12:07:41.828472 #125918]  WARN -- : Unicorn::WorkerKiller send SIGQUIT (pid: 125918) alive: 23 sec (trial 1)
+[2015-06-05T12:07:42.025916 #117565]  INFO -- : reaped #<Process::Status: pid 125918 exit 0> worker=4
+[2015-06-05T12:07:42.034527 #127549]  INFO -- : worker=4 spawned pid=127549
+[2015-06-05T12:07:42.035217 #127549]  INFO -- : worker=4 ready
+```
+
+One other thing that stands out in the log snippet above, taken from
+GitLab.com, is that 'worker 4' was serving requests for only 23 seconds. This
+is a normal value for our current GitLab.com setup and traffic.
+
+The high frequency of Unicorn memory restarts on some GitLab sites can be a
+source of confusion for administrators. Usually they are a [red
+herring](https://en.wikipedia.org/wiki/Red_herring).
author	Achilleas Pipinellis <axilleas@axilleas.me>	2016-09-25 12:44:09 +0200
committer	Achilleas Pipinellis <axilleas@axilleas.me>	2016-10-11 16:13:04 +0200
commit	d8f33c0a51d9106ece6cd4bae469e40734e05f85 (patch)
tree	f8168d158bb85d303e6dcab953ebf6a26c9dfdd2 /doc/administration/operations
parent	6602b917f7ffcb8c8e9134ee156ac49c24ed2a9b (diff)
download	gitlab-ce-d8f33c0a51d9106ece6cd4bae469e40734e05f85.tar.gz