diff options
Diffstat (limited to 'doc/administration/geo/replication/troubleshooting.md')
-rw-r--r-- | doc/administration/geo/replication/troubleshooting.md | 384 |
1 files changed, 384 insertions, 0 deletions
diff --git a/doc/administration/geo/replication/troubleshooting.md b/doc/administration/geo/replication/troubleshooting.md new file mode 100644 index 00000000000..6fea03cc8ec --- /dev/null +++ b/doc/administration/geo/replication/troubleshooting.md @@ -0,0 +1,384 @@ +# Geo Troubleshooting + +NOTE: **Note:** +This list is an attempt to document all the moving parts that can go wrong. +We are working into getting all this steps verified automatically in a +rake task in the future. + +Setting up Geo requires careful attention to details and sometimes it's easy to +miss a step. Here is a list of questions you should ask to try to detect +what you need to fix (all commands and path locations are for Omnibus installs): + +## First check the health of the **secondary** node + +Visit the **primary** node's **Admin Area > Geo** (`/admin/geo/nodes`) in +your browser. We perform the following health checks on each **secondary** node +to help identify if something is wrong: + +- Is the node running? +- Is the node's secondary database configured for streaming replication? +- Is the node's secondary tracking database configured? +- Is the node's secondary tracking database connected? +- Is the node's secondary tracking database up-to-date? + +![Geo health check](img/geo_node_healthcheck.png) + +There is also an option to check the status of the **secondary** node by running a special rake task: + +```sh +sudo gitlab-rake geo:status +``` + +## Is Postgres replication working? + +### Are my nodes pointing to the correct database instance? + +You should make sure your **primary** Geo node points to the instance with +writing permissions. + +Any **secondary** nodes should point only to read-only instances. + +### Can Geo detect my current node correctly? + +Geo uses the defined node from the **Admin Area > Geo** screen, and tries to match +it with the value defined in the `/etc/gitlab/gitlab.rb` configuration file. +The relevant line looks like: `external_url "http://gitlab.example.com"`. + +To check if the node on the current machine is correctly detected type: + +```sh +sudo gitlab-rails runner "puts Gitlab::Geo.current_node.inspect" +``` + +and expect something like: + +``` +#<GeoNode id: 2, schema: "https", host: "gitlab.example.com", port: 443, relative_url_root: "", primary: false, ...> +``` + +By running the command above, `primary` should be `true` when executed in +the **primary** node, and `false` on any **secondary** node. + +## How do I fix the message, "ERROR: replication slots can only be used if max_replication_slots > 0"? + +This means that the `max_replication_slots` PostgreSQL variable needs to +be set on the **primary** database. In GitLab 9.4, we have made this setting +default to 1. You may need to increase this value if you have more +**secondary** nodes. Be sure to restart PostgreSQL for this to take +effect. See the [PostgreSQL replication +setup][database-pg-replication] guide for more details. + +## How do I fix the message, "FATAL: could not start WAL streaming: ERROR: replication slot "geo_secondary_my_domain_com" does not exist"? + +This occurs when PostgreSQL does not have a replication slot for the +**secondary** node by that name. You may want to rerun the [replication +process](database.md) on the **secondary** node . + +## How do I fix the message, "Command exceeded allowed execution time" when setting up replication? + +This may happen while [initiating the replication process][database-start-replication] on the **secondary** node, +and indicates that your initial dataset is too large to be replicated in the default timeout (30 minutes). + +Re-run `gitlab-ctl replicate-geo-database`, but include a larger value for +`--backup-timeout`: + +```sh +sudo gitlab-ctl replicate-geo-database --host=primary.geo.example.com --slot-name=secondary_geo_example_com --backup-timeout=21600 +``` + +This will give the initial replication up to six hours to complete, rather than +the default thirty minutes. Adjust as required for your installation. + +## How do I fix the message, "PANIC: could not write to file 'pg_xlog/xlogtemp.123': No space left on device" + +Determine if you have any unused replication slots in the **primary** database. This can cause large amounts of +log data to build up in `pg_xlog`. Removing the unused slots can reduce the amount of space used in the `pg_xlog`. + +1. Start a PostgreSQL console session: + + ```sh + sudo gitlab-psql gitlabhq_production + ``` + + > Note that using `gitlab-rails dbconsole` will not work, because managing replication slots requires superuser permissions. + +1. View your replication slots with: + + ```sql + SELECT * FROM pg_replication_slots; + ``` + +Slots where `active` is `f` are not active. + +- When this slot should be active, because you have a **secondary** node configured using that slot, + log in to that **secondary** node and check the PostgreSQL logs why the replication is not running. + +- If you are no longer using the slot (e.g. you no longer have Geo enabled), you can remove it with in the + PostgreSQL console session: + + ```sql + SELECT pg_drop_replication_slot('name_of_extra_slot'); + ``` + +## Very large repositories never successfully synchronize on the **secondary** node + +GitLab places a timeout on all repository clones, including project imports +and Geo synchronization operations. If a fresh `git clone` of a repository +on the primary takes more than a few minutes, you may be affected by this. +To increase the timeout, add the following line to `/etc/gitlab/gitlab.rb` +on the **secondary** node: + +```ruby +gitlab_rails['gitlab_shell_git_timeout'] = 10800 +``` + +Then reconfigure GitLab: + +```sh +sudo gitlab-ctl reconfigure +``` + +This will increase the timeout to three hours (10800 seconds). Choose a time +long enough to accommodate a full clone of your largest repositories. + +## How to reset Geo **secondary** node replication + +If you get a **secondary** node in a broken state and want to reset the replication state, +to start again from scratch, there are a few steps that can help you: + +1. Stop Sidekiq and the Geo LogCursor + + It's possible to make Sidekiq stop gracefully, but making it stop getting new jobs and + wait until the current jobs to finish processing. + + You need to send a **SIGTSTP** kill signal for the first phase and them a **SIGTERM** + when all jobs have finished. Otherwise just use the `gitlab-ctl stop` commands. + + ```sh + gitlab-ctl status sidekiq + # run: sidekiq: (pid 10180) <- this is the PID you will use + kill -TSTP 10180 # change to the correct PID + + gitlab-ctl stop sidekiq + gitlab-ctl stop geo-logcursor + ``` + + You can watch sidekiq logs to know when sidekiq jobs processing have finished: + + ```sh + gitlab-ctl tail sidekiq + ``` + +1. Rename repository storage folders and create new ones + + ```sh + mv /var/opt/gitlab/git-data/repositories /var/opt/gitlab/git-data/repositories.old + mkdir -p /var/opt/gitlab/git-data/repositories + chown git:git /var/opt/gitlab/git-data/repositories + ``` + + TIP: **Tip** + You may want to remove the `/var/opt/gitlab/git-data/repositories.old` in the future + as soon as you confirmed that you don't need it anymore, to save disk space. + +1. _(Optional)_ Rename other data folders and create new ones + + CAUTION: **Caution**: + You may still have files on the **secondary** node that have been removed from **primary** node but + removal have not been reflected. If you skip this step, they will never be removed + from this Geo node. + + Any uploaded content like file attachments, avatars or LFS objects are stored in a + subfolder in one of the two paths below: + + 1. /var/opt/gitlab/gitlab-rails/shared + 1. /var/opt/gitlab/gitlab-rails/uploads + + To rename all of them: + + ```sh + gitlab-ctl stop + + mv /var/opt/gitlab/gitlab-rails/shared /var/opt/gitlab/gitlab-rails/shared.old + mkdir -p /var/opt/gitlab/gitlab-rails/shared + + mv /var/opt/gitlab/gitlab-rails/uploads /var/opt/gitlab/gitlab-rails/uploads.old + mkdir -p /var/opt/gitlab/gitlab-rails/uploads + ``` + + Reconfigure in order to recreate the folders and make sure permissions and ownership + are correctly + + ```sh + gitlab-ctl reconfigure + ``` + +1. Reset the Tracking Database + + ```sh + gitlab-rake geo:db:reset + ``` + +1. Restart previously stopped services + + ```sh + gitlab-ctl start + ``` + +## How do I fix a "Foreign Data Wrapper (FDW) is not configured" error? + +When setting up Geo, you might see this warning in the `gitlab-rake +gitlab:geo:check` output: + +``` +GitLab Geo tracking database Foreign Data Wrapper schema is up-to-date? ... foreign data wrapper is not configured +``` + +There are a few key points to remember: + +1. The FDW settings are configured on the Geo **tracking** database. +1. The configured foreign server enables a login to the Geo +**secondary**, read-only database. + +By default, the Geo secondary and tracking database are running on the +same host on different ports. That is, 5432 and 5431 respectively. + +### Checking configuration + +NOTE: **Note:** +The following steps are for Omnibus installs only. Using Geo with source-based installs [is deprecated](index.md#using-gitlab-installed-from-source-deprecated). + +To check the configuration: + +1. Enter the database console: + + ```sh + gitlab-geo-psql + ``` + +1. Check whether any tables are present. If everything is working, you +should see something like this: + + ```sql + gitlabhq_geo_production=# SELECT * from information_schema.foreign_tables; + foreign_table_catalog | foreign_table_schema | foreign_table_name | foreign_server_catalog | foreign_server_n + ame + -------------------------+----------------------+-------------------------------------------------+-------------------------+----------------- + ---- + gitlabhq_geo_production | gitlab_secondary | abuse_reports | gitlabhq_geo_production | gitlab_secondary + gitlabhq_geo_production | gitlab_secondary | appearances | gitlabhq_geo_production | gitlab_secondary + gitlabhq_geo_production | gitlab_secondary | application_setting_terms | gitlabhq_geo_production | gitlab_secondary + gitlabhq_geo_production | gitlab_secondary | application_settings | gitlabhq_geo_production | gitlab_secondary + <snip> + ``` + + However, if the query returns with `0 rows`, then continue onto the next steps. + +1. Check that the foreign server mapping is correct via `\des+`. The + results should look something like this: + + ```sql + gitlabhq_geo_production=# \des+ + List of foreign servers + -[ RECORD 1 ]--------+------------------------------------------------------------ + Name | gitlab_secondary + Owner | gitlab-psql + Foreign-data wrapper | postgres_fdw + Access privileges | "gitlab-psql"=U/"gitlab-psql" + + | gitlab_geo=U/"gitlab-psql" + Type | + Version | + FDW Options | (host '0.0.0.0', port '5432', dbname 'gitlabhq_production') + Description | + ``` + + NOTE: **Note:** Pay particular attention to the host and port under + FDW options. That configuration should point to the Geo secondary + database. + + If you need to experiment with changing the host or password, the + following queries demonstrate how: + + ```sql + ALTER SERVER gitlab_secondary OPTIONS (SET host 'my-new-host'); + ALTER SERVER gitlab_secondary OPTIONS (SET port 5432); + ``` + + If you change the host and/or port, you will also have to adjust the + following settings in `/etc/gitlab/gitlab.rb` and run `gitlab-ctl + reconfigure`: + + - `gitlab_rails['db_host']` + - `gitlab_rails['db_port']` + +1. Check that the user mapping is configured properly via `\deu+`: + + ```sql + gitlabhq_geo_production=# \deu+ + List of user mappings + Server | User name | FDW Options + ------------------+------------+-------------------------------------------------------------------------------- + gitlab_secondary | gitlab_geo | ("user" 'gitlab', password 'YOUR-PASSWORD-HERE') + (1 row) + ``` + + Make sure the password is correct. You can test that logins work by running `psql`: + + ```sh + # Connect to the tracking database as the `gitlab_geo` user + sudo -u git /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/geo-postgresql -p 5431 -U gitlab_geo -W -d gitlabhq_geo_production + ``` + + If you need to correct the password, the following query shows how: + + ```sql + ALTER USER MAPPING FOR gitlab_geo SERVER gitlab_secondary OPTIONS (SET password 'my-new-password'); + ``` + + If you change the user or password, you will also have to adjust the + following settings in `/etc/gitlab/gitlab.rb` and run `gitlab-ctl + reconfigure`: + + - `gitlab_rails['db_username']` + - `gitlab_rails['db_password']` + + If you are using [PgBouncer in front of the secondary + database](database.md#pgbouncer-support-optional), be sure to update + the following settings: + + - `geo_postgresql['fdw_external_user']` + - `geo_postgresql['fdw_external_password']` + +### Manual reload of FDW schema + +If you're still unable to get FDW working, you may want to try a manual +reload of the FDW schema. To manually reload the FDW schema: + +1. On the node running the Geo tracking database, enter the PostgreSQL console via + the `gitlab_geo` user: + + ```sh + sudo -u git /opt/gitlab/embedded/bin/psql -h /var/opt/gitlab/geo-postgresql -p 5431 -U gitlab_geo -W -d gitlabhq_geo_production + ``` + + Be sure to adjust the port and hostname for your configuration. You + may be asked to enter a password. + +1. Reload the schema via: + + ```sql + DROP SCHEMA IF EXISTS gitlab_secondary CASCADE; + CREATE SCHEMA gitlab_secondary; + GRANT USAGE ON FOREIGN SERVER gitlab_secondary TO gitlab_geo; + IMPORT FOREIGN SCHEMA public FROM SERVER gitlab_secondary INTO gitlab_secondary; + ``` + +1. Test that queries work: + + ```sql + SELECT * from information_schema.foreign_tables; + SELECT * FROM gitlab_secondary.projects limit 1; + ``` + +[database-start-replication]: database.md#step-3-initiate-the-replication-process +[database-pg-replication]: database.md#postgresql-replication |