From 6ed4ec3e0b1340f96b7c043ef51d1b33bbe85fde Mon Sep 17 00:00:00 2001
From: GitLab Bot <gitlab-bot@gitlab.com>
Date: Mon, 19 Sep 2022 23:18:09 +0000
Subject: Add latest changes from gitlab-org/gitlab@15-4-stable-ee

---
 .../geo/replication/troubleshooting.md             | 190 ++++++++++++---------
 1 file changed, 112 insertions(+), 78 deletions(-)

(limited to 'doc/administration/geo/replication/troubleshooting.md')

diff --git a/doc/administration/geo/replication/troubleshooting.md b/doc/administration/geo/replication/troubleshooting.md
index 26d192f62cd..d64ad2549e8 100644
--- a/doc/administration/geo/replication/troubleshooting.md
+++ b/doc/administration/geo/replication/troubleshooting.md
@@ -19,24 +19,24 @@ Here is a list of steps you should take to attempt to fix problem:
 
 Before attempting more advanced troubleshooting:
 
-- Check [the health of the **secondary** node](#check-the-health-of-the-secondary-node).
+- Check [the health of the **secondary** site](#check-the-health-of-the-secondary-site).
 - Check [if PostgreSQL replication is working](#check-if-postgresql-replication-is-working).
 
-### Check the health of the **secondary** node
+### Check the health of the **secondary** site
 
-On the **primary** node:
+On the **primary** site:
 
-1. On the top bar, select **Menu > Admin**.
-1. On the left sidebar, select **Geo > Nodes**.
+1. On the top bar, select **Main menu > Admin**.
+1. On the left sidebar, select **Geo > Sites**.
 
-We perform the following health checks on each **secondary** node
+We perform the following health checks on each **secondary** site
 to help identify if something is wrong:
 
-- Is the node running?
-- Is the node's secondary database configured for streaming replication?
-- Is the node's secondary tracking database configured?
-- Is the node's secondary tracking database connected?
-- Is the node's secondary tracking database up-to-date?
+- Is the site running?
+- Is the secondary site's database configured for streaming replication?
+- Is the secondary site's tracking database configured?
+- Is the secondary site's tracking database connected?
+- Is the secondary site's tracking database up-to-date?
 
 ![Geo health check](img/geo_site_health_v14_0.png)
 
@@ -48,8 +48,8 @@ health check manually to get this information and a few more details.
 
 #### Health check Rake task
 
-This Rake task can be run on an app node in the **primary** or **secondary**
-Geo nodes:
+This Rake task can be run on a **Rails** node in the **primary** or **secondary**
+Geo sites:
 
 ```shell
 sudo gitlab-rake gitlab:geo:check
@@ -275,11 +275,11 @@ sudo gitlab-rake gitlab:geo:check
   Checking Geo ... Finished
   ```
 
-  Ensure you have added the secondary node in the Admin Area of the **primary** node.
-  Also ensure you entered the `external_url` or `gitlab_rails['geo_node_name']`
-  when adding the secondary node in the Admin Area of the **primary** node.
-  In GitLab 12.3 and earlier, edit the secondary node in the Admin Area of the **primary**
-  node and ensure that there is a trailing `/` in the `Name` field.
+  Ensure you have added the secondary site in the **Main menu > Admin > Geo > Sites** on the web interface for the **primary** site.
+  Also ensure you entered the `gitlab_rails['geo_node_name']`
+  when adding the secondary site in the Admin Area of the **primary** site.
+  In GitLab 12.3 and earlier, edit the secondary site in the Admin Area of the **primary**
+  site and ensure that there is a trailing `/` in the `Name` field.
 
 - Check returns `Exception: PG::UndefinedTable: ERROR:  relation "geo_nodes" does not exist`.
 
@@ -321,7 +321,7 @@ error messages (indicated by `Database replication working? ... no` in the
 
 This means that the `max_replication_slots` PostgreSQL variable needs to
 be set on the **primary** database. This setting defaults to 1. You may need to
-increase this value if you have more **secondary** nodes.
+increase this value if you have more **secondary** sites.
 
 Be sure to restart PostgreSQL for this to take effect. See the
 [PostgreSQL replication setup](../setup/database.md#postgresql-replication) guide for more details.
@@ -329,13 +329,13 @@ Be sure to restart PostgreSQL for this to take effect. See the
 ### Message: `FATAL:  could not start WAL streaming: ERROR:  replication slot "geo_secondary_my_domain_com" does not exist`?
 
 This occurs when PostgreSQL does not have a replication slot for the
-**secondary** node by that name.
+**secondary** site by that name.
 
-You may want to rerun the [replication process](../setup/database.md) on the **secondary** node .
+You may want to rerun the [replication process](../setup/database.md) on the **secondary** site .
 
 ### Message: "Command exceeded allowed execution time" when setting up replication?
 
-This may happen while [initiating the replication process](../setup/database.md#step-3-initiate-the-replication-process) on the **secondary** node,
+This may happen while [initiating the replication process](../setup/database.md#step-3-initiate-the-replication-process) on the **secondary** site,
 and indicates your initial dataset is too large to be replicated in the default timeout (30 minutes).
 
 Re-run `gitlab-ctl replicate-geo-database`, but include a larger value for
@@ -374,8 +374,8 @@ log data to build up in `pg_xlog`. Removing the unused slots can reduce the amou
 
 Slots where `active` is `f` are not active.
 
-- When this slot should be active, because you have a **secondary** node configured using that slot,
-  sign in to that **secondary** node and check the [PostgreSQL logs](../../logs/index.md#postgresql-logs)
+- When this slot should be active, because you have a **secondary** site configured using that slot,
+  sign in on the web interface for the **secondary** site and check the [PostgreSQL logs](../../logs/index.md#postgresql-logs)
   to view why the replication is not running.
 
 - If you are no longer using the slot (for example, you no longer have Geo enabled), you can remove it with in the
@@ -398,12 +398,12 @@ These long-running queries are
 [planned to be removed in the future](https://gitlab.com/gitlab-org/gitlab/-/issues/34269),
 but as a workaround, we recommend enabling
 [`hot_standby_feedback`](https://www.postgresql.org/docs/10/hot-standby.html#HOT-STANDBY-CONFLICT).
-This increases the likelihood of bloat on the **primary** node as it prevents
+This increases the likelihood of bloat on the **primary** site as it prevents
 `VACUUM` from removing recently-dead rows. However, it has been used
 successfully in production on GitLab.com.
 
 To enable `hot_standby_feedback`, add the following to `/etc/gitlab/gitlab.rb`
-on the **secondary** node:
+on the **secondary** site:
 
 ```ruby
 postgresql['hot_standby_feedback'] = 'on'
@@ -463,14 +463,14 @@ This happens if data is detected in the `projects` table. When one or more proje
 is aborted to prevent accidental data loss. To bypass this message, pass the `--force` option to the command.
 
 In GitLab 13.4, a seed project is added when GitLab is first installed. This makes it necessary to pass `--force` even
-on a new Geo secondary node. There is an [issue to account for seed projects](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/5618)
+on a new Geo secondary site. There is an [issue to account for seed projects](https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/5618)
 when checking the database.
 
 ### Message: `Synchronization failed - Error syncing repository`
 
 WARNING:
 If large repositories are affected by this problem,
-their resync may take a long time and cause significant load on your Geo nodes,
+their resync may take a long time and cause significant load on your Geo sites,
 storage and network systems.
 
 If you see the error message `Synchronization failed - Error syncing repository` along with `fatal: fsck error in packed object`, this indicates
@@ -483,7 +483,7 @@ it's possible to override the consistency checks instead. To do that, follow
 [the instructions in the Gitaly docs](../../gitaly/configure_gitaly.md#repository-consistency-checks).
 
 You can also get the error message `Synchronization failed - Error syncing repository` along with the following log messages, this indicates that the expected `geo` remote is not present in the `.git/config` file
-of a repository on the secondary Geo node's file system:
+of a repository on the secondary Geo site's file system:
 
 ```json
 {
@@ -505,7 +505,7 @@ of a repository on the secondary Geo node's file system:
 
 To solve this:
 
-1. Sign in to the secondary Geo node.
+1. Sign in on the web interface for the secondary Geo site.
 
 1. Back up [the `.git` folder](../../repository_storage_types.md#translate-hashed-storage-paths).
 
@@ -538,7 +538,7 @@ To solve this:
    end
    ```
 
-### Very large repositories never successfully synchronize on the **secondary** node
+### Very large repositories never successfully synchronize on the **secondary** site
 
 GitLab places a timeout on all repository clones, including project imports
 and Geo synchronization operations. If a fresh `git clone` of a repository
@@ -546,7 +546,8 @@ on the **primary** takes more than the default three hours, you may be affected
 
 To increase the timeout:
 
-1. On the **secondary** node, add the following line to `/etc/gitlab/gitlab.rb`:
+1. On the **Sidekiq nodes on your secondary** site,
+add the following line to `/etc/gitlab/gitlab.rb`:
 
    ```ruby
    gitlab_rails['gitlab_shell_git_timeout'] = 14400
@@ -563,9 +564,9 @@ long enough to accommodate a full clone of your largest repositories.
 
 ### New LFS objects are never replicated
 
-If new LFS objects are never replicated to secondary Geo nodes, check the version of
+If new LFS objects are never replicated to secondary Geo sites, check the version of
 GitLab you are running. GitLab versions 11.11.x or 12.0.x are affected by
-[a bug that results in new LFS objects not being replicated to Geo secondary nodes](https://gitlab.com/gitlab-org/gitlab/-/issues/32696).
+[a bug that results in new LFS objects not being replicated to Geo secondary sites](https://gitlab.com/gitlab-org/gitlab/-/issues/32696).
 
 To resolve the issue, upgrade to GitLab 12.1 or later.
 
@@ -574,9 +575,9 @@ To resolve the issue, upgrade to GitLab 12.1 or later.
 During a [backfill](../index.md#backfill), failures are scheduled to be retried at the end
 of the backfill queue, therefore these failures only clear up **after** the backfill completes.
 
-### Resetting Geo **secondary** node replication
+### Resetting Geo **secondary** site replication
 
-If you get a **secondary** node in a broken state and want to reset the replication state,
+If you get a **secondary** site in a broken state and want to reset the replication state,
 to start again from scratch, there are a few steps that can help you:
 
 1. Stop Sidekiq and the Geo LogCursor.
@@ -617,8 +618,8 @@ to start again from scratch, there are a few steps that can help you:
 1. Optional. Rename other data folders and create new ones.
 
    WARNING:
-   You may still have files on the **secondary** node that have been removed from the **primary** node, but this
-   removal has not been reflected. If you skip this step, these files are not removed from the Geo node.
+   You may still have files on the **secondary** site that have been removed from the **primary** site, but this
+   removal has not been reflected. If you skip this step, these files are not removed from the Geo **secondary** site.
 
    Any uploaded content (like file attachments, avatars, or LFS objects) is stored in a
    subfolder in one of these paths:
@@ -667,7 +668,7 @@ to start again from scratch, there are a few steps that can help you:
 
 ### Design repository failures on mirrored projects and project imports
 
-On the top bar, under **Menu > Admin > Geo > Nodes**,
+On the top bar, under **Main menu > Admin > Geo > Sites**,
 if the Design repositories progress bar shows
 `Synced` and `Failed` greater than 100%, and negative `Queued`, the instance
 is likely affected by
@@ -714,7 +715,7 @@ Counts:
 {"synced"=>3}
 ```
 
-#### If you are promoting a Geo secondary site running on a single server
+#### If you are promoting a Geo secondary site running on a single node
 
 `gitlab-ctl promotion-preflight-checks` fails due to the existence of
 `failed` rows in the `geo_design_registry` table. Use the
@@ -831,10 +832,10 @@ We recommend transferring each failing repository individually and checking for
 after each transfer. Follow the [single target `rsync` instructions](../../operations/moving_repositories.md#single-rsync-to-another-server)
 to transfer each affected repository from the primary to the secondary site.
 
-## Fixing errors during a failover or when promoting a secondary to a primary node
+## Fixing errors during a failover or when promoting a secondary to a primary site
 
 The following are possible error messages that might be encountered during failover or
-when promoting a secondary to a primary node with strategies to resolve them.
+when promoting a secondary to a primary site with strategies to resolve them.
 
 ### Message: `ActiveRecord::RecordInvalid: Validation failed: Name has already been taken`
 
@@ -868,14 +869,14 @@ or `gitlab-ctl promote-to-primary-node`, either:
   ```
 
 - Upgrade to GitLab 12.6.3 or later if it is safe to do so. For example,
-  if the failover was just a test. A 
+  if the failover was just a test. A
   [caching-related bug](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/22021) was fixed.
 
 ### Message: `ActiveRecord::RecordInvalid: Validation failed: Enabled Geo primary node cannot be disabled`
 
-If you disabled a secondary node, either with the [replication pause task](../index.md#pausing-and-resuming-replication)
+If you disabled a secondary site, either with the [replication pause task](../index.md#pausing-and-resuming-replication)
 (GitLab 13.2) or by using the user interface (GitLab 13.1 and earlier), you must first
-re-enable the node before you can continue. This is fixed in GitLab 13.4.
+re-enable the site before you can continue. This is fixed in GitLab 13.4.
 
 This can be fixed in the database.
 
@@ -894,7 +895,7 @@ This can be fixed in the database.
    ```
 
 1. Run the following command, replacing  `https://<secondary url>/` with the URL
-   for your secondary server. You can use either `http` or `https`, but ensure that you
+   for your secondary node. You can use either `http` or `https`, but ensure that you
    end the URL with a slash (`/`):
 
    ```sql
@@ -987,32 +988,31 @@ sudo gitlab-rake geo:set_secondary_as_primary
 ## Expired artifacts
 
 If you notice for some reason there are more artifacts on the Geo
-secondary node than on the Geo primary node, you can use the Rake task
+**secondary** site than on the Geo **primary** site, you can use the Rake task
 to [cleanup orphan artifact files](../../../raketasks/cleanup.md#remove-orphan-artifact-files).
 
-On a Geo **secondary** node, this command also cleans up all Geo
+On a Geo **secondary** site, this command also cleans up all Geo
 registry record related to the orphan files on disk.
 
 ## Fixing sign in errors
 
 ### Message: The redirect URI included is not valid
 
-If you are able to sign in to the **primary** node, but you receive this error message
-when attempting to sign in to a **secondary**, you should verify the Geo
-node's URL matches its external URL.
+If you are able to sign in to the web interface for the **primary** site, but you receive this error message
+when attempting to sign in to a **secondary** web interface, you should verify the Geo
+site's URL matches its external URL.
 
-On the **primary** node:
+On the **primary** site:
 
-1. On the top bar, select **Menu > Admin**.
-1. On the left sidebar, select **Geo > Nodes**.
+1. On the top bar, select **Main menu > Admin**.
+1. On the left sidebar, select **Geo > Sites**.
 1. Find the affected **secondary** site and select **Edit**.
 1. Ensure the **URL** field matches the value found in `/etc/gitlab/gitlab.rb`
-   in `external_url "https://gitlab.example.com"` on the frontend servers of
-   the **secondary** node.
+   in `external_url "https://gitlab.example.com"` on the **Rails nodes of the secondary** site.
 
 ## Fixing common errors
 
-This section documents common error messages reported in the Admin Area, and how to fix them.
+This section documents common error messages reported in the Admin Area on the web interface, and how to fix them.
 
 ### Geo database configuration file is missing
 
@@ -1029,11 +1029,11 @@ has the correct permissions.
 Geo cannot reuse an existing tracking database.
 
 It is safest to use a fresh secondary, or reset the whole secondary by following
-[Resetting Geo secondary node replication](#resetting-geo-secondary-node-replication).
+[Resetting Geo secondary site replication](#resetting-geo-secondary-site-replication).
 
-### Geo node has a database that is writable which is an indication it is not configured for replication with the primary node
+### Geo site has a database that is writable which is an indication it is not configured for replication with the primary site
 
-This error message refers to a problem with the database replica on a **secondary** node,
+This error message refers to a problem with the database replica on a **secondary** site,
 which Geo expects to have access to. It usually means, either:
 
 - An unsupported replication method was used (for example, logical replication).
@@ -1043,24 +1043,24 @@ which Geo expects to have access to. It usually means, either:
 
 Geo **secondary** sites require two separate PostgreSQL instances:
 
-- A read-only replica of the **primary** node.
+- A read-only replica of the **primary** site.
 - A regular, writable instance that holds replication metadata. That is, the Geo tracking database.
 
 This error message indicates that the replica database in the **secondary** site is misconfigured and replication has stopped.
 
 To restore the database and resume replication, you can do one of the following:
 
-- [Reset the Geo secondary site replication](#resetting-geo-secondary-node-replication).
+- [Reset the Geo secondary site replication](#resetting-geo-secondary-site-replication).
 - [Set up a new secondary Geo Omnibus instance](../setup/index.md#using-omnibus-gitlab).
 
 If you set up a new secondary from scratch, you must also [remove the old site from the Geo cluster](remove_geo_site.md#removing-secondary-geo-sites).
 
-### Geo node does not appear to be replicating the database from the primary node
+### Geo site does not appear to be replicating the database from the primary site
 
 The most common problems that prevent the database from replicating correctly are:
 
-- **Secondary** nodes cannot reach the **primary** node. Check credentials, firewall rules, and so on.
-- SSL certificate problems. Make sure you copied `/etc/gitlab/gitlab-secrets.json` from the **primary** node.
+- **Secondary** sites cannot reach the **primary** site. Check credentials, [firewall rules](../index.md#firewall-rules), and so on.
+- SSL certificate problems. Make sure you copied `/etc/gitlab/gitlab-secrets.json` from the **primary** site.
 - Database storage disk is full.
 - Database replication slot is misconfigured.
 - Database is not using a replication slot or another alternative and cannot catch-up because WAL files were purged.
@@ -1072,26 +1072,26 @@ Make sure you follow the [Geo database replication](../setup/database.md) instru
 If you are using Omnibus GitLab installation, something might have failed during upgrade. You can:
 
 - Run `sudo gitlab-ctl reconfigure`.
-- Manually trigger the database migration by running: `sudo gitlab-rake db:migrate:geo` as root on the **secondary** node.
+- Manually trigger the database migration by running: `sudo gitlab-rake db:migrate:geo` as root on the **secondary** site.
 
 ### GitLab indicates that more than 100% of repositories were synced
 
 This can be caused by orphaned records in the project registry. You can clear them
 [using a Rake task](../../../administration/raketasks/geo.md#remove-orphaned-project-registries).
 
-### Geo Admin Area returns 404 error for a secondary node
+### Geo Admin Area returns 404 error for a secondary site
 
-Sometimes `sudo gitlab-rake gitlab:geo:check` indicates that the **secondary** node is
-healthy, but a 404 Not Found error message for the **secondary** node is returned in the Geo Admin Area on
-the **primary** node.
+Sometimes `sudo gitlab-rake gitlab:geo:check` indicates that **Rails nodes of the secondary** sites are
+healthy, but a 404 Not Found error message for the **secondary** site is returned in the Geo Admin Area on the web interface for
+the **primary** site.
 
 To resolve this issue:
 
-- Try restarting the **secondary** using `sudo gitlab-ctl restart`.
-- Check `/var/log/gitlab/gitlab-rails/geo.log` to see if the **secondary** node is
-  using IPv6 to send its status to the **primary** node. If it is, add an entry to
-  the **primary** node using IPv4 in the `/etc/hosts` file. Alternatively, you should
-  [enable IPv6 on the **primary** node](https://docs.gitlab.com/omnibus/settings/nginx.html#setting-the-nginx-listen-address-or-addresses).
+- Try restarting **each Rails, Sidekiq and Gitaly nodes on your secondary site** using `sudo gitlab-ctl restart`.
+- Check `/var/log/gitlab/gitlab-rails/geo.log` on Sidekiq nodes to see if the **secondary** site is
+  using IPv6 to send its status to the **primary** site. If it is, add an entry to
+  the **primary** site using IPv4 in the `/etc/hosts` file. Alternatively, you should
+  [enable IPv6 on the **primary** site](https://docs.gitlab.com/omnibus/settings/nginx.html#setting-the-nginx-listen-address-or-addresses).
 
 ### Secondary site returns 502 errors with Geo proxying
 
@@ -1167,7 +1167,7 @@ To fix this issue, set the primary site's internal URL to a URL that is:
 
 You may have problems if you're running a version of [Git LFS](https://git-lfs.github.com/) before 2.4.2.
 As noted in [this authentication issue](https://github.com/git-lfs/git-lfs/issues/3025),
-requests redirected from the secondary to the primary node do not properly send the
+requests redirected from the secondary to the primary site do not properly send the
 Authorization header. This may result in either an infinite `Authorization <-> Redirect`
 loop, or Authorization error messages.
 
@@ -1194,13 +1194,13 @@ The partial failover to a secondary Geo *site* may be the result of a temporary/
 
 1. SSH into every Sidekiq, PostgresSQL, Gitaly, and Rails node in the **secondary** site and run one of the following commands:
 
-   - To promote the secondary node to primary:
+   - To promote the secondary site to primary:
 
      ```shell
      sudo gitlab-ctl geo promote
      ```
 
-   - To promote the secondary node to primary **without any further confirmation**:
+   - To promote the secondary site to primary **without any further confirmation**:
 
      ```shell
      sudo gitlab-ctl geo promote --force
@@ -1230,3 +1230,37 @@ If the above steps are **not successful**, proceed through the next steps:
 
 1. Verify you can connect to the newly-promoted **primary** site using the URL used previously for the **secondary** site.
 1. If successful, the **secondary** site is now promoted to the **primary** site.
+
+## Additional tools
+
+There are useful snippets for manipulating Geo internals in the [GitLab Rails Cheat Sheet](../../troubleshooting/gitlab_rails_cheat_sheet.md#geo). For example, you can find how to manually sync or verify a replicable in Rails console.
+
+## Check OS locale data compatibility
+
+If different operating systems or different operating system versions are deployed across Geo sites, we recommend that you perform a locale data compatibility check setting up Geo.
+
+Geo uses PostgreSQL and Streaming Replication to replicate data across Geo sites. PostgreSQL uses locale data provided by the operating system’s C library for sorting text. If the locale data in the C library is incompatible across Geo sites, erroneous query results that lead to [incorrect behavior on secondary sites](https://gitlab.com/gitlab-org/gitlab/-/issues/360723). See [here](https://wiki.postgresql.org/wiki/Locale_data_changes) for more details.
+
+On all hosts running PostgreSQL, across all Geo sites, run the following shell command:
+
+```shell
+( echo "1-1"; echo "11" ) | LC_COLLATE=en_US.UTF-8 sort
+```
+
+The output will either look like:
+
+```plaintext
+1-1
+11
+```
+
+or the reverse order:
+
+```plaintext
+11
+1-1
+```
+
+If the output is identical on all hosts, then they running compatible versions of locale data.
+
+If the output differs on some hosts, then PostgreSQL replication will not work properly. We advise that you select operating system versions that are compatible.
-- 
cgit v1.2.1