diff options
author | GitLab Bot <gitlab-bot@gitlab.com> | 2022-08-18 08:17:02 +0000 |
---|---|---|
committer | GitLab Bot <gitlab-bot@gitlab.com> | 2022-08-18 08:17:02 +0000 |
commit | b39512ed755239198a9c294b6a45e65c05900235 (patch) | |
tree | d234a3efade1de67c46b9e5a38ce813627726aa7 /doc/administration/geo/replication/troubleshooting.md | |
parent | d31474cf3b17ece37939d20082b07f6657cc79a9 (diff) | |
download | gitlab-ce-b39512ed755239198a9c294b6a45e65c05900235.tar.gz |
Add latest changes from gitlab-org/gitlab@15-3-stable-eev15.3.0-rc42
Diffstat (limited to 'doc/administration/geo/replication/troubleshooting.md')
-rw-r--r-- | doc/administration/geo/replication/troubleshooting.md | 68 |
1 files changed, 49 insertions, 19 deletions
diff --git a/doc/administration/geo/replication/troubleshooting.md b/doc/administration/geo/replication/troubleshooting.md index 082ecbbb208..26d192f62cd 100644 --- a/doc/administration/geo/replication/troubleshooting.md +++ b/doc/administration/geo/replication/troubleshooting.md @@ -129,7 +129,7 @@ http://secondary.example.com/ ``` To find more details about failed items, check -[the `gitlab-rails/geo.log` file](../../troubleshooting/log_parsing.md#find-most-common-geo-sync-errors) +[the `gitlab-rails/geo.log` file](../../logs/log_parsing.md#find-most-common-geo-sync-errors) ### Check if PostgreSQL replication is working @@ -191,7 +191,7 @@ If a replication slot is inactive, the `pg_wal` logs corresponding to the slot are reserved forever (or until the slot is active again). This causes continuous disk usage growth and the following messages appear repeatedly in the -[PostgreSQL logs](../../logs.md#postgresql-logs): +[PostgreSQL logs](../../logs/index.md#postgresql-logs): ```plaintext WARNING: oldest xmin is far in the past @@ -331,8 +331,7 @@ Be sure to restart PostgreSQL for this to take effect. See the This occurs when PostgreSQL does not have a replication slot for the **secondary** node by that name. -You may want to rerun the [replication -process](../setup/database.md) on the **secondary** node . +You may want to rerun the [replication process](../setup/database.md) on the **secondary** node . ### Message: "Command exceeded allowed execution time" when setting up replication? @@ -376,7 +375,7 @@ log data to build up in `pg_xlog`. Removing the unused slots can reduce the amou Slots where `active` is `f` are not active. - When this slot should be active, because you have a **secondary** node configured using that slot, - sign in to that **secondary** node and check the [PostgreSQL logs](../../logs.md#postgresql-logs) + sign in to that **secondary** node and check the [PostgreSQL logs](../../logs/index.md#postgresql-logs) to view why the replication is not running. - If you are no longer using the slot (for example, you no longer have Geo enabled), you can remove it with in the @@ -510,7 +509,7 @@ To solve this: 1. Back up [the `.git` folder](../../repository_storage_types.md#translate-hashed-storage-paths). -1. Optional. [Spot-check](../../troubleshooting/log_parsing.md#find-all-projects-affected-by-a-fatal-git-problem) +1. Optional. [Spot-check](../../logs/log_parsing.md#find-all-projects-affected-by-a-fatal-git-problem) a few of those IDs whether they indeed correspond to a project with known Geo replication failures. Use `fatal: 'geo'` as the `grep` term and the following API call: @@ -597,7 +596,7 @@ to start again from scratch, there are a few steps that can help you: gitlab-ctl stop geo-logcursor ``` - You can watch the [Sidekiq logs](../../logs.md#sidekiq-logs) to know when Sidekiq jobs processing has finished: + You can watch the [Sidekiq logs](../../logs/index.md#sidekiq-logs) to know when Sidekiq jobs processing has finished: ```shell gitlab-ctl tail sidekiq @@ -837,7 +836,7 @@ to transfer each affected repository from the primary to the secondary site. The following are possible error messages that might be encountered during failover or when promoting a secondary to a primary node with strategies to resolve them. -### Message: ActiveRecord::RecordInvalid: Validation failed: Name has already been taken +### Message: `ActiveRecord::RecordInvalid: Validation failed: Name has already been taken` When [promoting a **secondary** site](../disaster_recovery/index.md#step-3-promoting-a-secondary-site), you might encounter the following error message: @@ -869,11 +868,10 @@ or `gitlab-ctl promote-to-primary-node`, either: ``` - Upgrade to GitLab 12.6.3 or later if it is safe to do so. For example, - if the failover was just a test. A [caching-related - bug](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/22021) was - fixed. + if the failover was just a test. A + [caching-related bug](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/22021) was fixed. -### Message: ActiveRecord::RecordInvalid: Validation failed: Enabled Geo primary node cannot be disabled +### Message: `ActiveRecord::RecordInvalid: Validation failed: Enabled Geo primary node cannot be disabled` If you disabled a secondary node, either with the [replication pause task](../index.md#pausing-and-resuming-replication) (GitLab 13.2) or by using the user interface (GitLab 13.1 and earlier), you must first @@ -1127,12 +1125,6 @@ Geo secondary sites continue to replicate and verify data, and the secondary sit This bug was [fixed in GitLab 14.4](https://gitlab.com/gitlab-org/gitlab/-/issues/292983). -### GitLab Pages return 404 errors after promoting - -This is due to [Pages data not being managed by Geo](datatypes.md#limitations-on-replicationverification). -Find advice to resolve those error messages in the -[Pages administration documentation](../../../administration/pages/index.md#404-error-after-promoting-a-geo-secondary-to-a-primary-node). - ### Primary site returns 500 error when accessing `/admin/geo/replication/projects` Navigating to **Admin > Geo > Replication** (or `/admin/geo/replication/projects`) on a primary Geo site, shows a 500 error, while that same link on the secondary works fine. The primary's `production.log` has a similar entry to the following: @@ -1146,7 +1138,28 @@ Geo::TrackingBase::SecondaryNotConfigured: Geo secondary database is not configu On a Geo primary site this error can be ignored. -This happens because GitLab is attempting to display registries from the [Geo tracking database](../../../administration/geo/#geo-tracking-database) which doesn't exist on the primary site (only the original projects exist on the primary; no replicated projects are present, therefore no tracking database exists). +This happens because GitLab is attempting to display registries from the [Geo tracking database](../../../administration/geo/index.md#geo-tracking-database) which doesn't exist on the primary site (only the original projects exist on the primary; no replicated projects are present, therefore no tracking database exists). + +### Secondary site returns 400 error "Request header or cookie too large" + +This error can happen when the internal URL of the primary site is incorrect. + +For example, when you use a unified URL and the primary site's internal URL is also equal to the external URL. This causes a loop when a secondary site proxies requests to the primary site's internal URL. + +To fix this issue, set the primary site's internal URL to a URL that is: + +- Unique to the primary site. +- Accessible from all secondary sites. + +1. Enter the [Rails console](../../operations/rails_console.md) on the primary site. + +1. Run the following, replacing `https://unique.url.for.primary.site` with your specific internal URL. + For example, depending on your network configuration, you could use an IP address, like + `http://1.2.3.4`. + + ```ruby + GeoNode.where(primary: true).first.update!(internal_url: "https://unique.url.for.primary.site") + ``` ## Fixing client errors @@ -1158,6 +1171,23 @@ requests redirected from the secondary to the primary node do not properly send Authorization header. This may result in either an infinite `Authorization <-> Redirect` loop, or Authorization error messages. +### Error: Net::ReadTimeout when pushing through SSH on a Geo secondary + +When you push large repositories through SSH on a Geo secondary site, you may encounter a timeout. +This is because Rails proxies the push to the primary and has a 60 second default timeout, +[as described in this Geo issue](https://gitlab.com/gitlab-org/gitlab/-/issues/7405). + +Current workarounds are: + +- Push through HTTP instead, where Workhorse proxies the request to the primary (or redirects to the primary if Geo proxying is not enabled). +- Push directly to the primary. + +Example log (`gitlab-shell.log`): + +```plaintext +Failed to contact primary https://primary.domain.com/namespace/push_test.git\\nError: Net::ReadTimeout\",\"result\":null}" code=500 method=POST pid=5483 url="http://127.0.0.1:3000/api/v4/geo/proxy_git_push_ssh/push" +``` + ## Recovering from a partial failover The partial failover to a secondary Geo *site* may be the result of a temporary/transient issue. Therefore, first attempt to run the promote command again. |