summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorDavid Ansari <david.ansari@gmx.de>2022-08-04 10:14:05 +0000
committerDavid Ansari <david.ansari@gmx.de>2022-08-04 13:44:51 +0000
commit4bf78d822d7496e03061119f4cb07c0b306e4c03 (patch)
treeac9f8ef71d05d3d60dd9b55294337a66d382321f
parent4faec42412d499cde370e6ebd680858eeeda7452 (diff)
downloadrabbitmq-server-git-4bf78d822d7496e03061119f4cb07c0b306e4c03.tar.gz
Prevent global:sync/0 from being stuck
Prior to this commit, global:sync/0 gets sometimes stuck when either performing a rolling update on Kubernetes or when creating a new RabbitMQ cluster on Kubernetes. When performing a rolling update, the node being booted will be stuck in: ``` 2022-07-26 10:49:58.891896+00:00 [debug] <0.226.0> == Plugins (prelaunch phase) == 2022-07-26 10:49:58.891908+00:00 [debug] <0.226.0> Setting plugins up 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> Loading the following plugins: [cowlib,cowboy,rabbitmq_web_dispatch, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_management_agent,amqp_client, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_management,quantile_estimator, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> prometheus,rabbitmq_peer_discovery_common, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> accept,rabbitmq_peer_discovery_k8s, 2022-07-26 10:49:58.920915+00:00 [debug] <0.226.0> rabbitmq_prometheus] 2022-07-26 10:49:58.926373+00:00 [debug] <0.226.0> Feature flags: REFRESHING after applications load... 2022-07-26 10:49:58.926416+00:00 [debug] <0.372.0> Feature flags: registering controller globally before proceeding with task: refresh_after_app_load 2022-07-26 10:49:58.926450+00:00 [debug] <0.372.0> Feature flags: [global sync] @ rabbit@r1-server-3.r1-nodes.default ``` During cluster creation, an example log of global:sync/0 being stuck can be found in bullet point 2 of https://github.com/rabbitmq/rabbitmq-server/pull/5331#pullrequestreview-1050715029 When global:sync/0 is stuck, it never receives a message in line https://github.com/erlang/otp/blob/bd05b07f973f11d73c4fc77d59b69f212f121c2d/lib/kernel/src/global.erl#L2942 This issue can be observed in both `kind` and GKE. `kind` uses CoreDNS, GKE uses kubedns. CoreDNS does not resolve the hostname of RabbitMQ and its peers correctly for up to 30 seconds after node startup. This is because the default cache value of CoreDNS is 30 seconds and CoreDNS has a bug described in https://github.com/kubernetes/kubernetes/issues/92559 global:sync/0 is known to be buggy "in the presence of network failures" unless the kernel parameter `prevent_overlapping_partitions` is set to `true`. When either: 1. setting CoreDNS cache value to 1 second (see https://github.com/rabbitmq/rabbitmq-server/issues/5322#issuecomment-1195826135 on how to set this value), or 2. setting the kernel parameter `prevent_overlapping_partitions` to `true` rolling updates do NOT get stuck anymore. This means we are hitting here a combination of: 1. Kubernetes DNS bug not updating DNS caches promptly for headless services with `publishNotReadyAddresses: true`, and 2. Erlang bug which causes global:sync/0 to hang forever in the presence of network failures. The Erlang bug is fixed by setting `prevent_overlapping_partitions` to `true` (default in Erlang/OTP 25). In RabbitMQ however, we explicitly set `prevent_overlapping_partitions` to `false` because we fear other issues could arise if we set this parameter to `true`. Luckily, to resolve this issue of global:sync/0 being stuck, we can just call function rabbit_node_monitor:global_sync/0 which provides a workaround. This function was introduced 8 years ago in https://github.com/rabbitmq/rabbitmq-server/commit/9fcb31f348590a74fd526333cf881cfbe27241e6 With this commit applied, rolling updates are not stuck anymore and we see in the debug log the workaround sometimes being applied.
-rw-r--r--deps/rabbit/src/rabbit_ff_controller.erl2
1 files changed, 1 insertions, 1 deletions
diff --git a/deps/rabbit/src/rabbit_ff_controller.erl b/deps/rabbit/src/rabbit_ff_controller.erl
index f8dd874dc6..7b005c5db7 100644
--- a/deps/rabbit/src/rabbit_ff_controller.erl
+++ b/deps/rabbit/src/rabbit_ff_controller.erl
@@ -268,7 +268,7 @@ register_globally() ->
"Feature flags: [global sync] @ ~s",
[node()],
#{domain => ?RMQLOG_DOMAIN_FEAT_FLAGS}),
- ok = global:sync(),
+ ok = rabbit_node_monitor:global_sync(),
?LOG_DEBUG(
"Feature flags: [global register] @ ~s",
[node()],