Remove randomized startup delays

On initial cluster formation, only one node in a multi node cluster should initialize the Mnesia database schema (i.e. form the cluster). To ensure that for nodes starting up in parallel, RabbitMQ peer discovery backends have used either locks or randomized startup delays. Locks work great: When a node holds the lock, it either starts a new blank node (if there is no other node in the cluster), or it joins an existing node. This makes it impossible to have two nodes forming the cluster at the same time. Consul and etcd peer discovery backends use locks. The lock is acquired in the consul and etcd infrastructure, respectively. For other peer discovery backends (classic, DNS, AWS), randomized startup delays were used. They work good enough in most cases. However, in https://github.com/rabbitmq/cluster-operator/issues/662 we observed that in 1% - 10% of the cases (the more nodes or the smaller the randomized startup delay range, the higher the chances), two nodes decide to form the cluster. That's bad since it will end up in a single Erlang cluster, but in two RabbitMQ clusters. Even worse, no obvious alert got triggered or error message logged. To solve this issue, one could increase the randomized startup delay range from e.g. 0m - 1m to 0m - 3m. However, this makes initial cluster formation very slow since it will take up to 3 minutes until every node is ready. In rare cases, we still end up with two nodes forming the cluster. Another way to solve the problem is to name a dedicated node to be the seed node (forming the cluster). This was explored in https://github.com/rabbitmq/cluster-operator/pull/689 and works well. Two minor downsides to this approach are: 1. If the seed node never becomes available, the whole cluster won't be formed (which is okay), and 2. it doesn't integrate with existing dynamic peer discovery backends (e.g. K8s, AWS) since nodes are not yet known at deploy time. In this commit, we take a better approach: We remove randomized startup delays altogether. We replace them with locks. However, instead of implementing our own lock implementation in an external system (e.g. in K8s), we re-use Erlang's locking mechanism global:set_lock/3. global:set_lock/3 has some convenient properties: 1. It accepts a list of nodes to set the lock on. 2. The nodes in that list connect to each other (i.e. create an Erlang cluster). 3. The method is synchronous with a timeout (number of retries). It blocks until the lock becomes available. 4. If a process that holds a lock dies, or the node goes down, the lock held by the process is deleted. The list of nodes passed to global:set_lock/3 corresponds to the nodes the peer discovery backend discovers (lists). Two special cases worth mentioning: 1. That list can be all desired nodes in the cluster (e.g. in classic peer discovery where nodes are known at deploy time) while only a subset of nodes is available. In that case, global:set_lock/3 still sets the lock not blocking until all nodes can be connected to. This is good since nodes might start sequentially (non-parallel). 2. In dynamic peer discovery backends (e.g. K8s, AWS), this list can be just a subset of desired nodes since nodes might not startup in parallel. That's also not a problem as long as the following requirement is met: "The peer disovery backend does not list two disjoint sets of nodes (on different nodes) at the same time." For example, in a 2-node cluster, the peer discovery backend must not list only node 1 on node 1 and only node 2 on node 2. Existing peer discovery backends fullfil that requirement because the resource the nodes are discovered from is global. For example, in K8s, once node 1 is part of the Endpoints object, it will be returned on both node 1 and node 2. Likewise, in AWS, once node 1 started, the described list of instances with a specific tag will include node 1 when the AWS peer discovery backend runs on node 1 or node 2. Removing randomized startup delays also makes cluster formation considerably faster (up to 1 minute faster if that was the upper bound in the range).
author: David Ansari <david.ansari@gmx.de> 2021-05-18 01:01:08 +0200
committer: David Ansari <david.ansari@gmx.de> 2021-06-03 08:01:28 +0200
commit: 0876746d5f94ad914e560d5ad0162741d88b3c29 (patch)
tree: 4d45d0df357b1e51ed688dda0365a9aa96d2a263 /deps/rabbitmq_peer_discovery_aws
parent: 4b95614f4e8e7edf7978344f68255b0d089de4a1 (diff)
download: rabbitmq-server-git-0876746d5f94ad914e560d5ad0162741d88b3c29.tar.gz
7 files changed, 86 insertions, 19 deletions
diff --git a/deps/rabbitmq_peer_discovery_aws/BUILD.bazel b/deps/rabbitmq_peer_discovery_aws/BUILD.bazel
index 432381107a..a0b11a6953 100644
--- a/deps/rabbitmq_peer_discovery_aws/BUILD.bazel
+++ b/deps/rabbitmq_peer_discovery_aws/BUILD.bazel
@@ -82,4 +82,7 @@ rabbitmq_suite(
 rabbitmq_suite(
     name = "unit_SUITE",
     size = "small",
+    runtime_deps = [
+        "@meck//:bazel_erlang_lib",
+    ],
 )
diff --git a/deps/rabbitmq_peer_discovery_aws/Makefile b/deps/rabbitmq_peer_discovery_aws/Makefile
index 5f617f28ea..6ec7bdddba 100644
--- a/deps/rabbitmq_peer_discovery_aws/Makefile
+++ b/deps/rabbitmq_peer_discovery_aws/Makefile
@@ -3,7 +3,7 @@ PROJECT_DESCRIPTION = AWS-based RabbitMQ peer discovery backend
 
 LOCAL_DEPS = inets
 DEPS = rabbit_common rabbitmq_peer_discovery_common rabbitmq_aws rabbit
-TEST_DEPS = rabbitmq_ct_helpers rabbitmq_ct_client_helpers ct_helper
+TEST_DEPS = rabbitmq_ct_helpers rabbitmq_ct_client_helpers ct_helper meck
 dep_ct_helper = git https://github.com/extend/ct_helper.git master
 
 DEP_EARLY_PLUGINS = rabbit_common/mk/rabbitmq-early-plugin.mk
diff --git a/deps/rabbitmq_peer_discovery_aws/src/rabbit_peer_discovery_aws.erl b/deps/rabbitmq_peer_discovery_aws/src/rabbit_peer_discovery_aws.erl
index 15746ab958..7c355d4094 100644
--- a/deps/rabbitmq_peer_discovery_aws/src/rabbit_peer_discovery_aws.erl
+++ b/deps/rabbitmq_peer_discovery_aws/src/rabbit_peer_discovery_aws.erl
@@ -99,9 +99,7 @@ list_nodes() ->
 -spec supports_registration() -> boolean().
 
 supports_registration() ->
-    %% see rabbitmq-peer-discovery-aws#17
-    true.
-
+    false.
 
 -spec register() -> ok.
 register() ->
@@ -116,15 +114,44 @@ unregister() ->
 post_registration() ->
     ok.
 
--spec lock(Node :: atom()) -> not_supported.
-
-lock(_Node) ->
-    not_supported.
-
--spec unlock(Data :: term()) -> ok.
+-spec lock(Node :: node()) -> {ok, {ResourceId :: string(), LockRequesterId :: node()}} | {error, Reason :: string()}.
 
-unlock(_Data) ->
-    ok.
+lock(Node) ->
+  %% call list_nodes/0 externally such that meck can mock the function
+  case ?MODULE:list_nodes() of
+    {ok, {[], disc}} ->
+          {error, "Cannot lock since no nodes got discovered."};
+    {ok, {Nodes, disc}} ->
+      case lists:member(Node, Nodes) of
+        true ->
+          rabbit_log:info("Will try to lock connecting to nodes ~p", [Nodes]),
+          LockId = rabbit_nodes:lock_id(Node),
+          Retries = rabbit_nodes:lock_retries(),
+          case global:set_lock(LockId, Nodes, Retries) of
+            true ->
+              {ok, LockId};
+            false ->
+              {error, io_lib:format("Acquiring lock taking too long, bailing out after ~b retries", [Retries])}
+          end;
+        false ->
+          %% Don't try to acquire the global lock when our own node is not discoverable by peers.
+          %% We shouldn't run into this branch because our node is running and should have been discovered.
+          {error, lists:flatten(io_lib:format("Local node ~s is not part of discovered nodes ~p", [Node, Nodes]))}
+      end;
+    {error, _} = Error ->
+      Error
+  end.
+
+-spec unlock({ResourceId :: string(), LockRequesterId :: node()}) -> ok | {error, Reason :: string()}.
+
+unlock(LockId) ->
+  case ?MODULE:list_nodes() of
+    {ok, {Nodes, disc}} ->
+      global:del_lock(LockId, Nodes),
+      ok;
+    {error, _} = Error ->
+      Error
+  end.
 
 %%
 %% Implementation
diff --git a/deps/rabbitmq_peer_discovery_aws/src/rabbitmq_peer_discovery_aws.erl b/deps/rabbitmq_peer_discovery_aws/src/rabbitmq_peer_discovery_aws.erl
index 55e83a566f..05bae26c20 100644
--- a/deps/rabbitmq_peer_discovery_aws/src/rabbitmq_peer_discovery_aws.erl
+++ b/deps/rabbitmq_peer_discovery_aws/src/rabbitmq_peer_discovery_aws.erl
@@ -45,10 +45,10 @@ unregister() ->
 post_registration() ->
     ?DELEGATE:post_registration().
 
--spec lock(Node :: atom()) -> not_supported.
+-spec lock(Node :: node()) -> {ok, {ResourceId :: string(), LockRequesterId :: node()}} | {error, Reason :: string()}.
 lock(Node) ->
     ?DELEGATE:lock(Node).
 
--spec unlock(Data :: term()) -> ok.
+-spec unlock({ResourceId :: string(), LockRequesterId :: node()}) -> ok | {error, Reason :: string()}.
 unlock(Data) ->
     ?DELEGATE:unlock(Data).
diff --git a/deps/rabbitmq_peer_discovery_aws/test/aws_ecs_util.erl b/deps/rabbitmq_peer_discovery_aws/test/aws_ecs_util.erl
index ed71a6c6a9..02f80aa487 100644
--- a/deps/rabbitmq_peer_discovery_aws/test/aws_ecs_util.erl
+++ b/deps/rabbitmq_peer_discovery_aws/test/aws_ecs_util.erl
@@ -23,7 +23,7 @@
          public_dns_names/1,
          fetch_nodes_endpoint/2]).
 
--define(ECS_CLUSTER_TIMEOUT, 120000).
+-define(ECS_CLUSTER_TIMEOUT, 120_000).
 
 %% NOTE:
 %% These helpers assume certain permissions associated with the aws credentials
diff --git a/deps/rabbitmq_peer_discovery_aws/test/integration_SUITE.erl b/deps/rabbitmq_peer_discovery_aws/test/integration_SUITE.erl
index 97130e9c20..8b969908e8 100644
--- a/deps/rabbitmq_peer_discovery_aws/test/integration_SUITE.erl
+++ b/deps/rabbitmq_peer_discovery_aws/test/integration_SUITE.erl
@@ -12,7 +12,7 @@
 -include_lib("rabbitmq_ct_helpers/include/rabbit_assert.hrl").
 
 -define(CLUSTER_SIZE, 3).
--define(TIMEOUT_MILLIS, 180000).
+-define(TIMEOUT_MILLIS, 180_000).
 
 -export([all/0,
          suite/0,
diff --git a/deps/rabbitmq_peer_discovery_aws/test/unit_SUITE.erl b/deps/rabbitmq_peer_discovery_aws/test/unit_SUITE.erl
index 791f755008..61e8ed2561 100644
--- a/deps/rabbitmq_peer_discovery_aws/test/unit_SUITE.erl
+++ b/deps/rabbitmq_peer_discovery_aws/test/unit_SUITE.erl
@@ -14,7 +14,8 @@
 
 all() ->
     [
-     {group, unit}
+     {group, unit},
+     {group, lock}
     ].
 
 groups() ->
@@ -23,7 +24,14 @@ groups() ->
                  maybe_add_tag_filters,
                  get_hostname_name_from_reservation_set,
                  registration_support
-                ]}].
+                ]},
+     {lock, [], [
+                 lock_single_node,
+                 lock_multiple_nodes,
+                 lock_local_node_not_discovered,
+                 lock_list_nodes_fails
+                ]}
+    ].
 
 %%%
 %%% Testcases
@@ -63,7 +71,36 @@ get_hostname_name_from_reservation_set(_Config) ->
     }.
 
 registration_support(_Config) ->
-    ?assertEqual(rabbit_peer_discovery_aws:supports_registration(), true).
+    ?assertEqual(false, rabbit_peer_discovery_aws:supports_registration()).
+
+lock_single_node(_Config) ->
+  LocalNode = node(),
+  meck:expect(rabbit_peer_discovery_aws, list_nodes, 0, {ok, {[LocalNode], disc}}),
+
+  {ok, LockId} = rabbit_peer_discovery_aws:lock(LocalNode),
+  ?assertEqual(ok, rabbit_peer_discovery_aws:unlock(LockId)).
+
+lock_multiple_nodes(_Config) ->
+  application:set_env(rabbit, cluster_formation, [{internal_lock_retries, 2}]),
+  LocalNode = node(),
+  OtherNode = other@host,
+  meck:expect(rabbit_peer_discovery_aws, list_nodes, 0, {ok, {[OtherNode, LocalNode], disc}}),
+
+  {ok, {LockResourceId, OtherNode}} = rabbit_peer_discovery_aws:lock(OtherNode),
+  ?assertEqual({error, "Acquiring lock taking too long, bailing out after 2 retries"},
+               rabbit_peer_discovery_aws:lock(LocalNode)),
+  ?assertEqual(ok, rabbitmq_peer_discovery_aws:unlock({LockResourceId, OtherNode})),
+  ?assertEqual({ok, {LockResourceId, LocalNode}}, rabbit_peer_discovery_aws:lock(LocalNode)),
+  ?assertEqual(ok, rabbitmq_peer_discovery_aws:unlock({LockResourceId, LocalNode})).
+
+lock_local_node_not_discovered(_Config) ->
+  meck:expect(rabbit_peer_discovery_aws, list_nodes, 0, {ok, {[n1@host, n2@host], disc}} ),
+  Expectation = {error, "Local node me@host is not part of discovered nodes [n1@host,n2@host]"},
+  ?assertEqual(Expectation, rabbit_peer_discovery_aws:lock(me@host)).
+
+lock_list_nodes_fails(_Config) ->
+  meck:expect(rabbit_peer_discovery_aws, list_nodes, 0, {error, "failed for some reason"}),
+  ?assertEqual({error, "failed for some reason"}, rabbit_peer_discovery_aws:lock(me@host)).
 
 %%%
 %%% Implementation
author	David Ansari <david.ansari@gmx.de>	2021-05-18 01:01:08 +0200
committer	David Ansari <david.ansari@gmx.de>	2021-06-03 08:01:28 +0200
commit	0876746d5f94ad914e560d5ad0162741d88b3c29 (patch)
tree	4d45d0df357b1e51ed688dda0365a9aa96d2a263 /deps/rabbitmq_peer_discovery_aws
parent	4b95614f4e8e7edf7978344f68255b0d089de4a1 (diff)
download	rabbitmq-server-git-0876746d5f94ad914e560d5ad0162741d88b3c29.tar.gz