| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Per-vhost DETS file with recovery terms for all queues is a bottleneck
when stopping RabbitMQ - all queues try save their state, leading
to a very long file server mailbox and very unpredictable time
to stop RabbitMQ (on my machine it can vary from 20 seconds to 5 minutes
with 100k classic queues).
In this PR we can still read the recovery terms from DETS but we only
save them in per-queue files. This way each queue can quickly store its
state. Under the same condition, my machine can consistently stop
RabbitMQ in 15 seconds or so.
The tradeoff is a slower startup time: on my machine, it goes up from
29 seconds to 38 seconds, but that's still better than what we had until
https://github.com/rabbitmq/rabbitmq-server/pull/7676 was merged a few
days ago. More importantly, the total of stop+start is lower and more
predictable.
This PR also improves shutdown with many classic queues v1.
Startup time with 100k CQv1s is so long and unpredictable that it's hard
to even tell if this PR affects it (it varies from 4 to 8 minutes for me).
Unfortunately this PR makes startup on MacOS slower (~55s instead of 30s
for me), but we don't have to optimise for that. In most cases (with
much fewer queues), it won't be noticable anyway.
|
|\
| |
| | |
CLI: correctly override DocGuide.virtual_hosts path segment
|
| |
| |
| |
| | |
Closes #7716
|
|/ |
|
| |
|
|\
| |
| | |
Update 3.11.11 release notes
|
|/ |
|
|\
| |
| | |
Update 3.10.20 release notes
|
|/ |
|
|\
| |
| | |
3.10.20 release notes
|
|/ |
|
|\
| |
| |
| |
| | |
rabbitmq/call-rabbit_mnesia-for-partition-handling-specific-code
rabbit_node_monitor: Use `rabbit_mnesia` in partition handling-specific code
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
[Why]
The partition detection code defines a partitioned node as an Erlang
node running RabbitMQ but which is not among the Mnesia running nodes.
Since #7058, `rabbit_node_monitor` uses the list functions exported by
`rabbit_nodes` for everything, except the partition detection code which
is Mnesia-specific and relies on `rabbit_mnesia:cluster_nodes/1`.
Unfortunately, we saw regressions in the Jepsen testsuite during the
3.12.0 release cycle only because that testsuite is not executed on
`main`.
It happens that the partition detection code is using `rabbit_nodes`
list functions in two places where it should have continued to use
`rabbit_mnesia`.
[How]
The fix bug fix simply consists of reverting the two calls to
`rabbit_nodes` back to calls to `rabbit_mnesia` as it used to do. This
seems to improve the situation a lot in the manual testing.
This code will go away with our use of Mnesia in the future, so it's not
a problem to call `rabbit_mnesia` here.
|
|\ \
| |/
|/| |
Fix return value of mgmt login handler on bad method
|
| |
| |
| |
| | |
To match what cowboy_handler expects.
|
|\ \
| | |
| | | |
Closes #7685
|
| | | |
|
|\ \ \
| | | |
| | | | |
Refactor selenium tests
|
| |/ / |
|
|\ \ \
| | | |
| | | | |
Use 3.11.11 for mixed version testing
|
| |/ / |
|
| | |
| | |
| | |
| | |
| | |
| | | |
* Faster all_queue_directory_names/1
* Optimise writing stub files
Combined, this reduces node startup time by half with many empty classic queues v2
|
|/ / |
|
|\ \
| | |
| | | |
Revert #7672
|
|/ /
| |
| |
| |
| |
| |
| | |
rabbitmq/mk-switch-cq-version-to-2-by-default"
This reverts commit f6e1a6e74bc916e17a26a9fed0549df08a26355b, reversing
changes made to c4d6503cad6666307c1ddc3f67ae3617c1c117b8.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The rabbit_fifo_dlx_worker should be co-located with the quorum
queue leader.
If a new leader on a different node gets elected before the
rabbit_fifo_dlx_worker initialises (i.e. registers itself as a
consumer), it should stop itself normally, such that it is not restarted
by rabbit_fifo_dlx_sup.
Another rabbit_fifo_dlx_worker should be created on the new quorum
queue leader node.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Previously, it used the default intensity:
"intensity defaults to 1 and period defaults to 5."
However, it's a bit low given there can be dozens or hundreds of DLX
workers: If only 2 fail within 5 seconds, the whole supervisor
terminates.
Even with the new values, there shouldn't be any infnite loop of the
supervisor terminating and restarting childs because the
rabbit_fifo_dlx_worker is terminated and started very quickly
given that the (the slow) consumer registration happens in
rabbit_fifo_dlx_worker:handle_continue/2.
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[Jepsen dead lettering tests](https://github.com/rabbitmq/rabbitmq-ci/blob/5977f587e203698b8f281ed52b636d60489883b7/jepsen/scripts/qq-jepsen-test.sh#L108)
of job `qq-jepsen-test-3-12` of Concourse pipeline `jepsen-tests`
fail sometimes with following error:
```
{{:try_clause, [{:undefined, #PID<12128.3596.0>, :worker, [:rabbit_fifo_dlx_worker]}, {:undefined, #PID<12128.10212.0>, :worker, [:rabbit_fifo_dlx_worker]}]}, [{:erl_eval, :try_clauses, 10, [file: 'erl_eval.erl', line: 995]}, {:erl_eval, :exprs, 2, []}]}
```
At the end of the Jepsen test, there are 2 DLX workers on the same node.
Analysing the logs reveals the following:
Source quorum queue node becomes leader and starts its DLX worker:
```
2023-03-18 12:14:04.365295+00:00 [debug] <0.1645.0> started rabbit_fifo_dlx_worker <0.3596.0> for queue 'jepsen.queue' in vhost '/'
```
Less than 1 second later, Mnesia reports a network partition (introduced
by Jepsen).
The DLX worker does not succeed to register as consumer to its source quorum queue because the Ra command times out:
```
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Failed to process command {dlx,{checkout,<0.3596.0>,32}} on quorum queue leader {'%2F_jepsen.queue',
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> 'rabbit@concourse-qq-jepsen-312-3'}: {timeout,
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> {'%2F_jepsen.queue',
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> 'rabbit@concourse-qq-jepsen-312-3'}}
2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Trying 5 more time(s)...
```
3 seconds after the DLX worker got created, the local source quorum queue node is not leader anymore:
```
2023-03-18 12:14:07.289213+00:00 [notice] <0.1645.0> queue 'jepsen.queue' in vhost '/': leader -> follower in term: 17 machine version: 3
```
But because the DLX worker at this point failed to register as consumer, it will not be terminated in
https://github.com/rabbitmq/rabbitmq-server/blob/865d533863c29ed52e03070ac8d9e1bcaee8b205/deps/rabbit/src/rabbit_fifo_dlx.erl#L264-L275
Eventually, when the local node becomes a leader again, that DLX worker succeeds to register
as consumer (due to retries in https://github.com/rabbitmq/rabbitmq-server/blob/865d533863c29ed52e03070ac8d9e1bcaee8b205/deps/rabbit/src/rabbit_fifo_dlx_client.erl#L41-L58),
and stays alive. When that happens, there is a 2nd DLX worker active because the 2nd
got started when the local quorum queue node transitioned to become a leader.
This commit prevents this issue.
So, last consumer who does a `#checkout{}` wins and the “old one” has to terminate.
|
| |
|
| |
|
|\
| |
| | |
Classic queues: make CQv2 the new default
|
| |
| |
| |
| |
| |
| |
| |
| | |
CQv2 is significantly more efficient (x2-4 on some workloads),
has lower and more predictable memory footprint, and eliminates
the need to make classic queues lazy to achieve that predictability.
Per several discussions with the team.
|
|/ |
|
|\
| |
| | |
More 3.12.0 release notes updates
|
|/ |
|
|\
| |
| | |
3.12.0 release notes corrections
|
|/ |
|
|\
| |
| |
| |
| | |
rabbitmq/rin/queue-deleted-events-include-queue-type
Include the queue type in the queue_deleted rabbit_event
|
| |
| |
| |
| | |
These will always be rabbit_classic_queue queues
|
| |
| |
| |
| |
| |
| |
| | |
This is useful for understanding if a deleted queue was matching any
policies given the more selective policies introduced in #7601.
Does not apply to bulk deletion of transient queues on node down.
|
|\ \
| | |
| | | |
Fix flaky test rabbit_mqtt_qos0_queue_overflow
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The test always succeeds on `main` branch.
The test also always succeeds on `mc` branch when running remotely:
```
bazel test //deps/rabbitmq_mqtt:reader_SUITE --test_env FOCUS="-group tests -case rabbit_mqtt_qos0_queue_overflow" --config=rbe-25 -t- --runs_per_test=50
```
However, the test flakes when running on `mc` branch locally on the MAC:
```
make -C deps/rabbitmq_mqtt ct-reader t=tests:rabbit_mqtt_qos0_queue_overflow FULL=1
```
with the following local changes:
```
~/workspace/rabbitmq-server/deps/rabbitmq_mqtt mc *6 !1 > 3s direnv rb 2.7.2
diff --git a/deps/rabbitmq_mqtt/test/reader_SUITE.erl b/deps/rabbitmq_mqtt/test/reader_SUITE.erl
index fb71eae375..21377a2e73 100644
--- a/deps/rabbitmq_mqtt/test/reader_SUITE.erl
+++ b/deps/rabbitmq_mqtt/test/reader_SUITE.erl
@@ -27,7 +27,7 @@ all() ->
groups() ->
[
- {tests, [],
+ {tests, [{repeat_until_any_fail, 30}],
[
block_connack_timeout,
handle_invalid_packets,
@@ -43,7 +43,7 @@ groups() ->
].
suite() ->
- [{timetrap, {seconds, 60}}].
+ [{timetrap, {minutes, 60}}].
%% -------------------------------------------------------------------
%% Testsuite setup/teardown.
```
failes prior to this commit after the 2nd time and does not fail after
this commit.
|
|\ \
| |/
|/| |
3.12.0 release notes edits
|
| | |
|
| |
| |
| |
| |
| |
| | |
When rabbit_policy:match_all/2 is called with a name of a queue
look up the queue type to correctly match the extra policy granularity
added in #7601
|
|\ \
| | |
| | | |
Give each of the summary jobs in actions different names
|
| | |
| | |
| | |
| | |
| | | |
Otherwise they do not appear to be selectable in the github branch
protection rules UI
|
|\ \ \
| |/ /
|/| |
| | |
| | | |
rabbitmq/stream-take-credit-even-for-inactive-subscription
Take credits for inactive stream subscription
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Not taking the credits can starve the subscription,
making it permanently under its credit send limit.
The subscription then never dispatches messages when
it becomes active again.
This happens in an active-inactive-active cycle, especially
with slow consumers.
|
|\ \ \
| |/ /
|/| | |
Add "Summary" jobs to test workflows in actions
|