summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Recovery terms in per-queue files instead of DETSper-queue-recovery-termsMichal Kuratczyk2023-03-236-40/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | Per-vhost DETS file with recovery terms for all queues is a bottleneck when stopping RabbitMQ - all queues try save their state, leading to a very long file server mailbox and very unpredictable time to stop RabbitMQ (on my machine it can vary from 20 seconds to 5 minutes with 100k classic queues). In this PR we can still read the recovery terms from DETS but we only save them in per-queue files. This way each queue can quickly store its state. Under the same condition, my machine can consistently stop RabbitMQ in 15 seconds or so. The tradeoff is a slower startup time: on my machine, it goes up from 29 seconds to 38 seconds, but that's still better than what we had until https://github.com/rabbitmq/rabbitmq-server/pull/7676 was merged a few days ago. More importantly, the total of stop+start is lower and more predictable. This PR also improves shutdown with many classic queues v1. Startup time with 100k CQv1s is so long and unpredictable that it's hard to even tell if this PR affects it (it varies from 4 to 8 minutes for me). Unfortunately this PR makes startup on MacOS slower (~55s instead of 30s for me), but we don't have to optimise for that. In most cases (with much fewer queues), it won't be noticable anyway.
* Merge pull request #7717 from rabbitmq/rabbitmq-server-7716Michael Klishin2023-03-221-1/+1
|\ | | | | CLI: correctly override DocGuide.virtual_hosts path segment
| * CLI: correctly override DocGuide.virtual_hosts path segmentMichael Klishin2023-03-221-1/+1
| | | | | | | | Closes #7716
* | CLI: define a couple more doc guidesMichael Klishin2023-03-221-1/+3
|/
* Split 3.11.12 and 3.11.11 release notes, lumped together by mistakeMichael Klishin2023-03-222-16/+64
|
* Merge pull request #7711 from rabbitmq/mk-update-3.11.11-release-notesMichael Klishin2023-03-221-0/+28
|\ | | | | Update 3.11.11 release notes
| * Update 3.11.11 release notesMichael Klishin2023-03-221-0/+28
|/
* Merge pull request #7707 from rabbitmq/mk-update-3.10.20-release-notesMichael Klishin2023-03-221-0/+24
|\ | | | | Update 3.10.20 release notes
| * Update 3.10.20 release notesMichael Klishin2023-03-221-0/+24
|/
* Merge pull request #7700 from rabbitmq/mk-3.10.20-release-notesMichael Klishin2023-03-221-0/+66
|\ | | | | 3.10.20 release notes
| * 3.10.20 release notesMichael Klishin2023-03-221-0/+66
|/
* Merge pull request #7686 from ↵Arnaud Cogoluègnes2023-03-211-2/+2
|\ | | | | | | | | rabbitmq/call-rabbit_mnesia-for-partition-handling-specific-code rabbit_node_monitor: Use `rabbit_mnesia` in partition handling-specific code
| * rabbit_node_monitor: Use `rabbit_mnesia` in partition handling-specific codeJean-Sébastien Pédron2023-03-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [Why] The partition detection code defines a partitioned node as an Erlang node running RabbitMQ but which is not among the Mnesia running nodes. Since #7058, `rabbit_node_monitor` uses the list functions exported by `rabbit_nodes` for everything, except the partition detection code which is Mnesia-specific and relies on `rabbit_mnesia:cluster_nodes/1`. Unfortunately, we saw regressions in the Jepsen testsuite during the 3.12.0 release cycle only because that testsuite is not executed on `main`. It happens that the partition detection code is using `rabbit_nodes` list functions in two places where it should have continued to use `rabbit_mnesia`. [How] The fix bug fix simply consists of reverting the two calls to `rabbit_nodes` back to calls to `rabbit_mnesia` as it used to do. This seems to improve the situation a lot in the manual testing. This code will go away with our use of Mnesia in the future, so it's not a problem to call `rabbit_mnesia` here.
* | Merge pull request #7675 from cloudamqp/login_handlerMichael Klishin2023-03-211-3/+3
|\ \ | |/ |/| Fix return value of mgmt login handler on bad method
| * Fix return value of mgmt login handler on bad methodPéter Gömöri2023-03-201-3/+3
| | | | | | | | To match what cowboy_handler expects.
* | Merge pull request #7689 from rabbitmq/rabbitmq-server-7685Michael Klishin2023-03-212-23/+32
|\ \ | | | | | | Closes #7685
| * | Closes #7685Michael Klishin2023-03-212-23/+32
| | |
* | | Merge pull request #7687 from rabbitmq/refactor-selenium-testsMichael Klishin2023-03-2182-1149/+3113
|\ \ \ | | | | | | | | Refactor selenium tests
| * | | Refactor selenium testsMarcial Rosales2023-03-2182-1149/+3113
| |/ /
* | | Merge pull request #7682 from rabbitmq/bump-mixed-version-cluster-brokerMichael Klishin2023-03-211-4/+3
|\ \ \ | | | | | | | | Use 3.11.11 for mixed version testing
| * | | Use 3.11.11 for mixed version testingRin Kuryloski2023-03-211-4/+3
| |/ /
* | | Faster node start with many classic queues v2 (#7676)Michal Kuratczyk2023-03-212-8/+25
| | | | | | | | | | | | | | | | | | * Faster all_queue_directory_names/1 * Optimise writing stub files Combined, this reduces node startup time by half with many empty classic queues v2
* | | Update PROTOCOL.adocDavid Ansari2023-03-211-2/+1
|/ /
* | Merge pull request #7684 from rabbitmq/revert-7672Rin Kuryloski2023-03-216-19/+8
|\ \ | | | | | | Revert #7672
| * | Revert "Merge pull request #7672 from ↵Rin Kuryloski2023-03-216-19/+8
|/ / | | | | | | | | | | | | rabbitmq/mk-switch-cq-version-to-2-by-default" This reverts commit f6e1a6e74bc916e17a26a9fed0549df08a26355b, reversing changes made to c4d6503cad6666307c1ddc3f67ae3617c1c117b8.
* | Do not restart DLX worker if leader is non-localDavid Ansari2023-03-203-14/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | The rabbit_fifo_dlx_worker should be co-located with the quorum queue leader. If a new leader on a different node gets elected before the rabbit_fifo_dlx_worker initialises (i.e. registers itself as a consumer), it should stop itself normally, such that it is not restarted by rabbit_fifo_dlx_sup. Another rabbit_fifo_dlx_worker should be created on the new quorum queue leader node.
* | Make rabbit_fifo_dlx_sup more resilientDavid Ansari2023-03-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, it used the default intensity: "intensity defaults to 1 and period defaults to 5." However, it's a bit low given there can be dozens or hundreds of DLX workers: If only 2 fail within 5 seconds, the whole supervisor terminates. Even with the new values, there shouldn't be any infnite loop of the supervisor terminating and restarting childs because the rabbit_fifo_dlx_worker is terminated and started very quickly given that the (the slow) consumer registration happens in rabbit_fifo_dlx_worker:handle_continue/2.
* | Terminate replaced rabbit_fifo_dlx_workerDavid Ansari2023-03-201-1/+10
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [Jepsen dead lettering tests](https://github.com/rabbitmq/rabbitmq-ci/blob/5977f587e203698b8f281ed52b636d60489883b7/jepsen/scripts/qq-jepsen-test.sh#L108) of job `qq-jepsen-test-3-12` of Concourse pipeline `jepsen-tests` fail sometimes with following error: ``` {{:try_clause, [{:undefined, #PID<12128.3596.0>, :worker, [:rabbit_fifo_dlx_worker]}, {:undefined, #PID<12128.10212.0>, :worker, [:rabbit_fifo_dlx_worker]}]}, [{:erl_eval, :try_clauses, 10, [file: 'erl_eval.erl', line: 995]}, {:erl_eval, :exprs, 2, []}]} ``` At the end of the Jepsen test, there are 2 DLX workers on the same node. Analysing the logs reveals the following: Source quorum queue node becomes leader and starts its DLX worker: ``` 2023-03-18 12:14:04.365295+00:00 [debug] <0.1645.0> started rabbit_fifo_dlx_worker <0.3596.0> for queue 'jepsen.queue' in vhost '/' ``` Less than 1 second later, Mnesia reports a network partition (introduced by Jepsen). The DLX worker does not succeed to register as consumer to its source quorum queue because the Ra command times out: ``` 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Failed to process command {dlx,{checkout,<0.3596.0>,32}} on quorum queue leader {'%2F_jepsen.queue', 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> 'rabbit@concourse-qq-jepsen-312-3'}: {timeout, 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> {'%2F_jepsen.queue', 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> 'rabbit@concourse-qq-jepsen-312-3'}} 2023-03-18 12:15:04.365840+00:00 [warning] <0.3596.0> Trying 5 more time(s)... ``` 3 seconds after the DLX worker got created, the local source quorum queue node is not leader anymore: ``` 2023-03-18 12:14:07.289213+00:00 [notice] <0.1645.0> queue 'jepsen.queue' in vhost '/': leader -> follower in term: 17 machine version: 3 ``` But because the DLX worker at this point failed to register as consumer, it will not be terminated in https://github.com/rabbitmq/rabbitmq-server/blob/865d533863c29ed52e03070ac8d9e1bcaee8b205/deps/rabbit/src/rabbit_fifo_dlx.erl#L264-L275 Eventually, when the local node becomes a leader again, that DLX worker succeeds to register as consumer (due to retries in https://github.com/rabbitmq/rabbitmq-server/blob/865d533863c29ed52e03070ac8d9e1bcaee8b205/deps/rabbit/src/rabbit_fifo_dlx_client.erl#L41-L58), and stays alive. When that happens, there is a 2nd DLX worker active because the 2nd got started when the local quorum queue node transitioned to become a leader. This commit prevents this issue. So, last consumer who does a `#checkout{}` wins and the “old one” has to terminate.
* 3.12.0 release notes: don't claim that CQv2 is the default versionMichael Klishin2023-03-201-4/+4
|
* 3.12.0 release notes: mention #6440Michael Klishin2023-03-201-2/+6
|
* Merge pull request #7672 from rabbitmq/mk-switch-cq-version-to-2-by-defaultMichael Klishin2023-03-206-8/+19
|\ | | | | Classic queues: make CQv2 the new default
| * Make CQv2 the new defaultMichael Klishin2023-03-206-8/+19
| | | | | | | | | | | | | | | | CQv2 is significantly more efficient (x2-4 on some workloads), has lower and more predictable memory footprint, and eliminates the need to make classic queues lazy to achieve that predictability. Per several discussions with the team.
* | 3.11.11 release notesMichael Klishin2023-03-201-0/+86
|/
* Merge pull request #7670 from rabbitmq/mk-update-3.12.0-release-notes-againMichael Klishin2023-03-181-1/+13
|\ | | | | More 3.12.0 release notes updates
| * More 3.12.0 release notes updatesMichael Klishin2023-03-181-1/+13
|/
* Merge pull request #7666 from rabbitmq/mk-update-3.12.0-beta-numberMichael Klishin2023-03-181-4/+4
|\ | | | | 3.12.0 release notes corrections
| * 3.12.0 release notes correctionsMichael Klishin2023-03-181-4/+4
|/
* Merge pull request #7659 from ↵Michael Klishin2023-03-189-47/+43
|\ | | | | | | | | rabbitmq/rin/queue-deleted-events-include-queue-type Include the queue type in the queue_deleted rabbit_event
| * Also include the kind in queue_deleted events for transient queuesRin Kuryloski2023-03-171-2/+3
| | | | | | | | These will always be rabbit_classic_queue queues
| * Include the queue type in the queue_deleted rabbit_eventRin Kuryloski2023-03-179-45/+40
| | | | | | | | | | | | | | This is useful for understanding if a deleted queue was matching any policies given the more selective policies introduced in #7601. Does not apply to bulk deletion of transient queues on node down.
* | Merge pull request #7663 from rabbitmq/fix-flaky-rabbit_mqtt_qos0_queue_overflowMichael Klishin2023-03-171-2/+2
|\ \ | | | | | | Fix flaky test rabbit_mqtt_qos0_queue_overflow
| * | Fix flaky test rabbit_mqtt_qos0_queue_overflowDavid Ansari2023-03-171-2/+2
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The test always succeeds on `main` branch. The test also always succeeds on `mc` branch when running remotely: ``` bazel test //deps/rabbitmq_mqtt:reader_SUITE --test_env FOCUS="-group tests -case rabbit_mqtt_qos0_queue_overflow" --config=rbe-25 -t- --runs_per_test=50 ``` However, the test flakes when running on `mc` branch locally on the MAC: ``` make -C deps/rabbitmq_mqtt ct-reader t=tests:rabbit_mqtt_qos0_queue_overflow FULL=1 ``` with the following local changes: ``` ~/workspace/rabbitmq-server/deps/rabbitmq_mqtt mc *6 !1 > 3s direnv rb 2.7.2 diff --git a/deps/rabbitmq_mqtt/test/reader_SUITE.erl b/deps/rabbitmq_mqtt/test/reader_SUITE.erl index fb71eae375..21377a2e73 100644 --- a/deps/rabbitmq_mqtt/test/reader_SUITE.erl +++ b/deps/rabbitmq_mqtt/test/reader_SUITE.erl @@ -27,7 +27,7 @@ all() -> groups() -> [ - {tests, [], + {tests, [{repeat_until_any_fail, 30}], [ block_connack_timeout, handle_invalid_packets, @@ -43,7 +43,7 @@ groups() -> ]. suite() -> - [{timetrap, {seconds, 60}}]. + [{timetrap, {minutes, 60}}]. %% ------------------------------------------------------------------- %% Testsuite setup/teardown. ``` failes prior to this commit after the 2nd time and does not fail after this commit.
* | Merge pull request #7660 from rabbitmq/mk-update-3.12.0-release-notesMichael Klishin2023-03-171-3/+9
|\ \ | |/ |/| 3.12.0 release notes edits
| * 3.12.0 release notes editsMichael Klishin2023-03-171-3/+9
| |
* | Fix rabbit_policy:match_all/2 (#7654)Rin Kuryloski2023-03-161-0/+3
| | | | | | | | | | | | When rabbit_policy:match_all/2 is called with a name of a queue look up the queue type to correctly match the extra policy granularity added in #7601
* | Merge pull request #7640 from rabbitmq/rin/fixup-ci-summary-jobsRin Kuryloski2023-03-165-5/+5
|\ \ | | | | | | Give each of the summary jobs in actions different names
| * | Give each of the summary jobs in actions different namesRin Kuryloski2023-03-165-5/+5
| | | | | | | | | | | | | | | Otherwise they do not appear to be selectable in the github branch protection rules UI
* | | Merge pull request #7638 from ↵Arnaud Cogoluègnes2023-03-161-13/+31
|\ \ \ | |/ / |/| | | | | | | | rabbitmq/stream-take-credit-even-for-inactive-subscription Take credits for inactive stream subscription
| * | Take credits for inactive stream subscriptionArnaud Cogoluègnes2023-03-161-13/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Not taking the credits can starve the subscription, making it permanently under its credit send limit. The subscription then never dispatches messages when it becomes active again. This happens in an active-inactive-active cycle, especially with slow consumers.
* | | Merge pull request #7631 from rabbitmq/rin/add-ci-summary-jobsRin Kuryloski2023-03-165-1/+45
|\ \ \ | |/ / |/| | Add "Summary" jobs to test workflows in actions