diff options
author | Alan Conway <aconway@apache.org> | 2014-12-19 03:18:57 +0000 |
---|---|---|
committer | Alan Conway <aconway@apache.org> | 2014-12-19 03:18:57 +0000 |
commit | 40e74eaa3f8a345e7bc888e36de79717b7c761d0 (patch) | |
tree | 4d9a08cb40caf897b9d73c55deac60374d97eb0c /qpid/cpp/examples/messaging | |
parent | aa51ac52f3bd77d92acf585699bc7429666ad785 (diff) | |
download | qpid-python-40e74eaa3f8a345e7bc888e36de79717b7c761d0.tar.gz |
QPID-6278: HA broker abort in TXN soak test
The crash appears to be a race condition in async completion exposed by the HA
TX code code as follows:
1. Message received and placed on tx-replication queue, completion delayed till backups ack.
Completion count goes up for each backup then down as each backup acks.
2. Prepare received, message placed on primary's local persistent queue.
Completion count goes up one then down one for local store completion (null store in this case).
The race is something like this:
- last backup ack arrives (on backup IO thread) and drops completion count to 0.
- prepare arrives (on client thread) null store bumps count to 1 and immediately drops to 0.
- both threads try to invoke the completion callback, one deletes it while the other is still invoking.
The old completion logic assumed that only one thread can see the atomic counter
go to 0. It does not handle the count going to 0 in one thread and concurrently
being increased and decreased back to 0 in another. This case is introduced by
HA transactions because the same message is put onto a tx-replication queue and
then put again onto another persistent local queue, so there are two cycles of
completion.
The new logic fixes this only one call to completion callback is possible in all cases.
Also fixed missing lock in ha/Primary.cpp.
git-svn-id: https://svn.apache.org/repos/asf/qpid/trunk@1646618 13f79535-47bb-0310-9956-ffa450edef68
Diffstat (limited to 'qpid/cpp/examples/messaging')
0 files changed, 0 insertions, 0 deletions