summaryrefslogtreecommitdiff
path: root/src/third_party/wiredtiger/src/docs/timestamp-prepare-roundup.dox
blob: 973344756624d74616cb531228b65117dda53261 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
/*! @page timestamp_prepare_roundup Automatic prepare timestamp rounding

@section timestamp_prepare_roundup_replay Replaying prepared transactions by rounding up the prepare timestamp

Prepared transactions have a configuration keyword for rounding timestamps.
Applications can configure <code>roundup_timestamps=(prepare=true)</code>
with the WT_SESSION::begin_transaction method.

It is possible for a system crash to cause a prepared transaction to be
rolled back. Because the durable timestamp of a transaction is permitted
to be later than the prepared transaction's commit timestamp, it is even
possible for a system crash to cause a prepared and committed transaction
to be rolled back. Part of the purpose of the timestamp interface is to
allow such transactions to be replayed at their original timestamps during
an application-level recovery phase.

Under ordinary circumstances this is purely an application concern.  However,
because it is also allowed for the stable timestamp to move forward after a
transaction prepares, strict enforcement of the timestamping rules can make
replaying prepared transactions at the same time impossible.

The setting <code>roundup_timestamps=(prepared=true)</code> is provided to
handle this problem. It disables the normal restriction that the prepare
timestamp must be greater than the stable timestamp. In addition, the
prepare timestamp is rounded up to the <i>oldest</i> timestamp (not the
stable timestamp) if necessary and then the commit timestamp is rounded up
to the prepare timestamp. The rounding provides some measure of safety by
disallowing operations before oldest.

\warning
This setting is dangerous. It is safe to replay a prepared transaction at
its original timestamps, regardless of the current stable timestamp, as
long as it is done during an application recovery phase after a crash and
before any ordinary operations are allowed. Using this setting to prepare
and/or commit before the current stable timestamp for any other purpose
can lead to data inconsistency. Likewise, replaying anything other than the
exact transaction that successfully prepared before the crash can lead to
subtle inconsistencies. If in any doubt, it is far safer to either abort the
transaction (this requires no further action in WiredTiger) or not allow the
stable timestamp to advance past the commit timestamp of a transaction that
has been prepared.

@section timestamp_prepare_roundup_safety Safety rationale and details

When a transaction is prepared and rolled back by a crash, then replayed,
this creates a period of execution time where the transaction's updates will
not appear. Reads or writes made during this period that intersect with
the transaction will not see it and can therefore produce incorrect results.

An <i>application recovery phase</i> is a startup phase in application
code that is responsible for returning the application to a running
state after a crash.
It executes after WiredTiger's own recovery completes and before the
application resumes normal operation.
(For a distributed application this may have nontrivial aspects.)
The important property is that only application-level recovery code
executes, and that code is expected to be able to take account of
special circumstances related to recovery.

It is safe to replay a prepared transaction during an application recovery
phase if nothing makes intersecting reads or writes during the period the
prepared transaction is missing and the replay makes the exact same updates
as before the crash, so any subsequent intersecting reads or writes will
behave the same as if they had been performed before the crash. (If the
application recovery code itself makes intersecting reads before replaying
a prepared transaction, it is responsible for compensating.)

Because a transaction's durable timestamp is allowed to be
later than its commit timestamp, it is possible for a transaction to
prepare and commit and still be rolled back by a crash.
It is thus possible to perform intersecting reads that succeed (rather
than failing with ::WT_PREPARE_CONFLICT), either before or after the
crash, and these would become inconsistent if the replayed transaction
is not replayed <i>exactly</i>.

Even for transactions that prepared successfully but did not commit
before the crash, it is important to replay exactly the same write
set; otherwise reads before and after the crash might produce
::WT_PREPARE_CONFLICT inconsistently.

It is expected the oldest timestamp will not advance during application
recovery. The rounding behavior does not check for this possibility; if for
some reason applications wish to advance oldest while replaying transactions
during recovery, they must check their commit timestamps explicitly to avoid
committing before oldest.

*/