1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
|
# $Id: Design,v 11.5 2002/02/01 19:07:18 bostic Exp $
Synchronization in the Locking Subsystem
This is a document that describes how we implemented fine-grain locking
in the lock manager (that is, locking on a hash bucket level instead of
locking the entire region). We found that the increase in concurrency
was not sufficient to warrant the increase in complexity or the additional
cost of performing each lock operation. Therefore, we don't use this
any more. Should we have to do fine-grain locking in a future release,
this would be a reasonable starting point.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
1. Data structures
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
The lock manager maintains 3 different structures:
Objects (__db_lockobj):
Describes an object that is locked. When used with DB, this consists
of a __db_ilock (a file identifier and a page number).
Lockers (__db_locker):
Identifies a specific locker ID and maintains the head of a list of
locks held by a locker (for using during transaction commit/abort).
Locks (__db_lock):
Describes a particular object lock held on behalf of a particular
locker id.
Objects and Lockers reference Locks.
These structures are organized via two synchronized hash tables. Each
hash table consists of two physical arrays: the array of actual hash
buckets and an array of mutexes so we can lock individual buckets, rather
than the whole table.
One hash table contains Objects and the other hash table contains Lockers.
Objects contain two lists of locks, waiters and holders: holders currently
hold a lock on the Object, waiters are lock waiting to be granted.
Lockers are a single linked list that connects the Locks held on behalf
of the specific locker ID.
In the diagram below:
Locker ID #1 holds a lock on Object #1 (L1) and Object #2 (L5), and is
waiting on a lock on Object #1 (L3).
Locker ID #2 holds a lock on Object #1 (L2) and is waiting on a lock for
Object #2 (L7).
Locker ID #3 is waiting for a lock on Object #2 (L6).
OBJECT -----------------------
HASH | |
----|------------- |
________ _______ | | ________ | |
| |-->| O1 |--|---|-->| O2 | | |
|_______| |_____| | | |______| V |
| | W H--->L1->L2 W H--->L5 | holders
|_______| | | | | V
| | ------->L3 \ ------->L6------>L7 waiters
|_______| / \ \
. . / \ \
. . | \ \
. . | \ -----------
|_______| | -------------- |
| | ____|____ ___|_____ _|______
|_______| | | | | | |
| | | LID1 | | LID2 | | LID3 |
|_______| |_______| |_______| |______|
^ ^ ^
| | |
___|________________________|________|___
LOCKER | | | | | | | | |
HASH | | | | | | | | |
| | | | | | | | |
|____|____|____|____|____|____|____|____|
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
2. Synchronization
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
There are four types of mutexes in the subsystem.
Object mutexes;
These map one-to-one to each bucket in the Object hash table.
Holding a mutex on an Object bucket secures all the Objects in
that bucket as well as the Lock structures linked from those
Objects. All fields in the Locks EXCEPT the Locker links (the
links that attach Locks by Locker ID) are protected by these
mutexes.
Locker mutexes:
These map one-to-one to each bucket in the Locker hash table.
Holding a mutex on a Locker bucket secures the Locker structures
and the Locker links in the Locks.
Memory mutex:
This mutex allows calls to allocate/free memory, i.e. calls to
__db_shalloc and __db_shalloc_free, as well as manipulation of
the Object, Locker and Lock free lists.
Region mutex:
This mutex is currently only used to protect the locker ids.
It may also be needed later to provide exclusive access to
the region for deadlock detection.
Creating or removing a Lock requires locking both the Object lock and the
Locker lock (and eventually the shalloc lock to return the item to the
free list).
The locking hierarchy is as follows:
The Region mutex may never be acquired after any other mutex.
The Object mutex may be acquired after the Region mutex.
The Locker mutex may be acquired after the Region and Object
mutexes.
The Memory mutex may be acquired after any mutex.
So, if both and Object mutex and a Locker mutex are going to be acquired,
the Object mutex must be acquired first.
The Memory mutex may be acquired after any other mutex, but no other mutexes
can be acquired once the Memory mutex is held.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
3. The algorithms:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
The locking subsystem supports four basic operations:
Get a Lock (lock_get)
Release a Lock (lock_put)
Release all the Locks on a specific Object (lock_vec)
Release all the Locks for a specific Locker (lock_vec)
Get a lock:
Acquire Object bucket mutex.
Acquire Locker bucket mutex.
Acquire Memory mutex.
If the Object does not exist
Take an Object off the freelist.
If the Locker doesn't exist
Take a Locker off the freelist.
Take a Lock off the free list.
Release Memory mutex.
Add Lock to the Object list.
Add Lock to the Locker list.
Release Locker bucket mutex
If the lock cannot be granted
Release Object bucket mutex
Acquire lock mutex (blocks)
Acquire Object bucket mutex
If lock acquisition did not succeed (e.g, deadlock)
Acquire Locker bucket mutex
If locker should be destroyed
Remove locker from hash table
Acquire Memory mutex
Return locker to free list
Release Memory mutex
Release Locker bucket mutex
If object should be released
Acquire Memory mutex
Return object to free list
Release Memory mutex
Release Object bucket mutex
Release a lock:
Acquire Object bucket mutex.
(Requires that we be able to find the Object hash bucket
without looking inside the Lock itself.)
If releasing a single lock and the user provided generation number
doesn't match the Lock's generation number, the Lock has been reused
and we return failure.
Enter lock_put_internal:
if the Lock is still on the Object's lists:
Increment Lock's generation number.
Remove Lock from the Object's list (NULL link fields).
Promote locks for the Object.
Enter locker_list_removal
Acquire Locker bucket mutex.
If Locker doesn't exist:
Release Locker bucket mutex
Release Object bucket mutex
Return error.
Else if Locker marked as deleted:
dont_release = TRUE
Else
Remove Lock from Locker list.
If Locker has no more locks
Remove Locker from table.
Acquire Memory mutex.
Return Locker to free list
Release Memory mutex
Release Locker bucket mutex.
Exit locker_list_removal
If (!dont_release)
Acquire Memory mutex
Return Lock to free list
Release Memory mutex
Exit lock_put_internal
Release Object bucket mutex
Release all the Locks on a specific Object (lock_vec, DB_PUT_ALL_OBJ):
Acquire Object bucket mutex.
For each lock on the waiter list:
lock_put_internal
For each lock on the holder list:
lock_put_internal
Release Object bucket mutex.
Release all the Locks for a specific Locker (lock_vec, DB_PUT_ALL):
Acquire Locker bucket mutex.
Mark Locker deleted.
Release Locker mutex.
For each lock on the Locker's list:
Remove from locker's list
(The lock could get put back on the free list in
lock_put and then could get reallocated and the
act of setting its locker links could clobber us.)
Perform "Release a Lock" above: skip locker_list_removal.
Acquire Locker bucket mutex.
Remove Locker
Release Locker mutex.
Acquire Memory mutex
Return Locker to free list
Release Memory mutex
Deadlock detection (lock_detect):
For each bucket in Object table
Acquire the Object bucket mutex.
create waitsfor
For each bucket in Object table
Release the Object mutex.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
FAQ:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Q: Why do you need generation numbers?
A: If a lock has been released due to a transaction abort (potentially in a
different process), and then lock is released by a thread of control
unaware of the abort, the lock might have potentially been re-allocated
to a different object. The generation numbers detect this problem.
Note, we assume that reads/writes of lock generation numbers are atomic,
if they are not, it is theoretically possible that a re-allocated lock
could be mistaken for another lock.
Q: Why is is safe to walk the Locker list without holding any mutexes at
all?
A: Locks are created with both the Object and Locker bucket mutexes held.
Once created, they removed in two ways:
a) when a specific Lock is released, in which case, the Object and
Locker bucket mutexes are again held, and
b) when all Locks for a specific Locker Id is released.
In case b), the Locker bucket mutex is held while the Locker chain is
marked as "destroyed", which blocks any further access to the Locker
chain. Then, each individual Object bucket mutex is acquired when each
individual Lock is removed.
Q: What are the implications of doing fine grain locking?
A: Since we no longer globally lock the entire region, lock_vec will no
longer be atomic. We still execute the items in a lock_vec in order,
so things like lock-coupling still work, but you can't make any
guarantees about atomicity.
Q: How do I configure for FINE_GRAIN locking?
A: We currently do not support any automatic configuration for FINE_GRAIN
locking. When we do, will need to document that atomicity discussion
listed above (it is bug-report #553).
|