1 files changed, 301 insertions, 0 deletions
diff --git a/src/lock/Design b/src/lock/Design
new file mode 100644
index 00000000..f82bc7e8
--- /dev/null
+++ b/src/lock/Design
@@ -0,0 +1,301 @@
+Synchronization in the Locking Subsystem
+
+This is a document that describes how we implemented fine-grain locking
+in the lock manager (that is, locking on a hash bucket level instead of
+locking the entire region).  We found that the increase in concurrency
+was not sufficient to warrant the increase in complexity or the additional
+cost of performing each lock operation.  Therefore, we don't use this
+any more.  Should we have to do fine-grain locking in a future release,
+this would be a reasonable starting point.
+
+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
+1. Data structures
+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
+
+The lock manager maintains 3 different structures:
+
+Objects (__db_lockobj):
+	Describes an object that is locked.  When used with DB, this consists
+	of a __db_ilock (a file identifier and a page number).
+
+Lockers (__db_locker):
+	Identifies a specific locker ID and maintains the head of a list of
+	locks held by a locker (for using during transaction commit/abort).
+
+Locks (__db_lock):
+	Describes a particular object lock held on behalf of a particular
+	locker id.
+
+Objects and Lockers reference Locks.
+
+These structures are organized via two synchronized hash tables.  Each
+hash table consists of two physical arrays: the array of actual hash
+buckets and an array of mutexes so we can lock individual buckets, rather
+than the whole table.
+
+One hash table contains Objects and the other hash table contains Lockers.
+Objects contain two lists of locks, waiters and holders: holders currently
+hold a lock on the Object, waiters are lock waiting to be granted.
+Lockers are a single linked list that connects the Locks held on behalf
+of the specific locker ID.
+
+In the diagram below:
+
+Locker ID #1 holds a lock on Object #1 (L1) and Object #2 (L5), and is
+waiting on a lock on Object #1 (L3).
+
+Locker ID #2 holds a lock on Object #1 (L2) and is waiting on a lock for
+Object #2 (L7).
+
+Locker ID #3 is waiting for a lock on Object #2 (L6).
+
+	OBJECT                   -----------------------
+	HASH                     |                     |
+                             ----|-------------        |
+	________    _______  |   |   ________ |        |
+	|	|-->| O1  |--|---|-->|  O2  | |        |
+	|_______|   |_____|  |   |   |______| V        |
+	|	|    W  H--->L1->L2   W  H--->L5       |	holders
+	|_______|    |       |   |    |                V
+	|	|    ------->L3  \    ------->L6------>L7	waiters
+	|_______|           /     \            \
+	.	.          /       \            \
+	.	.          |        \            \
+	.	.          |         \            -----------
+	|_______|          |          --------------        |
+	|	|      ____|____                ___|_____  _|______
+	|_______|      |       |                |       |  |      |
+	|	|      | LID1  |                |  LID2 |  | LID3 |
+	|_______|      |_______|                |_______|  |______|
+			   ^                        ^        ^
+			   |                        |        |
+			___|________________________|________|___
+	       LOCKER	|    |    |    |    |    |    |    |    |
+	       HASH	|    |    |    |    |    |    |    |    |
+			|    |    |    |    |    |    |    |    |
+			|____|____|____|____|____|____|____|____|
+
+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
+2. Synchronization
+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
+
+There are four types of mutexes in the subsystem.
+
+Object mutexes;
+	These map one-to-one to each bucket in the Object hash table.
+	Holding a mutex on an Object bucket secures all the Objects in
+	that bucket as well as the Lock structures linked from those
+	Objects.  All fields in the Locks EXCEPT the Locker links (the
+	links that attach Locks by Locker ID) are protected by these
+	mutexes.
+
+Locker mutexes:
+	These map one-to-one to each bucket in the Locker hash table.
+	Holding a mutex on a Locker bucket secures the Locker structures
+	and the Locker links in the Locks.
+
+Memory mutex:
+	This mutex allows calls to allocate/free memory, i.e. calls to
+	__db_shalloc and __db_shalloc_free, as well as manipulation of
+	the Object, Locker and Lock free lists.
+
+Region mutex:
+	This mutex is currently only used to protect the locker ids.
+	It may also be needed later to provide exclusive access to
+	the region for deadlock detection.
+
+Creating or removing a Lock requires locking both the Object lock and the
+Locker lock (and eventually the shalloc lock to return the item to the
+free list).
+
+The locking hierarchy is as follows:
+
+	The Region mutex may never be acquired after any other mutex.
+
+	The Object mutex may be acquired after the Region mutex.
+
+	The Locker mutex may be acquired after the Region and Object
+	mutexes.
+
+	The Memory mutex may be acquired after any mutex.
+
+So, if both and Object mutex and a Locker mutex are going to be acquired,
+the Object mutex must be acquired first.
+
+The Memory mutex may be acquired after any other mutex, but no other mutexes
+can be acquired once the Memory mutex is held.
+
+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
+3. The algorithms:
+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
+The locking subsystem supports four basic operations:
+	Get a Lock (lock_get)
+
+	Release a Lock (lock_put)
+
+	Release all the Locks on a specific Object (lock_vec)
+
+	Release all the Locks for a specific Locker (lock_vec)
+
+Get a lock:
+	Acquire Object bucket mutex.
+	Acquire Locker bucket mutex.
+
+	Acquire Memory mutex.
+	If the Object does not exist
+		Take an Object off the freelist.
+	If the Locker doesn't exist
+		Take a Locker off the freelist.
+	Take a Lock off the free list.
+	Release Memory mutex.
+
+	Add Lock to the Object list.
+	Add Lock to the Locker list.
+	Release Locker bucket mutex
+
+	If the lock cannot be granted
+		Release Object bucket mutex
+		Acquire lock mutex (blocks)
+
+		Acquire Object bucket mutex
+		If lock acquisition did not succeed (e.g, deadlock)
+			Acquire Locker bucket mutex
+			If locker should be destroyed
+				Remove locker from hash table
+				Acquire Memory mutex
+				Return locker to free list
+				Release Memory mutex
+			Release Locker bucket mutex
+
+			If object should be released
+				Acquire Memory mutex
+				Return object to free list
+				Release Memory mutex
+
+	Release Object bucket mutex
+
+Release a lock:
+	Acquire Object bucket mutex.
+		(Requires that we be able to find the Object hash bucket
+		without looking inside the Lock itself.)
+
+	If releasing a single lock and the user provided generation number
+	doesn't match the Lock's generation number, the Lock has been reused
+	and we return failure.
+
+	Enter lock_put_internal:
+		if the Lock is still on the Object's lists:
+			Increment Lock's generation number.
+			Remove Lock from the Object's list (NULL link fields).
+			Promote locks for the Object.
+
+		Enter locker_list_removal
+			Acquire Locker bucket mutex.
+			If Locker doesn't exist:
+				Release Locker bucket mutex
+				Release Object bucket mutex
+				Return error.
+			Else if Locker marked as deleted:
+				dont_release = TRUE
+			Else
+				Remove Lock from Locker list.
+				If Locker has no more locks
+					Remove Locker from table.
+					Acquire Memory mutex.
+					Return Locker to free list
+					Release Memory mutex
+			Release Locker bucket mutex.
+		Exit locker_list_removal
+
+		If (!dont_release)
+			Acquire Memory mutex
+			Return Lock to free list
+			Release Memory mutex
+
+	Exit lock_put_internal
+
+	Release Object bucket mutex
+
+Release all the Locks on a specific Object (lock_vec, DB_PUT_ALL_OBJ):
+
+	Acquire Object bucket mutex.
+
+	For each lock on the waiter list:
+		lock_put_internal
+	For each lock on the holder list:
+		lock_put_internal
+
+	Release Object bucket mutex.
+
+Release all the Locks for a specific Locker (lock_vec, DB_PUT_ALL):
+
+	Acquire Locker bucket mutex.
+	Mark Locker deleted.
+	Release Locker mutex.
+
+	For each lock on the Locker's list:
+		Remove from locker's list
+			(The lock could get put back on the free list in
+			lock_put and then could get reallocated and the
+			act of setting its locker links could clobber us.)
+		Perform "Release a Lock" above: skip locker_list_removal.
+
+	Acquire Locker bucket mutex.
+	Remove Locker
+	Release Locker mutex.
+
+	Acquire Memory mutex
+	Return Locker to free list
+	Release Memory mutex
+
+Deadlock detection (lock_detect):
+
+	For each bucket in Object table
+		Acquire the Object bucket mutex.
+		create waitsfor
+
+	For each bucket in Object table
+		Release the Object mutex.
+
+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
+FAQ:
+=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
+Q: Why do you need generation numbers?
+A: If a lock has been released due to a transaction abort (potentially in a
+   different process), and then lock is released by a thread of control
+   unaware of the abort, the lock might have potentially been re-allocated
+   to a different object.  The generation numbers detect this problem.
+
+   Note, we assume that reads/writes of lock generation numbers are atomic,
+   if they are not, it is theoretically possible that a re-allocated lock
+   could be mistaken for another lock.
+
+Q: Why is is safe to walk the Locker list without holding any mutexes at
+   all?
+A: Locks are created with both the Object and Locker bucket mutexes held.
+   Once created, they removed in two ways:
+
+   a) when a specific Lock is released, in which case, the Object and
+   Locker bucket mutexes are again held, and
+
+   b) when all Locks for a specific Locker Id is released.
+
+   In case b), the Locker bucket mutex is held while the Locker chain is
+   marked as "destroyed", which blocks any further access to the Locker
+   chain.  Then, each individual Object bucket mutex is acquired when each
+   individual Lock is removed.
+
+Q: What are the implications of doing fine grain locking?
+
+A: Since we no longer globally lock the entire region, lock_vec will no
+   longer be atomic.  We still execute the items in a lock_vec in order,
+   so things like lock-coupling still work, but you can't make any
+   guarantees about atomicity.
+
+Q: How do I configure for FINE_GRAIN locking?
+
+A: We currently do not support any automatic configuration for FINE_GRAIN
+   locking.  When we do, will need to document that atomicity discussion
+   listed above (it is bug-report #553).
+
+Copyright (c) 2010, 2012 Oracle and/or its affiliates.  All rights reserved.