external storage base commit

been squashing reorganizing, and pulling code off to go upstream ahead of merging the whole branch.
author: dormando <dormando@rydia.net> 2017-09-26 14:43:17 -0700
committer: dormando <dormando@rydia.net> 2017-11-28 14:18:05 -0800
commit: f593a59bce69f917514ef6213cf565c71bddcf8c (patch)
tree: 4a5dc07433e97b089f46a913b5367aa5d52c059a /doc/storage.txt
parent: e6239a905d072e837baa8aa425ca0ccee2fc3e01 (diff)
download: memcached-f593a59bce69f917514ef6213cf565c71bddcf8c.tar.gz
1 files changed, 141 insertions, 0 deletions
diff --git a/doc/storage.txt b/doc/storage.txt
new file mode 100644
index 0000000..41a3c7e
--- /dev/null
+++ b/doc/storage.txt
@@ -0,0 +1,141 @@
+Storage system notes
+--------------------
+
+extstore.h defines the API.
+
+extstore_write() is a synchronous call which memcpy's the input buffer into a
+write buffer for an active page. A failure is not usually a hard failure, but
+indicates caller can try again another time. IE: it might be busy freeing
+pages or assigning new ones.
+
+As of this writing the write() implementation doesn't have an internal loop,
+so it can give spurious failures (good for testing integration)
+
+extstore_read() is an asynchronous call which takes a stack of IO objects and
+adds it to the end of a queue. It then signals the IO thread to run. Once an
+IO stack is submitted the caller must not touch the submitted objects anymore
+(they are relinked internally).
+
+extstore_delete() is a synchronous call which informs the storage engine an
+item has been removed from that page. It's important to call this as items are
+actively deleted or passively reaped due to TTL expiration. This allows the
+engine to intelligently reclaim pages.
+
+The IO threads execute each object in turn (or in bulk of running in the
+future libaio mode).
+
+Callbacks are issued from the IO threads. It's thus important to keep
+processing to a minimum. Callbacks may be issued out of order, and it is the
+caller's responsibility to know when its stack has been fully processed so it
+may reclaim the memory.
+
+With DIRECT_IO support, buffers submitted for read/write will need to be
+aligned with posix_memalign() or similar.
+
+Buckets
+-------
+
+During extstore_init(), a number of active buckets is specified. Pages are
+handled overall as a global pool, but writes can be redirected to specific
+active pages.
+
+This allows a lot of flexibility, ie:
+
+1) an idea of "high TTL" and "low TTL" being two buckets. TTL < 86400
+goes into bucket 0, rest into bucket 1. Co-locating low TTL items means
+those pages can reach zero objects and free up more easily.
+
+2) Extended: "low TTL" is one bucket, and then one bucket per slab class.
+If TTL's are low, mixed sized objects can go together as they are likely to
+expire before cycling out of flash (depending on workload, of course).
+For higher TTL items, pages are stored on chunk barriers. This means less
+space is wasted as items should fit nearly exactly into write buffers and
+pages. It also means you can blindly read items back if the system wants to
+free a page and we can indicate to the caller somehow which pages are up for
+probation. ie; issue a read against page 3 version 1 for byte range 0->1MB,
+then chunk and look up objects. Then read next 1MB chunk/etc. If there's
+anything we want to keep, pull it back into RAM before pages is freed.
+
+Pages are assigned into buckets on demand, so if you make 30 but use 1 there
+will only be a single active page with write buffers.
+
+Memcached integration
+---------------------
+
+With the POC: items.c's lru_maintainer_thread calls writes to storage if all
+memory has been allocated out to slab classes, and there is less than an
+amount of memory free. Original objects are swapped with items marked with
+ITEM_HDR flag. an ITEM_HDR contains copies of the original key and most of the
+header data. The ITEM_data() section of an ITEM_HDR object contains (item_hdr
+*), which describes enough information to retrieve the original object from
+storage.
+
+To get best performance is important that reads can be deeply pipelined.
+As much processing as possible is done ahead of time, IO's are submitted, and
+once IO's are done processing a minimal amount of code is executed before
+transmit() is possible. This should amortize the amount of latency incurred by
+hopping threads and waiting on IO.
+
+Recaching
+---------
+
+If a header is hit twice overall, and the second time within ~60s of the first
+time, it will have a chance of getting recached. "recache_rate" is a simple
+"counter % rate == 0" check. Setting to 1000 means one out of every 1000
+instances of an item being hit twice within ~60s it will be recached into
+memory. Very hot items will get pulled out of storage relatively quickly.
+
+Compaction
+----------
+
+A target fragmentation limit is set: "0.9", meaning "run compactions if pages
+exist which have less than 90% of their bytes used".
+
+This value is slewed based on the number of free pages in the system, and
+activates when half of the pages used. The percentage of free pages is
+multiplied against the target fragmentation limit, ie:
+limit of 0.9: 50% of pages free -> 0.9 * 0.5 -> 0.45%. If a page is 64
+megabytes, pages with less than 28.8 megabytes used would be targeted for
+compaction. If 0 pges are free, anything less than 90% used is targeted, which
+means it has to rewrite 10 pages to free one page.
+
+In memcached's integration, a second bucket is used for objects rewritten via
+the compactor. Potentially objects around long enough to get compacted might
+continue to stick around, so co-locating them could reduce fragmentation work.
+
+If an exclusive lock is made on a valid object header, the flash locations are
+rewritten directly in the object. As of this writing, if an object header is
+busy for some reason, the write is dropped (COW needs to be implemented). This
+is an unlikely scenario however.
+
+Objects are read back along the boundaries of a write buffer. If an 8 meg
+write buffer is used, 8 megs are read back at once and iterated for objects.
+
+This needs a fair amount of tuning, possibly more throttling. It will still
+evict pages if the compactor gets behind.
+
+TODO
+----
+
+Sharing my broad TODO items into here. While doing the work they get split up
+more into local notes. Adding this so others can follow along:
+
+(a bunch of the TODO has been completed and removed)
+- DIRECT_IO support
+- libaio support (requires DIRECT_IO)
+- code cleanup (funtion over form until I have bugs out)
+  - pull all of the inlined linked list code
+  - naming consistency
+  - clear FIXME/TODO's related to error handling
+- JBOD support (not first pass)
+  - 1-2 active pages per device. potentially dedicated IO threads per device.
+    with a RAID setup you risk any individual disk doing a GC pause stalling
+    all writes. could also simply rotate devices on a per-bucket basis.
+
+on memcached end:
+- fix append/prepend/incr/decr/etc
+- large item support
+- --configure gating for extstore being compiled (for now, at least)
+- binprot support
+- DIRECT_IO support; mostly memalign pages, but also making chunks grow
+  aligned to sector sizes once they are >= a single sector.
author	dormando <dormando@rydia.net>	2017-09-26 14:43:17 -0700
committer	dormando <dormando@rydia.net>	2017-11-28 14:18:05 -0800
commit	f593a59bce69f917514ef6213cf565c71bddcf8c (patch)
tree	4a5dc07433e97b089f46a913b5367aa5d52c059a /doc/storage.txt
parent	e6239a905d072e837baa8aa425ca0ccee2fc3e01 (diff)
download	memcached-f593a59bce69f917514ef6213cf565c71bddcf8c.tar.gz