diff options
author | dormando <dormando@rydia.net> | 2017-09-26 14:43:17 -0700 |
---|---|---|
committer | dormando <dormando@rydia.net> | 2017-11-28 14:18:05 -0800 |
commit | f593a59bce69f917514ef6213cf565c71bddcf8c (patch) | |
tree | 4a5dc07433e97b089f46a913b5367aa5d52c059a /doc/storage.txt | |
parent | e6239a905d072e837baa8aa425ca0ccee2fc3e01 (diff) | |
download | memcached-f593a59bce69f917514ef6213cf565c71bddcf8c.tar.gz |
external storage base commit
been squashing reorganizing, and pulling code off to go upstream ahead
of merging the whole branch.
Diffstat (limited to 'doc/storage.txt')
-rw-r--r-- | doc/storage.txt | 141 |
1 files changed, 141 insertions, 0 deletions
diff --git a/doc/storage.txt b/doc/storage.txt new file mode 100644 index 0000000..41a3c7e --- /dev/null +++ b/doc/storage.txt @@ -0,0 +1,141 @@ +Storage system notes +-------------------- + +extstore.h defines the API. + +extstore_write() is a synchronous call which memcpy's the input buffer into a +write buffer for an active page. A failure is not usually a hard failure, but +indicates caller can try again another time. IE: it might be busy freeing +pages or assigning new ones. + +As of this writing the write() implementation doesn't have an internal loop, +so it can give spurious failures (good for testing integration) + +extstore_read() is an asynchronous call which takes a stack of IO objects and +adds it to the end of a queue. It then signals the IO thread to run. Once an +IO stack is submitted the caller must not touch the submitted objects anymore +(they are relinked internally). + +extstore_delete() is a synchronous call which informs the storage engine an +item has been removed from that page. It's important to call this as items are +actively deleted or passively reaped due to TTL expiration. This allows the +engine to intelligently reclaim pages. + +The IO threads execute each object in turn (or in bulk of running in the +future libaio mode). + +Callbacks are issued from the IO threads. It's thus important to keep +processing to a minimum. Callbacks may be issued out of order, and it is the +caller's responsibility to know when its stack has been fully processed so it +may reclaim the memory. + +With DIRECT_IO support, buffers submitted for read/write will need to be +aligned with posix_memalign() or similar. + +Buckets +------- + +During extstore_init(), a number of active buckets is specified. Pages are +handled overall as a global pool, but writes can be redirected to specific +active pages. + +This allows a lot of flexibility, ie: + +1) an idea of "high TTL" and "low TTL" being two buckets. TTL < 86400 +goes into bucket 0, rest into bucket 1. Co-locating low TTL items means +those pages can reach zero objects and free up more easily. + +2) Extended: "low TTL" is one bucket, and then one bucket per slab class. +If TTL's are low, mixed sized objects can go together as they are likely to +expire before cycling out of flash (depending on workload, of course). +For higher TTL items, pages are stored on chunk barriers. This means less +space is wasted as items should fit nearly exactly into write buffers and +pages. It also means you can blindly read items back if the system wants to +free a page and we can indicate to the caller somehow which pages are up for +probation. ie; issue a read against page 3 version 1 for byte range 0->1MB, +then chunk and look up objects. Then read next 1MB chunk/etc. If there's +anything we want to keep, pull it back into RAM before pages is freed. + +Pages are assigned into buckets on demand, so if you make 30 but use 1 there +will only be a single active page with write buffers. + +Memcached integration +--------------------- + +With the POC: items.c's lru_maintainer_thread calls writes to storage if all +memory has been allocated out to slab classes, and there is less than an +amount of memory free. Original objects are swapped with items marked with +ITEM_HDR flag. an ITEM_HDR contains copies of the original key and most of the +header data. The ITEM_data() section of an ITEM_HDR object contains (item_hdr +*), which describes enough information to retrieve the original object from +storage. + +To get best performance is important that reads can be deeply pipelined. +As much processing as possible is done ahead of time, IO's are submitted, and +once IO's are done processing a minimal amount of code is executed before +transmit() is possible. This should amortize the amount of latency incurred by +hopping threads and waiting on IO. + +Recaching +--------- + +If a header is hit twice overall, and the second time within ~60s of the first +time, it will have a chance of getting recached. "recache_rate" is a simple +"counter % rate == 0" check. Setting to 1000 means one out of every 1000 +instances of an item being hit twice within ~60s it will be recached into +memory. Very hot items will get pulled out of storage relatively quickly. + +Compaction +---------- + +A target fragmentation limit is set: "0.9", meaning "run compactions if pages +exist which have less than 90% of their bytes used". + +This value is slewed based on the number of free pages in the system, and +activates when half of the pages used. The percentage of free pages is +multiplied against the target fragmentation limit, ie: +limit of 0.9: 50% of pages free -> 0.9 * 0.5 -> 0.45%. If a page is 64 +megabytes, pages with less than 28.8 megabytes used would be targeted for +compaction. If 0 pges are free, anything less than 90% used is targeted, which +means it has to rewrite 10 pages to free one page. + +In memcached's integration, a second bucket is used for objects rewritten via +the compactor. Potentially objects around long enough to get compacted might +continue to stick around, so co-locating them could reduce fragmentation work. + +If an exclusive lock is made on a valid object header, the flash locations are +rewritten directly in the object. As of this writing, if an object header is +busy for some reason, the write is dropped (COW needs to be implemented). This +is an unlikely scenario however. + +Objects are read back along the boundaries of a write buffer. If an 8 meg +write buffer is used, 8 megs are read back at once and iterated for objects. + +This needs a fair amount of tuning, possibly more throttling. It will still +evict pages if the compactor gets behind. + +TODO +---- + +Sharing my broad TODO items into here. While doing the work they get split up +more into local notes. Adding this so others can follow along: + +(a bunch of the TODO has been completed and removed) +- DIRECT_IO support +- libaio support (requires DIRECT_IO) +- code cleanup (funtion over form until I have bugs out) + - pull all of the inlined linked list code + - naming consistency + - clear FIXME/TODO's related to error handling +- JBOD support (not first pass) + - 1-2 active pages per device. potentially dedicated IO threads per device. + with a RAID setup you risk any individual disk doing a GC pause stalling + all writes. could also simply rotate devices on a per-bucket basis. + +on memcached end: +- fix append/prepend/incr/decr/etc +- large item support +- --configure gating for extstore being compiled (for now, at least) +- binprot support +- DIRECT_IO support; mostly memalign pages, but also making chunks grow + aligned to sector sizes once they are >= a single sector. |