| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
|
|
|
|
|
|
|
|
| |
We want to start using cache commands in contexts without a client
connection, but the client object has always been passed to all
functions.
In most cases we only need the worker thread (LIBEVENT_THREAD *t), so
this change adjusts the arguments passed in.
|
|
|
|
|
|
|
| |
extstore.h is now only used from storage.c. starting a path towards
getting the storage interface to be more generalized.
should be no functional changes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- we get asked a lot to provide a "metaget" command, for various uses
(debugging, etc)
- we also get asked for random one-off commands for various use cases.
- I really hate both of these situations and have been wanting to
experiment with a slight tuning of how get commands work for a long
time.
Assuming that if I offer a metaget command which gives people the
information they're curious about in an inefficient format, plus data
they don't need, we'll just end up with a slow command with
compatibility issues. No matter how you wrap warnings around a command,
people will put it into production under high load. Then I'm stuck with
it forever.
Behold, the meta commands!
See doc/protocol.txt and the wiki for a full explanation and examples.
The intent of the meta commands is to support any features the binary
protocol had over the text protocol. Though this is missing some
commands still, it is close and surpasses the binary protocol in many
ways.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
"-e /path/to/tmpfsmnt/file"
SIGUSR1 for graceful stop
restart requires the same memory limit, slab sizes, and some other
infrequently changed details. Most other options and features can
change between restarts. Binary can be upgraded between restarts.
Restart does some fixup work on start for every item in cache. Can take
over a minute with more than a few hundred million items in cache.
Keep in mind when a cache is down it may be missing invalidations,
updates, and so on.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
LRU crawler metadumper is used for getting snapshot-y looks at the LRU's.
Since there's no default limit, it'll get any new items added or bumped since
the roll started.
with this change it limits the number of items dumped to the number that
existed in that LRU when the roll was kicked off. You still end up with an
approximation, but not a terrible one:
- items bumped after the crawler passes them likely won't be revisited
- items bumped before the crawler passes them will likely be visited toward
the end, or mixed with new items.
- deletes are somewhere in the middle.
|
|
|
|
|
| |
been squashing reorganizing, and pulling code off to go upstream ahead
of merging the whole branch.
|
|
|
|
| |
used in separate file for flash branch.
|
|
|
|
|
| |
removes a few ifdef's and upstreams small internal interface tweaks for easy
rebase.
|
|
|
|
| |
plumbing for doing inline reclaim, or similar.
|
|
|
|
| |
converts the python script to C, more or less.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
when trying to manually run a crawl, the internal autocrawler is now blocked
from restarting for 60 seconds.
the internal autocrawl now independently schedules LRU's, and can re-schedule
sub-LRU's while others are still running. should allow much better memory
control when some sub-lru's (such as TEMP or WARM) are small, or slab classes
are differently sized.
this also makes the crawler drop its lock frequently.. this fixes an issue
where a long crawl happening at the same time as a hash table expansion could
hang the server until the crawl finished.
to improve still:
- elapsed time can be wrong in the logger entry
- need to cap number of entries scanned. enough set pressure and a crawl may
never finish.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Memory chunk chains would simply stitch multiple chunks of the highest slab
class together. If your item was 17k and the chunk limit is 16k, the item
would use 32k of space instead of a bit over 17k.
This refactor simplifies the slab allocation path and pulls the allocation of
chunks into the upload process. A "large" item gets a small chunk assigned as
an object header, rather than attempting to inline a slab chunk into a parent
chunk. It then gets chunks individually allocated and added into the chain
while the object uploads.
This solves a lot of issues:
1) When assembling new, potentially very large items, we don't have to sit and
spin evicting objects all at once. If there are 20 16k chunks in the tail and
we allocate a 1 meg item, the new item will evict one of those chunks
inbetween each read, rather than trying to guess how many loops to run before
giving up. Very large objects take time to read from the socket anyway.
2) Simplifies code around the initial chunk. Originally embedding data into
the top chunk and embedding data at the same time required a good amount of
fiddling. (Though this might flip back to embedding the initial chunk if I can
clean it up a bit more).
3) Pulling chunks individually means the slabber code can be flatened to not
think about chunks aside from freeing them, which culled a lot of code and
removed branches from a hot path.
4) The size of the final chunk is naturally set to the remaining about of
bytes that need to be stored, which means chunks from another slab class can
be pulled to "cap off" a large item, reducing memory overhead.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previous tree fixed a problem; active items needed to be processed from the
tail of COLD, which makes evictions harder without evicting active items.
COLD bumps were modified to be immediate (old style). This uses a
per-worker-thread mostly-nonblocking queue that the LRU thread consumes for
COLD bumps.
In most cases, hits to COLD are 1/10th or less than the other classes. On high
rates of access where the buffers fill, those items simply don't get their
ACTIVE bit set. If they get hit again with free space, they will be processed
then. This prevents regressions from high speed keyspace scans.
|
|
|
|
|
|
|
| |
Confident the other feature was never used; and if someone wants it, it's easy
to restore by allowing exptime of 0 to go into TEMP_LRU.
This could possibly become a default, or at least recommended.
|
|
|
|
|
|
|
|
| |
item_get() would hash, item_lock, fetch item.
consumers which can bump the LRU would then call item_update(),
which would hash, item_lock, then update the item.
Good performance bump by inlining the LRU bump when it's necessary.
|
|
|
|
|
|
|
| |
~600 lines gone from items.c makes it a lot more manageable.
this change is almost purely moving code around and renaming functions. very
little logic has changed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
now has internal module system for the LRU crawler.
autocrawl checker should be a bit better now. doesn't
constantly re-run the histogram calcs.
metadump works as a module now. ended up generalizing the client case outside
of the module system since it looks reusable. Cut the amount of functions
required for metadump specifically to nothing.
still need to bug hunt, a few more smaller refactors, and see about pulling
this out into its own file.
|
|
|
|
|
| |
Functionality is nearly all there. A handful of FIXME's and TODO's to address.
From there it needs to be refactored into something proper.
|
|
|
|
|
|
|
| |
If all hash values of five tail items are zero on the specified slab
class, expire check is unintentionally skipped and items stay without
being evicted. Consequently, new item allocation consume memory space
every time an item is set, that leads to slab OOM errors.
|
|
|
|
|
|
|
|
|
|
| |
Now relies on CAS feature for runtime enable/disable tracking. Still usable if
enabled at starttime with CAS disabled. Also adds start option `-o
track_sizes`, and a stat for `stats settings`.
Finally, adds documentation and cleans up status outputs.
Could use some automated tests but not make or break for release.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
"stats sizes" is one of the lack cache-hanging commands. With millions of
items it can hang for many seconds.
This commit changes the command to be dynamic. A histogram is tracked as items
are linked and unlinked from the cache. The tracking is enabled or disabled at
runtime via "stats sizes_enable" and "stats sizes_disable".
This presently "works" but isn't accurate. Giving it some time to think over
before switching to requiring that CAS be enabled. Otherwise the values could
underflow if items are removed that existed before the sizes tracker is
enabled. This attempts to work around it by using it->time, which gets updated
on fetch, and is thus inaccurate.
|
|
|
|
|
|
|
| |
most of the code would parse and handle flags as unsigned int, but passed into
alloc functions as a signed int... which would then continue to print it as
unsigned up until a change made in 2007. Now treat it fully as unsigned and
print as unsigned.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If any slab classes have more than two pages worth of free chunks, attempt to
free one page back to a global pool.
Create new concept of a slab page move destination of "0", which is a global
page pool. Pages can be re-assigned out of that pool during allocation.
Combined with item rescuing from the previous patch, we can safely shuffle
pages back to the reassignment pool as chunks free up naturally. This should
be a safe default going forward. Users should be able to decide to free or
move pages based on eviction pressure as well. This is coming up in another
commit.
This also fixes a calculation of the NOEXP LRU size, and completely removes
the old slab automover thread. Slab automove decisions will now be part of the
lru maintainer thread.
|
|
|
|
|
|
|
|
|
|
|
| |
During a slab page move items are typically ejected regardless of their
validity. Now, if an item is valid and free chunks are available in the same
slab class, copy the item over and replace it.
It's up to external systems to try to ensure free chunks are available before
moving a slab page. If there is no memory it will simply evict them as normal.
Also adds counters so we can finally tell how often these cases happen.
|
| |
|
|
|
|
|
|
|
|
| |
... if available. Very simple starter heuristic for how often to run the
crawler.
At this point, this patch series should have a significant impact on hit
ratio.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Only way to do eviction case fast enough is to inline it, sadly.
This finally deletes the old item_alloc code now that I'm not intending on
reusing it.
Also removes the condition wakeup for the background thread. Instead runs on a
timer, and meters its aggressiveness by how much shuffling is going on.
Also fixes a segfault in lru_pull_tail(), was unlinking `it` instead of
`search`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The basics work, but tests still do not pass.
A background thread wakes up once per second, or when signaled. It is signaled
if a slab class gets an allocation request and has fewer than N chunks free.
The background thread shuffles LRU's: HOT, WARM, COLD. HOT is where new items
exist. HOT and WARM flow into COLD. Active items in COLD flow back to WARM.
Evictions are pulled from COLD.
item_update's no longer do anything (and need to be fixed to tick it->time).
Items are reshuffled within or around LRU's as they reach the bottom.
Ratios of HOT/WARM memory are hardcoded, as are the low/high watermarks.
Thread is not fast enough right now, sets cannot block on it.
|
|
|
|
|
|
|
|
| |
Primarily splitting cache_lock into a lock-per LRU, and making the
it->slab_clsid lookup indirect. cache_lock is now more or less gone.
Stats are still wrong. they need to internally summarize over each
sub-class.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Unfortunately if you disable CAS, all items set in the same second as a
flush_all will immediately expire. This is the old (2006ish) behavior.
However, if CAS is enabled (as is the default), it will still be more or less
exact.
The locking issue is that if the LRU lock is held, you may not be able to
modify an item if the item lock is also held. This means that some items may
not be flushed if locking is done correctly.
In the current code, it could lead to corruption as an item could be locked
and in use while the expunging is happening.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We used to hold a global lock around all modifications to the hash table.
Then it was switched to wrapping hash table accesses in a global lock during
hash table expansion, set by notifying each worker thread to change lock
styles. There was a bug here which causes trylocks to clobber, due to the
specific item locks not being held during the global lock:
https://code.google.com/p/memcached/issues/detail?id=370
The patch previous to this one uses item locks during hash table expansion.
Since the item lock table is always smaller than the hash table, an item lock
will always cover both its new and old buckets.
However, we still need to pause all threads during the pointer swap and setup.
This patch pauses all background threads and worker threads, swaps the hash
table, then unpauses them.
This trades the (possibly significant) slowdown during the hash table copy,
with a short total hang at the beginning of each expansion. As previously;
those worried about consistent performance can presize the hash table with
`-o hashpower=n`
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a client fetches a few thousand keys, then does not ever read the socket,
those keys will stay reflocked until the client disconnects or resumes. If
some of those items are unpopular they can drop to the tail, causing all
writes in the slab class to OOM.
This creates some relief by chucking the items back to the head.
Big thanks to Jay Grizzard and other folks at Box for helping narrow this
down.
|
|
|
|
|
| |
lru_crawler crawl 1,2,3,10,20 will kick crawlers off for all of those slabs in
parallel.
|
|
|
|
|
|
|
|
|
|
|
| |
nothing internally magically fires it off yet, but now there is an external
command:
lru_crawler crawl [classid]
... will signal the thread to wake up and immediately reap through a
particular class.
need some thought/feedback for internal kickoffs (plugins?)
|
|
|
|
|
|
|
|
|
| |
so many things undone... TODO is inline in items.c.
this seems to work, and the locking should be correct. it is a background
thread so shouldn't cause significant latency. However it does quickly roll
through the entire LRU (and as of this PoC it just constantly runs), so there
will be cpu impact.
|
|
|
|
|
|
|
|
|
|
| |
This doesn't reduce mutex contention much, if at all, for the global stats
lock, but it does remove a handful of instructions from the alloc hot path,
which is always worth doing.
Previous commits possibly added a handful of instructions for the loop and for
the bucket readlock trylock, but this is still faster than .14 for writes
overall.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixes a few issues with a restructuring... I think -M was broken before,
should be fixed now. It had a refcount leak.
Now walks up to five items from the bottom in case of the bottomost items
being item_locked, or refcount locked. Helps avoid excessive OOM errors for
some oddball cases. Those happen more often if you're hammering on a handful
of pages in a very large class size (100k+)
The hash item lock ensures that if we're holding that lock, no other thread
can be incrementing the refcount lock at that time. It will mean more in
future patches.
slab rebalancer gets a similar update.
|
|
|
|
|
|
|
|
|
|
| |
Enable at startup with -o slab_reassign,slab_automove
Enable or disable at runtime with "slabs automove 1\r\n"
Has many weaknesses. Only pulls from slabs which have had zero recent
evictions. Is slow, not tunable, etc. Use the scripts/mc_slab_mover example to
write your own external automover if this doesn't satisfy.
|
|
|
|
| |
push cache_lock deeper into the abyss
|
|
|
|
|
| |
been hard to measure while using the intel hash (since it's very fast), but
should help with the software hash.
|
|
|
|
|
| |
Taken from the 1.6 branch, partly written by Trond. I hope the CAS handling is
correct.
|
|
|
|
|
|
| |
These are automatically initialized to 0 (both Trond and the spec says
so, and I asserted it on all current builders at least once before
killing it off).
|
|
|
|
|
| |
(dustin) I made some changes to the original growth code to pass in
the required size.
|
|
|
|
| |
See: http://code.google.com/p/memcached/issues/detail?id=22
|
|
|
|
|
|
|
|
| |
This fixes a problem reported as bug 15 where incr and decr do not
change CAS values when they aren't completely replacing the item
(which is the typical case).
http://code.google.com/p/memcached/issues/detail?id=15
|
|
|
|
| |
wasteful function calls (more to come).
|
|
|
|
| |
abstraction by the callback.
|
| |
|