| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
| |
Thread wasn't keeping up in high load scenarios with low/default free
ratios.
|
|
|
|
|
|
|
|
| |
- removes unused "completed" IO callback handler
- moves primary post-IO callback handlers from the queue definition to
the actual IO objects.
- allows IO object callbacks to be handled generically instead of based
on the queue they were submitted from.
|
|
|
|
|
| |
allow users to differentiate thread functions externally to memcached.
Useful for setting priorities or pinning threads to CPU's.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
extstore has a background thread which examines slab classes for items
to flush to disk. The thresholds for flushing to disk are managed by a
specialized "slab automove" algorithm. This algorithm was written in
2017 and not tuned since.
Most serious users set "ext_item_age=0" and force flush all items. This
is partially because the defaults do not flush aggressively enough,
which causes memory to run out and evictions to happen.
This change simplifies the slab automove portion. Instead of balancing
free chunks of memory per slab class, it sets a target of a certain
number of free global pages.
The extstore flusher thread also uses the page pool and some low chunk
limits to decide when to start flushing. Its sleep routines have also
been adjusted as it could oversleep too easily.
A few other small changes were required to avoid over-moving slab pages
around.
|
|
|
|
|
|
| |
allows tests to run faster, let users make it sleep longer/less time.
Also cuts the sleep time down when actively compacting and coming from
high idle.
|
|
|
| |
Fixing 'bugs' of the pattern: 'assert(ptr != 0)' after 'ptr' was already dereferenced
|
|
|
|
|
|
|
|
|
|
|
| |
probably squash into previous commit.
io->c->thead can change for orpahned IO's, so we had to directly add the
original worker thread as a reference.
also tried again to split callbacks onto the thread and off of the
connection for similar reasons; sometimes we just need the callbacks,
sometimes we need both.
|
|
|
|
|
|
|
|
|
|
|
| |
instead of passing ownership of (io_queue_t)*q to the side thread,
instead the ownership of IO objects are passed to the side thread, which
are then individually returned. The worker thread runs return_cb() on
each, determining when it's done with the response batch.
this interface could use more explicit functions to make it more clear.
Ownership of *q isn't actually "passed" anywhere, it's just used or not
used depending on which return function the owner wants.
|
|
|
|
| |
if extstore wasn't enabled, crashes. Reported by @zer0e on github.
|
|
|
|
|
|
| |
since multiple queues can be sent to different sidethreads, we need a
new mechanism for knowing when to return everything. In the common case
only one queue will be active, so adding a mutex would be excessive.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mc_resp is the proper owner of a pending IO once it's been initialized;
release it during resp_finish(). Also adds a completion callback which
runs on the submitted stack after returning to the worker thread but
before the response is transmitted.
allows re-queueing for pending IO if processing a response generates
another pending IO. also allows a further refactor to run more extstore
code on the worker thread instead of the IO threads.
uses proper conn_io_queue state to describe connections waiting for
pending IO's.
|
|
|
|
|
|
| |
reserve space in an io_pending_t. users cast it to a more specific
structure, avoiding extra allocations for local data. In this case what
might require 3 allocs stays as just 1.
|
|
|
|
|
|
|
| |
extstore.h is now only used from storage.c. starting a path towards
getting the storage interface to be more generalized.
should be no functional changes.
|
|
|
|
|
|
|
|
|
|
|
|
| |
want to reuse the deferred IO system for extstore for something else.
Should allow evolving into a more plugin-centric system.
step one of three(?) - replace in place and tests pass with extstore
enabled.
step two should move more extstore code into storage.c
step three should build the IO queue code without ifdef gating.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mem_requested is an oddball counter: it's the total number of bytes
"actually requested" from the slab's caller. It's mainly used for a
stats counter, alerting the user that the slab factor may not be
efficient if the gap between total_chunks * chunk_size - mem_requested
is large.
However, since chunked items were added it's _also_ used to help the
LRU balance itself. The total number of bytes used in the class vs the
total number of bytes in a sub-LRU is used to judge whether to move
items between sub-LRU's.
This is a layer violation; forcing slabs.c to know more about how items
work, as well as EXTSTORE for calculating item sizes from headers.
Further, it turns out it wasn't necessary for item allocation: if we
need to evict an item we _always_ pull from COLD_LRU or force a move
from HOT_LRU. So the total doesn't matter.
The total does matter in the LRU maintainer background thread. However,
this thread caches mem_requested to avoid hitting the slab lock too
frequently. Since sizes_bytes[] within items.c is generally redundant
with mem_requested, we now total sizes_bytes[] from each sub-LRU before
starting a batch of LRU juggles.
This simplifies the code a bit, reduces the layer violations in slabs.c
slightly, and actually speeds up some hot paths as a number of branches
and operations are removed completely.
This also fixes an issue I was having with the restartable memory
branch :) recalculating p->requested and keeping a clean API is painful
and slow.
NOTE: This will vary a bit compared to what mem_requested originally
did, mostly for large chunked items.
For items which fit inside a single slab chunk, the stat is identical.
However, items constructed by chaining chunks will have a single large
"nbytes" value and end up in the highest slab class. Chunked items can
be capped with chunks from smaller slab classes; you will see
utilization of chunks but not an increase in mem_requested for them.
I'm still thinking this through but this is probably acceptable. Large
chunked items should be accounted for separately, perhaps with some new
counters so they can be discounted from normal calculations.
|
|
|
|
|
|
| |
Has defaulted to false since 1.5.0, and with -o modern for a few years
before that. Performance is fine, no reported bugs. Always was the
intention. Code is simpler without the options.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
some whackarse ARM platforms on specific glibc/gcc (new?) versions trip
SIGBUS while reading the header chunk for a split item.
the header chunk is unfortunate magic: It lives in ITEM_data() at a random
offset, is zero sized, and only exists to simplify code around finding the
orignial slab class, and linking/relinking subchunks to an item.
there's no fix to this which isn't a lot of code. I need to refactor chunked
items, and attempted to do so, but couldn't come up with something I liked
quickly enough.
This change pads the first chunk if alignment is necessary, which wastes
bytes and a little CPU, but I'm not going to worry a ton for these obscure
platforms.
this works with rebalancing because in the case of ITEM_CHUNKED header, it
treats the item size as the size of the class it resides in, and memcpy's the
item during recovery.
all other cases were changes from ITEM_data to a new ITEM_schunk() inline
function that is created when NEED_ALIGN is set, else it's equal to ITEM_data
still.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Just a Bunch Of Devices :P
code exists for routing specific devices to specific buckets
(lowttl/compact/etc), but enabling it requires significant fixes to
compaction algorithm. Thus it is disabled as of this writing.
code cleanups and future work:
- pedantically freeing memory and closing fd's on exit
- unify and flatten the free_bucket code
- defines for free buckets
- page eviction adjustment (force min-free per free bucket)
- fix default calculation for compact_under and drop_under
- might require forcing this value only on default bucket
|
|
|
|
|
|
|
|
|
|
|
| |
trying out a simplified slab class backoff algorithm. The LRU maintainer
individually schedules slab classes by time, which leads to multiple wakeups
in a steady state as they get out of sync. This algorithm more simply skips
that class more often each time it runs the main loop, using a single
scheduled sleep instead.
if it goes to sleep for a long time, it also reduces the backoff for all
classes. if we're barely awake it should be fine to poke everything.
|
| |
|
|
|
|
|
|
|
|
| |
memory alignment when reading header data back.
left "32" in a few places that should've at least been a define, is now
properly an offsetof. used for skipping crc32 for dynamic parts of the item
headers.
|
|
|
|
|
|
|
|
|
|
| |
* automover
* avoiding
* compress
* fails
* successfully
* success
* tidiness
|
|
|
|
| |
simple change.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
there's now an optional ext_drop_under setting which defaults to the same as
compact_under, which should be fine. now, if drop_unread is enabled, it only
kicks in if there are no pages matching the compaction threshold.
This allows you to set a lower compaction frag rate, then start rescuing only
non-COLD items if storage is too full. You can also compact up to a point,
then allow a buffer of pages to be used before dropping unread.
previously enabling drop_unread would always drop_unread even when compacting
normally. This limited utility of the feature.
|
|
|
|
|
|
| |
"watch evictions" will show a stream of evictions + writes to extstore.
useful for analyzing the remaining ttl or key pattern of stuff being
flushed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ITEM_LINKED was still set on the objects being written to disk. If a page
being moved contains a chunk read from extstore currently being written back
to the client, it will mistake it for a chunk properly linked and attempt to
unlink if if the page has also been jammed in the mover.
So you need a cross section of a particular chunk being active the entire
time a page is jammed, then it tries to unlink it but the header is partially
garbage and it segfaults.
The page mover ignores all items which don't either have LINKED or SLABBED,
assuming they're in transition. So the fix is to simply remove the LINKED bit
after copying into the write buffer.
|
|
|
|
| |
I was so sure I'd triggered this with tests...
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
was early evicting from HOT/WARM LRU's for item headers because the
*original* item size was being tracked, then compared to the actual byte
totals for the class.
also adjusts drop_unread so it drops items which are currently in the COLD_LRU
this is expected to be used with very low compacat_under values; ie 2-5
depending on page count and write load. If you can't defrag-compact,
drop-compact.
but this is still subtly wrong, since drop_compact is now an option.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
external comands only for the moment. allows specifying per-slab-class how
many chunks to leave free before causing flushing to storage.
the external page mover algo in previous commits has a few issues:
- relies too heavily on page mover. lots of constant activity under load.
- adjusting the item age level on the fly is too laggy, and can easily
over-frefree or under-free. IE; class 3 has TTL 90, but class 4 has TTL 60
and most of the pages in memory, it won't free much until it lowers to 60.
Thinking this would be something like; % of total chunks in slab class.
easiest to set as a percentage of total memory or by write rate periodically.
from there TTL can be balanced almost as in the original algorithm; keep a
small global page pool for small items to allocate memory from, and pull
pages from or balance between storage-capable classes to align TTL.
|
|
|
|
|
|
|
|
| |
had a hardcoded value of "start to compact under a slew if more than 3/4ths
of pages are used", but this allows it to be set directly.
ie; "I have 100 pages but don't want to compact util almost full, and then
drop any unread"
|
|
|
|
|
|
|
|
|
|
|
|
| |
was struggling to figure out how to automatically turn this on or off, but I
think it should be part of an outside process.
ie; a mechanism should be able to target a specific write rate, and one of
its tools for reducing the write rate should be flipping this on.
there's *still* a hole where you can't trigger a compaction attempt if
there's no fragmentation. I kind of want, if this feature is on, to attempt
a compaction on the oldest page while dropping unread items.
|
|
|
|
|
|
|
|
|
|
|
|
| |
LRU crawler was not marking reclaimed expired items as removed from the
storage engine. This could cause fragmentation to persist much longer than it
should, but would not cause any problems once compaction started.
Adds "ext_low_ttl" option. Items with a remaining expiration age below this
value are grouped into special pages. If you have a mixed TTL workload this
would help prevent low TTL items from causing excess fragmentation/compaction.
Pages with low ttl items are excluded from compaction.
|
| |
|
|
|
|
| |
could potentially cause weirdness when the hash table is swapped.
|
|
|
|
|
|
|
|
|
|
| |
refuse to start if inline_ascii_resp is enabled, due to it breaking the item
header objects.
actually use iovst value passed during binprot requests
make the item flags converters the same (strtoul can eat leading space). need
to replace them with a function still.
|
|
|
|
|
|
| |
item size max must be <= wbuf_size.
reads into iovecs, writes out of same iovecs.
|
|
|
|
|
|
| |
if < 2 free pages left, "evict" objects which haven't been hit at all.
should be better than evicting everything if we can continue compacting.
|
|
|
|
|
|
|
|
| |
write_request returns a buffer to write into, which lets us not corrupt the
active item with the hash and crc.
"technically" we can save 24 bytes per item in storage but I'll leave that
for a later optimization, in case we want to stuff more data into the header.
|
|
been squashing reorganizing, and pulling code off to go upstream ahead
of merging the whole branch.
|