| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
| |
Sending 's' flag to metaset now returns the size of the item stored.
Useful if you want to know how large an append/prepended item now is.
If the 'N' flag is supplied while in append/prepend mode, allows
autovivifying (with exptime supplied from N) for append/prepend style
keys that don't need headers created first.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`mcp.pool(p, { dist = etc, iothread = true }`
By default the IO thread is not used; instead a backend connection is
created for each worker thread. This can be overridden by setting
`iothread = true` when creating a pool.
`mcp.pool(p, { dist = etc, beprefix = "etc" }`
If a `beprefix` is added to pool arguments, it will create unique
backend connections for this pool. This allows you to create multiple
sockets per backend by making multiple pools with unique prefixes.
There are legitimate use cases for sharing backend connections across
different pools, which is why that is the default behavior.
|
|
|
|
|
|
|
|
| |
- removes unused "completed" IO callback handler
- moves primary post-IO callback handlers from the queue definition to
the actual IO objects.
- allows IO object callbacks to be handled generically instead of based
on the queue they were submitted from.
|
|
|
|
|
|
|
|
|
| |
We want to start using cache commands in contexts without a client
connection, but the client object has always been passed to all
functions.
In most cases we only need the worker thread (LIBEVENT_THREAD *t), so
this change adjusts the arguments passed in.
|
|
|
|
|
| |
allow users to differentiate thread functions externally to memcached.
Useful for setting priorities or pinning threads to CPU's.
|
|
|
|
|
|
|
|
|
|
|
| |
clang-15+ has started diagnosing them as errors
thread.c:925:18: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes]
| void STATS_UNLOCK() {
| ^
| void
Signed-off-by: Khem Raj <raj.khem@gmail.com>
|
|
|
|
| |
should make isolation/testing earlier.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
-l proto[ascii]:127.0.0.1:11211
accepts:
- ascii
- binary
- negotiating
- proxy
Allows running proxy on default listeners but direct to memcached on a
specific port, or binary and ascii on different ports, or etc.
|
|
|
|
|
|
| |
-l tag[asdfasdf]:0.0.0.0:11211
not presently used for anything outside of the proxy code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See BUILD for compilation details.
See t/startfile.lua for configuration examples.
(see also https://github.com/memcached/memcached-proxylibs for
extensions, config libraries, more examples)
NOTE: io_uring mode is _not stable_, will crash.
As of this commit it is not recommended to run the proxy in production.
If you are interested please let us know, as we are actively stabilizing
for production use.
|
|
|
|
|
|
|
|
|
|
|
| |
probably squash into previous commit.
io->c->thead can change for orpahned IO's, so we had to directly add the
original worker thread as a reference.
also tried again to split callbacks onto the thread and off of the
connection for similar reasons; sometimes we just need the callbacks,
sometimes we need both.
|
|
|
|
|
|
|
|
|
|
|
| |
instead of passing ownership of (io_queue_t)*q to the side thread,
instead the ownership of IO objects are passed to the side thread, which
are then individually returned. The worker thread runs return_cb() on
each, determining when it's done with the response batch.
this interface could use more explicit functions to make it more clear.
Ownership of *q isn't actually "passed" anywhere, it's just used or not
used depending on which return function the owner wants.
|
|
|
|
|
|
| |
now that all of the read/writes to the notify pipe are in one place,
we can easily use linux eventfd if available. This also allows batching
events so we're not firing the same notifier constantly.
|
|
|
|
|
|
|
| |
worker notification was a mix of reading data from pipe or examining a
an object queue stack. now it's all one interface. this is necessary to
switch signalling to eventfd or similar, since we won't have that pipe
to work with.
|
|
|
|
|
|
|
| |
help scalability a bit by having a per-worker-thread freelist and queue
for connection event items (new conns, etc). Also removes a hand-rolled
linked list and uses cache.c for freelist handling to cull some
redundancy.
|
|
|
|
|
|
|
|
|
| |
cache constructors/destructors were never used, which just ended up
being wasted branches. Since there's no constructor/destructor for cache
objects, we can use the memory itself for the freelist.
This removes a doubling realloc for the freelist of cache objects and
simplifies the code a bunch.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
By default memcached assigns connections to worker threads in
a round-robin manner. This patch introduces an option to select
a worker thread based on the incoming connection's NAPI ID if
SO_INCOMING_NAPI_ID socket option is supported by the OS.
This allows a memcached worker thread to be associated with a
NIC HW receive queue and service all the connection requests
received on a specific RX queue. This mapping between a memcached
thread and a HW NIC queue streamlines the flow of data from the
NIC to the application. In addition, an optimal path with reduced
context switches is possible, if epoll based busy polling
(sysctl -w net.core.busy_poll = <non-zero value>) is also enabled.
This feature is enabled via a new command line parameter -N <num>
or "--napi_ids=<num>", where <num> is the number of available/assigned
NIC hardware RX queues through which the connections can be received.
The number of napi_ids specified cannot be greater than the number
of worker threads specified using -t/--threads option.
If the option is not specified, or the conditions not met, the code
defaults to round robin thread selection.
Signed-off-by: Kiran Patil <kiran.patil@intel.com>
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mc_resp is the proper owner of a pending IO once it's been initialized;
release it during resp_finish(). Also adds a completion callback which
runs on the submitted stack after returning to the worker thread but
before the response is transmitted.
allows re-queueing for pending IO if processing a response generates
another pending IO. also allows a further refactor to run more extstore
code on the worker thread instead of the IO threads.
uses proper conn_io_queue state to describe connections waiting for
pending IO's.
|
|
|
|
|
| |
don't gate on EXTSTORE for the deferred io_queue code. removes a number of
ifdef's and allows more clean usage of the interface.
|
|
|
|
|
|
|
|
|
|
|
|
| |
want to reuse the deferred IO system for extstore for something else.
Should allow evolving into a more plugin-centric system.
step one of three(?) - replace in place and tests pass with extstore
enabled.
step two should move more extstore code into storage.c
step three should build the IO queue code without ifdef gating.
|
|
|
|
|
|
| |
accept but warn if the commandline is used. also keeps the oom counter
distinct so users have a chance of telling what type of memory pressure
they're under.
|
|
|
|
|
|
|
|
| |
If no_modern option was supplied then these threads did not run
but memcached still attempted to join them, which resulted in a
segfault.
resolve #685
|
|
|
|
|
|
|
|
|
| |
1. warning: passing argument 4 of 'event_set' from incompatible pointer type [-Wincompatible-pointer-types]
--> libevent expects callbacks to use evutil_socket_t for fd instead of int. Current code works fine in *nix because it's just int but not in Windows because it's defined as intptr_t.
2. warning: passing argument 4 of 'getsockopt' from incompatible pointer type [-Wincompatible-pointer-types]
--> Apply casting for Windows' getsockopt call
3. warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
--> Replace %lx with more portable %p. long is just 32-bit in 64-bit Windows.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Client connections were being closed and cleaned up after worker
threads exit. In 2018 a patch went in to have the worker threads
actually free their event base when stopped. If your system is strict
enough (which is apparently none out of the dozen+ systems we've tested
against!) it will segfault on invalid memory.
This change leaves the workers hung while they wait for connections to
be centrally closed. I would prefer to have each worker thread close
its own connections for speed if nothing else, but we still need to
close the listener connections and any connections currently open in
side channels.
Much apprecation to darix for helping narrow this down, as it presented
as a wiped stack that only appeared in a specific build environment on
a specific linux distribution.
Hopefully with all of the valgrind noise fixes lately we can start
running it more regularly and spot these early.
|
|
|
|
|
| |
see doc/protocol.txt. needed slightly different code as we have to
generate the response line after the main operation completes.
|
|
|
|
|
|
|
|
|
|
|
| |
allows specifying a megabyte limit for either response objects or read
buffers. this is split among all of the worker threads.
useful if connection limit is extremely high and you want to
aggressively close connections if something
happens and all connections become active at the same time.
missing runtime tuning.
|
|
|
|
|
|
|
|
|
|
|
|
| |
instead of 2k and then realloc all over every time you set a large
item, or do large pipelined fetches, use a static slightly larger
buffer.
Idle connections no longer hold a buffer, freeing up a ton of memory.
To maintain compatibility with unbound ASCII multigets, those fall back
to the old malloc/realloc/free routine which it's done since the dark
ages.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change refactors most of memcached's networking frontend code. The
goals with this change:
- Decouple buffer memory from client connections where plausible,
reducing memory usage for currently inactive TCP sessions.
- Allow easy response stacking for all protocols. Previously every
binary command generates a syscall during response. This is no longer
true when multiple commands are read off the wire at once. The new meta
protocol and most text protocol commands are similar.
- Reduce code complexity for network handling. Remove automatic buffer
adjustments, error checking, and so on.
This is accomplished by removing the iovec, msg, and "write buffer"
structures from connection objects. A `mc_resp` object is now used to
represent an individual response. These objects contain small iovec's
and enough detail to be late-combined on transmit. As a side effect,
UDP mode now works with extstore :)
Adding to the iovec always had a remote chance of memory failure, so
every call had to be checked for an error. Now once a response object
is allocated, most manipulations can happen without any checking. This
is both a boost to code robustness and performance for some hot paths.
This patch is the first in a series, focusing on the client response.
|
|
|
|
|
|
| |
In most of the sources codes of memcached return is followed by an
immediate semicolon, except these line. This change will help future
contributors to keep the style consistent.
|
|
|
|
|
|
|
|
| |
During config step just being "contented" by generating the header,
there is no symbols to attach for, no chance to work as is.
Changing a probe signature, on some platforms, pthread_t is an
opaque type thus casting to a type large enough to hold it
for all oses.
|
|
|
|
|
|
|
|
|
| |
frees said context. Don't use SSL_Shutdown as connection was not
established.
also fixes potential leak if dispatch_conn_new fails; but that
shouldn't be possible for most systems. requires either a malloc
failure or event_add() failure.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
"-e /path/to/tmpfsmnt/file"
SIGUSR1 for graceful stop
restart requires the same memory limit, slab sizes, and some other
infrequently changed details. Most other options and features can
change between restarts. Binary can be upgraded between restarts.
Restart does some fixup work on start for every item in cache. Can take
over a minute with more than a few hundred million items in cache.
Keep in mind when a cache is down it may be missing invalidations,
updates, and so on.
|
|
|
|
|
|
|
|
|
|
|
|
| |
re #469 - delete actually locks/unlocks the item (and hashes the
key!) three times. Inbetween fetch and unlink, a fully locked
add_delta() can run, deleting the underlying item. DELETE then returns
success despite the original object hopscotching over it.
I really need to get to the frontend rewrite soon :(
This commit hasn't been fully audited for deadlocks on the stats
counters or extstore STORAGE_delete() function, but it passes tests.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most of the work done by Tharanga. Some commits squashed in by
dormando. Also reviewed by dormando.
Tested, working, but experimental implementation of TLS for memcached.
Enable with ./configure --enable-tls
Requires OpenSSL 1.1.0 or better.
See `memcached -h` output for usage.
|
|
|
|
|
|
|
|
|
|
|
| |
trying out a simplified slab class backoff algorithm. The LRU maintainer
individually schedules slab classes by time, which leads to multiple wakeups
in a steady state as they get out of sync. This algorithm more simply skips
that class more often each time it runs the main loop, using a single
scheduled sleep instead.
if it goes to sleep for a long time, it also reduces the backoff for all
classes. if we're barely awake it should be fine to poke everything.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The slab rebalancer was paused after the LRU maintainer thread during hash
table expansion. This was mostly fine, except in one case.
If an item is in the COLD LRU when it gets hit, it gets assigned an ACTIVE
bitflag and queued for an asynchronous bump. If the item already has an
ACTIVE bit, it does not queue a second time. Items in other LRU's are bumped
when they hit the tail, but COLD items have to bump relatively soon to avoid
accidentally evicting recently active items while under memory pressure.
Items queued for async bumps have an extra refcount.
The slab page mover cannot move a page until all memory within the page is
cleared and freed. If it runs into an item which is stuck, it will try to
delete it, then wait for all of its references to be removed.
However, if an item is in the bump queue when the LRU thread is stopped,
and the slab page mover is moving the page a queued item happens to be in,
the refcount will never drop to zero as the bump will never happen.
The page mover has no mechanism for "giving up" partway through, so it loops
forever. For most use cases, this degrades the instance but doesn't actually
break much:
- The COLD LRU's will eventually drain to zero.
- Allocations will all become "direct reclaims", which force items into COLD
and then evicts them.
- This will degrade the LRU algorithm and some performance, but if the
instance usage isn't very high most users may never notice.
- Normally, page moving is rare enough and hash table expansion is rare
enough that it's an unlikely race.
However, with extstore, page moves tend to happen a lot more frequently.
Also, items are only ever written to extstore from the COLD LRU's, so if they
empty the process will stop. This makes the race more likely to happen and
the effects are more obvious.
The fix is to simply pause the page mover before the LRU thread. The LRU
thread does call the page mover, but it is via trylock() since it normally
expects to progress even if the page mover is busy. The pause of the page
mover won't succeed until it fully completes the current page, so it should
then be clear to pause the LRU thread.
Further improvements could be made by allowing the page mover to bail after N
unsuccessful loops. I am also considering removing the async LRU bump
process, and letting the LRU crawler handle that instead.
Much thanks to Shashi and Sridhar from Netflix for all of their help in
tracking down this bug!
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
If we detect libevent version >= 2.0.2-alpha, change to use
event_base_new_with_config() instead of obsolete event_init() for creating new
event base. Set config flag to EVENT_BASE_FLAG_NOLOCK to avoid lock/unlock
around every libevent operation.
For newer version of libevent, the event_init() API is deprecated and is
totally unsafe for multithreaded use. By using the new API, we can explicitly
disable/enable locking on the event_base.
|
|
|
|
| |
could potentially cause weirdness when the hash table is swapped.
|
|
|
|
|
| |
been squashing reorganizing, and pulling code off to go upstream ahead
of merging the whole branch.
|
|
|
|
|
| |
removes a few ifdef's and upstreams small internal interface tweaks for easy
rebase.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implement an aggressive version of drop_privileges(). Additionally add
similar initialization function for threads drop_worker_privileges().
This version is similar to Solaris one and prohibits memcached from
making any not approved syscalls. Current list narrows down the allowed
calls to socket sends/recvs, accept, epoll handling, futex (and
dependencies - mmap), getrusage (for stats), and signal / exit
handling.
Any incorrect behaviour will result in EACCES returned. This should be
restricted further to KILL in the future (after more testing).
The feature is only tested for i386 and x86_64. It depends on bpf
filters and seccomp enabled in the kernel. It also requires libsecomp
for abstraction to seccomp filters. All are available since Linux 3.5.
Seccomp filtering can be enabled at compile time with --enable-seccomp.
In case of local customisations which require more rights, memcached
allows disabling drop_privileges() with "-o no_drop_privileges" at
startup.
Tests have to run with "-o relaxed_privileges", since they require
disk access after the tests complete. This adds a few allowed syscalls,
but does not disable the protection system completely.
|
|
|
|
|
|
|
| |
plus a locking fix for slabs reassign.
lru maintainer can call into slabs reassign + lru crawler, so we need to pause
it before attempting to pause the other two threads.
|
|
|
|
|
|
|
|
| |
no actual speed loss. emulate the slab_stats "get_hits" by totalling up the
per-LRU get_hits.
could sub-LRU many stats but should use a different command/interface for
that.
|
|
|
|
|
|
|
|
|
| |
very old race condition: push to queue, write to signal pipe. with extstore,
hammering a secondary item type into the queue could occasionally swap the
item with the type it represented.
now differentiate on the type, and the character signifies either to pull an
item or do something else entirely.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previous tree fixed a problem; active items needed to be processed from the
tail of COLD, which makes evictions harder without evicting active items.
COLD bumps were modified to be immediate (old style). This uses a
per-worker-thread mostly-nonblocking queue that the LRU thread consumes for
COLD bumps.
In most cases, hits to COLD are 1/10th or less than the other classes. On high
rates of access where the buffers fill, those items simply don't get their
ACTIVE bit set. If they get hit again with free space, they will be processed
then. This prevents regressions from high speed keyspace scans.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
when I first split the locks up further I had a trick where "item_remove()"
did not require holding the associated item lock. If an item were to be freed,
it would then do the necessary work.;
Since then, all calls to refcount_incr and refcount_decr only happen while the
item is locked. This was mostly due to the slab mover being very tricky with
locks. The atomic is no longer needed as the refcount is only ever checked
after a lock to the item.
Calling atomics is pretty expensive, especially in multicore/multisocket
scenarios. This yields a notable performance increase.
|
|
|
|
|
|
|
|
| |
item_get() would hash, item_lock, fetch item.
consumers which can bump the LRU would then call item_update(),
which would hash, item_lock, then update the item.
Good performance bump by inlining the LRU bump when it's necessary.
|
|
|
|
|
|
|
|
|
| |
the thread scalability is much higher than when the bucket count was tested.
Increasing this has a relatively small effect, but still worth adjusting.
More strict testing should be done to more properly scale the table; as well
ad adjusting for false-sharing issues given table entries that border each
other.
|