summaryrefslogtreecommitdiff
path: root/thread.c
Commit message (Collapse)AuthorAgeFilesLines
* meta: N flag changes append/prepend. ms s flag.dormando2023-03-111-2/+2
| | | | | | | | | Sending 's' flag to metaset now returns the size of the item stored. Useful if you want to know how large an append/prepended item now is. If the 'N' flag is supplied while in append/prepend mode, allows autovivifying (with exptime supplied from N) for append/prepend style keys that don't need headers created first.
* declare item_lock_hashpower as a static variablexuesenliang2023-03-081-1/+1
|
* proxy: allow workers to run IO optionallydormando2023-02-241-0/+1
| | | | | | | | | | | | | | | | | `mcp.pool(p, { dist = etc, iothread = true }` By default the IO thread is not used; instead a backend connection is created for each worker thread. This can be overridden by setting `iothread = true` when creating a pool. `mcp.pool(p, { dist = etc, beprefix = "etc" }` If a `beprefix` is added to pool arguments, it will create unique backend connections for this pool. This allows you to create multiple sockets per backend by making multiple pools with unique prefixes. There are legitimate use cases for sharing backend connections across different pools, which is why that is the default behavior.
* core: simplify background IO APIdormando2023-01-111-4/+3
| | | | | | | | - removes unused "completed" IO callback handler - moves primary post-IO callback handlers from the queue definition to the actual IO objects. - allows IO object callbacks to be handled generically instead of based on the queue they were submitted from.
* core: remove *conn object from cache commandsdormando2023-01-111-11/+11
| | | | | | | | | We want to start using cache commands in contexts without a client connection, but the client object has always been passed to all functions. In most cases we only need the worker thread (LIBEVENT_THREAD *t), so this change adjusts the arguments passed in.
* core: give threads unique namesdormando2022-11-011-0/+13
| | | | | allow users to differentiate thread functions externally to memcached. Useful for setting priorities or pinning threads to CPU's.
* Fix function protypes for clang errorsKhem Raj2022-10-131-2/+2
| | | | | | | | | | | clang-15+ has started diagnosing them as errors thread.c:925:18: error: a function declaration without a prototype is deprecated in all versions of C [-Werror,-Wstrict-prototypes] | void STATS_UNLOCK() { | ^ | void Signed-off-by: Khem Raj <raj.khem@gmail.com>
* proxy: remove most references to settings globaldormando2022-09-151-1/+1
| | | | should make isolation/testing earlier.
* core: allow forcing protocol per listener socketdormando2022-08-241-2/+4
| | | | | | | | | | | | | | -l proto[ascii]:127.0.0.1:11211 accepts: - ascii - binary - negotiating - proxy Allows running proxy on default listeners but direct to memcached on a specific port, or binary and ascii on different ports, or etc.
* core: add tagging to listener socketsdormando2022-08-241-2/+5
| | | | | | -l tag[asdfasdf]:0.0.0.0:11211 not presently used for anything outside of the proxy code.
* proxy: initial commit.dormando2021-10-051-0/+38
| | | | | | | | | | | | | | | See BUILD for compilation details. See t/startfile.lua for configuration examples. (see also https://github.com/memcached/memcached-proxylibs for extensions, config libraries, more examples) NOTE: io_uring mode is _not stable_, will crash. As of this commit it is not recommended to run the proxy in production. If you are interested please let us know, as we are actively stabilizing for production use.
* core: io_queue flow second attemptdormando2021-08-091-11/+12
| | | | | | | | | | | probably squash into previous commit. io->c->thead can change for orpahned IO's, so we had to directly add the original worker thread as a reference. also tried again to split callbacks onto the thread and off of the connection for similar reasons; sometimes we just need the callbacks, sometimes we need both.
* core: io_queue_t flow modedormando2021-08-091-2/+30
| | | | | | | | | | | instead of passing ownership of (io_queue_t)*q to the side thread, instead the ownership of IO objects are passed to the side thread, which are then individually returned. The worker thread runs return_cb() on each, determining when it's done with the response batch. this interface could use more explicit functions to make it more clear. Ownership of *q isn't actually "passed" anywhere, it's just used or not used depending on which return function the owner wants.
* thread: use eventfd for worker notify if availabledormando2021-08-091-58/+99
| | | | | | now that all of the read/writes to the notify pipe are in one place, we can easily use linux eventfd if available. This also allows batching events so we're not firing the same notifier constantly.
* thread: unify worker notify interfacedormando2021-08-091-99/+104
| | | | | | | worker notification was a mix of reading data from pipe or examining a an object queue stack. now it's all one interface. this is necessary to switch signalling to eventfd or similar, since we won't have that pipe to work with.
* thread: per-worker-thread connection event queuesdormando2021-08-091-79/+38
| | | | | | | help scalability a bit by having a per-worker-thread freelist and queue for connection event items (new conns, etc). Also removes a hand-rolled linked list and uses cache.c for freelist handling to cull some redundancy.
* core: cache.c cleanups, use queue.h freelistdormando2021-08-091-2/+2
| | | | | | | | | cache constructors/destructors were never used, which just ended up being wasted branches. Since there's no constructor/destructor for cache objects, we can use the memory itself for the freelist. This removes a doubling realloc for the freelist of cache objects and simplifies the code a bunch.
* Introduce NAPI ID based worker thread selectionSridhar Samudrala2020-11-021-4/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | By default memcached assigns connections to worker threads in a round-robin manner. This patch introduces an option to select a worker thread based on the incoming connection's NAPI ID if SO_INCOMING_NAPI_ID socket option is supported by the OS. This allows a memcached worker thread to be associated with a NIC HW receive queue and service all the connection requests received on a specific RX queue. This mapping between a memcached thread and a HW NIC queue streamlines the flow of data from the NIC to the application. In addition, an optimal path with reduced context switches is possible, if epoll based busy polling (sysctl -w net.core.busy_poll = <non-zero value>) is also enabled. This feature is enabled via a new command line parameter -N <num> or "--napi_ids=<num>", where <num> is the number of available/assigned NIC hardware RX queues through which the connections can be received. The number of napi_ids specified cannot be greater than the number of worker threads specified using -t/--threads option. If the option is not specified, or the conditions not met, the code defaults to round robin thread selection. Signed-off-by: Kiran Patil <kiran.patil@intel.com> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
* core: restructure IO queue callbacksdormando2020-10-301-2/+3
| | | | | | | | | | | | | | mc_resp is the proper owner of a pending IO once it's been initialized; release it during resp_finish(). Also adds a completion callback which runs on the submitted stack after returning to the worker thread but before the response is transmitted. allows re-queueing for pending IO if processing a response generates another pending IO. also allows a further refactor to run more extstore code on the worker thread instead of the IO threads. uses proper conn_io_queue state to describe connections waiting for pending IO's.
* core: compile io_queue code by defaultdormando2020-10-301-4/+0
| | | | | don't gate on EXTSTORE for the deferred io_queue code. removes a number of ifdef's and allows more clean usage of the interface.
* core: generalize extstore's defered IO queuedormando2020-10-301-1/+10
| | | | | | | | | | | | want to reuse the deferred IO system for extstore for something else. Should allow evolving into a more plugin-centric system. step one of three(?) - replace in place and tests pass with extstore enabled. step two should move more extstore code into storage.c step three should build the IO queue code without ifdef gating.
* net: remove most response obj cache related codedormando2020-06-091-20/+3
| | | | | | accept but warn if the commandline is used. also keeps the oom counter distinct so users have a chance of telling what type of memory pressure they're under.
* Do not join lru and slab maintainer threads if they do not existTomas Korbar2020-05-271-6/+10
| | | | | | | | If no_modern option was supplied then these threads did not run but memcached still attempted to join them, which resulted in a segfault. resolve #685
* Fix build warnings in Windows.Jefty Negapatan2020-04-111-2/+2
| | | | | | | | | 1. warning: passing argument 4 of 'event_set' from incompatible pointer type [-Wincompatible-pointer-types] --> libevent expects callbacks to use evutil_socket_t for fd instead of int. Current code works fine in *nix because it's just int but not in Windows because it's defined as intptr_t. 2. warning: passing argument 4 of 'getsockopt' from incompatible pointer type [-Wincompatible-pointer-types] --> Apply casting for Windows' getsockopt call 3. warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] --> Replace %lx with more portable %p. long is just 32-bit in 64-bit Windows.
* restart: fix rare segfault on shutdowndormando2020-03-251-0/+14
| | | | | | | | | | | | | | | | | | | | | Client connections were being closed and cleaned up after worker threads exit. In 2018 a patch went in to have the worker threads actually free their event base when stopped. If your system is strict enough (which is apparently none out of the dozen+ systems we've tested against!) it will segfault on invalid memory. This change leaves the workers hung while they wait for connections to be centrally closed. I would prefer to have each worker thread close its own connections for speed if nothing else, but we still need to close the listener connections and any connections currently open in side channels. Much apprecation to darix for helping narrow this down, as it presented as a wiped stack that only appeared in a specific build environment on a specific linux distribution. Hopefully with all of the valgrind noise fixes lately we can start running it more regularly and spot these early.
* meta: arithmetic command for incr/decrdormando2020-03-061-1/+1
| | | | | see doc/protocol.txt. needed slightly different code as we have to generate the response line after the main operation completes.
* add separate limits for connection buffersdormando2020-02-261-0/+20
| | | | | | | | | | | allows specifying a megabyte limit for either response objects or read buffers. this is split among all of the worker threads. useful if connection limit is extremely high and you want to aggressively close connections if something happens and all connections become active at the same time. missing runtime tuning.
* network: transient static read buffer for connsdormando2020-02-011-0/+9
| | | | | | | | | | | | instead of 2k and then realloc all over every time you set a large item, or do large pipelined fetches, use a static slightly larger buffer. Idle connections no longer hold a buffer, freeing up a ton of memory. To maintain compatibility with unbound ASCII multigets, those fall back to the old malloc/realloc/free routine which it's done since the dark ages.
* network: response stacking for all commandsdormando2020-02-011-43/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | This change refactors most of memcached's networking frontend code. The goals with this change: - Decouple buffer memory from client connections where plausible, reducing memory usage for currently inactive TCP sessions. - Allow easy response stacking for all protocols. Previously every binary command generates a syscall during response. This is no longer true when multiple commands are read off the wire at once. The new meta protocol and most text protocol commands are similar. - Reduce code complexity for network handling. Remove automatic buffer adjustments, error checking, and so on. This is accomplished by removing the iovec, msg, and "write buffer" structures from connection objects. A `mc_resp` object is now used to represent an individual response. These objects contain small iovec's and enough detail to be late-combined on transmit. As a side effect, UDP mode now works with extstore :) Adding to the iovec always had a remote chance of memory failure, so every call had to be checked for an error. Now once a response object is allocated, most manipulations can happen without any checking. This is both a boost to code robustness and performance for some hot paths. This patch is the first in a series, focusing on the client response.
* Replace return\ \; with return;Iqram Mahmud2020-01-131-1/+1
| | | | | | In most of the sources codes of memcached return is followed by an immediate semicolon, except these line. This change will help future contributors to keep the style consistent.
* DTrace build fixDavid Carlier2019-10-171-1/+1
| | | | | | | | During config step just being "contented" by generating the header, there is no symbols to attach for, no chance to work as is. Changing a probe signature, on some platforms, pthread_t is an opaque type thus casting to a type large enough to hold it for all oses.
* TLS: fix leak of SSL context on accept failuredormando2019-09-281-0/+6
| | | | | | | | | frees said context. Don't use SSL_Shutdown as connection was not established. also fixes potential leak if dispatch_conn_new fails; but that shouldn't be possible for most systems. requires either a malloc failure or event_add() failure.
* restartable cachedormando2019-09-171-0/+61
| | | | | | | | | | | | | | | "-e /path/to/tmpfsmnt/file" SIGUSR1 for graceful stop restart requires the same memory limit, slab sizes, and some other infrequently changed details. Most other options and features can change between restarts. Binary can be upgraded between restarts. Restart does some fixup work on start for every item in cache. Can take over a minute with more than a few hundred million items in cache. Keep in mind when a cache is down it may be missing invalidations, updates, and so on.
* close delete + incr item survival bugdormando2019-04-261-0/+11
| | | | | | | | | | | | re #469 - delete actually locks/unlocks the item (and hashes the key!) three times. Inbetween fetch and unlink, a fully locked add_delta() can run, deleting the underlying item. DELETE then returns success despite the original object hopscotching over it. I really need to get to the frontend rewrite soon :( This commit hasn't been fully audited for deadlocks on the stats counters or extstore STORAGE_delete() function, but it passes tests.
* Basic implementation of TLS for memcached.1.5.13Tharanga Gamaethige2019-04-151-2/+30
| | | | | | | | | | | | | Most of the work done by Tharanga. Some commits squashed in by dormando. Also reviewed by dormando. Tested, working, but experimental implementation of TLS for memcached. Enable with ./configure --enable-tls Requires OpenSSL 1.1.0 or better. See `memcached -h` output for usage.
* split storage writer into its own threaddormando2018-08-031-0/+2
| | | | | | | | | | | trying out a simplified slab class backoff algorithm. The LRU maintainer individually schedules slab classes by time, which leads to multiple wakeups in a steady state as they get out of sync. This algorithm more simply skips that class more often each time it runs the main loop, using a single scheduled sleep instead. if it goes to sleep for a long time, it also reduces the backoff for all classes. if we're barely awake it should be fine to poke everything.
* fix deadlock during hash table expansiondormando2018-05-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The slab rebalancer was paused after the LRU maintainer thread during hash table expansion. This was mostly fine, except in one case. If an item is in the COLD LRU when it gets hit, it gets assigned an ACTIVE bitflag and queued for an asynchronous bump. If the item already has an ACTIVE bit, it does not queue a second time. Items in other LRU's are bumped when they hit the tail, but COLD items have to bump relatively soon to avoid accidentally evicting recently active items while under memory pressure. Items queued for async bumps have an extra refcount. The slab page mover cannot move a page until all memory within the page is cleared and freed. If it runs into an item which is stuck, it will try to delete it, then wait for all of its references to be removed. However, if an item is in the bump queue when the LRU thread is stopped, and the slab page mover is moving the page a queued item happens to be in, the refcount will never drop to zero as the bump will never happen. The page mover has no mechanism for "giving up" partway through, so it loops forever. For most use cases, this degrades the instance but doesn't actually break much: - The COLD LRU's will eventually drain to zero. - Allocations will all become "direct reclaims", which force items into COLD and then evicts them. - This will degrade the LRU algorithm and some performance, but if the instance usage isn't very high most users may never notice. - Normally, page moving is rare enough and hash table expansion is rare enough that it's an unlikely race. However, with extstore, page moves tend to happen a lot more frequently. Also, items are only ever written to extstore from the COLD LRU's, so if they empty the process will stop. This makes the race more likely to happen and the effects are more obvious. The fix is to simply pause the page mover before the LRU thread. The LRU thread does call the page mover, but it is via trylock() since it normally expects to progress even if the page mover is busy. The pause of the page mover won't succeed until it fully completes the current page, so it should then be clear to pause the LRU thread. Further improvements could be made by allowing the page mover to bail after N unsuccessful loops. I am also considering removing the async LRU bump process, and letting the LRU crawler handle that instead. Much thanks to Shashi and Sridhar from Netflix for all of their help in tracking down this bug!
* add_delta: use bool incr to be consistent with other functionsFangrui Song2018-03-141-1/+1
|
* Replace event_init() with new API if detect newer versionQian Li2018-02-191-0/+11
| | | | | | | | | | | If we detect libevent version >= 2.0.2-alpha, change to use event_base_new_with_config() instead of obsolete event_init() for creating new event base. Set config flag to EVENT_BASE_FLAG_NOLOCK to avoid lock/unlock around every libevent operation. For newer version of libevent, the event_init() API is deprecated and is totally unsafe for multithreaded use. By using the new API, we can explicitly disable/enable locking on the event_base.
* extstore: pause compaction thread with hash expanddormando2017-11-281-0/+9
| | | | could potentially cause weirdness when the hash table is swapped.
* external storage base commitdormando2017-11-281-1/+16
| | | | | been squashing reorganizing, and pulling code off to go upstream ahead of merging the whole branch.
* interface code for flash branchdormando2017-09-261-1/+1
| | | | | removes a few ifdef's and upstreams small internal interface tweaks for easy rebase.
* Add drop_privileges() for LinuxStanisław Pitucha2017-08-231-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement an aggressive version of drop_privileges(). Additionally add similar initialization function for threads drop_worker_privileges(). This version is similar to Solaris one and prohibits memcached from making any not approved syscalls. Current list narrows down the allowed calls to socket sends/recvs, accept, epoll handling, futex (and dependencies - mmap), getrusage (for stats), and signal / exit handling. Any incorrect behaviour will result in EACCES returned. This should be restricted further to KILL in the future (after more testing). The feature is only tested for i386 and x86_64. It depends on bpf filters and seccomp enabled in the kernel. It also requires libsecomp for abstraction to seccomp filters. All are available since Linux 3.5. Seccomp filtering can be enabled at compile time with --enable-seccomp. In case of local customisations which require more rights, memcached allows disabling drop_privileges() with "-o no_drop_privileges" at startup. Tests have to run with "-o relaxed_privileges", since they require disk access after the tests complete. This adds a few allowed syscalls, but does not disable the protection system completely.
* fix long lock pause in hash expansiondormando2017-06-211-2/+2
| | | | | | | plus a locking fix for slabs reassign. lru maintainer can call into slabs reassign + lru crawler, so we need to pause it before attempting to pause the other two threads.
* per-LRU hits breakdowndormando2017-06-041-0/+9
| | | | | | | | no actual speed loss. emulate the slab_stats "get_hits" by totalling up the per-LRU get_hits. could sub-LRU many stats but should use a different command/interface for that.
* fix ordering issue in conn dispatchdormando2017-05-211-25/+34
| | | | | | | | | very old race condition: push to queue, write to signal pipe. with extstore, hammering a secondary item type into the queue could occasionally swap the item with the type it represented. now differentiate on the type, and the character signifies either to pull an item or do something else entirely.
* use LRU thread for COLD -> WARM bumpsdormando2017-01-301-1/+2
| | | | | | | | | | | | | | Previous tree fixed a problem; active items needed to be processed from the tail of COLD, which makes evictions harder without evicting active items. COLD bumps were modified to be immediate (old style). This uses a per-worker-thread mostly-nonblocking queue that the LRU thread consumes for COLD bumps. In most cases, hits to COLD are 1/10th or less than the other classes. On high rates of access where the buffers fill, those items simply don't get their ACTIVE bit set. If they get hit again with free space, they will be processed then. This prevents regressions from high speed keyspace scans.
* stop using atomics for item refcount managementdormando2017-01-221-30/+0
| | | | | | | | | | | | | | when I first split the locks up further I had a trick where "item_remove()" did not require holding the associated item lock. If an item were to be freed, it would then do the necessary work.; Since then, all calls to refcount_incr and refcount_decr only happen while the item is locked. This was mostly due to the slab mover being very tricky with locks. The atomic is no longer needed as the refcount is only ever checked after a lock to the item. Calling atomics is pretty expensive, especially in multicore/multisocket scenarios. This yields a notable performance increase.
* Do LRU-bumps while already holding item lockdormando2017-01-221-14/+2
| | | | | | | | item_get() would hash, item_lock, fetch item. consumers which can bump the LRU would then call item_update(), which would hash, item_lock, then update the item. Good performance bump by inlining the LRU bump when it's necessary.
* scale item hash lock with more worker threadsdormando2016-12-191-2/+6
| | | | | | | | | the thread scalability is much higher than when the bucket count was tested. Increasing this has a relatively small effect, but still worth adjusting. More strict testing should be done to more properly scale the table; as well ad adjusting for false-sharing issues given table entries that border each other.