| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
| |
export a lot of the connection handling code from memcached.c
|
|
|
|
|
|
|
|
|
|
| |
The list grows toward next, not prev. Also wasn't zeroing out next ptr.
Also didn't unmark free for first resp object. (thanks Prudhviraj!)
Adds a couple counters so users can tell if something is wrong as well.
response_obj_count is solely for response objects in-flight, they
should not be held when idle (except for one bundle per worker thread).
|
|
|
|
|
|
| |
accept but warn if the commandline is used. also keeps the oom counter
distinct so users have a chance of telling what type of memory pressure
they're under.
|
|
|
|
|
|
|
|
|
| |
balancing start arguments for memory limits on read vs response buffers
is hard. with this change there's a custom sub-allocator that cuts the
small response objects out of larger read objects.
it is somewhat complicated and has a few loops. the allocation path is
short circuited as much as possible.
|
|
|
|
|
|
|
|
| |
Add new enumeration with reasons for stopping memcached.
Make sure that restart_mmap_close gets called only when the right
signal is received.
Improve logging to stderr so user knows why exactly memcached
stopped.
|
|
|
|
| |
Enables server-side TLS session caching.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Client connections were being closed and cleaned up after worker
threads exit. In 2018 a patch went in to have the worker threads
actually free their event base when stopped. If your system is strict
enough (which is apparently none out of the dozen+ systems we've tested
against!) it will segfault on invalid memory.
This change leaves the workers hung while they wait for connections to
be centrally closed. I would prefer to have each worker thread close
its own connections for speed if nothing else, but we still need to
close the listener connections and any connections currently open in
side channels.
Much apprecation to darix for helping narrow this down, as it presented
as a wiped stack that only appeared in a specific build environment on
a specific linux distribution.
Hopefully with all of the valgrind noise fixes lately we can start
running it more regularly and spot these early.
|
| |
|
|
|
|
|
| |
see doc/protocol.txt. needed slightly different code as we have to
generate the response line after the main operation completes.
|
|
|
|
|
|
|
|
|
|
|
| |
allows specifying a megabyte limit for either response objects or read
buffers. this is split among all of the worker threads.
useful if connection limit is extremely high and you want to
aggressively close connections if something
happens and all connections become active at the same time.
missing runtime tuning.
|
|
|
|
|
|
|
|
|
|
|
|
| |
instead of 2k and then realloc all over every time you set a large
item, or do large pipelined fetches, use a static slightly larger
buffer.
Idle connections no longer hold a buffer, freeing up a ton of memory.
To maintain compatibility with unbound ASCII multigets, those fall back
to the old malloc/realloc/free routine which it's done since the dark
ages.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
since keys were (maybe) expanding to 64k, binprot overloads the
read-into-key "conn_nread" state to read... a specific number of bytes
into the read buffer.
However keys were never expanded in size, so this ends up being a waste
of code and attaches binprot to management of the read buffer directly.
Instead, re-parse the binary header if we don't have enough bytes and
let try_read_network() handle it like it does for ascii.
As a side effect, this prevents multiple memmove's of potentially a
large amount of data for pipelined binprot commands, and under
NEED_ALIGN.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change refactors most of memcached's networking frontend code. The
goals with this change:
- Decouple buffer memory from client connections where plausible,
reducing memory usage for currently inactive TCP sessions.
- Allow easy response stacking for all protocols. Previously every
binary command generates a syscall during response. This is no longer
true when multiple commands are read off the wire at once. The new meta
protocol and most text protocol commands are similar.
- Reduce code complexity for network handling. Remove automatic buffer
adjustments, error checking, and so on.
This is accomplished by removing the iovec, msg, and "write buffer"
structures from connection objects. A `mc_resp` object is now used to
represent an individual response. These objects contain small iovec's
and enough detail to be late-combined on transmit. As a side effect,
UDP mode now works with extstore :)
Adding to the iovec always had a remote chance of memory failure, so
every call had to be checked for an error. Now once a response object
is allocated, most manipulations can happen without any checking. This
is both a boost to code robustness and performance for some hot paths.
This patch is the first in a series, focusing on the client response.
|
|
|
|
|
|
| |
Suggested in issue #544. Implementation inspired by the flag which
disables flush_all. A similar CLIENT_ERROR is generated should someone
attempt a watch command while they are disabled.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
For full discussion see: https://github.com/memcached/memcached/pull/524
- Avoids looping in most cases where an item had to be force-freed
- Avoids re-locking and re-checking already completed memory
- Uses a backoff timer for sleeping when busy items are found
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- we get asked a lot to provide a "metaget" command, for various uses
(debugging, etc)
- we also get asked for random one-off commands for various use cases.
- I really hate both of these situations and have been wanting to
experiment with a slight tuning of how get commands work for a long
time.
Assuming that if I offer a metaget command which gives people the
information they're curious about in an inefficient format, plus data
they don't need, we'll just end up with a slow command with
compatibility issues. No matter how you wrap warnings around a command,
people will put it into production under high load. Then I'm stuck with
it forever.
Behold, the meta commands!
See doc/protocol.txt and the wiki for a full explanation and examples.
The intent of the meta commands is to support any features the binary
protocol had over the text protocol. Though this is missing some
commands still, it is close and surpasses the binary protocol in many
ways.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
"-e /path/to/tmpfsmnt/file"
SIGUSR1 for graceful stop
restart requires the same memory limit, slab sizes, and some other
infrequently changed details. Most other options and features can
change between restarts. Binary can be upgraded between restarts.
Restart does some fixup work on start for every item in cache. Can take
over a minute with more than a few hundred million items in cache.
Keep in mind when a cache is down it may be missing invalidations,
updates, and so on.
|
|
|
|
|
|
|
|
|
| |
When seccomp causes a crash, use a SIGSYS action and handle it to print
out an error. Most functions are not allowed at that point (no buffered
output, no ?printf functions, no abort, ...), so the implementation is
as minimal as possible.
Print out a message with the syscall number and exit the process (all
threads).
|
|
|
|
|
| |
Modified Logger and Crawler to use the correct buffer length when
they are printing URI encoded keys. Fixes #471
|
|
|
|
|
|
|
|
|
| |
did a weird dance. nsuffix is no longer an 8bit length, replaced with
ITEM_CFLAGS bit. This indicates whether there is a 32bit set of
client flags in the item or not.
possible after removing the inlined ascii response header via previous
commit.
|
|
|
|
|
|
| |
Has defaulted to false since 1.5.0, and with -o modern for a few years
before that. Performance is fine, no reported bugs. Always was the
intention. Code is simpler without the options.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Loads "username:password\n" tokens (up to 8) out of a supplied authfile.
If enabled, disables binary protocol (though may be able to enable both
if sasl is also used?).
authentication is done via the "set" command. A separate handler is
used to avoid some hot path conditionals and narrow the code
executed in an unauthenticated state.
ie:
set foo 0 0 7\r\n
foo bar\r\n
returns "STORED" on success. Else returns CLIENT_ERROR with some
information.
Any key is accepted: if using a client that doesn't try to authenticate
when connecting to a pool of servers, the authentication set can be
tried with the same key as one that failed to coerce the client to
routing to the correct server. Else an "auth" or similar key would
always go to the same server.
|
|
|
|
|
|
|
|
|
|
|
|
| |
re #469 - delete actually locks/unlocks the item (and hashes the
key!) three times. Inbetween fetch and unlink, a fully locked
add_delta() can run, deleting the underlying item. DELETE then returns
success despite the original object hopscotching over it.
I really need to get to the frontend rewrite soon :(
This commit hasn't been fully audited for deadlocks on the stats
counters or extstore STORAGE_delete() function, but it passes tests.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most of the work done by Tharanga. Some commits squashed in by
dormando. Also reviewed by dormando.
Tested, working, but experimental implementation of TLS for memcached.
Enable with ./configure --enable-tls
Requires OpenSSL 1.1.0 or better.
See `memcached -h` output for usage.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
some whackarse ARM platforms on specific glibc/gcc (new?) versions trip
SIGBUS while reading the header chunk for a split item.
the header chunk is unfortunate magic: It lives in ITEM_data() at a random
offset, is zero sized, and only exists to simplify code around finding the
orignial slab class, and linking/relinking subchunks to an item.
there's no fix to this which isn't a lot of code. I need to refactor chunked
items, and attempted to do so, but couldn't come up with something I liked
quickly enough.
This change pads the first chunk if alignment is necessary, which wastes
bytes and a little CPU, but I'm not going to worry a ton for these obscure
platforms.
this works with rebalancing because in the case of ITEM_CHUNKED header, it
treats the item size as the size of the class it resides in, and memcpy's the
item during recovery.
all other cases were changes from ITEM_data to a new ITEM_schunk() inline
function that is created when NEED_ALIGN is set, else it's equal to ITEM_data
still.
|
|
|
|
|
|
|
|
|
|
|
| |
apparently since (forever?) the double while loop in process_get_command()
would turn into an infinite loop (and leak memory/die) if add_iov() ever
failed. The recently added get_extstore() is more likely to spuriously fail,
so it turned into a problem.
This creates a common path for the key length abort as well.
Adds test, which breaks several ways before this patch.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If sasl_server_step is called on a sasl_conn which has not had
sasl_server_start called on it, it will segfault from reading
uninitialized memory.
Memcached currently calls sasl_server_start when the client sends
the PROTOCOL_BINARY_CMD_SASL_AUTH command and sasl_server_step when
the client sends the PROTOCOL_BINARY_CMD_SASL_STEP command. So if the
client sends SASL_STEP before SASL_AUTH, the server segfaults.
For well-behaved clients, this case never happens; but for the
java-memcached-client, when configured with an incorrect password,
it happens very frequently. This is likely because the client handles
auth on a background thread and the socket may be swapped out in the
middle of authentication. You can see that code here:
https://github.com/dustin/java-memcached-client/blob/master/src/main/java/net/spy/memcached/auth/AuthThread.java
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When launching as root, we drop permissions to run as a specific
user, however the way this is done fails to drop supplementary
groups from the process.
On many systems, the root user is a member of many groups used to
guard important system files. If we neglect to drop these groups
it's still possible for a compromised memcached instance to access
critical system files as though it were running with root level
permissions.
On any given system `grep root /etc/group` will reveal all the
groups root is a member of, and memcached will have access to
any file these groups are authorized for, in spite of our attempts
to drop this access.
It's possible to test if a given memcached instance is affected by
this and running with elevated permissions by checking the respective
line in procfs: `grep Groups /proc/$pid/status`
If any groups are listed, memcached would have access to everything
the listed groups have access to. After this patch no groups will
be listed and memcached will be locked down properly.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
there's now an optional ext_drop_under setting which defaults to the same as
compact_under, which should be fine. now, if drop_unread is enabled, it only
kicks in if there are no pages matching the compaction threshold.
This allows you to set a lower compaction frag rate, then start rescuing only
non-COLD items if storage is too full. You can also compact up to a point,
then allow a buffer of pages to be used before dropping unread.
previously enabling drop_unread would always drop_unread even when compacting
normally. This limited utility of the feature.
|
|
|
|
|
| |
items expired/evicted while pulling from tail weren't being tracked, leading
to a leak of object counts in pages.
|
|
|
|
|
|
|
|
| |
couple TODO items left for a new issue I thought of. Also hardcoded memory
buffer size which should be fixed.
also need to change the "free and re-init" logic to use a boolean in case any
related option changes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
external comands only for the moment. allows specifying per-slab-class how
many chunks to leave free before causing flushing to storage.
the external page mover algo in previous commits has a few issues:
- relies too heavily on page mover. lots of constant activity under load.
- adjusting the item age level on the fly is too laggy, and can easily
over-frefree or under-free. IE; class 3 has TTL 90, but class 4 has TTL 60
and most of the pages in memory, it won't free much until it lowers to 60.
Thinking this would be something like; % of total chunks in slab class.
easiest to set as a percentage of total memory or by write rate periodically.
from there TTL can be balanced almost as in the original algorithm; keep a
small global page pool for small items to allocate memory from, and pull
pages from or balance between storage-capable classes to align TTL.
|
|
|
|
|
|
|
|
| |
had a hardcoded value of "start to compact under a slew if more than 3/4ths
of pages are used", but this allows it to be set directly.
ie; "I have 100 pages but don't want to compact util almost full, and then
drop any unread"
|
|
|
|
|
|
|
|
|
|
|
|
| |
was struggling to figure out how to automatically turn this on or off, but I
think it should be part of an outside process.
ie; a mechanism should be able to target a specific write rate, and one of
its tools for reducing the write rate should be flipping this on.
there's *still* a hole where you can't trigger a compaction attempt if
there's no fragmentation. I kind of want, if this feature is on, to attempt
a compaction on the oldest page while dropping unread items.
|
|
|
|
|
|
|
|
|
|
|
|
| |
LRU crawler was not marking reclaimed expired items as removed from the
storage engine. This could cause fragmentation to persist much longer than it
should, but would not cause any problems once compaction started.
Adds "ext_low_ttl" option. Items with a remaining expiration age below this
value are grouped into special pages. If you have a mixed TTL workload this
would help prevent low TTL items from causing excess fragmentation/compaction.
Pages with low ttl items are excluded from compaction.
|
|
|
|
|
| |
./configure --enable-extstore to compile the feature in
specify -o ext_path=/whatever to start.
|
|
|
|
|
|
| |
if < 2 free pages left, "evict" objects which haven't been hit at all.
should be better than evicting everything if we can continue compacting.
|
|
|
|
|
| |
been squashing reorganizing, and pulling code off to go upstream ahead
of merging the whole branch.
|
|
|
|
|
| |
removes a few ifdef's and upstreams small internal interface tweaks for easy
rebase.
|
|
|
|
|
| |
most other stats related to items should be 64bit, so the totals should be
ableable to go higher until we get 64bit hash tables.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implement an aggressive version of drop_privileges(). Additionally add
similar initialization function for threads drop_worker_privileges().
This version is similar to Solaris one and prohibits memcached from
making any not approved syscalls. Current list narrows down the allowed
calls to socket sends/recvs, accept, epoll handling, futex (and
dependencies - mmap), getrusage (for stats), and signal / exit
handling.
Any incorrect behaviour will result in EACCES returned. This should be
restricted further to KILL in the future (after more testing).
The feature is only tested for i386 and x86_64. It depends on bpf
filters and seccomp enabled in the kernel. It also requires libsecomp
for abstraction to seccomp filters. All are available since Linux 3.5.
Seccomp filtering can be enabled at compile time with --enable-seccomp.
In case of local customisations which require more rights, memcached
allows disabling drop_privileges() with "-o no_drop_privileges" at
startup.
Tests have to run with "-o relaxed_privileges", since they require
disk access after the tests complete. This adds a few allowed syscalls,
but does not disable the protection system completely.
|
|
|
|
|
|
|
|
|
| |
defaults at 20% of COLD age.
hot_max_age was added because many people's caches were sitting at 32% memory
utilized (exactly the size of hot). Capping the LRU's by percentage and age
would promote some fairness, but I made a mistake making WARM dynamic but HOT
static. This is now fixed.
|
|
|
|
| |
converts the python script to C, more or less.
|
|
|
|
|
|
|
|
|
| |
if we loop through a slab too many times without freeing everything, delete
items stuck with high refcounts. they should bleed off so long as the
connections aren't jammed holding them.
should be possible to force rescues in this case as well, but that's more code
so will follow up later. Need a big-ish refactor.
|
|
|
|
|
|
|
|
| |
no actual speed loss. emulate the slab_stats "get_hits" by totalling up the
per-LRU get_hits.
could sub-LRU many stats but should use a different command/interface for
that.
|