| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
A call to setgroups can fail on Linux if the user lacks cap_setgid or if setgroups has been disallowed in a user namespace. The latter happens when the namespace has no GID map or if /proc/pid/setgroups has been changed to "deny". Treating this failure as fatal means that memcached cannot start in a bazel fakeroot. Ignoring failure seems harmless.
|
|
|
|
|
|
|
|
|
|
| |
severe bug in item chunk fixup (wasn't doing it at all!)
failed to check on my 32bit builders... and 32bit platforms weren't
working at all. This is a bit of a kludge since I'm still working
around having ptrdiff, but it seems to work.
also fixes a bug with missing null byte for meta filename.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
"-e /path/to/tmpfsmnt/file"
SIGUSR1 for graceful stop
restart requires the same memory limit, slab sizes, and some other
infrequently changed details. Most other options and features can
change between restarts. Binary can be upgraded between restarts.
Restart does some fixup work on start for every item in cache. Can take
over a minute with more than a few hundred million items in cache.
Keep in mind when a cache is down it may be missing invalidations,
updates, and so on.
|
|
|
|
|
|
|
|
|
|
|
| |
Ensure we're only reading to the size of the smallest buffer, since
they're both on the stack and could potentially overlap. Overlapping is
defined as ... undefined behavior. I've looked through all available
implementations of strncpy and they still only copy from the first \0
found.
We'll also never read past the end of sun_path since we _supply_
sun_path with a proper null terminator.
|
|
|
|
|
|
| |
this is helpful when it's required to identify which clients are
connected to which server address when memcached listens on
multiple addresses
|
|
|
|
|
|
|
|
|
| |
When seccomp causes a crash, use a SIGSYS action and handle it to print
out an error. Most functions are not allowed at that point (no buffered
output, no ?printf functions, no abort, ...), so the implementation is
as minimal as possible.
Print out a message with the syscall number and exit the process (all
threads).
|
| |
|
|
|
|
| |
you can now monitor fetch and mutations of a given client
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mem_requested is an oddball counter: it's the total number of bytes
"actually requested" from the slab's caller. It's mainly used for a
stats counter, alerting the user that the slab factor may not be
efficient if the gap between total_chunks * chunk_size - mem_requested
is large.
However, since chunked items were added it's _also_ used to help the
LRU balance itself. The total number of bytes used in the class vs the
total number of bytes in a sub-LRU is used to judge whether to move
items between sub-LRU's.
This is a layer violation; forcing slabs.c to know more about how items
work, as well as EXTSTORE for calculating item sizes from headers.
Further, it turns out it wasn't necessary for item allocation: if we
need to evict an item we _always_ pull from COLD_LRU or force a move
from HOT_LRU. So the total doesn't matter.
The total does matter in the LRU maintainer background thread. However,
this thread caches mem_requested to avoid hitting the slab lock too
frequently. Since sizes_bytes[] within items.c is generally redundant
with mem_requested, we now total sizes_bytes[] from each sub-LRU before
starting a batch of LRU juggles.
This simplifies the code a bit, reduces the layer violations in slabs.c
slightly, and actually speeds up some hot paths as a number of branches
and operations are removed completely.
This also fixes an issue I was having with the restartable memory
branch :) recalculating p->requested and keeping a clean API is painful
and slow.
NOTE: This will vary a bit compared to what mem_requested originally
did, mostly for large chunked items.
For items which fit inside a single slab chunk, the stat is identical.
However, items constructed by chaining chunks will have a single large
"nbytes" value and end up in the highest slab class. Chunked items can
be capped with chunks from smaller slab classes; you will see
utilization of chunks but not an increase in mem_requested for them.
I'm still thinking this through but this is probably acceptable. Large
chunked items should be accounted for separately, perhaps with some new
counters so they can be discounted from normal calculations.
|
|
|
|
| |
Uses fast itoa_u64 function instead. Fixes #471
|
| |
|
|
|
|
|
|
|
|
|
| |
did a weird dance. nsuffix is no longer an 8bit length, replaced with
ITEM_CFLAGS bit. This indicates whether there is a 32bit set of
client flags in the item or not.
possible after removing the inlined ascii response header via previous
commit.
|
|
|
|
|
|
| |
Has defaulted to false since 1.5.0, and with -o modern for a few years
before that. Performance is fine, no reported bugs. Always was the
intention. Code is simpler without the options.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Loads "username:password\n" tokens (up to 8) out of a supplied authfile.
If enabled, disables binary protocol (though may be able to enable both
if sasl is also used?).
authentication is done via the "set" command. A separate handler is
used to avoid some hot path conditionals and narrow the code
executed in an unauthenticated state.
ie:
set foo 0 0 7\r\n
foo bar\r\n
returns "STORED" on success. Else returns CLIENT_ERROR with some
information.
Any key is accepted: if using a client that doesn't try to authenticate
when connecting to a pool of servers, the authentication set can be
tried with the same key as one that failed to coerce the client to
routing to the correct server. Else an "auth" or similar key would
always go to the same server.
|
|
|
|
|
| |
binary protocol commands never updated the last command time, so would
always get disconnected while in use with the idle timeout thread.
|
|
|
|
| |
limit got pushed to 1G with chunked items. fixes #473.
|
|
|
|
| |
fixes #474 - off by one in token count.
|
|
|
|
|
| |
temporary fix. some folks ... randomize... their start arguments, so
need to restructure in a way I'm happy with.
|
|
|
|
|
|
|
|
|
|
|
|
| |
re #469 - delete actually locks/unlocks the item (and hashes the
key!) three times. Inbetween fetch and unlink, a fully locked
add_delta() can run, deleting the underlying item. DELETE then returns
success despite the original object hopscotching over it.
I really need to get to the frontend rewrite soon :(
This commit hasn't been fully audited for deadlocks on the stats
counters or extstore STORAGE_delete() function, but it passes tests.
|
|
|
|
|
|
|
| |
This is a knob existing from 7.0 (2008), can be only changed
at boot time. It is enabled by default, on usual archs at least,
but in some cases it might not be desired so we check it
whatsoever.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most of the work done by Tharanga. Some commits squashed in by
dormando. Also reviewed by dormando.
Tested, working, but experimental implementation of TLS for memcached.
Enable with ./configure --enable-tls
Requires OpenSSL 1.1.0 or better.
See `memcached -h` output for usage.
|
|
|
|
|
|
| |
Bug added in 2014. Same condition reused to bounce incr/decr commands
off of CHUNKED and ITEM_HDR items. Thus, incr/decr'ing a value that is
already in extstore would immediately cause a refcount leak :(
|
|
|
|
|
|
|
|
|
| |
queues were roundrobin before. during sustained overload some queues
can get behind while other stays empty. Simply do a bit more work to
track depth and pick the lowest queue. This is fine for now since the
bottleneck remains elsewhere.
Been meaning to do this, benchmark work made it more obvious.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
some whackarse ARM platforms on specific glibc/gcc (new?) versions trip
SIGBUS while reading the header chunk for a split item.
the header chunk is unfortunate magic: It lives in ITEM_data() at a random
offset, is zero sized, and only exists to simplify code around finding the
orignial slab class, and linking/relinking subchunks to an item.
there's no fix to this which isn't a lot of code. I need to refactor chunked
items, and attempted to do so, but couldn't come up with something I liked
quickly enough.
This change pads the first chunk if alignment is necessary, which wastes
bytes and a little CPU, but I'm not going to worry a ton for these obscure
platforms.
this works with rebalancing because in the case of ITEM_CHUNKED header, it
treats the item size as the size of the class it resides in, and memcpy's the
item during recovery.
all other cases were changes from ITEM_data to a new ITEM_schunk() inline
function that is created when NEED_ALIGN is set, else it's equal to ITEM_data
still.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Just a Bunch Of Devices :P
code exists for routing specific devices to specific buckets
(lowttl/compact/etc), but enabling it requires significant fixes to
compaction algorithm. Thus it is disabled as of this writing.
code cleanups and future work:
- pedantically freeing memory and closing fd's on exit
- unify and flatten the free_bucket code
- defines for free buckets
- page eviction adjustment (force min-free per free bucket)
- fix default calculation for compact_under and drop_under
- might require forcing this value only on default bucket
|
|
|
|
|
|
|
|
|
|
|
| |
trying out a simplified slab class backoff algorithm. The LRU maintainer
individually schedules slab classes by time, which leads to multiple wakeups
in a steady state as they get out of sync. This algorithm more simply skips
that class more often each time it runs the main loop, using a single
scheduled sleep instead.
if it goes to sleep for a long time, it also reduces the backoff for all
classes. if we're barely awake it should be fine to poke everything.
|
|
|
|
|
|
|
|
|
|
|
| |
apparently since (forever?) the double while loop in process_get_command()
would turn into an infinite loop (and leak memory/die) if add_iov() ever
failed. The recently added get_extstore() is more likely to spuriously fail,
so it turned into a problem.
This creates a common path for the key length abort as well.
Adds test, which breaks several ways before this patch.
|
|
|
|
|
|
|
|
|
|
|
| |
adds `-o drop_privileges` along with existing no_drop_privileges.
Feature is experimental, and causing some user pain: I'd forgotten about the
disable flag entirely, somehow.
Changing defaults (especially for a security feature) is not a typical thing
to do, but we should have done this from the start like all other features:
initially gated, then added to `modern`, then switched to default once mature.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Linux has supported transparent huge pages on Linux for quite some time.
Memory regions can be marked for conversion to huge pages with madvise.
Alternatively, Users can have the system default to using huge pages for
all memory regions when applicable, i.e. when the mapped region is large
enough, the properly aligned pages will be converted.
Using either method, we would preallocate memory for the cache with
proper alignment, and call madvise on it. Whether the memory region
actually gets converted to hugepages ultimately depends on the setting
of /sys/kernel/mm/transparent_hugepage/enabled. The existence of this
file is also checked to see if transparent huge pages support is compiled
into the kernel.
If any step of the preallocation fails, we simply fallback to standard
allocation, without even preallocating slabs, as they would not have
the proper alignment or settings anyway.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If sasl_server_step is called on a sasl_conn which has not had
sasl_server_start called on it, it will segfault from reading
uninitialized memory.
Memcached currently calls sasl_server_start when the client sends
the PROTOCOL_BINARY_CMD_SASL_AUTH command and sasl_server_step when
the client sends the PROTOCOL_BINARY_CMD_SASL_STEP command. So if the
client sends SASL_STEP before SASL_AUTH, the server segfaults.
For well-behaved clients, this case never happens; but for the
java-memcached-client, when configured with an incorrect password,
it happens very frequently. This is likely because the client handles
auth on a background thread and the socket may be swapped out in the
middle of authentication. You can see that code here:
https://github.com/dustin/java-memcached-client/blob/master/src/main/java/net/spy/memcached/auth/AuthThread.java
|
|
|
|
|
|
|
|
| |
memory alignment when reading header data back.
left "32" in a few places that should've at least been a define, is now
properly an offsetof. used for skipping crc32 for dynamic parts of the item
headers.
|
|
|
|
|
|
|
|
|
|
|
|
| |
GATK returns a key but not the value. c->io_wraplist is only appended if the
value is to be returned, but c->item is skipped if it is an ITEM_HDR at all.
This now checks for the ITEM_HDR bit being set but also !value
which then reclaims the reference normally.
I knew doubling up the cleanup code made it a lot more complex, and hope to
flatten that to a single path. Also the TOUCH/GAT/GATK binprot code has no real
test coverage, nor mc-crusher entries. Should be worth fixing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When launching as root, we drop permissions to run as a specific
user, however the way this is done fails to drop supplementary
groups from the process.
On many systems, the root user is a member of many groups used to
guard important system files. If we neglect to drop these groups
it's still possible for a compromised memcached instance to access
critical system files as though it were running with root level
permissions.
On any given system `grep root /etc/group` will reveal all the
groups root is a member of, and memcached will have access to
any file these groups are authorized for, in spite of our attempts
to drop this access.
It's possible to test if a given memcached instance is affected by
this and running with elevated permissions by checking the respective
line in procfs: `grep Groups /proc/$pid/status`
If any groups are listed, memcached would have access to everything
the listed groups have access to. After this patch no groups will
be listed and memcached will be locked down properly.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As reported, UDP amplification attacks have started to use insecure
internet-exposed memcached instances. UDP used to be a lot more popular as a
transport for memcached many years ago, but I'm not aware of many recent
users.
Ten years ago, the TCP connection overhead from many clients was relatively
high (dozens or hundreds per client server), but these days many clients are
batched, or user fewer processes, or simply anre't worried about it.
While changing the default to listen on localhost only would also help, the
true culprit is UDP. There are many more use cases for using memcached over
the network than there are for using the UDP protocol.
|
|
|
|
|
|
|
|
|
|
|
| |
If we detect libevent version >= 2.0.2-alpha, change to use
event_base_new_with_config() instead of obsolete event_init() for creating new
event base. Set config flag to EVENT_BASE_FLAG_NOLOCK to avoid lock/unlock
around every libevent operation.
For newer version of libevent, the event_init() API is deprecated and is
totally unsafe for multithreaded use. By using the new API, we can explicitly
disable/enable locking on the event_base.
|
|
|
|
|
| |
issue #338 reported a memory leak in the init code. another non-issue, since
it's a handful of bytes and that code path is only used in a couple of tests.
|
|
|
|
| |
issue #337 reported a memory leak, but in these cases the process exits anyway.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
curr_items tracks how many items are linked in the hash table. internally to
the hash table, hash_items tracked how many items were in the hash table.
on every insert/delete, hash_items had to be locked and checked to see if th
table should be expanded. rip that all out, and call a check with the
once-per-second clock event to check for hash table expansion.
this actually ends up fixing an obscure bug: if you burst a bunch of sets
then stop, the hash table won't attempt to expand a second time until the
next insert. with this change, every second the hash table has a chance of
expanding again.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
LRU crawler metadumper is used for getting snapshot-y looks at the LRU's.
Since there's no default limit, it'll get any new items added or bumped since
the roll started.
with this change it limits the number of items dumped to the number that
existed in that LRU when the roll was kicked off. You still end up with an
approximation, but not a terrible one:
- items bumped after the crawler passes them likely won't be revisited
- items bumped before the crawler passes them will likely be visited toward
the end, or mixed with new items.
- deletes are somewhere in the middle.
|
|
|
|
| |
would segfault if you gave it only 2 arguments :|
|
|
|
|
|
| |
needs more writing. will happen over time. at least --help has the right
number of newlines...
|
|
|
|
|
| |
I purposefully broke _get_extstore and it turned it into a miss as expected.
tests pass with/without extstore.
|
|
|
|
|
| |
was initializing to 1... but we want it to be zero until the thing has a
chance to fill and flip on the balancer algo.
|
|
|
|
|
| |
with page mover algo being less shitty, shouldn't rely on item_age except for
weird scenarios.
|