| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
Use has_feature() method too.
Signed-off-by: Samuel Just <sam.just@inktank.com>
Signed-off-by: Sage Weil <sage@inktank.com>
|
|\
| |
| |
| |
| |
| |
| |
| | |
Conflicts:
src/include/ceph_features.h
Reviewed-by: Sage Weil <sage@inktank.com>
Fixes: #5278
|
| |
| |
| |
| | |
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
| |
Compare all keys within the sync'ed prefixes across members of the quorum
and compare the key counts and CRC for inconsistencies.
Currently this is a one-shot inefficient hammer. We'll want to make this
work in chunks before it is usable in production environments.
Protect with a feature bit to avoid sending MMonScrub to mons who can't
decode it.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Greg Farnum <greg@inktank.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use the atomic pipe link removal as a signal that we are the one failing
the con and use that to queue the reset event.
This fixes the case where we have an open, the session gets set up via the
handle_accept callback, and then race with another connection and go into
wait + close, or just close. In that case, fault() needs to queue a reset
event to match the accept.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make RefCountedObject a private parent of Connection so that users are
forced to use ConnectionRef whenever references are taken.
Many methods can still take a raw Connection* when they are using the
caller's reference but not taking their own; this is cheaper than
twiddling the reference count, and the lifetime is still well defined.
Local variables generally use ConnectionRef, though.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|\
| |
| |
| |
| |
| |
| | |
Reviewed-by: Sage Weil <sage@inktank.com>
Conflicts:
src/mds/MDCache.cc
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This patch adds "open-by-ino" helper. It utilizes backtrace to find
inode's path and open the inode. The algorithm looks like:
1. Check MDS peers. If any MDS has the inode in its cache, goto step 6.
2. Fetch backtrace. If backtrace was previously fetched and get the
same backtrace again, return -EIO.
3. Traverse the path in backtrace. If the inode is found, goto step 6;
if non-auth dirfrag is encountered, goto next step. If fail to find
the inode in its parent dir, goto step 1.
4. Request MDS peers to traverse the path in backtrace. If the inode
is found, goto step 6. If MDS peer encounters non-auth dirfrag, it
stops traversing. If any MDS peer fails to find the inode in its
parent dir, goto step 1.
5. Use the same algorithm to open the inode's parent. Goto step 3 if
succeeds; goto step 1 if fails.
6. return the inode's auth MDS ID.
The algorithm has two main assumptions:
1. If an inode is in its auth MDS's cache, its on-disk backtrace
can be out of date.
2. If an inode is not in any MDS's cache, its on-disk backtrace
must be up to date.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Send ping requests to both the front and back hb addrs for peer osds. If
the front hb addr is not present, do not send it and interpret a reply
as coming from both. This handles the transition from old to new OSDs
seamlessly.
Note both the front and back rx times. Both need to be up to date in order
for the peer to be healthy.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|/
|
|
|
|
| |
This allows us to get the messenger associated with a connection.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
|
|
|
|
|
|
|
|
| |
We already have a throttler that lets of limit the amount of memory
consumed by messages from a given source. Currently this is based only
on the size of the message payload. Add a second throttler that limits
the number of messages so that we can effectively throttle small requests
as well.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
|
|
| |
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The HealthMonitor builds upon the QuorumService interface, and should be
used to keep track of all and any relevant information about the monitor
cluster (maybe even about all the cluster if need be).
This patch also introduces the HealthService interface, used to define
a HealthMonitor service, responsible for dispatching 'MMonHealth' messages
(the QuorumService interface dispatches generic 'Message').
Based on the HealthService interface, we introduce the DataHealthService
class, a service that will track disk space consumption by the monitors,
warn when a given threshold is crossed, and gracefully shutdown the monitor
if disk space usage hits critical levels that might affect the correct
monitor behavior.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
|
|\
| |
| |
| |
| |
| |
| |
| |
| | |
Conflicts:
src/.gitignore
src/Makefile.am
src/include/ceph_features.h
src/mon/MDSMonitor.cc
src/mon/PGMonitor.cc
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The monitor's synchronization process requires a specific message type
to carry the required informations. Since this process significantly
differs from slurping, reusing the MMonProbe message is not an option as
it would require major changes and, for all intetions and purposes, it
would be far outside the scope of the MMonProbe message.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
|
|/
|
|
|
|
| |
Replace C-style pointer casting with correct static_cast<>().
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
|
|
|
|
|
|
| |
The data payload is a decent proxy for cost in most cases, but not all.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
|
|
|
| |
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
|
|
|
|
|
|
|
| |
This won't bite us for a while yet (we're on bit 26), but it will soon!
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
|
|
|
|
| |
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
|
|
|
| |
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
|
|
|
|
|
|
| |
This message will be used to reserve and release recovery slots on
replica PGs.
Signed-off-by: Mike Ryan <mike.ryan@inktank.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, a new osd would be bombarded by backfills from many osds
simultaneously, resulting in excessively high load. Instead, we
want to limit the number of backfills coming into and going out
from a single osd.
To that end, each OSDService now has two AsyncReserver instances: one
for backfills going from the osd (local_reserver) and one for backfills
going to the osd (remote_reserver). For a primary to initiate a
backfill, it must first obtain a reservation from its own
local_reserver. Then, it must obtain a reservation from the backfill
target's remote_reserver via a MBackfillReserve message. This process is
managed by substates of Active and ReplicaActive (see the changes in
PG.h). The reservations are dropped either on the Backfilled event,
which is sent on the primary before calling recovery_complete and on the
replica on receipt of the BackfillComplete progress message), or upon
leaving Active or ReplicaActive.
It's important that we always grab the local reservation before the
remote reservation in order to prevent a circular dependency.
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
|\ |
|
| |
| |
| |
| |
| |
| |
| | |
gcc 4.7 requires that the intrusive_ptr_* functions be in
the same namespace as the templated class.
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
|\ \
| |/
|/| |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
There was a race where:
- sending stuff to a lossy Connection
- it fails, and queues itself for reap, queues a RESET event
- reaper clears the Pipe
- some thread queues new messages and the Pipe is reopened, messages sent
- RESET event delivered to dispatch, connection is closed and reopened.
The result was that messages got sent to the OSD out of order during the
window between the fault() and ms_handle_reset() getting called. This will
prevent that.
Signed-off-by: Sage Weil <sage@inktank.com>
|
| |
| |
| |
| |
| |
| | |
If a lossy Connection fails and we disconnect the Pipe, set a failed flag.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|/
|
|
|
|
|
|
| |
Query and Notify messages include logical messages from multiple
pgs. Each logical message (pg_query_t and pg_notify_t) now
contains an epoch_sent.
Signed-off-by: Samuel Just <sam.just@inktank.com>
|
|
|
|
| |
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
|
|
|
|
|
|
|
|
|
| |
This way old Pipes that have been replaced can't clear the new Pipe
out of a Connection's link.
We might attempt to instead sever the link between CLOSED Pipes and
their Connections more completely (eg, when the Connection gets a
new Pipe), but that will require more work to handle all the
cases, and this works for now.
Signed-off-by: Greg Farnum <greg@inktank.com>
|
|
|
|
|
|
|
|
| |
This is a big patch that will remove all references to the observers
throughout the code, including a complete removal of the Observer-related
messages' source files.
Signed-off-by: Joao Eduardo Luis <joao.luis@inktank.com>
|
|
|
|
|
|
| |
Following a popular request.
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
|
|
|
|
|
|
|
|
|
|
| |
This fixes #2342. We shouldn't call notify on the dispatcher
context. We should also make sure that we don't hold
the client lock while waiting for the responses.
Also, pushed the client_lock locking into the
ctx->notify().
Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
|
|
|
|
|
|
|
| |
Also, Message now has a timestamp indicating when the message
was fully recieved for use by OSDTracker.
Signed-off-by: Samuel Just <samuel.just@dreamhost.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Define a HEAD_VERSION and COMPAT_VERSION for any versioned message. Pass
to Message constructor so that it is always initialized, even from the
the default constructor. That's needed because we use that to check
decoding compatibility when receiving/decoding messages.
If we are conditionally encoding an old version, explicitly set
header.version in encode_payload().
We also set compat_version to demonstrate what will happen for future
revisions. In this case, it's moot, because no old code understands
compat_version yet: nobody with old decode code will see these values
anyway. But use this opportunity to demonstrate how it would be used in
the future.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Reviewed-by: Josh Durgin <josh.durgin@dreamhost.com>
Conflicts:
src/msg/Message.h
src/osd/OSD.cc
src/osd/ReplicatedPG.cc
src/osd/ReplicatedPG.h
|
| |
| |
| |
| |
| |
| | |
Just wrap print() for now.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
- get_type_name()
- print()
and all the random crap they call.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
|
| |
| |
| |
| | |
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
|
| |
| |
| |
| | |
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
|
| |
| |
| |
| |
| |
| |
| | |
Avoid using the connection reference; pass it in explicitly instead. This
will make ceph-dencoder's life a bit easier.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
|
|/
|
|
| |
Signed-off-by: Greg Farnum <gregory.farnum@dreamhost.com>
|
|
|
|
| |
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
|
|
|
|
|
|
| |
Message to query hash ranges of a PG.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
|
|
|
|
|
|
| |
If a monitor starts up with the correct fsid and auth keys, it will now
add itself to the monmap (and subsequently try to join the quorum) if it
is not already in the monmap.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a monitor has been down and is behind, and joins the quorum, the
other nodes will try to send it all of the needed state, which can
bring the cluster to a halt.
Instead, implement a new bootstrap() procedure:
- probe the cluster nodes
- if there is an existing quorum,
- and it is not too far ahead of me, join it (call an election)
- otherwise, slurp down all the newer state and then restart (bootstrap)
- if we see enough online nodes that are not part of the quorum, call
an election.
We still need to add some timeouts.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
|
|
|
|
|
|
|
|
| |
These are similar to MMonCommand[Ack], but aren't PaxosServiceMessage
children, don't include the command in the reply (useless), have a more
generic name.
Signed-off-by: Sage Weil <sage.weil@dreamhost.com>
|
|
|
|
|
|
| |
If we can't reply, throw out the request; they'll need to resend it anyway.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
|
|
| |
Signed-off-by: Colin McCabe <colin.mccabe@dreamhost.com>
|