1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
|
commit 408cd61584c72c0d97b774b3d8f95c6b1b06341a
Author: Gary Lowell <gary.lowell@inktank.com>
Date: Mon Sep 9 12:50:11 2013 -0700
v0.67.3
commit 17a7342b3b935c06610c58ab92a9a1d086923d32
Merge: b4252bf 10433bb
Author: Sage Weil <sage@inktank.com>
Date: Sat Sep 7 13:34:45 2013 -0700
Merge pull request #574 from dalgaaf/fix/da-dumpling-cherry-picks
init-radosgw*: fix status return value if radosgw isn't running
Reviewed-by: Sage Weil <sage@inktank.com>
commit 10433bbe72dbf8eae8fae836e557a043610eb54e
Author: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Date: Sat Sep 7 11:30:15 2013 +0200
init-radosgw*: fix status return value if radosgw isn't running
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
(cherry picked from commit b5137baf651eaaa9f67e3864509e437f9d5c3d5a)
commit b4252bff79150a95e9d075dd0b5e146ba9bf2ee5
Author: Samuel Just <sam.just@inktank.com>
Date: Thu Aug 22 11:19:37 2013 -0700
FileStore: add config option to disable the wbthrottle
Backport: dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 3528100a53724e7ae20766344e467bf762a34163)
commit 699324e0910e5e07a1ac68df8cf1108e5671ec15
Author: Samuel Just <sam.just@inktank.com>
Date: Thu Aug 22 11:19:52 2013 -0700
WBThrottle: use fdatasync instead of fsync
Backport: dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d571825080f0bff1ed3666e95e19b78a738ecfe8)
commit 074717b4b49ae1a55bc867e5c34d43c51edc84a5
Author: Samuel Just <sam.just@inktank.com>
Date: Thu Aug 29 15:08:58 2013 -0700
PGLog: initialize writeout_from in PGLog constructor
Fixes: 6151
Backport: dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
Introduced: f808c205c503f7d32518c91619f249466f84c4cf
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 42d65b0a7057696f4b8094f7c686d467c075a64d)
commit c22d980cf42e580818dc9f526327518c0ddf8ff5
Author: Samuel Just <sam.just@inktank.com>
Date: Tue Aug 27 08:49:14 2013 -0700
PGLog: maintain writeout_from and trimmed
This way, we can avoid omap_rmkeyrange in the common append
and trim cases.
Fixes: #6040
Backport: Dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit f808c205c503f7d32518c91619f249466f84c4cf)
commit 53c7ab4db00ec7034f5aa555231f9ee167f43201
Author: Samuel Just <sam.just@inktank.com>
Date: Tue Aug 27 07:27:26 2013 -0700
PGLog: don't maintain log_keys_debug if the config is disabled
Fixes: #6040
Backport: Dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit 1c0d75db1075a58d893d30494a5d7280cb308899)
commit 40dc489351383c2e35b91c3d4e76b633309716df
Author: Samuel Just <sam.just@inktank.com>
Date: Mon Aug 26 23:19:45 2013 -0700
PGLog: move the log size check after the early return
There really are stl implementations (like the one on my ubuntu 12.04
machine) which have a list::size() which is linear in the size of the
list. That assert, therefore, is quite expensive!
Fixes: #6040
Backport: Dumpling
Signed-off-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit fe68b15a3d82349f8941f5b9f70fcbb5d4bc7f97)
commit 4261eb5ec105b9c27605360910602dc367fd79f5
Author: Sage Weil <sage@inktank.com>
Date: Tue Aug 13 17:16:08 2013 -0700
rbd.cc: relicense as LGPL2
All past authors for rbd.cc have consented to relicensing from GPL to
LGPL2 via email:
---
Date: Sat, 27 Jul 2013 01:59:36 +0200
From: Sylvain Munaut <s.munaut@whatever-company.com>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change
I hereby consent to the relicensing of any contribution I made to the
aforementioned rbd.cc file from GPL to LGPL2.1.
(I hope that'll be impressive enough, I did my best :p)
btw, tnt@246tNt.com and s.munaut@whatever-company.com are both me.
Cheers,
Sylvain
---
Date: Fri, 26 Jul 2013 17:00:48 -0700
From: Yehuda Sadeh <yehuda@inktank.com>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change
I consent.
---
Date: Fri, 26 Jul 2013 17:02:24 -0700
From: Josh Durgin <josh.durgin@inktank.com>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change
I consent.
---
Date: Fri, 26 Jul 2013 18:17:46 -0700
From: Stanislav Sedov <stas@freebsd.org>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change
I consent.
Thanks for taking care of it!
---
Date: Fri, 26 Jul 2013 18:24:15 -0700
From: Colin McCabe <cmccabe@alumni.cmu.edu>
I consent.
cheers,
Colin
---
Date: Sat, 27 Jul 2013 07:08:12 +0200
From: Christian Brunner <christian@brunner-muc.de>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change
I consent
Christian
---
Date: Sat, 27 Jul 2013 12:17:34 +0300
From: Stratos Psomadakis <psomas@grnet.gr>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change
Hi,
I consent with the GPL -> LGL2.1 re-licensing.
Thanks
Stratos
---
Date: Sat, 27 Jul 2013 16:13:13 +0200
From: Wido den Hollander <wido@42on.com>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change
I consent!
You have my permission to re-license the code I wrote for rbd.cc to LGPL2.1
---
Date: Sun, 11 Aug 2013 10:40:32 +0200
From: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
Subject: Re: btw
Hi Sage,
I agree to switch the license of ceph_argparse.py and rbd.cc from GPL2
to LGPL2.
Regards
Danny Al-Gaaf
---
Date: Tue, 13 Aug 2013 17:15:24 -0700
From: Dan Mick <dan.mick@inktank.com>
Subject: Re: Ceph rbd.cc GPL -> LGPL2 license change
I consent to relicense any contributed code that I wrote under LGPL2.1 license.
---
...and I consent too. Drop the exception from COPYING and debian/copyright
files.
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2206f55761c675b31078dea4e7dd66f2666d7d03)
commit 211c5f13131e28b095a1f3b72426128f1db22218
Author: Yehuda Sadeh <yehuda@inktank.com>
Date: Fri Aug 23 15:39:20 2013 -0700
rgw: flush pending data when completing multipart part upload
Fixes: #6111
Backport: dumpling
When completing the part upload we need to flush any data that we
aggregated and didn't flush yet. With earlier code didn't have to deal
with it as for multipart upload we didn't have any pending data.
What we do now is we call the regular atomic data completion
function that takes care of it.
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 9a551296e0811f2b65972377b25bb28dbb42f575)
commit 1a9651010aab51c9be2edeccd80e9bd11f5177ce
Author: Yehuda Sadeh <yehuda@inktank.com>
Date: Mon Aug 26 19:46:43 2013 -0700
rgw: check object name after rebuilding it in S3 POST
Fixes: #6088
Backport: bobtail, cuttlefish, dumpling
When posting an object it is possible to provide a key
name that refers to the original filename, however we
need to verify that in the end we don't end up with an
empty object name.
Reviewed-by: Josh Durgin <josh.durgin@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit c8ec532fadc0df36e4b265fe20a2ff3e35319744)
commit 1bd74a020b93f154b2d4129d512f6334387de7c7
Author: Sage Weil <sage@inktank.com>
Date: Thu Aug 22 17:46:45 2013 -0700
mon/MonClient: release pending outgoing messages on shutdown
This fixes a small memory leak when we have messages queued for the mon
when we shut down. It is harmless except for the valgrind leak check
noise that obscures real leaks.
Backport: dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 309569a6d0b7df263654b7f3f15b910a72f2918d)
commit 24f2669783e2eb9d9af5ecbe106efed93366ba63
Author: Yehuda Sadeh <yehuda@inktank.com>
Date: Thu Aug 29 13:06:33 2013 -0700
rgw: change watch init ordering, don't distribute if can't
Backport: dumpling
Moving back the watch initialization after the zone init,
as the zone info holds the control pool name. Since zone
init might need to create a new system object (that needs
to distribute cache), don't try to distribute cache if
watch is not yet initialized.
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 1d1f7f18dfbdc46fdb09a96ef973475cd29feef5)
commit a708c8ab52e5b1476405a1f817c23b8845fbaab3
Author: Sage Weil <sage@inktank.com>
Date: Fri Aug 30 09:41:29 2013 -0700
ceph-post-file: use mktemp instead of tempfile
tempfile is a debian thing, apparently; mktemp is present everywhere.
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit e60d4e09e9f11e3c34a05cd122341e06c7c889bb)
commit 625f13ee0d6cca48d61dfd65e00517d092552d1c
Author: Sage Weil <sage@inktank.com>
Date: Wed Aug 28 09:50:11 2013 -0700
mon: discover mon addrs, names during election state too
Currently we only detect new mon addrs and names during the probing phase.
For non-trivial clusters, this means we can get into a sticky spot when
we discover enough peers to form an quorum, but not all of them, and the
undiscovered ones are enough to break the mon ranks and prevent an
election.
One way to work around this is to continue addr and name discovery during
the election. We should also consider making the ranks less sensitive to
the undefined addrs; that is a separate change.
Fixes: #4924
Backport: dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
Tested-by: Bernhard Glomm <bernhard.glomm@ecologic.eu>
(cherry picked from commit c24028570015cacf1d9e154ffad80bec06a61e7c)
commit 83cfd4386c1fd0fa41aea345704e27f82b524ece
Author: Dan Mick <dan.mick@inktank.com>
Date: Thu Aug 22 17:30:24 2013 -0700
ceph_rest_api.py: create own default for log_file
common/config thinks the default log_file for non-daemons should be "".
Override that so that the default is
/var/log/ceph/{cluster}-{name}.{pid}.log
since ceph-rest-api is more of a daemon than a client.
Fixes: #6099
Backport: dumpling
Signed-off-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit 2031f391c3df68e0d9e381a1ef3fe58d8939f0a8)
commit 8a1da62d9564a32f7b8963fe298e1ac3ad0ea3d9
Author: Sage Weil <sage@inktank.com>
Date: Fri Aug 16 17:59:11 2013 -0700
ceph-post-file: single command to upload a file to cephdrop
Use sftp to upload to a directory that only this user and ceph devs can
access.
Distribute an ssh key to connect to the account. This will let us revoke
the key in the future if we feel the need. Also distribute a known_hosts
file so that users have some confidence that they are connecting to the
real ceph drop account and not some third party.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit d08e05e463f1f7106a1f719d81b849435790a3b9)
commit 3f8663477b585dcb528fdd7047c50d9a52d24b95
Author: Gary Lowell <glowell@inktank.com>
Date: Thu Aug 22 13:29:32 2013 -0700
ceph.spec.in: remove trailing paren in previous commit
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
commit 23fb908cb3ac969c874ac12755d20ed2f636e1b9
Author: Gary Lowell <glowell@inktank.com>
Date: Thu Aug 22 11:07:16 2013 -0700
ceph.spec.in: Don't invoke debug_package macro on centos.
If the redhat-rpm-config package is installed, the debuginfo rpms will
be built by default. The build will fail when the package installed
and the specfile also invokes the macro.
Signed-off-by: Gary Lowell <gary.lowell@inktank.com>
commit 11f5853d8178ab60ab948d373c1a1f67324ce3bd
Author: Sage Weil <sage@inktank.com>
Date: Sat Aug 24 14:04:09 2013 -0700
osd: install admin socket commands after signals
This lets us tell by the presence of the admin socket commands whether
a signal will make us shut down cleanly. See #5924.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Samuel Just <sam.just@inktank.com>
(cherry picked from commit c5b5ce120a8ce9116be52874dbbcc39adec48b5c)
commit 39adc0195e6016ce36828885515be1bffbc10ae1
Author: Sage Weil <sage@inktank.com>
Date: Tue Aug 20 22:39:09 2013 -0700
ceph-disk: partprobe after creating journal partition
At least one user reports that a partprobe is needed after creating the
journal partition. It is not clear why sgdisk is not doing it, but this
fixes ceph-disk for them, and should be harmless for other users.
Fixes: #5599
Tested-by: lurbs in #ceph
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2af59d5e81c5e3e3d7cfc50d9330d7364659c5eb)
(cherry picked from commit 3e42df221315679605d68b2875aab6c7eb6b3cc4)
commit 6a4fe7b9b068ae990d6404921a46631fe9ebcd31
Author: Sage Weil <sage@inktank.com>
Date: Tue Aug 20 11:27:23 2013 -0700
mon/Paxos: always refresh after any store_state
If we store any new state, we need to refresh the services, even if we
are still in the midst of Paxos recovery. This is because the
subscription path will share any committed state even when paxos is
still recovering. This prevents a race like:
- we have maps 10..20
- we drop out of quorum
- we are elected leader, paxos recovery starts
- we get one LAST with committed states that trim maps 10..15
- we get a subscribe for map 10..20
- we crash because 10 is no longer on disk because the PaxosService
is out of sync with the on-disk state.
Fixes: #6045
Backport: dumpling
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 981eda9f7787c83dc457f061452685f499e7dd27)
commit 13d396e46ed9200e4b9f21db2f0a8efbc5998d82
Author: Sage Weil <sage@inktank.com>
Date: Tue Aug 20 11:27:09 2013 -0700
mon/Paxos: return whether store_state stored anything
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit 7e0848d8f88f156a05eef47a9f730b772b64fbf2)
commit f248383bacff76203fa94716cfdf6cf766da24a7
Author: Sage Weil <sage@inktank.com>
Date: Tue Aug 20 11:26:57 2013 -0700
mon/Paxos: cleanup: use do_refresh from handle_commit
This avoid duplicated code by using the helper created exactly for this
purpose.
Signed-off-by: Sage Weil <sage@inktank.com>
Reviewed-by: Joao Eduardo Luis <joao.luis@inktank.com>
(cherry picked from commit b9dee2285d9fe8533fa98c940d5af7b0b81f3d33)
commit 02608a12d4e7592784148a62a47d568efc24079d
Author: Sage Weil <sage@inktank.com>
Date: Thu Aug 15 21:48:06 2013 -0700
osdc/ObjectCacher: do not merge rx buffers
We do not try to merge rx buffers currently. Make that explicit and
documented in the code that it is not supported. (Otherwise the
last_read_tid values will get lost and read results won't get applied
to the cache properly.)
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 1c50c446152ab0e571ae5508edb4ad7c7614c310)
commit 0e2bfe71965eeef29b47e8032637ea820a7ce49c
Author: Sage Weil <sage@inktank.com>
Date: Thu Aug 15 21:47:18 2013 -0700
osdc/ObjectCacher: match reads with their original rx buffers
Consider a sequence like:
1- start read on 100~200
100~200 state rx
2- truncate to 200
100~100 state rx
3- start read on 200~200
100~100 state rx
200~200 state rx
4- get 100~200 read result
Currently this makes us crash on
osdc/ObjectCacher.cc: 738: FAILED assert(bh->length() <= start+(loff_t)length-opos)
when processing the second 200~200 bufferhead (it is too big). The
larger issue, though, is that we should not be looking at this data at
all; it has been truncated away.
Fix this by marking each rx buffer with the read request that is sent to
fill it, and only fill it from that read request. Then the first reply
will fill the first 100~100 extend but not touch the other extent; the
second read will do that.
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit b59f930ae147767eb4c9ff18c3821f6936a83227)
commit 6b51c960715971a0351e8203d4896cb0c4138a3f
Author: Sage Weil <sage@inktank.com>
Date: Thu Aug 22 15:54:48 2013 -0700
mon/Paxos: fix another uncommitted value corner case
It is possible that we begin the paxos recovery with an uncommitted
value for, say, commit 100. During last/collect we discover 100 has been
committed already. But also, another node provides an uncommitted value
for 101 with the same pn. Currently, we refuse to learn it, because the
pn is not strictly > than our current uncommitted pn... even though it is
the next last_committed+1 value that we need.
There are two possible fixes here:
- make this a >= as we can accept newer values from the same pn.
- discard our uncommitted value metadata when we commit the value.
Let's do both!
Fixes: #6090
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit fe5010380a3a18ca85f39403e8032de1dddbe905)
commit b3a280d5af9d06783d2698bd434940de94ab0fda
Author: Sage Weil <sage@inktank.com>
Date: Fri Aug 23 11:45:35 2013 -0700
os: make readdir_r buffers larger
PATH_MAX isn't quite big enough.
Backport: dumpling, cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 99a2ff7da99f8cf70976f05d4fe7aa28dd7afae5)
commit 989a664ef0d1c716cab967f249112f595cf98c43
Author: Sage Weil <sage@inktank.com>
Date: Fri Aug 23 11:45:08 2013 -0700
os: fix readdir_r buffer size
The buffer needs to be big or else we're walk all over the stack.
Backport: dumpling, cuttlefish, bobtail
Signed-off-by: Sage Weil <sage@inktank.com>
(cherry picked from commit 2df66d9fa214e90eb5141df4d5755b57e8ba9413)
Conflicts:
src/os/BtrfsFileStoreBackend.cc
commit a4cca31c82bf0e84272e01eb1b3188dfdb5b5615
Author: Yehuda Sadeh <yehuda@inktank.com>
Date: Thu Aug 22 10:53:12 2013 -0700
rgw: fix crash when creating new zone on init
Moving the watch/notify init before the zone init,
as we might need to send a notification.
Reviewed-by: Sage Weil <sage@inktank.com>
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
(cherry picked from commit 3d55534268de7124d29bd365ea65da8d2f63e501)
commit 4cf6996803ef66f2b6083f73593259d45e2740a3
Author: Yehuda Sadeh <yehuda@inktank.com>
Date: Mon Aug 19 08:40:16 2013 -0700
rgw: change cache / watch-notify init sequence
Fixes: #6046
We were initializing the watch-notify (through the cache
init) before reading the zone info which was much too
early, as we didn't have the control pool name yet. Now
simplifying init/cleanup a bit, cache doesn't call watch/notify
init and cleanup directly, but rather states its need
through a virtual callback.
Signed-off-by: Yehuda Sadeh <yehuda@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit d26ba3ab0374e77847c742dd00cb3bc9301214c2)
commit aea6de532b0b843c3a8bb76d10bab8476f0d7c09
Author: Alexandre Oliva <oliva@gnu.org>
Date: Thu Aug 22 03:40:22 2013 -0300
enable mds rejoin with active inodes' old parent xattrs
When the parent xattrs of active inodes that the mds attempts to open
during rejoin lack pool info (struct_v < 5), this field will be filled
in with -1, causing the mds to retry fetching a backtrace with a pool
number that matches the expected value, which fails and causes the
err==-ENOENT branch to be taken and retry pool 1, which succeeds, but
with pool -1, and so keeps on bouncing between the two retry cases
forever.
This patch arranges for the mds to go along with pool -1 instead of
insisting that it be refetched, enabling it to complete recovery
instead of eating cpu, network bandwidth and metadata osd's resources
like there's no tomorrow, in what AFAICT is an infinite and very busy
loop.
This is not a new problem: I've had it even before upgrading from
Cuttlefish to Dumpling, I'd just never managed to track it down, and
force-unmounting the filesystem and then restarting the mds was an
easier (if inconvenient) work-around, particularly because it always
hit when the filesystem was under active, heavy-ish use (or there
wouldn't be much reason for caps recovery ;-)
There are two issues not addressed in this patch, however. One is
that nothing seems to proactively update the parent xattr when it is
found to be outdated, so it remains out of date forever. Not even
renaming top-level directories causes the xattrs to be recursively
rewritten. AFAICT that's a bug.
The other is that inodes that don't have a parent xattr (created by
even older versions of ceph) are reported as non-existing in the mds
rejoin message, because the absence of the parent xattr is signaled as
a missing inode (?failed to reconnect caps for missing inodes?). I
suppose this may cause more serious recovery problems.
I suppose a global pass over the filesystem tree updating parent
xattrs that are out-of-date would be desirable, if we find any parent
xattrs still lacking current information; it might make sense to
activate it as a background thread from the backtrace decoding
function, when it finds a parent xattr that's too out-of-date, or as a
separate client (ceph-fsck?).
Backport: dumpling, cuttlefish
Signed-off-by: Alexandre Oliva <oliva@gnu.org>
Reviewed-by: Zheng, Yan <zheng.z.yan@intel.com>
(cherry picked from commit 617dc36d477fd83b2d45034fe6311413aa1866df)
commit 0738bdf92f5e5eb93add152a4135310ac7ea1c91
Author: David Disseldorp <ddiss@suse.de>
Date: Mon Jul 29 17:05:44 2013 +0200
mds: remove waiting lock before merging with neighbours
CephFS currently deadlocks under CTDB's ping_pong POSIX locking test
when run concurrently on multiple nodes.
The deadlock is caused by failed removal of a waiting_locks entry when
the waiting lock is merged with an existing lock, e.g:
Initial MDS state (two clients, same file):
held_locks -- start: 0, length: 1, client: 4116, pid: 7899, type: 2
start: 2, length: 1, client: 4110, pid: 40767, type: 2
waiting_locks -- start: 1, length: 1, client: 4116, pid: 7899, type: 2
Waiting lock entry 4116@1:1 fires:
handle_client_file_setlock: start: 1, length: 1,
client: 4116, pid: 7899, type: 2
MDS state after lock is obtained:
held_locks -- start: 0, length: 2, client: 4116, pid: 7899, type: 2
start: 2, length: 1, client: 4110, pid: 40767, type: 2
waiting_locks -- start: 1, length: 1, client: 4116, pid: 7899, type: 2
Note that the waiting 4116@1:1 lock entry is merged with the existing
4116@0:1 held lock to become a 4116@0:2 held lock. However, the now
handled 4116@1:1 waiting_locks entry remains.
When handling a lock request, the MDS calls adjust_locks() to merge
the new lock with available neighbours. If the new lock is merged,
then the waiting_locks entry is not located in the subsequent
remove_waiting() call because adjust_locks changed the new lock to
include the old locks.
This fix ensures that the waiting_locks entry is removed prior to
modification during merge.
Signed-off-by: David Disseldorp <ddiss@suse.de>
Reviewed-by: Greg Farnum <greg@inktank.com>
(cherry picked from commit 476e4902907dfadb3709ba820453299ececf990b)
commit a0ac88272511d670b5c3756dda2d02c93c2e9776
Author: Dan Mick <dan.mick@inktank.com>
Date: Tue Aug 20 11:10:42 2013 -0700
mon/PGMap: OSD byte counts 4x too large (conversion to bytes overzealous)
Fixes: #6049
Signed-off-by: Dan Mick <dan.mick@inktank.com>
(cherry picked from commit eca53bbf583027397f0d5e050a76498585ecb059)
commit 87b19c33ce29e2ca4fc49a2adeb12d3f14ca90a9
Author: Alfredo Deza <alfredo.deza@inktank.com>
Date: Fri Aug 23 08:56:07 2013 -0400
ceph-disk: specify the filetype when mounting
Signed-off-by: Alfredo Deza <alfredo.deza@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
(cherry picked from commit f040020fb2a7801ebbed23439159755ff8a3edbd)
|