src/TODO


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375

v0.17
- kclient: fix multiple mds mdsmap decoding
- kclient: fix mon subscription renewal
- crush: fix crush map creation with empty buckets (occurs on larger clusters)
- osdmap: fix encoding bug (crashes kclient); make kclient not crash
- msgr: simplified policy, failure model
- mon: less push, more pull
- mon: request routing
- mon cluster expansion
- osd: fix pg parsing, restarts on larger clusters

v0.18
- osd: basic ENOSPC handling
- big endian fixes (required protocol/disk format change)
- osd: improved object -> pg hash function; selectable
- crush: selectable hash function(s)
- mds restart bug fixes
- kclient: mds reconnect bug fixes
- fixed mds log trimming bug
- fixed mds cap vs snap deadlock
- filestore: faster flushing
- uclient,kclient: snapshot fixes
- mds: fix recursive accounting bug
- uclient: fixes for 32bit clients
- auth: 'none' security framework
- mon: "safely" bail on write errors (e.g. ENOSPC)
- mds: fix replay/reconnect race (caused (fast) client reconnect to fail)
- mds: misc journal replay, session fixes

v0.19
- ms_dispatch fairness
- kclient: bad fsid deadlock fix
- tids in fixed msg header (protocol change)
- feature bits during connection handshake
- remove erank from ceph_entity_addr
- disk format, compat/incompat bits
- journal format improvements
- kclient: cephx
- improved truncation
- cephx: lots of fixes
- mkcephfs: cephx support
- debian: packaging fixes

v0.20

- new filestore, journaling
- multiple mds fixes

- qa: snap test.  maybe walk through 2.6.* kernel trees?
- osd: rebuild pg log
- osd: handle storage errors
- rebuild mds hierarchy
- kclient: retry alloc on ENOMEM when reading from connection?

filestore
- throttling
- flush objects onto primary during recovery
- audit queue_transaction calls for dependencies
- convert apply_transaction calls in handle_map to queue?
  - need an osdmap cache layer?

bugs
- mds prepare_force_open_sessions, then import aborts.. session is still OPENING but no client_session is sent...
- multimds: untar kernel, control-z, sync
- wget mismatch with multiple mds?
- rm -r failure (on kernel tree)
- dbench 1, restart mds (may take a few times), dbench will error out.

- cfuse crash on 'cat >> mnt/foo'
client/Client.cc: In function 'Inode* Client::_ll_get_inode(vinodeno_t)':
client/Client.cc:5047: FAILED assert(inode_map.count(vino))
 1: (Client::_ll_get_inode(vinodeno_t)+0x46) [0x60a268]
 2: (Client::ll_getattr(vinodeno_t, stat*, int, int)+0x15c) [0x62ab66]
 3: ./cfuse [0x6050f7]
 4: /usr/lib/libfuse.so.2 [0x7f532b02e0a2]
 5: (fuse_session_loop()+0x7a) [0x7f532b02aeba]
 6: (ceph_fuse_ll_main(Client*, int, char const**)+0x24a) [0x60373c]
 7: (main()+0x279) [0x5e0a14]
 8: (__libc_start_main()+0xfd) [0x7f5329f81abd]
 9: (std::ios_base::Init::~Init()+0x49) [0x5e0659]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

?- bonnie++ -u root -d /mnt/ceph/ -s 0 -n 1
(03:35:29 PM) Isteriat: Using uid:0, gid:0.
(03:35:29 PM) Isteriat: Create files in sequential order...done.
(03:35:29 PM) Isteriat: Stat files in sequential order...Expected 1024 files but only got 0
(03:35:29 PM) Isteriat: Cleaning up test directory after error.

- osd pg split breaks if not all osds are up...

- mds recovery flag set on inode that didn't get recovered??
- osd pg split breaks if not all osds are up...
- mislinked directory?  (cpusr.sh, mv /c/* /c/t, more cpusr, ls /c/t)

?- kclient: after reconnect,
cp: writing `/c/ceph2.2/bin/gs-gpl': Bad file descriptor
  - need to somehow wake up unreconnected caps?   hrm!!

?- kclient: socket creation

- snaprealm thing
ceph3:~# find /c
/c
/c/.ceph
/c/.ceph/mds0
/c/.ceph/mds0/journal
/c/.ceph/mds0/stray
[68663.397407] ceph: ceph_add_cap: couldn't find snap realm 10000491bb5
...
ceph3:/c# [68724.067160] BUG: unable to handle kernel NULL pointer dereference at 0000000000000088
[68724.071069] IP: [<ffffffffa00805c3>] __send_cap+0x237/0x585 [ceph]
[68724.078917] PGD f7a12067 PUD f688c067 PMD 0 
[68724.082907] Oops: 0000 [#1] PREEMPT SMP 
[68724.082907] last sysfs file: /sys/class/net/lo/operstate
[68724.082907] CPU 1 
[68724.082907] Modules linked in: ceph fan ac battery psmouse ehci_hcd ohci_hcd ide_pci_generic thermal processor button
[68724.082907] Pid: 10, comm: events/1 Not tainted 2.6.32-rc2 #1 H8SSL
[68724.082907] RIP: 0010:[<ffffffffa00805c3>]  [<ffffffffa00805c3>] __send_cap+0x237/0x585 [ceph]
[68724.114907] RSP: 0018:ffff8800f96e3a50  EFLAGS: 00010202
[68724.114907] RAX: 0000000000000000 RBX: 0000000000000354 RCX: 0000000000000000
[68724.114907] RDX: 0000000000000000 RSI: ffff8800f76e8ba8 RDI: ffff8800f581a508
[68724.114907] RBP: ffff8800f96e3bb0 R08: 0000000000000000 R09: 0000000000000001
[68724.114907] R10: ffff8800cea922b8 R11: ffffffffa0082982 R12: 0000000000000001
[68724.114907] R13: 0000000000000000 R14: ffff8800cea95378 R15: 0000000000000000
[68724.114907] FS:  00007f54be9a06e0(0000) GS:ffff880009200000(0000) knlGS:0000000000000000
[68724.114907] CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
[68724.114907] CR2: 0000000000000088 CR3: 00000000f7118000 CR4: 00000000000006e0
[68724.178904] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[68724.178904] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[68724.178904] Process events/1 (pid: 10, threadinfo ffff8800f96e2000, task ffff8800f96e02c0)
[68724.178904] Stack:
[68724.178904]  ffff8800f96e0980 ffff8800f96e02c0 ffff8800f96e3a80 ffffffff8106a3b9
[68724.178904] <0> ffff8800f96e3a80 0000000000000003 00006589ac4ca260 0000000000000004
[68724.178904] <0> 0cb13589944c0262 0000000000000000 ffff8800f96e3b30 ffffffff81ca7c80
[68724.178904] Call Trace:
[68724.178904]  [<ffffffff8106a3b9>] ? get_lock_stats+0x19/0x4c
[68724.178904]  [<ffffffff8106d8c8>] ? mark_held_locks+0x4d/0x6b
[68724.178904]  [<ffffffffa0082a25>] ceph_check_caps+0x740/0xa70 [ceph]
[68724.178904]  [<ffffffff8106a3b9>] ? get_lock_stats+0x19/0x4c
[68724.178904]  [<ffffffff8106a964>] ? put_lock_stats+0xe/0x27
[68724.178904]  [<ffffffffa00840b6>] ceph_check_delayed_caps+0xcb/0x14a [ceph]
[68724.178904]  [<ffffffffa009011f>] delayed_work+0x3f/0x368 [ceph]
[68724.178904]  [<ffffffff8105b194>] ? worker_thread+0x229/0x398
[68724.178904]  [<ffffffff8105b1ee>] worker_thread+0x283/0x398
[68724.178904]  [<ffffffff8105b194>] ? worker_thread+0x229/0x398
[68724.178904]  [<ffffffffa00900e0>] ? delayed_work+0x0/0x368 [ceph]
[68724.178904]  [<ffffffff8146a56e>] ? preempt_schedule+0x3e/0x4b
[68724.306901]  [<ffffffff8105f4d0>] ? autoremove_ceph3:/c# [68724.067160]


filestore performance notes
- write ordering options
  - fs only (no journal)
  - fs, journal
  - fs + journal in parallel
  - journal sync, then fs
- and the issues
  - latency
  - effect of a btrfs hang
  - unexpected error handling (EIO, ENOSPC)
  - impact on ack, sync ordering semantics.
  - how to throttle request stream to disk io rate
  - rmw vs delayed mode

- if journal is on fs, then
  - throttling isn't an issue, but
  - fs stalls are also journal stalls

- fs only
  - latency: commits are bad.
  - hang: bad.
  - errors: could be handled, aren't
  - acks: supported
  - throttle: fs does it
  - rmw: pg toggles mode
- fs, journal
  - latency: good, unless fs hangs
  - hang: bad.  latency spikes.  overall throughput drops.
  - errors: could probably be handled, isn't.
  - acks: supported
  - throttle: btrfs does it (by hanging), which leads to a (necessary) latency spike
  - rmw: pg toggles mode
- fs | journal
  - latency: good
  - hang: no latency spike.  fs throughput may drop, to the extent btrfs throughput necessarily will.
  - errors: not detected until later.  could journal addendum record.  or die (like we do now)
  - acks: could be flexible.. maybe supported, maybe not.  will need some extra locking smarts?
  - throttle: ??
  - rmw: rmw must block on prior fs writes.
- journal, fs (writeahead)
  - latency: good (commit only, no acks)
  - hang: same as |
  - errors: same as |
  - acks: never.
  - throttle: ??
  - rmw: rmw must block on prior fs writes.
  * JourningObjectStore interface needs work?

- separate reads/writes into separate op queues?
- 


greg
- csync data import/export tool?
- uclient: readdir from cache
- mds: basic auth checks

later
- document on-wire protocol
- authentication
- client reconnect after long eviction; and slow delayed reconnect
- repair
- mds security enforcement
- client, user authentication
- cas
- osd failure declarations
- rename over old files should flush data, or revert back to old contents
- clean up SimpleMessenger interface and usage a little. Can probably unify
	some/all of shutdown, wait, destroy. Possibly move destroy into put()
	and make get/put usage more consistent/stringently mandated.

rados
- make rest interface superset of s3?
  - create/delete snapshots
  - list, access snapped version
- perl swig wrapper
- 'rados call foo.bar'?
- merge pgs
- destroy pg_pools
- autosize pg_pools?
- security

repair
- namespace reconstruction tool
- repair pg (rebuild log)  (online or offline?  ./cosd --repair_pg 1.ef?)
- repair file ioctl?
- are we concerned about
  - scrubbing
  - reconstruction after loss of subset of cdirs
  - reconstruction after loss of md log
- data object 
  - path backpointers?
  - parent dir pointer?
- mds scrubbing

kclient
- ENOMEM
  - message pools
  - sockets?  (this can actual generates a lockdep warning :/)
- fs-portable file layout virtual xattr (see Andreas' -fsdevel thread)
- statlite
- audit/combine/rework/whatever invalidate, writeback threads and associated invariants
- add cap to release if we get fouled up in fill_inode et al?
- fix up ESTALE handling
- don't retry on ENOMEM on non-nofail requests in kick_requests
- make cap import/export more efficient?
- flock, fnctl locks
- ACLs
  - init security xattrs
- should we try to ref CAP_PIN on special inodes that are open?  
- fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it
- inotify for updates from other clients?

vfs issues
- real_lookup() race:
  1- hash lookup find no dentry
  2- real_lookup() takes dir i_mutex, but then finds a dentry
  3- drops mutex, then calld d_revalidate.  if that fails, we return ENOENT (instead of looping?)
- vfs_rename_dir()
- a getattr mask would be really nice

filestore
- make min sync interval self-tuning (ala xfs, ext3?)
- get file csum?

btrfs
- clone compressed inline extents
- ioctl to pull out data csum?

osd
- gracefully handle ENOSPC
- gracefully handle EIO?
- client session object
  - track client's osdmap; and only share latest osdmap with them once!
- what to do with lost objects.. continue peering?
- segregate backlog from log ondisk?
- preserve pg logs on disk for longer period
- make scrub interruptible
- optionally separate osd interfaces (ips) for clients and osds (replication, peering, etc.)
- pg repair
- pg split should be a work queue
- optimize remove wrt recovery pushes?

uclient
- fix client_lock vs other mutex with C_SafeCond
- clean up check_caps to more closely mirror kclient logic
- readdir from cache
- fix readdir vs fragment race by keeping a separate frag pos, and ignoring dentries below it
- hadoop: clean up assert usage

mds
- don't sync log on every clientreplay request?
- pass issued, wanted into eval(lock) when eval() already has it?  (and otherwise optimize eval paths..)
- add an up:shadow mode?
  - tail the mds log as it is written
  - periodically check head so that we trim, too
- handle slow client reconnect (i.e. after mds has gone active)
- anchor_destroy needs to xlock linklock.. which means it needs a Mutation wrapper?
  - ... when it gets a caller.. someday..
- add FILE_CAP_EXTEND capability bit
- dir fragment
  - maybe just take dftlock for now, to keep it simple.
- dir merge
- snap
  - hard link backpointers
    - anchor source dir
    - build snaprealm for any hardlinked file
    - include snaps for all (primary+remote) parents
  - how do we properly clean up inodes when doing a snap purge?
    - when they are mid-recover?  see 136470cf7ca876febf68a2b0610fa3bb77ad3532
  - what if a recovery is queued, or in progress, and the inode is then cowed?  can that happen?  
  - proper handling of cache expire messages during rejoin phase?
    -> i think cache expires are fine; the rejoin_ack handler just has to behave if rejoining items go missing

- clustered
  - on replay, but dirty scatter replicas on lists so that they get flushed?  or does rejoin handle that?
  - linkage vs cdentry replicas and remote rename....
  - rename: importing inode... also journal imported client map?

mon
- don't allow lpg_num expansion and osd addition at the same time?
- how to shrink cluster?
- how to tell osd to cleanly shut down
- mds injectargs N should take mds# or id.  * should bcast to standy mds's.
- paxos need to clean up old states.
  - default: simple max of (state count, min age), so that we have at least N hours of history, say?
  - osd map: trim only old maps < oldest "in" osd up_from

osdmon
- monitor needs to monitor some osds...

pgmon
/- include osd vector with pg state
  - check for orphan pgs
- monitor pg states, notify on out?
- watch osd utilization; adjust overload in cluster map

crush
- allow forcefeed for more complicated rule structures.  (e.g. make force_stack a list< set<int> >)

simplemessenger
- close idle connections?

objectcacher
- read locks?
- maintain more explicit inode grouping instead of wonky hashes

cas
- chunking.  see TTTD in
   ESHGHI, K.
   A framework for analyzing and improving content-based chunking algorithms.
   Tech. Rep. HPL-2005-30(R.1), Hewlett Packard Laboratories, Palo Alto, 2005. 

radosgw
 - handle gracefully location related requests
 - logging control (?)
 - parse date/time better
 - upload using post
 - torrent
 - handle gracefully PUT/GET requestPayment


-- for nicer kclient debug output (everything but messenger, but including msg in/out)
echo 'module ceph +p' > /sys/kernel/debug/dynamic_debug/control ; echo 'file fs/ceph/messenger.c -p' > /sys/kernel/debug/dynamic_debug/control ; echo 'file ' `grep -- --- /sys/kernel/debug/dynamic_debug/control | grep ceph | awk '{print $1}' | sed 's/:/ line /'` +p  > /sys/kernel/debug/dynamic_debug/control ; echo 'file ' `grep -- === /sys/kernel/debug/dynamic_debug/control | grep ceph | awk '{print $1}' | sed 's/:/ line /'` +p  > /sys/kernel/debug/dynamic_debug/control