summaryrefslogtreecommitdiff
path: root/fs/super.c
Commit message (Collapse)AuthorAgeFilesLines
* freeze_bdev: grab active reference to frozen superblocksChristoph Hellwig2009-09-241-1/+47
| | | | | | | | | | | | | | Currently we held s_umount while a filesystem is frozen, despite that we might return to userspace and unlock it from a different process. Instead grab an active reference to keep the file system busy and add an explicit check for frozen filesystems in remount and reject the remount instead of blocking on s_umount. Add a new get_active_super helper to super.c for use by freeze_bdev that grabs an active reference to a superblock from a given block device. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* freeze_bdev: kill bd_mount_semChristoph Hellwig2009-09-241-2/+7
| | | | | | | | | | | | | | Now that we have the freeze count there is not much reason for bd_mount_sem anymore. The actual freeze/thaw operations are serialized using the bd_fsfreeze_mutex, and the only other place we take bd_mount_sem is get_sb_bdev which tries to prevent mounting a filesystem while the block device is frozen. Instead of add a check for bd_fsfreeze_count and return -EBUSY if a filesystem is frozen. While that is a change in user visible behaviour a failing mount is much better for this case rather than having the mount process stuck uninterruptible for a long time. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: change sb->s_maxbytes to a loff_tJeff Layton2009-09-241-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | sb->s_maxbytes is supposed to indicate the maximum size of a file that can exist on the filesystem. It's declared as an unsigned long long. Even if a filesystem has no inherent limit that prevents it from using every bit in that unsigned long long, it's still problematic to set it to anything larger than MAX_LFS_FILESIZE. There are places in the kernel that cast s_maxbytes to a signed value. If it's set too large then this cast makes it a negative number and generally breaks the comparison. Change s_maxbytes to be loff_t instead. That should help eliminate the temptation to set it too large by making it a signed value. Also, add a warning for couple of releases to help catch filesystems that set s_maxbytes too large. Eventually we can either convert this to a BUG() or just remove it and in the hope that no one will get it wrong now that it's a signed value. Signed-off-by: Jeff Layton <jlayton@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Robert Love <rlove@google.com> Cc: Mandeep Singh Baines <msb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* const: mark remaining super_operations constAlexey Dobriyan2009-09-221-1/+1
| | | | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: Assign bdi in super_blockJens Axboe2009-09-161-0/+6
| | | | | | | | | | | We do this automatically in get_sb_bdev() from the set_bdev_super() callback. Filesystems that have their own private backing_dev_info must assign that in ->fill_super(). Note that ->s_bdi assignment is required for proper writeback! Acked-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: switch to per-bdi threads for flushing dataJens Axboe2009-09-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This gets rid of pdflush for bdi writeout and kupdated style cleaning. pdflush writeout suffers from lack of locality and also requires more threads to handle the same workload, since it has to work in a non-blocking fashion against each queue. This also introduces lumpy behaviour and potential request starvation, since pdflush can be starved for queue access if others are accessing it. A sample ffsb workload that does random writes to files is about 8% faster here on a simple SATA drive during the benchmark phase. File layout also seems a LOT more smooth in vmstat: r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42 0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44 1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37 0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58 0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34 0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37 0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44 0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38 0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41 0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45 where vanilla tends to fluctuate a lot in the creation phase: r b swpd free buff cache si so bi bo in cs us sy id wa 1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36 1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51 0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40 0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37 1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41 0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49 0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36 1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43 0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39 1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45 1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34 0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54 A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A SSD based writeback test on XFS performs over 20% better as well, with the throughput being very stable around 1GB/sec, where pdflush only manages 750MB/sec and fluctuates wildly while doing so. Random buffered writes to many files behave a lot better as well, as does random mmap'ed writes. A separate thread is added to sync the super blocks. In the long term, adding sync_supers_bdi() functionality could get rid of this thread again. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* writeback: move dirty inodes from super_block to backing_dev_infoJens Axboe2009-09-111-3/+0
| | | | | | | | | This is a first step at introducing per-bdi flusher threads. We should have no change in behaviour, although sb_has_dirty_inodes() is now ridiculously expensive, as there's no easy way to answer that question. Not a huge problem, since it'll be deleted in subsequent patches. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* ... and the same for vfsmount id/mount group idAl Viro2009-06-241-2/+4
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Make allocation of anon devices cheaperAl Viro2009-06-241-1/+6
| | | | | | | | | | | | | | | | | | | Standard trick - add a new variable (start) such that for each n < start n is known to be busy. Allocation can skip checking everything in [0..start) and if it returns n, we can set start to n + 1. Freeing below start sets start to what we'd just freed. Of course, it still sucks if we do something like free 0 allocate allocate in a loop - still O(n^2) time. However, on saner loads it improves the things a lot and the entire thing is not worth the trouble of switching to something with better worst-case behaviour. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* remove unlock_kernel() left accidentallyJ. R. Okajima2009-06-171-9/+3
| | | | | | | | | | | | | | | commit 337eb00a2c3a421999c39c94ce7e33545ee8baa7 Push BKL down into ->remount_fs() and commit 4aa98cf768b6f2ea4b204620d949a665959214f6 Push BKL down into do_remount_sb() were uncorrectly merged. The former removes one pair of lock/unlock_kernel(), but the latter adds several unlock_kernel(). Finally a few unlock_kernel() calls left. Signed-off-by: J. R. Okajima <hooanon05@yahoo.co.jp> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Push BKL down into ->remount_fs()Alessio Igor Bogani2009-06-111-2/+0
| | | | | | | [xfs, btrfs, capifs, shmem don't need BKL, exempt] Signed-off-by: Alessio Igor Bogani <abogani@texware.it> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ->write_super lock_super pushdownChristoph Hellwig2009-06-111-2/+0
| | | | | | | | | | | | | | | | | | | Push down lock_super into ->write_super instances and remove it from the caller. Following filesystem don't need ->s_lock in ->write_super and are skipped: * bfs, nilfs2 - no other uses of s_lock and have internal locks in ->write_super * ext2 - uses BKL in ext2_write_super and has internal calls without s_lock * reiserfs - no other uses of s_lock as has reiserfs_write_lock (BKL) in ->write_super * xfs - no other uses of s_lock and uses internal lock (buffer lock on superblock buffer) to serialize ->write_super. Also xfs_fs_write_super is superflous and will go away in the next merge window Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Push BKL down into do_remount_sb()Al Viro2009-06-111-5/+11
| | | | | | [folded fix from Jiri Slaby] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Push lock_super() into the ->remount_fs() of filesystems that care about itAl Viro2009-06-111-2/+0
| | | | | | | | | | | | | Note that since we can't run into contention between remount_fs and write_super (due to exclusion on s_umount), we have to care only about filesystems that touch lock_super() on their own. Out of those ext3, ext4, hpfs, sysv and ufs do need it; fat doesn't since its ->remount_fs() only accesses assign-once data (basically, it's "we have no atime on directories and only have atime on files for vfat; force nodiratime and possibly noatime into *flags"). [folded a build fix from hch] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* push BKL down into ->put_superChristoph Hellwig2009-06-111-3/+0
| | | | | | | | | | | | | | | | | Move BKL into ->put_super from the only caller. A couple of filesystems had trivial enough ->put_super (only kfree and NULLing of s_fs_info + stuff in there) to not get any locking: coda, cramfs, efs, hugetlbfs, omfs, qnx4, shmem, all others got the full treatment. Most of them probably don't need it, but I'd rather sort that out individually. Preferably after all the other BKL pushdowns in that area. [AV: original used to move lock_super() down as well; these changes are removed since we don't do lock_super() at all in generic_shutdown_super() now] [AV: fuse, btrfs and xfs are known to need no damn BKL, exempt] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* No need to do lock_super() for exclusion in generic_shutdown_super()Al Viro2009-06-111-2/+2
| | | | | | | | | | | | | | We can't run into contention on it. All other callers of lock_super() either hold s_umount (and we have it exclusive) or hold an active reference to superblock in question, which prevents the call of generic_shutdown_super() while the reference is held. So we can replace lock_super(s) with get_fs_excl() in generic_shutdown_super() (and corresponding change for unlock_super(), of course). Since ext4 expects s_lock held for its put_super, take lock_super() into it. The rest of filesystems do not care at all. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Make sure that all callers of remount hold s_umount exclusiveAl Viro2009-06-111-2/+3
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* cleanup sync_supersChristoph Hellwig2009-06-111-12/+15
| | | | | | | | | | | Merge the write_super helper into sync_super and move the check for ->write_super earlier so that we can avoid grabbing a reference to a superblock that doesn't have it. While we're at it also add a little comment documenting sync_supers. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* remove ->write_super call in generic_shutdown_superChristoph Hellwig2009-06-111-2/+0
| | | | | | | | | | | | | | | | | | | | | We just did a full fs writeout using sync_filesystem before, and if that's not enough for the filesystem it can perform it's own writeout in ->put_super, which many filesystems already do. Move a call to foofs_write_super into every foofs_put_super for now to guarantee identical behaviour until it's cleaned up by the individual filesystem maintainers. Exceptions: - affs already has identical copy & pasted code at the beginning of affs_put_super so no need to do it twice. - xfs does the right thing without it and I have changes pending for the xfs tree touching this are so I don't really need conflicts here.. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: Rename fsync_super() to sync_filesystem() (version 4)Jan Kara2009-06-111-3/+2
| | | | | | | | Rename the function so that it better describe what it really does. Also remove the unnecessary include of buffer_head.h. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: Move syncing code from super.c to sync.c (version 4)Jan Kara2009-06-111-85/+0
| | | | | | | | | | Move sync_filesystems(), __fsync_super(), fsync_super() from super.c to sync.c where it fits better. [build fixes folded] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: Make sys_sync() use fsync_super() (version 4)Jan Kara2009-06-111-49/+23
| | | | | | | | | | | | | | | | | | It is unnecessarily fragile to have two places (fsync_super() and do_sync()) doing data integrity sync of the filesystem. Alter __fsync_super() to accommodate needs of both callers and use it. So after this patch __fsync_super() is the only place where we gather all the calls needed to properly send all data on a filesystem to disk. Nice bonus is that we get a complete livelock avoidance and write_supers() is now only used for periodic writeback of superblocks. sync_blockdevs() introduced a couple of patches ago is gone now. [build fixes folded] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: Make __fsync_super() a static function (version 4)Jan Kara2009-06-111-4/+3
| | | | | | | | | | __fsync_super() does the same thing as fsync_super(). So change the only caller to use fsync_super() and make __fsync_super() static. This removes unnecessarily duplicated call to sync_blockdev() and prepares ground for the changes to __fsync_super() in the following patches. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: Call ->sync_fs() even if s_dirt is 0 (version 4)Jan Kara2009-06-111-1/+1
| | | | | | | | | | | sync_filesystems() has a condition that if wait == 0 and s_dirt == 0, then ->sync_fs() isn't called. This does not really make much sence since s_dirt is generally used by a filesystem to mean that ->write_super() needs to be called. But ->sync_fs() does different things. I even suspect that some filesystems (btrfs?) sets s_dirt just to fool this logic. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: Fix sys_sync() and fsync_super() reliability (version 4)Jan Kara2009-06-111-1/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | So far, do_sync() called: sync_inodes(0); sync_supers(); sync_filesystems(0); sync_filesystems(1); sync_inodes(1); This ordering makes it kind of hard for filesystems as sync_inodes(0) need not submit all the IO (for example it skips inodes with I_SYNC set) so e.g. forcing transaction to disk in ->sync_fs() is not really enough. Therefore sys_sync has not been completely reliable on some filesystems (ext3, ext4, reiserfs, ocfs2 and others are hit by this) when racing e.g. with background writeback. A similar problem hits also other filesystems (e.g. ext2) because of write_supers() being called before the sync_inodes(1). Change the ordering of calls in do_sync() - this requires a new function sync_blockdevs() to preserve the property that block devices are always synced after write_super() / sync_fs() call. The same issue is fixed in __fsync_super() function used on umount / remount read-only. [AV: build fixes] Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* remove s_async_listChristoph Hellwig2009-06-111-8/+0
| | | | | | | | | | Remove the unused s_async_list in the superblock, a leftover of the broken async inode deletion code that leaked into mainline. Having this in the middle of the sync/unmount path is not helpful for the following cleanups. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: move mark_files_ro into file_table.cnpiggin@suse.de2009-06-111-39/+0
| | | | | | | | | | | This function walks the s_files lock, and operates primarily on the files in a superblock, so it better belongs here (eg. see also fs_may_remount_ro). [AV: ... and it shouldn't be static after that move] Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* NULL noise in fs/super.c:kill_bdev_super()H Hartley Sweeten2009-05-091-1/+1
| | | | | | Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Subrata Modak <subrata@linux.vnet.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* New helper: deactivate_locked_super()Al Viro2009-05-091-12/+34
| | | | | | | | | | | | | | | Does equivalent of up_write(&s->s_umount); deactivate_super(s); However, it does not does not unlock it until it's all over. As the result, it's safe to use to dispose of new superblock on ->get_sb() failure exits - nobody will see the sucker until it's all over. Equivalent using up_write/deactivate_super is safe for that purpose if superblock is either safe to use or has NULL ->s_root when we unlock. Normally filesystems take the required precautions, but a) we do have bugs in that area in some of them. b) up_write/deactivate_super sequence is extremely common, so the helper makes sense anyway. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* namespaces: move proc_net_get_sb to a generic fs/super.c helperSerge E. Hallyn2009-04-071-0/+40
| | | | | | | | | | | | The mqueuefs filesystem will use this helper as well. Proc's main get_sb could also be made to use it, but that will require a bit more rework. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Cedric Le Goater <clg@fr.ibm.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* CacheFiles: Export things for CacheFilesDavid Howells2009-04-031-0/+1
| | | | | | | | | | | Export a number of functions for CacheFiles's use. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Steve Dickson <steved@redhat.com> Acked-by: Trond Myklebust <Trond.Myklebust@netapp.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Tested-by: Daire Byrne <Daire.Byrne@framestore.com>
* Merge branch 'for-linus' of ↵Linus Torvalds2009-03-271-3/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (37 commits) fs: avoid I_NEW inodes Merge code for single and multiple-instance mounts Remove get_init_pts_sb() Move common mknod_ptmx() calls into caller Parse mount options just once and copy them to super block Unroll essentials of do_remount_sb() into devpts vfs: simple_set_mnt() should return void fs: move bdev code out of buffer.c constify dentry_operations: rest constify dentry_operations: configfs constify dentry_operations: sysfs constify dentry_operations: JFS constify dentry_operations: OCFS2 constify dentry_operations: GFS2 constify dentry_operations: FAT constify dentry_operations: FUSE constify dentry_operations: procfs constify dentry_operations: ecryptfs constify dentry_operations: CIFS constify dentry_operations: AFS ...
| * vfs: simple_set_mnt() should return voidSukadev Bhattiprolu2009-03-271-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | simple_set_mnt() is defined as returning 'int' but always returns 0. Callers assume simple_set_mnt() never fails and don't properly cleanup if it were to _ever_ fail. For instance, get_sb_single() and get_sb_nodev() should: up_write(sb->s_unmount); deactivate_super(sb); if simple_set_mnt() fails. Since simple_set_mnt() never fails, would be cleaner if it did not return anything. [akpm@linux-foundation.org: fix build] Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'for_linus' of ↵Linus Torvalds2009-03-271-4/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-quota-2.6: (27 commits) ext2: Zero our b_size in ext2_quota_read() trivial: fix typos/grammar errors in fs/Kconfig quota: Coding style fixes quota: Remove superfluous inlines quota: Remove uppercase aliases for quota functions. nfsd: Use lowercase names of quota functions jfs: Use lowercase names of quota functions udf: Use lowercase names of quota functions ufs: Use lowercase names of quota functions reiserfs: Use lowercase names of quota functions ext4: Use lowercase names of quota functions ext3: Use lowercase names of quota functions ext2: Use lowercase names of quota functions ramfs: Remove quota call vfs: Use lowercase names of quota functions quota: Remove dqbuf_t and other cleanups quota: Remove NODQUOT macro quota: Make global quota locks cacheline aligned quota: Move quota files into separate directory ext4: quota reservation for delayed allocation ...
| * | vfs: Use lowercase names of quota functionsJan Kara2009-03-261-4/+4
| |/ | | | | | | | | | | | | Use lowercase names of quota functions instead of old uppercase ones. Signed-off-by: Jan Kara <jack@suse.cz> CC: Alexander Viro <viro@zeniv.linux.org.uk>
* | Get rid of pdflush_operation() in emergency sync and remountJens Axboe2009-03-261-2/+9
|/ | | | | | | | | | | Opencode a cheasy approach with kevent. The idea here is that we'll add some generic delayed work infrastructure, which probably wont be based on pdflush (or maybe it will, in which case we can just add it back). This is in preparation for getting rid of pdflush completely. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* vfs: add missing unlock in sget()Li Zefan2009-03-121-1/+4
| | | | | | | | | | | | In sget(), destroy_super(s) is called with s->s_umount held, which makes lockdep unhappy. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs/super.c: add lockdep annotation to s_umountPeter Zijlstra2009-02-181-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Li Zefan said: Thread 1: for ((; ;)) { mount -t cpuset xxx /mnt > /dev/null 2>&1 cat /mnt/cpus > /dev/null 2>&1 umount /mnt > /dev/null 2>&1 } Thread 2: for ((; ;)) { mount -t cpuset xxx /mnt > /dev/null 2>&1 umount /mnt > /dev/null 2>&1 } (Note: It is irrelevant which cgroup subsys is used.) After a while a lockdep warning showed up: ============================================= [ INFO: possible recursive locking detected ] 2.6.28 #479 --------------------------------------------- mount/13554 is trying to acquire lock: (&type->s_umount_key#19){--..}, at: [<c049d888>] sget+0x5e/0x321 but task is already holding lock: (&type->s_umount_key#19){--..}, at: [<c049da0c>] sget+0x1e2/0x321 other info that might help us debug this: 1 lock held by mount/13554: #0: (&type->s_umount_key#19){--..}, at: [<c049da0c>] sget+0x1e2/0x321 stack backtrace: Pid: 13554, comm: mount Not tainted 2.6.28-mc #479 Call Trace: [<c044ad2e>] validate_chain+0x4c6/0xbbd [<c044ba9b>] __lock_acquire+0x676/0x700 [<c044bb82>] lock_acquire+0x5d/0x7a [<c049d888>] ? sget+0x5e/0x321 [<c061b9b8>] down_write+0x34/0x50 [<c049d888>] ? sget+0x5e/0x321 [<c049d888>] sget+0x5e/0x321 [<c045a2e7>] ? cgroup_set_super+0x0/0x3e [<c045959f>] ? cgroup_test_super+0x0/0x2f [<c045bcea>] cgroup_get_sb+0x98/0x2e7 [<c045cfb6>] cpuset_get_sb+0x4a/0x5f [<c049dfa4>] vfs_kern_mount+0x40/0x7b [<c049e02d>] do_kern_mount+0x37/0xbf [<c04af4a0>] do_mount+0x5c3/0x61a [<c04addd2>] ? copy_mount_options+0x2c/0x111 [<c04af560>] sys_mount+0x69/0xa0 [<c0403251>] sysenter_do_call+0x12/0x31 The cause is after alloc_super() and then retry, an old entry in list fs_supers is found, so grab_super(old) is called, but both functions hold s_umount lock: struct super_block *sget(...) { ... retry: spin_lock(&sb_lock); if (test) { list_for_each_entry(old, &type->fs_supers, s_instances) { if (!test(old, data)) continue; if (!grab_super(old)) <--- 2nd: down_write(&old->s_umount); goto retry; if (s) destroy_super(s); return old; } } if (!s) { spin_unlock(&sb_lock); s = alloc_super(type); <--- 1th: down_write(&s->s_umount) if (!s) return ERR_PTR(-ENOMEM); goto retry; } ... } It seems like a false positive, and seems like VFS but not cgroup needs to be fixed. Peter said: We can simply put the new s_umount instance in a but lockdep doesn't particularly cares about subclass order. If there's any issue with the callers of sget() assuming the s_umount lock being of sublcass 0, then there is another annotation we can use to fix that, but lets not bother with that if this is sufficient. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12673 Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Tested-by: Li Zefan <lizf@cn.fujitsu.com> Reported-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Paul Menage <menage@google.com> Cc: Arjan van de Ven <arjan@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* async: Rename _special -> _domain for clarity.Cornelia Huck2009-02-081-2/+2
| | | | | | | | | Rename the async_*_special() functions to async_*_domain(), which describes the purpose of these functions much better. [Broke up long lines to silence checkpatch] Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
* [CVE-2009-0029] System call wrappers part 11Heiko Carstens2009-01-141-1/+1
| | | | Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com>
* Merge branch 'for_linus' of ↵Linus Torvalds2009-01-081-0/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (57 commits) jbd2: Fix oops in jbd2_journal_init_inode() on corrupted fs ext4: Remove "extents" mount option block: Add Kconfig help which notes that ext4 needs CONFIG_LBD ext4: Make printk's consistently prefixed with "EXT4-fs: " ext4: Add sanity checks for the superblock before mounting the filesystem ext4: Add mount option to set kjournald's I/O priority jbd2: Submit writes to the journal using WRITE_SYNC jbd2: Add pid and journal device name to the "kjournald2 starting" message ext4: Add markers for better debuggability ext4: Remove code to create the journal inode ext4: provide function to release metadata pages under memory pressure ext3: provide function to release metadata pages under memory pressure add releasepage hooks to block devices which can be used by file systems ext4: Fix s_dirty_blocks_counter if block allocation failed with nodelalloc ext4: Init the complete page while building buddy cache ext4: Don't allow new groups to be added during block allocation ext4: mark the blocks/inode bitmap beyond end of group as used ext4: Use new buffer_head flag to check uninit group bitmaps initialization ext4: Fix the race between read_inode_bitmap() and ext4_new_inode() ext4: code cleanup ...
| * add releasepage hooks to block devices which can be used by file systemsTheodore Ts'o2009-01-031-0/+2
| | | | | | | | | | | | | | | | | | Implement blkdev_releasepage() to release the buffer_heads and pages after we release private data belonging to a mounted filesystem. Cc: Toshiyuki Okajima <toshi.okajima@jp.fujitsu.com> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
* | async: Don't call async_synchronize_full_special() while holding sb_lockDave Kleikamp2009-01-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | sync_filesystems() shouldn't be calling async_synchronize_full_special while holding a spinlock. The second while loop in that function is the right place for this anyway. Signed-off-by: Dave Kleikamp <shaggy@linux.vnet.ibm.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Reported-by: Grissiom <chaos.proton@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | async: make the final inode deletion an asynchronous eventArjan van de Ven2009-01-071-0/+10
|/ | | | | | | this makes "rm -rf" on a (names cached) kernel tree go from 11.6 to 8.6 seconds on an ext3 filesystem Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
* security: pass mount flags to security_sb_kern_mount()James Morris2008-12-201-1/+1
| | | | | | | | Pass mount flags to security_sb_kern_mount(), so security modules can determine if a mount operation is being performed by the kernel. Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Stephen Smalley <sds@tycho.nsa.gov>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdevLinus Torvalds2008-10-231-4/+10
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev: (66 commits) [PATCH] kill the rest of struct file propagation in block ioctls [PATCH] get rid of struct file use in blkdev_ioctl() BLKBSZSET [PATCH] get rid of blkdev_locked_ioctl() [PATCH] get rid of blkdev_driver_ioctl() [PATCH] sanitize blkdev_get() and friends [PATCH] remember mode of reiserfs journal [PATCH] propagate mode through swsusp_close() [PATCH] propagate mode through open_bdev_excl/close_bdev_excl [PATCH] pass fmode_t to blkdev_put() [PATCH] kill the unused bsize on the send side of /dev/loop [PATCH] trim file propagation in block/compat_ioctl.c [PATCH] end of methods switch: remove the old ones [PATCH] switch sr [PATCH] switch sd [PATCH] switch ide-scsi [PATCH] switch tape_block [PATCH] switch dcssblk [PATCH] switch dasd [PATCH] switch mtd_blkdevs [PATCH] switch mmc ...
| * [PATCH] propagate mode through open_bdev_excl/close_bdev_exclAl Viro2008-10-211-4/+10
| | | | | | | | | | | | | | replace open_bdev_excl/close_bdev_excl with variants taking fmode_t. superblock gets the value used to mount it stored in sb->s_mode Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | [PATCH 2/2] anondev: switch to IDAAlexey Dobriyan2008-10-231-5/+5
| | | | | | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
* | [PATCH 1/2] anondev: init IDR staticallyAlexey Dobriyan2008-10-231-6/+1
|/ | | | Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
* fix soft lock up at NFS mount via per-SB LRU-list of unused dentriesKentaro Makita2008-07-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [Summary] Split LRU-list of unused dentries to one per superblock to avoid soft lock up during NFS mounts and remounting of any filesystem. Previously I posted here: http://lkml.org/lkml/2008/3/5/590 [Descriptions] - background dentry_unused is a list of dentries which are not referenced. dentry_unused grows up when references on directories or files are released. This list can be very long if there is huge free memory. - the problem When shrink_dcache_sb() is called, it scans all dentry_unused linearly under spin_lock(), and if dentry->d_sb is differnt from given superblock, scan next dentry. This scan costs very much if there are many entries, and very ineffective if there are many superblocks. IOW, When we need to shrink unused dentries on one dentry, but scans unused dentries on all superblocks in the system. For example, we scan 500 dentries to unmount a filesystem, but scans 1,000,000 or more unused dentries on other superblocks. In our case , At mounting NFS*, shrink_dcache_sb() is called to shrink unused dentries on NFS, but scans 100,000,000 unused dentries on superblocks in the system such as local ext3 filesystems. I hear NFS mounting took 1 min on some system in use. * : NFS uses virtual filesystem in rpc layer, so NFS is affected by this problem. 100,000,000 is possible number on large systems. Per-superblock LRU of unused dentried can reduce the cost in reasonable manner. - How to fix I found this problem is solved by David Chinner's "Per-superblock unused dentry LRU lists V3"(1), so I rebase it and add some fix to reclaim with fairness, which is in Andrew Morton's comments(2). 1) http://lkml.org/lkml/2006/5/25/318 2) http://lkml.org/lkml/2006/5/25/320 Split LRU-list of unused dentries to each superblocks. Then, NFS mounting will check dentries under a superblock instead of all. But this spliting will break LRU of dentry-unused. So, I've attempted to make reclaim unused dentrins with fairness by calculate number of dentries to scan on this sb based on following way number of dentries to scan on this sb = count * (number of dentries on this sb / number of dentries in the machine) - ToDo - I have to measuring performance number and do stress tests. - When unmount occurs during prune_dcache(), scanning on same superblock, It is unable to reach next superblock because it is gone away. We restart scannig superblock from first one, it causes unfairness of reclaim unused dentries on first superblock. But I think this happens very rarely. - Test Results Result on 6GB boxes with excessive unused dentries. Without patch: $ cat /proc/sys/fs/dentry-state 10181835 10180203 45 0 0 0 # mount -t nfs 10.124.60.70:/work/kernel-src nfs real 0m1.830s user 0m0.001s sys 0m1.653s With this patch: $ cat /proc/sys/fs/dentry-state 10236610 10234751 45 0 0 0 # mount -t nfs 10.124.60.70:/work/kernel-src nfs real 0m0.106s user 0m0.002s sys 0m0.032s [akpm@linux-foundation.org: fix comments] Signed-off-by: Kentaro Makita <k-makita@np.css.fujitsu.com> Cc: Neil Brown <neilb@suse.de> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: David Chinner <dgc@sgi.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>