summaryrefslogtreecommitdiff
path: root/fs
Commit message (Collapse)AuthorAgeFilesLines
* orangefs: bump minimum userspace versionMartin Brandenburg2016-09-211-2/+2
| | | | | | | OrangeFS 2.9.6 was released without support for the features op. Thus OrangeFS 2.9.7 will be required to use it. Signed-off-by: Martin Brandenburg <martin@omnibond.com>
* orangefs: do not allow client readahead cache without feature bitMartin Brandenburg2016-08-122-5/+27
| | | | Signed-off-by: Martin Brandenburg <martin@omnibond.com>
* orangefs: add features opMartin Brandenburg2016-08-127-6/+53
| | | | | | | | | | | | | This is a new userspace operation, which will be done if the client-core version is greater than or equal to 2.9.6. This will provide a way to implement optional features and to determine which features are supported by the client-core. If the client-core version is older than 2.9.6, no optional features are supported and the op will not be done. The intent is to allow protocol extensions without relying on the client-core's current behavior of ignoring what it doesn't understand. Signed-off-by: Martin Brandenburg <martin@omnibond.com>
* orangefs: record userspace version for feature compatbilityMartin Brandenburg2016-08-092-0/+12
| | | | | | | | | The client reports its version to the kernel on startup. We already test that it is above the minimum version. Now we record it in a global variable so code elsewhere can consult it before making a request the client may not understand. Signed-off-by: Martin Brandenburg <martin@omnibond.com>
* orangefs: add readahead count and size to sysfsMartin Brandenburg2016-08-081-8/+115
| | | | Signed-off-by: Martin Brandenburg <martin@omnibond.com>
* orangefs: re-add flush_racache from out-of-treeMartin Brandenburg2016-08-081-1/+33
| | | | Signed-off-by: Martin Brandenburg <martin@omnibond.com>
* orangefs: turn param response value into unionMartin Brandenburg2016-08-083-7/+11
| | | | | | | | | | | This will support a upcoming request where two related values need to be updated atomically. This was done without a union in the OrangeFS server source already. Since that will break the kernel protocol, it has been fixed there and done here in a way that does not break the kernel protocol. Signed-off-by: Martin Brandenburg <martin@omnibond.com>
* orangefs: add missing param request opsMartin Brandenburg2016-08-081-0/+3
| | | | Signed-off-by: Martin Brandenburg <martin@omnibond.com>
* orangefs: rename remaining bits of mmap readahead cacheMartin Brandenburg2016-08-085-7/+7
| | | | | | | | | | This has been dormant code for many years. Parts of it were removed from the OrangeFS kernel code when it went into mainline. These bits were missed. Now the readahead cache has been resurrected in the OrangeFS userspace portions. It was renamed there, since it doesn't really have anything to do with mmap specifically, so it will be renamed here. Signed-off-by: Martin Brandenburg <martin@omnibond.com>
* Merge tag 'for-linus-v4.8' of git://github.com/martinbrandenburg/linuxLinus Torvalds2016-08-028-30/+89
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull orangefs update from Martin Brandenburg: "Kernel side caching and executable bugfix This allows OrangeFS to utilize the dcache and adds an in kernel attribute cache. We previously used the user side client for this purpose. We see a modest performance increase on small file operations. For example, without the cache, compiling coreutils takes about 17 minutes. With the patch and a 50 millisecond timeout for dcache_timeout_msecs and getattr_timeout_msecs (the default), compiling coreutils takes about 6 minutes 20 seconds. On the same hardware, compiling coreutils on an xfs filesystem takes 90 seconds. We see similar improvements with mdtest and a test involving writing, reading, and deleting a large number of small files. Interested parties can review more data at the following URL. https://docs.google.com/spreadsheets/d/1v4aUeppKexIbRMz_Yn9k4eaM3uy2KCaPoe_93YKWOtA/pubhtml The eventual goal of this is to allow getdents to turn into a readdirplus to the OrangeFS server. The cache will be filled then, which should provide a performance benefit to the common case of readdir followed by getattr on each entry (i.e. ls -l). This also fixes a bug. When orangefs_inode_permission was added, it did not collect i_size from the OrangeFS server, since this presses an unnecessary load on the OrangeFS server. However, it left a case where i_size is never initialized. Then running an executable could fail. With this patch, size is always collected to be inserted into the cache. Thus the bug disappears. If this patch is not accepted during this merge window, we will send a one-line band-aid for this bug instead" * tag 'for-linus-v4.8' of git://github.com/martinbrandenburg/linux: Orangefs: update orangefs.txt orangefs: Account for jiffies wraparound. orangefs: Change default dcache and getattr timeout to 50 msec. orangefs: Allow dcache and getattr cache time to be configured. orangefs: Cache getattr results. orangefs: Use d_time to avoid excessive lookups
| * orangefs: Account for jiffies wraparound.Martin Brandenburg2016-08-023-8/+8
| | | | | | | | Signed-off-by: Martin Brandenburg <martin@omnibond.com>
| * orangefs: Change default dcache and getattr timeout to 50 msec.Martin Brandenburg2016-08-021-2/+2
| | | | | | | | Signed-off-by: Martin Brandenburg <martin@omnibond.com>
| * orangefs: Allow dcache and getattr cache time to be configured.Martin Brandenburg2016-08-026-7/+52
| | | | | | | | Signed-off-by: Martin Brandenburg <martin@omnibond.com>
| * orangefs: Cache getattr results.Martin Brandenburg2016-08-025-29/+34
| | | | | | | | | | | | | | | | The userspace component attempts to do this, but this will prevent us from even needing to go into userspace to satisfy certain getattr requests. Signed-off-by: Martin Brandenburg <martin@omnibond.com>
| * orangefs: Use d_time to avoid excessive lookupsMartin Brandenburg2016-08-022-0/+9
| | | | | | | | Signed-off-by: Martin Brandenburg <martin@omnibond.com>
* | Merge tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds2016-08-0213-752/+999
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull Ceph updates from Ilya Dryomov: "The highlights are: - RADOS namespace support in libceph and CephFS (Zheng Yan and myself). The stopgaps added in 4.5 to deny access to inodes in namespaces are removed and CEPH_FEATURE_FS_FILE_LAYOUT_V2 feature bit is now fully supported - A large rework of the MDS cap flushing code (Zheng Yan) - Handle some of ->d_revalidate() in RCU mode (Jeff Layton). We were overly pessimistic before, bailing at the first sight of LOOKUP_RCU On top of that we've got a few CephFS bug fixes, a couple of cleanups and Arnd's workaround for a weird genksyms issue" * tag 'ceph-for-4.8-rc1' of git://github.com/ceph/ceph-client: (34 commits) ceph: fix symbol versioning for ceph_monc_do_statfs ceph: Correctly return NXIO errors from ceph_llseek ceph: Mark the file cache as unreclaimable ceph: optimize cap flush waiting ceph: cleanup ceph_flush_snaps() ceph: kick cap flushes before sending other cap message ceph: introduce an inode flag to indicates if snapflush is needed ceph: avoid sending duplicated cap flush message ceph: unify cap flush and snapcap flush ceph: use list instead of rbtree to track cap flushes ceph: update types of some local varibles ceph: include 'follows' of pending snapflush in cap reconnect message ceph: update cap reconnect message to version 3 ceph: mount non-default filesystem by name libceph: fsmap.user subscription support ceph: handle LOOKUP_RCU in ceph_d_revalidate ceph: allow dentry_lease_is_valid to work under RCU walk ceph: clear d_fsinfo pointer under d_lock ceph: remove ceph_mdsc_lease_release ceph: don't use ->d_time ...
| * ceph: Correctly return NXIO errors from ceph_llseekPhil Turnbull2016-07-281-7/+5
| | | | | | | | | | | | | | | | | | ceph_llseek does not correctly return NXIO errors because the 'out' path always returns 'offset'. Fixes: 06222e491e66 ("fs: handle SEEK_HOLE/SEEK_DATA properly in all fs's that define their own llseek") Signed-off-by: Phil Turnbull <phil.turnbull@oracle.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: Mark the file cache as unreclaimableNikolay Borisov2016-07-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ceph creates multiple caches with the SLAB_RECLAIMABLE flag set, so that it can satisfy its internal needs. Inspecting the code shows that most of the caches are indeed reclaimable since they are directly related to the generic inode/dentry shrinkers. However, one of the cache used to satisfy struct file is not reclaimable since its entries are freed only when the last reference to the file is dropped. If a heavily loaded node opens a lot of files it can introduce non-trivial discrepancies between memory shown as reclaimable and what is actually reclaimed when drop_caches is used. Fix this by removing the reclaimable flag for the file's cache. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: optimize cap flush waitingYan, Zheng2016-07-283-27/+73
| | | | | | | | | | | | | | | | | | | | | | | | Add a 'wake' flag to ceph_cap_flush struct, which indicates if there is someone waiting for it to finish. When getting flush ack message, we check the 'wake' flag in corresponding ceph_cap_flush struct to decide if we should wake up waiters. One corner case is that the acked cap flush has 'wake' flags is set, but it is not the first one on the flushing list. We do not wake up waiters in this case, set 'wake' flags of preceding ceph_cap_flush struct instead Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: cleanup ceph_flush_snaps()Yan, Zheng2016-07-283-88/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch devide __ceph_flush_snaps() into two stags. In the first stage, __ceph_flush_snaps() assign snapcaps flush TIDs and add them to cap flush lists. __ceph_flush_snaps() keeps holding the i_ceph_lock in this stagge. So inode's auth cap can not change. In the second stage, __ceph_flush_snaps() send flushsnap cap messages. i_ceph_lock is unlocked before sending each cap message. If auth cap changes in the middle, __ceph_flush_snaps() just stops. This is OK because kick_flushing_inode_caps() will re-send flushsnap cap messages to inode's new auth MDS. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: kick cap flushes before sending other cap messageYan, Zheng2016-07-281-9/+34
| | | | | | | | | | | | | | If ceph_check_caps() wants to send cap message to a recovering MDS, make sure it kicks cap flushes first. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: introduce an inode flag to indicates if snapflush is neededYan, Zheng2016-07-283-15/+36
| | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: avoid sending duplicated cap flush messageYan, Zheng2016-07-282-1/+18
| | | | | | | | | | | | | | make ceph_kick_flushing_caps() ignore inodes whose cap flushes have already been re-sent by ceph_early_kick_flushing_caps() Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: unify cap flush and snapcap flushYan, Zheng2016-07-285-222/+175
| | | | | | | | | | | | | | | | | | | | | | | | This patch includes following changes - Assign flush tid to snapcap flush - Remove session's s_cap_snaps_flushing list. Add inode to session's s_cap_flushing list instead. Inode is removed from the list when there is no pending snapcap flush or cap flush. - make __kick_flushing_caps() re-send both snapcap flushes and cap flushes. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: use list instead of rbtree to track cap flushesYan, Zheng2016-07-285-118/+56
| | | | | | | | | | | | | | | | We don't have requirement of searching cap flush by TID. In most cases, we just need to know TID of the oldest cap flush. List is ideal for this usage. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: update types of some local variblesYan, Zheng2016-07-281-11/+12
| | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: include 'follows' of pending snapflush in cap reconnect messageYan, Zheng2016-07-281-1/+16
| | | | | | | | | | | | | | This helps the recovering MDS to reconstruct the internal states that tracking pending snapflush. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: update cap reconnect message to version 3Yan, Zheng2016-07-281-21/+47
| | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: mount non-default filesystem by nameYan, Zheng2016-07-284-17/+117
| | | | | | | | | | | | | | | | | | | | | | To mount non-default filesytem, user currently needs to provide mds namespace ID. This is inconvenience. This patch makes user be able to mount filesystem by name. If user wants to mount non-default filesystem. Client first subscribes to fsmap.user. Subscribe to mdsmap.<ID> after getting ID of filesystem. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: handle LOOKUP_RCU in ceph_d_revalidateJeff Layton2016-07-281-6/+14
| | | | | | | | | | | | | | | | | | We can now handle the snapshot cases under RCU, as well as the non-snapshot case when we don't need to queue up a lease renewal allow LOOKUP_RCU walks to proceed under those conditions. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
| * ceph: allow dentry_lease_is_valid to work under RCU walkJeff Layton2016-07-281-15/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under rcuwalk, we need to take extra care when dereferencing d_parent. We want to do that once and pass a pointer to dentry_lease_is_valid. Also, we must ensure that that function can handle the case where we're racing with d_release. Check whether "di" is NULL under the d_lock, and just return 0 if so. Finally, we still need to kick off a renewal job if the lease is getting close to expiration. If that's the case, then just drop out of rcuwalk mode since that could block. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
| * ceph: clear d_fsinfo pointer under d_lockJeff Layton2016-07-281-1/+5
| | | | | | | | | | | | | | | | | | | | | | To check for a valid dentry lease, we need to get at the ceph_dentry_info. Under rcuwalk though, we may end up with a dentry that is on its way to destruction. Since we need to take the d_lock in dentry_lease_is_valid already, we can just ensure that we clear the d_fsinfo pointer out under the same lock before destroying it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
| * ceph: remove ceph_mdsc_lease_releaseJeff Layton2016-07-282-45/+0
| | | | | | | | | | | | | | Nothing calls it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
| * ceph: don't use ->d_timeMiklos Szeredi2016-07-284-8/+8
| | | | | | | | | | | | | | Pretty simple: just use ceph_dentry_info.time instead (which was already there, unused). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * ceph: fix spelling mistake: "resgister" -> "register"Colin Ian King2016-07-281-1/+1
| | | | | | | | | | | | trivial fix to spelling mistake in pr_err message Signed-off-by: Colin Ian King <colin.king@canonical.com>
| * ceph: fix NULL dereference in ceph_queue_cap_snap()Yan, Zheng2016-07-281-1/+1
| | | | | | | | | | | | old_snapc->seq is used in dout(...) Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: wait unsafe sync writes for evicting inodeYan, Zheng2016-07-285-48/+61
| | | | | | | | | | | | Otherwise ceph_sync_write_unsafe() may access/modify freed inode. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: fix use-after-free bug in ceph_direct_read_write()Yan, Zheng2016-07-281-2/+5
| | | | | | | | | | | | | | ceph_aio_complete() can free the ceph_aio_request struct before the code exits the while loop. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: reduce i_nr_by_mode array sizeYan, Zheng2016-07-284-22/+35
| | | | | | | | | | | | | | Track usage count for individual fmode bit. This can reduce the array size by half. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: set user pages dirty after direct IO readYan, Zheng2016-07-281-2/+2
| | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: rados pool namespace supportYan, Zheng2016-07-288-77/+159
| | | | | | | | | | | | | | | | This patch adds codes that decode pool namespace information in cap message and request reply. Pool namespace is saved in i_layout, it will be passed to libceph when doing read/write. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * libceph: rados pool namespace supportYan, Zheng2016-07-281-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Add pool namesapce pointer to struct ceph_file_layout and struct ceph_object_locator. Pool namespace is used by when mapping object to PG, it's also used when composing OSD request. The namespace pointer in struct ceph_file_layout is RCU protected. So libceph can read namespace without taking lock. Signed-off-by: Yan, Zheng <zyan@redhat.com> [idryomov@gmail.com: ceph_oloc_destroy(), misc minor changes] Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: define new ceph_file_layout structureYan, Zheng2016-07-287-45/+43
| | | | | | | | | | | | | | | | Define new ceph_file_layout structure and rename old ceph_file_layout to ceph_file_layout_legacy. This is preparation for adding namespace to ceph_file_layout structure. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * libceph: add an ONSTACK initializer for oidsIlya Dryomov2016-07-281-1/+1
| | | | | | | | | | | | | | | | | | | | An on-stack oid in ceph_ioctl_get_dataloc() is not initialized, resulting in a WARN and a NULL pointer dereference later on. We will have more of these on-stack in the future, so fix it with a convenience macro. Fixes: d30291b985d1 ("libceph: variable-sized ceph_object_id") Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | Merge branch 'for-linus-4.8' of ↵Linus Torvalds2016-07-316-338/+544
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs updates from Chris Mason: "This pull is dedicated to Josef's enospc rework, which we've been testing for a few releases now. It fixes some early enospc problems and is dramatically faster. This also includes an updated fix for the delalloc accounting that happens after a fault in copy_from_user. My patch in v4.7 was almost but not quite enough" * 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: Btrfs: fix delalloc accounting after copy_from_user faults Btrfs: avoid deadlocks during reservations in btrfs_truncate_block Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes Btrfs: fill relocation block rsv after allocation Btrfs: always use trans->block_rsv for orphans Btrfs: change how we calculate the global block rsv Btrfs: use root when checking need_async_flush Btrfs: don't bother kicking async if there's nothing to reclaim Btrfs: fix release reserved extents trace points Btrfs: add fsid to some tracepoints Btrfs: add tracepoints for flush events Btrfs: fix delalloc reservation amount tracepoint Btrfs: trace pinned extents Btrfs: introduce ticketed enospc infrastructure Btrfs: add tracepoint for adding block groups Btrfs: warn_on for unaccounted spaces Btrfs: change delayed reservation fallback behavior Btrfs: always reserve metadata for delalloc extents Btrfs: fix callers of btrfs_block_rsv_migrate Btrfs: add bytes_readonly to the spaceinfo at once
| * | Btrfs: fix delalloc accounting after copy_from_user faultsChris Mason2016-07-211-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 56244ef151c3cd11 was almost but not quite enough to fix the reservation math after btrfs_copy_from_user returned partial copies. Some users are still seeing warnings in btrfs_destroy_inode, and with a long enough test run I'm able to trigger them as well. This patch fixes the accounting math again, bringing it much closer to the way it was before the sectorsize conversion Chandan did. The problem is accounting for the offset into the page/sector when we do a partial copy. This one just uses the dirty_sectors variable which should already be updated properly. Signed-off-by: Chris Mason <clm@fb.com> cc: stable@vger.kernel.org # v4.6+
| * | Btrfs: avoid deadlocks during reservations in btrfs_truncate_blockJosef Bacik2016-07-201-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | The new enospc code makes it possible to deadlock if we don't use FLUSH_LIMIT during reservations inside a transaction. This enforces the correct flush type to avoid both deadlocks and assertions Signed-off-by: Chris Mason <clm@fb.com> Signed-off-by: Josef Bacik <jbacik@fb.com>
| * | Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytesJosef Bacik2016-07-072-17/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We used to allow you to set FLUSH_ALL and then just wouldn't do things like commit transactions or wait on ordered extents if we noticed you were in a transaction. However now that all the flushing for FLUSH_ALL is asynchronous we've lost the ability to tell, and we could end up deadlocking. So instead use FLUSH_LIMIT in reserve_metadata_bytes in relocation and then return -EAGAIN if we error out to preserve the previous behavior. I've also added an ASSERT() to catch anybody else who tries to do this. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: fill relocation block rsv after allocationJosef Bacik2016-07-071-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we set the reloc control before we've reserved our space for relocation we could race with a root being dirtied and not actually have space to do our init reloc root. So once we've allocated it and set it up go ahead and make our reservation before setting the relocate control, that way anybody who tries to do the reloc root init has space to use. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | Btrfs: always use trans->block_rsv for orphansJosef Bacik2016-07-071-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the case all the time anyway except for relocation which could be doing a reloc root for a non ref counted root, in which case we'd end up with some random block rsv rather than the one we have our reservation in. If there isn't enough space in the block rsv we are trying to steal from we'll BUG() because we expect there to be space for the orphan to make its reservation. Thanks, Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>