summaryrefslogtreecommitdiff
path: root/net
Commit message (Collapse)AuthorAgeFilesLines
* SUNRPC: Unexport xdr_partial_copy_from_skb()Trond Myklebust2018-09-301-2/+2
| | | | | | It is no longer used outside of net/sunrpc/socklib.c Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Clean up xs_udp_data_receive()Trond Myklebust2018-09-301-12/+5
| | | | | | Simplify the retry logic. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Allow AF_LOCAL sockets to use the generic stream receiveTrond Myklebust2018-09-302-123/+18
| | | | Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Clean up - rename xs_tcp_data_receive() to xs_stream_data_receive()Trond Myklebust2018-09-301-41/+30
| | | | | | In preparation for sharing with AF_LOCAL. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Simplify TCP receive code by switching to using iteratorsTrond Myklebust2018-09-301-364/+333
| | | | | | Most of this code should also be reusable with other socket types. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Add a bvec array to struct xdr_buf for use with iovec_iter()Trond Myklebust2018-09-303-1/+54
| | | | | | | Add a bvec array to struct xdr_buf, and have the client allocate it when we need to receive data into pages. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Add a label for RPC calls that require allocation on receiveTrond Myklebust2018-09-302-1/+2
| | | | | | | | | | If the RPC call relies on the receive call allocating pages as buffers, then let's label it so that we a) Don't leak memory by allocating pages for requests that do not expect this behaviour b) Can optimise for the common case where calls do not require allocation. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Convert the xprt->sending queue back to an ordinary wait queueTrond Myklebust2018-09-301-17/+3
| | | | | | | | | | We no longer need priority semantics on the xprt->sending queue, because the order in which tasks are sent is now dictated by their position in the send queue. Note that the backlog queue remains a priority queue, meaning that slot resources are still managed in order of task priority. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Fix priority queue fairnessTrond Myklebust2018-09-301-55/+54
| | | | | | | | | | | Fix up the priority queue to not batch by owner, but by queue, so that we allow '1 << priority' elements to be dequeued before switching to the next priority queue. The owner field is still used to wake up requests in round robin order by owner to avoid single processes hogging the RPC layer by loading the queues. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Convert xprt receive queue to use an rbtreeTrond Myklebust2018-09-301-11/+82
| | | | | | | | | If the server is slow, we can find ourselves with quite a lot of entries on the receive queue. Converting the search from an O(n) to O(log(n)) can make a significant difference, particularly since we have to hold a number of locks while searching. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Don't take transport->lock unnecessarily when taking XPRT_LOCKTrond Myklebust2018-09-301-2/+5
| | | | Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Cleanup: remove the unused 'task' argument from the request_send()Trond Myklebust2018-09-304-11/+8
| | | | Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Clean up transport write space handlingTrond Myklebust2018-09-306-79/+70
| | | | | | | | Treat socket write space handling in the same way we now treat transport congestion: by denying the XPRT_LOCK until the transport signals that it has free buffer space. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Turn off throttling of RPC slots for TCP socketsTrond Myklebust2018-09-302-15/+1
| | | | | | | | The theory was that we would need to grab the socket lock anyway, so we might as well use it to gate the allocation of RPC slots for a TCP socket. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Allow soft RPC calls to time out when waiting for the XPRT_LOCKTrond Myklebust2018-09-301-2/+2
| | | | | | This no longer causes them to lose their place in the transmission queue. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Allow calls to xprt_transmit() to drain the entire transmit queueTrond Myklebust2018-09-301-11/+60
| | | | | | | | Rather than forcing each and every RPC task to grab the socket write lock in order to send itself, we allow whichever task is holding the write lock to attempt to drain the entire transmit queue. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Enqueue swapper tagged RPCs at the head of the transmit queueTrond Myklebust2018-09-301-0/+11
| | | | | | | Avoid memory starvation by giving RPCs that are tagged with the RPC_TASK_SWAPPER flag the highest priority. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Support for congestion control when queuing is enabledTrond Myklebust2018-09-305-36/+107
| | | | | | | | | | | | Both RDMA and UDP transports require the request to get a "congestion control" credit before they can be transmitted. Right now, this is done when the request locks the socket. We'd like it to happen when a request attempts to be transmitted for the first time. In order to support retransmission of requests that already hold such credits, we also want to ensure that they get queued first, so that we don't deadlock with requests that have yet to obtain a credit. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Improve latency for interactive tasksTrond Myklebust2018-09-301-3/+24
| | | | | | | | | | | One of the intentions with the priority queues was to ensure that no single process can hog the transport. The field task->tk_owner therefore identifies the RPC call's origin, and is intended to allow the RPC layer to organise queues for fairness. This commit therefore modifies the transmit queue to group requests by task->tk_owner, and ensures that we round robin among those groups. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Move RPC retransmission stat counter to xprt_transmit()Trond Myklebust2018-09-302-13/+12
| | | | Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Simplify xprt_prepare_transmit()Trond Myklebust2018-09-301-16/+7
| | | | | | | | Remove the checks for whether or not we need to transmit, and whether or not a reply has been received. Those are already handled in call_transmit() itself. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Don't reset the request 'bytes_sent' counter when releasing XPRT_LOCKTrond Myklebust2018-09-302-18/+0
| | | | | | If the request is still on the queue, this will be incorrect behaviour. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Treat the task and request as separate in the xprt_ops->send_request()Trond Myklebust2018-09-304-20/+17
| | | | | | | When we shift to using the transmit queue, then the task that holds the write lock will not necessarily be the same as the one being transmitted. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Fix up the back channel transmitTrond Myklebust2018-09-302-15/+31
| | | | | | | | Fix up the back channel code to recognise that it has already been transmitted, so does not need to be called again. Also ensure that we set req->rq_task. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Refactor RPC call encodingTrond Myklebust2018-09-302-41/+62
| | | | | | | Move the call encoding so that it occurs before the transport connection etc. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Add a transmission queue for RPC requestsTrond Myklebust2018-09-302-13/+77
| | | | | | Add the queue that will enforce the ordering of RPC task transmission. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Distinguish between the slot allocation list and receive queueTrond Myklebust2018-09-301-6/+6
| | | | | | | | | | | | When storing a struct rpc_rqst on the slot allocation list, we currently use the same field 'rq_list' as we use to store the request on the receive queue. Since the structure is never on both lists at the same time, this is OK. However, for clarity, let's make that a union with different names for the different lists so that we can more easily distinguish between the two states. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Minor cleanup for call_transmit()Trond Myklebust2018-09-301-17/+15
| | | | Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Refactor xprt_transmit() to remove wait for reply codeTrond Myklebust2018-09-302-31/+53
| | | | | | | | Allow the caller in clnt.c to call into the code to wait for a reply after calling xprt_transmit(). Again, the reason is that the backchannel code does not need this functionality. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Refactor xprt_transmit() to remove the reply queue codeTrond Myklebust2018-09-304-46/+88
| | | | | | | Separate out the action of adding a request to the reply queue so that the backchannel code can simply skip calling it altogether. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Rename xprt->recv_lock to xprt->queue_lockTrond Myklebust2018-09-305-37/+37
| | | | | | We will use the same lock to protect both the transmit and receive queues. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Don't wake queued RPC calls multiple times in xprt_transmitTrond Myklebust2018-09-301-6/+3
| | | | | | | Rather than waking up the entire queue of RPC messages a second time, just wake up the task that was put to sleep. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Test whether the task is queued before grabbing the queue spinlocksTrond Myklebust2018-09-301-0/+4
| | | | | | | When asked to wake up an RPC task, it makes sense to test whether or not the task is still queued. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Add a helper to wake up a sleeping rpc_task and set its statusTrond Myklebust2018-09-301-10/+55
| | | | | | | | | Add a helper that will wake up a task that is sleeping on a specific queue, and will set the value of task->tk_status. This is mainly intended for use by the transport layer to notify the task of an error condition. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Refactor the transport request pinningTrond Myklebust2018-09-301-20/+23
| | | | | | We are going to need to pin for both send and receive. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Simplify dealing with aborted partially transmitted messagesTrond Myklebust2018-09-301-26/+25
| | | | | | | | | | | | If the previous message was only partially transmitted, we need to close the socket in order to avoid corruption of the message stream. To do so, we currently hijack the unlocking of the socket in order to schedule the close. Now that we track the message offset in the socket state, we can move that kind of checking out of the socket lock code, which is needed to allow messages to remain queued after dropping the socket lock. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Add socket transmit queue offset trackingTrond Myklebust2018-09-301-18/+22
| | | | Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Move reset of TCP state variables into the reconnect codeTrond Myklebust2018-09-301-7/+6
| | | | | | | Rather than resetting state variables in socket state_change() callback, do it in the sunrpc TCP connect function itself. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Rename TCP receive-specific state variablesTrond Myklebust2018-09-301-89/+89
| | | | | | | | Since we will want to introduce similar TCP state variables for the transmission of requests, let's rename the existing ones to label that they are for the receive side. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Avoid holding locks across the XDR encoding of the RPC messageTrond Myklebust2018-09-301-3/+3
| | | | | | | | Currently, we grab the socket bit lock before we allow the message to be XDR encoded. That significantly slows down the transmission rate, since we serialise on a potentially blocking operation. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Simplify identification of when the message send/receive is completeTrond Myklebust2018-09-302-15/+21
| | | | | | | Add states to indicate that the message send and receive are not yet complete. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: The transmitted message must lie in the RPCSEC window of validityTrond Myklebust2018-09-304-0/+61
| | | | | | | | | | | | | | | If a message has been encoded using RPCSEC_GSS, the server is maintaining a window of sequence numbers that it considers valid. The client should normally be tracking that window, and needs to verify that the sequence number used by the message being transmitted still lies inside the window of validity. So far, we've been able to assume this condition would be realised automatically, since the client has been encoding the message only after taking the socket lock. Once we change that condition, we will need the explicit check. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: If there is no reply expected, bail early from call_decodeTrond Myklebust2018-09-301-5/+8
| | | | Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* SUNRPC: Clean up initialisation of the struct rpc_rqstTrond Myklebust2018-09-302-41/+51
| | | | | | Move the initialisation back into xprt.c. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* ip_tunnel: be careful when accessing the inner headerPaolo Abeni2018-09-241-0/+9
| | | | | | | | | | | Cong noted that we need the same checks introduced by commit 76c0ddd8c3a6 ("ip6_tunnel: be careful when accessing the inner header") even for ipv4 tunnels. Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* mpls: allow routes on ip6gre devicesSaif Hasan2018-09-241-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: This appears to be necessary and sufficient change to enable `MPLS` on `ip6gre` tunnels (RFC4023). This diff allows IP6GRE devices to be recognized by MPLS kernel module and hence user can configure interface to accept packets with mpls headers as well setup mpls routes on them. Test Plan: Test plan consists of multiple containers connected via GRE-V6 tunnel. Then carrying out testing steps as below. - Carry out necessary sysctl settings on all containers ``` sysctl -w net.mpls.platform_labels=65536 sysctl -w net.mpls.ip_ttl_propagate=1 sysctl -w net.mpls.conf.lo.input=1 ``` - Establish IP6GRE tunnels ``` ip -6 tunnel add name if_1_2_1 mode ip6gre \ local 2401:db00:21:6048:feed:0::1 \ remote 2401:db00:21:6048:feed:0::2 key 1 ip link set dev if_1_2_1 up sysctl -w net.mpls.conf.if_1_2_1.input=1 ip -4 addr add 169.254.0.2/31 dev if_1_2_1 scope link ip -6 tunnel add name if_1_3_1 mode ip6gre \ local 2401:db00:21:6048:feed:0::1 \ remote 2401:db00:21:6048:feed:0::3 key 1 ip link set dev if_1_3_1 up sysctl -w net.mpls.conf.if_1_3_1.input=1 ip -4 addr add 169.254.0.4/31 dev if_1_3_1 scope link ``` - Install MPLS encap rules on node-1 towards node-2 ``` ip route add 192.168.0.11/32 nexthop encap mpls 32/64 \ via inet 169.254.0.3 dev if_1_2_1 ``` - Install MPLS forwarding rules on node-2 and node-3 ``` // node2 ip -f mpls route add 32 via inet 169.254.0.7 dev if_2_4_1 // node3 ip -f mpls route add 64 via inet 169.254.0.12 dev if_4_3_1 ``` - Ping 192.168.0.11 (node4) from 192.168.0.1 (node1) (where routing towards 192.168.0.1 is via IP route directly towards node1 from node4) ``` ping 192.168.0.11 ``` - tcpdump on interface to capture ping packets wrapped within MPLS header which inturn wrapped within IP6GRE header ``` 16:43:41.121073 IP6 2401:db00:21:6048:feed::1 > 2401:db00:21:6048:feed::2: DSTOPT GREv0, key=0x1, length 100: MPLS (label 32, exp 0, ttl 255) (label 64, exp 0, [S], ttl 255) IP 192.168.0.1 > 192.168.0.11: ICMP echo request, id 1208, seq 45, length 64 0x0000: 6000 2cdb 006c 3c3f 2401 db00 0021 6048 `.,..l<?$....!`H 0x0010: feed 0000 0000 0001 2401 db00 0021 6048 ........$....!`H 0x0020: feed 0000 0000 0002 2f00 0401 0401 0100 ......../....... 0x0030: 2000 8847 0000 0001 0002 00ff 0004 01ff ...G............ 0x0040: 4500 0054 3280 4000 ff01 c7cb c0a8 0001 E..T2.@......... 0x0050: c0a8 000b 0800 a8d7 04b8 002d 2d3c a05b ...........--<.[ 0x0060: 0000 0000 bcd8 0100 0000 0000 1011 1213 ................ 0x0070: 1415 1617 1819 1a1b 1c1d 1e1f 2021 2223 .............!"# 0x0080: 2425 2627 2829 2a2b 2c2d 2e2f 3031 3233 $%&'()*+,-./0123 0x0090: 3435 3637 4567 ``` Signed-off-by: Saif Hasan <has@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netpoll: make ndo_poll_controller() optionalEric Dumazet2018-09-231-12/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | As diagnosed by Song Liu, ndo_poll_controller() can be very dangerous on loaded hosts, since the cpu calling ndo_poll_controller() might steal all NAPI contexts (for all RX/TX queues of the NIC). This capture can last for unlimited amount of time, since one cpu is generally not able to drain all the queues under load. It seems that all networking drivers that do use NAPI for their TX completions, should not provide a ndo_poll_controller(). NAPI drivers have netpoll support already handled in core networking stack, since netpoll_poll_dev() uses poll_napi(dev) to iterate through registered NAPI contexts for a device. This patch allows netpoll_poll_dev() to process NAPI contexts even for drivers not providing ndo_poll_controller(), allowing for following patches in NAPI drivers. Also we export netpoll_poll_dev() so that it can be called by bonding/team drivers in following patches. Reported-by: Song Liu <songliubraving@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Song Liu <songliubraving@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* rds: Fix build regression.David S. Miller2018-09-231-1/+1
| | | | | | | Use DECLARE_* not DEFINE_* Fixes: 8360ed6745df ("RDS: IB: Use DEFINE_PER_CPU_SHARED_ALIGNED for rds_ib_stats") Signed-off-by: David S. Miller <davem@davemloft.net>
* net-ethtool: ETHTOOL_GUFO did not and should not require CAP_NET_ADMINMaciej Żenczykowski2018-09-221-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | So it should not fail with EPERM even though it is no longer implemented... This is a fix for: (userns)$ egrep ^Cap /proc/self/status CapInh: 0000003fffffffff CapPrm: 0000003fffffffff CapEff: 0000003fffffffff CapBnd: 0000003fffffffff CapAmb: 0000003fffffffff (userns)$ tcpdump -i usb_rndis0 tcpdump: WARNING: usb_rndis0: SIOCETHTOOL(ETHTOOL_GUFO) ioctl failed: Operation not permitted Warning: Kernel filter failed: Bad file descriptor tcpdump: can't remove kernel filter: Bad file descriptor With this change it returns EOPNOTSUPP instead of EPERM. See also https://github.com/the-tcpdump-group/libpcap/issues/689 Fixes: 08a00fea6de2 "net: Remove references to NETIF_F_UFO from ethtool." Cc: David S. Miller <davem@davemloft.net> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* RDS: IB: Use DEFINE_PER_CPU_SHARED_ALIGNED for rds_ib_statsNathan Chancellor2018-09-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clang warns when two declarations' section attributes don't match. net/rds/ib_stats.c:40:1: warning: section does not match previous declaration [-Wsection] DEFINE_PER_CPU_SHARED_ALIGNED(struct rds_ib_statistics, rds_ib_stats); ^ ./include/linux/percpu-defs.h:142:2: note: expanded from macro 'DEFINE_PER_CPU_SHARED_ALIGNED' DEFINE_PER_CPU_SECTION(type, name, PER_CPU_SHARED_ALIGNED_SECTION) \ ^ ./include/linux/percpu-defs.h:93:9: note: expanded from macro 'DEFINE_PER_CPU_SECTION' extern __PCPU_ATTRS(sec) __typeof__(type) name; \ ^ ./include/linux/percpu-defs.h:49:26: note: expanded from macro '__PCPU_ATTRS' __percpu __attribute__((section(PER_CPU_BASE_SECTION sec))) \ ^ net/rds/ib.h:446:1: note: previous attribute is here DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats); ^ ./include/linux/percpu-defs.h:111:2: note: expanded from macro 'DECLARE_PER_CPU' DECLARE_PER_CPU_SECTION(type, name, "") ^ ./include/linux/percpu-defs.h:87:9: note: expanded from macro 'DECLARE_PER_CPU_SECTION' extern __PCPU_ATTRS(sec) __typeof__(type) name ^ ./include/linux/percpu-defs.h:49:26: note: expanded from macro '__PCPU_ATTRS' __percpu __attribute__((section(PER_CPU_BASE_SECTION sec))) \ ^ 1 warning generated. The initial definition was added in commit ec16227e1414 ("RDS/IB: Infiniband transport") and the cache aligned definition was added in commit e6babe4cc4ce ("RDS/IB: Stats and sysctls") right after. The definition probably should have been updated in net/rds/ib.h, which is what this patch does. Link: https://github.com/ClangBuiltLinux/linux/issues/114 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>