diff options
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/bonding.rst | 11 | ||||
-rw-r--r-- | Documentation/networking/devlink/index.rst | 16 | ||||
-rw-r--r-- | Documentation/networking/dsa/sja1105.rst | 27 | ||||
-rw-r--r-- | Documentation/networking/ethtool-netlink.rst | 19 | ||||
-rw-r--r-- | Documentation/networking/index.rst | 1 | ||||
-rw-r--r-- | Documentation/networking/ip-sysctl.rst | 23 | ||||
-rw-r--r-- | Documentation/networking/mctp.rst | 48 | ||||
-rw-r--r-- | Documentation/networking/page_pool.rst | 56 | ||||
-rw-r--r-- | Documentation/networking/smc-sysctl.rst | 23 | ||||
-rw-r--r-- | Documentation/networking/timestamping.rst | 2 |
10 files changed, 225 insertions, 1 deletions
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst index ab98373535ea..525e6842dd33 100644 --- a/Documentation/networking/bonding.rst +++ b/Documentation/networking/bonding.rst @@ -313,6 +313,17 @@ arp_ip_target maximum number of targets that can be specified is 16. The default value is no IP addresses. +ns_ip6_target + + Specifies the IPv6 addresses to use as IPv6 monitoring peers when + arp_interval is > 0. These are the targets of the NS request + sent to determine the health of the link to the targets. + Specify these values in ffff:ffff::ffff:ffff format. Multiple IPv6 + addresses must be separated by a comma. At least one IPv6 + address must be given for NS/NA monitoring to function. The + maximum number of targets that can be specified is 16. The + default value is no IPv6 addresses. + arp_validate Specifies whether or not ARP probes and replies should be diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst index 443123772f44..c17cdb079611 100644 --- a/Documentation/networking/devlink/index.rst +++ b/Documentation/networking/devlink/index.rst @@ -4,6 +4,22 @@ Linux Devlink Documentation devlink is an API to expose device information and resources not directly related to any device class, such as chip-wide/switch-ASIC-wide configuration. +Locking +------- + +Driver facing APIs are currently transitioning to allow more explicit +locking. Drivers can use the existing ``devlink_*`` set of APIs, or +new APIs prefixed by ``devl_*``. The older APIs handle all the locking +in devlink core, but don't allow registration of most sub-objects once +the main devlink object is itself registered. The newer ``devl_*`` APIs assume +the devlink instance lock is already held. Drivers can take the instance +lock by calling ``devl_lock()``. It is also held in most of the callbacks. +Eventually all callbacks will be invoked under the devlink instance lock, +refer to the use of the ``DEVLINK_NL_FLAG_NO_LOCK`` flag in devlink core +to find out which callbacks are not converted, yet. + +Drivers are encouraged to use the devlink instance lock for their own needs. + Interface documentation ----------------------- diff --git a/Documentation/networking/dsa/sja1105.rst b/Documentation/networking/dsa/sja1105.rst index 29b1bae0cf00..e0219c1452ab 100644 --- a/Documentation/networking/dsa/sja1105.rst +++ b/Documentation/networking/dsa/sja1105.rst @@ -293,6 +293,33 @@ of dropped frames, which is a sum of frames dropped due to timing violations, lack of destination ports and MTU enforcement checks). Byte-level counters are not available. +Limitations +=========== + +The SJA1105 switch family always performs VLAN processing. When configured as +VLAN-unaware, frames carry a different VLAN tag internally, depending on +whether the port is standalone or under a VLAN-unaware bridge. + +The virtual link keys are always fixed at {MAC DA, VLAN ID, VLAN PCP}, but the +driver asks for the VLAN ID and VLAN PCP when the port is under a VLAN-aware +bridge. Otherwise, it fills in the VLAN ID and PCP automatically, based on +whether the port is standalone or in a VLAN-unaware bridge, and accepts only +"VLAN-unaware" tc-flower keys (MAC DA). + +The existing tc-flower keys that are offloaded using virtual links are no +longer operational after one of the following happens: + +- port was standalone and joins a bridge (VLAN-aware or VLAN-unaware) +- port is part of a bridge whose VLAN awareness state changes +- port was part of a bridge and becomes standalone +- port was standalone, but another port joins a VLAN-aware bridge and this + changes the global VLAN awareness state of the bridge + +The driver cannot veto all these operations, and it cannot update/remove the +existing tc-flower filters either. So for proper operation, the tc-flower +filters should be installed only after the forwarding configuration of the port +has been made, and removed by user space before making any changes to it. + Device Tree bindings and board design ===================================== diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst index 9d98e0511249..24d9be69065d 100644 --- a/Documentation/networking/ethtool-netlink.rst +++ b/Documentation/networking/ethtool-netlink.rst @@ -860,8 +860,17 @@ Kernel response contents: ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring + ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` u8 TCP header / data split + ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE ==================================== ====== =========================== +``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` indicates whether the device is usable with +page-flipping TCP zero-copy receive (``getsockopt(TCP_ZEROCOPY_RECEIVE)``). +If enabled the device is configured to place frame headers and data into +separate buffers. The device configuration must make it possible to receive +full memory pages of data, for example because MTU is high enough or through +HW-GRO. + RINGS_SET ========= @@ -877,6 +886,7 @@ Request contents: ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring + ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE ==================================== ====== =========================== Kernel checks that requested ring sizes do not exceed limits reported by @@ -884,6 +894,15 @@ driver. Driver may impose additional constraints and may not suspport all attributes. +``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size. +Completion queue events(CQE) are the events posted by NIC to indicate the +completion status of a packet when the packet is sent(like send success or +error) or received(like pointers to packet fragments). The CQE size parameter +enables to modify the CQE size other than default size if NIC supports it. +A bigger CQE can have more receive buffer pointers inturn NIC can transfer +a bigger frame from wire. Based on the NIC hardware, the overall completion +queue size can be adjusted in the driver if CQE size is modified. + CHANNELS_GET ============ diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 58bc8cd367c6..ce017136ab05 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -96,6 +96,7 @@ Contents: sctp secid seg6-sysctl + smc-sysctl statistics strparser switchdev diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 2572eecc3e86..b0024aa7b051 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -878,6 +878,29 @@ tcp_min_tso_segs - INTEGER Default: 2 +tcp_tso_rtt_log - INTEGER + Adjustment of TSO packet sizes based on min_rtt + + Starting from linux-5.18, TCP autosizing can be tweaked + for flows having small RTT. + + Old autosizing was splitting the pacing budget to send 1024 TSO + per second. + + tso_packet_size = sk->sk_pacing_rate / 1024; + + With the new mechanism, we increase this TSO sizing using: + + distance = min_rtt_usec / (2^tcp_tso_rtt_log) + tso_packet_size += gso_max_size >> distance; + + This means that flows between very close hosts can use bigger + TSO packets, reducing their cpu costs. + + If you want to use the old autosizing, set this sysctl to 0. + + Default: 9 (2^9 = 512 usec) + tcp_pacing_ss_ratio - INTEGER sk->sk_pacing_rate is set by TCP stack using a ratio applied to current rate. (current_rate = cwnd * mss / srtt) diff --git a/Documentation/networking/mctp.rst b/Documentation/networking/mctp.rst index 46f74bffce0f..c628cb5406d2 100644 --- a/Documentation/networking/mctp.rst +++ b/Documentation/networking/mctp.rst @@ -212,6 +212,54 @@ remote address is already known, or the message does not require a reply. Like the send calls, sockets will only receive responses to requests they have sent (TO=1) and may only respond (TO=0) to requests they have received. +``ioctl(SIOCMCTPALLOCTAG)`` and ``ioctl(SIOCMCTPDROPTAG)`` +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +These tags give applications more control over MCTP message tags, by allocating +(and dropping) tag values explicitly, rather than the kernel automatically +allocating a per-message tag at ``sendmsg()`` time. + +In general, you will only need to use these ioctls if your MCTP protocol does +not fit the usual request/response model. For example, if you need to persist +tags across multiple requests, or a request may generate more than one response. +In these cases, the ioctls allow you to decouple the tag allocation (and +release) from individual message send and receive operations. + +Both ioctls are passed a pointer to a ``struct mctp_ioc_tag_ctl``: + +.. code-block:: C + + struct mctp_ioc_tag_ctl { + mctp_eid_t peer_addr; + __u8 tag; + __u16 flags; + }; + +``SIOCMCTPALLOCTAG`` allocates a tag for a specific peer, which an application +can use in future ``sendmsg()`` calls. The application populates the +``peer_addr`` member with the remote EID. Other fields must be zero. + +On return, the ``tag`` member will be populated with the allocated tag value. +The allocated tag will have the following tag bits set: + + - ``MCTP_TAG_OWNER``: it only makes sense to allocate tags if you're the tag + owner + + - ``MCTP_TAG_PREALLOC``: to indicate to ``sendmsg()`` that this is a + preallocated tag. + + - ... and the actual tag value, within the least-significant three bits + (``MCTP_TAG_MASK``). Note that zero is a valid tag value. + +The tag value should be used as-is for the ``smctp_tag`` member of ``struct +sockaddr_mctp``. + +``SIOCMCTPDROPTAG`` releases a tag that has been previously allocated by a +``SIOCMCTPALLOCTAG`` ioctl. The ``peer_addr`` must be the same as used for the +allocation, and the ``tag`` value must match exactly the tag returned from the +allocation (including the ``MCTP_TAG_OWNER`` and ``MCTP_TAG_PREALLOC`` bits). +The ``flags`` field must be zero. + Kernel internals ================ diff --git a/Documentation/networking/page_pool.rst b/Documentation/networking/page_pool.rst index a147591ce203..5db8c263b0c6 100644 --- a/Documentation/networking/page_pool.rst +++ b/Documentation/networking/page_pool.rst @@ -105,6 +105,47 @@ a page will cause no race conditions is enough. Please note the caller must not use data area after running page_pool_put_page_bulk(), as this function overwrites it. +* page_pool_get_stats(): Retrieve statistics about the page_pool. This API + is only available if the kernel has been configured with + ``CONFIG_PAGE_POOL_STATS=y``. A pointer to a caller allocated ``struct + page_pool_stats`` structure is passed to this API which is filled in. The + caller can then report those stats to the user (perhaps via ethtool, + debugfs, etc.). See below for an example usage of this API. + +Stats API and structures +------------------------ +If the kernel is configured with ``CONFIG_PAGE_POOL_STATS=y``, the API +``page_pool_get_stats()`` and structures described below are available. It +takes a pointer to a ``struct page_pool`` and a pointer to a ``struct +page_pool_stats`` allocated by the caller. + +The API will fill in the provided ``struct page_pool_stats`` with +statistics about the page_pool. + +The stats structure has the following fields:: + + struct page_pool_stats { + struct page_pool_alloc_stats alloc_stats; + struct page_pool_recycle_stats recycle_stats; + }; + + +The ``struct page_pool_alloc_stats`` has the following fields: + * ``fast``: successful fast path allocations + * ``slow``: slow path order-0 allocations + * ``slow_high_order``: slow path high order allocations + * ``empty``: ptr ring is empty, so a slow path allocation was forced. + * ``refill``: an allocation which triggered a refill of the cache + * ``waive``: pages obtained from the ptr ring that cannot be added to + the cache due to a NUMA mismatch. + +The ``struct page_pool_recycle_stats`` has the following fields: + * ``cached``: recycling placed page in the page pool cache + * ``cache_full``: page pool cache was full + * ``ring``: page placed into the ptr ring + * ``ring_full``: page released from page pool because the ptr ring was full + * ``released_refcnt``: page released (and not recycled) because refcnt > 1 + Coding examples =============== @@ -157,6 +198,21 @@ NAPI poller } } +Stats +----- + +.. code-block:: c + + #ifdef CONFIG_PAGE_POOL_STATS + /* retrieve stats */ + struct page_pool_stats stats = { 0 }; + if (page_pool_get_stats(page_pool, &stats)) { + /* perhaps the driver reports statistics with ethool */ + ethtool_print_allocation_stats(&stats.alloc_stats); + ethtool_print_recycle_stats(&stats.recycle_stats); + } + #endif + Driver unload ------------- diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst new file mode 100644 index 000000000000..0987fd1bc220 --- /dev/null +++ b/Documentation/networking/smc-sysctl.rst @@ -0,0 +1,23 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========== +SMC Sysctl +========== + +/proc/sys/net/smc/* Variables +============================= + +autocorking_size - INTEGER + Setting SMC auto corking size: + SMC auto corking is like TCP auto corking from the application's + perspective of view. When applications do consecutive small + write()/sendmsg() system calls, we try to coalesce these small writes + as much as possible, to lower total amount of CDC and RDMA Write been + sent. + autocorking_size limits the maximum corked bytes that can be sent to + the under device in 1 single sending. If set to 0, the SMC auto corking + is disabled. + Applications can still use TCP_CORK for optimal behavior when they + know how/when to uncork their sockets. + + Default: 64K diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst index f5809206eb93..be4eb1242057 100644 --- a/Documentation/networking/timestamping.rst +++ b/Documentation/networking/timestamping.rst @@ -668,7 +668,7 @@ timestamping: (through another RX timestamping FIFO). Deferral on RX is typically necessary when retrieving the timestamp needs a sleepable context. In that case, it is the responsibility of the DSA driver to call - ``netif_rx_ni()`` on the freshly timestamped skb. + ``netif_rx()`` on the freshly timestamped skb. 3.2.2 Ethernet PHYs ^^^^^^^^^^^^^^^^^^^ |