summaryrefslogtreecommitdiff
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/bonding.rst11
-rw-r--r--Documentation/networking/devlink/index.rst16
-rw-r--r--Documentation/networking/dsa/sja1105.rst27
-rw-r--r--Documentation/networking/ethtool-netlink.rst19
-rw-r--r--Documentation/networking/index.rst1
-rw-r--r--Documentation/networking/ip-sysctl.rst23
-rw-r--r--Documentation/networking/mctp.rst48
-rw-r--r--Documentation/networking/page_pool.rst56
-rw-r--r--Documentation/networking/smc-sysctl.rst23
-rw-r--r--Documentation/networking/timestamping.rst2
10 files changed, 225 insertions, 1 deletions
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
index ab98373535ea..525e6842dd33 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -313,6 +313,17 @@ arp_ip_target
maximum number of targets that can be specified is 16. The
default value is no IP addresses.
+ns_ip6_target
+
+ Specifies the IPv6 addresses to use as IPv6 monitoring peers when
+ arp_interval is > 0. These are the targets of the NS request
+ sent to determine the health of the link to the targets.
+ Specify these values in ffff:ffff::ffff:ffff format. Multiple IPv6
+ addresses must be separated by a comma. At least one IPv6
+ address must be given for NS/NA monitoring to function. The
+ maximum number of targets that can be specified is 16. The
+ default value is no IPv6 addresses.
+
arp_validate
Specifies whether or not ARP probes and replies should be
diff --git a/Documentation/networking/devlink/index.rst b/Documentation/networking/devlink/index.rst
index 443123772f44..c17cdb079611 100644
--- a/Documentation/networking/devlink/index.rst
+++ b/Documentation/networking/devlink/index.rst
@@ -4,6 +4,22 @@ Linux Devlink Documentation
devlink is an API to expose device information and resources not directly
related to any device class, such as chip-wide/switch-ASIC-wide configuration.
+Locking
+-------
+
+Driver facing APIs are currently transitioning to allow more explicit
+locking. Drivers can use the existing ``devlink_*`` set of APIs, or
+new APIs prefixed by ``devl_*``. The older APIs handle all the locking
+in devlink core, but don't allow registration of most sub-objects once
+the main devlink object is itself registered. The newer ``devl_*`` APIs assume
+the devlink instance lock is already held. Drivers can take the instance
+lock by calling ``devl_lock()``. It is also held in most of the callbacks.
+Eventually all callbacks will be invoked under the devlink instance lock,
+refer to the use of the ``DEVLINK_NL_FLAG_NO_LOCK`` flag in devlink core
+to find out which callbacks are not converted, yet.
+
+Drivers are encouraged to use the devlink instance lock for their own needs.
+
Interface documentation
-----------------------
diff --git a/Documentation/networking/dsa/sja1105.rst b/Documentation/networking/dsa/sja1105.rst
index 29b1bae0cf00..e0219c1452ab 100644
--- a/Documentation/networking/dsa/sja1105.rst
+++ b/Documentation/networking/dsa/sja1105.rst
@@ -293,6 +293,33 @@ of dropped frames, which is a sum of frames dropped due to timing violations,
lack of destination ports and MTU enforcement checks). Byte-level counters are
not available.
+Limitations
+===========
+
+The SJA1105 switch family always performs VLAN processing. When configured as
+VLAN-unaware, frames carry a different VLAN tag internally, depending on
+whether the port is standalone or under a VLAN-unaware bridge.
+
+The virtual link keys are always fixed at {MAC DA, VLAN ID, VLAN PCP}, but the
+driver asks for the VLAN ID and VLAN PCP when the port is under a VLAN-aware
+bridge. Otherwise, it fills in the VLAN ID and PCP automatically, based on
+whether the port is standalone or in a VLAN-unaware bridge, and accepts only
+"VLAN-unaware" tc-flower keys (MAC DA).
+
+The existing tc-flower keys that are offloaded using virtual links are no
+longer operational after one of the following happens:
+
+- port was standalone and joins a bridge (VLAN-aware or VLAN-unaware)
+- port is part of a bridge whose VLAN awareness state changes
+- port was part of a bridge and becomes standalone
+- port was standalone, but another port joins a VLAN-aware bridge and this
+ changes the global VLAN awareness state of the bridge
+
+The driver cannot veto all these operations, and it cannot update/remove the
+existing tc-flower filters either. So for proper operation, the tc-flower
+filters should be installed only after the forwarding configuration of the port
+has been made, and removed by user space before making any changes to it.
+
Device Tree bindings and board design
=====================================
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index 9d98e0511249..24d9be69065d 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -860,8 +860,17 @@ Kernel response contents:
``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring
``ETHTOOL_A_RINGS_TX`` u32 size of TX ring
``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring
+ ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` u8 TCP header / data split
+ ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE
==================================== ====== ===========================
+``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` indicates whether the device is usable with
+page-flipping TCP zero-copy receive (``getsockopt(TCP_ZEROCOPY_RECEIVE)``).
+If enabled the device is configured to place frame headers and data into
+separate buffers. The device configuration must make it possible to receive
+full memory pages of data, for example because MTU is high enough or through
+HW-GRO.
+
RINGS_SET
=========
@@ -877,6 +886,7 @@ Request contents:
``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring
``ETHTOOL_A_RINGS_TX`` u32 size of TX ring
``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring
+ ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE
==================================== ====== ===========================
Kernel checks that requested ring sizes do not exceed limits reported by
@@ -884,6 +894,15 @@ driver. Driver may impose additional constraints and may not suspport all
attributes.
+``ETHTOOL_A_RINGS_CQE_SIZE`` specifies the completion queue event size.
+Completion queue events(CQE) are the events posted by NIC to indicate the
+completion status of a packet when the packet is sent(like send success or
+error) or received(like pointers to packet fragments). The CQE size parameter
+enables to modify the CQE size other than default size if NIC supports it.
+A bigger CQE can have more receive buffer pointers inturn NIC can transfer
+a bigger frame from wire. Based on the NIC hardware, the overall completion
+queue size can be adjusted in the driver if CQE size is modified.
+
CHANNELS_GET
============
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 58bc8cd367c6..ce017136ab05 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -96,6 +96,7 @@ Contents:
sctp
secid
seg6-sysctl
+ smc-sysctl
statistics
strparser
switchdev
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 2572eecc3e86..b0024aa7b051 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -878,6 +878,29 @@ tcp_min_tso_segs - INTEGER
Default: 2
+tcp_tso_rtt_log - INTEGER
+ Adjustment of TSO packet sizes based on min_rtt
+
+ Starting from linux-5.18, TCP autosizing can be tweaked
+ for flows having small RTT.
+
+ Old autosizing was splitting the pacing budget to send 1024 TSO
+ per second.
+
+ tso_packet_size = sk->sk_pacing_rate / 1024;
+
+ With the new mechanism, we increase this TSO sizing using:
+
+ distance = min_rtt_usec / (2^tcp_tso_rtt_log)
+ tso_packet_size += gso_max_size >> distance;
+
+ This means that flows between very close hosts can use bigger
+ TSO packets, reducing their cpu costs.
+
+ If you want to use the old autosizing, set this sysctl to 0.
+
+ Default: 9 (2^9 = 512 usec)
+
tcp_pacing_ss_ratio - INTEGER
sk->sk_pacing_rate is set by TCP stack using a ratio applied
to current rate. (current_rate = cwnd * mss / srtt)
diff --git a/Documentation/networking/mctp.rst b/Documentation/networking/mctp.rst
index 46f74bffce0f..c628cb5406d2 100644
--- a/Documentation/networking/mctp.rst
+++ b/Documentation/networking/mctp.rst
@@ -212,6 +212,54 @@ remote address is already known, or the message does not require a reply.
Like the send calls, sockets will only receive responses to requests they have
sent (TO=1) and may only respond (TO=0) to requests they have received.
+``ioctl(SIOCMCTPALLOCTAG)`` and ``ioctl(SIOCMCTPDROPTAG)``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+These tags give applications more control over MCTP message tags, by allocating
+(and dropping) tag values explicitly, rather than the kernel automatically
+allocating a per-message tag at ``sendmsg()`` time.
+
+In general, you will only need to use these ioctls if your MCTP protocol does
+not fit the usual request/response model. For example, if you need to persist
+tags across multiple requests, or a request may generate more than one response.
+In these cases, the ioctls allow you to decouple the tag allocation (and
+release) from individual message send and receive operations.
+
+Both ioctls are passed a pointer to a ``struct mctp_ioc_tag_ctl``:
+
+.. code-block:: C
+
+ struct mctp_ioc_tag_ctl {
+ mctp_eid_t peer_addr;
+ __u8 tag;
+ __u16 flags;
+ };
+
+``SIOCMCTPALLOCTAG`` allocates a tag for a specific peer, which an application
+can use in future ``sendmsg()`` calls. The application populates the
+``peer_addr`` member with the remote EID. Other fields must be zero.
+
+On return, the ``tag`` member will be populated with the allocated tag value.
+The allocated tag will have the following tag bits set:
+
+ - ``MCTP_TAG_OWNER``: it only makes sense to allocate tags if you're the tag
+ owner
+
+ - ``MCTP_TAG_PREALLOC``: to indicate to ``sendmsg()`` that this is a
+ preallocated tag.
+
+ - ... and the actual tag value, within the least-significant three bits
+ (``MCTP_TAG_MASK``). Note that zero is a valid tag value.
+
+The tag value should be used as-is for the ``smctp_tag`` member of ``struct
+sockaddr_mctp``.
+
+``SIOCMCTPDROPTAG`` releases a tag that has been previously allocated by a
+``SIOCMCTPALLOCTAG`` ioctl. The ``peer_addr`` must be the same as used for the
+allocation, and the ``tag`` value must match exactly the tag returned from the
+allocation (including the ``MCTP_TAG_OWNER`` and ``MCTP_TAG_PREALLOC`` bits).
+The ``flags`` field must be zero.
+
Kernel internals
================
diff --git a/Documentation/networking/page_pool.rst b/Documentation/networking/page_pool.rst
index a147591ce203..5db8c263b0c6 100644
--- a/Documentation/networking/page_pool.rst
+++ b/Documentation/networking/page_pool.rst
@@ -105,6 +105,47 @@ a page will cause no race conditions is enough.
Please note the caller must not use data area after running
page_pool_put_page_bulk(), as this function overwrites it.
+* page_pool_get_stats(): Retrieve statistics about the page_pool. This API
+ is only available if the kernel has been configured with
+ ``CONFIG_PAGE_POOL_STATS=y``. A pointer to a caller allocated ``struct
+ page_pool_stats`` structure is passed to this API which is filled in. The
+ caller can then report those stats to the user (perhaps via ethtool,
+ debugfs, etc.). See below for an example usage of this API.
+
+Stats API and structures
+------------------------
+If the kernel is configured with ``CONFIG_PAGE_POOL_STATS=y``, the API
+``page_pool_get_stats()`` and structures described below are available. It
+takes a pointer to a ``struct page_pool`` and a pointer to a ``struct
+page_pool_stats`` allocated by the caller.
+
+The API will fill in the provided ``struct page_pool_stats`` with
+statistics about the page_pool.
+
+The stats structure has the following fields::
+
+ struct page_pool_stats {
+ struct page_pool_alloc_stats alloc_stats;
+ struct page_pool_recycle_stats recycle_stats;
+ };
+
+
+The ``struct page_pool_alloc_stats`` has the following fields:
+ * ``fast``: successful fast path allocations
+ * ``slow``: slow path order-0 allocations
+ * ``slow_high_order``: slow path high order allocations
+ * ``empty``: ptr ring is empty, so a slow path allocation was forced.
+ * ``refill``: an allocation which triggered a refill of the cache
+ * ``waive``: pages obtained from the ptr ring that cannot be added to
+ the cache due to a NUMA mismatch.
+
+The ``struct page_pool_recycle_stats`` has the following fields:
+ * ``cached``: recycling placed page in the page pool cache
+ * ``cache_full``: page pool cache was full
+ * ``ring``: page placed into the ptr ring
+ * ``ring_full``: page released from page pool because the ptr ring was full
+ * ``released_refcnt``: page released (and not recycled) because refcnt > 1
+
Coding examples
===============
@@ -157,6 +198,21 @@ NAPI poller
}
}
+Stats
+-----
+
+.. code-block:: c
+
+ #ifdef CONFIG_PAGE_POOL_STATS
+ /* retrieve stats */
+ struct page_pool_stats stats = { 0 };
+ if (page_pool_get_stats(page_pool, &stats)) {
+ /* perhaps the driver reports statistics with ethool */
+ ethtool_print_allocation_stats(&stats.alloc_stats);
+ ethtool_print_recycle_stats(&stats.recycle_stats);
+ }
+ #endif
+
Driver unload
-------------
diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
new file mode 100644
index 000000000000..0987fd1bc220
--- /dev/null
+++ b/Documentation/networking/smc-sysctl.rst
@@ -0,0 +1,23 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========
+SMC Sysctl
+==========
+
+/proc/sys/net/smc/* Variables
+=============================
+
+autocorking_size - INTEGER
+ Setting SMC auto corking size:
+ SMC auto corking is like TCP auto corking from the application's
+ perspective of view. When applications do consecutive small
+ write()/sendmsg() system calls, we try to coalesce these small writes
+ as much as possible, to lower total amount of CDC and RDMA Write been
+ sent.
+ autocorking_size limits the maximum corked bytes that can be sent to
+ the under device in 1 single sending. If set to 0, the SMC auto corking
+ is disabled.
+ Applications can still use TCP_CORK for optimal behavior when they
+ know how/when to uncork their sockets.
+
+ Default: 64K
diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
index f5809206eb93..be4eb1242057 100644
--- a/Documentation/networking/timestamping.rst
+++ b/Documentation/networking/timestamping.rst
@@ -668,7 +668,7 @@ timestamping:
(through another RX timestamping FIFO). Deferral on RX is typically
necessary when retrieving the timestamp needs a sleepable context. In
that case, it is the responsibility of the DSA driver to call
- ``netif_rx_ni()`` on the freshly timestamped skb.
+ ``netif_rx()`` on the freshly timestamped skb.
3.2.2 Ethernet PHYs
^^^^^^^^^^^^^^^^^^^