diff options
Diffstat (limited to 'Documentation/networking/devlink/devlink-port.rst')
-rw-r--r-- | Documentation/networking/devlink/devlink-port.rst | 168 |
1 files changed, 161 insertions, 7 deletions
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst index 7627b1da01f2..3da590953ce8 100644 --- a/Documentation/networking/devlink/devlink-port.rst +++ b/Documentation/networking/devlink/devlink-port.rst @@ -110,7 +110,7 @@ devlink ports for both the controllers. Function configuration ====================== -A user can configure the function attribute before enumerating the PCI +Users can configure one or more function attributes before enumerating the PCI function. Usually it means, user should configure function attribute before a bus specific device for the function is created. However, when SRIOV is enabled, virtual function devices are created on the PCI bus. @@ -119,9 +119,127 @@ function device to the driver. For subfunctions, this means user should configure port function attribute before activating the port function. A user may set the hardware address of the function using -'devlink port function set hw_addr' command. For Ethernet port function +`devlink port function set hw_addr` command. For Ethernet port function this means a MAC address. +Users may also set the RoCE capability of the function using +`devlink port function set roce` command. + +Users may also set the function as migratable using +'devlink port function set migratable' command. + +Function attributes +=================== + +MAC address setup +----------------- +The configured MAC address of the PCI VF/SF will be used by netdevice and rdma +device created for the PCI VF/SF. + +- Get the MAC address of the VF identified by its unique devlink port index:: + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:00:00:00:00:00 + +- Set the MAC address of the VF identified by its unique devlink port index:: + + $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55 + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:11:22:33:44:55 + +- Get the MAC address of the SF identified by its unique devlink port index:: + + $ devlink port show pci/0000:06:00.0/32768 + pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 + function: + hw_addr 00:00:00:00:00:00 + +- Set the MAC address of the SF identified by its unique devlink port index:: + + $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 + + $ devlink port show pci/0000:06:00.0/32768 + pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 + function: + hw_addr 00:00:00:00:88:88 + +RoCE capability setup +--------------------- +Not all PCI VFs/SFs require RoCE capability. + +When RoCE capability is disabled, it saves system memory per PCI VF/SF. + +When user disables RoCE capability for a VF/SF, user application cannot send or +receive any RoCE packets through this VF/SF and RoCE GID table for this PCI +will be empty. + +When RoCE capability is disabled in the device using port function attribute, +VF/SF driver cannot override it. + +- Get RoCE capability of the VF device:: + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:00:00:00:00:00 roce enable + +- Set RoCE capability of the VF device:: + + $ devlink port function set pci/0000:06:00.0/2 roce disable + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:00:00:00:00:00 roce disable + +migratable capability setup +--------------------------- +Live migration is the process of transferring a live virtual machine +from one physical host to another without disrupting its normal +operation. + +User who want PCI VFs to be able to perform live migration need to +explicitly enable the VF migratable capability. + +When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver +with migration support, the user can migrate the VM with this VF from one HV to a +different one. + +However, when migratable capability is enable, device will disable features which cannot +be migrated. Thus migratable cap can impose limitations on a VF so let the user decide. + +Example of LM with migratable function configuration: +- Get migratable capability of the VF device:: + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:00:00:00:00:00 migratable disable + +- Set migratable capability of the VF device:: + + $ devlink port function set pci/0000:06:00.0/2 migratable enable + + $ devlink port show pci/0000:06:00.0/2 + pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 + function: + hw_addr 00:00:00:00:00:00 migratable enable + +- Bind VF to VFIO driver with migration support:: + + $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind + $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override + $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind + +Attach VF to the VM. +Start the VM. +Perform live migration. + Subfunction ============ @@ -130,10 +248,11 @@ it is deployed. Subfunction is created and deployed in unit of 1. Unlike SRIOV VFs, a subfunction doesn't require its own PCI virtual function. A subfunction communicates with the hardware through the parent PCI function. -To use a subfunction, 3 steps setup sequence is followed. -(1) create - create a subfunction; -(2) configure - configure subfunction attributes; -(3) deploy - deploy the subfunction; +To use a subfunction, 3 steps setup sequence is followed: + +1) create - create a subfunction; +2) configure - configure subfunction attributes; +3) deploy - deploy the subfunction; Subfunction management is done using devlink port user interface. User performs setup on the subfunction management device. @@ -191,13 +310,48 @@ API allows to configure following rate object's parameters: ``tx_max`` Maximum TX rate value. +``tx_priority`` + Allows for usage of strict priority arbiter among siblings. This + arbitration scheme attempts to schedule nodes based on their priority + as long as the nodes remain within their bandwidth limit. The higher the + priority the higher the probability that the node will get selected for + scheduling. + +``tx_weight`` + Allows for usage of Weighted Fair Queuing arbitration scheme among + siblings. This arbitration scheme can be used simultaneously with the + strict priority. As a node is configured with a higher rate it gets more + BW relative to it's siblings. Values are relative like a percentage + points, they basically tell how much BW should node take relative to + it's siblings. + ``parent`` Parent node name. Parent node rate limits are considered as additional limits to all node children limits. ``tx_max`` is an upper limit for children. ``tx_share`` is a total bandwidth distributed among children. +``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case +nodes with the same priority form a WFQ subgroup in the sibling group +and arbitration among them is based on assigned weights. + +Arbitration flow from the high level: + +#. Choose a node, or group of nodes with the highest priority that stays + within the BW limit and are not blocked. Use ``tx_priority`` as a + parameter for this arbitration. + +#. If group of nodes have the same priority perform WFQ arbitration on + that subgroup. Use ``tx_weight`` as a parameter for this arbitration. + +#. Select the winner node, and continue arbitration flow among it's children, + until leaf node is reached, and the winner is established. + +#. If all the nodes from the highest priority sub-group are satisfied, or + overused their assigned BW, move to the lower priority nodes. + Driver implementations are allowed to support both or either rate object types -and setting methods of their parameters. +and setting methods of their parameters. Additionally driver implementation +may export nodes/leafs and their child-parent relationships. Terms and Definitions ===================== |