diff options
Diffstat (limited to 'Documentation/bpf')
-rw-r--r-- | Documentation/bpf/bpf_prog_run.rst | 117 | ||||
-rw-r--r-- | Documentation/bpf/btf.rst | 45 | ||||
-rw-r--r-- | Documentation/bpf/index.rst | 1 | ||||
-rw-r--r-- | Documentation/bpf/instruction-set.rst | 215 | ||||
-rw-r--r-- | Documentation/bpf/verifier.rst | 2 |
5 files changed, 297 insertions, 83 deletions
diff --git a/Documentation/bpf/bpf_prog_run.rst b/Documentation/bpf/bpf_prog_run.rst new file mode 100644 index 000000000000..4868c909df5c --- /dev/null +++ b/Documentation/bpf/bpf_prog_run.rst @@ -0,0 +1,117 @@ +.. SPDX-License-Identifier: GPL-2.0 + +=================================== +Running BPF programs from userspace +=================================== + +This document describes the ``BPF_PROG_RUN`` facility for running BPF programs +from userspace. + +.. contents:: + :local: + :depth: 2 + + +Overview +-------- + +The ``BPF_PROG_RUN`` command can be used through the ``bpf()`` syscall to +execute a BPF program in the kernel and return the results to userspace. This +can be used to unit test BPF programs against user-supplied context objects, and +as way to explicitly execute programs in the kernel for their side effects. The +command was previously named ``BPF_PROG_TEST_RUN``, and both constants continue +to be defined in the UAPI header, aliased to the same value. + +The ``BPF_PROG_RUN`` command can be used to execute BPF programs of the +following types: + +- ``BPF_PROG_TYPE_SOCKET_FILTER`` +- ``BPF_PROG_TYPE_SCHED_CLS`` +- ``BPF_PROG_TYPE_SCHED_ACT`` +- ``BPF_PROG_TYPE_XDP`` +- ``BPF_PROG_TYPE_SK_LOOKUP`` +- ``BPF_PROG_TYPE_CGROUP_SKB`` +- ``BPF_PROG_TYPE_LWT_IN`` +- ``BPF_PROG_TYPE_LWT_OUT`` +- ``BPF_PROG_TYPE_LWT_XMIT`` +- ``BPF_PROG_TYPE_LWT_SEG6LOCAL`` +- ``BPF_PROG_TYPE_FLOW_DISSECTOR`` +- ``BPF_PROG_TYPE_STRUCT_OPS`` +- ``BPF_PROG_TYPE_RAW_TRACEPOINT`` +- ``BPF_PROG_TYPE_SYSCALL`` + +When using the ``BPF_PROG_RUN`` command, userspace supplies an input context +object and (for program types operating on network packets) a buffer containing +the packet data that the BPF program will operate on. The kernel will then +execute the program and return the results to userspace. Note that programs will +not have any side effects while being run in this mode; in particular, packets +will not actually be redirected or dropped, the program return code will just be +returned to userspace. A separate mode for live execution of XDP programs is +provided, documented separately below. + +Running XDP programs in "live frame mode" +----------------------------------------- + +The ``BPF_PROG_RUN`` command has a separate mode for running live XDP programs, +which can be used to execute XDP programs in a way where packets will actually +be processed by the kernel after the execution of the XDP program as if they +arrived on a physical interface. This mode is activated by setting the +``BPF_F_TEST_XDP_LIVE_FRAMES`` flag when supplying an XDP program to +``BPF_PROG_RUN``. + +The live packet mode is optimised for high performance execution of the supplied +XDP program many times (suitable for, e.g., running as a traffic generator), +which means the semantics are not quite as straight-forward as the regular test +run mode. Specifically: + +- When executing an XDP program in live frame mode, the result of the execution + will not be returned to userspace; instead, the kernel will perform the + operation indicated by the program's return code (drop the packet, redirect + it, etc). For this reason, setting the ``data_out`` or ``ctx_out`` attributes + in the syscall parameters when running in this mode will be rejected. In + addition, not all failures will be reported back to userspace directly; + specifically, only fatal errors in setup or during execution (like memory + allocation errors) will halt execution and return an error. If an error occurs + in packet processing, like a failure to redirect to a given interface, + execution will continue with the next repetition; these errors can be detected + via the same trace points as for regular XDP programs. + +- Userspace can supply an ifindex as part of the context object, just like in + the regular (non-live) mode. The XDP program will be executed as though the + packet arrived on this interface; i.e., the ``ingress_ifindex`` of the context + object will point to that interface. Furthermore, if the XDP program returns + ``XDP_PASS``, the packet will be injected into the kernel networking stack as + though it arrived on that ifindex, and if it returns ``XDP_TX``, the packet + will be transmitted *out* of that same interface. Do note, though, that + because the program execution is not happening in driver context, an + ``XDP_TX`` is actually turned into the same action as an ``XDP_REDIRECT`` to + that same interface (i.e., it will only work if the driver has support for the + ``ndo_xdp_xmit`` driver op). + +- When running the program with multiple repetitions, the execution will happen + in batches. The batch size defaults to 64 packets (which is same as the + maximum NAPI receive batch size), but can be specified by userspace through + the ``batch_size`` parameter, up to a maximum of 256 packets. For each batch, + the kernel executes the XDP program repeatedly, each invocation getting a + separate copy of the packet data. For each repetition, if the program drops + the packet, the data page is immediately recycled (see below). Otherwise, the + packet is buffered until the end of the batch, at which point all packets + buffered this way during the batch are transmitted at once. + +- When setting up the test run, the kernel will initialise a pool of memory + pages of the same size as the batch size. Each memory page will be initialised + with the initial packet data supplied by userspace at ``BPF_PROG_RUN`` + invocation. When possible, the pages will be recycled on future program + invocations, to improve performance. Pages will generally be recycled a full + batch at a time, except when a packet is dropped (by return code or because + of, say, a redirection error), in which case that page will be recycled + immediately. If a packet ends up being passed to the regular networking stack + (because the XDP program returns ``XDP_PASS``, or because it ends up being + redirected to an interface that injects it into the stack), the page will be + released and a new one will be allocated when the pool is empty. + + When recycling, the page content is not rewritten; only the packet boundary + pointers (``data``, ``data_end`` and ``data_meta``) in the context object will + be reset to the original values. This means that if a program rewrites the + packet contents, it has to be prepared to see either the original content or + the modified version on subsequent invocations. diff --git a/Documentation/bpf/btf.rst b/Documentation/bpf/btf.rst index 1ebf4c5c7ddc..7940da9bc6c1 100644 --- a/Documentation/bpf/btf.rst +++ b/Documentation/bpf/btf.rst @@ -503,6 +503,19 @@ valid index (starting from 0) pointing to a member or an argument. * ``info.vlen``: 0 * ``type``: the type with ``btf_type_tag`` attribute +Currently, ``BTF_KIND_TYPE_TAG`` is only emitted for pointer types. +It has the following btf type chain: +:: + + ptr -> [type_tag]* + -> [const | volatile | restrict | typedef]* + -> base_type + +Basically, a pointer type points to zero or more +type_tag, then zero or more const/volatile/restrict/typedef +and finally the base type. The base type is one of +int, ptr, array, struct, union, enum, func_proto and float types. + 3. BTF Kernel API ================= @@ -565,18 +578,15 @@ A map can be created with ``btf_fd`` and specified key/value type id.:: In libbpf, the map can be defined with extra annotation like below: :: - struct bpf_map_def SEC("maps") btf_map = { - .type = BPF_MAP_TYPE_ARRAY, - .key_size = sizeof(int), - .value_size = sizeof(struct ipv_counts), - .max_entries = 4, - }; - BPF_ANNOTATE_KV_PAIR(btf_map, int, struct ipv_counts); + struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, int); + __type(value, struct ipv_counts); + __uint(max_entries, 4); + } btf_map SEC(".maps"); -Here, the parameters for macro BPF_ANNOTATE_KV_PAIR are map name, key and -value types for the map. During ELF parsing, libbpf is able to extract -key/value type_id's and assign them to BPF_MAP_CREATE attributes -automatically. +During ELF parsing, libbpf is able to extract key/value type_id's and assign +them to BPF_MAP_CREATE attributes automatically. .. _BPF_Prog_Load: @@ -824,13 +834,12 @@ structure has bitfields. For example, for the following map,:: ___A b1:4; enum A b2:4; }; - struct bpf_map_def SEC("maps") tmpmap = { - .type = BPF_MAP_TYPE_ARRAY, - .key_size = sizeof(__u32), - .value_size = sizeof(struct tmp_t), - .max_entries = 1, - }; - BPF_ANNOTATE_KV_PAIR(tmpmap, int, struct tmp_t); + struct { + __uint(type, BPF_MAP_TYPE_ARRAY); + __type(key, int); + __type(value, struct tmp_t); + __uint(max_entries, 1); + } tmpmap SEC(".maps"); bpftool is able to pretty print like below: :: diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst index ef5c996547ec..96056a7447c7 100644 --- a/Documentation/bpf/index.rst +++ b/Documentation/bpf/index.rst @@ -21,6 +21,7 @@ that goes into great technical depth about the BPF Architecture. helpers programs maps + bpf_prog_run classic_vs_extended.rst bpf_licensing test_debug diff --git a/Documentation/bpf/instruction-set.rst b/Documentation/bpf/instruction-set.rst index 3704836fe6df..5300837ac2c9 100644 --- a/Documentation/bpf/instruction-set.rst +++ b/Documentation/bpf/instruction-set.rst @@ -22,7 +22,13 @@ necessary across calls. Instruction encoding ==================== -eBPF uses 64-bit instructions with the following encoding: +eBPF has two instruction encodings: + + * the basic instruction encoding, which uses 64 bits to encode an instruction + * the wide instruction encoding, which appends a second 64-bit immediate value + (imm64) after the basic instruction for a total of 128 bits. + +The basic instruction encoding looks as follows: ============= ======= =============== ==================== ============ 32 bits (MSB) 16 bits 4 bits 4 bits 8 bits (LSB) @@ -82,9 +88,9 @@ BPF_ALU uses 32-bit wide operands while BPF_ALU64 uses 64-bit wide operands for otherwise identical operations. The code field encodes the operation as below: - ======== ===== ========================== + ======== ===== ================================================= code value description - ======== ===== ========================== + ======== ===== ================================================= BPF_ADD 0x00 dst += src BPF_SUB 0x10 dst -= src BPF_MUL 0x20 dst \*= src @@ -98,8 +104,8 @@ The code field encodes the operation as below: BPF_XOR 0xa0 dst ^= src BPF_MOV 0xb0 dst = src BPF_ARSH 0xc0 sign extending shift right - BPF_END 0xd0 endianness conversion - ======== ===== ========================== + BPF_END 0xd0 byte swap operations (see separate section below) + ======== ===== ================================================= BPF_ADD | BPF_X | BPF_ALU means:: @@ -118,6 +124,42 @@ BPF_XOR | BPF_K | BPF_ALU64 means:: src_reg = src_reg ^ imm32 +Byte swap instructions +---------------------- + +The byte swap instructions use an instruction class of ``BFP_ALU`` and a 4-bit +code field of ``BPF_END``. + +The byte swap instructions instructions operate on the destination register +only and do not use a separate source register or immediate value. + +The 1-bit source operand field in the opcode is used to to select what byte +order the operation convert from or to: + + ========= ===== ================================================= + source value description + ========= ===== ================================================= + BPF_TO_LE 0x00 convert between host byte order and little endian + BPF_TO_BE 0x08 convert between host byte order and big endian + ========= ===== ================================================= + +The imm field encodes the width of the swap operations. The following widths +are supported: 16, 32 and 64. + +Examples: + +``BPF_ALU | BPF_TO_LE | BPF_END`` with imm = 16 means:: + + dst_reg = htole16(dst_reg) + +``BPF_ALU | BPF_TO_BE | BPF_END`` with imm = 64 means:: + + dst_reg = htobe64(dst_reg) + +``BPF_FROM_LE`` and ``BPF_FROM_BE`` exist as aliases for ``BPF_TO_LE`` and +``BPF_TO_LE`` respetively. + + Jump instructions ----------------- @@ -176,63 +218,96 @@ The mode modifier is one of: ============= ===== ==================================== mode modifier value description ============= ===== ==================================== - BPF_IMM 0x00 used for 64-bit mov - BPF_ABS 0x20 legacy BPF packet access - BPF_IND 0x40 legacy BPF packet access - BPF_MEM 0x60 all normal load and store operations + BPF_IMM 0x00 64-bit immediate instructions + BPF_ABS 0x20 legacy BPF packet access (absolute) + BPF_IND 0x40 legacy BPF packet access (indirect) + BPF_MEM 0x60 regular load and store operations BPF_ATOMIC 0xc0 atomic operations ============= ===== ==================================== -BPF_MEM | <size> | BPF_STX means:: + +Regular load and store operations +--------------------------------- + +The ``BPF_MEM`` mode modifier is used to encode regular load and store +instructions that transfer data between a register and memory. + +``BPF_MEM | <size> | BPF_STX`` means:: *(size *) (dst_reg + off) = src_reg -BPF_MEM | <size> | BPF_ST means:: +``BPF_MEM | <size> | BPF_ST`` means:: *(size *) (dst_reg + off) = imm32 -BPF_MEM | <size> | BPF_LDX means:: +``BPF_MEM | <size> | BPF_LDX`` means:: dst_reg = *(size *) (src_reg + off) -Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. +Where size is one of: ``BPF_B``, ``BPF_H``, ``BPF_W``, or ``BPF_DW``. Atomic operations ----------------- -eBPF includes atomic operations, which use the immediate field for extra -encoding:: +Atomic operations are operations that operate on memory and can not be +interrupted or corrupted by other access to the same memory region +by other eBPF programs or means outside of this specification. + +All atomic operations supported by eBPF are encoded as store operations +that use the ``BPF_ATOMIC`` mode modifier as follows: + + * ``BPF_ATOMIC | BPF_W | BPF_STX`` for 32-bit operations + * ``BPF_ATOMIC | BPF_DW | BPF_STX`` for 64-bit operations + * 8-bit and 16-bit wide atomic operations are not supported. - .imm = BPF_ADD, .code = BPF_ATOMIC | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg - .imm = BPF_ADD, .code = BPF_ATOMIC | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg +The imm field is used to encode the actual atomic operation. +Simple atomic operation use a subset of the values defined to encode +arithmetic operations in the imm field to encode the atomic operation: -The basic atomic operations supported are:: + ======== ===== =========== + imm value description + ======== ===== =========== + BPF_ADD 0x00 atomic add + BPF_OR 0x40 atomic or + BPF_AND 0x50 atomic and + BPF_XOR 0xa0 atomic xor + ======== ===== =========== - BPF_ADD - BPF_AND - BPF_OR - BPF_XOR -Each having equivalent semantics with the ``BPF_ADD`` example, that is: the -memory location addresed by ``dst_reg + off`` is atomically modified, with -``src_reg`` as the other operand. If the ``BPF_FETCH`` flag is set in the -immediate, then these operations also overwrite ``src_reg`` with the -value that was in memory before it was modified. +``BPF_ATOMIC | BPF_W | BPF_STX`` with imm = BPF_ADD means:: -The more special operations are:: + *(u32 *)(dst_reg + off16) += src_reg - BPF_XCHG +``BPF_ATOMIC | BPF_DW | BPF_STX`` with imm = BPF ADD means:: -This atomically exchanges ``src_reg`` with the value addressed by ``dst_reg + -off``. :: + *(u64 *)(dst_reg + off16) += src_reg - BPF_CMPXCHG +``BPF_XADD`` is a deprecated name for ``BPF_ATOMIC | BPF_ADD``. -This atomically compares the value addressed by ``dst_reg + off`` with -``R0``. If they match it is replaced with ``src_reg``. In either case, the -value that was there before is zero-extended and loaded back to ``R0``. +In addition to the simple atomic operations, there also is a modifier and +two complex atomic operations: -Note that 1 and 2 byte atomic operations are not supported. + =========== ================ =========================== + imm value description + =========== ================ =========================== + BPF_FETCH 0x01 modifier: return old value + BPF_XCHG 0xe0 | BPF_FETCH atomic exchange + BPF_CMPXCHG 0xf0 | BPF_FETCH atomic compare and exchange + =========== ================ =========================== + +The ``BPF_FETCH`` modifier is optional for simple atomic operations, and +always set for the complex atomic operations. If the ``BPF_FETCH`` flag +is set, then the operation also overwrites ``src_reg`` with the value that +was in memory before it was modified. + +The ``BPF_XCHG`` operation atomically exchanges ``src_reg`` with the value +addressed by ``dst_reg + off``. + +The ``BPF_CMPXCHG`` operation atomically compares the value addressed by +``dst_reg + off`` with ``R0``. If they match, the value addressed by +``dst_reg + off`` is replaced with ``src_reg``. In either case, the +value that was at ``dst_reg + off`` before the operation is zero-extended +and loaded back to ``R0``. Clang can generate atomic instructions by default when ``-mcpu=v3`` is enabled. If a lower version for ``-mcpu`` is set, the only atomic instruction @@ -240,40 +315,52 @@ Clang can generate is ``BPF_ADD`` *without* ``BPF_FETCH``. If you need to enable the atomics features, while keeping a lower ``-mcpu`` version, you can use ``-Xclang -target-feature -Xclang +alu32``. -You may encounter ``BPF_XADD`` - this is a legacy name for ``BPF_ATOMIC``, -referring to the exclusive-add operation encoded when the immediate field is -zero. +64-bit immediate instructions +----------------------------- -16-byte instructions --------------------- +Instructions with the ``BPF_IMM`` mode modifier use the wide instruction +encoding for an extra imm64 value. -eBPF has one 16-byte instruction: ``BPF_LD | BPF_DW | BPF_IMM`` which consists -of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single -instruction that loads 64-bit immediate value into a dst_reg. +There is currently only one such instruction. -Packet access instructions --------------------------- +``BPF_LD | BPF_DW | BPF_IMM`` means:: -eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and -(BPF_IND | <size> | BPF_LD) which are used to access packet data. + dst_reg = imm64 -They had to be carried over from classic BPF to have strong performance of -socket filters running in eBPF interpreter. These instructions can only -be used when interpreter context is a pointer to ``struct sk_buff`` and -have seven implicit operands. Register R6 is an implicit input that must -contain pointer to sk_buff. Register R0 is an implicit output which contains -the data fetched from the packet. Registers R1-R5 are scratch registers -and must not be used to store the data across BPF_ABS | BPF_LD or -BPF_IND | BPF_LD instructions. -These instructions have implicit program exit condition as well. When -eBPF program is trying to access the data beyond the packet boundary, -the interpreter will abort the execution of the program. JIT compilers -therefore must preserve this property. src_reg and imm32 fields are -explicit inputs to these instructions. +Legacy BPF Packet access instructions +------------------------------------- -For example, BPF_IND | BPF_W | BPF_LD means:: +eBPF has special instructions for access to packet data that have been +carried over from classic BPF to retain the performance of legacy socket +filters running in the eBPF interpreter. - R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) +The instructions come in two forms: ``BPF_ABS | <size> | BPF_LD`` and +``BPF_IND | <size> | BPF_LD``. -and R1 - R5 are clobbered. +These instructions are used to access packet data and can only be used when +the program context is a pointer to networking packet. ``BPF_ABS`` +accesses packet data at an absolute offset specified by the immediate data +and ``BPF_IND`` access packet data at an offset that includes the value of +a register in addition to the immediate data. + +These instructions have seven implicit operands: + + * Register R6 is an implicit input that must contain pointer to a + struct sk_buff. + * Register R0 is an implicit output which contains the data fetched from + the packet. + * Registers R1-R5 are scratch registers that are clobbered after a call to + ``BPF_ABS | BPF_LD`` or ``BPF_IND`` | BPF_LD instructions. + +These instructions have an implicit program exit condition as well. When an +eBPF program is trying to access the data beyond the packet boundary, the +program execution will be aborted. + +``BPF_ABS | BPF_W | BPF_LD`` means:: + + R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + imm32)) + +``BPF_IND | BPF_W | BPF_LD`` means:: + + R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) diff --git a/Documentation/bpf/verifier.rst b/Documentation/bpf/verifier.rst index fae5f6273bac..d4326caf01f9 100644 --- a/Documentation/bpf/verifier.rst +++ b/Documentation/bpf/verifier.rst @@ -329,7 +329,7 @@ Program with unreachable instructions:: BPF_EXIT_INSN(), }; -Error: +Error:: unreachable insn 1 |