Skip to content

Program type BPF_PROG_TYPE_XDP

v4.8

XDP (Express Data Path) programs can attach to network devices and are called for every incoming (ingress) packet received by that network device. XDP programs can take quite a large number of actions, most prominent of which are manipulation of the packet, dropping the packet, redirecting it and letting it pass to the network stack.

Notable use cases for XDP programs are for DDoS protection, Load Balancing, and high-throughput packet filtering. If loaded with native driver support, XDP programs will be called just after receiving the packet but before allocating memory for a socket buffer. This callsite makes XDP programs extremely performant, especially in use cases where traffic is forwarded or dropped a lot in comparison to other eBPF program types or techniques which run after the relatively expensive socket buffer allocation process has taken place, only to discard it.

Usage

XDP programs are typically put into an ELF section prefixed with xdp. The XDP program is called by the kernel with a xdp_md context. The return value indicates what action the kernel should take with the packet, the following values are permitted:

  • XDP_ABORTED - Signals that a unrecoverable error has taken place. Returning this action will cause the kernel to trigger the xdp_exception tracepoint and print a line to the trace log. This allows for debugging of such occurrences. It is also expensive, so should not be used without consideration in production.
  • XDP_DROP - Discards the packet. It should be noted that since we drop the packet very early, it will be invisible to tools like tcpdump. Consider recording drops using a custom feedback mechanism to maintain visibility.
  • XDP_PASS - Pass the packet to the network stack. The packet can be manipulated before hand
  • XDP_TX - Send the packet back out the same network port it arrived on. The packet can be manipulated before hand.
  • XDP_REDIRECT - Redirect the packet to one of a number of locations. The packet can be manipulated before hand.

XDP_REDIRECT should not be returned by itself, always in combination with a helper function call. A number of helper functions can be used to redirect the current packet. These annotate hidden values in the context to inform the kernel what actual redirection action to take after the program exists.

Packets can be redirected in the following ways:

Context

XDP programs are called with the struct xdp_md context. This is a very simple context representing a single packet.

data

v4.8

This field contains a pointer to the start of packet data. The XDP program can read from this region between data and data_end, as long as it always performs bounds checks.

data_end

v4.8

This field contains a pointer to the end of the packet data. The verifier will enforce that any XDP program checks that offsets from data are less then data_end before the program attempts to read from it.

data_meta

v4.15

This field contains a pointer to the start of a metadata region in the packet memory. By default, no metadata room is available, so the value of data_meta and data will be the same. The XDP program can request metadata with the bpf_xdp_adjust_meta helper, on success data_meta is updated so it is not less then data. The room between data_meta and data is freely useable by the XDP program.

If the packet with metadata is passed to the kernel, that metadata will be available in the __sk_buff via its data_meta and data fields.

This means that XDP programs can communicate information to for example BPF_PROG_TYPE_SCHED_CLS programs which can then manipulate the socket buffer to change __sk_buff->mark or __sk_buff->priority on behalf of an XDP program.

ingress_ifindex

v4.16

This field contains the network interface index the packet arrived on.

rx_queue_index

v4.16

This field contains the queue index within the NIC on which the packet was received.

Note

While this field is normally read-only, offloaded XDP programs are allowed to write to it to perform custom RSS (Receive-Side Scaling) in the network device v4.18

egress_ifindex

v5.8

This field is read-only and contains the network interface index the packet has been redirected out of. This field is only ever set after an initial XDP program redirected a packet to another device with a BPF_MAP_TYPE_DEVMAP and the value of the devmap contained a file descriptor of a secondary XDP program. This secondary program will be invoked with a context that has egress_ifindex, rx_queue_index, and ingress_ifindex set so it can modify fields in the packet to match the redirection.

XDP fragments

v5.18

An increasingly common performance optimization technique is to use larger packets and to bulk process them (Jumbo packets, GRO, BIG-TCP). It might therefor happen that packets get larger than a single memory page or that we want to glue multiple already allocated packets together. This breaks the existing assumption XDP programs have of all the packet data living in a linear area between data and data_end.

In order to offer support and not break existing programs, the concept of "XDP fragment aware" programs was introduced. XDP program authors writing such programs can compare the length between the data and data_end pointer and the output of bpf_xdp_get_buff_len. If the XDP program needs to work with data beyond the linear portion it should use the bpf_xdp_load_bytes and bpf_xdp_store_bytes helpers.

To indicate that a program is "XDP Fragment aware" the program should be loaded with the BPF_F_XDP_HAS_FRAGS flag. Program authors can indicate that they wish libraries like libbpf to load programs with this flag by placing their program in a xdp.frags/ ELF section instead of a xdp/ section.

Note

If a program is both "XDP Fragment aware" and should be attached to a CPUMAP or DEVMAP the two ELF naming conventions are combined: xdp.frags/cpumap/ or xdp.frags/devmap.

Warning

XDP fragments are not supported by all network drivers, check the driver support table.

Attachment

There are two ways of attaching XDP programs to network devices, the legacy way of doing is is via a netlink socket the details of which are complex. Examples of libraries that implement netlink XDP attaching are vishvananda/netlink and libbpf.

The modern and recommended way is to use BPF links. Doing so is as easy as calling BPF_LINK_CREATE with the target_ifindex set to the network interface target, attach_type set to BPF_LINK_TYPE_XDP and the same flags as would be used for the netlink approach.

There are some subtile differences. The netlink method will give the network interface a reference to the program, which means that after attaching, the program will stay attached until it is detached by a program, even if the original loader exists. This is in contrast to kprobes for example which will stop as soon as the loader exists (assuming we are not pinning the program). With links however, this referencing doesn't occur, the creation of the link returns a file descriptor which is used to manage the lifecycle, if the link fd is closed or the loader exists without pinning it, the program will be detached from the network interface.

Warning

Hardware offloaded GRO and LSO are incompatible with XDP and have to be disabled in order to use XDP. Not doing so will result in a -EINVAL error upon attaching. The following commands can be used to disable GRO and LSO: ethtool -K {ifname} lro off gro off

Warning

For XDP programs without fragments support there exists a max MTU of between 1500 and 4096 bytes, the exact limit depends on the driver. If the configured MTU on the device is set higher then the limit, XDP programs cannot be attached.

Flags

XDP_FLAGS_UPDATE_IF_NOEXIST

If set, the kernel will only attach the XDP program if the network interface doesn't have a XDP program attached already.

Note

This flag is only used with the netlink attach method, the link attach method handles this behavior more generically.

XDP_FLAGS_SKB_MODE

If set, the kernel will attach the program in SKB (Socket buffer) mode. This mode is also known as "Generic mode". This always works regardless of driver support. It works by calling the XDP program after a socket buffer has already been allocated further up the stack that an XDP program would normally be called. This negates the speed advantage of XDP programs. This mode also lacks full feature support since some actions cannot be taken this high up the network stack anymore.

It is recommended to use BPF_PROG_TYPE_SCHED_CLS prog types instead if driver support isn't available since it offers more capabilities with roughtly the same performance.

This flag is mutually exclusive with XDP_FLAGS_DRV_MODE and XDP_FLAGS_HW_MODE

XDP_FLAGS_DRV_MODE

If set, the kernel will attach the program in driver mode. This does require support from the network driver, but most predominant network card vendors have support in the latest kernel.

This flag is mutually exclusive with XDP_FLAGS_SKB_MODE and XDP_FLAGS_HW_MODE

XDP_FLAGS_HW_MODE

If set, the kernel will attach the program in hardware offload mode. This requires both driver and hardware support for XDP offloading. Currently only select Netronome devices support offloading. However, it should be noted that only a subset of normal features are supported.

XDP_FLAGS_REPLACE

If set, the kernel will atomically replace the existing program for this new program. You will also have to pass the file descriptor of the old program via the netlink request.

Note

This flag is only used with the netlink attach method, the link attach method handles this behavior more generically.

Device map program

v5.8

XDP programs can be attached to map values of a BPF_MAP_TYPE_DEVMAP map. Once attached this program will run after the first program concluded but before the packet is sent of to the new network device. These programs are called with additional context, see egress_ifindex.

Only XDP programs that have been loaded with the BPF_XDP_DEVMAP value in expected_attach_type are allowed to be attached in this way.

Program authors can indicate to loaders like libbpf that a given program should be loaded with this expected attach type by placing the program in a xdp/devmap/ ELF section.

CPU map program

v5.9.

XDP programs can be attached to map values of a BPF_MAP_TYPE_CPUMAP map. Once attached this program will run on the new logical CPU. The idea being that you would spend minimal time in the first XDP program and only schedule it and perform the more CPU intensive tasks in this second program.

Only XDP programs that have been loaded with the BPF_XDP_CPUMAP value in expected_attach_type are allowed to be attached in this way.

Program authors can indicate to loaders like libbpf that a given program should be loaded with this expected attach type by placing the program in a xdp/cpumap/ ELF section.

Driver support

Driver name Native XDP XDP HW Offload XDP Fragments AF_XDP
Mellanox mlx4 v4.8
Mellanox mlx5 v4.9 v5.181, v6.4 v5.3
Qlogic qede v4.10
Netronome nfp v4.10 v5.18
Virtio v4.10 v6.3
Broadcom bnxt v4.11 v5.19
Intel ixgbe v4.12 v4.20
Cavium thunder (nicvf) v4.12
Intel i40e v4.13 v6.4 v4.20
Tun v4.14
Netdevsim v4.16
Intel ixgbevf v4.17
Veth v4.19 v5.5
Freescale dpaa2 v5.0 v6.2
Socionext netsec v5.3
TI cpsw v5.3
Solarflare efx v5.5
Intel ice v5.5 v6.3 v5.5
Marvell mvneta v5.5 v5.18
Amazon ena v5.6
Hyper-V netvsc v5.6
Marvell mvpp2 v5.9
Xen xennet v5.9
Intel igb v5.10
Freescale dpaa v5.11
Intel igc v5.13 v5.14
STmicro stmmac v5.13 v5.13
Freescale enetc v5.13
Bond v5.15
Marvell otx2 v5.16
Microsoft mana v5.17
Fungible fun v5.18
Atlantic aq v5.19 v5.19
Mediatek mtk v6.0
Freescale fec_enet v6.2
Microchip lan966x v6.2
Engleder tsnep v6.3 v6.4
Google gve v6.4 v6.4
VMware vmxnet3 v6.6

Note

This table has last been updated for Linux v6.7 and is subject to change in the future.

Max MTU

Plain XDP (fragments disabled) has the limitation that every packet must fit within a single memory page (typically 4096 bytes). This same memory page is also used to store NIC specific metadata and metadata to be passed to the network stack. The room needed for the metadata eats into the available space for the packet data. This means that the actual maximum MTU is some amount lower. The exact value depends on a lot of factors including but not limited to: the driver, the NIC, the CPU architecture, the kernel version and kernel configuration.

The following table has been calculated from mathematical formulas based on the driver code and constants derived from the most common systems. This table assumes a 4k page size, most common L2 cache line sizes for the given architectures, a 6.8 kernel (kernel version doesn't seem to make a big difference). Please refer to tools/mtu-calc in the doc sources to see the exact formulas used and/or to calculate exact max MTU if you have a non-standard system.

Vendor Driver x86 arm arm64 armv7 riscv
Kernel Veth 3520 3518 3520 3454 3518
Kernel VirtIO 3506 3506 3506 3442 3506
Kernel Tun 1500 1500 1500 1500 1500
Kernel Bond 4 4 4 4 4
Xen Netfront 3840 3840 3840 3840 3840
Amazon ENA 3498 3498 3498 3434 3498
Aquantia/Marvell AQtion 2048 2048 2048 2048 2048
Broadcom BNXT 3502 3500 3502 3436 3500
Cavium Thunder (nicvf) 1508 1508 1508 1508 1508
Engelder TSN Endpoint 2 2 2 2 2
Freescale FEC 2 2 2 2 2
Freescale DPAA 3706 3706 3706 3642 3706
Freescale DPAA2 ?3 ?3 ?3 ?3 ?3
Freescale ENETC 2 2 2 2 2
Fungible Funeth 3566 3566 3566 3502 3566
Google GVE 2032 2032 2032 2032 2032
Intel I40e 3046 3046 3046 3046 3046
Intel ICE 3046 3046 3046 3046 3046
Intel IGB 3046 3046 3046 3046 3046
Intel IGC 1500 1500 1500 1500 1500
Intel IXGBE 3050 3050 3050 3050 3050
Intel IXGBEVF 3050 3050 3050 3050 3050
Marvell NETA 3520 3520 3520 3456 3520
Marvell PPv2 3552 3552 3552 3488 3552
Marvell Octeon TX2 1508 1508 1508 1508 1508
MediaTek MTK 3520 3520 3520 3456 3520
Mellanox MLX4 3498 3498 3498 3434 3498
Mellanox MLX5 3498 3498 3498 3434 3498
Microchip LAN966x 2 2 2 2 2
Microsoft Mana 3506 3506 3506 3442 3506
Microsoft Hyper-V 3506 3506 3506 3442 3506
Netronome NFP 4096 4096 4096 4096 4096
Pensando Ionic 3502 3502 3502 3438 3502
Qlogic QEDE 2 2 2 2 2
Solarflare SFP (SFC9xxx PF/VF) 3530 3546 3530 3386 3514
Solarflare SFP (Riverhead) 3522 3530 3522 3370 3498
Solarflare SFP (SFC4000A) 3508 3538 3508 3378 3506
Solarflare SFP (SFC4000B) 3528 3542 3528 3382 3510
Solarflare SFP (SFC9020/SFL9021) 3528 3542 3528 3382 3510
Socionext NetSec 1500 1500 1500 1500 1500
STMicro ST MAC 1500 1500 1500 1500 1500
TI CPSW 2 2 2 2 2
VMWare VMXNET 3 3494 3492 3494 3428 3492
Vendor Driver x86 arm arm64 armv7 riscv
Kernel Veth 73152 73150 73152 73086 73150
Kernel VirtIO 2 2 2 2 2
Kernel Tun
Kernel Bond 4 4 4 4 4
Xen Netfront
Amazon ENA
Aquantia/Marvell AQtion 2 2 2 2 2
Broadcom BNXT 2 2 2 2 2
Cavium Thunder (nicvf)
Engelder TSN Endpoint
Freescale FEC
Freescale DPAA
Freescale DPAA2 ?3 ?3 ?3 ?3 ?3
Freescale ENETC
Fungible Funeth
Google GVE
Intel I40e 9702 9702 9702 9702 9702
Intel ICE 3046 3046 3046 3046 3046
Intel IGB
Intel IGC
Intel IXGBE
Intel IXGBEVF
Marvell NETA 2 2 2 2 2
Marvell PPv2 3552 3552 3552 3488 3552
Marvell Octeon TX2
MediaTek MTK
Mellanox MLX4
Mellanox MLX5 2 2 2 2 2
Microchip LAN966x
Microsoft Mana
Microsoft Hyper-V
Netronome NFP
Pensando Ionic
Qlogic QEDE
Solarflare SFP (SFC9xxx PF/VF)
Solarflare SFP (Riverhead)
Solarflare SFP (SFC4000A)
Solarflare SFP (SFC4000B)
Solarflare SFP (SFC9020/SFL9021)
Socionext NetSec
STMicro ST MAC
TI CPSW
VMWare VMXNET 3

Warning

If the configured MTU on a network interface is higher than the limit calculated by the network driver, XDP programs cannot be attached. When attaching via netlink, most drivers will use netlink debug messages to communicate the exact limit. When attaching via BPF links, no such feedback is given, by default. The error message can still be obtained by attaching a eBPF program to the bpf_xdp_link_attach_failed tracepoint and printing the error message or passing it userspace.

Helper functions

Not all helper functions are available in all program types. These are the helper calls available for XDP programs:

Supported helper functions

KFuncs

Supported kfuncs

  1. Only the legacy RQ mode supports XDP frags, which is not the default and will require setting via ethtool. 

  2. Driver does not have logic to limit the max MTU and XDP usage, but implicit limits such as in firmware or hardware may still apply. 

  3. MTU limit is loaded from firmware. 

  4. MTU limit is determed by slave devices.