Skip to content

Program type BPF_PROG_TYPE_SOCK_OPS

v4.13

Socket ops programs are attached to cGroups and get called for multiple lifecycle events of a socket, giving the program the opportunity to changes settings per connection or to record the existence of a socket.

Usage

Socket ops programs are called multiple times on the same socket during different parts of its lifecycle for different operations. Some operations query the program for certain parameters, others just inform the program of certain events so the program can perform some at that time.

Regardless of the type of operation, the program should always return 1 on success. A negative integer indicate a operation is not supported. For operations that query information, the reply field in the context is used to "reply" to the query, the program is expected to set it equal to the requested value.

There are a few envisioned use cases for this program type. First is to reply with certain settings like RTO, RTT and ECN (see ops section for details) or to set socket options using the bpf_setsockopt helper to tune settings/options on a per-connection basis.

For example, it is easy to use facebook's internal IPv6 addresses to determine if both hosts of a connection are in the same datacenter. Therefore, it is easy to write a BPF program to choose a small SYN RTO value when both hosts are in the same datacenter.

Secondly, socket ops programs are in an excellent position to gather detailed metrics about connections. Especially after v4.16.

Thirdly, socket ops programs can be used to implement TCP options which are not known to the kernel, both on the sending and receiving side. See BPF_SOCK_OPS_PARSE_HDR_OPT_CB and BPF_SOCK_OPS_WRITE_HDR_OPT_CB.

The last, but not least, envisioned use case for socket ops programs is to dynamically add sockets to BPF_MAP_TYPE_SOCKMAP or BPF_MAP_TYPE_SOCKHASH maps. Since socket ops programs are notified when sockets are connecting or listening, it allows us to add the sockets to these maps before any actual message traffic happens. This allows BPF_PROG_TYPE_SK_MSG and BPF_PROG_TYPE_SK_SKB to operate without user space needing to add sockets to the sock maps. The bpf_sock_map_update and bpf_sock_hash_update helpers exist for this very purpose.

Ops

After attaching the program, it will be invoked for multiple socket and multiple ops. The op field in the context indicates for which operation the program is invoked. Availability of fields in the context and the meaning of return values vary from op to op.

The ops ending with _CB are callbacks which are just called to notify the program of an event. Return values for these ops are ignored. Some of these callbacks are not triggered unless activated by setting flags on the socket. Setting these flags is done by the program itself with the use of the bpf_sock_ops_cb_flags_set helper which can both set and unset flags.

BPF_SOCK_OPS_TIMEOUT_INIT

v4.13

When invoked with this op, the program can overwrite the default RTO (retransmission timeout) for a SYN or SYN-ACK. -1 can be returned if default value should be used.

BPF_SOCK_OPS_RWND_INIT

v4.13

When invoked with this op, the program can overwrite the default initial advertized window (in packets) or -1 if default value should be used.

BPF_SOCK_OPS_TCP_CONNECT_CB

v4.13

The program is invoked with this op when a socket is in the 'connect' state, it has sent out a SYN message, but is not yet established. This is just a notification, return value is discarded.

BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB

v4.13

The program is invoked with this op when a active socket transitioned to have an established connection. This happens when a outgoing connection establishes. This is just a notification, return value is discarded.

BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB

v4.13

The program is invoked with this op when a active socket transitioned to have an established connection. This happens when a incoming connection establishes. This is just a notification, return value is discarded.

BPF_SOCK_OPS_NEEDS_ECN

v4.13

When invoked with this op, the program is asked if ECN (Explicit Congestion Notification) should be enabled for a given connection. The program is expected to return 0 or 1.

BPF_SOCK_OPS_BASE_RTT

v4.15

When invoked with this op, the program is asked for the base RTT (Round Trip Time) for a given connection. If the measured RTT goes above this value it indicates the connection is congested and the congestion control algorithm will take steps.

BPF_SOCK_OPS_RTO_CB

v4.16

When BPF_SOCK_OPS_RTO_CB_FLAG is set via bpf_sock_ops_cb_flags_set, this program may be called with this op to indicate when an RTO (Retransmission Timeout) has triggered. This is just a notification, return value is discarded.

The arguments in the context will have the following meanings:

  • args[0]: value of icsk_retransmits
  • args[1]: value of icsk_rto
  • args[2]: whether RTO has expired

BPF_SOCK_OPS_RETRANS_CB

v4.16

When the BPF_SOCK_OPS_RETRANS_CB_FLAG flag is set with bpf_sock_ops_cb_flags_set, the program is invoked with this op when a packet from the skb has been retransmitted. This is just a notification, return value is discarded.

The arguments in the context will have the following meanings:

  • args[0]: sequence number of 1st byte
  • args[1]: # segments
  • args[2]: return value of tcp_transmit_skb (0 => success)

BPF_SOCK_OPS_STATE_CB

v4.16

When the BPF_SOCK_OPS_STATE_CB_FLAG flag is set with bpf_sock_ops_cb_flags_set, the program is invoked with this op when the TCP state of the socket changes. This is just a notification, return value is discarded.

The arguments in the context will have the following meanings:

  • args[0]: old_state
  • args[1]: new_state

The states will be one of:

enum {
    BPF_TCP_ESTABLISHED = 1,
    BPF_TCP_SYN_SENT,
    BPF_TCP_SYN_RECV,
    BPF_TCP_FIN_WAIT1,
    BPF_TCP_FIN_WAIT2,
    BPF_TCP_TIME_WAIT,
    BPF_TCP_CLOSE,
    BPF_TCP_CLOSE_WAIT,
    BPF_TCP_LAST_ACK,
    BPF_TCP_LISTEN,
    BPF_TCP_CLOSING,    /* Now a valid state */
    BPF_TCP_NEW_SYN_RECV
};

BPF_SOCK_OPS_TCP_LISTEN_CB

v4.19

The program is invoked with this op when the listen syscall is used on the socket, transitioning it to the LISTEN state. This is just a notification, return value is discarded.

BPF_SOCK_OPS_RTT_CB

v5.3

When the BPF_SOCK_OPS_RTT_CB_FLAG flag is set with bpf_sock_ops_cb_flags_set, the program is invoked with this op for every round trip. This is just a notification, return value is discarded.

BPF_SOCK_OPS_PARSE_HDR_OPT_CB

v5.10

The program is invoked with this op to parse TCP headers. If the BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG is set, the program will be invoked for all TCP headers, if BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG is set, the program is only invoked for unknown TCP headers.

The program will be invoked to handle the packets received at an already established connection.

The TCP header is question starts at sock_ops->skb_data, the bpf_load_hdr_opt helper can also be used to search for a particular option.

This is just a notification, return value is discarded.

BPF_SOCK_OPS_HDR_OPT_LEN_CB

v5.10

When the BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG flag is set with bpf_sock_ops_cb_flags_set, the program is invoked with this op to reserve space for TCP options which will be written to the packet when the program is invoked with the BPF_SOCK_OPS_WRITE_HDR_OPT_CB op.

The arguments in the context will have the following meanings:

  • args[0]: bool want_cookie. (in writing SYNACK only)

sock_ops->skb_data: Not available because no header has been written yet.

sock_ops->skb_tcp_flags: The tcp_flags of the outgoing skb. (e.g. SYN, ACK, FIN).

The bpf_reserve_hdr_opt should be used to reserve space.

This is just a notification, return value is discarded.

BPF_SOCK_OPS_WRITE_HDR_OPT_CB

v5.10

When the BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG flag is set with bpf_sock_ops_cb_flags_set, the program is invoked with this op to write TCP options to the packet, the room for these options has been reserved in a previous invokation of the program with the BPF_SOCK_OPS_HDR_OPT_LEN_CB op.

The arguments in the context will have the following meanings:

args[0]: bool want_cookie. (in writing SYNACK only)

sock_ops->skb_data: Referring to the outgoing skb. It covers the TCP header that has already been written by the kernel and the earlier bpf-progs.

sock_ops->skb_tcp_flags: The tcp_flags of the outgoing skb. (e.g. SYN, ACK, FIN).

The bpf_store_hdr_opt should be used to write the option.

The bpf_load_hdr_opt can also be used to search for a particular option that has already been written by the kernel or the earlier bpf-progs.

Context

struct bpf_sock_ops

C structure
/* User bpf_sock_ops struct to access socket values and specify request ops
* and their replies.
* Some of this fields are in network (bigendian) byte order and may need
* to be converted before use (bpf_ntohl() defined in samples/bpf/bpf_endian.h).
* New fields can only be added at the end of this structure
*/
struct bpf_sock_ops {
    __u32 op;
    union {
        __u32 args[4];      /* Optionally passed to bpf program */
        __u32 reply;        /* Returned by bpf program      */
        __u32 replylong[4]; /* Optionally returned by bpf prog  */
    };
    __u32 family;
    __u32 remote_ip4;   /* Stored in network byte order */
    __u32 local_ip4;    /* Stored in network byte order */
    __u32 remote_ip6[4];    /* Stored in network byte order */
    __u32 local_ip6[4]; /* Stored in network byte order */
    __u32 remote_port;  /* Stored in network byte order */
    __u32 local_port;   /* stored in host byte order */
    __u32 is_fullsock;  /* Some TCP fields are only valid if
                * there is a full socket. If not, the
                * fields read as zero.
                */
    __u32 snd_cwnd;
    __u32 srtt_us;      /* Averaged RTT << 3 in usecs */
    __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
    __u32 state;
    __u32 rtt_min;
    __u32 snd_ssthresh;
    __u32 rcv_nxt;
    __u32 snd_nxt;
    __u32 snd_una;
    __u32 mss_cache;
    __u32 ecn_flags;
    __u32 rate_delivered;
    __u32 rate_interval_us;
    __u32 packets_out;
    __u32 retrans_out;
    __u32 total_retrans;
    __u32 segs_in;
    __u32 data_segs_in;
    __u32 segs_out;
    __u32 data_segs_out;
    __u32 lost_out;
    __u32 sacked_out;
    __u32 sk_txhash;
    __u64 bytes_received;
    __u64 bytes_acked;
    __bpf_md_ptr(struct bpf_sock *, sk);
    /* [skb_data, skb_data_end) covers the whole TCP header.
    *
    * BPF_SOCK_OPS_PARSE_HDR_OPT_CB: The packet received
    * BPF_SOCK_OPS_HDR_OPT_LEN_CB:   Not useful because the
    *                                header has not been written.
    * BPF_SOCK_OPS_WRITE_HDR_OPT_CB: The header and options have
    *                 been written so far.
    * BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:  The SYNACK that concludes
    *                   the 3WHS.
    * BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB: The ACK that concludes
    *                   the 3WHS.
    *
    * bpf_load_hdr_opt() can also be used to read a particular option.
    */
    __bpf_md_ptr(void *, skb_data);
    __bpf_md_ptr(void *, skb_data_end);
    __u32 skb_len;      /* The total length of a packet.
                * It includes the header, options,
                * and payload.
                */
    __u32 skb_tcp_flags;    /* tcp_flags of the header.  It provides
                * an easy way to check for tcp_flags
                * without parsing skb_data.
                *
                * In particular, the skb_tcp_flags
                * will still be available in
                * BPF_SOCK_OPS_HDR_OPT_LEN even though
                * the outgoing header has not
                * been written yet.
                */
    __u64 skb_hwtstamp;
};

op

v4.13

This field will indicate the current operation, see the ops section for the possible values and meanings.

args

v4.16

This field is an array of 4 __u32 values, used by some operations to provide additional information. The meaning of the arguments is dependant on the op.

reply

v4.13

This field is used as the return value for operations that expect one. It is the only field the BPF program is allowed to modify.

replylong

v4.13

This field was envisioned to be used for replies that do not fit in a single __u32, but in practice this has not occurred as of v6.3.

family

v4.13

The address family of the socket for which the program is invoked. One of the AF_* enums.

remote_ip4

v4.13

The remote IPv4 address in network byte order if family == AF_INET.

local_ip4

v4.13

The local IPv4 address in network byte order if family == AF_INET.

remote_ip6

v4.13

The remote IPv6 address in network byte order if family == AF_INET6.

local_ip6

v4.13

The local IPv6 address in network byte order if family == AF_INET6.

remote_port

v4.13

The remote data link / layer 4 port in network byte order.

local_port

v4.13

The local data link / layer 4 port in network byte order.

is_fullsock

v4.16

Some TCP fields are only valid if there is a full socket. If not, the fields read as zero.

snd_cwnd

v4.16

The sending congestion window

srtt_us

v4.16

The averaged/smoothed RTT (Round Trip Time), stored 3 bits shifted left in μs (microseconds).

actual srtt in μs = ctx->srtt_us >> 3;

bpf_sock_ops_cb_flags

v4.16

This field contains the flags that indicate which optional operations are enabled or not. Possible values are listed in include/uapi/linux/bpf.h. To the change the contents of the field, the bpf_sock_ops_cb_flags_set helper must be used.

state

v4.16

This field contains the connection state of the socket.

The states will be one of:

enum {
    BPF_TCP_ESTABLISHED = 1,
    BPF_TCP_SYN_SENT,
    BPF_TCP_SYN_RECV,
    BPF_TCP_FIN_WAIT1,
    BPF_TCP_FIN_WAIT2,
    BPF_TCP_TIME_WAIT,
    BPF_TCP_CLOSE,
    BPF_TCP_CLOSE_WAIT,
    BPF_TCP_LAST_ACK,
    BPF_TCP_LISTEN,
    BPF_TCP_CLOSING,    /* Now a valid state */
    BPF_TCP_NEW_SYN_RECV
};

rtt_min

v4.16

The minimum observed RTT (Round Trip Time)

snd_ssthresh

v4.16

The slow start size threshold.

rcv_nxt

v4.16

The TCP sequence number we want to receive next.

snd_nxt

v4.16

The TCP sequence number we will to send next.

snd_una

v4.16

The first byte we want to ACK for.

mss_cache

v4.16

Cached effective MSS (Maximum Segment Size), not including SACKS.

ecn_flags

v4.16

ECN (Explicit Congestion Notification) status bits.

rate_delivered

v4.16

Saved rate sample: packets delivered.

rate_interval_us

v4.16

Saved rate sample: time elapsed.

packets_out

v4.16

Number of packets which are "in flight".

retrans_out

v4.16

Number of packets retransmitted out.

total_retrans

v4.16

Total # of packet retransmits for entire connection.

segs_in

v4.16

RFC4898 tcpEStatsPerfSegsIn total number of segments in.

data_segs_in

v4.16

RFC4898 tcpEStatsPerfDataSegsIn total number of data segments in.

segs_out

v4.16

RFC4898 tcpEStatsPerfSegsOut the total number of segments sent.

data_segs_out

v4.16

RFC4898 tcpEStatsPerfDataSegsOut total number of data segments sent.

lost_out

v4.16

Number of lost packets.

sacked_out

v4.16

Number of SACK'd packets.

sk_txhash

v4.16

Computed flow hash for use on transmit.

bytes_received

v4.16

RFC4898 tcpEStatsAppHCThruOctetsReceived sum(delta(rcv_nxt)), or how many bytes were acked.

bytes_acked

v4.16

RFC4898 tcpEStatsAppHCThruOctetsAcked sum(delta(snd_una)), or how many bytes were acked.

sk

v5.3

Pointer to the struct bpf_sock.

skb_data

v5.10

skb_data to skb_data_end covers the whole TCP header.

  • BPF_SOCK_OPS_PARSE_HDR_OPT_CB - The packet received
  • BPF_SOCK_OPS_HDR_OPT_LEN_CB - Not useful because the header has not been written.
  • BPF_SOCK_OPS_WRITE_HDR_OPT_CB - The header and options have been written so far.
  • BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB - The SYNACK that concludes the 3WHS.
  • BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB - The ACK that concludes the 3WHS.

bpf_load_hdr_opt can also be used to read a particular option.

skb_data_end

v5.10

The end pointer of the TCP header.

skb_len

v5.10

The total length of a packet. It includes the header, options, and payload.

skb_tcp_flags

v5.10

tcp_flags of the header. It provides an easy way to check for tcp_flags without parsing skb_data.

In particular, the skb_tcp_flags will still be available in BPF_SOCK_OPS_HDR_OPT_LEN even though the outgoing header has not been written yet.

skb_hwtstamp

v6.2

The timestamp at which the packet was received as reported by the hardware/NIC.

In sockops, the skb is also available to the bpf prog during the BPF_SOCK_OPS_PARSE_HDR_OPT_CB event. There is a use case that the hwtstamp will be useful to the sockops prog to better measure the one-way-delay when the sender has put the tx timestamp in the tcp header option.

Warning

hwtstamps can only be compared against other hwtstamps from the same device.

Attachment

Socket ops programs are attached to cgroups via the BPF_PROG_ATTACH syscall or via BPF link.

Examples

Clamping a connection
// Copyright (c) 2017 Facebook
#define DEBUG 1

SEC("sockops")
int bpf_clamp(struct bpf_sock_ops *skops)
{
    int bufsize = 150000;
    int to_init = 10;
    int clamp = 100;
    int rv = 0;
    int op;

    /* For testing purposes, only execute rest of BPF program
    * if neither port numberis 55601
    */
    if (bpf_ntohl(skops->remote_port) != 55601 && skops->local_port != 55601) {
        skops->reply = -1;
        return 0;
    }

    op = (int) skops->op;

#ifdef DEBUG
    bpf_printk("BPF command: %d\n", op);
#endif

    /* Check that both hosts are within same datacenter. For this example
    * it is the case when the first 5.5 bytes of their IPv6 addresses are
    * the same.
    */
    if (skops->family == AF_INET6 &&
        skops->local_ip6[0] == skops->remote_ip6[0] &&
        (bpf_ntohl(skops->local_ip6[1]) & 0xfff00000) ==
        (bpf_ntohl(skops->remote_ip6[1]) & 0xfff00000)) {
        switch (op) {
        case BPF_SOCK_OPS_TIMEOUT_INIT:
            rv = to_init;
            break;
        case BPF_SOCK_OPS_TCP_CONNECT_CB:
            /* Set sndbuf and rcvbuf of active connections */
            rv = bpf_setsockopt(skops, SOL_SOCKET, SO_SNDBUF,
                        &bufsize, sizeof(bufsize));
            rv += bpf_setsockopt(skops, SOL_SOCKET,
                        SO_RCVBUF, &bufsize,
                        sizeof(bufsize));
            break;
        case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
            rv = bpf_setsockopt(skops, SOL_TCP,
                        TCP_BPF_SNDCWND_CLAMP,
                        &clamp, sizeof(clamp));
            break;
        case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
            /* Set sndbuf and rcvbuf of passive connections */
            rv = bpf_setsockopt(skops, SOL_TCP,
                        TCP_BPF_SNDCWND_CLAMP,
                        &clamp, sizeof(clamp));
            rv += bpf_setsockopt(skops, SOL_SOCKET,
                        SO_SNDBUF, &bufsize,
                        sizeof(bufsize));
            rv += bpf_setsockopt(skops, SOL_SOCKET,
                        SO_RCVBUF, &bufsize,
                        sizeof(bufsize));
            break;
        default:
            rv = -1;
        }
    } else {
        rv = -1;
    }
#ifdef DEBUG
    bpf_printk("Returning %d\n", rv);
#endif
    skops->reply = rv;
    return 1;
}
Dump statistics
#define INTERVAL            1000000000ULL

int _version SEC("version") = 1;
char _license[] SEC("license") = "GPL";

struct {
    __u32 type;
    __u32 map_flags;
    int *key;
    __u64 *value;
} bpf_next_dump SEC(".maps") = {
    .type = BPF_MAP_TYPE_SK_STORAGE,
    .map_flags = BPF_F_NO_PREALLOC,
};

SEC("sockops")
int _sockops(struct bpf_sock_ops *ctx)
{
    struct bpf_tcp_sock *tcp_sk;
    struct bpf_sock *sk;
    __u64 *next_dump;
    __u64 now;

    switch (ctx->op) {
    case BPF_SOCK_OPS_TCP_CONNECT_CB:
        bpf_sock_ops_cb_flags_set(ctx, BPF_SOCK_OPS_RTT_CB_FLAG);
        return 1;
    case BPF_SOCK_OPS_RTT_CB:
        break;
    default:
        return 1;
    }

    sk = ctx->sk;
    if (!sk)
        return 1;

    next_dump = bpf_sk_storage_get(&bpf_next_dump, sk, 0,
                    BPF_SK_STORAGE_GET_F_CREATE);
    if (!next_dump)
        return 1;

    now = bpf_ktime_get_ns();
    if (now < *next_dump)
        return 1;

    tcp_sk = bpf_tcp_sock(sk);
    if (!tcp_sk)
        return 1;

    *next_dump = now + INTERVAL;

    bpf_printk("dsack_dups=%u delivered=%u\n",
        tcp_sk->dsack_dups, tcp_sk->delivered);
    bpf_printk("delivered_ce=%u icsk_retransmits=%u\n",
        tcp_sk->delivered_ce, tcp_sk->icsk_retransmits);

    return 1;
}
Adding socket to map
// Copyright (c) 2017-2018 Covalent IO
SEC("sockops")
int bpf_sockmap(struct bpf_sock_ops *skops)
{
    __u32 lport, rport;
    int op, err = 0, index, key, ret;


    op = (int) skops->op;

    switch (op) {
    case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
        lport = skops->local_port;
        rport = skops->remote_port;

        if (lport == 10000) {
            ret = 1;
#ifdef SOCKMAP
            err = bpf_sock_map_update(skops, &sock_map, &ret,
                        BPF_NOEXIST);
#else
            err = bpf_sock_hash_update(skops, &sock_map, &ret,
                        BPF_NOEXIST);
#endif
        }
        break;
    case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
        lport = skops->local_port;
        rport = skops->remote_port;

        if (bpf_ntohl(rport) == 10001) {
            ret = 10;
#ifdef SOCKMAP
            err = bpf_sock_map_update(skops, &sock_map, &ret,
                        BPF_NOEXIST);
#else
            err = bpf_sock_hash_update(skops, &sock_map, &ret,
                        BPF_NOEXIST);
#endif
        }
        break;
    default:
        break;
    }

    return 0;
}

Helper functions

Supported helper functions

KFuncs

There are currently no kfuncs supported for this program type