Skip to content
Snippets Groups Projects
  • Ilya Maximets's avatar
    cb039ef3
    net: add initial support for AF_XDP network backend · cb039ef3
    Ilya Maximets authored
    
    AF_XDP is a network socket family that allows communication directly
    with the network device driver in the kernel, bypassing most or all
    of the kernel networking stack.  In the essence, the technology is
    pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
    and works with any network interfaces without driver modifications.
    Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
    require access to character devices or unix sockets.  Only access to
    the network interface itself is necessary.
    
    This patch implements a network backend that communicates with the
    kernel by creating an AF_XDP socket.  A chunk of userspace memory
    is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
    Fill and Completion) are placed in that memory along with a pool of
    memory buffers for the packet data.  Data transmission is done by
    allocating one of the buffers, copying packet data into it and
    placing the pointer into Tx ring.  After transmission, device will
    return the buffer via Completion ring.  On Rx, device will take
    a buffer form a pre-populated Fill ring, write the packet data into
    it and place the buffer into Rx ring.
    
    AF_XDP network backend takes on the communication with the host
    kernel and the network interface and forwards packets to/from the
    peer device in QEMU.
    
    Usage example:
    
      -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
      -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
    
    XDP program bridges the socket with a network interface.  It can be
    attached to the interface in 2 different modes:
    
    1. skb - this mode should work for any interface and doesn't require
             driver support.  With a caveat of lower performance.
    
    2. native - this does require support from the driver and allows to
                bypass skb allocation in the kernel and potentially use
                zero-copy while getting packets in/out userspace.
    
    By default, QEMU will try to use native mode and fall back to skb.
    Mode can be forced via 'mode' option.  To force 'copy' even in native
    mode, use 'force-copy=on' option.  This might be useful if there is
    some issue with the driver.
    
    Option 'queues=N' allows to specify how many device queues should
    be open.  Note that all the queues that are not open are still
    functional and can receive traffic, but it will not be delivered to
    QEMU.  So, the number of device queues should generally match the
    QEMU configuration, unless the device is shared with something
    else and the traffic re-direction to appropriate queues is correctly
    configured on a device level (e.g. with ethtool -N).
    'start-queue=M' option can be used to specify from which queue id
    QEMU should start configuring 'N' queues.  It might also be necessary
    to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
    for examples.
    
    In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
    or CAP_BPF capabilities in order to load default XSK/XDP programs to
    the network interface and configure BPF maps.  It is possible, however,
    to run with no capabilities.  For that to work, an external process
    with enough capabilities will need to pre-load default XSK program,
    create AF_XDP sockets and pass their file descriptors to QEMU process
    on startup via 'sock-fds' option.  Network backend will need to be
    configured with 'inhibit=on' to avoid loading of the program.
    QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue
    or CAP_IPC_LOCK.
    
    There are few performance challenges with the current network backends.
    
    First is that they do not support IO threads.  This means that data
    path is handled by the main thread in QEMU and may slow down other
    work or may be slowed down by some other work.  This also means that
    taking advantage of multi-queue is generally not possible today.
    
    Another thing is that data path is going through the device emulation
    code, which is not really optimized for performance.  The fastest
    "frontend" device is virtio-net.  But it's not optimized for heavy
    traffic either, because it expects such use-cases to be handled via
    some implementation of vhost (user, kernel, vdpa).  In practice, we
    have virtio notifications and rcu lock/unlock on a per-packet basis
    and not very efficient accesses to the guest memory.  Communication
    channels between backend and frontend devices do not allow passing
    more than one packet at a time as well.
    
    Some of these challenges can be avoided in the future by adding better
    batching into device emulation or by implementing vhost-af-xdp variant.
    
    There are also a few kernel limitations.  AF_XDP sockets do not
    support any kinds of checksum or segmentation offloading.  Buffers
    are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
    support implementation for AF_XDP is in progress, but not ready yet.
    Also, transmission in all non-zero-copy modes is synchronous, i.e.
    done in a syscall.  That doesn't allow high packet rates on virtual
    interfaces.
    
    However, keeping in mind all of these challenges, current implementation
    of the AF_XDP backend shows a decent performance while running on top
    of a physical NIC with zero-copy support.
    
    Test setup:
    
    2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
    Network backend is configured to open the NIC directly in native mode.
    The driver supports zero-copy.  NIC is configured to use 1 queue.
    
    Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
    for PPS testing.
    
    iperf3 result:
     TCP stream      : 19.1 Gbps
    
    dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
     Tx only         : 3.4 Mpps
     Rx only         : 2.0 Mpps
     L2 FWD Loopback : 1.5 Mpps
    
    In skb mode the same setup shows much lower performance, similar to
    the setup where pair of physical NICs is replaced with veth pair:
    
    iperf3 result:
      TCP stream      : 9 Gbps
    
    dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
      Tx only         : 1.2 Mpps
      Rx only         : 1.0 Mpps
      L2 FWD Loopback : 0.7 Mpps
    
    Results in skb mode or over the veth are close to results of a tap
    backend with vhost=on and disabled segmentation offloading bridged
    with a NIC.
    
    Signed-off-by: default avatarIlya Maximets <i.maximets@ovn.org>
    Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> (docker/lcitool)
    Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
    cb039ef3
    History
    net: add initial support for AF_XDP network backend
    Ilya Maximets authored
    
    AF_XDP is a network socket family that allows communication directly
    with the network device driver in the kernel, bypassing most or all
    of the kernel networking stack.  In the essence, the technology is
    pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
    and works with any network interfaces without driver modifications.
    Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
    require access to character devices or unix sockets.  Only access to
    the network interface itself is necessary.
    
    This patch implements a network backend that communicates with the
    kernel by creating an AF_XDP socket.  A chunk of userspace memory
    is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
    Fill and Completion) are placed in that memory along with a pool of
    memory buffers for the packet data.  Data transmission is done by
    allocating one of the buffers, copying packet data into it and
    placing the pointer into Tx ring.  After transmission, device will
    return the buffer via Completion ring.  On Rx, device will take
    a buffer form a pre-populated Fill ring, write the packet data into
    it and place the buffer into Rx ring.
    
    AF_XDP network backend takes on the communication with the host
    kernel and the network interface and forwards packets to/from the
    peer device in QEMU.
    
    Usage example:
    
      -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
      -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
    
    XDP program bridges the socket with a network interface.  It can be
    attached to the interface in 2 different modes:
    
    1. skb - this mode should work for any interface and doesn't require
             driver support.  With a caveat of lower performance.
    
    2. native - this does require support from the driver and allows to
                bypass skb allocation in the kernel and potentially use
                zero-copy while getting packets in/out userspace.
    
    By default, QEMU will try to use native mode and fall back to skb.
    Mode can be forced via 'mode' option.  To force 'copy' even in native
    mode, use 'force-copy=on' option.  This might be useful if there is
    some issue with the driver.
    
    Option 'queues=N' allows to specify how many device queues should
    be open.  Note that all the queues that are not open are still
    functional and can receive traffic, but it will not be delivered to
    QEMU.  So, the number of device queues should generally match the
    QEMU configuration, unless the device is shared with something
    else and the traffic re-direction to appropriate queues is correctly
    configured on a device level (e.g. with ethtool -N).
    'start-queue=M' option can be used to specify from which queue id
    QEMU should start configuring 'N' queues.  It might also be necessary
    to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
    for examples.
    
    In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
    or CAP_BPF capabilities in order to load default XSK/XDP programs to
    the network interface and configure BPF maps.  It is possible, however,
    to run with no capabilities.  For that to work, an external process
    with enough capabilities will need to pre-load default XSK program,
    create AF_XDP sockets and pass their file descriptors to QEMU process
    on startup via 'sock-fds' option.  Network backend will need to be
    configured with 'inhibit=on' to avoid loading of the program.
    QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue
    or CAP_IPC_LOCK.
    
    There are few performance challenges with the current network backends.
    
    First is that they do not support IO threads.  This means that data
    path is handled by the main thread in QEMU and may slow down other
    work or may be slowed down by some other work.  This also means that
    taking advantage of multi-queue is generally not possible today.
    
    Another thing is that data path is going through the device emulation
    code, which is not really optimized for performance.  The fastest
    "frontend" device is virtio-net.  But it's not optimized for heavy
    traffic either, because it expects such use-cases to be handled via
    some implementation of vhost (user, kernel, vdpa).  In practice, we
    have virtio notifications and rcu lock/unlock on a per-packet basis
    and not very efficient accesses to the guest memory.  Communication
    channels between backend and frontend devices do not allow passing
    more than one packet at a time as well.
    
    Some of these challenges can be avoided in the future by adding better
    batching into device emulation or by implementing vhost-af-xdp variant.
    
    There are also a few kernel limitations.  AF_XDP sockets do not
    support any kinds of checksum or segmentation offloading.  Buffers
    are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
    support implementation for AF_XDP is in progress, but not ready yet.
    Also, transmission in all non-zero-copy modes is synchronous, i.e.
    done in a syscall.  That doesn't allow high packet rates on virtual
    interfaces.
    
    However, keeping in mind all of these challenges, current implementation
    of the AF_XDP backend shows a decent performance while running on top
    of a physical NIC with zero-copy support.
    
    Test setup:
    
    2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
    Network backend is configured to open the NIC directly in native mode.
    The driver supports zero-copy.  NIC is configured to use 1 queue.
    
    Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
    for PPS testing.
    
    iperf3 result:
     TCP stream      : 19.1 Gbps
    
    dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
     Tx only         : 3.4 Mpps
     Rx only         : 2.0 Mpps
     L2 FWD Loopback : 1.5 Mpps
    
    In skb mode the same setup shows much lower performance, similar to
    the setup where pair of physical NICs is replaced with veth pair:
    
    iperf3 result:
      TCP stream      : 9 Gbps
    
    dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
      Tx only         : 1.2 Mpps
      Rx only         : 1.0 Mpps
      L2 FWD Loopback : 0.7 Mpps
    
    Results in skb mode or over the veth are close to results of a tap
    backend with vhost=on and disabled segmentation offloading bridged
    with a NIC.
    
    Signed-off-by: default avatarIlya Maximets <i.maximets@ovn.org>
    Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> (docker/lcitool)
    Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
meson_options.txt 17.78 KiB