Skip to content
Snippets Groups Projects
  1. Nov 07, 2023
  2. Nov 02, 2023
  3. Sep 18, 2023
    • Ilya Maximets's avatar
      net: add initial support for AF_XDP network backend · cb039ef3
      Ilya Maximets authored
      
      AF_XDP is a network socket family that allows communication directly
      with the network device driver in the kernel, bypassing most or all
      of the kernel networking stack.  In the essence, the technology is
      pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
      and works with any network interfaces without driver modifications.
      Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
      require access to character devices or unix sockets.  Only access to
      the network interface itself is necessary.
      
      This patch implements a network backend that communicates with the
      kernel by creating an AF_XDP socket.  A chunk of userspace memory
      is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
      Fill and Completion) are placed in that memory along with a pool of
      memory buffers for the packet data.  Data transmission is done by
      allocating one of the buffers, copying packet data into it and
      placing the pointer into Tx ring.  After transmission, device will
      return the buffer via Completion ring.  On Rx, device will take
      a buffer form a pre-populated Fill ring, write the packet data into
      it and place the buffer into Rx ring.
      
      AF_XDP network backend takes on the communication with the host
      kernel and the network interface and forwards packets to/from the
      peer device in QEMU.
      
      Usage example:
      
        -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
        -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
      
      XDP program bridges the socket with a network interface.  It can be
      attached to the interface in 2 different modes:
      
      1. skb - this mode should work for any interface and doesn't require
               driver support.  With a caveat of lower performance.
      
      2. native - this does require support from the driver and allows to
                  bypass skb allocation in the kernel and potentially use
                  zero-copy while getting packets in/out userspace.
      
      By default, QEMU will try to use native mode and fall back to skb.
      Mode can be forced via 'mode' option.  To force 'copy' even in native
      mode, use 'force-copy=on' option.  This might be useful if there is
      some issue with the driver.
      
      Option 'queues=N' allows to specify how many device queues should
      be open.  Note that all the queues that are not open are still
      functional and can receive traffic, but it will not be delivered to
      QEMU.  So, the number of device queues should generally match the
      QEMU configuration, unless the device is shared with something
      else and the traffic re-direction to appropriate queues is correctly
      configured on a device level (e.g. with ethtool -N).
      'start-queue=M' option can be used to specify from which queue id
      QEMU should start configuring 'N' queues.  It might also be necessary
      to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
      for examples.
      
      In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
      or CAP_BPF capabilities in order to load default XSK/XDP programs to
      the network interface and configure BPF maps.  It is possible, however,
      to run with no capabilities.  For that to work, an external process
      with enough capabilities will need to pre-load default XSK program,
      create AF_XDP sockets and pass their file descriptors to QEMU process
      on startup via 'sock-fds' option.  Network backend will need to be
      configured with 'inhibit=on' to avoid loading of the program.
      QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue
      or CAP_IPC_LOCK.
      
      There are few performance challenges with the current network backends.
      
      First is that they do not support IO threads.  This means that data
      path is handled by the main thread in QEMU and may slow down other
      work or may be slowed down by some other work.  This also means that
      taking advantage of multi-queue is generally not possible today.
      
      Another thing is that data path is going through the device emulation
      code, which is not really optimized for performance.  The fastest
      "frontend" device is virtio-net.  But it's not optimized for heavy
      traffic either, because it expects such use-cases to be handled via
      some implementation of vhost (user, kernel, vdpa).  In practice, we
      have virtio notifications and rcu lock/unlock on a per-packet basis
      and not very efficient accesses to the guest memory.  Communication
      channels between backend and frontend devices do not allow passing
      more than one packet at a time as well.
      
      Some of these challenges can be avoided in the future by adding better
      batching into device emulation or by implementing vhost-af-xdp variant.
      
      There are also a few kernel limitations.  AF_XDP sockets do not
      support any kinds of checksum or segmentation offloading.  Buffers
      are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
      support implementation for AF_XDP is in progress, but not ready yet.
      Also, transmission in all non-zero-copy modes is synchronous, i.e.
      done in a syscall.  That doesn't allow high packet rates on virtual
      interfaces.
      
      However, keeping in mind all of these challenges, current implementation
      of the AF_XDP backend shows a decent performance while running on top
      of a physical NIC with zero-copy support.
      
      Test setup:
      
      2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
      Network backend is configured to open the NIC directly in native mode.
      The driver supports zero-copy.  NIC is configured to use 1 queue.
      
      Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
      for PPS testing.
      
      iperf3 result:
       TCP stream      : 19.1 Gbps
      
      dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
       Tx only         : 3.4 Mpps
       Rx only         : 2.0 Mpps
       L2 FWD Loopback : 1.5 Mpps
      
      In skb mode the same setup shows much lower performance, similar to
      the setup where pair of physical NICs is replaced with veth pair:
      
      iperf3 result:
        TCP stream      : 9 Gbps
      
      dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
        Tx only         : 1.2 Mpps
        Rx only         : 1.0 Mpps
        L2 FWD Loopback : 0.7 Mpps
      
      Results in skb mode or over the veth are close to results of a tap
      backend with vhost=on and disabled segmentation offloading bridged
      with a NIC.
      
      Signed-off-by: default avatarIlya Maximets <i.maximets@ovn.org>
      Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> (docker/lcitool)
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      cb039ef3
  4. May 10, 2023
  5. May 02, 2023
  6. Mar 13, 2023
  7. Mar 01, 2023
  8. Feb 04, 2023
  9. Oct 28, 2022
    • Laurent Vivier's avatar
      qapi: net: add stream and dgram netdevs · 5166fe0a
      Laurent Vivier authored
      
      Copied from socket netdev file and modified to use SocketAddress
      to be able to introduce new features like unix socket.
      
      "udp" and "mcast" are squashed into dgram netdev, multicast is detected
      according to the IP address type.
      "listen" and "connect" modes are managed by stream netdev. An optional
      parameter "server" defines the mode (off by default)
      
      The two new types need to be parsed the modern way with -netdev, because
      with the traditional way, the "type" field of netdev structure collides with
      the "type" field of SocketAddress and prevents the correct evaluation of the
      command line option. Moreover the traditional way doesn't allow to use
      the same type (SocketAddress) several times with the -netdev option
      (needed to specify "local" and "remote" addresses).
      
      The previous commit paved the way for parsing the modern way, but
      omitted one detail: how to pick modern vs. traditional, in
      netdev_is_modern().
      
      We want to pick based on the value of parameter "type".  But how to
      extract it from the option argument?
      
      Parsing the option argument, either the modern or the traditional way,
      extracts it for us, but only if parsing succeeds.
      
      If parsing fails, there is no good option.  No matter which parser we
      pick, it'll be the wrong one for some arguments, and the error
      reporting will be confusing.
      
      Fortunately, the traditional parser accepts *anything* when called in
      a certain way.  This maximizes our chance to extract the value of
      "type", and in turn minimizes the risk of confusing error reporting.
      
      Signed-off-by: default avatarLaurent Vivier <lvivier@redhat.com>
      Reviewed-by: default avatarStefano Brivio <sbrivio@redhat.com>
      Acked-by: default avatarMarkus Armbruster <armbru@redhat.com>
      Acked-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      5166fe0a
  10. Oct 17, 2022
    • Daniel Henrique Barboza's avatar
      qmp/hmp, device_tree.c: introduce dumpdtb · bf353ad5
      Daniel Henrique Barboza authored
      
      To save the FDT blob we have the '-machine dumpdtb=<file>' property.
      With this property set, the machine saves the FDT in <file> and exit.
      The created file can then be converted to plain text dts format using
      'dtc'.
      
      There's nothing particularly sophisticated into saving the FDT that
      can't be done with the machine at any state, as long as the machine has
      a valid FDT to be saved.
      
      The 'dumpdtb' command receives a 'filename' parameter and, if the FDT is
      available via current_machine->fdt, save it in dtb format to 'filename'.
      In short, this is a '-machine dumpdtb' that can be fired on demand via
      QMP/HMP.
      
      This command will always be executed in-band (i.e. holding BQL),
      avoiding potential race conditions with machines that might change the
      FDT during runtime (e.g. PowerPC 'pseries' machine).
      
      Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
      Cc: Markus Armbruster <armbru@redhat.com>
      Cc: Alistair Francis <alistair.francis@wdc.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Acked-by: default avatarDr. David Alan Gilbert <dgilbert@redhat.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <f4bug@amsat.org>
      Reviewed-by: default avatarMarkus Armbruster <armbru@redhat.com>
      Signed-off-by: default avatarDaniel Henrique Barboza <danielhb413@gmail.com>
      Message-Id: <20220926173855.1159396-2-danielhb413@gmail.com>
      bf353ad5
  11. Sep 15, 2022
  12. Jul 20, 2022
    • Hyman Huang's avatar
      softmmu/dirtylimit: Implement dirty page rate limit · f3b2e38c
      Hyman Huang authored
      
      Implement dirtyrate calculation periodically basing on
      dirty-ring and throttle virtual CPU until it reachs the quota
      dirty page rate given by user.
      
      Introduce qmp commands "set-vcpu-dirty-limit",
      "cancel-vcpu-dirty-limit", "query-vcpu-dirty-limit"
      to enable, disable, query dirty page limit for virtual CPU.
      
      Meanwhile, introduce corresponding hmp commands
      "set_vcpu_dirty_limit", "cancel_vcpu_dirty_limit",
      "info vcpu_dirty_limit" so the feature can be more usable.
      
      "query-vcpu-dirty-limit" success depends on enabling dirty
      page rate limit, so just add it to the list of skipped
      command to ensure qmp-cmd-test run successfully.
      
      Signed-off-by: default avatarHyman Huang(黄勇) <huangy81@chinatelecom.cn>
      Acked-by: default avatarMarkus Armbruster <armbru@redhat.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <4143f26706d413dd29db0b672fe58b3d3fbe34bc.1656177590.git.huangy81@chinatelecom.cn>
      Signed-off-by: default avatarDr. David Alan Gilbert <dgilbert@redhat.com>
      f3b2e38c
  13. Jun 14, 2022
  14. May 17, 2022
  15. Apr 27, 2022
  16. Apr 25, 2022
  17. Apr 22, 2022
  18. Mar 02, 2022
  19. Nov 09, 2021
  20. Nov 01, 2021
  21. Sep 27, 2021
  22. Jun 11, 2021
  23. Jun 08, 2021
  24. May 26, 2021
  25. May 25, 2021
  26. Mar 23, 2021
  27. Mar 19, 2021
  28. Mar 18, 2021
  29. Jan 23, 2021
    • Paolo Bonzini's avatar
      hmp: remove "change vnc TARGET" command · cfb5387a
      Paolo Bonzini authored
      
      The HMP command \"change vnc TARGET\" is messy:
      
      - it takes an ugly shortcut to determine if the option has an "id",
      with incorrect results if "id=" is not preceded by an unescaped
      comma.
      
      - it deletes the existing QemuOpts and does not try to rollback
      if the parsing fails (which is not causing problems, but only due to
      how VNC options are parsed)
      
      - because it uses the same parsing function as "-vnc", it forces
      the latter to not support "-vnc help".
      
      On top of this, it uses a deprecated QMP command, thus getting in
      the way of removing the QMP command.  Since the usecase for the
      command is not clear, just remove it and send "change vnc password"
      directly to the QMP "change-vnc-password" command.
      
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: default avatarEric Blake <eblake@redhat.com>
      Reviewed-by: default avatarGerd Hoffmann <kraxel@redhat.com>
      Message-Id: <20210120144235.345983-2-pbonzini@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cfb5387a
  30. Dec 18, 2020
  31. Dec 15, 2020
  32. Dec 09, 2020
  33. Nov 04, 2020
  34. Oct 09, 2020
    • Kevin Wolf's avatar
      block: Convert 'block_resize' to coroutine · eb94b81a
      Kevin Wolf authored
      
      block_resize performs some I/O that could potentially take quite some
      time, so use it as an example for the new 'coroutine': true annotation
      in the QAPI schema.
      
      bdrv_truncate() requires that we're already in the right AioContext for
      the BlockDriverState if called in coroutine context. So instead of just
      taking the AioContext lock, move the QMP handler coroutine to the
      context.
      
      Call blk_unref() only after switching back because blk_unref() may only
      be called in the main thread.
      
      Signed-off-by: default avatarKevin Wolf <kwolf@redhat.com>
      Message-Id: <20201005155855.256490-15-kwolf@redhat.com>
      Reviewed-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: default avatarMarkus Armbruster <armbru@redhat.com>
      eb94b81a
  35. Oct 06, 2020
  36. Sep 29, 2020
  37. Sep 17, 2020
Loading