Skip to content
Snippets Groups Projects
  1. Oct 17, 2023
    • Juan Quintela's avatar
      migration: Improve json and formatting · e4ceec29
      Juan Quintela authored
      
      Reviewed-by: default avatarMarkus Armbruster <armbru@redhat.com>
      Signed-off-by: default avatarJuan Quintela <quintela@redhat.com>
      Message-ID: <20231013104736.31722-2-quintela@redhat.com>
      e4ceec29
    • Peter Xu's avatar
      migration: Allow user to specify available switchover bandwidth · 8b239597
      Peter Xu authored
      
      Migration bandwidth is a very important value to live migration.  It's
      because it's one of the major factors that we'll make decision on when to
      switchover to destination in a precopy process.
      
      This value is currently estimated by QEMU during the whole live migration
      process by monitoring how fast we were sending the data.  This can be the
      most accurate bandwidth if in the ideal world, where we're always feeding
      unlimited data to the migration channel, and then it'll be limited to the
      bandwidth that is available.
      
      However in reality it may be very different, e.g., over a 10Gbps network we
      can see query-migrate showing migration bandwidth of only a few tens of
      MB/s just because there are plenty of other things the migration thread
      might be doing.  For example, the migration thread can be busy scanning
      zero pages, or it can be fetching dirty bitmap from other external dirty
      sources (like vhost or KVM).  It means we may not be pushing data as much
      as possible to migration channel, so the bandwidth estimated from "how many
      data we sent in the channel" can be dramatically inaccurate sometimes.
      
      With that, the decision to switchover will be affected, by assuming that we
      may not be able to switchover at all with such a low bandwidth, but in
      reality we can.
      
      The migration may not even converge at all with the downtime specified,
      with that wrong estimation of bandwidth, keeping iterations forever with a
      low estimation of bandwidth.
      
      The issue is QEMU itself may not be able to avoid those uncertainties on
      measuing the real "available migration bandwidth".  At least not something
      I can think of so far.
      
      One way to fix this is when the user is fully aware of the available
      bandwidth, then we can allow the user to help providing an accurate value.
      
      For example, if the user has a dedicated channel of 10Gbps for migration
      for this specific VM, the user can specify this bandwidth so QEMU can
      always do the calculation based on this fact, trusting the user as long as
      specified.  It may not be the exact bandwidth when switching over (in which
      case qemu will push migration data as fast as possible), but much better
      than QEMU trying to wildly guess, especially when very wrong.
      
      A new parameter "avail-switchover-bandwidth" is introduced just for this.
      So when the user specified this parameter, instead of trusting the
      estimated value from QEMU itself (based on the QEMUFile send speed), it
      trusts the user more by using this value to decide when to switchover,
      assuming that we'll have such bandwidth available then.
      
      Note that specifying this value will not throttle the bandwidth for
      switchover yet, so QEMU will always use the full bandwidth possible for
      sending switchover data, assuming that should always be the most important
      way to use the network at that time.
      
      This can resolve issues like "unconvergence migration" which is caused by
      hilarious low "migration bandwidth" detected for whatever reason.
      
      Reported-by: default avatarZhiyi Guo <zhguo@redhat.com>
      Reviewed-by: default avatarJoao Martins <joao.m.martins@oracle.com>
      Reviewed-by: default avatarJuan Quintela <quintela@redhat.com>
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Signed-off-by: default avatarJuan Quintela <quintela@redhat.com>
      Message-ID: <20231010221922.40638-1-peterx@redhat.com>
      8b239597
  2. Oct 11, 2023
  3. Oct 10, 2023
    • Andrei Gudkov's avatar
      migration/dirtyrate: use QEMU_CLOCK_HOST to report start-time · 320a6ccc
      Andrei Gudkov authored
      
      Currently query-dirty-rate uses QEMU_CLOCK_REALTIME as
      the source for start-time field. This translates to
      clock_gettime(CLOCK_MONOTONIC), i.e. number of seconds
      since host boot. This is not very useful. The only
      reasonable use case of start-time I can imagine is to
      check whether previously completed measurements are
      too old or not. But this makes sense only if start-time
      is reported as host wall-clock time.
      
      This patch replaces source of start-time from
      QEMU_CLOCK_REALTIME to QEMU_CLOCK_HOST.
      
      Signed-off-by: default avatarAndrei Gudkov <gudkov.andrei@huawei.com>
      Reviewed-by: default avatarHyman Huang <yong.huang@smartx.com>
      Message-Id: <399861531e3b24a1ecea2ba453fb2c3d129fb03a.1693905328.git.gudkov.andrei@huawei.com>
      Signed-off-by: default avatarHyman Huang <yong.huang@smartx.com>
      320a6ccc
    • Andrei Gudkov's avatar
      migration/calc-dirty-rate: millisecond-granularity period · 34a68001
      Andrei Gudkov authored
      
      This patch allows to measure dirty page rate for
      sub-second intervals of time. An optional argument is
      introduced -- calc-time-unit. For example:
      {"execute": "calc-dirty-rate", "arguments":
        {"calc-time": 500, "calc-time-unit": "millisecond"} }
      
      Millisecond granularity allows to make predictions whether
      migration will succeed or not. To do this, calculate dirty
      rate with calc-time set to max allowed downtime (e.g. 300ms),
      convert measured rate into volume of dirtied memory,
      and divide by network throughput. If the value is lower
      than max allowed downtime, then migration will converge.
      
      Measurement results for single thread randomly writing to
      a 1/4/24GiB memory region:
      
      +----------------+-----------------------------------------------+
      | calc-time      |                dirty rate MiB/s               |
      | (milliseconds) +----------------+---------------+--------------+
      |                | theoretical    | page-sampling | dirty-bitmap |
      |                | (at 3M wr/sec) |               |              |
      +----------------+----------------+---------------+--------------+
      |                               1GiB                             |
      +----------------+----------------+---------------+--------------+
      |            100 |           6996 |          7100 |         3192 |
      |            200 |           4606 |          4660 |         2655 |
      |            300 |           3305 |          3280 |         2371 |
      |            400 |           2534 |          2525 |         2154 |
      |            500 |           2041 |          2044 |         1871 |
      |            750 |           1365 |          1341 |         1358 |
      |           1000 |           1024 |          1052 |         1025 |
      |           1500 |            683 |           678 |          684 |
      |           2000 |            512 |           507 |          513 |
      +----------------+----------------+---------------+--------------+
      |                               4GiB                             |
      +----------------+----------------+---------------+--------------+
      |            100 |          10232 |          8880 |         4070 |
      |            200 |           8954 |          8049 |         3195 |
      |            300 |           7889 |          7193 |         2881 |
      |            400 |           6996 |          6530 |         2700 |
      |            500 |           6245 |          5772 |         2312 |
      |            750 |           4829 |          4586 |         2465 |
      |           1000 |           3865 |          3780 |         2178 |
      |           1500 |           2694 |          2633 |         2004 |
      |           2000 |           2041 |          2031 |         1789 |
      +----------------+----------------+---------------+--------------+
      |                               24GiB                            |
      +----------------+----------------+---------------+--------------+
      |            100 |          11495 |          8640 |         5597 |
      |            200 |          11226 |          8616 |         3527 |
      |            300 |          10965 |          8386 |         2355 |
      |            400 |          10713 |          8370 |         2179 |
      |            500 |          10469 |          8196 |         2098 |
      |            750 |           9890 |          7885 |         2556 |
      |           1000 |           9354 |          7506 |         2084 |
      |           1500 |           8397 |          6944 |         2075 |
      |           2000 |           7574 |          6402 |         2062 |
      +----------------+----------------+---------------+--------------+
      
      Theoretical values are computed according to the following formula:
      size * (1 - (1-(4096/size))^(time*wps)) / (time * 2^20),
      where size is in bytes, time is in seconds, and wps is number of
      writes per second.
      
      Signed-off-by: default avatarAndrei Gudkov <gudkov.andrei@huawei.com>
      Reviewed-by: default avatarHyman Huang <yong.huang@smartx.com>
      Message-Id: <d802e6b8053eb60fbec1a784cf86f67d9528e0a8.1693895970.git.gudkov.andrei@huawei.com>
      Signed-off-by: default avatarHyman Huang <yong.huang@smartx.com>
      34a68001
  4. Sep 20, 2023
  5. Sep 19, 2023
    • David Hildenbrand's avatar
      backends/hostmem-file: Add "rom" property to support VM templating with R/O files · e92666b0
      David Hildenbrand authored
      
      For now, "share=off,readonly=on" would always result in us opening the
      file R/O and mmap'ing the opened file MAP_PRIVATE R/O -- effectively
      turning it into ROM.
      
      Especially for VM templating, "share=off" is a common use case. However,
      that use case is impossible with files that lack write permissions,
      because "share=off,readonly=on" will not give us writable RAM.
      
      The sole user of ROM via memory-backend-file are R/O NVDIMMs, but as we
      have users (Kata Containers) that rely on the existing behavior --
      malicious VMs should not be able to consume COW memory for R/O NVDIMMs --
      we cannot change the semantics of "share=off,readonly=on"
      
      So let's add a new "rom" property with on/off/auto values. "auto" is
      the default and what most people will use: for historical reasons, to not
      change the old semantics, it defaults to the value of the "readonly"
      property.
      
      For VM templating, one can now use:
          -object memory-backend-file,share=off,readonly=on,rom=off,...
      
      But we'll disallow:
          -object memory-backend-file,share=on,readonly=on,rom=off,...
      because we would otherwise get an error when trying to mmap the R/O file
      shared and writable. An explicit error message is cleaner.
      
      We will also disallow for now:
          -object memory-backend-file,share=off,readonly=off,rom=on,...
          -object memory-backend-file,share=on,readonly=off,rom=on,...
      It's not harmful, but also not really required for now.
      
      Alternatives that were abandoned:
      * Make "unarmed=on" for the NVDIMM set the memory region container
        readonly. We would still see a change of ROM->RAM and possibly run
        into memslot limits with vhost-user. Further, there might be use cases
        for "unarmed=on" that should still allow writing to that memory
        (temporary files, system RAM, ...).
      * Add a new "readonly=on/off/auto" parameter for NVDIMMs. Similar issues
        as with "unarmed=on".
      * Make "readonly" consume "on/off/file" instead of being a 'bool' type.
        This would slightly changes the behavior of the "readonly" parameter:
        values like true/false (as accepted by a 'bool'type) would no longer be
        accepted.
      
      Message-ID: <20230906120503.359863-4-david@redhat.com>
      Acked-by: default avatarMarkus Armbruster <armbru@redhat.com>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      e92666b0
  6. Sep 18, 2023
    • Ilya Maximets's avatar
      net: add initial support for AF_XDP network backend · cb039ef3
      Ilya Maximets authored
      
      AF_XDP is a network socket family that allows communication directly
      with the network device driver in the kernel, bypassing most or all
      of the kernel networking stack.  In the essence, the technology is
      pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
      and works with any network interfaces without driver modifications.
      Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
      require access to character devices or unix sockets.  Only access to
      the network interface itself is necessary.
      
      This patch implements a network backend that communicates with the
      kernel by creating an AF_XDP socket.  A chunk of userspace memory
      is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
      Fill and Completion) are placed in that memory along with a pool of
      memory buffers for the packet data.  Data transmission is done by
      allocating one of the buffers, copying packet data into it and
      placing the pointer into Tx ring.  After transmission, device will
      return the buffer via Completion ring.  On Rx, device will take
      a buffer form a pre-populated Fill ring, write the packet data into
      it and place the buffer into Rx ring.
      
      AF_XDP network backend takes on the communication with the host
      kernel and the network interface and forwards packets to/from the
      peer device in QEMU.
      
      Usage example:
      
        -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
        -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
      
      XDP program bridges the socket with a network interface.  It can be
      attached to the interface in 2 different modes:
      
      1. skb - this mode should work for any interface and doesn't require
               driver support.  With a caveat of lower performance.
      
      2. native - this does require support from the driver and allows to
                  bypass skb allocation in the kernel and potentially use
                  zero-copy while getting packets in/out userspace.
      
      By default, QEMU will try to use native mode and fall back to skb.
      Mode can be forced via 'mode' option.  To force 'copy' even in native
      mode, use 'force-copy=on' option.  This might be useful if there is
      some issue with the driver.
      
      Option 'queues=N' allows to specify how many device queues should
      be open.  Note that all the queues that are not open are still
      functional and can receive traffic, but it will not be delivered to
      QEMU.  So, the number of device queues should generally match the
      QEMU configuration, unless the device is shared with something
      else and the traffic re-direction to appropriate queues is correctly
      configured on a device level (e.g. with ethtool -N).
      'start-queue=M' option can be used to specify from which queue id
      QEMU should start configuring 'N' queues.  It might also be necessary
      to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
      for examples.
      
      In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
      or CAP_BPF capabilities in order to load default XSK/XDP programs to
      the network interface and configure BPF maps.  It is possible, however,
      to run with no capabilities.  For that to work, an external process
      with enough capabilities will need to pre-load default XSK program,
      create AF_XDP sockets and pass their file descriptors to QEMU process
      on startup via 'sock-fds' option.  Network backend will need to be
      configured with 'inhibit=on' to avoid loading of the program.
      QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue
      or CAP_IPC_LOCK.
      
      There are few performance challenges with the current network backends.
      
      First is that they do not support IO threads.  This means that data
      path is handled by the main thread in QEMU and may slow down other
      work or may be slowed down by some other work.  This also means that
      taking advantage of multi-queue is generally not possible today.
      
      Another thing is that data path is going through the device emulation
      code, which is not really optimized for performance.  The fastest
      "frontend" device is virtio-net.  But it's not optimized for heavy
      traffic either, because it expects such use-cases to be handled via
      some implementation of vhost (user, kernel, vdpa).  In practice, we
      have virtio notifications and rcu lock/unlock on a per-packet basis
      and not very efficient accesses to the guest memory.  Communication
      channels between backend and frontend devices do not allow passing
      more than one packet at a time as well.
      
      Some of these challenges can be avoided in the future by adding better
      batching into device emulation or by implementing vhost-af-xdp variant.
      
      There are also a few kernel limitations.  AF_XDP sockets do not
      support any kinds of checksum or segmentation offloading.  Buffers
      are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
      support implementation for AF_XDP is in progress, but not ready yet.
      Also, transmission in all non-zero-copy modes is synchronous, i.e.
      done in a syscall.  That doesn't allow high packet rates on virtual
      interfaces.
      
      However, keeping in mind all of these challenges, current implementation
      of the AF_XDP backend shows a decent performance while running on top
      of a physical NIC with zero-copy support.
      
      Test setup:
      
      2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
      Network backend is configured to open the NIC directly in native mode.
      The driver supports zero-copy.  NIC is configured to use 1 queue.
      
      Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
      for PPS testing.
      
      iperf3 result:
       TCP stream      : 19.1 Gbps
      
      dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
       Tx only         : 3.4 Mpps
       Rx only         : 2.0 Mpps
       L2 FWD Loopback : 1.5 Mpps
      
      In skb mode the same setup shows much lower performance, similar to
      the setup where pair of physical NICs is replaced with veth pair:
      
      iperf3 result:
        TCP stream      : 9 Gbps
      
      dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
        Tx only         : 1.2 Mpps
        Rx only         : 1.0 Mpps
        L2 FWD Loopback : 0.7 Mpps
      
      Results in skb mode or over the veth are close to results of a tap
      backend with vhost=on and disabled segmentation offloading bridged
      with a NIC.
      
      Signed-off-by: default avatarIlya Maximets <i.maximets@ovn.org>
      Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> (docker/lcitool)
      Signed-off-by: default avatarJason Wang <jasowang@redhat.com>
      cb039ef3
  7. Sep 04, 2023
  8. Aug 02, 2023
  9. Jul 26, 2023
    • Markus Armbruster's avatar
      qapi: Reformat recent doc comments to conform to current conventions · 9e272073
      Markus Armbruster authored
      
      Since commit a937b6aa (qapi: Reformat doc comments to conform to
      current conventions), a number of comments not conforming to the
      current formatting conventions were added.  No problem, just sweep
      the entire documentation once more.
      
      To check the generated documentation does not change, I compared the
      generated HTML before and after this commit with "wdiff -3".  Finds no
      differences.  Comparing with diff is not useful, as the reflown
      paragraphs are visible there.
      
      Signed-off-by: default avatarMarkus Armbruster <armbru@redhat.com>
      Message-ID: <20230720071610.1096458-7-armbru@redhat.com>
      9e272073
    • Markus Armbruster's avatar
      qapi/trace: Tidy up trace-event-get-state, -set-state documentation · e27a9d62
      Markus Armbruster authored
      
      trace-event-set-state's explanation of how events are selected is
      under "Features".  Doesn't belong there.  Simply delete it, as it
      feels redundant with documentation of member @name.
      
      trace-event-get-state's explanation is under "Returns".  Tolerable,
      but similarly redundant.  Delete it, too.
      
      Cc: Alex Bennée <alex.bennee@linaro.org>
      Signed-off-by: default avatarMarkus Armbruster <armbru@redhat.com>
      Message-ID: <20230720071610.1096458-5-armbru@redhat.com>
      e27a9d62
    • Markus Armbruster's avatar
      qapi/qdev: Tidy up device_add documentation · a9c72efd
      Markus Armbruster authored
      
      The notes section comes out like this:
      
          Notes
      
          Additional arguments depend on the type.
      
          1. For detailed information about this command, please refer to the
             ‘docs/qdev-device-use.txt’ file.
      
          2. It’s possible to list device properties by running QEMU with the
             “-device DEVICE,help” command-line argument, where DEVICE is the
             device’s name
      
      The first item isn't numbered.  Fix that:
      
          1. Additional arguments depend on the type.
      
          2. For detailed information about this command, please refer to the
             ‘docs/qdev-device-use.txt’ file.
      
          3. It’s possible to list device properties by running QEMU with the
             “-device DEVICE,help” command-line argument, where DEVICE is the
             device’s name
      
      Signed-off-by: default avatarMarkus Armbruster <armbru@redhat.com>
      Message-ID: <20230720071610.1096458-4-armbru@redhat.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      a9c72efd
    • Markus Armbruster's avatar
      qapi/block: Tidy up block-latency-histogram-set documentation · e893b9e3
      Markus Armbruster authored
      
      Examples come out like
      
          Example
      
             set new histograms for all io types with intervals [0, 10), [10,
             50), [50, 100), [100, +inf):
      
      The sentence "set new histograms ..." starts with a lower case letter.
      Capitalize it.  Same for the other examples.
      
      Signed-off-by: default avatarMarkus Armbruster <armbru@redhat.com>
      Message-ID: <20230720071610.1096458-3-armbru@redhat.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      e893b9e3
    • Markus Armbruster's avatar
      qapi/block-core: Tidy up BlockLatencyHistogramInfo documentation · dad3c956
      Markus Armbruster authored
      
      Documentation for member @bin comes out like
      
          list of io request counts corresponding to histogram intervals.
          len("bins") = len("boundaries") + 1 For the example above, "bins"
          may be something like [3, 1, 5, 2], and corresponding histogram
          looks like:
      
      Note how the equation and the sentence following it run together.
      Replace the equation:
      
          list of io request counts corresponding to histogram intervals,
          one more element than "boundaries" has.  For the example above,
          "bins" may be something like [3, 1, 5, 2], and corresponding
          histogram looks like:
      
      Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
      Signed-off-by: default avatarMarkus Armbruster <armbru@redhat.com>
      Message-ID: <20230720071610.1096458-2-armbru@redhat.com>
      [Off by one fixed]
      dad3c956
    • Juan Quintela's avatar
      migration: skipped field is really obsolete. · 7b24d326
      Juan Quintela authored
      
      Has return zero for more than 10 years.
      
      Specifically we introduced the field in 1.5.0
      
      commit f1c72795
      Author: Peter Lieven <pl@kamp.de>
      Date:   Tue Mar 26 10:58:37 2013 +0100
      
          migration: do not sent zero pages in bulk stage
      
          during bulk stage of ram migration if a page is a
          zero page do not send it at all.
          the memory at the destination reads as zero anyway.
      
          even if there is an madvise with QEMU_MADV_DONTNEED
          at the target upon receipt of a zero page I have observed
          that the target starts swapping if the memory is overcommitted.
          it seems that the pages are dropped asynchronously.
      
          this patch also updates QMP to return the number of
          skipped pages in MigrationStats.
      
      but removed its usage in 1.5.3
      
      commit 9ef051e5
      Author: Peter Lieven <pl@kamp.de>
      Date:   Mon Jun 10 12:14:19 2013 +0200
      
          Revert "migration: do not sent zero pages in bulk stage"
      
          Not sending zero pages breaks migration if a page is zero
          at the source but not at the destination. This can e.g. happen
          if different BIOS versions are used at source and destination.
          It has also been reported that migration on pseries is completely
          broken with this patch.
      
          This effectively reverts commit f1c72795.
      
      Reviewed-by: default avatarDaniel P. Berrangé <berrange@redhat.com>
      Message-ID: <20230612193344.3796-2-quintela@redhat.com>
      Signed-off-by: default avatarJuan Quintela <quintela@redhat.com>
      7b24d326
    • Hyman Huang(黄勇)'s avatar
      migration: Extend query-migrate to provide dirty page limit info · 15699cf5
      Hyman Huang(黄勇) authored
      
      Extend query-migrate to provide throttle time and estimated
      ring full time with dirty-limit capability enabled, through which
      we can observe if dirty limit take effect during live migration.
      
      Signed-off-by: default avatarHyman Huang(黄勇) <yong.huang@smartx.com>
      Reviewed-by: default avatarMarkus Armbruster <armbru@redhat.com>
      Reviewed-by: default avatarJuan Quintela <quintela@redhat.com>
      Message-ID: <168733225273.5845.15871826788879741674-8@git.sr.ht>
      Signed-off-by: default avatarJuan Quintela <quintela@redhat.com>
      15699cf5
    • Hyman Huang(黄勇)'s avatar
      migration: Introduce dirty-limit capability · dc623955
      Hyman Huang(黄勇) authored
      
      Introduce migration dirty-limit capability, which can
      be turned on before live migration and limit dirty
      page rate durty live migration.
      
      Introduce migrate_dirty_limit function to help check
      if dirty-limit capability enabled during live migration.
      
      Meanwhile, refactor vcpu_dirty_rate_stat_collect
      so that period can be configured instead of hardcoded.
      
      dirty-limit capability is kind of like auto-converge
      but using dirty limit instead of traditional cpu-throttle
      to throttle guest down. To enable this feature, turn on
      the dirty-limit capability before live migration using
      migrate-set-capabilities, and set the parameters
      "x-vcpu-dirty-limit-period", "vcpu-dirty-limit" suitably
      to speed up convergence.
      
      Signed-off-by: default avatarHyman Huang(黄勇) <yong.huang@smartx.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJuan Quintela <quintela@redhat.com>
      Message-Id: <168618975839.6361.17407633874747688653-4@git.sr.ht>
      Signed-off-by: default avatarJuan Quintela <quintela@redhat.com>
      dc623955
    • Hyman Huang(黄勇)'s avatar
      qapi/migration: Introduce vcpu-dirty-limit parameters · 09f9ec99
      Hyman Huang(黄勇) authored
      
      Introduce "vcpu-dirty-limit" migration parameter used
      to limit dirty page rate during live migration.
      
      "vcpu-dirty-limit" and "x-vcpu-dirty-limit-period" are
      two dirty-limit-related migration parameters, which can
      be set before and during live migration by qmp
      migrate-set-parameters.
      
      This two parameters are used to help implement the dirty
      page rate limit algo of migration.
      
      Signed-off-by: default avatarHyman Huang(黄勇) <yong.huang@smartx.com>
      Acked-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJuan Quintela <quintela@redhat.com>
      Message-Id: <168618975839.6361.17407633874747688653-3@git.sr.ht>
      Signed-off-by: default avatarJuan Quintela <quintela@redhat.com>
      09f9ec99
    • Hyman Huang(黄勇)'s avatar
      qapi/migration: Introduce x-vcpu-dirty-limit-period parameter · 4d807857
      Hyman Huang(黄勇) authored
      
      Introduce "x-vcpu-dirty-limit-period" migration experimental
      parameter, which is in the range of 1 to 1000ms and used to
      make dirtyrate calculation period configurable.
      
      Currently with the "x-vcpu-dirty-limit-period" varies, the
      total time of live migration changes, test results show the
      optimal value of "x-vcpu-dirty-limit-period" ranges from
      500ms to 1000 ms. "x-vcpu-dirty-limit-period" should be made
      stable once it proves best value can not be determined with
      developer's experiments.
      
      Signed-off-by: default avatarHyman Huang(黄勇) <yong.huang@smartx.com>
      Reviewed-by: default avatarMarkus Armbruster <armbru@redhat.com>
      Reviewed-by: default avatarJuan Quintela <quintela@redhat.com>
      Message-Id: <168618975839.6361.17407633874747688653-2@git.sr.ht>
      Signed-off-by: default avatarJuan Quintela <quintela@redhat.com>
      4d807857
  10. Jul 25, 2023
  11. Jul 17, 2023
  12. Jul 10, 2023
  13. Jun 30, 2023
    • Avihai Horon's avatar
      migration: Add switchover ack capability · 6574232f
      Avihai Horon authored
      
      Migration downtime estimation is calculated based on bandwidth and
      remaining migration data. This assumes that loading of migration data in
      the destination takes a negligible amount of time and that downtime
      depends only on network speed.
      
      While this may be true for RAM, it's not necessarily true for other
      migrated devices. For example, loading the data of a VFIO device in the
      destination might require from the device to allocate resources, prepare
      internal data structures and so on. These operations can take a
      significant amount of time which can increase migration downtime.
      
      This patch adds a new capability "switchover ack" that prevents the
      source from stopping the VM and completing the migration until an ACK
      is received from the destination that it's OK to do so.
      
      This can be used by migrated devices in various ways to reduce downtime.
      For example, a device can send initial precopy metadata to pre-allocate
      resources in the destination and use this capability to make sure that
      the pre-allocation is completed before the source VM is stopped, so it
      will have full effect.
      
      This new capability relies on the return path capability to communicate
      from the destination back to the source.
      
      The actual implementation of the capability will be added in the
      following patches.
      
      Signed-off-by: default avatarAvihai Horon <avihaih@nvidia.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Acked-by: default avatarMarkus Armbruster <armbru@redhat.com>
      Tested-by: default avatarYangHang Liu <yanghliu@redhat.com>
      Acked-by: default avatarAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: default avatarCédric Le Goater <clg@redhat.com>
      6574232f
  14. Jun 27, 2023
  15. Jun 26, 2023
  16. Jun 23, 2023
  17. Jun 22, 2023
  18. Jun 20, 2023
  19. Jun 13, 2023
    • Steve Sistare's avatar
      exec/memory: Introduce RAM_NAMED_FILE flag · b0182e53
      Steve Sistare authored
      
      migrate_ignore_shared() is an optimization that avoids copying memory
      that is visible and can be mapped on the target.  However, a
      memory-backend-ram or a memory-backend-memfd block with the RAM_SHARED
      flag set is not migrated when migrate_ignore_shared() is true.  This is
      wrong, because the block has no named backing store, and its contents will
      be lost.  To fix, ignore shared memory iff it is a named file.  Define a
      new flag RAM_NAMED_FILE to distinguish this case.
      
      Signed-off-by: default avatarSteve Sistare <steven.sistare@oracle.com>
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Message-Id: <1686151116-253260-1-git-send-email-steven.sistare@oracle.com>
      Signed-off-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      b0182e53
  20. Jun 09, 2023
  21. Jun 05, 2023
    • Jean-Louis Dupond's avatar
      qcow2: add discard-no-unref option · 42a2890a
      Jean-Louis Dupond authored
      When we for example have a sparse qcow2 image and discard: unmap is enabled,
      there can be a lot of fragmentation in the image after some time. Especially on VM's
      that do a lot of writes/deletes.
      This causes the qcow2 image to grow even over 110% of its virtual size,
      because the free gaps in the image get too small to allocate new
      continuous clusters. So it allocates new space at the end of the image.
      
      Disabling discard is not an option, as discard is needed to keep the
      incremental backup size as low as possible. Without discard, the
      incremental backups would become large, as qemu thinks it's just dirty
      blocks but it doesn't know the blocks are unneeded.
      So we need to avoid fragmentation but also 'empty' the unneeded blocks in
      the image to have a small incremental backup.
      
      In addition, we also want to send the discards further down the stack, so
      the underlying blocks are still discarded.
      
      Therefor we introduce a new qcow2 option "discard-no-unref".
      When setting this option to true, discards will no longer have the qcow2
      driver relinquish cluster allocations. Other than that, the request is
      handled as normal: All clusters in range are marked as zero, and, if
      pass-discard-request is true, it is passed further down the stack.
      The only difference is that the now-zero clusters are preallocated
      instead of being unallocated.
      This will avoid fragmentation on the qcow2 image.
      
      Fixes: https://gitlab.com/qemu-project/qemu/-/issues/1621
      
      
      Signed-off-by: default avatarJean-Louis Dupond <jean-louis@dupond.be>
      Message-Id: <20230605084523.34134-2-jean-louis@dupond.be>
      Reviewed-by: default avatarHanna Czenczek <hreitz@redhat.com>
      Signed-off-by: default avatarHanna Czenczek <hreitz@redhat.com>
      42a2890a
  22. Jun 02, 2023
    • Eric Blake's avatar
      cutils: Adjust signature of parse_uint[_full] · bd1386cc
      Eric Blake authored
      
      It's already confusing that we have two very similar functions for
      wrapping the parse of a 64-bit unsigned value, differing mainly on
      whether they permit leading '-'.  Adjust the signature of parse_uint()
      and parse_uint_full() to be like all of qemu_strto*(): put the result
      parameter last, use the same types (uint64_t and unsigned long long
      have the same width, but are not always the same type), and mark
      endptr const (this latter change only affects the rare caller of
      parse_uint).  Adjust all callers in the tree.
      
      While at it, note that since cutils.c already includes:
      
          QEMU_BUILD_BUG_ON(sizeof(int64_t) != sizeof(long long));
      
      we are guaranteed that the result of parse_uint* cannot exceed
      UINT64_MAX (or the build would have failed), so we can drop
      pre-existing dead comparisons in opts-visitor.c that were never false.
      
      Reviewed-by: default avatarHanna Czenczek <hreitz@redhat.com>
      Message-Id: <20230522190441.64278-8-eblake@redhat.com>
      [eblake: Drop dead code spotted by Markus]
      Signed-off-by: default avatarEric Blake <eblake@redhat.com>
      bd1386cc
  23. Jun 01, 2023
  24. May 28, 2023
Loading