Commits · b1be65f6436f53618408d9c6fc6959054f5afed6 · Anton / libtcg

Oct 17, 2023

migration: Improve json and formatting · e4ceec29

Juan Quintela authored 1 year ago


Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Message-ID: <20231013104736.31722-2-quintela@redhat.com>

e4ceec29

migration: Allow user to specify available switchover bandwidth · 8b239597

Peter Xu authored 1 year ago

Migration bandwidth is a very important value to live migration. It's
because it's one of the major factors that we'll make decision on when to
switchover to destination in a precopy process.

This value is currently estimated by QEMU during the whole live migration
process by monitoring how fast we were sending the data. This can be the
most accurate bandwidth if in the ideal world, where we're always feeding
unlimited data to the migration channel, and then it'll be limited to the
bandwidth that is available.

However in reality it may be very different, e.g., over a 10Gbps network we
can see query-migrate showing migration bandwidth of only a few tens of
MB/s just because there are plenty of other things the migration thread
might be doing. For example, the migration thread can be busy scanning
zero pages, or it can be fetching dirty bitmap from other external dirty
sources (like vhost or KVM). It means we may not be pushing data as much
as possible to migration channel, so the bandwidth estimated from "how many
data we sent in the channel" can be dramatically inaccurate sometimes.

With that, the decision to switchover will be affected, by assuming that we
may not be able to switchover at all with such a low bandwidth, but in
reality we can.

The migration may not even converge at all with the downtime specified,
with that wrong estimation of bandwidth, keeping iterations forever with a
low estimation of bandwidth.

The issue is QEMU itself may not be able to avoid those uncertainties on
measuing the real "available migration bandwidth". At least not something
I can think of so far.

One way to fix this is when the user is fully aware of the available
bandwidth, then we can allow the user to help providing an accurate value.

For example, if the user has a dedicated channel of 10Gbps for migration
for this specific VM, the user can specify this bandwidth so QEMU can
always do the calculation based on this fact, trusting the user as long as
specified. It may not be the exact bandwidth when switching over (in which
case qemu will push migration data as fast as possible), but much better
than QEMU trying to wildly guess, especially when very wrong.

A new parameter "avail-switchover-bandwidth" is introduced just for this.
So when the user specified this parameter, instead of trusting the
estimated value from QEMU itself (based on the QEMUFile send speed), it
trusts the user more by using this value to decide when to switchover,
assuming that we'll have such bandwidth available then.

Note that specifying this value will not throttle the bandwidth for
switchover yet, so QEMU will always use the full bandwidth possible for
sending switchover data, assuming that should always be the most important
way to use the network at that time.

This can resolve issues like "unconvergence migration" which is caused by
hilarious low "migration bandwidth" detected for whatever reason.

Reported-by: Zhiyi Guo <zhguo@redhat.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Message-ID: <20231010221922.40638-1-peterx@redhat.com>

8b239597

Oct 11, 2023

migration: Display error in query-migrate irrelevant of status · c94143e5

Peter Xu authored 1 year ago

Display it as long as being set, irrelevant of FAILED status.  E.g., it may
also be applicable to PAUSED stage of postcopy, to provide hint on what has
gone wrong.

The error_mutex seems to be overlooked when referencing the error, add it
to be very safe.

This will change QAPI behavior by showing up error message outside !FAILED
status, but it's intended and doesn't expect to break anyone.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2018404


Reviewed-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>
Message-ID: <20231004220240.167175-2-peterx@redhat.com>

c94143e5

Oct 10, 2023

migration/dirtyrate: use QEMU_CLOCK_HOST to report start-time · 320a6ccc

Andrei Gudkov authored 1 year ago


Currently query-dirty-rate uses QEMU_CLOCK_REALTIME as
the source for start-time field. This translates to
clock_gettime(CLOCK_MONOTONIC), i.e. number of seconds
since host boot. This is not very useful. The only
reasonable use case of start-time I can imagine is to
check whether previously completed measurements are
too old or not. But this makes sense only if start-time
is reported as host wall-clock time.

This patch replaces source of start-time from
QEMU_CLOCK_REALTIME to QEMU_CLOCK_HOST.

Signed-off-by: Andrei Gudkov <gudkov.andrei@huawei.com>
Reviewed-by: Hyman Huang <yong.huang@smartx.com>
Message-Id: <399861531e3b24a1ecea2ba453fb2c3d129fb03a.1693905328.git.gudkov.andrei@huawei.com>
Signed-off-by: Hyman Huang <yong.huang@smartx.com>

320a6ccc

migration/calc-dirty-rate: millisecond-granularity period · 34a68001

Andrei Gudkov authored 1 year ago


This patch allows to measure dirty page rate for
sub-second intervals of time. An optional argument is
introduced -- calc-time-unit. For example:
{"execute": "calc-dirty-rate", "arguments":
  {"calc-time": 500, "calc-time-unit": "millisecond"} }

Millisecond granularity allows to make predictions whether
migration will succeed or not. To do this, calculate dirty
rate with calc-time set to max allowed downtime (e.g. 300ms),
convert measured rate into volume of dirtied memory,
and divide by network throughput. If the value is lower
than max allowed downtime, then migration will converge.

Measurement results for single thread randomly writing to
a 1/4/24GiB memory region:

+----------------+-----------------------------------------------+
| calc-time      |                dirty rate MiB/s               |
| (milliseconds) +----------------+---------------+--------------+
|                | theoretical    | page-sampling | dirty-bitmap |
|                | (at 3M wr/sec) |               |              |
+----------------+----------------+---------------+--------------+
|                               1GiB                             |
+----------------+----------------+---------------+--------------+
|            100 |           6996 |          7100 |         3192 |
|            200 |           4606 |          4660 |         2655 |
|            300 |           3305 |          3280 |         2371 |
|            400 |           2534 |          2525 |         2154 |
|            500 |           2041 |          2044 |         1871 |
|            750 |           1365 |          1341 |         1358 |
|           1000 |           1024 |          1052 |         1025 |
|           1500 |            683 |           678 |          684 |
|           2000 |            512 |           507 |          513 |
+----------------+----------------+---------------+--------------+
|                               4GiB                             |
+----------------+----------------+---------------+--------------+
|            100 |          10232 |          8880 |         4070 |
|            200 |           8954 |          8049 |         3195 |
|            300 |           7889 |          7193 |         2881 |
|            400 |           6996 |          6530 |         2700 |
|            500 |           6245 |          5772 |         2312 |
|            750 |           4829 |          4586 |         2465 |
|           1000 |           3865 |          3780 |         2178 |
|           1500 |           2694 |          2633 |         2004 |
|           2000 |           2041 |          2031 |         1789 |
+----------------+----------------+---------------+--------------+
|                               24GiB                            |
+----------------+----------------+---------------+--------------+
|            100 |          11495 |          8640 |         5597 |
|            200 |          11226 |          8616 |         3527 |
|            300 |          10965 |          8386 |         2355 |
|            400 |          10713 |          8370 |         2179 |
|            500 |          10469 |          8196 |         2098 |
|            750 |           9890 |          7885 |         2556 |
|           1000 |           9354 |          7506 |         2084 |
|           1500 |           8397 |          6944 |         2075 |
|           2000 |           7574 |          6402 |         2062 |
+----------------+----------------+---------------+--------------+

Theoretical values are computed according to the following formula:
size * (1 - (1-(4096/size))^(time*wps)) / (time * 2^20),
where size is in bytes, time is in seconds, and wps is number of
writes per second.

Signed-off-by: Andrei Gudkov <gudkov.andrei@huawei.com>
Reviewed-by: Hyman Huang <yong.huang@smartx.com>
Message-Id: <d802e6b8053eb60fbec1a784cf86f67d9528e0a8.1693895970.git.gudkov.andrei@huawei.com>
Signed-off-by: Hyman Huang <yong.huang@smartx.com>

34a68001

Sep 20, 2023

qemu-img: map: report compressed data blocks · 52b10c9c

Andrey Drobyshev authored 1 year ago

Right now "qemu-img map" reports compressed blocks as containing data
but having no host offset. This is not very informative. Instead,
let's add another boolean field named "compressed" in case JSON output
mode is specified. This is achieved by utilizing new allocation status
flag BDRV_BLOCK_COMPRESSED for bdrv_block_status().

Also update the expected qemu-iotests outputs to contain the new field.

Signed-off-by: Andrey Drobyshev <andrey.drobyshev@virtuozzo.com>
Message-ID: <20230907210226.953821-3-andrey.drobyshev@virtuozzo.com>
Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>

52b10c9c

Sep 19, 2023

backends/hostmem-file: Add "rom" property to support VM templating with R/O files · e92666b0

David Hildenbrand authored 1 year ago


For now, "share=off,readonly=on" would always result in us opening the
file R/O and mmap'ing the opened file MAP_PRIVATE R/O -- effectively
turning it into ROM.

Especially for VM templating, "share=off" is a common use case. However,
that use case is impossible with files that lack write permissions,
because "share=off,readonly=on" will not give us writable RAM.

The sole user of ROM via memory-backend-file are R/O NVDIMMs, but as we
have users (Kata Containers) that rely on the existing behavior --
malicious VMs should not be able to consume COW memory for R/O NVDIMMs --
we cannot change the semantics of "share=off,readonly=on"

So let's add a new "rom" property with on/off/auto values. "auto" is
the default and what most people will use: for historical reasons, to not
change the old semantics, it defaults to the value of the "readonly"
property.

For VM templating, one can now use:
    -object memory-backend-file,share=off,readonly=on,rom=off,...

But we'll disallow:
    -object memory-backend-file,share=on,readonly=on,rom=off,...
because we would otherwise get an error when trying to mmap the R/O file
shared and writable. An explicit error message is cleaner.

We will also disallow for now:
    -object memory-backend-file,share=off,readonly=off,rom=on,...
    -object memory-backend-file,share=on,readonly=off,rom=on,...
It's not harmful, but also not really required for now.

Alternatives that were abandoned:
* Make "unarmed=on" for the NVDIMM set the memory region container
  readonly. We would still see a change of ROM->RAM and possibly run
  into memslot limits with vhost-user. Further, there might be use cases
  for "unarmed=on" that should still allow writing to that memory
  (temporary files, system RAM, ...).
* Add a new "readonly=on/off/auto" parameter for NVDIMMs. Similar issues
  as with "unarmed=on".
* Make "readonly" consume "on/off/file" instead of being a 'bool' type.
  This would slightly changes the behavior of the "readonly" parameter:
  values like true/false (as accepted by a 'bool'type) would no longer be
  accepted.

Message-ID: <20230906120503.359863-4-david@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>

e92666b0

Sep 18, 2023

net: add initial support for AF_XDP network backend · cb039ef3

Ilya Maximets authored 1 year ago


AF_XDP is a network socket family that allows communication directly
with the network device driver in the kernel, bypassing most or all
of the kernel networking stack.  In the essence, the technology is
pretty similar to netmap.  But, unlike netmap, AF_XDP is Linux-native
and works with any network interfaces without driver modifications.
Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
require access to character devices or unix sockets.  Only access to
the network interface itself is necessary.

This patch implements a network backend that communicates with the
kernel by creating an AF_XDP socket.  A chunk of userspace memory
is shared between QEMU and the host kernel.  4 ring buffers (Tx, Rx,
Fill and Completion) are placed in that memory along with a pool of
memory buffers for the packet data.  Data transmission is done by
allocating one of the buffers, copying packet data into it and
placing the pointer into Tx ring.  After transmission, device will
return the buffer via Completion ring.  On Rx, device will take
a buffer form a pre-populated Fill ring, write the packet data into
it and place the buffer into Rx ring.

AF_XDP network backend takes on the communication with the host
kernel and the network interface and forwards packets to/from the
peer device in QEMU.

Usage example:

  -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
  -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1

XDP program bridges the socket with a network interface.  It can be
attached to the interface in 2 different modes:

1. skb - this mode should work for any interface and doesn't require
         driver support.  With a caveat of lower performance.

2. native - this does require support from the driver and allows to
            bypass skb allocation in the kernel and potentially use
            zero-copy while getting packets in/out userspace.

By default, QEMU will try to use native mode and fall back to skb.
Mode can be forced via 'mode' option.  To force 'copy' even in native
mode, use 'force-copy=on' option.  This might be useful if there is
some issue with the driver.

Option 'queues=N' allows to specify how many device queues should
be open.  Note that all the queues that are not open are still
functional and can receive traffic, but it will not be delivered to
QEMU.  So, the number of device queues should generally match the
QEMU configuration, unless the device is shared with something
else and the traffic re-direction to appropriate queues is correctly
configured on a device level (e.g. with ethtool -N).
'start-queue=M' option can be used to specify from which queue id
QEMU should start configuring 'N' queues.  It might also be necessary
to use this option with certain NICs, e.g. MLX5 NICs.  See the docs
for examples.

In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
or CAP_BPF capabilities in order to load default XSK/XDP programs to
the network interface and configure BPF maps.  It is possible, however,
to run with no capabilities.  For that to work, an external process
with enough capabilities will need to pre-load default XSK program,
create AF_XDP sockets and pass their file descriptors to QEMU process
on startup via 'sock-fds' option.  Network backend will need to be
configured with 'inhibit=on' to avoid loading of the program.
QEMU will need 32 MB of locked memory (RLIMIT_MEMLOCK) per queue
or CAP_IPC_LOCK.

There are few performance challenges with the current network backends.

First is that they do not support IO threads.  This means that data
path is handled by the main thread in QEMU and may slow down other
work or may be slowed down by some other work.  This also means that
taking advantage of multi-queue is generally not possible today.

Another thing is that data path is going through the device emulation
code, which is not really optimized for performance.  The fastest
"frontend" device is virtio-net.  But it's not optimized for heavy
traffic either, because it expects such use-cases to be handled via
some implementation of vhost (user, kernel, vdpa).  In practice, we
have virtio notifications and rcu lock/unlock on a per-packet basis
and not very efficient accesses to the guest memory.  Communication
channels between backend and frontend devices do not allow passing
more than one packet at a time as well.

Some of these challenges can be avoided in the future by adding better
batching into device emulation or by implementing vhost-af-xdp variant.

There are also a few kernel limitations.  AF_XDP sockets do not
support any kinds of checksum or segmentation offloading.  Buffers
are limited to a page size (4K), i.e. MTU is limited.  Multi-buffer
support implementation for AF_XDP is in progress, but not ready yet.
Also, transmission in all non-zero-copy modes is synchronous, i.e.
done in a syscall.  That doesn't allow high packet rates on virtual
interfaces.

However, keeping in mind all of these challenges, current implementation
of the AF_XDP backend shows a decent performance while running on top
of a physical NIC with zero-copy support.

Test setup:

2 VMs running on 2 physical hosts connected via ConnectX6-Dx card.
Network backend is configured to open the NIC directly in native mode.
The driver supports zero-copy.  NIC is configured to use 1 queue.

Inside a VM - iperf3 for basic TCP performance testing and dpdk-testpmd
for PPS testing.

iperf3 result:
 TCP stream      : 19.1 Gbps

dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
 Tx only         : 3.4 Mpps
 Rx only         : 2.0 Mpps
 L2 FWD Loopback : 1.5 Mpps

In skb mode the same setup shows much lower performance, similar to
the setup where pair of physical NICs is replaced with veth pair:

iperf3 result:
  TCP stream      : 9 Gbps

dpdk-testpmd (single queue, single CPU core, 64 B packets) results:
  Tx only         : 1.2 Mpps
  Rx only         : 1.0 Mpps
  L2 FWD Loopback : 0.7 Mpps

Results in skb mode or over the veth are close to results of a tap
backend with vhost=on and disabled segmentation offloading bridged
with a NIC.

Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> (docker/lcitool)
Signed-off-by: Jason Wang <jasowang@redhat.com>

cb039ef3

Sep 04, 2023

ui/vc: do not parse VC-specific options in Spice and GTK · 32aa1f8d

Marc-André Lureau authored 1 year ago


In commit 6f974c84 ("gtk: overwrite the console.c char driver"), I
shared the VC console parse handler with GTK. And later on in commit
d8aec9d9 ("display: add -display spice-app launching a Spice client"),
I also used it to handle spice-app VC.

This is not necessary, the VC console options (width/height/cols/rows)
are specific, and unused by tty-level GTK/Spice VC.

This is not a breaking change, as those options are still being parsed
by QAPI ChardevVC. Adjust the documentation about it.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-Id: <20230830093843.3531473-44-marcandre.lureau@redhat.com>

32aa1f8d

Aug 02, 2023

qapi: Craft the dirty-limit capability comment · ef965377

Hyman Huang(黄勇) authored 1 year ago


Signed-off-by: Hyman Huang(黄勇) <yong.huang@smartx.com>
Message-ID: <169073570563.19893.2928364761104733482-2@git.sr.ht>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>

ef965377

qapi: Reformat the dirty-limit migration doc comments · 8abc8115

Hyman Huang(黄勇) authored 1 year ago


Reformat the dirty-limit migration doc comments to conform
to current conventions as commit a937b6aa (qapi: Reformat
doc comments to conform to current conventions).

Signed-off-by: Hyman Huang(黄勇) <yong.huang@smartx.com>
Message-ID: <169073570563.19893.2928364761104733482-1@git.sr.ht>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[Whitespace tidied up]
Signed-off-by: Markus Armbruster <armbru@redhat.com>

8abc8115

Jul 26, 2023

qapi: Reformat recent doc comments to conform to current conventions · 9e272073

Markus Armbruster authored 1 year ago


Since commit a937b6aa (qapi: Reformat doc comments to conform to
current conventions), a number of comments not conforming to the
current formatting conventions were added.  No problem, just sweep
the entire documentation once more.

To check the generated documentation does not change, I compared the
generated HTML before and after this commit with "wdiff -3".  Finds no
differences.  Comparing with diff is not useful, as the reflown
paragraphs are visible there.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20230720071610.1096458-7-armbru@redhat.com>

9e272073

qapi/trace: Tidy up trace-event-get-state, -set-state documentation · e27a9d62

Markus Armbruster authored 1 year ago


trace-event-set-state's explanation of how events are selected is
under "Features".  Doesn't belong there.  Simply delete it, as it
feels redundant with documentation of member @name.

trace-event-get-state's explanation is under "Returns".  Tolerable,
but similarly redundant.  Delete it, too.

Cc: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20230720071610.1096458-5-armbru@redhat.com>

e27a9d62

qapi/qdev: Tidy up device_add documentation · a9c72efd

Markus Armbruster authored 1 year ago


The notes section comes out like this:

    Notes

    Additional arguments depend on the type.

    1. For detailed information about this command, please refer to the
       ‘docs/qdev-device-use.txt’ file.

    2. It’s possible to list device properties by running QEMU with the
       “-device DEVICE,help” command-line argument, where DEVICE is the
       device’s name

The first item isn't numbered.  Fix that:

    1. Additional arguments depend on the type.

    2. For detailed information about this command, please refer to the
       ‘docs/qdev-device-use.txt’ file.

    3. It’s possible to list device properties by running QEMU with the
       “-device DEVICE,help” command-line argument, where DEVICE is the
       device’s name

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20230720071610.1096458-4-armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>

a9c72efd

qapi/block: Tidy up block-latency-histogram-set documentation · e893b9e3

Markus Armbruster authored 1 year ago


Examples come out like

    Example

       set new histograms for all io types with intervals [0, 10), [10,
       50), [50, 100), [100, +inf):

The sentence "set new histograms ..." starts with a lower case letter.
Capitalize it.  Same for the other examples.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20230720071610.1096458-3-armbru@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>

e893b9e3

qapi/block-core: Tidy up BlockLatencyHistogramInfo documentation · dad3c956

Markus Armbruster authored 1 year ago


Documentation for member @bin comes out like

    list of io request counts corresponding to histogram intervals.
    len("bins") = len("boundaries") + 1 For the example above, "bins"
    may be something like [3, 1, 5, 2], and corresponding histogram
    looks like:

Note how the equation and the sentence following it run together.
Replace the equation:

    list of io request counts corresponding to histogram intervals,
    one more element than "boundaries" has.  For the example above,
    "bins" may be something like [3, 1, 5, 2], and corresponding
    histogram looks like:

Cc: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-ID: <20230720071610.1096458-2-armbru@redhat.com>
[Off by one fixed]

dad3c956

migration: skipped field is really obsolete. · 7b24d326

Juan Quintela authored 1 year ago


Has return zero for more than 10 years.

Specifically we introduced the field in 1.5.0

commit f1c72795
Author: Peter Lieven <pl@kamp.de>
Date:   Tue Mar 26 10:58:37 2013 +0100

    migration: do not sent zero pages in bulk stage

    during bulk stage of ram migration if a page is a
    zero page do not send it at all.
    the memory at the destination reads as zero anyway.

    even if there is an madvise with QEMU_MADV_DONTNEED
    at the target upon receipt of a zero page I have observed
    that the target starts swapping if the memory is overcommitted.
    it seems that the pages are dropped asynchronously.

    this patch also updates QMP to return the number of
    skipped pages in MigrationStats.

but removed its usage in 1.5.3

commit 9ef051e5
Author: Peter Lieven <pl@kamp.de>
Date:   Mon Jun 10 12:14:19 2013 +0200

    Revert "migration: do not sent zero pages in bulk stage"

    Not sending zero pages breaks migration if a page is zero
    at the source but not at the destination. This can e.g. happen
    if different BIOS versions are used at source and destination.
    It has also been reported that migration on pseries is completely
    broken with this patch.

    This effectively reverts commit f1c72795.

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20230612193344.3796-2-quintela@redhat.com>
Signed-off-by: Juan Quintela <quintela@redhat.com>

7b24d326

migration: Extend query-migrate to provide dirty page limit info · 15699cf5

Hyman Huang(黄勇) authored 1 year ago


Extend query-migrate to provide throttle time and estimated
ring full time with dirty-limit capability enabled, through which
we can observe if dirty limit take effect during live migration.

Signed-off-by: Hyman Huang(黄勇) <yong.huang@smartx.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-ID: <168733225273.5845.15871826788879741674-8@git.sr.ht>
Signed-off-by: Juan Quintela <quintela@redhat.com>

15699cf5

migration: Introduce dirty-limit capability · dc623955

Hyman Huang(黄勇) authored 1 year ago


Introduce migration dirty-limit capability, which can
be turned on before live migration and limit dirty
page rate durty live migration.

Introduce migrate_dirty_limit function to help check
if dirty-limit capability enabled during live migration.

Meanwhile, refactor vcpu_dirty_rate_stat_collect
so that period can be configured instead of hardcoded.

dirty-limit capability is kind of like auto-converge
but using dirty limit instead of traditional cpu-throttle
to throttle guest down. To enable this feature, turn on
the dirty-limit capability before live migration using
migrate-set-capabilities, and set the parameters
"x-vcpu-dirty-limit-period", "vcpu-dirty-limit" suitably
to speed up convergence.

Signed-off-by: Hyman Huang(黄勇) <yong.huang@smartx.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-Id: <168618975839.6361.17407633874747688653-4@git.sr.ht>
Signed-off-by: Juan Quintela <quintela@redhat.com>

dc623955

qapi/migration: Introduce vcpu-dirty-limit parameters · 09f9ec99

Hyman Huang(黄勇) authored 1 year ago


Introduce "vcpu-dirty-limit" migration parameter used
to limit dirty page rate during live migration.

"vcpu-dirty-limit" and "x-vcpu-dirty-limit-period" are
two dirty-limit-related migration parameters, which can
be set before and during live migration by qmp
migrate-set-parameters.

This two parameters are used to help implement the dirty
page rate limit algo of migration.

Signed-off-by: Hyman Huang(黄勇) <yong.huang@smartx.com>
Acked-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-Id: <168618975839.6361.17407633874747688653-3@git.sr.ht>
Signed-off-by: Juan Quintela <quintela@redhat.com>

09f9ec99

qapi/migration: Introduce x-vcpu-dirty-limit-period parameter · 4d807857

Hyman Huang(黄勇) authored 1 year ago


Introduce "x-vcpu-dirty-limit-period" migration experimental
parameter, which is in the range of 1 to 1000ms and used to
make dirtyrate calculation period configurable.

Currently with the "x-vcpu-dirty-limit-period" varies, the
total time of live migration changes, test results show the
optimal value of "x-vcpu-dirty-limit-period" ranges from
500ms to 1000 ms. "x-vcpu-dirty-limit-period" should be made
stable once it proves best value can not be determined with
developer's experiments.

Signed-off-by: Hyman Huang(黄勇) <yong.huang@smartx.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Juan Quintela <quintela@redhat.com>
Message-Id: <168618975839.6361.17407633874747688653-2@git.sr.ht>
Signed-off-by: Juan Quintela <quintela@redhat.com>

4d807857

Jul 25, 2023

qapi: Correct "eg." to "e.g." in documentation · ff62c210

Markus Armbruster authored 1 year ago


Signed-off-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>

ff62c210

Jul 17, 2023

audio/pw: Pipewire->PipeWire case fix for user-visible text · 20c51248

Marc-André Lureau authored 1 year ago


"PipeWire" is the correct case.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Volker Rümelin <vr_qemu@t-online.de>
Message-Id: <20230506163735.3481387-4-marcandre.lureau@redhat.com>

20c51248

Jul 10, 2023

migration.json: Don't use space before colon · fd658a7b

Juan Quintela authored 1 year ago


So all the file is consistent.

Signed-off-by: Juan Quintela <quintela@redhat.com>
Message-Id: <20230612191604.2219-1-quintela@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>

fd658a7b

qapi: better docs for calc-dirty-rate and friends · 5034e3d4

Andrei Gudkov authored 1 year ago


Rewrote calc-dirty-rate documentation. Briefly described
different modes of dirty page rate measurement. Added some
examples. Fixed obvious grammar errors.

Signed-off-by: Andrei Gudkov <gudkov.andrei@huawei.com>
Message-Id: <fe7d32a621ebd69ef6974beb2499c0b5dccb9e19.1684854849.git.gudkov.andrei@huawei.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
[Prose tweaked and spacing corrected, as per review]
Signed-off-by: Markus Armbruster <armbru@redhat.com>

5034e3d4

Jun 30, 2023

migration: Add switchover ack capability · 6574232f

Avihai Horon authored 1 year ago


Migration downtime estimation is calculated based on bandwidth and
remaining migration data. This assumes that loading of migration data in
the destination takes a negligible amount of time and that downtime
depends only on network speed.

While this may be true for RAM, it's not necessarily true for other
migrated devices. For example, loading the data of a VFIO device in the
destination might require from the device to allocate resources, prepare
internal data structures and so on. These operations can take a
significant amount of time which can increase migration downtime.

This patch adds a new capability "switchover ack" that prevents the
source from stopping the VM and completing the migration until an ACK
is received from the destination that it's OK to do so.

This can be used by migrated devices in various ways to reduce downtime.
For example, a device can send initial precopy metadata to pre-allocate
resources in the destination and use this capability to make sure that
the pre-allocation is completed before the source VM is stopped, so it
will have full effect.

This new capability relies on the return path capability to communicate
from the destination back to the source.

The actual implementation of the capability will be added in the
following patches.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Tested-by: YangHang Liu <yanghliu@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Cédric Le Goater <clg@redhat.com>

6574232f

Jun 27, 2023

ui: add egl-headless support on win32 · 39324b49

Marc-André Lureau authored 1 year ago

Make GBM optional for EGL code, and enable the build for win32.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20230606115658.677673-13-marcandre.lureau@redhat.com>

39324b49

Jun 26, 2023

accel/tcg: remove CONFIG_PROFILER · 1b65b4f5

Fei Wu authored 1 year ago


TBStats will be introduced to replace CONFIG_PROFILER totally, here
remove all CONFIG_PROFILER related stuffs first.

Signed-off-by: Vanderson M. do Rosario <vandersonmr2@gmail.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Fei Wu <fei2.wu@intel.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230607122411.3394702-2-fei2.wu@intel.com>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

1b65b4f5

Jun 23, 2023

hw/cxl/events: Add injection of Memory Module Events · bafe0308

Jonathan Cameron authored 1 year ago


These events include a copy of the device health information at the
time of the event. Actually using the emulated device health would
require a lot of controls to manipulate that state.  Given the aim
of this injection code is to just test the flows when events occur,
inject the contents of the device health state as well.

Future work may add more sophisticate device health emulation
including direct generation of these records when events occur
(such as a temperature threshold being crossed).  That does not
reduce the usefulness of this more basic generation of the events.

Acked-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Message-Id: <20230530133603.16934-8-Jonathan.Cameron@huawei.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

bafe0308

Jun 22, 2023

hw/cxl/events: Add injection of DRAM events · b90a324e

Jonathan Cameron authored 1 year ago


Defined in CXL r3.0 8.2.9.2.1.2 DRAM Event Record, this event
provides information related to DRAM devices.

Example injection command in QMP:

{ "execute": "cxl-inject-dram-event",
    "arguments": {
        "path": "/machine/peripheral/cxl-mem0",
        "log": "informational",
        "flags": 1,
        "dpa": 1000,
        "descriptor": 3,
        "type": 3,
        "transaction-type": 192,
        "channel": 3,
        "rank": 17,
        "nibble-mask": 37421234,
        "bank-group": 7,
        "bank": 11,
        "row": 2,
        "column": 77,
        "correction-mask": [33, 44, 55,66]
    }}

Acked-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Message-Id: <20230530133603.16934-7-Jonathan.Cameron@huawei.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

b90a324e

hw/cxl/events: Add injection of General Media Events · ea9b6d64

Ira Weiny authored 1 year ago


To facilitate testing provide a QMP command to inject a general media
event.  The event can be added to the log specified.

Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Message-Id: <20230530133603.16934-6-Jonathan.Cameron@huawei.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

ea9b6d64

hw/cxl: QMP based poison injection support · 9547754f

Jonathan Cameron authored 1 year ago

Inject poison using QMP command cxl-inject-poison to add an entry to the
poison list.

For now, the poison is not returned CXL.mem reads, but only via the
mailbox command Get Poison List. So a normal memory read to an address
that is on the poison list will not yet result in a synchronous exception
(and similar for partial cacheline writes).
That is left for a future patch.

See CXL rev 3.0, sec 8.2.9.8.4.1 Get Poison list (Opcode 4300h)

Kernel patches to use this interface here:
https://lore.kernel.org/linux-cxl/cover.1665606782.git.alison.schofield@intel.com/



To inject poison using QMP (telnet to the QMP port)
{ "execute": "qmp_capabilities" }

{ "execute": "cxl-inject-poison",
    "arguments": {
         "path": "/machine/peripheral/cxl-pmem0",
         "start": 2048,
         "length": 256
    }
}

Adjusted to select a device on your machine.

Note that the poison list supported is kept short enough to avoid the
complexity of state machine that is needed to handle the MORE flag.

Reviewed-by: Fan Ni <fan.ni@samsung.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Message-Id: <20230526170010.574-3-Jonathan.Cameron@huawei.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

9547754f

Jun 20, 2023

meson: Replace CONFIG_SOFTMMU -> CONFIG_SYSTEM_ONLY · c7b64948

Philippe Mathieu-Daudé authored 1 year ago


Since we *might* have user emulation with softmmu,
use the clearer 'CONFIG_SYSTEM_ONLY' key to check
for system emulation.

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20230613133347.82210-9-philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

c7b64948

Jun 13, 2023

exec/memory: Introduce RAM_NAMED_FILE flag · b0182e53

Steve Sistare authored 1 year ago

migrate_ignore_shared() is an optimization that avoids copying memory
that is visible and can be mapped on the target. However, a
memory-backend-ram or a memory-backend-memfd block with the RAM_SHARED
flag set is not migrated when migrate_ignore_shared() is true. This is
wrong, because the block has no named backing store, and its contents will
be lost. To fix, ignore shared memory iff it is a named file. Define a
new flag RAM_NAMED_FILE to distinguish this case.

Signed-off-by: Steve Sistare <steven.sistare@oracle.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Message-Id: <1686151116-253260-1-git-send-email-steven.sistare@oracle.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>

b0182e53

Jun 09, 2023

spelling: information · 40b89515

Michael Tokarev authored 1 year ago


3 trivial fixes: 2 .json comments which goes to
executables, and 1 .h file comment.

Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>

40b89515

Jun 05, 2023

qcow2: add discard-no-unref option · 42a2890a

Jean-Louis Dupond authored 1 year ago

When we for example have a sparse qcow2 image and discard: unmap is enabled,
there can be a lot of fragmentation in the image after some time. Especially on VM's
that do a lot of writes/deletes.
This causes the qcow2 image to grow even over 110% of its virtual size,
because the free gaps in the image get too small to allocate new
continuous clusters. So it allocates new space at the end of the image.

Disabling discard is not an option, as discard is needed to keep the
incremental backup size as low as possible. Without discard, the
incremental backups would become large, as qemu thinks it's just dirty
blocks but it doesn't know the blocks are unneeded.
So we need to avoid fragmentation but also 'empty' the unneeded blocks in
the image to have a small incremental backup.

In addition, we also want to send the discards further down the stack, so
the underlying blocks are still discarded.

Therefor we introduce a new qcow2 option "discard-no-unref".
When setting this option to true, discards will no longer have the qcow2
driver relinquish cluster allocations. Other than that, the request is
handled as normal: All clusters in range are marked as zero, and, if
pass-discard-request is true, it is passed further down the stack.
The only difference is that the now-zero clusters are preallocated
instead of being unallocated.
This will avoid fragmentation on the qcow2 image.

Fixes: https://gitlab.com/qemu-project/qemu/-/issues/1621


Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be>
Message-Id: <20230605084523.34134-2-jean-louis@dupond.be>
Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>

42a2890a

Jun 02, 2023

cutils: Adjust signature of parse_uint[_full] · bd1386cc

Eric Blake authored 1 year ago


It's already confusing that we have two very similar functions for
wrapping the parse of a 64-bit unsigned value, differing mainly on
whether they permit leading '-'.  Adjust the signature of parse_uint()
and parse_uint_full() to be like all of qemu_strto*(): put the result
parameter last, use the same types (uint64_t and unsigned long long
have the same width, but are not always the same type), and mark
endptr const (this latter change only affects the rare caller of
parse_uint).  Adjust all callers in the tree.

While at it, note that since cutils.c already includes:

    QEMU_BUILD_BUG_ON(sizeof(int64_t) != sizeof(long long));

we are guaranteed that the result of parse_uint* cannot exceed
UINT64_MAX (or the build would have failed), so we can drop
pre-existing dead comparisons in opts-visitor.c that were never false.

Reviewed-by: Hanna Czenczek <hreitz@redhat.com>
Message-Id: <20230522190441.64278-8-eblake@redhat.com>
[eblake: Drop dead code spotted by Markus]
Signed-off-by: Eric Blake <eblake@redhat.com>

bd1386cc

Jun 01, 2023

qapi: add '@fdset' feature for BlockdevOptionsVirtioBlkVhostVdpa · 98b126f5

Stefano Garzarella authored 1 year ago


The virtio-blk-vhost-vdpa driver in libblkio 1.3.0 supports the fd
passing through the new 'fd' property.

Since now we are using qemu_open() on '@path' if the virtio-blk driver
supports the fd passing, let's announce it.
In this way, the management layer can pass the file descriptor of an
already opened vhost-vdpa character device. This is useful especially
when the device can only be accessed with certain privileges.

Add the '@fdset' feature only when the virtio-blk-vhost-vdpa driver
in libblkio supports it.

Suggested-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Message-id: 20230530071941.8954-3-sgarzare@redhat.com
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

98b126f5

qapi: make the vcpu parameters deprecated for 8.1 · 5485e52a

Alex Bennée authored 1 year ago


I don't think I can remove the parameters directly but certainly mark
them as deprecated.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-id: 20230526165401.574474-7-alex.bennee@linaro.org
Message-Id: <20230524133952.3971948-6-alex.bennee@linaro.org>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>

5485e52a

May 28, 2023

ui: add the infrastructure to support MT events · 2bfb10df

Sergio Lopez authored 1 year ago


Add the required infrastructure to support generating multitouch events.

Signed-off-by: Sergio Lopez <slp@redhat.com>
Reviewed-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20230526112925.38794-3-slp@redhat.com>

2bfb10df