Skip to content
Snippets Groups Projects
  1. Sep 19, 2023
  2. Aug 23, 2023
  3. Jul 10, 2023
  4. Jun 26, 2023
    • Gavin Shan's avatar
      numa: Validate cluster and NUMA node boundary if required · a494fdb7
      Gavin Shan authored
      
      For some architectures like ARM64, multiple CPUs in one cluster can be
      associated with different NUMA nodes, which is irregular configuration
      because we shouldn't have this in baremetal environment. The irregular
      configuration causes Linux guest to misbehave, as the following warning
      messages indicate.
      
        -smp 6,maxcpus=6,sockets=2,clusters=1,cores=3,threads=1 \
        -numa node,nodeid=0,cpus=0-1,memdev=ram0                \
        -numa node,nodeid=1,cpus=2-3,memdev=ram1                \
        -numa node,nodeid=2,cpus=4-5,memdev=ram2                \
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 1 at kernel/sched/topology.c:2271 build_sched_domains+0x284/0x910
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.14.0-268.el9.aarch64 #1
        pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
        pc : build_sched_domains+0x284/0x910
        lr : build_sched_domains+0x184/0x910
        sp : ffff80000804bd50
        x29: ffff80000804bd50 x28: 0000000000000002 x27: 0000000000000000
        x26: ffff800009cf9a80 x25: 0000000000000000 x24: ffff800009cbf840
        x23: ffff000080325000 x22: ffff0000005df800 x21: ffff80000a4ce508
        x20: 0000000000000000 x19: ffff000080324440 x18: 0000000000000014
        x17: 00000000388925c0 x16: 000000005386a066 x15: 000000009c10cc2e
        x14: 00000000000001c0 x13: 0000000000000001 x12: ffff00007fffb1a0
        x11: ffff00007fffb180 x10: ffff80000a4ce508 x9 : 0000000000000041
        x8 : ffff80000a4ce500 x7 : ffff80000a4cf920 x6 : 0000000000000001
        x5 : 0000000000000001 x4 : 0000000000000007 x3 : 0000000000000002
        x2 : 0000000000001000 x1 : ffff80000a4cf928 x0 : 0000000000000001
        Call trace:
         build_sched_domains+0x284/0x910
         sched_init_domains+0xac/0xe0
         sched_init_smp+0x48/0xc8
         kernel_init_freeable+0x140/0x1ac
         kernel_init+0x28/0x140
         ret_from_fork+0x10/0x20
      
      Improve the situation to warn when multiple CPUs in one cluster have
      been associated with different NUMA nodes. However, one NUMA node is
      allowed to be associated with different clusters.
      
      Signed-off-by: default avatarGavin Shan <gshan@redhat.com>
      Acked-by: default avatarPhilippe Mathieu-Daudé <philmd@linaro.org>
      Acked-by: default avatarIgor Mammedov <imammedo@redhat.com>
      Message-Id: <20230509002739.18388-2-gshan@redhat.com>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      a494fdb7
  5. May 26, 2023
  6. May 19, 2023
  7. Apr 27, 2023
  8. Apr 21, 2023
  9. Apr 20, 2023
  10. Apr 12, 2023
    • Peter Xu's avatar
      migration: Fix potential race on postcopy_qemufile_src · 6621883f
      Peter Xu authored
      
      postcopy_qemufile_src object should be owned by one thread, either the main
      thread (e.g. when at the beginning, or at the end of migration), or by the
      return path thread (when during a preempt enabled postcopy migration).  If
      that's not the case the access to the object might be racy.
      
      postcopy_preempt_shutdown_file() can be potentially racy, because it's
      called at the end phase of migration on the main thread, however during
      which the return path thread hasn't yet been recycled; the recycle happens
      in await_return_path_close_on_source() which is after this point.
      
      It means, logically it's posslbe the main thread and the return path thread
      are both operating on the same qemufile.  While I don't think qemufile is
      thread safe at all.
      
      postcopy_preempt_shutdown_file() used to be needed because that's where we
      send EOS to dest so that dest can safely shutdown the preempt thread.
      
      To avoid the possible race, remove this only place that a race can happen.
      Instead we figure out another way to safely close the preempt thread on
      dest.
      
      The core idea during postcopy on deciding "when to stop" is that dest will
      send a postcopy SHUT message to src, telling src that all data is there.
      Hence to shut the dest preempt thread maybe better to do it directly on
      dest node.
      
      This patch proposed such a way that we change postcopy_prio_thread_created
      into PreemptThreadStatus, so that we kick the preempt thread on dest qemu
      by a sequence of:
      
        mis->preempt_thread_status = PREEMPT_THREAD_QUIT;
        qemu_file_shutdown(mis->postcopy_qemufile_dst);
      
      While here shutdown() is probably so far the easiest way to kick preempt
      thread from a blocked qemu_get_be64().  Then it reads preempt_thread_status
      to make sure it's not a network failure but a willingness to quit the
      thread.
      
      We could have avoided that extra status but just rely on migration status.
      The problem is postcopy_ram_incoming_cleanup() is just called early enough
      so we're still during POSTCOPY_ACTIVE no matter what.. So just make it
      simple to have the status introduced.
      
      One flag x-preempt-pre-7-2 is added to keep old pre-7.2 behaviors of
      postcopy preempt.
      
      Fixes: 93589827 ("migration: Send requested page directly in rp-return thread")
      Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarJuan Quintela <quintela@redhat.com>
      Signed-off-by: default avatarJuan Quintela <quintela@redhat.com>
      6621883f
  11. Mar 10, 2023
  12. Mar 02, 2023
  13. Feb 23, 2023
  14. Feb 08, 2023
  15. Feb 06, 2023
    • David Hildenbrand's avatar
      virtio-mem: Migrate immutable properties early · 3b95a71b
      David Hildenbrand authored
      
      The bitmap and the size are immutable while migration is active: see
      virtio_mem_is_busy(). We can migrate this information early, before
      migrating any actual RAM content. Further, all information we need for
      sanity checks is immutable as well.
      
      Having this information in place early will, for example, allow for
      properly preallocating memory before touching these memory locations
      during RAM migration: this way, we can make sure that all memory was
      actually preallocated and that any user errors (e.g., insufficient
      hugetlb pages) can be handled gracefully.
      
      In contrast, usable_region_size and requested_size can theoretically
      still be modified on the source while the VM is running. Keep migrating
      these properties the usual, late, way.
      
      Use a new device property to keep behavior of compat machines
      unmodified.
      
      Reviewed-by: default avatarPeter Xu <peterx@redhat.com>
      Reviewed-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: default avatarJuan Quintela <quintela@redhat.com&gt;S>
      Signed-off-by: default avatarDavid Hildenbrand <david@redhat.com>
      Signed-off-by: default avatarJuan Quintela <quintela@redhat.com>
      3b95a71b
  16. Jan 27, 2023
  17. Jan 05, 2023
  18. Dec 21, 2022
  19. Dec 14, 2022
  20. Nov 07, 2022
    • Brice Goglin's avatar
      hmat acpi: Don't require initiator value in -numa · 83bcae98
      Brice Goglin authored
      
      The "Memory Proximity Domain Attributes" structure of the ACPI HMAT
      has a "Processor Proximity Domain Valid" flag that is currently
      always set because Qemu -numa requires an initiator=X value
      when hmat=on. Unsetting this flag allows to create more complex
      memory topologies by having multiple best initiators for a single
      memory target.
      
      This patch allows -numa without initiator=X when hmat=on by keeping
      the default value MAX_NODES in numa_state->nodes[i].initiator.
      All places reading numa_state->nodes[i].initiator already check
      whether it's different from MAX_NODES before using it.
      
      Tested with
      qemu-system-x86_64 -accel kvm \
       -machine pc,hmat=on \
       -drive if=pflash,format=raw,file=./OVMF.fd \
       -drive media=disk,format=qcow2,file=efi.qcow2 \
       -smp 4 \
       -m 3G \
       -object memory-backend-ram,size=1G,id=ram0 \
       -object memory-backend-ram,size=1G,id=ram1 \
       -object memory-backend-ram,size=1G,id=ram2 \
       -numa node,nodeid=0,memdev=ram0,cpus=0-1 \
       -numa node,nodeid=1,memdev=ram1,cpus=2-3 \
       -numa node,nodeid=2,memdev=ram2 \
       -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
       -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
       -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
       -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 \
       -numa hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-latency,latency=30 \
       -numa hmat-lb,initiator=0,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=1048576 \
       -numa hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-latency,latency=20 \
       -numa hmat-lb,initiator=1,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880 \
       -numa hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-latency,latency=10 \
       -numa hmat-lb,initiator=1,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
       -numa hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-latency,latency=30 \
       -numa hmat-lb,initiator=1,target=2,hierarchy=memory,data-type=access-bandwidth,bandwidth=1048576
      which reports NUMA node2 at same distance from both node0 and node1 as seen in lstopo:
      Machine (2966MB total) + Package P#0
        NUMANode P#2 (979MB)
        Group0
          NUMANode P#0 (980MB)
          Core P#0 + PU P#0
          Core P#1 + PU P#1
        Group0
          NUMANode P#1 (1007MB)
          Core P#2 + PU P#2
          Core P#3 + PU P#3
      
      Before this patch, we had to add ",initiator=X" to "-numa node,nodeid=2,memdev=ram2".
      The lstopo output difference between initiator=1 and no initiator is:
      @@ -1,10 +1,10 @@
       Machine (2966MB total) + Package P#0
      +  NUMANode P#2 (979MB)
         Group0
           NUMANode P#0 (980MB)
           Core P#0 + PU P#0
           Core P#1 + PU P#1
         Group0
           NUMANode P#1 (1007MB)
      -    NUMANode P#2 (979MB)
           Core P#2 + PU P#2
           Core P#3 + PU P#3
      
      Corresponding changes in the HMAT MPDA structure:
      @@ -49,10 +49,10 @@
       [078h 0120   2]               Structure Type : 0000 [Memory Proximity Domain Attributes]
       [07Ah 0122   2]                     Reserved : 0000
       [07Ch 0124   4]                       Length : 00000028
      -[080h 0128   2]        Flags (decoded below) : 0001
      -            Processor Proximity Domain Valid : 1
      +[080h 0128   2]        Flags (decoded below) : 0000
      +            Processor Proximity Domain Valid : 0
       [082h 0130   2]                    Reserved1 : 0000
      -[084h 0132   4] Attached Initiator Proximity Domain : 00000001
      +[084h 0132   4] Attached Initiator Proximity Domain : 00000080
       [088h 0136   4]      Memory Proximity Domain : 00000002
       [08Ch 0140   4]                    Reserved2 : 00000000
       [090h 0144   8]                    Reserved3 : 0000000000000000
      
      Final HMAT SLLB structures:
      [0A0h 0160   2]               Structure Type : 0001 [System Locality Latency and Bandwidth Information]
      [0A2h 0162   2]                     Reserved : 0000
      [0A4h 0164   4]                       Length : 00000040
      [0A8h 0168   1]        Flags (decoded below) : 00
                                  Memory Hierarchy : 0
      [0A9h 0169   1]                    Data Type : 00
      [0AAh 0170   2]                    Reserved1 : 0000
      [0ACh 0172   4] Initiator Proximity Domains # : 00000002
      [0B0h 0176   4]   Target Proximity Domains # : 00000003
      [0B4h 0180   4]                    Reserved2 : 00000000
      [0B8h 0184   8]              Entry Base Unit : 0000000000002710
      [0C0h 0192   4] Initiator Proximity Domain List : 00000000
      [0C4h 0196   4] Initiator Proximity Domain List : 00000001
      [0C8h 0200   4] Target Proximity Domain List : 00000000
      [0CCh 0204   4] Target Proximity Domain List : 00000001
      [0D0h 0208   4] Target Proximity Domain List : 00000002
      [0D4h 0212   2]                        Entry : 0001
      [0D6h 0214   2]                        Entry : 0002
      [0D8h 0216   2]                        Entry : 0003
      [0DAh 0218   2]                        Entry : 0002
      [0DCh 0220   2]                        Entry : 0001
      [0DEh 0222   2]                        Entry : 0003
      
      [0E0h 0224   2]               Structure Type : 0001 [System Locality Latency and Bandwidth Information]
      [0E2h 0226   2]                     Reserved : 0000
      [0E4h 0228   4]                       Length : 00000040
      [0E8h 0232   1]        Flags (decoded below) : 00
                                  Memory Hierarchy : 0
      [0E9h 0233   1]                    Data Type : 03
      [0EAh 0234   2]                    Reserved1 : 0000
      [0ECh 0236   4] Initiator Proximity Domains # : 00000002
      [0F0h 0240   4]   Target Proximity Domains # : 00000003
      [0F4h 0244   4]                    Reserved2 : 00000000
      [0F8h 0248   8]              Entry Base Unit : 0000000000000001
      [100h 0256   4] Initiator Proximity Domain List : 00000000
      [104h 0260   4] Initiator Proximity Domain List : 00000001
      [108h 0264   4] Target Proximity Domain List : 00000000
      [10Ch 0268   4] Target Proximity Domain List : 00000001
      [110h 0272   4] Target Proximity Domain List : 00000002
      [114h 0276   2]                        Entry : 000A
      [116h 0278   2]                        Entry : 0005
      [118h 0280   2]                        Entry : 0001
      [11Ah 0282   2]                        Entry : 0005
      [11Ch 0284   2]                        Entry : 000A
      [11Eh 0286   2]                        Entry : 0001
      
      Signed-off-by: default avatarBrice Goglin <Brice.Goglin@inria.fr>
      Signed-off-by: default avatarHesham Almatary <hesham.almatary@huawei.com>
      Reviewed-by: default avatarJingqi Liu <jingqi.liu@intel.com>
      Message-Id: <20221027100037.251-2-hesham.almatary@huawei.com>
      Tested-by: default avatarYicong Yang <yangyicong@hisilicon.com>
      Reviewed-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      83bcae98
    • Kangjie Xu's avatar
      virtio: core: vq reset feature negotation support · 69e1c14a
      Kangjie Xu authored
      
      A a new command line parameter "queue_reset" is added.
      
      Meanwhile, the vq reset feature is disabled for pre-7.2 machines.
      
      Signed-off-by: default avatarKangjie Xu <kangjie.xu@linux.alibaba.com>
      Signed-off-by: default avatarXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Acked-by: default avatarJason Wang <jasowang@redhat.com>
      Message-Id: <20221017092558.111082-5-xuanzhuo@linux.alibaba.com>
      Reviewed-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: default avatarMichael S. Tsirkin <mst@redhat.com>
      69e1c14a
  21. Aug 25, 2022
  22. Jun 09, 2022
  23. Jun 03, 2022
    • Klaus Jensen's avatar
      hw/nvme: do not auto-generate eui64 · 36d83272
      Klaus Jensen authored
      
      We cannot provide auto-generated unique or persistent namespace
      identifiers (EUI64, NGUID, UUID) easily. Since 6.1, namespaces have been
      assigned a generated EUI64 of the form "52:54:00:<namespace counter>".
      This is will be unique within a QEMU instance, but not globally.
      
      Revert that this is assigned automatically and immediately deprecate the
      compatibility parameter. Users can opt-in to this with the
      `eui64-default=on` device parameter or set it explicitly with
      `eui64=UINT64`.
      
      Cc: libvir-list@redhat.com
      Reviewed-by: default avatarChristoph Hellwig <hch@lst.de>
      Signed-off-by: default avatarKlaus Jensen <k.jensen@samsung.com>
      36d83272
  24. May 19, 2022
  25. May 13, 2022
  26. May 12, 2022
  27. May 09, 2022
  28. Apr 20, 2022
  29. Apr 06, 2022
    • Dr. David Alan Gilbert's avatar
      acpi: fix acpi_index migration · a83c2844
      Dr. David Alan Gilbert authored
      
      vmstate_acpi_pcihp_use_acpi_index() was expecting AcpiPciHpState
      as state but it actually received PIIX4PMState, because
      VMSTATE_PCI_HOTPLUG is a macro and not another struct.
      So it ended up accessing random pointer, which resulted
      in 'false' return value and acpi_index field wasn't ever
      sent.
      
      However in 7.0 that pointer de-references to value > 0, and
      destination QEMU starts to expect the field which isn't
      sent in migratioon stream from older QEMU (6.2 and older).
      As result migration fails with:
        qemu-system-x86_64: Missing section footer for 0000:00:01.3/piix4_pm
        qemu-system-x86_64: load of migration failed: Invalid argument
      
      In addition with QEMU-6.2, destination due to not expected
      state, also never expects the acpi_index field in migration
      stream.
      
      Q35 is not affected as it always sends/expects the field as
      long as acpi based PCI hotplug is enabled.
      
      Fix issue by introducing compat knob to never send/expect
      acpi_index in migration stream for 6.2 and older PC machine
      types and always send it for 7.0 and newer PC machine types.
      
      Diagnosed-by: default avatarDr. David Alan Gilbert <dgilbert@redhat.com>
      Fixes: b32bd763 ("pci: introduce acpi-index property for PCI device")
      Resolves: https://gitlab.com/qemu-project/qemu/-/issues/932
      
      
      Signed-off-by: default avatarIgor Mammedov <imammedo@redhat.com>
      Reviewed-by: default avatarDr. David Alan Gilbert <dgilbert@redhat.com>
      Signed-off-by: default avatarPeter Maydell <peter.maydell@linaro.org>
      a83c2844
  30. Jan 18, 2022
  31. Jan 05, 2022
  32. Dec 31, 2021
    • Yanan Wang's avatar
      hw/core/machine: Introduce CPU cluster topology support · 864c3b5c
      Yanan Wang authored
      
      The new Cluster-Aware Scheduling support has landed in Linux 5.16,
      which has been proved to benefit the scheduling performance (e.g.
      load balance and wake_affine strategy) on both x86_64 and AArch64.
      
      So now in Linux 5.16 we have four-level arch-neutral CPU topology
      definition like below and a new scheduler level for clusters.
      struct cpu_topology {
          int thread_id;
          int core_id;
          int cluster_id;
          int package_id;
          int llc_id;
          cpumask_t thread_sibling;
          cpumask_t core_sibling;
          cpumask_t cluster_sibling;
          cpumask_t llc_sibling;
      }
      
      A cluster generally means a group of CPU cores which share L2 cache
      or other mid-level resources, and it is the shared resources that
      is used to improve scheduler's behavior. From the point of view of
      the size range, it's between CPU die and CPU core. For example, on
      some ARM64 Kunpeng servers, we have 6 clusters in each NUMA node,
      and 4 CPU cores in each cluster. The 4 CPU cores share a separate
      L2 cache and a L3 cache tag, which brings cache affinity advantage.
      
      In virtualization, on the Hosts which have pClusters (physical
      clusters), if we can design a vCPU topology with cluster level for
      guest kernel and have a dedicated vCPU pinning. A Cluster-Aware
      Guest kernel can also make use of the cache affinity of CPU clusters
      to gain similar scheduling performance.
      
      This patch adds infrastructure for CPU cluster level topology
      configuration and parsing, so that the user can specify cluster
      parameter if their machines support it.
      
      Signed-off-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Message-Id: <20211228092221.21068-3-wangyanan55@huawei.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@redhat.com>
      [PMD: Added '(since 7.0)' to @clusters in qapi/machine.json]
      Signed-off-by: default avatarPhilippe Mathieu-Daudé <philmd@redhat.com>
      864c3b5c
    • Philippe Mathieu-Daudé's avatar
      hw/core: Rename smp_parse() -> machine_parse_smp_config() · 3e2f1498
      Philippe Mathieu-Daudé authored
      
      All methods related to MachineState are prefixed with "machine_".
      smp_parse() does not need to be an exception. Rename it and
      const'ify the SMPConfiguration argument, since it doesn't need
      to be modified.
      
      Reviewed-by: default avatarAndrew Jones <drjones@redhat.com>
      Reviewed-by: default avatarRichard Henderson <richard.henderson@linaro.org>
      Reviewed-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Tested-by: default avatarYanan Wang <wangyanan55@huawei.com>
      Signed-off-by: default avatarPhilippe Mathieu-Daudé <philmd@redhat.com>
      Message-Id: <20211216132015.815493-9-philmd@redhat.com>
      3e2f1498
Loading