Skip to content
Snippets Groups Projects
  • Daniel Henrique Barboza's avatar
    d522cb52
    spapr: rollback 'unplug timeout' for CPU hotunplugs · d522cb52
    Daniel Henrique Barboza authored
    The pseries machines introduced the concept of 'unplug timeout' for CPU
    hotunplugs. The idea was to circunvent a deficiency in the pSeries
    specification (PAPR), that currently does not define a proper way for
    the hotunplug to fail. If the guest refuses to release the CPU (see [1]
    for an example) there is no way for QEMU to detect the failure.
    
    Further discussions about how to send a QAPI event to inform about the
    hotunplug timeout [2] exposed problems that weren't predicted back when
    the idea was developed. Other QEMU machines don't have any type of
    hotunplug timeout mechanism for any device, e.g. ACPI based machines
    have a way to make hotunplug errors visible to the hypervisor. This
    would make this timeout mechanism exclusive to pSeries, which is not
    ideal.
    
    The real problem is that a QAPI event that reports hotunplug timeouts
    puts the management layer (namely Libvirt) in a weird spot. We're not
    telling that the hotunplug failed, because we can't be 100% sure of
    that, and yet we're resetting the unplug state back, preventing any
    DEVICE_DEL events to reach out in case the guest decides to release the
    device. Libvirt would need to inspect the guest itself to see if the
    device was released or not, otherwise the internal domain states will be
    inconsistent.  Moreover, Libvirt already has an 'unplug timeout'
    concept, and a QEMU side timeout would need to be juggled together with
    the existing Libvirt timeout.
    
    All this considered, this solution ended up creating more trouble than
    it solved. This patch reverts the 3 commits that introduced the timeout
    mechanism for CPU hotplugs in pSeries machines.
    
    This reverts commit 4515a5f7
    "qemu_timer.c: add timer_deadline_ms() helper"
    
    This reverts commit d1c2e3ce
    "spapr_drc.c: add hotunplug timeout for CPUs"
    
    This reverts commit 51254ffb
    "spapr_drc.c: introduce unplug_timeout_timer"
    
    [1] https://bugzilla.redhat.com/show_bug.cgi?id=1911414
    [2] https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg04682.html
    
    
    
    CC: Paolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: default avatarDaniel Henrique Barboza <danielhb413@gmail.com>
    Message-Id: <20210401000437.131140-2-danielhb413@gmail.com>
    Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
    d522cb52
    History
    spapr: rollback 'unplug timeout' for CPU hotunplugs
    Daniel Henrique Barboza authored
    The pseries machines introduced the concept of 'unplug timeout' for CPU
    hotunplugs. The idea was to circunvent a deficiency in the pSeries
    specification (PAPR), that currently does not define a proper way for
    the hotunplug to fail. If the guest refuses to release the CPU (see [1]
    for an example) there is no way for QEMU to detect the failure.
    
    Further discussions about how to send a QAPI event to inform about the
    hotunplug timeout [2] exposed problems that weren't predicted back when
    the idea was developed. Other QEMU machines don't have any type of
    hotunplug timeout mechanism for any device, e.g. ACPI based machines
    have a way to make hotunplug errors visible to the hypervisor. This
    would make this timeout mechanism exclusive to pSeries, which is not
    ideal.
    
    The real problem is that a QAPI event that reports hotunplug timeouts
    puts the management layer (namely Libvirt) in a weird spot. We're not
    telling that the hotunplug failed, because we can't be 100% sure of
    that, and yet we're resetting the unplug state back, preventing any
    DEVICE_DEL events to reach out in case the guest decides to release the
    device. Libvirt would need to inspect the guest itself to see if the
    device was released or not, otherwise the internal domain states will be
    inconsistent.  Moreover, Libvirt already has an 'unplug timeout'
    concept, and a QEMU side timeout would need to be juggled together with
    the existing Libvirt timeout.
    
    All this considered, this solution ended up creating more trouble than
    it solved. This patch reverts the 3 commits that introduced the timeout
    mechanism for CPU hotplugs in pSeries machines.
    
    This reverts commit 4515a5f7
    "qemu_timer.c: add timer_deadline_ms() helper"
    
    This reverts commit d1c2e3ce
    "spapr_drc.c: add hotunplug timeout for CPUs"
    
    This reverts commit 51254ffb
    "spapr_drc.c: introduce unplug_timeout_timer"
    
    [1] https://bugzilla.redhat.com/show_bug.cgi?id=1911414
    [2] https://lists.gnu.org/archive/html/qemu-devel/2021-03/msg04682.html
    
    
    
    CC: Paolo Bonzini <pbonzini@redhat.com>
    Signed-off-by: default avatarDaniel Henrique Barboza <danielhb413@gmail.com>
    Message-Id: <20210401000437.131140-2-danielhb413@gmail.com>
    Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
qemu-timer.c 18.29 KiB