Skip to content
  • Peter Xu's avatar
    f8c543e8
    migration: Allow network to fail even during recovery · f8c543e8
    Peter Xu authored
    
    
    Normally the postcopy recover phase should only exist for a super short
    period, that's the duration when QEMU is trying to recover from an
    interrupted postcopy migration, during which handshake will be carried out
    for continuing the procedure with state changes from PAUSED -> RECOVER ->
    POSTCOPY_ACTIVE again.
    
    Here RECOVER phase should be super small, that happens right after the
    admin specified a new but working network link for QEMU to reconnect to
    dest QEMU.
    
    However there can still be case where the channel is broken in this small
    RECOVER window.
    
    If it happens, with current code there's no way the src QEMU can got kicked
    out of RECOVER stage. No way either to retry the recover in another channel
    when established.
    
    This patch allows the RECOVER phase to fail itself too - we're mostly
    ready, just some small things missing, e.g. properly kick the main
    migration thread out when sleeping on rp_sem when we found that we're at
    RECOVER stage.  When this happens, it fails the RECOVER itself, and
    rollback to PAUSED stage.  Then the user can retry another round of
    recovery.
    
    To make it even stronger, teach QMP command migrate-pause to explicitly
    kick src/dst QEMU out when needed, so even if for some reason the migration
    thread didn't got kicked out already by a failing rethrn-path thread, the
    admin can also kick it out.
    
    This will be an super, super corner case, but still try to cover that.
    
    One can try to test this with two proxy channels for migration:
    
      (a) socat unix-listen:/tmp/src.sock,reuseaddr,fork tcp:localhost:10000
      (b) socat tcp-listen:10000,reuseaddr,fork unix:/tmp/dst.sock
    
    So the migration channel will be:
    
                          (a)          (b)
      src -> /tmp/src.sock -> tcp:10000 -> /tmp/dst.sock -> dst
    
    Then to make QEMU hang at RECOVER stage, one can do below:
    
      (1) stop the postcopy using QMP command postcopy-pause
      (2) kill the 2nd proxy (b)
      (3) try to recover the postcopy using /tmp/src.sock on src
      (4) src QEMU will go into RECOVER stage but won't be able to continue
          from there, because the channel is actually broken at (b)
    
    Before this patch, step (4) will make src QEMU stuck in RECOVER stage,
    without a way to kick the QEMU out or continue the postcopy again.  After
    this patch, (4) will quickly fail qemu and bounce back to PAUSED stage.
    
    Admin can also kick QEMU from (4) into PAUSED when needed using
    migrate-pause when needed.
    
    After bouncing back to PAUSED stage, one can recover again.
    
    Reported-by: default avatarXiaohui Li <xiaohli@redhat.com>
    Reviewed-by: default avatarFabiano Rosas <farosas@suse.de>
    Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2111332
    
    
    Reviewed-by: default avatarJuan Quintela <quintela@redhat.com>
    Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
    Signed-off-by: default avatarJuan Quintela <quintela@redhat.com>
    Message-ID: <20231017202633.296756-3-peterx@redhat.com>
    f8c543e8
    migration: Allow network to fail even during recovery
    Peter Xu authored
    
    
    Normally the postcopy recover phase should only exist for a super short
    period, that's the duration when QEMU is trying to recover from an
    interrupted postcopy migration, during which handshake will be carried out
    for continuing the procedure with state changes from PAUSED -> RECOVER ->
    POSTCOPY_ACTIVE again.
    
    Here RECOVER phase should be super small, that happens right after the
    admin specified a new but working network link for QEMU to reconnect to
    dest QEMU.
    
    However there can still be case where the channel is broken in this small
    RECOVER window.
    
    If it happens, with current code there's no way the src QEMU can got kicked
    out of RECOVER stage. No way either to retry the recover in another channel
    when established.
    
    This patch allows the RECOVER phase to fail itself too - we're mostly
    ready, just some small things missing, e.g. properly kick the main
    migration thread out when sleeping on rp_sem when we found that we're at
    RECOVER stage.  When this happens, it fails the RECOVER itself, and
    rollback to PAUSED stage.  Then the user can retry another round of
    recovery.
    
    To make it even stronger, teach QMP command migrate-pause to explicitly
    kick src/dst QEMU out when needed, so even if for some reason the migration
    thread didn't got kicked out already by a failing rethrn-path thread, the
    admin can also kick it out.
    
    This will be an super, super corner case, but still try to cover that.
    
    One can try to test this with two proxy channels for migration:
    
      (a) socat unix-listen:/tmp/src.sock,reuseaddr,fork tcp:localhost:10000
      (b) socat tcp-listen:10000,reuseaddr,fork unix:/tmp/dst.sock
    
    So the migration channel will be:
    
                          (a)          (b)
      src -> /tmp/src.sock -> tcp:10000 -> /tmp/dst.sock -> dst
    
    Then to make QEMU hang at RECOVER stage, one can do below:
    
      (1) stop the postcopy using QMP command postcopy-pause
      (2) kill the 2nd proxy (b)
      (3) try to recover the postcopy using /tmp/src.sock on src
      (4) src QEMU will go into RECOVER stage but won't be able to continue
          from there, because the channel is actually broken at (b)
    
    Before this patch, step (4) will make src QEMU stuck in RECOVER stage,
    without a way to kick the QEMU out or continue the postcopy again.  After
    this patch, (4) will quickly fail qemu and bounce back to PAUSED stage.
    
    Admin can also kick QEMU from (4) into PAUSED when needed using
    migrate-pause when needed.
    
    After bouncing back to PAUSED stage, one can recover again.
    
    Reported-by: default avatarXiaohui Li <xiaohli@redhat.com>
    Reviewed-by: default avatarFabiano Rosas <farosas@suse.de>
    Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=2111332
    
    
    Reviewed-by: default avatarJuan Quintela <quintela@redhat.com>
    Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
    Signed-off-by: default avatarJuan Quintela <quintela@redhat.com>
    Message-ID: <20231017202633.296756-3-peterx@redhat.com>
Loading