Skip to content
  • Peter Xu's avatar
    6621883f
    migration: Fix potential race on postcopy_qemufile_src · 6621883f
    Peter Xu authored
    
    
    postcopy_qemufile_src object should be owned by one thread, either the main
    thread (e.g. when at the beginning, or at the end of migration), or by the
    return path thread (when during a preempt enabled postcopy migration).  If
    that's not the case the access to the object might be racy.
    
    postcopy_preempt_shutdown_file() can be potentially racy, because it's
    called at the end phase of migration on the main thread, however during
    which the return path thread hasn't yet been recycled; the recycle happens
    in await_return_path_close_on_source() which is after this point.
    
    It means, logically it's posslbe the main thread and the return path thread
    are both operating on the same qemufile.  While I don't think qemufile is
    thread safe at all.
    
    postcopy_preempt_shutdown_file() used to be needed because that's where we
    send EOS to dest so that dest can safely shutdown the preempt thread.
    
    To avoid the possible race, remove this only place that a race can happen.
    Instead we figure out another way to safely close the preempt thread on
    dest.
    
    The core idea during postcopy on deciding "when to stop" is that dest will
    send a postcopy SHUT message to src, telling src that all data is there.
    Hence to shut the dest preempt thread maybe better to do it directly on
    dest node.
    
    This patch proposed such a way that we change postcopy_prio_thread_created
    into PreemptThreadStatus, so that we kick the preempt thread on dest qemu
    by a sequence of:
    
      mis->preempt_thread_status = PREEMPT_THREAD_QUIT;
      qemu_file_shutdown(mis->postcopy_qemufile_dst);
    
    While here shutdown() is probably so far the easiest way to kick preempt
    thread from a blocked qemu_get_be64().  Then it reads preempt_thread_status
    to make sure it's not a network failure but a willingness to quit the
    thread.
    
    We could have avoided that extra status but just rely on migration status.
    The problem is postcopy_ram_incoming_cleanup() is just called early enough
    so we're still during POSTCOPY_ACTIVE no matter what.. So just make it
    simple to have the status introduced.
    
    One flag x-preempt-pre-7-2 is added to keep old pre-7.2 behaviors of
    postcopy preempt.
    
    Fixes: 93589827 ("migration: Send requested page directly in rp-return thread")
    Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
    Reviewed-by: default avatarJuan Quintela <quintela@redhat.com>
    Signed-off-by: default avatarJuan Quintela <quintela@redhat.com>
    6621883f
    migration: Fix potential race on postcopy_qemufile_src
    Peter Xu authored
    
    
    postcopy_qemufile_src object should be owned by one thread, either the main
    thread (e.g. when at the beginning, or at the end of migration), or by the
    return path thread (when during a preempt enabled postcopy migration).  If
    that's not the case the access to the object might be racy.
    
    postcopy_preempt_shutdown_file() can be potentially racy, because it's
    called at the end phase of migration on the main thread, however during
    which the return path thread hasn't yet been recycled; the recycle happens
    in await_return_path_close_on_source() which is after this point.
    
    It means, logically it's posslbe the main thread and the return path thread
    are both operating on the same qemufile.  While I don't think qemufile is
    thread safe at all.
    
    postcopy_preempt_shutdown_file() used to be needed because that's where we
    send EOS to dest so that dest can safely shutdown the preempt thread.
    
    To avoid the possible race, remove this only place that a race can happen.
    Instead we figure out another way to safely close the preempt thread on
    dest.
    
    The core idea during postcopy on deciding "when to stop" is that dest will
    send a postcopy SHUT message to src, telling src that all data is there.
    Hence to shut the dest preempt thread maybe better to do it directly on
    dest node.
    
    This patch proposed such a way that we change postcopy_prio_thread_created
    into PreemptThreadStatus, so that we kick the preempt thread on dest qemu
    by a sequence of:
    
      mis->preempt_thread_status = PREEMPT_THREAD_QUIT;
      qemu_file_shutdown(mis->postcopy_qemufile_dst);
    
    While here shutdown() is probably so far the easiest way to kick preempt
    thread from a blocked qemu_get_be64().  Then it reads preempt_thread_status
    to make sure it's not a network failure but a willingness to quit the
    thread.
    
    We could have avoided that extra status but just rely on migration status.
    The problem is postcopy_ram_incoming_cleanup() is just called early enough
    so we're still during POSTCOPY_ACTIVE no matter what.. So just make it
    simple to have the status introduced.
    
    One flag x-preempt-pre-7-2 is added to keep old pre-7.2 behaviors of
    postcopy preempt.
    
    Fixes: 93589827 ("migration: Send requested page directly in rp-return thread")
    Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
    Reviewed-by: default avatarJuan Quintela <quintela@redhat.com>
    Signed-off-by: default avatarJuan Quintela <quintela@redhat.com>
Loading