Skip to content
Snippets Groups Projects
  • Peter Xu's avatar
    d246ea50
    migration/postcopy: Release fd before going into 'postcopy-pause' · d246ea50
    Peter Xu authored
    
    Logically below race could trigger with the old code:
    
              test program                        migration thread
              ------------                        ----------------
           wait_until('postcopy-pause')
                                              postcopy_pause()
                                                set_state('postcopy-pause')
           do_postcopy_recover()
             arm s->to_dst_file with new fd
                                                release s->to_dst_file [1]
    
    Here [1] could have released the just-installed recoverying channel.  Then the
    migration could hang without really resuming.
    
    Instead, it should be very safe to release the fd before setting the state into
    'postcopy-pause', because there's no reason for any other thread to touch it
    during 'postcopy-active'.
    
    Dave reported a very rare postcopy recovery hang that the migration-test
    program waited for the migration to complete in migrate_postcopy_complete().
    We do suspect it's the same thing that we're gonna fix here.  Hard to tell.
    However since we've noticed this, fix this irrelevant of the hang report.
    
    Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
    Cc: Juan Quintela <quintela@redhat.com>
    Reviewed-by: default avatarDr. David Alan Gilbert <dgilbert@redhat.com>
    Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
    Message-Id: <20201021212721.440373-6-peterx@redhat.com>
    Signed-off-by: default avatarDr. David Alan Gilbert <dgilbert@redhat.com>
    d246ea50
    History
    migration/postcopy: Release fd before going into 'postcopy-pause'
    Peter Xu authored
    
    Logically below race could trigger with the old code:
    
              test program                        migration thread
              ------------                        ----------------
           wait_until('postcopy-pause')
                                              postcopy_pause()
                                                set_state('postcopy-pause')
           do_postcopy_recover()
             arm s->to_dst_file with new fd
                                                release s->to_dst_file [1]
    
    Here [1] could have released the just-installed recoverying channel.  Then the
    migration could hang without really resuming.
    
    Instead, it should be very safe to release the fd before setting the state into
    'postcopy-pause', because there's no reason for any other thread to touch it
    during 'postcopy-active'.
    
    Dave reported a very rare postcopy recovery hang that the migration-test
    program waited for the migration to complete in migrate_postcopy_complete().
    We do suspect it's the same thing that we're gonna fix here.  Hard to tell.
    However since we've noticed this, fix this irrelevant of the hang report.
    
    Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
    Cc: Juan Quintela <quintela@redhat.com>
    Reviewed-by: default avatarDr. David Alan Gilbert <dgilbert@redhat.com>
    Signed-off-by: default avatarPeter Xu <peterx@redhat.com>
    Message-Id: <20201021212721.440373-6-peterx@redhat.com>
    Signed-off-by: default avatarDr. David Alan Gilbert <dgilbert@redhat.com>
migration.c 119.57 KiB