Skip to content
Snippets Groups Projects
  • Li Zhijian's avatar
    b390afd8
    migration/rdma: Fix out of order wrid · b390afd8
    Li Zhijian authored
    
    destination:
    ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5902,disable-ticketing -incoming rdma:192.168.22.23:8888
    qemu-system-x86_64: -spice streaming-video=filter,port=5902,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated
    Please use disable-ticketing=on instead
    QEMU 6.0.50 monitor - type 'help' for more information
    (qemu) trace-event qemu_rdma_block_for_wrid_miss on
    (qemu) dest_init RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
    qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got CONTROL RECV (4000)
    
    source:
    ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5901,disable-ticketing -S
    qemu-system-x86_64: -spice streaming-video=filter,port=5901,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated
    Please use disable-ticketing=on instead
    QEMU 6.0.50 monitor - type 'help' for more information
    (qemu)
    (qemu) trace-event qemu_rdma_block_for_wrid_miss on
    (qemu) migrate -d rdma:192.168.22.23:8888
    source_resolve_host RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
    (qemu) qemu_rdma_block_for_wrid_miss A Wanted wrid WRITE RDMA (1) but got CONTROL RECV (4000)
    
    NOTE: we use soft RoCE as the rdma device.
    [root@iaas-rpma images]# rdma link show rxe_eth0/1
    link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev eth0
    
    This migration could not be completed when out of order(OOO) CQ event occurs.
    The send queue and receive queue shared a same completion queue, and
    qemu_rdma_block_for_wrid() will drop the CQs it's not interested in. But
    the dropped CQs by qemu_rdma_block_for_wrid() could be later CQs it wants.
    So in this case, qemu_rdma_block_for_wrid() will block forever.
    
    OOO cases will occur in both source side and destination side. And a
    forever blocking happens on only SEND and RECV are out of order. OOO between
    'WRITE RDMA' and 'RECV' doesn't matter.
    
    below the OOO sequence:
           source                             destination
          rdma_write_one()                   qemu_rdma_registration_handle()
    1.    S1: post_recv X                    D1: post_recv Y
    2.    wait for recv CQ event X
    3.                                       D2: post_send X     ---------------+
    4.                                       wait for send CQ send event X (D2) |
    5.    recv CQ event X reaches (D2)                                          |
    6.  +-S2: post_send Y                                                       |
    7.  | wait for send CQ event Y                                              |
    8.  |                                    recv CQ event Y (S2) (drop it)     |
    9.  +-send CQ event Y reaches (S2)                                          |
    10.                                      send CQ event X reaches (D2)  -----+
    11.                                      wait recv CQ event Y (dropped by (8))
    
    Although a hardware IB works fine in my a hundred of runs, the IB specification
    doesn't guaratee the CQ order in such case.
    
    Here we introduce a independent send completion queue to distinguish
    ibv_post_send completion queue from the original mixed completion queue.
    It helps us to poll the specific CQE we are really interested in.
    
    Signed-off-by: default avatarLi Zhijian <lizhijian@cn.fujitsu.com>
    Reviewed-by: default avatarJuan Quintela <quintela@redhat.com>
    Signed-off-by: default avatarJuan Quintela <quintela@redhat.com>
    b390afd8
    History
    migration/rdma: Fix out of order wrid
    Li Zhijian authored
    
    destination:
    ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server-migration.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5902,disable-ticketing -incoming rdma:192.168.22.23:8888
    qemu-system-x86_64: -spice streaming-video=filter,port=5902,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated
    Please use disable-ticketing=on instead
    QEMU 6.0.50 monitor - type 'help' for more information
    (qemu) trace-event qemu_rdma_block_for_wrid_miss on
    (qemu) dest_init RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
    qemu_rdma_block_for_wrid_miss A Wanted wrid CONTROL SEND (2000) but got CONTROL RECV (4000)
    
    source:
    ../qemu/build/qemu-system-x86_64 -enable-kvm -netdev tap,id=hn0,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown -device e1000,netdev=hn0,mac=50:52:54:00:11:22 -boot c -drive if=none,file=./Fedora-rdma-server.qcow2,id=drive-virtio-disk0 -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0 -m 2048 -smp 2 -device piix3-usb-uhci -device usb-tablet -monitor stdio -vga qxl -spice streaming-video=filter,port=5901,disable-ticketing -S
    qemu-system-x86_64: -spice streaming-video=filter,port=5901,disable-ticketing: warning: short-form boolean option 'disable-ticketing' deprecated
    Please use disable-ticketing=on instead
    QEMU 6.0.50 monitor - type 'help' for more information
    (qemu)
    (qemu) trace-event qemu_rdma_block_for_wrid_miss on
    (qemu) migrate -d rdma:192.168.22.23:8888
    source_resolve_host RDMA Device opened: kernel name rxe_eth0 uverbs device name uverbs2, infiniband_verbs class device path /sys/class/infiniband_verbs/uverbs2, infiniband class device path /sys/class/infiniband/rxe_eth0, transport: (2) Ethernet
    (qemu) qemu_rdma_block_for_wrid_miss A Wanted wrid WRITE RDMA (1) but got CONTROL RECV (4000)
    
    NOTE: we use soft RoCE as the rdma device.
    [root@iaas-rpma images]# rdma link show rxe_eth0/1
    link rxe_eth0/1 state ACTIVE physical_state LINK_UP netdev eth0
    
    This migration could not be completed when out of order(OOO) CQ event occurs.
    The send queue and receive queue shared a same completion queue, and
    qemu_rdma_block_for_wrid() will drop the CQs it's not interested in. But
    the dropped CQs by qemu_rdma_block_for_wrid() could be later CQs it wants.
    So in this case, qemu_rdma_block_for_wrid() will block forever.
    
    OOO cases will occur in both source side and destination side. And a
    forever blocking happens on only SEND and RECV are out of order. OOO between
    'WRITE RDMA' and 'RECV' doesn't matter.
    
    below the OOO sequence:
           source                             destination
          rdma_write_one()                   qemu_rdma_registration_handle()
    1.    S1: post_recv X                    D1: post_recv Y
    2.    wait for recv CQ event X
    3.                                       D2: post_send X     ---------------+
    4.                                       wait for send CQ send event X (D2) |
    5.    recv CQ event X reaches (D2)                                          |
    6.  +-S2: post_send Y                                                       |
    7.  | wait for send CQ event Y                                              |
    8.  |                                    recv CQ event Y (S2) (drop it)     |
    9.  +-send CQ event Y reaches (S2)                                          |
    10.                                      send CQ event X reaches (D2)  -----+
    11.                                      wait recv CQ event Y (dropped by (8))
    
    Although a hardware IB works fine in my a hundred of runs, the IB specification
    doesn't guaratee the CQ order in such case.
    
    Here we introduce a independent send completion queue to distinguish
    ibv_post_send completion queue from the original mixed completion queue.
    It helps us to poll the specific CQE we are really interested in.
    
    Signed-off-by: default avatarLi Zhijian <lizhijian@cn.fujitsu.com>
    Reviewed-by: default avatarJuan Quintela <quintela@redhat.com>
    Signed-off-by: default avatarJuan Quintela <quintela@redhat.com>
rdma.c 131.91 KiB