Skip to content
  • Peter Lieven's avatar
    c56ac27d
    block/rbd: add write zeroes support · c56ac27d
    Peter Lieven authored
    This patch wittingly sets BDRV_REQ_NO_FALLBACK and silently ignores
    BDRV_REQ_MAY_UNMAP for older librbd versions.
    
    The rationale for this is as follows (citing Ilya Dryomov current RBD
    maintainer):
    
    ---8<---
    a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
       and as a consequence always unmap if librbd is too old
    
       It's not clear what qemu's expectation is but in general Write
       Zeroes is allowed to unmap.  The only guarantee is that subsequent
       reads return zeroes, everything else is a hint.  This is how it is
       specified in the kernel and in the NVMe spec.
    
       In particular, block/nvme.c implements it as follows:
    
       if (flags & BDRV_REQ_MAY_UNMAP) {
           cdw12 |= (1 << 25);
       }
    
       This sets the Deallocate bit.  But if it's not set, the device may
       still deallocate:
    
       """
       If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
       command, and the namespace supports clearing all bytes to 0h in the
       values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
       from a deallocated logical block and its metadata (excluding
       protection information), then for each specified logical block, the
       controller:
       - should deallocate that logical block;
    
       ...
    
       If the Deallocate bit is cleared to '0' in a Write Zeroes command,
       and the namespace supports clearing all bytes to 0h in the values
       read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
       a deallocated logical block and its metadata (excluding protection
       information), then, for each specified logical block, the
       controller:
       - may deallocate that logical block;
       """
    
       https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf
    
    
    
    b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags
    
       Again, it's not clear what qemu expects here, but without it we end
       up in a ridiculous situation where specifying the "don't allow slow
       fallback" switch immediately fails all efficient zeroing requests on
       a device where Write Zeroes is always efficient:
    
       $ qemu-io -c 'help write' | grep -- '-[zun]'
        -n, -- with -z, don't allow slow fallback
        -u, -- with -z, allow unmapping
        -z, -- write zeroes using blk_co_pwrite_zeroes
    
       $ qemu-io -f rbd -c 'write -z -u -n 0 1M' rbd:foo/bar
       write failed: Operation not supported
    --->8---
    
    Signed-off-by: default avatarPeter Lieven <pl@kamp.de>
    Reviewed-by: default avatarIlya Dryomov <idryomov@gmail.com>
    Message-Id: <20210702172356.11574-6-idryomov@gmail.com>
    Signed-off-by: default avatarKevin Wolf <kwolf@redhat.com>
    c56ac27d
    block/rbd: add write zeroes support
    Peter Lieven authored
    This patch wittingly sets BDRV_REQ_NO_FALLBACK and silently ignores
    BDRV_REQ_MAY_UNMAP for older librbd versions.
    
    The rationale for this is as follows (citing Ilya Dryomov current RBD
    maintainer):
    
    ---8<---
    a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
       and as a consequence always unmap if librbd is too old
    
       It's not clear what qemu's expectation is but in general Write
       Zeroes is allowed to unmap.  The only guarantee is that subsequent
       reads return zeroes, everything else is a hint.  This is how it is
       specified in the kernel and in the NVMe spec.
    
       In particular, block/nvme.c implements it as follows:
    
       if (flags & BDRV_REQ_MAY_UNMAP) {
           cdw12 |= (1 << 25);
       }
    
       This sets the Deallocate bit.  But if it's not set, the device may
       still deallocate:
    
       """
       If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
       command, and the namespace supports clearing all bytes to 0h in the
       values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
       from a deallocated logical block and its metadata (excluding
       protection information), then for each specified logical block, the
       controller:
       - should deallocate that logical block;
    
       ...
    
       If the Deallocate bit is cleared to '0' in a Write Zeroes command,
       and the namespace supports clearing all bytes to 0h in the values
       read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
       a deallocated logical block and its metadata (excluding protection
       information), then, for each specified logical block, the
       controller:
       - may deallocate that logical block;
       """
    
       https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf
    
    
    
    b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags
    
       Again, it's not clear what qemu expects here, but without it we end
       up in a ridiculous situation where specifying the "don't allow slow
       fallback" switch immediately fails all efficient zeroing requests on
       a device where Write Zeroes is always efficient:
    
       $ qemu-io -c 'help write' | grep -- '-[zun]'
        -n, -- with -z, don't allow slow fallback
        -u, -- with -z, allow unmapping
        -z, -- write zeroes using blk_co_pwrite_zeroes
    
       $ qemu-io -f rbd -c 'write -z -u -n 0 1M' rbd:foo/bar
       write failed: Operation not supported
    --->8---
    
    Signed-off-by: default avatarPeter Lieven <pl@kamp.de>
    Reviewed-by: default avatarIlya Dryomov <idryomov@gmail.com>
    Message-Id: <20210702172356.11574-6-idryomov@gmail.com>
    Signed-off-by: default avatarKevin Wolf <kwolf@redhat.com>
Loading