Skip to content
  • Sam Eiderman's avatar
    98eb9733
    vmdk: Add read-only support for seSparse snapshots · 98eb9733
    Sam Eiderman authored
    Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
    QEMU).
    
    This format was lacking in the following:
    
        * Grain directory (L1) and grain table (L2) entries were 32-bit,
          allowing access to only 2TB (slightly less) of data.
        * The grain size (default) was 512 bytes - leading to data
          fragmentation and many grain tables.
        * For space reclamation purposes, it was necessary to find all the
          grains which are not pointed to by any grain table - so a reverse
          mapping of "offset of grain in vmdk" to "grain table" must be
          constructed - which takes large amounts of CPU/RAM.
    
    The format specification can be found in VMware's documentation:
    https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf
    
    
    
    In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
    introduced: SESparse (Space Efficient).
    
    This format fixes the above issues:
    
        * All entries are now 64-bit.
        * The grain size (default) is 4KB.
        * Grain directory and grain tables are now located at the beginning
          of the file.
          + seSparse format reserves space for all grain tables.
          + Grain tables can be addressed using an index.
          + Grains are located in the end of the file and can also be
            addressed with an index.
          - seSparse vmdks of large disks (64TB) have huge preallocated
            headers - mainly due to L2 tables, even for empty snapshots.
        * The header contains a reverse mapping ("backmap") of "offset of
          grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
          specifies for each grain - whether it is allocated or not.
          Using these data structures we can implement space reclamation
          efficiently.
        * Due to the fact that the header now maintains two mappings:
            * The regular one (grain directory & grain tables)
            * A reverse one (backmap and free bitmap)
          These data structures can lose consistency upon crash and result
          in a corrupted VMDK.
          Therefore, a journal is also added to the VMDK and is replayed
          when the VMware reopens the file after a crash.
    
    Since ESXi 6.7 - SESparse is the only snapshot format available.
    
    Unfortunately, VMware does not provide documentation regarding the new
    seSparse format.
    
    This commit is based on black-box research of the seSparse format.
    Various in-guest block operations and their effect on the snapshot file
    were tested.
    
    The only VMware provided source of information (regarding the underlying
    implementation) was a log file on the ESXi:
    
        /var/log/hostd.log
    
    Whenever an seSparse snapshot is created - the log is being populated
    with seSparse records.
    
    Relevant log records are of the form:
    
    [...] Const Header:
    [...]  constMagic     = 0xcafebabe
    [...]  version        = 2.1
    [...]  capacity       = 204800
    [...]  grainSize      = 8
    [...]  grainTableSize = 64
    [...]  flags          = 0
    [...] Extents:
    [...]  Header         : <1 : 1>
    [...]  JournalHdr     : <2 : 2>
    [...]  Journal        : <2048 : 2048>
    [...]  GrainDirectory : <4096 : 2048>
    [...]  GrainTables    : <6144 : 2048>
    [...]  FreeBitmap     : <8192 : 2048>
    [...]  BackMap        : <10240 : 2048>
    [...]  Grain          : <12288 : 204800>
    [...] Volatile Header:
    [...] volatileMagic     = 0xcafecafe
    [...] FreeGTNumber      = 0
    [...] nextTxnSeqNumber  = 0
    [...] replayJournal     = 0
    
    The sizes that are seen in the log file are in sectors.
    Extents are of the following format: <offset : size>
    
    This commit is a strict implementation which enforces:
        * magics
        * version number 2.1
        * grain size of 8 sectors  (4KB)
        * grain table size of 64 sectors
        * zero flags
        * extent locations
    
    Additionally, this commit proivdes only a subset of the functionality
    offered by seSparse's format:
        * Read-only
        * No journal replay
        * No space reclamation
        * No unmap support
    
    Hence, journal header, journal, free bitmap and backmap extents are
    unused, only the "classic" (L1 -> L2 -> data) grain access is
    implemented.
    
    However there are several differences in the grain access itself.
    Grain directory (L1):
        * Grain directory entries are indexes (not offsets) to grain
          tables.
        * Valid grain directory entries have their highest nibble set to
          0x1.
        * Since grain tables are always located in the beginning of the
          file - the index can fit into 32 bits - so we can use its low
          part if it's valid.
    Grain table (L2):
        * Grain table entries are indexes (not offsets) to grains.
        * If the highest nibble of the entry is:
            0x0:
                The grain in not allocated.
                The rest of the bytes are 0.
            0x1:
                The grain is unmapped - guest sees a zero grain.
                The rest of the bits point to the previously mapped grain,
                see 0x3 case.
            0x2:
                The grain is zero.
            0x3:
                The grain is allocated - to get the index calculate:
                ((entry & 0x0fff000000000000) >> 48) |
                ((entry & 0x0000ffffffffffff) << 12)
        * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
          grain which results from the guest using sg_unmap to unmap the
          grain - but the grain itself still exists in the grain extent - a
          space reclamation procedure should delete it.
          Unmapping a zero grain has no effect (0x2 will not change to 0x1)
          but unmapping an unallocated grain will (0x0 to 0x1) - naturally.
    
    In order to implement seSparse some fields had to be changed to support
    both 32-bit and 64-bit entry sizes.
    
    Reviewed-by: default avatarKarl Heubaum <karl.heubaum@oracle.com>
    Reviewed-by: default avatarEyal Moscovici <eyal.moscovici@oracle.com>
    Reviewed-by: default avatarArbel Moshe <arbel.moshe@oracle.com>
    Signed-off-by: default avatarSam Eiderman <shmuel.eiderman@oracle.com>
    Message-id: 20190620091057.47441-4-shmuel.eiderman@oracle.com
    Signed-off-by: default avatarMax Reitz <mreitz@redhat.com>
    98eb9733
    vmdk: Add read-only support for seSparse snapshots
    Sam Eiderman authored
    Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
    QEMU).
    
    This format was lacking in the following:
    
        * Grain directory (L1) and grain table (L2) entries were 32-bit,
          allowing access to only 2TB (slightly less) of data.
        * The grain size (default) was 512 bytes - leading to data
          fragmentation and many grain tables.
        * For space reclamation purposes, it was necessary to find all the
          grains which are not pointed to by any grain table - so a reverse
          mapping of "offset of grain in vmdk" to "grain table" must be
          constructed - which takes large amounts of CPU/RAM.
    
    The format specification can be found in VMware's documentation:
    https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf
    
    
    
    In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
    introduced: SESparse (Space Efficient).
    
    This format fixes the above issues:
    
        * All entries are now 64-bit.
        * The grain size (default) is 4KB.
        * Grain directory and grain tables are now located at the beginning
          of the file.
          + seSparse format reserves space for all grain tables.
          + Grain tables can be addressed using an index.
          + Grains are located in the end of the file and can also be
            addressed with an index.
          - seSparse vmdks of large disks (64TB) have huge preallocated
            headers - mainly due to L2 tables, even for empty snapshots.
        * The header contains a reverse mapping ("backmap") of "offset of
          grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
          specifies for each grain - whether it is allocated or not.
          Using these data structures we can implement space reclamation
          efficiently.
        * Due to the fact that the header now maintains two mappings:
            * The regular one (grain directory & grain tables)
            * A reverse one (backmap and free bitmap)
          These data structures can lose consistency upon crash and result
          in a corrupted VMDK.
          Therefore, a journal is also added to the VMDK and is replayed
          when the VMware reopens the file after a crash.
    
    Since ESXi 6.7 - SESparse is the only snapshot format available.
    
    Unfortunately, VMware does not provide documentation regarding the new
    seSparse format.
    
    This commit is based on black-box research of the seSparse format.
    Various in-guest block operations and their effect on the snapshot file
    were tested.
    
    The only VMware provided source of information (regarding the underlying
    implementation) was a log file on the ESXi:
    
        /var/log/hostd.log
    
    Whenever an seSparse snapshot is created - the log is being populated
    with seSparse records.
    
    Relevant log records are of the form:
    
    [...] Const Header:
    [...]  constMagic     = 0xcafebabe
    [...]  version        = 2.1
    [...]  capacity       = 204800
    [...]  grainSize      = 8
    [...]  grainTableSize = 64
    [...]  flags          = 0
    [...] Extents:
    [...]  Header         : <1 : 1>
    [...]  JournalHdr     : <2 : 2>
    [...]  Journal        : <2048 : 2048>
    [...]  GrainDirectory : <4096 : 2048>
    [...]  GrainTables    : <6144 : 2048>
    [...]  FreeBitmap     : <8192 : 2048>
    [...]  BackMap        : <10240 : 2048>
    [...]  Grain          : <12288 : 204800>
    [...] Volatile Header:
    [...] volatileMagic     = 0xcafecafe
    [...] FreeGTNumber      = 0
    [...] nextTxnSeqNumber  = 0
    [...] replayJournal     = 0
    
    The sizes that are seen in the log file are in sectors.
    Extents are of the following format: <offset : size>
    
    This commit is a strict implementation which enforces:
        * magics
        * version number 2.1
        * grain size of 8 sectors  (4KB)
        * grain table size of 64 sectors
        * zero flags
        * extent locations
    
    Additionally, this commit proivdes only a subset of the functionality
    offered by seSparse's format:
        * Read-only
        * No journal replay
        * No space reclamation
        * No unmap support
    
    Hence, journal header, journal, free bitmap and backmap extents are
    unused, only the "classic" (L1 -> L2 -> data) grain access is
    implemented.
    
    However there are several differences in the grain access itself.
    Grain directory (L1):
        * Grain directory entries are indexes (not offsets) to grain
          tables.
        * Valid grain directory entries have their highest nibble set to
          0x1.
        * Since grain tables are always located in the beginning of the
          file - the index can fit into 32 bits - so we can use its low
          part if it's valid.
    Grain table (L2):
        * Grain table entries are indexes (not offsets) to grains.
        * If the highest nibble of the entry is:
            0x0:
                The grain in not allocated.
                The rest of the bytes are 0.
            0x1:
                The grain is unmapped - guest sees a zero grain.
                The rest of the bits point to the previously mapped grain,
                see 0x3 case.
            0x2:
                The grain is zero.
            0x3:
                The grain is allocated - to get the index calculate:
                ((entry & 0x0fff000000000000) >> 48) |
                ((entry & 0x0000ffffffffffff) << 12)
        * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
          grain which results from the guest using sg_unmap to unmap the
          grain - but the grain itself still exists in the grain extent - a
          space reclamation procedure should delete it.
          Unmapping a zero grain has no effect (0x2 will not change to 0x1)
          but unmapping an unallocated grain will (0x0 to 0x1) - naturally.
    
    In order to implement seSparse some fields had to be changed to support
    both 32-bit and 64-bit entry sizes.
    
    Reviewed-by: default avatarKarl Heubaum <karl.heubaum@oracle.com>
    Reviewed-by: default avatarEyal Moscovici <eyal.moscovici@oracle.com>
    Reviewed-by: default avatarArbel Moshe <arbel.moshe@oracle.com>
    Signed-off-by: default avatarSam Eiderman <shmuel.eiderman@oracle.com>
    Message-id: 20190620091057.47441-4-shmuel.eiderman@oracle.com
    Signed-off-by: default avatarMax Reitz <mreitz@redhat.com>
Loading