Skip to content
  • Emilio G. Cota's avatar
    42bd3228
    tb hash: hash phys_pc, pc, and flags with xxhash · 42bd3228
    Emilio G. Cota authored
    For some workloads such as arm bootup, tb_phys_hash is performance-critical.
    The is due to the high frequency of accesses to the hash table, originated
    by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
    More info:
      https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html
    
    
    
    To dig further into this I modified an arm image booting debian jessie to
    immediately shut down after boot. Analysis revealed that quite a bit of time
    is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
    results in very uneven loading of chains in the hash table's buckets;
    the longest observed chain had ~550 elements.
    
    The appended addresses this with two changes:
    
    1) Use xxhash as the hash table's hash function. xxhash is a fast,
       high-quality hashing function.
    
    2) Feed the hashing function with not just tb_phys, but also pc and flags.
    
    This improves performance over using just tb_phys for hashing, since that
    resulted in some hash buckets having many TB's, while others getting very few;
    with these changes, the longest observed chain on a single hash bucket is
    brought down from ~550 to ~40.
    
    Tests show that the other element checked for in tb_find_physical,
    cs_base, is always a match when tb_phys+pc+flags are a match,
    so hashing cs_base is wasteful. It could be that this is an ARM-only
    thing, though. UPDATE:
    On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
    > The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
    > consisting of only a delay slot).
    > It may well still turn out to be reasonable to ignore cs_base for hashing.
    
    BTW, after this change the hash table should not be called "tb_hash_phys"
    anymore; this is addressed later in this series.
    
    This change gives consistent bootup time improvements. I tested two
    host machines:
    - Intel Xeon E5-2690: 11.6% less time
    - Intel i7-4790K: 19.2% less time
    
    Increasing the number of hash buckets yields further improvements. However,
    using a larger, fixed number of buckets can degrade performance for other
    workloads that do not translate as many blocks (600K+ for debian-jessie arm
    bootup). This is dealt with later in this series.
    
    Reviewed-by: default avatarSergey Fedorov <sergey.fedorov@linaro.org>
    Reviewed-by: default avatarRichard Henderson <rth@twiddle.net>
    Reviewed-by: default avatarAlex Bennée <alex.bennee@linaro.org>
    Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
    Message-Id: <1465412133-3029-8-git-send-email-cota@braap.org>
    Signed-off-by: default avatarRichard Henderson <rth@twiddle.net>
    42bd3228
    tb hash: hash phys_pc, pc, and flags with xxhash
    Emilio G. Cota authored
    For some workloads such as arm bootup, tb_phys_hash is performance-critical.
    The is due to the high frequency of accesses to the hash table, originated
    by (frequent) TLB flushes that wipe out the cpu-private tb_jmp_cache's.
    More info:
      https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg05098.html
    
    
    
    To dig further into this I modified an arm image booting debian jessie to
    immediately shut down after boot. Analysis revealed that quite a bit of time
    is unnecessarily spent in tb_phys_hash: the cause is poor hashing that
    results in very uneven loading of chains in the hash table's buckets;
    the longest observed chain had ~550 elements.
    
    The appended addresses this with two changes:
    
    1) Use xxhash as the hash table's hash function. xxhash is a fast,
       high-quality hashing function.
    
    2) Feed the hashing function with not just tb_phys, but also pc and flags.
    
    This improves performance over using just tb_phys for hashing, since that
    resulted in some hash buckets having many TB's, while others getting very few;
    with these changes, the longest observed chain on a single hash bucket is
    brought down from ~550 to ~40.
    
    Tests show that the other element checked for in tb_find_physical,
    cs_base, is always a match when tb_phys+pc+flags are a match,
    so hashing cs_base is wasteful. It could be that this is an ARM-only
    thing, though. UPDATE:
    On Tue, Apr 05, 2016 at 08:41:43 -0700, Richard Henderson wrote:
    > The cs_base field is only used by i386 (in 16-bit modes), and sparc (for a TB
    > consisting of only a delay slot).
    > It may well still turn out to be reasonable to ignore cs_base for hashing.
    
    BTW, after this change the hash table should not be called "tb_hash_phys"
    anymore; this is addressed later in this series.
    
    This change gives consistent bootup time improvements. I tested two
    host machines:
    - Intel Xeon E5-2690: 11.6% less time
    - Intel i7-4790K: 19.2% less time
    
    Increasing the number of hash buckets yields further improvements. However,
    using a larger, fixed number of buckets can degrade performance for other
    workloads that do not translate as many blocks (600K+ for debian-jessie arm
    bootup). This is dealt with later in this series.
    
    Reviewed-by: default avatarSergey Fedorov <sergey.fedorov@linaro.org>
    Reviewed-by: default avatarRichard Henderson <rth@twiddle.net>
    Reviewed-by: default avatarAlex Bennée <alex.bennee@linaro.org>
    Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
    Message-Id: <1465412133-3029-8-git-send-email-cota@braap.org>
    Signed-off-by: default avatarRichard Henderson <rth@twiddle.net>
Loading