Skip to content
Snippets Groups Projects
  1. Jul 21, 2021
  2. Sep 23, 2020
    • Stefan Hajnoczi's avatar
      qemu/atomic.h: rename atomic_ to qatomic_ · d73415a3
      Stefan Hajnoczi authored
      
      clang's C11 atomic_fetch_*() functions only take a C11 atomic type
      pointer argument. QEMU uses direct types (int, etc) and this causes a
      compiler error when a QEMU code calls these functions in a source file
      that also included <stdatomic.h> via a system header file:
      
        $ CC=clang CXX=clang++ ./configure ... && make
        ../util/async.c:79:17: error: address argument to atomic operation must be a pointer to _Atomic type ('unsigned int *' invalid)
      
      Avoid using atomic_*() names in QEMU's atomic.h since that namespace is
      used by <stdatomic.h>. Prefix QEMU's APIs with 'q' so that atomic.h
      and <stdatomic.h> can co-exist. I checked /usr/include on my machine and
      searched GitHub for existing "qatomic_" users but there seem to be none.
      
      This patch was generated using:
      
        $ git grep -h -o '\<atomic\(64\)\?_[a-z0-9_]\+' include/qemu/atomic.h | \
          sort -u >/tmp/changed_identifiers
        $ for identifier in $(</tmp/changed_identifiers); do
              sed -i "s%\<$identifier\>%q$identifier%g" \
                  $(git grep -I -l "\<$identifier\>")
          done
      
      I manually fixed line-wrap issues and misaligned rST tables.
      
      Signed-off-by: default avatarStefan Hajnoczi <stefanha@redhat.com>
      Reviewed-by: default avatarPhilippe Mathieu-Daudé <philmd@redhat.com>
      Acked-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20200923105646.47864-1-stefanha@redhat.com>
      d73415a3
  3. Dec 17, 2019
  4. Sep 16, 2019
  5. Apr 18, 2019
  6. Dec 17, 2018
  7. Oct 02, 2018
    • Emilio G. Cota's avatar
      qsp: use atomic64 accessors · ac8c7748
      Emilio G. Cota authored
      
      With the seqlock, we either have to use atomics to remain
      within defined behaviour (and note that 64-bit atomics aren't
      always guaranteed to compile, irrespective of __nocheck), or
      drop the atomics and be in undefined behaviour territory.
      
      Fix it by dropping the seqlock and using atomic64 accessors.
      This will limit scalability when !CONFIG_ATOMIC64, but those
      machines (1) don't have many users and (2) are unlikely to
      have many cores.
      
      - With CONFIG_ATOMIC64:
      $ tests/atomic_add-bench -n 1 -m -p
       Throughput:         13.00 Mops/s
      
      - Forcing !CONFIG_ATOMIC64:
      $ tests/atomic_add-bench -n 1 -m -p
       Throughput:         10.89 Mops/s
      
      Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
      Message-Id: <20180910232752.31565-5-cota@braap.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      ac8c7748
  8. Sep 26, 2018
  9. Aug 23, 2018
    • Emilio G. Cota's avatar
      qsp: track BQL callers explicitly · cb764d06
      Emilio G. Cota authored
      
      The BQL is acquired via qemu_mutex_lock_iothread(), which makes
      the profiler assign the associated wait time (i.e. most of
      BQL wait time) entirely to that function. This loses the original
      call site information, which does not help diagnose BQL contention.
      Fix it by tracking the callers explicitly.
      
      Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      cb764d06
    • Emilio G. Cota's avatar
      qsp: support call site coalescing · d557de4a
      Emilio G. Cota authored
      
      Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      d557de4a
    • Emilio G. Cota's avatar
      qsp: add qsp_reset · 996e8d9a
      Emilio G. Cota authored
      
      I first implemented this by deleting all entries in the global
      hash table. But doing that safely slows down profiling, since
      we'd need to introduce rcu_read_lock/unlock in the fast path.
      
      What's implemented here avoids messing with the thread-local
      data in the global hash table. It achieves this by taking a snapshot
      of the current state, so that subsequent reports present the delta
      wrt to the snapshot.
      
      Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      996e8d9a
    • Emilio G. Cota's avatar
      qsp: add sort_by option to qsp_report · 0a22777c
      Emilio G. Cota authored
      
      Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      0a22777c
    • Emilio G. Cota's avatar
      qsp: QEMU's Synchronization Profiler · fe9959a2
      Emilio G. Cota authored
      
      The goal of this module is to profile synchronization primitives (i.e.
      mutexes, recursive mutexes and condition variables) so that scalability
      issues can be quickly diagnosed.
      
      Sync primitives are profiled by QSP based on the vaddr of the object accessed
      as well as the call site (file:line_nr). That means the same object called
      from two different call sites will be tracked in separate entries, which
      might be reported together or separately (see subsequent commit on
      call site coalescing).
      
      Some perf numbers:
      
      Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
      Command: taskset -c 0 tests/atomic_add-bench -d 5 -m
      
      - Before: 54.80 Mops/s
      - After:  54.75 Mops/s
      
      That is, a negligible slowdown due to the now indirect call to
      qemu_mutex_lock. Note that using a branch instead of an indirect
      call introduces a more severe slowdown (53.65 Mops/s, i.e. 2% slowdown).
      
      Enabling the profiler (with -p, added in this series) is more interesting:
      
      - No profiling: 54.75 Mops/s
      - W/ profiling: 12.53 Mops/s
      
      That is, a 4.36X slowdown.
      
      We can break down this slowdown by removing the get_clock calls or
      the entry lookup:
      
      - No profiling:     54.75 Mops/s
      - W/o get_clock:    25.37 Mops/s
      - W/o entry lookup: 19.30 Mops/s
      - W/ profiling:     12.53 Mops/s
      
      Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
      Signed-off-by: default avatarPaolo Bonzini <pbonzini@redhat.com>
      fe9959a2
Loading