Skip to content
Snippets Groups Projects
  1. Oct 30, 2019
  2. Aug 19, 2019
  3. Mar 25, 2019
    • Kito Cheng's avatar
      hardfloat: fix float32/64 fused multiply-add · 896f51fb
      Kito Cheng authored
      
      Before falling back to softfloat FMA, we do not restore the original
      values of inputs A and C. Fix it.
      
      This bug was caught by running gcc's testsuite on RISC-V qemu.
      
      Note that this change gives a small perf increase for fp-bench:
      
        Host: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
        Command: perf stat -r 3 taskset -c 0 ./fp-bench -o mulAdd -p $prec
      
      - $prec = single:
        - before:
          101.71 MFlops
          102.18 MFlops
          100.96 MFlops
        - after:
          103.63 MFlops
          103.05 MFlops
          102.96 MFlops
      
      - $prec = double:
        - before:
          173.10 MFlops
          173.93 MFlops
          172.11 MFlops
        - after:
          178.49 MFlops
          178.88 MFlops
          178.66 MFlops
      
      Signed-off-by: default avatarKito Cheng <kito.cheng@gmail.com>
      Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
      Message-Id: <20190322204320.17777-1-cota@braap.org>
      Reviewed-by: default avatarRichard Henderson <richard.henderson@linaro.org>
      Signed-off-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      896f51fb
    • Mateja Marjanovic's avatar
      target/mips: Fix minor bug in FPU · 7ca96e1a
      Mateja Marjanovic authored
      
      Wrong type of NaN was generated for IEEE 754-2008 by MADDF.<D|S> and
      MSUBF.<D|S> instructions when the arguments were (Inf, Zero, NaN) or
      (Zero, Inf, NaN).
      
      The if-else statement establishes if the system conforms to IEEE
      754-1985 or IEEE 754-2008, and defines different behaviors depending
      on that. In case of IEEE 754-2008, in mentioned cases of inputs,
      <MADDF|MSUBF>.<D|S> returns the input value 'c' [2] (page 53) and
      raises floating point exception 'Invalid Operation' [1] (pages 349,
      350).
      
      These scenarios were tested and the results in QEMU emulation match
      the results obtained on the machine that has a MIPS64R6 CPU.
      
      [1] MIPS Architecture for Programmers Volume II-a: The MIPS64
          Instruction Set Reference Manual, Revision 6.06
      [2] MIPS Architecture for Programmers Volume IV-j: The MIPS64
          SIMD Architecture Module, Revision 1.12
      
      Signed-off-by: default avatarMateja Marjanovic <mateja.marjanovic@rt-rk.com>
      Message-Id: <1553008916-15274-2-git-send-email-mateja.marjanovic@rt-rk.com>
      Reviewed-by: default avatarPeter Maydell <peter.maydell@linaro.org>
      [AJB: fixed up commit message]
      Signed-off-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      7ca96e1a
  4. Feb 26, 2019
  5. Jan 22, 2019
  6. Dec 17, 2018
    • Emilio G. Cota's avatar
      hardfloat: implement float32/64 comparison · d9fe9db9
      Emilio G. Cota authored
      Performance results for fp-bench:
      
      Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
      - before:
      cmp-single: 110.98 MFlops
      cmp-double: 107.12 MFlops
      - after:
      cmp-single: 506.28 MFlops
      cmp-double: 524.77 MFlops
      
      Note that flattening both eq and eq_signaling versions
      would give us extra performance (695v506, 615v524 Mflops
      for single/double, respectively) but this would emit two
      essentially identical functions for each eq/signaling pair,
      which is a waste.
      
      Aggregate performance improvement for the last few patches:
      [ all charts in png: https://imgur.com/a/4yV8p
      
       ]
      
      1. Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
      
                         qemu-aarch64 NBench score; higher is better
                       Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
      
        16 +-+-----------+-------------+----===-------+---===-------+-----------+-+
        14 +-+..........................@@@&&.=.......@@@&&.=...................+-+
        12 +-+..........................@.@.&.=.......@.@.&.=.....+befor===     +-+
        10 +-+..........................@.@.&.=.......@.@.&.=.....+ad@@&& =     +-+
         8 +-+.......................$$$%.@.&.=.......@.@.&.=.....+  @@u& =     +-+
         6 +-+............@@@&&=+***##.$%.@.&.=***##$$%+@.&.=..###$$%%@i& =     +-+
         4 +-+.......###$%%.@.&=.*.*.#.$%.@.&.=*.*.#.$%.@.&.=+**.#+$ +@m& =     +-+
         2 +-+.....***.#$.%.@.&=.*.*.#.$%.@.&.=*.*.#.$%.@.&.=.**.#+$+sqr& =     +-+
         0 +-+-----***##$%%@@&&=-***##$$%@@&&==***##$$%@@&&==-**##$$%+cmp==-----+-+
                  FOURIER    NEURAL NELU DECOMPOSITION         gmean
      
                                    qemu-aarch64 SPEC06fp (test set) speedup over QEMU 4c2c1015
                                            Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
                                                  error bars: 95% confidence interval
      
        4.5 +-+---+-----+----+-----+-----+-&---+-----+----+-----+-----+-----+----+-----+-----+-----+-----+----+-----+---+-+
          4 +-+..........................+@@+...........................................................................+-+
        3.5 +-+..............%%@&.........@@..............%%@&............................................+++dsub       +-+
        2.5 +-+....&&+.......%%@&.......+%%@..+%%&+..@@&+.%%@&....................................+%%&+.+%@&++%%@&      +-+
          2 +-+..+%%&..+%@&+.%%@&...+++..%%@...%%&.+$$@&..%%@&..%%@&.......+%%&+.%%@&+......+%%@&.+%%&++$$@&++d%@&  %%@&+-+
        1.5 +-+**#$%&**#$@&**#%@&**$%@**#$%@**#$%&**#$@&**$%@&*#$%@**#$%@**#$%&**#%@&**$%@&*#$%@**#$%&**#$@&*+f%@&**$%@&+-+
        0.5 +-+**#$%&**#$@&**#%@&**$%@**#$%@**#$%&**#$@&**$%@&*#$%@**#$%@**#$%&**#%@&**$%@&*#$%@**#$%&**#$@&+sqr@&**$%@&+-+
          0 +-+**#$%&**#$@&**#%@&**$%@**#$%@**#$%&**#$@&**$%@&*#$%@**#$%@**#$%&**#%@&**$%@&*#$%@**#$%&**#$@&*+cmp&**$%@&+-+
        410.bw416.gam433.434.z435.436.cac437.lesli444.447.de450.so453454.ca459.GemsF465.tont470.lb4482.sphinxgeomean
      
      2. Host: ARM Aarch64 A57 @ 2.4GHz
      
                          qemu-aarch64 NBench score; higher is better
                       Host: Applied Micro X-Gene, Aarch64 A57 @ 2.4 GHz
      
          5 +-+-----------+-------------+-------------+-------------+-----------+-+
        4.5 +-+........................................@@@&==...................+-+
        3 4 +-+..........................@@@&==........@.@&.=.....+before       +-+
          3 +-+..........................@.@&.=........@.@&.=.....+ad@@@&==     +-+
        2.5 +-+.....................##$$%%.@&.=........@.@&.=.....+  @m@& =     +-+
          2 +-+............@@@&==.***#.$.%.@&.=.***#$$%%.@&.=.***#$$%%d@& =     +-+
        1.5 +-+.....***#$$%%.@&.=.*.*#.$.%.@&.=.*.*#.$.%.@&.=.*.*#+$ +f@& =     +-+
        0.5 +-+.....*.*#.$.%.@&.=.*.*#.$.%.@&.=.*.*#.$.%.@&.=.*.*#+$+sqr& =     +-+
          0 +-+-----***#$$%%@@&==-***#$$%%@@&==-***#$$%%@@&==-***#$$%+cmp==-----+-+
                   FOURIER    NEURAL NLU DECOMPOSITION         gmean
      
      Reviewed-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
      Signed-off-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      d9fe9db9
    • Emilio G. Cota's avatar
      hardfloat: implement float32/64 square root · f131bae8
      Emilio G. Cota authored
      
      Performance results for fp-bench:
      
      Host: Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
      - before:
      sqrt-single: 42.30 MFlops
      sqrt-double: 22.97 MFlops
      - after:
      sqrt-single: 311.42 MFlops
      sqrt-double: 311.08 MFlops
      
      Here USE_FP makes a huge difference for f64's, with throughput
      going from ~200 MFlops to ~300 MFlops.
      
      Reviewed-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
      Signed-off-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      f131bae8
    • Emilio G. Cota's avatar
      hardfloat: implement float32/64 fused multiply-add · ccf770ba
      Emilio G. Cota authored
      
      Performance results for fp-bench:
      
      1. Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
      - before:
      fma-single: 74.73 MFlops
      fma-double: 74.54 MFlops
      - after:
      fma-single: 203.37 MFlops
      fma-double: 169.37 MFlops
      
      2. ARM Aarch64 A57 @ 2.4GHz
      - before:
      fma-single: 23.24 MFlops
      fma-double: 23.70 MFlops
      - after:
      fma-single: 66.14 MFlops
      fma-double: 63.10 MFlops
      
      3. IBM POWER8E @ 2.1 GHz
      - before:
      fma-single: 37.26 MFlops
      fma-double: 37.29 MFlops
      - after:
      fma-single: 48.90 MFlops
      fma-double: 59.51 MFlops
      
      Here having 3FP64 set to 1 pays off for x86_64:
      [1] 170.15 vs [0] 153.12 MFlops
      
      Reviewed-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
      Signed-off-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      ccf770ba
    • Emilio G. Cota's avatar
      hardfloat: implement float32/64 division · 4a629561
      Emilio G. Cota authored
      
      Performance results for fp-bench:
      
      1. Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
      - before:
      div-single: 34.84 MFlops
      div-double: 34.04 MFlops
      - after:
      div-single: 275.23 MFlops
      div-double: 216.38 MFlops
      
      2. ARM Aarch64 A57 @ 2.4GHz
      - before:
      div-single: 9.33 MFlops
      div-double: 9.30 MFlops
      - after:
      div-single: 51.55 MFlops
      div-double: 15.09 MFlops
      
      3. IBM POWER8E @ 2.1 GHz
      - before:
      div-single: 25.65 MFlops
      div-double: 24.91 MFlops
      - after:
      div-single: 96.83 MFlops
      div-double: 31.01 MFlops
      
      Here setting 2FP64_USE_FP to 1 pays off for x86_64:
      [1] 215.97 vs [0] 62.15 MFlops
      
      Reviewed-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
      Signed-off-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      4a629561
    • Emilio G. Cota's avatar
      hardfloat: implement float32/64 multiplication · 2dfabc86
      Emilio G. Cota authored
      
      Performance results for fp-bench:
      
      1. Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
      - before:
      mul-single: 126.91 MFlops
      mul-double: 118.28 MFlops
      - after:
      mul-single: 258.02 MFlops
      mul-double: 197.96 MFlops
      
      2. ARM Aarch64 A57 @ 2.4GHz
      - before:
      mul-single: 37.42 MFlops
      mul-double: 38.77 MFlops
      - after:
      mul-single: 73.41 MFlops
      mul-double: 76.93 MFlops
      
      3. IBM POWER8E @ 2.1 GHz
      - before:
      mul-single: 58.40 MFlops
      mul-double: 59.33 MFlops
      - after:
      mul-single: 60.25 MFlops
      mul-double: 94.79 MFlops
      
      Reviewed-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
      Signed-off-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      2dfabc86
    • Emilio G. Cota's avatar
      hardfloat: implement float32/64 addition and subtraction · 1b615d48
      Emilio G. Cota authored
      
      Performance results (single and double precision) for fp-bench:
      
      1. Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
      - before:
      add-single: 135.07 MFlops
      add-double: 131.60 MFlops
      sub-single: 130.04 MFlops
      sub-double: 133.01 MFlops
      - after:
      add-single: 443.04 MFlops
      add-double: 301.95 MFlops
      sub-single: 411.36 MFlops
      sub-double: 293.15 MFlops
      
      2. ARM Aarch64 A57 @ 2.4GHz
      - before:
      add-single: 44.79 MFlops
      add-double: 49.20 MFlops
      sub-single: 44.55 MFlops
      sub-double: 49.06 MFlops
      - after:
      add-single: 93.28 MFlops
      add-double: 88.27 MFlops
      sub-single: 91.47 MFlops
      sub-double: 88.27 MFlops
      
      3. IBM POWER8E @ 2.1 GHz
      - before:
      add-single: 72.59 MFlops
      add-double: 72.27 MFlops
      sub-single: 75.33 MFlops
      sub-double: 70.54 MFlops
      - after:
      add-single: 112.95 MFlops
      add-double: 201.11 MFlops
      sub-single: 116.80 MFlops
      sub-double: 188.72 MFlops
      
      Note that the IBM and ARM machines benefit from having
      HARDFLOAT_2F{32,64}_USE_FP set to 0. Otherwise their performance
      can suffer significantly:
      - IBM Power8:
      add-single: [1] 54.94 vs [0] 116.37 MFlops
      add-double: [1] 58.92 vs [0] 201.44 MFlops
      - Aarch64 A57:
      add-single: [1] 80.72 vs [0] 93.24 MFlops
      add-double: [1] 82.10 vs [0] 88.18 MFlops
      
      On the Intel machine, having 2F64 set to 1 pays off, but it
      doesn't for 2F32:
      - Intel i7-6700K:
      add-single: [1] 285.79 vs [0] 426.70 MFlops
      add-double: [1] 302.15 vs [0] 278.82 MFlops
      
      Reviewed-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
      Signed-off-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      1b615d48
    • Emilio G. Cota's avatar
      fpu: introduce hardfloat · a94b7839
      Emilio G. Cota authored
      
      The appended paves the way for leveraging the host FPU for a subset
      of guest FP operations. For most guest workloads (e.g. FP flags
      aren't ever cleared, inexact occurs often and rounding is set to the
      default [to nearest]) this will yield sizable performance speedups.
      
      The approach followed here avoids checking the FP exception flags register.
      See the added comment for details.
      
      This assumes that QEMU is running on an IEEE754-compliant FPU and
      that the rounding is set to the default (to nearest). The
      implementation-dependent specifics of the FPU should not matter; things
      like tininess detection and snan representation are still dealt with in
      soft-fp. However, this approach will break on most hosts if we compile
      QEMU with flags that break IEEE compatibility. There is no way to detect
      all of these flags at compilation time, but at least we check for
      -ffast-math (which defines __FAST_MATH__) and disable hardfloat
      (plus emit a #warning) when it is set.
      
      This patch just adds common code. Some operations will be migrated
      to hardfloat in subsequent patches to ease bisection.
      
      Note: some architectures (at least PPC, there might be others) clear
      the status flags passed to softfloat before most FP operations. This
      precludes the use of hardfloat, so to avoid introducing a performance
      regression for those targets, we add a flag to disable hardfloat.
      In the long run though it would be good to fix the targets so that
      at least the inexact flag passed to softfloat is indeed sticky.
      
      Reviewed-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
      Signed-off-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      a94b7839
    • Emilio G. Cota's avatar
      softfloat: rename canonicalize to sf_canonicalize · f9943c7f
      Emilio G. Cota authored
      
      glibc >= 2.25 defines canonicalize in commit eaf5ad0
      (Add canonicalize, canonicalizef, canonicalizel., 2016-10-26).
      
      Given that we'll be including <math.h> soon, prepare
      for this by prefixing our canonicalize() with sf_ to avoid
      clashing with the libc's canonicalize().
      
      Reported-by: default avatarBastian Koppelmann <kbastian@mail.uni-paderborn.de>
      Tested-by: default avatarBastian Koppelmann <kbastian@mail.uni-paderborn.de>
      Reviewed-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      Signed-off-by: default avatarEmilio G. Cota <cota@braap.org>
      Signed-off-by: default avatarAlex Bennée <alex.bennee@linaro.org>
      f9943c7f
  7. Oct 17, 2018
  8. Oct 05, 2018
  9. Aug 24, 2018
  10. Aug 16, 2018
  11. May 17, 2018
Loading