Commits · 2f8a21d8ff3af484a37edc8ea61d127ec1529ab5 · Anton / libtcg

Oct 18, 2022

target/i386: Enable AVX cpuid bits when using TCG · 2f8a21d8

Paul Brook authored 3 years ago


Include AVX, AVX2 and VAES in the guest cpuid features supported by TCG.

Signed-off-by: Paul Brook <paul@nowt.org>
Message-Id: <20220424220204.2493824-40-paul@nowt.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

2f8a21d8

target/i386: implement VLDMXCSR/VSTMXCSR · 57f6bba0

Paolo Bonzini authored 2 years ago


These are exactly the same as the non-VEX version, but one has to be careful
that only VEX.L=0 is allowed.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

57f6bba0

target/i386: implement XSAVE and XRSTOR of AVX registers · 89254431

Paolo Bonzini authored 2 years ago


Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

89254431

target/i386: reimplement 0x0f 0x28-0x2f, add AVX · f8d19eec

Paolo Bonzini authored 2 years ago


Here the code is a bit uglier due to the truncation and extension
of registers to and from 32-bit.  There is also a mistake in the
manual with respect to the size of the memory operand of CVTPS2PI
and CVTTPS2PI, reported by Ricky Zhou.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

f8d19eec

target/i386: reimplement 0x0f 0x10-0x17, add AVX · 7170a17e

Paolo Bonzini authored 2 years ago


These are mostly moves, and yet are a total pain.  The main issue
is that:

1) some instructions are selected by mod==11 (register operand)
vs. mod=00/01/10 (memory operand)

2) stores to memory are two-operand operations, while the 3-register
and load-from-memory versions operate on the entire contents of the
destination; this makes it easier to separate the gen_* function for
the store case

3) it's inefficient to load into xmm_T0 only to move the value out
again, so the gen_* function for the load case is separated too

The manual also has various mistakes in the operands here, for example
the store case of MOVHPS operates on a 128-bit source (albeit discarding
the bottom 64 bits) and therefore should be Mq,Vdq rather than Mq,Vq.
Likewise for the destination and source of MOVHLPS.

VUNPCK?PS and VUNPCK?PD are the same as VUNPCK?DQ and VUNPCK?QDQ,
but encoded as prefixes rather than separate operands.  The helpers
can be reused however.

For MOVSLDUP, MOVSHDUP and MOVDDUP I chose to reimplement them as
helpers.  I named the helper for MOVDDUP "movdldup" in preparation
for possible future introduction of MOVDHDUP and to clarify the
similarity with MOVSLDUP.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

7170a17e

target/i386: reimplement 0x0f 0xc2, 0xc4-0xc6, add AVX · aba2b8ec

Paolo Bonzini authored 2 years ago


Nothing special going on here, for once.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

aba2b8ec

target/i386: reimplement 0x0f 0x38, add AVX · 16fc5726

Paolo Bonzini authored 2 years ago


There are several special cases here:

1) extending moves have different widths for the helpers vs. for the
memory loads, and the width for memory loads depends on VEX.L too.
This is represented by X86_SPECIAL_AVXExtMov.

2) some instructions, such as variable-width shifts, select the vector element
size via REX.W.

3) VSIB instructions (VGATHERxPy, VPGATHERxy) are also part of this group,
and they have (among other things) two output operands.

3) the macros for 4-operand blends (which are under 0x0f 0x3a) have to be
extended to support 2-operand blends.  The 2-operand variant actually
came a few years earlier, but it is clearer to implement them in the
opposite order.

X86_TYPE_WM, introduced earlier for unaligned loads, is reused for helpers
that accept a Reg* but have a M argument.

These three-byte opcodes also include AVX new instructions, for which
the helpers were originally implemented by Paul Brook <paul@nowt.org>.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

16fc5726

target/i386: Use tcg gvec ops for pmovmskb · d4af67a2

Richard Henderson authored 2 years ago


As pmovmskb is used by strlen et al, this is the third
highest overhead sse operation at %0.8.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
[Reorganize to generate code for any vector size. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

d4af67a2

target/i386: reimplement 0x0f 0x3a, add AVX · 79068477

Paolo Bonzini authored 2 years ago


The more complicated operations here are insertions and extractions.
Otherwise, there are just more entries than usual because the PS/PD/SS/SD
variations are encoded in the opcode rater than in the prefixes.

These three-byte opcodes also include AVX new instructions, whose
implementation in the helpers was originally done by Paul Brook
<paul@nowt.org>.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

79068477

target/i386: clarify (un)signedness of immediates from 0F3Ah opcodes · a64eee3a

Paolo Bonzini authored 2 years ago


Three-byte opcodes from the 0F3Ah area all have an immediate byte which
is usually unsigned.  Clarify in the helper code that it is unsigned;
the new decoder treats immediates as signed by default, and seeing
an intN_t in the prototype might give the wrong impression that one
can use decode->immediate directly.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

a64eee3a

target/i386: reimplement 0x0f 0xd0-0xd7, 0xe0-0xe7, 0xf0-0xf7, add AVX · 6bbeb98d

Paolo Bonzini authored 2 years ago


The more complicated ones here are d6-d7, e6-e7, f7.  The others
are trivial.

For LDDQU, using gen_load_sse directly might corrupt the register if
the second part of the load fails.  Therefore, add a custom X86_TYPE_WM
value; like X86_TYPE_W it does call gen_load(), but it also rejects a
value of 11 in the ModRM field like X86_TYPE_M.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

6bbeb98d

target/i386: reimplement 0x0f 0x70-0x77, add AVX · ce4fcb94

Paolo Bonzini authored 2 years ago


This includes shifts by immediate, which use bits 3-5 of the ModRM byte
as an opcode extension.  With the exception of 128-bit shifts, they are
implemented using gvec.

This also covers VZEROALL and VZEROUPPER, which use the same opcode
as EMMS.  If we were wanting to optimize out gen_clear_ymmh then this
would be one of the starting points.  The implementation of the VZEROALL
and VZEROUPPER helpers is by Paul Brook.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

ce4fcb94

target/i386: reimplement 0x0f 0x78-0x7f, add AVX · d1c1a422

Paolo Bonzini authored 2 years ago


These are a mixed batch, including the first two horizontal
(66 and F2 only) operations, more moves, and SSE4a extract/insert.

Because SSE4a is pretty rare, I chose to leave the helper as they are,
but it is possible to unify them by loading index and length from the
source XMM register and generating deposit or extract TCG ops.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

d1c1a422

target/i386: reimplement 0x0f 0x50-0x5f, add AVX · 03b45880

Paolo Bonzini authored 2 years ago


These are mostly floating-point SSE operations.  The odd ones out
are MOVMSK and CVTxx2yy, the others are straightforward.

Unary operations are a bit special in AVX because they have 2 operands
for PD/PS operands (VEX.vvvv must be 1111b), and 3 operands for SD/SS.
They are handled using X86_OP_GROUP3 for compactness.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

03b45880

target/i386: reimplement 0x0f 0xd8-0xdf, 0xe8-0xef, 0xf8-0xff, add AVX · 1d0efbdb

Paolo Bonzini authored 2 years ago

These are more simple integer instructions present in both MMX and SSE/AVX,
with no holes that were later occupied by newer instructions.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

1d0efbdb

target/i386: reimplement 0x0f 0x60-0x6f, add AVX · 92ec056a

Paolo Bonzini authored 2 years ago

These are both MMX and SSE/AVX instructions, except for vmovdqu. In both
cases the inputs and output is in s->ptr{0,1,2}, so the only difference
between MMX, SSE, and AVX is which helper to call.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

92ec056a

target/i386: Introduce 256-bit vector helpers · b98f886c

Paolo Bonzini authored 2 years ago


The new implementation of SSE will cover AVX from the get go, because
all the work for the helper functions is already done.  We just need to
build them.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

b98f886c

target/i386: implement additional AVX comparison operators · 6e0cac78

Paolo Bonzini authored 2 years ago


The new implementation of SSE will cover AVX from the get go, so include
the 24 extra comparison operators that are only available with the VEX
prefix.

Based on a patch by Paul Brook <paul@nowt.org>.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

6e0cac78

target/i386: provide 3-operand versions of unary scalar helpers · 620f7556

Paolo Bonzini authored 2 years ago

Compared to Paul's implementation, the new decoder will use a different approach
to implement AVX's merging of dst with src1 on scalar operations. Adjust the
old SSE decoder to be compatible with new-style helpers.

The affected instructions are CVTSx2Sx, ROUNDSx, RSQRTSx, SQRTSx, RCPSx.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

620f7556

target/i386: support operand merging in binary scalar helpers · 1de9e7e6

Paolo Bonzini authored 2 years ago

Compared to Paul's implementation, the new decoder will use a different approach
to implement AVX's merging of dst with src1 on scalar operations. Adjust the
helpers to provide this functionality.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

1de9e7e6

target/i386: extend helpers to support VEX.V 3- and 4- operand encodings · f05f9789

Paolo Bonzini authored 2 years ago


Add to the helpers all the operands that are needed to implement AVX.

Extracted from a patch by Paul Brook <paul@nowt.org>.

Message-Id: <20220424220204.2493824-26-paul@nowt.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

f05f9789

target/i386: Prepare ops_sse_header.h for 256 bit AVX · 30f41921

Paul Brook authored 3 years ago


Adjust all #ifdefs to match the ones in ops_sse.h.

Signed-off-by: Paul Brook <paul@nowt.org>
Message-Id: <20220424220204.2493824-23-paul@nowt.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

30f41921

target/i386: move scalar 0F 38 and 0F 3A instruction to new decoder · 1d0b9261

Paolo Bonzini authored 2 years ago


Because these are the only VEX instructions that QEMU supports, the
new decoder is entered on the first byte of a valid VEX prefix, and VEX
decoding only needs to be done in decode-new.c.inc.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

1d0b9261

target/i386: validate SSE prefixes directly in the decoding table · 55a33286

Paolo Bonzini authored 2 years ago


Many SSE and AVX instructions are only valid with specific prefixes
(none, 66, F3, F2).  Introduce a direct way to encode this in the
decoding table to avoid using decode groups too much.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

55a33286

target/i386: validate VEX prefixes via the instructions' exception classes · 20581aad
Paolo Bonzini authored 2 years ago
```
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
```
20581aad

target/i386: add AVX_EN hflag · 608db8db

Paul Brook authored 3 years ago


Add a new hflag bit to determine whether AVX instructions are allowed

Signed-off-by: Paul Brook <paul@nowt.org>
Message-Id: <20220424220204.2493824-4-paul@nowt.org>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

608db8db

target/i386: add CPUID feature checks to new decoder · caa01fad

Paolo Bonzini authored 2 years ago


Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

caa01fad

target/i386: add CPUID[EAX=7,ECX=0].ECX to DisasContext · 268dc464

Paolo Bonzini authored 2 years ago


TCG will shortly implement VAES instructions, so add the relevant feature
word to the DisasContext.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

268dc464

target/i386: add ALU load/writeback core · 6ba13999

Paolo Bonzini authored 2 years ago


Add generic code generation that takes care of preparing operands
around calls to decode.e.gen in a table-driven manner, so that ALU
operations need not take care of that.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

6ba13999

target/i386: add core of new i386 decoder · b3e22b23

Paolo Bonzini authored 2 years ago


The new decoder is based on three principles:

- use mostly table-driven decoding, using tables derived as much as possible
  from the Intel manual.  Centralizing the decode the operands makes it
  more homogeneous, for example all immediates are signed.  All modrm
  handling is in one function, and can be shared between SSE and ALU
  instructions (including XMM<->GPR instructions).  The SSE/AVX decoder
  will also not have duplicated code between the 0F, 0F38 and 0F3A tables.

- keep the code as "non-branchy" as possible.  Generally, the code for
  the new decoder is more verbose, but the control flow is simpler.
  Conditionals are not nested and have small bodies.  All instruction
  groups are resolved even before operands are decoded, and code
  generation is separated as much as possible within small functions
  that only handle one instruction each.

- keep address generation and (for ALU operands) memory loads and writeback
  as much in common code as possible.  All ALU operations for example
  are implemented as T0=f(T0,T1).  For non-ALU instructions,
  read-modify-write memory operations are rare, but registers do not
  have TCGv equivalents: therefore, the common logic sets up pointer
  temporaries with the operands, while load and writeback are handled
  by gvec or by helpers.

These principles make future code review and extensibility simpler, at
the cost of having a relatively large amount of code in the form of this
patch.  Even EVEX should not be _too_ hard to implement (it's just a crazy
large amount of possibilities).

This patch introduces the main decoder flow, and integrates the old
decoder with the new one.  The old decoder takes care of parsing
prefixes and then optionally drops to the new one.  The changes to the
old decoder are minimal and allow it to be replaced incrementally with
the new one.

There is a debugging mechanism through a "LIMIT" environment variable.
In user-mode emulation, the variable is the number of instructions
decoded by the new decoder before permanently switching to the old one.
In system emulation, the variable is the highest opcode that is decoded
by the new decoder (this is less friendly, but it's the best that can
be done without requiring deterministic execution).

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

b3e22b23

target/i386: make rex_w available even in 32-bit mode · a61ef762

Paolo Bonzini authored 2 years ago

REX.W can be used even in 32-bit mode by AVX instructions, where it is retroactively
renamed to VEX.W. Make the field available even in 32-bit mode but keep the REX_W()
macro as it was; this way, that the handling of dflag does not use it by mistake and
the AVX code more clearly points at the special VEX behavior of the bit.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

a61ef762

target/i386: make ldo/sto operations consistent with ldq · 12a2c9c7

Paolo Bonzini authored 2 years ago


ldq takes a pointer to the first byte to load the 64-bit word in;
ldo takes a pointer to the first byte of the ZMMReg.  Make them
consistent, which will be useful in the new SSE decoder's
load/writeback routines.

Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

12a2c9c7

target/i386: Define XMMReg and access macros, align ZMM registers · 75f107a8

Richard Henderson authored 2 years ago

This will be used for emission and endian adjustments of gvec operations.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20220822223722.1697758-2-richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

75f107a8

target/i386: Use probe_access_full for final stage2 translation · 8629e77b

Richard Henderson authored 2 years ago


Rather than recurse directly on mmu_translate, go through the
same softmmu lookup that we did for the page table walk.
This centralizes all knowledge of MMU_NESTED_IDX, with respect
to setup of TranslationParams, to get_physical_address.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20221002172956.265735-10-richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

8629e77b

target/i386: Use atomic operations for pte updates · 4a1e9d4d

Richard Henderson authored 2 years ago

Use probe_access_full in order to resolve to a host address,
which then lets us use a host cmpxchg to update the pte.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/279


Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20221002172956.265735-9-richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

4a1e9d4d

target/i386: Combine 5 sets of variables in mmu_translate · 11b4e971

Richard Henderson authored 2 years ago


We don't need one variable set per translation level,
which requires copying into pte/pte_addr for huge pages.
Standardize on pte/pte_addr for all levels.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20221002172956.265735-8-richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

11b4e971

target/i386: Use MMU_NESTED_IDX for vmload/vmsave · 726ea335

Richard Henderson authored 2 years ago


Use MMU_NESTED_IDX for each memory access, rather than
just a single translation to physical.  Adjust svm_save_seg
and svm_load_seg to pass in mmu_idx.

This removes the last use of get_hphys so remove it.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20221002172956.265735-7-richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

726ea335

target/i386: Add MMU_PHYS_IDX and MMU_NESTED_IDX · 98281984

Richard Henderson authored 2 years ago


These new mmu indexes will be helpful for improving
paging and code throughout the target.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20221002172956.265735-6-richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

98281984

target/i386: Reorg GET_HPHYS · 9bbcf372

Richard Henderson authored 2 years ago


Replace with PTE_HPHYS for the page table walk, and a direct call
to mmu_translate for the final stage2 translation.  Hoist the check
for HF2_NPT_MASK out to get_physical_address, which avoids the
recursive call when stage2 is disabled.

We can now return all the way out to x86_cpu_tlb_fill before raising
an exception, which means probe works.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20221002172956.265735-5-richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

9bbcf372

target/i386: Introduce structures for mmu_translate · 3563362d

Richard Henderson authored 2 years ago

Create TranslateParams for inputs, TranslateResults for successful
outputs, and TranslateFault for error outputs; return true on success.

Move stage1 error paths from handle_mmu_fault to x86_cpu_tlb_fill;
reorg the rest of handle_mmu_fault into get_physical_address.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20221002172956.265735-4-richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

3563362d