- Apr 10, 2018
-
-
Richard Henderson authored
The parameters for tcg_gen_insn_start are target_ulong, which may be split into two TCGArg parameters for storage in the opcode on 32-bit hosts. Fixes the ARM target and its direct use of tcg_set_insn_param, which would set the wrong argument in the 64-on-32 case. Cc: qemu-stable@nongnu.org Reported-by:
<alarson@ddci.com> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org> Message-id: 20180410003558.2470-1-richard.henderson@linaro.org Reviewed-by:
Peter Maydell <peter.maydell@linaro.org> Signed-off-by:
Peter Maydell <peter.maydell@linaro.org>
-
- Mar 28, 2018
-
-
Richard Henderson authored
Failure to do so results in the tcg optimizer sign-extending any constant fold from 32-bits. This turns out to be visible in the RISC-V testsuite using a host that emits these opcodes (e.g. any non-x86_64). Reported-by:
Michael Clark <mjc@sifive.com> Reviewed-by:
Emilio G. Cota <cota@braap.org> Reviewed-by:
Philippe Mathieu-Daudé <f4bug@amsat.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
- Mar 15, 2018
-
-
Richard Henderson authored
This unifies 5 copies of checks for supported vector size, and in the process fixes a missing check in tcg_gen_gvec_2s. This lead to an assertion failure for 64-bit vector multiply, which is not available in the AVX instruction set. Suggested-by:
Peter Maydell <peter.maydell@linaro.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
Unknown why -m32 was passing with gcc but not clang; it should have failed for both. This would be used for tcg_gen_dup_i64_vec, and visible with the right TB and an aarch64 guest. Reported-by:
Max Reitz <mreitz@redhat.com> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
Convert multiplication by power of two to left shift. Reviewed-by:
Emilio G. Cota <cota@braap.org> Reviewed-by:
Philippe Mathieu-Daudé <f4bug@amsat.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
- Feb 08, 2018
-
-
Richard Henderson authored
Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
The x86 vector instruction set is extremely irregular. With newer editions, Intel has filled in some of the blanks. However, we don't get many 64-bit operations until SSE4.2, introduced in 2009. The subsequent edition was for AVX1, introduced in 2011, which added three-operand addressing, and adjusts how all instructions should be encoded. Given the relatively narrow 2 year window between possible to support and desirable to support, and to vastly simplify code maintainence, I am only planning to support AVX1 and later cpus. Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
Trivial move and constant propagation. Some identity and constant function folding, but nothing that requires knowledge of the size of the vector element. Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
Use dup to convert a non-constant scalar to a third vector. Add addition, multiplication, and logical operations with an immediate. Add addition, subtraction, multiplication, and logical operations with a non-constant scalar. Allow for the front-end to build operations in which the scalar operand comes first. Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
No vector ops as yet. SSE only has direct support for 8- and 16-bit saturation; handling 32- and 64-bit saturation is much more expensive. Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
Opcodes are added for scalar and vector shifts, but considering the varied semantics of these do not expose them to the front ends. Do go ahead and provide them in case they are needed for backend expansion. Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
Some functions use intN_t arguments, some use uintN_t, some just used "unsigned". To aid putting function pointers in tables, we need consistency. Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Reviewed-by:
Peter Maydell <peter.maydell@linaro.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
Nothing uses or enables them yet. Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
This will be required for storing vector constants. Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Reviewed-by:
Peter Maydell <peter.maydell@linaro.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
- Jan 16, 2018
-
-
Richard Henderson authored
We recently relaxed the limit of the number of opcodes that can appear in a TranslationBlock. In certain cases this has resulted in relocation overflow. Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
AArch64 with SVE has an offset of 80k to the 8th TLB. Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
AArch64 with SVE has an offset of 80k to the 8th TLB. Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
The code sequence we were generating was only good for unsigned comparisons. For signed comparisions, use the sequence from gcc. Fixes booting of ppc64 firmware, with a patch changing the code sequence for ppc comparisons. Tested-by:
Michael Roth <mdroth@linux.vnet.ibm.com> Reviewed-by:
Peter Maydell <peter.maydell@linaro.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
- Dec 29, 2017
-
-
Richard Henderson authored
We already handle this in the backends, and the lifetime datum for the TCGOp is already large enough. Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
Complimenting the existing tcg_unsigned_cond. Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
We had two fields specific to INDEX_op_call. Rename these and add some macros so that the fields may be reused for other opcodes. Reviewed-by:
Philippe Mathieu-Daudé <f4bug@amsat.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
With no fixed array allocation, we can't overflow a buffer. This will be important as optimizations related to host vectors may expand the number of ops used. Use QTAILQ to link the ops together. Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
These are now trivial sets and tests against NULL. Unwrap. Reviewed-by:
Philippe Mathieu-Daudé <f4bug@amsat.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
- Nov 03, 2017
-
-
Richard Henderson authored
Rather than have separate code only used for guest_base, rely on a recent change to handle constant pool entries. Cc: qemu-s390x@nongnu.org Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
Both ARMv6 and AArch64 currently may drop complex guest_base values into the constant pool. But generic code wasn't expecting that, and the pool is not emitted. Correct that. Tested-by:
Emilio G. Cota <cota@braap.org> Tested-by:
Laurent Desnogues <laurent.desnogues@gmail.com> Reported-by:
Laurent Desnogues <laurent.desnogues@gmail.com> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
- Oct 24, 2017
-
-
Richard Henderson authored
This is identical for each target. So, move the initialization to common code. Move the variable itself out of tcg_ctx and name it cpu_env to minimize changes within targets. This also means we can remove tcg_global_reg_new_{ptr,i32,i64}, since there are no longer global-register temps created by targets. Reviewed-by:
Emilio G. Cota <cota@braap.org> Reviewed-by:
Philippe Mathieu-Daudé <f4bug@amsat.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Emilio G. Cota authored
This enables parallel TCG code generation. However, we do not take advantage of it yet since tb_lock is still held during tb_gen_code. In user-mode we use a single TCG context; see the documentation added to tcg_region_init for the rationale. Note that targets do not need any conversion: targets initialize a TCGContext (e.g. defining TCG globals), and after this initialization has finished, the context is cloned by the vCPU threads, each of them keeping a separate copy. TCG threads claim one entry in tcg_ctxs[] by atomically increasing n_tcg_ctxs. Do not be too annoyed by the subsequent atomic_read's of that variable and tcg_ctxs; they are there just to play nice with analysis tools such as thread sanitizer. Note that we do not allocate an array of contexts (we allocate an array of pointers instead) because when tcg_context_init is called, we do not know yet how many contexts we'll use since the bool behind qemu_tcg_mttcg_enabled() isn't set yet. Previous patches folded some TCG globals into TCGContext. The non-const globals remaining are only set at init time, i.e. before the TCG threads are spawned. Here is a list of these set-at-init-time globals under tcg/: Only written by tcg_context_init: - indirect_reg_alloc_order - tcg_op_defs Only written by tcg_target_init (called from tcg_context_init): - tcg_target_available_regs - tcg_target_call_clobber_regs - arm: arm_arch, use_idiv_instructions - i386: have_cmov, have_bmi1, have_bmi2, have_lzcnt, have_movbe, have_popcnt - mips: use_movnz_instructions, use_mips32_instructions, use_mips32r2_instructions, got_sigill (tcg_target_detect_isa) - ppc: have_isa_2_06, have_isa_3_00, tb_ret_addr - s390: tb_ret_addr, s390_facilities - sparc: qemu_ld_trampoline, qemu_st_trampoline (build_trampolines), use_vis3_instructions Only written by tcg_prologue_init: - 'struct jit_code_entry one_entry' - aarch64: tb_ret_addr - arm: tb_ret_addr - i386: tb_ret_addr, guest_base_flags - ia64: tb_ret_addr - mips: tb_ret_addr, bswap32_addr, bswap32u_addr, bswap64_addr Reviewed-by:
Richard Henderson <rth@twiddle.net> Signed-off-by:
Emilio G. Cota <cota@braap.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Emilio G. Cota authored
This is groundwork for supporting multiple TCG contexts. The naive solution here is to split code_gen_buffer statically among the TCG threads; this however results in poor utilization if translation needs are different across TCG threads. What we do here is to add an extra layer of indirection, assigning regions that act just like pages do in virtual memory allocation. (BTW if you are wondering about the chosen naming, I did not want to use blocks or pages because those are already heavily used in QEMU). We use a global lock to serialize allocations as well as statistics reporting (we now export the size of the used code_gen_buffer with tcg_code_size()). Note that for the allocator we could just use a counter and atomic_inc; however, that would complicate the gathering of tcg_code_size()-like stats. So given that the region operations are not a fast path, a lock seems the most reasonable choice. The effectiveness of this approach is clear after seeing some numbers. I used the bootup+shutdown of debian-arm with '-tb-size 80' as a benchmark. Note that I'm evaluating this after enabling per-thread TCG (which is done by a subsequent commit). * -smp 1, 1 region (entire buffer): qemu: flush code_size=83885014 nb_tbs=154739 avg_tb_size=357 qemu: flush code_size=83884902 nb_tbs=153136 avg_tb_size=363 qemu: flush code_size=83885014 nb_tbs=152777 avg_tb_size=364 qemu: flush code_size=83884950 nb_tbs=150057 avg_tb_size=373 qemu: flush code_size=83884998 nb_tbs=150234 avg_tb_size=373 qemu: flush code_size=83885014 nb_tbs=154009 avg_tb_size=360 qemu: flush code_size=83885014 nb_tbs=151007 avg_tb_size=370 qemu: flush code_size=83885014 nb_tbs=151816 avg_tb_size=367 That is, 8 flushes. * -smp 8, 32 regions (80/32 MB per region) [i.e. this patch]: qemu: flush code_size=76328008 nb_tbs=141040 avg_tb_size=356 qemu: flush code_size=75366534 nb_tbs=138000 avg_tb_size=361 qemu: flush code_size=76864546 nb_tbs=140653 avg_tb_size=361 qemu: flush code_size=76309084 nb_tbs=135945 avg_tb_size=375 qemu: flush code_size=74581856 nb_tbs=132909 avg_tb_size=375 qemu: flush code_size=73927256 nb_tbs=135616 avg_tb_size=360 qemu: flush code_size=78629426 nb_tbs=142896 avg_tb_size=365 qemu: flush code_size=76667052 nb_tbs=138508 avg_tb_size=368 Again, 8 flushes. Note how buffer utilization is not 100%, but it is close. Smaller region sizes would yield higher utilization, but we want region allocation to be rare (it acquires a lock), so we do not want to go too small. * -smp 8, static partitioning of 8 regions (10 MB per region): qemu: flush code_size=21936504 nb_tbs=40570 avg_tb_size=354 qemu: flush code_size=11472174 nb_tbs=20633 avg_tb_size=370 qemu: flush code_size=11603976 nb_tbs=21059 avg_tb_size=365 qemu: flush code_size=23254872 nb_tbs=41243 avg_tb_size=377 qemu: flush code_size=28289496 nb_tbs=52057 avg_tb_size=358 qemu: flush code_size=43605160 nb_tbs=78896 avg_tb_size=367 qemu: flush code_size=45166552 nb_tbs=82158 avg_tb_size=364 qemu: flush code_size=63289640 nb_tbs=116494 avg_tb_size=358 qemu: flush code_size=51389960 nb_tbs=93937 avg_tb_size=362 qemu: flush code_size=59665928 nb_tbs=107063 avg_tb_size=372 qemu: flush code_size=38380824 nb_tbs=68597 avg_tb_size=374 qemu: flush code_size=44884568 nb_tbs=79901 avg_tb_size=376 qemu: flush code_size=50782632 nb_tbs=90681 avg_tb_size=374 qemu: flush code_size=39848888 nb_tbs=71433 avg_tb_size=372 qemu: flush code_size=64708840 nb_tbs=119052 avg_tb_size=359 qemu: flush code_size=49830008 nb_tbs=90992 avg_tb_size=362 qemu: flush code_size=68372408 nb_tbs=123442 avg_tb_size=368 qemu: flush code_size=33555560 nb_tbs=59514 avg_tb_size=378 qemu: flush code_size=44748344 nb_tbs=80974 avg_tb_size=367 qemu: flush code_size=37104248 nb_tbs=67609 avg_tb_size=364 That is, 20 flushes. Note how a static partitioning approach uses the code buffer poorly, leading to many unnecessary flushes. Reviewed-by:
Richard Henderson <richard.henderson@linaro.org> Signed-off-by:
Emilio G. Cota <cota@braap.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Emilio G. Cota authored
Groundwork for supporting multiple TCG contexts. While at it, also allocate temps_used directly as a bitmap of the required size, instead of using a bitmap of TCG_MAX_TEMPS via TCGTempSet. Performance-wise we lose about 1.12% in a translation-heavy workload such as booting+shutting down debian-arm: Performance counter stats for 'taskset -c 0 arm-softmmu/qemu-system-arm \ -machine type=virt -nographic -smp 1 -m 4096 \ -netdev user,id=unet,hostfwd=tcp::2222-:22 \ -device virtio-net-device,netdev=unet \ -drive file=die-on-boot.qcow2,id=myblock,index=0,if=none \ -device virtio-blk-device,drive=myblock \ -kernel kernel.img -append console=ttyAMA0 root=/dev/vda1 \ -name arm,debug-threads=on -smp 1' (10 runs): exec time (s) Relative slowdown wrt original (%) --------------------------------------------------------------- original 20.213321616 0. tcg_malloc 20.441130078 1.1270214 TCGContext 20.477846517 1.3086662 g_malloc 20.780527895 2.8061013 The other two alternatives shown in the table are: - TCGContext: embed temps[TCG_MAX_TEMPS] and TCGTempSet used_temps in TCGContext. This is simple enough but it isn't faster than using tcg_malloc; moreover, it wastes memory. - g_malloc: allocate/deallocate both temps and used_temps every time tcg_optimize is executed. Suggested-by:
Richard Henderson <rth@twiddle.net> Signed-off-by:
Emilio G. Cota <cota@braap.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Emilio G. Cota authored
This is groundwork for supporting multiple TCG contexts. To avoid scalability issues when profiling info is enabled, this patch makes the profiling info counters distributed via the following changes: 1) Consolidate profile info into its own struct, TCGProfile, which TCGContext also includes. Note that tcg_table_op_count is brought into TCGProfile after dropping the tcg_ prefix. 2) Iterate over the TCG contexts in the system to obtain the total counts. This change also requires updating the accessors to TCGProfile fields to use atomic_read/set whenever there may be conflicting accesses (as defined in C11) to them. Reviewed-by:
Richard Henderson <rth@twiddle.net> Signed-off-by:
Emilio G. Cota <cota@braap.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Emilio G. Cota authored
Groundwork for supporting multiple TCG contexts. Note that having n_tcg_ctxs is unnecessary. However, it is convenient to have it, since it will simplify iterating over the array: we'll have just a for loop instead of having to iterate over a NULL-terminated array (which would require n+1 elems) or having to check with ifdef's for usermode/softmmu. Reviewed-by:
Richard Henderson <rth@twiddle.net> Signed-off-by:
Emilio G. Cota <cota@braap.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Emilio G. Cota authored
Groundwork for supporting multiple TCG contexts. Reviewed-by:
Richard Henderson <rth@twiddle.net> Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Emilio G. Cota <cota@braap.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Emilio G. Cota authored
Groundwork for supporting multiple TCG contexts. The core of this patch is this change to tcg/tcg.h: > -extern TCGContext tcg_ctx; > +extern TCGContext tcg_init_ctx; > +extern TCGContext *tcg_ctx; Note that for now we set *tcg_ctx to whatever TCGContext is passed to tcg_context_init -- in this case &tcg_init_ctx. Reviewed-by:
Richard Henderson <rth@twiddle.net> Signed-off-by:
Emilio G. Cota <cota@braap.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Emilio G. Cota authored
Groundwork for supporting multiple TCG contexts. Reviewed-by:
Richard Henderson <rth@twiddle.net> Reviewed-by:
Alex Bennée <alex.bennee@linaro.org> Signed-off-by:
Emilio G. Cota <cota@braap.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Emilio G. Cota authored
Thereby decoupling the resulting translated code from the current state of the system. The tb->cflags field is not passed to tcg generation functions. So we add a field to TCGContext, storing there a copy of tb->cflags. Most architectures have <= 32 registers, which results in a 4-byte hole in TCGContext. Use this hole for the new field. Reviewed-by:
Richard Henderson <rth@twiddle.net> Signed-off-by:
Emilio G. Cota <cota@braap.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Emilio G. Cota authored
This will enable us to decouple code translation from the value of parallel_cpus at any given time. It will also help us minimize TB flushes when generating code via EXCP_ATOMIC. Note that the declaration of parallel_cpus is brought to exec-all.h to be able to define there the "curr_cflags" inline. Signed-off-by:
Emilio G. Cota <cota@braap.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-
Richard Henderson authored
Using the offset of a temporary, relative to TCGContext, rather than its index means that we don't use 0. That leaves offset 0 free for a NULL representation without having to leave index 0 unused. Reviewed-by:
Philippe Mathieu-Daudé <f4bug@amsat.org> Reviewed-by:
Emilio G. Cota <cota@braap.org> Signed-off-by:
Richard Henderson <richard.henderson@linaro.org>
-