Update three changes:
1.Split the Load/Store resources into two, Ld0St and Ld1,
since only one of them is capable of stores.
2.Integer ADD and SUB instructions have different latencies
and processor resource usage (pipeline) when they have a shift of
zero vs. non-zero, refer to D8043
3.The throughout of scalar DIV instruction.
Reviewed By: dmgreen, bryanpkc
Differential Revision: https://reviews.llvm.org/D132529
Part of the schedule model is not accurate, so need a initial test record the changes.
This assemble list is refer to the basic part of D128631
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D132103
1. Missing instruction information (FTSSEL, FMSB, PFIRST and RDFFR)
is added and CompleteModel is set to one.
2. Information for pseudo SVE instructions is added. Those
instructions are present at the time of scheduling.
3. Resource and latency information for SVE instructions is modified
to be more accurate.
For example, the description for CMPEQ, which consumes one cycle
each of unit FLA and PPR, is as follows.
```
Previous:
def A64FXGI01 : ProcResGroup<[A64FXIPFLA, A64FXIPPR]>;
def A64FXWrite_4Cyc_GI01 : SchedWriteRes<[A64FXGI01]> {...
Modified:
def A64FXGI0 : ProcResGroup<[A64FXIPFLA]>;
def A64FXGI1 : ProcResGroup<[A64FXIPPR]>;
def A64FXWrite_CMP : SchedWriteRes<[A64FXGI0, A64FXGI1]> {...
```
Reference: A64FX Microarchitecture Manual (Table 16-3)
https://github.com/fujitsu/A64FX/blob/master/doc/A64FX_Microarchitecture_Manual_en_1.7.pdf
Reviewed By: dmgreen, kawashima-fj
Differential Revision: https://reviews.llvm.org/D131165
The Cortex-A55 scheduling model is used for -mcpu=generic, meaning it
can have a wider effect than just the A55. The changes to the A55
scheduling model seems to have caused performance regressions on
Cortex-A510 device which have latencies closer to the original and
different forwarding paths.
This partially reverts the changes from D117003, at least until we can
do something to improve Cortex-A510. According to my results, this
improves the A510 results without altering the A55 very much.
memory-barrier instructions to providing targets and developers a convenient
way to explicitly declare which instructions are memory-barriers.
Differential Revision: https://reviews.llvm.org/D116779
Patch fixes scheduling of FP load instructions with pre/post increment adding WriteAdr for address operand.
Reviewed By: dmgreen
Differential Revision: https://reviews.llvm.org/D116361
OS Laboratory. Huawei Russian Research Institute. Saint-Petersburg
Cortex-A55 has 2 64bit NEON vector units, meaning a 128bit instruction
requires taking both units (and can only be issued as the first
instruction in a dual issue pair). This patch models that by splitting
the WriteV SchedWrite into two - the WriteVd that reads/writes only
64bit operands, and the WriteVq that read/writes 128bit registers. The
A55 schedule then uses this distinction to model the WriteVq as taking
both resource units, and starting a Schedule Group and WriteVd as taking
one as before.
I believe this is more correct, even if it does not lead to much better
performance.
Differential Revision: https://reviews.llvm.org/D108766
It appears that the Read operand for stores was being placed on the
first operand (the stored value) not the address base. This adds a
ReadST for the stored value operand, allowing the ReadAdrBase to
correctly act upon the address.
Differential Revision: https://reviews.llvm.org/D108287
Load/Store unit is used to enforce order of loads and stores if they
alias (controlled by --noalias=false option).
Fixes PR50483 - [MCA] In-order pipeline doesn't track memory
load/store dependencies.
Differential Revision: https://reviews.llvm.org/D103955
Specifying the latencies of specific LDP variants appears to improve
performance almost universally.
Differential Revision: https://reviews.llvm.org/D105882
This sets the latency of stores to 1 in the Cortex-A55 scheduling model,
to better match the values given in the software optimization guide.
The latency of a store in normal llvm scheduling does not appear to have
a lot of uses. If the store has no outputs then the latency is somewhat
meaningless (and pre/post increment update operands use the WriteAdr
write for those operands instead). The one place it does alter things is
the latency between a store and the end of the scheduling region, which
can in turn have an effect on the critical path length. As a result a
latency of 1 is more correct and offers ever-so-slightly better
scheduling of instructions near the end of the block.
They are marked as RetireOOO to keep the llvm-mca from introducing
stalls where non would exist.
Differential Revision: https://reviews.llvm.org/D105541
Conservatively use the instruction latency to compute the last write-back cycle.
Before this patch, the last write cycle computation was incorrect for store
instructions that didn't declare any register writes.
The tests compare IPC statistics that MCA provides with IPC values
measured on Cortex-A55 hardware. For hardware tests, each snippet is
run in a loop unrolled by 1000, and IPC is measured by linux-perf.
Several tests do not match the hardware: the skewed ALU is not
supported, LDR seem to be missing a forwarding path.
Differential Revision: https://reviews.llvm.org/D98174
Instructions that have more uops than the processor's IssueWidth are
issued in multiple cycles.
The patch fixes PR49712.
Differential Revision: https://reviews.llvm.org/D99339
This is a follow-up for:
D98604 [MCA] Ensure that writes occur in-order
When instructions are aligned by the order of writes, they retire
in-order naturally. There is no need for an RCU, so it is disabled.
Differential Revision: https://reviews.llvm.org/D98628
This patch adds a pipeline to support in-order CPUs such as ARM
Cortex-A55.
In-order pipeline implements a simplified version of Dispatch,
Scheduler and Execute stages as a single stage. Entry and Retire
stages are common for both in-order and out-of-order pipelines.
Differential Revision: https://reviews.llvm.org/D94928
The old CPU model only had MLA->MLA forwarding. I added some missing
MUL->MLA read advances and a missing absolute diff accumulator read
advance according to the Cortex A57 Software Optimization Guide.
The patch improves performance in EEMBC rgbyiqv2 by about 6%-7% and
spec2006/milc by 8% (repeated runs on multiple devices), causes no
significant regressions (none in SPEC).
Differential Revision: https://reviews.llvm.org/D92296
The model was committed in 4b8ade837e
but not yet enabled to allow for a few fix ups. This adds a few
of these fixes, and also a LLVM MCA test to check most instructions.
While I do have plans to look into some more tuning, it's time to
enable this as it better than using the A53 schedule.
Differential Revision: https://reviews.llvm.org/D88017
Only the aliases 'xzr' and 'sp' exist for the physical register x31.
The reason for wanting to remove the alias 'x31' is because it allows users
to write invalid asm that is not accepted by the GNU assembler.
Is there any objection to removing this alias? Or do we want to keep
this for compatibility with existing code that uses w31/x31?
Differential Revision: https://reviews.llvm.org/D90153
This fixes a regression introduced by a very old commit 280ac1fd1d (was
llvm-svn 361950).
Commit 280ac1fd1d redesigned the logic in the LSUnit with the goal of
speeding up isReady() queries, and stabilising the LSUnit API (while also making
the load store unit more customisable).
The concept of MemoryGroup (effectively an alias set) was added by that commit
to better describe and track dependencies between memory operations. However,
that concept was not just used for alias dependencies, but it was also used for
describing memory "order" dependencies (enforced by the memory consistency
model).
Instructions of a same memory group were considered "equivalent" as in:
independent operations that can potentially execute in parallel. The problem
was that the cost of a dependency (in terms of number of cycles) should have
been different for "order" dependency. Instructions in an order dependency
simply have to have to wait until their predecessors are "issued" to an
underlying pipeline (rather than having to wait until predecessors have beeng
fully executed). For simple "order" dependencies, this was effectively
introducing an artificial delay on the "issue" of independent loads and stores.
This patch fixes the issue and adds a new test named 'independent-load-stores.s'
to a bunch of x86 targets. That test contains the reproducible posted by Fabian
Ritter on PR45793.
I had to rerun the update-mca-tests script on several files. To avoid expected
regressions on some Exynos tests, I have added a -noalias=false flag (to match
the old strict behavior on latencies).
Some tests for processor Barcelona are improved/fixed by this change and they
now show better results. In a few tests we were incorrectly counting the time
spent by instructions in a scheduler queue. In one case in particular we now
correctly see a store executed out of order. That test was affected by the same
underlying issue reported as PR45793.
Reviewers: mattd
Differential Revision: https://reviews.llvm.org/D79351
It makes more sense to print out the number of micro opcodes that are issued
every cycle rather than the number of instructions issued per cycle.
This behavior is also consistent with the dispatch-stats: numbers from the two
views can now be easily compared.
llvm-svn: 357919