As extending from or truncating to mask vector do not use the same instructions as the normal cast, this path changed it to 2 which is the number of instructions we used.
Differential Revision: https://reviews.llvm.org/D131552
The cost of convert from or to mask vector is different from other cases. We could not use PowDiff to calculate it. This patch set it to 3 as we use 3 instruction to make it.
Differential Revision: https://reviews.llvm.org/D131149
In RVV, we use vwredsum.vs and vwredsumu.vs for vecreduce.add(ext(Ty A)) if the result type's width is twice of the input vector's SEW-width. In this situation, the cost of extended add reduction should be same as single-width add reduction. So as the vector float widenning reduction.
Differential Revision: https://reviews.llvm.org/D129994
A mul by a negated power of 2 is a slli followed by neg. This doesn't
require any constant materialization and may be lower latency than mul.
The neg may also be foldable into other arithmetic.
Reviewed By: reames
Differential Revision: https://reviews.llvm.org/D130047
This extends the existing cost model for reductions for scalable vectors.
The existing cost model assumes that reductions are roughly logarithmic in cost for unordered variants and linear for ordered ones. This change keeps that same basic model, and extends it out to the maximum number of elements a scalable vector could possibly have.
This results in costs which aren't terribly high for unordered reductions, but are for ordered ones. This seems about right; we want to strongly bias away from using scalable ordered reductions if the cost might be linear in VL.
Differential Revision: https://reviews.llvm.org/D127447
LoopVectorizer uses getVScaleForTuning for deciding how to discount the cost of a potential vector factor by the amount of work performed. Without the callback implemented, the vectorizer was defaulting to an estimated vscale of 1. This results in fixed vectorization looking falsely profitable (since it used the command line VLEN).
The test change is pretty limited since a) we don't have much coverage of the vectorizer with scalable vectors at all, and b) what little coverage we have mostly uses i64 element types. There's a separate issue with <vscale x 1 x i64> which prevents us from getting to this stage of costing, and thus only the one test explicitly written to avoid that is visible in the diff. However, this is actually a very wide impact change as it changes the practical vectorization result when both fixed and scalable is enabled to scalable.
As an aside, I think the vectorizer is at little too strongly biased towards scalable when both are legal, but we can explore that separately. For now, let's just get the cost model working the way it was intended.
Differential Revision: https://reviews.llvm.org/D128547
The comments in the existing code appear to pre-exist the standardization of the +v extension. In particular, the specification *does* provide a bound on the maximum value VLEN can take. From what I can tell, the LMUL comment was simply a misunderstanding of what this API returns.
This API returns the maximum value that vscale can take at runtime. This is used in the vectorizer to bound the largest scalable VF (e.g. LMUL in RISCV terms) which can be used without violating memory dependence.
Differential Revision: https://reviews.llvm.org/D128538
This doesn't change behavior, it just makes it slightly more obvious what's
going on. Note that getRealMinVLen is always >= getMinRVVVectorSizeInBits.
The first case is a bit tricky, as you have to know that
getMinRVVVectorSizeInBits returns 0 when not set, and thus is equivalent
to the else value clause. The new code structure makes it more obvious we
return 0 unless using RVV for fixed length vectors.
This was a bug introduced in d764aa. A pointer type is not a primitive type, and thus we were ending up dividing by zero when computing VLMax.
Differential Revision: https://reviews.llvm.org/D128219
The costing we use for fixed length vector gather and scatter is to simply count up the memory ops, and multiply by a fixed memory op cost. For scalable vectors, we don't actually know how many lanes are active. Instead, we have to end up making a worst case assumption on how many lanes could be active. In the generic +V case, this results in very high costs, but we can do better when we know an upper bound on the VLEN.
There's some obvious ways to improve this - e.g. using information about VL and mask bits from the instruction to reduce the upper bound - but this seems like a reasonable starting point.
The resulting costs do bias us pretty strongly away from generating scatter/gather for generic +V. Without this, we'd be returning an invalid cost and thus definitely not vectorizing, so no major change in practical behavior expected.
Differential Revision: https://reviews.llvm.org/D127541
Our actual lowering for i1 reductions uses ctpop combined with possibly a vector negate and possibly a logic op afterwards. I believe ctpop to be low cost on all reasonable hardware.
The default costing implementation here was returning quite inconsistent costs. and/or were returning very high costs (because we seem to think moving into scalar registers is very expensive?) and others were returning lower but still too high (because of the assumed tree reduce strategy). While we should probably improve the generic costing strategy for i1 vectors, let's start by fixing the immediate problem.
Differential Revision: https://reviews.llvm.org/D127511
Previously, `getRegUsageForType` was implemented using
`getTypeLegalizationCost`. `getRegUsageForType` is used by the loop
vectorizer to estimate the register pressure caused by using a vector
type. However, `getTypeLegalizationCost` currently only appears to
understand splitting and not scalarization, so significantly
underestimates the register requirements.
Instead, use `getNumRegisters`, which understands when scalarization
can occur (via computeRegisterProperties).
This was discovered while investigating D118979 (Set maximum VF with
shouldMaximizeVectorBandwidth), where under fixed-length 512-bit SVE the
loop vectorizer previously ends up costing an v128i1 as 2 v64i*
registers where it actually occupies 128 i32 registers.
I'm sending this patch early for comment, I'm still doing some sanity checking
with LNT. I note that getRegisterClassForType appears to return VectorRC even
though the type in question (large vNi1 types) end up occupying scalar
registers. That might be worth fixing too.
Differential Revision: https://reviews.llvm.org/D125918
In RISCVTargetTransformInfo, enumerating the processor family is not a good way to predict.
Because it needs to enumerate many subtarget family and is hard to update if add new subtarget.
Instead, create a feature to distinguish whether targets want to use default unroll preference or not.
Keep TuneSiFive7 because it's flag to indicate subtarget family, which may used in other place.
Differential Revision: https://reviews.llvm.org/D125741
Add cost model for broadcast shuffle in RISCVTTIImpl::getShuffleCost
with scalable vector. The cost model might not the best.
For scalable vector, BasicTTIImpl::getShuffleCost return invalid cost,
so this patch relies on the existing cost model in BasicTTIImpl.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D124101
This was added before Zve extensions were defined. I think users
should use Zve32x or Zve32f now. Though we will lose support for limiting
ELEN to 16 or 8, but I hope no one was using that.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D123418
Scalable vectors llvm.experimental.stepvector intrinsic
will crash due to an invalid cost when run the code through the loopunroll.
Reviewed By: kito-cheng
Differential Revision: https://reviews.llvm.org/D122782
To perform the cost model of vector casting, the patch consider most vector
casts as their scalar form and consider those vector form of free scalr castings
as 1.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D121771
The patch adds very basic cost model for masked memory op on scalable vector.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117884
The patch adds very basic cost model for masked memory op on scalable vector.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D117884
While testing scalable vectors I found that if we generate a
vector splice intrinsic and run the code through the loop unroller,
we'll crash due to an invalid cost.
This adds a basic cost based on the 2 slide instructions used by the
lowering in D119303.
We probably need to factor LMUL into this, but that's true for
arithmetic instructions too. So I've ignored for the moment.
Reviewed By: ABataev
Differential Revision: https://reviews.llvm.org/D119316
Those two TTI hooks are used during vectorization for calculating
register pressure, the default implementation isn't consider for LMUL,
and that's also definitly wrong value for register number (all register class
are 8 registers).
So in this patch we tried to:
1. Calculate right register usage for vector type and scalar type.
2. Return right number of register for general purpose register and
vector register.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D116890
Use it to remove explicit string compares from unrolling preferences.
I'm of two minds on this. Ideally, we would define things in terms
of architectural or microarchitectural features, but it's hard to
do that with things like unrolling preferences without just ending up
with FeatureSiFive7UnrollingPreferences.
Having a proc enum is consistent with ARM and AArch64. X86 only has
a few and is trying to move away from it.
Reviewed By: asb, mcberg2021
Differential Revision: https://reviews.llvm.org/D117060
By default we return the width of an LMUL=1 register. We can enable
testing with larger LMUL values by returning a larger bit width.
This patch adds a RISCV specific option to provide a LMUL which will be
multiplied by the LMUL=1 bit width.
Reviewed By: kito-cheng
Differential Revision: https://reviews.llvm.org/D116339
Both these preference helper functions have initial support with
this change. The loop unrolling preferences are set with initial
settings to control thresholds, size and attributes of loops to
unroll with some tuning done. The peeling preferences may need
some tuning as well as the initial support looks much like what
other architectures utilize.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D113798
Both these preference helper functions have initial support with
this change. The loop unrolling preferences are set with initial
settings to control thresholds, size and attributes of loops to
unroll with some tuning done. The peeling preferences may need
some tuning as well as the initial support looks much like what
other architectures utilize.
Reviewed By: craig.topper
Differential Revision: https://reviews.llvm.org/D113798
Add new hasVInstructions() which is currently equivalent.
Replace vector uses of hasStdExtZfh/F/D with new vector specific
versions. The vector spec no longer requires that the vectors implement the
same types as scalar. It only requires that the scalar type is
the maximum size the vectors can support. This is currently
implemented using the scalar rule we were using before.
Add new hasVInstructionsI64() begin using to qualify code that
requires i64 vector elements.
This is all NFC for now, but we can start using this to better
implement D112408 which introduces the Zve extensions.
Reviewed By: frasercrmck, eopXD
Differential Revision: https://reviews.llvm.org/D112496
If we have these instructions, we don't need to hoist the immediate
for an AND that would match them.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D107783
If the upper 32 bits are zero and bit 31 is set, we might be able to
use zext.w to fill in the zeros after using an lui and/or addi.
Most of this patch is plumbing the subtarget features into the constant
materialization.
Reviewed By: luismarques
Differential Revision: https://reviews.llvm.org/D105509
getMinRVVVectorSizeInBits() asserts if the V extension isn't
enabled. So check that gather/scatter is legal first since it
already contains a check for V extension being enabled. It
also already checks getMinRVVVectorSizeInBits for fixed length
vectors so we don't need a check in getGatherScatterOpCost.
This will tell loop idiom recognize that it can make popcount loops countable
using the ctpop intrinsic. I didn't bother checking for illegal types.
Type legalization knows how to split a ctpop into multiple ctops added together.
Assuming we only receive reasonable integer bit widths, a few cpop instructions
added together is probably better than the loop.
Reviewed By: frasercrmck
Differential Revision: https://reviews.llvm.org/D99203