Update costs based on the worst case costs from the script in D103695.
Move to using legalized types wherever possible, which allows us to prune the cost tables.
Update (mainly) vXi8/vXi16 -> vXf32/vXf64 sitofp/uitofp costs based on the worst case costs from the script in D103695.
Move to using legalized types wherever possible, which allows us to prune the cost tables.
Provide a generic fallback that extends sub-i32 scalars before using the existing sitofp instructions.
These numbers can be tweaked for specific sse levels, but we should get the default handling in place first.
We get the extension for free for non-vector loads.
Update v4i64 -> v4f32/v4f64 uitofp costs based on the worst case costs from the script in D103695.
Fixes a few regressions before we start adding AVX costs for legalized types.
Building on rG2a1ef8784ad9a, adjust the SSE cost tables to use the legalized types based on the worst case costs from the script in D103695.
To account for different numbers of src/dst legalized type registers we must scale the cost by maximum of the src/dst, not just use src
Based off the worse case numbers generated by D103695, the AVX1/2/512 sitofp/uitofp/fptosi/fptoui costs were higher than necessary (based off instruction counts instead of actual throughput).
The SSE costs still need further fixes, but I hit an issue with the order in which SSE costs are checked - we need to check CUSTOM costs (with non-legal types) first, and then fallback to LEGALIZED types. I'm looking at this now, and this should let us start thinning out a lot of the duplicates in the costs tables.
Then we can finally start work on vXi64 / vXi16 / vXi8 / vXi1 integers, which should let us look at sub-128-bit vectorization (D103925).
If both the source and the destination need to be split then the two halves of the split operation are completely independent and don't need to be split or joined. So we don't need to assess a cost for the split or join.
Differential Revision: https://reviews.llvm.org/D79111
The assert that caused this to be reverted should be fixed now.
Original commit message:
This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.
This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.
Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.
llvm-svn: 368183
This patch changes our defualt legalization behavior for 16, 32, and
64 bit vectors with i8/i16/i32/i64 scalar types from promotion to
widening. For example, v8i8 will now be widened to v16i8 instead of
promoted to v8i16. This keeps the elements widths the same and pads
with undef elements. We believe this is a better legalization strategy.
But it carries some issues due to the fragmented vector ISA. For
example, i8 shifts and multiplies get widened and then later have
to be promoted/split into vXi16 vectors.
This has the potential to cause regressions so we wanted to get
it in early in the 10.0 cycle so we have plenty of time to
address them.
Next steps will be to merge tests that explicitly test the command
line option. And then we can remove the option and its associated
code.
llvm-svn: 367901
Make some AVX and AVX512 cast costs more precise.
Based on part of a patch by Elena Demikhovsky (D15604).
Differential Revision: http://reviews.llvm.org/D22064
llvm-svn: 275106
The cost model should not assume vector casts get completely scalarized, since
on targets that have vector support, the common case is a partial split up to
the legal vector size. So, when a vector cast gets split, the resulting casts
end up legal and cheap.
Instead of pessimistically assuming scalarization, base TTI can use the costs
the concrete TTI provides for the split vector, plus a fudge factor to account
for the cost of the split itself. This fudge factor is currently 1 by default,
except on AMDGPU where inserts and extracts are considered free.
Differential Revision: http://reviews.llvm.org/D21251
llvm-svn: 274642
I checked and updated the cost of AVX-512 conversion operations. Added cost of conversion operations in DQ mode.
Conversion of illegal types that requires vector split is not calculated right now (like for other X86 targets).
Differential Revision: http://reviews.llvm.org/D15074
llvm-svn: 254494
Merged separate (but equivalent) SSE2/AVX512F tests.
Removed codegen tests since these are already done better in test/CodeGen/X86.
The actual cost values still need to be updated to match recent codegen improvements.
llvm-svn: 240219
AVX2 is available.
According to IACA, the new lowering has a throughput of 8 cycles instead of 13
with the previous one.
Althought this lowering kicks in some SPECs benchmarks, the performance
improvement was within the noise.
Correctness testing has been done for the whole range of uint32_t with the
following program:
uint4 v = (uint4) {0,1,2,3};
uint32_t i;
//Check correctness over entire range for uint4 -> float4 conversion
for( i = 0; i < 1U << (32-2); i++ )
{
float4 t = test(v);
float4 c = correct(v);
if( 0xf != _mm_movemask_ps( t == c ))
{
printf( "Error @ %vx: %vf vs. %vf\n", v, c, t);
return -1;
}
v += 4;
}
Where "correct" is the old lowering and "test" the new one.
The patch adds a test case for the two custom lowering instruction.
It also modifies the vector cost model, which is why cast.ll and uitofp.ll are
modified.
2009-02-26-MachineLICMBug.ll is also modified because we now hoist 7
instructions instead of 4 (3 more constant loads).
rdar://problem/18153096>
llvm-svn: 221657
The costs are overfitted so that I can still use the legalization factor.
For example the following kernel has about half the throughput vectorized than
unvectorized when compiled with SSE2. Before this patch we would vectorize it.
unsigned short A[1024];
double B[1024];
void f() {
int i;
for (i = 0; i < 1024; ++i) {
B[i] = (double) A[i];
}
}
radar://13599001
llvm-svn: 179033