vpermw is 2 uops. vpermt2b/vpermt2w are two shuffle uops and a port 015 uop. Weirdly vpermb is a single uop.
This patch bumps the cost to 2 for these operations. Maybe should go to 3 for the vpermt2*, but I've started conservative.
I've also removed a few entries that were now the same as earlier subtargets or that I didn't think we really did. Like I don't think we extend v32i8 to v32i16, shuffle, and then truncate.
Differential Revision: https://reviews.llvm.org/D79148
This moves v32i16/v64i8 to a model consistent with how we
treat integer types with avx1.
This does change the ABI for types vXi16/vXi8 vectors larger than
512 bits to pass in multiple zmms instead of multiple ymms. We'd
already hacked some code to make v64i8/v32i16 pass in zmm.
Cost model is still a bit of a mess. In some place I tried to
match existing behavior. But really we need to account for
splitting and concating costs. Cost model for shuffles is
especially pessimistic.
Differential Revision: https://reviews.llvm.org/D76212
v2i8/v4i8/v8i8 + v2i16/v4i16 all show up in vectorizer code and by just using the legalized types (v16i8/v8i16) we're highly exaggerating the actual cost of the shuffle.