Currently we model i16 bswap as very high cost (`10`),
which doesn't seem right, with all other being at `1`.
Regardless of `MOVBE`, i16 reg-reg bswap is lowered into
(an extending move plus) rot-by-8:
https://godbolt.org/z/8jrq7fMTj
I think it should at worst have throughput of `1`:
Since i32/i64 already have cost of `1`,
`MOVBE` doesn't improve their costs any further.
BUT, `MOVBE` must have at least a single memory operand,
with other being a register. Which means, if we have
a bswap of load, iff load has a single use,
we'll fold bswap into load.
Likewise, if we have store of a bswap, iff bswap
has a single use, we'll fold bswap into store.
So i think we should treat such a bswap as free,
unless of course we know that for the particular CPU
they are performing badly.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D101924