The scalar variant with GPR source/dest has considerably higher latency than the SIMD&FP scalar variant across a variety of micro-architectures: Core Scalar SIMD&FP -------------------------------- Neoverse V1 9 cyc 3 cyc Neoverse N2 8 cyc 3 cyc Cortex A510 8 cyc 4 cyc A64FX 29 cyc 6 cyc