We are using undef on the indirect move source subreg and then
using implicit super-reg. This creates a problem in RA when
Greedy decides to split the register. It reassigns the implicit
super-reg but does not bother to change undef source because
it is really does not matter. The fix is to stop lying to RA and
drop undef flag.
This has also hit a problem in SIFoldOperands as it can fold
immediate into an indirect move since there is no undef flag
anymore. That results in multiple test failures, so added the
check for this case.
Differential Revision: https://reviews.llvm.org/D84899
These are treated identically to value aggregates placed in the kernel
argument list. A %struct.foo or %struct.foo addrspace(4)*
byref(sizeof(%struct.foo)) align(alignof(%struct.foo)) argument should
produce the same offsets and argument metadata.
This handles all 3 kernel ABI implementations, and the two HSA
metadata emission paths.
The hardware has created a real mess in the naming for add/sub, which
have been renamed basically every generation. Switch the carry out
pseudos to have the gfx9/gfx10 names. We were using the original SI/CI
v_add_i32/v_sub_i32 names. Later targets reintroduced these names as
carryless instructions with a saturating clamp bit, which we do not
define. Do this rename so we can unambiguously add these missing
instructions.
The carry-in versions should also be renamed, but at least those had a
consistent _u32 name to begin with. The 16-bit instructions were also
renamed, but aren't ambiguous.
This does regress assembler error message quality in some cases. In
mismatched wave32/wave64 situations, this will switch from
"unsupported instruction" to "invalid operand", with the error
pointing at the wrong position. I couldn't quite follow how the
assembler selects these, but the previous behavior seemed accidental
to me. It looked like there was a partial attempt to handle this which
was never completed (i.e. there is an AMDGPUOperand::isBoolReg but it
isn't used for anything).
Summary:
Since changing the Predicate to be an unsigned enum, the lower bound check for
isFPPredicate no longer needs to check the lower bound, since
it will always evaluate to true.
Also fixed a similar issue in SIISelLowering.cpp by removing the need for
comparing to FIRST and LAST predicates
Added an assert to the isFPPredicate comparison to flag if the
FIRST_FCMP_PREDICATE is ever changed to anything other than 0, in which case the
logic will break.
Without this change warnings are generated in VS.
Change-Id: I358f0daf28c0628c7bda8ad4cab4e1757b761bab
Subscribers: arsenm, jvesely, nhaehnle, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D83540
Even though wide vectors are legal they still cost more as we
will have to eventually split them. Not all operations can
be uniformly done on vector types.
Conservatively add the cost of splitting at least to 8 dwords,
which is our widest possible load.
We are more or less lying to cost mode with this change but
this can prevent vectorizer from creation of wide vectors which
results in RA problems for us.
Differential Revision: https://reviews.llvm.org/D83078
This was resulting in a missing vreg def in the use select
instruction.
The output of the pseudo doesn't make sense, since it really shouldn't
have the vreg output in the first place, and instead an implicit scc
def to match the real scalar behavior.
We could have easier to understand tests if we selected scalar
versions of the [us]{add|sub}.with.overflow intrinsics.
This does still end up producing vector code in the end, since it gets
moved later.
This wasn't getting much value from the DAG or depth arguments, since
it's only called on the frame index root nodes. FrameIndexes can also
only return a scalar value, so it also didn't need DemandedElts.
In this awkward case, we have to emit custom pseudo-constrained FP
wrappers. InstrEmitter concludes that since a mayRaiseFPException
instruction had a chain, it can't add nofpexcept.
Test deferred until mayRaiseFPException is really set on everything.
This is a custom inserter because it was less work than teaching
tablegen a way to indicate that it is sometimes OK to have a no side
effect instruction in the output of a side effecting pattern.
The asm is needed to look like a read of the mode register to prevent
it from being deleted. However, there seems to be a bug where the mode
register def instructions are moved across the asm sideeffect by the
post-RA scheduler.
Another oddity is the immediate is formatted differently between
s_denorm_mode and s_round_mode.
Tweak a few constant expressions involving numbers::pi etc to avoid
rounding errors. NFCI though it's possible some of these will now be
more accurate in the last bit.
OpenMP emits these for some reason, so handle them. Assume these use
4096 bytes by default, with a flag to override this. Also change the
related stack assumption for calls to have a flag.
This will enable selecting non-entry block allocas. Skip the SP write
check in the base isSchedulingBoundary implementation to preserve the
previous scheduling behavior and avoid test churn. It's apparently for
compile time reasons, but if we were to use this more work would be
needed since in some of the failing tests, we seem to incorrectly get
hazard nops inserted.
Summary: 'A' constraint requires an immediate int or fp constant that can be inlined in an instruction encoding.
Reviewers: arsenm, rampitec
Differential Revision: https://reviews.llvm.org/D78494
Unlike SelectionDAGBuilder, IRTranslator omits the unconditional
branch in fallthrough cases. Confusingly, the control flow pseudos
function in the opposite way the intrinsics are used, and the branch
targets always need to be swapped. We're inverting the target blocks,
so we need to figure out the old fallthrough block and insert a branch
to the original unconditional branch target.
It was set in total vector size while the idea was to limit
a number of instructions. Now it started to work with doubles
and thresholds needs to be updated.
Differential Revision: https://reviews.llvm.org/D80322
Even though series of cmd/cndmask can produce quite a lot of
code that is still better than a loop. In case of doubles we
would even produce two loops.
Differential Revision: https://reviews.llvm.org/D80032
This should be directly implied from the register class, and there's
no need to special case live ins here. This was getting the wrong
answer for the queue ptr argument in callable functions, since it's
not an explicit IR argument and is always uniform.
Fixes not using scalar loads for the aperture in addrspacecast
lowering, and any other places that use implicit SGPR arguments.
We can produce such vectors in the Promote Alloca pass,
but we are unable to use movrel to operate it and lower
via scratch. Making it legal makes SI_INDIRECT patterns
work.
There is more work to do in subsequent changes:
1. We initialize m0 twice to access each dword. It shall
be possible to only do it once and increment base register
number instead.
2. We also need v16i64/v16f64 but these first need to be
added to tablegen.
Differential Revision: https://reviews.llvm.org/D79808
Summary: This change enables all kind of carry out ISD opcodes to be selected according to the node divergence.
Reviewers: rampitec, arsenm, vpykhtin
Reviewed By: rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D78091
This method has been commented as deprecated for a while. Remove
it and replace all uses with the equivalent getCalledOperand().
I also made a few cleanups in here. For example, to removes use
of getElementType on a pointer when we could just use getFunctionType
from the call.
Differential Revision: https://reviews.llvm.org/D78882
This avoids more long lists of register classes that have to be updated
every time we add a new one. NFC.
Differential Revision: https://reviews.llvm.org/D78570
12994a70cf did this for 128-bit classes:
SGPR_128 only includes the real allocatable SGPRs, and SReg_128 adds
the additional non-allocatable TTMP registers. There's no point in
allocating SReg_128 vregs. This shrinks the size of the classes
regalloc needs to consider, which is usually good.
This patch extends it to all classes > 64 bits, for consistency.
Differential Revision: https://reviews.llvm.org/D78622
Add 96-bit, 160-bit and 256-bit AReg classes to match VReg and SReg.
NFC as far as I know, but it may avoid weird legalization problems.
Differential Revision: https://reviews.llvm.org/D78348
Ports the existing DAG combines, minus the simplify demanded bits
which seems to have no equivalent now. Without these, this isn't
particularly helpful in most of the IR sample cases.
Since 1725f28841, this should check
isFMADLegalForFAddFSub rather than the the plain isOperationLegal.
This would assert in a subset of cases due to an oddity in how FMAD is
selected. We will allow FMA formation pre-legalize, but not FMAD even
in cases where it would be valid.
The current hook requires passing in the root fadd/fsub. However, in
this distributed case, this would be far more complicated to pass in
the relevant operand. AMDGPU doesn't get any value from the node, and
only needs the type and is the only implementor, so I'm not sure why
we have this complexity. Just rename and expand the assert to avoid
the more complicated checks spread through the distribution logic.
The extracts from control flow intrinsics are already properly handled
by divergence analysis. The inline asm case isn't dead, but has also
never really worked correctly so leave it as-is for now.
Add a new llvm.amdgcn.ballot intrinsic modeled on the ballot function
in GLSL and other shader languages. It returns a bitfield containing the
result of its boolean argument in all active lanes, and zero in all
inactive lanes.
This is intended to replace the existing llvm.amdgcn.icmp and
llvm.amdgcn.fcmp intrinsics after a suitable transition period.
Use the new intrinsic in the atomic optimizer pass.
Differential Revision: https://reviews.llvm.org/D65088
Summary:
This change adds amdgcn.reloc.constant intrinsic to the amdgpu backend, which will compile into a relocation entry in the resulting elf.
The intrinsics takes a MetadataNode (String) as its only argument, which specifies the symbol name of the relocation entry.
`SelectionDAGBuilder::getValueImpl` is changed to allow metadata operands passed through to ISel.
Author: csyonghe <yonghe@google.com>
Reviewers: tpr, nhaehnle
Reviewed By: nhaehnle
Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D76440
Remove the gap left between the stack pointer (s32) and frame pointer
(s34) now that the scratch wave offset is no longer a part of the
calling convention ABI.
Update llvm/docs/AMDGPUUsage.rst to reflect the change.
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75657
Add the scratch wave offset to the scratch buffer descriptor (SRSrc) in
the entry function prologue. This allows us to removes the scratch wave
offset register from the calling convention ABI.
As part of this change, allow the use of an inline constant zero for the
SOffset of MUBUF instructions accessing the stack in entry functions
when a frame pointer is not requested/required. Entry functions with
calls still need to set up the calling convention ABI stack pointer
register, and reference it in order to address arguments of called
functions. The ABI stack pointer register remains unswizzled, but is now
wave-relative instead of queue-relative.
Non-entry functions also use an inline constant zero SOffset for
wave-relative scratch access, but continue to use the stack and frame
pointers as before. When the stack or frame pointer is converted to a
swizzled offset it is now scaled directly, as the scratch wave offset no
longer needs to be subtracted first.
Update llvm/docs/AMDGPUUsage.rst to reflect these changes to the calling
convention.
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75138
This isn't really usable, and requires using the
-amdgpu-fixed-function-abi flag to work.
Assumes a uniform call target, and will hit a verifier error if the
call target ends up in a VGPR. Also doesn't attempt to do anything
sensible for the reported register/stack usage.
Summary:
When SI_INDIRECT_DST_V* pseudos has indexes in VGPR, they get expanded into the self-looped basic block that modifies EXEC in a loop.
To keep EXEC consistent it is stored before and then re-stored after the pseudo expansion result.
%95:vreg_512 = SI_INDIRECT_DST_V16 %93:vreg_512(tied-def 0), %94:sreg_32, 0, killed %1500:vgpr_32
results to
s_mov_b64 s[6:7], exec
BB0_16:
v_readfirstlane_b32 s8, v28
v_cmp_eq_u32_e32 vcc, s8, v28
s_and_saveexec_b64 vcc, vcc
s_set_gpr_idx_on s8, gpr_idx(DST)
v_mov_b32_e32 v6, v25
s_set_gpr_idx_off
s_xor_b64 exec, exec, vcc
s_cbranch_execnz BB0_16
; %bb.17:
s_mov_b64 exec, s[6:7]
The bug appeared in case this expansion occurs in the ELSE block of the CF.
Originally
%110:vreg_512 = SI_INDIRECT_DST_V16 %103:vreg_512(tied-def 0), %85:vgpr_32, 0, %107:vgpr_32,
%112:sreg_64 = SI_ELSE %108:sreg_64, %bb.19, 0, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
expanded to
****************** <== here exec has "THEN" context
s_mov_b64 s[6:7], exec
BB0_16:
v_readfirstlane_b32 s8, v28
v_cmp_eq_u32_e32 vcc, s8, v28
s_and_saveexec_b64 vcc, vcc
s_set_gpr_idx_on s8, gpr_idx(DST)
v_mov_b32_e32 v6, v25
s_set_gpr_idx_off
s_xor_b64 exec, exec, vcc
s_cbranch_execnz BB0_16
; %bb.17:
s_or_saveexec_b64 s[4:5], s[4:5] <-- exec mask is restored for "ELSE" but immediately overwritten.
s_mov_b64 exec, s[6:7]
The rest of the "ELSE" block is executed not by the workitems which constitute the "else mask" but by those which constitute "then mask"
SILowerControlFlow::emitElse always considers the basic block begin() as an insertion point for s_or_saveexec.
Proposed fix: The SI_INDIRECT_DST_V* procedure should split the reminder block to create landing pad for the EXEC restoration.
Reviewers: rampitec, vpykhtin, nhaehnle
Reviewed By: vpykhtin
Subscribers: arsenm, kzhuravl, jvesely, wdng, yaxunl, dstuttard, tpr, t-tye, hiraditya, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D75472
If SimplifyDemandedBits succeeds in simplifying the byte src, add the CVT_F32_UBYTE node back to the worklist as we might be able to simplify further.
Yet another step towards removing SelectionDAG::GetDemandedBits.
This is part of the work to remove SelectionDAG::GetDemandedBits and just use SimplifyMultipleUseDemandedBits.
Recent experiments raised some v_cvt_f32_ubyte*_e32 regressions, so I've added some additional abilities to performCvtF32UByteNCombine to help unpack byte data more aggressively.
We still don't remove all OR(SHL,SRL) patterns as some of the regenerated nodes don't get combined again, but we are getting closer.
Differential Revision: https://reviews.llvm.org/D74786
Also greatly improve i64 lowering. LegalizeIntegerTypes does the
correct narrowing if i64 isn't legal. Just workaround this for
SelectionDAG by making i64 legal and splitting in the patterns.
This is apparently worse than 1-byte alignment. This does not attempt
to decompose 2-byte aligned wide stores, but will stop trying to
produce them.
Also fix bug in LoadStoreVectorizer which was decreasing the alignment
and vectorizing stack accesses. It was assuming a stack object was an
alloca that could have its base alignment changed, which is not true
if the pointer is derived from a function argument.
These are generated and do not need to have the same values.
We are defining separate subregs for R600 and GCN but then
using AMDGPU subregs on R600.
Differential Revision: https://reviews.llvm.org/D74248
Based on D72931
This adds a new feature called A16 which is enabled for gfx10.
gfx9 keeps the R128A16 feature so it can share all the instruction encodings
with gfx7/8.
Differential Revision: https://reviews.llvm.org/D73956
Summary:
The accuracy limit to use rcp is adjusted to 1.0 ulp from 2.5 ulp.
Also, afn instead of arcp is used to allow inaccurate rcp to be used.
Reviewers:
arsenm
Differential Revision: https://reviews.llvm.org/D73588
Summary: This patch introduces an API for MemOp in order to simplify and tighten the client code.
Reviewers: courbet
Subscribers: arsenm, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jsji, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73964
We are using countPopulation on a LaneBitmask to determine
a number of registers it covers. This is the assumption which
does not necessarily need to be true. It is not changed but
factored into a single call SIRegisterInfo::getNumCoveredRegs().
Some other places are cleaned up with respect to assumptions
about subreg indexes values and tablegen behavior.
Differential Revision: https://reviews.llvm.org/D74177
scalar_to_vector takes only one argument, not two.
The a16 tests now also check the packing of coordinates into registers
Differential Revision: https://reviews.llvm.org/D73482
This should lower the amount of used registers for gfx9.
I updated some of the changed tests with the update script because
changing them by hand is tedious.
Differential Revision: https://reviews.llvm.org/D73884
This was incorrectly rounding up to the next power of 2. v4f32 was
rounding up to v8f32, which was just wrong. There are also v3i16/v3f16
available in MVT, so we don't even need to round the f16 cases
anymore. Additionally, this field is really an EVT so we don't even
need to consider this.
Also switch some asserts to return invalid. We should have an IR
verifier for these intrinsic return types, but for now it's better to
not assert on IR that passes the verifier.
This should also probably be fixed to consider that dmask is really
eliminating some of the loaded components.
Prepare to accurately track the future denormal-fp-math attribute
changes. The way to actually set these separately is not wired in yet.
This is just a mechanical change, and mostly still assumes the input
and output mode match. This should be refined for some cases. For
example, fcanonicalize lowering should use the flushing variant if
either input or output flushing is enabled
Summary:
This is patch is part of a series to introduce an Alignment type.
See this thread for context: http://lists.llvm.org/pipermail/llvm-dev/2019-July/133851.html
See this patch for the introduction of the type: https://reviews.llvm.org/D64790
Reviewers: courbet
Subscribers: arsenm, dschuff, jyknight, sdardis, nemanjai, jvesely, nhaehnle, sbc100, jgravelle-google, hiraditya, aheejin, kbarton, fedor.sergeev, asb, rbar, johnrusso, simoncook, sabuasal, niosHD, jrtc27, MaskRay, zzheng, edward-jones, atanasyan, rogfer01, MartinMosbeck, brucehoult, the_o, PkmX, jocewei, jsji, Jim, lenary, s.egerton, pzheng, sameer.abuasal, apazos, luismarques, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73885
Summary: This is a first step before changing the types to llvm::Align and introduce functions to ease client code.
Reviewers: courbet
Subscribers: arsenm, sdardis, nemanjai, jvesely, nhaehnle, hiraditya, kbarton, jrtc27, atanasyan, jsji, kerbowa, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D73785
There's not much value to this separate node from the intrinsic. Make
the operand structure the same as the intrinsic, so we can reuse the
same pattern for GlobalISel.
Manually select this is as a tablegen workraound. Both SelectionDAG
and GlobalISel end up misplacing the copy to m0 when both instructions
in the output need it. Neither considers that both output instructions
depend on m0. I don't know of any other pattern we need to handle this
case, so it's less effort to just workaround this for now.
Summary:
RCP has the accuracy limit. If FDIV fpmath require high accuracy rcp may not
meet the requirement. However, in DAG lowering, fpmath information gets lost,
and thus we may generate either inaccurate rcp related computation or slow code
for fdiv.
In patch implements fdiv optimizations in the AMDGPUCodeGenPrepare, which could
exactly know !fpmath.
FastUnsafeRcpLegal: We determine whether it is legal to use rcp based on
unsafe-fp-math, fast math flags, denormals and fpmath
accuracy request.
RCP Optimizations:
1/x -> rcp(x) when fast unsafe rcp is legal or fpmath >= 2.5ULP with
denormals flushed.
a/b -> a*rcp(b) when fast unsafe rcp is legal.
Use fdiv.fast:
a/b -> fdiv.fast(a, b) when RCP optimization is not performed and
fpmath >= 2.5ULP with denormals flushed.
1/x -> fdiv.fast(1,x) when RCP optimization is not performed and
fpmath >= 2.5ULP with denormals.
Reviewers:
arsenm
Differential Revision:
https://reviews.llvm.org/D71293
This using the wrong result register, and dropping the result entirely
for v2f16. This would fail to select on the scalar case. I believe it
was also mishandling packed/unpacked subtargets.
a785209bc2 switched to using a pseudos instead of manually tying
operands on the regular instruction. The VGPR indexing mode path
should have the same problems that change attempted to avoid, so these
should use the same strategy.
Use a single pseudo for the VGPR indexing mode and movreld paths, and
expand it based on the subtarget later. These have essentially the
same constraints, reading the index from m0.
Switch from using an offset to the subregister index directly, instead
of computing an offset and re-adding it back. Also add missing pseudos
for existing register class sizes.
Currently the lowering for i16 image coordinates asserts on gfx10. I'm
somewhat confused by this though. The feature is missing from the
gfx10 feature lists, but the a16 bit appears to be present in the
manual for MIMG instructions.
This case can be handled as a regular selection pattern, so move it
out of the weird post-isel folding code which doesn't have an exactly
equivalent place in GlobalISel.
I think it doesn't make much sense to do this optimization here
though, and it would be more useful in instcombine. There's not really
any new information that will be gained during lowering since these
inputs were known from the beginning.
I'm mildly worried about potentially reordering exp/exp_done with
IntrWriteMem on the intrinsic.
Requires hacking out the illegal type on SI, so manually select that
case during lowering.
The 16 bank LDS case is complicated due to using multiple
instructions. If I attempt to write a pattern for it, the generated
selector incorrectly places the copy to m0 after the first
instruction, so that needs to be separately addressed.
Also fix not gluing the copy to m0 to the second operation in the
second half of the 16 bank lowering.
Only PPC seems to be using it, and only checks some simple cases and
doesn't distinguish between FP. Just switch to using LLT to simplify
use from GlobalISel.
The existing test only covered one case for r600. The use of
mul_legacy also looks suspicious to me, but leave it for now. The
patterns are also not making use of source modifiers.
Summary:
- CF users won't be non-instruction values. Skip them to save the
compilation time. It's especially true when there are multiple
functions in that module, where, says, a constant may be used in most
functions. The current CF user tracing adds significant overhead.
Reviewers: alex-t, rampitec
Subscribers: arsenm, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D72174
When the "disable-tail-calls" attribute was added, checks were added for
it in various backends. Now this code has proliferated, and it is
something the target is responsible for checking. Move that
responsibility back to the ISels (fast, global, and SD).
There's no major functionality change, except for targets that never
implemented this check.
This LLVM attribute was originally added in
d9699bc7bd (2015).
Reviewers: echristo, MaskRay
Differential Revision: https://reviews.llvm.org/D72118
This allows us to clean up some places that were peeking through
the MERGE_VALUES node after the call. By returning the SDValues
directly, we can clean that up.
Unfortunately, there are several call sites in AMDGPU that wanted
the MERGE_VALUES and now need to create their own.
Summary:
The only useful information the UndefValue conveys is the address space,
which MachinePointerInfo can represent directly without referring to an
IR value.
Reviewers: arsenm, rampitec
Subscribers: kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, Petar.Avramovic, llvm-commits
Tags: #llvm
Differential Revision: https://reviews.llvm.org/D71838
This reverts commit 69fcfb7d35.
As shown in the test I attached to this commit, the change I reverted
causes a problem with "zext(cc1) - zext(cc2)". It commuted
the operands to the sub and used different logic to select the addc/subc
instruction:
sub zext (setcc), x => addcarry 0, x, setcc
sub sext (setcc), x => subcarry 0, x, setcc
... but that is bogus. I believe it is not possible to fold those commuted
patterns into any form of addcarry or subcarry. It may have worked as
intended before "AMDGPU: Change boolean content type to 0 or 1" because
the setcc was considered to be -1 rather than 1.
Differential Revision: https://reviews.llvm.org/D70978
Change-Id: If2139421aa6c935cbd1d925af58fe4a4aa9e8f43
Start moving towards treating this as a property of the calling
convention, and not the subtarget. The default denormal mode should
not be part of the subtarget, and be moved into a separate function
attribute.
This patch is still NFC. The denormal mode remains as a subtarget
feature for now, but make the necessary changes to switch to using an
attribute.
AMDGPU needs to know the FP mode for the function to answer this
correctly when this is removed from the subtarget.
AArch64 had to make this more complicated by using this from an IR
hook, so add an IR typed overload.
The usage of target boolean checks is overly inflexible, since sext
and zext of a compare are equally cheap. The choice is arbitrary, but
using 0/1 to some degree is the choice of lower resistance since
that's what most targets use. This enables a few combines that don't
bother to support ZeroOrNegativeOneBooleanContent.
For AMDGPU this depends on whether denormals are enabled in the
default FP mode for the function. Currently this is treated as a
subtarget feature, so FMAD is selectively legal based on that. I want
to move this out of the subtarget features so this can be controlled
with a denormal mode attribute. Additionally, this will allow folding
based on a future ftz fast math flag.