Commit Graph

15 Commits

Author SHA1 Message Date
Tue Ly 82d6e77048 [libc] Implement tanf function correctly rounded for all rounding modes.
Implement tanf function correctly rounded for all rounding modes.

We use the range reduction that is shared with `sinf`, `cosf`, and `sincosf`:
```
  k = round(x * 32/pi) and y = x * (32/pi) - k.
```
Then we use the tangent of sum formula:
```
  tan(x) = tan((k + y)* pi/32) = tan((k mod 32) * pi / 32 + y * pi/32)
         = (tan((k mod 32) * pi/32) + tan(y * pi/32)) / (1 - tan((k mod 32) * pi/32) * tan(y * pi/32))
```
We need to make a further reduction when `k mod 32 >= 16` due to the pole at `pi/2` of `tan(x)` function:
```
  if (k mod 32 >= 16): k = k - 31, y = y - 1.0
```
And to compute the final result, we store `tan(k * pi/32)` for `k = -15..15` in a table of 32 double values,
and evaluate `tan(y * pi/32)` with a degree-11 minimax odd polynomial generated by Sollya with:
```
>  P = fpminimax(tan(y * pi/32)/y, [|0, 2, 4, 6, 8, 10|], [|D...|], [0, 1.5]);
```

Performance benchmark using `perf` tool from the CORE-MATH project on Ryzen 1700:
```
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh tanf
CORE-MATH reciprocal throughput   : 18.586
System LIBC reciprocal throughput : 50.068

LIBC reciprocal throughput        : 33.823
LIBC reciprocal throughput        : 25.161     (with `-msse4.2` flag)
LIBC reciprocal throughput        : 19.157     (with `-mfma` flag)

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh tanf --latency
GNU libc version: 2.31
GNU libc release: stable
CORE-MATH latency   : 55.630
System LIBC latency : 106.264

LIBC latency        : 96.060
LIBC latency        : 90.727    (with `-msse4.2` flag)
LIBC latency        : 82.361    (with `-mfma` flag)
```

Reviewed By: orex

Differential Revision: https://reviews.llvm.org/D131715
2022-08-12 09:21:05 -04:00
Tue Ly 42f183792c [libc] Change sinf/cosf range reduction to mod pi/32 to be shared with tanf.
Change sinf/cosf range reduction to mod pi/32 to be shared with tanf,
since polynomial approximations for tanf on subintervals of length pi/16 do not
provide enough accuracy.

Reviewed By: orex

Differential Revision: https://reviews.llvm.org/D131652
2022-08-11 09:41:45 -04:00
Tue Ly 131dda9acc [libc] Implement sincosf function correctly rounded to all rounding modes.
Refactor common range reductions and evaluations for sinf, cosf, and
sincosf.  Added exhaustive tests for sincosf.

Performance before the patch:
```
System LIBC reciprocal throughput : 30.205
LIBC reciprocal throughput        : 30.533

System LIBC latency : 67.961
LIBC latency        : 61.564
```
Performance after the patch:
```
System LIBC reciprocal throughput : 30.409
LIBC reciprocal throughput        : 20.273

System LIBC latency : 67.527
LIBC latency        : 61.959
```

Reviewed By: orex

Differential Revision: https://reviews.llvm.org/D130901
2022-08-05 09:58:01 -04:00
Tue Ly 69cc240534 [libc][doc] Update implementation status of tanhf. 2022-08-01 17:45:40 -04:00
Tue Ly 17df74214c [libc][doc] Update implementation status of exp2f, sinhf, and coshf. 2022-07-31 16:32:21 -04:00
Tue Ly 2ff187fbc9 [libc] Implement cosf function that is correctly rounded to all rounding modes.
Implement cosf function that is correctly rounded to all rounding
modes.

Performance benchmark using perf tool from CORE-MATH project

(https://gitlab.inria.fr/core-math/core-math/-/tree/master) on Ryzen 1700:
Before this patch (not correctly rounded):
```
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh cosf
CORE-MATH reciprocal throughput   : 19.043
System LIBC reciprocal throughput : 26.328
LIBC reciprocal throughput        : 30.955

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh cosf --latency
GNU libc version: 2.31
GNU libc release: stable
CORE-MATH latency   : 49.995
System LIBC latency : 59.286
LIBC latency        : 60.174

```
After this patch (correctly rounded):
```
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh cosf
GNU libc version: 2.31
GNU libc release: stable
CORE-MATH reciprocal throughput   : 19.072
System LIBC reciprocal throughput : 26.286
LIBC reciprocal throughput        : 13.631

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh cosf --latency
GNU libc version: 2.31
GNU libc release: stable
CORE-MATH latency   : 49.872
System LIBC latency : 59.468
LIBC latency        : 56.119
```

Reviewed By: orex, zimmermann6

Differential Revision: https://reviews.llvm.org/D130644
2022-07-29 21:08:31 -04:00
Kirill Okhotnikov c78144e1c7 [libc][math] Improved performance of exp2f function.
New exp2 function algorithm:
1) Improved performance: 8.176 vs 15.270 by core-math perf tool.
2) Improved accuracy. Only two special values left.
3) Lookup table size reduced twice.

Differential Revision: https://reviews.llvm.org/D129005
2022-07-28 10:57:16 +02:00
Tue Ly 15b9380dfd [libc] Change sinf range reduction to mod pi/16 to be shared with cosf.
Change `sinf` range reduction to mod pi/16 to be shared with `cosf`.

Previously, `sinf` used range reduction `mod pi`, but this cannot be used to implement `cosf` since the minimax algorithm for `cosf` does not converge due to critical points at `pi/2`.  In order to be able to share the same range reduction functions for both `sinf` and `cosf`, we change the range reduction to `mod pi/16` for the following reasons:
- The table size is sufficiently small: 32 entries for `sin(k * pi/16)` with `k = 0..31`.  It could be reduced to 16 entries if we treat the final sign separately, with an extra multiplication at the end.
- The polynomials' degrees are reduced to 7/8 from 15, with extra computations to combine `sin` and `cos` with trig sum equality.
- The number of exceptional cases reduced to 2 (with FMA) and 3 (without FMA).
- The latency is reduced while maintaining similar throughput as before.

Reviewed By: zimmermann6

Differential Revision: https://reviews.llvm.org/D130629
2022-07-27 12:23:36 -04:00
Tue Ly 628fbbef81 [libc] Use nearest_integer instructions to improve expm1f performance.
Use nearest_integer instructions to improve expf performance.

Performance tests with CORE-MATH's perf tool:

Before the patch:
```
$ ./perf.sh expm1f
LIBC-location: /home/lnt/experiment/llvm/llvm-project/build/projects/libc/lib/libllvmlibc.a
GNU libc version: 2.31
GNU libc release: stable
CORE-MATH reciprocal throughput   : 10.096
System LIBC reciprocal throughput : 44.036
LIBC reciprocal throughput        : 11.575

$ ./perf.sh expm1f --latency
LIBC-location: /home/lnt/experiment/llvm/llvm-project/build/projects/libc/lib/libllvmlibc.a
GNU libc version: 2.31
GNU libc release: stable
CORE-MATH latency   : 42.239
System LIBC latency : 122.815
LIBC latency        : 50.122
```
After the patch:
```
$ ./perf.sh expm1f
LIBC-location: /home/lnt/experiment/llvm/llvm-project/build/projects/libc/lib/libllvmlibc.a
GNU libc version: 2.31
GNU libc release: stable
CORE-MATH reciprocal throughput   : 10.046
System LIBC reciprocal throughput : 43.899
LIBC reciprocal throughput        : 9.179

$ ./perf.sh expm1f --latency
LIBC-location: /home/lnt/experiment/llvm/llvm-project/build/projects/libc/lib/libllvmlibc.a
GNU libc version: 2.31
GNU libc release: stable
CORE-MATH latency   : 42.078
System LIBC latency : 120.488
LIBC latency        : 41.528
```

Reviewed By: zimmermann6

Differential Revision: https://reviews.llvm.org/D130502
2022-07-26 09:12:37 -04:00
Tue Ly 91ee672062 [libc] Use nearest_integer instructions to improve expf performance.
Use nearest_integer instructions to improve expf performance.

Performance tests with CORE-MATH's perf tool:

Before the patch:
```
$ ./perf.sh expf
LIBC-location: /home/lnt/experiment/llvm-project/build/projects/libc/lib/libllvmlibc.a
GNU libc version: 2.31
GNU libc release: stable
CORE-MATH reciprocal throughput   : 9.860
System LIBC reciprocal throughput : 7.728
LIBC reciprocal throughput        : 12.363

$ ./perf.sh expf --latency
LIBC-location: /home/lnt/experiment/llvm-project/build/projects/libc/lib/libllvmlibc.a
GNU libc version: 2.31
GNU libc release: stable
CORE-MATH latency   : 42.802
System LIBC latency : 35.941
LIBC latency        : 49.808
```

After the patch:
```
$ ./perf.sh expf
LIBC-location: /home/lnt/experiment/llvm/llvm-project/build/projects/libc/lib/libllvmlibc.a
GNU libc version: 2.31
GNU libc release: stable
CORE-MATH reciprocal throughput   : 9.441
System LIBC reciprocal throughput : 7.382
LIBC reciprocal throughput        : 8.843

$ ./perf.sh expf --latency
LIBC-location: /home/lnt/experiment/llvm/llvm-project/build/projects/libc/lib/libllvmlibc.a
GNU libc version: 2.31
GNU libc release: stable
CORE-MATH latency   : 44.192
System LIBC latency : 37.693
LIBC latency        : 44.145
```

Reviewed By: zimmermann6

Differential Revision: https://reviews.llvm.org/D130498
2022-07-26 09:11:27 -04:00
Tue Ly d883a4ad02 [libc] Implement sinf function that is correctly rounded to all rounding modes.
Implement sinf function that is correctly rounded to all rounding modes.

- We use a simple range reduction for `pi/16 < |x|` :
    Let `k = round(x / pi)` and `y = (x/pi) - k`.
    So `k` is an integer and `-0.5 <= y <= 0.5`.
Then
```
sin(x) = sin(y*pi + k*pi)
          = (-1)^(k & 1) * sin(y*pi)
          ~ (-1)^(k & 1) * y * P(y^2)
```
    where `y*P(y^2)` is a degree-15 minimax polynomial generated by Sollya with:
```
> P = fpminimax(sin(x*pi)/x, [|0, 2, 4, 6, 8, 10, 12, 14|], [|D...|], [0, 0.5]);
```

- Performance benchmark using perf tool from CORE-MATH project
(https://gitlab.inria.fr/core-math/core-math/-/tree/master) on Ryzen 1700:
Before this patch (not correctly rounded):
```
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh sinf
CORE-MATH reciprocal throughput   : 17.892
System LIBC reciprocal throughput : 25.559
LIBC reciprocal throughput        : 29.381
```
After this patch (correctly rounded):
```
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh sinf
CORE-MATH reciprocal throughput   : 17.896
System LIBC reciprocal throughput : 25.740

LIBC reciprocal throughput        : 27.872
LIBC reciprocal throughput        : 20.012     (with `-msse4.2` flag)
LIBC reciprocal throughput        : 14.244     (with `-mfma` flag)
```

Reviewed By: zimmermann6

Differential Revision: https://reviews.llvm.org/D123154
2022-07-22 10:07:31 -04:00
Kirill Okhotnikov 5358457089 [libc][docs] Added fmod performance results. 2022-06-27 19:31:54 +02:00
Kirill Okhotnikov b8e8012aa2 [libc][math] fmod/fmodf implementation.
This is a implementation of find remainder fmod function from standard libm.
The underline algorithm is developed by myself, but probably it was first
invented before.
Some features of the implementation:
1. The code is written on more-or-less modern C++.
2. One general implementation for both float and double precision numbers.
3. Spitted platform/architecture dependent and independent code and tests.
4. Tests covers 100% of the code for both float and double numbers. Tests cases with NaN/Inf etc is copied from glibc.
5. The new implementation in general 2-4 times faster for “regular” x,y values. It can be 20 times faster for x/y huge value, but can also be 2 times slower for double denormalized range (according to perf tests provided).
6. Two different implementation of division loop are provided. In some platforms division can be very time consuming operation. Depend on platform it can be 3-10 times slower than multiplication.

Performance tests:

The test is based on core-math project (https://gitlab.inria.fr/core-math/core-math). By Tue Ly suggestion I took hypot function and use it as template for fmod. Preserving all test cases.

`./check.sh <--special|--worst> fmodf` passed.
`CORE_MATH_PERF_MODE=rdtsc ./perf.sh fmodf` results are

```
GNU libc version: 2.35
GNU libc release: stable
21.166 <-- FPU
51.031 <-- current glibc
37.659 <-- this fmod version.
```
2022-06-24 23:09:14 +02:00
Tue Ly 6441bfb886 [libc][Obvious] Fix hyperlink and typo in math status page. 2022-06-17 09:35:51 -04:00
Tue Ly 72c1effb34 [libc] Add a status page for math functions.
Add a status page for math functions.

Reviewed By: sivachandra

Differential Revision: https://reviews.llvm.org/D127920
2022-06-16 17:41:46 -04:00