add cuda samples and cuda benchmarks
This commit is contained in:
parent
c5337cdf43
commit
d6d28ed7e0
100
ReadMe.md
100
ReadMe.md
|
@ -20,10 +20,108 @@ CUDA gpu 编程学习,基于 《CUDA 编程——基础与实践》(樊哲
|
|||
14. [CUDA 标准库](./capter14/ReadMe.md)
|
||||
|
||||
|
||||
CUDA 官方文档:
|
||||
## CUDA 官方文档
|
||||
|
||||
[CUDA c++编程指南](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)
|
||||
[CUDA c++最佳实践指南](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html)
|
||||
[CUDA 运行时API手册](https://docs.nvidia.com/cuda/cuda-runtime-api/index.html)
|
||||
[CUDA 数学函数库API手册](https://docs.nvidia.com/cuda/cuda-math-api/index.html)
|
||||
|
||||
|
||||
## CUDA 编程案例
|
||||
|
||||
[CUDA Samples](https://github.com/NVIDIA/cuda-samples)
|
||||
+ Simple Reference
|
||||
基础CUDA示例,适用于初学者, 反映了运用CUDA和CUDA runtime APIs的一些基本概念.
|
||||
+ Utilities Reference
|
||||
演示如何查询设备能力和衡量GPU/CPU 带宽的实例程序。
|
||||
+ Graphics Reference
|
||||
图形化示例展现的是 CUDA, OpenGL, DirectX 之间的互通性。
|
||||
+ Imaging Reference
|
||||
图像处理,压缩,和数据分析。
|
||||
+ Finance Reference
|
||||
金融计算的并行处理。
|
||||
+ Simulations Reference
|
||||
展现一些运用CUDA的模拟算法。
|
||||
+ Advanced Reference
|
||||
用CUDA实现的一些先进的算法。
|
||||
+ Cudalibraries Reference
|
||||
这类示例主要告诉我们该如何使用CUDA各种函数库(NPP, CUBLAS, CUFFT,CUSPARSE, and CURAND)。
|
||||
|
||||
## CUDA 性能测试
|
||||
|
||||
[CUDA Bechmarks](https://github.com/ekondis/mixbench)
|
||||
|
||||
+ Four types of experiments are executed combined with global memory accesses:
|
||||
Single precision Flops (multiply-additions)
|
||||
Double precision Flops (multiply-additions)
|
||||
Half precision Flops (multiply-additions)
|
||||
Integer multiply-addition operations
|
||||
|
||||
+ Building is based now on CMake files. Each implementation resides in a separate folder:
|
||||
CUDA implementation: mixbench-cuda
|
||||
OpenCL implementation: mixbench-opencl
|
||||
HIP implementation: mixbench-hip
|
||||
SYCL implementation: mixbench-sycl
|
||||
|
||||
生成的测试结果类似:
|
||||
```
|
||||
mixbench/read-only (v0.03-2-gbccfd71)
|
||||
------------------------ Device specifications ------------------------
|
||||
Device: GeForce RTX 2070
|
||||
CUDA driver version: 10.20
|
||||
GPU clock rate: 1620 MHz
|
||||
Memory clock rate: 3500 MHz
|
||||
Memory bus width: 256 bits
|
||||
WarpSize: 32
|
||||
L2 cache size: 4096 KB
|
||||
Total global mem: 7979 MB
|
||||
ECC enabled: No
|
||||
Compute Capability: 7.5
|
||||
Total SPs: 2304 (36 MPs x 64 SPs/MP)
|
||||
Compute throughput: 7464.96 GFlops (theoretical single precision FMAs)
|
||||
Memory bandwidth: 448.06 GB/sec
|
||||
-----------------------------------------------------------------------
|
||||
Total GPU memory 8366784512, free 7941521408
|
||||
Buffer size: 256MB
|
||||
Trade-off type: compute with global memory (block strided)
|
||||
Elements per thread: 8
|
||||
Thread fusion degree: 4
|
||||
----------------------------------------------------------------------------- CSV data -----------------------------------------------------------------------------
|
||||
Experiment ID, Single Precision ops,,,, Double precision ops,,,, Half precision ops,,,, Integer operations,,,
|
||||
Compute iters, Flops/byte, ex.time, GFLOPS, GB/sec, Flops/byte, ex.time, GFLOPS, GB/sec, Flops/byte, ex.time, GFLOPS, GB/sec, Iops/byte, ex.time, GIOPS, GB/sec
|
||||
0, 0.250, 0.32, 104.42, 417.68, 0.125, 0.63, 53.04, 424.35, 0.500, 0.32, 211.41, 422.81, 0.250, 0.32, 105.58, 422.30
|
||||
1, 0.750, 0.32, 316.34, 421.79, 0.375, 0.63, 158.69, 423.18, 1.500, 0.32, 634.22, 422.81, 0.750, 0.32, 317.30, 423.07
|
||||
2, 1.250, 0.32, 528.46, 422.77, 0.625, 0.78, 215.91, 345.45, 2.500, 0.32, 1055.97, 422.39, 1.250, 0.32, 528.57, 422.86
|
||||
3, 1.750, 0.32, 738.81, 422.17, 0.875, 1.08, 218.17, 249.34, 3.500, 0.32, 1478.95, 422.56, 1.750, 0.32, 740.59, 423.20
|
||||
4, 2.250, 0.32, 951.33, 422.81, 1.125, 1.38, 219.57, 195.17, 4.500, 0.32, 1902.66, 422.81, 2.250, 0.32, 950.66, 422.51
|
||||
5, 2.750, 0.32, 1162.74, 422.81, 1.375, 1.67, 220.38, 160.28, 5.500, 0.32, 2328.52, 423.37, 2.750, 0.32, 1162.74, 422.81
|
||||
6, 3.250, 0.32, 1374.56, 422.94, 1.625, 1.97, 220.99, 135.99, 6.500, 0.32, 2756.62, 424.10, 3.250, 0.32, 1375.81, 423.32
|
||||
7, 3.750, 0.32, 1592.45, 424.65, 1.875, 2.27, 221.38, 118.07, 7.500, 0.32, 3169.50, 422.60, 3.750, 0.32, 1585.55, 422.81
|
||||
8, 4.250, 0.32, 1796.95, 422.81, 2.125, 2.57, 221.71, 104.33, 8.500, 0.32, 3587.76, 422.09, 4.250, 0.37, 1545.63, 363.68
|
||||
9, 4.750, 0.32, 2006.34, 422.39, 2.375, 2.87, 221.85, 93.41, 9.500, 0.32, 3995.38, 420.57, 4.750, 0.32, 1998.29, 420.69
|
||||
10, 5.250, 0.32, 2209.52, 420.86, 2.625, 3.17, 222.02, 84.58, 10.500, 0.32, 4439.54, 422.81, 5.250, 0.32, 2220.44, 422.94
|
||||
11, 5.750, 0.32, 2434.12, 423.32, 2.875, 3.47, 222.17, 77.28, 11.500, 0.32, 4855.01, 422.17, 5.750, 0.32, 2426.77, 422.05
|
||||
12, 6.250, 0.32, 2638.06, 422.09, 3.125, 3.78, 222.18, 71.10, 12.500, 0.32, 5227.20, 418.18, 6.250, 0.38, 2202.15, 352.34
|
||||
13, 6.750, 0.32, 2841.95, 421.03, 3.375, 4.08, 222.30, 65.87, 13.500, 0.32, 5712.58, 423.15, 6.750, 0.32, 2850.54, 422.30
|
||||
14, 7.250, 0.32, 3065.39, 422.81, 3.625, 4.37, 222.45, 61.36, 14.500, 0.32, 6135.74, 423.15, 7.250, 0.32, 3065.08, 422.77
|
||||
15, 7.750, 0.33, 3143.40, 405.60, 3.875, 4.67, 222.57, 57.44, 15.500, 0.32, 6546.34, 422.34, 7.750, 0.32, 3268.89, 421.79
|
||||
16, 8.250, 0.32, 3482.59, 422.13, 4.125, 4.98, 222.57, 53.96, 16.500, 0.32, 6957.48, 421.67, 8.250, 0.39, 2803.68, 339.84
|
||||
17, 8.750, 0.32, 3693.66, 422.13, 4.375, 5.28, 222.53, 50.86, 17.500, 0.32, 7396.24, 422.64, 8.750, 0.32, 3694.77, 422.26
|
||||
18, 9.250, 0.32, 3901.58, 421.79, 4.625, 5.58, 222.58, 48.12, 18.500, 0.32, 7786.72, 420.90, 9.250, 0.32, 3897.66, 421.37
|
||||
20, 10.250, 0.32, 4312.53, 420.73, 5.125, 6.18, 222.66, 43.45, 20.500, 0.32, 8640.66, 421.50, 10.250, 0.41, 3374.54, 329.22
|
||||
22, 11.250, 0.32, 4729.94, 420.44, 5.625, 6.78, 222.74, 39.60, 22.500, 0.32, 9452.31, 420.10, 11.250, 0.32, 4734.21, 420.82
|
||||
24, 12.250, 0.32, 5148.83, 420.31, 6.125, 7.36, 223.51, 36.49, 24.500, 0.32,10346.40, 422.30, 12.250, 0.42, 3900.12, 318.38
|
||||
28, 14.250, 0.32, 6009.94, 421.75, 7.125, 8.53, 224.23, 31.47, 28.500, 0.32,11975.32, 420.19, 14.250, 0.44, 4368.11, 306.53
|
||||
32, 16.250, 0.32, 6795.36, 418.18, 8.125, 9.72, 224.31, 27.61, 32.500, 0.32,13605.64, 418.64, 16.250, 0.45, 4797.12, 295.21
|
||||
40, 20.250, 0.34, 7899.43, 390.10, 10.125, 12.11, 224.50, 22.17, 40.500, 0.33,16371.37, 404.23, 20.250, 0.50, 5464.85, 269.87
|
||||
48, 24.250, 0.41, 8029.04, 331.09, 12.125, 14.49, 224.58, 18.52, 48.500, 0.40,16468.89, 339.56, 24.250, 0.54, 5986.22, 246.85
|
||||
56, 28.250, 0.47, 8114.58, 287.24, 14.125, 16.88, 224.65, 15.90, 56.500, 0.46,16443.12, 291.03, 28.250, 0.60, 6342.42, 224.51
|
||||
64, 32.250, 0.53, 8154.47, 252.85, 16.125, 19.26, 224.72, 13.94, 64.500, 0.52,16536.22, 256.38, 32.250, 0.66, 6591.93, 204.40
|
||||
80, 40.250, 0.66, 8242.80, 204.79, 20.125, 24.03, 224.79, 11.17, 80.500, 0.65,16644.88, 206.77, 40.250, 0.78, 6909.54, 171.67
|
||||
96, 48.250, 0.78, 8321.35, 172.46, 24.125, 28.80, 224.85, 9.32, 96.500, 0.78,16685.23, 172.90, 48.250, 0.91, 7108.62, 147.33
|
||||
128, 64.250, 1.03, 8337.22, 129.76, 32.125, 38.34, 224.91, 7.00, 128.500, 1.03,16775.65, 130.55, 64.250, 1.18, 7295.18, 113.54
|
||||
192, 96.250, 1.54, 8414.49, 87.42, 48.125, 57.42, 224.97, 4.67, 192.500, 1.53,16847.93, 87.52, 96.250, 1.74, 7431.64, 77.21
|
||||
256, 128.250, 2.06, 8362.01, 65.20, 64.125, 76.50, 225.02, 3.51, 256.500, 2.06,16693.65, 65.08, 128.250, 2.30, 7477.75, 58.31
|
||||
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
||||
```
|
|
@ -68,7 +68,8 @@ CUDA 基于 CUDA 事件的计时方法:
|
|||
|
||||
>> nvprof ./bin/clock
|
||||
|
||||
(没有输出结果)。
|
||||
如果没有输出结果,需要将`nvprof`的目录包含到环境环境变量中(不支持7.5 以上计算能力的显卡)。
|
||||
推荐采用一代性能分析工具: [Nvidia Nsight Systems](https://developer.nvidia.com/zh-cn/nsight-systems).
|
||||
|
||||
------
|
||||
|
||||
|
|
Loading…
Reference in New Issue