diff --git a/ReadMe.md b/ReadMe.md index ccfcfb8..2103b2f 100644 --- a/ReadMe.md +++ b/ReadMe.md @@ -20,10 +20,108 @@ CUDA gpu 编程学习,基于 《CUDA 编程——基础与实践》(樊哲 14. [CUDA 标准库](./capter14/ReadMe.md) -CUDA 官方文档: +## CUDA 官方文档 + [CUDA c++编程指南](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) [CUDA c++最佳实践指南](https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html) [CUDA 运行时API手册](https://docs.nvidia.com/cuda/cuda-runtime-api/index.html) [CUDA 数学函数库API手册](https://docs.nvidia.com/cuda/cuda-math-api/index.html) +## CUDA 编程案例 + +[CUDA Samples](https://github.com/NVIDIA/cuda-samples) ++ Simple Reference +基础CUDA示例,适用于初学者, 反映了运用CUDA和CUDA runtime APIs的一些基本概念. ++ Utilities Reference +演示如何查询设备能力和衡量GPU/CPU 带宽的实例程序。 ++ Graphics Reference +图形化示例展现的是 CUDA, OpenGL, DirectX 之间的互通性。 ++ Imaging Reference +图像处理,压缩,和数据分析。 ++ Finance Reference +金融计算的并行处理。 ++ Simulations Reference +展现一些运用CUDA的模拟算法。 ++ Advanced Reference +用CUDA实现的一些先进的算法。 ++ Cudalibraries Reference +这类示例主要告诉我们该如何使用CUDA各种函数库(NPP, CUBLAS, CUFFT,CUSPARSE, and CURAND)。 + +## CUDA 性能测试 + +[CUDA Bechmarks](https://github.com/ekondis/mixbench) + ++ Four types of experiments are executed combined with global memory accesses: +Single precision Flops (multiply-additions) +Double precision Flops (multiply-additions) +Half precision Flops (multiply-additions) +Integer multiply-addition operations + ++ Building is based now on CMake files. Each implementation resides in a separate folder: +CUDA implementation: mixbench-cuda +OpenCL implementation: mixbench-opencl +HIP implementation: mixbench-hip +SYCL implementation: mixbench-sycl + +生成的测试结果类似: +``` +mixbench/read-only (v0.03-2-gbccfd71) +------------------------ Device specifications ------------------------ +Device: GeForce RTX 2070 +CUDA driver version: 10.20 +GPU clock rate: 1620 MHz +Memory clock rate: 3500 MHz +Memory bus width: 256 bits +WarpSize: 32 +L2 cache size: 4096 KB +Total global mem: 7979 MB +ECC enabled: No +Compute Capability: 7.5 +Total SPs: 2304 (36 MPs x 64 SPs/MP) +Compute throughput: 7464.96 GFlops (theoretical single precision FMAs) +Memory bandwidth: 448.06 GB/sec +----------------------------------------------------------------------- +Total GPU memory 8366784512, free 7941521408 +Buffer size: 256MB +Trade-off type: compute with global memory (block strided) +Elements per thread: 8 +Thread fusion degree: 4 +----------------------------------------------------------------------------- CSV data ----------------------------------------------------------------------------- +Experiment ID, Single Precision ops,,,, Double precision ops,,,, Half precision ops,,,, Integer operations,,, +Compute iters, Flops/byte, ex.time, GFLOPS, GB/sec, Flops/byte, ex.time, GFLOPS, GB/sec, Flops/byte, ex.time, GFLOPS, GB/sec, Iops/byte, ex.time, GIOPS, GB/sec + 0, 0.250, 0.32, 104.42, 417.68, 0.125, 0.63, 53.04, 424.35, 0.500, 0.32, 211.41, 422.81, 0.250, 0.32, 105.58, 422.30 + 1, 0.750, 0.32, 316.34, 421.79, 0.375, 0.63, 158.69, 423.18, 1.500, 0.32, 634.22, 422.81, 0.750, 0.32, 317.30, 423.07 + 2, 1.250, 0.32, 528.46, 422.77, 0.625, 0.78, 215.91, 345.45, 2.500, 0.32, 1055.97, 422.39, 1.250, 0.32, 528.57, 422.86 + 3, 1.750, 0.32, 738.81, 422.17, 0.875, 1.08, 218.17, 249.34, 3.500, 0.32, 1478.95, 422.56, 1.750, 0.32, 740.59, 423.20 + 4, 2.250, 0.32, 951.33, 422.81, 1.125, 1.38, 219.57, 195.17, 4.500, 0.32, 1902.66, 422.81, 2.250, 0.32, 950.66, 422.51 + 5, 2.750, 0.32, 1162.74, 422.81, 1.375, 1.67, 220.38, 160.28, 5.500, 0.32, 2328.52, 423.37, 2.750, 0.32, 1162.74, 422.81 + 6, 3.250, 0.32, 1374.56, 422.94, 1.625, 1.97, 220.99, 135.99, 6.500, 0.32, 2756.62, 424.10, 3.250, 0.32, 1375.81, 423.32 + 7, 3.750, 0.32, 1592.45, 424.65, 1.875, 2.27, 221.38, 118.07, 7.500, 0.32, 3169.50, 422.60, 3.750, 0.32, 1585.55, 422.81 + 8, 4.250, 0.32, 1796.95, 422.81, 2.125, 2.57, 221.71, 104.33, 8.500, 0.32, 3587.76, 422.09, 4.250, 0.37, 1545.63, 363.68 + 9, 4.750, 0.32, 2006.34, 422.39, 2.375, 2.87, 221.85, 93.41, 9.500, 0.32, 3995.38, 420.57, 4.750, 0.32, 1998.29, 420.69 + 10, 5.250, 0.32, 2209.52, 420.86, 2.625, 3.17, 222.02, 84.58, 10.500, 0.32, 4439.54, 422.81, 5.250, 0.32, 2220.44, 422.94 + 11, 5.750, 0.32, 2434.12, 423.32, 2.875, 3.47, 222.17, 77.28, 11.500, 0.32, 4855.01, 422.17, 5.750, 0.32, 2426.77, 422.05 + 12, 6.250, 0.32, 2638.06, 422.09, 3.125, 3.78, 222.18, 71.10, 12.500, 0.32, 5227.20, 418.18, 6.250, 0.38, 2202.15, 352.34 + 13, 6.750, 0.32, 2841.95, 421.03, 3.375, 4.08, 222.30, 65.87, 13.500, 0.32, 5712.58, 423.15, 6.750, 0.32, 2850.54, 422.30 + 14, 7.250, 0.32, 3065.39, 422.81, 3.625, 4.37, 222.45, 61.36, 14.500, 0.32, 6135.74, 423.15, 7.250, 0.32, 3065.08, 422.77 + 15, 7.750, 0.33, 3143.40, 405.60, 3.875, 4.67, 222.57, 57.44, 15.500, 0.32, 6546.34, 422.34, 7.750, 0.32, 3268.89, 421.79 + 16, 8.250, 0.32, 3482.59, 422.13, 4.125, 4.98, 222.57, 53.96, 16.500, 0.32, 6957.48, 421.67, 8.250, 0.39, 2803.68, 339.84 + 17, 8.750, 0.32, 3693.66, 422.13, 4.375, 5.28, 222.53, 50.86, 17.500, 0.32, 7396.24, 422.64, 8.750, 0.32, 3694.77, 422.26 + 18, 9.250, 0.32, 3901.58, 421.79, 4.625, 5.58, 222.58, 48.12, 18.500, 0.32, 7786.72, 420.90, 9.250, 0.32, 3897.66, 421.37 + 20, 10.250, 0.32, 4312.53, 420.73, 5.125, 6.18, 222.66, 43.45, 20.500, 0.32, 8640.66, 421.50, 10.250, 0.41, 3374.54, 329.22 + 22, 11.250, 0.32, 4729.94, 420.44, 5.625, 6.78, 222.74, 39.60, 22.500, 0.32, 9452.31, 420.10, 11.250, 0.32, 4734.21, 420.82 + 24, 12.250, 0.32, 5148.83, 420.31, 6.125, 7.36, 223.51, 36.49, 24.500, 0.32,10346.40, 422.30, 12.250, 0.42, 3900.12, 318.38 + 28, 14.250, 0.32, 6009.94, 421.75, 7.125, 8.53, 224.23, 31.47, 28.500, 0.32,11975.32, 420.19, 14.250, 0.44, 4368.11, 306.53 + 32, 16.250, 0.32, 6795.36, 418.18, 8.125, 9.72, 224.31, 27.61, 32.500, 0.32,13605.64, 418.64, 16.250, 0.45, 4797.12, 295.21 + 40, 20.250, 0.34, 7899.43, 390.10, 10.125, 12.11, 224.50, 22.17, 40.500, 0.33,16371.37, 404.23, 20.250, 0.50, 5464.85, 269.87 + 48, 24.250, 0.41, 8029.04, 331.09, 12.125, 14.49, 224.58, 18.52, 48.500, 0.40,16468.89, 339.56, 24.250, 0.54, 5986.22, 246.85 + 56, 28.250, 0.47, 8114.58, 287.24, 14.125, 16.88, 224.65, 15.90, 56.500, 0.46,16443.12, 291.03, 28.250, 0.60, 6342.42, 224.51 + 64, 32.250, 0.53, 8154.47, 252.85, 16.125, 19.26, 224.72, 13.94, 64.500, 0.52,16536.22, 256.38, 32.250, 0.66, 6591.93, 204.40 + 80, 40.250, 0.66, 8242.80, 204.79, 20.125, 24.03, 224.79, 11.17, 80.500, 0.65,16644.88, 206.77, 40.250, 0.78, 6909.54, 171.67 + 96, 48.250, 0.78, 8321.35, 172.46, 24.125, 28.80, 224.85, 9.32, 96.500, 0.78,16685.23, 172.90, 48.250, 0.91, 7108.62, 147.33 + 128, 64.250, 1.03, 8337.22, 129.76, 32.125, 38.34, 224.91, 7.00, 128.500, 1.03,16775.65, 130.55, 64.250, 1.18, 7295.18, 113.54 + 192, 96.250, 1.54, 8414.49, 87.42, 48.125, 57.42, 224.97, 4.67, 192.500, 1.53,16847.93, 87.52, 96.250, 1.74, 7431.64, 77.21 + 256, 128.250, 2.06, 8362.01, 65.20, 64.125, 76.50, 225.02, 3.51, 256.500, 2.06,16693.65, 65.08, 128.250, 2.30, 7477.75, 58.31 +-------------------------------------------------------------------------------------------------------------------------------------------------------------------- +``` \ No newline at end of file diff --git a/capter5/ReadMe.md b/capter5/ReadMe.md index de0ced2..4454875 100644 --- a/capter5/ReadMe.md +++ b/capter5/ReadMe.md @@ -68,7 +68,8 @@ CUDA 基于 CUDA 事件的计时方法: >> nvprof ./bin/clock -(没有输出结果)。 +如果没有输出结果,需要将`nvprof`的目录包含到环境环境变量中(不支持7.5 以上计算能力的显卡)。 +推荐采用一代性能分析工具: [Nvidia Nsight Systems](https://developer.nvidia.com/zh-cn/nsight-systems). ------