Commit Graph

100 Commits

Author SHA1 Message Date
David Addison fae7cb4727
Merge pull request #316 from martin-belanger/print-program-name
Print the name of the program being executed before and after test output
2025-07-24 14:58:54 -07:00
David Addison 6edafa0a9c Add extra reserved space during maxBytes calculation
Also, don't allow minBytes > maxBytes
2025-07-23 16:19:37 -07:00
David Addison def2d3689c Minor fix to Makefile
Move comments to separate lines
2025-07-23 16:04:30 -07:00
David Addison 97ee098516 Add Turing (SM75) support to CUDA 13.0 builds 2025-06-04 17:54:58 -07:00
David Addison e7c8825b0b Wrap ncclCommWindowRegister() calls within ncclGroup 2025-06-03 10:36:53 -07:00
Martin Belanger dafb70408d Print the name of the program being executed
One thing missing from the stdout of each performance test is
the name of the test that is actually being run.

This patch adds 2 new messages to the stdout. At the beginning
of the execution of a test (e.g. sendrecv_perf) we will now
see this message:

  Collective test starting: sendrecv_perf

And at the end, we will now see this:

  Collective test concluded: sendrecv_perf

This is needed when running several tests consecutively and we're
trying to parse the stdout to collect the results.

For example, using a Python script to parse the stdout, one could
retrieve the results for each test and plot them on a graph. This
patch makes it easier to implement such a script.

Signed-off-by: Martin Belanger <martin.belanger@dell.com>
2025-06-03 11:43:02 -04:00
David Addison 5290298ab6 Reinstate Pascal suppport for CUDA 12.8+ builds 2025-06-02 09:29:52 -07:00
David Addison 8bc16f4e01 Need to drop Volta (sm_70) support from CUDA 13.0 2025-05-30 18:04:25 -07:00
David Addison 0c60e6a8e4 Fix formatting errors in README.md 2025-05-30 17:43:30 -07:00
David Addison a5c539e68b Add support for Symmetric Memory Registration
From NCCL 2.27.x we can now use the Symmetric Memory APIs (-R 2)
2025-05-30 17:31:34 -07:00
David Addison e041d901e6 Re-add sm_70 support for CUDA 12.8+ and 13.0 builds 2025-05-07 10:30:59 -07:00
David Addison 1021260ca9 Make verifiable a DSO and add NAME_SUFFIX support
Build option DSO=1 generates libverifiable.so which can be
used to reduce the combined binary size.

Build option NAME_SUFFIX can be used to a add suffix to all
generated binaries. e.g. NAME_SUFFIX=_mpi

Added new make target: clean_intermediates
2025-04-23 17:07:24 -07:00
David Addison 501a149d57 Add support for FP8 datatypes
Added new datatypes: f8e4m3, f8e5m2

Only supported on H100+ architectures and NCCL versions >= 2.24.0
2025-04-18 19:20:59 -07:00
David Addison b4300cc79d Add PCI domain and device ID for GPU device BDF display 2025-02-28 13:25:51 -08:00
Sylvain Jeaugey 903918fc54
Add NCCL_TESTS_SPLIT documentation in the README 2025-02-06 14:10:07 +01:00
Junyu Ma a89cf07fe8 Perftests: Introduce NCCL_TESTS_SPLIT env
`NCCL_TESTS_SPLIT` serves as new way of computing the color for splitting communicators.

Will be overrided by `NCCL_TESTS_SPLIT_MASK`.

Examples:

NCCL_TESTS_SPLIT_MASK="0x7" # color = rank & 0x7. What we do today to run on a DGX with one GPU per node.
NCCL_TESTS_SPLIT="AND 0x7"  # color = rank & 0x7. New way to run on one GPU per node on a DGX, equivalent to NCCL_TESTS_SPLIT_MASK=0x7
NCCL_TESTS_SPLIT="MOD 72"   # color = rank % 72.  One GPU per NVLink domain on an NVL72 system.
NCCL_TESTS_SPLIT="DIV 72"   # color = rank / 72.  Intra NVLink domain on NVL72.

You can also use: "%" "&" "|" "/" for short.
Extra spaces in the middle will be automatically ignored.
Not case sensitive.

The followings are all equivalent:

NCCL_TESTS_SPLIT="%0x7"
NCCL_TESTS_SPLIT="%0b111"
NCCL_TESTS_SPLIT="AND 7"
NCCL_TESTS_SPLIT="and 0x7"
2025-02-04 15:18:09 -08:00
David Addison cb6a46fdd6 Update CUDA gencodes
Add support for Blackwell sm100 and sm120 from CUDA 12.8

Add support for Hopper sm90 from CUDA 12.0
2025-01-25 17:32:16 -08:00
John Bachan 29f4114f02 Fixes to all tests that divide buffers by nranks so that they trim buffer sizes to be multiples of 16 bytes.
This ensures non-pow2 ranks have buffer addresses aligned suitably for performance.
2024-12-18 11:20:28 -08:00
Sylvain Jeaugey 8dfeab9eb9
Merge pull request #259 from NVIDIA/fix-ncclstringtotype
Future-proof ncclstringtotype
2024-10-24 10:28:02 -07:00
Kamil Iskra 34d6d53910 Future-proof ncclstringtotype
Ensure that ncclstringtotype iterates only over data types known to
nccl-tests (as indicated by test_typenum), not over a potentially larger
set of all NCCL types.
2024-10-24 09:21:37 -07:00
David Addison 9d26b8422b
Merge pull request #226 from netgroup/master
improve parsing of stepbytes (increment size) argument
2024-07-30 14:58:54 -07:00
David Addison 0d86b5a6e7 Added some missing command line options to README.md
Also updated single and multi-node examples.
2024-07-30 14:50:45 -07:00
David Addison d2d40cc824 Added -N,--run_cycles option 2024-07-25 22:00:23 -07:00
David Addison 3a3f790efd
Merge pull request #240 from OrenLeung/patch-1
doc: add all2all factor
2024-07-25 22:00:06 -07:00
Oren c6eb15875f
doc: add all2all factor 2024-07-24 22:55:00 -04:00
Stefano Salsano 746549b28d
improve parsing of stepbytes (increment size) argument 2024-06-14 11:28:55 +02:00
Kaiming Ouyang d028efcf35 Change ncclCommRegister size to maxBytes in serial comm init 2024-06-06 06:54:48 -07:00
Giuseppe Congiu a1efb427e7 Add -R option to register user buffers 2024-06-03 01:04:58 -07:00
David Addison c6afef0b6f Added missing MPI_Comm_free() call before MPI_Finalize() 2024-02-05 08:53:54 -08:00
David Addison 1292b25553 Added an MPI_Barrier() call after MPI_Bcast() for HCOLL issue 2023-10-12 16:53:32 -07:00
David Addison 6c46206a47 Make the -c option be a datacheck iteration count parameter
Default is 1
2023-09-13 14:03:38 -07:00
Sylvain Jeaugey 1a5f551ffd
Merge pull request #146 from yangxingwu/master
makefile: remove extra space
2023-06-06 11:58:24 +02:00
yangxingwu 52ea1b2148 makefile: remove extra space 2023-06-06 09:47:50 +00:00
Sylvain Jeaugey e98ef24bc0
Merge pull request #135 from aavbsouza/fix_nvcc_variable_handling
fix handling of variable NVCC.
2023-03-27 11:14:10 +02:00
alan.souza 7ccda3c97b fix handling of variable NVCC. Permit overriding the variable using environment variables 2023-03-25 16:56:16 -03:00
David Addison e76e36e9a9
Merge pull request #134 from flx42/patch-1
Update README.md to fix -i default increment value.
2023-03-23 09:53:15 -07:00
Felix Abecassis 17d0a42d5a
Update README.md 2023-03-23 09:05:41 -07:00
Sylvain Jeaugey 2cbb968101
Update README.md
Improve MPI example to avoid confusion of number of processes / total number of GPUs.

https://github.com/NVIDIA/nccl-tests/issues/54#issuecomment-1212023369
2023-01-03 08:47:43 +01:00
David Addison 0b4c4cb99f Add boot_id to the hostname hash due to collisions on Azure
Fixes #60
2022-12-12 01:16:46 -08:00
Jithin Jose 0aeba157db Use DJB2a hash algorithm in getHostHash() 2022-12-12 01:16:38 -08:00
David Addison 24fcf64ed1 Call cudaFreeHost() on wrongPerGpu not cudaFree() 2022-11-22 11:18:37 -08:00
David Addison 3bd2bd292b Add fflush(stdout) before perf output 2022-11-22 11:16:47 -08:00
Sylvain Jeaugey 365b92a1ea Fix build on RHEL7 with GCC 4.8
Add -std=c++11 to CXXFLAGS.
Fixes #116.
2022-10-12 01:24:14 -07:00
Sylvain Jeaugey d313d20a26 Update NCCL tests 2022-09-23 01:13:29 -07:00
David Addison 749573f2d6 Fix preprocessor version check for ncclGetLastError()
ncclGetLastError() was added in NCCL 2.13.0
2022-09-07 16:10:41 -07:00
David Addison afa4c56b6a Fix an issue with the last commit when data checking is disabled 2022-09-07 11:23:49 -07:00
David Addison a0a14911ee Display N/A for error count in AlltoAll in-place test
AlltoAll does not support in-place buffers
2022-09-06 13:17:15 -07:00
John Bachan bc5f7cfb0a Changed top-level Makefile behavior so that BUILDDIR is interpreted
as relative to top-level directory. This done is by abspath'ing it before
passing it to subdirectory Makefile's.

The old behavior had two cases: with and without BUILDDIR being set by
the user. With BUILDDIR not set, the build dir would be named "build"
in the top-level directory. If BUILDDIR was set, then the build dir
would be placed at "src/${BUILDDIR}".

The new behavior is simpler, if BUILDDIR is not set then it defaults
to "build", and the directory holding the final build is always at just
"${BUILDDIR}" in the top level.
2022-08-23 10:08:49 -07:00
John Bachan 51af5572bf Resync with NCCL 2.13
* Added "verifiable", a suite of kernels for generating and verifying reduction
  input and output arrays in a bit-precise way.
* Data corruption errors now reported in number of wrong elements instead of max
  deviation.
* Use ncclGetLastError.
* Don't run hypercube on non-powers of 2 ranks.
* Fix to hypercube data verification.
* Use "thread local" as the defaut CUDA capture mode.
* Replaced pthread_yield -> sched_yield()
* Bugfix to the cpu-side barrier/allreduce implementations.
2022-08-22 17:51:06 -07:00
David Addison 8274cb47b6
Merge pull request #96 from NVIDIA/nersc-linkage-fix
Add option to statically link cudart
2022-05-26 16:54:44 -07:00