Compare commits

...

8 Commits

Author SHA1 Message Date
Yehuda Yitschak 1c1f49dcb6
Merge ef96b97524 into 0d1ece2b43 2025-07-22 20:58:01 +02:00
Stephen Sachs 0d1ece2b43 Exclude ongoing issues from auto-closing logic
- Added a check to skip issues labeled "ongoing" in the close-old-issues script
- Adjusted the condition to compare both creation and update dates against six months ago
2025-07-17 21:50:05 +02:00
Stephen Sachs bfedf2629e Add issues templates and Github action to remove stale issues
We add 3 different issue types issue/question/RFE and add some predefined
questions to speed up the debugging process.

We also add a custom action which will close all issues create mode than 6
months ago which have not been updated for more than a month.
2025-07-16 17:56:12 +02:00
Kamil Iskra 7c12c627c6 NCCL 2.27.6-1
Improve support for DirectNIC (CX8)
* Add support for XDR speed detection.
* When DirectNIC is enabled, report only the RDMA interfaces.

Extend the P2C (PXN over C2C) support to send/receive operations.

Support compilation with GCC 14 (Issues #1743, #1751).

Fix the unloading of network plugins that also provide tuner capability.

Fix the change of the current device across the calls to ncclCommDestroy()
and ncclCommAbort().

A note for users on MNNVL systems: please ensure an adequate stack size for
NCCL threads.  While the default Linux stack size limit of 8192 KB is known
to be sufficient, we've seen crashes if the limit is changed to
"unlimited", as it causes the glibc library to unexpectedly *decrease* the
stack size of NCCL's background threads to just 2048 KB.  Use "ulimit -s"
in bash to print the current limit; if needed, reset it to 8192 KB using
"ulimit -s 8192" (one also needs to ensure that the new setting is
propagated to other nodes when launching a multi-node NCCL job).
2025-07-11 07:32:13 -07:00
Kamil Iskra 3ea7eedf3b NCCL 2.27.5-1
Improvements for GB200 systems
* Optimize the network performance by alternating the direction of the
  rings and the NIC to GPU assignment across communicators to limit
  unnecessary sharing.
* Fix the detection of C2C links in case GPU Direct RDMA is disabled
  between a GPU and a NIC.
* Fix PXN support on MNNVL systems, where NCCL would try (and fail) to
  share regular host memory across multiple nodes.
* Fix P2C (PXN over C2C), which is now preferred over regular PXN.  This
  support is currently preliminary and is disabled by default; use
  NCCL_PXN_C2C=1 to enable.

Further reduce the overheads of CUDA graph capturing, which increased in
NCCL 2.26.2 for large graphs.

Optimize the network performance on DGX B200 systems by adjusting the
bandwidths provided to the graph search algorithm.

Enable fp8 reductions in symmetric kernels on Blackwell with CUDA 12.8.

Restore the plugin name handling logic to make it possible to specify a
path to the plugin (Issue #1732).

Restore the ability to change NCCL_COLLNET_ENABLE during execution
(Issue #1741).

Add an example tuner plugin with CSV-based overrides.

Remove an x86 dependency from the example profiler.
2025-06-18 10:34:47 -07:00
Kamil Iskra 72d2432094 NCCL 2.27.3-1
Symmetric memory API and symmetric kernels
 * Redesign from the ground up, enabling major latency and bandwidth
   improvements.
 * Add new API calls to register user-allocated memory among communicator
   ranks into a NCCL window: ncclCommWindowRegister() and
   ncclCommWindowDeregister(). The calls currently support symmetric
   registration for P2P and NVLS, and require VMM memory buffers (i.e.,
   CUMEM must be operational).
 * Implement specialized kernels taking advantage of symmetrically
   registered memory, with performance gains expected particularly for
   small to medium message sizes.
 * The kernels support 32 bit floating point types and smaller, and sum as
   the reduction operator, with no more than one collective operation per
   group.
 * Floating point summation is always done in fp32 accumulators (with the
   exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus,
   the accuracy with fp8 and fp16 data types should be much improved.
 * This initial implementation supports non-network communicators only (P2P
   and NVLS transports).
 * To explore this functionality users need to use the new memory
   registration API calls with the NCCL_WIN_COLL_SYMMETRIC flag and all
   ranks of a communicator must pass buffers at the same offset in the same
   registration when invoking a collective NCCL operation.

Add support for DGX Spark.

Add support for DirectNIC (CX8) to the internal IB plugin.

Add a new ncclCommShrink() API call
 * It is a non-collective call similar to ncclCommSplit(), which makes it
   possible to exclude some (possibly unresponsive) ranks from the parent
   communicator.

Add support for loading multiple network plugins
 * This enables the creation of generic containers that can work across a
   range of providers.
 * Allow NCCL_NET_PLUGIN to accept a comma-separated list of plugins to
   load.

NVLink SHARP (NVLS) improvements
 * Implement NVLS+IB SHARP support for AllGather and ReduceScatter with
   user buffer registration. This improves performance and reduces the
   number of CTAs needed to achieve peak bandwidth.
 * Gracefully fall back by default to other transports if NVLS
   initialization fails (the old behavior of returning an error code from a
   NCCL call can be preserved by setting NCCL_NVLS_ENABLE=1).
 * Decrease the NVLS channel count to 24 on Blackwell systems with multiple
   NVLink domains per communicator.
 * Enable fine-tuning of NCCL behavior per communicator using new
   "ncclConfig_t" members "collnetEnable", "CTAPolicy", and "nvlsCTAs".

Profiler improvements
 * Extend the init function by adding communicator name, comm id (hash),
   rank, number of ranks, number of nodes, and the NCCL log function to the
   argument list. This makes the name and the comm id available to all
   events in the communicator without explicitly passing them to each
   individual event. Add the communicator id and rank to the profiler trace
   filename. Now, the communicator name can be set via a new "ncclConfig_t"
   member "commName".
 * Improve the accuracy of the GPU kernel events by providing GPU-generated
   timestamps for the start and stop of every NCCL operation.
 * Harmonize proxy events, removing overlaps between ProxyOp and ProxyStep
   states.
 * Add support for network-defined event updates (through
   "recordEventState").
 * Report the correct number of channels used by every collective/p2p
   operation (used to be set to nMaxChannels for collectives and absent for
   p2ps).
 * Fix the logic on proxyCtrl Idle/Active events (Issue #1162).
 * Fix an issue where the network proxy profiler could lose track of an
   event identifier (Issue #1682).
 * Improve the backward compatibility with plugins older than v4.
 * Ensure that the work counters are 0-initialized.
 * Fix a potential race condition in the network profiler that could result
   in an event being linked to a wrong parent.

MNNVL improvements
 * Increase to 16 the number of NICs used to communicate between MNNVL
   domains on GB200 systems, to optimize the performance of collective
   operations.
 * Add support for more complex MNNVL topologies with up to 32 NICs per
   node.
 * If the MNNVL fabric initialization was unsuccessful, NCCL will now fail
   by default, so as to avoid inadvertently falling back to a potentially
   much slower network transport. Such failures are typically due to a
   misconfigured IMEX support on the system. To continue without MNNVL,
   restart the job with NCCL_MNNVL_ENABLE=0.
 * Fix a potential hang in alltoall-like communication patterns at a scale
   of over 80 ranks.
 * Make NCCL_P2P_DISABLE=1 imply NCCL_MNNVL_ENABLE=0 (so the latter no
   longer needs to be specified on MNNVL systems).
 * Fix an initialization failure when NCCL_TOPO_FILE is used on MNNVL
   systems.
 * Fix the graph search to exclude non-local NICs.
 * Fix the SHM transport to use fabric handles on MNNVL systems.

NIC Fusion improvements
 * Disable the creation of fused NICs for physical devices that haven't
   been merged.
 * Flatten multiple ports to a single PCI device within the internal IB
   plugin and reparent dual-port NICs under the first PCI parent. If the
   parent is not a PCI switch, PCI devices for fused NICs won't be
   duplicated.
 * Route traffic on GB200-CX8 systems through DirectNIC, not the host
   interface.

Improve support for platforms with C2C connectivity (e.g., GB200)
 * Enable GPUDirect RDMA for the NICs by default.
 * Add support for P2C (PXN over C2C) and the LL128 protocol.

Extend NCCL fault tolerance in multithreaded scenarios
 * Support the creation of multiple nonblocking communicators within a
   single group and polling in parallel for the completion using multiple
   threads (one per communicator).

Enable ncclImplicitOrderLaunch for CUDA 12.9+
 * This can potentially speed up NCCL_IMPLICIT_LAUNCH_ORDER.

Improve the netSocket transport latency and control
 * Provide finer control over the size of the socket send/receive buffers,
   the task size, and the number of sockets that a single peer can open.
 * Add support for the inlining of small messages behind the header when
   using multiple sockets per connection.

Improve the readability of the CPU affinity in the debug output
 * Print it as a range string rather than a bitmask.

Fix a potential race condition in graph execution
 * A contention could arise when mixing graph and non-graph execution.

Improve PXN connection code
 * Avoid duplicate and unused connections.

RAS fixes
 * Fix a memory corruption at job termination time in case of a previously
   failed initialization of a RAS socket connection.
 * Fix a race condition leading to a crash when generating a RAS report
   during communicator initialization (Issues #1669, #1718).
 * Fix a potential race condition when gathering data for a RAS status
   report.

Fix a potential memory corruption in ncclCommSplit()
 * Memory could get corrupted when resource sharing was in use and the size
   of the NVLink domain in the new communicator was smaller than in the old
   one.

Fix asynchronous graph upload
 * Fix a small memory leak.
 * Fix oversychronization.

Add a check for out-of-memory conditions in ncclMemAlloc()

Clean up the NCCL socket code
 * accept() will retry also if just reading the magic failed (Issue #1613).
 * connect() will retry also if poll() did not return a POLLOUT event
   (Issue #1618).
 * Add error checking in a few instances (Issue #1539).
 * Fix the loop condition in ncclFindInterfaceMatchSubnet() (Issue #1574).
 * Clean up the debug output, downgrading WARN messages to INFO in
   non-critical cases, and printing the peer's address where relevant.

Switch NCCL_DEBUG_FILE to line buffering
 * This should help avoid mixed-up partial output lines in multithreaded
   cases.

Other minor fixes
 * Improve the checks for buffer overflows in the graph code (Issue #1585).
 * Extend logging and state clearing to all four events in the internal IB
   plugin (Issue #1650).
 * Fix the error path in case IB communication is not ready (Issue #1489).
 * Add ECE logging for IB fabric.
 * Fix various minor issues in the graph module (Issue #1635).
 * Clean up the debug output in the graph code, downgrading WARN messages
   to INFO in non-critical cases.
 * Add a missing argument to a directSend() call (Issue #1628).
 * Remove duplicate code in sendProxySetup() (Issue #1420).
 * Fix the order of arguments of cudaDeviceCanAccessPeer() (Issue #1507).
 * Fix compiler warnings with GCC 14.
 * Fix a typo in a comment (Issue #1236).
2025-05-29 20:56:40 -07:00
Giuseppe Congiu 8171af656b NCCL 2.26.6-1
Fix profiler_v2 compatibility layer
 * Removing trafficBytes in profiler_v3 breaks casting to ncclProfilerEventDescr_v2_t
   in the compatibility layer for profiler_v2 interface. This patch fixes the issue
   by making the conversion between the two descriptors explicit.
2025-05-20 04:04:41 -07:00
Yehuda Yitschak ef96b97524 proxy: yield progress only when idle
Currently if no new ops were added the proxy thread may sleep while
there are non-idle ops to process

Signed-off-by: Yehuda Yitschak <yehuday@amazon.com>
2023-09-06 06:21:52 +00:00
125 changed files with 10273 additions and 2153 deletions

77
.github/ISSUE_TEMPLATE/ISSUE.yaml vendored Normal file
View File

@ -0,0 +1,77 @@
name: NCCL issue or bug
description: Report an issue or failure when running NCCL code
title: "[Issue]: "
labels: ["triage"]
body:
- type: markdown
attributes:
value: |
Thanks for reaching out! Before reporting a new issue, please feel free to search for the behavior in the existing issues. If you found an issue which is already closed or you are unsure, open a new issue and reference the old one from it.
You can also check out the [troubleshooting section](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html) in our user guide.
---
To ensure we can assist you quickly and accurately, we often need the following information:
- type: dropdown
id: type
attributes:
label: How is this issue impacting you?
description: What best describes your issue?
options:
- Lower performance than expected
- Application crash
- Data corruption
- Application hang
validations:
required: true
- type: textarea
id: log
attributes:
label: Share Your Debug Logs
description: |
The logs and topo-files are a great tool to pin down issues. You can create them by setting these environment variables before the run.
* `NCCL_DEBUG=INFO` and `NCCL_DEBUG_FILE=ncclDebug.%h.%p` to produce one file per rank
* `NCCL_TOPO_DUMP_FILE=ncclSystem.txt`
- type: textarea
id: repro
attributes:
label: Steps to Reproduce the Issue
description: |
* **Minimal Steps**: Please provide a simple way to recreate the issue (see [Minimal Bug Reports](https://matthewrocklin.com/minimal-bug-reports) for inspiration).
* **Environment Details**: Include software versions and relevant settings.
* **Intermittency**: Is this a sporadic issue? If so, how often does it occur?
* **Previous Success**: Did this work with an older NCCL version?
The easier we can reproduce on our side the more likely we are to be able to solve it in a timely manner.
- type: input
id: nccl_version
attributes:
label: NCCL Version
description: |
NCCL reports its version string in the debug logs.
You can also determine the version if you know which library was used by running `strings libnccl.so | grep 'NCCL version'`.
placeholder: "e.g. 2.27.1+cuda12.8"
validations:
required: true
- type: textarea
id: platform
attributes:
label: Your platform details
description: |
* **GPU & Network**: Share your architecture and topology (e.g., from `nvidia-smi`, `nvidia-smi topo -m`, `ibstatus`).
* **Environment**: Bare-metal, containers, or cloud?
* **Scalability**: Does this issue occur with a specific number of ranks/nodes?
- type: textarea
id: issue-description
attributes:
label: Error Message & Behavior
description: |
* **First Error**: What was the initial `NCCL WARN` message in your logs?
* **Expected vs. Actual**: Briefly describe the anticipated behavior versus what you're seeing.

15
.github/ISSUE_TEMPLATE/QUESTION.yaml vendored Normal file
View File

@ -0,0 +1,15 @@
name: NCCL question
description: Ask the NCCL team a question
title: "[Question]: "
labels: ["question"]
body:
- type: markdown
attributes:
value: |
Thanks for reaching out! To solve your problem, feel free to check out the [user guide](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html), in particular the troubleshooting section, and also the [release notes](https://docs.nvidia.com/deeplearning/nccl/release-notes/index.html).
---
- type: textarea
id: question
attributes:
label: Question

22
.github/ISSUE_TEMPLATE/RFE.yaml vendored Normal file
View File

@ -0,0 +1,22 @@
name: NCCL request for enhancement
description: Request for enhancement
title: "[RFE]: "
labels: ["enhancement"]
body:
- type: markdown
attributes:
value: |
Thanks for your feedback! Before reporting a new RFE you could quickly check if this already exists in our [existing requests](https://github.com/NVIDIA/nccl/issues?q=sort%3Aupdated-desc%20is%3Aissue%20is%3Aopen%20label%3Aenhancement).
---
- type: textarea
id: rfe-description
attributes:
label: Please provide the below details to ensure we understand your needs
description: |
* What is the goal of this request?
* Who will benefit from this feature?
* Is this request for a specific GPU architecture or network infrastructure?
* How will this feature improve current workflows or processes?
* What is the priority level of this request?

1
.github/ISSUE_TEMPLATE/config.yml vendored Normal file
View File

@ -0,0 +1 @@
blank_issues_enabled: false

79
.github/workflows/close-old-issues.js vendored Normal file
View File

@ -0,0 +1,79 @@
const { Octokit } = require("@octokit/rest");
const octokit = new Octokit({ auth: process.env.GITHUB_TOKEN });
const owner = process.env.REPO_OWNER;
const repo = process.env.REPO_NAME.split('/').pop(); // Handles owner/repo format
const now = new Date();
const sixMonthsAgo = new Date(now);
sixMonthsAgo.setMonth(now.getMonth() - 6);
const oneMonthAgo = new Date(now);
oneMonthAgo.setMonth(now.getMonth() - 1);
async function closeOldIssues() {
let page = 1;
let closedCount = 0;
// write a multiline comment into a variable:
let body = `### Issue Cleanup: Helping Us Focus on Current Challenges
We're [reviewing](https://github.com/NVIDIA/nccl/discussions/1761) older issues to ensure we prioritize the most relevant and active ones. Since this issue hasn't seen updates in over 6 months, we'll be closing it for now.
*This change helps us focus our efforts on addressing any current issues our users are facing.* If this issue still affects you, please don't hesitate to reopen it with a quick update (e.g., \"Still relevant on [version=X]\").
Thanks for your understanding and for contributing to NCCL.`;
while (true) {
const { data: issues } = await octokit.issues.listForRepo({
owner,
repo,
state: "open",
per_page: 100,
page,
});
if (issues.length === 0) break;
for (const issue of issues) {
// Ignore PRs
if (issue.pull_request) continue;
// Ignore issues with label "ongoing"
if (issue.labels.some(label => label.name === "ongoing")) continue;
const createdAt = new Date(issue.created_at);
const updatedAt = new Date(issue.updated_at);
if (createdAt < sixMonthsAgo && updatedAt < sixMonthsAgo) {
// Add a comment before closing
await octokit.issues.createComment({
owner,
repo,
issue_number: issue.number,
body: body,
});
await octokit.issues.update({
owner,
repo,
issue_number: issue.number,
state: "closed",
state_reason: "not_planned",
});
closedCount++;
console.log(`Closed issue #${issue.number}`);
// Break out if we have closed 100 issues
if (closedCount >= 100) {
console.log("Closed 100 issues, stopping.");
return;
}
}
}
page++;
}
console.log(`Total closed: ${closedCount}`);
}
closeOldIssues().catch(console.error);

31
.github/workflows/close_old_issues.yaml vendored Normal file
View File

@ -0,0 +1,31 @@
name: Close Old Issues
on:
schedule:
- cron: '30 2 * * *' # Runs daily at 02:30 UTC
workflow_dispatch:
permissions:
issues: write
jobs:
close-old-issues:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: 20
- name: Install dependencies
run: npm install @octokit/rest@22.0.0
- name: Run close-old-issues script
run: node .github/workflows/close-old-issues.js
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
REPO_OWNER: ${{ github.repository_owner }}
REPO_NAME: ${{ github.event.repository.name || github.repository }}

View File

@ -3,15 +3,20 @@
#
# See LICENSE.txt for license information
#
NCCL_HOME:=../../build/
CUDA_HOME:=/usr/local/cuda
INC:= -I$(NCCL_HOME)/include -I$(CUDA_HOME)/include -Inccl
PLUGIN_SO:=libnccl-net.so
.DEFAULT_GOAL: build
include ../../makefiles/common.mk
SRCDIR ?= $(abspath ../..)
BUILDDIR ?= .
NCCLDIR := $(BUILDDIR)
default: $(PLUGIN_SO)
SRC_FILES := $(wildcard *.c)
$(PLUGIN_SO): plugin.c
$(CC) $(INC) -fPIC -shared -o $@ -Wl,-soname,$(PLUGIN_SO) $^
build: ${BUILDDIR}/libnccl-net-example.so
${BUILDDIR}/libnccl-net-example.so: ${SRC_FILES}
@printf "Compiling %-35s > %s\n" $< $@
@mkdir -p ${BUILDDIR}
$(CC) -Inccl -fPIC -shared -o $@ $^
clean:
rm -f $(PLUGIN_SO)
rm -f ${BUILDDIR}/libnccl-net-example.so

View File

@ -7,9 +7,15 @@
#ifndef COMMON_H_
#define COMMON_H_
#include <stdint.h>
typedef enum {NCCL_LOG_NONE=0, NCCL_LOG_VERSION=1, NCCL_LOG_WARN=2, NCCL_LOG_INFO=3, NCCL_LOG_ABORT=4, NCCL_LOG_TRACE=5} ncclDebugLogLevel;
typedef enum {NCCL_INIT=1, NCCL_COLL=2, NCCL_P2P=4, NCCL_SHM=8, NCCL_NET=16, NCCL_GRAPH=32, NCCL_TUNING=64, NCCL_ENV=128, NCCL_ALLOC=256, NCCL_CALL=512, NCCL_PROXY=1024, NCCL_NVLS=2048, NCCL_BOOTSTRAP=4096, NCCL_REG=8192, NCCL_ALL=~0} ncclDebugLogSubSys;
typedef void (*ncclDebugLogger_t)(ncclDebugLogLevel level, unsigned long flags, const char *file, int line, const char *fmt, ...);
enum { ncclProfilerNetEventStart = 0, ncclProfilerNetEventStop, ncclProfilerNetEventUpdate, ncclProfilerNetEventUpdateAndStop };
typedef ncclResult_t (*ncclProfilerCallback_t)(void** eHandle, int type, void* phandle, int64_t pluginId, void* extData);
#endif

View File

@ -8,9 +8,9 @@
#include <stdint.h>
#include <stdlib.h>
#include "common.h"
#include "err.h"
#include "net_device.h"
#include "common.h"
#define NCCL_NET_HANDLE_MAXSIZE 128
#define NCCL_MAX_NET_SIZE_BYTES (1*1024*1024*1024*1024L) //1TB
@ -23,8 +23,6 @@
// Maximum number of requests per comm object
#define NCCL_NET_MAX_REQUESTS 32
typedef ncclResult_t (*ncclProfilerCallback_t)(void** eHandle, int type, void* phandle, int64_t pluginId, void* extData);
#include "net_v10.h"
#include "net_v9.h"
#include "net_v8.h"

View File

@ -49,9 +49,9 @@ of newer ones.
The `nccl/` directory is populated with `profiler_vX.h` files extracting all relevant definitions
from old API versions. It also provides error codes in `err.h`.
# API (v3)
# API (v4)
Below is the main `ncclProfiler_v3` struct. Each function is explained in later sections.
Below is the main `ncclProfiler_v4` struct. Each function is explained in later sections.
```
typedef struct {
@ -60,9 +60,15 @@ typedef struct {
// init - initialize the profiler plugin
// Input
// - context : opaque profiler context object for separating profiler behavior across comms
// - commName : user assigned communicator name
// - commHash : communicator id
// - nNodes : number of nodes in communicator
// - nranks : number of ranks in communicator
// - rank : rank identifier in communicator
// - logfn : logger function
// Output
// - eActivationMask: bitmask of active events set by the plugin
ncclResult_t (*init)(void** context, int* eActivationMask);
ncclResult_t (*init)(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);
// startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
// Input
@ -70,7 +76,7 @@ typedef struct {
// - eDescr : pointer to ncclProfilerEventDescr_t object
// Output
// - eHandle: return event handle for supplied event descriptor object
ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v3_t* eDescr);
ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v4_t* eDescr);
// stopEvent - stop/finalize an event inside and event set
// Input
@ -82,13 +88,13 @@ typedef struct {
// - eHandle : handle to event object created through startEvent
// - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
// - eState : event state transition
ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v3_t eState, ncclProfilerEventStateArgs_v3_t* eStateArgs);
ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v4_t eState, ncclProfilerEventStateArgs_v4_t* eStateArgs);
// finalize - finalize the profiler plugin
// Input
// - context: opaque profiler context object
ncclResult_t (*finalize)(void* context);
} ncclProfiler_v3_t;
} ncclProfiler_v4_t;
```
## Error codes
@ -147,8 +153,6 @@ typedef struct {
int rank; // rank that generated the event
union {
struct { // collective events metadata
const char* name; // string containing name of the communicator
uint64_t commHash; // unique hash/id for the communicator
uint64_t seqNumber; // sequence number of this collective operation in the communicator
const char* func; // string containing name of the collective
void const* sendBuff; // address of send buffer
@ -156,20 +160,19 @@ typedef struct {
size_t count; // data count
int root; // root rank
const char* datatype; // string containing the name of the datatype
uint8_t nMaxChannels; // max number of channels for this collective
uint8_t nChannels; // number of channels for this collective
uint8_t nWarps; // number of GPU warps for this collective
const char* algo; // string containing name of the algorithm for this collective
const char* proto; // string containing name of the protocol for this collective
} coll;
struct { // point-to-point events metadata
const char* name;
uint64_t commHash;
const char* func;
void* buff;
const char* datatype;
size_t count;
int peer; // peer rank for this point-to-point
uint8_t nChannels; // number of channels for this p2p
} p2p;
struct { // proxyOp events metadata
@ -178,7 +181,7 @@ typedef struct {
int peer; // peer rank
int nSteps; // number of network transfers/steps required by the `ncclProxyOp`
int chunkSize; // chunk size for this `ncclProxyOp`
int isSend; // set to 1 for sends and 0 for recvs
int isSend; // type of network operation
} proxyOp;
struct { // proxyStep events metadata
@ -187,6 +190,7 @@ typedef struct {
struct {
uint8_t channelId; // id of the channel used by the kernel
uint64_t ptimer; // kernel supplied timestamp
} kernelCh;
struct {
@ -194,7 +198,7 @@ typedef struct {
void* data; // pointer to network plugin defined event
} netPlugin;
};
} ncclProfilerEventDescr_v3_t;
} ncclProfilerEventDescr_v4_t;
```
NCCL defines the following events: `ncclProfileGroup`, `ncclProfileColl`, `ncclProfileP2p`,
@ -212,45 +216,57 @@ handle after `eventStop` is undefined behavior.
Some events can only be started and stopped. For example, `ncclProfileGroup`, `ncclProfileColl`,
`ncclProfileP2p`, cannot be updated through calls to `recordEventState`.
`ncclProfileProxyOp`, `ncclProfileProxyStep` and `ncclProfileProxyCtrl` can be updated through
calls to `recordEventState`.
`ncclProfileProxyOp`, `ncclProfileProxyStep`, `ncclProfileNetPlugin`, `ncclProfileKernelCh`, and
`ncclProfileProxyCtrl` can be updated through calls to `recordEventState`.
The state of proxy generated events can be updated, along with event attributes, using
`recordEventState`. These events can go through several states during their lifecycle.
The list of supported states for the proxy-defined events is reported below.
The state of these events can be updated, along with event attributes, using `recordEventState`.
These events can go through several states during their lifecycle.
The list of supported states for the updatable events is reported below.
```
typedef enum {
// ncclProfileProxyOp event states
ncclProfilerProxyOpSendPosted, // state marks the posting of send buffer to GPU for given network transfer/step
ncclProfilerProxyOpSendRemFifoWait, // state marks the waiting of CTS credits from peer rank
ncclProfilerProxyOpSendTransmitted, // state marks the sending of network transfer/step to peer rank
ncclProfilerProxyOpSendDone, // state marks the ending of network transfer/step
ncclProfilerProxyOpRecvPosted, // state marks the posting of recv to network for given network transfer/step
ncclProfilerProxyOpRecvReceived, // state marks the recving of network transfer/step from peer rank
ncclProfilerProxyOpRecvTransmitted, // state marks the ending of the network transfer/step
ncclProfilerProxyOpRecvDone, // state marks the consuming of data from GPU
ncclProfilerProxyOpSendPosted = 0, // deprecated in v4
ncclProfilerProxyOpSendRemFifoWait = 1, // deprecated in v4
ncclProfilerProxyOpSendTransmitted = 2, // deprecated in v4
ncclProfilerProxyOpSendDone = 3, // deprecated in v4
ncclProfilerProxyOpRecvPosted = 4, // deprecated in v4
ncclProfilerProxyOpRecvReceived = 5, // deprecated in v4
ncclProfilerProxyOpRecvTransmitted = 6, // deprecated in v4
ncclProfilerProxyOpRecvDone = 7, // deprecated in v4
ncclProfilerProxyOpInProgress_v4 = 19,// state marks transition of proxy op to progress
// ncclProfileProxyStep event states
ncclProfilerProxyStepSendGPUWait, // state marks the waiting of send data from GPU for given network transfer/step
ncclProfilerProxyStepSendWait, // state marks the waiting of send data from network for given network transfer/step
ncclProfilerProxyStepRecvWait, // state marks the waiting of recv data from network for given network transfer/step
ncclProfilerProxyStepRecvFlushWait, // state marks the waiting of recv data flush to GPU for given network transfer/step
ncclProfilerProxyStepRecvGPUWait, // state marks the waiting of recv data consumption from GPU for given network transfer/step
ncclProfilerProxyStepSendGPUWait = 8, // state marks the waiting of send data from GPU for given network transfer/step
ncclProfilerProxyStepSendPeerWait_v4 = 20,// state marks the waiting of recv clear to send credits for given network transfer/step
ncclProfilerProxyStepSendWait = 9, // state marks the waiting of send data from network for given network transfer/step
ncclProfilerProxyStepRecvWait = 10,// state marks the waiting of recv data from network for given network transfer/step
ncclProfilerProxyStepRecvFlushWait = 11,// state marks the waiting of recv data flush to GPU for given network transfer/step
ncclProfilerProxyStepRecvGPUWait = 12,// state marks the waiting of recv data consumption from GPU for given network transfer/step
// ncclProfileProxyCtrl event states
ncclProfilerProxyCtrlIdle, // state marks proxy progress thread idle
ncclProfilerProxyCtrlActive, // state marks proxy progress thread active
ncclProfilerProxyCtrlSleep, // state marks proxy progress thread sleeping
ncclProfilerProxyCtrlWakeup, // state marks proxy progress thread waking up
ncclProfilerProxyCtrlAppend, // state marks append of new network work item begin
ncclProfilerProxyCtrlAppendEnd, // state marks append of new network work item end
} ncclProfilerEventState_v3_t;
ncclProfilerProxyCtrlIdle = 13,// state marks proxy progress thread idle
ncclProfilerProxyCtrlActive = 14,// state marks proxy progress thread active
ncclProfilerProxyCtrlSleep = 15,// state marks proxy progress thread sleeping
ncclProfilerProxyCtrlWakeup = 16,// state marks proxy progress thread waking up
ncclProfilerProxyCtrlAppend = 17,// state marks append of new network work item begin
ncclProfilerProxyCtrlAppendEnd = 18,// state marks append of new network work item end
// ncclProfileNetPlugin event states
ncclProfilerNetPluginUpdate = 21,// state marks update of network defined event
// ncclProfileKernelCh event states
ncclProfilerKernelChStop = 22,// state marks stop of kernelCh event and timestamp update
} ncclProfilerEventState_v4_t;
```
`ncclProfileProxyOp` events are generated by the proxy progress thread while it is processing
network requests for the GPU kernel. ProxyOp events are generated for every active channel and
provide a summary of the activity of the proxy progress thread for that channel.
provide a summary of the activity of the proxy progress thread for that channel. Most of the
states for this event were duplicated with `ncclProfileProxyStep` events. Therefore, starting
with version 4 of the profiler interface these states have been deprecated. The same level of
information can still be obtained through the `ncclProfileProxyStep` events.
`ncclProfileProxyStep` events are generated by the proxy progress thread while it is processing
network requests for the GPU kernel. ProxyStep events describe individual network transfer in
@ -348,15 +364,22 @@ reason the profiler defines the `ncclProfilerEventStateArgs_t` struct, reported
```
typedef union {
struct { // attributes to update for ncclProfileProxyOp events
size_t transSize; // data transferred thus far
int steps; // network transfer/steps processed thus far
} proxyOp;
struct { // attributes for update for ncclProfileProxyStep events
size_t transSize; // transfer size field for this proxy step
} proxyStep;
struct { // attributes to update for ncclProfileProxyCtrl
struct { // attributes to update for ncclProfileProxyCtrl events
int appendedProxyOps; // number of appended proxy ops thus far
} proxyCtrl;
} ncclProfilerEventStateArgs_v3_t;
struct { // attributes to update for ncclProfileNetPlugin events
void* data; // network plugin opaque update data field
} netPlugin;
struct { // attribute to update for ncclProfileKernelCh events
uint64_t pTimer; // timestamp provided by the NCCL kernel
} kernelCh;
} ncclProfilerEventStateArgs_v4_t;
```
The example profiler in `ext-profiler/example` contains details on how to capture and use the events above.
@ -396,12 +419,12 @@ ProxyCtrl event
## Profiling of collective and p2p operations
The NCCL code is instrumented with profiler callbacks at different levels to capture start/stop of groups,
collective and point-to-point operations, as well as proxy progress activity. Due to the asynchronous nature
collective and point-to-point operations, as well as proxy, kernel and network activity. Due to the asynchronous nature
of NCCL operations, events associated to collective and point-to-point operations are not easy to delimit
precisely. For example, without both proxy and/or kernel activity it is impossible for the profiler to
figure out when a collective operation completes. Therefore, `stopEvent` for collectives simply indicates to
the profiler that the collective has been enqueued. The profiler can leverage proxy event information, if
these are enabled, to estimate when the collective ends. In this case, the profiler can look at the `stopEvent`
the profiler that the collective has been enqueued. The profiler can leverage proxy and/or kernel event information, if
these are enabled, to estimate when the collective ends. For example, the profiler can look at the `stopEvent`
call of the last `ncclProfileProxyOp` event to mark the completion of the associated collective event. This
can be achieved by reference counting the collective event and letting calls to `startEvent` and `stopEvent`
increment and decrement the reference counter, respectively.
@ -425,8 +448,14 @@ enqueue can be time stamped by the profiler (at start and stop) to reconstruct t
collective. However, this time only represents the launch time of the collective and not the actual
execution time. To reconstruct the execution time more accurately proxy and kernel events are provided.
With version 3 of the profiler interface network activity is no longer required to do intra-node profiling.
Kernel events instrumentation leverages counters exposed by the kernel to the host and the proxy progress
thread. Thus, the proxy progress thread infrastructure is shared between the network and the profiler. If
the proxy is serving network requests the kernel profiling probing can be delayed, causing loss of
accuracy. Similarly, if the CPU is under heavy load and the scheduling of the proxy progress thread is
delayed, a similar loss of accuracy can be encountered. Keep this in mind when using kernel events.
delayed, a similar loss of accuracy can be encountered.
To mitigate this effect, with version 4 of the profiler NCCL uses a per-channel ring buffer of 64 elements.
Every counter is complemented by two timestamps (ptimers) supplied by the NCCL kernel (one for start and one
for stop of the operation in the kernel). NCCL propagates these timestamps to the profiler plugin that it can
convert them to CPU time domain.

View File

@ -3,14 +3,20 @@
#
# See LICENSE.txt for license information
#
NCCL_HOME := ../../build
INC := -I$(NCCL_HOME)/include -I$(CUDA_HOME)/include -Inccl
PLUGIN_SO := libnccl-profiler.so
.DEFAULT_GOAL: build
include ../../makefiles/common.mk
SRCDIR ?= $(abspath ../..)
BUILDDIR ?= .
NCCLDIR := $(BUILDDIR)
default: $(PLUGIN_SO)
SRC_FILES := $(wildcard *.c)
$(PLUGIN_SO): plugin.c event.c print_event.c
$(CXX) $(INC) -g -fPIC -shared -o $@ -Wl,-soname,$(PLUGIN_SO) $^
build: ${BUILDDIR}/libnccl-profiler-example.so
${BUILDDIR}/libnccl-profiler-example.so: ${SRC_FILES}
@printf "Compiling %-35s > %s\n" $< $@
@mkdir -p ${BUILDDIR}
$(CC) -Inccl -fPIC -shared -o $@ $^
clean:
rm -f $(PLUGIN_SO)
rm -f ${BUILDDIR}/libnccl-profiler-example.so

View File

@ -15,24 +15,6 @@
#define MAX_CHANNELS 32
#define MAX_STEPS 16
#define MAX_OPS 16 // Up to 64K ranks for PAT
#define PROXY_OP_SEND_STATE_OFFSET (ncclProfilerProxyOpSendPosted)
#define PROXY_OP_RECV_STATE_OFFSET (ncclProfilerProxyOpRecvPosted)
#define PROXY_STEP_SEND_STATE_OFFSET (ncclProfilerProxyStepSendGPUWait)
#define PROXY_STEP_RECV_STATE_OFFSET (ncclProfilerProxyStepRecvWait)
#define NUM_PROXY_OP_SEND_STATES (ncclProfilerProxyOpSendDone - ncclProfilerProxyOpSendPosted + 1)
#define NUM_PROXY_OP_RECV_STATES (ncclProfilerProxyOpRecvDone - ncclProfilerProxyOpRecvPosted + 1)
#define NUM_PROXY_STEP_SEND_STATES (ncclProfilerProxyStepSendWait - ncclProfilerProxyStepSendGPUWait + 1)
#define NUM_PROXY_STEP_RECV_STATES (ncclProfilerProxyStepRecvGPUWait - ncclProfilerProxyStepRecvWait + 1)
#define PROXY_OP_SEND_STATE_IDX(state) (state - PROXY_OP_SEND_STATE_OFFSET)
#define PROXY_OP_RECV_STATE_IDX(state) (state - PROXY_OP_RECV_STATE_OFFSET)
#define PROXY_STEP_SEND_STATE_IDX(state) (state - PROXY_STEP_SEND_STATE_OFFSET)
#define PROXY_STEP_RECV_STATE_IDX(state) (state - PROXY_STEP_RECV_STATE_OFFSET)
#define MAX_PROXY_OP_STATES ((NUM_PROXY_OP_SEND_STATES > NUM_PROXY_OP_RECV_STATES ) ? NUM_PROXY_OP_SEND_STATES : NUM_PROXY_OP_RECV_STATES)
#define MAX_PROXY_STEP_STATES ((NUM_PROXY_STEP_SEND_STATES > NUM_PROXY_STEP_RECV_STATES) ? NUM_PROXY_STEP_SEND_STATES : NUM_PROXY_STEP_RECV_STATES)
#define MAX_EVENTS_PER_REQ (8)
struct proxyOp;
@ -68,13 +50,24 @@ struct kernelCh {
struct taskEventBase* parent;
double startTs;
double stopTs;
uint64_t startGpuClk;
uint64_t stopGpuClk;
};
#define PROXY_STEP_SEND_GPU_WAIT 0
#define PROXY_STEP_SEND_PEER_WAIT 1
#define PROXY_STEP_SEND_WAIT 2
#define PROXY_STEP_RECV_WAIT 0
#define PROXY_STEP_RECV_FLUSH_WAIT 1
#define PROXY_STEP_RECV_GPU_WAIT 2
#define PROXY_STEP_MAX_STATES 3
struct proxyStep {
uint8_t type; // type of event: network transfer
int state;
int step; // network transfer id in given channel
int isSend; // send/recv channel operation
double timestamp[MAX_PROXY_STEP_STATES];
double timestamp[PROXY_STEP_MAX_STATES];
double startTs;
double stopTs;
struct proxyOp* parent;
@ -92,11 +85,8 @@ struct proxyOp {
int chunkSize; // chunk size for this proxy operation
int isSend; // send/recv channel operation
size_t transSize; // transfer data size for this proxy operation
struct {
int steps; // completed steps for this proxy operation state
double timestamp;
} states[MAX_PROXY_OP_STATES];
double startTs;
double progrTs; // In progress state transition
double stopTs;
int stepCount; // last processed network operation for this proxy operation
struct proxyStep step[MAX_STEPS]; // array of network transfer events
@ -119,8 +109,6 @@ struct proxyCtrl {
struct taskEventBase {
uint8_t type; // event type: collective/p2p
int rank; // rank of the operation in NCCL communicator
const char* name; // FIXME: unused
uint64_t commHash; // communicator identifier
const char* func; // ncclFunc*
int refCount; // number of references for this operation
struct group* parent; // parent event group
@ -137,12 +125,11 @@ struct collective {
size_t count;
int root;
const char* datatype;
uint8_t nMaxChannels;
uint8_t nChannels;
const char* algo;
const char* proto;
int nWarps;
struct proxyOp send[MAX_CHANNELS][MAX_OPS];// array of send proxy operation events
struct proxyOp recv[MAX_CHANNELS][MAX_OPS];// array of recv proxy operation events
struct proxyOp op[MAX_CHANNELS][2*MAX_OPS];
int nProxyOps[MAX_CHANNELS];
struct kernelCh kernel[MAX_CHANNELS];
};
@ -154,6 +141,7 @@ struct p2p {
size_t count;
const char* datatype;
int peer;
uint8_t nChannels;
struct proxyOp op[MAX_CHANNELS];
struct kernelCh kernel[MAX_CHANNELS];
};
@ -172,6 +160,11 @@ struct group {
// arrays for different event objects
struct context {
const char* commName;
uint64_t commHash;
int nranks;
int rank;
int groupPoolSize;
int groupPoolBase;
int groupPoolIndex;

View File

@ -25,42 +25,52 @@ enum {
};
typedef enum {
ncclProfilerProxyOpSendPosted,
ncclProfilerProxyOpSendRemFifoWait,
ncclProfilerProxyOpSendTransmitted,
ncclProfilerProxyOpSendDone,
ncclProfilerProxyOpRecvPosted,
ncclProfilerProxyOpRecvReceived,
ncclProfilerProxyOpRecvTransmitted,
ncclProfilerProxyOpRecvDone,
ncclProfilerProxyOpSendPosted = 0, // deprecated in v4
ncclProfilerProxyOpSendRemFifoWait = 1, // deprecated in v4
ncclProfilerProxyOpSendTransmitted = 2, // deprecated in v4
ncclProfilerProxyOpSendDone = 3, // deprecated in v4
ncclProfilerProxyOpRecvPosted = 4, // deprecated in v4
ncclProfilerProxyOpRecvReceived = 5, // deprecated in v4
ncclProfilerProxyOpRecvTransmitted = 6, // deprecated in v4
ncclProfilerProxyOpRecvDone = 7, // deprecated in v4
ncclProfilerProxyOpInProgress_v4 = 19,
/* Legacy proxy profiler states */
ncclProfilerProxyStepSendGPUWait,
ncclProfilerProxyStepSendWait,
ncclProfilerProxyStepRecvWait,
ncclProfilerProxyStepRecvFlushWait,
ncclProfilerProxyStepRecvGPUWait,
ncclProfilerProxyStepSendGPUWait = 8,
ncclProfilerProxyStepSendPeerWait_v4 = 20,
ncclProfilerProxyStepSendWait = 9,
ncclProfilerProxyStepRecvWait = 10,
ncclProfilerProxyStepRecvFlushWait = 11,
ncclProfilerProxyStepRecvGPUWait = 12,
/* Legacy proxy control states */
ncclProfilerProxyCtrlIdle,
ncclProfilerProxyCtrlActive,
ncclProfilerProxyCtrlSleep,
ncclProfilerProxyCtrlWakeup,
ncclProfilerProxyCtrlAppend,
ncclProfilerProxyCtrlAppendEnd,
ncclProfilerProxyCtrlIdle = 13,
ncclProfilerProxyCtrlActive = 14,
ncclProfilerProxyCtrlSleep = 15,
ncclProfilerProxyCtrlWakeup = 16,
ncclProfilerProxyCtrlAppend = 17,
ncclProfilerProxyCtrlAppendEnd = 18,
/* Network defined events states */
ncclProfilerNetPluginUpdate = 21,
/* Kernel event states */
ncclProfilerKernelChStop = 22,
} ncclProfilerEventState_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v1_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v2_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v3_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v4_t;
#include "profiler_v4.h"
#include "profiler_v3.h"
#include "profiler_v2.h"
#include "profiler_v1.h"
#include "profiler_net.h"
typedef ncclProfiler_v3_t ncclProfiler_t;
typedef ncclProfilerEventDescr_v3_t ncclProfilerEventDescr_t;
typedef ncclProfilerEventStateArgs_v3_t ncclProfilerEventStateArgs_t;
typedef ncclProfiler_v4_t ncclProfiler_t;
typedef ncclProfilerEventDescr_v4_t ncclProfilerEventDescr_t;
typedef ncclProfilerEventStateArgs_v4_t ncclProfilerEventStateArgs_t;
#endif // end include guard

View File

@ -111,9 +111,4 @@ typedef struct {
ncclResult_t (*finalize)(void* context);
} ncclProfiler_v3_t;
typedef ncclProfilerEventDescr_v3_t ncclProfilerEventDescr_t;
typedef ncclProfilerEventState_v3_t ncclProfilerEventState_t;
typedef ncclProfilerEventStateArgs_v3_t ncclProfilerEventStateArgs_t;
typedef ncclProfiler_v3_t ncclProfiler_t;
#endif

View File

@ -0,0 +1,123 @@
/*************************************************************************
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#ifndef PROFILER_V4_H_
#define PROFILER_V4_H_
typedef struct {
uint8_t type; // event type descriptor: ncclProfileColl, ...
void* parentObj; // pointer to the profiler parent object (for coll is the group)
int rank; // originating rank
union {
struct {
uint64_t seqNumber;
const char* func;
void const* sendBuff;
void* recvBuff;
size_t count;
int root;
const char* datatype;
uint8_t nChannels;
uint8_t nWarps;
const char* algo;
const char* proto;
} coll;
struct {
const char* func;
void* buff;
const char* datatype;
size_t count;
int peer;
uint8_t nChannels;
} p2p;
struct {
pid_t pid; // pid of the originating process
uint8_t channelId; // channel id for this proxy operation
int peer; // remote rank for send/recv
int nSteps; // number of steps for this proxy operation
int chunkSize; // amount of data transferred by this proxy operation
int isSend;
} proxyOp;
struct {
int step;
} proxyStep;
struct {
uint8_t channelId;
uint64_t pTimer; // start timestamp from GPU globaltimer
} kernelCh;
struct {
int64_t id;
void* data;
} netPlugin;
};
} ncclProfilerEventDescr_v4_t;
typedef union {
struct {
size_t transSize;
} proxyStep;
struct {
int appendedProxyOps;
} proxyCtrl;
struct {
void* data;
} netPlugin;
struct {
uint64_t pTimer;
} kernelCh;
} ncclProfilerEventStateArgs_v4_t;
typedef struct {
const char* name;
// init - initialize the profiler plugin
// Input
// - context : opaque profiler context object for separating profiler behavior across comms
// - commName : user assigned communicator name
// - commHash : communicator id
// - nNodes : number of nodes in communicator
// - nranks : number of ranks in communciator
// - rank : rank identifier in communicator
// - logfn : logger function
// Output
// - eActivationMask: bitmask of active events set by the plugin
ncclResult_t (*init)(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);
// startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
// Input
// - context: opaque profiler context object
// - eDescr : pointer to ncclProfilerEventDescr_t object
// Output
// - eHandle: return event handle for supplied event descriptor object
ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v4_t* eDescr);
// stopEvent - stop/finalize an event inside and event set
// Input
// - eHandle: handle to event object
ncclResult_t (*stopEvent)(void* eHandle);
// recordEventState - record event state transitions and event attribute updates
// Input
// - eHandle : handle to event object created through startEvent
// - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
// - eState : event state transition
ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v4_t eState, ncclProfilerEventStateArgs_v4_t* eStateArgs);
// finalize - finalize the profiler plugin
// Input
// - context: opaque profiler context object
ncclResult_t (*finalize)(void* context);
} ncclProfiler_v4_t;
#endif

View File

@ -12,7 +12,7 @@
#include <sys/types.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <x86intrin.h>
#include <time.h>
#include "event.h"
#include "print_event.h"
@ -38,29 +38,20 @@ static int detachPoolIndex;
static int detachPoolDone;
static struct proxyOp* detachPool;
static double freq = -1;
__hidden void calibrate() {
struct timeval tv;
gettimeofday(&tv, NULL);
uint64_t timeCycles = __rdtsc();
double time = - tv.tv_sec*1e6 - tv.tv_usec;
uint64_t total = 0ULL;
for (int i = 0; i < 10000; i++) total += __rdtsc();
gettimeofday(&tv, NULL);
timeCycles = __rdtsc() - timeCycles;
time += tv.tv_sec*1e6 + tv.tv_usec;
freq = timeCycles / time;
}
ncclDebugLogger_t logFn;
#define INFO(FLAGS, ...) logFn(NCCL_LOG_INFO, (FLAGS), __func__, __LINE__, __VA_ARGS__)
__hidden double gettime(void) {
return __rdtsc() / freq;
struct timespec t;
clock_gettime(CLOCK_MONOTONIC, &t);
return (t.tv_sec*1e6 + (t.tv_nsec*1e-3));
}
static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
static pid_t pid;
static int* eActivationMaskPtr;
__hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask) {
__hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn) {
pthread_mutex_lock(&lock);
if (__atomic_fetch_add(&initialized, 1, __ATOMIC_RELAXED) == 0) {
// first thread initializes event mask, environment and detach pool
@ -95,8 +86,6 @@ __hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask)
// process address space.
pid = getpid();
// calibrate and start timer
calibrate();
startTime = gettime();
}
pthread_mutex_unlock(&lock);
@ -106,6 +95,13 @@ __hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask)
// pre-allocate memory for event object pools in dedicated profiler context
struct context* ctx = (struct context *)calloc(1, sizeof(*ctx));
ctx->commName = commName;
ctx->commHash = commHash;
ctx->nranks = nranks;
ctx->rank = rank;
logFn = logfn;
INFO(NCCL_INIT, "PROFILER/Plugin: init commName: %s commHash: %lu nranks: %d rank: %d", commName ? commName : "", commHash, nranks, rank);
ctx->groupPool = (struct group *)calloc(groupPoolSize, sizeof(*ctx->groupPool));
if (ctx->groupPool == NULL) goto fail;
@ -142,17 +138,16 @@ fail:
__hidden ncclResult_t exampleProfilerFinalize(void* context) {
FILE* fh = NULL;
char filename[PATH_MAX] = { 0 };
char hostname[64] = { 0 };
gethostname(hostname, 64);
struct context* ctx = (struct context *)context;
const char* dump = getenv("NCCL_PROFILE_DUMP_FILE");
if (dump) {
sprintf(filename, "%s-%s-%ld.txt", dump, hostname, syscall(SYS_gettid));
sprintf(filename, "%s_%lu_%d.json", dump, ctx->commHash, ctx->rank);
fh = fopen(filename, "w");
fprintf(fh, "[\n");
}
INFO(NCCL_INIT, "PROFILER/Plugin: finalize commName: %s commHash: %lu nranks: %d rank: %d", ctx->commName ? ctx->commName : "", ctx->commHash, ctx->nranks, ctx->rank);
// print last N groups/collectives/p2ps
struct context* ctx = (struct context *)context;
int start = (ctx->groupPoolIndex - groupPoolSize >= 0) ? ctx->groupPoolIndex - groupPoolSize : 0;
int end = ctx->groupPoolIndex;
for (int i = start; i < end; i++) {
@ -243,8 +238,6 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
event->base.type = ncclProfileColl;
event->base.rank = eDescr->rank;
event->base.name = eDescr->coll.name;
event->base.commHash = eDescr->coll.commHash;
event->base.func = eDescr->coll.func;
event->base.startTs = gettime() - startTime;
event->base.parent = parent;
@ -254,7 +247,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
event->count = eDescr->coll.count;
event->root = eDescr->coll.root;
event->datatype = eDescr->coll.datatype;
event->nMaxChannels = eDescr->coll.nMaxChannels;
event->nChannels = eDescr->coll.nChannels;
event->nWarps = eDescr->coll.nWarps;
event->algo = eDescr->coll.algo;
event->proto = eDescr->coll.proto;
@ -281,8 +274,6 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
event->base.type = ncclProfileP2p;
event->base.rank = eDescr->rank;
event->base.name = eDescr->p2p.name;
event->base.commHash = eDescr->p2p.commHash;
event->base.func = eDescr->p2p.func;
event->base.next = parent->eventHead;
event->base.startTs = gettime() - startTime;
@ -291,6 +282,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
event->count = eDescr->p2p.count;
event->datatype = eDescr->p2p.datatype;
event->peer = eDescr->p2p.peer;
event->nChannels = eDescr->p2p.nChannels;
*eHandle = event;
// increment the group ref counter so the event will staty open
taskEventQueueEnqueue(parent, (struct taskEventBase *)event);
@ -331,6 +323,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
event->isSend = eDescr->proxyOp.isSend;
event->startTs = gettime() - startTime;
event->parent = NULL;
event->stepCount = 0;
*eHandle = event;
debugEvent(event, "PxnProxyOpStart");
return ncclSuccess;
@ -339,9 +332,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
if (eventBase->type == ncclProfileColl) {
struct collective* parent = (struct collective *)eDescr->parentObj;
int channelId = eDescr->proxyOp.channelId;
struct proxyOp* event = (eDescr->proxyOp.isSend) ?
&parent->send[channelId][parent->nProxyOps[channelId]++] :
&parent->recv[channelId][parent->nProxyOps[channelId]++];
struct proxyOp* event = &parent->op[channelId][parent->nProxyOps[channelId]++];
event->type = ncclProfileProxyOp;
event->channelId = channelId;
@ -353,6 +344,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
event->isSend = eDescr->proxyOp.isSend;
event->parent = eventBase;
event->startTs = gettime() - startTime;
event->stepCount = 0;
*eHandle = event;
__atomic_fetch_add(&parent->base.refCount, 1, __ATOMIC_RELAXED);
debugEvent(event, "ProxyOpStart");
@ -370,6 +362,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
event->isSend = eDescr->proxyOp.isSend;
event->parent = eventBase;
event->startTs = gettime() - startTime;
event->stepCount = 0;
*eHandle = event;
__atomic_fetch_add(&parent->base.refCount, 1, __ATOMIC_RELAXED);
debugEvent(event, "ProxyOpStart");
@ -382,9 +375,10 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
int s = parent->stepCount++ % MAX_STEPS;
struct proxyStep* event = &parent->step[s];
event->type = ncclProfileProxyStep;
event->state = 0;
event->step = eDescr->proxyStep.step;
event->isSend = parent->isSend;
event->parent = parent;
event->isSend = parent->isSend;
event->startTs = gettime() - startTime;
event->nNetEvents = 0;
*eHandle = event;
@ -397,6 +391,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
struct kernelCh* event = &parent->kernel[eDescr->kernelCh.channelId];
event->type = ncclProfileKernelCh;
event->channelId = eDescr->kernelCh.channelId;
event->startGpuClk = eDescr->kernelCh.pTimer;
event->parent = eventBase;
event->startTs = gettime() - startTime;
*eHandle = event;
@ -407,6 +402,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
struct kernelCh* event = &parent->kernel[eDescr->kernelCh.channelId];
event->type = ncclProfileKernelCh;
event->channelId = eDescr->kernelCh.channelId;
event->startGpuClk = eDescr->kernelCh.pTimer;
event->parent = eventBase;
event->startTs = gettime() - startTime;
*eHandle = event;
@ -563,29 +559,57 @@ __hidden ncclResult_t exampleProfilerRecordEventState(void* eHandle, ncclProfile
// the event handle might be null if we run out of events
if (eHandle == NULL) return ncclSuccess;
debugEvent(eHandle, "RecordEventState");
uint8_t type = *(uint8_t *)eHandle;
if (type == ncclProfileProxyOp) {
struct proxyOp* event = (struct proxyOp *)eHandle;
int steps = event->states[event->isSend ? PROXY_OP_SEND_STATE_IDX(eState) : PROXY_OP_RECV_STATE_IDX(eState)].steps;
if (eState == ncclProfilerProxyOpSendRemFifoWait && eStateArgs->proxyOp.steps == steps) return ncclSuccess;
event->states[event->isSend ? PROXY_OP_SEND_STATE_IDX(eState) : PROXY_OP_RECV_STATE_IDX(eState)].steps = eStateArgs->proxyOp.steps;
event->states[event->isSend ? PROXY_OP_SEND_STATE_IDX(eState) : PROXY_OP_RECV_STATE_IDX(eState)].timestamp = gettime() - startTime;
event->transSize = eStateArgs->proxyOp.transSize;
if (eState == ncclProfilerProxyOpInProgress_v4) {
event->progrTs = gettime() - startTime;
}
} else if (type == ncclProfileProxyStep) {
struct proxyStep* event = (struct proxyStep *)eHandle;
event->timestamp[event->isSend ? PROXY_STEP_SEND_STATE_IDX(eState) : PROXY_STEP_RECV_STATE_IDX(eState)] = gettime() - startTime;
struct proxyOp* parent = event->parent;
switch (eState) {
case ncclProfilerProxyStepSendGPUWait:
event->timestamp[PROXY_STEP_SEND_GPU_WAIT] = gettime() - startTime;
break;
case ncclProfilerProxyStepSendPeerWait_v4:
// do not update step event if in SendPeerWait
if (event->state == ncclProfilerProxyStepSendPeerWait_v4) break;
event->timestamp[PROXY_STEP_SEND_PEER_WAIT] = gettime() - startTime;
event->state = ncclProfilerProxyStepSendPeerWait_v4;
break;
case ncclProfilerProxyStepSendWait:
event->timestamp[PROXY_STEP_SEND_WAIT] = gettime() - startTime;
parent->transSize += eStateArgs->proxyStep.transSize;
break;
case ncclProfilerProxyStepRecvWait:
event->timestamp[PROXY_STEP_RECV_WAIT] = gettime() - startTime;
break;
case ncclProfilerProxyStepRecvFlushWait:
event->timestamp[PROXY_STEP_RECV_FLUSH_WAIT] = gettime() - startTime;
parent->transSize += eStateArgs->proxyStep.transSize;
break;
case ncclProfilerProxyStepRecvGPUWait:
event->timestamp[PROXY_STEP_RECV_GPU_WAIT] = gettime() - startTime;
break;
}
} else if (type == ncclProfileProxyCtrl) {
struct proxyCtrl* event = (struct proxyCtrl *)eHandle;
if (eState == ncclProfilerProxyCtrlAppendEnd) {
event->appended = eStateArgs->proxyCtrl.appendedProxyOps;
}
event->state = eState;
} else if (type == ncclProfileKernelCh) {
struct kernelCh* event = (struct kernelCh *)eHandle;
if (eState == ncclProfilerKernelChStop) {
event->stopGpuClk = eStateArgs->kernelCh.pTimer;
}
}
debugEvent(eHandle, "RecordEventState");
return ncclSuccess;
}
ncclProfiler_t ncclProfiler_v3 = {
ncclProfiler_t ncclProfiler_v4 = {
"Example-profiler",
exampleProfilerInit,
exampleProfilerStartEvent,

View File

@ -27,8 +27,8 @@ __hidden void printGroupEventTrailer(FILE* fh, struct group* event) {
static __thread int collId;
__hidden void printCollEventHeader(FILE* fh, struct collective* event) {
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"COLL\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"SeqNum\": %lu, \"CommHash\": %lu, \"Rank\": %d, \"Count\": %lu, \"Datatype\": \"%s\", \"Algorithm\": \"%s\", \"Protocol\": \"%s\", \"nMaxChannels\": %d}},\n",
event->base.func, collId, getpid(), 1, event->base.startTs, event->seqNumber, event->base.commHash, event->base.rank, event->count, event->datatype, event->algo, event->proto, event->nMaxChannels);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"COLL\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"SeqNum\": %lu, \"CommHash\": %lu, \"Rank\": %d, \"Count\": %lu, \"Datatype\": \"%s\", \"Algorithm\": \"%s\", \"Protocol\": \"%s\", \"nChannels\": %d}},\n",
event->base.func, collId, getpid(), 1, event->base.startTs, event->seqNumber, event->base.parent->ctx->commHash, event->base.rank, event->count, event->datatype, event->algo, event->proto, event->nChannels);
}
__hidden void printCollEventTrailer(FILE* fh, struct collective* event) {
@ -38,8 +38,8 @@ __hidden void printCollEventTrailer(FILE* fh, struct collective* event) {
static __thread int p2pId;
__hidden void printP2pEventHeader(FILE* fh, struct p2p* event) {
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"P2P\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"CommHash\": %lu, \"Rank\": %d, \"Peer\": %d, \"Count\": %lu, \"Datatype\": \"%s\"}},\n",
event->base.func, p2pId, getpid(), 1, event->base.startTs, event->base.commHash, event->base.rank, event->peer, event->count, event->datatype);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"P2P\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"CommHash\": %lu, \"Rank\": %d, \"Peer\": %d, \"Count\": %lu, \"Datatype\": \"%s\", \"nChannels\": %d}},\n",
event->base.func, p2pId, getpid(), 1, event->base.startTs, event->base.parent->ctx->commHash, event->base.rank, event->peer, event->count, event->datatype, event->nChannels);
}
__hidden void printP2pEventTrailer(FILE* fh, struct p2p* event) {
@ -50,47 +50,43 @@ __hidden void printP2pEventTrailer(FILE* fh, struct p2p* event) {
static __thread int proxyOpId;
__hidden void printProxyOpEventHeader(FILE* fh, struct proxyOp* event) {
if (event->isSend) {
int posted = PROXY_OP_SEND_STATE_IDX(ncclProfilerProxyOpSendPosted);
int remFifoWait = PROXY_OP_SEND_STATE_IDX(ncclProfilerProxyOpSendRemFifoWait);
int transmitted = PROXY_OP_SEND_STATE_IDX(ncclProfilerProxyOpSendTransmitted);
int done = PROXY_OP_SEND_STATE_IDX(ncclProfilerProxyOpSendDone);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu, \"POSTED\": {\"step\": %d, \"ts\": %f}, \"REM_FIFO_WAIT\": {\"step\": %d, \"ts\": %f}, \"TRANSMITTED\": {\"step\": %d, \"ts\": %f}, \"DONE\": {\"step\": %d, \"ts\": %f}}},\n",
"Send", proxyOpId, getpid(), 1, event->startTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize, event->states[posted].steps, event->states[posted].timestamp, event->states[remFifoWait].steps, event->states[remFifoWait].timestamp, event->states[transmitted].steps, event->states[transmitted].timestamp, event->states[done].steps, event->states[done].timestamp);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu}},\n",
"ScheduleSend", proxyOpId, getpid(), 1, event->startTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"ScheduleSend", proxyOpId, getpid(), 1, event->progrTs);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu}},\n",
"ProgressSend", proxyOpId, getpid(), 1, event->progrTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize);
} else {
int posted = PROXY_OP_RECV_STATE_IDX(ncclProfilerProxyOpRecvPosted);
int received = PROXY_OP_RECV_STATE_IDX(ncclProfilerProxyOpRecvReceived);
int transmitted = PROXY_OP_RECV_STATE_IDX(ncclProfilerProxyOpRecvTransmitted);
int done = PROXY_OP_RECV_STATE_IDX(ncclProfilerProxyOpRecvDone);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu, \"POSTED\": {\"step\": %d, \"ts\": %f}, \"RECEIVED\": {\"step\": %d, \"ts\": %f}, \"TRANSMITTED\": {\"step\": %d, \"ts\": %f}, \"DONE\": {\"step\": %d, \"ts\": %f}}},\n",
"Recv", proxyOpId, getpid(), 1, event->startTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize, event->states[posted].steps, event->states[posted].timestamp, event->states[received].steps, event->states[received].timestamp, event->states[transmitted].steps, event->states[transmitted].timestamp, event->states[done].steps, event->states[done].timestamp);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu}},\n",
"ScheduleRecv", proxyOpId, getpid(), 1, event->startTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"ScheduleRecv", proxyOpId, getpid(), 1, event->progrTs);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu}},\n",
"ProgressRecv", proxyOpId, getpid(), 1, event->progrTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize);
}
}
__hidden void printProxyOpEventTrailer(FILE* fh, struct proxyOp* event) {
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
event->isSend ? "Send" : "Recv", proxyOpId++, getpid(), 1, event->stopTs);
event->isSend ? "ProgressSend" : "ProgressRecv", proxyOpId++, getpid(), 1, event->stopTs);
}
static __thread int proxyStepId;
__hidden void printProxyStepEventHeader(FILE* fh, struct proxyStep* event) {
if (event->isSend) {
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
"SendBufferWait", proxyStepId, getpid(), 1, event->startTs, event->step);
"SendGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_GPU_WAIT], event->step);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"SendBufferWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_STATE_IDX(ncclProfilerProxyStepSendGPUWait)]);
"SendGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_PEER_WAIT]);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
"SendGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_STATE_IDX(ncclProfilerProxyStepSendGPUWait)], event->step);
"SendPeerWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_PEER_WAIT], event->step);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"SendGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_STATE_IDX(ncclProfilerProxyStepSendWait)]);
"SendPeerWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_WAIT]);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
"SendWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_STATE_IDX(ncclProfilerProxyStepSendWait)], event->step);
"SendWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_WAIT], event->step);
} else {
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
"RecvBufferWait", proxyStepId, getpid(), 1, event->startTs, event->step);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"RecvBufferWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvWait)]);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
"RecvWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvWait)], event->step);
"RecvWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_WAIT], event->step);
}
}
@ -100,13 +96,13 @@ __hidden void printProxyStepEventTrailer(FILE* fh, struct proxyStep* event) {
"SendWait", proxyStepId++, getpid(), 1, event->stopTs);
} else {
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"RecvWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvFlushWait)]);
"RecvWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_FLUSH_WAIT]);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
"RecvFlushWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvFlushWait)], event->step);
"RecvFlushWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_FLUSH_WAIT], event->step);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"RecvFlushWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvGPUWait)]);
"RecvFlushWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_GPU_WAIT]);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
"RecvGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvGPUWait)], event->step);
"RecvGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_GPU_WAIT], event->step);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
"RecvGpuWait", proxyStepId++, getpid(), 1, event->stopTs);
}
@ -115,8 +111,8 @@ __hidden void printProxyStepEventTrailer(FILE* fh, struct proxyStep* event) {
static __thread int kernelId;
__hidden void printKernelChEventHeader(FILE* fh, struct kernelCh* event) {
if (event->type != ncclProfileKernelCh) return;
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"GPU\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d}},\n",
"KernelCh", kernelId, getpid(), 1, event->startTs, event->channelId);
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"GPU\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"StartGpuClk\": %lu, \"StopGpuClk\": %lu}},\n",
"KernelCh", kernelId, getpid(), 1, event->startTs, event->channelId, event->startGpuClk, event->stopGpuClk);
}
__hidden void printKernelChEventTrailer(FILE* fh, struct kernelCh* event) {
@ -134,6 +130,8 @@ __hidden void printProxyCtrlEvent(FILE* fh, struct proxyCtrl* event) {
str = "Sleep";
} else if (event->state == ncclProfilerProxyCtrlAppend || event->state == ncclProfilerProxyCtrlAppendEnd) {
str = "Append";
} else {
return;
}
if (event->state == ncclProfilerProxyCtrlAppendEnd) {
fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"appended\": %d}},\n",
@ -188,9 +186,8 @@ void debugEvent(void* eHandle, const char* tag) {
fprintf(fh, "Collective event %p tag = %s {\n", event, tag);
fprintf(fh, " refCount = %d\n", __atomic_load_n(&event->base.refCount, __ATOMIC_RELAXED));
fprintf(fh, " parent = %p\n", event->base.parent);
for (int j = 0; j < MAX_OPS; j++) {
for (int i = 0; i < MAX_CHANNELS; i++) if (event->send[i][j].type == ncclProfileProxyOp) fprintf(fh, " send[%d] = %p\n", i, &event->send[i]);
for (int i = 0; i < MAX_CHANNELS; i++) if (event->recv[i][j].type == ncclProfileProxyOp) fprintf(fh, " recv[%d] = %p\n", i, &event->recv[i]);
for (int j = 0; j < 2*MAX_OPS; j++) {
for (int i = 0; i < MAX_CHANNELS; i++) if (event->op[i][j].type == ncclProfileProxyOp) fprintf(fh, " op[%d] = %p\n", i, &event->op[i]);
}
fprintf(fh, " startTs = %f\n", event->base.startTs);
fprintf(fh, " stopTs = %f\n", event->base.stopTs);
@ -207,17 +204,18 @@ void debugEvent(void* eHandle, const char* tag) {
} else if (type == ncclProfileProxyOp) {
struct proxyOp* event = (struct proxyOp *)eHandle;
fprintf(fh, "ProxyOp event %p tag = %s {\n", event, tag);
fprintf(fh, " type = %s\n", event->isSend ? "Send" : "Recv");
fprintf(fh, " type = %s\n", event->isSend < 0 ? "Unknown" : event->isSend ? "Send" : "Recv");
fprintf(fh, " channel = %d\n", event->channelId);
fprintf(fh, " parent = %p\n", event->parent);
fprintf(fh, " rank = %d\n", event->rank);
fprintf(fh, " startTs = %f\n", event->startTs);
fprintf(fh, " progrTs = %f\n", event->progrTs);
fprintf(fh, " stopTs = %f\n", event->stopTs);
fprintf(fh, "}\n");
} else if (type == ncclProfileProxyStep) {
struct proxyStep* event = (struct proxyStep *)eHandle;
fprintf(fh, "ProxyStep event %p tag = %s {\n", event, tag);
fprintf(fh, " type = %s\n", event->isSend ? "Send" : "Recv");
fprintf(fh, " type = %s\n", event->isSend < 0 ? "Unknown" : event->isSend ? "Send" : "Recv");
fprintf(fh, " parent = %p\n", event->parent);
fprintf(fh, " startTs = %f\n", event->startTs);
fprintf(fh, " stopTs = %f\n", event->stopTs);
@ -260,8 +258,7 @@ void printEvent(FILE* fh, void* handle) {
for (int i = 0; i < MAX_CHANNELS; i++) {
printKernelChEventHeader(fh, &c->kernel[i]);
for (int j = 0; j < c->nProxyOps[i]; j++) {
printEvent(fh, &c->send[i][j]);
printEvent(fh, &c->recv[i][j]);
printEvent(fh, &c->op[i][j]);
}
printKernelChEventTrailer(fh, &c->kernel[i]);
}

View File

@ -7,6 +7,9 @@
#ifndef PRINT_EVENT_H_
#define PRINT_EVENT_H_
#include "nccl/common.h"
extern ncclDebugLogger_t logFn;
void debugEvent(void* eHandle, const char* tag);
void printEvent(FILE* fh, void* handle);

23
ext-tuner/basic/Makefile Normal file
View File

@ -0,0 +1,23 @@
#
# Copyright (c) 2015-2019, NVIDIA CORPORATION. All rights reserved.
#
# See LICENSE.txt for license information
#
.DEFAULT_GOAL: build
include ../../makefiles/common.mk
SRCDIR ?= $(abspath ../..)
BUILDDIR ?= .
NCCLDIR := $(BUILDDIR)
SRC_FILES := $(wildcard *.c)
DST_DIR := $(BUILDDIR)/test/unit/plugins
build: ${BUILDDIR}/libnccl-tuner-basic.so
${BUILDDIR}/libnccl-tuner-basic.so: ${SRC_FILES}
@printf "Compiling %-35s > %s\n" $< $@
@mkdir -p ${BUILDDIR}
$(CC) -Inccl -fPIC -shared -o $@ $^
clean:
rm -f ${BUILDDIR}/libnccl-tuner-basic.so

View File

@ -0,0 +1,15 @@
/*************************************************************************
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#ifndef COMMON_H_
#define COMMON_H_
typedef enum {NCCL_LOG_NONE=0, NCCL_LOG_VERSION=1, NCCL_LOG_WARN=2, NCCL_LOG_INFO=3, NCCL_LOG_ABORT=4, NCCL_LOG_TRACE=5} ncclDebugLogLevel;
typedef enum {NCCL_INIT=1, NCCL_COLL=2, NCCL_P2P=4, NCCL_SHM=8, NCCL_NET=16, NCCL_GRAPH=32, NCCL_TUNING=64, NCCL_ENV=128, NCCL_ALLOC=256, NCCL_CALL=512, NCCL_PROXY=1024, NCCL_NVLS=2048, NCCL_BOOTSTRAP=4096, NCCL_REG=8192, NCCL_ALL=~0} ncclDebugLogSubSys;
typedef void (*ncclDebugLogger_t)(ncclDebugLogLevel level, unsigned long flags, const char *file, int line, const char *fmt, ...);
#endif

View File

@ -0,0 +1,17 @@
/*
* Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
*/
#ifndef NCCL_ERR_H_
#define NCCL_ERR_H_
/* Error type for plugins */
typedef enum { ncclSuccess = 0,
ncclUnhandledCudaError = 1,
ncclSystemError = 2,
ncclInternalError = 3,
ncclInvalidArgument = 4,
ncclInvalidUsage = 5,
ncclRemoteError = 6 } ncclResult_t;
#endif

View File

@ -0,0 +1,97 @@
/*************************************************************************
* Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
* Copyright (c) 2023, Meta Platforms, Inc. and affiliates.
*
* See LICENSE.txt for license information
************************************************************************/
#ifndef NCCL_TUNER_H_
#define NCCL_TUNER_H_
#include <stdint.h>
#include <stdlib.h>
#include "common.h"
#include "err.h"
#define NCCL_NUM_FUNCTIONS 5 // Send/Recv not included for now
typedef enum {
ncclFuncBroadcast = 0,
ncclFuncReduce = 1,
ncclFuncAllGather = 2,
ncclFuncReduceScatter = 3,
ncclFuncAllReduce = 4,
ncclFuncSendRecv = 5,
ncclFuncSend = 6,
ncclFuncRecv = 7,
ncclNumFuncs = 8
} ncclFunc_t;
#define NCCL_NUM_ALGORITHMS 7 // Tree/Ring/CollNet*
#define NCCL_ALGO_UNDEF -1
#define NCCL_ALGO_TREE 0
#define NCCL_ALGO_RING 1
#define NCCL_ALGO_COLLNET_DIRECT 2
#define NCCL_ALGO_COLLNET_CHAIN 3
#define NCCL_ALGO_NVLS 4
#define NCCL_ALGO_NVLS_TREE 5
#define NCCL_ALGO_PAT 6
#define NCCL_NUM_PROTOCOLS 3 // Simple/LL/LL128
#define NCCL_PROTO_UNDEF -1
#define NCCL_PROTO_LL 0
#define NCCL_PROTO_LL128 1
#define NCCL_PROTO_SIMPLE 2
#define NCCL_ALGO_PROTO_IGNORE -1.0
// API to be implemented by external tuner
typedef struct {
// Name of the tuner
const char* name;
// Initializes tuner states.
// Inputs:
// - nRanks: number of ranks in current communicator. Each communicator initialize its own tuner.
// - nNodes: number of nodes in current communicator.
// - logFunction: a logFunction can be useful to integrate logging together with NCCL core.
// Outputs:
// - context: tuner context object
ncclResult_t (*init)(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context);
// Gets info (algo, protocol, number of ctas and threads) for a given collective.
// Inputs:
// - context: tuner context object
// - collType: collective type , e.g., allreduce, allgather…
// - nBytes: collective size in bytes
// - numPipeOps: number of operations in the group
// - numAlgo: number of algorithms in collCostTable
// - numProto: number of protocols in collCostTable
// - regBuff: can register user buffer
//
// Outputs:
// - nChannels: number of channels (hence SMs) to be used.
//
// InOut:
// - collCostTable: collective cost table, generated by NCCL core, containing algo|proto|time entries for collType.
// NCCL core sets ignored algo/proto cost table entries to -1.0 (NCCL_ALGO_PROTO_IGNORE).
//
// If getCollInfo() does not return ncclSuccess, NCCL will fall back to the
// default tuning for the given collective.
// Also, the plugin is allowed to not set any output, or set only the
// algorithm and protocol, but not only the algorithm or only the protocol.
// Unset fields will be set automatically by NCCL.
ncclResult_t (*getCollInfo)(void* context, ncclFunc_t collType, size_t nBytes,
int numPipeOps, float** collCostTable, int numAlgo, int numProto,
int regBuff, int* nChannels);
// Terminates the plugin and cleans up any resources that the plugin allocated.
// context: tuner context object
ncclResult_t (*destroy)(void* context);
} ncclTuner_v4_t;
typedef ncclTuner_v4_t ncclTuner_t;
#define NCCL_TUNER_PLUGIN_SYMBOL "ncclTunerPlugin_v4"
#endif

34
ext-tuner/basic/plugin.c Normal file
View File

@ -0,0 +1,34 @@
/*************************************************************************
* Copyright (c) 2015-2019, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#include "tuner.h"
#define __hidden __attribute__ ((visibility("hidden")))
__hidden ncclResult_t pluginInit(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context) { return ncclSuccess; }
__hidden ncclResult_t pluginGetCollInfo(void* context, ncclFunc_t collType, size_t nBytes,
int numPipeOps, float** collCostTable, int numAlgo, int numProto,
int regBuff, int* nChannels) {
// Update NCCL core generated cost table. Updated table will be evaluated by NCCL to pick the best algo/proto combo
float (*table)[NCCL_NUM_PROTOCOLS] = (float (*)[NCCL_NUM_PROTOCOLS])collCostTable;
if (table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] != NCCL_ALGO_PROTO_IGNORE) {
table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] = 0.0;
}
*nChannels = 1;
return ncclSuccess;
}
__hidden ncclResult_t pluginDestroy(void* context) { return ncclSuccess; }
#define PLUGIN_NAME "Basic"
const ncclTuner_v4_t ncclTunerPlugin_v4 = {
.name = PLUGIN_NAME,
.init = pluginInit,
.getCollInfo = pluginGetCollInfo,
.destroy = pluginDestroy
};

View File

@ -3,15 +3,53 @@
#
# See LICENSE.txt for license information
#
NCCL_HOME:=../../build/
CUDA_HOME:=/usr/local/cuda
INC:= -I$(NCCL_HOME)/include -I$(CUDA_HOME)/include -Inccl
PLUGIN_SO:=libnccl-tuner.so
default: $(PLUGIN_SO)
.DEFAULT_GOAL: build
PLUGIN_SO:=libnccl-tuner-example.so
include ../../makefiles/common.mk
SRCDIR ?= $(abspath ../..)
BUILDDIR ?= .
NCCLDIR := $(BUILDDIR)
$(PLUGIN_SO): plugin.c
$(CC) $(INC) -fPIC -shared -o $@ -Wl,-soname,$(PLUGIN_SO) $^
SRC_FILES := $(wildcard *.c)
DST_DIR := $(BUILDDIR)/test/unit/plugins
default: ${BUILDDIR}/$(PLUGIN_SO)
build: ${BUILDDIR}/$(PLUGIN_SO)
${BUILDDIR}/$(PLUGIN_SO): plugin.c
@printf "Compiling %-35s > %s\n" $< $@
@mkdir -p ${BUILDDIR}
$(CC) -Inccl $(INC) -fPIC -shared -o $@ -Wl,-soname,$(PLUGIN_SO) $^
# Test targets - delegate to test directory
test:
$(MAKE) -C test test TEST_CASE=$(TEST_CASE)
test-verbose:
$(MAKE) -C test test-verbose TEST_CASE=$(TEST_CASE)
# Build tests
test-build:
$(MAKE) -C test all
# Optimize configurations from performance data
optimize-config:
@if [ -z "$(CSV_FILE)" ]; then \
echo "Usage: make optimize-config CSV_FILE=path/to/data.csv [OUTPUT=config.conf] [METRIC=latency_us]"; \
echo "Example: make optimize-config CSV_FILE=scripts/sample_performance_data.csv"; \
exit 1; \
fi
python3 scripts/optimize_config.py $(CSV_FILE) \
$(if $(OUTPUT),-o $(OUTPUT)) \
$(if $(METRIC),-m $(METRIC)) \
$(if $(SIZE_RANGES),--size-ranges $(SIZE_RANGES)) \
$(if $(DRY_RUN),--dry-run) \
$(if $(NO_HEADER),--no-header)
clean:
rm -f $(PLUGIN_SO)
rm -f ${BUILDDIR}/$(PLUGIN_SO)
$(MAKE) -C test clean
.PHONY: test test-verbose test-build optimize-config clean

164
ext-tuner/example/README.md Normal file
View File

@ -0,0 +1,164 @@
# NCCL Example Tuner Plugin
This example plugin shows a practical example of a CSV file-based tuning approach, allowing selective overrides for tuning parameters based on all tuning inputs without recompiling.
## Features
- **File-based Configuration**: Read tuning parameters from a CSV configuration file
- **Size-based Tuning**: Specify different configurations based on message size ranges
- **Dimension-aware Tuning**: Match configurations based on number of nodes and ranks
- **Optional Channels Configuration**: Set specific channel counts or use -1 to keep NCCL's default
- **Environment Variable Support**: Specify config file location via `NCCL_TUNER_CONFIG_FILE`
- **Fallback Behavior**: Gracefully handles missing config files and invalid entries
## Building
```bash
make
```
This will create `libnccl-tuner-example.so` that can be loaded by NCCL.
## Configuration File Format
The configuration file uses CSV (Comma-Separated Values) format with one configuration per line:
```
collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
```
### Parameters
- **collective_type**: The collective operation type
- `broadcast`, `reduce`, `allgather`, `reducescatter`, `allreduce`
- **min_bytes/max_bytes**: The message size range (in bytes) for which this config applies
- Use `0` for minimum and `4294967295` for maximum (covers all sizes)
- **algorithm**: The NCCL algorithm to use
- `tree`, `ring`, `collnet_direct`, `collnet_chain`, `nvls`, `nvls_tree`, `pat`
- **protocol**: The NCCL protocol to use
- `ll`, `ll128`, `simple`
- **channels**: Number of channels (SMs) to use
- Use a positive integer to specify exact channel count
- Use `-1` to keep NCCL's default channel selection
- **nNodes**: Number of nodes to match
- Use a positive integer to match specific node count
- Use `-1` to match any number of nodes
- **nRanks**: Number of ranks to match
- Use a positive integer to match specific rank count
- Use `-1` to match any number of ranks
- **numPipeOps**: Number of pipeline operations to match (optional)
- Use a positive integer to match specific pipeline operation count
- Use `-1` to match any number of pipeline operations
- If omitted, configuration will match any numPipeOps value
- **regBuff**: Whether user buffer can be registered (optional)
- Use `0` to match only non-registered buffers
- Use `1` to match only registered buffers
- Use `-1` to match either registered or non-registered buffers
- If omitted, configuration will match any regBuff value
### Example Configuration
```csv
# Single-node, small allreduce: use tree algorithm, registered buffers only
allreduce,0,65536,tree,simple,2,1,-1,-1,1
# 4-node, 32-rank setup: medium allreduce, single pipeline op, non-registered buffers
allreduce,65537,1048576,ring,simple,4,4,32,1,0
# Any topology: large allreduce with LL128, multiple pipeline ops, any buffer type
allreduce,1048577,4294967295,ring,ll128,-1,-1,-1,4,-1
# Single-node broadcast: prefer tree, any pipeOps, registered buffers (backward compatible)
broadcast,0,32768,tree,simple,-1,1,-1
# Multi-node broadcast: optimized for non-registered buffers, single pipeline op
broadcast,32769,4294967295,ring,simple,2,-1,-1,1,0
```
Comments start with `#` and empty lines are ignored. The CSV format makes it easy to edit configurations in spreadsheet applications like Excel, Google Sheets, or LibreOffice Calc.
### Backward Compatibility
Configurations without the numPipeOps and/or regBuff parameters are fully supported:
- 8 fields: matches any numPipeOps and regBuff values
- 9 fields: matches any regBuff value
- 10 fields: full parameter specification
This ensures existing configuration files continue to work without modification.
## Usage
### Method 1: Default Config File
Place your configuration in `nccl_tuner.conf` in the current working directory.
### Method 2: Environment Variable
Set the `NCCL_TUNER_CONFIG_FILE` environment variable to specify the config file path:
```bash
export NCCL_TUNER_CONFIG_FILE=/path/to/your/tuner.conf
export LD_LIBRARY_PATH=/path/to/plugin:$LD_LIBRARY_PATH
mpirun -np 4 your_nccl_application
```
## Editing Configuration Files
### Generating Configuration Files from Raw Data
A python script to generate valid CSV configs has been provided. [Using optimize_config.py](scripts/README.md).
### Spreadsheet Tips:
- Use column headers: `collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff`
- Save as CSV format (not Excel format) for the plugin to read
- Use data validation to prevent typos in algorithm/protocol names
## Logging
The plugin uses NCCL's logging system. To see tuner-related messages:
```bash
export NCCL_DEBUG=INFO
```
This will show when configurations are loaded and applied, including the topology information.
For detailed debugging output during tuning decisions:
```bash
export NCCL_DEBUG=TRACE
```
This will show verbose information about which configurations are being evaluated and matched.
## Dimension Matching
Configurations are only applied when the topology matches:
- **Exact Match**: Configuration specifies `nNodes=4,nRanks=32`, only applied when communicator has exactly 4 nodes and 32 ranks
- **Wildcard Nodes**: Configuration specifies `nNodes=-1,nRanks=8`, applied to any topology with exactly 8 ranks
- **Wildcard Ranks**: Configuration specifies `nNodes=2,nRanks=-1`, applied to any 2-node topology regardless of ranks per node
- **Wildcard Both**: Configuration specifies `nNodes=-1,nRanks=-1`, applied to any topology
This allows you to create specialized configurations for different cluster setups while maintaining flexibility.
## Default Behavior
If no configuration file is found or no matching configuration exists for a collective operation, the plugin falls back to preferring the ring algorithm with simple protocol. All configured algorithm/protocol combinations are given a low cost (0.0) to make them preferred by NCCL's selection logic.
When channels is set to `-1`, NCCL's default channel selection logic is preserved, allowing the system to automatically determine the optimal number of channels based on hardware and message size.
## Troubleshooting
1. **Config file not found**: Check the file path and permissions
2. **Configurations not applied**: Verify the collective type, size ranges, algorithm/protocol names, and topology parameters
3. **Plugin not loaded**: Ensure `LD_LIBRARY_PATH` includes the plugin directory
4. **No effect on performance**: Check that NCCL is actually using the tuner plugin with `NCCL_DEBUG=INFO`
5. **Topology mismatch**: Verify that nNodes and nRanks match your actual setup, or use -1 for wildcards
6. **CSV parsing errors**: Ensure no spaces after commas, or quote fields containing spaces

View File

@ -0,0 +1,45 @@
# NCCL Tuner Configuration File (CSV Format)
# Format: collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
#
# Collective types: broadcast, reduce, allgather, reducescatter, allreduce
# Algorithms: tree, ring, collnet_direct, collnet_chain, nvls, nvls_tree, pat
# Protocols: ll, ll128, simple
# Channels: number of channels to use, or -1 to keep default
# nNodes: number of nodes to match, or -1 for any number of nodes
# nRanks: number of ranks to match, or -1 for any number of ranks
# numPipeOps: number of pipeline operations to match, or -1 for any number (optional)
# regBuff: whether user buffer can be registered (0=no, 1=yes, -1=any) (optional)
#
# Note: numPipeOps and regBuff parameters are optional - configurations without them will match any value
#
# Examples:
# For single-node configurations with registered buffers
# Small allreduce operations on single node - use tree algorithm, registered buffers
allreduce,0,65536,tree,simple,2,1,-1,-1,1
# For multi-node configurations with 4 nodes, 32 total ranks, single pipeline op, non-registered buffers
# Medium allreduce operations - use ring algorithm
allreduce,65537,1048576,ring,simple,4,4,32,1,0
# For any topology - large allreduce operations with LL128 protocol, multiple pipeline ops, any buffer type
allreduce,1048577,4294967295,ring,ll128,-1,-1,-1,4,-1
# Broadcast operations - different configs for different topologies, pipeline complexity, and buffer types
# Single node broadcast - prefer tree, any pipeOps, registered buffers only
broadcast,0,32768,tree,simple,-1,1,-1,-1,1
# Multi-node broadcast with single pipeline operation, non-registered buffers - use ring
broadcast,32769,4294967295,ring,simple,2,-1,-1,1,0
# AllGather operations - optimized for 2-node configurations, any pipeOps, any buffer type
allgather,0,4294967295,ring,simple,4,2,-1
# ReduceScatter operations
# Small messages on single node, single pipeline op, registered buffers
reducescatter,0,131072,tree,simple,2,1,-1,1,1
# Large messages on any topology, multiple pipeline ops, non-registered buffers
reducescatter,131073,4294967295,ring,simple,-1,-1,-1,2,0
# Reduce operations - any topology, keep default channels, any pipeOps, any buffer type
reduce,0,4294967295,tree,simple,-1,-1,-1

View File

@ -5,24 +5,443 @@
************************************************************************/
#include "tuner.h"
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#define __hidden __attribute__ ((visibility("hidden")))
#define MAX_LINE_LENGTH 256
__hidden ncclResult_t pluginInit(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context) { return ncclSuccess; }
// CSV field indices for configuration parsing
// Format: colltype,minbytes,maxbytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
#define CONFIG_FIELD_COLLTYPE 0
#define CONFIG_FIELD_MINBYTES 1
#define CONFIG_FIELD_MAXBYTES 2
#define CONFIG_FIELD_ALGORITHM 3
#define CONFIG_FIELD_PROTOCOL 4
#define CONFIG_FIELD_CHANNELS 5
#define CONFIG_FIELD_NNODES 6
#define CONFIG_FIELD_NRANKS 7
#define CONFIG_FIELD_PIPEOPS 8 // Optional field
#define CONFIG_FIELD_REGBUFF 9 // Optional field
// Field count constants
#define CONFIG_FIELDS_REQUIRED 8 // Minimum required fields (up to nRanks)
#define CONFIG_FIELDS_WITH_PIPEOPS 9 // Fields including numPipeOps
#define CONFIG_FIELDS_WITH_REGBUFF 10 // Fields including both numPipeOps and regBuff
#define CONFIG_FIELDS_MAX 10 // Maximum number of fields supported
typedef struct {
ncclFunc_t collType;
size_t minBytes;
size_t maxBytes;
int algorithm;
int protocol;
int nChannels;
int nNodes;
int nRanks;
int numPipeOps;
int regBuff;
} TuningConfig;
typedef struct {
TuningConfig* configs; // Changed from static array to dynamic pointer
int numConfigs;
int maxConfigs; // Added to track allocated size
size_t nRanks;
size_t nNodes;
ncclDebugLogger_t logFunction;
} TunerContext;
// Parse collective type from string
static ncclFunc_t parseCollType(const char* str) {
if (strcmp(str, "broadcast") == 0) return ncclFuncBroadcast;
if (strcmp(str, "reduce") == 0) return ncclFuncReduce;
if (strcmp(str, "allgather") == 0) return ncclFuncAllGather;
if (strcmp(str, "reducescatter") == 0) return ncclFuncReduceScatter;
if (strcmp(str, "allreduce") == 0) return ncclFuncAllReduce;
return ncclFuncAllReduce; // default
}
// Convert collective type to string
static const char* collTypeToString(ncclFunc_t collType) {
switch (collType) {
case ncclFuncBroadcast: return "broadcast";
case ncclFuncReduce: return "reduce";
case ncclFuncAllGather: return "allgather";
case ncclFuncReduceScatter: return "reducescatter";
case ncclFuncAllReduce: return "allreduce";
default: return "unknown";
}
}
// Parse algorithm from string
static int parseAlgorithm(const char* str) {
if (strcmp(str, "tree") == 0) return NCCL_ALGO_TREE;
if (strcmp(str, "ring") == 0) return NCCL_ALGO_RING;
if (strcmp(str, "collnet_direct") == 0) return NCCL_ALGO_COLLNET_DIRECT;
if (strcmp(str, "collnet_chain") == 0) return NCCL_ALGO_COLLNET_CHAIN;
if (strcmp(str, "nvls") == 0) return NCCL_ALGO_NVLS;
if (strcmp(str, "nvls_tree") == 0) return NCCL_ALGO_NVLS_TREE;
if (strcmp(str, "pat") == 0) return NCCL_ALGO_PAT;
return NCCL_ALGO_RING; // default
}
// Convert algorithm to string
static const char* algorithmToString(int algorithm) {
switch (algorithm) {
case NCCL_ALGO_TREE: return "tree";
case NCCL_ALGO_RING: return "ring";
case NCCL_ALGO_COLLNET_DIRECT: return "collnet_direct";
case NCCL_ALGO_COLLNET_CHAIN: return "collnet_chain";
case NCCL_ALGO_NVLS: return "nvls";
case NCCL_ALGO_NVLS_TREE: return "nvls_tree";
case NCCL_ALGO_PAT: return "pat";
default: return "unknown";
}
}
// Parse protocol from string
static int parseProtocol(const char* str) {
if (strcmp(str, "ll") == 0) return NCCL_PROTO_LL;
if (strcmp(str, "ll128") == 0) return NCCL_PROTO_LL128;
if (strcmp(str, "simple") == 0) return NCCL_PROTO_SIMPLE;
return NCCL_PROTO_SIMPLE; // default
}
// Convert protocol to string
static const char* protocolToString(int protocol) {
switch (protocol) {
case NCCL_PROTO_LL: return "ll";
case NCCL_PROTO_LL128: return "ll128";
case NCCL_PROTO_SIMPLE: return "simple";
default: return "unknown";
}
}
// Helper function to count valid configuration lines in file
static int countConfigLines(const char* filename) {
FILE* file = fopen(filename, "r");
if (!file) {
return 0;
}
char line[MAX_LINE_LENGTH];
int count = 0;
while (fgets(line, sizeof(line), file)) {
// Skip comments and empty lines
if (line[0] == '#' || line[0] == '\n') continue;
// Remove trailing newline
line[strcspn(line, "\n")] = 0;
// Check if line has content
if (strlen(line) > 0) {
count++;
}
}
fclose(file);
return count;
}
// Load configuration from file
static ncclResult_t loadConfig(TunerContext* ctx, const char* filename) {
FILE* file = fopen(filename, "r");
if (!file) {
if (ctx->logFunction) {
ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Config file %s not found, using defaults", filename);
}
return ncclSuccess; // Not finding config file is not an error
}
// First pass: count valid configuration lines
int configCount = countConfigLines(filename);
if (configCount == 0) {
if (ctx->logFunction) {
ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: No valid configurations found in %s", filename);
}
fclose(file);
return ncclSuccess;
}
// Allocate memory for configurations based on actual count
ctx->configs = (TuningConfig*)malloc(configCount * sizeof(TuningConfig));
if (!ctx->configs) {
if (ctx->logFunction) {
ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Failed to allocate memory for %d configurations", configCount);
}
fclose(file);
return ncclSystemError;
}
ctx->maxConfigs = configCount;
ctx->numConfigs = 0;
if (ctx->logFunction) {
ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Allocated memory for %d configurations", configCount);
}
// Reset file pointer to beginning
fseek(file, 0, SEEK_SET);
char line[MAX_LINE_LENGTH];
while (fgets(line, sizeof(line), file) && ctx->numConfigs < ctx->maxConfigs) {
// Skip comments and empty lines
if (line[0] == '#' || line[0] == '\n') continue;
// Remove trailing newline
line[strcspn(line, "\n")] = 0;
// Parse CSV format: colltype,minbytes,maxbytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
char* token;
char* tokens[CONFIG_FIELDS_MAX];
int tokenCount = 0;
// Make a copy of the line for tokenizing
char lineCopy[MAX_LINE_LENGTH];
strncpy(lineCopy, line, sizeof(lineCopy));
lineCopy[sizeof(lineCopy) - 1] = '\0';
// Tokenize by comma
token = strtok(lineCopy, ",");
while (token != NULL && tokenCount < CONFIG_FIELDS_MAX) {
// Trim whitespace
while (*token == ' ' || *token == '\t') token++;
char* end = token + strlen(token) - 1;
while (end > token && (*end == ' ' || *end == '\t')) {
*end = '\0';
end--;
}
tokens[tokenCount++] = token;
token = strtok(NULL, ",");
}
// Validate field count: support required fields (8), with pipeOps (9), or with regBuff (10)
if (tokenCount >= CONFIG_FIELDS_REQUIRED && tokenCount <= CONFIG_FIELDS_MAX) {
TuningConfig* config = &ctx->configs[ctx->numConfigs];
config->collType = parseCollType(tokens[CONFIG_FIELD_COLLTYPE]);
config->minBytes = (size_t)strtoull(tokens[CONFIG_FIELD_MINBYTES], NULL, 10);
config->maxBytes = (size_t)strtoull(tokens[CONFIG_FIELD_MAXBYTES], NULL, 10);
config->algorithm = parseAlgorithm(tokens[CONFIG_FIELD_ALGORITHM]);
config->protocol = parseProtocol(tokens[CONFIG_FIELD_PROTOCOL]);
config->nChannels = atoi(tokens[CONFIG_FIELD_CHANNELS]);
config->nNodes = atoi(tokens[CONFIG_FIELD_NNODES]);
config->nRanks = atoi(tokens[CONFIG_FIELD_NRANKS]);
// numPipeOps is optional (9th field, index 8)
if (tokenCount >= CONFIG_FIELDS_WITH_PIPEOPS) {
config->numPipeOps = atoi(tokens[CONFIG_FIELD_PIPEOPS]);
} else {
config->numPipeOps = -1; // -1 means match any numPipeOps
}
// regBuff is optional (10th field, index 9)
if (tokenCount >= CONFIG_FIELDS_WITH_REGBUFF) {
config->regBuff = atoi(tokens[CONFIG_FIELD_REGBUFF]);
} else {
config->regBuff = -1; // -1 means match any regBuff value
}
ctx->numConfigs++;
if (ctx->logFunction) {
if (config->numPipeOps == -1 && config->regBuff == -1) {
ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Loaded config: %s [%zu-%zu] %s/%s channels=%d nodes=%d ranks=%d pipeOps=any regBuff=any",
tokens[CONFIG_FIELD_COLLTYPE], config->minBytes, config->maxBytes,
tokens[CONFIG_FIELD_ALGORITHM], tokens[CONFIG_FIELD_PROTOCOL],
config->nChannels, config->nNodes, config->nRanks);
} else if (config->regBuff == -1) {
ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Loaded config: %s [%zu-%zu] %s/%s channels=%d nodes=%d ranks=%d pipeOps=%d regBuff=any",
tokens[CONFIG_FIELD_COLLTYPE], config->minBytes, config->maxBytes,
tokens[CONFIG_FIELD_ALGORITHM], tokens[CONFIG_FIELD_PROTOCOL],
config->nChannels, config->nNodes, config->nRanks, config->numPipeOps);
} else if (config->numPipeOps == -1) {
ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Loaded config: %s [%zu-%zu] %s/%s channels=%d nodes=%d ranks=%d pipeOps=any regBuff=%d",
tokens[CONFIG_FIELD_COLLTYPE], config->minBytes, config->maxBytes,
tokens[CONFIG_FIELD_ALGORITHM], tokens[CONFIG_FIELD_PROTOCOL],
config->nChannels, config->nNodes, config->nRanks, config->regBuff);
} else {
ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Loaded config: %s [%zu-%zu] %s/%s channels=%d nodes=%d ranks=%d pipeOps=%d regBuff=%d",
tokens[CONFIG_FIELD_COLLTYPE], config->minBytes, config->maxBytes,
tokens[CONFIG_FIELD_ALGORITHM], tokens[CONFIG_FIELD_PROTOCOL],
config->nChannels, config->nNodes, config->nRanks, config->numPipeOps, config->regBuff);
}
}
}
}
fclose(file);
if (ctx->logFunction) {
ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Loaded %d tuning configurations from %s", ctx->numConfigs, filename);
}
return ncclSuccess;
}
__hidden ncclResult_t pluginInit(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context) {
TunerContext* ctx = (TunerContext*)malloc(sizeof(TunerContext));
if (!ctx) return ncclSystemError;
ctx->configs = NULL; // Initialize to NULL
ctx->numConfigs = 0;
ctx->maxConfigs = 0; // Initialize to 0
ctx->nRanks = nRanks;
ctx->nNodes = nNodes;
ctx->logFunction = logFunction;
if (logFunction) {
logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Initializing tuner for %zu nodes, %zu ranks", nNodes, nRanks);
}
// Try to load config file from environment variable or default location
const char* configFile = getenv("NCCL_TUNER_CONFIG_FILE");
if (!configFile) {
configFile = "nccl_tuner.conf"; // default config file name
}
ncclResult_t result = loadConfig(ctx, configFile);
if (result != ncclSuccess) {
if (ctx->configs) {
free(ctx->configs); // Clean up allocated memory on error
}
free(ctx);
return result;
}
*context = ctx;
return ncclSuccess;
}
__hidden ncclResult_t pluginGetCollInfo(void* context, ncclFunc_t collType, size_t nBytes,
int numPipeOps, float** collCostTable, int numAlgo, int numProto,
int regBuff, int* nChannels) {
// Update NCCL core generated cost table. Updated table will be evaluated by NCCL to pick the best algo/proto combo
float (*table)[NCCL_NUM_PROTOCOLS] = (float (*)[NCCL_NUM_PROTOCOLS])collCostTable;
if (table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] != NCCL_ALGO_PROTO_IGNORE) {
table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] = 0.0;
}
TunerContext* ctx = (TunerContext*)context;
if (!ctx) return ncclInternalError;
// Default channels
*nChannels = 1;
if (ctx->logFunction) {
ctx->logFunction(NCCL_LOG_TRACE, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: pluginGetCollInfo called - collType=%s, nBytes=%zu, numPipeOps=%d, regBuff=%d, numConfigs=%d",
collTypeToString(collType), nBytes, numPipeOps, regBuff, ctx->numConfigs);
}
// Look for matching configuration
for (int i = 0; i < ctx->numConfigs; i++) {
TuningConfig* config = &ctx->configs[i];
if (ctx->logFunction) {
ctx->logFunction(NCCL_LOG_TRACE, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Checking config %d - collType=%s, minBytes=%zu, maxBytes=%zu, algo=%s, proto=%s, nNodes=%d, nRanks=%d, numPipeOps=%d, regBuff=%d",
i, collTypeToString(config->collType), config->minBytes, config->maxBytes, algorithmToString(config->algorithm), protocolToString(config->protocol),
config->nNodes, config->nRanks, config->numPipeOps, config->regBuff);
}
// Check if this config matches the current collective, size range, topology, pipeline ops, and regBuff
if (config->collType == collType &&
nBytes >= config->minBytes &&
nBytes <= config->maxBytes &&
(config->nNodes == -1 || config->nNodes == (int)ctx->nNodes) &&
(config->nRanks == -1 || config->nRanks == (int)ctx->nRanks) &&
(config->numPipeOps == -1 || config->numPipeOps == numPipeOps) &&
(config->regBuff == -1 || config->regBuff == regBuff)) {
if (ctx->logFunction) {
ctx->logFunction(NCCL_LOG_TRACE, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Config matches. Applying algo=%s, proto=%s, channels=%d",
algorithmToString(config->algorithm), protocolToString(config->protocol), config->nChannels);
}
// Check bounds
if (config->algorithm < numAlgo && config->protocol < numProto) {
if (collCostTable[config->algorithm][config->protocol] != NCCL_ALGO_PROTO_IGNORE) {
if (ctx->logFunction) {
ctx->logFunction(NCCL_LOG_TRACE, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Setting cost table[%s][%s] (%p) = 0.0 (was %.1f)",
algorithmToString(config->algorithm), protocolToString(config->protocol),
&collCostTable[config->algorithm][config->protocol], collCostTable[config->algorithm][config->protocol]);
}
collCostTable[config->algorithm][config->protocol] = 0.0; // Set low cost to prefer this configuration
// Only override channels if not set to -1 (keep default)
if (config->nChannels != -1) {
*nChannels = config->nChannels;
}
if (ctx->logFunction) {
if (config->nChannels == -1) {
ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Applied config for collType=%s, bytes=%zu, pipeOps=%d, regBuff=%d: algo=%s, proto=%s, channels=default (nodes=%d, ranks=%d)",
collTypeToString(config->collType), nBytes, numPipeOps, regBuff, algorithmToString(config->algorithm), protocolToString(config->protocol),
config->nNodes, config->nRanks);
} else {
ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Applied config for collType=%s, bytes=%zu, pipeOps=%d, regBuff=%d: algo=%s, proto=%s, channels=%d (nodes=%d, ranks=%d)",
collTypeToString(config->collType), nBytes, numPipeOps, regBuff, algorithmToString(config->algorithm), protocolToString(config->protocol),
config->nChannels, config->nNodes, config->nRanks);
}
}
return ncclSuccess;
} else {
if (ctx->logFunction) {
ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Algorithm/protocol combination [%s][%s] is marked as IGNORE",
algorithmToString(config->algorithm), protocolToString(config->protocol));
}
}
} else {
if (ctx->logFunction) {
ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Algorithm/protocol out of bounds - algo=%s (max %d), proto=%s (max %d)",
algorithmToString(config->algorithm), numAlgo, protocolToString(config->protocol), numProto);
}
}
} else {
if (ctx->logFunction) {
ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: Config does not match - collType match=%d, size match=%d, nodes match=%d, ranks match=%d, pipeOps match=%d, regBuff match=%d",
config->collType == collType,
(nBytes >= config->minBytes && nBytes <= config->maxBytes),
(config->nNodes == -1 || config->nNodes == (int)ctx->nNodes),
(config->nRanks == -1 || config->nRanks == (int)ctx->nRanks),
(config->numPipeOps == -1 || config->numPipeOps == numPipeOps),
(config->regBuff == -1 || config->regBuff == regBuff));
}
}
}
// If no specific config found, apply default behavior
if (ctx->logFunction) {
ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
"TUNER/ExamplePlugin: No matching config found");
}
return ncclSuccess;
}
__hidden ncclResult_t pluginDestroy(void* context) { return ncclSuccess; }
__hidden ncclResult_t pluginDestroy(void* context) {
if (context) {
TunerContext* ctx = (TunerContext*)context;
if (ctx->configs) {
free(ctx->configs); // Free dynamically allocated configs array
}
free(context);
}
return ncclSuccess;
}
#define PLUGIN_NAME "Example"

View File

@ -0,0 +1,106 @@
# NCCL Tuner Configuration Scripts
This directory contains scripts for optimizing NCCL tuner configurations based on performance data.
## optimize_config.py
A Python script that reads performance data from CSV files and generates optimal NCCL tuner configurations.
### Usage
```bash
python scripts/optimize_config.py [options] <input_csv_file>
```
### Options
- `-o, --output FILE`: Output NCCL tuner config file (default: `nccl_tuner.conf`)
- `-m, --metric METRIC`: Optimization metric (`cost_metric`, `bandwidth_gbps`, `latency_us`)
- `--no-header`: Don't add header comments to output file
- `--dry-run`: Print configurations without writing to file
### CSV Input Format
The input CSV file should have the following columns:
```csv
collective,size_bytes,algorithm,protocol,channels,nodes,ranks,pipeOps,regBuff,cost_metric,bandwidth_gbps,latency_us
```
**Required columns:**
- `collective`: NCCL collective type (`allreduce`, `broadcast`, `reduce`, etc.)
- `size_bytes`: Message size in bytes
- `algorithm`: NCCL algorithm (`tree`, `ring`, `nvls`, etc.)
- `protocol`: NCCL protocol (`simple`, `ll`, `ll128`)
- `channels`: Number of channels (or `-1` for default)
- `nodes`: Number of nodes (or `-1` for any)
- `ranks`: Number of ranks (or `-1` for any)
- `pipeOps`: Number of pipeline operations (or `-1` for any)
- `regBuff`: Registered buffer flag (`0`, `1`, or `-1` for any)
**Optional metrics (must have at least one present):**
- `bandwidth_gbps`: Bandwidth in GB/s (higher is better)
- `latency_us`: Latency in microseconds (lower is better)
### Examples
**Basic usage with cost optimization:**
```bash
python scripts/optimize_config.py sample_performance_data.csv
```
**Optimize for bandwidth and write to custom file:**
```bash
python scripts/optimize_config.py -m bandwidth_gbps -o my_tuner.conf performance_data.csv
```
**Preview configurations without writing:**
```bash
python scripts/optimize_config.py --dry-run performance_data.csv
```
### How It Works
1. **Data Loading**: Reads CSV performance data and validates format
2. **Grouping**: Groups data by collective type, topology (nodes/ranks), and other parameters
3. **Size Ranges**: Automatically bins data into size ranges for optimization
4. **Optimization**: Finds the best performing configuration for each group/size combination
5. **Output**: Generates NCCL tuner config format and appends to specified file
### Default Size Ranges
The script uses these default size ranges (in bytes):
- Small: 0 - 1,024
- Medium: 1,025 - 65,536
- Large: 65,537 - 1,048,576
- XLarge: 1,048,577 - 16,777,216
- XXLarge: 16,777,217 - 4,294,967,295
### Sample Data
See `sample_performance_data.csv` for an example of the expected input format.
### Integration with NCCL
The generated configuration file can be used directly with the NCCL tuner plugin:
```bash
export NCCL_TUNER_CONFIG_FILE=/path/to/optimized_config.conf
export NCCL_TUNER_PLUGIN=/path/to/libnccl-tuner.so
mpirun -np 8 your_nccl_application
```
### Performance Data Collection
To collect performance data for optimization, you can:
1. **Use NCCL benchmarks** with different algorithm/protocol combinations
2. **Profile your applications** with various tuner settings
3. **Run systematic sweeps** across parameter combinations
4. **Use NCCL debug output** to collect timing information
The key is to have comprehensive data covering:
- Different message sizes (small to large)
- Various topologies (single node, multi-node)
- All relevant algorithm/protocol combinations
- Different channel counts and pipeline configurations

View File

@ -0,0 +1,430 @@
#!/usr/bin/env python3
"""
NCCL Tuner Configuration Optimizer
Reads a CSV file containing performance data across different tuning parameters
and generates optimal NCCL tuner configurations based on the best performing
combinations.
By default, creates growing size ranges that interpolate between the actual data sizes
for each unique dimension (node count, rank count combination). This ensures that
different cluster configurations get their own optimized size boundaries, as
performance characteristics often vary significantly between topologies.
Each dimension gets its own set of ranges starting from 0 and extending to the maximum
size for that dimension, with boundaries at midpoints between consecutive data sizes.
CSV Input Format:
collective,size_bytes,algorithm,protocol,channels,nodes,ranks,pipeOps,regBuff,bandwidth_gbps,latency_us
Output Format (NCCL Tuner Config):
collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
Usage Examples:
# Auto-create dimension-specific interpolated ranges (default)
python3 optimize_config.py data.csv
# Use custom size ranges (applied to all topologies)
python3 optimize_config.py data.csv --size-ranges "0-1024,1025-65536,65537-1048576"
# Use hardcoded default ranges (applied to all topologies)
python3 optimize_config.py data.csv --no-auto-ranges
"""
import csv
import argparse
import sys
import os
from collections import defaultdict
from typing import Dict, List, Tuple, Any
class PerformanceData:
def __init__(self, row: Dict[str, str]):
self.collective = row['collective']
self.size_bytes = int(row['size_bytes'])
self.algorithm = row['algorithm']
self.protocol = row['protocol']
self.channels = int(row['channels']) if row['channels'] != '-1' else -1
self.nodes = int(row['nodes']) if row['nodes'] != '-1' else -1
self.ranks = int(row['ranks']) if row['ranks'] != '-1' else -1
self.pipeOps = int(row['pipeOps']) if row['pipeOps'] != '-1' else -1
self.regBuff = int(row['regBuff']) if row['regBuff'] != '-1' else -1
# Performance metrics
self.bandwidth_gbps = float(row.get('bandwidth_gbps', 0)) # Higher is better
self.latency_us = float(row.get('latency_us', 0)) # Lower is better
def get_config_key(self) -> Tuple:
"""Generate a key for grouping similar configurations"""
return (self.collective, self.nodes, self.ranks, self.pipeOps, self.regBuff)
def get_size_range_key(self, topology_size_ranges: Dict[Tuple[int, int], List[Tuple[int, int]]]) -> Tuple[int, int]:
"""Find which size range this data point belongs to for its dimension"""
topology_key = (self.nodes, self.ranks)
# Get size ranges for this dimension, or fall back to default
if topology_key in topology_size_ranges:
size_ranges = topology_size_ranges[topology_key]
elif (-1, -1) in topology_size_ranges:
size_ranges = topology_size_ranges[(-1, -1)]
else:
# Fallback to first available dimension ranges
size_ranges = next(iter(topology_size_ranges.values()))
for min_size, max_size in size_ranges:
if min_size <= self.size_bytes <= max_size:
return (min_size, max_size)
# If no range found, create a single-point range
return (self.size_bytes, self.size_bytes)
class ConfigOptimizer:
def __init__(self, optimization_metric: str = 'latency_us'):
self.optimization_metric = optimization_metric
# Default size ranges - will be overridden by auto-detection
self.size_ranges = [
(0, 1024),
(1025, 64*1024),
(64*1024+1, 1024*1024),
(1024*1024+1, 16*1024*1024),
(16*1024*1024+1, 4*1024*1024*1024-1)
]
self.auto_size_ranges = True
def set_size_ranges(self, ranges: List[Tuple[int, int]]):
"""Set custom size ranges for optimization"""
self.size_ranges = ranges
self.auto_size_ranges = False
def auto_determine_size_ranges(self, data: List[PerformanceData]) -> Dict[Tuple[int, int], List[Tuple[int, int]]]:
"""Create growing size ranges for each unique (nodes, ranks) dimension"""
if not data:
return {(-1, -1): self.size_ranges}
# Group data by dimension (nodes, ranks)
topology_data = defaultdict(list)
for item in data:
topology_key = (item.nodes, item.ranks)
topology_data[topology_key].append(item)
topology_ranges = {}
for topology_key, items in topology_data.items():
nodes, ranks = topology_key
# Extract unique sizes for this dimension and sort them
unique_sizes = sorted(set(item.size_bytes for item in items))
if len(unique_sizes) <= 1:
# Only one size, create a single range from 0 to that size
size = unique_sizes[0] if unique_sizes else 0
ranges = [(0, size)]
else:
# Create growing ranges that interpolate between data points
ranges = []
for i, size in enumerate(unique_sizes):
if i == 0:
# First range: 0 to midpoint between first and second size
if len(unique_sizes) > 1:
next_size = unique_sizes[i + 1]
max_size = (size + next_size) // 2
else:
max_size = size
min_size = 0
elif i == len(unique_sizes) - 1:
# Last range: previous max + 1 to current size (and beyond)
min_size = ranges[-1][1] + 1
max_size = size
else:
# Intermediate ranges: previous max + 1 to midpoint with next size
min_size = ranges[-1][1] + 1
next_size = unique_sizes[i + 1]
max_size = (size + next_size) // 2
ranges.append((min_size, max_size))
topology_ranges[topology_key] = ranges
print(f"Dimension {nodes} nodes, {ranks} ranks: {len(ranges)} size ranges from {len(unique_sizes)} unique sizes:")
for i, (min_size, max_size) in enumerate(ranges):
# Count data points that fall in this range for this dimension
count = sum(1 for item in items if min_size <= item.size_bytes <= max_size)
actual_sizes = sorted(set(item.size_bytes for item in items if min_size <= item.size_bytes <= max_size))
if actual_sizes:
size_list = ', '.join(f"{s:,}" for s in actual_sizes[:3])
if len(actual_sizes) > 3:
size_list += f", ... (+{len(actual_sizes)-3} more)"
print(f" Range {i+1}: {min_size:,} - {max_size:,} bytes ({count} data points, sizes: {size_list})")
return topology_ranges
def load_data(self, csv_file: str) -> List[PerformanceData]:
"""Load performance data from CSV file"""
data = []
try:
with open(csv_file, 'r') as f:
reader = csv.DictReader(f)
for row in reader:
try:
data.append(PerformanceData(row))
except (ValueError, KeyError) as e:
print(f"Warning: Skipping invalid row: {row} - {e}")
except FileNotFoundError:
print(f"Error: File {csv_file} not found")
sys.exit(1)
except Exception as e:
print(f"Error reading {csv_file}: {e}")
sys.exit(1)
print(f"Loaded {len(data)} performance data points")
# Auto-determine size ranges if enabled
if self.auto_size_ranges and data:
self.topology_size_ranges = self.auto_determine_size_ranges(data)
else:
# Use default ranges for all topologies
self.topology_size_ranges = {(-1, -1): self.size_ranges}
return data
def is_better(self, new_data: PerformanceData, current_best: PerformanceData) -> bool:
"""Determine if new_data is better than current_best"""
if self.optimization_metric == 'bandwidth_gbps':
return new_data.bandwidth_gbps > current_best.bandwidth_gbps
elif self.optimization_metric == 'latency_us':
return new_data.latency_us < current_best.latency_us
else:
# Default to latency
return new_data.latency_us < current_best.latency_us
def optimize_configurations(self, data: List[PerformanceData]) -> List[str]:
"""Find optimal configurations and return as NCCL config strings"""
# Group data by configuration key and size range
grouped_data = defaultdict(lambda: defaultdict(list))
for item in data:
config_key = item.get_config_key()
size_range = item.get_size_range_key(self.topology_size_ranges)
grouped_data[config_key][size_range].append(item)
# Store optimal configurations before combining ranges
optimal_configs = []
for config_key, size_ranges_dict in grouped_data.items():
collective, nodes, ranks, pipeOps, regBuff = config_key
for (min_size, max_size), items in size_ranges_dict.items():
if not items:
continue
# Find the best performing configuration for this size range
best_item = items[0]
for item in items[1:]:
if self.is_better(item, best_item):
best_item = item
# Store the optimal configuration with its range
optimal_configs.append({
'collective': collective,
'min_size': min_size,
'max_size': max_size,
'algorithm': best_item.algorithm,
'protocol': best_item.protocol,
'channels': best_item.channels,
'nodes': best_item.nodes,
'ranks': best_item.ranks,
'pipeOps': best_item.pipeOps,
'regBuff': best_item.regBuff,
'metric_value': getattr(best_item, self.optimization_metric)
})
# Combine sequential ranges with identical tunings
combined_configs = self.combine_sequential_ranges(optimal_configs)
# Generate config strings
configs = []
for config in combined_configs:
config_str = f"{config['collective']},{config['min_size']},{config['max_size']},{config['algorithm']},{config['protocol']},{config['channels']},{config['nodes']},{config['ranks']},{config['pipeOps']},{config['regBuff']}"
configs.append(config_str)
print(f"Optimal for {config['collective']} [{config['min_size']}-{config['max_size']}] nodes={config['nodes']} ranks={config['ranks']}: "
f"{config['algorithm']}/{config['protocol']} channels={config['channels']} "
f"({self.optimization_metric}={config['metric_value']:.3f})")
return configs
def combine_sequential_ranges(self, configs: List[Dict]) -> List[Dict]:
"""Combine sequential ranges that have identical tuning parameters"""
if not configs:
return configs
# Group by collective and topology (nodes, ranks)
topology_groups = defaultdict(list)
for config in configs:
topology_key = (config['collective'], config['nodes'], config['ranks'],
config['pipeOps'], config['regBuff'])
topology_groups[topology_key].append(config)
combined_configs = []
for topology_key, topology_configs in topology_groups.items():
# Sort by min_size to ensure proper ordering
topology_configs.sort(key=lambda x: x['min_size'])
# Group by tuning parameters (algorithm, protocol, channels)
tuning_groups = defaultdict(list)
for config in topology_configs:
tuning_key = (config['algorithm'], config['protocol'], config['channels'])
tuning_groups[tuning_key].append(config)
# For each tuning group, combine sequential ranges
for tuning_key, tuning_configs in tuning_groups.items():
if not tuning_configs:
continue
# Sort by min_size
tuning_configs.sort(key=lambda x: x['min_size'])
# Combine sequential ranges
current_config = tuning_configs[0].copy()
for next_config in tuning_configs[1:]:
# Check if ranges are adjacent or overlapping
if current_config['max_size'] + 1 >= next_config['min_size']:
# Extend the current range
current_config['max_size'] = max(current_config['max_size'], next_config['max_size'])
# Update metric value to the better one
if self.optimization_metric == 'bandwidth_gbps':
if next_config['metric_value'] > current_config['metric_value']:
current_config['metric_value'] = next_config['metric_value']
else: # latency_us or default
if next_config['metric_value'] < current_config['metric_value']:
current_config['metric_value'] = next_config['metric_value']
else:
# Gap between ranges, save current and start new one
combined_configs.append(current_config)
current_config = next_config.copy()
# Add the last configuration
combined_configs.append(current_config)
# Sort final configs by collective, nodes, ranks, then min_size
combined_configs.sort(key=lambda x: (x['collective'], x['nodes'], x['ranks'], x['min_size']))
original_count = len(configs)
combined_count = len(combined_configs)
if combined_count < original_count:
print(f"Combined {original_count} ranges into {combined_count} ranges "
f"(reduced by {original_count - combined_count})")
return combined_configs
def append_to_config_file(self, configs: List[str], config_file: str, add_header: bool = True):
"""Append optimized configurations to NCCL tuner config file"""
try:
# Create directory if it doesn't exist
config_dir = os.path.dirname(config_file)
if config_dir and not os.path.exists(config_dir):
os.makedirs(config_dir)
print(f"Created directory: {config_dir}")
# Check if file exists and has content
file_exists = os.path.exists(config_file)
add_separator = False
if file_exists:
with open(config_file, 'r') as f:
content = f.read().strip()
add_separator = len(content) > 0
print(f"Appending to existing file: {config_file}")
else:
print(f"Creating new file: {config_file}")
with open(config_file, 'a') as f:
if add_separator:
f.write("\n\n")
if add_header:
f.write(f"# Optimized configurations generated by optimize_config.py\n")
f.write(f"# Optimization metric: {self.optimization_metric}\n")
f.write(f"# Format: collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff\n")
for config in configs:
f.write(f"{config}\n")
if file_exists:
print(f"Appended {len(configs)} optimized configurations to {config_file}")
else:
print(f"Created {config_file} with {len(configs)} optimized configurations")
except PermissionError:
print(f"Error: Permission denied writing to {config_file}")
print("Try running with appropriate permissions or choose a different output location")
sys.exit(1)
except OSError as e:
print(f"Error: Cannot create/write to {config_file}: {e}")
print("Check that the path is valid and you have write permissions")
sys.exit(1)
except Exception as e:
print(f"Unexpected error writing to {config_file}: {e}")
sys.exit(1)
def main():
parser = argparse.ArgumentParser(description="Optimize NCCL tuner configurations from performance data")
parser.add_argument("csv_file", help="Input CSV file with performance data")
parser.add_argument("-o", "--output", default="nccl_tuner.conf",
help="Output NCCL tuner config file (default: nccl_tuner.conf)")
parser.add_argument("-m", "--metric", choices=['bandwidth_gbps', 'latency_us'],
default='latency_us', help="Optimization metric (default: latency_us)")
parser.add_argument("--no-header", action="store_true",
help="Don't add header comments to output file")
parser.add_argument("--dry-run", action="store_true",
help="Print configurations without writing to file")
parser.add_argument("--no-auto-ranges", action="store_true",
help="Disable automatic size range determination (use default ranges)")
parser.add_argument("--size-ranges", type=str,
help="Custom size ranges as comma-separated pairs: 'min1-max1,min2-max2,...'")
args = parser.parse_args()
optimizer = ConfigOptimizer(args.metric)
# Handle size range configuration
if args.size_ranges:
# Parse custom size ranges
try:
ranges = []
for range_str in args.size_ranges.split(','):
min_size, max_size = map(int, range_str.split('-'))
ranges.append((min_size, max_size))
optimizer.set_size_ranges(ranges)
print(f"Using custom size ranges: {ranges}")
except ValueError:
print("Error: Invalid size ranges format. Use 'min1-max1,min2-max2,...'")
sys.exit(1)
elif args.no_auto_ranges:
# Disable auto-ranging
optimizer.auto_size_ranges = False
print("Using default hardcoded size ranges")
else:
# Auto-ranging is enabled by default - creates one bucket per unique size
optimizer.auto_size_ranges = True
print("Auto-ranging enabled: will create one bucket per unique size in data")
# Load and optimize data
data = optimizer.load_data(args.csv_file)
if not data:
print("No valid data found in CSV file")
sys.exit(1)
configs = optimizer.optimize_configurations(data)
if args.dry_run:
print("\nGenerated configurations:")
for config in configs:
print(config)
else:
optimizer.append_to_config_file(configs, args.output, not args.no_header)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,24 @@
collective,size_bytes,algorithm,protocol,channels,nodes,ranks,pipeOps,regBuff,cost_metric,bandwidth_gbps,latency_us
allreduce,1024,tree,simple,2,1,8,-1,-1,0.15,45.2,12.5
allreduce,1024,ring,simple,4,1,8,-1,-1,0.12,52.1,10.8
allreduce,1024,tree,ll,2,1,8,-1,-1,0.18,41.3,15.2
allreduce,1024,ring,ll,4,1,8,-1,-1,0.14,48.7,12.1
allreduce,32768,tree,simple,2,1,8,-1,-1,0.25,156.8,25.3
allreduce,32768,ring,simple,4,1,8,-1,-1,0.18,189.2,18.4
allreduce,32768,ring,ll128,8,1,8,-1,-1,0.16,201.5,16.2
allreduce,1048576,ring,simple,4,1,8,-1,-1,0.45,425.6,45.1
allreduce,1048576,ring,ll128,8,1,8,-1,-1,0.38,482.3,38.7
allreduce,1048576,nvls,simple,16,1,8,-1,-1,0.32,551.2,32.1
broadcast,1024,tree,simple,2,1,8,-1,-1,0.08,89.4,8.2
broadcast,1024,ring,simple,4,1,8,-1,-1,0.12,71.3,12.1
broadcast,32768,tree,simple,2,1,8,-1,-1,0.18,234.7,18.5
broadcast,32768,ring,ll128,4,1,8,-1,-1,0.15,267.8,15.2
broadcast,1048576,ring,simple,4,1,8,-1,-1,0.35,612.4,35.1
broadcast,1048576,ring,ll128,8,1,8,-1,-1,0.28,702.1,28.3
allreduce,1024,tree,simple,2,2,16,-1,-1,0.22,38.1,22.4
allreduce,1024,ring,simple,4,2,16,-1,-1,0.19,42.7,19.6
allreduce,32768,ring,simple,4,2,16,-1,-1,0.28,145.2,28.1
allreduce,32768,ring,ll128,8,2,16,-1,-1,0.24,167.8,24.3
allreduce,1048576,ring,simple,4,2,16,-1,-1,0.58,387.5,58.2
allreduce,1048576,ring,ll128,8,2,16,-1,-1,0.48,456.9,48.1
allreduce,1048576,nvls,simple,16,2,16,-1,-1,0.42,512.6,42.3
1 collective size_bytes algorithm protocol channels nodes ranks pipeOps regBuff cost_metric bandwidth_gbps latency_us
2 allreduce 1024 tree simple 2 1 8 -1 -1 0.15 45.2 12.5
3 allreduce 1024 ring simple 4 1 8 -1 -1 0.12 52.1 10.8
4 allreduce 1024 tree ll 2 1 8 -1 -1 0.18 41.3 15.2
5 allreduce 1024 ring ll 4 1 8 -1 -1 0.14 48.7 12.1
6 allreduce 32768 tree simple 2 1 8 -1 -1 0.25 156.8 25.3
7 allreduce 32768 ring simple 4 1 8 -1 -1 0.18 189.2 18.4
8 allreduce 32768 ring ll128 8 1 8 -1 -1 0.16 201.5 16.2
9 allreduce 1048576 ring simple 4 1 8 -1 -1 0.45 425.6 45.1
10 allreduce 1048576 ring ll128 8 1 8 -1 -1 0.38 482.3 38.7
11 allreduce 1048576 nvls simple 16 1 8 -1 -1 0.32 551.2 32.1
12 broadcast 1024 tree simple 2 1 8 -1 -1 0.08 89.4 8.2
13 broadcast 1024 ring simple 4 1 8 -1 -1 0.12 71.3 12.1
14 broadcast 32768 tree simple 2 1 8 -1 -1 0.18 234.7 18.5
15 broadcast 32768 ring ll128 4 1 8 -1 -1 0.15 267.8 15.2
16 broadcast 1048576 ring simple 4 1 8 -1 -1 0.35 612.4 35.1
17 broadcast 1048576 ring ll128 8 1 8 -1 -1 0.28 702.1 28.3
18 allreduce 1024 tree simple 2 2 16 -1 -1 0.22 38.1 22.4
19 allreduce 1024 ring simple 4 2 16 -1 -1 0.19 42.7 19.6
20 allreduce 32768 ring simple 4 2 16 -1 -1 0.28 145.2 28.1
21 allreduce 32768 ring ll128 8 2 16 -1 -1 0.24 167.8 24.3
22 allreduce 1048576 ring simple 4 2 16 -1 -1 0.58 387.5 58.2
23 allreduce 1048576 ring ll128 8 2 16 -1 -1 0.48 456.9 48.1
24 allreduce 1048576 nvls simple 16 2 16 -1 -1 0.42 512.6 42.3

View File

@ -0,0 +1,30 @@
#
# Makefile for NCCL Tuner Plugin Unit Tests
#
CC := gcc
CFLAGS := -Wall -Wextra -g -std=c99 -fPIC
INC := -I. -I../nccl
TARGET := test_plugin
SOURCES := test_plugin.c
# Default target
all: $(TARGET)
# Build the test executable
$(TARGET): $(SOURCES)
$(CC) $(CFLAGS) $(INC) -o $(TARGET) $(SOURCES)
# Run the tests
test: $(TARGET)
./$(TARGET) $(TEST_CASE)
# Run tests with verbose output
test-verbose: $(TARGET)
NCCL_DEBUG=INFO ./$(TARGET) $(TEST_CASE)
# Clean build artifacts
clean:
rm -f $(TARGET) *.o *.gcov *.gcda *.gcno test_*.conf
.PHONY: all test test-verbose clean

View File

@ -0,0 +1,205 @@
# NCCL Tuner Plugin Unit Tests
This directory contains comprehensive unit tests for the NCCL tuner plugin. The tests verify all major functionality including configuration parsing, matching logic, and cost table updates.
## Test Structure
```
test/
├── test_plugin.c # Main unit test file
├── Makefile # Build system for tests
└── README.md # This file
```
## Building and Running Tests
### Quick Start
```bash
# Build and run all tests
make test
# Or step by step
make # Build test executable
./test_plugin # Run tests
```
### Advanced Testing
```bash
# Run with memory leak detection (requires valgrind)
make test-memory
# Run with verbose logging
make test-verbose
# Generate code coverage report (requires gcov)
make coverage
# Create sample test configuration files
make test-configs
```
## Test Coverage
The unit tests cover the following functionality:
### 1. **Plugin Initialization (`test_plugin_init`)**
- Tests successful plugin initialization
- Verifies context allocation
- Tests cleanup on destroy
### 2. **Configuration Parsing (`test_config_parsing_valid`, `test_config_parsing_invalid`)**
- Valid CSV format parsing
- Comment and empty line handling
- Invalid format graceful handling
- Environment variable configuration
### 3. **Collective Type Matching (`test_collective_matching`)**
- Correct matching of allreduce, broadcast, etc.
- Algorithm/protocol selection
- Channel configuration
### 4. **Size Range Matching (`test_size_matching`)**
- Small, medium, large message size handling
- Proper range boundary checking
- Multiple size-based configurations
### 5. **Topology Matching (`test_topology_matching`)**
- Single-node vs multi-node configurations
- Exact nNodes/nRanks matching
- Wildcard matching (-1 values)
### 6. **Default Channels (`test_default_channels`)**
- Proper handling of -1 channel specification
- Preservation of NCCL default behavior
### 7. **Registered Buffer Matching (`test_regbuff_matching`)**
- Configurations based on regBuff parameter
- Registered vs non-registered buffer handling
- Backward compatibility with configs missing regBuff
### 8. **Pipeline Operations Matching (`test_pipeops_matching`)**
- Configurations based on numPipeOps parameter
- Single vs multiple pipeline operation handling
- Backward compatibility with configs missing numPipeOps
### 9. **Fallback Behavior (`test_no_match_fallback`)**
- Default behavior when no config matches
- Ring/Simple algorithm fallback
## Test Output
Successful test run:
```
Running NCCL Tuner Plugin Unit Tests
=====================================
PASS: test_plugin_init
PASS: test_config_parsing_valid
PASS: test_config_parsing_invalid
PASS: test_collective_matching
PASS: test_size_matching
PASS: test_topology_matching
PASS: test_default_channels
PASS: test_regbuff_matching
PASS: test_pipeops_matching
PASS: test_no_match_fallback
=====================================
Test Results: 9/9 tests passed
All tests PASSED!
```
Failed test example:
```
FAIL: test_collective_matching - Tree/Simple should have low cost
Test Results: 8/9 tests passed
Some tests FAILED!
```
## Mock NCCL Implementation
The tests use the actual NCCL header files from the `../nccl/` directory:
- `tuner.h` - Complete NCCL tuner interface and type definitions
- `common.h` - Common NCCL types and logging functions
- `err.h` - NCCL error codes
This allows testing with the real NCCL interface definitions while still being able to run tests without the full NCCL library installation.
## Integration with CI/CD
```bash
# Install tests for CI/CD pipeline
make install-test
# Run as part of automated testing
make test && echo "Tests passed" || echo "Tests failed"
```
## Memory Testing
The tests can be run with valgrind for memory leak detection:
```bash
make test-memory
```
This will detect:
- Memory leaks
- Invalid memory access
- Use of uninitialized memory
## Code Coverage
Generate code coverage reports to ensure comprehensive testing:
```bash
make coverage
# Creates test_plugin.c.gcov with line-by-line coverage
```
## Adding New Tests
To add a new test:
1. Create a new test function in `test_plugin.c`:
```c
int test_new_feature() {
// Test setup
TEST_ASSERT(condition, "description");
// Test cleanup
TEST_PASS();
}
```
2. Add the test to the main function:
```c
total++; passed += test_new_feature();
```
3. Rebuild and run:
```bash
make test
```
## Debugging Tests
For debugging failed tests:
```bash
# Compile with debug symbols
make CFLAGS="-g -O0 -DDEBUG"
# Run with gdb
gdb ./test_plugin
```
## Cleaning Up
```bash
# Remove all build artifacts and temporary files
make clean
```
This comprehensive test suite ensures the NCCL tuner plugin works correctly across all supported configurations and edge cases.

View File

@ -0,0 +1,856 @@
/*************************************************************************
* Unit tests for NCCL Tuner Plugin
************************************************************************/
#define _GNU_SOURCE // Enable setenv/unsetenv and other GNU extensions
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <unistd.h>
#include <sys/stat.h>
#include <stdarg.h>
// Include NCCL tuner header (which includes common.h and err.h)
#include "tuner.h"
// Include plugin source for testing
#include "../plugin.c"
// Test framework macros
#define TEST_ASSERT(condition, message) \
do { \
if (!(condition)) { \
printf("FAIL: %s - %s\n", __func__, message); \
return 0; \
} \
} while(0)
#define TEST_PASS() \
do { \
printf("PASS: %s\n", __func__); \
return 1; \
} while(0)
// Global test state
static int test_log_count = 0;
// Mock logger function
void mock_logger(ncclDebugLogLevel level, unsigned long flags,
const char* file, int line, const char* fmt, ...) {
(void)flags; // Suppress unused parameter warning
test_log_count++;
// Check if we should print based on NCCL_DEBUG level
const char* debug_level = getenv("NCCL_DEBUG");
int should_print = 0;
if (debug_level) {
if (strcmp(debug_level, "TRACE") == 0) {
should_print = 1; // Print everything
} else if (strcmp(debug_level, "INFO") == 0 && level <= NCCL_LOG_INFO) {
should_print = 1; // Print INFO and below
} else if (strcmp(debug_level, "WARN") == 0 && level <= NCCL_LOG_WARN) {
should_print = 1; // Print WARN and below
}
}
if (!should_print) return;
// Convert log level to string
const char* level_str;
switch(level) {
case NCCL_LOG_NONE: level_str = "NONE"; break;
case NCCL_LOG_VERSION: level_str = "VERSION"; break;
case NCCL_LOG_WARN: level_str = "WARN"; break;
case NCCL_LOG_INFO: level_str = "INFO"; break;
case NCCL_LOG_ABORT: level_str = "ABORT"; break;
case NCCL_LOG_TRACE: level_str = "TRACE"; break;
default: level_str = "UNKNOWN"; break;
}
// Print log header
printf("[TUNER:%s:%s:%d] ", level_str, file, line);
// Print formatted message
va_list args;
va_start(args, fmt);
vprintf(fmt, args);
va_end(args);
printf("\n");
}
// Helper function to create test config file
void create_test_config(const char* filename, const char* content) {
FILE* f = fopen(filename, "w");
if (f) {
fprintf(f, "%s", content);
fclose(f);
}
}
// Test 1: Plugin initialization
int test_plugin_init() {
void* context = NULL;
// Test successful initialization
ncclResult_t result = pluginInit(8, 2, mock_logger, &context);
TEST_ASSERT(result == ncclSuccess, "Plugin init should succeed");
TEST_ASSERT(context != NULL, "Context should be allocated");
// Clean up
pluginDestroy(context);
TEST_PASS();
}
// Test 2: Configuration file parsing - valid CSV
int test_config_parsing_valid() {
const char* test_config =
"# Test configuration\n"
"allreduce,0,65536,tree,simple,2,1,-1,-1,-1\n"
"broadcast,0,32768,ring,ll128,4,2,16,-1,-1\n"
"# Comment line\n"
"\n" // Empty line
"reduce,1024,2048,tree,simple,-1,-1,-1,-1,-1\n";
create_test_config("test_valid.conf", test_config);
// Set environment variable to use our test config
setenv("NCCL_TUNER_CONFIG_FILE", "test_valid.conf", 1);
void* context = NULL;
ncclResult_t result = pluginInit(16, 2, mock_logger, &context);
TEST_ASSERT(result == ncclSuccess, "Plugin init with valid config should succeed");
// Clean up
pluginDestroy(context);
unlink("test_valid.conf");
unsetenv("NCCL_TUNER_CONFIG_FILE");
TEST_PASS();
}
// Test 3: Configuration file parsing - invalid CSV
int test_config_parsing_invalid() {
const char* test_config =
"allreduce,0,65536,tree,simple,2,1 # Missing nRanks and other fields\n"
"invalid_collective,0,1024,ring,simple,1,1,1,-1,-1\n"
"broadcast,abc,def,ring,simple,1,1,1,-1,-1\n"; // Invalid numbers
create_test_config("test_invalid.conf", test_config);
setenv("NCCL_TUNER_CONFIG_FILE", "test_invalid.conf", 1);
void* context = NULL;
ncclResult_t result = pluginInit(8, 1, mock_logger, &context);
// Should still succeed but with no valid configs loaded
TEST_ASSERT(result == ncclSuccess, "Plugin init should succeed even with invalid config");
// Clean up
pluginDestroy(context);
unlink("test_invalid.conf");
unsetenv("NCCL_TUNER_CONFIG_FILE");
TEST_PASS();
}
// Test 4: Collective type matching
int test_collective_matching() {
const char* test_config =
"allreduce,0,65536,tree,simple,8,1,-1,-1,-1\n"
"broadcast,0,32768,ring,ll128,4,-1,-1,-1,-1\n";
create_test_config("test_match.conf", test_config);
setenv("NCCL_TUNER_CONFIG_FILE", "test_match.conf", 1);
void* context = NULL;
pluginInit(8, 1, mock_logger, &context);
// Create mock cost table
float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
cost_table_ptr[i] = cost_table[i];
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0; // Default high cost
}
}
int nChannels;
// Test allreduce matching (should match first config)
ncclResult_t result = pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
0, &nChannels);
TEST_ASSERT(result == ncclSuccess, "GetCollInfo should succeed");
mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
"DEBUG: Checking cost_table[TREE][SIMPLE] (%p) = %.1f (expecting 0.0)",
&cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE], cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE]);
TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Tree/Simple should have low cost");
TEST_ASSERT(nChannels == 8, "Should set 8 channels");
// Test broadcast matching (should match second config)
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0; // Reset costs
}
}
result = pluginGetCollInfo(context, ncclFuncBroadcast, 16384, 1,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
0, &nChannels);
TEST_ASSERT(result == ncclSuccess, "GetCollInfo should succeed");
mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
"DEBUG: Checking cost_table[RING][LL128] (%p) = %.1f (expecting 0.0)",
&cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128], cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128]);
TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128] == 0.0, "Ring/LL128 should have low cost");
TEST_ASSERT(nChannels == 4, "Should set 4 channels");
// Clean up
pluginDestroy(context);
unlink("test_match.conf");
unsetenv("NCCL_TUNER_CONFIG_FILE");
TEST_PASS();
}
// Test 5: Size range matching
int test_size_matching() {
const char* test_config =
"allreduce,0,1024,tree,simple,2,-1,-1,-1,-1\n"
"allreduce,1025,65536,ring,simple,4,-1,-1,-1,-1\n"
"allreduce,65537,4294967295,ring,ll128,8,-1,-1,-1,-1\n";
create_test_config("test_size.conf", test_config);
setenv("NCCL_TUNER_CONFIG_FILE", "test_size.conf", 1);
void* context = NULL;
pluginInit(8, 1, mock_logger, &context);
float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
cost_table_ptr[i] = cost_table[i];
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0;
}
}
int nChannels = 1;
pluginGetCollInfo(context, ncclFuncAllReduce, 512, 1,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
0, &nChannels);
mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
"DEBUG: Small message - checking cost_table[TREE][SIMPLE] (%p) = %.1f (expecting 0.0)",
&cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE], cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE]);
TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Small: Tree/Simple should have low cost");
TEST_ASSERT(nChannels == 2, "Small: Should set 2 channels");
// Test medium message (should match second config)
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0;
}
}
pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
0, &nChannels);
mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
"DEBUG: Medium message - checking cost_table[RING][SIMPLE] (%p) = %.1f (expecting 0.0)",
&cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE], cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE]);
TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] == 0.0, "Medium: Ring/Simple should have low cost");
TEST_ASSERT(nChannels == 4, "Medium: Should set 4 channels");
// Test large message (should match third config)
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0;
}
}
pluginGetCollInfo(context, ncclFuncAllReduce, 1048576, 1,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
0, &nChannels);
mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
"DEBUG: Large message - checking cost_table[RING][LL128] (%p) = %.1f (expecting 0.0)",
&cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128], cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128]);
TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128] == 0.0, "Large: Ring/LL128 should have low cost");
TEST_ASSERT(nChannels == 8, "Large: Should set 8 channels");
// Clean up
pluginDestroy(context);
unlink("test_size.conf");
unsetenv("NCCL_TUNER_CONFIG_FILE");
TEST_PASS();
}
// Test 6: Topology matching
int test_topology_matching() {
const char* test_config =
"allreduce,0,65536,tree,simple,2,1,-1,-1,-1\n" // Single node only
"allreduce,0,65536,ring,simple,4,4,32,-1,-1\n" // 4 nodes, 32 ranks exactly
"allreduce,0,65536,ring,ll128,8,-1,-1,-1,-1\n"; // Any topology
create_test_config("test_topo.conf", test_config);
setenv("NCCL_TUNER_CONFIG_FILE", "test_topo.conf", 1);
// Test with single node setup
void* context1 = NULL;
pluginInit(8, 1, mock_logger, &context1); // 8 ranks, 1 node
float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
cost_table_ptr[i] = cost_table[i];
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0;
}
}
int nChannels;
pluginGetCollInfo(context1, ncclFuncAllReduce, 32768, 1,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
0, &nChannels);
TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Single node: Should match tree config");
TEST_ASSERT(nChannels == 2, "Single node: Should set 2 channels");
pluginDestroy(context1);
// Test with 4 nodes, 32 ranks setup
void* context2 = NULL;
pluginInit(32, 4, mock_logger, &context2); // 32 ranks, 4 nodes
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0;
}
}
pluginGetCollInfo(context2, ncclFuncAllReduce, 32768, 1,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
0, &nChannels);
TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] == 0.0, "4-node: Should match ring/simple config");
TEST_ASSERT(nChannels == 4, "4-node: Should set 4 channels");
// Clean up
unlink("test_topo.conf");
unsetenv("NCCL_TUNER_CONFIG_FILE");
TEST_PASS();
}
// Test 7: Default channels behavior (-1)
int test_default_channels() {
const char* test_config =
"allreduce,0,65536,tree,simple,-1,-1,-1,-1,-1\n"; // Use default channels
create_test_config("test_default.conf", test_config);
setenv("NCCL_TUNER_CONFIG_FILE", "test_default.conf", 1);
void* context = NULL;
pluginInit(8, 1, mock_logger, &context);
float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
cost_table_ptr[i] = cost_table[i];
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0;
}
}
int nChannels = 99; // Set to known value
pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
0, &nChannels);
TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Should apply algorithm/protocol");
TEST_ASSERT(nChannels == 1, "Should keep default channels (1) when config has -1");
// Clean up
pluginDestroy(context);
unlink("test_default.conf");
unsetenv("NCCL_TUNER_CONFIG_FILE");
TEST_PASS();
}
// Test 8: regBuff matching
int test_regbuff_matching() {
const char* test_config =
"allreduce,0,65536,tree,simple,2,-1,-1,-1,1\n" // Registered buffers only
"allreduce,0,65536,ring,simple,4,-1,-1,-1,0\n" // Non-registered buffers only
"allreduce,0,65536,ring,ll128,8,-1,-1,-1,-1\n"; // Any buffer type (backward compatible)
create_test_config("test_regbuff.conf", test_config);
setenv("NCCL_TUNER_CONFIG_FILE", "test_regbuff.conf", 1);
void* context = NULL;
pluginInit(8, 1, mock_logger, &context);
float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
cost_table_ptr[i] = cost_table[i];
}
int nChannels;
// Test registered buffer (should match first config)
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0;
}
}
pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
1, &nChannels); // regBuff = 1 (registered)
TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Registered buffer: Tree/Simple should have low cost");
TEST_ASSERT(nChannels == 2, "Registered buffer: Should set 2 channels");
// Test non-registered buffer (should match second config)
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0;
}
}
pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
0, &nChannels); // regBuff = 0 (non-registered)
TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] == 0.0, "Non-registered buffer: Ring/Simple should have low cost");
TEST_ASSERT(nChannels == 4, "Non-registered buffer: Should set 4 channels");
// Test backward compatibility - config without regBuff should match any regBuff value
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0;
}
}
// First try with regBuff=2 (unusual value, should match third config)
pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
2, &nChannels); // regBuff = 2 (only third config should match)
TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128] == 0.0, "Any regBuff: Ring/LL128 should have low cost");
TEST_ASSERT(nChannels == 8, "Any regBuff: Should set 8 channels");
// Clean up
pluginDestroy(context);
unlink("test_regbuff.conf");
unsetenv("NCCL_TUNER_CONFIG_FILE");
TEST_PASS();
}
// Test 9: numPipeOps matching
int test_pipeops_matching() {
const char* test_config =
"allreduce,0,65536,tree,simple,2,-1,-1,1,-1\n" // Single pipeline op
"allreduce,0,65536,ring,simple,4,-1,-1,4,-1\n" // Multiple pipeline ops
"allreduce,0,65536,ring,ll128,8,-1,-1,-1,-1\n"; // Any pipeline ops (backward compatible)
create_test_config("test_pipeops.conf", test_config);
setenv("NCCL_TUNER_CONFIG_FILE", "test_pipeops.conf", 1);
void* context = NULL;
pluginInit(8, 1, mock_logger, &context);
float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
cost_table_ptr[i] = cost_table[i];
}
int nChannels;
// Test single pipeline op (should match first config)
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0;
}
}
pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
0, &nChannels);
TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Single pipeOp: Tree/Simple should have low cost");
TEST_ASSERT(nChannels == 2, "Single pipeOp: Should set 2 channels");
// Test multiple pipeline ops (should match second config)
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0;
}
}
pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 4,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
0, &nChannels);
TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] == 0.0, "Multiple pipeOps: Ring/Simple should have low cost");
TEST_ASSERT(nChannels == 4, "Multiple pipeOps: Should set 4 channels");
// Test different number of pipeline ops (should match third config - backward compatible)
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0;
}
}
pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 2,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
0, &nChannels);
TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128] == 0.0, "Any pipeOps: Ring/LL128 should have low cost");
TEST_ASSERT(nChannels == 8, "Any pipeOps: Should set 8 channels");
// Clean up
pluginDestroy(context);
unlink("test_pipeops.conf");
unsetenv("NCCL_TUNER_CONFIG_FILE");
TEST_PASS();
}
// Test 10: No matching configuration (fallback behavior)
int test_no_match_fallback() {
const char* test_config =
"broadcast,0,1024,tree,simple,2,-1,-1,-1,-1\n"; // Only broadcast config
create_test_config("test_fallback.conf", test_config);
setenv("NCCL_TUNER_CONFIG_FILE", "test_fallback.conf", 1);
void* context = NULL;
pluginInit(8, 1, mock_logger, &context);
float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
cost_table_ptr[i] = cost_table[i];
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0;
}
}
int nChannels;
// Try allreduce (should not match, use fallback)
pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
0, &nChannels);
mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
"DEBUG: Fallback test - checking cost_table[RING][SIMPLE] (%p) = %.1f (expecting 0.0)",
&cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE], cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE]);
TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] == 1.0, "Should use pass through unmodified");
TEST_ASSERT(nChannels == 1, "Should use default channels");
// Clean up
pluginDestroy(context);
unlink("test_fallback.conf");
unsetenv("NCCL_TUNER_CONFIG_FILE");
TEST_PASS();
}
// Test 11: Large configuration files (testing dynamic allocation)
int test_large_config() {
const char* large_config_file = "test_large.conf";
// Create a large configuration file with many entries
// This tests the dynamic allocation functionality
FILE* f = fopen(large_config_file, "w");
TEST_ASSERT(f != NULL, "Should be able to create large config file");
// Write header comment
fprintf(f, "# Large configuration file for testing dynamic allocation\n");
fprintf(f, "# This file contains many configurations to test memory allocation\n");
// Generate a large number of configurations (much more than the old MAX_CONFIGS=100)
const int num_configs = 500; // 5x the old static limit
const char* collectives[] = {"allreduce", "broadcast", "reduce", "allgather", "reducescatter"};
const char* algorithms[] = {"tree", "ring", "collnet_direct", "nvls"};
const char* protocols[] = {"simple", "ll", "ll128"};
for (int i = 0; i < num_configs; i++) {
// Vary the configurations to create realistic test data
const char* coll = collectives[i % 5];
const char* algo = algorithms[i % 4];
const char* proto = protocols[i % 3];
size_t min_bytes = (i * 1024) % 1048576; // Vary from 0 to 1MB
size_t max_bytes = min_bytes + 65536; // 64KB range
int channels = (i % 8) + 1; // 1-8 channels
int nodes = (i % 4) == 0 ? -1 : (i % 4); // Mix of -1 and 1-3 nodes
int ranks = (i % 8) == 0 ? -1 : (i % 32) + 1; // Mix of -1 and 1-32 ranks
int pipeOps = (i % 3) == 0 ? -1 : (i % 4) + 1; // Mix of -1 and 1-4 pipeOps
int regBuff = (i % 3) == 0 ? -1 : (i % 2); // Mix of -1, 0, 1
fprintf(f, "%s,%zu,%zu,%s,%s,%d,%d,%d,%d,%d\n",
coll, min_bytes, max_bytes, algo, proto, channels, nodes, ranks, pipeOps, regBuff);
}
fclose(f);
// Set environment to use our large config file
setenv("NCCL_TUNER_CONFIG_FILE", large_config_file, 1);
// Initialize plugin with large config
void* context = NULL;
ncclResult_t result = pluginInit(16, 4, mock_logger, &context);
TEST_ASSERT(result == ncclSuccess, "Plugin init with large config should succeed");
TEST_ASSERT(context != NULL, "Context should be allocated");
// Verify that configurations were loaded
TunerContext* ctx = (TunerContext*)context;
TEST_ASSERT(ctx->numConfigs == num_configs, "Should load all configurations from large file");
TEST_ASSERT(ctx->maxConfigs == num_configs, "maxConfigs should match allocated size");
TEST_ASSERT(ctx->configs != NULL, "Configs array should be dynamically allocated");
// Test that we can access configurations throughout the array
// (This would have failed with the old static MAX_CONFIGS=100 limit)
for (int i = 0; i < ctx->numConfigs; i++) {
TuningConfig* config = &ctx->configs[i];
// Basic sanity checks on the loaded configurations
TEST_ASSERT(config->collType >= ncclFuncBroadcast && config->collType <= ncclFuncAllReduce,
"Collective type should be valid");
TEST_ASSERT(config->maxBytes >= config->minBytes, "maxBytes should be >= minBytes");
TEST_ASSERT(config->nChannels > 0, "nChannels should be positive");
}
// Test specific configuration access at various indices
// Index 0 (first config)
TuningConfig* first_config = &ctx->configs[0];
TEST_ASSERT(first_config != NULL, "First config should be accessible");
// Index in middle
TuningConfig* mid_config = &ctx->configs[num_configs / 2];
TEST_ASSERT(mid_config != NULL, "Middle config should be accessible");
// Index near end (this would have crashed with static array of 100)
TuningConfig* late_config = &ctx->configs[num_configs - 1];
TEST_ASSERT(late_config != NULL, "Last config should be accessible");
// Test memory allocation size - verify we didn't over-allocate
mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
"Successfully loaded %d configurations (dynamic allocation)", ctx->numConfigs);
mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
"Memory allocated for %d configurations (%zu bytes total)",
ctx->maxConfigs, ctx->maxConfigs * sizeof(TuningConfig));
// Test that the plugin can still find matching configurations from the large set
float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
cost_table_ptr[i] = cost_table[i];
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0; // Default high cost
}
}
int nChannels;
// Try to find a matching configuration - should work with large config set
result = pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
0, &nChannels);
TEST_ASSERT(result == ncclSuccess, "GetCollInfo should work with large config set");
// Clean up
pluginDestroy(context);
unlink(large_config_file);
unsetenv("NCCL_TUNER_CONFIG_FILE");
TEST_PASS();
}
// Test 12: Very large configuration stress test
int test_very_large_config_stress() {
const char* stress_config_file = "test_stress.conf";
// Create an even larger configuration file to stress test the implementation
FILE* f = fopen(stress_config_file, "w");
TEST_ASSERT(f != NULL, "Should be able to create stress test config file");
fprintf(f, "# Stress test configuration with very large number of entries\n");
// Generate an extremely large number of configurations
const int stress_configs = 2000; // 20x the old static limit
for (int i = 0; i < stress_configs; i++) {
// Create varied but valid configurations
fprintf(f, "allreduce,%d,%d,ring,simple,4,-1,-1,-1,-1\n",
i * 512, (i * 512) + 1024);
}
fclose(f);
setenv("NCCL_TUNER_CONFIG_FILE", stress_config_file, 1);
// Test initialization with stress config
void* context = NULL;
ncclResult_t result = pluginInit(8, 2, mock_logger, &context);
TEST_ASSERT(result == ncclSuccess, "Plugin should handle very large config files");
TunerContext* ctx = (TunerContext*)context;
TEST_ASSERT(ctx->numConfigs == stress_configs, "Should load all stress test configurations");
TEST_ASSERT(ctx->configs != NULL, "Stress test configs should be allocated");
mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
"Stress test - loaded %d configurations successfully", stress_configs);
mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
"Memory usage: %zu bytes for configuration array",
stress_configs * sizeof(TuningConfig));
// Verify we can access configurations throughout the entire range
for (int i = 0; i < stress_configs; i += 100) { // Sample every 100th config
TuningConfig* config = &ctx->configs[i];
TEST_ASSERT(config->collType == ncclFuncAllReduce, "Config should have correct collective type");
TEST_ASSERT(config->minBytes == (size_t)(i * 512), "Config should have correct minBytes");
}
// Clean up
pluginDestroy(context);
unlink(stress_config_file);
unsetenv("NCCL_TUNER_CONFIG_FILE");
TEST_PASS();
}
// Test 13: Edge case - empty config file
int test_empty_config() {
const char* empty_config_file = "test_empty.conf";
// Create empty config file (only comments)
create_test_config(empty_config_file,
"# Empty configuration file\n"
"# No actual configurations\n"
"\n"
"\n");
setenv("NCCL_TUNER_CONFIG_FILE", empty_config_file, 1);
void* context = NULL;
ncclResult_t result = pluginInit(8, 2, mock_logger, &context);
TEST_ASSERT(result == ncclSuccess, "Plugin should handle empty config files");
TunerContext* ctx = (TunerContext*)context;
TEST_ASSERT(ctx->numConfigs == 0, "Should have zero configurations");
TEST_ASSERT(ctx->maxConfigs == 0, "Should have zero max configurations");
TEST_ASSERT(ctx->configs == NULL, "Should not allocate memory for empty config");
// Test that plugin still works with no configurations (fallback behavior)
float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
cost_table_ptr[i] = cost_table[i];
for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
cost_table[i][j] = 1.0;
}
}
int nChannels;
result = pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
0, &nChannels);
TEST_ASSERT(result == ncclSuccess, "GetCollInfo should work with empty config");
// Clean up
pluginDestroy(context);
unlink(empty_config_file);
unsetenv("NCCL_TUNER_CONFIG_FILE");
TEST_PASS();
}
// Test runner function pointer type
typedef int (*TestFunction)(void);
// Test registry
typedef struct {
const char* name;
TestFunction func;
const char* description;
} TestCase;
// All available tests
TestCase test_cases[] = {
{"init", test_plugin_init, "Plugin initialization"},
{"config-valid", test_config_parsing_valid, "Valid configuration parsing"},
{"config-invalid", test_config_parsing_invalid, "Invalid configuration parsing"},
{"collective", test_collective_matching, "Collective type matching"},
{"size", test_size_matching, "Size range matching"},
{"topology", test_topology_matching, "Topology matching"},
{"channels", test_default_channels, "Default channels behavior"},
{"regbuff", test_regbuff_matching, "Registered buffer matching"},
{"pipeops", test_pipeops_matching, "Pipeline operations matching"},
{"fallback", test_no_match_fallback, "Fallback behavior"},
{"large-config", test_large_config, "Large configuration files (dynamic allocation)"},
{"stress-config", test_very_large_config_stress, "Very large configuration stress test"},
{"empty-config", test_empty_config, "Empty configuration file handling"},
{NULL, NULL, NULL} // End marker
};
// Show help/usage information
void show_help(const char* program_name) {
printf("Usage: %s [test_name ...]\n\n", program_name);
printf("Available tests:\n");
for (int i = 0; test_cases[i].name != NULL; i++) {
printf(" %-15s - %s\n", test_cases[i].name, test_cases[i].description);
}
printf("\nExamples:\n");
printf(" %s # Run all tests\n", program_name);
printf(" %s init # Run only initialization test\n", program_name);
printf(" %s init collective # Run initialization and collective tests\n", program_name);
printf(" %s --help # Show this help\n", program_name);
}
// Find test by name
TestFunction find_test(const char* name) {
for (int i = 0; test_cases[i].name != NULL; i++) {
if (strcmp(test_cases[i].name, name) == 0) {
return test_cases[i].func;
}
}
return NULL;
}
// Main test runner
int main(int argc, char* argv[]) {
int passed = 0, total = 0;
// Check for help
if (argc > 1 && (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-h") == 0)) {
show_help(argv[0]);
return 0;
}
printf("Running NCCL Tuner Plugin Unit Tests\n");
printf("=====================================\n");
if (argc == 1) {
// No arguments - run all tests
for (int i = 0; test_cases[i].name != NULL; i++) {
total++;
passed += test_cases[i].func();
}
} else {
// Run specific tests
for (int arg = 1; arg < argc; arg++) {
TestFunction test_func = find_test(argv[arg]);
if (test_func) {
total++;
passed += test_func();
} else {
printf("ERROR: Unknown test '%s'\n", argv[arg]);
printf("Use --help to see available tests\n");
return 1;
}
}
}
printf("\n=====================================\n");
printf("Test Results: %d/%d tests passed\n", passed, total);
if (passed == total) {
printf("All tests PASSED!\n");
return 0;
} else {
printf("Some tests FAILED!\n");
return 1;
}
}

View File

@ -17,6 +17,8 @@ PROFAPI ?= 1
NVTX ?= 1
RDMA_CORE ?= 0
NET_PROFILER ?= 0
MLX5DV ?= 0
MAX_EXT_NET_PLUGINS ?= 0
NVCC = $(CUDA_HOME)/bin/nvcc
@ -38,10 +40,12 @@ ifeq ($(shell test "0$(CUDA_MAJOR)" -lt 12; echo $$?),0)
CUDA8_GENCODE += -gencode=arch=compute_35,code=sm_35
endif
CUDA9_GENCODE = -gencode=arch=compute_70,code=sm_70
CUDA10_GENCODE = -gencode=arch=compute_75,code=sm_75
CUDA11_GENCODE = -gencode=arch=compute_80,code=sm_80
CUDA12_GENCODE = -gencode=arch=compute_90,code=sm_90
CUDA13_GENCODE = -gencode=arch=compute_100,code=sm_100 \
-gencode=arch=compute_120,code=sm_120
CUDA12_8_GENCODE = -gencode=arch=compute_100,code=sm_100 \
-gencode=arch=compute_120,code=sm_120
CUDA13_GENCODE = -gencode=arch=compute_110,code=sm_110
CUDA8_PTX = -gencode=arch=compute_61,code=compute_61
CUDA9_PTX = -gencode=arch=compute_70,code=compute_70
@ -49,10 +53,12 @@ CUDA11_PTX = -gencode=arch=compute_80,code=compute_80
CUDA12_PTX = -gencode=arch=compute_90,code=compute_90
CUDA13_PTX = -gencode=arch=compute_120,code=compute_120
ifeq ($(shell test "0$(CUDA_MAJOR)" -eq 12 -a "0$(CUDA_MINOR)" -ge 8 -o "0$(CUDA_MAJOR)" -gt 12; echo $$?),0)
ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 13; echo $$?),0)
# Prior to SM75 is deprecated from CUDA13.0 onwards
NVCC_GENCODE ?= $(CUDA10_GENCODE) $(CUDA11_GENCODE) $(CUDA12_GENCODE) $(CUDA12_8_GENCODE) $(CUDA13_GENCODE) $(CUDA13_PTX)
else ifeq ($(shell test "0$(CUDA_MAJOR)" -eq 12 -a "0$(CUDA_MINOR)" -ge 8; echo $$?),0)
# Include Blackwell support if we're using CUDA12.8 or above
NVCC_GENCODE ?= $(CUDA8_GENCODE) $(CUDA9_GENCODE) $(CUDA11_GENCODE) $(CUDA12_GENCODE) $(CUDA13_GENCODE) $(CUDA13_PTX)
NVCC_GENCODE ?= $(CUDA8_GENCODE) $(CUDA9_GENCODE) $(CUDA11_GENCODE) $(CUDA12_GENCODE) $(CUDA12_8_GENCODE) $(CUDA13_PTX)
else ifeq ($(shell test "0$(CUDA_MAJOR)" -eq 11 -a "0$(CUDA_MINOR)" -ge 8 -o "0$(CUDA_MAJOR)" -gt 11; echo $$?),0)
# Include Hopper support if we're using CUDA11.8 or above
NVCC_GENCODE ?= $(CUDA8_GENCODE) $(CUDA9_GENCODE) $(CUDA11_GENCODE) $(CUDA12_GENCODE) $(CUDA12_PTX)
@ -66,14 +72,21 @@ else
endif
$(info NVCC_GENCODE is ${NVCC_GENCODE})
# CUDA 13.0 requires c++17
ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 13; echo $$?),0)
CXXSTD ?= -std=c++17
else
CXXSTD ?= -std=c++14
endif
CXXFLAGS := -DCUDA_MAJOR=$(CUDA_MAJOR) -DCUDA_MINOR=$(CUDA_MINOR) -fPIC -fvisibility=hidden \
-Wall -Wno-unused-function -Wno-sign-compare -std=c++11 -Wvla \
-I $(CUDA_INC) \
-Wall -Wno-unused-function -Wno-sign-compare $(CXXSTD) -Wvla \
-I $(CUDA_INC) -I $(CUDA_INC)/cccl \
$(CXXFLAGS)
# Maxrregcount needs to be set accordingly to NCCL_MAX_NTHREADS (otherwise it will cause kernel launch errors)
# 512 : 120, 640 : 96, 768 : 80, 1024 : 60
# We would not have to set this if we used __launch_bounds__, but this only works on kernels, not on functions.
NVCUFLAGS := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++11 --expt-extended-lambda -Xptxas -maxrregcount=96 -Xfatbin -compress-all
NVCUFLAGS := -ccbin $(CXX) $(NVCC_GENCODE) $(CXXSTD) --expt-extended-lambda -Xptxas -maxrregcount=96 -Xfatbin -compress-all
# Use addprefix so that we can specify more than one path
NVLDFLAGS := -L${CUDA_LIB} -lcudart -lrt
@ -136,9 +149,17 @@ CXXFLAGS += -DPROFAPI
endif
ifneq ($(RDMA_CORE), 0)
CXXFLAGS += -DNCCL_BUILD_RDMA_CORE=1
CXXFLAGS += -DNCCL_BUILD_RDMA_CORE=1 -libverbs
endif
ifneq ($(MLX5DV), 0)
CXXFLAGS += -DNCCL_BUILD_MLX5DV=1 -lmlx5
endif
ifneq ($(NET_PROFILER), 0)
CXXFLAGS += -DNCCL_ENABLE_NET_PROFILING=1
endif
ifneq ($(MAX_EXT_NET_PLUGINS), 0)
CXXFLAGS += -DNCCL_NET_MAX_PLUGINS=$(MAX_EXT_NET_PLUGINS)
endif

View File

@ -1,6 +1,6 @@
##### version
NCCL_MAJOR := 2
NCCL_MINOR := 26
NCCL_PATCH := 5
NCCL_MINOR := 27
NCCL_PATCH := 6
NCCL_SUFFIX :=
PKG_REVISION := 1

View File

@ -10,7 +10,7 @@ include ../makefiles/version.mk
INCEXPORTS := nccl.h
LIBSRCFILES := \
bootstrap.cc channel.cc collectives.cc debug.cc enqueue.cc group.cc \
init.cc init_nvtx.cc proxy.cc transport.cc mnnvl.cc \
init.cc init_nvtx.cc proxy.cc transport.cc mnnvl.cc allocator.cc symmetric.cc \
$(wildcard graph/*.cc) \
$(wildcard misc/*.cc) \
$(wildcard transport/*.cc) \

196
src/allocator.cc Normal file
View File

@ -0,0 +1,196 @@
/*************************************************************************
* Copyright (c) 2015-2025, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#include "comm.h"
#include "transport.h"
#include "group.h"
NCCL_API(ncclResult_t, ncclMemAlloc, void **ptr, size_t size);
ncclResult_t ncclMemAlloc(void **ptr, size_t size) {
NVTX3_FUNC_RANGE_IN(nccl_domain);
ncclResult_t ret = ncclSuccess;
#if CUDART_VERSION >= 12010
size_t memGran = 0;
CUdevice currentDev;
CUmemAllocationProp memprop = {};
CUmemAccessDesc accessDesc = {};
CUmemGenericAllocationHandle handle = (CUmemGenericAllocationHandle)-1;
int cudaDev;
int flag;
int dcnt;
if (ptr == NULL || size == 0) goto fallback;
if (ncclCudaLibraryInit() != ncclSuccess) goto fallback;
CUDACHECK(cudaGetDevice(&cudaDev));
CUCHECK(cuDeviceGet(&currentDev, cudaDev));
if (ncclCuMemEnable()) {
size_t handleSize = size;
int requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
// Query device to see if FABRIC handle support is available
flag = 0;
(void) CUPFN(cuDeviceGetAttribute(&flag, CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED, currentDev));
if (flag) requestedHandleTypes |= CU_MEM_HANDLE_TYPE_FABRIC;
memprop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
memprop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
memprop.requestedHandleTypes = (CUmemAllocationHandleType) requestedHandleTypes;
memprop.location.id = currentDev;
// Query device to see if RDMA support is available
flag = 0;
CUCHECK(cuDeviceGetAttribute(&flag, CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED, currentDev));
if (flag) memprop.allocFlags.gpuDirectRDMACapable = 1;
CUCHECK(cuMemGetAllocationGranularity(&memGran, &memprop, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED));
CUDACHECK(cudaGetDeviceCount(&dcnt));
ALIGN_SIZE(handleSize, memGran);
if (requestedHandleTypes & CU_MEM_HANDLE_TYPE_FABRIC) {
/* First try cuMemCreate() with FABRIC handle support and then remove if it fails */
CUresult err = CUPFN(cuMemCreate(&handle, handleSize, &memprop, 0));
if (err == CUDA_ERROR_NOT_PERMITTED || err == CUDA_ERROR_NOT_SUPPORTED) {
requestedHandleTypes &= ~CU_MEM_HANDLE_TYPE_FABRIC;
memprop.requestedHandleTypes = (CUmemAllocationHandleType) requestedHandleTypes;
/* Allocate the physical memory on the device */
CUCHECK(cuMemCreate(&handle, handleSize, &memprop, 0));
} else if (err != CUDA_SUCCESS) {
// Catch and report any error from above
CUCHECK(cuMemCreate(&handle, handleSize, &memprop, 0));
}
} else {
/* Allocate the physical memory on the device */
CUCHECK(cuMemCreate(&handle, handleSize, &memprop, 0));
}
/* Reserve a virtual address range */
CUCHECK(cuMemAddressReserve((CUdeviceptr*)ptr, handleSize, memGran, 0, 0));
/* Map the virtual address range to the physical allocation */
CUCHECK(cuMemMap((CUdeviceptr)*ptr, handleSize, 0, handle, 0));
/* Now allow RW access to the newly mapped memory */
for (int i = 0; i < dcnt; ++i) {
int p2p = 0;
if (i == cudaDev || ((cudaDeviceCanAccessPeer(&p2p, i, cudaDev) == cudaSuccess) && p2p)) {
accessDesc.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
accessDesc.location.id = i;
accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
CUCHECK(cuMemSetAccess((CUdeviceptr)*ptr, handleSize, &accessDesc, 1));
}
if (0 == p2p && i != cudaDev) INFO(NCCL_ALLOC, "P2P not supported between GPU%d and GPU%d", cudaDev, i);
}
goto exit;
}
fallback:
#endif
// Coverity is right to complain that we may pass a NULL ptr to cudaMalloc. That's deliberate though:
// we want CUDA to return an error to the caller.
// coverity[var_deref_model]
CUDACHECKGOTO(cudaMalloc(ptr, size), ret, fail);
exit:
return ret;
fail:
goto exit;
}
NCCL_API(ncclResult_t, ncclMemFree, void *ptr);
ncclResult_t ncclMemFree(void *ptr) {
NVTX3_FUNC_RANGE_IN(nccl_domain);
ncclResult_t ret = ncclSuccess;
int saveDevice;
CUDACHECK(cudaGetDevice(&saveDevice));
#if CUDART_VERSION >= 12010
CUdevice ptrDev = 0;
if (ptr == NULL) goto fallback;
if (ncclCudaLibraryInit() != ncclSuccess) goto fallback;
CUCHECKGOTO(cuPointerGetAttribute((void*)&ptrDev, CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL, (CUdeviceptr)ptr), ret, fail);
CUDACHECKGOTO(cudaSetDevice((int)ptrDev), ret, fail);
if (ncclCuMemEnable()) {
NCCLCHECKGOTO(ncclCuMemFree(ptr), ret, fail);
goto exit;
}
fallback:
#endif
CUDACHECKGOTO(cudaFree(ptr), ret, fail);
exit:
CUDACHECK(cudaSetDevice(saveDevice));
return ret;
fail:
goto exit;
}
// This is a collective function and should be called by all ranks in the communicator
ncclResult_t ncclCommSymmetricAllocInternal(struct ncclComm* comm, size_t size, size_t alignment, void** symPtr) {
ncclResult_t ret = ncclSuccess;
void* regSymAddr = NULL;
size_t allocSize = size;
size_t granularity;
CUdevice cuDev;
CUmemAllocationProp memprop = {};
CUmemGenericAllocationHandle memHandle;
int bit = 0, cnt = 0;
// aligment must be power of 2 as an input
while (bit < sizeof(size_t) * 8) {
if (alignment & (1L << bit)) cnt++;
if (cnt == 2) {
WARN("rank %d alignment %ld is not power of 2", comm->rank, alignment);
goto fail;
}
bit++;
}
// temporarily align the alignment to NCCL_REC_PAGE_SIZE
ALIGN_SIZE(alignment, NCCL_REC_PAGE_SIZE);
CUCHECKGOTO(cuDeviceGet(&cuDev, comm->cudaDev), ret, fail);
memprop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
memprop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
memprop.requestedHandleTypes = ncclCuMemHandleType;
memprop.location.id = cuDev;
CUCHECKGOTO(cuMemGetAllocationGranularity(&granularity, &memprop, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED), ret, fail);
ALIGN_SIZE(allocSize, granularity);
CUCHECKGOTO(cuMemCreate(&memHandle, allocSize, &memprop, 0), ret, fail);
ALIGN_SIZE(comm->symAllocHead, alignment);
NCCLCHECKGOTO(ncclIpcSymmetricMap(comm, comm->symAllocHead, allocSize, memHandle, &regSymAddr), ret, fail);
NCCLCHECKGOTO(ncclNvlsSymmetricMap(comm, comm->symAllocHead, allocSize, regSymAddr), ret, fail);
NCCLCHECKGOTO(bootstrapIntraNodeBarrier(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, comm->localRankToRank[0]), ret, fail);
comm->symAllocHead += allocSize;
*symPtr = regSymAddr;
exit:
return ret;
fail:
*symPtr = NULL;
goto exit;
}
ncclResult_t ncclCommSymmetricFreeInternal(struct ncclComm* comm, void* symPtr) {
CUmemGenericAllocationHandle handle;
size_t size = 0;
ncclResult_t ret = ncclSuccess;
int saveDev = comm->cudaDev;
CUDACHECKGOTO(cudaGetDevice(&saveDev), ret, fail);
if (ncclCuMemEnable()) {
CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
CUCHECKGOTO(cuMemRetainAllocationHandle(&handle, symPtr), ret, fail);
CUCHECKGOTO(cuMemRelease(handle), ret, fail);
CUCHECKGOTO(cuMemGetAddressRange(NULL, &size, (CUdeviceptr)symPtr), ret, fail);
NCCLCHECKGOTO(ncclNvlsSymmetricFree(comm, size, symPtr), ret, fail);
NCCLCHECKGOTO(ncclIpcSymmetricFree(comm, size, symPtr), ret, fail);
CUCHECKGOTO(cuMemRelease(handle), ret, fail);
}
exit:
CUDACHECK(cudaSetDevice(saveDev));
return ret;
fail:
goto exit;
}

View File

@ -94,6 +94,7 @@ ncclResult_t bootstrapNetInit() {
pthread_mutex_lock(&bootstrapNetLock);
if (bootstrapNetInitDone == 0) {
const char* env = ncclGetEnv("NCCL_COMM_ID");
int nIfs = 0;
if (env) {
union ncclSocketAddress remoteAddr;
if (ncclSocketGetAddrFromString(&remoteAddr, env) != ncclSuccess) {
@ -101,13 +102,15 @@ ncclResult_t bootstrapNetInit() {
pthread_mutex_unlock(&bootstrapNetLock);
return ncclInvalidArgument;
}
if (ncclFindInterfaceMatchSubnet(bootstrapNetIfName, &bootstrapNetIfAddr, &remoteAddr, MAX_IF_NAME_SIZE, 1) <= 0) {
NCCLCHECK(ncclFindInterfaceMatchSubnet(bootstrapNetIfName, &bootstrapNetIfAddr, &remoteAddr, MAX_IF_NAME_SIZE,
&nIfs));
if (nIfs <= 0) {
WARN("NET/Socket : No usable listening interface found");
pthread_mutex_unlock(&bootstrapNetLock);
return ncclSystemError;
}
} else {
int nIfs = ncclFindInterfaces(bootstrapNetIfName, &bootstrapNetIfAddr, MAX_IF_NAME_SIZE, 1);
NCCLCHECK(ncclFindInterfaces(bootstrapNetIfName, &bootstrapNetIfAddr, MAX_IF_NAME_SIZE, 1, &nIfs));
if (nIfs <= 0) {
WARN("Bootstrap : no socket interface found");
pthread_mutex_unlock(&bootstrapNetLock);
@ -828,7 +831,7 @@ ncclResult_t bootstrapSplit(uint64_t magic, struct ncclComm* comm, struct ncclCo
NCCLCHECKGOTO(ncclCalloc(&state->peerP2pAddresses, nranks), ret, fail);
memcpy(state->peerP2pAddresses + rank, &peerSocketAddress, sizeof(union ncclSocketAddress));
if (parent->config.splitShare) {
if (parent->shareResources) {
/* map local rank to top parent local rank. */
for (int i = 0; i < nranks; ++i) {
comm->topParentRanks[i] = parent->topParentRanks[parentRanks[i]];

View File

@ -147,7 +147,7 @@ ncclResult_t initCollnetChannel(struct ncclComm* comm, int channelId, struct ncc
ncclResult_t freeChannel(struct ncclChannel* channel, int nRanks, int collnetNRanks, int nvlsNRanks) {
int nPeers = nRanks + collnetNRanks + nvlsNRanks;
/* channel peers are only valid when async init thread completes commAlloc() and
* the channel is intialized with initChannel(); if either is not done, this channel
* the channel is initialized with initChannel(); if either is not done, this channel
* should never be free. */
if (channel->id == -1 || channel->peers == NULL) return ncclSuccess;

View File

@ -16,6 +16,8 @@
#include <chrono>
#include "param.h"
#define NCCL_DEBUG_RESET_TRIGGERED (-2)
int ncclDebugLevel = -1;
static uint32_t ncclDebugTimestampLevels = 0; // bitmaps of levels that have timestamps turned on
static char ncclDebugTimestampFormat[256]; // with space for subseconds
@ -26,7 +28,7 @@ static int pid = -1;
static char hostname[1024];
thread_local int ncclDebugNoWarn = 0;
char ncclLastError[1024] = ""; // Global string for the last error in human readable form
static uint64_t ncclDebugMask = NCCL_INIT | NCCL_BOOTSTRAP | NCCL_ENV; // Default debug sub-system mask is INIT and ENV
static uint64_t ncclDebugMask = 0;
FILE *ncclDebugFile = stdout;
static pthread_mutex_t ncclDebugLock = PTHREAD_MUTEX_INITIALIZER;
static std::chrono::steady_clock::time_point ncclEpoch;
@ -34,11 +36,16 @@ static bool ncclWarnSetDebugInfo = false;
static __thread int tid = -1;
// This function must be called with ncclDebugLock locked!
static void ncclDebugInit() {
pthread_mutex_lock(&ncclDebugLock);
if (ncclDebugLevel != -1) { pthread_mutex_unlock(&ncclDebugLock); return; }
const char* nccl_debug = ncclGetEnv("NCCL_DEBUG");
int tempNcclDebugLevel = -1;
uint64_t tempNcclDebugMask = NCCL_INIT | NCCL_BOOTSTRAP | NCCL_ENV; // Default debug sub-system mask
if (ncclDebugLevel == NCCL_DEBUG_RESET_TRIGGERED && ncclDebugFile != stdout) {
// Finish the reset initiated via ncclResetDebugInit().
fclose(ncclDebugFile);
ncclDebugFile = stdout;
}
if (nccl_debug == NULL) {
tempNcclDebugLevel = NCCL_LOG_NONE;
} else if (strcasecmp(nccl_debug, "VERSION") == 0) {
@ -61,7 +68,7 @@ static void ncclDebugInit() {
if (ncclDebugSubsysEnv != NULL) {
int invert = 0;
if (ncclDebugSubsysEnv[0] == '^') { invert = 1; ncclDebugSubsysEnv++; }
ncclDebugMask = invert ? ~0ULL : 0ULL;
tempNcclDebugMask = invert ? ~0ULL : 0ULL;
char *ncclDebugSubsys = strdup(ncclDebugSubsysEnv);
char *subsys = strtok(ncclDebugSubsys, ",");
while (subsys != NULL) {
@ -102,7 +109,7 @@ static void ncclDebugInit() {
mask = NCCL_ALL;
}
if (mask) {
if (invert) ncclDebugMask &= ~mask; else ncclDebugMask |= mask;
if (invert) tempNcclDebugMask &= ~mask; else tempNcclDebugMask |= mask;
}
subsys = strtok(NULL, ",");
}
@ -246,15 +253,15 @@ static void ncclDebugInit() {
if (debugFn[0] != '\0') {
FILE *file = fopen(debugFn, "w");
if (file != nullptr) {
setbuf(file, nullptr); // disable buffering
setlinebuf(file); // disable block buffering
ncclDebugFile = file;
}
}
}
ncclEpoch = std::chrono::steady_clock::now();
ncclDebugMask = tempNcclDebugMask;
__atomic_store_n(&ncclDebugLevel, tempNcclDebugLevel, __ATOMIC_RELEASE);
pthread_mutex_unlock(&ncclDebugLock);
}
/* Common logging function used by the INFO, WARN and TRACE macros
@ -262,19 +269,38 @@ static void ncclDebugInit() {
* they can share the debugging mechanisms and output files
*/
void ncclDebugLog(ncclDebugLogLevel level, unsigned long flags, const char *filefunc, int line, const char *fmt, ...) {
if (__atomic_load_n(&ncclDebugLevel, __ATOMIC_ACQUIRE) == -1) ncclDebugInit();
bool locked = false; // Keeps track of the ncclDebugLock state.
int gotLevel = __atomic_load_n(&ncclDebugLevel, __ATOMIC_ACQUIRE);
if (ncclDebugNoWarn != 0 && level == NCCL_LOG_WARN) { level = NCCL_LOG_INFO; flags = ncclDebugNoWarn; }
// Save the last error (WARN) as a human readable string
if (level == NCCL_LOG_WARN) {
pthread_mutex_lock(&ncclDebugLock);
locked = true;
va_list vargs;
va_start(vargs, fmt);
(void) vsnprintf(ncclLastError, sizeof(ncclLastError), fmt, vargs);
va_end(vargs);
pthread_mutex_unlock(&ncclDebugLock);
}
if (ncclDebugLevel < level || ((flags & ncclDebugMask) == 0)) return;
if (gotLevel >= 0 && (gotLevel < level || (flags & ncclDebugMask) == 0)) {
if (locked)
pthread_mutex_unlock(&ncclDebugLock);
return;
}
if (!locked) {
pthread_mutex_lock(&ncclDebugLock);
locked = true;
}
// From this point on ncclDebugLock is always locked so we don't need to check "locked" anymore.
if (ncclDebugLevel < 0)
ncclDebugInit();
if (ncclDebugLevel < level || ((flags & ncclDebugMask) == 0)) {
pthread_mutex_unlock(&ncclDebugLock);
return;
}
if (tid == -1) {
tid = syscall(SYS_gettid);
@ -335,7 +361,7 @@ void ncclDebugLog(ncclDebugLogLevel level, unsigned long flags, const char *file
// Add level specific formatting.
if (level == NCCL_LOG_WARN) {
len += snprintf(buffer+len, sizeof(buffer)-len, "[%d] %s:%d NCCL WARN ", cudaDev, filefunc, line);
if (ncclWarnSetDebugInfo) ncclDebugLevel = NCCL_LOG_INFO;
if (ncclWarnSetDebugInfo) __atomic_store_n(&ncclDebugLevel, NCCL_LOG_INFO, __ATOMIC_RELEASE);
} else if (level == NCCL_LOG_INFO) {
len += snprintf(buffer+len, sizeof(buffer)-len, "[%d] NCCL INFO ", cudaDev);
} else if (level == NCCL_LOG_TRACE && flags == NCCL_CALL) {
@ -360,19 +386,17 @@ void ncclDebugLog(ncclDebugLogLevel level, unsigned long flags, const char *file
// necessary since we write bytes instead of the string.
buffer[len++] = '\n';
fwrite(buffer, 1, len, ncclDebugFile);
pthread_mutex_unlock(&ncclDebugLock);
}
NCCL_API(void, ncclResetDebugInit);
void ncclResetDebugInit() {
// Cleans up from a previous ncclDebugInit() and reruns.
// Use this after changing NCCL_DEBUG and related parameters in the environment.
__atomic_load_n(&ncclDebugLevel, __ATOMIC_ACQUIRE);
if (ncclDebugFile != stdout) {
fclose(ncclDebugFile);
ncclDebugFile = stdout;
}
ncclDebugLevel = -1;
ncclDebugInit();
pthread_mutex_lock(&ncclDebugLock);
// Let ncclDebugInit() know to complete the reset.
__atomic_store_n(&ncclDebugLevel, NCCL_DEBUG_RESET_TRIGGERED, __ATOMIC_RELEASE);
pthread_mutex_unlock(&ncclDebugLock);
}
NCCL_PARAM(SetThreadName, "SET_THREAD_NAME", 0);

View File

@ -23,6 +23,9 @@ INCFLAGS = -I. -I.. -I$(BUILDDIR)/include -I../include
NVCUFLAGS += $(INCFLAGS) --compiler-options "-fPIC -fvisibility=hidden"
CXXFLAGS += $(INCFLAGS)
NVCUFLAGS_SYM := -ccbin $(CXX) $(CXXSTD) --expt-extended-lambda -Xptxas -maxrregcount=128 -Xfatbin -compress-all
NVCUFLAGS_SYM += $(INCFLAGS) --compiler-options "-fPIC -fvisibility=hidden"
SAY = @bash -c 'path="$$2"; [[ "$$(realpath "$$2")" =~ ^$(subst .,\.,$(abspath $(NCCLDIR)))/(.*)$$ ]] && path="$${BASH_REMATCH[1]}"; printf "%-15s %s\n" "$$1" "$$path"' SAY
COMPILE.cu = $(NVCC) $(NVCUFLAGS) -dc $2 -o $1
@ -30,7 +33,21 @@ COMPILE.cc = $(CXX) $(CXXFLAGS) -c $2 -o $1
define COMPILE
@$(SAY) "Compiling" $2;\
mkdir -p $(dir $1);\
$(call COMPILE$(suffix $2),$1,$2)
$(call COMPILE$(or $3,$(suffix $2)),$1,$2)
endef
ifeq ($(shell echo "$$((1000*$(CUDA_MAJOR) + 10*$(CUDA_MINOR) >= 12090))"),1)
NVCC_GENCODE_LDMC_FP8 = -gencode=arch=compute_100f,code=sm_100f
else ifeq ($(shell echo "$$((1000*$(CUDA_MAJOR) + 10*$(CUDA_MINOR) >= 12070))"),1)
NVCC_GENCODE_LDMC_FP8 = -gencode=arch=compute_100a,code=sm_100a
else
NVCC_GENCODE_LDMC_FP8 =
endif
define COMPILE_SYM
@$(SAY) "Compiling" $2;\
mkdir -p $(dir $1);\
$(NVCC) $(NVCUFLAGS_SYM) $3 -dw $2 -o $1
endef
DEPENDS.cu = $(NVCC) $(NVCUFLAGS) -M -dc $1
@ -48,8 +65,6 @@ endef
all: $(MANIFEST)
ifeq (1,1)
# Case if the <gensrc> directory is generated on-demand:
$(OBJDIR)/gensrc: generate.py
@mkdir -p $@
(which python3 >/dev/null || \
@ -57,22 +72,26 @@ $(OBJDIR)/gensrc: generate.py
printf "\n$${bar}\nERROR: Building NCCL requires a Python 3 installation invokable as 'python3'.\n$${bar}\n\n" 1>&2; \
exit 1)) \
&& ./generate.py $@ "$(ONLY_FUNCS)"
else
# Case if the <gensrc> directory is pre-generated and checked in the repo as ./gen:
$(OBJDIR)/gensrc:
@mkdir -p $(OBJDIR); ln -srfn ./gen $@
endif
$(OBJDIR)/gensrc/symmetric: $(OBJDIR)/gensrc symmetric/generate.py
@mkdir -p $@
./symmetric/generate.py $@
# The trailing ";" is necessary to make this an "empty recipe":
# https://www.gnu.org/software/make/manual/html_node/Empty-Recipes.html
$(OBJDIR)/gensrc/rules.mk: $(OBJDIR)/gensrc ;
$(OBJDIR)/gensrc/symmetric/rules.mk: $(OBJDIR)/gensrc/symmetric ;
-include $(OBJDIR)/gensrc/rules.mk
# "gensrc/rules.mk" populates $(LIB_OBJS_GEN)
-include $(OBJDIR)/gensrc/symmetric/rules.mk
# "gensrc/symmetric/rules.mk" populates $(LIB_OBJS_SYM_GEN)
SRCS = common.cu onerank.cu
LIB_OBJS = $(patsubst %, $(OBJDIR)/%.o, $(SRCS)) $(LIB_OBJS_GEN)
LIB_OBJS = $(patsubst %, $(OBJDIR)/%.o, $(SRCS)) $(LIB_OBJS_GEN) $(LIB_OBJS_SYM_GEN)
$(OBJDIR)/%.o: % $(OBJDIR)/%.d
$(call COMPILE,$@,$<)
@ -80,12 +99,18 @@ $(OBJDIR)/%.o: % $(OBJDIR)/%.d
$(OBJDIR)/genobj/%.o: $(OBJDIR)/gensrc $(OBJDIR)/genobj/%.d
$(call COMPILE,$@,$(OBJDIR)/gensrc/$*)
$(OBJDIR)/genobj/symmetric/%.o: $(OBJDIR)/gensrc/symmetric $(OBJDIR)/genobj/symmetric/%.d
$(call COMPILE,$@,$(OBJDIR)/gensrc/symmetric/$*)
$(OBJDIR)/%.d: %
$(call DEPENDS,$@,$<)
$(OBJDIR)/genobj/%.d: $(OBJDIR)/gensrc/%
$(call DEPENDS,$@,$<)
$(OBJDIR)/genobj/symmetric/%.d: $(OBJDIR)/gensrc/symmetric/%
$(call DEPENDS,$@,$<)
$(DEVGLUE_OBJ): $(LIB_OBJS)
$(NVCC) $(NVCUFLAGS) -dlink $^ -o $@
@ -94,6 +119,7 @@ $(MANIFEST): $(LIB_OBJS) $(DEVGLUE_OBJ)
-include $(wildcard $(OBJDIR)/*.d)
-include $(wildcard $(OBJDIR)/genobj/*.d)
-include $(wildcard $(OBJDIR)/genobj/symmetric/*.d)
.PHONY: clean
clean:

View File

@ -173,73 +173,221 @@ struct RunWorkColl<ncclFuncAllGather, T, RedOp, NCCL_ALGO_PAT, NCCL_PROTO_SIMPLE
template<typename T, typename RedOp>
struct RunWorkColl<ncclFuncAllGather, T, RedOp, NCCL_ALGO_NVLS, NCCL_PROTO_SIMPLE> {
template<bool BcastSendNotRecv>
struct Scatterer {
struct ncclDevWorkColl* work;
ssize_t chunkSize;
ssize_t railGridOffset;
template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts, int MultimemSrcs, int MultimemDsts>
__device__ __forceinline__ void operator()(
int tid, int tn, int slice, int maxSliceSize,
int nSrcs, void** srcPtrs, int nDsts, void** dstPtrs, int32_t* dstSizes, uint32_t sendDirectFlag, uint32_t recvDirectFlag
) {
static_assert(SlicePerChunk==1, "require: SlicePerChunk==1");
static_assert(MaxDsts<=1 || MaxSrcs<=1, "require: MaxDsts<=1 || MaxSrcs<=1");
struct ncclNvls* nvls = &ncclShmem.channel.nvls;
int nNodes = ncclShmem.comm.nNodes;
int nRails = nvls->nHeads;
int part = ncclShmem.channelId - work->channelLo;
char* inbuf = (char*)work->sendbuff;
char* outbuf = (char*)work->recvbuff;
ssize_t countPerRank = work->collnet.count;
bool inPlace = (inbuf == outbuf + ncclShmem.comm.rank * countPerRank);
ssize_t railAllBeg = min(railGridOffset + part * chunkSize, nNodes * countPerRank);
ssize_t railAllEnd = min(railAllBeg + chunkSize, nNodes * countPerRank);
int railAllSize = railAllEnd - railAllBeg;
int rail = 0;
int src = 0;
if (BcastSendNotRecv) {
rail = nvls->headRank;
} else {
if (work->regUsed) return;
rail = 0;
}
if (tid < nDsts) dstSizes[tid] = railAllSize;
do {
int node = railAllBeg / countPerRank;
int railAllOffset = 0;
while (railAllOffset < railAllSize) {
ssize_t railOneBeg = node * countPerRank;
ssize_t railOneEnd = railOneBeg + countPerRank;
ssize_t railOneOffset = (railAllBeg + railAllOffset) - railOneBeg;
int delta = min(railAllEnd, railOneEnd) - (railAllBeg + railAllOffset);
int rank = ncclShmem.comm.collNetDenseToUserRank[node * nRails + rail];
ssize_t userOneBeg = rank * countPerRank + railOneOffset;
int outIsDst = (inPlace && rank == ncclShmem.comm.rank) || BcastSendNotRecv || work->regUsed ? 0 : 1;
if (nSrcs != 0 && outIsDst + nDsts != 0) {
reduceCopy<ncclCollUnroll(), RedOp, T,
/*MultimemSrcs,MinSrcs,MaxSrcs=*/MultimemSrcs, 1, 1,
/*MultimemDsts=*/MultimemDsts, 0 + MultimemDsts + MinDsts, 1 + MaxDsts,
/*PreOpSrcs=*/0>
(tid, tn, 0, nullptr, false,
/*nSrcs=*/1, [=]__device__(int s/*==0*/) -> void* {
return (char*)srcPtrs[src] + railAllOffset;
},
/*nDsts=*/outIsDst + nDsts, [=]__device__(int d) -> void* {
return d < outIsDst ? outbuf + userOneBeg
: work->regUsed ? (char*)dstPtrs[d - outIsDst] + userOneBeg
: (char*)dstPtrs[d - outIsDst] + railAllOffset;
}, delta);
}
railAllOffset += delta;
node += 1;
}
rail += 1;
src += 1;
} while (!BcastSendNotRecv && src < nRails);
}
};
__device__ __forceinline__ void run(int tid, int/*nthreads*/, struct ncclDevWorkColl* work) {
struct ncclNvls* nvls = &ncclShmem.channel.nvls;
const ssize_t rank = ncclShmem.comm.rank;
size_t count, gridOffset, channelCount;
size_t chunkCount;
ncclCollCbdPart(work, ncclShmem.channelId, NCCL_PROTO_SIMPLE, sizeof(T), &count, &gridOffset, &channelCount, &chunkCount);
size_t offset;
int nelem;
const int nThreadsBcast = work->regUsed ? (NCCL_MAX_NTHREADS - WARP_SIZE) : 4 * WARP_SIZE;
const int nThreadsGather = work->regUsed ? WARP_SIZE : NCCL_MAX_NTHREADS - nThreadsBcast;
const int tidEndGather = nThreadsGather;
const int tidEndBcast = tidEndGather + nThreadsBcast;
const int nThreadsNetSend = work->oneNode ? 0 : (work->netRegUsed ? WARP_SIZE : 6 * WARP_SIZE);
const int nThreadsGather = work->regUsed ? roundUp(nvls->nHeads << 2, WARP_SIZE) : 8 * WARP_SIZE;
const int nThreadsBcast = NCCL_MAX_NTHREADS - nThreadsNetSend - nThreadsGather;
if (!work->regUsed) {
if (tid < tidEndGather) {
// Gather
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanAsymmetric<NCCL_MAX_NVLS_ARITY, 0>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsGather, nvls->up, NULL, NULL, work->recvbuff,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.gather(offset, nvls->nHeads * count, nelem, count, -1, 0);
const int tidEndGather = nThreadsGather;
const int tidEndNetSend = tidEndGather + nThreadsNetSend;
const int tidEndBcast = tidEndNetSend + nThreadsBcast;
if (work->oneNode) {
const ssize_t rank = ncclShmem.comm.rank;
size_t count, gridOffset, channelCount, offset, chunkCount;
ncclCollCbdPart(work, ncclShmem.channelId, NCCL_PROTO_SIMPLE, sizeof(T), &count, &gridOffset, &channelCount, &chunkCount);
if (!work->regUsed) {
if (tid < tidEndGather) {
// Gather
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanAsymmetric<NCCL_MAX_NVLS_ARITY, 0>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsGather, nvls->up, NULL, NULL, work->recvbuff,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.gather(offset, nvls->nHeads * count, nelem, count, -1, 0);
}
// coverity[overrun-call] => Coverity think prims.index can be greater than 1
} else if (tid < tidEndBcast) {
// Bcast through NVLS
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
Primitives<T, RedOp, FanAsymmetric<0, 1>, /*Direct=*/0, Proto, 0>
prims(tid - tidEndGather, nThreadsBcast, NULL, &nvls->down, work->sendbuff, NULL,
work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.send(offset, nelem);
}
// coverity[overrun-call] => Coverity think prims.index can be greater than 1
}
// coverity[overrun-call] => Coverity think prims.index can be greater than 1
} else if (tid < tidEndBcast) {
// Bcast through NVLS
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
Primitives<T, RedOp, FanAsymmetric<0, 1>, /*Direct=*/0, Proto, 0>
prims(tid - tidEndGather, nThreadsBcast, NULL, &nvls->down, work->sendbuff, NULL,
work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.send(offset, nelem);
} else {
if (tid < tidEndGather) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanSymmetric<NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsGather, nvls->up, nvls->up, NULL, NULL,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
/* used as sync */
prims.scatter(0, 0, 0, 0, -1, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
prims.gather(0, 0, 0, 0, -1, 0);
}
} else if (tid < tidEndBcast) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, Proto, 0>
prims(tid - tidEndGather, nThreadsBcast, &nvls->down, &nvls->down, work->sendbuff, NULL,
work->redOpArg, 1 * Proto::MaxGroupWidth, 0, 0, work);
/* used as sync */
prims.recv(0, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
ssize_t inpOffset = gridOffset + elemOffset;
ssize_t outOffset = inpOffset + rank * count;
nelem = min(chunkCount, channelCount - elemOffset);
prims.directSend(inpOffset, outOffset, nelem);
}
}
// coverity[overrun-call] => Coverity think prims.index can be greater than 1
}
} else {
/* direct allgather */
// NVLS + IB SHARP
int nNodes = ncclShmem.comm.nNodes;
int part = ncclShmem.channelId - work->channelLo;
ssize_t countPerRank = work->collnet.count;
const int nChannels = work->channelHi - work->channelLo + 1;
ssize_t chunkCount = work->collnet.chunkCount;
if (tid < tidEndGather) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanSymmetric<NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsGather, nvls->up, nvls->up, NULL, NULL,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
/* used as sync */
prims.scatter(0, 0, 0, 0, -1, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
prims.gather(0, 0, 0, 0, -1, 0);
Primitives<T, RedOp, FanAsymmetric<NCCL_MAX_NVLS_ARITY, 0>, /*Direct=*/1, Proto, 0>
prims(tid, nThreadsGather, nvls->up, nullptr, nullptr, work->recvbuff,
/*redOpArg=*/0, 1 * Proto::MaxGroupWidth, 1, 1, work);
for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
Scatterer</*BcastSendNotRecv=*/false> scat;
scat.work = work;
scat.chunkSize = chunkCount;
scat.railGridOffset = railGridOffset;
prims.template process</*Recv=*/1, /*Send=*/0>(scat);
}
} else if (tid < tidEndBcast) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, Proto, 0>
prims(tid - tidEndGather, nThreadsBcast, &nvls->down, &nvls->down, work->sendbuff, NULL,
work->redOpArg, 1 * Proto::MaxGroupWidth, 0, 0, work);
/* used as sync */
prims.recv(0, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
ssize_t inpOffset = gridOffset + elemOffset;
ssize_t outOffset = inpOffset + rank * count;
nelem = min(chunkCount, channelCount - elemOffset);
prims.directSend(inpOffset, outOffset, nelem);
} else {
if (work->netRegUsed) {
using ProtoSend = ProtoSimple<1, 1, COLL_UNROLL>;
using ProtoBcast = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
int maxSteps = (int)divUp(nNodes * countPerRank, nChannels * chunkCount);
int curSteps = -1;
int postThread = tid - tidEndGather == 0 ? 1 : 0;
// for UB, we need to control the send speed to avoid net congestion.
// first unroll 2 steps, then unroll the rest steps when the data is received.
if (postThread) {
curSteps = min(2, maxSteps);
Primitives<T, RedOp, FanAsymmetric<0, 1>, /*Direct=*/1, ProtoSend, 0>::sendPeerNotify(nvls->out, 1, curSteps);
}
Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, ProtoBcast, 0>
prims(tid - tidEndGather, nThreadsNetSend + nThreadsBcast, &nvls->out, &nvls->down, nullptr, nullptr,
/*redOpArg=*/0, 2 * ProtoBcast::MaxGroupWidth, 0, 0, work);
for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
Scatterer</*BcastSendNotRecv=*/true> scat;
scat.work = work;
scat.chunkSize = chunkCount;
scat.railGridOffset = railGridOffset;
prims.template process</*Recv=*/1, /*Send=*/1>(scat);
if (postThread && curSteps < maxSteps) {
curSteps++;
Primitives<T, RedOp, FanAsymmetric<0, 1>, /*Direct=*/1, ProtoSend, 0>::sendPeerNotify(nvls->out, 1, 1);
}
}
} else {
if (tid < tidEndNetSend) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanAsymmetric<0, 1>, /*Direct=*/0, Proto, 0>
prims(tid - tidEndGather, nThreadsNetSend, nullptr, &nvls->out, work->sendbuff, nullptr,
/*redOpArg=*/0, 0 * Proto::MaxGroupWidth, 1, 1);
for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
ssize_t railAllBeg = railGridOffset + part * chunkCount;
ssize_t railAllEnd = min(railAllBeg + chunkCount, nNodes * countPerRank);
ssize_t railOneBeg = ncclShmem.comm.node * countPerRank;
ssize_t railOneEnd = railOneBeg + countPerRank;
ssize_t beg = max(railAllBeg, railOneBeg);
ssize_t end = min(railAllEnd, railOneEnd);
prims.send(beg - railOneBeg, max(ssize_t(0), end - beg));
}
} else {
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/0, Proto, 0>
prims(tid - tidEndNetSend, nThreadsBcast, &nvls->out, &nvls->down, nullptr, nullptr,
/*redOpArg=*/0, 2 * Proto::MaxGroupWidth, 0, 0);
for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
Scatterer</*BcastSendNotRecv=*/true> scat;
scat.work = work;
scat.chunkSize = chunkCount;
scat.railGridOffset = railGridOffset;
prims.template process</*Recv=*/1, /*Send=*/1>(scat);
}
}
}
}
}
@ -254,7 +402,7 @@ struct RunWorkColl<ncclFuncAllGather, T, RedOp, NCCL_ALGO_COLLNET_DIRECT, NCCL_P
ssize_t chunkSize;
ssize_t railGridOffset;
template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts>
template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts, int MultimemSrcs, int MultimemDsts>
__device__ __forceinline__ void operator()(
int tid, int tn, int slice, int maxSliceSize,
int nSrcs, void** srcPtrs, int nDsts, void** dstPtrs, int32_t* dstSizes, uint32_t sendDirectFlag, uint32_t recvDirectFlag

View File

@ -106,7 +106,7 @@ namespace {
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.directSend(offset, nelem);
prims.directSend(offset, offset, nelem);
}
}
else {

View File

@ -52,7 +52,6 @@ struct ncclShmemData {
uint16_t funcId;
int nWorks;
int workSize;
uint32_t workConsumed;
uint64_t workCounter;
bool profilerEnabled;
struct ncclShmemGroup groups[NCCL_MAX_GROUPS];
@ -182,7 +181,6 @@ __device__ __forceinline__ void loadWorkBatchToShmem(
}
if (tid == 0) {
ncclShmem.workSize = workSize;
ncclShmem.workConsumed = batch.offsetBase + (64-__clzll(batch.offsetBitset))*workSize;
}
// We deliberately replicate these div and mod calculations into the case
// blocks above so that they get constant divisor optimizations by the compiler.
@ -242,6 +240,12 @@ __device__ __forceinline__ void loadWorkBatchToShmem(
}
}
__device__ __forceinline__ unsigned long long int globaltimer() {
unsigned long long int timer;
asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(timer));
return timer;
}
template<ncclFunc_t Fn, typename T, typename RedOp, int Algo, int Proto>
struct RunWorkColl {
__device__ void run(int tid, int tn, struct ncclDevWorkColl* work) {
@ -296,40 +300,30 @@ struct RunWorkBatch {
#define STOP 1
#define FINI 2
__device__ __forceinline__ bool profilerEnabled(void) {
// Check if any of the workItems in the batch is profiled. If so, there is an equivalent
// profiler ProxyOp waiting for the counter update in the host thread. If this check was
// done only for the first workItem the profiler counter for other workItems in the batch
// could never be updated, leaving the host thread spinning forever for the counter update
// and causing a hang.
bool enabled = false;
for (int i = 0; i < ncclShmem.nWorks && !enabled; i++) {
if (ncclShmem.workType == ncclDevWorkTypeP2p)
enabled = ((struct ncclDevWorkP2p*)ncclShmem.workStorage)[i].profilerEnabled;
else
enabled = ((struct ncclDevWorkColl*)ncclShmem.workStorage)[i].profilerEnabled;
}
return enabled;
__device__ __forceinline__ bool profilerEnabled(int workItemIdx) {
return (ncclShmem.workType == ncclDevWorkTypeP2p) ?
((struct ncclDevWorkP2p*)ncclShmem.workStorage)[workItemIdx].profilerEnabled :
((struct ncclDevWorkColl*)ncclShmem.workStorage)[workItemIdx].profilerEnabled;
}
__device__ __forceinline__ void profiler(int action) {
if (action == START) {
if (threadIdx.x == 0) {
// increment workCounter regardless of the profiler being active or not
if (threadIdx.x == 0) {
int idx = 0;
uint64_t wc = ncclShmem.channel.workCounter + 1;
if (action == START) {
for (; wc <= ncclShmem.channel.workCounter + ncclShmem.nWorks; wc++) {
if (!profilerEnabled(idx++)) continue;
ncclShmem.comm.workStarted[ncclShmem.channelId].data[wc%MAX_PROFILER_EVENTS_PER_CHANNEL].timestamp = globaltimer();
ncclShmem.comm.workStarted[ncclShmem.channelId].data[wc%MAX_PROFILER_EVENTS_PER_CHANNEL].counter = wc;
}
} else {
for (; wc <= ncclShmem.channel.workCounter + ncclShmem.nWorks; wc++) {
if (!profilerEnabled(idx++)) continue;
ncclShmem.comm.workCompleted[ncclShmem.channelId].data[wc%MAX_PROFILER_EVENTS_PER_CHANNEL].timestamp = globaltimer();
ncclShmem.comm.workCompleted[ncclShmem.channelId].data[wc%MAX_PROFILER_EVENTS_PER_CHANNEL].counter = wc;
}
ncclShmem.channel.workCounter += ncclShmem.nWorks;
if(!profilerEnabled()) return;
ncclShmem.comm.workStarted[ncclShmem.channelId] = ncclShmem.channel.workCounter;
}
} else if (action == STOP) {
if (threadIdx.x == 0 && profilerEnabled()) {
ncclShmem.comm.workCompleted[ncclShmem.channelId] = ncclShmem.channel.workCounter;
}
} else { // FINI
if (threadIdx.x == 0) {
// store the workCounter back to vidmem regardless of the profiler being active or not
((ncclDevCommAndChannels*)ncclShmem.args.comm)->channels[ncclShmem.channelId].workCounter = ncclShmem.channel.workCounter;
if (!profilerEnabled()) return;
ncclShmem.comm.workCompleted[ncclShmem.channelId] = ncclShmem.channel.workCounter;
if (action == FINI) ((ncclDevCommAndChannels*)ncclShmem.args.comm)->channels[ncclShmem.channelId].workCounter = ncclShmem.channel.workCounter;
}
}
}
@ -388,11 +382,6 @@ __device__ __forceinline__ void ncclKernelMain(struct ncclDevKernelArgs const* a
}
__syncthreads(); // publish ncclShmem
if (tid == 0 && ncclShmem.args.workStorageType == ncclDevWorkStorageTypeFifo) {
// ncclShmem.workConsumed written by loadWorkBatchToShmem before __syncthreads()
ncclShmem.comm.workConsumed[ncclShmem.channelId] = ncclShmem.workConsumed;
}
while (ncclShmem.aborted == 0) {
profiler(START);
if (0 <= SpecializedFnId && ncclShmem.funcId == (unsigned)SpecializedFnId) {
@ -407,11 +396,6 @@ __device__ __forceinline__ void ncclKernelMain(struct ncclDevKernelArgs const* a
profiler(STOP);
loadWorkBatchToShmem(tid, tn, args, batchIx);
__syncthreads();
if (tid == 0 && ncclShmem.args.workStorageType == ncclDevWorkStorageTypeFifo) {
// ncclShmem.workConsumed written by loadWorkBatchToShmem before __syncthreads()
ncclShmem.comm.workConsumed[ncclShmem.channelId] = ncclShmem.workConsumed;
}
}
profiler(FINI);
}

View File

@ -327,7 +327,7 @@ with open(os.path.join(gensrc, "rules.mk"), "w") as f:
out = f.write
impl_names = sorted(name_to_funcs.keys())
names = impl_names + ["host_table.cc", "device_table.cu"]
out("LIB_OBJS_GEN = $(patsubst %, $(OBJDIR)/genobj/%.o, {names})\n"
out("LIB_OBJS_GEN = $(patsubst %,$(OBJDIR)/genobj/%.o,{names})\n"
.format(names=" ".join(names)))
out("\n")

View File

@ -99,37 +99,60 @@ template<>
union BytePack<0> {};
template<>
union BytePack<1> {
uint8_t u8, native;
uint8_t u8[1], native;
};
template<>
union BytePack<2> {
BytePack<1> half[2];
BytePack<1> b1[2];
uint8_t u8[2];
uint16_t u16, native;
uint16_t u16[1], native;
};
template<>
union BytePack<4> {
BytePack<2> half[2];
BytePack<1> b1[4];
BytePack<2> b2[2];
uint8_t u8[4];
uint16_t u16[2];
uint32_t u32, native;
uint32_t u32[1], native;
};
template<>
union BytePack<8> {
BytePack<4> half[2];
BytePack<1> b1[8];
BytePack<2> b2[4];
BytePack<4> b4[2];
uint8_t u8[8];
uint16_t u16[4];
uint32_t u32[2];
uint64_t u64, native;
uint64_t u64[1], native;
};
template<>
union alignas(16) BytePack<16> {
BytePack<8> half[2];
BytePack<1> b1[16];
BytePack<2> b2[8];
BytePack<4> b4[4];
BytePack<8> b8[2];
uint8_t u8[16];
uint16_t u16[8];
uint32_t u32[4];
uint64_t u64[2];
ulong2 ul2, native;
ulong2 ul2[1], native;
};
template<int Size>
union BytePack {
BytePack<Size/2> half[2];
BytePack<1> b1[Size];
BytePack<2> b2[Size/2];
BytePack<4> b4[Size/4];
BytePack<8> b8[Size/8];
BytePack<16> b16[Size/16];
uint8_t u8[Size];
uint16_t u16[Size/2];
uint32_t u32[Size/4];
uint64_t u64[Size/8];
};
template<typename T>
@ -357,19 +380,19 @@ __device__ __forceinline__ void multimem_st_global<0>(uintptr_t addr, BytePack<0
}
template<>
__device__ __forceinline__ void multimem_st_global<1>(uintptr_t addr, BytePack<1> val) {
asm volatile("st.global.b8 [%0], %1;" :: "l"(addr), "r"((uint32_t)val.u8) : "memory");
asm volatile("st.global.b8 [%0], %1;" :: "l"(addr), "r"((uint32_t)val.native) : "memory");
}
template<>
__device__ __forceinline__ void multimem_st_global<2>(uintptr_t addr, BytePack<2> val) {
asm volatile("st.global.b16 [%0], %1;" :: "l"(addr), "h"(val.u16) : "memory");
asm volatile("st.global.b16 [%0], %1;" :: "l"(addr), "h"(val.native) : "memory");
}
template<>
__device__ __forceinline__ void multimem_st_global<4>(uintptr_t addr, BytePack<4> val) {
asm volatile("multimem.st.global.b32 [%0], %1;" :: "l"(addr), "r"(val.u32) : "memory");
asm volatile("multimem.st.global.b32 [%0], %1;" :: "l"(addr), "r"(val.native) : "memory");
}
template<>
__device__ __forceinline__ void multimem_st_global<8>(uintptr_t addr, BytePack<8> val) {
asm volatile("multimem.st.global.b64 [%0], %1;" :: "l"(addr), "l"(val.u64) : "memory");
asm volatile("multimem.st.global.b64 [%0], %1;" :: "l"(addr), "l"(val.native) : "memory");
}
template<>
__device__ __forceinline__ void multimem_st_global<16>(uintptr_t addr, BytePack<16> val) {
@ -384,6 +407,56 @@ __device__ __forceinline__ void multimem_st_global(uintptr_t addr, BytePack<Size
}
#endif
// Load pack starting at index in array. Ignore elements past end (length of array).
template<typename Pack, typename T>
__device__ __forceinline__ Pack loadPack(T* ptr, int ix, int end) {
constexpr int Size = sizeof(Pack);
ptr += ix;
int n = end - ix;
if (alignof(T) == Size && sizeof(T) == Size) {
return *(Pack*)ptr;
} else if ((Size+3)/4 + 1 < Size/sizeof(T)) {
union { Pack ans; uint32_t part[Size/4]; };
int misalign = reinterpret_cast<uintptr_t>(ptr) % 4;
uint32_t* down = reinterpret_cast<uint32_t*>(reinterpret_cast<uintptr_t>(ptr) & -uintptr_t(4));
int i;
#pragma unroll
for (i=0; i < Size/4; i++) {
if (i*4/sizeof(T) < 1 || i*4/sizeof(T) < n) part[i] = down[i];
}
uint32_t extra;
if (misalign) extra = down[i];
#pragma unroll
for (i=0; i < Size/4; i++) {
part[i] = __funnelshift_r(part[i], part[i+1], 8*misalign);
}
if (misalign) part[i] = __funnelshift_r(part[i], extra, 8*misalign);
return ans;
} else {
union { Pack ans; BytePack<sizeof(T)> part[Size/sizeof(T)]; };
#pragma unroll
for (int i=0; i < Size/sizeof(T); i++) {
if (i < 1 || i < n) part[i] = ((BytePack<sizeof(T)>*)ptr)[i];
}
return ans;
}
}
// Store pack starting at index in array. Ignore elements past end (length of array).
template<typename Pack, typename T>
__device__ __forceinline__ void storePack(T* ptr, int ix, int end, Pack val) {
constexpr int Size = sizeof(Pack);
union { Pack tmp; BytePack<sizeof(T)> part[Size/sizeof(T)]; };
tmp = val;
ptr += ix;
int n = end - ix;
#pragma unroll
for (int i=0; i < Size/sizeof(T); i++) {
if (i < 1 || i < n) ((BytePack<sizeof(T)>*)ptr)[i] = part[i];
}
}
// Warp-uniform memory copy from shared address (not generic) to global memory.
// The number of bytes copied is `min(MaxBytes, nBytesAhead)`, a negative value
// is interpeted as zero. EltSize is the guaranteed alignment of the addresses and sizes.
@ -426,10 +499,10 @@ __device__ __forceinline__ void copyGlobalShared_WarpUnrolled(
b4[3] = ld_shared<4>(srcAddr + 3*4);
if (srcMisalign != 0) {
BytePack<4> b4_4 = ld_shared<4>(srcAddr + 4*4);
b4[0].u32 = __funnelshift_r(b4[0].u32, b4[1].u32, srcMisalign*8);
b4[1].u32 = __funnelshift_r(b4[1].u32, b4[2].u32, srcMisalign*8);
b4[2].u32 = __funnelshift_r(b4[2].u32, b4[3].u32, srcMisalign*8);
b4[3].u32 = __funnelshift_r(b4[3].u32, b4_4.u32, srcMisalign*8);
b4[0].native = __funnelshift_r(b4[0].native, b4[1].native, srcMisalign*8);
b4[1].native = __funnelshift_r(b4[1].native, b4[2].native, srcMisalign*8);
b4[2].native = __funnelshift_r(b4[2].native, b4[3].native, srcMisalign*8);
b4[3].native = __funnelshift_r(b4[3].native, b4_4.native, srcMisalign*8);
}
if (Multimem) multimem_st_global<16>(dstAddr, b16);
else st_global<16>(dstAddr, b16);

View File

@ -125,7 +125,7 @@ class Primitives<
void **ptrs = isSendNotRecv ? (ncclShmem.groups[group].dsts + Dst)
: (ncclShmem.groups[group].srcs + Src);
if (flags & NetRegMode) {
if ((flags & NetRegMode) && ((!isSendNotRecv && DirectRecv) || (isSendNotRecv && DirectSend))) {
if (P2p) {
ptrs[index] = NULL;
} else {
@ -337,7 +337,7 @@ public:
}
template<int Recv, int Send, typename Fn>
__device__ __forceinline__ void process(Fn &&fn, uint32_t sendDirectFlag, uint32_t recvDirectFlag) {
__device__ __forceinline__ void process(Fn &&fn, uint32_t sendDirectFlag = 0, uint32_t recvDirectFlag = 0) {
#pragma unroll 1
for (int slice=0; slice < SlicePerChunk; slice++) {
if (tid < nworkers) {
@ -361,7 +361,7 @@ public:
} else if (flags & DirectRead) { // empty send
ptrs[index] = nullptr;
} else {
ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*stepSize;
ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*connStepSize;
}
} else {
if (flags & DirectRead) {
@ -372,11 +372,11 @@ public:
else
ptrs[index] = nullptr;
} else {
ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*stepSize;
ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*connStepSize;
}
}
} else {
ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*stepSize;
ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*connStepSize;
}
}
subBarrier();
@ -391,7 +391,7 @@ public:
} else {
nsend = fan.nsend();
}
fn.template operator() < SlicePerChunk, 0, Recv*MaxRecv, 0, Send*MaxSend >
fn.template operator()<SlicePerChunk, 0, Recv*MaxRecv, 0, Send*MaxSend, MultimemSrcs, MultimemDsts>
(tid, nworkers, slice, stepSize * StepPerSlice,
nrecv, ncclShmem.groups[group].srcs,
nsend, ncclShmem.groups[group].dsts, ncclShmem.groups[group].dstSizes, sendDirectFlag, recvDirectFlag);
@ -896,6 +896,12 @@ private:
__device__ __forceinline__ void directRecvDirectSend(intptr_t inpIx, intptr_t outIx, int eltN, bool postOp=false) {
genericOp<1, 1, 1, 1, -1, -1>(inpIx, outIx, eltN, postOp);
}
__device__ __forceinline__ void recvDirectSend(intptr_t outIx, int eltN, bool postOp=false) {
genericOp<0, 1, 1, 1, -1, -1>(-1, outIx, eltN, postOp);
}
__device__ __forceinline__ void directRecvSend(intptr_t outIx, int eltN, bool postOp=false) {
genericOp<1, 0, 1, 1, -1, -1>(outIx, outIx, eltN, postOp);
}
__device__ __forceinline__ void recvCopyDirectSend(intptr_t outIx, int eltN, bool postOp=false) {
genericOp<0, 1, 1, 1, -1, Output>(-1, outIx, eltN, postOp);
}

View File

@ -38,18 +38,18 @@ struct IsFloatingPoint<double>: std::true_type {};
// 3. Have constructor taking `uint64_t opArg`.
template<typename T>
struct FuncCopy { using EltType = T; __device__ FuncCopy(uint64_t opArg=0) {}; };
struct FuncCopy { using EltType = T; __device__ __forceinline__ FuncCopy(uint64_t opArg=0) {}; };
template<typename T>
struct FuncSum { using EltType = T; __device__ FuncSum(uint64_t opArg=0) {}; };
struct FuncSum { using EltType = T; __device__ __forceinline__ FuncSum(uint64_t opArg=0) {}; };
template<typename T>
struct FuncProd { using EltType = T; __device__ FuncProd(uint64_t opArg=0) {}; };
struct FuncProd { using EltType = T; __device__ __forceinline__ FuncProd(uint64_t opArg=0) {}; };
template<typename T>
struct FuncMinMax {
using EltType = T;
BytePack<sizeof(T)> xormask; // only used by integers
bool isMinNotMax; // only used by floats
__device__ FuncMinMax(uint64_t opArg=0) {
__device__ __forceinline__ FuncMinMax(uint64_t opArg=0) {
xormask.native = opArg;
isMinNotMax = (opArg&1)==0;
}
@ -64,13 +64,13 @@ template<typename T> struct FuncSumPostDiv;
template<typename Fn>
struct RedOpArg { // default case: no argument
static constexpr bool ArgUsed = false;
__device__ static uint64_t loadArg(void *ptr) { return 0; }
__device__ __forceinline__ static uint64_t loadArg(void *ptr) { return 0; }
};
template<typename T>
struct RedOpArg<FuncMinMax<T>> {
static constexpr bool ArgUsed = true;
__device__ static uint64_t loadArg(void *ptr) {
__device__ __forceinline__ static uint64_t loadArg(void *ptr) {
union { uint64_t u64; T val; };
u64 = 0;
val = *(T*)ptr;
@ -84,6 +84,11 @@ struct RedOpArg<FuncMinMax<T>> {
// of elements. These classes are intended to be specialized for specific
// combinations of reduction function and pack size.
template<typename A, typename B, int EltPerPackA>
struct Apply_Cast/*{
static BytePack<EltPerPackA*sizeof(B)/sizeof(A)> cast(BytePack<EltPerPackA*sizeof(A)> a);
}*/;
template<typename Fn, int EltPerPack>
struct Apply_Reduce /*{
static BytePack<EltPerPack*sizeof(T)> reduce(
@ -111,16 +116,60 @@ struct Apply_LoadMultimem/*{
static BytePack<BytePerPack> load(Fn fn, uintptr_t addr);
}*/;
// Helpers for dealing with BytePack<0>'s
template<typename A, typename B, int EltPerPack>
struct Apply_Cast_MaybeEmpty: Apply_Cast<A, B, EltPerPack> {};
template<typename A, typename B>
struct Apply_Cast_MaybeEmpty<A, B, /*EltPerPack=*/0> {
__device__ constexpr static BytePack<0> cast(BytePack<0> a) { return {}; }
};
template<typename Fn, int EltPerPack>
struct Apply_Reduce_MaybeEmpty: Apply_Reduce<Fn, EltPerPack> {};
template<typename Fn>
struct Apply_Reduce_MaybeEmpty<Fn, 0> {
__device__ constexpr static BytePack<0> reduce(Fn fn, BytePack<0> a, BytePack<0> b) { return {}; }
};
template<typename Fn, int EltPerPack>
struct Apply_PreOp_MaybeEmpty: Apply_PreOp<Fn, EltPerPack> {};
template<typename Fn>
struct Apply_PreOp_MaybeEmpty<Fn, 0> {
static constexpr bool IsIdentity = true;
__device__ constexpr static BytePack<0> preOp(Fn fn, BytePack<0> a) { return {}; }
};
template<typename Fn, int EltPerPack>
struct Apply_PostOp_MaybeEmpty: Apply_PostOp<Fn, EltPerPack> {};
template<typename Fn>
struct Apply_PostOp_MaybeEmpty<Fn, 0> {
static constexpr bool IsIdentity = true;
__device__ constexpr static BytePack<0> postOp(Fn fn, BytePack<0> a) { return {}; }
};
template<typename Fn, int BytePerPack>
struct Apply_LoadMultimem_MaybeEmpty: Apply_LoadMultimem<Fn, BytePerPack> {};
template<typename Fn>
struct Apply_LoadMultimem_MaybeEmpty<Fn, 0> {
__device__ constexpr static BytePack<0> load(Fn fn, uintptr_t addr) { return {}; }
};
////////////////////////////////////////////////////////////////////////////////
// Public API for calling the trait classes. These take the data elements as a
// pack of any type, which could be a BytePack<?> or any integral type (uint64_t,
// uint32_t, etc.), and will return a new pack where each element has been
// transformed appropriately.
template<typename A, typename B, typename PackA>
__device__ __forceinline__ BytePack<BytePackOf<PackA>::Size*sizeof(B)/sizeof(A)> applyCast(PackA a) {
return Apply_Cast_MaybeEmpty<A, B, BytePackOf<PackA>::Size/sizeof(A)>::cast(toPack(a));
}
template<typename Fn, typename Pack>
__device__ __forceinline__ Pack applyReduce(Fn fn, Pack a, Pack b) {
return fromPack<Pack>(
Apply_Reduce<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
Apply_Reduce_MaybeEmpty<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
::reduce(fn, toPack(a), toPack(b))
);
}
@ -128,7 +177,7 @@ __device__ __forceinline__ Pack applyReduce(Fn fn, Pack a, Pack b) {
template<typename Fn, typename Pack>
__device__ __forceinline__ Pack applyPreOp(Fn fn, Pack a) {
return fromPack<Pack>(
Apply_PreOp<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
Apply_PreOp_MaybeEmpty<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
::preOp(fn, toPack(a))
);
}
@ -136,23 +185,107 @@ __device__ __forceinline__ Pack applyPreOp(Fn fn, Pack a) {
template<typename Fn, typename Pack>
__device__ __forceinline__ Pack applyPostOp(Fn fn, Pack a) {
return fromPack<Pack>(
Apply_PostOp<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
Apply_PostOp_MaybeEmpty<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
::postOp(fn, toPack(a))
);
}
template<typename Fn, int BytePerPack>
__device__ __forceinline__ BytePack<BytePerPack> applyLoadMultimem(Fn fn, uintptr_t addr) {
return Apply_LoadMultimem<Fn, BytePerPack>::load(fn, addr);
return Apply_LoadMultimem_MaybeEmpty<Fn, BytePerPack>::load(fn, addr);
}
////////////////////////////////////////////////////////////////////////////////
// Apply_Cast
template<typename A, typename B, int EltPerPack>
struct Apply_Cast {
__device__ __forceinline__ static BytePack<EltPerPack*sizeof(B)> cast(BytePack<EltPerPack*sizeof(A)> a) {
BytePack<EltPerPack*sizeof(B)> b;
b.half[0] = Apply_Cast<A, B, EltPerPack/2>::cast(a.half[0]);
b.half[1] = Apply_Cast<A, B, EltPerPack/2>::cast(a.half[1]);
return b;
}
};
template<typename A, typename B>
struct Apply_Cast<A, B, /*EltPerPack=*/1> {
__device__ __forceinline__ static BytePack<sizeof(B)> cast(BytePack<sizeof(A)> a) {
return toPack(B(fromPack<A>(a)));
}
};
template<>
struct Apply_Cast<__half, float, /*EltPerPack=*/1> {
__device__ __forceinline__ static BytePack<sizeof(float)> cast(BytePack<sizeof(__half)> a) {
return toPack(__half2float(fromPack<__half>(a)));
}
};
template<>
struct Apply_Cast<float, __half, /*EltPerPack=*/1> {
__device__ __forceinline__ static BytePack<sizeof(__half)> cast(BytePack<sizeof(float)> a) {
return toPack(__float2half_rn(fromPack<float>(a)));
}
};
template<>
struct Apply_Cast<__half, float, /*EltPerPack=*/2> {
__device__ __forceinline__ static BytePack<4*2> cast(BytePack<2*2> a) {
return toPack(__half22float2(fromPack<__half2>(a)));
}
};
template<>
struct Apply_Cast<float, __half, /*EltPerPack=*/2> {
__device__ __forceinline__ static BytePack<2*2> cast(BytePack<4*2> a) {
return toPack(__float22half2_rn(fromPack<float2>(a)));
}
};
#if defined(__CUDA_BF16_TYPES_EXIST__) && (CUDART_RUNTIME >= 12000 || __CUDA_ARCH__ >= 800)
template<>
struct Apply_Cast<__nv_bfloat16, float, /*EltPerPack=*/2> {
__device__ __forceinline__ static BytePack<4*2> cast(BytePack<2*2> a) {
return toPack(__bfloat1622float2(fromPack<__nv_bfloat162>(a)));
}
};
template<>
struct Apply_Cast<float ,__nv_bfloat16, /*EltPerPack=*/2> {
__device__ __forceinline__ static BytePack<2*2> cast(BytePack<4*2> a) {
return toPack(__float22bfloat162_rn(fromPack<float2>(a)));
}
};
#endif
#define EASY_CAST(A, B, EltPerPack, VecA, VecB) \
template<> \
struct Apply_Cast<A, B, EltPerPack> { \
__device__ __forceinline__ static BytePack<sizeof(B)*EltPerPack> cast(BytePack<sizeof(A)*EltPerPack> a) { \
return toPack(VecB(fromPack<VecA>(a))); \
} \
}; \
template<> \
struct Apply_Cast<B, A, EltPerPack> { \
__device__ __forceinline__ static BytePack<sizeof(A)*EltPerPack> cast(BytePack<sizeof(B)*EltPerPack> b) { \
return toPack(VecA(fromPack<VecB>(b))); \
} \
};
#if defined(__CUDA_FP8_TYPES_EXIST__)
EASY_CAST(__nv_fp8_e5m2, float, 2, __nv_fp8x2_e5m2, float2)
EASY_CAST(__nv_fp8_e5m2, float, 4, __nv_fp8x4_e5m2, float4)
EASY_CAST(__nv_fp8_e4m3, float, 2, __nv_fp8x2_e4m3, float2)
EASY_CAST(__nv_fp8_e4m3, float, 4, __nv_fp8x4_e4m3, float4)
#endif
#undef EASY_CAST
////////////////////////////////////////////////////////////////////////////////
// Apply_Reduce
// Nonsensical base case
template<typename Fn>
struct Apply_Reduce<Fn, /*EltPerPack=*/0> {
__device__ static BytePack<0> reduce(Fn fn, BytePack<0> a, BytePack<0> b) {
__device__ __forceinline__ static BytePack<0> reduce(Fn fn, BytePack<0> a, BytePack<0> b) {
return {};
}
};
@ -164,7 +297,7 @@ struct Apply_Reduce<Fn, /*EltPerPack=*/0> {
template<typename Fn, int EltPerPack>
struct Apply_Reduce {
template<int Size>
__device__ static BytePack<Size> reduce(Fn fn, BytePack<Size> a, BytePack<Size> b) {
__device__ __forceinline__ static BytePack<Size> reduce(Fn fn, BytePack<Size> a, BytePack<Size> b) {
a.half[0] = Apply_Reduce<Fn, EltPerPack/2>::reduce(fn, a.half[0], b.half[0]);
a.half[1] = Apply_Reduce<Fn, EltPerPack/2>::reduce(fn, a.half[1], b.half[1]);
return a;
@ -174,25 +307,25 @@ struct Apply_Reduce {
// Base case definitions (EltPerPack == 1)
template<typename T>
struct Apply_Reduce<FuncCopy<T>, /*EltPerPack=*/1> {
__device__ static BytePack<sizeof(T)> reduce(FuncCopy<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
__device__ __forceinline__ static BytePack<sizeof(T)> reduce(FuncCopy<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
return a;
}
};
template<typename T>
struct Apply_Reduce<FuncSum<T>, /*EltPerPack=*/1> {
__device__ static BytePack<sizeof(T)> reduce(FuncSum<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
__device__ __forceinline__ static BytePack<sizeof(T)> reduce(FuncSum<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
return toPack<T>(fromPack<T>(a) + fromPack<T>(b));
}
};
template<typename T>
struct Apply_Reduce<FuncProd<T>, /*EltPerPack=*/1> {
__device__ static BytePack<sizeof(T)> reduce(FuncProd<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
__device__ __forceinline__ static BytePack<sizeof(T)> reduce(FuncProd<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
return toPack<T>(fromPack<T>(a) * fromPack<T>(b));
}
};
template<typename T>
struct Apply_Reduce<FuncMinMax<T>, /*EltPerPack=*/1> {
__device__ static BytePack<sizeof(T)> reduce(FuncMinMax<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
__device__ __forceinline__ static BytePack<sizeof(T)> reduce(FuncMinMax<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
return (a.native ^ fn.xormask.native) < (b.native ^ fn.xormask.native) ? a : b;
}
};
@ -200,7 +333,7 @@ struct Apply_Reduce<FuncMinMax<T>, /*EltPerPack=*/1> {
// Optimizations for specfic types and element count combinations:
template<>
struct Apply_Reduce<FuncSum<uint8_t>, /*EltPerPack=*/4> {
__device__ static BytePack<4> reduce(FuncSum<uint8_t> fn, BytePack<4> a, BytePack<4> b) {
__device__ __forceinline__ static BytePack<4> reduce(FuncSum<uint8_t> fn, BytePack<4> a, BytePack<4> b) {
constexpr uint32_t even = 0x00ff00ffu;
uint32_t x = (a.native & even) + (b.native & even);
uint32_t y = (a.native & ~even) + (b.native & ~even);
@ -236,7 +369,7 @@ struct Apply_Reduce<FuncMinMax<uint8_t>, /*EltPerPack=*/4> {
template<>
struct Apply_Reduce<FuncProd<uint8_t>, /*EltPerPack=*/4> {
__device__ static BytePack<4> reduce(FuncProd<uint8_t> fn, BytePack<4> apack, BytePack<4> bpack) {
__device__ __forceinline__ static BytePack<4> reduce(FuncProd<uint8_t> fn, BytePack<4> apack, BytePack<4> bpack) {
uint32_t a = apack.native;
uint32_t b = bpack.native;
uint32_t ab0 = (a*b) & 0xffu;
@ -332,7 +465,7 @@ template<typename Fn, int EltPerPack>
struct Apply_PreOp {
static constexpr bool IsIdentity = Apply_PreOp<Fn, EltPerPack/2>::IsIdentity;
template<int Size>
__device__ static BytePack<Size> preOp(Fn fn, BytePack<Size> a) {
__device__ __forceinline__ static BytePack<Size> preOp(Fn fn, BytePack<Size> a) {
#if __cpp_if_constexpr
if constexpr(!IsIdentity) {
#else
@ -352,7 +485,7 @@ template<typename Fn>
struct Apply_PreOp<Fn, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = true;
template<int Size>
__device__ static BytePack<Size> preOp(Fn fn, BytePack<Size> a) {
__device__ __forceinline__ static BytePack<Size> preOp(Fn fn, BytePack<Size> a) {
return a;
}
};
@ -360,7 +493,7 @@ struct Apply_PreOp<Fn, /*EltPerPack=*/1> {
template<typename Fn>
struct Apply_PreOp<Fn, /*EltPerPack=*/0> {
static constexpr bool IsIdentity = true;
__device__ static BytePack<0> preOp(Fn fn, BytePack<0> a) {
__device__ __forceinline__ static BytePack<0> preOp(Fn fn, BytePack<0> a) {
return {};
}
};
@ -373,7 +506,7 @@ template<typename Fn, int EltPerPack>
struct Apply_PostOp {
static constexpr bool IsIdentity = Apply_PostOp<Fn, EltPerPack/2>::IsIdentity;
template<int Size>
__device__ static BytePack<Size> postOp(Fn fn, BytePack<Size> a) {
__device__ __forceinline__ static BytePack<Size> postOp(Fn fn, BytePack<Size> a) {
#if __cpp_if_constexpr
if constexpr(!IsIdentity) {
#else
@ -393,7 +526,7 @@ template<typename Fn>
struct Apply_PostOp<Fn, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = true;
template<int Size>
__device__ static BytePack<Size> postOp(Fn fn, BytePack<Size> a) {
__device__ __forceinline__ static BytePack<Size> postOp(Fn fn, BytePack<Size> a) {
return a;
}
};
@ -401,7 +534,7 @@ struct Apply_PostOp<Fn, /*EltPerPack=*/1> {
template<typename Fn>
struct Apply_PostOp<Fn, /*EltPerPack=*/0> {
static constexpr bool IsIdentity = true;
__device__ static BytePack<0> postOp(Fn fn, BytePack<0> a) {
__device__ __forceinline__ static BytePack<0> postOp(Fn fn, BytePack<0> a) {
return {};
}
};
@ -413,7 +546,7 @@ struct Apply_PostOp<Fn, /*EltPerPack=*/0> {
template<typename T>
struct RedOpArg<FuncPreMulSum<T>> {
static constexpr bool ArgUsed = true;
__device__ static uint64_t loadArg(void *ptr) {
__device__ __forceinline__ static uint64_t loadArg(void *ptr) {
union { uint64_t u64; T val; };
u64 = 0;
val = *(T*)ptr;
@ -426,7 +559,7 @@ template<typename T>
struct FuncPreMulSum {
using EltType = T;
T scalar;
__device__ FuncPreMulSum(uint64_t opArg=0) {
__device__ __forceinline__ FuncPreMulSum(uint64_t opArg=0) {
union { uint64_t u64; T val; };
u64 = opArg;
scalar = val;
@ -441,7 +574,7 @@ struct FuncPreMulSum<half> {
using EltType = half;
#if __CUDA_ARCH__ >= 530 && __CUDA_ARCH__ != 610
__half2 scalar;
__device__ FuncPreMulSum(uint64_t opArg=0) {
__device__ __forceinline__ FuncPreMulSum(uint64_t opArg=0) {
union { uint64_t u64; __half val; };
u64 = opArg;
scalar.x = val;
@ -449,7 +582,7 @@ struct FuncPreMulSum<half> {
}
#else
float scalar;
__device__ FuncPreMulSum(uint64_t opArg=0) {
__device__ __forceinline__ FuncPreMulSum(uint64_t opArg=0) {
union { uint64_t u64; __half val; };
u64 = opArg;
scalar = (float)val;
@ -466,7 +599,7 @@ struct FuncPreMulSum<half> {
using EltType = __nv_bfloat16;
#if __CUDA_ARCH__ >= 800
__nv_bfloat162 scalar;
__device__ FuncPreMulSum(uint64_t opArg=0) {
__device__ __forceinline__ FuncPreMulSum(uint64_t opArg=0) {
union { uint64_t u64; __nv_bfloat16 val; };
u64 = opArg;
scalar.x = val;
@ -474,7 +607,7 @@ struct FuncPreMulSum<half> {
}
#else
float scalar;
__device__ FuncPreMulSum(uint64_t opArg=0) {
__device__ __forceinline__ FuncPreMulSum(uint64_t opArg=0) {
union { uint64_t u64; __nv_bfloat16 val; };
u64 = opArg;
scalar = __bfloat162float(val);
@ -489,7 +622,7 @@ struct FuncPreMulSum<half> {
struct FuncPreMulSum<__nv_fp8_e4m3> {
using EltType = __nv_fp8_e4m3;
__half2 scalar2;
__device__ FuncPreMulSum(uint64_t opArg) {
__device__ __forceinline__ FuncPreMulSum(uint64_t opArg) {
union { uint64_t u64; __nv_fp8_storage_t val; };
u64 = opArg;
scalar2.x = __half(__nv_cvt_fp8_to_halfraw(val, __NV_E4M3));
@ -501,7 +634,7 @@ struct FuncPreMulSum<half> {
struct FuncPreMulSum<__nv_fp8_e5m2> {
using EltType = __nv_fp8_e5m2;
__half2 scalar2;
__device__ FuncPreMulSum(uint64_t opArg) {
__device__ __forceinline__ FuncPreMulSum(uint64_t opArg) {
union { uint64_t u64; __nv_fp8_storage_t val; };
u64 = opArg;
scalar2.x = __half(__nv_cvt_fp8_to_halfraw(val, __NV_E5M2));
@ -513,7 +646,7 @@ struct FuncPreMulSum<half> {
template<typename T, int EltPerPack>
struct Apply_Reduce<FuncPreMulSum<T>, EltPerPack> {
__device__ static BytePack<EltPerPack*sizeof(T)> reduce(FuncPreMulSum<T> fn, BytePack<EltPerPack*sizeof(T)> a, BytePack<EltPerPack*sizeof(T)> b) {
__device__ __forceinline__ static BytePack<EltPerPack*sizeof(T)> reduce(FuncPreMulSum<T> fn, BytePack<EltPerPack*sizeof(T)> a, BytePack<EltPerPack*sizeof(T)> b) {
// FuncPreMulSum reduce dispatches to FuncSum.
return Apply_Reduce<FuncSum<T>, EltPerPack>::reduce(FuncSum<T>(), a, b);
}
@ -523,7 +656,7 @@ struct Apply_Reduce<FuncPreMulSum<T>, EltPerPack> {
template<typename T>
struct Apply_PreOp<FuncPreMulSum<T>, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(T)> preOp(FuncPreMulSum<T> fn, BytePack<sizeof(T)> a) {
__device__ __forceinline__ static BytePack<sizeof(T)> preOp(FuncPreMulSum<T> fn, BytePack<sizeof(T)> a) {
return toPack<T>(fromPack<T>(a) * fn.scalar);
}
};
@ -534,7 +667,7 @@ struct Apply_PreOp<FuncPreMulSum<T>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(half)> preOp(FuncPreMulSum<half> fn, BytePack<sizeof(half)> a) {
__device__ __forceinline__ static BytePack<sizeof(half)> preOp(FuncPreMulSum<half> fn, BytePack<sizeof(half)> a) {
#if __CUDA_ARCH__ >= 530 && __CUDA_ARCH__ != 610
return toPack<half>(__hmul(fromPack<half>(a), fn.scalar.x));
#else
@ -546,7 +679,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/2> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(half2)> preOp(FuncPreMulSum<half> fn, BytePack<sizeof(half2)> a) {
__device__ __forceinline__ static BytePack<sizeof(half2)> preOp(FuncPreMulSum<half> fn, BytePack<sizeof(half2)> a) {
return toPack<half2>(__hmul2(fromPack<half2>(a), fn.scalar));
}
};
@ -559,7 +692,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<__nv_bfloat16>, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(__nv_bfloat16)> preOp(
__device__ __forceinline__ static BytePack<sizeof(__nv_bfloat16)> preOp(
FuncPreMulSum<__nv_bfloat16> fn, BytePack<sizeof(__nv_bfloat16)> a
) {
#if __CUDA_ARCH__ >= 800
@ -573,7 +706,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<__nv_bfloat16>, /*EltPerPack=*/2> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(__nv_bfloat162)> preOp(
__device__ __forceinline__ static BytePack<sizeof(__nv_bfloat162)> preOp(
FuncPreMulSum<__nv_bfloat16> fn, BytePack<sizeof(__nv_bfloat162)> a
) {
return toPack<__nv_bfloat162>(__hmul2(fromPack<__nv_bfloat162>(a), fn.scalar));
@ -590,7 +723,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<__nv_fp8_e4m3>, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(__nv_fp8_e4m3)> preOp(
__device__ __forceinline__ static BytePack<sizeof(__nv_fp8_e4m3)> preOp(
FuncPreMulSum<__nv_fp8_e4m3> fn, BytePack<sizeof(__nv_fp8_e4m3)> a
) {
return toPack<__nv_fp8_e4m3>(__nv_fp8_e4m3(__hmul(__half(fromPack<__nv_fp8_e4m3>(a)), fn.scalar2.x)));
@ -599,7 +732,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<__nv_fp8_e4m3>, /*EltPerPack=*/2> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(__nv_fp8x2_e4m3)> preOp(
__device__ __forceinline__ static BytePack<sizeof(__nv_fp8x2_e4m3)> preOp(
FuncPreMulSum<__nv_fp8_e4m3> fn, BytePack<sizeof(__nv_fp8x2_e4m3)> a
) {
return toPack<__nv_fp8x2_e4m3>(__nv_fp8x2_e4m3(__hmul2(__half2(fromPack<__nv_fp8x2_e4m3>(a)), fn.scalar2)));
@ -609,7 +742,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<__nv_fp8_e5m2>, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(__nv_fp8_e5m2)> preOp(
__device__ __forceinline__ static BytePack<sizeof(__nv_fp8_e5m2)> preOp(
FuncPreMulSum<__nv_fp8_e5m2> fn, BytePack<sizeof(__nv_fp8_e5m2)> a
) {
return toPack<__nv_fp8_e5m2>(__nv_fp8_e5m2(__hmul(__half(fromPack<__nv_fp8_e5m2>(a)), fn.scalar2.x)));
@ -618,7 +751,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<>
struct Apply_PreOp<FuncPreMulSum<__nv_fp8_e5m2>, /*EltPerPack=*/2> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(__nv_fp8x2_e5m2)> preOp(
__device__ __forceinline__ static BytePack<sizeof(__nv_fp8x2_e5m2)> preOp(
FuncPreMulSum<__nv_fp8_e5m2> fn, BytePack<sizeof(__nv_fp8x2_e5m2)> a
) {
return toPack<__nv_fp8x2_e5m2>(__nv_fp8x2_e5m2(__hmul2(__half2(fromPack<__nv_fp8x2_e5m2>(a)), fn.scalar2)));
@ -633,7 +766,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
template<typename T>
struct RedOpArg<FuncSumPostDiv<T>> {
static constexpr bool ArgUsed = true;
__device__ static uint64_t loadArg(void *ptr) {
__device__ __forceinline__ static uint64_t loadArg(void *ptr) {
return *(uint64_t*)ptr;
}
};
@ -646,12 +779,12 @@ struct FuncSumPostDiv {
uint32_t divisor:31, isSigned:1;
UintType recip;
__device__ FuncSumPostDiv(uint64_t opArg=0) {
__device__ __forceinline__ FuncSumPostDiv(uint64_t opArg=0) {
isSigned = opArg & 1;
divisor = opArg >> 1;
recip = UintType(-1)/divisor;
}
__device__ T divide(T x) {
__device__ __forceinline__ T divide(T x) {
// x is negative iff we are in signed mode and the top bit is set
bool xneg = isSigned && (x & ~(T(-1)>>1));
// Compute abs(x):
@ -673,7 +806,7 @@ struct FuncSumPostDiv {
template<typename T, int EltPerPack>
struct Apply_Reduce<FuncSumPostDiv<T>, EltPerPack>:
Apply_Reduce<FuncSum<T>, EltPerPack> {
__device__ static BytePack<EltPerPack*sizeof(T)> reduce(FuncSumPostDiv<T> fn, BytePack<EltPerPack*sizeof(T)> a, BytePack<EltPerPack*sizeof(T)> b) {
__device__ __forceinline__ static BytePack<EltPerPack*sizeof(T)> reduce(FuncSumPostDiv<T> fn, BytePack<EltPerPack*sizeof(T)> a, BytePack<EltPerPack*sizeof(T)> b) {
// FuncSumPostDiv reduce dispatches to FuncSum.
return Apply_Reduce<FuncSum<T>, EltPerPack>::reduce(FuncSum<T>(), a, b);
}
@ -682,7 +815,7 @@ struct Apply_Reduce<FuncSumPostDiv<T>, EltPerPack>:
template<typename T>
struct Apply_PostOp<FuncSumPostDiv<T>, /*EltPerPack=*/1> {
static constexpr bool IsIdentity = false;
__device__ static BytePack<sizeof(T)> postOp(FuncSumPostDiv<T> fn, BytePack<sizeof(T)> a) {
__device__ __forceinline__ static BytePack<sizeof(T)> postOp(FuncSumPostDiv<T> fn, BytePack<sizeof(T)> a) {
return toPack<T>(fn.divide(fromPack<T>(a)));
}
};
@ -690,120 +823,145 @@ struct Apply_PostOp<FuncSumPostDiv<T>, /*EltPerPack=*/1> {
////////////////////////////////////////////////////////////////////////////////
// Apply_LoadMultimem
#define SIZEOF_BytePack_field_u16 2
#define PTX_REG_BytePack_field_u16 "h"
#define RegCode_for_size_1 "r"
#define RegCode_for_size_2 "h"
#define RegCode_for_size_4 "r"
#define RegCode_for_size_8 "l"
#define SIZEOF_BytePack_field_u32 4
#define PTX_REG_BytePack_field_u32 "r"
#define RegSize_for_size_1 4
#define RegSize_for_size_2 2
#define RegSize_for_size_4 4
#define RegSize_for_size_8 8
#define SIZEOF_BytePack_field_u64 8
#define PTX_REG_BytePack_field_u64 "l"
#define PtxAcc_for_u32
#define PtxAcc_for_s32
#define PtxAcc_for_s64
#define PtxAcc_for_u64
#define PtxAcc_for_f32
#define PtxAcc_for_f64
#if CUDART_VERSION >= 12020
#define PtxAcc_for_f16 ".acc::f32"
#define PtxAcc_for_bf16 ".acc::f32"
#define PtxAcc_for_f16x2 ".acc::f32"
#define PtxAcc_for_bf16x2 ".acc::f32"
#else
#define PtxAcc_for_f16
#define PtxAcc_for_bf16
#define PtxAcc_for_f16x2
#define PtxAcc_for_bf16x2
#endif
#define PtxAcc_for_e4m3 ".acc::f16"
#define PtxAcc_for_e5m2 ".acc::f16"
#define PtxAcc_for_e4m3x4 ".acc::f16"
#define PtxAcc_for_e5m2x4 ".acc::f16"
#define DEFINE_Apply_LoadMultimem_sum(T, ptx_ty, pack_field) \
#define DEFINE_Apply_LoadMultimem_sum(T, ptx_ty, PackSize) \
template<> \
struct Apply_LoadMultimem<FuncSum<T>, SIZEOF_BytePack_field_##pack_field> { \
static constexpr int PackSize = SIZEOF_BytePack_field_##pack_field; \
__device__ static BytePack<PackSize> load(FuncSum<T> fn, uintptr_t addr) { \
BytePack<PackSize> ans; \
asm volatile("multimem.ld_reduce.relaxed.sys.global.add." #ptx_ty " %0, [%1];" \
: "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field) \
struct Apply_LoadMultimem<FuncSum<T>, PackSize> { \
__device__ __forceinline__ static BytePack<PackSize> load(FuncSum<T> fn, uintptr_t addr) { \
BytePack<RegSize_for_size_##PackSize> reg; \
asm volatile("multimem.ld_reduce.relaxed.sys.global.add" PtxAcc_for_##ptx_ty "." #ptx_ty " %0, [%1];" \
: "=" RegCode_for_size_##PackSize(reg.native) \
: "l"(addr) : "memory"); \
BytePack<PackSize> ans; \
ans.native = reg.native; \
return ans; \
} \
};
#define DEFINE_Apply_LoadMultimem_minmax(T, ptx_ty, pack_field) \
#define DEFINE_Apply_LoadMultimem_minmax(T, ptx_ty, PackSize) \
template<> \
struct Apply_LoadMultimem<FuncMinMax<T>, SIZEOF_BytePack_field_##pack_field> { \
static constexpr int PackSize = SIZEOF_BytePack_field_##pack_field; \
__device__ static BytePack<PackSize> load(FuncMinMax<T> fn, uintptr_t addr) { \
BytePack<PackSize> ans; \
struct Apply_LoadMultimem<FuncMinMax<T>, PackSize> { \
__device__ __forceinline__ static BytePack<PackSize> load(FuncMinMax<T> fn, uintptr_t addr) { \
BytePack<RegSize_for_size_##PackSize> reg; \
if (fn.isMinNotMax) { \
asm volatile("multimem.ld_reduce.relaxed.sys.global.min." #ptx_ty " %0, [%1];" \
: "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field) \
: "=" RegCode_for_size_##PackSize(reg.native) \
: "l"(addr) : "memory"); \
} else { \
asm volatile("multimem.ld_reduce.relaxed.sys.global.max." #ptx_ty " %0, [%1];" \
: "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field) \
: "=" RegCode_for_size_##PackSize(reg.native) \
: "l"(addr) : "memory"); \
} \
BytePack<PackSize> ans; \
ans.native = reg.native; \
return ans; \
} \
};
#define DEFINE_Apply_LoadMultimem_sum_v4(T, ptx_ty, pack_field) \
#define DEFINE_Apply_LoadMultimem_sum_v4(T, ptx_ty, VecEltSize) \
template<> \
struct Apply_LoadMultimem<FuncSum<T>, 4*(SIZEOF_BytePack_field_##pack_field)> { \
static constexpr int PackSize = 4*(SIZEOF_BytePack_field_##pack_field); \
__device__ static BytePack<PackSize> load(FuncSum<T> fn, uintptr_t addr) { \
BytePack<PackSize> ans; \
asm volatile("multimem.ld_reduce.relaxed.sys.global.add.v4." #ptx_ty " {%0,%1,%2,%3}, [%4];" \
: "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[0]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[1]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[2]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[3]) \
struct Apply_LoadMultimem<FuncSum<T>, 4*(VecEltSize)> { \
static constexpr int PackSize = 4*(VecEltSize); \
__device__ __forceinline__ static BytePack<PackSize> load(FuncSum<T> fn, uintptr_t addr) { \
union { BytePack<PackSize> ans; BytePack<VecEltSize> elts[4]; }; \
asm volatile("multimem.ld_reduce.relaxed.sys.global.add" PtxAcc_for_##ptx_ty ".v4." #ptx_ty " {%0,%1,%2,%3}, [%4];" \
: "=" RegCode_for_size_##VecEltSize(elts[0].native), \
"=" RegCode_for_size_##VecEltSize(elts[1].native), \
"=" RegCode_for_size_##VecEltSize(elts[2].native), \
"=" RegCode_for_size_##VecEltSize(elts[3].native) \
: "l"(addr) : "memory"); \
return ans; \
} \
};
#define DEFINE_Apply_LoadMultimem_minmax_v4(T, ptx_ty, pack_field) \
#define DEFINE_Apply_LoadMultimem_minmax_v4(T, ptx_ty, VecEltSize) \
template<> \
struct Apply_LoadMultimem<FuncMinMax<T>, 4*(SIZEOF_BytePack_field_##pack_field)> { \
static constexpr int PackSize = 4*(SIZEOF_BytePack_field_##pack_field); \
__device__ static BytePack<PackSize> load(FuncMinMax<T> fn, uintptr_t addr) { \
BytePack<PackSize> ans; \
struct Apply_LoadMultimem<FuncMinMax<T>, 4*(VecEltSize)> { \
static constexpr int PackSize = 4*(VecEltSize); \
__device__ __forceinline__ static BytePack<PackSize> load(FuncMinMax<T> fn, uintptr_t addr) { \
union { BytePack<PackSize> ans; BytePack<VecEltSize> elts[4]; }; \
if (fn.isMinNotMax) { \
asm volatile("multimem.ld_reduce.relaxed.sys.global.min.v4." #ptx_ty " {%0,%1,%2,%3}, [%4];" \
: "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[0]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[1]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[2]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[3]) \
: "=" RegCode_for_size_##VecEltSize(elts[0].native), \
"=" RegCode_for_size_##VecEltSize(elts[1].native), \
"=" RegCode_for_size_##VecEltSize(elts[2].native), \
"=" RegCode_for_size_##VecEltSize(elts[3].native) \
: "l"(addr) : "memory"); \
} else { \
asm volatile("multimem.ld_reduce.relaxed.sys.global.max.v4." #ptx_ty " {%0,%1,%2,%3}, [%4];" \
: "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[0]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[1]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[2]), \
"=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[3]) \
: "=" RegCode_for_size_##VecEltSize(elts[0].native), \
"=" RegCode_for_size_##VecEltSize(elts[1].native), \
"=" RegCode_for_size_##VecEltSize(elts[2].native), \
"=" RegCode_for_size_##VecEltSize(elts[3].native) \
: "l"(addr) : "memory"); \
} \
return ans; \
} \
};
#define DEFINE_Apply_LoadMultimem_sum_v4x2_and_subhalf(T, ptx_ty, pack_field) \
DEFINE_Apply_LoadMultimem_sum_v4(T, ptx_ty, pack_field) \
#define DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(T, ptx_ty, VecEltSize) \
DEFINE_Apply_LoadMultimem_sum_v4(T, ptx_ty, VecEltSize) \
template<> \
struct Apply_LoadMultimem<FuncSum<T>, sizeof(T)> { \
__device__ static BytePack<sizeof(T)> load(FuncSum<T> fn, uintptr_t addr) { \
BytePack<2*sizeof(T)> tmp; \
asm volatile("multimem.ld_reduce.relaxed.sys.global.add." #ptx_ty " %0, [%1];" \
: "=" PTX_REG_BytePack_field_##pack_field(tmp.pack_field) \
: "l"(addr & -uintptr_t(2*sizeof(T))) : "memory"); \
return tmp.half[(addr/sizeof(T))%2]; \
__device__ __forceinline__ static BytePack<sizeof(T)> load(FuncSum<T> fn, uintptr_t addr) { \
union { BytePack<VecEltSize> tmp; BytePack<sizeof(T)> elts[(VecEltSize)/sizeof(T)]; }; \
asm volatile("multimem.ld_reduce.relaxed.sys.global.add" PtxAcc_for_##ptx_ty "." #ptx_ty " %0, [%1];" \
: "=" RegCode_for_size_##VecEltSize(tmp.native) \
: "l"(addr & -uintptr_t(VecEltSize)) : "memory"); \
return elts[(addr/sizeof(T))%((VecEltSize)/sizeof(T))]; \
} \
};
#define DEFINE_Apply_LoadMultimem_minmax_v4x2_and_subhalf(T, ptx_ty, pack_field) \
DEFINE_Apply_LoadMultimem_minmax_v4(T, ptx_ty, pack_field) \
#define DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(T, ptx_ty, VecEltSize) \
DEFINE_Apply_LoadMultimem_minmax_v4(T, ptx_ty, VecEltSize) \
template<> \
struct Apply_LoadMultimem<FuncMinMax<T>, sizeof(T)> { \
__device__ static BytePack<sizeof(T)> load(FuncMinMax<T> fn, uintptr_t addr) { \
BytePack<2*sizeof(T)> tmp; \
__device__ __forceinline__ static BytePack<sizeof(T)> load(FuncMinMax<T> fn, uintptr_t addr) { \
union { BytePack<VecEltSize> tmp; BytePack<sizeof(T)> elts[(VecEltSize)/sizeof(T)]; }; \
if (fn.isMinNotMax) { \
asm volatile("multimem.ld_reduce.relaxed.sys.global.min." #ptx_ty " %0, [%1];" \
: "=" PTX_REG_BytePack_field_##pack_field(tmp.pack_field) \
: "l"(addr & -uintptr_t(2*sizeof(T))) : "memory"); \
: "=" RegCode_for_size_##VecEltSize(tmp.native) \
: "l"(addr & -uintptr_t(VecEltSize)) : "memory"); \
} else { \
asm volatile("multimem.ld_reduce.relaxed.sys.global.max." #ptx_ty " %0, [%1];" \
: "=" PTX_REG_BytePack_field_##pack_field(tmp.pack_field) \
: "l"(addr & -uintptr_t(2*sizeof(T))) : "memory"); \
: "=" RegCode_for_size_##VecEltSize(tmp.native) \
: "l"(addr & -uintptr_t(VecEltSize)) : "memory"); \
} \
return tmp.half[(addr/sizeof(T))%2]; \
return elts[(addr/sizeof(T))%((VecEltSize)/sizeof(T))]; \
} \
};
template<typename Fn, int BytePerPack>
struct Apply_LoadMultimem {
__device__ static BytePack<BytePerPack> load(Fn fn, uintptr_t addr) {
__device__ __forceinline__ static BytePack<BytePerPack> load(Fn fn, uintptr_t addr) {
__trap();
return {};
}
@ -826,29 +984,36 @@ struct Apply_LoadMultimem {
/*multimem.ld_reduce not supported:*/ 0;
};
DEFINE_Apply_LoadMultimem_sum(uint32_t, u32, u32)
DEFINE_Apply_LoadMultimem_minmax(uint32_t, u32, u32)
DEFINE_Apply_LoadMultimem_sum(uint32_t, u32, 4)
DEFINE_Apply_LoadMultimem_minmax(uint32_t, u32, 4)
DEFINE_Apply_LoadMultimem_sum(int32_t, s32, u32)
DEFINE_Apply_LoadMultimem_minmax(int32_t, s32, u32)
DEFINE_Apply_LoadMultimem_sum(int32_t, s32, 4)
DEFINE_Apply_LoadMultimem_minmax(int32_t, s32, 4)
DEFINE_Apply_LoadMultimem_sum(uint64_t, u64, u64)
DEFINE_Apply_LoadMultimem_minmax(uint64_t, u64, u64)
DEFINE_Apply_LoadMultimem_sum(uint64_t, u64, 8)
DEFINE_Apply_LoadMultimem_minmax(uint64_t, u64, 8)
DEFINE_Apply_LoadMultimem_sum(int64_t, u64, u64)
DEFINE_Apply_LoadMultimem_minmax(int64_t, s64, u64)
DEFINE_Apply_LoadMultimem_sum(int64_t, u64, 8)
DEFINE_Apply_LoadMultimem_minmax(int64_t, s64, 8)
DEFINE_Apply_LoadMultimem_sum(float, f32, u32)
DEFINE_Apply_LoadMultimem_sum_v4(float, f32, u32)
DEFINE_Apply_LoadMultimem_sum(float, f32, 4)
DEFINE_Apply_LoadMultimem_sum_v4(float, f32, 4)
DEFINE_Apply_LoadMultimem_sum(double, f64, u64)
DEFINE_Apply_LoadMultimem_sum(double, f64, 8)
DEFINE_Apply_LoadMultimem_sum_v4x2_and_subhalf(half, f16x2, u32)
DEFINE_Apply_LoadMultimem_minmax_v4x2_and_subhalf(half, f16x2, u32)
DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(half, f16x2, 4)
DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(half, f16x2, 4)
#if defined(__CUDA_BF16_TYPES_EXIST__)
DEFINE_Apply_LoadMultimem_sum_v4x2_and_subhalf(__nv_bfloat16, bf16x2, u32)
DEFINE_Apply_LoadMultimem_minmax_v4x2_and_subhalf(__nv_bfloat16, bf16x2, u32)
DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(__nv_bfloat16, bf16x2, 4)
DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(__nv_bfloat16, bf16x2, 4)
#endif
#if NCCL_CUDA_ARCH_SPECIFIC == 1000 || NCCL_CUDA_ARCH_SPECIFIC == 1010 || NCCL_CUDA_ARCH_FAMILY_SPECIFIC == 1000 || NCCL_CUDA_ARCH_FAMILY_SPECIFIC == 1010 || NCCL_CUDA_ARCH_SPECIFIC == 1200 || NCCL_CUDA_ARCH_SPECIFIC == 1210
DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(__nv_fp8_e4m3, e4m3x4, 4)
DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(__nv_fp8_e4m3, e4m3x4, 4)
DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(__nv_fp8_e5m2, e5m2x4, 4)
DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(__nv_fp8_e5m2, e5m2x4, 4)
#endif
#else
template<typename Fn>
@ -860,11 +1025,29 @@ struct Apply_LoadMultimem {
#undef DEFINE_Apply_LoadMultimem
#undef DEFINE_Apply_LoadMultimem_v4
#undef DEFINE_Apply_LoadMultimem_v4x2_and_subhalf
#undef SIZEOF_BytePack_field_u64
#undef PTX_REG_BytePack_field_u64
#undef SIZEOF_BytePack_field_u32
#undef PTX_REG_BytePack_field_u32
#undef SIZEOF_BytePack_field_u16
#undef PTX_REG_BytePack_field_u16
#undef RegCode_for_size_2
#undef RegCode_for_size_4
#undef RegCode_for_size_8
#undef RegSize_for_size_1
#undef RegSize_for_size_2
#undef RegSize_for_size_4
#undef RegSize_for_size_8
#undef PtxAcc_for_u32
#undef PtxAcc_for_s32
#undef PtxAcc_for_s64
#undef PtxAcc_for_u64
#undef PtxAcc_for_f32
#undef PtxAcc_for_f64
#undef PtxAcc_for_f16
#undef PtxAcc_for_bf16
#undef PtxAcc_for_f16x2
#undef PtxAcc_for_bf16x2
#undef PtxAcc_for_e4m3
#undef PtxAcc_for_e5m2
#undef PtxAcc_for_e4m3x4
#undef PtxAcc_for_e5m2x4
#endif // REDUCE_KERNEL_H_

View File

@ -142,82 +142,206 @@ struct RunWorkColl<ncclFuncReduceScatter, T, RedOp, NCCL_ALGO_PAT, NCCL_PROTO_SI
template<typename T, typename RedOp>
struct RunWorkColl<ncclFuncReduceScatter, T, RedOp, NCCL_ALGO_NVLS, NCCL_PROTO_SIMPLE> {
template<bool ReduceSendNotRecv>
struct Scatterer {
struct ncclDevWorkColl* work;
int chunkCount;
ssize_t railGridOffset;
template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts, int MultimemSrcs, int MultimemDsts>
__device__ __forceinline__ void operator()(
int tid, int tn, int slice, int maxSliceSize,
int nSrcs, void** srcPtrs, int nDsts, void** dstPtrs, int32_t* dstSizes, uint32_t sendDirectFlag, uint32_t recvDirectFlag
) {
static_assert(SlicePerChunk == 1, "require: SlicePerChunk==1");
static_assert(MaxDsts <= 1 || MaxSrcs <= 1, "require: MaxDsts<=1 || MaxSrcs<=1");
struct ncclNvls* nvls = &ncclShmem.channel.nvls;
int nNodes = ncclShmem.comm.nNodes;
int nRails = nvls->nHeads;
int part = ncclShmem.channelId - work->channelLo;
void* inbuf = (void*)work->sendbuff;
ssize_t countPerRank = work->collnet.count;
ssize_t railAllBeg = min(railGridOffset + part * chunkCount, nNodes * countPerRank);
ssize_t railAllEnd = min(railAllBeg + chunkCount, nNodes * countPerRank);
int railAllSize = railAllEnd - railAllBeg;
int rail = nvls->headRank;
int dst = 0;
if (ReduceSendNotRecv) {
if (work->regUsed) return;
rail = 0;
nSrcs = 1;
} else {
rail = nvls->headRank;
}
if (tid < nDsts) dstSizes[tid] = railAllSize;
do {
int node = railAllBeg / countPerRank;
int railAllOffset = 0;
while (railAllOffset < railAllSize) {
ssize_t railOneBeg = node * countPerRank;
ssize_t railOneEnd = railOneBeg + countPerRank;
ssize_t railOneOffset = (railAllBeg + railAllOffset) - railOneBeg;
int delta = min(railAllEnd, railOneEnd) - (railAllBeg + railAllOffset);
int rank = ncclShmem.comm.collNetDenseToUserRank[node * nRails + rail];
ssize_t userOneBeg = rank * countPerRank + railOneOffset;
if (nDsts != 0) {
reduceCopy<ncclCollUnroll(), RedOp, T,
/*MultimemSrcs=*/MultimemSrcs, 1, 1 + MaxSrcs,
/*MultimemDsts,MinDsts,MaxDsts=*/MultimemDsts, 1, 1,
/*PreOpSrcs=*/1>
(tid, tn, work->redOpArg, &work->redOpArg, false,
/*nSrcs=*/nSrcs, [=]__device__(int s) {
return work->regUsed ? (T*)srcPtrs[s] + userOneBeg :
!ReduceSendNotRecv ? (T*)srcPtrs[s] + railAllOffset:
(T*)inbuf + userOneBeg;
},
/*nDsts=*/1, [=]__device__(int d/*==0*/) {
return (T*)dstPtrs[dst] + railAllOffset;
}, delta);
}
railAllOffset += delta;
node += 1;
}
dst += 1;
rail += 1;
} while (ReduceSendNotRecv && dst < nRails);
}
};
__device__ __forceinline__ void run(int tid, int/*nthreads*/, struct ncclDevWorkColl* work) {
struct ncclNvls* nvls = &ncclShmem.channel.nvls;
size_t count;
size_t gridOffset;
size_t channelCount;
size_t chunkCount;
ncclCollCbdPart(work, ncclShmem.channelId, NCCL_PROTO_SIMPLE, sizeof(T), &count, &gridOffset, &channelCount, &chunkCount);
const int rank = ncclShmem.comm.rank;
const int nranks = ncclShmem.comm.nRanks;
size_t offset;
int nelem;
/* if we are direct NVLS, we only need to allocate 1 warp to scatter for sync;
* if not, based on #ranks, we allocate 7 or 5 warps to reduce to saturate bandwidth
* and the rest are allocated to scatter. */
const int nThreadsReduce = work->regUsed ? (NCCL_MAX_NTHREADS - WARP_SIZE) : (nranks <= 6 ? 7 * WARP_SIZE : 5 * WARP_SIZE);
const int nThreadsScatter = work->regUsed ? WARP_SIZE : (NCCL_MAX_NTHREADS - nThreadsReduce);
const int tidEndScatter = nThreadsScatter;
const int nThreadsNetRecv = work->oneNode ? 0 : (work->netRegUsed ? WARP_SIZE : 6 * WARP_SIZE);
const int nThreadsScatter = work->regUsed ? roundUp(nvls->nHeads << 2, WARP_SIZE) : 8 * WARP_SIZE;
const int nThreadsReduce = NCCL_MAX_NTHREADS - nThreadsNetRecv - nThreadsScatter;
const int tidEndNetRecv = nThreadsNetRecv;
const int tidEndScatter = tidEndNetRecv + nThreadsScatter;
const int tidEndReduce = tidEndScatter + nThreadsReduce;
if (!work->regUsed) {
if (tid < tidEndScatter) {
// Scatter
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanAsymmetric<0, NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsScatter, NULL, nvls->up, work->sendbuff, NULL,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.scatter(offset, nvls->nHeads * count, nelem, count, -1, 0);
if (work->oneNode) {
const int rank = ncclShmem.comm.rank;
size_t offset;
size_t count, gridOffset, channelCount, chunkCount;
ncclCollCbdPart(work, ncclShmem.channelId, NCCL_PROTO_SIMPLE, sizeof(T), &count, &gridOffset, &channelCount, &chunkCount);
if (!work->regUsed) {
if (tid < tidEndScatter) {
// Scatter
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanAsymmetric<0, NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsScatter, NULL, nvls->up, work->sendbuff, NULL,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.scatter(offset, nvls->nHeads * count, nelem, count, -1, 0);
}
// coverity[overrun-call] => Coverity think prims.index can be greater than 1
} else if (tid < tidEndReduce) {
// Reduce through NVLS
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 1, 0>;
Primitives<T, RedOp, FanAsymmetric<1, 0>, /*Direct=*/0, Proto, 0>
prims(tid - tidEndScatter, nThreadsReduce, &nvls->down, NULL, NULL, work->recvbuff,
work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.recv(offset, nelem);
}
}
// coverity[overrun-call] => Coverity think prims.index can be greater than 1
} else if (tid < tidEndReduce) {
// Reduce through NVLS
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 1, 0>;
Primitives<T, RedOp, FanAsymmetric<1, 0>, /*Direct=*/0, Proto, 0>
prims(tid - tidEndScatter, nThreadsReduce, &nvls->down, NULL, NULL, work->recvbuff,
work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
offset = gridOffset + elemOffset;
nelem = min(chunkCount, channelCount - elemOffset);
prims.recv(offset, nelem);
} else {
if (tid < tidEndScatter) {
// Scatter
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanSymmetric<NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsScatter, nvls->up, nvls->up, NULL, NULL,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
prims.scatter(0, 0, 0, 0, -1, 0);
}
/* gather used as sync */
prims.gather(0, 0, 0, 0, -1, 0);
} else if (tid < tidEndReduce) {
// Reduce through NVLS
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 1, 0>;
Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, Proto, 0>
prims(tid - tidEndScatter, nThreadsReduce, &nvls->down, &nvls->down, NULL, work->recvbuff,
work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0, work);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
size_t outOffset = gridOffset + elemOffset;
size_t inpOffset = outOffset + rank * count;
nelem = min(chunkCount, channelCount - elemOffset);
// Coverity complains about a possible overrun inside the method invoked below, but that's actually
// a false positive.
// coverity[overrun-call:FALSE]
prims.directRecvCopy(inpOffset, outOffset, nelem);
}
/* send for sync */
prims.send(0, 0);
}
}
} else {
if (tid < tidEndScatter) {
// Scatter
// multi-node
int nNodes = ncclShmem.comm.nNodes;
int part = ncclShmem.channelId - work->channelLo;
ssize_t countPerRank = work->collnet.count;
const int nChannels = work->channelHi - work->channelLo + 1;
ssize_t chunkCount = work->collnet.chunkCount;
if (tid < tidEndNetRecv) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanSymmetric<NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsScatter, nvls->up, nvls->up, NULL, NULL,
work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
prims.scatter(0, 0, 0, 0, -1, 0);
if (work->netRegUsed) {
if (tid == 0) {
int steps = (int)divUp(nNodes * countPerRank, nChannels * chunkCount);
Primitives<T, RedOp, FanAsymmetric<1, 0>, /*Direct=*/0, Proto, 0>::recvPeerNotify(nvls->out, 0, steps);
}
__syncwarp();
} else {
Primitives<T, RedOp, FanAsymmetric<1, 0>, /*Direct=*/0, Proto, 0>
prims(tid, nThreadsNetRecv, &nvls->out, nullptr, nullptr, work->recvbuff,
work->redOpArg, 0 * Proto::MaxGroupWidth, 0, 0);
for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
ssize_t railAllBeg = railGridOffset + part * chunkCount;
ssize_t railAllEnd = min(railAllBeg + chunkCount, nNodes * countPerRank);
ssize_t railOneBeg = ncclShmem.comm.node * countPerRank;
ssize_t railOneEnd = railOneBeg + countPerRank;
ssize_t beg = max(railAllBeg, railOneBeg);
ssize_t end = min(railAllEnd, railOneEnd);
prims.recv(beg - railOneBeg, max(ssize_t(0), end - beg), /*postOp=*/true);
}
}
/* gather used as sync */
prims.gather(0, 0, 0, 0, -1, 0);
} else if (tid < tidEndReduce) {
// Reduce through NVLS
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 1, 0>;
Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, Proto, 0>
prims(tid - tidEndScatter, nThreadsReduce, &nvls->down, &nvls->down, NULL, work->recvbuff,
work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0, work);
for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
size_t outOffset = gridOffset + elemOffset;
size_t inpOffset = outOffset + rank * count;
nelem = min(chunkCount, channelCount - elemOffset);
// Coverity complains about a possible overrun inside the method invoked below, but that's actually
// a false positive.
// coverity[overrun-call:FALSE]
prims.directRecvCopy(inpOffset, outOffset, nelem);
} else {
if (tid < tidEndScatter) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
Primitives<T, RedOp, FanAsymmetric<0, NCCL_MAX_NVLS_ARITY>, /*Direct=*/1, Proto, 0>
prims(tid - tidEndNetRecv, nThreadsScatter, nullptr, nvls->up, work->sendbuff, nullptr,
work->redOpArg, 1 * Proto::MaxGroupWidth, 1, 1, work);
for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
Scatterer</*ReduceSendNotRecv=*/true> scat;
scat.work = work;
scat.chunkCount = chunkCount;
scat.railGridOffset = railGridOffset;
prims.template process</*Recv=*/0, /*Send=*/1>(scat);
}
} else if (tid < tidEndReduce) {
using Proto = ProtoSimple<1, 1, COLL_UNROLL, 1, 0>;
Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, Proto, 0>
prims(tid - tidEndScatter, nThreadsReduce, &nvls->down, &nvls->out, nullptr, nullptr,
work->redOpArg, 2 * Proto::MaxGroupWidth, 0, 1, work);
for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
Scatterer</*ReduceSendNotRecv=*/false> scat;
scat.work = work;
scat.chunkCount = chunkCount;
scat.railGridOffset = railGridOffset;
prims.template process</*Recv=*/1, /*Send=*/1>(scat);
}
}
/* send for sync */
prims.send(0, 0);
}
}
}
@ -231,7 +355,7 @@ struct RunWorkColl<ncclFuncReduceScatter, T, RedOp, NCCL_ALGO_COLLNET_DIRECT, NC
int chunkSize;
ssize_t railGridOffset;
template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts>
template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts, int MultimemSrcs, int MultimemDsts>
__device__ __forceinline__ void operator()(
int tid, int tn, int slice, int maxSliceSize,
int nSrcs, void** srcPtrs, int nDsts, void** dstPtrs, int32_t* dstSizes, uint32_t sendDirectFlag, uint32_t recvDirectFlag

View File

@ -0,0 +1,367 @@
#include "symmetric.h"
#include "symmetric/kernel.cuh"
#include "symmetric/primitives.cuh"
template<int BytePerPack, int UnrollPacks, int UnrollPeers>
static __device__ void bcastDeep(
ncclSymPrims& prim, int tn, int t, bool waitNeeded,
char* inputHere, char* outputRank0, bool inPlace, int nIters
) {
using Pack = BytePack<BytePerPack>;
int wn = tn/WARP_SIZE;
int w = t/WARP_SIZE;
int lane = t%WARP_SIZE;
int const& rank = prim.rank;
int const& nRanks = prim.nRanks;
uint32_t const& stride4G = prim.stride4G;
Pack* inpHere = (Pack*)inputHere + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
Pack* outRank0 = (Pack*)outputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
Pack tmp[UnrollPacks];
nIters -= w;
if (0 < nIters) {
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
tmp[u] = inpHere[u*WARP_SIZE];
}
}
if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
if (0 < nIters) {
while (true) {
int dr = inPlace ? 1 : 0;
int r = rank + dr;
if (r == nRanks) r = 0;
#pragma unroll 2
for (int partial=0; partial <= 1; partial++) {
#pragma unroll 1
for (int i = 0;
partial ? i < 1 : (dr + UnrollPeers <= nRanks);
partial ? i++ : (dr += UnrollPeers)) {
#pragma unroll
for (int ur=0; ur < UnrollPeers-partial; ur++) {
if (partial && dr == nRanks) break;
#pragma unroll UnrollPacks
for (int u=0; u < UnrollPacks; u++) {
add4G(outRank0, r*stride4G)[u*WARP_SIZE] = tmp[u];
}
if (++r == nRanks) r = 0;
}
}
}
inpHere += intptr_t(wn)*UnrollPacks*WARP_SIZE;
outRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
nIters -= wn;
if (nIters <= 0) break;
// Load data for next iteration.
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
tmp[u] = inpHere[u*WARP_SIZE];
}
}
}
}
template<int UnrollPeers, typename T>
static __device__ void bcastEnds(
ncclSymPrims& prim, int tn, int t,
T* inputHere, T* outputRank0, bool inPlace, size_t nElts, uint32_t nPreElts, size_t nSufElts
) {
int const& rank = prim.rank;
int const& nRanks = prim.nRanks;
uint32_t const& stride4G = prim.stride4G;
BytePack<sizeof(T)>* inpHere = (BytePack<sizeof(T)>*)inputHere;
BytePack<sizeof(T)>* outRank0 = (BytePack<sizeof(T)>*)outputRank0;
#pragma unroll 1
for (size_t i = t; i < nPreElts+nSufElts; i += tn) {
size_t elt = i < nPreElts ? i : nElts-nPreElts-nSufElts+i;
BytePack<sizeof(T)> tmp = inpHere[elt];
int dr = inPlace ? 1 : 0;
int r = rank + dr;
if (r == nRanks) r = 0;
#pragma unroll 1
for (; dr + UnrollPeers <= nRanks; dr += UnrollPeers) {
#pragma unroll UnrollPeers
for (int u=0; u < UnrollPeers; u++) {
*add4G(outRank0+elt, r*stride4G) = tmp;
if (++r == nRanks) r = 0;
}
}
#pragma unroll UnrollPeers
for (int u=0; u < UnrollPeers; u++) {
if (dr+u == nRanks) break;
*add4G(outRank0+elt, r*stride4G) = tmp;
if (++r == nRanks) r = 0;
}
}
}
template<typename T>
static __device__ void bcast(
ncclSymPrims& prim, int tn, int t, bool waitNeeded, T* input, T* output, size_t nElts
) {
bool inPlace = (input == output);
// Mpve to rank=0
output = prim.peerPtr(0, output);
uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
size_t nBytes = nElts*sizeof(T);
uint32_t nPreBytes = (128u - inputUptr)%128u;
nPreBytes = min((size_t)nPreBytes, nBytes);
uintptr_t cursor = nPreBytes;
constexpr int MinWarpPerBlock = 4;
if ((inputUptr-outputUptr)%16 == 0) {
constexpr int BytePerPack = 16, UnrollPacks = 4, UnrollPeers = 2;
constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
uint32_t chunks = (nBytes-cursor)/BytePerChunk;
chunks -= imodFast32(chunks, prim.nBlocks, prim.nBlocks_rcp32);
if (chunks != 0) {
uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
bcastDeep<BytePerPack, UnrollPacks, UnrollPeers>(
prim, tn, t, waitNeeded,
(char*)input + cursor, (char*)output + cursor, inPlace,
chunks*MinWarpPerBlock
);
cursor = cursorAfter;
waitNeeded = false;
}
}
if (sizeof(T) == 4 || (sizeof(T) < 4 && (inputUptr-outputUptr)%4 == 0)) {
constexpr int BytePerPack = 4, UnrollPacks = 4, UnrollPeers = 4;
constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
uint32_t chunks = (nBytes-cursor)/BytePerChunk;
chunks -= imodFast32(chunks, prim.nBlocks, prim.nBlocks_rcp32);
if (chunks != 0) {
uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
bcastDeep<(sizeof(T) <= BytePerPack ? BytePerPack : 0), UnrollPacks, UnrollPeers>(
prim, tn, t, waitNeeded,
(char*)input + cursor, (char*)output + cursor, inPlace,
chunks*MinWarpPerBlock
);
cursor = cursorAfter;
waitNeeded = false;
}
}
if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
constexpr int UnrollPeers = 8;
size_t nSufElts = (nBytes-cursor)/sizeof(T);
bcastEnds<UnrollPeers>(prim, tn, t, input, output, inPlace, nElts, nPreBytes/sizeof(T), nSufElts);
}
__device__ __forceinline__ void ncclSymRun_AllGather_ST(ncclSymDevArgs const* args) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier);
int const& rank = prim.rank;
// Threads numbered over rank.
int bt = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
prim.block, prim.nBlocks,
threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
int btn = prim.nBlocks*blockDim.x;
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
//prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
bcast(prim, btn, bt, /*waitNeeded=*/true, (char*)args->input, (char*)args->output + rank*args->nElts, args->nElts);
prim.barrierArrive(ncclCoopCta(), /*release=*/true);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
}
template<typename T>
static __device__ void bcastMultimem(
ncclSymPrims& prim, int tn, int t, T* input, T* output, size_t nElts
) {
// Move output to multimem
output = prim.multimemPtr(output);
uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
size_t nBytes = nElts*sizeof(T);
uint32_t nPreBytes = (16-inputUptr)%16;
nPreBytes = min((size_t)nPreBytes, nBytes);
uintptr_t nSufBytes;
if ((inputUptr-outputUptr)%16 == 0) {
constexpr int BytePerPack = 16, UnrollPacks = 8;
constexpr int BytePerChunk = UnrollPacks*WARP_SIZE*BytePerPack;
uintptr_t cursor = nPreBytes;
uint32_t nChunks = (nBytes-cursor)/BytePerChunk;
uintptr_t cursorAfter = cursor + uintptr_t(nChunks)*BytePerChunk;
nSufBytes = nBytes - cursorAfter;
cursor += (t/WARP_SIZE)*UnrollPacks*WARP_SIZE*BytePerPack;
cursor += (t%WARP_SIZE)*BytePerPack;
int nIters = nChunks - t/WARP_SIZE;
#pragma unroll 1
while (0 < nIters) {
BytePack<BytePerPack> tmp[UnrollPacks];
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
tmp[u] = *reinterpret_cast<BytePack<BytePerPack>*>(inputUptr + cursor + u*WARP_SIZE*BytePerPack);
}
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
multimem_st_global(outputUptr + cursor + u*WARP_SIZE*BytePerPack, tmp[u]);
}
cursor += tn*UnrollPacks*BytePerPack;
nIters -= tn/WARP_SIZE;
}
} else {
nPreBytes = 0;
nSufBytes = nBytes;
}
// Get the prefix+suffix element one at a time.
#pragma unroll 4
for (uintptr_t i = t*sizeof(T); i < nPreBytes + nSufBytes; i += tn*sizeof(T)) {
uintptr_t cursor = i < nPreBytes ? i : nBytes-nSufBytes+(i-nPreBytes);
BytePack<sizeof(T)> val = *reinterpret_cast<BytePack<sizeof(T)>*>(inputUptr + cursor);
multimem_st_global(outputUptr + cursor, val);
cursor += tn*sizeof(T);
}
}
__device__ __forceinline__ void ncclSymRun_AllGather_STMC(ncclSymDevArgs const* args) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier|ncclSymPrims_UseMultimem);
int const& rank = prim.rank;
char* input = args->input;
char* output = args->output;
size_t bytes = args->nElts;
// Round robin memory to blocks.
int t = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
prim.block, prim.nBlocks,
threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
int tn = prim.nBlocks*blockDim.x;
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
bcastMultimem(prim, tn, t, input, output + rank*bytes, bytes);
prim.barrierArrive(ncclCoopCta(), /*release=*/true);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
}
template<typename EltType>
static __device__ void allgather_LL_body(
ncclSymPrims &prim, EltType* input, EltType* output, int nElts, int nPacks, int nStrideElts
) {
using Pack = BytePack<8>;
constexpr int EltPerPack = 8/sizeof(EltType);
ncclCoopCta cta;
int rank = prim.rank;
int nRanks = prim.nRanks;
constexpr int tn = ncclSymMaxThreads;
int t = threadIdx.x;
#pragma unroll 1
while (0 < nElts) {
int nIterPacks = min(nPacks, tn);
if (t < nIterPacks) {
Pack x = loadPack<Pack>(input, t*EltPerPack, nElts);
prim.bcastLL(/*slot=*/nIterPacks*rank + t, x);
}
int tn_div_nPacks = tn/nIterPacks;
int tn_mod_nPacks = tn%nIterPacks;
int peer = t/nIterPacks;
int pack = t%nIterPacks;
#if 1
// NOTE: Unrolling speedup on eos nranks=8 size=64K: 5.7us vs 6.7us
constexpr int Unroll = 4;
#pragma unroll 1
for (int i = t; i < (nRanks*nIterPacks & -(Unroll*tn)); i += Unroll*tn) {
Pack got[Unroll];
prim.template recvLL<Unroll, Unroll>(i, Unroll, tn, /*&*/got);
#pragma unroll
for (int u=0; u < Unroll; u++) {
storePack<Pack>(output + peer*nStrideElts, pack*EltPerPack, nElts, got[u]);
peer += tn_div_nPacks;
pack += tn_mod_nPacks;
if (nIterPacks <= pack) { peer += 1; pack -= nIterPacks; }
}
}
int i = (nRanks*nIterPacks & -(Unroll*tn)) + t;
int n = (nRanks*nIterPacks)/tn % Unroll;
if (i + n*tn < nRanks*nIterPacks) n += 1;
if (n != 0) {
Pack got[Unroll];
prim.template recvLL<1, Unroll>(i, n, tn, /*&*/got);
#pragma unroll
for (int u=0; u < Unroll; u++) {
if (u != 0 && u == n) break;
storePack(output + peer*nStrideElts, pack*EltPerPack, nElts, got[u]);
peer += tn_div_nPacks;
pack += tn_mod_nPacks;
if (nIterPacks <= pack) { peer += 1; pack -= nIterPacks; }
}
}
#else
// The non-unrolled but "obviously correct" implementation for reference.
#pragma unroll 1
for (int i = t; i < nRanks*nIterPacks; i += tn) {
Pack got = prim.template recvLL<Pack>(i);
storePack(output + peer*nStrideElts, pack*EltPerPack, nElts, got);
peer += tn_div_nPacks;
pack += tn_mod_nPacks;
if (nIterPacks <= pack) { peer += 1; pack -= nIterPacks; }
}
#endif
prim.endLL(cta);
input += tn*EltPerPack;
output += tn*EltPerPack;
nElts -= tn*EltPerPack;
nPacks -= tn;
}
}
static __device__ void ncclSymRun_AllGather_LL_impl(ncclSymDevArgs const* args, bool multimem) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseLL | multimem*ncclSymPrims_UseMultimem);
using Pack = BytePack<8>;
constexpr int BytePerPack = 8;
int nElts = args->nElts;
int nPacks = divUp(nElts, BytePerPack);
uint32_t nPackPerBlock, nPackModBlock;
idivmodFast32(&nPackPerBlock, &nPackModBlock, nPacks, prim.nBlocks, prim.nBlocks_rcp32);
int blockPackBegin = prim.block*nPackPerBlock + minval<int>(prim.block, nPackModBlock);
int blockPackEnd = blockPackBegin + nPackPerBlock + (prim.block < nPackModBlock ? 1 : 0);
int nBlockPacks = blockPackEnd - blockPackBegin;
int nBlockElts = nElts - blockPackBegin*BytePerPack;
nBlockElts = min(nBlockElts, nBlockPacks*BytePerPack);
char* blockInput = args->input + blockPackBegin*BytePerPack;
char* blockOutput = args->output + blockPackBegin*BytePerPack;
uint32_t lowBits = args->nElts;
lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->input);
lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->output);
if (__builtin_expect(lowBits%8 == 0, true)) {
// NOTE: Specializing for 8-byte alignment in one case help at size=65K: 8.9us vs 5.6us
allgather_LL_body(prim, (BytePack<8>*)blockInput, (BytePack<8>*)blockOutput, nBlockElts/8, nBlockPacks, nElts/8);
} else {
allgather_LL_body(prim, blockInput, blockOutput, nBlockElts, nBlockPacks, nElts);
}
}
__device__ __forceinline__ void ncclSymRun_AllGather_LL(ncclSymDevArgs const* args) {
ncclSymRun_AllGather_LL_impl(args, /*multimem=*/false);
}
__device__ __forceinline__ void ncclSymRun_AllGather_LLMC(ncclSymDevArgs const* args) {
ncclSymRun_AllGather_LL_impl(args, /*multimem=*/true);
}

View File

@ -0,0 +1,432 @@
#include "symmetric.h"
#include "symmetric/kernel.cuh"
#include "symmetric/primitives.cuh"
template<int BytePerPack, int UnrollPacks, int UnrollPeers, typename T, typename Red>
static __device__ __forceinline__ void allreduceDeep(
ncclSymPrims& prim, int tn, int t, bool waitNeeded,
Red red, char* inputRank0, char* outputRank0, int32_t nIters
) {
using Pack = BytePack<BytePerPack>;
using Acc = typename Red::EltType;
using AccPack = BytePack<BytePerPack*sizeof(Acc)/sizeof(T)>;
int wn = tn/WARP_SIZE;
int w = t/WARP_SIZE;
int lane = t%WARP_SIZE;
int const& rank = prim.rank;
int const& nRanks = prim.nRanks;
uint32_t const& stride4G = prim.stride4G;
Pack* inpRank0 = (Pack*)inputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
Pack* outRank0 = (Pack*)outputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
Pack acc0[UnrollPacks];
nIters -= w;
if (0 < nIters) {
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
}
}
if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
if (0 < nIters) {
while (true) {
AccPack acc1[UnrollPacks];
int r = rank;
if (++r == nRanks) r = 0;
{ Pack tmp1[UnrollPacks];
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
tmp1[u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
}
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
acc1[u] = applyReduce(red, applyCast<T, Acc>(acc0[u]), applyCast<T, Acc>(tmp1[u]));
}
}
if (++r == nRanks) r = 0;
int dr = 2;
#pragma unroll 2
for (int partial=0; partial <= 1; partial++) {
#pragma unroll 1
for (int i = 0;
partial ? i < 1 : (dr + UnrollPeers <= nRanks);
partial ? i++ : (dr += UnrollPeers)) {
if (partial && dr == nRanks) break;
Pack tmp1[UnrollPeers][UnrollPacks];
#pragma unroll
for (int ur=0; ur < UnrollPeers-partial; ur++) {
if (partial && ur!=0 && dr+ur == nRanks) break;
#pragma unroll UnrollPacks
for (int u=0; u < UnrollPacks; u++) {
tmp1[ur][u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
}
if (++r == nRanks) r = 0;
}
#pragma unroll
for (int ur=0; ur < UnrollPeers-partial; ur++) {
if (partial && ur!=0 && dr+ur == nRanks) break;
#pragma unroll UnrollPacks
for (int u=0; u < UnrollPacks; u++) {
acc1[u] = applyReduce(red, acc1[u], applyCast<T, Acc>(tmp1[ur][u]));
}
}
}
}
#pragma unroll
for (int u=0; u < UnrollPacks; u++) acc0[u] = applyCast<Acc, T>(acc1[u]);
dr = 0;
r = rank;
#pragma unroll 2
for (int partial=0; partial <= 1; partial++) {
#pragma unroll 1
for (int i = 0;
partial ? i < 1 : (dr + UnrollPeers <= nRanks);
partial ? i++ : (dr += UnrollPeers)) {
#pragma unroll
for (int ur=0; ur < UnrollPeers-partial; ur++) {
if (partial && dr == nRanks) break;
#pragma unroll UnrollPacks
for (int u=0; u < UnrollPacks; u++) {
add4G(outRank0, r*stride4G)[u*WARP_SIZE] = acc0[u];
}
if (++r == nRanks) r = 0;
}
}
}
inpRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
outRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
nIters -= wn;
if (nIters <= 0) break;
// Load data for next iteration.
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
}
}
}
}
template<int UnrollPeers, typename Red, typename T>
static __device__ __forceinline__ void allreduceEnds(
ncclSymPrims& prim, int tn, int t, Red red,
T* inputRank0, T* outputRank0, size_t nElts, uint32_t nPreElts, size_t nSufElts
) {
using Acc = typename Red::EltType;
int const& rank = prim.rank;
int const& nRanks = prim.nRanks;
uint32_t const& stride4G = prim.stride4G;
BytePack<sizeof(T)>* inpRank0 = (BytePack<sizeof(T)>*)inputRank0;
BytePack<sizeof(T)>* outRank0 = (BytePack<sizeof(T)>*)outputRank0;
#pragma unroll 1
for (size_t i = t; i < nPreElts+nSufElts; i += tn) {
size_t elt = i < nPreElts ? i : nElts-nSufElts-nPreElts+i;
BytePack<sizeof(T)> acc0 = *add4G(inpRank0+elt, rank*stride4G);
BytePack<sizeof(Acc)> acc1;
BytePack<sizeof(T)> tmp[UnrollPeers];
int dr = 1;
int r = rank+1;
if (nRanks == r) r = 0;
bool first = true;
#pragma unroll 2
for (int partial=0; partial <= 1; partial++) {
#pragma unroll 1
for (int j = 0;
partial ? j < 1 : (dr + UnrollPeers <= nRanks);
partial ? j++ : (dr += UnrollPeers)) {
if (partial && dr == nRanks) break;
#pragma unroll
for (int u=0; u < UnrollPeers-partial; u++) {
if (partial && u!=0 && dr+u == nRanks) break;
tmp[u] = *add4G(inpRank0+elt, r*stride4G);
r += 1;
if (r == nRanks) r = 0;
}
if (first) {
first = false;
acc1 = applyCast<T, Acc>(acc0);
}
#pragma unroll
for (int u=0; u < UnrollPeers-partial; u++) {
if (partial && u!=0 && dr+u == nRanks) break;
acc1 = applyReduce(red, acc1, applyCast<T, Acc>(tmp[u]));
}
}
}
acc0 = applyCast<Acc, T>(acc1);
dr = 0;
r = rank;
#pragma unroll 2
for (int partial=0; partial <= 1; partial++) {
#pragma unroll 1
for (int j=0;
partial ? j < 1 : (dr + UnrollPeers <= nRanks);
partial ? j++ : (dr += UnrollPeers)) {
#pragma unroll
for (int u=0; u < UnrollPeers-partial; u++) {
if (partial && dr+u == nRanks) break;
*add4G(outRank0+elt, r*stride4G) = acc0;
r += 1;
if (r == nRanks) r = 0;
}
}
}
}
}
template<typename Red, typename T>
static __device__ void allreduce(
ncclSymPrims& prim, int tn, int t, bool waitNeeded,
Red red, T* input, T* output, size_t nElts
) {
int nRanks = prim.nRanks;
int nBlocks = prim.nBlocks;
// Mpve to rank=0
input = prim.peerPtr(0, input);
output = prim.peerPtr(0, output);
uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
size_t nBytes = nElts*sizeof(T);
uint32_t nPreBytes = (16u - inputUptr)%16u;
nPreBytes = min((size_t)nPreBytes, nBytes);
uintptr_t cursor = nPreBytes;
constexpr int MinWarpPerBlock = 4;
if ((inputUptr-outputUptr)%16 == 0) {
constexpr int BytePerPack = 16, UnrollPacks = 4, UnrollPeers = 2;
constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
uint32_t chunks = (nBytes-cursor)/BytePerChunk;
chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
if (chunks != 0) {
uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
allreduceDeep<BytePerPack, UnrollPacks, UnrollPeers, T>(
prim, tn, t, waitNeeded, red,
(char*)input + cursor, (char*)output + cursor,
chunks*MinWarpPerBlock
);
cursor = cursorAfter;
waitNeeded = false;
}
}
if (sizeof(T) == 4 || (sizeof(T) < 4 && (inputUptr-outputUptr)%4 == 0)) {
constexpr int BytePerPack = 4, UnrollPacks = 4, UnrollPeers = 4;
constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
uint32_t chunks = (nBytes-cursor)/BytePerChunk;
chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
if (chunks != 0) {
uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
allreduceDeep<(sizeof(T) <= BytePerPack ? BytePerPack : 0), UnrollPacks, UnrollPeers, T>(
prim, tn, t, waitNeeded, red,
(char*)input + cursor, (char*)output + cursor,
chunks*MinWarpPerBlock
);
cursor = cursorAfter;
waitNeeded = false;
}
}
if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
constexpr int UnrollPeers = 8;
size_t nSufElts = (nBytes-cursor)/sizeof(T);
allreduceEnds<UnrollPeers>(prim, tn, t, red, input, output, nElts, nPreBytes/sizeof(T), nSufElts);
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLD_AGxST(ncclSymDevArgs const* args) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier);
int /*const&*/ rank = prim.rank;
int /*const&*/ nRanks = prim.nRanks;
Red<typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type> red(args->redOpArg);
// Threads numbered globally such that we round robin warps by rank then block.
int gt = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
rank, nRanks,
prim.block, prim.nBlocks,
threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
int gtn = nRanks*prim.nBlocks*blockDim.x;
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
//prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
allreduce(prim, gtn, gt, /*waitNeeded=*/true, red, (T*)args->input, (T*)args->output, args->nElts);
prim.barrierArrive(ncclCoopCta(), /*release=*/true);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
}
template<typename Red, typename T>
static __device__ void allreduceMultimem(
ncclSymPrims& prim, int tn, int t, Red red, T* input, T* output, size_t nElts
) {
// Mpve to multimem
input = prim.multimemPtr(input);
output = prim.multimemPtr(output);
uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
size_t nBytes = nElts*sizeof(T);
constexpr int BytePerPack = LoadMultimem_BigPackSize<Red>::BigPackSize;
uint32_t nPreBytes = (BytePerPack - inputUptr)%BytePerPack;
nPreBytes = min((size_t)nPreBytes, nBytes);
uintptr_t nSufBytes;
if (alignof(T) == BytePerPack || (inputUptr-outputUptr)%BytePerPack == 0) {
constexpr int UnrollPacks = 16*8/BytePerPack;
constexpr int BytePerChunk = UnrollPacks*WARP_SIZE*BytePerPack;
uintptr_t cursor = nPreBytes;
int nChunks = (nBytes-cursor)/BytePerChunk;
uintptr_t cursorAfter = cursor + uintptr_t(nChunks)*BytePerChunk;
nSufBytes = nBytes - cursorAfter;
cursor += (t/WARP_SIZE)*UnrollPacks*WARP_SIZE*BytePerPack;
cursor += (t%WARP_SIZE)*BytePerPack;
int nIters = nChunks - t/WARP_SIZE;
#pragma unroll 1
while (0 < nIters) {
BytePack<BytePerPack> tmp[UnrollPacks];
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
tmp[u] = applyLoadMultimem<Red, BytePerPack>(red, inputUptr + cursor + u*WARP_SIZE*BytePerPack);
}
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
multimem_st_global(outputUptr + cursor + u*WARP_SIZE*BytePerPack, tmp[u]);
}
cursor += tn*UnrollPacks*BytePerPack;
nIters -= tn/WARP_SIZE;
}
} else {
nPreBytes = 0;
nSufBytes = nBytes;
}
// Get the prefix+suffix element one at a time.
#pragma unroll 4
for (uintptr_t i = t*sizeof(T); i < nPreBytes + nSufBytes; i += tn*sizeof(T)) {
uintptr_t cursor = i < nPreBytes ? i : nBytes-nSufBytes+(i-nPreBytes);
BytePack<sizeof(T)> val = applyLoadMultimem<Red, sizeof(T)>(red, inputUptr + cursor);
multimem_st_global(outputUptr + cursor, val);
cursor += tn*sizeof(T);
}
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLDMC_AGxSTMC(ncclSymDevArgs const* args) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier|ncclSymPrims_UseMultimem);
Red<typename ncclSymAccumType<Red, T, /*nvls=*/true>::Type> red(args->redOpArg);
// Threads numbered globally such that we round robin warps by rank then block.
int gt = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
prim.rank, prim.nRanks,
prim.block, prim.nBlocks,
threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
int gtn = prim.nRanks*prim.nBlocks*blockDim.x;
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
allreduceMultimem(prim, gtn, gt, red, (T*)args->input, (T*)args->output, args->nElts);
prim.barrierArrive(ncclCoopCta(), /*release=*/true);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLL_R_impl(ncclSymDevArgs const* args, bool multimem) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseLL | multimem*ncclSymPrims_UseMultimem);
int /*const&*/ rank = prim.rank;
using Acc = typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type;
Red<Acc> red(args->redOpArg);
using Pack = BytePack<8>;
using AccPack = BytePack<8*sizeof(Acc)/sizeof(T)>;
constexpr int EltPerPack = 8/sizeof(T);
int nElts = args->nElts;
int nPacks = divUp(nElts, EltPerPack);
bool packAligned = 8 <= alignof(T) || (
args->nElts*sizeof(T) |
(uint32_t)reinterpret_cast<uintptr_t>(args->input) |
(uint32_t)reinterpret_cast<uintptr_t>(args->output)
)%8 == 0;
uint32_t nPackPerBlock, nPackModBlock;
idivmodFast32(&nPackPerBlock, &nPackModBlock, nPacks, prim.nBlocks, prim.nBlocks_rcp32);
int begin = prim.block*nPackPerBlock + minval<int>(prim.block, nPackModBlock);
int end = begin + nPackPerBlock + (prim.block < nPackModBlock ? 1 : 0);
nPacks = end - begin;
nElts -= begin*EltPerPack;
nElts = min(nElts, nPacks*EltPerPack);
T* input = (T*)args->input + begin*EltPerPack;
T* output = (T*)args->output + begin*EltPerPack;
ncclCoopCta cta;
int t = threadIdx.x;
int tn = ncclSymMaxThreads;
if (__builtin_expect(packAligned, true)) {
#pragma unroll 1
while (0 < nPacks) {
if (t < nPacks) {
int nIterPacks = min(nPacks, tn);
Pack inp = loadPack<Pack>((Pack*)input, t, nPacks);
prim.bcastLL(/*slot=*/nIterPacks*rank + t, inp);
Pack out = prim.template recvReduceLL<Pack, T>(t, nIterPacks, red);
storePack((Pack*)output, t, nPacks, out);
}
prim.endLL(cta);
input += tn*EltPerPack;
output += tn*EltPerPack;
nPacks -= tn;
}
} else {
#pragma unroll 1
while (0 < nElts) {
if (t*EltPerPack < nElts) {
int nIterPacks = min(nPacks, tn);
Pack inp = loadPack<Pack>(input, t*EltPerPack, nElts);
prim.bcastLL(/*slot=*/nIterPacks*rank + t, inp);
Pack out = prim.template recvReduceLL<Pack, T>(t, nIterPacks, red);
storePack(output, t*EltPerPack, nElts, out);
}
prim.endLL(cta);
input += tn*EltPerPack;
output += tn*EltPerPack;
nElts -= tn*EltPerPack;
nPacks -= tn;
}
}
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLL_R(ncclSymDevArgs const* args) {
ncclSymRun_AllReduce_AGxLL_R_impl<Red, T>(args, /*multimem=*/false);
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLLMC_R(ncclSymDevArgs const* args) {
ncclSymRun_AllReduce_AGxLL_R_impl<Red, T>(args, /*multimem=*/true);
}

294
src/device/symmetric/generate.py Executable file
View File

@ -0,0 +1,294 @@
#!/usr/bin/env python3
import os
import sys
################################################################################
# The first command line argument is the path to the directory to generate and
# populate.
gensrc = sys.argv[1]
if os.path.exists(gensrc):
for name in os.listdir(gensrc):
os.remove(os.path.join(gensrc, name))
#os.truncate(os.path.join(gensrc, name), 0)
else:
os.mkdir(gensrc)
def paste(sep, *args):
return sep.join(args)
indents = 0
def emitln(f, lines):
global indents
for ln in ((lines,) if isinstance(lines, str) else lines):
f.write(' '*indents + ln + '\n')
def indent(s):
return '\n'.join(' '+l for l in s.splitlines())
class Rec(object):
def __init__(me, **kw):
me.__dict__.update(kw)
def __eq__(x, y):
if len(x) != len(y): return False
for k in x:
if k not in y: return False
if x[k] != y[k]: return False
return True
def __hash__(me):
h = 0
for k in me.__dict__:
h += hash((k, me.__dict__[k]))
return h
################################################################################
# Edit this region for introducing new algos etc
reductions = ["AllReduce","ReduceScatter"]
all_reds = ["sum"]
all_tys = ["f32","f16","bf16","f8e4m3","f8e5m2"]
nvls_algos_by_coll = {
"AllReduce": ["AGxLLMC_R","RSxLDMC_AGxSTMC"],
"ReduceScatter": ["LDMC"]
}
ldmc_algos = ["RSxLDMC_AGxSTMC", "LDMC"]
coll_to_lower = {
"AllGather": "all_gather",
"AllReduce": "all_reduce",
"ReduceScatter": "reduce_scatter"
}
red_to_ncclDevRedOp = {
"sum": "ncclDevSum"
}
red_to_Func = {
"sum": "FuncSum"
}
ty_to_ncclDataType = {
"f32": "ncclFloat32",
"f16": "ncclFloat16",
"bf16": "ncclBfloat16",
"f8e4m3": "ncclFloat8e4m3",
"f8e5m2": "ncclFloat8e5m2"
}
ty_to_cxxtype = {
"f32": "float",
"f16": "half",
"bf16": "__nv_bfloat16",
"f8e4m3": "__nv_fp8_e4m3",
"f8e5m2": "__nv_fp8_e5m2"
}
def enumerate_kernels():
for algo in ["LL","LLMC","ST","STMC"]:
yield Rec(coll="AllGather", algo=algo)
for red in all_reds:
for ty in all_tys:
for algo in ["AGxLL_R","AGxLLMC_R","RSxLD_AGxST","RSxLDMC_AGxSTMC"]:
yield Rec(coll="AllReduce", algo=algo, red=red, ty=ty)
for algo in ["LL","LD","LDMC"]:
yield Rec(coll="ReduceScatter", algo=algo, red=red, ty=ty)
def required_cuda(k):
cudart, arch, specific_sms = 0, 0, None
is_nvls = k.algo in nvls_algos_by_coll.get(k.coll, [])
if is_nvls:
cudart = max(cudart, 12010)
arch = 900
if k.coll in reductions:
if k.ty == "bf16":
cudart = max(cudart, 11000)
if k.ty.startswith("f8"):
cudart = max(cudart, 11080)
arch = 900
if k.algo in ldmc_algos:
cudart = 12070
arch = None
specific_sms = ["100a", "101a", "100f", "101f", "120a", "121a"]
return (cudart, arch, specific_sms)
################################################################################
def kernel_fdep(k):
return coll_to_lower[k.coll] + '.cu'
def kernel_fname(k):
if k.coll in reductions:
if k.algo in ldmc_algos and k.ty.startswith('f8'):
return paste('_', coll_to_lower[k.coll], k.red, k.ty, k.algo) + '.cu'
else:
return paste('_', coll_to_lower[k.coll], k.red, k.ty) + '.cu'
else:
return coll_to_lower[k.coll] + '.cu'
def kernel_gencode(k):
if k.coll in reductions and k.algo in ldmc_algos and k.ty.startswith('f8'):
return "$(NVCC_GENCODE_LDMC_FP8)"
else:
return "$(NVCC_GENCODE)"
def kernel_cname(k):
if k.coll in reductions:
return paste("_", "ncclSymDevKernel", k.coll, k.algo, k.red, k.ty)
else:
return paste("_", "ncclSymDevKernel", k.coll, k.algo)
def kernel_conds(k):
cudart, arch, specific_sms = required_cuda(k)
if cudart == 0: return (None, None)
cudart_cond = "CUDART_VERSION >= %d"%cudart
if not specific_sms:
arch_cond = "__CUDA_ARCH__ >= %d"%arch
else:
arch_cond = " || ".join(["0"] + ["NCCL_CUDA_ARCH_%sSPECIFIC==%d"%("FAMILY_" if sm[-1] == "f" else "", 10*int(sm.replace('a', '').replace('f', ''))) for sm in specific_sms])
return cudart_cond, arch_cond
def instantiate(k):
cudart_cond, arch_cond = kernel_conds(k)
if (cudart_cond, arch_cond) == (None, None):
form_red_ty = (
"__global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
" ncclSymRun_{id}<{red}, {ty}>(&args);\n"
"}}"
)
form = (
"__global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
" ncclSymRun_{id}(&args);\n"
"}}"
)
else:
form_red_ty = (
"#if {cudart_cond}\n"
" __global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
" #if {arch_cond}\n"
" ncclSymRun_{id}<{red}, {ty}>(&args);\n"
" #endif\n"
" }}\n"
"#endif"
)
form = (
"#if {cudart_cond}\n"
" __global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
" #if {arch_cond}\n"
" ncclSymRun_{id}(&args);\n"
" #endif\n"
" }}\n"
"#endif"
)
id = k.coll+'_'+k.algo
cname = kernel_cname(k)
if k.coll in reductions:
inst = form_red_ty.format(cname=cname, id=id, red=red_to_Func[k.red], ty=ty_to_cxxtype[k.ty], cudart_cond=cudart_cond, arch_cond=arch_cond)
else:
inst = form.format(cname=cname, id=id, cudart_cond=cudart_cond, arch_cond=arch_cond)
return inst
def prototype(k):
cudart_cond, arch_cond = kernel_conds(k)
if cudart_cond is None:
form = "__global__ void {cname}(ncclSymDevArgs const);"
else:
form = (
"#if {cudart_cond}\n"
" __global__ void {cname}(ncclSymDevArgs const);\n"
"#else\n"
" constexpr void* {cname} = nullptr;\n"
"#endif"
)
return form.format(cname=kernel_cname(k), cudart_cond=cudart_cond)
################################################################################
def partition(vals, keyfn):
ans = {}
for x in vals:
k = keyfn(x)
if k not in ans:
ans[k] = []
ans[k].append(x)
return ans
kernels_by_file = partition(enumerate_kernels(), lambda k: (kernel_fname(k), k.coll))
# Add dependency only files (e.g. allreduce.cu)
for coll in set(k.coll for k in enumerate_kernels()):
fname = coll_to_lower[coll]+'.cu'
if (fname, coll) not in kernels_by_file:
kernels_by_file[fname, coll] = []
# Generate each kernel instantiation file
for (fname, coll), ks in kernels_by_file.items():
with open(os.path.join(gensrc, fname), "w") as f:
emitln(f, '#include "symmetric.h"')
emitln(f, '#include "symmetric/kernel.cuh"')
emitln(f, '#include "symmetric/{coll}.cuh"'.format(coll=coll_to_lower[coll]))
for k in ks:
emitln(f, instantiate(k))
# Generate <gensrc>/symmetric_host.cc
with open(os.path.join(gensrc, "symmetric_kernels.cc"), "w") as f:
emitln(f, '#include "symmetric.h"')
emitln(f, '#include "device.h"')
emitln(f, '')
for k in enumerate_kernels():
emitln(f, prototype(k))
emitln(f, '')
emitln(f, 'extern int const ncclSymKernelCount = %d;' % len(list(enumerate_kernels())))
emitln(f, 'extern void* const ncclSymKernelList[] = {')
for k in enumerate_kernels():
emitln(f, '(void*){cname},'.format(cname=kernel_cname(k)))
emitln(f, 'nullptr};')
emitln(f, '')
emitln(f, 'void* ncclSymGetKernelPtr(ncclSymKernelId id, int red, ncclDataType_t ty) {')
indents += 1
emitln(f, 'switch (id) {')
emitln(f, 'default: return nullptr;')
for (coll, algo), coll_algo_ks in partition(enumerate_kernels(), lambda k: (k.coll, k.algo)).items():
emitln(f, 'case ncclSymKernelId_'+coll+'_'+algo+':')
indents += 1
if len(coll_algo_ks) == 1:
emitln(f, 'return (void*)&'+kernel_cname(coll_algo_ks[0])+';')
else:
emitln(f, 'switch ((ncclDevRedOp_t)red) {')
emitln(f, 'default: return nullptr;')
for red, coll_algo_red_ks in partition(coll_algo_ks, lambda k: k.red).items():
emitln(f, 'case '+red_to_ncclDevRedOp[red]+':')
indents += 1
emitln(f, 'switch (ty) {')
emitln(f, 'default: return nullptr;')
for k in coll_algo_red_ks:
emitln(f, 'case '+ty_to_ncclDataType[k.ty]+': return (void*)'+kernel_cname(k)+';')
emitln(f, '}')
indents -= 1
emitln(f, '}')
indents -=1
emitln(f, '}')
indents -= 1
emitln(f, '}')
# Generate <gensrc>/rules.mk
with open(os.path.join(gensrc, "rules.mk"), "w") as f:
inst_names = sorted(set(kernel_fname(k) for k in enumerate_kernels()))
names = inst_names + ["symmetric_kernels.cc"]
f.write("LIB_OBJS_SYM_GEN = $(patsubst %,$(OBJDIR)/genobj/symmetric/%.o,{names})\n"
.format(names=" ".join(names)))
f.write("\n")
inst_names = sorted(set((k.coll, kernel_fname(k), kernel_gencode(k)) for k in enumerate_kernels()))
for coll, name, gencode in inst_names:
f.write(
"$(OBJDIR)/genobj/symmetric/{name}.o: $(OBJDIR)/gensrc/symmetric $(OBJDIR)/genobj/symmetric/{coll}.cu.d\n"
"\t" "$(call COMPILE_SYM,$@,$(OBJDIR)/gensrc/symmetric/{name},{gencode})\n"
"\n"
.format(name=name, coll=coll_to_lower[coll], gencode=gencode)
)

View File

@ -0,0 +1,27 @@
#ifndef NCCL_DEVICE_SYMMETRIC_KERNEL_H_
#define NCCL_DEVICE_SYMMETRIC_KERNEL_H_
#include "symmetric.h"
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLL_R(struct ncclSymDevArgs const* args);
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLLMC_R(struct ncclSymDevArgs const* args);
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLD_AGxST(struct ncclSymDevArgs const* args);
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLDMC_AGxSTMC(struct ncclSymDevArgs const* args);
__device__ __forceinline__ void ncclSymRun_AllGather_LL(struct ncclSymDevArgs const* args);
__device__ __forceinline__ void ncclSymRun_AllGather_LLMC(struct ncclSymDevArgs const* args);
__device__ __forceinline__ void ncclSymRun_AllGather_ST(struct ncclSymDevArgs const* args);
__device__ __forceinline__ void ncclSymRun_AllGather_STMC(struct ncclSymDevArgs const* args);
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_ReduceScatter_LL(struct ncclSymDevArgs const* args);
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_ReduceScatter_LD(struct ncclSymDevArgs const* args);
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_ReduceScatter_LDMC(struct ncclSymDevArgs const* args);
#endif

View File

@ -0,0 +1,420 @@
#ifndef NCCL_DEVICE_SYMMETRIC_PRIMITIVES_H_
#define NCCL_DEVICE_SYMMETRIC_PRIMITIVES_H_
#include "symmetric.h"
#include "bitops.h"
#include "collectives.h"
#include "op128.h"
#include "reduce_kernel.h"
#if __CUDA_ARCH__ >= 700
// __grid_constant__ appears to break cuda-gdb
#define NCCL_GRID_CONSTANT __grid_constant__
#else
#define NCCL_GRID_CONSTANT
#endif
// flattenIx(pos0, dim0, pos1, dim1, pos2, dim2, ...)
// Given a position vector `pos` in a rectangular index space with lengths in the `dim`
// vector, flatten that down to a linear index. The fastest moving dimension is given first.
__device__ __forceinline__ int flattenIx() { return 0; }
template<typename Int0, typename Int1, typename ...Ints>
static __device__ Int0 flattenIx(Int0 pos, Int1 size, Ints ...more) {
return pos + size*flattenIx(more...);
}
// Precomputed integer reciprocoals for denominator values 1..64 inclusive.
// Pass these to idivFast64() for fast division on the GPU.
static __device__ uint64_t idivRcp64_upto64(int x) {
static constexpr uint64_t table[65] = {
idivRcp64(0x01), idivRcp64(0x01), idivRcp64(0x02), idivRcp64(0x03),
idivRcp64(0x04), idivRcp64(0x05), idivRcp64(0x06), idivRcp64(0x07),
idivRcp64(0x08), idivRcp64(0x09), idivRcp64(0x0a), idivRcp64(0x0b),
idivRcp64(0x0c), idivRcp64(0x0d), idivRcp64(0x0e), idivRcp64(0x0f),
idivRcp64(0x10), idivRcp64(0x11), idivRcp64(0x12), idivRcp64(0x13),
idivRcp64(0x14), idivRcp64(0x15), idivRcp64(0x16), idivRcp64(0x17),
idivRcp64(0x18), idivRcp64(0x19), idivRcp64(0x1a), idivRcp64(0x1b),
idivRcp64(0x1c), idivRcp64(0x1d), idivRcp64(0x1e), idivRcp64(0x1f),
idivRcp64(0x20), idivRcp64(0x21), idivRcp64(0x22), idivRcp64(0x23),
idivRcp64(0x24), idivRcp64(0x25), idivRcp64(0x26), idivRcp64(0x27),
idivRcp64(0x28), idivRcp64(0x29), idivRcp64(0x2a), idivRcp64(0x2b),
idivRcp64(0x2c), idivRcp64(0x2d), idivRcp64(0x2e), idivRcp64(0x2f),
idivRcp64(0x30), idivRcp64(0x31), idivRcp64(0x32), idivRcp64(0x33),
idivRcp64(0x34), idivRcp64(0x35), idivRcp64(0x36), idivRcp64(0x37),
idivRcp64(0x38), idivRcp64(0x39), idivRcp64(0x3a), idivRcp64(0x3b),
idivRcp64(0x3c), idivRcp64(0x3d), idivRcp64(0x3e), idivRcp64(0x3f),
idivRcp64(0x40)
};
return table[x];
}
static __device__ uint32_t idivRcp32_upto64(int x) {
return idivRcp64_upto64(x)>>32;
}
namespace {
struct ncclCoopCta {
__device__ void sync() { __syncthreads(); }
__device__ int self() { return threadIdx.x; }
__device__ int count() { return blockDim.x; }
};
struct ncclCoopWarps {
int log2_nWarps;
__device__ void sync() {
asm volatile("barrier.sync %0, %1;" :: "r"(1 + (threadIdx.x>>(5+log2_nWarps))), "r"(32<<log2_nWarps) : "memory");
}
__device__ int self() { return threadIdx.x & ((32<<log2_nWarps)-1); }
__device__ int count() { return 32<<log2_nWarps; }
};
struct ncclCoopWarp {
__device__ void sync() { __syncwarp(); }
__device__ int self() { return threadIdx.x%32; }
__device__ int count() { return 32; }
};
}
namespace {
static constexpr int ncclSymPrims_UseBarrier = 1;
static constexpr int ncclSymPrims_UseLL = 2;
static constexpr int ncclSymPrims_UseMultimem = 4;
struct ncclSymPrims {
int flags;
int const &rank;
int const &nRanks;
uint32_t const &nRanks_rcp32;
int block, nBlocks;
uint32_t nBlocks_rcp32;
uint32_t nBlocks_nWarps_rcp32;
uint32_t nRanks_nBlocks_rcp32;
uint32_t nWarpPerRank, nWarpPerRank_rcp32;
struct ncclSymDevBase* const &base;
uintptr_t offsetMc;
uint32_t const &stride4G;
uint32_t barEpoch;
uint32_t llEpoch;
__device__ ncclSymPrims(ncclSymDevComm const &comm, int flags):
flags(flags),
rank(comm.rank),
nRanks(comm.nRanks),
nRanks_rcp32(comm.nRanks_rcp32),
block(blockIdx.x),
nBlocks(gridDim.x),
nBlocks_rcp32(idivRcp32_upto64(nBlocks)),
nBlocks_nWarps_rcp32(imulRcp32(nBlocks, nBlocks_rcp32, blockDim.x/32, idivRcp32_upto64(blockDim.x/32))),
nRanks_nBlocks_rcp32(imulRcp32(nRanks, nRanks_rcp32, gridDim.x, nBlocks_rcp32)),
nWarpPerRank(idivFast32(nBlocks*blockDim.x/32, nRanks, nRanks_rcp32)),
nWarpPerRank_rcp32(idivRcp32_upto64(nWarpPerRank)),
base(comm.base),
offsetMc((flags & ncclSymPrims_UseMultimem) ? (char*)comm.baseMc - (char*)base : 0x0),
stride4G(comm.stride4G) {
#if CUDART_VERSION >= 12030 && __CUDA_ARCH__ >= 900
cudaGridDependencySynchronize();
#endif
if ((flags & ncclSymPrims_UseBarrier) && threadIdx.x < nRanks) {
barEpoch = (flags & ncclSymPrims_UseMultimem) ? base->barEpochMc[block] : base->barEpochUc[block];
}
if (flags & ncclSymPrims_UseLL) llEpoch = base->llEpoch[block] + 2;
}
__device__ ~ncclSymPrims() {
if (threadIdx.x == 0) {
if (flags & ncclSymPrims_UseBarrier) {
((flags & ncclSymPrims_UseMultimem) ? base->barEpochMc : base->barEpochUc)[block] = barEpoch;
}
if (flags & ncclSymPrims_UseLL) base->llEpoch[block] = llEpoch - 2;
}
}
template<typename T>
__device__ T* peerPtr(int peer, T* selfPtr) {
return add4G(selfPtr, (peer-rank)*stride4G);
}
template<typename T>
__device__ T* multimemPtr(T* selfPtr) {
return reinterpret_cast<T*>(reinterpret_cast<uintptr_t>(selfPtr) + offsetMc);
}
__device__ void barrierArrive(ncclCoopCta cta, bool release) {
cta.sync();
#if __CUDA_ARCH__ < 700
if (release) {
if (cta.self() == 0) __threadfence_system();
cta.sync();
}
#endif
if (flags & ncclSymPrims_UseMultimem) {
#if __CUDA_ARCH__ >= 900 && CUDART_VERSION >= 12010
if (cta.self() == 0) {
uint32_t* inbox = &multimemPtr(base)->barInboxMc[block];
if (release) {
asm volatile("multimem.red.release.sys.add.u32 [%0],1;" :: "l"(inbox));
} else {
asm volatile("multimem.red.relaxed.sys.add.u32 [%0],1;" :: "l"(inbox));
}
}
#endif
} else {
int r = cta.self();
if (r != rank && r < nRanks) {
uint32_t* inbox = &peerPtr(r, base)->barInboxPerPeer[block*nRanks + rank];
#if __CUDA_ARCH__ >= 700
if (release) {
asm volatile("st.release.sys.u32 [%0],%1;" :: "l"(inbox), "r"(barEpoch+1));
} else {
asm volatile("st.relaxed.sys.u32 [%0],%1;" :: "l"(inbox), "r"(barEpoch+1));
}
#else
asm volatile("st.volatile.u32 [%0],%1;" :: "l"(inbox), "r"(barEpoch+1));
#endif
}
}
}
__device__ void barrierWait(ncclCoopCta cta, bool acquire) {
if (flags & ncclSymPrims_UseMultimem) {
#if __CUDA_ARCH__ >= 900
if (cta.self() == 0) {
uint32_t* inbox = &base->barInboxMc[block];
while (true) {
uint32_t got;
if (acquire) {
asm volatile("ld.acquire.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
} else {
asm volatile("ld.relaxed.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
}
if (got-(barEpoch+nRanks) <= uint32_t(-1)>>1) break;
}
barEpoch += nRanks;
}
#endif
} else {
int r = cta.self();
if (r != rank && r < nRanks) {
uint32_t* inbox = &base->barInboxPerPeer[block*nRanks + r];
while (true) {
uint32_t got;
#if __CUDA_ARCH__ >= 700
if (acquire) {
asm volatile("ld.acquire.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
} else {
asm volatile("ld.relaxed.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
}
#else
asm volatile("ld.volatile.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
#endif
if (got-(barEpoch+1) <= uint32_t(-1)>>1) break;
}
}
#if __CUDA_ARCH__ < 700
if (acquire) {
cta.sync();
if (cta.self() == 0) __threadfence();
}
#endif
barEpoch += 1;
}
cta.sync();
}
__device__ void endLL(ncclCoopCta cta) {
if (__builtin_expect(llEpoch >= -2u, false)) {
cta.sync();
uint4* buf = ncclSymDevBase_getLLBuf(base, nRanks, block, llEpoch);
int epochSize = ncclSymLLEpochSize(nRanks);
#pragma unroll 4
for (int i=cta.self(); i*16 < epochSize; i += cta.count()) {
buf[i] = uint4{0, 0, 0, 0};
}
}
cta.sync();
llEpoch += (llEpoch == -1u) ? 3 : 1;
}
template<typename T>
__device__ void sendLL(int peer, int slot, T val) {
union { T tmp; uint32_t u32[divUp(sizeof(T),8)][2]; };
tmp = val;
uint4* buf = ncclSymDevBase_getLLBuf(peerPtr(peer, base), nRanks, block, llEpoch) + slot;
#pragma unroll
for (int u=0; u < divUp(sizeof(T),8); u++) {
asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(buf + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
}
}
template<typename T>
__device__ void bcastLL(int slot, T val) {
if (flags & ncclSymPrims_UseMultimem) {
union { T tmp; uint32_t u32[divUp(sizeof(T),8)][2]; };
tmp = val;
uint4* bufmc = ncclSymDevBase_getLLBuf(multimemPtr(base), nRanks, block, llEpoch) + slot;
#pragma unroll
for (int u=0; u < divUp(sizeof(T),8); u++) {
asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(bufmc + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
}
} else {
union { T tmp; uint32_t u32[divUp(sizeof(T),8)][2]; };
tmp = val;
uint4* buf0 = ncclSymDevBase_getLLBuf(peerPtr(0, base), nRanks, block, llEpoch) + slot;
int dr = 0;
int r = rank;
#pragma unroll 1
for (; dr+8 <= nRanks; dr += 8) {
#pragma unroll
for (int ur=0; ur < 8; ur++) {
uint4* buf = add4G(buf0, r*stride4G);
#pragma unroll
for (int u=0; u < divUp(sizeof(T),8); u++) {
asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(buf + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
}
r += 1;
if (r == nRanks) r = 0;
}
}
#pragma unroll
for (int ur=0; ur < 8; ur++, dr++) {
if (dr == nRanks) break;
uint4* buf = add4G(buf0, r*stride4G);
#pragma unroll
for (int u=0; u < divUp(sizeof(T),8); u++) {
asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(buf + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
}
r += 1;
if (r == nRanks) r = 0;
}
}
}
template<int nSlotsMin, int nSlotsMax, typename T>
__device__ void recvLL(int slot0, int nSlots, int stride, T(&elts)[nSlotsMax]) {
uint4* buf = ncclSymDevBase_getLLBuf(base, nRanks, block, llEpoch) + slot0;
uint4 tmp[nSlotsMax][divUp(sizeof(T),8)];
//int spins=0;
while (true) {
#pragma unroll
for (int u=0; u < nSlotsMax; u++) {
if (u < nSlotsMin || u < nSlots) {
#pragma unroll
for (int v=0; v < divUp(sizeof(T),8); v++) {
asm volatile("ld.volatile.v4.u32 {%0,%1,%2,%3},[%4];" : "=r"(tmp[u][v].x), "=r"(tmp[u][v].y), "=r"(tmp[u][v].z), "=r"(tmp[u][v].w) : "l"(buf + u*stride + v*ncclSymLLMaxSlots(sizeof(T))));
}
}
}
bool okAll = true;
#pragma unroll
for (int u=0; u < nSlotsMax; u++) {
#pragma unroll
for (int v=0; v < divUp(sizeof(T),8); v++) {
if (u < nSlotsMin || u < nSlots) {
bool ok = tmp[u][v].y == llEpoch &&
tmp[u][v].w == llEpoch;
okAll &= ok;
}
}
}
if (__builtin_expect(okAll, true)) break;
//if (spins++ == 10<<20) spins=0;
}
#pragma unroll
for (int u=0; u < nSlotsMax; u++) {
if (nSlotsMin <= u && u == nSlots) break;
union { T val; uint32_t u32[divUp(sizeof(T),8)][2]; };
#pragma unroll
for (int v=0; v < divUp(sizeof(T),8); v++) {
u32[v][0] = tmp[u][v].x;
u32[v][1] = tmp[u][v].z;
}
elts[u] = val;
}
}
template<typename Pack, typename T, typename Red, int Unroll=8>
__device__ Pack recvReduceLL(int slot, int stride, Red red) {
using Acc = typename Red::EltType;
using AccPack = BytePack<sizeof(Pack)*sizeof(Acc)/sizeof(T)>;
AccPack acc;
bool first = true;
int r = 0;
#pragma unroll 1
for (; r+Unroll <= nRanks; r += Unroll) {
Pack got[Unroll];
this->template recvLL</*Min=*/Unroll>(slot + r*stride, Unroll, stride, got);
AccPack acc0 = applyCast<T, Acc>(got[0]);
acc = first ? acc0 : applyReduce(red, acc, acc0);
first = false;
#pragma unroll
for (int i=1; i < Unroll; i++) acc = applyReduce(red, acc, applyCast<T, Acc>(got[i]));
}
if (r < nRanks) {
Pack got[Unroll];
this->template recvLL</*Min=*/1>(slot + r*stride, nRanks-r, stride, got);
AccPack acc0 = applyCast<T, Acc>(got[0]);
acc = first ? acc0 : applyReduce(red, acc, acc0);
#pragma unroll
for (int i=1; i < Unroll-1; i++) {
if (r+i < nRanks) acc = applyReduce(red, acc, applyCast<T, Acc>(got[i]));
}
}
return applyCast<Acc, T>(acc);
}
template<typename T>
__device__ T recvLL(int slot) {
T one[1];
this->template recvLL<1, 1, T>(slot, 1, 0, one);
return one[0];
}
template<typename Coop, typename T>
__device__ void coopRecvLL(Coop coop, int slot0, int nSlots, T* dst) {
int me = coop.self();
if (me < nSlots) {
uint4* buf = ncclSymDevBase_getLLBuf(base, nRanks, block, llEpoch) + slot0 + me;
uint4 got[divUp(sizeof(T), 8)];
//int spins=0;
#pragma unroll 1
while (true) {
#pragma unroll
for (int u=0; u < divUp(sizeof(T), 8); u++) {
asm volatile("ld.volatile.v4.u32 {%0,%1,%2,%3},[%4];" : "=r"(got[u].x), "=r"(got[u].y), "=r"(got[u].z), "=r"(got[u].w) : "l"(buf + u*ncclSymLLMaxSlots(sizeof(T))));
}
bool ok = true;
#pragma unroll
for (int u=0; u < divUp(sizeof(T), 8); u++) {
ok &= got[u].y == llEpoch;
ok &= got[u].w == llEpoch;
}
if (__builtin_expect(ok, true)) break;
//if (++spins == 10<<20) { spins=0; printf("r=%d LL spin @ ix=%d got=%d want=%d\n", rank, slot0+me, got[0].y, llEpoch); }
}
union { T val; uint32_t u32[divUp(sizeof(T), 8)][2]; };
#pragma unroll
for (int u=0; u < divUp(sizeof(T), 8); u++) {
u32[u][0] = got[u].x;
u32[u][1] = got[u].z;
}
dst[slot0 + me] = val;
}
}
};
}
template<template<typename> typename Red, typename T, bool nvls>
struct ncclSymAccumType { using Type = T; };
// Only Red's whose opArg is invariant w.r.t. the datatype can have a different
// accumulator type. At the moment this excludes integer min/max, sumpostdiv,
// and premulsum.
template<> struct ncclSymAccumType<FuncSum, __half, false> { using Type = float; };
#if defined(__CUDA_BF16_TYPES_EXIST__)
template<> struct ncclSymAccumType<FuncSum, __nv_bfloat16, false> { using Type = float; };
#endif
#if defined(__CUDA_FP8_TYPES_EXIST__)
template<> struct ncclSymAccumType<FuncSum, __nv_fp8_e4m3, false> { using Type = float; };
template<> struct ncclSymAccumType<FuncSum, __nv_fp8_e5m2, false> { using Type = float; };
#endif
#endif

View File

@ -0,0 +1,387 @@
#include "symmetric.h"
#include "symmetric/kernel.cuh"
#include "symmetric/primitives.cuh"
template<int BytePerPack, int UnrollPacks, int UnrollPeers, typename T, typename Red>
static __device__ void reduceDeep(
ncclSymPrims& prim, int tn, int t, bool waitNeeded,
Red red, char* inputRank0, char* outputHere, int32_t nIters
) {
using Pack = BytePack<BytePerPack>;
using Acc = typename Red::EltType;
using AccPack = BytePack<BytePerPack*sizeof(Acc)/sizeof(T)>;
int wn = tn/WARP_SIZE;
int w = t/WARP_SIZE;
int lane = t%WARP_SIZE;
int const& rank = prim.rank;
int const& nRanks = prim.nRanks;
uint32_t const& stride4G = prim.stride4G;
Pack* inpRank0 = (Pack*)inputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
Pack* outHere = (Pack*)outputHere + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
Pack acc0[UnrollPacks];
nIters -= w;
if (0 < nIters) {
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
}
}
if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
if (0 < nIters) {
while (true) {
AccPack acc1[UnrollPacks];
int r = rank+1;
if (r == nRanks) r = 0;
{ Pack tmp1[UnrollPacks];
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
tmp1[u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
}
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
acc1[u] = applyReduce(red, applyCast<T, Acc>(acc0[u]), applyCast<T, Acc>(tmp1[u]));
}
}
r += 1;
if (r == nRanks) r = 0;
int dr = 2;
#pragma unroll 2
for (int partial=0; partial <= 1; partial++) {
#pragma unroll 1
for (int i = 0;
partial ? i < 1 : (dr + UnrollPeers <= nRanks);
partial ? i++ : (dr += UnrollPeers)) {
if (partial && dr == nRanks) break;
Pack tmp1[UnrollPeers][UnrollPacks];
#pragma unroll
for (int ur=0; ur < UnrollPeers-partial; ur++) {
if (partial && ur!=0 && dr+ur == nRanks) break;
#pragma unroll UnrollPacks
for (int u=0; u < UnrollPacks; u++) {
tmp1[ur][u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
}
r += 1;
if (r == nRanks) r = 0;
}
#pragma unroll
for (int ur=0; ur < UnrollPeers-partial; ur++) {
if (partial && ur!=0 && dr+ur == nRanks) break;
#pragma unroll UnrollPacks
for (int u=0; u < UnrollPacks; u++) {
acc1[u] = applyReduce(red, acc1[u], applyCast<T, Acc>(tmp1[ur][u]));
}
}
}
}
#pragma unroll
for (int u=0; u < UnrollPacks; u++) acc0[u] = applyCast<Acc, T>(acc1[u]);
#pragma unroll UnrollPacks
for (int u=0; u < UnrollPacks; u++) outHere[u*WARP_SIZE] = acc0[u];
inpRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
outHere += intptr_t(wn)*UnrollPacks*WARP_SIZE;
nIters -= wn;
if (nIters <= 0) break;
// Load data for next iteration.
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
}
}
}
}
template<int UnrollPeers, typename Red, typename T>
static __device__ void reduceEnds(
ncclSymPrims& prim, int tn, int t, Red red,
T* inputRank0, T* outputHere, size_t nElts, uint32_t nPreElts, size_t nSufElts
) {
using Acc = typename Red::EltType;
int const& rank = prim.rank;
int const& nRanks = prim.nRanks;
uint32_t const& stride4G = prim.stride4G;
BytePack<sizeof(T)>* inpRank0 = (BytePack<sizeof(T)>*)inputRank0;
BytePack<sizeof(T)>* outHere = (BytePack<sizeof(T)>*)outputHere;
#pragma unroll 1
for (size_t i = t; i < nPreElts+nSufElts; i += tn) {
size_t elt = i < nPreElts ? i : nElts-nSufElts-nPreElts+i;
BytePack<sizeof(T)> acc0 = *add4G(inpRank0+elt, rank*stride4G);
BytePack<sizeof(Acc)> acc1;
BytePack<sizeof(T)> tmp[UnrollPeers];
int dr = 1;
int r = rank+1;
if (nRanks == r) r = 0;
bool first = true;
#pragma unroll 2
for (int partial=0; partial <= 1; partial++) {
#pragma unroll 1
for (int j = 0;
partial ? j < 1 : (dr + UnrollPeers <= nRanks);
partial ? j++ : (dr += UnrollPeers)) {
if (partial && dr == nRanks) break;
#pragma unroll
for (int u=0; u < UnrollPeers-partial; u++) {
if (partial && u!=0 && dr+u == nRanks) break;
tmp[u] = *add4G(inpRank0+elt, r*stride4G);
r += 1;
if (r == nRanks) r = 0;
}
if (first) {
first = false;
acc1 = applyCast<T, Acc>(acc0);
}
#pragma unroll
for (int u=0; u < UnrollPeers-partial; u++) {
if (partial && u!=0 && dr+u == nRanks) break;
acc1 = applyReduce(red, acc1, applyCast<T, Acc>(tmp[u]));
}
}
}
acc0 = applyCast<Acc, T>(acc1);
outHere[elt] = acc0;
}
}
template<typename Red, typename T>
static __device__ void reduce(
ncclSymPrims& prim, int tn, int t, bool waitNeeded,
Red red, T* input, T* output, size_t nElts
) {
int nRanks = prim.nRanks;
int nBlocks = prim.nBlocks;
// Mpve input to rank=0
input = prim.peerPtr(0, input);
uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
uint32_t alignment = uint32_t(inputUptr - outputUptr);
size_t nBytes = nElts*sizeof(T);
uint32_t nPreBytes = (16u - inputUptr)%16u;
nPreBytes = min((size_t)nPreBytes, nBytes);
uintptr_t cursor = nPreBytes;
constexpr int MinWarpPerBlock = 4;
if (alignment%16 == 0) {
constexpr int BytePerPack = 16, UnrollPacks = 4, UnrollPeers = 2;
constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
uint32_t chunks = (nBytes-cursor)/BytePerChunk;
chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
if (chunks != 0) {
uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
reduceDeep<BytePerPack, UnrollPacks, UnrollPeers, T>(
prim, tn, t, waitNeeded, red,
(char*)input + cursor, (char*)output + cursor,
chunks*MinWarpPerBlock
);
cursor = cursorAfter;
waitNeeded = false;
}
}
if (sizeof(T) == 4 || (sizeof(T) < 4 && alignment%4 == 0)) {
constexpr int BytePerPack = 4, UnrollPacks = 4, UnrollPeers = 4;
constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
uint32_t chunks = (nBytes-cursor)/BytePerChunk;
chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
if (chunks != 0) {
uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
reduceDeep<(sizeof(T) <= BytePerPack ? BytePerPack : 0), UnrollPacks, UnrollPeers, T>(
prim, tn, t, waitNeeded, red,
(char*)input + cursor, (char*)output + cursor,
chunks*MinWarpPerBlock
);
cursor = cursorAfter;
waitNeeded = false;
}
}
if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
constexpr int UnrollPeers = 8;
size_t nSufElts = (nBytes-cursor)/sizeof(T);
reduceEnds<UnrollPeers>(prim, tn, t, red, input, output, nElts, nPreBytes/sizeof(T), nSufElts);
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_ReduceScatter_LD(ncclSymDevArgs const* args) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier);
Red<typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type> red(args->redOpArg);
// Round robin warps over blocks.
int t = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
prim.block, prim.nBlocks,
threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
int tn = prim.nBlocks*blockDim.x;
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
//prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
reduce(prim, tn, t, /*waitNeeded=*/true, red, (T*)args->input + prim.rank*args->nElts, (T*)args->output, args->nElts);
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
}
template<typename Red, typename T>
static __device__ void reduceMultimem(
ncclSymPrims& prim, int tn, int t, Red red, T* input, T* output, size_t nElts
) {
// Mpve input to multimem
input = prim.multimemPtr(input);
uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
size_t nBytes = nElts*sizeof(T);
constexpr int BytePerPack = LoadMultimem_BigPackSize<Red>::BigPackSize;
uint32_t nPreBytes = (BytePerPack - inputUptr)%BytePerPack;
nPreBytes = min((size_t)nPreBytes, nBytes);
uintptr_t nSufBytes;
if (sizeof(T) == BytePerPack || (inputUptr-outputUptr)%BytePerPack == 0) {
constexpr int UnrollPacks = 8*(16/BytePerPack);
constexpr int BytePerChunk = UnrollPacks*WARP_SIZE*BytePerPack;
uintptr_t cursor = nPreBytes;
uint32_t nChunks = (nBytes-cursor)/BytePerChunk;
uintptr_t cursorAfter = cursor + uintptr_t(nChunks)*BytePerChunk;
nSufBytes = nBytes - cursorAfter;
cursor += (t/WARP_SIZE)*UnrollPacks*WARP_SIZE*BytePerPack;
cursor += (t%WARP_SIZE)*BytePerPack;
int nIters = nChunks - t/WARP_SIZE;
#pragma unroll 1
while (0 < nIters) {
BytePack<BytePerPack> tmp[UnrollPacks];
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
tmp[u] = applyLoadMultimem<Red, BytePerPack>(red, inputUptr + cursor + u*WARP_SIZE*BytePerPack);
}
#pragma unroll
for (int u=0; u < UnrollPacks; u++) {
*reinterpret_cast<BytePack<BytePerPack>*>(outputUptr + cursor + u*WARP_SIZE*BytePerPack) = tmp[u];
}
cursor += tn*UnrollPacks*BytePerPack;
nIters -= tn/WARP_SIZE;
}
} else {
nPreBytes = 0;
nSufBytes = nBytes;
}
// Get the prefix+suffix element one at a time.
#pragma unroll 4
for (uintptr_t i = t*sizeof(T); i < nPreBytes + nSufBytes; i += tn*sizeof(T)) {
uintptr_t cursor = i < nPreBytes ? i : nBytes-nSufBytes+(i-nPreBytes);
BytePack<sizeof(T)> val = applyLoadMultimem<Red, sizeof(T)>(red, inputUptr + cursor);
*reinterpret_cast<BytePack<sizeof(T)>*>(outputUptr + cursor) = val;
cursor += tn*sizeof(T);
}
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_ReduceScatter_LDMC(ncclSymDevArgs const* args) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier|ncclSymPrims_UseMultimem);
Red<typename ncclSymAccumType<Red, T, /*nvls=*/true>::Type> red(args->redOpArg);
// Round robin warps over blocks.
int t = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
prim.block, prim.nBlocks,
threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
int tn = prim.nBlocks*blockDim.x;
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
reduceMultimem(prim, tn, t, red, (T*)args->input + prim.rank*args->nElts, (T*)args->output, args->nElts);
prim.barrierArrive(ncclCoopCta(), /*release=*/false);
prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
}
// T is user type, EltType is the most aligned type
template<typename T, typename Red, typename EltType>
__device__ __forceinline__ void ncclSymRun_ReduceScatter_LL_body(
ncclSymPrims &prim, Red red, EltType* input, EltType* output, int nElts, int nPacks, int nStrideElts) {
using Pack = BytePack<8>;
constexpr int EltPerPack = 8/sizeof(EltType);
int nRanks = prim.nRanks;
int rank = prim.rank;
int t = threadIdx.x;
int tn = ncclSymMaxThreads;
ncclCoopCta cta;
#pragma unroll 1
while (0 < nElts) {
int nIterPacks = min(nPacks, tn);
int tn_div_nPacks = tn/nIterPacks;
int tn_mod_nPacks = tn%nIterPacks;
int peer = t/nIterPacks;
int pack = t%nIterPacks;
#pragma unroll 1
for (int i = t; i < nRanks*nIterPacks; i += tn) {
Pack got = loadPack<Pack>(input + peer*nStrideElts, pack*EltPerPack, nElts);
prim.sendLL(peer, rank*nIterPacks + pack, got);
peer += tn_div_nPacks;
pack += tn_mod_nPacks;
if (nIterPacks <= pack) { peer += 1; pack -= nIterPacks; }
}
if (t < nIterPacks) {
Pack got = prim.template recvReduceLL<Pack, T>(t, nIterPacks, red);
storePack(output, t*EltPerPack, nElts, got);
}
prim.endLL(cta);
input += tn*EltPerPack;
output += tn*EltPerPack;
nElts -= tn*EltPerPack;
nPacks -= tn;
}
}
template<template<typename> typename Red, typename T>
__device__ __forceinline__ void ncclSymRun_ReduceScatter_LL(ncclSymDevArgs const* args) {
ncclSymPrims prim(args->comm, ncclSymPrims_UseLL);
Red<typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type> red(args->redOpArg);
using Pack = BytePack<8>;
constexpr int EltPerPack = 8/sizeof(T);
int nAllElts = args->nElts;
int nAllPacks = divUp(nAllElts, EltPerPack);
uint32_t nPackPerBlock, nPackModBlock;
idivmodFast32(&nPackPerBlock, &nPackModBlock, nAllPacks, prim.nBlocks, prim.nBlocks_rcp32);
int blockPackBegin = prim.block*nPackPerBlock + minval<int>(prim.block, nPackModBlock);
int blockPackEnd = blockPackBegin + nPackPerBlock + (prim.block < nPackModBlock ? 1 : 0);
int nPacks = blockPackEnd - blockPackBegin;
int nElts = nAllElts - blockPackBegin*EltPerPack;
nElts = min(nElts, nPacks*EltPerPack);
T* input = (T*)args->input + blockPackBegin*EltPerPack;
T* output = (T*)args->output + blockPackBegin*EltPerPack;
uint32_t lowBits = args->nElts*sizeof(T);
lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->input);
lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->output);
if (__builtin_expect(lowBits%8 == 0, true)) {
ncclSymRun_ReduceScatter_LL_body<T>(prim, red, (Pack*)input, (Pack*)output, nPacks, nPacks, nAllElts/EltPerPack);
} else {
ncclSymRun_ReduceScatter_LL_body<T>(prim, red, input, output, nElts, nPacks, nAllElts);
}
}

View File

@ -13,6 +13,7 @@
#include "cudawrap.h"
#include "profiler.h"
#include "transport.h"
#include "register_inline.h"
#include <cstring> // std::memcpy
#include <cinttypes> // PRIx64
@ -28,34 +29,41 @@ ncclResult_t ncclInitKernelsForDevice(int cudaArch, int maxSharedMem, size_t* ma
int carveout = ncclParamL1SharedMemoryCarveout();
int ncclMaxSharedMem = ncclShmemDynamicSize(cudaArch);
for (int k=0; k < ncclDevKernelCount; k++) {
void* fn = ncclDevKernelList[k];
cudaFuncAttributes attr = {0};
if (fn == nullptr) continue;
for (int sym=0; sym <= 1; sym++) {
int kcount = sym==0 ? ncclDevKernelCount : ncclSymKernelCount;
void* const* kptrs = sym==0 ? ncclDevKernelList : ncclSymKernelList;
for (int k=0; k < kcount; k++) {
void* fn = kptrs[k];
cudaFuncAttributes attr = {0};
if (fn == nullptr) continue;
CUDACHECKGOTO(cudaFuncGetAttributes(&attr, fn), result, ignore0);
if (maxStackSize) {
if (attr.localSizeBytes > *maxStackSize) *maxStackSize = attr.localSizeBytes;
ignore0:;
}
if (carveout) {
CUDACHECKGOTO(cudaFuncSetAttribute(fn,
cudaFuncAttributePreferredSharedMemoryCarveout, carveout),
result, ignore1);
ignore1:;
}
if (ncclMaxSharedMem != 0) {
int sharedMemSize = ncclMaxSharedMem;
if (sharedMemSize > (maxSharedMem-attr.sharedSizeBytes)) {
WARN("cudaArch %d ncclMaxSharedMem %d exceeds device/fn maxSharedMem %zu",
cudaArch, sharedMemSize, maxSharedMem-attr.sharedSizeBytes);
return ncclSystemError;
cudaError_t errcode = cudaFuncGetAttributes(&attr, fn);
if (errcode == cudaErrorNoKernelImageForDevice) continue;
CUDACHECKGOTO(errcode, result, ignore0);
if (maxStackSize) {
if (attr.localSizeBytes > *maxStackSize) *maxStackSize = attr.localSizeBytes;
ignore0:;
}
CUDACHECKGOTO(cudaFuncSetAttribute(fn,
cudaFuncAttributeMaxDynamicSharedMemorySize, sharedMemSize),
result, next_kernel);
if (carveout) {
CUDACHECKGOTO(cudaFuncSetAttribute(fn,
cudaFuncAttributePreferredSharedMemoryCarveout, carveout),
result, ignore1);
ignore1:;
}
if (ncclMaxSharedMem != 0) {
int sharedMemSize = ncclMaxSharedMem;
if (sharedMemSize > (maxSharedMem-attr.sharedSizeBytes)) {
WARN("cudaArch %d ncclMaxSharedMem %d exceeds device/fn maxSharedMem %zu",
cudaArch, sharedMemSize, maxSharedMem-attr.sharedSizeBytes);
return ncclSystemError;
}
CUDACHECKGOTO(cudaFuncSetAttribute(fn,
cudaFuncAttributeMaxDynamicSharedMemorySize, sharedMemSize),
result, next_kernel);
}
next_kernel:;
}
next_kernel:;
}
return result;
}
@ -258,8 +266,8 @@ static bool testBudget(
ncclResult_t ncclTasksRegAndEnqueue(struct ncclComm* comm) {
struct ncclKernelPlanner* planner = &comm->planner;
if (planner->isSymColl) return ncclSuccess;
struct ncclTaskColl *task;
task = ncclIntruQueueHead(&planner->collTaskQueue);
while (task != nullptr) {
// Build a ncclDevWorkColl[Reg?] struct for each task.
@ -331,6 +339,38 @@ ncclResult_t ncclPrepareTasks(struct ncclComm* comm, bool* algoNeedConnect, bool
int fnOpTyIndices[ncclNumFuncs*ncclNumDevRedOps*ncclNumTypes];
int fnOpTyCount = 0;
if (comm->nNodes == 1 && planner->nTasksColl == 1 && planner->nTasksP2p == 0) {
void* sendSymPtr;
void* recvSymPtr;
struct ncclReg* sendReg;
struct ncclReg* recvReg;
size_t size = task->count*ncclTypeSize(task->datatype);
NCCLCHECK(ncclRegFindSymmetric(comm, task->sendbuff, size, &sendSymPtr, &sendReg));
NCCLCHECK(ncclRegFindSymmetric(comm, task->recvbuff, size, &recvSymPtr, &recvReg));
bool implemented = ncclSymImplemented(task->func, task->opDev.op, task->datatype);
if (sendReg && recvReg && (sendReg->winFlags & recvReg->winFlags & NCCL_WIN_COLL_SYMMETRIC) && implemented) {
enum ncclSymKernelId kernel;
int nChannels, nWarps;
float estTimeUs = 1.e18;
NCCLCHECK(ncclSymPickKernel(comm, task->func, task->opDev.op, task->datatype, task->count, &estTimeUs, &kernel, &nChannels, &nWarps));
// We should only use symmetric kernel if it beats the asymmetric kernel. But the
// perf model accuracy from asymmetric kernels is too inaccurate and reports too high
// of a bandwidth. For now just always use symmetric if available.
if (kernel != ncclSymKernelId_Count) {
task->sendbuff = sendSymPtr;
task->recvbuff = recvSymPtr;
task->devFuncId = (int)kernel;
task->nMaxChannels = nChannels;
task->nWarps = nWarps;
ncclIntruQueueEnqueue(&planner->collTaskQueue, task);
planner->isSymColl = true;
return ncclSuccess;
}
}
}
// Walk the size sorted tasks, binning them by (fn,op,ty).
while (task != nullptr) {
struct ncclTaskColl* next = task->next;
@ -603,6 +643,10 @@ static ncclResult_t scheduleCollTasksToPlan(
(countHi != 0 ? countHi : countLo) -= cells*elementsPerCell - task->count;
nChannels = (countLo!=0 ? 1 : 0) + nMidChannels + (cellsHi!=0 ? 1 : 0);
// Update number of channels propagated to the profiler
task->nChannels = (uint8_t)nChannels;
// Ensure room for worst case of one new batch per channel
if (!testBudget(budget, plan->nWorkBatches + nChannels, plan->workBytes + workNode->size)) {
return ncclSuccess;
@ -860,6 +904,8 @@ static ncclResult_t addP2pToPlan(
partSize = divUp(bytes[dir], nChannels[dir]);
}
}
// Update number of channels propagated to the profiler
if (p2pTasks[dir]) p2pTasks[dir]->nChannels = nChannels[dir];
}
struct ncclWorkList* workNode = ncclMemoryStackAllocInlineArray<ncclWorkList, ncclDevWorkP2p>(&comm->memScoped, 1);
@ -1052,47 +1098,17 @@ static ncclResult_t scheduleP2pTasksToPlan(
}
// Spin until its safe to increase comm->workFifoProduced to desiredProduced.
static void waitWorkFifoAvailable(struct ncclComm* comm, uint32_t desiredProduced) {
bool hasRoom = (desiredProduced - comm->workFifoConsumedLeast) <= comm->workFifoBytes;
if (hasRoom) return;
while (true) {
// We have to poll for notifications from device.
uint32_t* consumedLive = comm->workFifoConsumed;
uint32_t consumed[MAXCHANNELS];
for (int c=0; c < MAXCHANNELS; c++) {
consumed[c] = __atomic_load_n(&consumedLive[c], __ATOMIC_RELAXED);
static ncclResult_t waitWorkFifoAvailable(struct ncclComm* comm, uint32_t desiredProduced) {
bool hasRoom = (desiredProduced - comm->workFifoConsumed) <= comm->workFifoBytes;
if (!hasRoom) {
while (true) {
NCCLCHECK(ncclCommPollEventCallbacks(comm, /*waitSome=*/true));
hasRoom = (desiredProduced - comm->workFifoConsumed) <= comm->workFifoBytes;
if (hasRoom) break;
sched_yield();
}
// Compiler-only fence to prevent fusion of loops to encourage dense loads.
__atomic_signal_fence(__ATOMIC_SEQ_CST);
uint32_t produced = comm->workFifoProduced;
uint32_t consumedLeast = produced;
for (int c=0; c < MAXCHANNELS; c++) {
// consumedLeast is min over all non-quiesced channels
if (consumed[c] != comm->channels[c].workFifoProduced) {
if ((produced - consumedLeast) < (produced - consumed[c])) {
consumedLeast = consumed[c];
}
}
}
// Compiler only fence to prevent fusion of loops to encourage dense stores.
__atomic_signal_fence(__ATOMIC_SEQ_CST);
for (int c=0; c < MAXCHANNELS; c++) {
// Advance counter on quiesced channels so they don't lag behind
// too far where they could get lost in 32-bit wraparound.
if (consumed[c] == comm->channels[c].workFifoProduced) {
comm->channels[c].workFifoProduced = consumedLeast;
__atomic_store_n(&consumedLive[c], consumedLeast, __ATOMIC_RELAXED);
}
}
comm->workFifoConsumedLeast = consumedLeast;
hasRoom = (desiredProduced - comm->workFifoConsumedLeast) <= comm->workFifoBytes;
if (hasRoom) break;
sched_yield();
}
return ncclSuccess;
}
namespace {
@ -1106,11 +1122,14 @@ namespace {
struct uploadWork_cleanup_t* me = (struct uploadWork_cleanup_t*)cb;
free(me->hostBuf);
CUDACHECK(cudaEventDestroy(me->base.event));
free(me);
return ncclSuccess;
}
}
static ncclResult_t uploadWork(struct ncclComm* comm, struct ncclKernelPlan* plan) {
if (plan->isSymColl) return ncclSuccess;
size_t workBytes = plan->workBytes;
size_t batchBytes = plan->nWorkBatches*sizeof(struct ncclDevWorkBatch);
void* fifoBufHost;
@ -1127,7 +1146,7 @@ static ncclResult_t uploadWork(struct ncclComm* comm, struct ncclKernelPlan* pla
fifoBufHost = comm->workFifoBuf;
fifoCursor = comm->workFifoProduced;
fifoMask = comm->workFifoBytes-1;
waitWorkFifoAvailable(comm, fifoCursor + workBytes);
NCCLCHECK(waitWorkFifoAvailable(comm, fifoCursor + workBytes));
plan->kernelArgs->workBuf = comm->workFifoBufDev;
break;
case ncclDevWorkStorageTypePersistent:
@ -1208,7 +1227,7 @@ static ncclResult_t uploadWork(struct ncclComm* comm, struct ncclKernelPlan* pla
ncclIntruQueueEnqueue(&comm->eventCallbackQueue, (struct ncclCommEventCallback *)cleanup);
NCCLCHECKGOTO(ncclStrongStreamRelease(ncclCudaGraphNone(), &comm->sharedRes->deviceStream, /*concurrent=*/false), result, fail);
NCCLCHECKGOTO(ncclCommPollEventCallbacks(comm), result, fail);
NCCLCHECKGOTO(ncclCommPollEventCallbacks(comm, /*waitSome=*/false), result, fail);
finish_scope:
if (mode != cudaStreamCaptureModeRelaxed) (void)cudaThreadExchangeStreamCaptureMode(&mode);
@ -1226,6 +1245,7 @@ static ncclResult_t uploadProxyOps(struct ncclComm* comm, struct ncclKernelPlan*
uint64_t collOpCount = comm->sharedRes->collOpCount;
uint64_t p2pOpBump[MAXCHANNELS] = {/*0...*/};
// Advance comm's collOpCount by number of colls in this plan.
int hasp2p = 0;
comm->sharedRes->collOpCount += plan->collOpCount;
comm->collOpCount += plan->collOpCount;
@ -1244,6 +1264,7 @@ static ncclResult_t uploadProxyOps(struct ncclComm* comm, struct ncclKernelPlan*
// remember last value to compute max.
p2pOpBump[op->channelId] = (oldId>>1) + 1; // +1 to ensure next plan doesn't collide
op->opCount = (comm->sharedRes->p2pOpCount[op->channelId]<<1) + oldId;
hasp2p = 1;
} else { // coll
op->opCount = (collOpCount<<1) + oldId;
}
@ -1253,9 +1274,11 @@ static ncclResult_t uploadProxyOps(struct ncclComm* comm, struct ncclKernelPlan*
op = op->enqNext;
}
for (int c=0; c < MAXCHANNELS; c++) {
// Advance channel's p2pOpCount by number of p2p's in this plan channel.
comm->sharedRes->p2pOpCount[c] += p2pOpBump[c];
if (hasp2p) {
for (int c=0; c < MAXCHANNELS; c++) {
// Advance channel's p2pOpCount by number of p2p's in this plan channel.
comm->sharedRes->p2pOpCount[c] += p2pOpBump[c];
}
}
return ncclSuccess;
}
@ -1263,8 +1286,10 @@ static ncclResult_t uploadProxyOps(struct ncclComm* comm, struct ncclKernelPlan*
static ncclResult_t hostStreamPlanTask(struct ncclComm* comm, struct ncclKernelPlan* plan) {
NCCLCHECK(ncclProfilerStartGroupEvent(plan));
NCCLCHECK(ncclProfilerStartTaskEvents(plan));
NCCLCHECK(uploadProxyOps(comm, plan));
NCCLCHECK(ncclProxyStart(comm));
if (ncclIntruQueueHead(&plan->proxyOpQueue)) {
NCCLCHECK(uploadProxyOps(comm, plan));
NCCLCHECK(ncclProxyStart(comm));
}
NCCLCHECK(ncclProfilerStopTaskEvents(plan));
NCCLCHECK(ncclProfilerStopGroupEvent(plan));
if (!plan->persistent) {
@ -1281,7 +1306,6 @@ static void CUDART_CB hostStreamPlanCallback(void *plan_) {
if (result != ncclSuccess) {
WARN("hostStreamPlanCallback() failed : %s", ncclGetErrorString(result));
}
if (!plan->persistent) ncclAtomicRefCountDecrement(&plan->comm->sharedRes->noncapturedRefs);
return;
}
@ -1357,9 +1381,8 @@ namespace {
static ncclResult_t getImplicitOrder(enum ncclImplicitOrder *mode, bool capturing, int driver=-1) {
if (ncclParamLaunchOrderImplicit()) {
// Due to an unresolved bug in CUDA ncclImplicitOrderLaunch is not supported in graphs
if (capturing) { *mode = ncclImplicitOrderSerial; return ncclSuccess; }
if (driver < 0) { NCCLCHECK(ncclCudaDriverVersion(&driver)); }
if (capturing && driver < 12090) { *mode = ncclImplicitOrderSerial; return ncclSuccess; }
*mode = 12030 <= std::min<int>(CUDART_VERSION, driver) ? ncclImplicitOrderLaunch : ncclImplicitOrderSerial;
return ncclSuccess;
}
@ -1386,26 +1409,51 @@ ncclResult_t ncclLaunchPrepare(struct ncclComm* comm) {
plan->workStorageType = persistent ? ncclDevWorkStorageTypePersistent
: ncclDevWorkStorageTypeFifo;
struct ncclKernelPlanBudget budget;
budget.inArgsBytes = comm->workArgsBytes - sizeof(struct ncclDevKernelArgs);
// Non-persistent kernels fill up at most half of our fifo per kernel.
budget.outArgsBytes = plan->persistent ? (1<<30) : comm->workFifoBytes/2;
if (planner->isSymColl) {
plan->workStorageType = ncclDevWorkStorageTypeArgs;
// Drain coll tasks first. This is essential since we partition tasks based
// on the work budget and p2p work isn't collective. If we were to drain p2p
// first, the place where we cut the kernel could vary by rank which would
// cause the "shortest channel first" channel picker to have divergent results.
if (planner->nTasksColl != 0) {
NCCLCHECKGOTO(scheduleCollTasksToPlan(comm, plan, &budget), result, failure);
}
// And only drain p2p tasks once colls are depleted.
if (planner->nTasksColl == 0 && planner->nTasksP2p != 0) {
NCCLCHECKGOTO(scheduleP2pTasksToPlan(comm, plan, &budget), result, failure);
}
finishPlan(comm, plan);
if (plan->workBytes != 0) {
struct ncclTaskColl* task = ncclIntruQueueHead(&planner->collTaskQueue);
plan->isSymColl = true;
plan->kernelFn = ncclSymGetKernelPtr((ncclSymKernelId)task->devFuncId, task->opDev.op, task->datatype);
plan->threadPerBlock = task->nWarps*WARP_SIZE;
plan->channelMask = uint64_t(-1) >> (64-task->nMaxChannels);
plan->kernelArgsSize = sizeof(struct ncclSymDevArgs);
plan->kernelSymArgs = ncclMemoryStackAlloc<struct ncclSymDevArgs>(&comm->memScoped);
plan->kernelSymArgs->comm = comm->symDevComm;
plan->kernelSymArgs->rootRank = task->root;
plan->kernelSymArgs->redOpArg = task->opDev.scalarArg;
plan->kernelSymArgs->nElts = task->count;
plan->kernelSymArgs->input = (char*)task->sendbuff;
plan->kernelSymArgs->output = (char*)task->recvbuff;
planner->nTasksColl -= 1;
ncclIntruQueueEnqueue(&planner->planQueue, plan);
INFO(NCCL_TUNING, "%s [Symmetric]: %ld Bytes -> Kernel %s nchannels %d nthreads %d",
ncclFuncToString(task->func), task->count * ncclTypeSize(task->datatype), ncclSymKernelIdToString(task->devFuncId), task->nMaxChannels, plan->threadPerBlock);
nPlans += 1;
} else {
struct ncclKernelPlanBudget budget;
budget.inArgsBytes = comm->workArgsBytes - sizeof(struct ncclDevKernelArgs);
// Non-persistent kernels fill up at most half of our fifo per kernel.
budget.outArgsBytes = plan->persistent ? (1<<30) : comm->workFifoBytes/2;
// Drain coll tasks first. This is essential since we partition tasks based
// on the work budget and p2p work isn't collective. If we were to drain p2p
// first, the place where we cut the kernel could vary by rank which would
// cause the "shortest channel first" channel picker to have divergent results.
if (planner->nTasksColl != 0) {
NCCLCHECKGOTO(scheduleCollTasksToPlan(comm, plan, &budget), result, failure);
}
// And only drain p2p tasks once colls are depleted.
if (planner->nTasksColl == 0 && planner->nTasksP2p != 0) {
NCCLCHECKGOTO(scheduleP2pTasksToPlan(comm, plan, &budget), result, failure);
}
finishPlan(comm, plan);
if (plan->workBytes != 0) {
ncclIntruQueueEnqueue(&planner->planQueue, plan);
nPlans += 1;
}
}
} while (planner->nTasksColl + planner->nTasksP2p != 0);
@ -1428,6 +1476,7 @@ ncclResult_t ncclLaunchPrepare(struct ncclComm* comm) {
bool capturing = ncclCudaGraphValid(planner->capturingGraph);
enum ncclImplicitOrder implicitOrder;
cudaError_t status = cudaSuccess;
NCCLCHECKGOTO(getImplicitOrder(&implicitOrder, capturing), result, failure);
if (implicitOrder != ncclImplicitOrderNone) {
@ -1439,7 +1488,8 @@ ncclResult_t ncclLaunchPrepare(struct ncclComm* comm) {
NCCLCHECKGOTO(ncclStreamWaitStream(launchStream, launchOrder, comm->sharedRes->scratchEvent), result, failure);
}
if (persistent || comm->sharedRes->persistentRefs != 0 || ncclCudaLaunchBlocking || __atomic_load_n(&comm->sharedRes->noncapturedRefs, __ATOMIC_ACQUIRE)) {
if (!persistent && comm->sharedRes->persistentRefs) status = cudaEventQuery(comm->sharedRes->hostStream.serialEvent);
if (persistent || ncclCudaLaunchBlocking || status == cudaErrorNotReady) {
// We have to launch host tasks to push proxy args. We are careful to only
// do this if necessary since host tasks impose a high performance cost in CUDA.
bool acquired = false;
@ -1450,7 +1500,6 @@ ncclResult_t ncclLaunchPrepare(struct ncclComm* comm) {
acquired = true;
NCCLCHECKGOTO(ncclStrongStreamAcquire(planner->capturingGraph, &comm->sharedRes->hostStream, /*concurrent=*/false, &hostStream), result, failure);
}
if (!persistent) ncclAtomicRefCountIncrement(&comm->sharedRes->noncapturedRefs);
plan->isHostCbEnq = true;
CUDACHECKGOTO(cudaLaunchHostFunc(hostStream, hostStreamPlanCallback, plan), result, failure);
}
@ -1485,6 +1534,8 @@ ncclResult_t ncclLaunchKernelBefore_NoUncapturedCuda(struct ncclComm* comm, stru
NCCL_PARAM(MemSyncDomain, "MEM_SYNC_DOMAIN", cudaLaunchMemSyncDomainRemote);
#endif
NCCL_PARAM(NvlinkUtilCentricSchedEnable, "NVLINK_UTIL_CENTRIC_SCHED_ENABLE", 0);
ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan) {
ncclResult_t ret = ncclSuccess;
struct ncclKernelPlanner* planner = &comm->planner;
@ -1512,7 +1563,7 @@ ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan
unsigned int clusterSize = (compCap >= 90) ? comm->config.cgaClusterSize : 0;
CUlaunchConfig launchConfig = {0};
CUlaunchAttribute launchAttrs[4] = {};
CUlaunchAttribute launchAttrs[6] = {};
int attrs = 0;
/* Cooperative Group Array (CGA)
* On sm90 and later we have an extra level of hierarchy where we
@ -1549,6 +1600,18 @@ ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan
launchAttrs[attrs].value.launchCompletionEvent.flags = 0;
attrs++;
}
if (comm->planner.isSymColl && compCap >= 90 && driverVersion >= 12030) {
launchAttrs[attrs].id = CU_LAUNCH_ATTRIBUTE_PROGRAMMATIC_STREAM_SERIALIZATION;
launchAttrs[attrs].value.programmaticStreamSerializationAllowed = 1;
attrs++;
}
#endif
#if CUDART_VERSION >= 13000
if (compCap >= 90 && driverVersion >= 13000) {
launchAttrs[attrs].id = CU_LAUNCH_ATTRIBUTE_NVLINK_UTIL_CENTRIC_SCHEDULING;
launchAttrs[attrs].value.nvlinkUtilCentricScheduling = ncclParamNvlinkUtilCentricSchedEnable();
attrs++;
}
#endif
launchConfig.gridDimX = grid.x;
launchConfig.gridDimY = grid.y;
@ -1560,7 +1623,6 @@ ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan
launchConfig.attrs = launchAttrs;
launchConfig.numAttrs = attrs;
launchConfig.hStream = launchStream;
CUCHECKGOTO(cuLaunchKernelEx(&launchConfig, fn, nullptr, extra), ret, do_return);
#endif
} else {
@ -1573,21 +1635,30 @@ do_return:
}
ncclResult_t ncclLaunchKernelAfter_NoCuda(struct ncclComm* comm, struct ncclKernelPlan* plan) {
if (!(plan->persistent || ncclCudaLaunchBlocking || plan->isHostCbEnq)) {
// We are not using the host stream for proxy ops and reclaimation submission.
if (!plan->isHostCbEnq) {
// we are not using the host stream for proxy ops and reclaimation submission, call
// hostStreamPlanTask directly
NCCLCHECK(hostStreamPlanTask(comm, plan));
} else {
// We are using the host stream for proxy ops and reclaimation submission.
// Only plans with proxy ops have a callback pushed by ncclLaunchPrepare.
// Since non-persistent plans also require reclaimation, we have to do it
// here.
if (!plan->persistent && !plan->hasProxyOps) {
ncclIntruQueueMpscEnqueue(&comm->callbackQueue, &plan->reclaimer);
}
}
return ncclSuccess;
}
namespace {
struct KernelFinishCallback {
struct ncclCommEventCallback base;
uint32_t workFifoConsumed;
};
ncclResult_t KernelFinishCallback_fn(
struct ncclComm* comm, struct ncclCommEventCallback* cb
) {
struct KernelFinishCallback* me = (struct KernelFinishCallback*)cb;
comm->workFifoConsumed = me->workFifoConsumed;
CUDACHECK(cudaEventDestroy(me->base.event));
free(me);
return ncclSuccess;
}
}
ncclResult_t ncclLaunchFinish(struct ncclComm* comm) {
struct ncclKernelPlanner* planner = &comm->planner;
if (!ncclIntruQueueEmpty(&planner->planQueue)) {
@ -1597,7 +1668,21 @@ ncclResult_t ncclLaunchFinish(struct ncclComm* comm) {
cudaStream_t launchStream = planner->streams->stream; // First user stream gets launch
cudaStream_t deviceStream, launchOrder;
CUDACHECK(cudaEventRecord(comm->sharedRes->scratchEvent, launchStream));
cudaEvent_t finishedEvent = comm->sharedRes->scratchEvent;
CUDACHECK(cudaEventRecord(finishedEvent, launchStream));
if (comm->workFifoProduced - comm->workFifoProducedLastRecorded > comm->workFifoBytes/8) {
comm->workFifoProducedLastRecorded = comm->workFifoProduced;
struct KernelFinishCallback* cb;
NCCLCHECK(ncclCalloc(&cb, 1));
cb->base.event = finishedEvent;
cb->base.fn = KernelFinishCallback_fn;
cb->workFifoConsumed = comm->workFifoProduced;
ncclIntruQueueEnqueue(&comm->eventCallbackQueue, &cb->base);
// We just stole scratchEvent so must create a new one.
CUDACHECK(cudaEventCreateWithFlags(&comm->sharedRes->scratchEvent, cudaEventDisableTiming));
}
// deviceStream waits on userStream[0]
NCCLCHECK(ncclStrongStreamAcquiredWorkStream(planner->capturingGraph, &comm->sharedRes->deviceStream, /*concurrent=*/false, &deviceStream));
@ -1606,13 +1691,13 @@ ncclResult_t ncclLaunchFinish(struct ncclComm* comm) {
// on launchStream as a fast-forward. When building CUDA graphs fast forwards should
// be handled specially so as not to create graphs with a blowup in the number of edges.
// So we could do this:
// CUDACHECK(cudaStreamWaitEvent(deviceStream, comm->sharedRes->scratchEvent, 0));
// CUDACHECK(cudaStreamWaitEvent(deviceStream, finishedEvent, 0));
// But instead we do:
NCCLCHECK(ncclStreamAdvanceToEvent(planner->capturingGraph, deviceStream, comm->sharedRes->scratchEvent));
NCCLCHECK(ncclStreamAdvanceToEvent(planner->capturingGraph, deviceStream, finishedEvent));
// Each userStream[i] waits on userStream[0]
for (struct ncclCudaStreamList* l=planner->streams->next; l != nullptr; l = l->next) {
CUDACHECK(cudaStreamWaitEvent(l->stream, comm->sharedRes->scratchEvent, 0));
CUDACHECK(cudaStreamWaitEvent(l->stream, finishedEvent, 0));
}
bool capturing = ncclCudaGraphValid(planner->capturingGraph);
enum ncclImplicitOrder implicitOrder;
@ -1623,7 +1708,7 @@ ncclResult_t ncclLaunchFinish(struct ncclComm* comm) {
// Incorporate launch event into per-device (context) launch order.
NCCLCHECK(ncclStrongStreamAcquiredWorkStream(planner->capturingGraph, &comm->context->launchOrder, concurrent, &launchOrder));
// If we don't have launch events (requires CUDA 12.3) then just use completion event (serialize execution).
CUDACHECK(cudaStreamWaitEvent(launchOrder, implicitOrder == ncclImplicitOrderLaunch ? comm->sharedRes->launchEvent : comm->sharedRes->scratchEvent));
CUDACHECK(cudaStreamWaitEvent(launchOrder, implicitOrder == ncclImplicitOrderLaunch ? comm->sharedRes->launchEvent : finishedEvent));
// Release launchOrder as acquired in ncclLaunchPrepare()
NCCLCHECK(ncclStrongStreamRelease(planner->capturingGraph, &comm->context->launchOrder, concurrent));
}
@ -1645,7 +1730,7 @@ static inline ncclResult_t getCollNetSupport(
if (info->opDev.op == ncclDevPreMulSum || info->opDev.op == ncclDevSumPostDiv) {
netOp = ncclSum;
}
*collNetSupport = comm->collNetSupport;
*collNetSupport = comm->config.collnetEnable;
switch (info->func) {
case ncclFuncAllReduce:
case ncclFuncReduce:
@ -1683,10 +1768,8 @@ static ncclResult_t updateCollCostTable(
if ((a == NCCL_ALGO_COLLNET_DIRECT || a == NCCL_ALGO_COLLNET_CHAIN) && collNetSupport != 1) continue;
// CollNetDirect is only supported for up to 8 local GPUs
if (a == NCCL_ALGO_COLLNET_DIRECT && comm->maxLocalRanks > NCCL_MAX_DIRECT_ARITY+1) continue;
if ((a == NCCL_ALGO_NVLS || a == NCCL_ALGO_NVLS_TREE) && nvlsSupport != 1 && info->func != ncclFuncAllGather) continue;
if ((a == NCCL_ALGO_NVLS || a == NCCL_ALGO_NVLS_TREE) && (!nvlsSupport || (info->func != ncclFuncAllReduce && comm->localRanks > NCCL_MAX_NVLS_ARITY))) continue;
if (a == NCCL_ALGO_NVLS && collNetSupport != 1 && comm->nNodes > 1) continue;
/* now we only support single-node NVLS allgather and reducescatter */
if (a == NCCL_ALGO_NVLS && (info->func == ncclFuncAllGather || info->func == ncclFuncReduceScatter) && (comm->nNodes > 1 || comm->nRanks > NCCL_MAX_NVLS_ARITY)) continue;
/* Tree reduceScatter doesn't support scaling yet */
if (a == NCCL_ALGO_PAT && info->func == ncclFuncReduceScatter
&& (info->opDev.op == ncclDevPreMulSum || info->opDev.op == ncclDevSumPostDiv)) continue;
@ -1801,7 +1884,14 @@ static ncclResult_t getAlgoInfo(
struct ncclComm* comm, struct ncclTaskColl* info,
int collNetSupport, int nvlsSupport, int numPipeOps, ncclSimInfo_t* simInfo/* = NULL*/
) {
size_t nBytes = ncclTypeSize(info->datatype)*ncclFuncMaxSendRecvCount(info->func, comm->nRanks, info->count);
size_t elementSize = ncclTypeSize(info->datatype);
size_t nBytes = elementSize * ncclFuncMaxSendRecvCount(info->func, comm->nRanks, info->count);
struct ncclReg* regSendBuf = NULL;
struct ncclReg* regRecvBuf = NULL;
int regBuff;
bool isSendValid, isRecvValid;
size_t sendbuffSize = elementSize * ncclFuncSendCount(info->func, comm->nRanks, info->count);
size_t recvbuffSize = elementSize * ncclFuncRecvCount(info->func, comm->nRanks, info->count);
info->algorithm = NCCL_ALGO_UNDEF;
info->protocol = NCCL_PROTO_UNDEF;
int nMaxChannels = 0;
@ -1809,20 +1899,42 @@ static ncclResult_t getAlgoInfo(
initCollCostTable((float **)collCostTable);
NCCLCHECK(updateCollCostTable(comm, info, nBytes, collNetSupport, nvlsSupport, numPipeOps, (float **)collCostTable));
if (comm->tuner != NULL) {
size_t elementSize = ncclTypeSize(info->datatype);
size_t sendbuffSize = elementSize*ncclFuncSendCount(info->func, comm->nRanks, info->count);
size_t recvbuffSize = elementSize*ncclFuncRecvCount(info->func, comm->nRanks, info->count);
struct ncclReg* regSendBuf;
struct ncclReg* regRecvBuf;
NCCLCHECK(ncclRegFind(comm, info->sendbuff, sendbuffSize, &regSendBuf));
NCCLCHECK(ncclRegFind(comm, info->recvbuff, recvbuffSize, &regRecvBuf));
int regBuff = ((regSendBuf && regRecvBuf) || (ncclCudaGraphValid(comm->planner.capturingGraph) && ncclParamGraphRegister()));
NCCLCHECK(ncclRegLocalIsValid(regSendBuf, &isSendValid));
NCCLCHECK(ncclRegLocalIsValid(regRecvBuf, &isRecvValid));
regBuff = (regSendBuf && regRecvBuf && isSendValid && isRecvValid) || (ncclCudaGraphValid(comm->planner.capturingGraph) && ncclParamGraphRegister());
NCCLCHECK(comm->tuner->getCollInfo(
comm->tunerContext, info->func, nBytes,
numPipeOps, (float **)collCostTable, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
regBuff, &nMaxChannels));
NCCLCHECK(topoGetAlgoInfo(comm, info, nBytes, (float **)collCostTable, simInfo));
} else {
NCCLCHECK(topoGetAlgoInfo(comm, info, nBytes, (float **)collCostTable, simInfo));
// NCCL_CTA_POLICY_EFFICIENCY requires user (non-symmetric) buffer registration (currently unsupported with MNNVL)
if (comm->config.CTAPolicy == NCCL_CTA_POLICY_EFFICIENCY && ncclGetEnv("NCCL_ALGO") == NULL && ncclGetEnv("NCCL_PROTO") == NULL && !comm->MNNVL) {
// make algorithm selection based on buffer registration
// there can be other specialized policies for algorithms and protocols pickup in the future
NCCLCHECK(ncclRegFind(comm, info->sendbuff, sendbuffSize, &regSendBuf));
NCCLCHECK(ncclRegFind(comm, info->recvbuff, recvbuffSize, &regRecvBuf));
NCCLCHECK(ncclRegLocalIsValid(regSendBuf, &isSendValid));
NCCLCHECK(ncclRegLocalIsValid(regRecvBuf, &isRecvValid));
regBuff = (regSendBuf && regRecvBuf && isSendValid && isRecvValid) || (ncclCudaGraphValid(comm->planner.capturingGraph) && ncclParamGraphRegister());
if (regBuff && (info->func == ncclFuncAllGather || info->func == ncclFuncReduceScatter)) {
if ((comm->nNodes > 1 && collNetSupport && nvlsSupport) || (comm->nNodes == 1 && nvlsSupport)) {
int recChannels;
NCCLCHECK(ncclNvlsRegResourcesQuery(comm, info, &recChannels));
if (recChannels <= info->nMaxChannels) {
info->algorithm = NCCL_ALGO_NVLS;
info->protocol = NCCL_PROTO_SIMPLE;
info->nMaxChannels = recChannels;
info->nWarps = comm->maxThreads[info->algorithm][info->protocol] / WARP_SIZE;
}
}
}
}
}
NCCLCHECK(topoGetAlgoInfo(comm, info, nBytes, (float **)collCostTable, simInfo));
info->nMaxChannels = nMaxChannels == 0 ? info->nMaxChannels : nMaxChannels;
return ncclSuccess;
}
@ -1892,16 +2004,20 @@ static ncclResult_t calcCollChunking(
while (nBytes / (nChannels * chunkSize) < comm->channels[0].collnetChain.depth * 8 && chunkSize > 65536) chunkSize /= 2;
while (nBytes / (nChannels * chunkSize) < comm->channels[0].collnetChain.depth && chunkSize > 32768) chunkSize /= 2;
} else if (info->algorithm == NCCL_ALGO_NVLS) {
int maxChunkSize = comm->nvlsChunkSize;
if (comm->nNodes > 1 && comm->bandwidths[ncclFuncAllReduce][NCCL_ALGO_NVLS][NCCL_PROTO_SIMPLE] < 150) maxChunkSize = 32768;
if (chunkSize > maxChunkSize) chunkSize = maxChunkSize;
// Use uint64_t so that concurrentOps*chunkSize*X does not overflow.
// However, nChannels * comm->channels[0].nvls.nHeads should easily fit in 32 bits.
// coverity[overflow_before_widen]
uint64_t concurrentOps = nChannels * comm->channels[0].nvls.nHeads;
if ((nBytes < (64 * (concurrentOps * chunkSize))) && (chunkSize > 65536)) chunkSize = 65536;
if ((nBytes < (8 * (concurrentOps * chunkSize))) && (chunkSize > 32768)) chunkSize = 32768;
if ((nBytes < (2 * (concurrentOps * chunkSize))) && (chunkSize > 16384)) chunkSize = 16384;
if ((info->regBufType & NCCL_NVLS_REG_BUFFER) && (info->func == ncclFuncAllGather || info->func == ncclFuncReduceScatter)) {
chunkSize = comm->buffSizes[NCCL_PROTO_SIMPLE] / NCCL_STEPS;
} else {
int maxChunkSize = comm->nvlsChunkSize;
if (comm->nNodes > 1 && comm->bandwidths[ncclFuncAllReduce][NCCL_ALGO_NVLS][NCCL_PROTO_SIMPLE] < 150) maxChunkSize = 32768;
if (chunkSize > maxChunkSize) chunkSize = maxChunkSize;
// Use uint64_t so that concurrentOps*chunkSize*X does not overflow.
// However, nChannels * comm->channels[0].nvls.nHeads should easily fit in 32 bits.
// coverity[overflow_before_widen]
uint64_t concurrentOps = nChannels * comm->channels[0].nvls.nHeads;
if ((nBytes < (64 * (concurrentOps * chunkSize))) && (chunkSize > 65536)) chunkSize = 65536;
if ((nBytes < (8 * (concurrentOps * chunkSize))) && (chunkSize > 32768)) chunkSize = 32768;
if ((nBytes < (2 * (concurrentOps * chunkSize))) && (chunkSize > 16384)) chunkSize = 16384;
}
} else if (info->algorithm == NCCL_ALGO_NVLS_TREE) {
// Use uint64_t so that concurrentOps*chunkSize*X does not overflow.
// However, nChannels * comm->channels[0].nvls.nHeads should easily fit in 32 bits.
@ -2045,7 +2161,7 @@ static ncclResult_t calcCollChunking(
proxyOp->reg = 0;
}
if (pattern == ncclPatternCollnetDirect) {
if (pattern == ncclPatternCollnetDirect || pattern == ncclPatternNvls) {
proxyOp->specifics.collnetDirect.nNodes = comm->nNodes;
proxyOp->specifics.collnetDirect.node = comm->node;
if (info->func == ncclFuncAllGather || info->func == ncclFuncReduceScatter) {
@ -2168,7 +2284,7 @@ static ncclResult_t taskAppend(struct ncclComm* comm, struct ncclInfo* info) {
bool isSendNotRecv = info->coll == ncclFuncSend;
// Must be in thread local group before tasks can be alloc'd in `comm->memScoped`.
ncclGroupCommJoin(info->comm);
ncclGroupCommJoin(info->comm, ncclGroupTaskTypeCollective);
struct ncclTaskP2p* p2p = ncclMemoryPoolAlloc<struct ncclTaskP2p>(&comm->memPool_ncclTaskP2p, &comm->memPermanent);
p2p->func = info->coll;
p2p->buff = (void*)info->recvbuff;
@ -2235,7 +2351,7 @@ static ncclResult_t taskAppend(struct ncclComm* comm, struct ncclInfo* info) {
return ncclSuccess;
} else {
// Must be in thread local group before tasks can be alloc'd in `comm->memScoped`.
ncclGroupCommJoin(info->comm);
ncclGroupCommJoin(info->comm, ncclGroupTaskTypeCollective);
struct ncclTaskColl* t = ncclMemoryPoolAlloc<struct ncclTaskColl>(&comm->memPool_ncclTaskColl, &comm->memPermanent);
t->func = info->coll;
t->sendbuff = info->sendbuff;

View File

@ -258,7 +258,7 @@ static ncclResult_t connectNvls(struct ncclComm* comm, int* nvlsHeads, int nHead
channel->nvls.out = -1; // NVLS+SHARP not yet implemented.
channel->nvls.headRank = headRank;
channel->nvls.treeUp = channel->nvls.treeDown[0] = channel->nvls.treeDown[1] = channel->nvls.treeDown[2] = -1;
if (comm->collNetSupport && channel->nvls.headRank != -1) channel->nvls.out = comm->nRanks;
if (comm->config.collnetEnable && channel->nvls.headRank != -1) channel->nvls.out = comm->nRanks;
}
if (comm->nNodes == 1) return ncclSuccess;
@ -330,7 +330,7 @@ int ncclMinNchannels() {
if (ncclParamMinNrings() != -2) minNchannels = ncclParamMinNrings();
if (ncclParamMinNchannels() != -2) minNchannels = ncclParamMinNchannels();
if (minNchannels > MAXCHANNELS) {
WARN("User asked for a minimum of %d channels, limiting to %d", minNchannels, MAXCHANNELS);
INFO(NCCL_GRAPH|NCCL_ENV, "User asked for a minimum of %d channels, limiting to %d", minNchannels, MAXCHANNELS);
minNchannels = MAXCHANNELS;
}
if (minNchannels < 0) minNchannels = 0;
@ -346,7 +346,7 @@ int ncclMaxNchannels() {
maxNchannels = std::min(maxNchannels, ncclDevMaxChannelsForArgsBytes(ncclParamWorkArgsBytes()));
if (maxNchannels > MAXCHANNELS) maxNchannels = MAXCHANNELS;
if (maxNchannels < 1) {
WARN("User asked for a maximum of %d channels, setting it to 1", maxNchannels);
INFO(NCCL_GRAPH|NCCL_ENV, "User asked for a maximum of %d channels, setting it to 1", maxNchannels);
maxNchannels = 1;
}
return maxNchannels;
@ -379,7 +379,7 @@ ncclResult_t ncclTopoPostset(struct ncclComm* comm, int* firstRanks, int* treePa
int nNodes = comm->nNodes;
int nChannels = comm->nChannels;
int minHeadNum = INT_MAX;
int shared = parent && parent->nvlsSupport && parent->config.splitShare;
int shared = parent && parent->nvlsSupport && parent->shareResources;
NCCLCHECK(ncclCalloc(&ringRecv, nNodes*MAXCHANNELS));
NCCLCHECKGOTO(ncclCalloc(&ringSend, nNodes*MAXCHANNELS), ret, fail);
NCCLCHECKGOTO(ncclCalloc(&ringPrev, nranks*MAXCHANNELS), ret, fail);
@ -452,7 +452,7 @@ ncclResult_t ncclTopoPostset(struct ncclComm* comm, int* firstRanks, int* treePa
nChannels = comm->nChannels = std::min(MAXCHANNELS,nChannels*2);
// Setup CollNet
if (comm->collNetSupport == 1) {
if (comm->config.collnetEnable) {
struct ncclTopoGraph* collNetChainGraph = graphs[NCCL_ALGO_COLLNET_CHAIN];
// Add more channels to saturate intra-node bandwidth, except the 1 PPN case
if (collNetChainGraph->bwIntra > collNetChainGraph->bwInter && comm->nRanks > comm->nNodes) {

View File

@ -175,6 +175,13 @@ ncclResult_t ncclGetLocalCpu(struct ncclTopoSystem* system, int gpu, int* retCpu
return ncclSuccess;
}
static int mergePathType(int type0, int type1){
int max = std::max(type0,type1);
int min = std::min(type0,type1);
if(max == PATH_PHB && min == PATH_C2C) return PATH_P2C;
else return max;
}
static ncclResult_t addInterStep(struct ncclTopoSystem* system, int tx, int ix, int t1, int i1, int t2, int i2) {
struct ncclTopoNode* cpuNode = system->nodes[tx].nodes+ix;
struct ncclTopoNode* srcNode = system->nodes[t1].nodes+i1;
@ -187,7 +194,7 @@ static ncclResult_t addInterStep(struct ncclTopoSystem* system, int tx, int ix,
// Update path characteristics
srcNode->paths[t2][i2].count = l;
srcNode->paths[t2][i2].type = std::max(srcNode->paths[tx][ix].type, cpuNode->paths[t2][i2].type);
srcNode->paths[t2][i2].type = mergePathType(srcNode->paths[tx][ix].type, cpuNode->paths[t2][i2].type);
if (tx == GPU) srcNode->paths[t2][i2].type = PATH_PXN;
srcNode->paths[t2][i2].bw = std::min(srcNode->paths[tx][ix].bw, cpuNode->paths[t2][i2].bw);
return ncclSuccess;
@ -214,7 +221,7 @@ ncclResult_t ncclGetLevel(int* level, const char* disableEnv, const char* levelE
const char* str = ncclGetEnv(disableEnv);
if (str) {
int disable = strtol(str, NULL, 0);
if (disable == 1) l = 0;
if (disable == 1) l = PATH_LOC;
if (l >= 0) INFO(NCCL_ALL, "%s set by environment to %d", disableEnv, disable);
}
}
@ -247,7 +254,18 @@ ncclResult_t ncclGetLevel(int* level, const char* disableEnv, const char* levelE
NCCL_PARAM(IgnoreDisabledP2p, "IGNORE_DISABLED_P2P", 0);
int ncclTopoUserP2pLevel = -1;
static int ncclTopoUserP2pLevel = -1; // Initially "uninitialized". When initialized but unset, changes to -2.
// Gets the user-provided value of NCCL_P2P_LEVEL/NCCL_P2P_DISABLE. If the user did not provide any, the value
// of the "level" argument is left unchanged.
ncclResult_t ncclGetUserP2pLevel(int* level) {
if (ncclTopoUserP2pLevel == -1)
NCCLCHECK(ncclGetLevel(&ncclTopoUserP2pLevel, "NCCL_P2P_DISABLE", "NCCL_P2P_LEVEL"));
if (ncclTopoUserP2pLevel != -2)
*level = ncclTopoUserP2pLevel;
return ncclSuccess;
}
ncclResult_t ncclTopoCheckP2p(struct ncclComm* comm, struct ncclTopoSystem* system, int rank1, int rank2,
int* p2p, int *read, int* intermediateRank) {
int mnnvl = 0;
@ -275,9 +293,9 @@ ncclResult_t ncclTopoCheckP2p(struct ncclComm* comm, struct ncclTopoSystem* syst
// Get GPUs from topology
int g1, g2;
NCCLCHECK(ncclTopoRankToIndex(system, rank1, &g1));
NCCLCHECK(ncclTopoRankToIndex(system, rank1, &g1, /*showWarn=*/true));
struct ncclTopoNode* gpu1 = system->nodes[GPU].nodes+g1;
if (ncclTopoRankToIndex(system, rank2, &g2) == ncclInternalError) {
if (ncclTopoRankToIndex(system, rank2, &g2, /*showWarn=*/false) == ncclInternalError) {
// GPU not found, we can't use p2p.
return ncclSuccess;
}
@ -302,15 +320,8 @@ ncclResult_t ncclTopoCheckP2p(struct ncclComm* comm, struct ncclTopoSystem* syst
if ((arch == NCCL_TOPO_CPU_ARCH_X86 && vendor == NCCL_TOPO_CPU_VENDOR_AMD) && system->nodes[GPU].count <= 2) p2pLevel = PATH_SYS;
// User override
if (ncclTopoUserP2pLevel == -1)
NCCLCHECK(ncclGetLevel(&ncclTopoUserP2pLevel, "NCCL_P2P_DISABLE", "NCCL_P2P_LEVEL"));
if (ncclTopoUserP2pLevel != -2) {
p2pLevel = ncclTopoUserP2pLevel;
goto compare;
}
NCCLCHECK(ncclGetUserP2pLevel(&p2pLevel));
compare:
// Compute the PCI distance and compare with the p2pLevel.
if (path->type <= p2pLevel) *p2p = 1;
@ -378,7 +389,8 @@ NCCL_PARAM(NetGdrRead, "NET_GDR_READ", -2);
int ncclTopoUserGdrLevel = -1;
const char* ncclTopoGdrModeStr[ncclTopoGdrModeNum] = { "Disabled", "Default", "PCI" };
NCCL_PARAM(NetGdrC2c, "NET_GDR_C2C", 0);
// On C2C platforms use GDRDMA on NICs which are connected to the CPUs
NCCL_PARAM(NetGdrC2c, "NET_GDR_C2C", 1);
ncclResult_t ncclTopoCheckGdr(struct ncclTopoSystem* system, int rank, int64_t netId, int read, enum ncclTopoGdrMode* gdrMode) {
*gdrMode = ncclTopoGdrModeDisable;
@ -387,7 +399,7 @@ ncclResult_t ncclTopoCheckGdr(struct ncclTopoSystem* system, int rank, int64_t n
int n, g;
NCCLCHECK(ncclTopoIdToIndex(system, NET, netId, &n));
struct ncclTopoNode* net = system->nodes[NET].nodes+n;
NCCLCHECK(ncclTopoRankToIndex(system, rank, &g));
NCCLCHECK(ncclTopoRankToIndex(system, rank, &g, /*showWarn=*/true));
struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
// Check that both the NIC and GPUs support it
@ -423,29 +435,29 @@ ncclResult_t ncclTopoCheckGdr(struct ncclTopoSystem* system, int rank, int64_t n
// In case of PXN, use the intermediate GPU distance instead
int proxyRank;
NCCLCHECK(ncclTopoGetIntermediateRank(system, gpu->gpu.rank, netId, &proxyRank));
NCCLCHECK(ncclTopoRankToIndex(system, proxyRank, &g));
NCCLCHECK(ncclTopoRankToIndex(system, proxyRank, &g, /*showWarn=*/true));
gpu = system->nodes[GPU].nodes+g;
distance = gpu->paths[NET][n].type;
}
int c;
NCCLCHECK(ncclGetLocalCpu(system, g, &c));
if (ncclParamNetGdrC2c() && distance == PATH_PHB && gpu->paths[CPU][c].type == PATH_C2C) {
// On C2C platforms we can still use GDRDMA on NICs connected to the CPUs
INFO(NCCL_NET, "GPU %d / HCA %lx connected to CPU %d via C2C link", rank, netId, c);
// On C2C platforms we can still use GDRDMA on NICs connected to the CPUs
if (ncclParamNetGdrC2c() && distance == PATH_P2C) {
INFO(NCCL_GRAPH | NCCL_NET, "GPU %d / HCA %lx connected via C2C link", rank, netId);
distance = PATH_C2C;
}
if (distance > netGdrLevel) {
INFO(NCCL_NET,"GPU Direct RDMA Disabled for GPU %d / HCA %lx (distance %d > %d)", rank, netId, distance, netGdrLevel);
INFO(NCCL_GRAPH|NCCL_NET,"GPU Direct RDMA Disabled for GPU %d / HCA %lx (distance %d > %d)", rank, netId, distance, netGdrLevel);
return ncclSuccess;
}
// Force PCIe mapping if path goes through PCI on a C2C system
int c;
NCCLCHECK(ncclGetLocalCpu(system, g, &c));
if (gpu->paths[CPU][c].type == PATH_C2C && distance != PATH_C2C) *gdrMode = ncclTopoGdrModePci;
else *gdrMode = ncclTopoGdrModeDefault;
INFO(NCCL_NET,"GPU Direct RDMA Enabled for GPU %d / HCA %lx (distance %d <= %d), read %d mode %s", rank, netId, distance, netGdrLevel, read, ncclTopoGdrModeStr[*gdrMode]);
INFO(NCCL_GRAPH|NCCL_NET,"GPU Direct RDMA Enabled for GPU %d / HCA %lx (distance %d <= %d), read %d mode %s", rank, netId, distance, netGdrLevel, read, ncclTopoGdrModeStr[*gdrMode]);
return ncclSuccess;
}
@ -480,7 +492,7 @@ ncclResult_t ncclTopoNeedFlush(struct ncclComm* comm, int64_t netId, int netDev,
if (props.forceFlush == 1 || ncclParamNetForceFlush()) return ncclSuccess;
int g;
struct ncclTopoSystem* system = comm->topo;
NCCLCHECK(ncclTopoRankToIndex(system, rank, &g));
NCCLCHECK(ncclTopoRankToIndex(system, rank, &g, /*showWarn=*/true));
struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
// Flush is required on Ampere and earlier
if (gpu->gpu.cudaCompCap >= 90) *flush = 0;
@ -506,8 +518,8 @@ ncclResult_t ncclTopoCheckNet(struct ncclTopoSystem* system, int rank1, int rank
*net = 1;
// First check the current GPU-to-GPU speed.
int g1, g2;
if (ncclTopoRankToIndex(system, rank1, &g1) != ncclSuccess ||
ncclTopoRankToIndex(system, rank2, &g2) != ncclSuccess) {
if (ncclTopoRankToIndex(system, rank1, &g1, /*showWarn=*/false) != ncclSuccess ||
ncclTopoRankToIndex(system, rank2, &g2, /*showWarn=*/false) != ncclSuccess) {
return ncclSuccess;
}
@ -533,7 +545,7 @@ ncclResult_t ncclTopoGetIntermediateRank(struct ncclTopoSystem* system, int rank
// Get GPU and NET
int n, g;
NCCLCHECK(ncclTopoIdToIndex(system, NET, netId, &n));
NCCLCHECK(ncclTopoRankToIndex(system, rank, &g));
NCCLCHECK(ncclTopoRankToIndex(system, rank, &g, /*showWarn=*/true));
struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
struct ncclTopoLinkList* path = gpu->paths[NET]+n;
if (path->type == PATH_PXN) {
@ -601,6 +613,8 @@ ncclResult_t ncclTopoGetPxnRanks(struct ncclComm* comm, int** intermediateRanks,
return ncclSuccess;
}
NCCL_PARAM(PxnC2c, "PXN_C2C", 0);
ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm* comm) {
// Precompute paths between GPUs/NICs.
@ -659,6 +673,20 @@ ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm
}
}
}
// update the GPU -> NIC path in the case of C2C + PHB
for (int n = 0; n < system->nodes[NET].count; n++) {
struct ncclTopoNode* netNode = system->nodes[NET].nodes + n;
for (int g = 0; g < system->nodes[GPU].count; g++) {
struct ncclTopoNode* gpuNode = system->nodes[GPU].nodes + g;
int c;
NCCLCHECK(ncclGetLocalCpu(system, g, &c));
if (c == -1) continue;
if (mergePathType(gpuNode->paths[CPU][c].type, netNode->paths[CPU][c].type) == PATH_P2C) {
gpuNode->paths[NET][n].type = std::min(PATH_P2C, gpuNode->paths[NET][n].type);
netNode->paths[GPU][g].type = std::min(PATH_P2C, netNode->paths[GPU][g].type);
}
}
}
// Update paths for NICs (no GPU Direct, PXN, ...)
for (int n=0; n<system->nodes[NET].count; n++) {
@ -674,15 +702,19 @@ ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm
// PXN = PCI + NVLink.
struct ncclTopoNode* peerNode = system->nodes[GPU].nodes+localGpuIndex;
// Only use PXN for NIC n if remote GPU p ...
if (peerNode->paths[NET][n].type <= PATH_PXB && // Is connected to the NIC through PCI
peerNode->paths[GPU][g].type <= PATH_NVL && // Is connected to us through NVLink
NCCL_TOPO_ID_SYSTEM_ID(peerNode->id) == NCCL_TOPO_ID_SYSTEM_ID(gpu->id) && // Is on the same node as us
(peerNode->paths[NET][n].bw > gpu->paths[NET][n].bw || // Has either higher BW to that NIC
gpu->paths[NET][n].type > PATH_PXB)) // or avoids going through a CPU
// We can use that GPU as relay to communicate with that NIC.
// Only enabling it in the GPU->NIC direction for now to favor
// receiving locally and sending remotely (consistent with net.cc)
NCCLCHECK(addInterStep(system, GPU, localGpuIndex, GPU, g, NET, n));
int pxnType = ncclParamPxnC2c() ? PATH_P2C : PATH_PXB;
if (/* (1) is connected to the NIC with PxN type*/
peerNode->paths[NET][n].type <= pxnType &&
/* and (2) is connected to us through NVLink */
peerNode->paths[GPU][g].type <= PATH_NVL &&
/* and (3) is on the same node as us */
NCCL_TOPO_ID_SYSTEM_ID(peerNode->id) == NCCL_TOPO_ID_SYSTEM_ID(gpu->id) &&
/* and (4) has either higher bw to that NIC or avoid going through the CPU (path.type is > PATH_PXN)*/
(peerNode->paths[NET][n].bw > gpu->paths[NET][n].bw || gpu->paths[NET][n].type > PATH_PXN))
// We can use that GPU as relay to communicate with that NIC.
// Only enabling it in the GPU->NIC direction for now to favor
// receiving locally and sending remotely (consistent with net.cc)
NCCLCHECK(addInterStep(system, GPU, localGpuIndex, GPU, g, NET, n));
}
}
if (gpu->paths[NET][n].type < PATH_PHB) {
@ -699,6 +731,12 @@ ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm
}
}
}
// Pre-compute NET local gpus to accelerate search
for (int n=0; n<system->nodes[NET].count; n++) {
struct ncclTopoNode* net = system->nodes[NET].nodes+n;
NCCLCHECK(ncclTopoGetLocalGpu(system, net->id, &net->net.localGpu));
}
return ncclSuccess;
}
@ -761,7 +799,7 @@ static ncclResult_t ncclTopoGetNchannels(struct ncclComm* comm, int g /*local gp
int peer;
struct ncclTopoSystem* system = comm->topo;
struct ncclTopoLinkList* path = NULL;
if (ncclTopoRankToIndex(system, peerRank, &peer) == ncclSuccess) {
if (ncclTopoRankToIndex(system, peerRank, &peer, /*showWarn=*/false) == ncclSuccess) {
// Same rank
if (g == peer) {
*nChannels = -1;

View File

@ -137,6 +137,7 @@ static ncclResult_t ncclTopoFollowPath(struct ncclTopoSystem* system, struct ncc
float bw = intra ? graph->bwIntra : graph->bwInter;
int type = intra ? graph->typeIntra : graph->typeInter;
if (path->type >= PATH_DIS) return ncclSuccess;
if (mult == 1 && (path->type > type)) return ncclSuccess;
if (mult == 1 && (graph->pattern == NCCL_TOPO_PATTERN_BALANCED_TREE ||
graph->pattern == NCCL_TOPO_PATTERN_TREE ||
@ -328,8 +329,7 @@ ncclResult_t ncclTopoReplayGetGpu(struct ncclTopoSystem* system, struct ncclTopo
*g = i;
return ncclSuccess;
}
if (*g == -1) return ncclInternalError;
return ncclSuccess;
return ncclInternalError;
}
ncclResult_t ncclTopoSearchRecGpu(struct ncclTopoSystem* system, struct ncclTopoGraph* graph, struct ncclTopoGraph* saveGraph, struct ncclTopoNode* gpu, int step, int backToNet, int backToFirstRank, int forcedOrder, int *time);
@ -437,6 +437,65 @@ ncclResult_t ncclTopoCompareGraphs(struct ncclTopoSystem* system, struct ncclTop
return ncclSuccess;
}
// Add the preferred NICs ordered by GPU first
static ncclResult_t ncclTopoPrefNetsGpuFirst(struct ncclTopoSystem* system, int gpu, int nets[NCCL_TOPO_MAX_NODES], int* netCount) {
const int nGpus = (gpu == -1) ? system->nodes[GPU].count : 1;
int gpuCount = nGpus;
int gpuIds[NCCL_TOPO_MAX_NODES] = {gpu};
int firstNets[NCCL_TOPO_MAX_NODES];
if (gpu == -1)
for (int g = 0; g < nGpus; g++) gpuIds[g] = g;
for (int c = 0; c < MAXCHANNELS; c++) {
for (int g = 0; g < nGpus; g++) {
if (gpuIds[g] == -1) continue;
int localNet;
int64_t netId;
struct ncclTopoNode* gpu = system->nodes[GPU].nodes + gpuIds[g];
NCCLCHECK(ncclTopoGetLocalNet(system, gpu->gpu.rank, c, &netId, NULL));
NCCLCHECK(ncclTopoIdToIndex(system, NET, netId, &localNet));
// store the first net found for each GPU in case of duplicates
if(c == 0) firstNets[g] = localNet;
// if the NET has already been returned for channel 0, that GPU is done
if (c > 0 && firstNets[g] == localNet) {
gpuIds[g] = -1;
gpuCount--;
continue;
}
// only add it to the list if it doesn't already exist
int found = 0;
while (found < (*netCount) && nets[found] != localNet) found++;
if (found == (*netCount)) nets[(*netCount)++] = localNet;
}
if (gpuCount == 0) break;
}
return ncclSuccess;
}
// Add the preferred NICs ordered by channels first
static ncclResult_t ncclTopoPrefNetsChannelFirst(struct ncclTopoSystem* system, int gpu, int nets[NCCL_TOPO_MAX_NODES], int* netCount) {
for (int g = 0; g < system->nodes[GPU].count; g++) {
if (gpu != -1 && gpu != g) continue;
int localNetCount = 0, localNets[MAXCHANNELS];
struct ncclTopoNode* gpu = system->nodes[GPU].nodes + g;
for (int c = 0; c < MAXCHANNELS; c++) {
int64_t netId;
NCCLCHECK(ncclTopoGetLocalNet(system, gpu->gpu.rank, c, &netId, NULL));
NCCLCHECK(ncclTopoIdToIndex(system, NET, netId, localNets + localNetCount));
if (localNetCount > 0 && localNets[localNetCount] == localNets[0]) break;
localNetCount++;
}
// Append NICs to list
for (int i = 0; i < localNetCount; i++) {
int n = localNets[i];
int found = 0;
while (found < (*netCount) && nets[found] != n) found++;
if (found == (*netCount)) nets[(*netCount)++] = n;
}
}
return ncclSuccess;
}
// Build a sorted list of the NETs to try.
//
// "gpu" can be set to -1 to build a list suitable for all GPUs (search start) or to a given gpu
@ -445,39 +504,25 @@ ncclResult_t ncclTopoCompareGraphs(struct ncclTopoSystem* system, struct ncclTop
// The list is built the following way:
// 1. Select NETs starting with those close to GPU(s), based on paths[n].type.
// 2. add other NETs satisfying typeInter but not already in the list.
NCCL_PARAM(ScatterEnable, "MNNVL_SCATTER_NETS_ENABLE", 1);
ncclResult_t ncclTopoSelectNets(struct ncclTopoSystem* system, int typeInter, int gpu, int nets[NCCL_TOPO_MAX_NODES], int* netCountRet) {
ncclResult_t ret = ncclSuccess;
int netCount = 0;
int localNetCount;
int localNets[MAXCHANNELS];
// First add the preferred NICs
for (int g=0; g<system->nodes[GPU].count; g++) {
if (gpu != -1 && gpu != g) continue;
localNetCount = 0;
struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
for (int c = 0; c<MAXCHANNELS; c++) {
int64_t netId;
NCCLCHECK(ncclTopoGetLocalNet(system, gpu->gpu.rank, c, &netId, NULL));
NCCLCHECK(ncclTopoIdToIndex(system, NET, netId, localNets+localNetCount));
if (localNetCount > 0 && localNets[localNetCount] == localNets[0]) break;
localNetCount++;
}
// Append NICs to list
for (int i=0; i<localNetCount; i++) {
int n = localNets[i];
int found = 0;
while (found<netCount && nets[found] != n) found++;
if (found == netCount) nets[netCount++] = n;
}
// First add the preferred NETs.
if (system->nHosts > 1 && ncclParamScatterEnable()) {
// For MNNVL systems, we sort the devices by GPU first, then by channel
NCCLCHECK(ncclTopoPrefNetsGpuFirst(system, gpu, nets, &netCount));
} else {
// For other systems, we sort the devices by channel first, then by GPU
NCCLCHECK(ncclTopoPrefNetsChannelFirst(system, gpu, nets, &netCount));
}
// Then add others satisfying typeInter
for (int t=0; t <= typeInter; t++) {
for (int g=0; g<system->nodes[GPU].count; g++) {
for (int g = 0; g < system->nodes[GPU].count; g++) {
if (gpu != -1 && gpu != g) continue;
localNetCount = 0;
int localNetCount = 0, localNets[MAXCHANNELS];
struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
struct ncclTopoLinkList* paths = gpu->paths[NET];
for (int n=0; n<system->nodes[NET].count && n<MAXCHANNELS; n++) {
@ -625,8 +670,7 @@ ncclResult_t ncclTopoSearchRecNet(struct ncclTopoSystem* system, struct ncclTopo
if (graph->pattern == NCCL_TOPO_PATTERN_NVLS || graph->pattern == NCCL_TOPO_PATTERN_COLLNET_DIRECT) {
// NVLS search only tries to find NIC:GPU combinations to compute the heads.
if (graph->nChannels < netCount) {
int gpu;
NCCLCHECK(ncclTopoGetLocalGpu(system, net->id, &gpu));
int gpu = net->net.localGpu;
if (gpu != -1) {
int duplicate = 0;
// check whether there is duplicate head when one GPU connects with multiple NICs
@ -643,13 +687,12 @@ ncclResult_t ncclTopoSearchRecNet(struct ncclTopoSystem* system, struct ncclTopo
}
}
} else {
if (graph->nChannels > 0) {
if (graph->nChannels > 0 && graph->sameChannels == 1) {
// Try to replay the last channel
int g;
NCCLCHECK(ncclTopoReplayGetGpu(system, graph, -1, &g));
NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, FORCED_ORDER_REPLAY, time, NET, n, g));
}
if (graph->nChannels == 0 || graph->sameChannels == 0) {
} else {
if (graph->nChannels == 0 && system->nodes[NVS].count == 0) {
// Always try the PCI order first to set a reference, but don't count in the timeout nor let it run for long
int t = 1 << 10;
@ -658,24 +701,17 @@ ncclResult_t ncclTopoSearchRecNet(struct ncclTopoSystem* system, struct ncclTopo
}
// Then try the most local GPUs
float maxBw = 0;
int minHops = 0xfffffff;
struct ncclTopoLinkList* paths = net->paths[GPU];
for (int g=0; g<system->nodes[GPU].count; g++) {
if (paths[g].bw > maxBw) {
maxBw = paths[g].bw;
minHops = paths[g].count;
} else if (paths[g].bw == maxBw && paths[g].count > 0 && paths[g].count < minHops) {
minHops = paths[g].count;
}
int localGpu = net->net.localGpu;
if (localGpu != -1) {
NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, 0, time, NET, n, localGpu));
}
if (maxBw >= bw) {
for (int i=0; i<system->nodes[GPU].count; i++) {
int g = (graph->nChannels+i)%system->nodes[GPU].count;
if (paths[g].bw == maxBw && paths[g].count == minHops) {
NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, 0, time, NET, n, g));
}
}
int localGpus[NCCL_TOPO_MAX_NODES], localGpuCount, pathType;
NCCLCHECK(ncclTopoGetLocal(system, NET, n, GPU, localGpus, &localGpuCount, &pathType));
// if no GPUs are connected, skip this net
if (pathType == PATH_DIS) continue;
for (int g = 0; g < localGpuCount; ++g) {
if (localGpus[g] == localGpu) continue; // We already tried this one
NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, 0, time, NET, n, localGpus[g]));
}
}
}
@ -761,6 +797,7 @@ struct kvDict kvDictLinkType[] = {
{ "NVB", PATH_NVB },
{ "PIX", PATH_PIX },
{ "PXB", PATH_PXB },
{ "P2C", PATH_P2C },
{ "PXN", PATH_PXN },
{ "PHB", PATH_PHB },
{ "SYS", PATH_SYS },
@ -809,8 +846,10 @@ ncclResult_t ncclTopoGetGraphFromXmlSub(struct ncclXmlNode *xmlGraph, struct ncc
NCCLCHECK(xmlGetAttrInt(xmlGraph, "nchannels", &graph->nChannels));
NCCLCHECK(xmlGetAttrFloat(xmlGraph, "speedintra", &graph->bwIntra));
NCCLCHECK(xmlGetAttrFloat(xmlGraph, "speedinter", &graph->bwInter));
if (xmlGetAttrFloat(xmlGraph, "latencyinter", &graph->latencyInter) != ncclSuccess) graph->latencyInter = 0.0;
const char* str;
NCCLCHECK(xmlGetAttr(xmlGraph, "latencyinter", &str));
if (!str) INFO(NCCL_GRAPH, "latencyinter not found in graph, using 0.0");
graph->latencyInter = str ? strtof(str, NULL) : 0.0;
NCCLCHECK(xmlGetAttr(xmlGraph, "typeintra", &str));
NCCLCHECK(kvConvertToInt(str, &graph->typeIntra, kvDictLinkType));
NCCLCHECK(xmlGetAttr(xmlGraph, "typeinter", &str));
@ -920,8 +959,8 @@ float sm90SpeedArrayInter[] = { 48.0, 45.0, 42.0, 40.0, 30.0, 24.0, 22.0, 20.0,
#define NSPEEDSINTRA_SM90 (sizeof(sm90SpeedArrayIntra)/sizeof(float))
#define NSPEEDSINTER_SM90 (sizeof(sm90SpeedArrayInter)/sizeof(float))
float sm100SpeedArrayIntra[] = { 90.0, 80.0, 70.0, 60.0, 50.0, 40.0, 30.0, 24.0, 20.0, 19.0 };
float sm100SpeedArrayInter[] = { 48.0, 45.0, 42.0, 40.0, 30.0, 24.0, 22.0, 20.0, 17.5, 15.0, 12.0, 6.0, 3.0, 2.4, 1.2, 0.24, 0.12 };
float sm100SpeedArrayIntra[] = { 90.0, 80.0, 70.0, 60.0, 50.0, 40.0, 30.0, 24.0, 20.0, 19.0, 18.0 };
float sm100SpeedArrayInter[] = { 96.0, 48.0, 45.1, 42.0, 40.0, 30.0, 24.0, 22.0, 20.0, 17.5, 15.0, 12.0, 6.0, 3.0, 2.4, 1.2, 0.24, 0.12 };
#define NSPEEDSINTRA_SM100 (sizeof(sm100SpeedArrayIntra)/sizeof(float))
#define NSPEEDSINTER_SM100 (sizeof(sm100SpeedArrayInter)/sizeof(float))
@ -1060,13 +1099,13 @@ search:
int maxIntra = system->nodes[NET].count > 0 ? tmpGraph.typeInter : maxTypeIntra;
if (tmpGraph.typeIntra < maxIntra && (graph->nChannels == 0 || tmpGraph.typeIntra < graph->typeIntra)) {
tmpGraph.typeIntra += 1;
goto search;
if (tmpGraph.typeIntra < PATH_DIS) goto search;
}
tmpGraph.typeIntra = minTypeIntra;
if (system->nodes[NET].count > 0 && tmpGraph.typeInter < maxTypeInter && (graph->nChannels == 0 || tmpGraph.typeInter < graph->typeInter || tmpGraph.typeInter < PATH_PXN)) {
tmpGraph.typeInter += 1;
goto search;
if (tmpGraph.typeInter < PATH_DIS) goto search;
}
tmpGraph.typeInter = minTypeInter;
@ -1124,7 +1163,7 @@ done:
}
if (graph->nChannels == 0 && graph->collNet == 0 && graph->pattern != NCCL_TOPO_PATTERN_NVLS) {
WARN("Could not find a path for pattern %d, falling back to simple order", graph->pattern);
INFO(NCCL_GRAPH, "Could not find a path for pattern %d, falling back to simple order", graph->pattern);
for (int i=0; i<ngpus; i++) graph->intra[i] = system->nodes[GPU].nodes[i].gpu.rank;
graph->inter[0] = graph->inter[1] = 0;
graph->bwIntra = graph->bwInter = 0.1;
@ -1147,8 +1186,12 @@ ncclResult_t ncclTopoPrintGraph(struct ncclTopoSystem* system, struct ncclTopoGr
offset = strlen(line);
}
for (int i=0; i<ngpus; i++) {
sprintf(line+offset, " %s/%d", topoNodeTypeStr[GPU], graph->intra[ngpus*c+i]);
int g;
ncclTopoRankToIndex(system, graph->intra[ngpus * c + i], &g, true);
int64_t topoId = system->nodes[GPU].nodes[g].id;
sprintf(line + offset, " %s/%lx-%lx", topoNodeTypeStr[GPU], NCCL_TOPO_ID_SYSTEM_ID(topoId), NCCL_TOPO_ID_LOCAL_ID(topoId));
offset = strlen(line);
if (graph->id == 3) break; // NVLS graphs only use the first GPU
}
if (system->nodes[NET].count > 0) {
sprintf(line+offset, " %s/%lx-%lx", topoNodeTypeStr[NET], NCCL_TOPO_ID_SYSTEM_ID(graph->inter[2*c+1]), NCCL_TOPO_ID_LOCAL_ID(graph->inter[2*c+1]));
@ -1248,7 +1291,7 @@ ncclResult_t ncclTopoGetNetDev(struct ncclComm* comm, int rank, struct ncclTopoG
}
if (pxnLevel == 1) {
int g, n;
NCCLCHECK(ncclTopoRankToIndex(comm->topo, rank, &g));
NCCLCHECK(ncclTopoRankToIndex(comm->topo, rank, &g, /*showWarn=*/true));
NCCLCHECK(ncclTopoIdToIndex(comm->topo, NET, netId, &n));
struct ncclTopoNode* gpu = comm->topo->nodes[GPU].nodes+g;
if (gpu->paths[NET][n].type <= PATH_PXN) {
@ -1260,11 +1303,12 @@ ncclResult_t ncclTopoGetNetDev(struct ncclComm* comm, int rank, struct ncclTopoG
// Check which local GPU corresponds to that NIC and see if we can use PXN.
int n, g1, g2;
NCCLCHECK(ncclTopoIdToIndex(comm->topo, NET, netId, &n));
NCCLCHECK(ncclTopoRankToIndex(comm->topo, rank, &g1));
NCCLCHECK(ncclTopoRankToIndex(comm->topo, rank, &g1, /*showWarn=*/true));
NCCLCHECK(ncclTopoGetLocalGpu(comm->topo, netId, &g2));
if (g2 != -1) {
struct ncclTopoNode* peerGpu = comm->topo->nodes[GPU].nodes+g2;
if (peerGpu->paths[GPU][g1].type <= PATH_NVL && peerGpu->paths[NET][n].type <= PATH_PXB) {
int pxnType = ncclParamPxnC2c() ? PATH_P2C : PATH_PXB;
if (peerGpu->paths[GPU][g1].type <= PATH_NVL && peerGpu->paths[NET][n].type <= pxnType) {
*proxyRank = peerGpu->gpu.rank;
if (dev) *dev = netDev;
if (id) *id = netId;

View File

@ -9,12 +9,10 @@
#include "topo.h"
#include "comm.h"
#include "nvmlwrap.h"
#include "net.h"
#include "coll_net.h"
#include "transport.h"
#include <sys/stat.h>
#include <fcntl.h>
#include "xml.h"
#include "cpuset.h"
#include "bootstrap.h"
@ -22,8 +20,8 @@
#define BUSID_REDUCED_SIZE (sizeof("0000:00"))
const char* topoNodeTypeStr[] = { "GPU", "PCI", "NVS", "CPU", "NIC", "NET" };
const char* topoLinkTypeStr[] = { "LOC", "NVL", "", "C2C", "PCI", "", "", "", "SYS", "NET" };
const char* topoPathTypeStr[] = { "LOC", "NVL", "NVB", "C2C", "PIX", "PXB", "PXN", "PHB", "SYS", "NET", "DIS" };
const char* topoLinkTypeStr[] = { "LOC", "NVL", "", "C2C", "PCI", "", "", "", "", "SYS", "NET" };
const char* topoPathTypeStr[] = { "LOC", "NVL", "NVB", "C2C", "PIX", "PXB", "P2C", "PXN", "PHB", "SYS", "NET", "DIS" };
/******************************************************************/
/******************* Graph Creation Functions *********************/
@ -251,7 +249,7 @@ ncclResult_t ncclTopoFlattenBcmSwitches(struct ncclTopoSystem* system) {
pciSwitch->pci.device |= 0xffff;
free(subSwIds);
// Restart, as system->nodes[PCI].nodes has changed.
s = 0;
s = -1; // Will be incremented to 0 in the next loop iteration
continue;
fail:
free(subSwIds);
@ -404,7 +402,9 @@ ncclResult_t ncclTopoAddGpu(struct ncclXmlNode* xmlGpu, struct ncclTopoSystem* s
return ncclSuccess;
}
struct kvDict kvDictPciClass[] = { { "0x060400", PCI }, { "0x068000", NVS }, { "0x068001", CPU }, { "0x03", GPU }, { "0x02", NIC }, { NULL, PCI /* Default fallback value */ } };
#define PCI_BRIDGE_DEVICE_CLASS "0x060400"
struct kvDict kvDictPciClass[] = { { PCI_BRIDGE_DEVICE_CLASS, PCI }, { "0x068000", NVS }, { "0x068001", CPU }, { "0x03", GPU }, { "0x02", NIC }, { NULL, PCI /* Default fallback value */ } };
struct kvDict kvDictPciGen[] = {
{ "2.5 GT/s", 15 }, { "5 GT/s", 30 }, { "8 GT/s", 60 }, { "16 GT/s", 120 }, { "32 GT/s", 240 }, /* Kernel 5.6 and earlier */
{ "2.5 GT/s PCIe", 15 }, { "5.0 GT/s PCIe", 30 }, { "8.0 GT/s PCIe", 60 }, { "16.0 GT/s PCIe", 120 }, { "32.0 GT/s PCIe", 240 }, { "64.0 GT/s PCIe", 480 },
@ -677,7 +677,14 @@ ncclResult_t ncclTopoGetSystemFromXml(struct ncclXml* xml, struct ncclTopoSystem
struct ncclXmlNode* node = topNode->subs[s];
if (strcmp(node->name, "cpu") == 0) NCCLCHECK(ncclTopoAddCpu(node, *topoSystem));
}
for (int systemId=0; systemId<system->nHosts; systemId++) if (system->hostHashes[systemId] == localHostHash) system->systemId = systemId;
int systemId = 0;
while (systemId < system->nHosts && system->hostHashes[systemId] != localHostHash) systemId++;
system->systemId = systemId;
if(systemId == system->nHosts){
WARN("localHostHash = 0x%lx not found in the list of system hostHashes",localHostHash);
return ncclInvalidArgument;
}
NCCLCHECK(ncclTopoAddNvLinks(topNode, *topoSystem, NULL, 0));
NCCLCHECK(ncclTopoAddC2c(topNode, *topoSystem, NULL, 0));
@ -699,6 +706,7 @@ static ncclResult_t xmlInitAttrInt(struct ncclXmlNode* node, const char* attrNam
if (index == -1) {
index = node->nAttrs++;
strncpy(node->attrs[index].key, attrName, MAX_STR_LEN);
node->attrs[index].key[MAX_STR_LEN] = '\0';
snprintf(node->attrs[index].value, MAX_STR_LEN, "%d", value);
}
return ncclSuccess;
@ -709,6 +717,7 @@ static ncclResult_t xmlInitAttrUint64(struct ncclXmlNode* node, const char* attr
if (index == -1) {
index = node->nAttrs++;
strncpy(node->attrs[index].key, attrName, MAX_STR_LEN);
node->attrs[index].key[MAX_STR_LEN] = '\0';
snprintf(node->attrs[index].value, MAX_STR_LEN, "0x%lx", value);
}
return ncclSuccess;
@ -719,6 +728,7 @@ static ncclResult_t xmlInitAttrFloat(struct ncclXmlNode* node, const char* attrN
if (index == -1) {
index = node->nAttrs++;
strncpy(node->attrs[index].key, attrName, MAX_STR_LEN);
node->attrs[index].key[MAX_STR_LEN] = '\0';
snprintf(node->attrs[index].value, MAX_STR_LEN, "%f", value);
}
return ncclSuccess;
@ -799,6 +809,17 @@ typedef struct xmlNodeStack {
} xmlNodeStack;
ncclResult_t ncclFindFirstPciParent(ncclXmlNode** parent) {
ncclXmlNode* newParent = *parent;
while (strcmp(newParent->name, "pci") != 0) {
newParent = newParent->parent;
if (newParent == nullptr) return ncclSuccess;
if (strcmp(newParent->name, "system") == 0) return ncclSuccess;
}
*parent = newParent;
return ncclSuccess;
}
// 1. Find the common parent xmlNode between the given set of nodes
ncclResult_t ncclTopoGetPath(ncclXmlNode** nodes, int nNodes, int* path, ncclXmlNode** parent) {
// Track a stack of parents per-net node being merged
@ -897,6 +918,7 @@ ncclResult_t ncclTopoGetPath(ncclXmlNode** nodes, int nNodes, int* path, ncclXml
}
out:
ncclFindFirstPciParent(&common);
*parent = common;
free(parents);
return ncclSuccess;
@ -960,13 +982,19 @@ ncclResult_t ncclTopoMakePciParent(struct ncclXml* xml, struct ncclXmlNode** par
return ncclSuccess;
}
ncclResult_t ncclTopoMakeVnic(ncclComm_t comm, struct ncclXml* xml, ncclNetVDeviceProps_t* vProps,
struct ncclXmlNode** physNetNodes, struct ncclXmlNode** netNode, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
ncclResult_t ncclTopoMakeVnic(struct ncclXml* xml, ncclNetVDeviceProps_t* vProps,
struct ncclXmlNode** physNetNodes, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
if (vProps->ndevs > NCCL_NET_MAX_DEVS_PER_NIC) {
WARN("TOPO/NET : Tried to merge too many NICs. %d > %d", vProps->ndevs, NCCL_NET_MAX_DEVS_PER_NIC);
return ncclInternalError;
}
// Don't make vNics of size 1
if (vProps->ndevs == 1) {
TRACE(NCCL_GRAPH, "TOPO/NET : Skipping vNic of size 1");
return ncclSuccess;
}
// Trigger the merge, then get the new device's properties
int vDevIndex = 0;
ncclResult_t ret = makeVDevice(&vDevIndex, vProps);
@ -976,11 +1004,18 @@ struct ncclXmlNode** physNetNodes, struct ncclXmlNode** netNode, ncclResult_t (*
return ret;
}
// Mark original NICs as keep="0" in the topology
for (int i = 0; i < vProps->ndevs; i++) {
int dev = vProps->devs[i];
struct ncclXmlNode* netNode = physNetNodes[dev];
NCCLCHECK(xmlSetAttrInt(netNode, "keep", 0));
}
INFO(NCCL_GRAPH, "TOPO/NET : Made vNic %d", vDevIndex);
return ncclSuccess;
}
ncclResult_t ncclTopoForceMerge(ncclComm_t comm, struct ncclXml* xml, const char* str, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
ncclResult_t ncclTopoForceMerge(struct ncclXml* xml, char* str, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
ncclResult_t ret = ncclSuccess;
INFO(NCCL_ENV|NCCL_NET, "TOPO/NET : Force-fusing NICs using NCCL_NET_FORCE_MERGE=%s", str);
char* ncStr;
@ -1018,8 +1053,7 @@ ncclResult_t ncclTopoForceMerge(ncclComm_t comm, struct ncclXml* xml, const char
goto fail;
}
struct ncclXmlNode* netNode;
ret = ncclTopoMakeVnic(comm, xml, &vProps, physNetNodes, &netNode, makeVDevice);
ret = ncclTopoMakeVnic(xml, &vProps, physNetNodes, makeVDevice);
if (ret == ncclSuccess) {
// Only set that a device is "placed" after successfully making a vNic (it's possible to exit before this)
for (int i = 0; i < vProps.ndevs; i++) {
@ -1041,7 +1075,7 @@ fail:
goto exit;
}
ncclResult_t ncclTopoAutoMerge(ncclComm_t comm, struct ncclXml* xml, int mergeLevel, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
ncclResult_t ncclTopoAutoMerge(struct ncclXml* xml, int mergeLevel, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
// Compute the path type between each device
int* paths = NULL;
ncclResult_t res = ncclSuccess;
@ -1085,8 +1119,7 @@ ncclResult_t ncclTopoAutoMerge(ncclComm_t comm, struct ncclXml* xml, int mergeLe
return ncclInternalError;
}
struct ncclXmlNode* netNode;
ncclResult_t ret = ncclTopoMakeVnic(comm, xml, &vProps, physNetNodes, &netNode, makeVDevice);
ncclResult_t ret = ncclTopoMakeVnic(xml, &vProps, physNetNodes, makeVDevice);
// Merging failed.
// Mark all as unplaced and increase their distance to disconnected (PATH_DIS)
@ -1117,6 +1150,7 @@ struct kvDict nicPathKvList[] = {
{ "PORT", PATH_PORT },
{ "PIX", PATH_PIX },
{ "PXB", PATH_PXB },
{ "P2C", PATH_P2C },
{ "PXN", PATH_PXN },
{ "PHB", PATH_PHB },
{ "SYS", PATH_SYS },
@ -1139,14 +1173,19 @@ ncclResult_t ncclTopoGetVNicParent(struct ncclXml* xml, ncclResult_t (*getProper
if (path == PATH_LOC) {
*parent = NULL;
} else if (parent && strcmp((*parent)->name, "pci") == 0) {
// If the common parent is PCI, we must reparent the new NIC under a made up busId
NCCLCHECK(ncclTopoMakePciParent(xml, parent, physNetNodes[0]));
// Compare PCI class here to avoid NCCL WARN when the "class" attribute doesn't exist
const char* c;
NCCLCHECK(xmlGetAttrStr(*parent, "class", &c));
if (strcmp(c, PCI_BRIDGE_DEVICE_CLASS) == 0) {
// If the common parent is a PCI switch, we must reparent the new NIC under a made up pci device with a unique busid
NCCLCHECK(ncclTopoMakePciParent(xml, parent, physNetNodes[0]));
}
}
TRACE(NCCL_GRAPH, "Selected parent %s with path %d", (*parent)->name, path);
return ncclSuccess;
}
ncclResult_t ncclTopoMakeVNics(ncclComm_t comm, struct ncclXml* xml, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*getProperties)(int, ncclNetProperties_t*), int physicalDevs) {
ncclResult_t ncclTopoMakeVNics(struct ncclXml* xml, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*getProperties)(int, ncclNetProperties_t*), int physicalDevs) {
int* placedDevs = NULL;
struct ncclXmlNode** physNetNodes = NULL;
if (physicalDevs == 0) return ncclSuccess;
@ -1170,15 +1209,15 @@ ncclResult_t ncclTopoMakeVNics(ncclComm_t comm, struct ncclXml* xml, ncclResult_
{ // Avoids warnings related to jumping to "out"
const char* mergeLevelEnv = ncclGetEnv("NCCL_NET_MERGE_LEVEL");
if (mergeLevelEnv) kvConvertToInt(mergeLevelEnv, &mergeLevel, nicPathKvList);
const char* forceMerge = ncclGetEnv("NCCL_NET_FORCE_MERGE");
char* forceMerge = (char*) ncclGetEnv("NCCL_NET_FORCE_MERGE");
NCCLCHECK(ncclCalloc(&placedDevs, physicalDevs));
memset(placedDevs, 0, sizeof(int)*physicalDevs);
if (forceMerge) {
NCCLCHECKGOTO(ncclTopoForceMerge(comm, xml, forceMerge, placedDevs, props, physNetNodes, physicalDevs, makeVDevice), res, out);
NCCLCHECKGOTO(ncclTopoForceMerge(xml, forceMerge, placedDevs, props, physNetNodes, physicalDevs, makeVDevice), res, out);
}
}
NCCLCHECKGOTO(ncclTopoAutoMerge(comm, xml, mergeLevel, placedDevs, props, physNetNodes, physicalDevs, makeVDevice), res, out);
NCCLCHECKGOTO(ncclTopoAutoMerge(xml, mergeLevel, placedDevs, props, physNetNodes, physicalDevs, makeVDevice), res, out);
out:
free(physNetNodes);
@ -1187,7 +1226,7 @@ out:
return res;
}
static ncclResult_t ncclTopoPopulateNics(ncclComm_t comm, ncclXml* xml, int startIndex, int endIndex, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), const char* netName, int coll, int keep, int virtualNics) {
static ncclResult_t ncclTopoPopulateNics(ncclXml* xml, int startIndex, int endIndex, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), const char* netName, int coll, int virtualNics, bool dmaBufSupport) {
for (int n = startIndex; n < endIndex; n++) {
ncclNetProperties_t props;
NCCLCHECK(getProperties(n, &props));
@ -1206,15 +1245,17 @@ static ncclResult_t ncclTopoPopulateNics(ncclComm_t comm, ncclXml* xml, int star
const char* colAttr;
NCCLCHECK(xmlGetAttr(netNode, "coll", &colAttr));
// If coll == 0 but the netNode is tagged as coll, don't update the keep value
if (colAttr == NULL || coll != 0 || strcmp(colAttr,"1") != 0) NCCLCHECK(xmlSetAttrInt(netNode, "keep", keep));
NCCLCHECK(xmlSetAttrInt(netNode, "keep", 1));
int dev;
xmlGetAttrIntDefault(netNode, "dev", &dev, -1);
if (dev != -1 && dev != n) INFO(NCCL_GRAPH, "TOPO/NET : Changing %s dev index from %d to %d", netName, dev, n);
NCCLCHECK(xmlSetAttrInt(netNode, "dev", n));
NCCLCHECK(xmlInitAttrInt(netNode, "latency", props.latency));
NCCLCHECK(xmlInitAttrInt(netNode, "speed", props.speed));
NCCLCHECK(xmlInitAttrInt(netNode, "port", props.port));
NCCLCHECK(xmlInitAttrUint64(netNode, "guid", props.guid));
NCCLCHECK(xmlInitAttrInt(netNode, "maxconn", props.maxComms));
bool gdrSupport = (props.ptrSupport & NCCL_PTR_CUDA) || (comm->dmaBufSupport && (props.ptrSupport & NCCL_PTR_DMABUF));
bool gdrSupport = (props.ptrSupport & NCCL_PTR_CUDA) || (dmaBufSupport && (props.ptrSupport & NCCL_PTR_DMABUF));
INFO(NCCL_NET,"NET/%s : GPU Direct RDMA %s for HCA %d '%s'", netName, gdrSupport ? "Enabled" : "Disabled", n, props.name);
NCCLCHECK(xmlInitAttrInt(netNode, "gdr", gdrSupport));
// Only set coll if it's not 0
@ -1230,30 +1271,22 @@ static ncclResult_t ncclTopoPopulateNics(ncclComm_t comm, ncclXml* xml, int star
return ncclSuccess;
}
struct ncclTopoNetState {
int nVirtualNics;
int nPhysicalNics;
const char* name;
};
// Calls to network plugin APIs should be protected. This function should be called inside a per-process lock.
static ncclResult_t ncclTopoProcessNet(ncclComm_t comm, ncclXml* xml, int coll, const char* dumpXmlFile, ncclTopoNetState* state, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*devices)(int*), const char* netName) {
ncclResult_t ncclTopoProcessNet(ncclXml* xml, int coll, const char* dumpXmlFile, ncclTopoNetState* state, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*devices)(int*), const char* netName, bool dmaBufSupport) {
int usePhysicalDevices = (dumpXmlFile || makeVDevice == NULL);
if (state->nPhysicalNics == -1) NCCLCHECK(devices(&state->nPhysicalNics));
// Enumerate physical devices
NCCLCHECK(ncclTopoPopulateNics(comm, xml, 0, state->nPhysicalNics, getProperties, netName, coll, 1, 0));
NCCLCHECK(ncclTopoPopulateNics(xml, 0, state->nPhysicalNics, getProperties, netName, coll, false, dmaBufSupport));
if (!usePhysicalDevices) {
if (state->nVirtualNics == -1) {
NCCLCHECK(ncclTopoMakeVNics(comm, xml, makeVDevice, getProperties, state->nPhysicalNics));
NCCLCHECK(ncclTopoMakeVNics(xml, makeVDevice, getProperties, state->nPhysicalNics));
int nDevs;
NCCLCHECK(devices(&nDevs));
state->nVirtualNics = nDevs - state->nPhysicalNics;
}
// Remove keep=1 for physical collnets
if (state->nVirtualNics > 0) {
NCCLCHECK(ncclTopoPopulateNics(comm, xml, 0, state->nPhysicalNics, getProperties, netName, coll, 0, 0));
// Populate new devices
NCCLCHECK(ncclTopoPopulateNics(comm, xml, state->nPhysicalNics, state->nPhysicalNics+state->nVirtualNics, getProperties, netName, coll, 1, 1));
NCCLCHECK(ncclTopoPopulateNics(xml, state->nPhysicalNics, state->nPhysicalNics+state->nVirtualNics, getProperties, netName, coll, true, dmaBufSupport));
}
}
@ -1301,6 +1334,15 @@ ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** sy
// Try default XML topology location
NCCLCHECKGOTO(ncclTopoGetXmlFromFile("/var/run/nvidia-topologyd/virtualTopology.xml", xml, 0), ret, fail);
}
// Fixup the cpu's host_hashes.
struct ncclXmlNode* node;
// Update every cpu node's host_hash attribute since those are not
// intended to be preserved from the XML files that have been read.
NCCLCHECKGOTO(xmlFindTag(xml, "cpu", &node), ret, fail);
while (node != nullptr) {
NCCLCHECKGOTO(xmlSetAttrLong(node, "host_hash", getHostHash()), ret, fail);
NCCLCHECKGOTO(xmlFindNextTag(xml, "cpu", node, &node), ret, fail);
}
if (xml->maxIndex == 0) {
// Create top tag
struct ncclXmlNode* top;
@ -1313,7 +1355,6 @@ ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** sy
// Detect only the GPU managed by this process. We'll get any others through XML fusion.
char busId[NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE];
NCCLCHECKGOTO(int64ToBusId(comm->peerInfo[comm->rank].busId, busId), ret, fail);
struct ncclXmlNode* node;
NCCLCHECKGOTO(ncclTopoFillGpu(xml, busId, &node), ret, fail);
if (node) {
NCCLCHECKGOTO(xmlSetAttrInt(node, "keep", 1), ret, fail);
@ -1330,12 +1371,12 @@ ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** sy
state = NULL;
if (collNetSupport(comm)) {
NCCLCHECKGOTO(ncclTopoGetSharedState(&state, comm->ncclCollNet->name, collNetStates), ret, fail);
NCCLCHECKGOTO(ncclTopoProcessNet(comm, xml, 1, dumpXmlFile, state,
comm->ncclCollNet->getProperties, comm->ncclCollNet->makeVDevice, comm->ncclCollNet->devices, comm->ncclCollNet->name), ret, fail);
NCCLCHECKGOTO(ncclTopoProcessNet(xml, 1, dumpXmlFile, state,
comm->ncclCollNet->getProperties, comm->ncclCollNet->makeVDevice, comm->ncclCollNet->devices, comm->ncclCollNet->name, comm->dmaBufSupport), ret, fail);
}
NCCLCHECKGOTO(ncclTopoGetSharedState(&state, comm->ncclNet->name, netStates), ret, fail);
NCCLCHECKGOTO(ncclTopoProcessNet(comm, xml, 0, dumpXmlFile, state,
comm->ncclNet->getProperties, comm->ncclNet->makeVDevice, comm->ncclNet->devices, comm->ncclNet->name), ret, fail);
NCCLCHECKGOTO(ncclTopoProcessNet(xml, 0, dumpXmlFile, state,
comm->ncclNet->getProperties, comm->ncclNet->makeVDevice, comm->ncclNet->devices, comm->ncclNet->name, comm->dmaBufSupport), ret, fail);
pthread_mutex_unlock(&netLock);
netLockHeld = 0;
@ -1387,7 +1428,7 @@ ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** sy
}
// Only update our topo tracking structure if we aren't dumping (separate steps)
if (dumpXmlFile == NULL) NCCLCHECKGOTO(ncclTopoGetSystemFromXml(xml, system, comm->peerInfo[comm->rank].hostHash), ret, fail);
if (dumpXmlFile == NULL) NCCLCHECKGOTO(ncclTopoGetSystemFromXml(xml, system, getHostHash()), ret, fail);
exit:
if (!comm->MNNVL && localRanks) free(localRanks);
@ -1399,7 +1440,7 @@ fail:
goto exit;
}
static ncclResult_t ncclTopoGetLocal(struct ncclTopoSystem* system, int type, int index, int resultType,
ncclResult_t ncclTopoGetLocal(struct ncclTopoSystem* system, int type, int index, int resultType,
int locals[NCCL_TOPO_MAX_NODES], int* localCount, int* pathType) {
int minType = PATH_DIS;
float maxBw = 0;
@ -1452,7 +1493,7 @@ ncclResult_t getLocalNetCountByBw(struct ncclTopoSystem* system, int gpu, int *c
ncclResult_t ncclTopoGetLocalNet(struct ncclTopoSystem* system, int rank, int channelId, int64_t* id, int* dev) {
int gpu;
NCCLCHECK(ncclTopoRankToIndex(system, rank, &gpu));
NCCLCHECK(ncclTopoRankToIndex(system, rank, &gpu, /*showWarn=*/true));
int localNets[NCCL_TOPO_MAX_NODES];
int localNetCount;
@ -1517,7 +1558,7 @@ NCCL_PARAM(IgnoreCpuAffinity, "IGNORE_CPU_AFFINITY", 0);
ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu_set_t* affinity) {
struct ncclTopoNode* cpu = NULL, *gpu = NULL;
int gpuIndex, cpuIndex;
NCCLCHECK(ncclTopoRankToIndex(system, rank, &gpuIndex));
NCCLCHECK(ncclTopoRankToIndex(system, rank, &gpuIndex, /*showWarn=*/true));
NCCLCHECK(ncclGetLocalCpu(system, gpuIndex, &cpuIndex));
gpu = system->nodes[GPU].nodes+gpuIndex;
cpu = system->nodes[CPU].nodes+cpuIndex;
@ -1529,8 +1570,8 @@ ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu
#ifdef ENABLE_TRACE
{
char affinityStr[sizeof(cpu_set_t)*2];
NCCLCHECK(ncclCpusetToStr(&mask, affinityStr));
TRACE(NCCL_INIT, "Current affinity for GPU %d is %s", gpu->gpu.dev, affinityStr);
TRACE(NCCL_INIT, "Current affinity for GPU %d is %s", gpu->gpu.dev,
ncclCpusetToRangeStr(&mask, affinityStr, sizeof(affinityStr)));
}
#endif
@ -1540,8 +1581,8 @@ ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu
#ifdef ENABLE_TRACE
{
char affinityStr[sizeof(cpu_set_t)*2];
NCCLCHECK(ncclCpusetToStr(&cpuMask, affinityStr));
TRACE(NCCL_INIT, "CPU GPU affinity for GPU %d is %s", gpu->gpu.dev, affinityStr);
TRACE(NCCL_INIT, "CPU GPU affinity for GPU %d is %s", gpu->gpu.dev,
ncclCpusetToRangeStr(&cpuMask, affinityStr, sizeof(affinityStr)));
}
#endif
@ -1558,8 +1599,8 @@ ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu
// If there is a non empty set, use it to set affinity
if (CPU_COUNT(&finalMask)) {
char affinityStr[sizeof(cpu_set_t)*2];
NCCLCHECK(ncclCpusetToStr(&finalMask, affinityStr));
INFO(NCCL_INIT, "Setting affinity for GPU %d to %s", gpu->gpu.dev, affinityStr);
INFO(NCCL_INIT, "Setting affinity for GPU %d to %s", gpu->gpu.dev,
ncclCpusetToRangeStr(&finalMask, affinityStr, sizeof(affinityStr)));
}
return ncclSuccess;
}

View File

@ -9,6 +9,8 @@
#include "graph.h"
#include "core.h"
#include "xml.h"
#include "net.h"
#define LOC_BW 5000.0
#define SM60_NVLINK_BW 18.0
@ -16,7 +18,7 @@
#define SM80_NVLINK_BW 20.0
#define SM90_NVLINK_BW 20.6
#define SM86_NVLINK_BW 12.0
#define SM100_NVLINK_BW 40.0
#define SM100_NVLINK_BW 40.1
#define PCI_BW 12.0 // PCI Gen3 x16
#define AMD_BW 16.0
#define BDW_QPI_BW 6.0
@ -50,9 +52,10 @@ extern const char* topoNodeTypeStr[];
#define LINK_PCI 4
// Skipping 5 for PATH_PXB
// Skipping 6 for PATH_PXN
// Skipping 7 for PATH_PHB
#define LINK_SYS 8
#define LINK_NET 9
// Skipping 7 for PATH_P2C
// Skipping 8 for PATH_PHB
#define LINK_SYS 9
#define LINK_NET 10
extern const char* topoLinkTypeStr[];
// Local (myself)
@ -73,25 +76,30 @@ extern const char* topoLinkTypeStr[];
// Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
#define PATH_PXB 5
// Connection between a GPU and a NIC using the C2C connection to the CPU and the PCIe connection to the NIC
#define PATH_P2C 6
// Connection between a GPU and a NIC using an intermediate GPU. Used to enable rail-local, aggregated network send/recv operations.
#define PATH_PXN 6
#define PATH_PXN 7
// Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
#define PATH_PHB 7
#define PATH_PHB 8
// Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
#define PATH_SYS 8
#define PATH_SYS 9
// Connection through the network
#define PATH_NET 9
#define PATH_NET 10
// New type of path which should precede PATH_PIX
#define PATH_PORT PATH_NVL
// Disconnected
#define PATH_DIS 10
#define PATH_DIS 11
extern const char* topoPathTypeStr[];
extern int64_t ncclParamPxnC2c();
struct ncclTopoNode;
struct ncclTopoLink {
int type;
@ -137,6 +145,7 @@ struct ncclTopoNode {
int gdrSupport;
int collSupport;
int maxChannels;
int localGpu;
}net;
struct {
int arch;
@ -181,6 +190,13 @@ ncclResult_t ncclTopoGetGpuMinPath(struct ncclTopoSystem* system, int type, int*
ncclResult_t ncclTopoGetGpuMaxPath(struct ncclTopoSystem* system, int type, int* max);
ncclResult_t ncclTopoSplitNvLink(struct ncclTopoSystem* system, int* splitNvLink);
struct ncclTopoNetState {
int nVirtualNics;
int nPhysicalNics;
const char* name;
};
ncclResult_t ncclTopoProcessNet(ncclXml* xml, int coll, const char* dumpXmlFile, ncclTopoNetState* state, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*devices)(int*), const char* netName, bool dmaBufSupport);
#define NCCL_TOPO_XML_MAX_NODES 256
#define NCCL_GRAPH_XML_MAX_NODES 4096
ncclResult_t ncclTopoGetSystemFromXml(struct ncclXml* xml, struct ncclTopoSystem** topoSystem, uint64_t localHostHash);
@ -200,7 +216,7 @@ static ncclResult_t ncclTopoIdToIndex(struct ncclTopoSystem* system, int type, i
return ncclInternalError;
}
static ncclResult_t ncclTopoRankToIndex(struct ncclTopoSystem* system, int rank, int* index) {
static ncclResult_t ncclTopoRankToIndex(struct ncclTopoSystem* system, int rank, int* index, bool showWarn) {
*index = -1;
for (int i=0; i<system->nodes[GPU].count; i++) {
if (system->nodes[GPU].nodes[i].gpu.rank == rank) {
@ -208,6 +224,7 @@ static ncclResult_t ncclTopoRankToIndex(struct ncclTopoSystem* system, int rank,
return ncclSuccess;
}
}
if (showWarn) WARN("ncclTopoRankToIndex could not find rank %d", rank);
return ncclInternalError;
}

View File

@ -16,13 +16,13 @@ static int getNthreads(const char* name, int env, int min, int max, int def) {
int nt = env;
if (nt > 0) {
if (nt % WARP_SIZE != 0) {
WARN("Invalid %s %d (must be a multiple of %d)", name, nt, WARP_SIZE);
INFO(NCCL_GRAPH|NCCL_ENV, "Invalid %s %d (must be a multiple of %d)", name, nt, WARP_SIZE);
nt = max;
} else if (nt > max) {
WARN("Invalid %s %d (maximum %d).", name, nt, max);
INFO(NCCL_GRAPH|NCCL_ENV, "Invalid %s %d (maximum %d).", name, nt, max);
nt = max;
} else if (nt < min) {
WARN("Invalid %s %d (minimum %d).", name, nt, min);
INFO(NCCL_GRAPH|NCCL_ENV, "Invalid %s %d (minimum %d).", name, nt, min);
nt = min;
}
} else {
@ -51,11 +51,14 @@ static int getNthreads(const char* name, int env, int min, int max, int def) {
// NCCL_PROTO="^LL128;allreduce:LL128"
// Enable everything but LL128, but only LL128 for allreduce.
ncclResult_t parseList(const char* str, const char* prefixElems[], int nprefixes, const char* elems[], int nelems, int* list) {
ncclResult_t ret = ncclSuccess;
char* fullStr = strdup(str);
char* tmpFullStr;
char* fullToken = strtok_r(fullStr, ";", &tmpFullStr);
char* subToken = nullptr;
char* tokStr = nullptr;
while (fullToken) {
char* subToken = strdup(fullToken);
subToken = strdup(fullToken);
char* tmpSubStr;
char* prefix = strtok_r(subToken, ":", &tmpSubStr);
char* elemList = strtok_r(NULL, ":", &tmpSubStr);
@ -65,7 +68,8 @@ ncclResult_t parseList(const char* str, const char* prefixElems[], int nprefixes
// because then all the prefixes before the prefix-less entry would be
// overwritten.
WARN("All entries except the first must have a prefix: \"%s\"", str);
return ncclInvalidUsage;
ret = ncclInvalidUsage;
goto fail;
}
elemList = prefix;
prefix = NULL;
@ -84,7 +88,7 @@ ncclResult_t parseList(const char* str, const char* prefixElems[], int nprefixes
foundPrefix = true;
for (int e=0; e<nelems; e++) list[p*nelems+e] = unset;
char* tokStr = strdup(elemList);
tokStr = strdup(elemList);
char* tmpStr;
char* elem = strtok_r(tokStr, ",", &tmpStr);
while (elem) {
@ -97,22 +101,32 @@ ncclResult_t parseList(const char* str, const char* prefixElems[], int nprefixes
}
if (e==nelems) {
WARN("Unrecognized element token \"%s\" when parsing \"%s\"", elem, str);
return ncclInvalidUsage;
ret = ncclInvalidUsage;
goto fail;
}
elem = strtok_r(NULL, ",", &tmpStr);
}
free(tokStr);
tokStr = nullptr;
}
if (!foundPrefix) {
WARN("Unrecognized prefix token \"%s\" when parsing \"%s\"", prefix, str);
return ncclInvalidUsage;
ret = ncclInvalidUsage;
goto fail;
}
free(subToken);
subToken = nullptr;
fullToken = strtok_r(NULL, ";", &tmpFullStr);
}
exit:
free(tokStr);
free(subToken);
free(fullStr);
return ncclSuccess;
return ret;
fail:
goto exit;
}
// Latencies in us, Bandwidths in GB/s
@ -194,6 +208,8 @@ static float getNetOverhead(struct ncclComm* comm) {
return 1.0;
}
NCCL_PARAM(Ll128C2c, "LL128_C2C", 1);
ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCompCap, struct ncclTopoGraph** graphs) {
int simpleDefaultThreads = (graphs[NCCL_ALGO_RING]->bwIntra*graphs[NCCL_ALGO_RING]->nChannels <= PCI_BW) ? 256 : NCCL_SIMPLE_MAX_NTHREADS;
comm->maxThreads[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] =
@ -248,7 +264,14 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
&& a == NCCL_ALGO_PAT && (p != NCCL_PROTO_SIMPLE || ncclPatEnable(comm) == 0)) continue;
int collnet = (a == NCCL_ALGO_COLLNET_DIRECT || a == NCCL_ALGO_COLLNET_CHAIN) ? 1 : 0;
float bw = nNodes <= 2 || collnet ? graphs[a]->bwIntra : graphs[a]->bwInter;
if (a == NCCL_ALGO_NVLS) bw = std::min(graphs[a]->bwIntra, graphs[a]->bwInter);
if (a == NCCL_ALGO_NVLS) {
if (coll == ncclFuncAllReduce) {
bw = std::min(graphs[a]->bwIntra, graphs[a]->bwInter);
} else {
// allgather and reducescatter
bw = std::min(graphs[a]->bwIntra * (ppn - 1.0f) / ppn, graphs[a]->bwInter * 0.9f);
}
}
if (a == NCCL_ALGO_NVLS_TREE) bw = std::min(graphs[a]->bwIntra, nNodes <= 2 ? graphs[a]->bwInter : graphs[a]->bwInter/2);
float busBw = graphs[a]->nChannels * bw;
@ -264,19 +287,7 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
if (a == NCCL_ALGO_COLLNET_CHAIN && p != NCCL_PROTO_SIMPLE) busBw = 0; // Not used
if (a == NCCL_ALGO_COLLNET_DIRECT && p == NCCL_PROTO_SIMPLE) {
if (coll == ncclFuncAllGather || coll == ncclFuncReduceScatter) {
busBw = ppn * bw;
// AllGather/ReduceScatter requires 1:1 GPU:NIC
int nicPerNode = comm->collNetHeadsNum;
if (coll == ncclFuncAllGather && comm->nNodes > 1) {
if (!comm->ncclCollNet || !comm->ncclCollNet->iallgather || ppn > nicPerNode) busBw = 0;
}
if (coll == ncclFuncReduceScatter && comm->nNodes > 1) {
if (!comm->ncclCollNet || !comm->ncclCollNet->ireducescatter || ppn > nicPerNode) busBw = 0;
}
// Measured corrective ratio needed at 1 ppn and 8ppn. Here we hackishly
// interpolate the two.
float w = (ppn-1)/(8-1);
busBw *= w*0.85 + (1-w)*0.95;
busBw = ppn * std::min(graphs[a]->bwIntra, graphs[a]->bwInter * 0.9f);
} else {
// Collnet+Direct requires all GPUs to have a local NIC to work at full speed
float factor = ppn / (1.0*graphs[a]->nChannels); // GPU/NIC ratio
@ -285,6 +296,26 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
if (minCompCap >= 90) busBw *= .85;
}
}
// disable collnet for allgather/reducescatter if #localranks > #heads
// AllGather/ReduceScatter requires 1:1 GPU:NIC
if ((a == NCCL_ALGO_NVLS || a == NCCL_ALGO_COLLNET_DIRECT) && p == NCCL_PROTO_SIMPLE && (coll == ncclFuncAllGather || coll == ncclFuncReduceScatter) && comm->nNodes > 1) {
int nHeads = 0;
if (coll == ncclFuncAllGather && comm->nNodes > 1 && (!comm->ncclCollNet || !comm->ncclCollNet->iallgather)) busBw = 0.0f;
if (coll == ncclFuncReduceScatter && comm->nNodes > 1 && (!comm->ncclCollNet || !comm->ncclCollNet->ireducescatter)) busBw = 0.0f;
if (comm->config.collnetEnable)
nHeads = comm->collNetHeadsNum;
else
busBw = 0.0f;
if (busBw > 0.0f) {
for (int r = 0; r < comm->nRanks; r++) {
int node = comm->rankToNode[r];
if (comm->nodeRanks[node].localRanks > nHeads) {
busBw = 0.0f;
break;
}
}
}
}
// Convert bus BW to algorithm BW
if (!(a != NCCL_ALGO_RING && (coll == ncclFuncAllGather || coll == ncclFuncReduceScatter))) {
@ -411,7 +442,7 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
// Disable NVLS Tree on a single node
if (comm->nNodes == 1 && a == NCCL_ALGO_NVLS_TREE) disable = 1;
// Disable Collnet+Direct, Collnet+Chain or Collnet+NVLS if collnet is not supported.
if (comm->collNetSupport == 0 &&
if (comm->config.collnetEnable == 0 &&
(a == NCCL_ALGO_COLLNET_DIRECT ||
a == NCCL_ALGO_COLLNET_CHAIN ||
(a == NCCL_ALGO_NVLS && comm->nNodes > 1))) disable = 1;
@ -424,19 +455,19 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
for (int c=0; c<NCCL_NUM_FUNCTIONS; c++) for (int a=0; a<NCCL_NUM_ALGORITHMS; a++) for (int p=0; p<NCCL_NUM_PROTOCOLS; p++) {
int pEnable = protoEnable[c*NCCL_NUM_PROTOCOLS+p];
if (pEnable == 2 && p == NCCL_PROTO_LL128) {
// Enable LL128 by default only on Volta/Ampere/Hopper/Blackwell+NVLink. Other cases are not tested and may cause silent data corruption.
pEnable = 1;
pEnable &= (graphs[a]->typeInter <= PATH_PXB || (minCompCap >= 90 && graphs[a]->typeInter <= PATH_PXN));
if (ncclParamLl128C2c() && minCompCap >= 90) {
// Enable LL128 by default only on Hopper/Blackwell for all connections up to P2C and PXN.
pEnable &= (graphs[a]->typeInter <= PATH_PXN);
} else {
// Enable LL128 only up to PXB. Don't enable LL128 over PxN because PxN can encapsulate PxB or P2C links.
pEnable &= (graphs[a]->typeInter <= PATH_PXB);
if (!ncclParamLl128C2c() && minCompCap >= 90)
INFO(NCCL_GRAPH, "Disabling LL128 over all PxN connections (PXB and C2C). This ensures that no C2C link will be used by LL128.");
}
pEnable &= (graphs[a]->typeIntra <= PATH_NVB);
pEnable &= (minCompCap == maxCompCap);
switch (minCompCap) {
case 70: pEnable &= 1; break;
case 80: pEnable &= 1; break;
case 90: pEnable &= !(CUDART_VERSION == 11080 && c == ncclFuncAllReduce && a == NCCL_ALGO_RING && comm->nRanks == 2); break;
case 100: pEnable &= 1; break;
case 120: pEnable &= 1; break;
default: pEnable &= 0; break;
}
pEnable &= !(minCompCap < 70 || (minCompCap == 90 && CUDART_VERSION == 11080 && c == ncclFuncAllReduce && a == NCCL_ALGO_RING && comm->nRanks == 2));
}
if (pEnable == 0) comm->bandwidths[c][a][p] = 0;
if (algoEnable[c*NCCL_NUM_ALGORITHMS+a] == 0) comm->bandwidths[c][a][p] = 0;
@ -483,7 +514,7 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
}
}
}
// Set per-thread amount of work before we increase nThreads and nChannels
for (int a=0; a<NCCL_NUM_ALGORITHMS; a++) {
comm->threadThresholds[a][NCCL_PROTO_LL] = NCCL_LL_THREAD_THRESHOLD;

View File

@ -39,7 +39,13 @@ ncclResult_t xmlGetValue(FILE* file, char* value, char* last) {
#if INT_OK
int o = 0;
do {
value[o++] = c;
value[o] = c;
if (o == MAX_STR_LEN-1) {
value[o] = '\0';
WARN("Error : value %s too long (max %d)", value, MAX_STR_LEN);
return ncclInternalError;
}
o++;
NCCLCHECK(xmlGetChar(file, &c));
} while (c >= '0' && c <= '9');
value[o] = '\0';
@ -51,10 +57,17 @@ ncclResult_t xmlGetValue(FILE* file, char* value, char* last) {
#endif
}
int o = 0;
char quote = c; // Remember which quote type we started with
do {
NCCLCHECK(xmlGetChar(file, &c));
value[o++] = c;
} while (c != '"');
value[o] = c;
if (o == MAX_STR_LEN-1) {
value[o] = '\0';
WARN("Error : value %s too long (max %d)", value, MAX_STR_LEN);
return ncclInternalError;
}
o++;
} while (c != quote);
value[o-1] = '\0';
NCCLCHECK(xmlGetChar(file, last));
return ncclSuccess;
@ -267,7 +280,7 @@ ncclResult_t ncclTopoDumpXmlRec(int indent, FILE* file, struct ncclXmlNode* node
ncclResult_t ncclTopoDumpXmlToFile(const char* xmlTopoFile, struct ncclXml* xml) {
FILE* file = fopen(xmlTopoFile, "w");
if (file == NULL) {
WARN("Unable to open %s, not dumping topology.", xmlTopoFile);
INFO(NCCL_GRAPH|NCCL_ENV, "Unable to open %s, not dumping topology.", xmlTopoFile);
return ncclSuccess;
}
NCCLCHECK(ncclTopoDumpXmlRec(0, file, xml->nodes));
@ -375,7 +388,7 @@ ncclResult_t ncclTopoGetXmlFromFile(const char* xmlTopoFile, struct ncclXml* xml
FILE* file = fopen(xmlTopoFile, "r");
if (file == NULL) {
if (warn) {
WARN("Could not open XML topology file %s : %s", xmlTopoFile, strerror(errno));
INFO(NCCL_GRAPH|NCCL_ENV, "Could not open XML topology file %s : %s", xmlTopoFile, strerror(errno));
}
return ncclSuccess;
}
@ -759,7 +772,7 @@ ncclResult_t ncclTopoGetXmlFromGpu(struct ncclXmlNode* pciNode, nvmlDevice_t nvm
int maxNvLinks = (sm < 60) ? 0 : (sm < 70) ? 4 : (sm < 80) ? 6 : (sm < 90) ? 12 : 18;
if (maxNvLinks > 0 && nvmlDev == NULL) {
WARN("No NVML device handle. Skipping nvlink detection.");
INFO(NCCL_GRAPH, "No NVML device handle. Skipping nvlink detection.");
maxNvLinks = 0;
}
@ -961,8 +974,16 @@ ncclResult_t ncclTopoTrimXmlRec(struct ncclXmlNode* node, int* keep) {
NCCLCHECK(ncclTopoTrimXmlRec(subs[s], &k));
*keep += k;
}
if (*keep == 0 && // Trim PCI switches or CPU with no used GPU/NIC under them.
(strcmp(node->name, "pci") == 0 || strcmp(node->name, "cpu") == 0)) {
// Remove node if it has no children and no keep attribute
if (*keep == 0 && // Trim PCI switches, CPUs with no used GPU/NIC under them, or pruned NICs
(strcmp(node->name, "pci") == 0 || strcmp(node->name, "cpu") == 0 || strcmp(node->name, "nic") == 0 || strcmp(node->name, "net") == 0)) {
#ifdef ENABLE_TRACE
const char* name;
const char* busid;
NCCLCHECK(xmlGetAttr(node, "name", &name));
NCCLCHECK(xmlGetAttr(node, "busid", &busid));
TRACE(NCCL_GRAPH, "Removing node %s %s %s\n", node->name, name, busid);
#endif
NCCLCHECK(xmlRemoveNode(node));
}
}

View File

@ -117,6 +117,13 @@ static ncclResult_t xmlGetAttrIntDefault(struct ncclXmlNode* node, const char* a
return ncclSuccess;
}
static ncclResult_t xmlGetAttrUint64(struct ncclXmlNode* node, const char* attrName, uint64_t* value) {
const char* str;
NCCLCHECK(xmlGetAttrStr(node, attrName, &str));
*value = strtoull(str, NULL, 0);
return ncclSuccess;
}
static ncclResult_t xmlGetAttrLong(struct ncclXmlNode* node, const char* attrName, int64_t* value) {
const char* str;
NCCLCHECK(xmlGetAttrStr(node, attrName, &str));
@ -124,7 +131,6 @@ static ncclResult_t xmlGetAttrLong(struct ncclXmlNode* node, const char* attrNam
return ncclSuccess;
}
static ncclResult_t xmlGetAttrFloat(struct ncclXmlNode* node, const char* attrName, float* value) {
const char* str;
NCCLCHECK(xmlGetAttrStr(node, attrName, &str));
@ -254,7 +260,6 @@ static ncclResult_t xmlSetAttrInt(struct ncclXmlNode* node, const char* attrName
node->attrs[index].key[MAX_STR_LEN] = '\0';
}
snprintf(node->attrs[index].value, MAX_STR_LEN, "%d", value);
node->attrs[index].value[MAX_STR_LEN] = '\0';
return ncclSuccess;
}
@ -267,7 +272,6 @@ static ncclResult_t xmlSetAttrFloat(struct ncclXmlNode* node, const char* attrNa
node->attrs[index].key[MAX_STR_LEN] = '\0';
}
snprintf(node->attrs[index].value, MAX_STR_LEN, "%g", value);
node->attrs[index].value[MAX_STR_LEN] = '\0';
return ncclSuccess;
}
@ -280,7 +284,6 @@ static ncclResult_t xmlSetAttrLong(struct ncclXmlNode* node, const char* attrNam
node->attrs[index].key[MAX_STR_LEN] = '\0';
}
snprintf(node->attrs[index].value, MAX_STR_LEN, "%#lx", value);
node->attrs[index].value[MAX_STR_LEN] = '\0';
return ncclSuccess;
}

View File

@ -12,16 +12,14 @@
#include <assert.h>
#include "bootstrap.h"
#define GROUP_MAX_RECLAIM_STEPS 10
__thread int ncclGroupDepth = 0; // depth of ncclGroupStart nesting
__thread ncclResult_t ncclGroupError = ncclSuccess;
__thread struct ncclComm* ncclGroupCommHead = nullptr;
__thread struct ncclComm* ncclGroupCommHead[ncclGroupTaskTypeNum] = {nullptr};
__thread struct ncclComm* ncclGroupCommPreconnectHead = nullptr;
__thread struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> ncclAsyncJobs;
__thread struct ncclGroupJob *ncclGroupJobMainPtr = NULL;
__thread struct ncclGroupJob ncclGroupJobMain;
__thread int ncclGroupBlocking = -1; /* default mode */
__thread bool ncclGroupJobAbortFlag = false;
void* ncclAsyncJobMain(void* arg);
ncclResult_t ncclAsyncLaunch(
@ -191,6 +189,66 @@ fail:
goto exit;
}
struct ncclGroupSymmetricJob {
struct ncclAsyncJob base;
struct ncclComm* comm;
};
NCCL_PARAM(WinStride, "WIN_STRIDE", -1);
ncclResult_t ncclCommGroupRegisterSymmetric(struct ncclAsyncJob* job_) {
struct ncclGroupSymmetricJob* job = (struct ncclGroupSymmetricJob*)job_;
struct ncclComm* comm = job->comm;
ncclResult_t ret = ncclSuccess;
CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
if (comm->baseStride == 0) {
cudaStream_t hostStream;
// first time to allocate symmetric VA space.
// calling into this function means symmetric is supported.
struct ncclSymDevBase* symBase = NULL;
size_t size = ncclSymDevBase::size(comm->localRanks);
if (ncclParamWinStride() != -1) {
comm->baseStride = ncclParamWinStride();
} else {
size_t maxStride = 0;
for (int r = 0; r < comm->nRanks; ++r)
if (comm->peerInfo[r].totalGlobalMem > maxStride) maxStride = comm->peerInfo[r].totalGlobalMem;
comm->baseStride = maxStride;
}
INFO(NCCL_INIT, "rank %d base stride %zuGB total VM %zuGB", comm->rank, comm->baseStride >> 30, (comm->baseStride * comm->localRanks) >> 30);
NCCLCHECKGOTO(ncclIpcSymmetricInit(comm), ret, fail);
NCCLCHECKGOTO(ncclNvlsSymmetricInit(comm), ret, fail);
comm->symAllocHead = 0;
// Allocate symmetric memory for NCCL internal usage
NCCLCHECKGOTO(ncclCommSymmetricAllocInternal(comm, size, alignof(struct ncclSymDevBase), (void**)&symBase), ret, fail);
assert((void*)symBase == (void*)(comm->baseUCSymPtr + comm->localRank * comm->baseStride));
NCCLCHECKGOTO(ncclStrongStreamAcquire(ncclCudaGraphNone(), &comm->sharedRes->hostStream, /*concurrent=*/false, &hostStream), ret, fail);
CUDACHECKGOTO(cudaMemsetAsync(symBase, 0, size, hostStream), ret, fail);
CUDACHECKGOTO(cudaStreamSynchronize(hostStream), ret, fail);
NCCLCHECKGOTO(ncclStrongStreamRelease(ncclCudaGraphNone(), &comm->sharedRes->hostStream, /*concurrent=*/false), ret, fail);
comm->symDevComm.base = (struct ncclSymDevBase*)(comm->baseUCSymPtr + comm->localRank * comm->baseStride);
comm->symDevComm.baseMc = (struct ncclSymDevBase*)comm->baseMCSymPtr;
comm->symDevComm.nRanks = comm->localRanks;
comm->symDevComm.nRanks_rcp32 = idivRcp32(comm->localRanks);
comm->symDevComm.rank = comm->localRank;
comm->symDevComm.stride4G = comm->baseStride >> 32;
}
while (!ncclIntruQueueEmpty(&comm->symRegTaskQueue)) {
struct ncclSymRegTask* task = ncclIntruQueueDequeue(&comm->symRegTaskQueue);
NCCLCHECKGOTO(ncclCommSymmetricRegisterInternal(comm, task->buff, task->baseSize, task->alignment, task->memHandle, task->regHandle), ret, fail);
free(task);
}
exit:
return ret;
fail:
goto exit;
}
static ncclResult_t doLaunches(struct ncclComm* head) {
ncclResult_t result = ncclSuccess;
struct ncclComm* cliqueHead = head;
@ -207,7 +265,7 @@ static ncclResult_t doLaunches(struct ncclComm* head) {
CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), result, failure);
NCCLCHECKGOTO(ncclLaunchPrepare(comm), result, failure);
if (useBarrier) ncclCommIntraBarrierIn(comm, 1);
comm = comm->groupNext;
comm = comm->groupNext[ncclGroupTaskTypeCollective];
} while (comm != nullptr && comm->intraComm0 == cliqueHead->intraComm0);
cliqueNextHead = comm;
@ -224,7 +282,7 @@ static ncclResult_t doLaunches(struct ncclComm* head) {
bool moreRounds = false;
comm = cliqueHead;
do { // Iterate clique members.
struct ncclComm* next = comm->groupNext;
struct ncclComm* next = comm->groupNext[ncclGroupTaskTypeCollective];
if (useBarrier) {
// Barrier reduction result tells us if this was the final round.
moreRounds = 0 != ncclCommIntraBarrierOut(comm);
@ -259,64 +317,60 @@ failure:
return result;
}
static inline void groupResetJobState(struct ncclGroupJob* job) {
if (job) {
if (job->groupBlockingPtr) *job->groupBlockingPtr = -1;
if (job->abortFlagPtr) *job->abortFlagPtr = false;
if (job->groupErrorPtr) *job->groupErrorPtr = ncclSuccess;
if (job->groupCommHeadPtr) *job->groupCommHeadPtr = NULL;
if (job->groupCommPreconnectHeadPtr) *job->groupCommPreconnectHeadPtr = NULL;
memset(job, 0, sizeof(struct ncclGroupJob));
}
static inline void groupLocalResetJobState() {
ncclGroupError = ncclSuccess;
for (int type = 0; type < ncclGroupTaskTypeNum; ++type) ncclGroupCommHead[type] = NULL;
ncclGroupCommPreconnectHead = NULL;
ncclGroupBlocking = -1;
ncclIntruQueueConstruct(&ncclAsyncJobs);
return;
}
static void groupCleanup(struct ncclComm** groupCommHeadPtr, struct ncclComm** groupCommPreconnectHeadPtr, struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next>* asyncJobsPtr, ncclResult_t* groupErrorPtr, int* groupBlockingPtr, volatile bool* groupJobAbortFlagPtr, ncclResult_t error) {
struct ncclComm* comm = *groupCommHeadPtr;
/* reset all thread local variables */
*groupCommHeadPtr = NULL;
*groupCommPreconnectHeadPtr = NULL;
*groupErrorPtr = ncclSuccess;
*groupBlockingPtr = -1;
*groupJobAbortFlagPtr = false;
while (comm != nullptr) {
struct ncclComm* next = comm->groupNext;
(void) ncclGroupCommLeave(comm); // overwrites comm->groupNext
// We don't know if preconnect succeeded or happened at all, so clear
// the flags that let `taskAppend()` skip over checking if preconnect
// is needed.
comm->preconnectNext = reinterpret_cast<struct ncclComm*>(0x1);
for (int i = 0; i < comm->nRanks; i++) {
comm->connectSend[i] = 0UL;
comm->connectRecv[i] = 0UL;
}
// Reclaim abandoned kernel plan memory. Note ncclWork structs were already
// reclaimed by a `ncclMemoryStackPop(&comm->memScoped)` during `ncclGroupCommLeave()`.
while (!ncclIntruQueueEmpty(&comm->planner.planQueue)) {
struct ncclKernelPlan* plan = ncclIntruQueueDequeue(&comm->planner.planQueue);
// Persistent plans will be reclaimed via the callbackQueue when the
// graph drops its UserObject reference.
if (!plan->persistent) {
while (!ncclIntruQueueEmpty(&plan->proxyOpQueue)) {
struct ncclProxyOp* pxop = ncclIntruQueueDequeue(&plan->proxyOpQueue);
ncclMemoryPoolFree(&comm->memPool_ncclProxyOp, pxop);
static void groupCleanup(struct ncclComm** groupCommHeadPtr, struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next>* asyncJobsPtr, ncclResult_t error) {
struct ncclComm* comm;
for (int type = 0; type < ncclGroupTaskTypeNum; ++type) {
comm = groupCommHeadPtr[type];
// reset groupCommHeadPtr[type]
groupCommHeadPtr[type] = nullptr;
while (comm != nullptr) {
struct ncclComm* next = comm->groupNext[type];
(void)ncclGroupCommLeave(comm, type); // overwrites comm->groupNext
// We don't know if preconnect succeeded or happened at all, so clear
// the flags that let `taskAppend()` skip over checking if preconnect
// is needed.
if (type == ncclGroupTaskTypeCollective) {
comm->preconnectNext = reinterpret_cast<struct ncclComm*>(0x1);
for (int i = 0; i < comm->nRanks; i++) {
comm->connectSend[i] = 0UL;
comm->connectRecv[i] = 0UL;
}
// Reclaim abandoned kernel plan memory. Note ncclWork structs were already
// reclaimed by a `ncclMemoryStackPop(&comm->memScoped)` during `ncclGroupCommLeave()`.
while (!ncclIntruQueueEmpty(&comm->planner.planQueue)) {
struct ncclKernelPlan* plan = ncclIntruQueueDequeue(&comm->planner.planQueue);
// Persistent plans will be reclaimed via the callbackQueue when the
// graph drops its UserObject reference.
if (!plan->persistent) {
while (!ncclIntruQueueEmpty(&plan->proxyOpQueue)) {
struct ncclProxyOp* pxop = ncclIntruQueueDequeue(&plan->proxyOpQueue);
ncclMemoryPoolFree(&comm->memPool_ncclProxyOp, pxop);
}
ncclMemoryPoolFree(&comm->memPool_ncclKernelPlan, plan);
}
}
{ // Reset comm->planner to empty.
ncclKernelPlanner::Peer* tmp = comm->planner.peers;
memset(&comm->planner, 0, sizeof(comm->planner));
comm->planner.peers = tmp;
if (comm->planner.peers != NULL) memset(comm->planner.peers, 0, comm->nRanks * sizeof(comm->planner.peers[0]));
}
ncclMemoryPoolFree(&comm->memPool_ncclKernelPlan, plan);
}
}
{ // Reset comm->planner to empty.
ncclKernelPlanner::Peer* tmp = comm->planner.peers;
memset(&comm->planner, 0, sizeof(comm->planner));
comm->planner.peers = tmp;
if (comm->planner.peers != NULL) memset(comm->planner.peers, 0, comm->nRanks*sizeof(comm->planner.peers[0]));
if (!comm->config.blocking)
(void)ncclCommSetAsyncError(comm, error);
comm = next;
}
if (!comm->config.blocking)
(void) ncclCommSetAsyncError(comm, error);
comm = next;
}
/* reset everything */
@ -393,11 +447,10 @@ fail:
static ncclResult_t groupLaunch(struct ncclAsyncJob *job_, ncclSimInfo_t* simInfo = NULL) {
ncclResult_t ret = ncclSuccess;
struct ncclGroupJob *gjob = (struct ncclGroupJob*) job_;
struct ncclComm *groupCommHeadMain = *gjob->groupCommHeadPtr;
struct ncclComm *groupCommPreconnectHeadMain = *gjob->groupCommPreconnectHeadPtr;
struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> *asyncJobsMain = gjob->asyncJobsPtr;
bool *groupAbortFlag = gjob->abortFlagPtr;
struct ncclComm **groupCommHeadMain = gjob->groupCommHead;
struct ncclComm *groupCommPreconnectHeadMain = gjob->groupCommPreconnectHead;
struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> *asyncJobsMain = &gjob->asyncJobs;
bool *groupAbortFlag = &gjob->abortFlag;
if (!simInfo && groupCommPreconnectHeadMain != nullptr) {
struct ncclComm* comm = groupCommPreconnectHeadMain;
@ -421,9 +474,41 @@ static ncclResult_t groupLaunch(struct ncclAsyncJob *job_, ncclSimInfo_t* simInf
NCCLCHECKGOTO(asyncJobLaunch(asyncJobsMain, groupAbortFlag), ret, fail);
// only loop through sym alloc and register tasks
for (int type = ncclGroupTaskTypeSymRegister; type <= ncclGroupTaskTypeSymRegister; ++type) {
if (groupCommHeadMain[type]) {
struct ncclComm* cliqueHead = groupCommHeadMain[type];
struct ncclComm* comm = NULL;
struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> asyncSymJobs;
ncclIntruQueueConstruct(&asyncSymJobs);
do {
comm = cliqueHead;
do {
struct ncclGroupSymmetricJob* job;
NCCLCHECKGOTO(ncclCalloc(&job, 1), ret, fail);
job->base.func = ncclCommGroupRegisterSymmetric;
job->base.undo = nullptr;
job->base.destructor = free;
job->base.state = ncclGroupJobRunning;
job->base.abortFlag = comm->abortFlag;
job->base.abortFlagDev = comm->abortFlagDev;
job->comm = comm;
ncclIntruQueueEnqueue(&asyncSymJobs, (struct ncclAsyncJob*)job);
comm = comm->groupNext[type];
} while (comm != nullptr && comm->intraComm0 == cliqueHead->intraComm0);
NCCLCHECKGOTO(asyncJobLaunch(&asyncSymJobs, groupAbortFlag), ret, fail);
while (!ncclIntruQueueEmpty(&asyncSymJobs)) {
struct ncclAsyncJob* job = ncclIntruQueueDequeue(&asyncSymJobs);
if (job->destructor) job->destructor((void*)job);
}
cliqueHead = comm;
} while (cliqueHead != nullptr);
}
}
/* Connect channels at runtime if cumem is supported */
if (groupCommHeadMain != nullptr) {
struct ncclComm* cliqueHead = groupCommHeadMain;
if (groupCommHeadMain[ncclGroupTaskTypeCollective] != nullptr) {
struct ncclComm* cliqueHead = groupCommHeadMain[ncclGroupTaskTypeCollective];
struct ncclComm* comm = NULL;
struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> asyncCollJobs;
ncclIntruQueueConstruct(&asyncCollJobs);
@ -454,7 +539,7 @@ static ncclResult_t groupLaunch(struct ncclAsyncJob *job_, ncclSimInfo_t* simInf
memcpy(job->algoNeedConnect, algoNeedConnect, sizeof(bool) * NCCL_NUM_ALGORITHMS);
ncclIntruQueueEnqueue(&asyncCollJobs, &job->base);
}
comm = comm->groupNext;
comm = comm->groupNext[ncclGroupTaskTypeCollective];
} while (comm != nullptr && comm->intraComm0 == cliqueHead->intraComm0);
// connect
NCCLCHECKGOTO(asyncJobLaunch(&asyncCollJobs, groupAbortFlag), ret, fail);
@ -466,42 +551,49 @@ static ncclResult_t groupLaunch(struct ncclAsyncJob *job_, ncclSimInfo_t* simInf
} while (cliqueHead != nullptr);
// done with all buffer allocation, start registration and enqueue
comm = groupCommHeadMain;
comm = groupCommHeadMain[ncclGroupTaskTypeCollective];
do {
CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
NCCLCHECKGOTO(ncclTasksRegAndEnqueue(comm), ret, fail);
comm = comm->groupNext;
comm = comm->groupNext[ncclGroupTaskTypeCollective];
} while (comm);
}
if ((!simInfo) && (groupCommHeadMain != nullptr)) {
NCCLCHECKGOTO(doLaunches(groupCommHeadMain), ret, fail);
if ((!simInfo) && (groupCommHeadMain[ncclGroupTaskTypeCollective] != nullptr)) {
NCCLCHECKGOTO(doLaunches(groupCommHeadMain[ncclGroupTaskTypeCollective]), ret, fail);
}
while (!ncclIntruQueueEmpty(asyncJobsMain)) {
struct ncclAsyncJob* job = ncclIntruQueueDequeue(asyncJobsMain);
if (!job->destroyFlag && job->comm && !job->comm->config.blocking)
if (!job->destroyFlag && job->comm && !job->comm->config.blocking && groupCommHeadMain[ncclGroupTaskTypeCollective] == nullptr)
(void) ncclCommSetAsyncError(job->comm, ret);
if (job->destructor) job->destructor((void*)job);
}
while (groupCommHeadMain != nullptr) {
struct ncclComm* comm = groupCommHeadMain;
struct ncclComm* next = comm->groupNext;
// Poll for callbacks sent to us from other threads. Typically these free
// resources from to our memory pools and UB
NCCLCHECKGOTO(ncclCommPollCallbacks(comm, /*waitSome=*/false), ret, fail);
(void) ncclGroupCommLeave(comm);
if (!comm->config.blocking) {
(void) ncclCommSetAsyncError(comm, ret);
for (int type = 0; type < ncclGroupTaskTypeNum; ++type) {
while (groupCommHeadMain[type] != nullptr) {
struct ncclComm* comm = groupCommHeadMain[type];
struct ncclComm* next = comm->groupNext[type];
// Poll for callbacks sent to us from other threads. Typically these free
// resources from to our memory pools and UB
if (comm->reclaimSteps == GROUP_MAX_RECLAIM_STEPS) {
NCCLCHECKGOTO(ncclCommPollCallbacks(comm, /*waitSome=*/false), ret, fail);
comm->reclaimSteps = 0;
} else {
comm->reclaimSteps++;
}
(void)ncclGroupCommLeave(comm, type);
if (!comm->config.blocking) {
(void)ncclCommSetAsyncError(comm, ret);
}
groupCommHeadMain[type] = next;
}
groupCommHeadMain = next;
}
exit:
return ret;
fail:
groupCleanup(gjob->groupCommHeadPtr, gjob->groupCommPreconnectHeadPtr, gjob->asyncJobsPtr, gjob->groupErrorPtr, gjob->groupBlockingPtr, gjob->abortFlagPtr, ret);
groupCleanup(gjob->groupCommHead, &gjob->asyncJobs, ret);
goto exit;
}
@ -514,6 +606,8 @@ ncclResult_t ncclGroupEndInternal(ncclSimInfo_t* simInfo) {
ncclSimInfo_t internalSimInfo = NCCL_SIM_INFO_INITIALIZER;
ncclSimInfo_t* internalSimInfoPtr = NULL;
size_t realSize = 0;
bool hasCommHead = false;
ncclGroupJob* groupJob = NULL;
internalSimInfo.magic = 0;
@ -539,72 +633,108 @@ ncclResult_t ncclGroupEndInternal(ncclSimInfo_t* simInfo) {
internalSimInfoPtr = &internalSimInfo;
}
if (ncclGroupCommHead != nullptr || !ncclIntruQueueEmpty(&ncclAsyncJobs) || ncclGroupCommPreconnectHead != nullptr) {
ncclGroupJobMain.groupCommHeadPtr = &ncclGroupCommHead;
ncclGroupJobMain.groupCommPreconnectHeadPtr = &ncclGroupCommPreconnectHead;
ncclGroupJobMain.groupErrorPtr = &ncclGroupError;
ncclGroupJobMain.asyncJobsPtr = &ncclAsyncJobs;
ncclGroupJobMain.abortFlagPtr = &ncclGroupJobAbortFlag;
ncclGroupJobMain.groupBlockingPtr = &ncclGroupBlocking;
ncclGroupJobMain.initialized = true;
ncclGroupJobMainPtr = &ncclGroupJobMain;
for (int type = 0; type < ncclGroupTaskTypeNum; ++type) {
if (ncclGroupCommHead[type]) {
hasCommHead = true;
break;
}
}
NCCLCHECKGOTO(ncclCalloc(&groupJob, 1), ret, fail);
ncclIntruQueueConstruct(&groupJob->asyncJobs);
groupJob->groupRefCount = 0;
groupJob->nonBlockingInit = false;
memcpy(groupJob->groupCommHead, ncclGroupCommHead, sizeof(ncclGroupCommHead));
groupJob->groupCommPreconnectHead = ncclGroupCommPreconnectHead;
groupJob->groupError = ncclSuccess;
groupJob->abortFlag = false;
groupJob->joined = false;
ncclIntruQueueTransfer(&groupJob->asyncJobs, &ncclAsyncJobs);
if (hasCommHead || !ncclIntruQueueEmpty(&groupJob->asyncJobs) || ncclGroupCommPreconnectHead != nullptr) {
/* make sure ncclGroupBlocking has been set. */
assert(ncclGroupBlocking == 0 || ncclGroupBlocking == 1);
if (ncclGroupBlocking == 0) {
/* nonblocking group */
if (!ncclIntruQueueEmpty(&ncclAsyncJobs)) {
ncclAsyncJob* job = ncclIntruQueueHead(&ncclAsyncJobs);
if (!ncclIntruQueueEmpty(&groupJob->asyncJobs)) {
ncclAsyncJob* job = ncclIntruQueueHead(&groupJob->asyncJobs);
do {
NCCLCHECKGOTO(ncclCommSetAsyncError(job->comm, ncclInProgress), ret, fail);
job->comm->groupJob = ncclGroupJobMainPtr;
if (job->comm->groupJob == NULL) {
job->comm->groupJob = groupJob;
groupJob->groupRefCount++;
}
job = job->next;
} while (job);
}
if (ncclGroupCommHead) {
ncclComm_t comm = ncclGroupCommHead;
do {
NCCLCHECKGOTO(ncclCommSetAsyncError(comm, ncclInProgress), ret, fail);
/* link group job to communicators. */
comm->groupJob = ncclGroupJobMainPtr;
comm = comm->groupNext;
} while (comm);
for (int type = 0; type < ncclGroupTaskTypeNum; ++type) {
if (ncclGroupCommHead[type]) {
ncclComm_t comm = ncclGroupCommHead[type];
do {
NCCLCHECKGOTO(ncclCommSetAsyncError(comm, ncclInProgress), ret, fail);
/* link group job to communicators. */
if (comm->groupJob == NULL) {
comm->groupJob = groupJob;
groupJob->groupRefCount++;
}
comm = comm->groupNext[type];
} while (comm);
}
}
ncclGroupJobMainPtr->base.func = groupLaunchNonBlocking;
PTHREADCHECKGOTO(pthread_create(&ncclGroupJobMainPtr->base.thread, NULL, ncclAsyncJobMain, (void*)&ncclGroupJobMainPtr->base), "pthread_create", ret, fail);
groupJob->base.func = groupLaunchNonBlocking;
PTHREADCHECKGOTO(pthread_create(&groupJob->base.thread, NULL, ncclAsyncJobMain, (void*)&groupJob->base), "pthread_create", ret, fail);
groupJob->nonBlockingInit = true;
ret = ncclInProgress;
} else {
/* blocking group */
int savedDev;
CUDACHECKGOTO(cudaGetDevice(&savedDev), ret, fail);
NCCLCHECKGOTO(groupLaunch(&ncclGroupJobMainPtr->base, internalSimInfoPtr), ret, fail);
NCCLCHECKGOTO(groupLaunch(&groupJob->base, internalSimInfoPtr), ret, fail);
CUDACHECKGOTO(cudaSetDevice(savedDev), ret, fail);
if (simInfo) memcpy((void*)simInfo, (void*)internalSimInfoPtr, realSize);
groupResetJobState(ncclGroupJobMainPtr);
free(groupJob);
}
}
/* Reset the job state for the next group call. */
groupLocalResetJobState();
exit:
return ret;
fail:
groupCleanup(&ncclGroupCommHead, &ncclGroupCommPreconnectHead, &ncclAsyncJobs, &ncclGroupError, &ncclGroupBlocking, &ncclGroupJobAbortFlag, ret);
if (groupJob) {
groupCleanup(groupJob->groupCommHead, &groupJob->asyncJobs, ret);
free(groupJob);
} else {
groupCleanup(ncclGroupCommHead, &ncclAsyncJobs, ret);
}
groupLocalResetJobState();
goto exit;
}
ncclResult_t ncclGroupJobComplete(struct ncclGroupJob* groupJob) {
ncclResult_t ret = ncclSuccess;
if (groupJob && groupJob->initialized) {
ret = ncclAsyncJobComplete(&groupJob->base);
groupResetJobState(groupJob);
if (groupJob && groupJob->nonBlockingInit) {
if (!__atomic_exchange_n(&groupJob->joined, true, __ATOMIC_ACQ_REL)) {
ret = ncclAsyncJobComplete(&groupJob->base);
}
if (ncclAtomicRefCountDecrement(&groupJob->groupRefCount) == 0) {
free(groupJob);
}
}
return ret;
}
ncclResult_t ncclGroupJobAbort(struct ncclGroupJob* groupJob) {
if (groupJob && groupJob->initialized) {
__atomic_store_n(groupJob->abortFlagPtr, true, __ATOMIC_RELEASE);
NCCLCHECK(ncclGroupJobComplete(groupJob));
if (groupJob && groupJob->nonBlockingInit) {
if (!__atomic_exchange_n(&groupJob->joined, true, __ATOMIC_ACQ_REL)) {
__atomic_store_n(&groupJob->abortFlag, true, __ATOMIC_RELAXED);
ncclAsyncJobComplete(&groupJob->base);
}
if (ncclAtomicRefCountDecrement(&groupJob->groupRefCount) == 0) {
free(groupJob);
}
}
return ncclSuccess;
}

13
src/include/allocator.h Normal file
View File

@ -0,0 +1,13 @@
/*************************************************************************
* Copyright (c) 2015-2025, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#ifndef NCCL_ALLOCATOR_H_
#define NCCL_ALLOCATOR_H_
ncclResult_t ncclCommSymmetricAllocInternal(struct ncclComm* comm, size_t size, size_t alignment, void** symPtr);
ncclResult_t ncclCommSymmetricFreeInternal(struct ncclComm* comm, void* symPtr);
#endif

View File

@ -19,6 +19,28 @@
#endif
#endif
template<typename Int>
constexpr static __host__ __device__ Int minval(Int a) { return a; }
template<typename Int, typename ...More>
constexpr static __host__ __device__ Int minval(Int a, Int b, More ...more) {
#if __CUDA_ARCH__
return minval(min(a, b), more...);
#else
return minval(a < b ? a : b, more...);
#endif
}
template<typename Int>
constexpr static __host__ __device__ Int maxval(Int a) { return a; }
template<typename Int, typename ...More>
constexpr static __host__ __device__ Int maxval(Int a, Int b, More ...more) {
#if __CUDA_ARCH__
return maxval(max(a, b), more...);
#else
return maxval(a > b ? a : b, more...);
#endif
}
#define DIVUP(x, y) \
(((x)+(y)-1)/(y))
@ -32,32 +54,150 @@
size = ((size + (align) - 1) / (align)) * (align);
template<typename X, typename Y, typename Z = decltype(X()+Y())>
__host__ __device__ constexpr Z divUp(X x, Y y) {
static __host__ __device__ constexpr Z divUp(X x, Y y) {
return (x+y-1)/y;
}
template<typename X, typename Y, typename Z = decltype(X()+Y())>
__host__ __device__ constexpr Z roundUp(X x, Y y) {
static __host__ __device__ constexpr Z roundUp(X x, Y y) {
return (x+y-1) - (x+y-1)%y;
}
template<typename X, typename Y, typename Z = decltype(X()+Y())>
__host__ __device__ constexpr Z roundDown(X x, Y y) {
static __host__ __device__ constexpr Z roundDown(X x, Y y) {
return x - x%y;
}
// assumes second argument is a power of 2
template<typename X, typename Z = decltype(X()+int())>
__host__ __device__ constexpr Z alignUp(X x, int a) {
static __host__ __device__ constexpr Z alignUp(X x, int a) {
return (x + a-1) & Z(-a);
}
// assumes second argument is a power of 2
template<typename X, typename Z = decltype(X()+int())>
__host__ __device__ constexpr Z alignDown(X x, int a) {
static __host__ __device__ constexpr Z alignDown(X x, int a) {
return x & Z(-a);
}
template<typename Int>
inline __host__ __device__ int countOneBits(Int x) {
constexpr __host__ __device__ bool isPow2(Int x) {
return (x & (x-1)) == 0;
}
template<typename T>
static __host__ __device__ T add4G(T base, int delta4G) {
union { T tmp; uint32_t u32[2]; };
tmp = base;
u32[1] += delta4G;
return tmp;
}
template<typename T>
static __host__ __device__ T incWrap4G(T ptr, uint32_t delta4G, uint32_t lo4G, uint32_t hi4G) {
union { T tmp; uint32_t u32[2]; };
tmp = ptr;
u32[1] += delta4G;
if (u32[1] >= hi4G) u32[1] -= hi4G-lo4G;
return tmp;
}
template<typename T>
static __host__ __device__ T decWrap4G(T ptr, uint32_t delta4G, uint32_t lo4G, uint32_t hi4G) {
union { T tmp; uint32_t u32[2]; };
tmp = ptr;
u32[1] -= delta4G;
if (u32[1] < lo4G) u32[1] += hi4G-lo4G;
return tmp;
}
// Produce the reciprocal of x for use in idivByRcp
constexpr __host__ __device__ uint32_t idivRcp32(uint32_t x) {
return uint32_t(uint64_t(0x100000000)/x);
}
constexpr __host__ __device__ uint64_t idivRcp64(uint64_t x) {
return uint64_t(-1)/x + isPow2(x);
}
static __host__ __device__ uint32_t mul32hi(uint32_t a, uint32_t b) {
#if __CUDA_ARCH__
return __umulhi(a, b);
#else
return uint64_t(a)*b >> 32;
#endif
}
static __host__ __device__ uint64_t mul64hi(uint64_t a, uint64_t b) {
#if __CUDA_ARCH__
return __umul64hi(a, b);
#else
return (uint64_t)(((unsigned __int128)a)*b >> 64);
#endif
}
// Produce the reciprocal of x*y given their respective reciprocals. This incurs
// no integer division on device.
static __host__ __device__ uint32_t imulRcp32(uint32_t x, uint32_t xrcp, uint32_t y, uint32_t yrcp) {
if (xrcp == 0) return yrcp;
if (yrcp == 0) return xrcp;
uint32_t rcp = mul32hi(xrcp, yrcp);
uint32_t rem = -x*y*rcp;
if (x*y <= rem) rcp += 1;
return rcp;
}
static __host__ __device__ uint64_t imulRcp64(uint64_t x, uint64_t xrcp, uint64_t y, uint64_t yrcp) {
if (xrcp == 0) return yrcp;
if (yrcp == 0) return xrcp;
uint64_t rcp = mul64hi(xrcp, yrcp);
uint64_t rem = -x*y*rcp;
if (x*y <= rem) rcp += 1;
return rcp;
}
// Fast integer division where divisor has precomputed reciprocal.
// idivFast(x, y, idivRcp(y)) == x/y
static __host__ __device__ void idivmodFast32(uint32_t *quo, uint32_t *rem, uint32_t x, uint32_t y, uint32_t yrcp) {
uint32_t q = x, r = 0;
if (yrcp != 0) {
q = mul32hi(x, yrcp);
r = x - y*q;
if (r >= y) { q += 1; r -= y; }
}
*quo = q;
*rem = r;
}
static __host__ __device__ void idivmodFast64(uint64_t *quo, uint64_t *rem, uint64_t x, uint64_t y, uint64_t yrcp) {
uint64_t q = x, r = 0;
if (yrcp != 0) {
q = mul64hi(x, yrcp);
r = x - y*q;
if (r >= y) { q += 1; r -= y; }
}
*quo = q;
*rem = r;
}
static __host__ __device__ uint32_t idivFast32(uint32_t x, uint32_t y, uint32_t yrcp) {
uint32_t q, r;
idivmodFast32(&q, &r, x, y, yrcp);
return q;
}
static __host__ __device__ uint32_t idivFast64(uint64_t x, uint64_t y, uint64_t yrcp) {
uint64_t q, r;
idivmodFast64(&q, &r, x, y, yrcp);
return q;
}
static __host__ __device__ uint32_t imodFast32(uint32_t x, uint32_t y, uint32_t yrcp) {
uint32_t q, r;
idivmodFast32(&q, &r, x, y, yrcp);
return r;
}
static __host__ __device__ uint32_t imodFast64(uint64_t x, uint64_t y, uint64_t yrcp) {
uint64_t q, r;
idivmodFast64(&q, &r, x, y, yrcp);
return r;
}
template<typename Int>
static __host__ __device__ int countOneBits(Int x) {
#if __CUDA_ARCH__
if (sizeof(Int) <= sizeof(unsigned int)) {
return __popc((unsigned int)x);
@ -83,7 +223,7 @@ inline __host__ __device__ int countOneBits(Int x) {
// Returns index of first one bit or returns -1 if mask is zero.
template<typename Int>
inline __host__ __device__ int firstOneBit(Int mask) {
static __host__ __device__ int firstOneBit(Int mask) {
int i;
#if __CUDA_ARCH__
if (sizeof(Int) <= sizeof(int)) {
@ -108,14 +248,14 @@ inline __host__ __device__ int firstOneBit(Int mask) {
}
template<typename Int>
inline __host__ __device__ int popFirstOneBit(Int* mask) {
static __host__ __device__ int popFirstOneBit(Int* mask) {
Int tmp = *mask;
*mask &= *mask-1;
return firstOneBit(tmp);
}
template<typename Int>
inline __host__ __device__ int log2Down(Int x) {
static __host__ __device__ int log2Down(Int x) {
int w, n;
#if __CUDA_ARCH__
if (sizeof(Int) <= sizeof(int)) {
@ -147,7 +287,7 @@ inline __host__ __device__ int log2Down(Int x) {
}
template<typename Int>
inline __host__ __device__ int log2Up(Int x) {
static __host__ __device__ int log2Up(Int x) {
int w, n;
if (x != 0) x -= 1;
#if __CUDA_ARCH__
@ -180,19 +320,19 @@ inline __host__ __device__ int log2Up(Int x) {
}
template<typename Int>
inline __host__ __device__ Int pow2Up(Int x) {
static __host__ __device__ Int pow2Up(Int x) {
return Int(1)<<log2Up(x);
}
template<typename Int>
inline __host__ __device__ Int pow2Down(Int x) {
static __host__ __device__ Int pow2Down(Int x) {
// True, log2Down can return -1, but we don't normally pass 0 as an argument...
// coverity[negative_shift]
return Int(1)<<log2Down(x);
}
template<typename UInt, int nSubBits>
inline __host__ UInt reverseSubBits(UInt x) {
static __host__ UInt reverseSubBits(UInt x) {
if (nSubBits >= 16 && 8*sizeof(UInt) == nSubBits) {
switch (8*sizeof(UInt)) {
case 16: x = __builtin_bswap16(x); break;
@ -225,7 +365,7 @@ template<> struct ncclToUnsigned<unsigned long long> { using type = unsigned lon
// Reverse the bottom nBits bits of x. The top bits will be overwritten with 0's.
template<typename Int>
inline __host__ __device__ Int reverseBits(Int x, int nBits) {
static __host__ __device__ Int reverseBits(Int x, int nBits) {
using UInt = typename ncclToUnsigned<Int>::type;
union { UInt ux; Int sx; };
sx = x;
@ -249,7 +389,7 @@ inline __host__ __device__ Int reverseBits(Int x, int nBits) {
// has nearly the full range of uint32_t except it only keeps the top 3 bits
// beneath the leading 1 bit and thus has a max value of 0xf0000000.
inline __host__ __device__ uint32_t u32fpEncode(uint32_t x, int bitsPerPow2) {
static __host__ __device__ uint32_t u32fpEncode(uint32_t x, int bitsPerPow2) {
int log2x;
#if __CUDA_ARCH__
log2x = 31-__clz(x|1);
@ -261,7 +401,7 @@ inline __host__ __device__ uint32_t u32fpEncode(uint32_t x, int bitsPerPow2) {
return exponent<<bitsPerPow2 | mantissa;
}
inline __host__ __device__ uint32_t u32fpDecode(uint32_t x, int bitsPerPow2) {
static __host__ __device__ uint32_t u32fpDecode(uint32_t x, int bitsPerPow2) {
uint32_t exponent = x>>bitsPerPow2;
uint32_t mantissa = (x & ((1u<<bitsPerPow2)-1)) | (exponent!=0 ? 0x8 : 0);
if (exponent != 0) exponent -= 1;
@ -270,16 +410,16 @@ inline __host__ __device__ uint32_t u32fpDecode(uint32_t x, int bitsPerPow2) {
constexpr uint32_t u32fp8MaxValue() { return 0xf0000000; }
inline __host__ __device__ uint8_t u32fp8Encode(uint32_t x) {
static __host__ __device__ uint8_t u32fp8Encode(uint32_t x) {
return u32fpEncode(x, 3);
}
inline __host__ __device__ uint32_t u32fp8Decode(uint8_t x) {
static __host__ __device__ uint32_t u32fp8Decode(uint8_t x) {
return u32fpDecode(x, 3);
}
// The hash isn't just a function of the bytes but also where the bytes are split
// into different calls to eatHash().
inline __host__ __device__ void eatHash(uint64_t acc[2], const void* bytes, size_t size) {
static __host__ __device__ void eatHash(uint64_t acc[2], const void* bytes, size_t size) {
char const* ptr = (char const*)bytes;
acc[0] ^= size;
while (size != 0) {
@ -302,11 +442,11 @@ inline __host__ __device__ void eatHash(uint64_t acc[2], const void* bytes, size
}
template<typename T>
inline __host__ __device__ void eatHash(uint64_t acc[2], const T* bytes) {
static __host__ __device__ void eatHash(uint64_t acc[2], const T* bytes) {
eatHash(acc, (const void*)bytes, sizeof(T));
}
inline __host__ __device__ uint64_t digestHash(uint64_t const acc[2]) {
static __host__ __device__ uint64_t digestHash(uint64_t const acc[2]) {
uint64_t h = acc[0];
h ^= h >> 31;
h *= 0xbac3bd562846de6b;
@ -316,13 +456,13 @@ inline __host__ __device__ uint64_t digestHash(uint64_t const acc[2]) {
return h;
}
inline __host__ __device__ uint64_t getHash(const void* bytes, size_t size) {
static __host__ __device__ uint64_t getHash(const void* bytes, size_t size) {
uint64_t acc[2] = {1, 1};
eatHash(acc, bytes, size);
return digestHash(acc);
}
template<typename T>
inline __host__ __device__ uint64_t getHash(const T* bytes) {
static __host__ __device__ uint64_t getHash(const T* bytes) {
return getHash((const void*)bytes, sizeof(T));
}

View File

@ -17,6 +17,7 @@
#include "register.h"
#include "graph.h"
#include "profiler.h"
#include "allocator.h"
#if CUDART_VERSION < 9000
struct cudaLaunchParams {
@ -131,7 +132,6 @@ struct ncclSharedResources {
int* tpRankToLocalRank;
// Internal streams
struct ncclStrongStream deviceStream, hostStream;
int noncapturedRefs; // number of non-captured hostStreamPlanCallback on the stream
int persistentRefs;
cudaEvent_t launchEvent, scratchEvent;
@ -218,6 +218,7 @@ struct ncclTaskColl {
// Profiler plugin
int eActivationMask;
void* eventHandle;
uint8_t nChannels;
};
struct ncclTaskP2p {
struct ncclTaskP2p* next;
@ -231,6 +232,7 @@ struct ncclTaskP2p {
// Profiler plugin
int eActivationMask;
void* eventHandle;
uint8_t nChannels;
};
struct ncclKernelPlan {
@ -243,10 +245,14 @@ struct ncclKernelPlan {
bool persistent; // aka captured in a graph
bool isHostCbEnq;
bool isSymColl;
enum ncclDevWorkStorageType workStorageType;
bool kernelSpecialized;
void *kernelFn;
struct ncclDevKernelArgs* kernelArgs;
void* kernelFn;
union {
struct ncclDevKernelArgs* kernelArgs;
struct ncclSymDevArgs* kernelSymArgs;
};
size_t kernelArgsSize;
uint64_t channelMask; // bitset of which channels are present
bool hasProxyOps; // does any channel have a non-empty proxyOpQueue
@ -355,6 +361,7 @@ struct ncclKernelPlanner {
struct Peer* peers/*[nRanks]*/;
int nTasksColl, nTasksP2p;
bool persistent;
bool isSymColl;
// The list of user streams aggregated over all tasks present.
struct ncclCudaStreamList* streams;
@ -404,6 +411,12 @@ struct ncclKernelPlanner {
#define NCCL_MAGIC 0x0280028002800280 // Nickel atomic number is 28.
typedef enum ncclGroupTaskType {
ncclGroupTaskTypeCollective = 0,
ncclGroupTaskTypeSymRegister = 1,
ncclGroupTaskTypeNum = 2,
} ncclGroupTaskType_t;
struct ncclComm {
uint64_t startMagic;
struct ncclMemoryStack memPermanent, memScoped;
@ -420,9 +433,10 @@ struct ncclComm {
struct ncclTopoSystem* topo;
struct ncclProxyConnector* gproxyConn;
struct ncclIntruQueue<struct ncclCommCallback, &ncclCommCallback::next> legacyRegCleanupQueue;
bool peerInfoValid;
int netPluginLoaded;
ncclNet_t* ncclNet;
int netPluginIndex;
int ncclNetVer;
ncclNetDeviceType netDeviceType;
ncclCollNet_t* ncclCollNet;
@ -439,7 +453,6 @@ struct ncclComm {
uint64_t magic; // Magic number for all network communication. Not a security key -- only goal is to detect mismatches.
const char* commName;
uint64_t commHash;
int rank; // my rank in the communicator
int nRanks; // number of GPUs in communicator
@ -515,6 +528,7 @@ struct ncclComm {
// Device side of the communicator (for cudaFree's)
struct ncclDevComm* devComm; // actually = &ncclDevCommAndChannels::comm
struct ncclSymDevComm symDevComm;
uint32_t workArgsBytes; // max size of kernel args
uint32_t workFifoBytes; // size of workFifoBuf, power of 2
@ -522,12 +536,10 @@ struct ncclComm {
void* workFifoBufDev;
void* workFifoBufGdrHandle;
// Monotonic number of bytes (mod 1<<32) consumed per channel. In cudaHost memory.
uint32_t* workFifoConsumed/*[MAXCHANNELS]*/;
// Last observed value of: min(workFifoConsumed[c] for c < MAXCHANNELS)
uint32_t workFifoConsumedLeast;
// Monotonic number of bytes (mod 1<<32) sent to fifo.
uint32_t workFifoProduced;
uint32_t workFifoProducedLastRecorded;
uint32_t workFifoConsumed;
// Intra-process sync
struct ncclComm* intraComm0; // leader of intra-process comms (self possible)
@ -543,10 +555,8 @@ struct ncclComm {
struct ncclProxyState* proxyState;
int proxyRefCountOld; /* store proxy post-atomic-sub refcount */
// Whether this communicator uses collNet
int collNetSupport;
bool isOneRPN;
uint8_t collNetSupportMatrix[4/*sum,prod,max,min*/][ncclNumTypes];
bool intraNodeP2pSupport;
int* collNetHeads;
int collNetHeadsNum;
int* collNetDenseToUserRank;
@ -568,7 +578,7 @@ struct ncclComm {
// Next comm in this thread's active ncclGroup[Start|End](). Holds "0x1" when
// this comm is not yet in a group.
struct ncclComm* groupNext;
struct ncclComm* groupNext[ncclGroupTaskTypeNum];
// Subset of those in groupNext list. Holds 0x1 if not needing preconnect.
struct ncclComm* preconnectNext;
int localPersistentRefs; // number of persistent plan-lists capturing this comm
@ -588,6 +598,7 @@ struct ncclComm {
ncclUserRedOp *userRedOps;
// Queue of things for the main thread to do
int reclaimSteps;
struct ncclIntruQueueMpsc<struct ncclCommCallback, &ncclCommCallback::next> callbackQueue;
ncclConfig_t config;
@ -600,6 +611,9 @@ struct ncclComm {
// group job to support multi-thread FT
struct ncclGroupJob *groupJob;
// Flag indicating if this communicator shares resources with parent or children
bool shareResources;
// Tuning plugin
int tunerPluginLoaded;
ncclTuner_t* tuner;
@ -613,9 +627,18 @@ struct ncclComm {
// buffer registration cache
struct ncclRegCache regCache;
int isAllNvlink;
bool isAllDirectP2p;
int symmetricSupport;
bool useNetPXN;
bool useGdr;
int splitCount;
// symmetric buffer
uint8_t* baseUCSymPtr;
uint8_t* baseMCSymPtr;
size_t baseStride;
size_t symAllocHead;
CUmemGenericAllocationHandle symMCHandle;
struct ncclIntruQueue<struct ncclSymRegTask, &ncclSymRegTask::next> symRegTaskQueue;
uint64_t endMagic;
};
@ -647,15 +670,21 @@ inline ncclResult_t ncclCommPollCallbacks(struct ncclComm* comm, bool waitSome)
return ncclSuccess;
}
inline ncclResult_t ncclCommPollEventCallbacks(struct ncclComm *comm) {
inline ncclResult_t ncclCommPollEventCallbacks(struct ncclComm *comm, bool waitSome) {
ncclResult_t result = ncclSuccess;
cudaStreamCaptureMode mode = cudaStreamCaptureModeRelaxed;
CUDACHECK(cudaThreadExchangeStreamCaptureMode(&mode));
while (true) {
struct ncclCommEventCallback* cb = ncclIntruQueueHead(&comm->eventCallbackQueue);
if (cb == nullptr) break;
cudaError_t ok = cudaEventSynchronize(cb->event);
if (ok == cudaErrorNotReady) break;
cudaError_t ok;
if (waitSome) {
ok = cudaEventSynchronize(cb->event);
waitSome = false;
} else {
ok = cudaEventQuery(cb->event);
if (ok == cudaErrorNotReady) break;
}
ncclIntruQueueDequeue(&comm->eventCallbackQueue);
if (ok == cudaSuccess) {
NCCLCHECKGOTO(cb->fn(comm, cb), result, finish);

View File

@ -58,4 +58,29 @@ static ncclResult_t ncclCpusetToStr(cpu_set_t* mask, char* str) {
return ncclSuccess;
}
static char* ncclCpusetToRangeStr(cpu_set_t* mask, char* str, size_t len) {
int c = 0;
int start = -1;
// Iterate through all possible CPU bits plus one extra position
for (int cpu = 0; cpu <= CPU_SETSIZE; cpu++) {
int isSet = (cpu == CPU_SETSIZE) ? 0 : CPU_ISSET(cpu, mask);
// Start of a new range
if (isSet && start == -1) {
start = cpu;
}
// End of a range, add comma between ranges
if (!isSet && start != -1) {
if (cpu-1 == start) {
c += snprintf(str+c, len-c, "%s%d", c ? "," : "", start);
} else {
c += snprintf(str+c, len-c, "%s%d-%d", c ? "," : "", start, cpu-1);
}
if (c >= len-1) break;
start = -1;
}
}
if (c == 0) str[0] = '\0';
return str;
}
#endif

View File

@ -36,6 +36,10 @@ extern CUmemAllocationHandleType ncclCuMemHandleType;
} \
} while(false)
#define CUCALL(cmd) do { \
pfn_##cmd; \
} while(false)
#define CUCHECKGOTO(cmd, res, label) do { \
CUresult err = pfn_##cmd; \
if( err != CUDA_SUCCESS ) { \
@ -66,49 +70,49 @@ extern CUmemAllocationHandleType ncclCuMemHandleType;
} \
} while(0)
#define DECLARE_CUDA_PFN_EXTERN(symbol) extern PFN_##symbol pfn_##symbol
#define DECLARE_CUDA_PFN_EXTERN(symbol,version) extern PFN_##symbol##_v##version pfn_##symbol
#if CUDART_VERSION >= 11030
/* CUDA Driver functions loaded with cuGetProcAddress for versioning */
DECLARE_CUDA_PFN_EXTERN(cuDeviceGet);
DECLARE_CUDA_PFN_EXTERN(cuDeviceGetAttribute);
DECLARE_CUDA_PFN_EXTERN(cuGetErrorString);
DECLARE_CUDA_PFN_EXTERN(cuGetErrorName);
DECLARE_CUDA_PFN_EXTERN(cuMemGetAddressRange);
DECLARE_CUDA_PFN_EXTERN(cuCtxCreate);
DECLARE_CUDA_PFN_EXTERN(cuCtxDestroy);
DECLARE_CUDA_PFN_EXTERN(cuCtxGetCurrent);
DECLARE_CUDA_PFN_EXTERN(cuCtxSetCurrent);
DECLARE_CUDA_PFN_EXTERN(cuCtxGetDevice);
DECLARE_CUDA_PFN_EXTERN(cuPointerGetAttribute);
DECLARE_CUDA_PFN_EXTERN(cuLaunchKernel);
DECLARE_CUDA_PFN_EXTERN(cuDeviceGet, 2000);
DECLARE_CUDA_PFN_EXTERN(cuDeviceGetAttribute, 2000);
DECLARE_CUDA_PFN_EXTERN(cuGetErrorString, 6000);
DECLARE_CUDA_PFN_EXTERN(cuGetErrorName, 6000);
DECLARE_CUDA_PFN_EXTERN(cuMemGetAddressRange, 3020);
DECLARE_CUDA_PFN_EXTERN(cuCtxCreate, 11040);
DECLARE_CUDA_PFN_EXTERN(cuCtxDestroy, 4000);
DECLARE_CUDA_PFN_EXTERN(cuCtxGetCurrent, 4000);
DECLARE_CUDA_PFN_EXTERN(cuCtxSetCurrent, 4000);
DECLARE_CUDA_PFN_EXTERN(cuCtxGetDevice, 2000);
DECLARE_CUDA_PFN_EXTERN(cuPointerGetAttribute, 4000);
DECLARE_CUDA_PFN_EXTERN(cuLaunchKernel, 4000);
#if CUDART_VERSION >= 11080
DECLARE_CUDA_PFN_EXTERN(cuLaunchKernelEx);
DECLARE_CUDA_PFN_EXTERN(cuLaunchKernelEx, 11060);
#endif
// cuMem API support
DECLARE_CUDA_PFN_EXTERN(cuMemAddressReserve);
DECLARE_CUDA_PFN_EXTERN(cuMemAddressFree);
DECLARE_CUDA_PFN_EXTERN(cuMemCreate);
DECLARE_CUDA_PFN_EXTERN(cuMemGetAllocationGranularity);
DECLARE_CUDA_PFN_EXTERN(cuMemExportToShareableHandle);
DECLARE_CUDA_PFN_EXTERN(cuMemImportFromShareableHandle);
DECLARE_CUDA_PFN_EXTERN(cuMemMap);
DECLARE_CUDA_PFN_EXTERN(cuMemRelease);
DECLARE_CUDA_PFN_EXTERN(cuMemRetainAllocationHandle);
DECLARE_CUDA_PFN_EXTERN(cuMemSetAccess);
DECLARE_CUDA_PFN_EXTERN(cuMemUnmap);
DECLARE_CUDA_PFN_EXTERN(cuMemGetAllocationPropertiesFromHandle);
DECLARE_CUDA_PFN_EXTERN(cuMemAddressReserve, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemAddressFree, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemCreate, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemGetAllocationGranularity, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemExportToShareableHandle, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemImportFromShareableHandle, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemMap, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemRelease, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemRetainAllocationHandle, 11000);
DECLARE_CUDA_PFN_EXTERN(cuMemSetAccess, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemUnmap, 10020);
DECLARE_CUDA_PFN_EXTERN(cuMemGetAllocationPropertiesFromHandle, 10020);
#if CUDA_VERSION >= 11070
DECLARE_CUDA_PFN_EXTERN(cuMemGetHandleForAddressRange); // DMA-BUF support
DECLARE_CUDA_PFN_EXTERN(cuMemGetHandleForAddressRange, 11070); // DMA-BUF support
#endif
#if CUDA_VERSION >= 12010
/* NVSwitch Multicast support */
DECLARE_CUDA_PFN_EXTERN(cuMulticastAddDevice);
DECLARE_CUDA_PFN_EXTERN(cuMulticastBindMem);
DECLARE_CUDA_PFN_EXTERN(cuMulticastBindAddr);
DECLARE_CUDA_PFN_EXTERN(cuMulticastCreate);
DECLARE_CUDA_PFN_EXTERN(cuMulticastGetGranularity);
DECLARE_CUDA_PFN_EXTERN(cuMulticastUnbind);
DECLARE_CUDA_PFN_EXTERN(cuMulticastAddDevice, 12010);
DECLARE_CUDA_PFN_EXTERN(cuMulticastBindMem, 12010);
DECLARE_CUDA_PFN_EXTERN(cuMulticastBindAddr, 12010);
DECLARE_CUDA_PFN_EXTERN(cuMulticastCreate, 12010);
DECLARE_CUDA_PFN_EXTERN(cuMulticastGetGranularity, 12010);
DECLARE_CUDA_PFN_EXTERN(cuMulticastUnbind, 12010);
#endif
#endif

View File

@ -10,6 +10,7 @@
#include "nccl.h"
#include "nccl_common.h"
#include "bitops.h"
#include "symmetric.h"
#include <algorithm>
#include <stdint.h>
#include <sys/types.h>
@ -29,6 +30,30 @@ extern const char* ncclProtoStr[NCCL_NUM_PROTOCOLS];
#define NCCL_CUDA_ARCH 0
#endif
#ifdef __CUDA_ARCH_SPECIFIC__
#define NCCL_CUDA_ARCH_SPECIFIC __CUDA_ARCH_SPECIFIC__
#elif defined(__CUDA_ARCH_HAS_FEATURE__)
#if __CUDA_ARCH_HAS_FEATURE__(SM90_ALL)
#define NCCL_CUDA_ARCH_SPECIFIC 900
#elif __CUDA_ARCH_HAS_FEATURE__(SM100_ALL)
#define NCCL_CUDA_ARCH_SPECIFIC 1000
#elif __CUDA_ARCH_HAS_FEATURE__(SM101_ALL)
#define NCCL_CUDA_ARCH_SPECIFIC 1010
#elif __CUDA_ARCH_HAS_FEATURE__(SM120_ALL)
#define NCCL_CUDA_ARCH_SPECIFIC 1200
#else
#define NCCL_CUDA_ARCH_SPECIFIC 0
#endif
#else
#define NCCL_CUDA_ARCH_SPECIFIC 0
#endif
#ifdef __CUDA_ARCH_FAMILY_SPECIFIC__
#define NCCL_CUDA_ARCH_FAMILY_SPECIFIC __CUDA_ARCH_FAMILY_SPECIFIC__
#else
#define NCCL_CUDA_ARCH_FAMILY_SPECIFIC 0
#endif
#include "net_device.h"
enum ncclDevRedOp_t {
@ -380,6 +405,14 @@ struct alignas(16) ncclDevChannel {
uint64_t workCounter;
};
#define MAX_PROFILER_EVENTS_PER_CHANNEL 64
struct ncclDevProfiler {
struct {
uint64_t counter;
uint64_t timestamp;
} data[MAX_PROFILER_EVENTS_PER_CHANNEL];
};
struct ncclDevComm {
int rank;
int nRanks;
@ -389,9 +422,6 @@ struct ncclDevComm {
int p2pChunkSize;
int isAllNvlink;
// Work fifo return credits
uint32_t* workConsumed/*[MAXCHANNELS]*/;
int* collNetDenseToUserRank;
// Flag to ask NCCL kernels to abort
@ -402,8 +432,8 @@ struct ncclDevComm {
int* rankToLocalRank;
// Profiler counters
uint64_t* workStarted/*[MAXCHANNELS]*/;
uint64_t* workCompleted/*[MAXCHANNELS]*/;
struct ncclDevProfiler* workStarted/*[MAXCHANNELS]*/;
struct ncclDevProfiler* workCompleted/*[MAXCHANNELS]*/;
};
struct alignas(16) ncclDevCommAndChannels {
@ -476,7 +506,7 @@ __host__ __device__ constexpr int ncclCalcUnroll(int bytePerPack, int insns, int
__host__ __device__ constexpr int ncclCollUnroll(int cudaArch = NCCL_CUDA_ARCH) {
// Our collective unroll should move to the same bytes&insns model as NVLS.
return cudaArch >= 800 ? (cudaArch == 1200 ? 6 : 8) : 4;
return cudaArch >= 800 ? (cudaArch / 100 == 12 ? 6 : 8) : 4;
}
__host__ __device__ constexpr int ncclNvlsUnrollBytes(int cudaArch = NCCL_CUDA_ARCH) { return 4*16; }
@ -507,7 +537,6 @@ extern int const ncclDevKernelCount;
extern void* const ncclDevKernelList[/*ncclDevKernelCount*/];
// Table of most specialized kernel function to run given func index.
extern int const ncclDevFuncIdCount;
extern int const ncclDevFuncRowToId[];
extern void* const ncclDevKernelForFunc[/*funcIndex*/];
extern bool const ncclDevKernelForFuncIsSpecialized[/*funcIndex*/];
@ -535,11 +564,7 @@ inline bool ncclNvlsSupported(int devRedOp, int type) {
// `ncclDevFuncIndex()` needs to be in sync with "all_functions()" in "src/device/generate.py"
inline int ncclDevFuncId(int coll, int devRedOp, int type, int algo, int proto) {
#if defined(__CUDA_BF16_TYPES_EXIST__)
constexpr int NumTypes = ncclNumTypes;
#else
constexpr int NumTypes = ncclNumTypes + 1;
#endif
int row;
do {
row = 0; // ncclDevFuncIndex_P2p
@ -564,7 +589,7 @@ inline int ncclDevFuncId(int coll, int devRedOp, int type, int algo, int proto)
}
row += nAlgos*NCCL_NUM_PROTOCOLS;
nAlgos = 6;
nAlgos = 6; // TREE RING COLLNET_DIRECT COLLNET_CHAIN NVLS NVLS_TREE
if (coll == ncclFuncAllReduce) {
row += ((devRedOp*NumTypes + type)*nAlgos + algo)*NCCL_NUM_PROTOCOLS + proto;
break;

View File

@ -50,6 +50,8 @@ int ncclPxnDisable(struct ncclComm* comm);
ncclResult_t ncclTopoGetPxnRanks(struct ncclComm* comm, int** intermediateRanks, int* nranks);
ncclResult_t ncclGetLocalCpu(struct ncclTopoSystem* system, int gpu, int* retCpu);
ncclResult_t ncclGetUserP2pLevel(int* level);
// Find CPU affinity
ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu_set_t* affinity);
@ -74,7 +76,9 @@ ncclResult_t ncclTopoGetLocalNet(struct ncclTopoSystem* system, int rank, int ch
ncclResult_t ncclTopoGetLocalGpu(struct ncclTopoSystem* system, int64_t netId, int* gpuIndex);
ncclResult_t getLocalNetCountByBw(struct ncclTopoSystem* system, int gpu, int *count);
#define NCCL_TOPO_MAX_NODES 256
// Allows for up to 32 NICs per node on GB200-NVL72
#define NCCL_TOPO_MAX_NODES 576
ncclResult_t ncclTopoGetLocal(struct ncclTopoSystem* system, int type, int index, int resultType, int locals[NCCL_TOPO_MAX_NODES], int* localCount, int* pathType);
// Init search. Needs to be done before calling ncclTopoCompute
ncclResult_t ncclTopoSearchInit(struct ncclTopoSystem* system);

View File

@ -9,9 +9,11 @@
#include "nccl.h"
#include "comm.h"
#include "allocator.h"
#include "register.h"
ncclResult_t ncclGroupErrCheck(ncclResult_t ret);
void ncclGroupCommJoin(struct ncclComm* comm);
void ncclGroupCommJoin(struct ncclComm* comm, int type);
void ncclGroupCommPreconnect(struct ncclComm* comm);
ncclResult_t ncclGroupCommLeave(struct ncclComm* comm);
ncclResult_t ncclGroupJobAbort(struct ncclGroupJob* groupJob);
@ -52,13 +54,14 @@ ncclResult_t ncclAsyncLaunch(
struct ncclGroupJob {
struct ncclAsyncJob base;
struct ncclComm **groupCommHeadPtr;
struct ncclComm **groupCommPreconnectHeadPtr;
ncclResult_t *groupErrorPtr;
bool *abortFlagPtr;
int *groupBlockingPtr;
struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> *asyncJobsPtr;
bool initialized;
int groupRefCount;
bool nonBlockingInit;
bool joined;
struct ncclComm *groupCommHead[ncclGroupTaskTypeNum];
struct ncclComm *groupCommPreconnectHead;
ncclResult_t groupError;
bool abortFlag;
struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> asyncJobs;
};
ncclResult_t ncclGroupStartInternal();
@ -69,27 +72,9 @@ ncclResult_t ncclAsyncJobComplete(struct ncclAsyncJob* job);
extern __thread int ncclGroupDepth; // depth of ncclGroupStart nesting
extern __thread ncclResult_t ncclGroupError;
extern __thread struct ncclComm* ncclGroupCommHead;
extern __thread struct ncclComm* ncclGroupCommHead[ncclGroupTaskTypeNum];
extern __thread struct ncclComm* ncclGroupCommPreconnectHead;
extern __thread int ncclGroupBlocking;
extern __thread struct ncclGroupJob *ncclGroupJobMainPtr;
extern __thread struct ncclGroupJob ncclGroupJobMain;
static inline void groupResetJobState() {
ncclGroupBlocking = -1;
ncclGroupJobMainPtr = NULL;
memset(&ncclGroupJobMain, 0, sizeof(struct ncclGroupJob));
return;
}
static inline ncclResult_t groupJobComplete(struct ncclGroupJob* job) {
ncclResult_t ret = ncclSuccess;
if (job) {
ret = ncclAsyncJobComplete(&job->base);
groupResetJobState();
}
return ret;
}
inline ncclResult_t ncclGroupStartInternal() {
ncclGroupDepth++;
@ -104,31 +89,32 @@ inline ncclResult_t ncclGroupErrCheck(ncclResult_t ret) {
}
// Add comm to this thread's group
inline void ncclGroupCommJoin(struct ncclComm* comm) {
if (comm->groupNext == reinterpret_cast<struct ncclComm*>(0x1)) {
inline void ncclGroupCommJoin(struct ncclComm* comm, int type) {
if (comm->groupNext[type] == reinterpret_cast<struct ncclComm*>(0x1)) {
// Insert comm into ncclGroupCommHead adjacent to sibling comms. This preserves
// the users program order yet insures siblings occur consecutively. This
// is required by doLaunches() in "group.cc".
struct ncclComm** pp = &ncclGroupCommHead;
struct ncclComm** pp = &ncclGroupCommHead[type];
while (*pp != nullptr && comm->intraComm0 != (*pp)->intraComm0)
pp = &(*pp)->groupNext;
pp = &(*pp)->groupNext[type];
// didn't find its clique, we need to insert it with ascending order based on commHash
if (*pp == nullptr) {
pp = &ncclGroupCommHead;
while (*pp != nullptr && (*pp)->commHash < comm->commHash) pp = &(*pp)->groupNext;
pp = &ncclGroupCommHead[type];
while (*pp != nullptr && (*pp)->commHash < comm->commHash) pp = &(*pp)->groupNext[type];
}
comm->groupNext = *pp;
comm->groupNext[type] = *pp;
*pp = comm;
// Comms gets a new memory stack scope upon joining. Each task batched for
// this comm is allocated there.
ncclMemoryStackPush(&comm->memScoped);
// Initialize planner
ncclKernelPlanner::Peer* tmp = comm->planner.peers;
memset(&comm->planner, 0, sizeof(comm->planner));
comm->planner.peers = tmp;
if (type == ncclGroupTaskTypeCollective) {
// Initialize planner
ncclKernelPlanner::Peer* tmp = comm->planner.peers;
memset(&comm->planner, 0, sizeof(comm->planner));
comm->planner.peers = tmp;
}
}
ncclGroupBlocking = comm->config.blocking;
}
@ -141,8 +127,8 @@ inline void ncclGroupCommPreconnect(struct ncclComm* comm) {
}
// Comm has left group
inline ncclResult_t ncclGroupCommLeave(struct ncclComm* comm) {
comm->groupNext = reinterpret_cast<struct ncclComm*>(0x1);
inline ncclResult_t ncclGroupCommLeave(struct ncclComm* comm, int type) {
comm->groupNext[type] = reinterpret_cast<struct ncclComm*>(0x1);
ncclMemoryStackPop(&comm->memScoped);
return ncclSuccess;
}

View File

@ -9,6 +9,7 @@
#include <stdint.h>
#include <sys/types.h>
#include <unistd.h>
#include <string.h>
#if __GNUC__ >= 3
# define __attribute_const __attribute__((const))
@ -39,7 +40,7 @@ union ibv_gid {
#define vext_field_avail(type, fld, sz) (offsetof(type, fld) < (sz))
/*XXX:__VERBS_ABI_IS_EXTENDED produces warning "integer operation result is out of range" with g++ 4.8.2*/
//static void *__VERBS_ABI_IS_EXTENDED = ((uint8_t *)NULL) - 1;
static void *__VERBS_ABI_IS_EXTENDED = ((uint8_t *)NULL) - 1;
enum ibv_node_type {
IBV_NODE_UNKNOWN = -1,
@ -208,7 +209,9 @@ struct ibv_port_attr {
uint8_t active_speed;
uint8_t phys_state;
uint8_t link_layer;
uint8_t reserved;
uint8_t flags;
uint16_t port_cap_flags2;
uint32_t active_speed_ex;
};
enum ibv_event_type {
@ -993,37 +996,50 @@ enum verbs_context_mask {
struct verbs_context {
/* "grows up" - new fields go here */
int (*_reserved_2) (void);
int (*destroy_flow) (struct ibv_flow *flow);
int (*_reserved_1) (void);
struct ibv_flow * (*create_flow) (struct ibv_qp *qp,
struct ibv_flow_attr *flow_attr);
int (*query_port)(struct ibv_context *context, uint8_t port_num,
struct ibv_port_attr *port_attr,
size_t port_attr_len);
int (*_reserved[25]) (void);
struct verbs_ex_private *priv;
int (*query_device_ex)(struct ibv_context *context,
const struct ibv_query_device_ex_input *input,
struct ibv_device_attr_ex *attr,
size_t attr_size);
int (*ibv_destroy_flow) (struct ibv_flow *flow);
void (*ABI_placeholder2) (void); /* DO NOT COPY THIS GARBAGE */
struct ibv_flow * (*ibv_create_flow) (struct ibv_qp *qp,
struct ibv_flow_attr *flow_attr);
void (*ABI_placeholder1) (void); /* DO NOT COPY THIS GARBAGE */
struct ibv_qp * (*open_qp)(struct ibv_context *context,
struct ibv_qp_open_attr *attr);
struct ibv_qp * (*create_qp_ex)(struct ibv_context *context,
struct ibv_qp_init_attr_ex *qp_init_attr_ex);
int (*get_srq_num)(struct ibv_srq *srq, uint32_t *srq_num);
struct ibv_srq * (*create_srq_ex)(struct ibv_context *context,
struct ibv_srq_init_attr_ex *srq_init_attr_ex);
struct ibv_xrcd * (*open_xrcd)(struct ibv_context *context,
struct ibv_xrcd_init_attr *xrcd_init_attr);
int (*close_xrcd)(struct ibv_xrcd *xrcd);
uint64_t has_comp_mask;
size_t sz; /* Must be immediately before struct ibv_context */
struct ibv_context context;/* Must be last field in the struct */
struct ibv_srq * (*create_srq_ex)(struct ibv_context *context,
struct ibv_srq_init_attr_ex *srq_init_attr_ex);
struct ibv_xrcd * (*open_xrcd)(struct ibv_context *context,
struct ibv_xrcd_init_attr *xrcd_init_attr);
int (*close_xrcd)(struct ibv_xrcd *xrcd);
uint64_t _ABI_placeholder3;
size_t sz; /* Must be immediately before struct ibv_context */
struct ibv_context context; /* Must be last field in the struct */
};
/*XXX:__VERBS_ABI_IS_EXTENDED produces warning "integer operation result is out of range" with g++ 4.8.2*/
/*static inline struct verbs_context *verbs_get_ctx(struct ibv_context *ctx)
static inline struct verbs_context *verbs_get_ctx(struct ibv_context *ctx)
{
return (!ctx || (ctx->abi_compat != __VERBS_ABI_IS_EXTENDED)) ?
NULL : container_of(ctx, struct verbs_context, context);
if (ctx->abi_compat != __VERBS_ABI_IS_EXTENDED)
return NULL;
/* open code container_of to not pollute the global namespace */
return (struct verbs_context *)(((uintptr_t)ctx) -
offsetof(struct verbs_context,
context));
}
#define verbs_get_ctx_op(ctx, op) ({ \
struct verbs_context *_vctx = verbs_get_ctx(ctx); \
(!_vctx || (_vctx->sz < sizeof(*_vctx) - offsetof(struct verbs_context, op)) || \
!_vctx->op) ? NULL : _vctx; })*/
struct verbs_context *__vctx = verbs_get_ctx(ctx); \
(!__vctx || (__vctx->sz < sizeof(*__vctx) - offsetof(struct verbs_context, op)) || \
!__vctx->op) ? NULL : __vctx; })
#define verbs_set_ctx_op(_vctx, op, ptr) ({ \
struct verbs_context *vctx = _vctx; \
@ -1055,4 +1071,20 @@ struct ibv_ece {
uint32_t comp_mask;
};
/**
* ibv_query_port_ex - Get (extended) port properties
*/
static inline int ibv_query_port_ex(struct ibv_context *context,
uint8_t port_num,
struct ibv_port_attr *port_attr)
{
struct verbs_context *vctx = verbs_get_ctx_op(context, query_port);
if (vctx) {
return vctx->query_port(context, port_num, port_attr, sizeof(*port_attr));
}
return -1;
}
#endif // NCCL_IBV_CORE_H_

View File

@ -0,0 +1,18 @@
#ifndef NCCL_MLX5DV_CORE_H_
#define NCCL_MLX5DV_CORE_H_
/* Basic MLX5 direct verbs structs. Needed to dynamically load MLX5 direct verbs functions without
* explicit including of MLX5 direct verbs header.
*/
#include <stddef.h>
#include <stdint.h>
#include <sys/types.h>
#include <unistd.h>
#include "ibvwrap.h"
enum mlx5dv_reg_dmabuf_access {
MLX5DV_REG_DMABUF_ACCESS_DATA_DIRECT = (1<<0),
};
#endif // NCCL_MLX5DV_CORE_H_

View File

@ -0,0 +1,23 @@
#ifndef NCCL_MLX5DV_SYMBOLS_H_
#define NCCL_MLX5DV_SYMBOLS_H_
#ifdef NCCL_BUILD_MLX5DV
#include <infiniband/mlx5dv.h>
#else
#include "mlx5/mlx5dvcore.h"
#endif
#include "nccl.h"
/* MLX5 Direct Verbs Function Pointers*/
struct ncclMlx5dvSymbols {
bool (*mlx5dv_internal_is_supported)(struct ibv_device *device);
int (*mlx5dv_internal_get_data_direct_sysfs_path)(struct ibv_context *context, char *buf, size_t buf_len);
/* DMA-BUF support */
struct ibv_mr * (*mlx5dv_internal_reg_dmabuf_mr)(struct ibv_pd *pd, uint64_t offset, size_t length, uint64_t iova, int fd, int access, int mlx5_access);
};
/* Constructs MLX5 direct verbs symbols per rdma-core linking or dynamic loading mode */
ncclResult_t buildMlx5dvSymbols(struct ncclMlx5dvSymbols* mlx5dvSymbols);
#endif // NCCL_MLX5DV_SYMBOLS_H_

View File

@ -0,0 +1,41 @@
/*************************************************************************
* Copyright (c) 2004, 2005 Topspin Communications. All rights reserved.
* Copyright (c) 2004, 2011-2012 Intel Corporation. All rights reserved.
* Copyright (c) 2005, 2006, 2007 Cisco Systems, Inc. All rights reserved.
* Copyright (c) 2005 PathScale, Inc. All rights reserved.
*
* Copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#ifndef NCCL_MLX5DVWRAP_H_
#define NCCL_MLX5DVWRAP_H_
#include <arpa/inet.h>
#include <netinet/in.h>
#ifdef NCCL_BUILD_MLX5DV
#include <infiniband/mlx5dv.h>
#else
#include "mlx5/mlx5dvcore.h"
#endif
#include "core.h"
#include "ibvwrap.h"
#include <sys/types.h>
#include <unistd.h>
typedef enum mlx5dv_return_enum
{
MLX5DV_SUCCESS = 0, //!< The operation was successful
} mlx5dv_return_t;
ncclResult_t wrap_mlx5dv_symbols(void);
/* NCCL wrappers of MLX5 direct verbs functions */
bool wrap_mlx5dv_is_supported(struct ibv_device *device);
ncclResult_t wrap_mlx5dv_get_data_direct_sysfs_path(struct ibv_context *context, char *buf, size_t buf_len);
/* DMA-BUF support */
ncclResult_t wrap_mlx5dv_reg_dmabuf_mr(struct ibv_mr **ret, struct ibv_pd *pd, uint64_t offset, size_t length, uint64_t iova, int fd, int access, int mlx5_access);
struct ibv_mr * wrap_direct_mlx5dv_reg_dmabuf_mr(struct ibv_pd *pd, uint64_t offset, size_t length, uint64_t iova, int fd, int access, int mlx5_access);
#endif // NCCL_MLX5DVWRAP_H_

View File

@ -7,6 +7,8 @@
#ifndef NCCL_DEBUG_H_
#define NCCL_DEBUG_H_
#include <cstdint>
typedef enum {
NCCL_LOG_NONE = 0,
NCCL_LOG_VERSION = 1,
@ -38,6 +40,16 @@ typedef enum {
typedef void (*ncclDebugLogger_t)(ncclDebugLogLevel level, unsigned long flags, const char *file, int line, const char *fmt, ...);
// NCCL core profiler callback for network defined events instrumentation
enum {
ncclProfilerNetEventStart = 0,
ncclProfilerNetEventStop,
ncclProfilerNetEventUpdate,
ncclProfilerNetEventUpdateAndStop,
};
typedef ncclResult_t (*ncclProfilerCallback_t)(void** eHandle, int type, void* pHandle, int64_t pluginId, void* extData);
#define NCCL_NUM_FUNCTIONS 5 // Send/Recv not included for now
typedef enum {
ncclFuncBroadcast = 0,
@ -51,7 +63,7 @@ typedef enum {
ncclNumFuncs = 8
} ncclFunc_t;
#define NCCL_NUM_ALGORITHMS 7 // Tree/Ring/CollNet*
#define NCCL_NUM_ALGORITHMS 7 // Tree/Ring/CollNet*/PAT
#define NCCL_ALGO_UNDEF -1
#define NCCL_ALGO_TREE 0
#define NCCL_ALGO_RING 1

View File

@ -14,8 +14,6 @@
typedef char ncclNetHandle_t[NCCL_NET_HANDLE_MAXSIZE];
ncclResult_t ncclNetPluginLoad(struct ncclComm* comm);
ncclResult_t ncclNetPluginUnload(struct ncclComm* comm);
ncclResult_t ncclNetInit(struct ncclComm* comm);
ncclResult_t ncclNetFinalize(struct ncclComm* comm);

View File

@ -31,10 +31,11 @@
#define NVTX_SID_CommInitRankScalable 12 // same schema as NVTX_SID_CommInitRank
#define NVTX_SID_CommSplit 13
#define NVTX_SID_CommFinalize 14
#define NVTX_SID_CommShrink 15
// When adding new schema IDs, DO NOT re-use/overlap with the enum schema ID below!
// Define static schema ID for the reduction operation.
#define NVTX_PAYLOAD_ENTRY_NCCL_REDOP 15 + NVTX_PAYLOAD_ENTRY_TYPE_SCHEMA_ID_STATIC_START
#define NVTX_PAYLOAD_ENTRY_NCCL_REDOP 16 + NVTX_PAYLOAD_ENTRY_TYPE_SCHEMA_ID_STATIC_START
extern const nvtxDomainHandle_t ncclNvtxDomainHandle;

View File

@ -67,6 +67,16 @@ NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsCommSplit, static cons
)
)
NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsCommShrink, static constexpr,
NCCL_NVTX_PAYLOAD_ENTRIES(
(uint64_t, newcomm, TYPE_UINT64, nccl_nvtxCommStr),
(int, nranks, TYPE_INT, nccl_nvtxNranksStr),
(int, myrank, TYPE_INT, nccl_nvtxRankStr),
(int, cudaDev, TYPE_INT, nccl_nvtxCudaDevStr),
(int, num_exclude, TYPE_INT, "num_exclude")
)
)
NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsCommFinalize, static constexpr,
NCCL_NVTX_PAYLOAD_ENTRIES(
(uint64_t, comm, TYPE_UINT64, nccl_nvtxCommStr)

View File

@ -28,10 +28,9 @@
#define NCCL_NET_MAX_REQUESTS 32
// Max number of ncclNet objects which can live in the same process
#define NCCL_NET_MAX_PLUGINS 3
// NCCL core profiler callback for network defined events instrumentation
typedef ncclResult_t (*ncclProfilerCallback_t)(void** eHandle, int type, void* pHandle, int64_t pluginId, void* extData);
#ifndef NCCL_NET_MAX_PLUGINS
#define NCCL_NET_MAX_PLUGINS 16
#endif
#include "net/net_v10.h"
#include "net/net_v9.h"

View File

@ -19,43 +19,53 @@ enum {
};
typedef enum {
ncclProfilerProxyOpSendPosted,
ncclProfilerProxyOpSendRemFifoWait,
ncclProfilerProxyOpSendTransmitted,
ncclProfilerProxyOpSendDone,
ncclProfilerProxyOpRecvPosted,
ncclProfilerProxyOpRecvReceived,
ncclProfilerProxyOpRecvTransmitted,
ncclProfilerProxyOpRecvDone,
ncclProfilerProxyOpSendPosted = 0, // deprecated in v4
ncclProfilerProxyOpSendRemFifoWait = 1, // deprecated in v4
ncclProfilerProxyOpSendTransmitted = 2, // deprecated in v4
ncclProfilerProxyOpSendDone = 3, // deprecated in v4
ncclProfilerProxyOpRecvPosted = 4, // deprecated in v4
ncclProfilerProxyOpRecvReceived = 5, // deprecated in v4
ncclProfilerProxyOpRecvTransmitted = 6, // deprecated in v4
ncclProfilerProxyOpRecvDone = 7, // deprecated in v4
ncclProfilerProxyOpInProgress_v4 = 19,
/* Legacy proxy profiler states */
ncclProfilerProxyStepSendGPUWait,
ncclProfilerProxyStepSendWait,
ncclProfilerProxyStepRecvWait,
ncclProfilerProxyStepRecvFlushWait,
ncclProfilerProxyStepRecvGPUWait,
ncclProfilerProxyStepSendGPUWait = 8,
ncclProfilerProxyStepSendPeerWait_v4 = 20,
ncclProfilerProxyStepSendWait = 9,
ncclProfilerProxyStepRecvWait = 10,
ncclProfilerProxyStepRecvFlushWait = 11,
ncclProfilerProxyStepRecvGPUWait = 12,
/* Legacy proxy control states */
ncclProfilerProxyCtrlIdle,
ncclProfilerProxyCtrlActive,
ncclProfilerProxyCtrlSleep,
ncclProfilerProxyCtrlWakeup,
ncclProfilerProxyCtrlAppend,
ncclProfilerProxyCtrlAppendEnd,
ncclProfilerProxyCtrlIdle = 13,
ncclProfilerProxyCtrlActive = 14,
ncclProfilerProxyCtrlSleep = 15,
ncclProfilerProxyCtrlWakeup = 16,
ncclProfilerProxyCtrlAppend = 17,
ncclProfilerProxyCtrlAppendEnd = 18,
/* Network defined event states */
ncclProfilerNetPluginUpdate = 21,
/* Kernel event states */
ncclProfilerKernelChStop = 22,
} ncclProfilerEventState_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v1_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v2_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v3_t;
typedef ncclProfilerEventState_t ncclProfilerEventState_v4_t;
#include <cstdint>
#include "profiler/profiler_v4.h"
#include "profiler/profiler_v3.h"
#include "profiler/profiler_v2.h"
#include "profiler/profiler_v1.h"
typedef ncclProfiler_v3_t ncclProfiler_t;
typedef ncclProfilerEventDescr_v3_t ncclProfilerEventDescr_t;
typedef ncclProfilerEventStateArgs_v3_t ncclProfilerEventStateArgs_t;
typedef ncclProfiler_v4_t ncclProfiler_t;
typedef ncclProfilerEventDescr_v4_t ncclProfilerEventDescr_t;
typedef ncclProfilerEventStateArgs_v4_t ncclProfilerEventStateArgs_t;
#define NCCL_PROFILER_NET_VER_BITS (16)
#define NCCL_PROFILER_NET_VER_MASK (~0U >> NCCL_PROFILER_NET_VER_BITS)

View File

@ -9,10 +9,16 @@
#include "nccl.h"
enum ncclPluginType {
ncclPluginTypeNet,
ncclPluginTypeTuner,
ncclPluginTypeProfiler,
};
void* ncclOpenNetPluginLib(const char* name);
void* ncclOpenTunerPluginLib(const char* name);
void* ncclOpenProfilerPluginLib(const char* name);
void* ncclGetNetPluginLib(void);
ncclResult_t ncclClosePluginLib(void* handle);
void* ncclGetNetPluginLib(enum ncclPluginType type);
ncclResult_t ncclClosePluginLib(void* handle, enum ncclPluginType type);
#endif

View File

@ -0,0 +1,123 @@
/*************************************************************************
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#ifndef PROFILER_V4_H_
#define PROFILER_V4_H_
typedef struct {
uint8_t type; // event type descriptor: ncclProfileColl, ...
void* parentObj; // pointer to the profiler parent object (for coll is the group)
int rank; // originating rank
union {
struct {
uint64_t seqNumber;
const char* func;
void const* sendBuff;
void* recvBuff;
size_t count;
int root;
const char* datatype;
uint8_t nChannels;
uint8_t nWarps;
const char* algo;
const char* proto;
} coll;
struct {
const char* func;
void* buff;
const char* datatype;
size_t count;
int peer;
uint8_t nChannels;
} p2p;
struct {
pid_t pid; // pid of the originating process
uint8_t channelId; // channel id for this proxy operation
int peer; // remote rank for send/recv
int nSteps; // number of steps for this proxy operation
int chunkSize; // amount of data transferred by this proxy operation
int isSend;
} proxyOp;
struct {
int step;
} proxyStep;
struct {
uint8_t channelId;
uint64_t pTimer; // start timestamp from GPU globaltimer
} kernelCh;
struct {
int64_t id;
void* data;
} netPlugin;
};
} ncclProfilerEventDescr_v4_t;
typedef union {
struct {
size_t transSize;
} proxyStep;
struct {
int appendedProxyOps;
} proxyCtrl;
struct {
void* data;
} netPlugin;
struct {
uint64_t pTimer;
} kernelCh;
} ncclProfilerEventStateArgs_v4_t;
typedef struct {
const char* name;
// init - initialize the profiler plugin
// Input
// - context : opaque profiler context object for separating profiler behavior across comms
// - commName : user assigned communicator name
// - commHash : communicator id
// - nNodes : number of nodes in communicator
// - nranks : number of ranks in communicator
// - rank : rank identifier in communicator
// - logfn : logger function
// Output
// - eActivationMask: bitmask of active events set by the plugin
ncclResult_t (*init)(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);
// startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
// Input
// - context: opaque profiler context object
// - eDescr : pointer to ncclProfilerEventDescr_t object
// Output
// - eHandle: return event handle for supplied event descriptor object
ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v4_t* eDescr);
// stopEvent - stop/finalize an event inside and event set
// Input
// - eHandle: handle to event object
ncclResult_t (*stopEvent)(void* eHandle);
// recordEventState - record event state transitions and event attribute updates
// Input
// - eHandle : handle to event object created through startEvent
// - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
// - eState : event state transition
ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v4_t eState, ncclProfilerEventStateArgs_v4_t* eStateArgs);
// finalize - finalize the profiler plugin
// Input
// - context: opaque profiler context object
ncclResult_t (*finalize)(void* context);
} ncclProfiler_v4_t;
#endif

View File

@ -21,8 +21,8 @@ struct ncclProxyConnector;
struct ncclProfilerProxy {
bool initialized;
uint64_t* workStarted/*[MAXCHANNELS]*/;
uint64_t* workCompleted/*[MAXCHANNELS]*/;
struct ncclDevProfiler* workStarted/*[MAXCHANNELS]*/;
struct ncclDevProfiler* workCompleted/*[MAXCHANNELS]*/;
uint64_t workCounter[MAXCHANNELS]; // host work counter
struct ncclProxyConnector sendProxyConn[MAXCHANNELS];
struct ncclProxyConnector recvProxyConn[MAXCHANNELS];
@ -43,8 +43,7 @@ ncclResult_t ncclProfilerStartTaskEvents(struct ncclKernelPlan* plan);
ncclResult_t ncclProfilerStopTaskEvents(struct ncclKernelPlan* plan);
// Proxy Op Start/Stop Event Wrappers
ncclResult_t ncclProfilerStartSendProxyOpEvent(int sub, struct ncclProxyArgs* args);
ncclResult_t ncclProfilerStartRecvProxyOpEvent(int sub, struct ncclProxyArgs* args);
ncclResult_t ncclProfilerStartProxyOpEvent(int sub, struct ncclProxyArgs* args);
ncclResult_t ncclProfilerStopProxyOpEvent(int sub, struct ncclProxyArgs* args);
// Proxy Step Start/Stop Event Wrappers
@ -57,11 +56,11 @@ ncclResult_t ncclProfilerStartProxyCtrlEvent(void* profilerContext, void** eHand
ncclResult_t ncclProfilerStopProxyCtrlEvent(void* eHandle);
// Kernel Channel Start/Stop Event Wrappers
ncclResult_t ncclProfilerStartKernelChEvent(struct ncclProxyArgs* args, int s);
ncclResult_t ncclProfilerStopKernelChEvent(struct ncclProxyArgs* args, int s);
ncclResult_t ncclProfilerStartKernelChEvent(struct ncclProxyArgs* args, int s, uint64_t start);
ncclResult_t ncclProfilerStopKernelChEvent(struct ncclProxyArgs* args, int s, uint64_t stop);
// Record Event Wrappers
ncclResult_t ncclProfilerRecordProxyOpEventState(int sub, struct ncclProxyArgs* args, int steps, size_t transSize, ncclProfilerEventState_t eState);
ncclResult_t ncclProfilerRecordProxyOpEventState(int sub, struct ncclProxyArgs* args, ncclProfilerEventState_t eState);
ncclResult_t ncclProfilerRecordProxyStepEventState(int sub, struct ncclProxyArgs* args, int stepId, ncclProfilerEventState_t eState);
ncclResult_t ncclProfilerRecordProxyCtrlEventState(void*eHandle, int appended, ncclProfilerEventState_t eState);

View File

@ -105,6 +105,13 @@ struct ncclProxyOp {
struct ncclProxyOp *enqNext;
};
struct ncclProxySubArgs;
struct ncclProxyEventHandle {
void* stepEventHandle;
struct ncclProxySubArgs* subArgPtr;
};
struct ncclProxySubArgs {
struct ncclProxyConnection* connection;
int reg;
@ -137,13 +144,12 @@ struct ncclProxySubArgs {
// Profiler plugin
int eActivationMask;
int rank;
uint64_t profilerSteps;
pid_t pid;
void* profilerContext;
void* taskEventHandle;
void* opEventHandle;
void* kernelEventHandle;
void* stepEventHandles[NCCL_STEPS];
struct ncclProxyEventHandle pHandles[NCCL_STEPS];
size_t transSize;
uint64_t workCounter;
@ -226,6 +232,8 @@ struct ncclProxyPeer {
};
struct ncclSharedNetComms {
int activeConnect[MAXCHANNELS];
int activeAccept[MAXCHANNELS];
void* sendComm[MAXCHANNELS];
void* recvComm[MAXCHANNELS];
int sendRefCount[MAXCHANNELS];

View File

@ -29,18 +29,24 @@ struct ncclRegNetHandles {
struct ncclRegNetHandles* next;
};
struct ncclSymRegTask {
struct ncclSymRegTask *next;
void* buff;
size_t baseSize;
CUmemGenericAllocationHandle memHandle;
struct ncclReg* regHandle;
size_t alignment;
};
struct ncclReg {
// common attributes
size_t pages;
uintptr_t begAddr, endAddr; // page aligned
int localRefs;
int graphRefs;
uintptr_t addr;
uint32_t state;
// net reg
struct ncclRegNetHandles* netHandleHead;
// nvls reg
uintptr_t baseAddr;
size_t baseSize;
CUdeviceptr regAddr;
size_t regUCSize, regMCSize;
int dev;
@ -52,6 +58,10 @@ struct ncclReg {
// general ipc reg
struct ncclPeerRegIpcAddr regIpcAddrs;
struct ncclIpcRegInfo* ipcInfos[NCCL_MAX_LOCAL_RANKS];
// symmetric reg
void* baseSymPtr;
size_t symSize;
int winFlags;
};
struct ncclRegCache {
@ -60,10 +70,14 @@ struct ncclRegCache {
uintptr_t pageSize;
};
struct ncclWindow {
struct ncclReg* handle;
};
ncclResult_t ncclRegCleanup(struct ncclComm* comm);
ncclResult_t ncclRegFind(struct ncclComm* comm, const void* data, size_t size, struct ncclReg** reg);
ncclResult_t ncclCommGraphRegister(const ncclComm_t comm, void* buff, size_t size, void** handle);
ncclResult_t ncclCommGraphDeregister(const ncclComm_t comm, struct ncclReg *handle);
ncclResult_t ncclRegLocalIsValid(struct ncclReg *reg, bool *isValid);
ncclResult_t ncclCommSymmetricRegisterInternal(struct ncclComm* comm, void* buff, size_t baseSize, size_t alignment, CUmemGenericAllocationHandle memHandle, struct ncclReg* regHandle);
#endif

View File

@ -0,0 +1,33 @@
#ifndef NCCL_REGISTER_INLINE_H_
#define NCCL_REGISTER_INLINE_H_
#include "comm.h"
#include "register.h"
static inline ncclResult_t ncclRegFind(struct ncclComm* comm, const void* data, size_t size, struct ncclReg** outReg) {
struct ncclRegCache* cache = &comm->regCache;
*outReg = NULL;
for (int slot=0; /*true*/; slot++) {
if (slot == cache->population) return ncclSuccess;
struct ncclReg *reg = cache->slots[slot];
if ((uintptr_t)data < reg->begAddr) return ncclSuccess;
if ((uintptr_t)data + size <= reg->endAddr) {
*outReg = reg;
return ncclSuccess;
}
}
}
static inline ncclResult_t ncclRegFindSymmetric(struct ncclComm* comm, const void* data, size_t size, void** symPtr, struct ncclReg** outReg) {
struct ncclReg* regRecord = NULL;
*symPtr = NULL;
*outReg = NULL;
NCCLCHECK(ncclRegFind(comm, data, size, &regRecord));
if (regRecord && regRecord->baseSymPtr) {
*symPtr = (void*)((uintptr_t)regRecord->baseSymPtr + (uintptr_t)data - (uintptr_t)regRecord->begAddr);
*outReg = regRecord;
}
return ncclSuccess;
}
#endif

View File

@ -69,8 +69,10 @@ struct ncclSocket {
const char *ncclSocketToString(const union ncclSocketAddress *addr, char *buf, const int numericHostForm = 1);
ncclResult_t ncclSocketGetAddrFromString(union ncclSocketAddress* ua, const char* ip_port_pair);
int ncclFindInterfaceMatchSubnet(char* ifNames, union ncclSocketAddress* localAddrs, union ncclSocketAddress* remoteAddr, int ifNameMaxSize, int maxIfs);
int ncclFindInterfaces(char* ifNames, union ncclSocketAddress *ifAddrs, int ifNameMaxSize, int maxIfs);
ncclResult_t ncclFindInterfaceMatchSubnet(char* ifName, union ncclSocketAddress* localAddr,
union ncclSocketAddress* remoteAddr, int ifNameMaxSize, int* found);
ncclResult_t ncclFindInterfaces(char* ifNames, union ncclSocketAddress *ifAddrs, int ifNameMaxSize, int maxIfs,
int* nIfs);
// Initialize a socket
ncclResult_t ncclSocketInit(struct ncclSocket* sock, const union ncclSocketAddress* addr = NULL, uint64_t magic = NCCL_SOCKET_MAGIC, enum ncclSocketType type = ncclSocketTypeUnknown, volatile uint32_t* abortFlag = NULL, int asyncFlag = 0, int customRetry = 0);

90
src/include/symmetric.h Normal file
View File

@ -0,0 +1,90 @@
#ifndef NCCL_DEVICE_SYMMETRIC_H_
#define NCCL_DEVICE_SYMMETRIC_H_
#include "nccl.h"
#include "nccl_common.h"
#include "bitops.h"
constexpr int ncclSymMaxBlocks = 64;
constexpr int ncclSymMaxThreads = 512;
constexpr int ncclSymLLMaxEltSize = 64;
constexpr __host__ __device__ int ncclSymLLMaxSlots(int eltSize = ncclSymLLMaxEltSize) {
return ncclSymMaxThreads*ncclSymLLMaxEltSize/eltSize;
}
constexpr __host__ __device__ int ncclSymLLEpochSize(int nRanks) {
return /*LL Overhead*/2 * maxval(ncclSymMaxThreads*nRanks*8, ncclSymLLMaxSlots(ncclSymLLMaxEltSize)*ncclSymLLMaxEltSize);
}
struct alignas(16) ncclSymDevBase {
uint32_t llEpoch[ncclSymMaxBlocks];
uint32_t barEpochMc[ncclSymMaxBlocks], barEpochUc[ncclSymMaxBlocks];
uint32_t barInboxMc[ncclSymMaxBlocks];
uint32_t barInboxPerPeer[];
static constexpr size_t size(int nRanks) {
return sizeof(ncclSymDevBase) +
alignUp(ncclSymMaxBlocks*nRanks*sizeof(uint32_t), 16) +
ncclSymMaxBlocks * /*epochs=*/2 * ncclSymLLEpochSize(nRanks);
}
};
static __device__ uint4* ncclSymDevBase_getLLBuf(struct ncclSymDevBase* base, int nRanks, int block, uint32_t epoch) {
// Get pointer to buffer trailing the header struct.
char* ans = (char*)(base + 1);
// Skip over barInboxPerPeer[]
ans += alignUp(ncclSymMaxBlocks*nRanks*sizeof(uint32_t), 16);
// Skip to our block
int epochSize = ncclSymLLEpochSize(nRanks);
ans += block * /*epochs=*/2 * epochSize;
ans += (epoch & 1)*epochSize;
return (uint4*)ans;
}
struct ncclSymDevComm {
ncclSymDevBase* base;
ncclSymDevBase* baseMc;
uint32_t stride4G;
int nRanks, rank;
uint32_t nRanks_rcp32; // idivRcp32(nRanks)
};
struct alignas(16) ncclSymDevArgs {
struct ncclSymDevComm comm;
int rootRank;
uint64_t redOpArg; // must be collectively uniform
size_t nElts;
char* input;
char* output;
};
enum ncclSymKernelId {
ncclSymKernelId_AllReduce_AGxLL_R,
ncclSymKernelId_AllReduce_AGxLLMC_R,
ncclSymKernelId_AllReduce_RSxLD_AGxST,
ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC,
ncclSymKernelId_AllGather_LL,
ncclSymKernelId_AllGather_LLMC,
ncclSymKernelId_AllGather_ST,
ncclSymKernelId_AllGather_STMC,
ncclSymKernelId_ReduceScatter_LL,
ncclSymKernelId_ReduceScatter_LD,
ncclSymKernelId_ReduceScatter_LDMC,
ncclSymKernelId_Count
};
bool ncclSymImplemented(ncclFunc_t fn, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty);
ncclResult_t ncclSymPickKernel(struct ncclComm* comm, ncclFunc_t fn, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty, size_t nElts, float* estTimeUs, ncclSymKernelId* kernelId, int* nBlocks, int* nWarps);
// Generated by src/device/symmetric/generate.py
extern int const ncclSymKernelCount;
extern void* const ncclSymKernelList[];
void* ncclSymGetKernelPtr(ncclSymKernelId kernelId, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty);
const char* ncclSymKernelIdToString(int kernelId);
#endif

View File

@ -22,6 +22,7 @@
#include "proxy.h"
#include "comm.h"
#include "bootstrap.h"
extern struct ncclTransport p2pTransport;
extern struct ncclTransport shmTransport;
@ -46,6 +47,7 @@ struct ncclPeerInfo {
int64_t busId;
struct ncclComm* comm;
int cudaCompCap;
size_t totalGlobalMem;
// MNNVL support
nvmlGpuFabricInfoV_t fabricInfo;
int cuMemSupport;
@ -53,6 +55,8 @@ struct ncclPeerInfo {
};
#define CONNECT_SIZE 256
#define NCCL_MAX_PAGE_SIZE (512L * 1024L * 1024L)
#define NCCL_REC_PAGE_SIZE (2L * 1024L * 1024L)
struct ncclConnect {
char data[CONNECT_SIZE];
};
@ -80,6 +84,7 @@ struct ncclNvlsSharedRes {
char* ucBuff; // Unicast NVLS buffer address
char* ucCredit; // Unicast NVLS credit address
int nChannels;
int nHeads;
struct ncclShmemCollBuff nvlsShmem;
void *nvlsShmemHandle;
};
@ -119,7 +124,8 @@ struct ncclTransport {
ncclResult_t ncclTransportP2pConnect(struct ncclComm* comm, int channelId, int nrecv, int* peerRecv, int nsend, int* peerSend, int connIndex);
ncclResult_t ncclTransportP2pSetup(struct ncclComm* comm, struct ncclTopoGraph* graph, int connIndex);
ncclResult_t ncclTransportCheckP2pType(struct ncclComm* comm, bool* intraNodeP2pSupport, bool* directMode);
ncclResult_t ncclTransportCheckP2pType(struct ncclComm* comm, bool* isAllDirectP2p, bool* directMode);
ncclResult_t ncclTransportIsAllDirectP2p(struct ncclComm* comm, int* isAllDirectP2p);
ncclResult_t ncclNvlsInit(struct ncclComm* comm);
ncclResult_t ncclNvlsSetup(struct ncclComm* comm, struct ncclComm* parent);
@ -154,5 +160,15 @@ ncclResult_t ncclRegisterP2pIpcBuffer(struct ncclComm* comm, void* userbuff, siz
ncclResult_t ncclRegisterP2pNetBuffer(struct ncclComm* comm, void* userbuff, size_t size, struct ncclConnector* conn, int* regFlag, void** handle, struct ncclIntruQueue<struct ncclCommCallback, &ncclCommCallback::next>* cleanupQueue);
ncclResult_t ncclRegisterCollBuffers(struct ncclComm* comm, struct ncclTaskColl* info, void* outRegBufSend[NCCL_MAX_LOCAL_RANKS], void* outRegBufRecv[NCCL_MAX_LOCAL_RANKS], struct ncclIntruQueue<struct ncclCommCallback, &ncclCommCallback::next>* cleanupQueue, bool* regNeedConnect);
ncclResult_t ncclRegisterCollNvlsBuffers(struct ncclComm* comm, struct ncclTaskColl* info, void* outRegBufSend[NCCL_MAX_LOCAL_RANKS], void* outRegBufRecv[NCCL_MAX_LOCAL_RANKS], struct ncclIntruQueue<struct ncclCommCallback, &ncclCommCallback::next>* cleanupQueue, bool* regNeedConnect);
ncclResult_t ncclNvlsRegResourcesQuery(struct ncclComm* comm, struct ncclTaskColl* info, int* recChannels);
ncclResult_t ncclIpcSymmetricInit(struct ncclComm* comm);
ncclResult_t ncclIpcSymmetricMap(struct ncclComm* comm, size_t offset, size_t size, CUmemGenericAllocationHandle memHandle, void** symPtr);
ncclResult_t ncclIpcSymmetricFree(struct ncclComm* comm, size_t size, void* symPtr);
ncclResult_t ncclIpcSymmetricFinalize(struct ncclComm* comm);
ncclResult_t ncclNvlsSymmetricInit(struct ncclComm* comm);
ncclResult_t ncclNvlsSymmetricMap(struct ncclComm* comm, size_t offset, size_t ucsize, void* ucaddr);
ncclResult_t ncclNvlsSymmetricFree(struct ncclComm* comm, size_t ucsize, void* ucaddr);
ncclResult_t ncclNvlsSymmetricFinalize(struct ncclComm* comm);
#endif

View File

@ -43,6 +43,12 @@ static long log2i(long n) {
return log2Down(n);
}
// Comparator function for qsort/bsearch to compare integers
static int compareInts(const void *a, const void *b) {
int ia = *(const int*)a, ib = *(const int*)b;
return (ia > ib) - (ia < ib);
}
inline uint64_t clockNano() {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);

View File

@ -18,6 +18,7 @@
#include "argcheck.h"
#include "tuner.h"
#include "ras.h"
#include "profiler.h"
#include "mnnvl.h"
#include <fcntl.h>
#include <string.h>
@ -29,6 +30,7 @@
#include <unistd.h>
#include "param.h"
#include "nvtx_payload_schemas.h"
#include "utils.h"
#define STR2(v) #v
#define STR(v) STR2(v)
@ -48,6 +50,10 @@ NCCL_PARAM(GroupCudaStream, "GROUP_CUDA_STREAM", NCCL_GROUP_CUDA_STREAM);
NCCL_PARAM(CheckPointers, "CHECK_POINTERS", 0);
NCCL_PARAM(CommBlocking, "COMM_BLOCKING", NCCL_CONFIG_UNDEF_INT);
NCCL_PARAM(RuntimeConnect, "RUNTIME_CONNECT", 1);
NCCL_PARAM(WinEnable, "WIN_ENABLE", 1);
NCCL_PARAM(CollnetEnable, "COLLNET_ENABLE", NCCL_CONFIG_UNDEF_INT);
NCCL_PARAM(CtaPolicy, "CTA_POLICY", NCCL_CONFIG_UNDEF_INT);
NCCL_PARAM(NvlsChannels, "NVLS_NCHANNELS", NCCL_CONFIG_UNDEF_INT);
static ncclResult_t commReclaim(ncclComm_t comm);
@ -174,6 +180,10 @@ static ncclResult_t commFree(ncclComm_t comm) {
if (comm == NULL)
return ncclSuccess;
if (comm->symmetricSupport && comm->symDevComm.base) {
NCCLCHECK(ncclCommSymmetricFreeInternal(comm, comm->baseUCSymPtr + comm->rank * comm->baseStride));
}
NCCLCHECK(ncclRasCommFini(comm));
/* in commReclaim, we have guaranteed only last rank which calls ncclCommDestroy() will
@ -253,15 +263,16 @@ static ncclResult_t commFree(ncclComm_t comm) {
NCCLCHECK(ncclRegCleanup(comm));
if (comm->symmetricSupport) {
NCCLCHECK(ncclNvlsSymmetricFinalize(comm));
NCCLCHECK(ncclIpcSymmetricFinalize(comm));
}
INFO(NCCL_INIT,"comm %p rank %d nranks %d cudaDev %d busId %lx - %s COMPLETE", comm, comm->rank, comm->nRanks, comm->cudaDev, comm->busId, abort ? "Abort" : "Destroy");
commPoison(comm); // poison comm before free to avoid comm reuse.
NCCLCHECK(ncclProfilerPluginFinalize(comm));
NCCLCHECK(ncclNetFinalize(comm));
NCCLCHECK(ncclNetPluginUnload(comm));
ncclCudaContextDrop(comm->context);
free(comm);
return ncclSuccess;
@ -271,7 +282,7 @@ NCCL_PARAM(DisableGraphHelper, "GRAPH_HELPER_DISABLE", 0);
// GDRCOPY support: FIFO_ENABLE when enabled locates a workFifo in CUDA memory
NCCL_PARAM(GdrCopyFifoEnable, "GDRCOPY_FIFO_ENABLE", 1);
#define NCCL_WORK_FIFO_BYTES_DEFAULT (1<<20)
NCCL_PARAM(WorkFifoBytes, "WORK_FIFO_BYTES", -1);
NCCL_PARAM(WorkFifoBytes, "WORK_FIFO_BYTES", NCCL_WORK_FIFO_BYTES_DEFAULT);
NCCL_PARAM(WorkArgsBytes, "WORK_ARGS_BYTES", INT64_MAX);
enum ncclLaunchMode ncclParamLaunchMode;
@ -331,12 +342,10 @@ static ncclResult_t commAlloc(struct ncclComm* comm, struct ncclComm* parent, in
comm->rank = rank;
comm->nRanks = ndev;
NCCLCHECK(ncclNetPluginLoad(comm));
NCCLCHECK(ncclNetInit(comm));
NCCLCHECK(ncclProfilerPluginInit(comm));
INFO(NCCL_INIT, "Using network %s", comm->ncclNet->name);
if (parent && parent->config.splitShare) {
if (parent && parent->shareResources) {
if (parent->ncclNet != comm->ncclNet) {
WARN("Split shares resources, but parent comm netName %s is different from child comm netName %s", parent->ncclNet->name, comm->ncclNet->name);
return ncclInvalidUsage;
@ -361,13 +370,14 @@ static ncclResult_t commAlloc(struct ncclComm* comm, struct ncclComm* parent, in
comm->checkPointers = ncclParamCheckPointers() == 1 ? true : false;
comm->dmaBufSupport = (dmaBufSupported(comm) == ncclSuccess) ? true : false;
comm->collNetSupport = 0;
memset(comm->collNetSupportMatrix, 0, sizeof(comm->collNetSupportMatrix));
ncclMemoryPoolConstruct(&comm->memPool_ncclKernelPlan);
ncclMemoryPoolConstruct(&comm->memPool_ncclProxyOp);
comm->groupNext = reinterpret_cast<struct ncclComm*>(0x1);
for (int i = 0; i < ncclGroupTaskTypeNum; i++) {
comm->groupNext[i] = reinterpret_cast<struct ncclComm*>(0x1);
}
comm->preconnectNext = reinterpret_cast<struct ncclComm*>(0x1);
static_assert(MAXCHANNELS <= sizeof(*comm->connectSend)*8, "comm->connectSend must have enough bits for all channels");
@ -378,7 +388,7 @@ static ncclResult_t commAlloc(struct ncclComm* comm, struct ncclComm* parent, in
// Mark channels as non initialized.
for (int c=0; c < MAXCHANNELS; c++) comm->channels[c].id = -1;
if (parent == NULL || !parent->config.splitShare) {
if (parent == NULL || !parent->shareResources) {
struct ncclSharedResources* sharedRes = NULL;
NCCLCHECK(ncclCalloc(&sharedRes, 1));
/* most of attributes are assigned later in initTransportsRank(). */
@ -432,6 +442,7 @@ static ncclResult_t devCommSetup(ncclComm_t comm) {
bool ccEnable;
cudaStream_t deviceStream;
memset(&tmpCommAndChans, '\0', sizeof(tmpCommAndChans));
NCCLCHECKGOTO(ncclStrongStreamAcquire(ncclCudaGraphNone(), &comm->sharedRes->deviceStream, /*concurrent=*/false, &deviceStream), ret, fail);
NCCLCHECKGOTO(ncclCudaCallocAsync(&devCommAndChans, 1, deviceStream), ret, fail);
ncclCommPushCudaFree(comm, devCommAndChans);
@ -458,22 +469,12 @@ static ncclResult_t devCommSetup(ncclComm_t comm) {
if (ccEnable) {
comm->workFifoBytes = 0;
} else {
int64_t workFifoBytesParam = ncclParamWorkFifoBytes();
if (workFifoBytesParam == -1) {
if (comm->MNNVL && (comm->compCap >= 100)) {
// WAR: Disable work fifo for Blackwell all2all hang issue on MNNVL
INFO(NCCL_INIT, "Disabling work fifo");
comm->workFifoBytes = 0;
} else {
comm->workFifoBytes = NCCL_WORK_FIFO_BYTES_DEFAULT;
}
} else {
if (0 != (workFifoBytesParam & (workFifoBytesParam-1))) {
WARN("NCCL_WORK_FIFO_BYTES=%ld is being ignored because it is not a power of 2.", workFifoBytesParam);
comm->workFifoBytes = NCCL_WORK_FIFO_BYTES_DEFAULT;
}
comm->workFifoBytes = std::min<uint64_t>(workFifoBytesParam, 1ul<<30);
comm->workFifoBytes = ncclParamWorkFifoBytes();
if (0 != (comm->workFifoBytes & (comm->workFifoBytes-1))) {
WARN("NCCL_WORK_FIFO_BYTES=%d is being ignored because it is not a power of 2.", comm->workFifoBytes);
comm->workFifoBytes = NCCL_WORK_FIFO_BYTES_DEFAULT;
}
comm->workFifoBytes = std::min(comm->workFifoBytes, 1u<<30);
}
if (comm->rank == 0) {
@ -492,11 +493,9 @@ static ncclResult_t devCommSetup(ncclComm_t comm) {
comm->workFifoBufDev = comm->workFifoBuf;
}
NCCLCHECKGOTO(ncclCudaHostCalloc(&comm->workFifoConsumed, MAXCHANNELS), ret, fail);
ncclCommPushCudaHostFree(comm, comm->workFifoConsumed);
comm->workFifoProduced = 0;
comm->workFifoConsumedLeast = 0;
tmpCommAndChans.comm.workConsumed = comm->workFifoConsumed;
comm->workFifoProducedLastRecorded = 0;
comm->workFifoConsumed = 0;
// Alloc profiler counters for the kernel
NCCLCHECKGOTO(ncclCudaHostCalloc(&comm->profiler.workStarted, MAXCHANNELS), ret, fail);
@ -549,6 +548,7 @@ NCCL_PARAM(MNNVLUUID, "MNNVL_UUID", -1);
NCCL_PARAM(MNNVLCliqueId, "MNNVL_CLIQUE_ID", -1);
static ncclResult_t fillInfo(struct ncclComm* comm, struct ncclPeerInfo* info, uint64_t commHash) {
cudaDeviceProp prop;
info->rank = comm->rank;
info->cudaDev = comm->cudaDev;
info->nvmlDev = comm->nvmlDev;
@ -556,6 +556,8 @@ static ncclResult_t fillInfo(struct ncclComm* comm, struct ncclPeerInfo* info, u
info->hostHash=getHostHash()+commHash;
info->pidHash=getPidHash()+commHash;
info->cuMemSupport = ncclCuMemEnable();
CUDACHECK(cudaGetDeviceProperties(&prop, comm->cudaDev));
info->totalGlobalMem = ROUNDUP(prop.totalGlobalMem, (1L << 32));
// Get the device MAJOR:MINOR of /dev/shm so we can use that
// information to decide whether we can use SHM for inter-process
@ -700,6 +702,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
struct ncclTopoRanks topoRanks;
int cpuArch;
int cpuVendor;
int localRanks;
};
int nChannelsOrig;
@ -711,12 +714,14 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
struct ncclProxyConnector proxyConn;
int* pxnPeers = NULL;
int *topParentLocalRanks = NULL;
int p2pLevel = -1;
timers[TIMER_INIT_ALLGATHER] = clockNano();
// AllGather1 - begin
NCCLCHECKGOTO(ncclCalloc(&comm->peerInfo, nranks+1), ret, fail); // Extra rank to represent CollNet root
NCCLCHECKGOTO(fillInfo(comm, comm->peerInfo+rank, comm->commHash), ret, fail);
NCCLCHECKGOTO(bootstrapAllGather(comm->bootstrap, comm->peerInfo, sizeof(struct ncclPeerInfo)), ret, fail);
__atomic_store_n(&comm->peerInfoValid, true, __ATOMIC_RELEASE);
comm->cuMemSupport = 1;
for (int i = 0; i < nranks; i++) {
@ -738,7 +743,8 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
timers[TIMER_INIT_ALLGATHER] = clockNano() - timers[TIMER_INIT_ALLGATHER];
// Check for MNNVL support
if ((nNodes > 1 && ncclParamMNNVLEnable() != 0) || ncclParamMNNVLEnable() == 1) {
NCCLCHECKGOTO(ncclGetUserP2pLevel(&p2pLevel), ret, fail);
if ((nNodes > 1 && ncclParamMNNVLEnable() != 0 && p2pLevel != 0) || ncclParamMNNVLEnable() == 1) {
NCCLCHECKGOTO(ncclMnnvlCheck(comm), ret, fail);
}
@ -829,14 +835,8 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
}
// Determine local CollNet support
if (collNetSupport(comm)) {
const char *collNetEnable = ncclGetEnv("NCCL_COLLNET_ENABLE");
if (collNetEnable != NULL) {
INFO(NCCL_ALL, "NCCL_COLLNET_ENABLE set by environment to %s.", collNetEnable);
if (strcmp(collNetEnable, "1") == 0) {
comm->collNetSupport = 1;
}
}
if (!collNetSupport(comm)) {
comm->config.collnetEnable = 0;
}
// Determine local Nvls support
@ -873,7 +873,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
collNetDirectGraph->collNet = 1;
collNetDirectGraph->minChannels = 1;
collNetDirectGraph->maxChannels = MAXCHANNELS;
if (comm->collNetSupport) {
if (comm->config.collnetEnable) {
NCCLCHECKGOTO(ncclTopoCompute(comm->topo, collNetChainGraph), ret, fail);
NCCLCHECKGOTO(ncclTopoPrintGraph(comm->topo, collNetChainGraph), ret, fail);
NCCLCHECKGOTO(ncclTopoCompute(comm->topo, collNetDirectGraph), ret, fail);
@ -1014,7 +1014,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
}
comm->maxTreePattern = std::max(comm->maxTreePattern, allGather3Data[i].graphInfo[NCCL_ALGO_TREE].pattern);
}
if (graphs[NCCL_ALGO_COLLNET_CHAIN]->nChannels == 0) comm->collNetSupport = 0;
if (graphs[NCCL_ALGO_COLLNET_CHAIN]->nChannels == 0) comm->config.collnetEnable = 0;
if (graphs[NCCL_ALGO_NVLS]->nChannels == 0) comm->nvlsSupport = comm->nvlsChannels = 0;
comm->nChannels = treeGraph->nChannels = ringGraph->nChannels = std::min(treeGraph->nChannels, ringGraph->nChannels);
@ -1025,11 +1025,11 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
}
// Determine CollNet support after all-gather now that we know nNodes and each node localRanks
if (comm->collNetSupport == 1) {
if (comm->config.collnetEnable == 1) {
int collNetNodeThreshold = ncclParamCollNetNodeThreshold();
if (comm->nNodes < collNetNodeThreshold) {
INFO(NCCL_INIT, "Communicator has %d nodes which is less than CollNet node threshold %d, disabling CollNet", comm->nNodes, collNetNodeThreshold);
comm->collNetSupport = 0;
comm->config.collnetEnable = 0;
}
}
NCCLCHECK(ncclTopoPathAllNVLink(comm->topo, &comm->isAllNvlink));
@ -1075,9 +1075,12 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
}
comm->topParentLocalRanks = topParentLocalRanks;
NCCLCHECKGOTO(ncclTransportCheckP2pType(comm, &comm->intraNodeP2pSupport, &comm->directMode), ret, fail);
// Profiler plugin context has to be initialized before proxy thread
NCCLCHECK(ncclProfilerPluginInit(comm));
NCCLCHECKGOTO(ncclTransportCheckP2pType(comm, &comm->isAllDirectP2p, &comm->directMode), ret, fail);
// Launch proxy service thread, after this, the proxy calls can be used.
if (parent && parent->config.splitShare) {
if (parent && parent->shareResources) {
comm->proxyState = parent->sharedRes->proxyState;
ncclAtomicRefCountIncrement(&parent->sharedRes->proxyState->refCount);
} else {
@ -1147,10 +1150,10 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
for (int c=0; c<comm->nChannels; c++) {
NCCLCHECKGOTO(setupChannel(comm, c, rank, nranks, rings+c*nranks), ret, fail);
}
// Setup NVLS
// Attempt to setup NVLS, may silently fail and disable NVLS
NCCLCHECKGOTO(ncclNvlsSetup(comm, parent), ret, fail);
// Check if we can setup CollNet
if (comm->collNetSupport > 0) ncclCollNetSetup(comm, parent, graphs);
if (comm->config.collnetEnable) ncclCollNetSetup(comm, parent, graphs);
} else {
for (int c=0; c<comm->nChannels; c++) {
NCCLCHECKGOTO(setupChannel(comm, c, rank, nranks, rings+c*nranks), ret, fail);
@ -1163,7 +1166,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
// Connect PAT only for communicators with 1 GPU per node
if (comm->maxLocalRanks == 1) NCCLCHECKGOTO(ncclTransportPatConnect(comm), ret, fail);
// Setup NVLS
// Attempt to setup NVLS, may silently fail and disable NVLS
NCCLCHECKGOTO(ncclNvlsSetup(comm, parent), ret, fail);
NCCLCHECKGOTO(ncclNvlsBufferSetup(comm), ret, fail);
@ -1171,7 +1174,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
NCCLCHECKGOTO(ncclNvlsTreeConnect(comm), ret, fail);
// Check if we can setup CollNet
if (comm->collNetSupport > 0) {
if (comm->config.collnetEnable) {
ncclCollNetSetup(comm, parent, graphs);
NCCLCHECKGOTO(ncclCollNetChainBufferSetup(comm), ret, fail);
if (comm->maxLocalRanks <= NCCL_MAX_DIRECT_ARITY+1) {
@ -1244,9 +1247,13 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
}
}
comm->symmetricSupport = comm->isAllDirectP2p && comm->nNodes == 1 && ncclParamWinEnable() && ncclCuMemEnable();
comm->baseStride = 0;
// Call devCommSetup before the last barrier, making sure we don't have a thread running in front and starting to
// launch NCCL kernels before all cuda mem allocation is complete. That could cause a deadlock.
NCCLCHECKGOTO(devCommSetup(comm), ret, fail);
timers[TIMER_INIT_CONNECT] = clockNano() - timers[TIMER_INIT_CONNECT];
/* Local intra-node barrier */
NCCLCHECKGOTO(bootstrapIntraNodeBarrier(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, comm->localRankToRank[0]), ret, fail);
@ -1260,7 +1267,7 @@ exit:
/* If split resource is shared, we are not able to unlink the proxy ops pool here since the child comm can
* attach the proxy ops pool of parent at any time; otherwise, unlink it here to make sure the pool will be
* properly cleaned up. */
if (comm->sharedRes->owner == comm && !comm->config.splitShare && ret == ncclSuccess && !ncclCuMemEnable()) ncclProxyShmUnlink(comm);
if (comm->sharedRes->owner == comm && !comm->shareResources && ret == ncclSuccess && !ncclCuMemEnable()) ncclProxyShmUnlink(comm);
free(allTopoRanks);
free(nodesTreePatterns);
free(nodesFirstRank);
@ -1293,6 +1300,9 @@ struct ncclCommInitRankAsyncJob {
struct ncclComm* parent;
int color, key;
int splitCount;
// For Shrink
int* excludeRanksList;
int excludeRanksCount;
// name of the function calling
char funcName[NCCL_COMMINIT_FUNCNAME_LEN];
};
@ -1303,6 +1313,7 @@ struct ncclCommFinalizeAsyncJob {
};
NCCL_PARAM(CommSplitShareResources, "COMM_SPLIT_SHARE_RESOURCES", NCCL_CONFIG_UNDEF_INT);
NCCL_PARAM(CommShrinkShareResources, "COMM_SHRINK_SHARE_RESOURCES", NCCL_CONFIG_UNDEF_INT);
typedef struct{
int key;
@ -1350,6 +1361,21 @@ fail:
goto exit;
}
static ncclResult_t getParentRanks(int parentRanks, int parentRank, int* excludeRanksList, int excludeRanksCount, int* nRanksRet, int* myRankRet, int* parentRanksRet) {
int count = 0, j = 0;
for (int i = 0; i < parentRanks; i++) {
// we assume excludeRanksList is sorted
if (j < excludeRanksCount && excludeRanksList[j] == i) {
j++;
continue;
}
if (i == parentRank) *myRankRet = count;
parentRanksRet[count++] = i;
}
*nRanksRet = parentRanks - excludeRanksCount;
return ncclSuccess;
}
static ncclResult_t ncclCommInitRankFunc(struct ncclAsyncJob* job_) {
struct ncclCommInitRankAsyncJob* job = (struct ncclCommInitRankAsyncJob*)job_;
ncclComm_t comm = job->comm;
@ -1383,9 +1409,13 @@ static ncclResult_t ncclCommInitRankFunc(struct ncclAsyncJob* job_) {
if (job->parent) {
NCCLCHECKGOTO(ncclCalloc(&parentRanks, job->parent->nRanks), res, fail);
NCCLCHECKGOTO(commGetSplitInfo(comm, job->parent, job->color, job->key, &job->nranks, &job->myrank, parentRanks), res, fail);
// Negative color does not create a new comm object. We needed to take part in the allgather, but we're done now.
if (job->color == NCCL_SPLIT_NOCOLOR) goto exit;
if (job->excludeRanksCount) {
NCCLCHECKGOTO(getParentRanks(job->parent->nRanks, job->parent->rank, job->excludeRanksList, job->excludeRanksCount, &job->nranks, &job->myrank, parentRanks), res, fail);
} else {
NCCLCHECKGOTO(commGetSplitInfo(comm, job->parent, job->color, job->key, &job->nranks, &job->myrank, parentRanks), res, fail);
// Negative color does not create a new comm object. We needed to take part in the allgather, but we're done now.
if (job->color == NCCL_SPLIT_NOCOLOR) goto exit;
}
timers[TIMER_INIT_ALLOC] = clockNano();
NCCLCHECKGOTO(commAlloc(comm, job->parent, job->nranks, job->myrank), res, fail);
timers[TIMER_INIT_ALLOC] = clockNano() - timers[TIMER_INIT_ALLOC];
@ -1477,6 +1507,10 @@ static ncclResult_t envConfigOverride(ncclComm_t comm) {
int minCTAsEnv;
int maxCTAsEnv;
int splitShareEnv;
const char* collnetEnableEnv;
int ctaPolicyEnv;
int shrinkShareEnv;
int nvlsCTAsEnv;
/* override configuration from env variable. */
blockingEnv = ncclParamCommBlocking();
@ -1522,6 +1556,31 @@ static ncclResult_t envConfigOverride(ncclComm_t comm) {
if (splitShareEnv != NCCL_CONFIG_UNDEF_INT) {
comm->config.splitShare = splitShareEnv;
}
shrinkShareEnv = ncclParamCommShrinkShareResources();
if (shrinkShareEnv != NCCL_CONFIG_UNDEF_INT) {
comm->config.shrinkShare = shrinkShareEnv;
}
// NCCL_COLLNET_ENABLE needs to be reloaded each time for comm init
// since users might change the env on the fly to enable/disable collnet
collnetEnableEnv = ncclGetEnv("NCCL_COLLNET_ENABLE");
if (collnetEnableEnv != NULL) {
int collnetEnableInt = (int)strtol(collnetEnableEnv, NULL, 0);
if (collnetEnableInt != NCCL_CONFIG_UNDEF_INT) {
comm->config.collnetEnable = collnetEnableInt;
INFO(NCCL_ENV, "NCCL_COLLNET_ENABLE set by environment to %d.", collnetEnableInt);
}
}
ctaPolicyEnv = ncclParamCtaPolicy();
if (ctaPolicyEnv != NCCL_CONFIG_UNDEF_INT) {
comm->config.CTAPolicy = ctaPolicyEnv;
}
nvlsCTAsEnv = ncclParamNvlsChannels();
if (nvlsCTAsEnv != NCCL_CONFIG_UNDEF_INT) {
comm->config.nvlsCTAs = nvlsCTAsEnv;
}
/* cap channels if needed */
if (comm->config.minCTAs > MAXCHANNELS) {
@ -1544,6 +1603,20 @@ static ncclResult_t envConfigOverride(ncclComm_t comm) {
comm->config.splitShare = 0;
}
if (comm->config.collnetEnable != 1 && comm->config.collnetEnable != 0) {
INFO(NCCL_ENV, "collnetEnable %d is not a valid value 0/1, set it to 0", comm->config.collnetEnable);
comm->config.collnetEnable = 0;
}
if (comm->config.CTAPolicy < NCCL_CTA_POLICY_DEFAULT || comm->config.CTAPolicy > NCCL_CTA_POLICY_EFFICIENCY) {
INFO(NCCL_ENV, "CTAPolicy %d is not a valid value, set it to %d", comm->config.CTAPolicy, NCCL_CTA_POLICY_DEFAULT);
comm->config.CTAPolicy = NCCL_CTA_POLICY_DEFAULT;
}
if (comm->config.nvlsCTAs != NCCL_CONFIG_UNDEF_INT && comm->config.nvlsCTAs <= 0) {
INFO(NCCL_ENV, "nvlsCTAs %d is not a valid value, NCCL will decide the default value automatically", comm->config.nvlsCTAs);
comm->config.nvlsCTAs = NCCL_CONFIG_UNDEF_INT;
}
return ret;
}
@ -1584,6 +1657,17 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
internalConfigPtr->maxCTAs = defaultConfig.maxCTAs;
internalConfigPtr->netName = defaultConfig.netName;
}
if (internalConfigPtr->version < NCCL_VERSION(2, 25, 0)) {
internalConfigPtr->trafficClass = defaultConfig.trafficClass;
}
if (internalConfigPtr->version < NCCL_VERSION(2, 27, 0)) {
internalConfigPtr->collnetEnable = defaultConfig.collnetEnable;
internalConfigPtr->CTAPolicy = defaultConfig.CTAPolicy;
internalConfigPtr->shrinkShare = defaultConfig.shrinkShare;
internalConfigPtr->nvlsCTAs = defaultConfig.nvlsCTAs;
}
}
/* check input config attributes, -1 means user-undefined and we should use default value from NCCL. */
@ -1615,6 +1699,31 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
goto fail;
}
if (internalConfigPtr->collnetEnable != NCCL_CONFIG_UNDEF_INT && (internalConfigPtr->collnetEnable < 0 || internalConfigPtr->collnetEnable > 1)) {
WARN("Invalid config collnetEnable attribute value %d", internalConfigPtr->collnetEnable);
ret = ncclInvalidArgument;
goto fail;
}
if (internalConfigPtr->CTAPolicy != NCCL_CONFIG_UNDEF_INT && (internalConfigPtr->CTAPolicy < NCCL_CTA_POLICY_DEFAULT ||
internalConfigPtr->CTAPolicy > NCCL_CTA_POLICY_EFFICIENCY)) {
WARN("Invalid config policy attribute value %d", internalConfigPtr->CTAPolicy);
ret = ncclInvalidArgument;
goto fail;
}
if (internalConfigPtr->shrinkShare != NCCL_CONFIG_UNDEF_INT && internalConfigPtr->shrinkShare != 0 && internalConfigPtr->shrinkShare != 1) {
WARN("Invalid config shrinkShare attribute value %d", internalConfigPtr->shrinkShare);
ret = ncclInvalidArgument;
goto fail;
}
if (internalConfigPtr->nvlsCTAs != NCCL_CONFIG_UNDEF_INT && internalConfigPtr->nvlsCTAs <= 0) {
WARN("Invalid config nvlsCTAs attribute value %d", internalConfigPtr->nvlsCTAs);
ret = ncclInvalidArgument;
goto fail;
}
/* default config value can be tuned on different platform. */
NCCL_CONFIG_DEFAULT(internalConfigPtr, blocking, NCCL_CONFIG_UNDEF_INT, 1, "Blocking", "%d");
NCCL_CONFIG_DEFAULT(internalConfigPtr, cgaClusterSize, NCCL_CONFIG_UNDEF_INT, 4, "CGA cluster size", "%d");
@ -1623,6 +1732,11 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
NCCL_CONFIG_DEFAULT(internalConfigPtr, netName, NCCL_CONFIG_UNDEF_PTR, NULL, "Net name", "%s");
NCCL_CONFIG_DEFAULT(internalConfigPtr, splitShare, NCCL_CONFIG_UNDEF_INT, 0, "Split share", "%d");
NCCL_CONFIG_DEFAULT(internalConfigPtr, trafficClass, NCCL_CONFIG_UNDEF_INT, NCCL_CONFIG_UNDEF_INT, "Traffic class", "%d");
NCCL_CONFIG_DEFAULT(internalConfigPtr, commName, NCCL_CONFIG_UNDEF_PTR, NULL, "Comm name", "%s");
NCCL_CONFIG_DEFAULT(internalConfigPtr, collnetEnable, NCCL_CONFIG_UNDEF_INT, 0, "Collnet enable", "%d");
NCCL_CONFIG_DEFAULT(internalConfigPtr, CTAPolicy, NCCL_CONFIG_UNDEF_INT, NCCL_CTA_POLICY_DEFAULT, "CTA policy flags", "%d");
NCCL_CONFIG_DEFAULT(internalConfigPtr, shrinkShare, NCCL_CONFIG_UNDEF_INT, 0, "shrinkShare", "%d");
NCCL_CONFIG_DEFAULT(internalConfigPtr, nvlsCTAs, NCCL_CONFIG_UNDEF_INT, NCCL_CONFIG_UNDEF_INT, "nvlsCTAs", "%d");
/* assign config to communicator */
comm->config.blocking = internalConfigPtr->blocking;
@ -1632,7 +1746,11 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
comm->config.netName = internalConfigPtr->netName;
comm->config.splitShare = internalConfigPtr->splitShare;
comm->config.trafficClass = internalConfigPtr->trafficClass;
comm->config.commName = internalConfigPtr->commName;
comm->config.collnetEnable = internalConfigPtr->collnetEnable;
comm->config.CTAPolicy = internalConfigPtr->CTAPolicy;
comm->config.shrinkShare = internalConfigPtr->shrinkShare;
comm->config.nvlsCTAs = internalConfigPtr->nvlsCTAs;
NCCLCHECKGOTO(envConfigOverride(comm), ret, fail);
exit:
@ -1909,7 +2027,7 @@ static ncclResult_t commDestroySync(struct ncclAsyncJob* job_) {
WARN("commDestroySync: comm %p rank %d sync deviceStream error %d\n", comm, comm->rank, ret);
}
NCCLCHECKGOTO(ncclCommPollEventCallbacks(comm), ret, fail);
NCCLCHECKGOTO(ncclCommPollEventCallbacks(comm, true), ret, fail);
NCCLCHECKGOTO(ncclCommPollCallbacks(comm, false), ret, fail);
// And keep polling until all graphs referencing us die.
while (comm->localPersistentRefs != 0) {
@ -2074,6 +2192,17 @@ fail:
goto exit;
}
static ncclResult_t setCommAbortFlags(ncclComm_t comm, int value) {
// Set abort flags
if (comm->childAbortFlag != nullptr) {
__atomic_store_n(comm->childAbortFlag, value, __ATOMIC_RELEASE);
__atomic_store_n(comm->childAbortFlagDev, value, __ATOMIC_RELEASE);
}
__atomic_store_n(comm->abortFlag, value, __ATOMIC_RELEASE);
__atomic_store_n(comm->abortFlagDev, value, __ATOMIC_RELEASE);
return ncclSuccess;
}
NCCL_API(ncclResult_t, ncclCommAbort, ncclComm_t comm);
ncclResult_t ncclCommAbort(ncclComm_t comm) {
NVTX3_RANGE(NcclNvtxParamsCommAbort);
@ -2083,12 +2212,7 @@ ncclResult_t ncclCommAbort(ncclComm_t comm) {
}
NCCLCHECK(ncclGroupStartInternal());
// Ask anything that might still be running on the device to quit
if (comm->childAbortFlag != nullptr) {
__atomic_store_n(comm->childAbortFlag, 1, __ATOMIC_RELEASE);
__atomic_store_n(comm->childAbortFlagDev, 1, __ATOMIC_RELEASE);
}
__atomic_store_n(comm->abortFlag, 1, __ATOMIC_RELEASE);
__atomic_store_n(comm->abortFlagDev, 1, __ATOMIC_RELEASE);
NCCLCHECK(setCommAbortFlags(comm,1));
comm->destroyFlag = 1;
/* init thread must be joined before we destroy the comm,
* and we should ignore the init error here. */
@ -2111,36 +2235,51 @@ ncclResult_t ncclCommAbort(ncclComm_t comm) {
exit:
ncclGroupErrCheck(res);
NCCLCHECK(ncclGroupEndInternal());
return ncclSuccess;
return res;
fail:
goto exit;
}
NCCL_API(ncclResult_t, ncclCommSplit, ncclComm_t comm, int color, int key, ncclComm_t *newcomm, ncclConfig_t *config);
ncclResult_t ncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t *newcomm, ncclConfig_t *config) {
static void childCommCleanupJob(void* job) {
struct ncclCommInitRankAsyncJob* initJob = (struct ncclCommInitRankAsyncJob*)job;
if (initJob->excludeRanksList) free(initJob->excludeRanksList);
free(job);
}
// initializing a child communicator (for both split and shrink)
static ncclResult_t ncclCommInitChildComm(ncclComm_t comm, ncclComm_t* newcomm, bool isShrink, int flags, int color, int key, int* excludeRanksList, int excludeRanksCount,
ncclConfig_t* config, const char* caller) {
struct ncclCommInitRankAsyncJob *job = NULL;
struct ncclComm* childComm = NCCL_COMM_NULL;
ncclResult_t res = ncclSuccess;
NVTX3_RANGE(NcclNvtxParamsCommSplit)
int oldDev;
CUDACHECK(cudaGetDevice(&oldDev));
NCCLCHECKGOTO(CommCheck(comm, caller, "comm"), res, exit);
NCCLCHECKGOTO(PtrCheck(newcomm, caller, "newcomm"), res, exit);
if (isShrink) {
NCCLCHECKGOTO(PtrCheck(excludeRanksList, caller, "excludeRanksList"), res, exit);
NCCLCHECKGOTO(excludeRanksCount > 0 ? ncclSuccess : ncclInvalidArgument, res, exit);
// excludeRanksList may not be sorted, need to sort it
qsort(excludeRanksList, excludeRanksCount, sizeof(int), compareInts);
// ranks in excludeRanksList should not call into this function
NCCLCHECKGOTO(bsearch(&comm->rank, excludeRanksList, excludeRanksCount, sizeof(int), compareInts) ? ncclInvalidArgument : ncclSuccess, res, exit);
}
NCCLCHECKGOTO(ncclCommEnsureReady(comm), res, exit);
CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), res, exit);
NCCLCHECK(ncclGroupStartInternal());
NCCLCHECKGOTO(CommCheck(comm, "CommSplit", "comm"), res, fail);
NCCLCHECKGOTO(PtrCheck(newcomm, "CommSplit", "newcomm"), res, fail);
NCCLCHECKGOTO(ncclCommEnsureReady(comm), res, fail);
CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), res, fail);
/* *newcomm should be NCCL_COMM_NULL until comm split fully complete. */
*newcomm = NCCL_COMM_NULL;
if (color == NCCL_SPLIT_NOCOLOR) {
if (!isShrink && color == NCCL_SPLIT_NOCOLOR) {
INFO(NCCL_INIT, "Rank %d has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator", comm->rank);
} else {
NCCLCHECKGOTO(ncclCalloc(&childComm, 1), res, fail);
childComm->startMagic = childComm->endMagic = NCCL_MAGIC;
if (comm->config.splitShare) {
// Set the shareResource field, this is used throughout the init and must be reset every time.
// If we shrink, we only reuse resources if we shrink in the default mode
comm->shareResources = isShrink ? (!(flags & NCCL_SHRINK_ABORT) && comm->config.shrinkShare) : comm->config.splitShare;
if (comm->shareResources) {
childComm->abortFlag = comm->abortFlag;
childComm->abortFlagDev = comm->abortFlagDev;
childComm->abortFlagRefCount = comm->abortFlagRefCount;
@ -2161,38 +2300,39 @@ ncclResult_t ncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t *newc
NCCLCHECKGOTO(parseCommConfig(childComm, config), res, fail);
}
/* start with ncclInProgress and will be changed to ncclSuccess if init succeeds. */
childComm->initState = ncclInProgress;
/* start with ncclInternalError and will be changed to ncclSuccess if init succeeds. */
childComm->initState = ncclInternalError;
}
NCCLCHECKGOTO(ncclCalloc(&job, 1), res, fail);
job->comm = childComm;
job->newcomm = newcomm;
job->parent = comm;
job->splitCount = ++comm->splitCount;
job->color = color;
job->key = key;
if (excludeRanksList) {
// need to copy the list of ranks to exclude because the job is async
job->excludeRanksCount = excludeRanksCount;
NCCLCHECKGOTO(ncclCalloc(&job->excludeRanksList, excludeRanksCount), res, fail);
memcpy(job->excludeRanksList, excludeRanksList, excludeRanksCount * sizeof(int));
} else {
// each split has to lead to a unique comm, so increment the splitCount
job->splitCount = ++comm->splitCount;
job->excludeRanksList = NULL;
}
job->cudaDev = comm->cudaDev;
snprintf(job->funcName, NCCL_COMMINIT_FUNCNAME_LEN, "%s", __func__);
NCCLCHECKGOTO(ncclAsyncLaunch((struct ncclAsyncJob*)job, ncclCommInitRankFunc, NULL, free, comm), res, fail);
snprintf(job->funcName, NCCL_COMMINIT_FUNCNAME_LEN, "%s", caller);
NCCLCHECKGOTO(ncclAsyncLaunch((struct ncclAsyncJob*)job, ncclCommInitRankFunc, /*undo=*/NULL, /*destructor=*/childCommCleanupJob, comm), res, fail);
exit:
(void)cudaSetDevice(oldDev);
(void)ncclGroupErrCheck(res);
NCCLCHECK(ncclGroupEndInternal());
if (res == ncclSuccess && *newcomm) {
NVTX3_RANGE_ADD_PAYLOAD(CommSplit, NcclNvtxParamsCommSplitSchema,
NVTX3_PAYLOAD((*newcomm)->commHash, comm->commHash, comm->nRanks, comm->rank, comm->cudaDev, color, key));
}
return res;
fail:
if (childComm) {
if (!comm->config.splitShare) {
free(childComm->abortFlag);
if (!comm->shareResources) {
if (childComm->abortFlag) free(childComm->abortFlag);
if (childComm->abortFlagDev) ncclCudaHostFree(childComm->abortFlagDev);
free(childComm->abortFlagRefCount);
if (childComm->abortFlagRefCount) free(childComm->abortFlagRefCount);
}
free(childComm);
}
@ -2200,6 +2340,44 @@ fail:
goto exit;
}
NCCL_API(ncclResult_t, ncclCommShrink, ncclComm_t comm, int* excludeRanksList, int excludeRanksCount, ncclComm_t* newcomm, ncclConfig_t* config, int shrinkFlags);
ncclResult_t ncclCommShrink(ncclComm_t comm, int* excludeRanksList, int excludeRanksCount, ncclComm_t *newcomm, ncclConfig_t* config, int shrinkFlags) {
NVTX3_RANGE(NcclNvtxParamsCommShrink)
ncclResult_t res = ncclSuccess;
NCCLCHECK(ncclGroupStartInternal());
// Handle error mode by setting abort flags and waiting for kernels to complete and unset the flags to avoid bootstrap issues
if (shrinkFlags & NCCL_SHRINK_ABORT) {
NCCLCHECKGOTO(setCommAbortFlags(comm, 1), res, exit);
NCCLCHECKGOTO(ncclStrongStreamSynchronize(&comm->sharedRes->deviceStream), res, exit);
NCCLCHECKGOTO(setCommAbortFlags(comm, 0), res, exit);
}
NCCLCHECKGOTO(ncclCommInitChildComm(comm, newcomm, /*isShrink=*/true, shrinkFlags, /*color=*/0, /*key=*/comm->rank, excludeRanksList, excludeRanksCount, config, __func__), res, exit);
if (*newcomm) NVTX3_RANGE_ADD_PAYLOAD(CommShrink, NcclNvtxParamsCommShrinkSchema, NVTX3_PAYLOAD(comm->commHash, comm->nRanks, comm->rank, comm->cudaDev, excludeRanksCount));
exit:
(void)ncclGroupErrCheck(res);
NCCLCHECK(ncclGroupEndInternal());
return res;
}
NCCL_API(ncclResult_t, ncclCommSplit, ncclComm_t comm, int color, int key, ncclComm_t *newcomm, ncclConfig_t *config);
ncclResult_t ncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t *newcomm, ncclConfig_t *config) {
NVTX3_RANGE(NcclNvtxParamsCommSplit)
ncclResult_t res = ncclSuccess;
NCCLCHECK(ncclGroupStartInternal());
NCCLCHECKGOTO(ncclCommInitChildComm(comm, newcomm, /*isShrink=*/false, /*shrink mode=*/NCCL_SHRINK_DEFAULT, color, key, NULL, 0, config, __func__), res, exit);
if (*newcomm)
NVTX3_RANGE_ADD_PAYLOAD(CommSplit, NcclNvtxParamsCommSplitSchema, NVTX3_PAYLOAD((*newcomm)->commHash, comm->commHash, comm->nRanks, comm->rank, comm->cudaDev, color, key));
exit:
(void)ncclGroupErrCheck(res);
NCCLCHECK(ncclGroupEndInternal());
return res;
}
NCCL_API(const char*, ncclGetErrorString, ncclResult_t code);
const char* ncclGetErrorString(ncclResult_t code) {
switch (code) {
@ -2277,119 +2455,3 @@ ncclResult_t ncclCommUserRank(const ncclComm_t comm, int* rank) {
*rank = comm->rank;
return ncclSuccess;
}
NCCL_API(ncclResult_t, ncclMemAlloc, void **ptr, size_t size);
ncclResult_t ncclMemAlloc(void **ptr, size_t size) {
NVTX3_FUNC_RANGE_IN(nccl_domain);
ncclResult_t ret = ncclSuccess;
#if CUDART_VERSION >= 12010
size_t memGran = 0;
CUdevice currentDev;
CUmemAllocationProp memprop = {};
CUmemAccessDesc accessDesc = {};
CUmemGenericAllocationHandle handle;
int cudaDev;
int flag;
int dcnt;
if (ptr == NULL || size == 0) goto fallback;
if (ncclCudaLibraryInit() != ncclSuccess) goto fallback;
CUDACHECK(cudaGetDevice(&cudaDev));
CUCHECK(cuDeviceGet(&currentDev, cudaDev));
if (ncclCuMemEnable()) {
size_t handleSize = size;
int requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
// Query device to see if FABRIC handle support is available
flag = 0;
(void) CUPFN(cuDeviceGetAttribute(&flag, CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED, currentDev));
if (flag) requestedHandleTypes |= CU_MEM_HANDLE_TYPE_FABRIC;
memprop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
memprop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
memprop.requestedHandleTypes = (CUmemAllocationHandleType) requestedHandleTypes;
memprop.location.id = currentDev;
// Query device to see if RDMA support is available
flag = 0;
CUCHECK(cuDeviceGetAttribute(&flag, CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED, currentDev));
if (flag) memprop.allocFlags.gpuDirectRDMACapable = 1;
CUCHECK(cuMemGetAllocationGranularity(&memGran, &memprop, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED));
CUDACHECK(cudaGetDeviceCount(&dcnt));
ALIGN_SIZE(handleSize, memGran);
if (requestedHandleTypes & CU_MEM_HANDLE_TYPE_FABRIC) {
/* First try cuMemCreate() with FABRIC handle support and then remove if it fails */
CUresult err = CUPFN(cuMemCreate(&handle, handleSize, &memprop, 0));
if (err == CUDA_ERROR_NOT_PERMITTED || err == CUDA_ERROR_NOT_SUPPORTED) {
requestedHandleTypes &= ~CU_MEM_HANDLE_TYPE_FABRIC;
memprop.requestedHandleTypes = (CUmemAllocationHandleType) requestedHandleTypes;
/* Allocate the physical memory on the device */
CUCHECK(cuMemCreate(&handle, handleSize, &memprop, 0));
}
} else {
/* Allocate the physical memory on the device */
CUCHECK(cuMemCreate(&handle, handleSize, &memprop, 0));
}
/* Reserve a virtual address range */
CUCHECK(cuMemAddressReserve((CUdeviceptr*)ptr, handleSize, memGran, 0, 0));
/* Map the virtual address range to the physical allocation */
CUCHECK(cuMemMap((CUdeviceptr)*ptr, handleSize, 0, handle, 0));
/* Now allow RW access to the newly mapped memory */
for (int i = 0; i < dcnt; ++i) {
int p2p = 0;
if (i == cudaDev || ((cudaDeviceCanAccessPeer(&p2p, cudaDev, i) == cudaSuccess) && p2p)) {
accessDesc.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
accessDesc.location.id = i;
accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
CUCHECK(cuMemSetAccess((CUdeviceptr)*ptr, handleSize, &accessDesc, 1));
}
if (0 == p2p && i != cudaDev) INFO(NCCL_ALLOC, "P2P not supported between GPU%d and GPU%d", cudaDev, i);
}
goto exit;
}
fallback:
#endif
// Coverity is right to complain that we may pass a NULL ptr to cudaMalloc. That's deliberate though:
// we want CUDA to return an error to the caller.
// coverity[var_deref_model]
CUDACHECKGOTO(cudaMalloc(ptr, size), ret, fail);
exit:
return ret;
fail:
goto exit;
}
NCCL_API(ncclResult_t, ncclMemFree, void *ptr);
ncclResult_t ncclMemFree(void *ptr) {
NVTX3_FUNC_RANGE_IN(nccl_domain);
ncclResult_t ret = ncclSuccess;
int saveDevice;
CUDACHECK(cudaGetDevice(&saveDevice));
#if CUDART_VERSION >= 12010
CUdevice ptrDev = 0;
if (ptr == NULL) goto fallback;
if (ncclCudaLibraryInit() != ncclSuccess) goto fallback;
CUCHECKGOTO(cuPointerGetAttribute((void*)&ptrDev, CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL, (CUdeviceptr)ptr), ret, fail);
CUDACHECKGOTO(cudaSetDevice((int)ptrDev), ret, fail);
if (ncclCuMemEnable()) {
NCCLCHECKGOTO(ncclCuMemFree(ptr), ret, fail);
goto exit;
}
fallback:
#endif
CUDACHECKGOTO(cudaFree(ptr), ret, fail);
exit:
CUDACHECK(cudaSetDevice(saveDevice));
return ret;
fail:
goto exit;
}

View File

@ -105,53 +105,53 @@ error:
#endif
}
#define DECLARE_CUDA_PFN(symbol) PFN_##symbol pfn_##symbol = nullptr
#define DECLARE_CUDA_PFN(symbol,version) PFN_##symbol##_v##version pfn_##symbol = nullptr
#if CUDART_VERSION >= 11030
/* CUDA Driver functions loaded with cuGetProcAddress for versioning */
DECLARE_CUDA_PFN(cuDeviceGet);
DECLARE_CUDA_PFN(cuDeviceGetAttribute);
DECLARE_CUDA_PFN(cuGetErrorString);
DECLARE_CUDA_PFN(cuGetErrorName);
DECLARE_CUDA_PFN(cuDeviceGet, 2000);
DECLARE_CUDA_PFN(cuDeviceGetAttribute, 2000);
DECLARE_CUDA_PFN(cuGetErrorString, 6000);
DECLARE_CUDA_PFN(cuGetErrorName, 6000);
/* enqueue.cc */
DECLARE_CUDA_PFN(cuMemGetAddressRange);
DECLARE_CUDA_PFN(cuLaunchKernel);
DECLARE_CUDA_PFN(cuMemGetAddressRange, 3020);
DECLARE_CUDA_PFN(cuLaunchKernel, 4000);
#if CUDA_VERSION >= 11080
DECLARE_CUDA_PFN(cuLaunchKernelEx);
DECLARE_CUDA_PFN(cuLaunchKernelEx, 11060);
#endif
/* proxy.cc */
DECLARE_CUDA_PFN(cuCtxCreate);
DECLARE_CUDA_PFN(cuCtxDestroy);
DECLARE_CUDA_PFN(cuCtxGetCurrent);
DECLARE_CUDA_PFN(cuCtxSetCurrent);
DECLARE_CUDA_PFN(cuCtxGetDevice);
DECLARE_CUDA_PFN(cuCtxCreate, 11040);
DECLARE_CUDA_PFN(cuCtxDestroy, 4000);
DECLARE_CUDA_PFN(cuCtxGetCurrent, 4000);
DECLARE_CUDA_PFN(cuCtxSetCurrent, 4000);
DECLARE_CUDA_PFN(cuCtxGetDevice, 2000);
/* cuMem API support */
DECLARE_CUDA_PFN(cuMemAddressReserve);
DECLARE_CUDA_PFN(cuMemAddressFree);
DECLARE_CUDA_PFN(cuMemCreate);
DECLARE_CUDA_PFN(cuMemGetAllocationGranularity);
DECLARE_CUDA_PFN(cuMemExportToShareableHandle);
DECLARE_CUDA_PFN(cuMemImportFromShareableHandle);
DECLARE_CUDA_PFN(cuMemMap);
DECLARE_CUDA_PFN(cuMemRelease);
DECLARE_CUDA_PFN(cuMemRetainAllocationHandle);
DECLARE_CUDA_PFN(cuMemSetAccess);
DECLARE_CUDA_PFN(cuMemUnmap);
DECLARE_CUDA_PFN(cuMemGetAllocationPropertiesFromHandle);
DECLARE_CUDA_PFN(cuMemAddressReserve, 10020);
DECLARE_CUDA_PFN(cuMemAddressFree, 10020);
DECLARE_CUDA_PFN(cuMemCreate, 10020);
DECLARE_CUDA_PFN(cuMemGetAllocationGranularity, 10020);
DECLARE_CUDA_PFN(cuMemExportToShareableHandle, 10020);
DECLARE_CUDA_PFN(cuMemImportFromShareableHandle, 10020);
DECLARE_CUDA_PFN(cuMemMap, 10020);
DECLARE_CUDA_PFN(cuMemRelease, 10020);
DECLARE_CUDA_PFN(cuMemRetainAllocationHandle, 11000);
DECLARE_CUDA_PFN(cuMemSetAccess, 10020);
DECLARE_CUDA_PFN(cuMemUnmap, 10020);
DECLARE_CUDA_PFN(cuMemGetAllocationPropertiesFromHandle, 10020);
/* ncclMemAlloc/Free */
DECLARE_CUDA_PFN(cuPointerGetAttribute);
DECLARE_CUDA_PFN(cuPointerGetAttribute, 4000);
#if CUDA_VERSION >= 11070
/* transport/collNet.cc/net.cc*/
DECLARE_CUDA_PFN(cuMemGetHandleForAddressRange); // DMA-BUF support
DECLARE_CUDA_PFN(cuMemGetHandleForAddressRange, 11070); // DMA-BUF support
#endif
#if CUDA_VERSION >= 12010
/* NVSwitch Multicast support */
DECLARE_CUDA_PFN(cuMulticastAddDevice);
DECLARE_CUDA_PFN(cuMulticastBindMem);
DECLARE_CUDA_PFN(cuMulticastBindAddr);
DECLARE_CUDA_PFN(cuMulticastCreate);
DECLARE_CUDA_PFN(cuMulticastGetGranularity);
DECLARE_CUDA_PFN(cuMulticastUnbind);
DECLARE_CUDA_PFN(cuMulticastAddDevice, 12010);
DECLARE_CUDA_PFN(cuMulticastBindMem, 12010);
DECLARE_CUDA_PFN(cuMulticastBindAddr, 12010);
DECLARE_CUDA_PFN(cuMulticastCreate, 12010);
DECLARE_CUDA_PFN(cuMulticastGetGranularity, 12010);
DECLARE_CUDA_PFN(cuMulticastUnbind, 12010);
#endif
#endif
@ -162,8 +162,17 @@ bool ncclCudaLaunchBlocking = false;
#if CUDART_VERSION >= 11030
#if CUDART_VERSION >= 12000
#define LOAD_SYM(symbol, ignore) do { \
#if CUDART_VERSION >= 13000
#define LOAD_SYM(symbol, version, ignore) do { \
cudaDriverEntryPointQueryResult driverStatus = cudaDriverEntryPointSymbolNotFound; \
res = cudaGetDriverEntryPointByVersion(#symbol, (void **) (&pfn_##symbol), version, cudaEnableDefault, &driverStatus); \
if (res != cudaSuccess || driverStatus != cudaDriverEntryPointSuccess) { \
if (!ignore) { \
WARN("Retrieve %s version %d failed with %d status %d", #symbol, version, res, driverStatus); \
return ncclSystemError; } \
} } while(0)
#elif CUDART_VERSION >= 12000
#define LOAD_SYM(symbol, version, ignore) do { \
cudaDriverEntryPointQueryResult driverStatus = cudaDriverEntryPointSymbolNotFound; \
res = cudaGetDriverEntryPoint(#symbol, (void **) (&pfn_##symbol), cudaEnableDefault, &driverStatus); \
if (res != cudaSuccess || driverStatus != cudaDriverEntryPointSuccess) { \
@ -172,7 +181,7 @@ bool ncclCudaLaunchBlocking = false;
return ncclSystemError; } \
} } while(0)
#else
#define LOAD_SYM(symbol, ignore) do { \
#define LOAD_SYM(symbol, version, ignore) do { \
res = cudaGetDriverEntryPoint(#symbol, (void **) (&pfn_##symbol), cudaEnableDefault); \
if (res != cudaSuccess) { \
if (!ignore) { \
@ -188,46 +197,46 @@ static ncclResult_t cudaPfnFuncLoader(void) {
cudaError_t res;
LOAD_SYM(cuGetErrorString, 0);
LOAD_SYM(cuGetErrorName, 0);
LOAD_SYM(cuDeviceGet, 0);
LOAD_SYM(cuDeviceGetAttribute, 0);
LOAD_SYM(cuMemGetAddressRange, 1);
LOAD_SYM(cuCtxCreate, 1);
LOAD_SYM(cuCtxDestroy, 1);
LOAD_SYM(cuCtxGetCurrent, 1);
LOAD_SYM(cuCtxSetCurrent, 1);
LOAD_SYM(cuCtxGetDevice, 1);
LOAD_SYM(cuLaunchKernel, 1);
LOAD_SYM(cuGetErrorString, 6000, 0);
LOAD_SYM(cuGetErrorName, 6000, 0);
LOAD_SYM(cuDeviceGet, 2000, 0);
LOAD_SYM(cuDeviceGetAttribute, 2000, 0);
LOAD_SYM(cuMemGetAddressRange, 3020, 1);
LOAD_SYM(cuCtxCreate, 11040, 1);
LOAD_SYM(cuCtxDestroy, 4000, 1);
LOAD_SYM(cuCtxGetCurrent, 4000, 1);
LOAD_SYM(cuCtxSetCurrent, 4000, 1);
LOAD_SYM(cuCtxGetDevice, 2000, 1);
LOAD_SYM(cuLaunchKernel, 4000, 1);
#if CUDA_VERSION >= 11080
LOAD_SYM(cuLaunchKernelEx, 1);
LOAD_SYM(cuLaunchKernelEx, 11060, 1);
#endif
/* cuMem API support */
LOAD_SYM(cuMemAddressReserve, 1);
LOAD_SYM(cuMemAddressFree, 1);
LOAD_SYM(cuMemCreate, 1);
LOAD_SYM(cuMemGetAllocationGranularity, 1);
LOAD_SYM(cuMemExportToShareableHandle, 1);
LOAD_SYM(cuMemImportFromShareableHandle, 1);
LOAD_SYM(cuMemMap, 1);
LOAD_SYM(cuMemRelease, 1);
LOAD_SYM(cuMemRetainAllocationHandle, 1);
LOAD_SYM(cuMemSetAccess, 1);
LOAD_SYM(cuMemUnmap, 1);
LOAD_SYM(cuMemGetAllocationPropertiesFromHandle, 1);
LOAD_SYM(cuMemAddressReserve, 10020, 1);
LOAD_SYM(cuMemAddressFree, 10020, 1);
LOAD_SYM(cuMemCreate, 10020, 1);
LOAD_SYM(cuMemGetAllocationGranularity, 10020, 1);
LOAD_SYM(cuMemExportToShareableHandle, 10020, 1);
LOAD_SYM(cuMemImportFromShareableHandle, 10020, 1);
LOAD_SYM(cuMemMap, 10020, 1);
LOAD_SYM(cuMemRelease, 10020, 1);
LOAD_SYM(cuMemRetainAllocationHandle, 11000, 1);
LOAD_SYM(cuMemSetAccess, 10020, 1);
LOAD_SYM(cuMemUnmap, 10020, 1);
LOAD_SYM(cuMemGetAllocationPropertiesFromHandle, 10020, 1);
/* ncclMemAlloc/Free */
LOAD_SYM(cuPointerGetAttribute, 1);
LOAD_SYM(cuPointerGetAttribute, 4000, 1);
#if CUDA_VERSION >= 11070
LOAD_SYM(cuMemGetHandleForAddressRange, 1); // DMA-BUF support
LOAD_SYM(cuMemGetHandleForAddressRange, 11070, 1); // DMA-BUF support
#endif
#if CUDA_VERSION >= 12010
/* NVSwitch Multicast support */
LOAD_SYM(cuMulticastAddDevice, 1);
LOAD_SYM(cuMulticastBindMem, 1);
LOAD_SYM(cuMulticastBindAddr, 1);
LOAD_SYM(cuMulticastCreate, 1);
LOAD_SYM(cuMulticastGetGranularity, 1);
LOAD_SYM(cuMulticastUnbind, 1);
LOAD_SYM(cuMulticastAddDevice, 12010, 1);
LOAD_SYM(cuMulticastBindMem, 12010, 1);
LOAD_SYM(cuMulticastBindAddr, 12010, 1);
LOAD_SYM(cuMulticastCreate, 12010, 1);
LOAD_SYM(cuMulticastGetGranularity, 12010, 1);
LOAD_SYM(cuMulticastUnbind, 12010, 1);
#endif
return ncclSuccess;
}

View File

@ -8,7 +8,11 @@
#include <sys/types.h>
#include <unistd.h>
#ifdef NCCL_BUILD_RDMA_CORE
#include <infiniband/verbs.h>
#else
#include "ibvcore.h"
#endif
#include "ibvsymbols.h"
static pthread_once_t initOnceControl = PTHREAD_ONCE_INIT;
@ -138,8 +142,14 @@ ncclResult_t wrap_ibv_query_device(struct ibv_context *context, struct ibv_devic
IBV_INT_CHECK_RET_ERRNO(ibvSymbols, ibv_internal_query_device, ibv_internal_query_device(context, device_attr), 0, "ibv_query_device");
}
ncclResult_t wrap_ibv_query_port(struct ibv_context *context, uint8_t port_num, struct ibv_port_attr *port_attr) { /*returns 0 on success, or the value of errno on failure (which indicates the failure reason)*/
IBV_INT_CHECK_RET_ERRNO(ibvSymbols, ibv_internal_query_port, ibv_internal_query_port(context, port_num, port_attr), 0, "ibv_query_port");
ncclResult_t wrap_ibv_query_port(struct ibv_context *context, uint8_t port_num, struct ibv_port_attr *port_attr) {
// First try and query the extended port attributes (e.g. active_speed_ex)
if (ibv_query_port_ex(context, port_num, port_attr) != 0) {
// Fall back to the original attribute API call, but zero all members first
memset(port_attr, 0, sizeof(*port_attr));
IBV_INT_CHECK_RET_ERRNO(ibvSymbols, ibv_internal_query_port, ibv_internal_query_port(context, port_num, port_attr), 0, "ibv_query_port");
}
return ncclSuccess;
}
ncclResult_t wrap_ibv_query_gid(struct ibv_context *context, uint8_t port_num, int index, union ibv_gid *gid) {

77
src/misc/mlx5dvsymbols.cc Normal file
View File

@ -0,0 +1,77 @@
#include <sys/types.h>
#include <unistd.h>
#include "mlx5/mlx5dvsymbols.h"
#ifdef NCCL_BUILD_MLX5DV
/* Mlx5dv linking mode. Symbols are pointers to linked MLX5 Direct Verbs */
#define ASSIGN_SYM(container, symbol, name) container->name= &symbol;
ncclResult_t buildMlx5dvSymbols(struct ncclMlx5dvSymbols* mlx5dvSymbols) {
ASSIGN_SYM(mlx5dvSymbols, mlx5dv_is_supported, mlx5dv_internal_is_supported);
ASSIGN_SYM(mlx5dvSymbols, mlx5dv_get_data_direct_sysfs_path, mlx5dv_internal_get_data_direct_sysfs_path);
ASSIGN_SYM(mlx5dvSymbols, mlx5dv_reg_dmabuf_mr, mlx5dv_internal_reg_dmabuf_mr);
return ncclSuccess;
}
#else
/* Mlx5dv dynamic loading mode. Symbols are loaded from shared objects. */
#include <dlfcn.h>
#include "core.h"
// MLX5DV Library versioning
#define MLX5DV_VERSION "MLX5_1.8"
ncclResult_t buildMlx5dvSymbols(struct ncclMlx5dvSymbols* mlx5dvSymbols) {
static void* mlx5dvhandle = NULL;
void* tmp;
void** cast;
mlx5dvhandle=dlopen("libmlx5.so", RTLD_NOW);
if (!mlx5dvhandle) {
mlx5dvhandle=dlopen("libmlx5.so.1", RTLD_NOW);
if (!mlx5dvhandle) {
INFO(NCCL_INIT, "Failed to open libmlx5.so[.1]");
goto teardown;
}
}
#define LOAD_SYM(handle, symbol, funcptr) do { \
cast = (void**)&funcptr; \
tmp = dlvsym(handle, symbol, MLX5DV_VERSION); \
if (tmp == NULL) { \
WARN("dlvsym failed on %s - %s version %s", symbol, dlerror(), MLX5DV_VERSION); \
goto teardown; \
} \
*cast = tmp; \
} while (0)
// Attempt to load a specific symbol version - fail silently
#define LOAD_SYM_VERSION(handle, symbol, funcptr, version) do { \
cast = (void**)&funcptr; \
*cast = dlvsym(handle, symbol, version); \
if (*cast == NULL) { \
INFO(NCCL_NET, "dlvsym failed on %s - %s version %s", symbol, dlerror(), version); \
} \
} while (0)
LOAD_SYM(mlx5dvhandle, "mlx5dv_is_supported", mlx5dvSymbols->mlx5dv_internal_is_supported);
// Cherry-pick the mlx5dv_get_data_direct_sysfs_path API from MLX5 1.25
LOAD_SYM_VERSION(mlx5dvhandle, "mlx5dv_get_data_direct_sysfs_path", mlx5dvSymbols->mlx5dv_internal_get_data_direct_sysfs_path, "MLX5_1.25");
// Cherry-pick the ibv_reg_dmabuf_mr API from MLX5 1.25
LOAD_SYM_VERSION(mlx5dvhandle, "mlx5dv_reg_dmabuf_mr", mlx5dvSymbols->mlx5dv_internal_reg_dmabuf_mr, "MLX5_1.25");
return ncclSuccess;
teardown:
mlx5dvSymbols->mlx5dv_internal_is_supported = NULL;
mlx5dvSymbols->mlx5dv_internal_get_data_direct_sysfs_path = NULL;
mlx5dvSymbols->mlx5dv_internal_reg_dmabuf_mr = NULL;
if (mlx5dvhandle != NULL) dlclose(mlx5dvhandle);
return ncclSystemError;
}
#endif

75
src/misc/mlx5dvwrap.cc Normal file
View File

@ -0,0 +1,75 @@
/*************************************************************************
* Copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
*
* See LICENSE.txt for license information
************************************************************************/
#include "mlx5/mlx5dvwrap.h"
#include <sys/types.h>
#include <unistd.h>
#ifdef NCCL_BUILD_MLX5DV
#include <infiniband/mlx5dv.h>
#else
#include "mlx5/mlx5dvcore.h"
#endif
#include "mlx5/mlx5dvsymbols.h"
static pthread_once_t initOnceControl = PTHREAD_ONCE_INIT;
static ncclResult_t initResult;
struct ncclMlx5dvSymbols mlx5dvSymbols;
ncclResult_t wrap_mlx5dv_symbols(void) {
pthread_once(&initOnceControl,
[](){ initResult = buildMlx5dvSymbols(&mlx5dvSymbols); });
return initResult;
}
/* CHECK_NOT_NULL: helper macro to check for NULL symbol */
#define CHECK_NOT_NULL(container, internal_name) \
if (container.internal_name == NULL) { \
WARN("lib wrapper not initialized."); \
return ncclInternalError; \
}
#define MLX5DV_PTR_CHECK_ERRNO(container, internal_name, call, retval, error_retval, name) \
CHECK_NOT_NULL(container, internal_name); \
retval = container.call; \
if (retval == error_retval) { \
WARN("Call to " name " failed with error %s", strerror(errno)); \
return ncclSystemError; \
} \
return ncclSuccess;
#define MLX5DV_INT_CHECK_RET_ERRNO(container, internal_name, call, success_retval, name) \
CHECK_NOT_NULL(container, internal_name); \
int ret = container.call; \
if (ret != success_retval) { \
INFO(NCCL_NET, "Call to " name " failed with error %s errno %d", strerror(ret), ret); \
return ncclSystemError; \
} \
return ncclSuccess;
bool wrap_mlx5dv_is_supported(struct ibv_device *device) {
if (mlx5dvSymbols.mlx5dv_internal_is_supported == NULL) {
return 0;
}
return mlx5dvSymbols.mlx5dv_internal_is_supported(device);
}
ncclResult_t wrap_mlx5dv_get_data_direct_sysfs_path(struct ibv_context *context, char *buf, size_t buf_len) {
MLX5DV_INT_CHECK_RET_ERRNO(mlx5dvSymbols, mlx5dv_internal_get_data_direct_sysfs_path, mlx5dv_internal_get_data_direct_sysfs_path(context, buf, buf_len), 0, "mlx5dv_get_data_direct_sysfs_path");
}
/* DMA-BUF support */
ncclResult_t wrap_mlx5dv_reg_dmabuf_mr(struct ibv_mr **ret, struct ibv_pd *pd, uint64_t offset, size_t length, uint64_t iova, int fd, int access, int mlx5_access) {
MLX5DV_PTR_CHECK_ERRNO(mlx5dvSymbols, mlx5dv_internal_reg_dmabuf_mr, mlx5dv_internal_reg_dmabuf_mr(pd, offset, length, iova, fd, access, mlx5_access), *ret, NULL, "mlx5dv_reg_dmabuf_mr");
}
struct ibv_mr * wrap_direct_mlx5dv_reg_dmabuf_mr(struct ibv_pd *pd, uint64_t offset, size_t length, uint64_t iova, int fd, int access, int mlx5_access) {
if (mlx5dvSymbols.mlx5dv_internal_reg_dmabuf_mr == NULL) {
errno = EOPNOTSUPP; // ncclIbDmaBufSupport() requires this errno being set
return NULL;
}
return mlx5dvSymbols.mlx5dv_internal_reg_dmabuf_mr(pd, offset, length, iova, fd, access, mlx5_access);
}

View File

@ -68,7 +68,8 @@ static ncclResult_t socketProgress(int op, struct ncclSocket* sock, void* ptr, i
return ncclSuccess;
} else {
char line[SOCKET_NAME_MAXLEN+1];
WARN("socketProgress: Connection closed by remote peer %s", ncclSocketToString(&sock->addr, line, 0));
WARN("socketProgress: Connection closed by remote peer %s",
ncclSocketToString(&sock->addr, line, /*numericHostForm*/0));
return ncclRemoteError;
}
}
@ -86,17 +87,22 @@ static ncclResult_t socketWait(int op, struct ncclSocket* sock, void* ptr, int s
* Output: "IPv4/IPv6 address<port>"
*/
const char *ncclSocketToString(const union ncclSocketAddress *addr, char *buf, const int numericHostForm /*= 1*/) {
if (buf == NULL || addr == NULL) return NULL;
const struct sockaddr *saddr = &addr->sa;
if (saddr->sa_family != AF_INET && saddr->sa_family != AF_INET6) { buf[0]='\0'; return buf; }
const struct sockaddr *saddr;
char host[NI_MAXHOST], service[NI_MAXSERV];
int flag = NI_NUMERICSERV | (numericHostForm ? NI_NUMERICHOST : 0);
if (buf == NULL || addr == NULL) goto fail;
saddr = &addr->sa;
if (saddr->sa_family != AF_INET && saddr->sa_family != AF_INET6) goto fail;
/* NI_NUMERICHOST: If set, then the numeric form of the hostname is returned.
* (When not set, this will still happen in case the node's name cannot be determined.)
*/
int flag = NI_NUMERICSERV | (numericHostForm ? NI_NUMERICHOST : 0);
(void) getnameinfo(saddr, sizeof(union ncclSocketAddress), host, NI_MAXHOST, service, NI_MAXSERV, flag);
if (getnameinfo(saddr, sizeof(union ncclSocketAddress), host, NI_MAXHOST, service, NI_MAXSERV, flag)) goto fail;
sprintf(buf, "%s<%s>", host, service);
return buf;
fail:
if (buf)
buf[0] = '\0';
return buf;
}
static uint16_t socketToPort(union ncclSocketAddress *addr) {
@ -120,7 +126,8 @@ static int envSocketFamily(void) {
return family;
}
static int findInterfaces(const char* prefixList, char* names, union ncclSocketAddress *addrs, int sock_family, int maxIfNameSize, int maxIfs) {
static ncclResult_t findInterfaces(const char* prefixList, char* names, union ncclSocketAddress *addrs, int sock_family,
int maxIfNameSize, int maxIfs, int* found) {
#ifdef ENABLE_TRACE
char line[SOCKET_NAME_MAXLEN+1];
#endif
@ -131,10 +138,10 @@ static int findInterfaces(const char* prefixList, char* names, union ncclSocketA
if (searchExact) prefixList++;
int nUserIfs = parseStringList(prefixList, userIfs, MAX_IFS);
int found = 0;
*found = 0;
struct ifaddrs *interfaces, *interface;
getifaddrs(&interfaces);
for (interface = interfaces; interface && found < maxIfs; interface = interface->ifa_next) {
SYSCHECK(getifaddrs(&interfaces), "getifaddrs");
for (interface = interfaces; interface && *found < maxIfs; interface = interface->ifa_next) {
if (interface->ifa_addr == NULL) continue;
/* We only support IPv4 & IPv6 */
@ -162,23 +169,23 @@ static int findInterfaces(const char* prefixList, char* names, union ncclSocketA
// Check that this interface has not already been saved
// getifaddrs() normal order appears to be; IPv4, IPv6 Global, IPv6 Link
bool duplicate = false;
for (int i = 0; i < found; i++) {
for (int i = 0; i < *found; i++) {
if (strcmp(interface->ifa_name, names+i*maxIfNameSize) == 0) { duplicate = true; break; }
}
if (!duplicate) {
// Store the interface name
strncpy(names+found*maxIfNameSize, interface->ifa_name, maxIfNameSize);
strncpy(names + (*found)*maxIfNameSize, interface->ifa_name, maxIfNameSize);
// Store the IP address
int salen = (family == AF_INET) ? sizeof(struct sockaddr_in) : sizeof(struct sockaddr_in6);
memset(addrs+found, '\0', sizeof(*addrs));
memcpy(addrs+found, interface->ifa_addr, salen);
found++;
memset(addrs + *found, '\0', sizeof(*addrs));
memcpy(addrs + *found, interface->ifa_addr, salen);
(*found)++;
}
}
freeifaddrs(interfaces);
return found;
return ncclSuccess;
}
static bool matchSubnet(struct ifaddrs local_if, union ncclSocketAddress* remote) {
@ -219,20 +226,21 @@ static bool matchSubnet(struct ifaddrs local_if, union ncclSocketAddress* remote
same &= (local_addr->sin6_scope_id == remote_addr.sin6_scope_id);
return same;
} else {
WARN("Net : Unsupported address family type");
INFO(NCCL_NET, "Net : Unsupported address family type");
return false;
}
}
int ncclFindInterfaceMatchSubnet(char* ifNames, union ncclSocketAddress* localAddrs, union ncclSocketAddress* remoteAddr, int ifNameMaxSize, int maxIfs) {
ncclResult_t ncclFindInterfaceMatchSubnet(char* ifName, union ncclSocketAddress* localAddr,
union ncclSocketAddress* remoteAddr, int ifNameMaxSize, int* found) {
#ifdef ENABLE_TRACE
char line[SOCKET_NAME_MAXLEN+1];
#endif
char line_a[SOCKET_NAME_MAXLEN+1];
int found = 0;
#endif
*found = 0;
struct ifaddrs *interfaces, *interface;
getifaddrs(&interfaces);
for (interface = interfaces; interface && !found; interface = interface->ifa_next) {
SYSCHECK(getifaddrs(&interfaces), "getifaddrs");
for (interface = interfaces; interface && !*found; interface = interface->ifa_next) {
if (interface->ifa_addr == NULL) continue;
/* We only support IPv4 & IPv6 */
@ -247,21 +255,18 @@ int ncclFindInterfaceMatchSubnet(char* ifNames, union ncclSocketAddress* localAd
// Store the local IP address
int salen = (family == AF_INET) ? sizeof(struct sockaddr_in) : sizeof(struct sockaddr_in6);
memcpy(localAddrs+found, interface->ifa_addr, salen);
memcpy(localAddr, interface->ifa_addr, salen);
// Store the interface name
strncpy(ifNames+found*ifNameMaxSize, interface->ifa_name, ifNameMaxSize);
strncpy(ifName, interface->ifa_name, ifNameMaxSize);
TRACE(NCCL_INIT|NCCL_NET,"NET : Found interface %s:%s in the same subnet as remote address %s", interface->ifa_name, ncclSocketToString(localAddrs+found, line), ncclSocketToString(remoteAddr, line_a));
found++;
if (found == maxIfs) break;
TRACE(NCCL_INIT|NCCL_NET,"NET : Found interface %s:%s in the same subnet as remote address %s",
interface->ifa_name, ncclSocketToString(localAddr, line), ncclSocketToString(remoteAddr, line_a));
*found = 1;
}
if (found == 0) {
WARN("Net : No interface found in the same subnet as remote address %s", ncclSocketToString(remoteAddr, line_a));
}
freeifaddrs(interfaces);
return found;
return ncclSuccess;
}
ncclResult_t ncclSocketGetAddrFromString(union ncclSocketAddress* ua, const char* ip_port_pair) {
@ -344,40 +349,41 @@ ncclResult_t ncclSocketGetAddrFromString(union ncclSocketAddress* ua, const char
return ncclSuccess;
}
int ncclFindInterfaces(char* ifNames, union ncclSocketAddress *ifAddrs, int ifNameMaxSize, int maxIfs) {
ncclResult_t ncclFindInterfaces(char* ifNames, union ncclSocketAddress *ifAddrs, int ifNameMaxSize, int maxIfs,
int* nIfs) {
static int shownIfName = 0;
int nIfs = 0;
// Allow user to force the INET socket family selection
int sock_family = envSocketFamily();
// User specified interface
const char* env = ncclGetEnv("NCCL_SOCKET_IFNAME");
*nIfs = 0;
if (env && strlen(env) > 1) {
INFO(NCCL_ENV, "NCCL_SOCKET_IFNAME set by environment to %s", env);
// Specified by user : find or fail
if (shownIfName++ == 0) INFO(NCCL_NET, "NCCL_SOCKET_IFNAME set to %s", env);
nIfs = findInterfaces(env, ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
NCCLCHECK(findInterfaces(env, ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
} else {
// Try to automatically pick the right one
// Start with IB
nIfs = findInterfaces("ib", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
NCCLCHECK(findInterfaces("ib", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
// else see if we can get some hint from COMM ID
if (nIfs == 0) {
if (*nIfs == 0) {
const char* commId = ncclGetEnv("NCCL_COMM_ID");
if (commId && strlen(commId) > 1) {
INFO(NCCL_ENV, "NCCL_COMM_ID set by environment to %s", commId);
// Try to find interface that is in the same subnet as the IP in comm id
union ncclSocketAddress idAddr;
ncclSocketGetAddrFromString(&idAddr, commId);
nIfs = ncclFindInterfaceMatchSubnet(ifNames, ifAddrs, &idAddr, ifNameMaxSize, maxIfs);
NCCLCHECK(ncclSocketGetAddrFromString(&idAddr, commId));
NCCLCHECK(ncclFindInterfaceMatchSubnet(ifNames, ifAddrs, &idAddr, ifNameMaxSize, nIfs));
}
}
// Then look for anything else (but not docker or lo)
if (nIfs == 0) nIfs = findInterfaces("^docker,lo", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
if (*nIfs == 0) NCCLCHECK(findInterfaces("^docker,lo", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
// Finally look for docker, then lo.
if (nIfs == 0) nIfs = findInterfaces("docker", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
if (nIfs == 0) nIfs = findInterfaces("lo", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
if (*nIfs == 0) NCCLCHECK(findInterfaces("docker", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
if (*nIfs == 0) NCCLCHECK(findInterfaces("lo", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
}
return nIfs;
return ncclSuccess;
}
ncclResult_t ncclSocketListen(struct ncclSocket* sock) {
@ -435,21 +441,25 @@ static ncclResult_t socketTryAccept(struct ncclSocket* sock) {
if (sock->fd != -1) {
sock->state = ncclSocketStateAccepted;
} else if (errno == ENETDOWN || errno == EPROTO || errno == ENOPROTOOPT || errno == EHOSTDOWN ||
errno == ENONET || errno == EHOSTUNREACH || errno == EOPNOTSUPP || errno == ENETUNREACH) {
errno == ENONET || errno == EHOSTUNREACH || errno == EOPNOTSUPP || errno == ENETUNREACH ||
errno == EINTR) {
/* per accept's man page, for linux sockets, the following errors might be already pending errors
* and should be considered as EAGAIN. To avoid infinite loop in case of errors, we use the retry count*/
if (++sock->errorRetries == ncclParamRetryCnt()) {
WARN("socketTryAccept: exceeded error retry count (%d), %s", sock->errorRetries, strerror(errno));
WARN("socketTryAccept: exceeded error retry count after %d attempts, %s", sock->errorRetries, strerror(errno));
return ncclSystemError;
}
INFO(NCCL_ALL, "Call to accept returned %s, retrying", strerror(errno));
} else if (errno != EAGAIN && errno != EWOULDBLOCK) {
INFO(NCCL_NET|NCCL_INIT, "Call to accept returned %s, retrying", strerror(errno));
} else if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) {
WARN("socketTryAccept: Accept failed: %s", strerror(errno));
return ncclSystemError;
}
return ncclSuccess;
}
NCCL_PARAM(SocketMaxRecvBuff, "SOCKET_RCVBUF", -1);
NCCL_PARAM(SocketMaxSendBuff, "SOCKET_SNDBUF", -1);
static ncclResult_t socketSetFlags(struct ncclSocket* sock) {
const int one = 1;
/* Set socket as non-blocking if async or if we need to be able to abort */
@ -458,34 +468,55 @@ static ncclResult_t socketSetFlags(struct ncclSocket* sock) {
SYSCHECK(flags = fcntl(sock->fd, F_GETFL), "fcntl");
SYSCHECK(fcntl(sock->fd, F_SETFL, flags | O_NONBLOCK), "fcntl");
}
SYSCHECK(setsockopt(sock->fd, IPPROTO_TCP, TCP_NODELAY, (char*)&one, sizeof(int)), "setsockopt");
SYSCHECK(setsockopt(sock->fd, IPPROTO_TCP, TCP_NODELAY, (char*)&one, sizeof(int)), "setsockopt TCP NODELAY");
// setsockopt should not fail even if the sizes are too large, do not change the default if unset by the user (=-1)
int rcvBuf = ncclParamSocketMaxRecvBuff(), sndBuf = ncclParamSocketMaxSendBuff();
if (sndBuf > 0) SYSCHECK(setsockopt(sock->fd, SOL_SOCKET, SO_SNDBUF, (char*)&sndBuf, sizeof(int)), "setsockopt SO_SNDBUF");
if (rcvBuf > 0) SYSCHECK(setsockopt(sock->fd, SOL_SOCKET, SO_RCVBUF, (char*)&rcvBuf, sizeof(int)), "setsockopt SO_RCVBUF");
return ncclSuccess;
}
static void socketResetAccept(struct ncclSocket* sock) {
char line[SOCKET_NAME_MAXLEN+1];
INFO(NCCL_NET|NCCL_INIT, "socketFinalizeAccept: didn't receive a valid magic from %s",
ncclSocketToString(&sock->addr, line));
// Ignore spurious connection and accept again
(void)close(sock->fd);
sock->fd = -1;
sock->state = ncclSocketStateAccepting;
sock->finalizeCounter = 0;
}
static ncclResult_t socketFinalizeAccept(struct ncclSocket* sock) {
uint64_t magic;
enum ncclSocketType type;
int received;
char line[SOCKET_NAME_MAXLEN+1];
// once accepted, linux sockets do NOT inherit file status flags such as O_NONBLOCK (BSD ones do)
NCCLCHECK(socketSetFlags(sock));
if (sock->asyncFlag == 0 || sock->finalizeCounter < sizeof(magic)) {
if (sock->asyncFlag == 0) {
received = 0;
NCCLCHECK(socketWait(NCCL_SOCKET_RECV, sock, &magic, sizeof(magic), &received));
if (socketWait(NCCL_SOCKET_RECV, sock, &magic, sizeof(magic), &received) != ncclSuccess) {
socketResetAccept(sock);
return ncclSuccess;
}
} else {
int closed = 0;
received = sock->finalizeCounter;
NCCLCHECK(socketProgress(NCCL_SOCKET_RECV, sock, sock->finalizeBuffer, sizeof(magic), &received));
NCCLCHECK(socketProgress(NCCL_SOCKET_RECV, sock, sock->finalizeBuffer, sizeof(magic), &received, &closed));
sock->finalizeCounter = received;
if (received < sizeof(magic)) return ncclSuccess;
if (received < sizeof(magic)) {
if (closed) {
socketResetAccept(sock);
}
return ncclSuccess;
}
memcpy(&magic, sock->finalizeBuffer, sizeof(magic));
}
if (magic != sock->magic) {
WARN("socketFinalizeAccept: wrong magic %lx != %lx", magic, sock->magic);
close(sock->fd);
sock->fd = -1;
// Ignore spurious connection and accept again
sock->state = ncclSocketStateAccepting;
socketResetAccept(sock);
return ncclSuccess;
}
}
@ -500,7 +531,7 @@ static ncclResult_t socketFinalizeAccept(struct ncclSocket* sock) {
memcpy(&type, sock->finalizeBuffer, sizeof(type));
}
if (type != sock->type) {
WARN("socketFinalizeAccept: wrong type %d != %d", type, sock->type);
WARN("socketFinalizeAccept from %s: wrong type %d != %d", ncclSocketToString(&sock->addr, line), type, sock->type);
sock->state = ncclSocketStateError;
close(sock->fd);
sock->fd = -1;
@ -532,32 +563,38 @@ cleanup:
}
goto exit;
}
static ncclResult_t socketConnectCheck(struct ncclSocket* sock, int errCode, const char funcName[]) {
char line[SOCKET_NAME_MAXLEN+1];
if (errCode == 0) {
sock->state = ncclSocketStateConnected;
} else if (errCode == EINPROGRESS) {
sock->state = ncclSocketStateConnectPolling;
} else if (errCode == ETIMEDOUT || errCode == EHOSTUNREACH || errCode == ECONNREFUSED) {
} else if (errCode == EINTR || errCode == EWOULDBLOCK || errCode == EAGAIN || errCode == ETIMEDOUT ||
errCode == EHOSTUNREACH || errCode == ECONNREFUSED) {
if (sock->customRetry == 0) {
if (sock->errorRetries++ == ncclParamRetryCnt()) {
sock->state = ncclSocketStateError;
WARN("%s: connect returned %s, exceeded error retry count (%d)", funcName, strerror(errCode), sock->errorRetries);
WARN("%s: connect to %s returned %s, exceeded error retry count after %d attempts",
funcName, ncclSocketToString(&sock->addr, line), strerror(errCode), sock->errorRetries);
return ncclRemoteError;
}
unsigned int sleepTime = sock->errorRetries * ncclParamRetryTimeOut();
INFO(NCCL_ALL, "%s: connect returned %s, retrying (%d/%ld) after sleep for %u msec", funcName, strerror(errCode), sock->errorRetries, ncclParamRetryCnt(), sleepTime);
INFO(NCCL_NET|NCCL_INIT, "%s: connect to %s returned %s, retrying (%d/%ld) after sleep for %u msec",
funcName, ncclSocketToString(&sock->addr, line), strerror(errCode),
sock->errorRetries, ncclParamRetryCnt(), sleepTime);
msleep(sleepTime);
}
NCCLCHECK(socketResetFd(sock)); /* in case of failure in connect, socket state is unspecified */
sock->state = ncclSocketStateConnecting;
} else {
char line[SOCKET_NAME_MAXLEN+1];
sock->state = ncclSocketStateError;
WARN("%s: Connect to %s failed : %s", funcName, ncclSocketToString(&sock->addr, line), strerror(errCode));
WARN("%s: connect to %s failed : %s", funcName, ncclSocketToString(&sock->addr, line), strerror(errCode));
return ncclSystemError;
}
return ncclSuccess;
}
static ncclResult_t socketStartConnect(struct ncclSocket* sock) {
/* blocking/non-blocking connect() is determined by asyncFlag. */
int ret = connect(sock->fd, &sock->addr.sa, sock->salen);
@ -568,6 +605,7 @@ static ncclResult_t socketPollConnect(struct ncclSocket* sock) {
struct pollfd pfd;
int timeout = 1, ret;
socklen_t rlen = sizeof(int);
char line[SOCKET_NAME_MAXLEN+1];
memset(&pfd, 0, sizeof(struct pollfd));
pfd.fd = sock->fd;
@ -577,10 +615,7 @@ static ncclResult_t socketPollConnect(struct ncclSocket* sock) {
if (ret == 0 || (ret < 0 && errno == EINTR)) {
return ncclSuccess;
} else if (ret < 0) {
WARN("socketPollConnect poll() failed with error %s", strerror(errno));
return ncclRemoteError;
} else if (ret != 1 || (pfd.revents & POLLOUT) == 0) {
WARN("socketPollConnect poll() returned %d%s", ret, (pfd.revents & POLLOUT) ? "" : ", no POLLOUT events");
WARN("socketPollConnect to %s failed with error %s", ncclSocketToString(&sock->addr, line), strerror(errno));
return ncclSystemError;
}
@ -899,7 +934,7 @@ ncclResult_t ncclSocketTryRecv(struct ncclSocket* sock, void* ptr, int size, int
ncclResult_t ncclSocketShutdown(struct ncclSocket* sock, int how) {
if (sock != NULL) {
if (sock->fd >= 0) {
shutdown(sock->fd, how);
SYSCHECK(shutdown(sock->fd, how), "shutdown");
}
sock->state = ncclSocketStateTerminating;
}
@ -921,8 +956,8 @@ ncclResult_t ncclSocketClose(struct ncclSocket* sock, bool wait) {
* by refcount of fd, but close() is. close() won't close a fd and send FIN packet if
* the fd is duplicated (e.g. fork()). So shutdown() guarantees the correct and graceful
* connection close here. */
shutdown(sock->fd, SHUT_RDWR);
close(sock->fd);
(void)shutdown(sock->fd, SHUT_RDWR);
(void)close(sock->fd);
}
sock->state = ncclSocketStateClosed;
sock->fd = -1;

View File

@ -9,13 +9,18 @@
#include "checks.h"
#include "param.h"
#if CUDART_VERSION >= 13000
#define cudaStreamGetCaptureInfo_v3 cudaStreamGetCaptureInfo
#define cudaGraphAddDependencies_v2 cudaGraphAddDependencies
#define cudaStreamUpdateCaptureDependencies_v2 cudaStreamUpdateCaptureDependencies
#endif
// Tracks the captured work a given graph captured identified by its graph id.
struct ncclStrongStreamCapture {
struct ncclStrongStreamCapture* next;
cudaGraph_t graph;
unsigned long long graphId;
cudaStream_t captureStream;
cudaGraphNode_t lastRecord;
void* acquiredBy;
};
@ -89,7 +94,11 @@ ncclResult_t ncclCudaGetCapturingGraph(
} else {
#if CUDART_VERSION >= 11030
cudaStreamCaptureStatus status;
#if CUDART_VERSION >= 13000
CUDACHECK(cudaStreamGetCaptureInfo_v3(stream, &status, &graph->graphId, &graph->graph, nullptr, nullptr, nullptr));
#else
CUDACHECK(cudaStreamGetCaptureInfo_v2(stream, &status, &graph->graphId, &graph->graph, nullptr, nullptr));
#endif
if (status != cudaStreamCaptureStatusActive) {
graph->origin = nullptr;
graph->graph = nullptr;
@ -206,7 +215,6 @@ ncclResult_t ncclStrongStreamAcquire(
CUDACHECKGOTO(cudaStreamCreateWithFlags(&cap->captureStream, cudaStreamNonBlocking), ret, do_unlock);
}
cap->graphId = graph.graphId;
cap->lastRecord = nullptr;
cap->acquiredBy = localThreadId();
// Push to capturing list.
cap->next = ss->captureHead;
@ -224,7 +232,11 @@ ncclResult_t ncclStrongStreamAcquire(
CUDACHECK(cudaEventRecord(scratch, graph.origin));
CUDACHECK(cudaStreamWaitEvent(cap->captureStream, scratch, 0));
CUDACHECK(cudaEventDestroy(scratch));
#if CUDART_VERSION >= 13000
CUDACHECK(cudaStreamUpdateCaptureDependencies_v2(cap->captureStream, nullptr, nullptr, 0, cudaStreamSetCaptureDependencies));
#else
CUDACHECK(cudaStreamUpdateCaptureDependencies(cap->captureStream, nullptr, 0, cudaStreamSetCaptureDependencies));
#endif
if (mixing && firstCapture) {
CUDACHECK(cudaEventRecord(ss->serialEvent, ss->liveStream));
@ -282,17 +294,15 @@ ncclResult_t ncclStrongStreamRelease(
cudaGraphNode_t recordNode;
CUDACHECK(cudaGraphAddEventRecordNode(&recordNode, graph.graph, nullptr, 0, ss->serialEvent));
// Make this record order after previous record on this stream.
if (cap->lastRecord != nullptr) {
CUDACHECK(cudaGraphAddDependencies(graph.graph, &cap->lastRecord, &recordNode, 1));
}
cap->lastRecord = recordNode;
// Get current nodes from work stream so we can add them as dependencies.
cudaStreamCaptureStatus status;
cudaGraphNode_t const* nodes;
size_t count = 0;
#if CUDART_VERSION >= 13000
cudaError_t res = cudaStreamGetCaptureInfo_v3(cap->captureStream, &status, nullptr, nullptr, &nodes, nullptr, &count);
#else
cudaError_t res = cudaStreamGetCaptureInfo_v2(cap->captureStream, &status, nullptr, nullptr, &nodes, &count);
#endif
#if CUDART_VERSION >= 12030
if (res == cudaErrorLossyQuery) { // CUDA is telling us the dependencies have edge annotations.
@ -308,10 +318,30 @@ ncclResult_t ncclStrongStreamRelease(
else {
CUDACHECK(res /* = cudaStreamGetCaptureInfo_v2(...)*/);
for (int i=0; i < (int)count; i++) {
#if CUDART_VERSION >= 13000
CUDACHECK(cudaGraphAddDependencies_v2(graph.graph, &nodes[i], &recordNode, nullptr, 1));
#else
CUDACHECK(cudaGraphAddDependencies(graph.graph, &nodes[i], &recordNode, 1));
#endif
}
}
// Make every future operation captured on cap->captureStream depend on 'recordNode'.
#if CUDART_VERSION >= 13000
CUDACHECK(cudaStreamUpdateCaptureDependencies_v2(
cap->captureStream,
&recordNode, /* dependencies */
/*edges =*/ nullptr, /* no edge annotations */
1, /* count */
cudaStreamSetCaptureDependencies));
#else
CUDACHECK(cudaStreamUpdateCaptureDependencies(
cap->captureStream,
&recordNode,
1,
cudaStreamSetCaptureDependencies));
#endif
if (cap->acquiredBy != localThreadId() && ncclParamLaunchRaceFatal()) {
WARN("%s", launchRaceFatalMsg);
return ncclInvalidUsage;
@ -339,7 +369,11 @@ ncclResult_t ncclStreamAdvanceToEvent(struct ncclCudaGraph g, cudaStream_t s, cu
cudaStreamCaptureStatus status;
cudaGraphNode_t const* nodes;
size_t count = 0;
#if CUDART_VERSION >= 13000
cudaError_t res = cudaStreamGetCaptureInfo_v3(tmp, &status, nullptr, nullptr, &nodes, nullptr, &count);
#else
cudaError_t res = cudaStreamGetCaptureInfo_v2(tmp, &status, nullptr, nullptr, &nodes, &count);
#endif
#if CUDART_VERSION >= 12030
if (res == cudaErrorLossyQuery) { // CUDA is telling us the dependencies have edge annotations.
@ -352,7 +386,11 @@ ncclResult_t ncclStreamAdvanceToEvent(struct ncclCudaGraph g, cudaStream_t s, cu
#endif
else {
CUDACHECK(res /* = cudaStreamGetCaptureInfo_v2(...)*/);
#if CUDART_VERSION >= 13000
CUDACHECK(cudaStreamUpdateCaptureDependencies_v2(s, (cudaGraphNode_t*)nodes, nullptr, count, cudaStreamSetCaptureDependencies));
#else
CUDACHECK(cudaStreamUpdateCaptureDependencies(s, (cudaGraphNode_t*)nodes, count, cudaStreamSetCaptureDependencies));
#endif
}
CUDACHECK(cudaStreamDestroy(tmp));

Some files were not shown because too many files have changed in this diff Show More