Merge ef96b97524 into 0d1ece2b43

Exclude ongoing issues from auto-closing logic
- Added a check to skip issues labeled "ongoing" in the close-old-issues script - Adjusted the condition to compare both creation and update dates against six months ago
2025-07-22 20:58:01 +02:00 · 2025-07-17 21:50:05 +02:00 · 2025-07-16 17:56:12 +02:00 · 2025-07-11 07:32:13 -07:00 · 2025-06-18 10:34:47 -07:00 · 2025-05-29 20:56:40 -07:00
125 changed files with 10273 additions and 2153 deletions
--- a/.github/ISSUE_TEMPLATE/ISSUE.yaml
+++ b/.github/ISSUE_TEMPLATE/ISSUE.yaml
@ -0,0 +1,77 @@
+name: NCCL issue or bug
+description: Report an issue or failure when running NCCL code
+title: "[Issue]: "
+labels: ["triage"]
+
+body:
+  - type: markdown
+    attributes:
+      value: |
+        Thanks for reaching out! Before reporting a new issue, please feel free to search for the behavior in the existing issues. If you found an issue which is already closed or you are unsure, open a new issue and reference the old one from it.
+        You can also check out the [troubleshooting section](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html) in our user guide.
+        
+        ---
+        
+        To ensure we can assist you quickly and accurately, we often need the following information:
+  - type: dropdown
+    id: type
+    attributes:
+      label: How is this issue impacting you?
+      description: What best describes your issue?
+      options:
+        - Lower performance than expected
+        - Application crash
+        - Data corruption
+        - Application hang
+    validations:
+      required: true
+
+  - type: textarea
+    id: log
+    attributes:
+      label: Share Your Debug Logs
+      description: |
+
+        The logs and topo-files are a great tool to pin down issues. You can create them by setting these environment variables before the run.
+        * `NCCL_DEBUG=INFO` and `NCCL_DEBUG_FILE=ncclDebug.%h.%p` to produce one file per rank
+        * `NCCL_TOPO_DUMP_FILE=ncclSystem.txt`
+
+  - type: textarea
+    id: repro
+    attributes:
+      label: Steps to Reproduce the Issue
+      description: |
+        * **Minimal Steps**: Please provide a simple way to recreate the issue (see [Minimal Bug Reports](https://matthewrocklin.com/minimal-bug-reports) for inspiration).
+        * **Environment Details**: Include software versions and relevant settings.
+        * **Intermittency**: Is this a sporadic issue? If so, how often does it occur?
+        * **Previous Success**: Did this work with an older NCCL version?
+
+        The easier we can reproduce on our side the more likely we are to be able to solve it in a timely manner.
+
+  - type: input
+    id: nccl_version
+    attributes:
+      label: NCCL Version
+      description: |
+        NCCL reports its version string in the debug logs.
+        You can also determine the version if you know which library was used by running `strings libnccl.so | grep 'NCCL version'`.
+      placeholder: "e.g. 2.27.1+cuda12.8"
+    validations:
+      required: true
+
+  - type: textarea
+    id: platform
+    attributes:
+      label: Your platform details
+      description: |
+        * **GPU & Network**: Share your architecture and topology (e.g., from `nvidia-smi`, `nvidia-smi topo -m`, `ibstatus`).
+        * **Environment**: Bare-metal, containers, or cloud?
+        * **Scalability**: Does this issue occur with a specific number of ranks/nodes?
+
+  - type: textarea
+    id: issue-description
+    attributes:
+      label: Error Message & Behavior
+      description: |
+        * **First Error**: What was the initial `NCCL WARN` message in your logs?
+        * **Expected vs. Actual**: Briefly describe the anticipated behavior versus what you're seeing.
--- a/.github/ISSUE_TEMPLATE/QUESTION.yaml
+++ b/.github/ISSUE_TEMPLATE/QUESTION.yaml
@ -0,0 +1,15 @@
+name: NCCL question
+description: Ask the NCCL team a question
+title: "[Question]: "
+labels: ["question"]
+
+body:
+  - type: markdown
+    attributes:
+      value: |
+        Thanks for reaching out! To solve your problem, feel free to check out the [user guide](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html), in particular the troubleshooting section, and also the [release notes](https://docs.nvidia.com/deeplearning/nccl/release-notes/index.html).
+        ---
+  - type: textarea
+    id: question
+    attributes:
+      label: Question
--- a/.github/ISSUE_TEMPLATE/RFE.yaml
+++ b/.github/ISSUE_TEMPLATE/RFE.yaml
@ -0,0 +1,22 @@
+name: NCCL request for enhancement
+description: Request for enhancement
+title: "[RFE]: "
+labels: ["enhancement"]
+body:
+  - type: markdown
+    attributes:
+      value: |
+
+        Thanks for your feedback! Before reporting a new RFE you could quickly check if this already exists in our [existing requests](https://github.com/NVIDIA/nccl/issues?q=sort%3Aupdated-desc%20is%3Aissue%20is%3Aopen%20label%3Aenhancement).
+        
+        ---
+  - type: textarea
+    id: rfe-description
+    attributes:
+      label: Please provide the below details to ensure we understand your needs
+      description: |
+        * What is the goal of this request?
+        * Who will benefit from this feature?
+        * Is this request for a specific GPU architecture or network infrastructure?
+        * How will this feature improve current workflows or processes?
+        * What is the priority level of this request?
--- a/.github/ISSUE_TEMPLATE/config.yml
+++ b/.github/ISSUE_TEMPLATE/config.yml
@ -0,0 +1 @@
+blank_issues_enabled: false
--- a/.github/workflows/close-old-issues.js
+++ b/.github/workflows/close-old-issues.js
@ -0,0 +1,79 @@
+const { Octokit } = require("@octokit/rest");
+
+const octokit = new Octokit({ auth: process.env.GITHUB_TOKEN });
+
+const owner = process.env.REPO_OWNER;
+const repo = process.env.REPO_NAME.split('/').pop(); // Handles owner/repo format
+
+const now = new Date();
+const sixMonthsAgo = new Date(now);
+sixMonthsAgo.setMonth(now.getMonth() - 6);
+const oneMonthAgo = new Date(now);
+oneMonthAgo.setMonth(now.getMonth() - 1);
+
+async function closeOldIssues() {
+  let page = 1;
+  let closedCount = 0;
+
+    // write a multiline comment into a variable:
+    let body = `### Issue Cleanup: Helping Us Focus on Current Challenges
+
+We're [reviewing](https://github.com/NVIDIA/nccl/discussions/1761) older issues to ensure we prioritize the most relevant and active ones. Since this issue hasn't seen updates in over 6 months, we'll be closing it for now.
+
+*This change helps us focus our efforts on addressing any current issues our users are facing.* If this issue still affects you, please don't hesitate to reopen it with a quick update (e.g., \"Still relevant on [version=X]\").
+Thanks for your understanding and for contributing to NCCL.`;
+
+  while (true) {
+    const { data: issues } = await octokit.issues.listForRepo({
+      owner,
+      repo,
+      state: "open",
+      per_page: 100,
+      page,
+    });
+
+    if (issues.length === 0) break;
+
+    for (const issue of issues) {
+      // Ignore PRs
+      if (issue.pull_request) continue;
+
+      // Ignore issues with label "ongoing"
+      if (issue.labels.some(label => label.name === "ongoing")) continue;
+
+      const createdAt = new Date(issue.created_at);
+      const updatedAt = new Date(issue.updated_at);
+
+        if (createdAt < sixMonthsAgo && updatedAt < sixMonthsAgo) {
+
+        // Add a comment before closing
+        await octokit.issues.createComment({
+          owner,
+          repo,
+          issue_number: issue.number,
+          body: body,
+        });
+
+        await octokit.issues.update({
+          owner,
+          repo,
+          issue_number: issue.number,
+          state: "closed",
+          state_reason: "not_planned",
+        });
+        closedCount++;
+        console.log(`Closed issue #${issue.number}`);
+
+        // Break out if we have closed 100 issues
+        if (closedCount >= 100) {
+          console.log("Closed 100 issues, stopping.");
+          return;
+        }
+      }
+    }
+    page++;
+  }
+  console.log(`Total closed: ${closedCount}`);
+}
+
+closeOldIssues().catch(console.error);
--- a/.github/workflows/close_old_issues.yaml
+++ b/.github/workflows/close_old_issues.yaml
@ -0,0 +1,31 @@
+name: Close Old Issues
+
+on:
+  schedule:
+    - cron: '30 2 * * *'  # Runs daily at 02:30 UTC
+  workflow_dispatch:
+
+permissions:
+  issues: write
+
+jobs:
+  close-old-issues:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Setup Node.js
+        uses: actions/setup-node@v4
+        with:
+          node-version: 20
+
+      - name: Install dependencies
+        run: npm install @octokit/rest@22.0.0
+
+      - name: Run close-old-issues script
+        run: node .github/workflows/close-old-issues.js
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          REPO_OWNER: ${{ github.repository_owner }}
+          REPO_NAME: ${{ github.event.repository.name || github.repository }}
--- a/ext-net/example/Makefile
+++ b/ext-net/example/Makefile
@ -3,15 +3,20 @@
 #
 # See LICENSE.txt for license information
 #
-NCCL_HOME:=../../build/
-CUDA_HOME:=/usr/local/cuda
-INC:= -I$(NCCL_HOME)/include -I$(CUDA_HOME)/include -Inccl
-PLUGIN_SO:=libnccl-net.so
+.DEFAULT_GOAL: build
+include ../../makefiles/common.mk
+SRCDIR   ?= $(abspath ../..)
+BUILDDIR ?= .
+NCCLDIR  := $(BUILDDIR)

-default: $(PLUGIN_SO)
+SRC_FILES := $(wildcard *.c)

-$(PLUGIN_SO): plugin.c
-	$(CC) $(INC) -fPIC -shared -o $@ -Wl,-soname,$(PLUGIN_SO) $^
+build: ${BUILDDIR}/libnccl-net-example.so
+
+${BUILDDIR}/libnccl-net-example.so: ${SRC_FILES}
+	@printf "Compiling  %-35s > %s\n" $< $@
+	@mkdir -p ${BUILDDIR}
+	$(CC) -Inccl -fPIC -shared -o $@ $^

 clean:
-	rm -f $(PLUGIN_SO)
+	rm -f ${BUILDDIR}/libnccl-net-example.so
--- a/ext-net/example/nccl/common.h
+++ b/ext-net/example/nccl/common.h
@ -7,9 +7,15 @@
 #ifndef COMMON_H_
 #define COMMON_H_

+#include <stdint.h>
+
 typedef enum {NCCL_LOG_NONE=0, NCCL_LOG_VERSION=1, NCCL_LOG_WARN=2, NCCL_LOG_INFO=3, NCCL_LOG_ABORT=4, NCCL_LOG_TRACE=5} ncclDebugLogLevel;
 typedef enum {NCCL_INIT=1, NCCL_COLL=2, NCCL_P2P=4, NCCL_SHM=8, NCCL_NET=16, NCCL_GRAPH=32, NCCL_TUNING=64, NCCL_ENV=128, NCCL_ALLOC=256, NCCL_CALL=512, NCCL_PROXY=1024, NCCL_NVLS=2048, NCCL_BOOTSTRAP=4096, NCCL_REG=8192, NCCL_ALL=~0} ncclDebugLogSubSys;

 typedef void (*ncclDebugLogger_t)(ncclDebugLogLevel level, unsigned long flags, const char *file, int line, const char *fmt, ...);

+enum { ncclProfilerNetEventStart = 0, ncclProfilerNetEventStop, ncclProfilerNetEventUpdate, ncclProfilerNetEventUpdateAndStop };
+
+typedef ncclResult_t (*ncclProfilerCallback_t)(void** eHandle, int type, void* phandle, int64_t pluginId, void* extData);
+
 #endif
--- a/ext-net/example/nccl/net.h
+++ b/ext-net/example/nccl/net.h
@ -8,9 +8,9 @@
 #include <stdint.h>
 #include <stdlib.h>

-#include "common.h"
 #include "err.h"
 #include "net_device.h"
+#include "common.h"

 #define NCCL_NET_HANDLE_MAXSIZE 128
 #define NCCL_MAX_NET_SIZE_BYTES (1*1024*1024*1024*1024L) //1TB
@ -23,8 +23,6 @@
 // Maximum number of requests per comm object
 #define NCCL_NET_MAX_REQUESTS 32

-typedef ncclResult_t (*ncclProfilerCallback_t)(void** eHandle, int type, void* phandle, int64_t pluginId, void* extData);
-
 #include "net_v10.h"
 #include "net_v9.h"
 #include "net_v8.h"
--- a/ext-profiler/README.md
+++ b/ext-profiler/README.md
@ -49,9 +49,9 @@ of newer ones.
 The `nccl/` directory is populated with `profiler_vX.h` files extracting all relevant definitions
 from old API versions. It also provides error codes in `err.h`.

-# API (v3)
+# API (v4)

-Below is the main `ncclProfiler_v3` struct. Each function is explained in later sections.
+Below is the main `ncclProfiler_v4` struct. Each function is explained in later sections.

 ```
 typedef struct {
@ -60,9 +60,15 @@ typedef struct {
  // init - initialize the profiler plugin
  // Input
  //  - context        : opaque profiler context object for separating profiler behavior across comms
+  //  - commName       : user assigned communicator name
+  //  - commHash       : communicator id
+  //  - nNodes         : number of nodes in communicator
+  //  - nranks         : number of ranks in communicator
+  //  - rank           : rank identifier in communicator
+  //  - logfn          : logger function
  // Output
  //  - eActivationMask: bitmask of active events set by the plugin
-  ncclResult_t (*init)(void** context, int* eActivationMask);
+  ncclResult_t (*init)(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);

  // startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
  // Input
@ -70,7 +76,7 @@ typedef struct {
  //  - eDescr : pointer to ncclProfilerEventDescr_t object
  // Output
  //  - eHandle: return event handle for supplied event descriptor object
-  ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v3_t* eDescr);
+  ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v4_t* eDescr);

  // stopEvent - stop/finalize an event inside and event set
  // Input
@ -82,13 +88,13 @@ typedef struct {
  //  - eHandle   : handle to event object created through startEvent
  //  - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
  //  - eState    : event state transition
-  ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v3_t eState, ncclProfilerEventStateArgs_v3_t* eStateArgs);
+  ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v4_t eState, ncclProfilerEventStateArgs_v4_t* eStateArgs);

  // finalize - finalize the profiler plugin
  // Input
  //  - context: opaque profiler context object
  ncclResult_t (*finalize)(void* context);
-} ncclProfiler_v3_t;
+} ncclProfiler_v4_t;
 ```

 ## Error codes
@ -147,8 +153,6 @@ typedef struct {
  int rank;                 // rank that generated the event
  union {
    struct {                // collective events metadata
-      const char* name;     // string containing name of the communicator
-      uint64_t commHash;    // unique hash/id for the communicator
      uint64_t seqNumber;   // sequence number of this collective operation in the communicator
      const char* func;     // string containing name of the collective
      void const* sendBuff; // address of send buffer
@ -156,20 +160,19 @@ typedef struct {
      size_t count;         // data count
      int root;             // root rank
      const char* datatype; // string containing the name of the datatype
-      uint8_t nMaxChannels; // max number of channels for this collective
+      uint8_t nChannels;    // number of channels for this collective
      uint8_t nWarps;       // number of GPU warps for this collective
      const char* algo;     // string containing name of the algorithm for this collective
      const char* proto;    // string containing name of the protocol for this collective
    } coll;

    struct {                // point-to-point events metadata
-      const char* name;
-      uint64_t commHash;
      const char* func;
      void* buff;
      const char* datatype;
      size_t count;
      int peer;             // peer rank for this point-to-point
+      uint8_t nChannels;    // number of channels for this p2p
    } p2p;

    struct {                // proxyOp events metadata
@ -178,7 +181,7 @@ typedef struct {
      int peer;             // peer rank
      int nSteps;           // number of network transfers/steps required by the `ncclProxyOp`
      int chunkSize;        // chunk size for this `ncclProxyOp`
-      int isSend;           // set to 1 for sends and 0 for recvs
+      int isSend;           // type of network operation
    } proxyOp;

    struct {                // proxyStep events metadata
@ -187,6 +190,7 @@ typedef struct {

    struct {
      uint8_t channelId;    // id of the channel used by the kernel
+      uint64_t ptimer;      // kernel supplied timestamp
    } kernelCh;

    struct {
@ -194,7 +198,7 @@ typedef struct {
      void* data;           // pointer to network plugin defined event
    } netPlugin;
  };
-} ncclProfilerEventDescr_v3_t;
+} ncclProfilerEventDescr_v4_t;
 ```

 NCCL defines the following events: `ncclProfileGroup`, `ncclProfileColl`, `ncclProfileP2p`,
@ -212,45 +216,57 @@ handle after `eventStop` is undefined behavior.
 Some events can only be started and stopped. For example, `ncclProfileGroup`, `ncclProfileColl`,
 `ncclProfileP2p`, cannot be updated through calls to `recordEventState`.

-`ncclProfileProxyOp`, `ncclProfileProxyStep` and `ncclProfileProxyCtrl` can be updated through
-calls to `recordEventState`.
+`ncclProfileProxyOp`, `ncclProfileProxyStep`, `ncclProfileNetPlugin`, `ncclProfileKernelCh`, and
+`ncclProfileProxyCtrl` can be updated through calls to `recordEventState`.

-The state of proxy generated events can be updated, along with event attributes, using
-`recordEventState`. These events can go through several states during their lifecycle.
-The list of supported states for the proxy-defined events is reported below.
+The state of these events can be updated, along with event attributes, using `recordEventState`.
+These events can go through several states during their lifecycle.
+
+The list of supported states for the updatable events is reported below.

 ```
 typedef enum {
  // ncclProfileProxyOp event states
-  ncclProfilerProxyOpSendPosted,        // state marks the posting of send buffer to GPU for given network transfer/step
-  ncclProfilerProxyOpSendRemFifoWait,   // state marks the waiting of CTS credits from peer rank
-  ncclProfilerProxyOpSendTransmitted,   // state marks the sending of network transfer/step to peer rank
-  ncclProfilerProxyOpSendDone,          // state marks the ending  of network transfer/step
-  ncclProfilerProxyOpRecvPosted,        // state marks the posting of recv to network for given network transfer/step
-  ncclProfilerProxyOpRecvReceived,      // state marks the recving of network transfer/step from peer rank
-  ncclProfilerProxyOpRecvTransmitted,   // state marks the ending  of the network transfer/step
-  ncclProfilerProxyOpRecvDone,          // state marks the consuming of data from GPU
+  ncclProfilerProxyOpSendPosted        = 0, // deprecated in v4
+  ncclProfilerProxyOpSendRemFifoWait   = 1, // deprecated in v4
+  ncclProfilerProxyOpSendTransmitted   = 2, // deprecated in v4
+  ncclProfilerProxyOpSendDone          = 3, // deprecated in v4
+  ncclProfilerProxyOpRecvPosted        = 4, // deprecated in v4
+  ncclProfilerProxyOpRecvReceived      = 5, // deprecated in v4
+  ncclProfilerProxyOpRecvTransmitted   = 6, // deprecated in v4
+  ncclProfilerProxyOpRecvDone          = 7, // deprecated in v4
+  ncclProfilerProxyOpInProgress_v4     = 19,// state marks transition of proxy op to progress

  // ncclProfileProxyStep event states
-  ncclProfilerProxyStepSendGPUWait,     // state marks the waiting of send data from GPU for given network transfer/step
-  ncclProfilerProxyStepSendWait,        // state marks the waiting of send data from network for given network transfer/step
-  ncclProfilerProxyStepRecvWait,        // state marks the waiting of recv data from network for given network transfer/step
-  ncclProfilerProxyStepRecvFlushWait,   // state marks the waiting of recv data flush to GPU for given network transfer/step
-  ncclProfilerProxyStepRecvGPUWait,     // state marks the waiting of recv data consumption from GPU for given network transfer/step
+  ncclProfilerProxyStepSendGPUWait     = 8, // state marks the waiting of send data from GPU for given network transfer/step
+  ncclProfilerProxyStepSendPeerWait_v4 = 20,// state marks the waiting of recv clear to send credits for given network transfer/step
+  ncclProfilerProxyStepSendWait        = 9, // state marks the waiting of send data from network for given network transfer/step
+  ncclProfilerProxyStepRecvWait        = 10,// state marks the waiting of recv data from network for given network transfer/step
+  ncclProfilerProxyStepRecvFlushWait   = 11,// state marks the waiting of recv data flush to GPU for given network transfer/step
+  ncclProfilerProxyStepRecvGPUWait     = 12,// state marks the waiting of recv data consumption from GPU for given network transfer/step

  // ncclProfileProxyCtrl event states
-  ncclProfilerProxyCtrlIdle,            // state marks proxy progress thread idle
-  ncclProfilerProxyCtrlActive,          // state marks proxy progress thread active
-  ncclProfilerProxyCtrlSleep,           // state marks proxy progress thread sleeping
-  ncclProfilerProxyCtrlWakeup,          // state marks proxy progress thread waking up
-  ncclProfilerProxyCtrlAppend,          // state marks append of new network work item begin
-  ncclProfilerProxyCtrlAppendEnd,       // state marks append of new network work item end
-} ncclProfilerEventState_v3_t;
+  ncclProfilerProxyCtrlIdle            = 13,// state marks proxy progress thread idle
+  ncclProfilerProxyCtrlActive          = 14,// state marks proxy progress thread active
+  ncclProfilerProxyCtrlSleep           = 15,// state marks proxy progress thread sleeping
+  ncclProfilerProxyCtrlWakeup          = 16,// state marks proxy progress thread waking up
+  ncclProfilerProxyCtrlAppend          = 17,// state marks append of new network work item begin
+  ncclProfilerProxyCtrlAppendEnd       = 18,// state marks append of new network work item end
+
+  // ncclProfileNetPlugin event states
+  ncclProfilerNetPluginUpdate          = 21,// state marks update of network defined event
+
+  // ncclProfileKernelCh event states
+  ncclProfilerKernelChStop             = 22,// state marks stop of kernelCh event and timestamp update
+} ncclProfilerEventState_v4_t;
 ```

 `ncclProfileProxyOp` events are generated by the proxy progress thread while it is processing
 network requests for the GPU kernel. ProxyOp events are generated for every active channel and
-provide a summary of the activity of the proxy progress thread for that channel.
+provide a summary of the activity of the proxy progress thread for that channel. Most of the
+states for this event were duplicated with `ncclProfileProxyStep` events. Therefore, starting
+with version 4 of the profiler interface these states have been deprecated. The same level of
+information can still be obtained through the `ncclProfileProxyStep` events.

 `ncclProfileProxyStep` events are generated by the proxy progress thread while it is processing
 network requests for the GPU kernel. ProxyStep events describe individual network transfer in
@ -348,15 +364,22 @@ reason the profiler defines the `ncclProfilerEventStateArgs_t` struct, reported

 ```
 typedef union {
-  struct {                // attributes to update for ncclProfileProxyOp events
-    size_t transSize;     // data transferred thus far
-    int steps;            // network transfer/steps processed thus far
-  } proxyOp;
+  struct {                // attributes for update for ncclProfileProxyStep events
+    size_t transSize;     // transfer size field for this proxy step
+  } proxyStep;

-  struct {                // attributes to update for ncclProfileProxyCtrl
+  struct {                // attributes to update for ncclProfileProxyCtrl events
    int appendedProxyOps; // number of appended proxy ops thus far
  } proxyCtrl;
-} ncclProfilerEventStateArgs_v3_t;
+
+  struct {                // attributes to update for ncclProfileNetPlugin events
+    void* data;           // network plugin opaque update data field
+  } netPlugin;
+
+  struct {                // attribute to update for ncclProfileKernelCh events
+    uint64_t pTimer;      // timestamp provided by the NCCL kernel
+  } kernelCh;
+} ncclProfilerEventStateArgs_v4_t;
 ```

 The example profiler in `ext-profiler/example` contains details on how to capture and use the events above.
@ -396,12 +419,12 @@ ProxyCtrl event
 ## Profiling of collective and p2p operations

 The NCCL code is instrumented with profiler callbacks at different levels to capture start/stop of groups,
-collective and point-to-point operations, as well as proxy progress activity. Due to the asynchronous nature
+collective and point-to-point operations, as well as proxy, kernel and network activity. Due to the asynchronous nature
 of NCCL operations, events associated to collective and point-to-point operations are not easy to delimit
 precisely. For example, without both proxy and/or kernel activity it is impossible for the profiler to
 figure out when a collective operation completes. Therefore, `stopEvent` for collectives simply indicates to
-the profiler that the collective has been enqueued. The profiler can leverage proxy event information, if
-these are enabled, to estimate when the collective ends. In this case, the profiler can look at the `stopEvent`
+the profiler that the collective has been enqueued. The profiler can leverage proxy and/or kernel event information, if
+these are enabled, to estimate when the collective ends. For example, the profiler can look at the `stopEvent`
 call of the last `ncclProfileProxyOp` event to mark the completion of the associated collective event. This
 can be achieved by reference counting the collective event and letting calls to `startEvent` and `stopEvent`
 increment and decrement the reference counter, respectively.
@ -425,8 +448,14 @@ enqueue can be time stamped by the profiler (at start and stop) to reconstruct t
 collective. However, this time only represents the launch time of the collective and not the actual
 execution time. To reconstruct the execution time more accurately proxy and kernel events are provided.

+With version 3 of the profiler interface network activity is no longer required to do intra-node profiling.
 Kernel events instrumentation leverages counters exposed by the kernel to the host and the proxy progress
 thread. Thus, the proxy progress thread infrastructure is shared between the network and the profiler. If
 the proxy is serving network requests the kernel profiling probing can be delayed, causing loss of
 accuracy. Similarly, if the CPU is under heavy load and the scheduling of the proxy progress thread is
-delayed, a similar loss of accuracy can be encountered. Keep this in mind when using kernel events.
+delayed, a similar loss of accuracy can be encountered.
+
+To mitigate this effect, with version 4 of the profiler NCCL uses a per-channel ring buffer of 64 elements.
+Every counter is complemented by two timestamps (ptimers) supplied by the NCCL kernel (one for start and one
+for stop of the operation in the kernel). NCCL propagates these timestamps to the profiler plugin that it can
+convert them to CPU time domain.
--- a/ext-profiler/example/Makefile
+++ b/ext-profiler/example/Makefile
@ -3,14 +3,20 @@
 #
 # See LICENSE.txt for license information
 #
-NCCL_HOME := ../../build
-INC := -I$(NCCL_HOME)/include -I$(CUDA_HOME)/include -Inccl
-PLUGIN_SO := libnccl-profiler.so
+.DEFAULT_GOAL: build
+include ../../makefiles/common.mk
+SRCDIR   ?= $(abspath ../..)
+BUILDDIR ?= .
+NCCLDIR  := $(BUILDDIR)

-default: $(PLUGIN_SO)
+SRC_FILES := $(wildcard *.c)

-$(PLUGIN_SO): plugin.c event.c print_event.c
-	$(CXX) $(INC) -g -fPIC -shared -o $@ -Wl,-soname,$(PLUGIN_SO) $^
+build: ${BUILDDIR}/libnccl-profiler-example.so
+
+${BUILDDIR}/libnccl-profiler-example.so: ${SRC_FILES}
+	@printf "Compiling  %-35s > %s\n" $< $@
+	@mkdir -p ${BUILDDIR}
+	$(CC) -Inccl -fPIC -shared -o $@ $^

 clean:
-	rm -f $(PLUGIN_SO)
+	rm -f ${BUILDDIR}/libnccl-profiler-example.so
--- a/ext-profiler/example/event.h
+++ b/ext-profiler/example/event.h
@ -15,24 +15,6 @@
 #define MAX_CHANNELS                     32
 #define MAX_STEPS                        16
 #define MAX_OPS                          16 // Up to 64K ranks for PAT
-
-#define PROXY_OP_SEND_STATE_OFFSET       (ncclProfilerProxyOpSendPosted)
-#define PROXY_OP_RECV_STATE_OFFSET       (ncclProfilerProxyOpRecvPosted)
-#define PROXY_STEP_SEND_STATE_OFFSET     (ncclProfilerProxyStepSendGPUWait)
-#define PROXY_STEP_RECV_STATE_OFFSET     (ncclProfilerProxyStepRecvWait)
-
-#define NUM_PROXY_OP_SEND_STATES         (ncclProfilerProxyOpSendDone      - ncclProfilerProxyOpSendPosted    + 1)
-#define NUM_PROXY_OP_RECV_STATES         (ncclProfilerProxyOpRecvDone      - ncclProfilerProxyOpRecvPosted    + 1)
-#define NUM_PROXY_STEP_SEND_STATES       (ncclProfilerProxyStepSendWait    - ncclProfilerProxyStepSendGPUWait + 1)
-#define NUM_PROXY_STEP_RECV_STATES       (ncclProfilerProxyStepRecvGPUWait - ncclProfilerProxyStepRecvWait    + 1)
-
-#define PROXY_OP_SEND_STATE_IDX(state)   (state - PROXY_OP_SEND_STATE_OFFSET)
-#define PROXY_OP_RECV_STATE_IDX(state)   (state - PROXY_OP_RECV_STATE_OFFSET)
-#define PROXY_STEP_SEND_STATE_IDX(state) (state - PROXY_STEP_SEND_STATE_OFFSET)
-#define PROXY_STEP_RECV_STATE_IDX(state) (state - PROXY_STEP_RECV_STATE_OFFSET)
-
-#define MAX_PROXY_OP_STATES              ((NUM_PROXY_OP_SEND_STATES   > NUM_PROXY_OP_RECV_STATES  ) ? NUM_PROXY_OP_SEND_STATES   : NUM_PROXY_OP_RECV_STATES)
-#define MAX_PROXY_STEP_STATES            ((NUM_PROXY_STEP_SEND_STATES > NUM_PROXY_STEP_RECV_STATES) ? NUM_PROXY_STEP_SEND_STATES : NUM_PROXY_STEP_RECV_STATES)
 #define MAX_EVENTS_PER_REQ               (8)

 struct proxyOp;
@ -68,13 +50,24 @@ struct kernelCh {
  struct taskEventBase* parent;
  double startTs;
  double stopTs;
+  uint64_t startGpuClk;
+  uint64_t stopGpuClk;
 };

+#define PROXY_STEP_SEND_GPU_WAIT 0
+#define PROXY_STEP_SEND_PEER_WAIT 1
+#define PROXY_STEP_SEND_WAIT 2
+#define PROXY_STEP_RECV_WAIT 0
+#define PROXY_STEP_RECV_FLUSH_WAIT 1
+#define PROXY_STEP_RECV_GPU_WAIT 2
+#define PROXY_STEP_MAX_STATES 3
+
 struct proxyStep {
  uint8_t type;                     // type of event: network transfer
+  int state;
  int step;                         // network transfer id in given channel
  int isSend;                       // send/recv channel operation
-  double timestamp[MAX_PROXY_STEP_STATES];
+  double timestamp[PROXY_STEP_MAX_STATES];
  double startTs;
  double stopTs;
  struct proxyOp* parent;
@ -92,11 +85,8 @@ struct proxyOp {
  int chunkSize;                    // chunk size for this proxy operation
  int isSend;                       // send/recv channel operation
  size_t transSize;                 // transfer data size for this proxy operation
-  struct {
-    int steps;                      // completed steps for this proxy operation state
-    double timestamp;
-  } states[MAX_PROXY_OP_STATES];
  double startTs;
+  double progrTs;                   // In progress state transition
  double stopTs;
  int stepCount;                    // last processed network operation for this proxy operation
  struct proxyStep step[MAX_STEPS]; // array of network transfer events
@ -119,8 +109,6 @@ struct proxyCtrl {
 struct taskEventBase {
  uint8_t type;                     // event type: collective/p2p
  int rank;                         // rank of the operation in NCCL communicator
-  const char* name;                 // FIXME: unused
-  uint64_t commHash;                // communicator identifier
  const char* func;                 // ncclFunc*
  int refCount;                     // number of references for this operation
  struct group* parent;             // parent event group
@ -137,12 +125,11 @@ struct collective {
  size_t count;
  int root;
  const char* datatype;
-  uint8_t nMaxChannels;
+  uint8_t nChannels;
  const char* algo;
  const char* proto;
  int nWarps;
-  struct proxyOp send[MAX_CHANNELS][MAX_OPS];// array of send proxy operation events
-  struct proxyOp recv[MAX_CHANNELS][MAX_OPS];// array of recv proxy operation events
+  struct proxyOp op[MAX_CHANNELS][2*MAX_OPS];
  int nProxyOps[MAX_CHANNELS];
  struct kernelCh kernel[MAX_CHANNELS];
 };
@ -154,6 +141,7 @@ struct p2p {
  size_t count;
  const char* datatype;
  int peer;
+  uint8_t nChannels;
  struct proxyOp op[MAX_CHANNELS];
  struct kernelCh kernel[MAX_CHANNELS];
 };
@ -172,6 +160,11 @@ struct group {

 // arrays for different event objects
 struct context {
+  const char* commName;
+  uint64_t commHash;
+  int nranks;
+  int rank;
+
  int groupPoolSize;
  int groupPoolBase;
  int groupPoolIndex;
--- a/ext-profiler/example/nccl/profiler.h
+++ b/ext-profiler/example/nccl/profiler.h
@ -25,42 +25,52 @@ enum {
 };

 typedef enum {
-  ncclProfilerProxyOpSendPosted,
-  ncclProfilerProxyOpSendRemFifoWait,
-  ncclProfilerProxyOpSendTransmitted,
-  ncclProfilerProxyOpSendDone,
-  ncclProfilerProxyOpRecvPosted,
-  ncclProfilerProxyOpRecvReceived,
-  ncclProfilerProxyOpRecvTransmitted,
-  ncclProfilerProxyOpRecvDone,
+  ncclProfilerProxyOpSendPosted        = 0,  // deprecated in v4
+  ncclProfilerProxyOpSendRemFifoWait   = 1,  // deprecated in v4
+  ncclProfilerProxyOpSendTransmitted   = 2,  // deprecated in v4
+  ncclProfilerProxyOpSendDone          = 3,  // deprecated in v4
+  ncclProfilerProxyOpRecvPosted        = 4,  // deprecated in v4
+  ncclProfilerProxyOpRecvReceived      = 5,  // deprecated in v4
+  ncclProfilerProxyOpRecvTransmitted   = 6,  // deprecated in v4
+  ncclProfilerProxyOpRecvDone          = 7,  // deprecated in v4
+  ncclProfilerProxyOpInProgress_v4     = 19,

  /* Legacy proxy profiler states */
-  ncclProfilerProxyStepSendGPUWait,
-  ncclProfilerProxyStepSendWait,
-  ncclProfilerProxyStepRecvWait,
-  ncclProfilerProxyStepRecvFlushWait,
-  ncclProfilerProxyStepRecvGPUWait,
+  ncclProfilerProxyStepSendGPUWait     = 8,
+  ncclProfilerProxyStepSendPeerWait_v4 = 20,
+  ncclProfilerProxyStepSendWait        = 9,
+  ncclProfilerProxyStepRecvWait        = 10,
+  ncclProfilerProxyStepRecvFlushWait   = 11,
+  ncclProfilerProxyStepRecvGPUWait     = 12,

  /* Legacy proxy control states */
-  ncclProfilerProxyCtrlIdle,
-  ncclProfilerProxyCtrlActive,
-  ncclProfilerProxyCtrlSleep,
-  ncclProfilerProxyCtrlWakeup,
-  ncclProfilerProxyCtrlAppend,
-  ncclProfilerProxyCtrlAppendEnd,
+  ncclProfilerProxyCtrlIdle            = 13,
+  ncclProfilerProxyCtrlActive          = 14,
+  ncclProfilerProxyCtrlSleep           = 15,
+  ncclProfilerProxyCtrlWakeup          = 16,
+  ncclProfilerProxyCtrlAppend          = 17,
+  ncclProfilerProxyCtrlAppendEnd       = 18,
+
+  /* Network defined events states */
+  ncclProfilerNetPluginUpdate          = 21,
+
+  /* Kernel event states */
+  ncclProfilerKernelChStop             = 22,
 } ncclProfilerEventState_t;

 typedef ncclProfilerEventState_t ncclProfilerEventState_v1_t;
 typedef ncclProfilerEventState_t ncclProfilerEventState_v2_t;
 typedef ncclProfilerEventState_t ncclProfilerEventState_v3_t;
+typedef ncclProfilerEventState_t ncclProfilerEventState_v4_t;

+#include "profiler_v4.h"
 #include "profiler_v3.h"
 #include "profiler_v2.h"
 #include "profiler_v1.h"
 #include "profiler_net.h"

-typedef ncclProfiler_v3_t ncclProfiler_t;
-typedef ncclProfilerEventDescr_v3_t ncclProfilerEventDescr_t;
-typedef ncclProfilerEventStateArgs_v3_t ncclProfilerEventStateArgs_t;
+typedef ncclProfiler_v4_t ncclProfiler_t;
+typedef ncclProfilerEventDescr_v4_t ncclProfilerEventDescr_t;
+typedef ncclProfilerEventStateArgs_v4_t ncclProfilerEventStateArgs_t;

 #endif // end include guard
--- a/ext-profiler/example/nccl/profiler_v3.h
+++ b/ext-profiler/example/nccl/profiler_v3.h
@ -111,9 +111,4 @@ typedef struct {
  ncclResult_t (*finalize)(void* context);
 } ncclProfiler_v3_t;

-typedef ncclProfilerEventDescr_v3_t ncclProfilerEventDescr_t;
-typedef ncclProfilerEventState_v3_t ncclProfilerEventState_t;
-typedef ncclProfilerEventStateArgs_v3_t ncclProfilerEventStateArgs_t;
-typedef ncclProfiler_v3_t ncclProfiler_t;
-
 #endif
--- a/ext-profiler/example/nccl/profiler_v4.h
+++ b/ext-profiler/example/nccl/profiler_v4.h
@ -0,0 +1,123 @@
+/*************************************************************************
+ * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef PROFILER_V4_H_
+#define PROFILER_V4_H_
+
+typedef struct {
+  uint8_t type;                 // event type descriptor: ncclProfileColl, ...
+  void* parentObj;              // pointer to the profiler parent object (for coll is the group)
+  int rank;                     // originating rank
+  union {
+    struct {
+      uint64_t seqNumber;
+      const char* func;
+      void const* sendBuff;
+      void* recvBuff;
+      size_t count;
+      int root;
+      const char* datatype;
+      uint8_t nChannels;
+      uint8_t nWarps;
+      const char* algo;
+      const char* proto;
+    } coll;
+
+    struct {
+      const char* func;
+      void* buff;
+      const char* datatype;
+      size_t count;
+      int peer;
+      uint8_t nChannels;
+    } p2p;
+
+    struct {
+      pid_t pid;                // pid of the originating process
+      uint8_t channelId;        // channel id for this proxy operation
+      int peer;                 // remote rank for send/recv
+      int nSteps;               // number of steps for this proxy operation
+      int chunkSize;            // amount of data transferred by this proxy operation
+      int isSend;
+    } proxyOp;
+
+    struct {
+      int step;
+    } proxyStep;
+
+    struct {
+      uint8_t channelId;
+      uint64_t pTimer;          // start timestamp from GPU globaltimer
+    } kernelCh;
+
+    struct {
+      int64_t id;
+      void* data;
+    } netPlugin;
+  };
+} ncclProfilerEventDescr_v4_t;
+
+typedef union {
+  struct {
+    size_t transSize;
+  } proxyStep;
+
+  struct {
+    int appendedProxyOps;
+  } proxyCtrl;
+
+  struct {
+    void* data;
+  } netPlugin;
+
+  struct {
+    uint64_t pTimer;
+  } kernelCh;
+} ncclProfilerEventStateArgs_v4_t;
+
+typedef struct {
+  const char* name;
+
+  // init - initialize the profiler plugin
+  // Input
+  //  - context        : opaque profiler context object for separating profiler behavior across comms
+  //  - commName       : user assigned communicator name
+  //  - commHash       : communicator id
+  //  - nNodes         : number of nodes in communicator
+  //  - nranks         : number of ranks in communciator
+  //  - rank           : rank identifier in communicator
+  //  - logfn          : logger function
+  // Output
+  //  - eActivationMask: bitmask of active events set by the plugin
+  ncclResult_t (*init)(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);
+
+  // startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
+  // Input
+  //  - context: opaque profiler context object
+  //  - eDescr : pointer to ncclProfilerEventDescr_t object
+  // Output
+  //  - eHandle: return event handle for supplied event descriptor object
+  ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v4_t* eDescr);
+
+  // stopEvent - stop/finalize an event inside and event set
+  // Input
+  //  - eHandle: handle to event object
+  ncclResult_t (*stopEvent)(void* eHandle);
+
+  // recordEventState - record event state transitions and event attribute updates
+  // Input
+  //  - eHandle   : handle to event object created through startEvent
+  //  - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
+  //  - eState    : event state transition
+  ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v4_t eState, ncclProfilerEventStateArgs_v4_t* eStateArgs);
+
+  // finalize - finalize the profiler plugin
+  // Input
+  //  - context: opaque profiler context object
+  ncclResult_t (*finalize)(void* context);
+} ncclProfiler_v4_t;
+
+#endif
--- a/ext-profiler/example/plugin.c
+++ b/ext-profiler/example/plugin.c
@ -12,7 +12,7 @@
 #include <sys/types.h>
 #include <sys/syscall.h>
 #include <unistd.h>
-#include <x86intrin.h>
+#include <time.h>
 #include "event.h"
 #include "print_event.h"

@ -38,29 +38,20 @@ static int detachPoolIndex;
 static int detachPoolDone;
 static struct proxyOp* detachPool;

-static double freq = -1;
-__hidden void calibrate() {
-  struct timeval tv;
-  gettimeofday(&tv, NULL);
-  uint64_t timeCycles = __rdtsc();
-  double time = - tv.tv_sec*1e6 - tv.tv_usec;
-  uint64_t total = 0ULL;
-  for (int i = 0; i < 10000; i++) total += __rdtsc();
-  gettimeofday(&tv, NULL);
-  timeCycles = __rdtsc() - timeCycles;
-  time += tv.tv_sec*1e6 + tv.tv_usec;
-  freq = timeCycles / time;
-}
+ncclDebugLogger_t logFn;
+#define INFO(FLAGS, ...) logFn(NCCL_LOG_INFO, (FLAGS), __func__, __LINE__, __VA_ARGS__)

 __hidden double gettime(void) {
-  return __rdtsc() / freq;
+  struct timespec t;
+  clock_gettime(CLOCK_MONOTONIC, &t);
+  return (t.tv_sec*1e6 + (t.tv_nsec*1e-3));
 }

 static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
 static pid_t pid;
 static int* eActivationMaskPtr;

-__hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask) {
+__hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn) {
  pthread_mutex_lock(&lock);
  if (__atomic_fetch_add(&initialized, 1, __ATOMIC_RELAXED) == 0) {
    // first thread initializes event mask, environment and detach pool
@ -95,8 +86,6 @@ __hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask)
    // process address space.
    pid = getpid();

-    // calibrate and start timer
-    calibrate();
    startTime = gettime();
  }
  pthread_mutex_unlock(&lock);
@ -106,6 +95,13 @@ __hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask)

  // pre-allocate memory for event object pools in dedicated profiler context
  struct context* ctx = (struct context *)calloc(1, sizeof(*ctx));
+  ctx->commName = commName;
+  ctx->commHash = commHash;
+  ctx->nranks = nranks;
+  ctx->rank = rank;
+  logFn = logfn;
+  INFO(NCCL_INIT, "PROFILER/Plugin: init commName: %s commHash: %lu nranks: %d rank: %d", commName ? commName : "", commHash, nranks, rank);
+
  ctx->groupPool = (struct group *)calloc(groupPoolSize, sizeof(*ctx->groupPool));
  if (ctx->groupPool == NULL) goto fail;

@ -142,17 +138,16 @@ fail:
 __hidden ncclResult_t exampleProfilerFinalize(void* context) {
  FILE* fh = NULL;
  char filename[PATH_MAX] = { 0 };
-  char hostname[64] = { 0 };
-  gethostname(hostname, 64);
+  struct context* ctx = (struct context *)context;
  const char* dump = getenv("NCCL_PROFILE_DUMP_FILE");
  if (dump) {
-    sprintf(filename, "%s-%s-%ld.txt", dump, hostname, syscall(SYS_gettid));
+    sprintf(filename, "%s_%lu_%d.json", dump, ctx->commHash, ctx->rank);
    fh = fopen(filename, "w");
    fprintf(fh, "[\n");
  }
+  INFO(NCCL_INIT, "PROFILER/Plugin: finalize commName: %s commHash: %lu nranks: %d rank: %d", ctx->commName ? ctx->commName : "", ctx->commHash, ctx->nranks, ctx->rank);

  // print last N groups/collectives/p2ps
-  struct context* ctx = (struct context *)context;
  int start = (ctx->groupPoolIndex - groupPoolSize >= 0) ? ctx->groupPoolIndex - groupPoolSize : 0;
  int end = ctx->groupPoolIndex;
  for (int i = start; i < end; i++) {
@ -243,8 +238,6 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n

    event->base.type = ncclProfileColl;
    event->base.rank = eDescr->rank;
-    event->base.name = eDescr->coll.name;
-    event->base.commHash = eDescr->coll.commHash;
    event->base.func = eDescr->coll.func;
    event->base.startTs = gettime() - startTime;
    event->base.parent = parent;
@ -254,7 +247,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
    event->count = eDescr->coll.count;
    event->root = eDescr->coll.root;
    event->datatype = eDescr->coll.datatype;
-    event->nMaxChannels = eDescr->coll.nMaxChannels;
+    event->nChannels = eDescr->coll.nChannels;
    event->nWarps = eDescr->coll.nWarps;
    event->algo = eDescr->coll.algo;
    event->proto = eDescr->coll.proto;
@ -281,8 +274,6 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n

    event->base.type = ncclProfileP2p;
    event->base.rank = eDescr->rank;
-    event->base.name = eDescr->p2p.name;
-    event->base.commHash = eDescr->p2p.commHash;
    event->base.func = eDescr->p2p.func;
    event->base.next = parent->eventHead;
    event->base.startTs = gettime() - startTime;
@ -291,6 +282,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
    event->count = eDescr->p2p.count;
    event->datatype = eDescr->p2p.datatype;
    event->peer = eDescr->p2p.peer;
+    event->nChannels = eDescr->p2p.nChannels;
    *eHandle = event;
    // increment the group ref counter so the event will staty open
    taskEventQueueEnqueue(parent, (struct taskEventBase *)event);
@ -331,6 +323,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
      event->isSend = eDescr->proxyOp.isSend;
      event->startTs = gettime() - startTime;
      event->parent = NULL;
+      event->stepCount = 0;
      *eHandle = event;
      debugEvent(event, "PxnProxyOpStart");
      return ncclSuccess;
@ -339,9 +332,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
    if (eventBase->type == ncclProfileColl) {
      struct collective* parent = (struct collective *)eDescr->parentObj;
      int channelId = eDescr->proxyOp.channelId;
-      struct proxyOp* event = (eDescr->proxyOp.isSend) ?
-        &parent->send[channelId][parent->nProxyOps[channelId]++] :
-        &parent->recv[channelId][parent->nProxyOps[channelId]++];
+      struct proxyOp* event = &parent->op[channelId][parent->nProxyOps[channelId]++];

      event->type = ncclProfileProxyOp;
      event->channelId = channelId;
@ -353,6 +344,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
      event->isSend = eDescr->proxyOp.isSend;
      event->parent = eventBase;
      event->startTs = gettime() - startTime;
+      event->stepCount = 0;
      *eHandle = event;
      __atomic_fetch_add(&parent->base.refCount, 1, __ATOMIC_RELAXED);
      debugEvent(event, "ProxyOpStart");
@ -370,6 +362,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
      event->isSend = eDescr->proxyOp.isSend;
      event->parent = eventBase;
      event->startTs = gettime() - startTime;
+      event->stepCount = 0;
      *eHandle = event;
      __atomic_fetch_add(&parent->base.refCount, 1, __ATOMIC_RELAXED);
      debugEvent(event, "ProxyOpStart");
@ -382,9 +375,10 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
    int s = parent->stepCount++ % MAX_STEPS;
    struct proxyStep* event = &parent->step[s];
    event->type = ncclProfileProxyStep;
+    event->state = 0;
    event->step = eDescr->proxyStep.step;
-    event->isSend = parent->isSend;
    event->parent = parent;
+    event->isSend = parent->isSend;
    event->startTs = gettime() - startTime;
    event->nNetEvents = 0;
    *eHandle = event;
@ -397,6 +391,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
      struct kernelCh* event = &parent->kernel[eDescr->kernelCh.channelId];
      event->type = ncclProfileKernelCh;
      event->channelId = eDescr->kernelCh.channelId;
+      event->startGpuClk = eDescr->kernelCh.pTimer;
      event->parent = eventBase;
      event->startTs = gettime() - startTime;
      *eHandle = event;
@ -407,6 +402,7 @@ __hidden ncclResult_t exampleProfilerStartEvent(void* context, void** eHandle, n
      struct kernelCh* event = &parent->kernel[eDescr->kernelCh.channelId];
      event->type = ncclProfileKernelCh;
      event->channelId = eDescr->kernelCh.channelId;
+      event->startGpuClk = eDescr->kernelCh.pTimer;
      event->parent = eventBase;
      event->startTs = gettime() - startTime;
      *eHandle = event;
@ -563,29 +559,57 @@ __hidden ncclResult_t exampleProfilerRecordEventState(void* eHandle, ncclProfile
  // the event handle might be null if we run out of events
  if (eHandle == NULL) return ncclSuccess;

-  debugEvent(eHandle, "RecordEventState");
  uint8_t type = *(uint8_t *)eHandle;
  if (type == ncclProfileProxyOp) {
    struct proxyOp* event = (struct proxyOp *)eHandle;
-    int steps = event->states[event->isSend ? PROXY_OP_SEND_STATE_IDX(eState) : PROXY_OP_RECV_STATE_IDX(eState)].steps;
-    if (eState == ncclProfilerProxyOpSendRemFifoWait && eStateArgs->proxyOp.steps == steps) return ncclSuccess;
-    event->states[event->isSend ? PROXY_OP_SEND_STATE_IDX(eState) : PROXY_OP_RECV_STATE_IDX(eState)].steps = eStateArgs->proxyOp.steps;
-    event->states[event->isSend ? PROXY_OP_SEND_STATE_IDX(eState) : PROXY_OP_RECV_STATE_IDX(eState)].timestamp = gettime() - startTime;
-    event->transSize = eStateArgs->proxyOp.transSize;
+    if (eState == ncclProfilerProxyOpInProgress_v4) {
+      event->progrTs = gettime() - startTime;
+    }
  } else if (type == ncclProfileProxyStep) {
    struct proxyStep* event = (struct proxyStep *)eHandle;
-    event->timestamp[event->isSend ? PROXY_STEP_SEND_STATE_IDX(eState) : PROXY_STEP_RECV_STATE_IDX(eState)] = gettime() - startTime;
+    struct proxyOp* parent = event->parent;
+    switch (eState) {
+      case ncclProfilerProxyStepSendGPUWait:
+        event->timestamp[PROXY_STEP_SEND_GPU_WAIT] = gettime() - startTime;
+        break;
+      case ncclProfilerProxyStepSendPeerWait_v4:
+        // do not update step event if in SendPeerWait
+        if (event->state == ncclProfilerProxyStepSendPeerWait_v4) break;
+        event->timestamp[PROXY_STEP_SEND_PEER_WAIT] = gettime() - startTime;
+        event->state = ncclProfilerProxyStepSendPeerWait_v4;
+        break;
+      case ncclProfilerProxyStepSendWait:
+        event->timestamp[PROXY_STEP_SEND_WAIT] = gettime() - startTime;
+        parent->transSize += eStateArgs->proxyStep.transSize;
+        break;
+      case ncclProfilerProxyStepRecvWait:
+        event->timestamp[PROXY_STEP_RECV_WAIT] = gettime() - startTime;
+        break;
+      case ncclProfilerProxyStepRecvFlushWait:
+        event->timestamp[PROXY_STEP_RECV_FLUSH_WAIT] = gettime() - startTime;
+        parent->transSize += eStateArgs->proxyStep.transSize;
+        break;
+      case ncclProfilerProxyStepRecvGPUWait:
+        event->timestamp[PROXY_STEP_RECV_GPU_WAIT] = gettime() - startTime;
+        break;
+    }
  } else if (type == ncclProfileProxyCtrl) {
    struct proxyCtrl* event = (struct proxyCtrl *)eHandle;
    if (eState == ncclProfilerProxyCtrlAppendEnd) {
      event->appended = eStateArgs->proxyCtrl.appendedProxyOps;
    }
    event->state = eState;
+  } else if (type == ncclProfileKernelCh) {
+    struct kernelCh* event = (struct kernelCh *)eHandle;
+    if (eState == ncclProfilerKernelChStop) {
+      event->stopGpuClk = eStateArgs->kernelCh.pTimer;
+    }
  }
+  debugEvent(eHandle, "RecordEventState");
  return ncclSuccess;
 }

-ncclProfiler_t ncclProfiler_v3 = {
+ncclProfiler_t ncclProfiler_v4 = {
  "Example-profiler",
  exampleProfilerInit,
  exampleProfilerStartEvent,
--- a/ext-profiler/example/print_event.c
+++ b/ext-profiler/example/print_event.c
@ -27,8 +27,8 @@ __hidden void printGroupEventTrailer(FILE* fh, struct group* event) {

 static __thread int collId;
 __hidden void printCollEventHeader(FILE* fh, struct collective* event) {
-  fprintf(fh, "{\"name\": \"%s\", \"cat\": \"COLL\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"SeqNum\": %lu, \"CommHash\": %lu, \"Rank\": %d, \"Count\": %lu, \"Datatype\": \"%s\", \"Algorithm\": \"%s\", \"Protocol\": \"%s\", \"nMaxChannels\": %d}},\n",
-          event->base.func, collId, getpid(), 1, event->base.startTs, event->seqNumber, event->base.commHash, event->base.rank, event->count, event->datatype, event->algo, event->proto, event->nMaxChannels);
+  fprintf(fh, "{\"name\": \"%s\", \"cat\": \"COLL\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"SeqNum\": %lu, \"CommHash\": %lu, \"Rank\": %d, \"Count\": %lu, \"Datatype\": \"%s\", \"Algorithm\": \"%s\", \"Protocol\": \"%s\", \"nChannels\": %d}},\n",
+          event->base.func, collId, getpid(), 1, event->base.startTs, event->seqNumber, event->base.parent->ctx->commHash, event->base.rank, event->count, event->datatype, event->algo, event->proto, event->nChannels);
 }

 __hidden void printCollEventTrailer(FILE* fh, struct collective* event) {
@ -38,8 +38,8 @@ __hidden void printCollEventTrailer(FILE* fh, struct collective* event) {

 static __thread int p2pId;
 __hidden void printP2pEventHeader(FILE* fh, struct p2p* event) {
-  fprintf(fh, "{\"name\": \"%s\", \"cat\": \"P2P\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"CommHash\": %lu, \"Rank\": %d, \"Peer\": %d, \"Count\": %lu, \"Datatype\": \"%s\"}},\n",
-          event->base.func, p2pId, getpid(), 1, event->base.startTs, event->base.commHash, event->base.rank, event->peer, event->count, event->datatype);
+  fprintf(fh, "{\"name\": \"%s\", \"cat\": \"P2P\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"CommHash\": %lu, \"Rank\": %d, \"Peer\": %d, \"Count\": %lu, \"Datatype\": \"%s\", \"nChannels\": %d}},\n",
+          event->base.func, p2pId, getpid(), 1, event->base.startTs, event->base.parent->ctx->commHash, event->base.rank, event->peer, event->count, event->datatype, event->nChannels);
 }

 __hidden void printP2pEventTrailer(FILE* fh, struct p2p* event) {
@ -50,47 +50,43 @@ __hidden void printP2pEventTrailer(FILE* fh, struct p2p* event) {
 static __thread int proxyOpId;
 __hidden void printProxyOpEventHeader(FILE* fh, struct proxyOp* event) {
  if (event->isSend) {
-    int posted = PROXY_OP_SEND_STATE_IDX(ncclProfilerProxyOpSendPosted);
-    int remFifoWait = PROXY_OP_SEND_STATE_IDX(ncclProfilerProxyOpSendRemFifoWait);
-    int transmitted = PROXY_OP_SEND_STATE_IDX(ncclProfilerProxyOpSendTransmitted);
-    int done = PROXY_OP_SEND_STATE_IDX(ncclProfilerProxyOpSendDone);
-    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu, \"POSTED\": {\"step\": %d, \"ts\": %f}, \"REM_FIFO_WAIT\": {\"step\": %d, \"ts\": %f}, \"TRANSMITTED\": {\"step\": %d, \"ts\": %f}, \"DONE\": {\"step\": %d, \"ts\": %f}}},\n",
-            "Send", proxyOpId, getpid(), 1, event->startTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize, event->states[posted].steps, event->states[posted].timestamp, event->states[remFifoWait].steps, event->states[remFifoWait].timestamp, event->states[transmitted].steps, event->states[transmitted].timestamp, event->states[done].steps, event->states[done].timestamp);
+    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu}},\n",
+            "ScheduleSend", proxyOpId, getpid(), 1, event->startTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize);
+    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
+            "ScheduleSend", proxyOpId, getpid(), 1, event->progrTs);
+    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu}},\n",
+            "ProgressSend", proxyOpId, getpid(), 1, event->progrTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize);
  } else {
-    int posted = PROXY_OP_RECV_STATE_IDX(ncclProfilerProxyOpRecvPosted);
-    int received = PROXY_OP_RECV_STATE_IDX(ncclProfilerProxyOpRecvReceived);
-    int transmitted = PROXY_OP_RECV_STATE_IDX(ncclProfilerProxyOpRecvTransmitted);
-    int done = PROXY_OP_RECV_STATE_IDX(ncclProfilerProxyOpRecvDone);
-    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu, \"POSTED\": {\"step\": %d, \"ts\": %f}, \"RECEIVED\": {\"step\": %d, \"ts\": %f}, \"TRANSMITTED\": {\"step\": %d, \"ts\": %f}, \"DONE\": {\"step\": %d, \"ts\": %f}}},\n",
-            "Recv", proxyOpId, getpid(), 1, event->startTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize, event->states[posted].steps, event->states[posted].timestamp, event->states[received].steps, event->states[received].timestamp, event->states[transmitted].steps, event->states[transmitted].timestamp, event->states[done].steps, event->states[done].timestamp);
+    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu}},\n",
+            "ScheduleRecv", proxyOpId, getpid(), 1, event->startTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize);
+    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
+            "ScheduleRecv", proxyOpId, getpid(), 1, event->progrTs);
+    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"Peer\": %d, \"Steps\": %d, \"ChunkSize\": %d, \"transSize\": %lu}},\n",
+            "ProgressRecv", proxyOpId, getpid(), 1, event->progrTs, event->channelId, event->peer, event->nSteps, event->chunkSize, event->transSize);
  }
 }

 __hidden void printProxyOpEventTrailer(FILE* fh, struct proxyOp* event) {
  fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
-          event->isSend ? "Send" : "Recv", proxyOpId++, getpid(), 1, event->stopTs);
+          event->isSend ? "ProgressSend" : "ProgressRecv", proxyOpId++, getpid(), 1, event->stopTs);
 }

 static __thread int proxyStepId;
 __hidden void printProxyStepEventHeader(FILE* fh, struct proxyStep* event) {
  if (event->isSend) {
    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
-            "SendBufferWait", proxyStepId, getpid(), 1, event->startTs, event->step);
+            "SendGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_GPU_WAIT], event->step);
    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
-            "SendBufferWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_STATE_IDX(ncclProfilerProxyStepSendGPUWait)]);
+            "SendGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_PEER_WAIT]);
    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
-            "SendGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_STATE_IDX(ncclProfilerProxyStepSendGPUWait)], event->step);
+            "SendPeerWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_PEER_WAIT], event->step);
    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
-            "SendGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_STATE_IDX(ncclProfilerProxyStepSendWait)]);
+            "SendPeerWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_WAIT]);
    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
-            "SendWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_STATE_IDX(ncclProfilerProxyStepSendWait)], event->step);
+            "SendWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_SEND_WAIT], event->step);
  } else {
    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
-            "RecvBufferWait", proxyStepId, getpid(), 1, event->startTs, event->step);
-    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
-            "RecvBufferWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvWait)]);
-    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
-            "RecvWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvWait)], event->step);
+            "RecvWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_WAIT], event->step);
  }
 }

@ -100,13 +96,13 @@ __hidden void printProxyStepEventTrailer(FILE* fh, struct proxyStep* event) {
            "SendWait", proxyStepId++, getpid(), 1, event->stopTs);
  } else {
    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
-            "RecvWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvFlushWait)]);
+            "RecvWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_FLUSH_WAIT]);
    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
-            "RecvFlushWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvFlushWait)], event->step);
+            "RecvFlushWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_FLUSH_WAIT], event->step);
    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
-            "RecvFlushWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvGPUWait)]);
+            "RecvFlushWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_GPU_WAIT]);
    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Step\": %d}},\n",
-            "RecvGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_STATE_IDX(ncclProfilerProxyStepRecvGPUWait)], event->step);
+            "RecvGpuWait", proxyStepId, getpid(), 1, event->timestamp[PROXY_STEP_RECV_GPU_WAIT], event->step);
    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"NET\", \"ph\": \"e\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f},\n",
            "RecvGpuWait", proxyStepId++, getpid(), 1, event->stopTs);
  }
@ -115,8 +111,8 @@ __hidden void printProxyStepEventTrailer(FILE* fh, struct proxyStep* event) {
 static __thread int kernelId;
 __hidden void printKernelChEventHeader(FILE* fh, struct kernelCh* event) {
  if (event->type != ncclProfileKernelCh) return;
-  fprintf(fh, "{\"name\": \"%s\", \"cat\": \"GPU\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d}},\n",
-          "KernelCh", kernelId, getpid(), 1, event->startTs, event->channelId);
+  fprintf(fh, "{\"name\": \"%s\", \"cat\": \"GPU\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"Channel\": %d, \"StartGpuClk\": %lu, \"StopGpuClk\": %lu}},\n",
+          "KernelCh", kernelId, getpid(), 1, event->startTs, event->channelId, event->startGpuClk, event->stopGpuClk);
 }

 __hidden void printKernelChEventTrailer(FILE* fh, struct kernelCh* event) {
@ -134,6 +130,8 @@ __hidden void printProxyCtrlEvent(FILE* fh, struct proxyCtrl* event) {
    str = "Sleep";
  } else if (event->state == ncclProfilerProxyCtrlAppend || event->state == ncclProfilerProxyCtrlAppendEnd) {
    str = "Append";
+  } else {
+    return;
  }
  if (event->state == ncclProfilerProxyCtrlAppendEnd) {
    fprintf(fh, "{\"name\": \"%s\", \"cat\": \"PROXY\", \"ph\": \"b\", \"id\": %d, \"pid\": %d, \"tid\": %d, \"ts\": %f, \"args\": {\"appended\": %d}},\n",
@ -188,9 +186,8 @@ void debugEvent(void* eHandle, const char* tag) {
    fprintf(fh, "Collective event %p tag = %s {\n", event, tag);
    fprintf(fh, "  refCount          = %d\n", __atomic_load_n(&event->base.refCount, __ATOMIC_RELAXED));
    fprintf(fh, "  parent            = %p\n", event->base.parent);
-    for (int j = 0; j < MAX_OPS; j++) {
-      for (int i = 0; i < MAX_CHANNELS; i++) if (event->send[i][j].type == ncclProfileProxyOp) fprintf(fh, "  send[%d]           = %p\n", i, &event->send[i]);
-      for (int i = 0; i < MAX_CHANNELS; i++) if (event->recv[i][j].type == ncclProfileProxyOp) fprintf(fh, "  recv[%d]           = %p\n", i, &event->recv[i]);
+    for (int j = 0; j < 2*MAX_OPS; j++) {
+      for (int i = 0; i < MAX_CHANNELS; i++) if (event->op[i][j].type == ncclProfileProxyOp) fprintf(fh, "  op[%d]           = %p\n", i, &event->op[i]);
    }
    fprintf(fh, "  startTs           = %f\n", event->base.startTs);
    fprintf(fh, "  stopTs            = %f\n", event->base.stopTs);
@ -207,17 +204,18 @@ void debugEvent(void* eHandle, const char* tag) {
  } else if (type == ncclProfileProxyOp) {
    struct proxyOp* event = (struct proxyOp *)eHandle;
    fprintf(fh, "ProxyOp event %p tag = %s {\n", event, tag);
-    fprintf(fh, "  type              = %s\n", event->isSend ? "Send" : "Recv");
+    fprintf(fh, "  type              = %s\n", event->isSend < 0 ? "Unknown" : event->isSend ? "Send" : "Recv");
    fprintf(fh, "  channel           = %d\n", event->channelId);
    fprintf(fh, "  parent            = %p\n", event->parent);
    fprintf(fh, "  rank              = %d\n", event->rank);
    fprintf(fh, "  startTs           = %f\n", event->startTs);
+    fprintf(fh, "  progrTs           = %f\n", event->progrTs);
    fprintf(fh, "  stopTs            = %f\n", event->stopTs);
    fprintf(fh, "}\n");
  } else if (type == ncclProfileProxyStep) {
    struct proxyStep* event = (struct proxyStep *)eHandle;
    fprintf(fh, "ProxyStep event %p tag = %s {\n", event, tag);
-    fprintf(fh, "  type              = %s\n", event->isSend ? "Send" : "Recv");
+    fprintf(fh, "  type              = %s\n", event->isSend < 0 ? "Unknown" : event->isSend ? "Send" : "Recv");
    fprintf(fh, "  parent            = %p\n", event->parent);
    fprintf(fh, "  startTs           = %f\n", event->startTs);
    fprintf(fh, "  stopTs            = %f\n", event->stopTs);
@ -260,8 +258,7 @@ void printEvent(FILE* fh, void* handle) {
    for (int i = 0; i < MAX_CHANNELS; i++) {
      printKernelChEventHeader(fh, &c->kernel[i]);
      for (int j = 0; j < c->nProxyOps[i]; j++) {
-        printEvent(fh, &c->send[i][j]);
-        printEvent(fh, &c->recv[i][j]);
+        printEvent(fh, &c->op[i][j]);
      }
      printKernelChEventTrailer(fh, &c->kernel[i]);
    }
--- a/ext-profiler/example/print_event.h
+++ b/ext-profiler/example/print_event.h
@ -7,6 +7,9 @@
 #ifndef PRINT_EVENT_H_
 #define PRINT_EVENT_H_

+#include "nccl/common.h"
+extern ncclDebugLogger_t logFn;
+
 void debugEvent(void* eHandle, const char* tag);
 void printEvent(FILE* fh, void* handle);

--- a/ext-tuner/basic/Makefile
+++ b/ext-tuner/basic/Makefile
@ -0,0 +1,23 @@
+#
+# Copyright (c) 2015-2019, NVIDIA CORPORATION. All rights reserved.
+#
+# See LICENSE.txt for license information
+#
+.DEFAULT_GOAL: build
+include ../../makefiles/common.mk
+SRCDIR   ?= $(abspath ../..)
+BUILDDIR ?= .
+NCCLDIR  := $(BUILDDIR)
+
+SRC_FILES := $(wildcard *.c)
+DST_DIR   := $(BUILDDIR)/test/unit/plugins
+
+build: ${BUILDDIR}/libnccl-tuner-basic.so
+
+${BUILDDIR}/libnccl-tuner-basic.so: ${SRC_FILES}
+	@printf "Compiling  %-35s > %s\n" $< $@
+	@mkdir -p ${BUILDDIR}
+	$(CC) -Inccl -fPIC -shared -o $@ $^
+
+clean:
+	rm -f ${BUILDDIR}/libnccl-tuner-basic.so
--- a/ext-tuner/basic/nccl/common.h
+++ b/ext-tuner/basic/nccl/common.h
@ -0,0 +1,15 @@
+/*************************************************************************
+ * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef COMMON_H_
+#define COMMON_H_
+
+typedef enum {NCCL_LOG_NONE=0, NCCL_LOG_VERSION=1, NCCL_LOG_WARN=2, NCCL_LOG_INFO=3, NCCL_LOG_ABORT=4, NCCL_LOG_TRACE=5} ncclDebugLogLevel;
+typedef enum {NCCL_INIT=1, NCCL_COLL=2, NCCL_P2P=4, NCCL_SHM=8, NCCL_NET=16, NCCL_GRAPH=32, NCCL_TUNING=64, NCCL_ENV=128, NCCL_ALLOC=256, NCCL_CALL=512, NCCL_PROXY=1024, NCCL_NVLS=2048, NCCL_BOOTSTRAP=4096, NCCL_REG=8192, NCCL_ALL=~0} ncclDebugLogSubSys;
+
+typedef void (*ncclDebugLogger_t)(ncclDebugLogLevel level, unsigned long flags, const char *file, int line, const char *fmt, ...);
+
+#endif
--- a/ext-tuner/basic/nccl/err.h
+++ b/ext-tuner/basic/nccl/err.h
@ -0,0 +1,17 @@
+/*
+ * Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
+ */
+
+#ifndef NCCL_ERR_H_
+#define NCCL_ERR_H_
+
+/* Error type for plugins */
+typedef enum { ncclSuccess                 =  0,
+               ncclUnhandledCudaError      =  1,
+               ncclSystemError             =  2,
+               ncclInternalError           =  3,
+               ncclInvalidArgument         =  4,
+               ncclInvalidUsage            =  5,
+               ncclRemoteError             =  6 } ncclResult_t;
+
+#endif
--- a/ext-tuner/basic/nccl/tuner.h
+++ b/ext-tuner/basic/nccl/tuner.h
@ -0,0 +1,97 @@
+/*************************************************************************
+ * Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2023, Meta Platforms, Inc. and affiliates.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef NCCL_TUNER_H_
+#define NCCL_TUNER_H_
+
+#include <stdint.h>
+#include <stdlib.h>
+
+#include "common.h"
+#include "err.h"
+
+#define NCCL_NUM_FUNCTIONS 5 // Send/Recv not included for now
+typedef enum {
+  ncclFuncBroadcast = 0,
+  ncclFuncReduce = 1,
+  ncclFuncAllGather = 2,
+  ncclFuncReduceScatter = 3,
+  ncclFuncAllReduce = 4,
+  ncclFuncSendRecv = 5,
+  ncclFuncSend = 6,
+  ncclFuncRecv = 7,
+  ncclNumFuncs = 8
+} ncclFunc_t;
+
+#define NCCL_NUM_ALGORITHMS 7 // Tree/Ring/CollNet*
+#define NCCL_ALGO_UNDEF -1
+#define NCCL_ALGO_TREE 0
+#define NCCL_ALGO_RING 1
+#define NCCL_ALGO_COLLNET_DIRECT 2
+#define NCCL_ALGO_COLLNET_CHAIN 3
+#define NCCL_ALGO_NVLS 4
+#define NCCL_ALGO_NVLS_TREE 5
+#define NCCL_ALGO_PAT 6
+
+#define NCCL_NUM_PROTOCOLS 3 // Simple/LL/LL128
+#define NCCL_PROTO_UNDEF -1
+#define NCCL_PROTO_LL 0
+#define NCCL_PROTO_LL128 1
+#define NCCL_PROTO_SIMPLE 2
+
+#define NCCL_ALGO_PROTO_IGNORE -1.0
+
+// API to be implemented by external tuner
+typedef struct {
+  // Name of the tuner
+  const char* name;
+
+  // Initializes tuner states.
+  // Inputs:
+  //   - nRanks: number of ranks in current communicator. Each communicator initialize its own tuner.
+  //   - nNodes: number of nodes in current communicator.
+  //   - logFunction: a logFunction can be useful to integrate logging together with NCCL core.
+  // Outputs:
+  //   - context: tuner context object
+  ncclResult_t (*init)(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context);
+
+  // Gets info (algo, protocol, number of ctas and threads) for a given collective.
+  // Inputs:
+  //   - context: tuner context object
+  //   - collType: collective type , e.g., allreduce, allgather…
+  //   - nBytes: collective size in bytes
+  //   - numPipeOps: number of operations in the group
+  //   - numAlgo: number of algorithms in collCostTable
+  //   - numProto: number of protocols in collCostTable
+  //   - regBuff: can register user buffer
+  //
+  // Outputs:
+  //   - nChannels: number of channels (hence SMs) to be used.
+  //
+  // InOut:
+  //   - collCostTable: collective cost table, generated by NCCL core, containing algo|proto|time entries for collType.
+  //                    NCCL core sets ignored algo/proto cost table entries to -1.0 (NCCL_ALGO_PROTO_IGNORE).
+  //
+  // If getCollInfo() does not return ncclSuccess, NCCL will fall back to the
+  // default tuning for the given collective.
+  // Also, the plugin is allowed to not set any output, or set only the
+  // algorithm and protocol, but not only the algorithm or only the protocol.
+  // Unset fields will be set automatically by NCCL.
+  ncclResult_t (*getCollInfo)(void* context, ncclFunc_t collType, size_t nBytes,
+                              int numPipeOps, float** collCostTable, int numAlgo, int numProto,
+                              int regBuff, int* nChannels);
+
+  // Terminates the plugin and cleans up any resources that the plugin allocated.
+  // context: tuner context object
+  ncclResult_t (*destroy)(void* context);
+} ncclTuner_v4_t;
+
+typedef ncclTuner_v4_t ncclTuner_t;
+
+#define NCCL_TUNER_PLUGIN_SYMBOL "ncclTunerPlugin_v4"
+
+#endif
--- a/ext-tuner/basic/plugin.c
+++ b/ext-tuner/basic/plugin.c
@ -0,0 +1,34 @@
+/*************************************************************************
+ * Copyright (c) 2015-2019, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#include "tuner.h"
+
+#define __hidden __attribute__ ((visibility("hidden")))
+
+__hidden ncclResult_t pluginInit(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context) { return ncclSuccess; }
+
+__hidden ncclResult_t pluginGetCollInfo(void* context, ncclFunc_t collType, size_t nBytes,
+                              int numPipeOps, float** collCostTable, int numAlgo, int numProto,
+                              int regBuff, int* nChannels) {
+  // Update NCCL core generated cost table. Updated table will be evaluated by NCCL to pick the best algo/proto combo
+  float (*table)[NCCL_NUM_PROTOCOLS] = (float (*)[NCCL_NUM_PROTOCOLS])collCostTable;
+  if (table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] != NCCL_ALGO_PROTO_IGNORE) {
+    table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] = 0.0;
+  }
+  *nChannels = 1;
+  return ncclSuccess;
+}
+
+__hidden ncclResult_t pluginDestroy(void* context) { return ncclSuccess; }
+
+#define PLUGIN_NAME "Basic"
+
+const ncclTuner_v4_t ncclTunerPlugin_v4 = {
+  .name = PLUGIN_NAME,
+  .init = pluginInit,
+  .getCollInfo = pluginGetCollInfo,
+  .destroy = pluginDestroy
+};
--- a/ext-tuner/example/Makefile
+++ b/ext-tuner/example/Makefile
@ -3,15 +3,53 @@
 #
 # See LICENSE.txt for license information
 #
-NCCL_HOME:=../../build/
-CUDA_HOME:=/usr/local/cuda
-INC:= -I$(NCCL_HOME)/include -I$(CUDA_HOME)/include -Inccl
-PLUGIN_SO:=libnccl-tuner.so

-default: $(PLUGIN_SO)
+.DEFAULT_GOAL: build
+PLUGIN_SO:=libnccl-tuner-example.so
+include ../../makefiles/common.mk
+SRCDIR   ?= $(abspath ../..)
+BUILDDIR ?= .
+NCCLDIR  := $(BUILDDIR)

-$(PLUGIN_SO): plugin.c
-	$(CC) $(INC) -fPIC -shared -o $@ -Wl,-soname,$(PLUGIN_SO) $^
+SRC_FILES := $(wildcard *.c)
+DST_DIR   := $(BUILDDIR)/test/unit/plugins
+
+default: ${BUILDDIR}/$(PLUGIN_SO)
+
+build: ${BUILDDIR}/$(PLUGIN_SO)
+
+${BUILDDIR}/$(PLUGIN_SO): plugin.c
+	@printf "Compiling  %-35s > %s\n" $< $@
+	@mkdir -p ${BUILDDIR}
+	$(CC) -Inccl $(INC) -fPIC -shared -o $@ -Wl,-soname,$(PLUGIN_SO) $^
+
+# Test targets - delegate to test directory
+test:
+	$(MAKE) -C test test TEST_CASE=$(TEST_CASE)
+
+test-verbose:
+	$(MAKE) -C test test-verbose TEST_CASE=$(TEST_CASE)
+
+# Build tests
+test-build:
+	$(MAKE) -C test all
+
+# Optimize configurations from performance data
+optimize-config:
+	@if [ -z "$(CSV_FILE)" ]; then \
+		echo "Usage: make optimize-config CSV_FILE=path/to/data.csv [OUTPUT=config.conf] [METRIC=latency_us]"; \
+		echo "Example: make optimize-config CSV_FILE=scripts/sample_performance_data.csv"; \
+		exit 1; \
+	fi
+	python3 scripts/optimize_config.py $(CSV_FILE) \
+		$(if $(OUTPUT),-o $(OUTPUT)) \
+		$(if $(METRIC),-m $(METRIC)) \
+		$(if $(SIZE_RANGES),--size-ranges $(SIZE_RANGES)) \
+		$(if $(DRY_RUN),--dry-run) \
+		$(if $(NO_HEADER),--no-header)

 clean:
-	rm -f $(PLUGIN_SO)
+	rm -f ${BUILDDIR}/$(PLUGIN_SO)
+	$(MAKE) -C test clean
+
+.PHONY: test test-verbose test-build optimize-config clean
--- a/ext-tuner/example/README.md
+++ b/ext-tuner/example/README.md
@ -0,0 +1,164 @@
+# NCCL Example Tuner Plugin
+
+This example plugin shows a practical example of a CSV file-based tuning approach, allowing selective overrides for tuning parameters based on all tuning inputs without recompiling.
+
+## Features
+
+- **File-based Configuration**: Read tuning parameters from a CSV configuration file
+- **Size-based Tuning**: Specify different configurations based on message size ranges
+- **Dimension-aware Tuning**: Match configurations based on number of nodes and ranks
+- **Optional Channels Configuration**: Set specific channel counts or use -1 to keep NCCL's default
+- **Environment Variable Support**: Specify config file location via `NCCL_TUNER_CONFIG_FILE`
+- **Fallback Behavior**: Gracefully handles missing config files and invalid entries
+
+## Building
+
+```bash
+make
+```
+
+This will create `libnccl-tuner-example.so` that can be loaded by NCCL.
+
+## Configuration File Format
+
+The configuration file uses CSV (Comma-Separated Values) format with one configuration per line:
+
+```
+collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
+```
+
+### Parameters
+
+- **collective_type**: The collective operation type
+  - `broadcast`, `reduce`, `allgather`, `reducescatter`, `allreduce`
+
+- **min_bytes/max_bytes**: The message size range (in bytes) for which this config applies
+  - Use `0` for minimum and `4294967295` for maximum (covers all sizes)
+
+- **algorithm**: The NCCL algorithm to use
+  - `tree`, `ring`, `collnet_direct`, `collnet_chain`, `nvls`, `nvls_tree`, `pat`
+
+- **protocol**: The NCCL protocol to use
+  - `ll`, `ll128`, `simple`
+
+- **channels**: Number of channels (SMs) to use
+  - Use a positive integer to specify exact channel count
+  - Use `-1` to keep NCCL's default channel selection
+
+- **nNodes**: Number of nodes to match
+  - Use a positive integer to match specific node count
+  - Use `-1` to match any number of nodes
+
+- **nRanks**: Number of ranks to match
+  - Use a positive integer to match specific rank count
+  - Use `-1` to match any number of ranks
+
+- **numPipeOps**: Number of pipeline operations to match (optional)
+  - Use a positive integer to match specific pipeline operation count
+  - Use `-1` to match any number of pipeline operations
+  - If omitted, configuration will match any numPipeOps value
+
+- **regBuff**: Whether user buffer can be registered (optional)
+  - Use `0` to match only non-registered buffers
+  - Use `1` to match only registered buffers
+  - Use `-1` to match either registered or non-registered buffers
+  - If omitted, configuration will match any regBuff value
+
+### Example Configuration
+
+```csv
+# Single-node, small allreduce: use tree algorithm, registered buffers only
+allreduce,0,65536,tree,simple,2,1,-1,-1,1
+
+# 4-node, 32-rank setup: medium allreduce, single pipeline op, non-registered buffers
+allreduce,65537,1048576,ring,simple,4,4,32,1,0
+
+# Any topology: large allreduce with LL128, multiple pipeline ops, any buffer type
+allreduce,1048577,4294967295,ring,ll128,-1,-1,-1,4,-1
+
+# Single-node broadcast: prefer tree, any pipeOps, registered buffers (backward compatible)
+broadcast,0,32768,tree,simple,-1,1,-1
+
+# Multi-node broadcast: optimized for non-registered buffers, single pipeline op
+broadcast,32769,4294967295,ring,simple,2,-1,-1,1,0
+```
+
+Comments start with `#` and empty lines are ignored. The CSV format makes it easy to edit configurations in spreadsheet applications like Excel, Google Sheets, or LibreOffice Calc.
+
+### Backward Compatibility
+
+Configurations without the numPipeOps and/or regBuff parameters are fully supported:
+- 8 fields: matches any numPipeOps and regBuff values
+- 9 fields: matches any regBuff value
+- 10 fields: full parameter specification
+
+This ensures existing configuration files continue to work without modification.
+
+## Usage
+
+### Method 1: Default Config File
+Place your configuration in `nccl_tuner.conf` in the current working directory.
+
+### Method 2: Environment Variable
+Set the `NCCL_TUNER_CONFIG_FILE` environment variable to specify the config file path:
+
+```bash
+export NCCL_TUNER_CONFIG_FILE=/path/to/your/tuner.conf
+export LD_LIBRARY_PATH=/path/to/plugin:$LD_LIBRARY_PATH
+mpirun -np 4 your_nccl_application
+```
+
+## Editing Configuration Files
+
+### Generating Configuration Files from Raw Data
+
+A python script to generate valid CSV configs has been provided. [Using optimize_config.py](scripts/README.md).
+
+### Spreadsheet Tips:
+- Use column headers: `collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff`
+- Save as CSV format (not Excel format) for the plugin to read
+- Use data validation to prevent typos in algorithm/protocol names
+
+## Logging
+
+The plugin uses NCCL's logging system. To see tuner-related messages:
+
+```bash
+export NCCL_DEBUG=INFO
+```
+
+This will show when configurations are loaded and applied, including the topology information.
+
+For detailed debugging output during tuning decisions:
+
+```bash
+export NCCL_DEBUG=TRACE
+```
+
+This will show verbose information about which configurations are being evaluated and matched.
+
+## Dimension Matching
+
+Configurations are only applied when the topology matches:
+
+- **Exact Match**: Configuration specifies `nNodes=4,nRanks=32`, only applied when communicator has exactly 4 nodes and 32 ranks
+- **Wildcard Nodes**: Configuration specifies `nNodes=-1,nRanks=8`, applied to any topology with exactly 8 ranks
+- **Wildcard Ranks**: Configuration specifies `nNodes=2,nRanks=-1`, applied to any 2-node topology regardless of ranks per node
+- **Wildcard Both**: Configuration specifies `nNodes=-1,nRanks=-1`, applied to any topology
+
+This allows you to create specialized configurations for different cluster setups while maintaining flexibility.
+
+## Default Behavior
+
+If no configuration file is found or no matching configuration exists for a collective operation, the plugin falls back to preferring the ring algorithm with simple protocol. All configured algorithm/protocol combinations are given a low cost (0.0) to make them preferred by NCCL's selection logic.
+
+When channels is set to `-1`, NCCL's default channel selection logic is preserved, allowing the system to automatically determine the optimal number of channels based on hardware and message size.
+
+## Troubleshooting
+
+1. **Config file not found**: Check the file path and permissions
+2. **Configurations not applied**: Verify the collective type, size ranges, algorithm/protocol names, and topology parameters
+3. **Plugin not loaded**: Ensure `LD_LIBRARY_PATH` includes the plugin directory
+4. **No effect on performance**: Check that NCCL is actually using the tuner plugin with `NCCL_DEBUG=INFO`
+5. **Topology mismatch**: Verify that nNodes and nRanks match your actual setup, or use -1 for wildcards
+6. **CSV parsing errors**: Ensure no spaces after commas, or quote fields containing spaces
--- a/ext-tuner/example/nccl_tuner.conf
+++ b/ext-tuner/example/nccl_tuner.conf
@ -0,0 +1,45 @@
+# NCCL Tuner Configuration File (CSV Format)
+# Format: collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
+#
+# Collective types: broadcast, reduce, allgather, reducescatter, allreduce
+# Algorithms: tree, ring, collnet_direct, collnet_chain, nvls, nvls_tree, pat
+# Protocols: ll, ll128, simple
+# Channels: number of channels to use, or -1 to keep default
+# nNodes: number of nodes to match, or -1 for any number of nodes
+# nRanks: number of ranks to match, or -1 for any number of ranks
+# numPipeOps: number of pipeline operations to match, or -1 for any number (optional)
+# regBuff: whether user buffer can be registered (0=no, 1=yes, -1=any) (optional)
+#
+# Note: numPipeOps and regBuff parameters are optional - configurations without them will match any value
+#
+# Examples:
+
+# For single-node configurations with registered buffers
+# Small allreduce operations on single node - use tree algorithm, registered buffers
+allreduce,0,65536,tree,simple,2,1,-1,-1,1
+
+# For multi-node configurations with 4 nodes, 32 total ranks, single pipeline op, non-registered buffers
+# Medium allreduce operations - use ring algorithm
+allreduce,65537,1048576,ring,simple,4,4,32,1,0
+
+# For any topology - large allreduce operations with LL128 protocol, multiple pipeline ops, any buffer type
+allreduce,1048577,4294967295,ring,ll128,-1,-1,-1,4,-1
+
+# Broadcast operations - different configs for different topologies, pipeline complexity, and buffer types
+# Single node broadcast - prefer tree, any pipeOps, registered buffers only
+broadcast,0,32768,tree,simple,-1,1,-1,-1,1
+
+# Multi-node broadcast with single pipeline operation, non-registered buffers - use ring
+broadcast,32769,4294967295,ring,simple,2,-1,-1,1,0
+
+# AllGather operations - optimized for 2-node configurations, any pipeOps, any buffer type
+allgather,0,4294967295,ring,simple,4,2,-1
+
+# ReduceScatter operations
+# Small messages on single node, single pipeline op, registered buffers
+reducescatter,0,131072,tree,simple,2,1,-1,1,1
+# Large messages on any topology, multiple pipeline ops, non-registered buffers
+reducescatter,131073,4294967295,ring,simple,-1,-1,-1,2,0
+
+# Reduce operations - any topology, keep default channels, any pipeOps, any buffer type
+reduce,0,4294967295,tree,simple,-1,-1,-1
--- a/ext-tuner/example/plugin.c
+++ b/ext-tuner/example/plugin.c
@ -5,24 +5,443 @@
 ************************************************************************/

 #include "tuner.h"
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>

 #define __hidden __attribute__ ((visibility("hidden")))
+#define MAX_LINE_LENGTH 256

-__hidden ncclResult_t pluginInit(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context) { return ncclSuccess; }
+// CSV field indices for configuration parsing
+// Format: colltype,minbytes,maxbytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
+#define CONFIG_FIELD_COLLTYPE     0
+#define CONFIG_FIELD_MINBYTES     1
+#define CONFIG_FIELD_MAXBYTES     2
+#define CONFIG_FIELD_ALGORITHM    3
+#define CONFIG_FIELD_PROTOCOL     4
+#define CONFIG_FIELD_CHANNELS     5
+#define CONFIG_FIELD_NNODES       6
+#define CONFIG_FIELD_NRANKS       7
+#define CONFIG_FIELD_PIPEOPS      8  // Optional field
+#define CONFIG_FIELD_REGBUFF      9  // Optional field
+
+// Field count constants
+#define CONFIG_FIELDS_REQUIRED    8   // Minimum required fields (up to nRanks)
+#define CONFIG_FIELDS_WITH_PIPEOPS 9  // Fields including numPipeOps
+#define CONFIG_FIELDS_WITH_REGBUFF 10 // Fields including both numPipeOps and regBuff
+#define CONFIG_FIELDS_MAX         10  // Maximum number of fields supported
+
+typedef struct {
+  ncclFunc_t collType;
+  size_t minBytes;
+  size_t maxBytes;
+  int algorithm;
+  int protocol;
+  int nChannels;
+  int nNodes;
+  int nRanks;
+  int numPipeOps;
+  int regBuff;
+} TuningConfig;
+
+typedef struct {
+  TuningConfig* configs;  // Changed from static array to dynamic pointer
+  int numConfigs;
+  int maxConfigs;         // Added to track allocated size
+  size_t nRanks;
+  size_t nNodes;
+  ncclDebugLogger_t logFunction;
+} TunerContext;
+
+// Parse collective type from string
+static ncclFunc_t parseCollType(const char* str) {
+  if (strcmp(str, "broadcast") == 0) return ncclFuncBroadcast;
+  if (strcmp(str, "reduce") == 0) return ncclFuncReduce;
+  if (strcmp(str, "allgather") == 0) return ncclFuncAllGather;
+  if (strcmp(str, "reducescatter") == 0) return ncclFuncReduceScatter;
+  if (strcmp(str, "allreduce") == 0) return ncclFuncAllReduce;
+  return ncclFuncAllReduce; // default
+}
+
+// Convert collective type to string
+static const char* collTypeToString(ncclFunc_t collType) {
+  switch (collType) {
+    case ncclFuncBroadcast: return "broadcast";
+    case ncclFuncReduce: return "reduce";
+    case ncclFuncAllGather: return "allgather";
+    case ncclFuncReduceScatter: return "reducescatter";
+    case ncclFuncAllReduce: return "allreduce";
+    default: return "unknown";
+  }
+}
+
+// Parse algorithm from string
+static int parseAlgorithm(const char* str) {
+  if (strcmp(str, "tree") == 0) return NCCL_ALGO_TREE;
+  if (strcmp(str, "ring") == 0) return NCCL_ALGO_RING;
+  if (strcmp(str, "collnet_direct") == 0) return NCCL_ALGO_COLLNET_DIRECT;
+  if (strcmp(str, "collnet_chain") == 0) return NCCL_ALGO_COLLNET_CHAIN;
+  if (strcmp(str, "nvls") == 0) return NCCL_ALGO_NVLS;
+  if (strcmp(str, "nvls_tree") == 0) return NCCL_ALGO_NVLS_TREE;
+  if (strcmp(str, "pat") == 0) return NCCL_ALGO_PAT;
+  return NCCL_ALGO_RING; // default
+}
+
+// Convert algorithm to string
+static const char* algorithmToString(int algorithm) {
+  switch (algorithm) {
+    case NCCL_ALGO_TREE: return "tree";
+    case NCCL_ALGO_RING: return "ring";
+    case NCCL_ALGO_COLLNET_DIRECT: return "collnet_direct";
+    case NCCL_ALGO_COLLNET_CHAIN: return "collnet_chain";
+    case NCCL_ALGO_NVLS: return "nvls";
+    case NCCL_ALGO_NVLS_TREE: return "nvls_tree";
+    case NCCL_ALGO_PAT: return "pat";
+    default: return "unknown";
+  }
+}
+
+// Parse protocol from string
+static int parseProtocol(const char* str) {
+  if (strcmp(str, "ll") == 0) return NCCL_PROTO_LL;
+  if (strcmp(str, "ll128") == 0) return NCCL_PROTO_LL128;
+  if (strcmp(str, "simple") == 0) return NCCL_PROTO_SIMPLE;
+  return NCCL_PROTO_SIMPLE; // default
+}
+
+// Convert protocol to string
+static const char* protocolToString(int protocol) {
+  switch (protocol) {
+    case NCCL_PROTO_LL: return "ll";
+    case NCCL_PROTO_LL128: return "ll128";
+    case NCCL_PROTO_SIMPLE: return "simple";
+    default: return "unknown";
+  }
+}
+
+// Helper function to count valid configuration lines in file
+static int countConfigLines(const char* filename) {
+  FILE* file = fopen(filename, "r");
+  if (!file) {
+    return 0;
+  }
+
+  char line[MAX_LINE_LENGTH];
+  int count = 0;
+
+  while (fgets(line, sizeof(line), file)) {
+    // Skip comments and empty lines
+    if (line[0] == '#' || line[0] == '\n') continue;
+
+    // Remove trailing newline
+    line[strcspn(line, "\n")] = 0;
+
+    // Check if line has content
+    if (strlen(line) > 0) {
+      count++;
+    }
+  }
+
+  fclose(file);
+  return count;
+}
+
+// Load configuration from file
+static ncclResult_t loadConfig(TunerContext* ctx, const char* filename) {
+  FILE* file = fopen(filename, "r");
+  if (!file) {
+    if (ctx->logFunction) {
+      ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                       "TUNER/ExamplePlugin: Config file %s not found, using defaults", filename);
+    }
+    return ncclSuccess; // Not finding config file is not an error
+  }
+
+  // First pass: count valid configuration lines
+  int configCount = countConfigLines(filename);
+  if (configCount == 0) {
+    if (ctx->logFunction) {
+      ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                       "TUNER/ExamplePlugin: No valid configurations found in %s", filename);
+    }
+    fclose(file);
+    return ncclSuccess;
+  }
+
+  // Allocate memory for configurations based on actual count
+  ctx->configs = (TuningConfig*)malloc(configCount * sizeof(TuningConfig));
+  if (!ctx->configs) {
+    if (ctx->logFunction) {
+      ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                       "TUNER/ExamplePlugin: Failed to allocate memory for %d configurations", configCount);
+    }
+    fclose(file);
+    return ncclSystemError;
+  }
+
+  ctx->maxConfigs = configCount;
+  ctx->numConfigs = 0;
+
+  if (ctx->logFunction) {
+    ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                     "TUNER/ExamplePlugin: Allocated memory for %d configurations", configCount);
+  }
+
+  // Reset file pointer to beginning
+  fseek(file, 0, SEEK_SET);
+
+  char line[MAX_LINE_LENGTH];
+
+  while (fgets(line, sizeof(line), file) && ctx->numConfigs < ctx->maxConfigs) {
+    // Skip comments and empty lines
+    if (line[0] == '#' || line[0] == '\n') continue;
+
+    // Remove trailing newline
+    line[strcspn(line, "\n")] = 0;
+
+    // Parse CSV format: colltype,minbytes,maxbytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
+    char* token;
+    char* tokens[CONFIG_FIELDS_MAX];
+    int tokenCount = 0;
+
+    // Make a copy of the line for tokenizing
+    char lineCopy[MAX_LINE_LENGTH];
+    strncpy(lineCopy, line, sizeof(lineCopy));
+    lineCopy[sizeof(lineCopy) - 1] = '\0';
+
+    // Tokenize by comma
+    token = strtok(lineCopy, ",");
+    while (token != NULL && tokenCount < CONFIG_FIELDS_MAX) {
+      // Trim whitespace
+      while (*token == ' ' || *token == '\t') token++;
+      char* end = token + strlen(token) - 1;
+      while (end > token && (*end == ' ' || *end == '\t')) {
+        *end = '\0';
+        end--;
+      }
+      tokens[tokenCount++] = token;
+      token = strtok(NULL, ",");
+    }
+
+    // Validate field count: support required fields (8), with pipeOps (9), or with regBuff (10)
+    if (tokenCount >= CONFIG_FIELDS_REQUIRED && tokenCount <= CONFIG_FIELDS_MAX) {
+      TuningConfig* config = &ctx->configs[ctx->numConfigs];
+      config->collType = parseCollType(tokens[CONFIG_FIELD_COLLTYPE]);
+      config->minBytes = (size_t)strtoull(tokens[CONFIG_FIELD_MINBYTES], NULL, 10);
+      config->maxBytes = (size_t)strtoull(tokens[CONFIG_FIELD_MAXBYTES], NULL, 10);
+      config->algorithm = parseAlgorithm(tokens[CONFIG_FIELD_ALGORITHM]);
+      config->protocol = parseProtocol(tokens[CONFIG_FIELD_PROTOCOL]);
+      config->nChannels = atoi(tokens[CONFIG_FIELD_CHANNELS]);
+      config->nNodes = atoi(tokens[CONFIG_FIELD_NNODES]);
+      config->nRanks = atoi(tokens[CONFIG_FIELD_NRANKS]);
+
+      // numPipeOps is optional (9th field, index 8)
+      if (tokenCount >= CONFIG_FIELDS_WITH_PIPEOPS) {
+        config->numPipeOps = atoi(tokens[CONFIG_FIELD_PIPEOPS]);
+      } else {
+        config->numPipeOps = -1; // -1 means match any numPipeOps
+      }
+
+      // regBuff is optional (10th field, index 9)
+      if (tokenCount >= CONFIG_FIELDS_WITH_REGBUFF) {
+        config->regBuff = atoi(tokens[CONFIG_FIELD_REGBUFF]);
+      } else {
+        config->regBuff = -1; // -1 means match any regBuff value
+      }
+
+      ctx->numConfigs++;
+
+      if (ctx->logFunction) {
+        if (config->numPipeOps == -1 && config->regBuff == -1) {
+          ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                           "TUNER/ExamplePlugin: Loaded config: %s [%zu-%zu] %s/%s channels=%d nodes=%d ranks=%d pipeOps=any regBuff=any",
+                           tokens[CONFIG_FIELD_COLLTYPE], config->minBytes, config->maxBytes,
+                           tokens[CONFIG_FIELD_ALGORITHM], tokens[CONFIG_FIELD_PROTOCOL],
+                           config->nChannels, config->nNodes, config->nRanks);
+        } else if (config->regBuff == -1) {
+          ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                           "TUNER/ExamplePlugin: Loaded config: %s [%zu-%zu] %s/%s channels=%d nodes=%d ranks=%d pipeOps=%d regBuff=any",
+                           tokens[CONFIG_FIELD_COLLTYPE], config->minBytes, config->maxBytes,
+                           tokens[CONFIG_FIELD_ALGORITHM], tokens[CONFIG_FIELD_PROTOCOL],
+                           config->nChannels, config->nNodes, config->nRanks, config->numPipeOps);
+        } else if (config->numPipeOps == -1) {
+          ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                           "TUNER/ExamplePlugin: Loaded config: %s [%zu-%zu] %s/%s channels=%d nodes=%d ranks=%d pipeOps=any regBuff=%d",
+                           tokens[CONFIG_FIELD_COLLTYPE], config->minBytes, config->maxBytes,
+                           tokens[CONFIG_FIELD_ALGORITHM], tokens[CONFIG_FIELD_PROTOCOL],
+                           config->nChannels, config->nNodes, config->nRanks, config->regBuff);
+        } else {
+          ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                           "TUNER/ExamplePlugin: Loaded config: %s [%zu-%zu] %s/%s channels=%d nodes=%d ranks=%d pipeOps=%d regBuff=%d",
+                           tokens[CONFIG_FIELD_COLLTYPE], config->minBytes, config->maxBytes,
+                           tokens[CONFIG_FIELD_ALGORITHM], tokens[CONFIG_FIELD_PROTOCOL],
+                           config->nChannels, config->nNodes, config->nRanks, config->numPipeOps, config->regBuff);
+        }
+      }
+    }
+  }
+
+  fclose(file);
+  if (ctx->logFunction) {
+    ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                     "TUNER/ExamplePlugin: Loaded %d tuning configurations from %s", ctx->numConfigs, filename);
+  }
+  return ncclSuccess;
+}
+
+__hidden ncclResult_t pluginInit(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context) {
+  TunerContext* ctx = (TunerContext*)malloc(sizeof(TunerContext));
+  if (!ctx) return ncclSystemError;
+
+  ctx->configs = NULL;     // Initialize to NULL
+  ctx->numConfigs = 0;
+  ctx->maxConfigs = 0;     // Initialize to 0
+  ctx->nRanks = nRanks;
+  ctx->nNodes = nNodes;
+  ctx->logFunction = logFunction;
+
+  if (logFunction) {
+    logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                "TUNER/ExamplePlugin: Initializing tuner for %zu nodes, %zu ranks", nNodes, nRanks);
+  }
+
+  // Try to load config file from environment variable or default location
+  const char* configFile = getenv("NCCL_TUNER_CONFIG_FILE");
+  if (!configFile) {
+    configFile = "nccl_tuner.conf"; // default config file name
+  }
+
+  ncclResult_t result = loadConfig(ctx, configFile);
+  if (result != ncclSuccess) {
+    if (ctx->configs) {
+      free(ctx->configs);  // Clean up allocated memory on error
+    }
+    free(ctx);
+    return result;
+  }
+
+  *context = ctx;
+  return ncclSuccess;
+}

 __hidden ncclResult_t pluginGetCollInfo(void* context, ncclFunc_t collType, size_t nBytes,
                              int numPipeOps, float** collCostTable, int numAlgo, int numProto,
                              int regBuff, int* nChannels) {
-  // Update NCCL core generated cost table. Updated table will be evaluated by NCCL to pick the best algo/proto combo
-  float (*table)[NCCL_NUM_PROTOCOLS] = (float (*)[NCCL_NUM_PROTOCOLS])collCostTable;
-  if (table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] != NCCL_ALGO_PROTO_IGNORE) {
-    table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] = 0.0;
-  }
+  TunerContext* ctx = (TunerContext*)context;
+  if (!ctx) return ncclInternalError;
+
+  // Default channels
  *nChannels = 1;
+
+  if (ctx->logFunction) {
+    ctx->logFunction(NCCL_LOG_TRACE, NCCL_TUNING, __FILE__, __LINE__,
+                     "TUNER/ExamplePlugin: pluginGetCollInfo called - collType=%s, nBytes=%zu, numPipeOps=%d, regBuff=%d, numConfigs=%d",
+                     collTypeToString(collType), nBytes, numPipeOps, regBuff, ctx->numConfigs);
+  }
+
+  // Look for matching configuration
+  for (int i = 0; i < ctx->numConfigs; i++) {
+    TuningConfig* config = &ctx->configs[i];
+
+    if (ctx->logFunction) {
+      ctx->logFunction(NCCL_LOG_TRACE, NCCL_TUNING, __FILE__, __LINE__,
+                       "TUNER/ExamplePlugin: Checking config %d - collType=%s, minBytes=%zu, maxBytes=%zu, algo=%s, proto=%s, nNodes=%d, nRanks=%d, numPipeOps=%d, regBuff=%d",
+                       i, collTypeToString(config->collType), config->minBytes, config->maxBytes, algorithmToString(config->algorithm), protocolToString(config->protocol),
+                       config->nNodes, config->nRanks, config->numPipeOps, config->regBuff);
+    }
+
+    // Check if this config matches the current collective, size range, topology, pipeline ops, and regBuff
+    if (config->collType == collType &&
+        nBytes >= config->minBytes &&
+        nBytes <= config->maxBytes &&
+        (config->nNodes == -1 || config->nNodes == (int)ctx->nNodes) &&
+        (config->nRanks == -1 || config->nRanks == (int)ctx->nRanks) &&
+        (config->numPipeOps == -1 || config->numPipeOps == numPipeOps) &&
+        (config->regBuff == -1 || config->regBuff == regBuff)) {
+
+      if (ctx->logFunction) {
+        ctx->logFunction(NCCL_LOG_TRACE, NCCL_TUNING, __FILE__, __LINE__,
+                         "TUNER/ExamplePlugin: Config matches. Applying algo=%s, proto=%s, channels=%d",
+                         algorithmToString(config->algorithm), protocolToString(config->protocol), config->nChannels);
+      }
+
+      // Check bounds
+      if (config->algorithm < numAlgo && config->protocol < numProto) {
+        if (collCostTable[config->algorithm][config->protocol] != NCCL_ALGO_PROTO_IGNORE) {
+          if (ctx->logFunction) {
+            ctx->logFunction(NCCL_LOG_TRACE, NCCL_TUNING, __FILE__, __LINE__,
+                             "TUNER/ExamplePlugin: Setting cost table[%s][%s] (%p) = 0.0 (was %.1f)",
+                             algorithmToString(config->algorithm), protocolToString(config->protocol),
+                             &collCostTable[config->algorithm][config->protocol], collCostTable[config->algorithm][config->protocol]);
+          }
+          collCostTable[config->algorithm][config->protocol] = 0.0; // Set low cost to prefer this configuration
+
+          // Only override channels if not set to -1 (keep default)
+          if (config->nChannels != -1) {
+            *nChannels = config->nChannels;
+          }
+
+          if (ctx->logFunction) {
+            if (config->nChannels == -1) {
+              ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                               "TUNER/ExamplePlugin: Applied config for collType=%s, bytes=%zu, pipeOps=%d, regBuff=%d: algo=%s, proto=%s, channels=default (nodes=%d, ranks=%d)",
+                               collTypeToString(config->collType), nBytes, numPipeOps, regBuff, algorithmToString(config->algorithm), protocolToString(config->protocol),
+                               config->nNodes, config->nRanks);
+            } else {
+              ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                               "TUNER/ExamplePlugin: Applied config for collType=%s, bytes=%zu, pipeOps=%d, regBuff=%d: algo=%s, proto=%s, channels=%d (nodes=%d, ranks=%d)",
+                               collTypeToString(config->collType), nBytes, numPipeOps, regBuff, algorithmToString(config->algorithm), protocolToString(config->protocol),
+                               config->nChannels, config->nNodes, config->nRanks);
+            }
+          }
+          return ncclSuccess;
+        } else {
+          if (ctx->logFunction) {
+            ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                             "TUNER/ExamplePlugin: Algorithm/protocol combination [%s][%s] is marked as IGNORE",
+                             algorithmToString(config->algorithm), protocolToString(config->protocol));
+          }
+        }
+      } else {
+        if (ctx->logFunction) {
+          ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                           "TUNER/ExamplePlugin: Algorithm/protocol out of bounds - algo=%s (max %d), proto=%s (max %d)",
+                           algorithmToString(config->algorithm), numAlgo, protocolToString(config->protocol), numProto);
+        }
+      }
+    } else {
+      if (ctx->logFunction) {
+        ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                         "TUNER/ExamplePlugin: Config does not match - collType match=%d, size match=%d, nodes match=%d, ranks match=%d, pipeOps match=%d, regBuff match=%d",
+                         config->collType == collType,
+                         (nBytes >= config->minBytes && nBytes <= config->maxBytes),
+                         (config->nNodes == -1 || config->nNodes == (int)ctx->nNodes),
+                         (config->nRanks == -1 || config->nRanks == (int)ctx->nRanks),
+                         (config->numPipeOps == -1 || config->numPipeOps == numPipeOps),
+                         (config->regBuff == -1 || config->regBuff == regBuff));
+      }
+    }
+  }
+
+  // If no specific config found, apply default behavior
+  if (ctx->logFunction) {
+    ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                     "TUNER/ExamplePlugin: No matching config found");
+  }
+
  return ncclSuccess;
 }

-__hidden ncclResult_t pluginDestroy(void* context) { return ncclSuccess; }
+__hidden ncclResult_t pluginDestroy(void* context) {
+  if (context) {
+    TunerContext* ctx = (TunerContext*)context;
+    if (ctx->configs) {
+      free(ctx->configs);  // Free dynamically allocated configs array
+    }
+    free(context);
+  }
+  return ncclSuccess;
+}

 #define PLUGIN_NAME "Example"

--- a/ext-tuner/example/scripts/README.md
+++ b/ext-tuner/example/scripts/README.md
@ -0,0 +1,106 @@
+# NCCL Tuner Configuration Scripts
+
+This directory contains scripts for optimizing NCCL tuner configurations based on performance data.
+
+## optimize_config.py
+
+A Python script that reads performance data from CSV files and generates optimal NCCL tuner configurations.
+
+### Usage
+
+```bash
+python scripts/optimize_config.py [options] <input_csv_file>
+```
+
+### Options
+
+- `-o, --output FILE`: Output NCCL tuner config file (default: `nccl_tuner.conf`)
+- `-m, --metric METRIC`: Optimization metric (`cost_metric`, `bandwidth_gbps`, `latency_us`)
+- `--no-header`: Don't add header comments to output file
+- `--dry-run`: Print configurations without writing to file
+
+### CSV Input Format
+
+The input CSV file should have the following columns:
+
+```csv
+collective,size_bytes,algorithm,protocol,channels,nodes,ranks,pipeOps,regBuff,cost_metric,bandwidth_gbps,latency_us
+```
+
+**Required columns:**
+- `collective`: NCCL collective type (`allreduce`, `broadcast`, `reduce`, etc.)
+- `size_bytes`: Message size in bytes
+- `algorithm`: NCCL algorithm (`tree`, `ring`, `nvls`, etc.)
+- `protocol`: NCCL protocol (`simple`, `ll`, `ll128`)
+- `channels`: Number of channels (or `-1` for default)
+- `nodes`: Number of nodes (or `-1` for any)
+- `ranks`: Number of ranks (or `-1` for any)
+- `pipeOps`: Number of pipeline operations (or `-1` for any)
+- `regBuff`: Registered buffer flag (`0`, `1`, or `-1` for any)
+
+**Optional metrics (must have at least one present):**
+- `bandwidth_gbps`: Bandwidth in GB/s (higher is better)
+- `latency_us`: Latency in microseconds (lower is better)
+
+### Examples
+
+**Basic usage with cost optimization:**
+```bash
+python scripts/optimize_config.py sample_performance_data.csv
+```
+
+**Optimize for bandwidth and write to custom file:**
+```bash
+python scripts/optimize_config.py -m bandwidth_gbps -o my_tuner.conf performance_data.csv
+```
+
+**Preview configurations without writing:**
+```bash
+python scripts/optimize_config.py --dry-run performance_data.csv
+```
+
+### How It Works
+
+1. **Data Loading**: Reads CSV performance data and validates format
+2. **Grouping**: Groups data by collective type, topology (nodes/ranks), and other parameters
+3. **Size Ranges**: Automatically bins data into size ranges for optimization
+4. **Optimization**: Finds the best performing configuration for each group/size combination
+5. **Output**: Generates NCCL tuner config format and appends to specified file
+
+### Default Size Ranges
+
+The script uses these default size ranges (in bytes):
+- Small: 0 - 1,024
+- Medium: 1,025 - 65,536
+- Large: 65,537 - 1,048,576
+- XLarge: 1,048,577 - 16,777,216
+- XXLarge: 16,777,217 - 4,294,967,295
+
+### Sample Data
+
+See `sample_performance_data.csv` for an example of the expected input format.
+
+### Integration with NCCL
+
+The generated configuration file can be used directly with the NCCL tuner plugin:
+
+```bash
+export NCCL_TUNER_CONFIG_FILE=/path/to/optimized_config.conf
+export NCCL_TUNER_PLUGIN=/path/to/libnccl-tuner.so
+mpirun -np 8 your_nccl_application
+```
+
+### Performance Data Collection
+
+To collect performance data for optimization, you can:
+
+1. **Use NCCL benchmarks** with different algorithm/protocol combinations
+2. **Profile your applications** with various tuner settings
+3. **Run systematic sweeps** across parameter combinations
+4. **Use NCCL debug output** to collect timing information
+
+The key is to have comprehensive data covering:
+- Different message sizes (small to large)
+- Various topologies (single node, multi-node)
+- All relevant algorithm/protocol combinations
+- Different channel counts and pipeline configurations
--- a/ext-tuner/example/scripts/optimize_config.py
+++ b/ext-tuner/example/scripts/optimize_config.py
@ -0,0 +1,430 @@
+#!/usr/bin/env python3
+"""
+NCCL Tuner Configuration Optimizer
+
+Reads a CSV file containing performance data across different tuning parameters
+and generates optimal NCCL tuner configurations based on the best performing
+combinations.
+
+By default, creates growing size ranges that interpolate between the actual data sizes
+for each unique dimension (node count, rank count combination). This ensures that
+different cluster configurations get their own optimized size boundaries, as
+performance characteristics often vary significantly between topologies.
+
+Each dimension gets its own set of ranges starting from 0 and extending to the maximum
+size for that dimension, with boundaries at midpoints between consecutive data sizes.
+
+CSV Input Format:
+collective,size_bytes,algorithm,protocol,channels,nodes,ranks,pipeOps,regBuff,bandwidth_gbps,latency_us
+
+Output Format (NCCL Tuner Config):
+collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
+
+Usage Examples:
+  # Auto-create dimension-specific interpolated ranges (default)
+  python3 optimize_config.py data.csv
+
+  # Use custom size ranges (applied to all topologies)
+  python3 optimize_config.py data.csv --size-ranges "0-1024,1025-65536,65537-1048576"
+
+  # Use hardcoded default ranges (applied to all topologies)
+  python3 optimize_config.py data.csv --no-auto-ranges
+"""
+
+import csv
+import argparse
+import sys
+import os
+from collections import defaultdict
+from typing import Dict, List, Tuple, Any
+
+class PerformanceData:
+    def __init__(self, row: Dict[str, str]):
+        self.collective = row['collective']
+        self.size_bytes = int(row['size_bytes'])
+        self.algorithm = row['algorithm']
+        self.protocol = row['protocol']
+        self.channels = int(row['channels']) if row['channels'] != '-1' else -1
+        self.nodes = int(row['nodes']) if row['nodes'] != '-1' else -1
+        self.ranks = int(row['ranks']) if row['ranks'] != '-1' else -1
+        self.pipeOps = int(row['pipeOps']) if row['pipeOps'] != '-1' else -1
+        self.regBuff = int(row['regBuff']) if row['regBuff'] != '-1' else -1
+
+        # Performance metrics
+        self.bandwidth_gbps = float(row.get('bandwidth_gbps', 0))  # Higher is better
+        self.latency_us = float(row.get('latency_us', 0))  # Lower is better
+
+    def get_config_key(self) -> Tuple:
+        """Generate a key for grouping similar configurations"""
+        return (self.collective, self.nodes, self.ranks, self.pipeOps, self.regBuff)
+
+    def get_size_range_key(self, topology_size_ranges: Dict[Tuple[int, int], List[Tuple[int, int]]]) -> Tuple[int, int]:
+        """Find which size range this data point belongs to for its dimension"""
+        topology_key = (self.nodes, self.ranks)
+
+        # Get size ranges for this dimension, or fall back to default
+        if topology_key in topology_size_ranges:
+            size_ranges = topology_size_ranges[topology_key]
+        elif (-1, -1) in topology_size_ranges:
+            size_ranges = topology_size_ranges[(-1, -1)]
+        else:
+            # Fallback to first available dimension ranges
+            size_ranges = next(iter(topology_size_ranges.values()))
+
+        for min_size, max_size in size_ranges:
+            if min_size <= self.size_bytes <= max_size:
+                return (min_size, max_size)
+        # If no range found, create a single-point range
+        return (self.size_bytes, self.size_bytes)
+
+class ConfigOptimizer:
+    def __init__(self, optimization_metric: str = 'latency_us'):
+        self.optimization_metric = optimization_metric
+        # Default size ranges - will be overridden by auto-detection
+        self.size_ranges = [
+            (0, 1024),
+            (1025, 64*1024),
+            (64*1024+1, 1024*1024),
+            (1024*1024+1, 16*1024*1024),
+            (16*1024*1024+1, 4*1024*1024*1024-1)
+        ]
+        self.auto_size_ranges = True
+
+    def set_size_ranges(self, ranges: List[Tuple[int, int]]):
+        """Set custom size ranges for optimization"""
+        self.size_ranges = ranges
+        self.auto_size_ranges = False
+
+    def auto_determine_size_ranges(self, data: List[PerformanceData]) -> Dict[Tuple[int, int], List[Tuple[int, int]]]:
+        """Create growing size ranges for each unique (nodes, ranks) dimension"""
+        if not data:
+            return {(-1, -1): self.size_ranges}
+
+        # Group data by dimension (nodes, ranks)
+        topology_data = defaultdict(list)
+        for item in data:
+            topology_key = (item.nodes, item.ranks)
+            topology_data[topology_key].append(item)
+
+        topology_ranges = {}
+
+        for topology_key, items in topology_data.items():
+            nodes, ranks = topology_key
+
+            # Extract unique sizes for this dimension and sort them
+            unique_sizes = sorted(set(item.size_bytes for item in items))
+
+            if len(unique_sizes) <= 1:
+                # Only one size, create a single range from 0 to that size
+                size = unique_sizes[0] if unique_sizes else 0
+                ranges = [(0, size)]
+            else:
+                # Create growing ranges that interpolate between data points
+                ranges = []
+
+                for i, size in enumerate(unique_sizes):
+                    if i == 0:
+                        # First range: 0 to midpoint between first and second size
+                        if len(unique_sizes) > 1:
+                            next_size = unique_sizes[i + 1]
+                            max_size = (size + next_size) // 2
+                        else:
+                            max_size = size
+                        min_size = 0
+                    elif i == len(unique_sizes) - 1:
+                        # Last range: previous max + 1 to current size (and beyond)
+                        min_size = ranges[-1][1] + 1
+                        max_size = size
+                    else:
+                        # Intermediate ranges: previous max + 1 to midpoint with next size
+                        min_size = ranges[-1][1] + 1
+                        next_size = unique_sizes[i + 1]
+                        max_size = (size + next_size) // 2
+
+                    ranges.append((min_size, max_size))
+
+            topology_ranges[topology_key] = ranges
+
+            print(f"Dimension {nodes} nodes, {ranks} ranks: {len(ranges)} size ranges from {len(unique_sizes)} unique sizes:")
+            for i, (min_size, max_size) in enumerate(ranges):
+                # Count data points that fall in this range for this dimension
+                count = sum(1 for item in items if min_size <= item.size_bytes <= max_size)
+                actual_sizes = sorted(set(item.size_bytes for item in items if min_size <= item.size_bytes <= max_size))
+                if actual_sizes:
+                    size_list = ', '.join(f"{s:,}" for s in actual_sizes[:3])
+                    if len(actual_sizes) > 3:
+                        size_list += f", ... (+{len(actual_sizes)-3} more)"
+                    print(f"  Range {i+1}: {min_size:,} - {max_size:,} bytes ({count} data points, sizes: {size_list})")
+
+        return topology_ranges
+
+    def load_data(self, csv_file: str) -> List[PerformanceData]:
+        """Load performance data from CSV file"""
+        data = []
+        try:
+            with open(csv_file, 'r') as f:
+                reader = csv.DictReader(f)
+                for row in reader:
+                    try:
+                        data.append(PerformanceData(row))
+                    except (ValueError, KeyError) as e:
+                        print(f"Warning: Skipping invalid row: {row} - {e}")
+        except FileNotFoundError:
+            print(f"Error: File {csv_file} not found")
+            sys.exit(1)
+        except Exception as e:
+            print(f"Error reading {csv_file}: {e}")
+            sys.exit(1)
+
+        print(f"Loaded {len(data)} performance data points")
+
+        # Auto-determine size ranges if enabled
+        if self.auto_size_ranges and data:
+            self.topology_size_ranges = self.auto_determine_size_ranges(data)
+        else:
+            # Use default ranges for all topologies
+            self.topology_size_ranges = {(-1, -1): self.size_ranges}
+
+        return data
+
+    def is_better(self, new_data: PerformanceData, current_best: PerformanceData) -> bool:
+        """Determine if new_data is better than current_best"""
+        if self.optimization_metric == 'bandwidth_gbps':
+            return new_data.bandwidth_gbps > current_best.bandwidth_gbps
+        elif self.optimization_metric == 'latency_us':
+            return new_data.latency_us < current_best.latency_us
+        else:
+            # Default to latency
+            return new_data.latency_us < current_best.latency_us
+
+    def optimize_configurations(self, data: List[PerformanceData]) -> List[str]:
+        """Find optimal configurations and return as NCCL config strings"""
+        # Group data by configuration key and size range
+        grouped_data = defaultdict(lambda: defaultdict(list))
+
+        for item in data:
+            config_key = item.get_config_key()
+            size_range = item.get_size_range_key(self.topology_size_ranges)
+            grouped_data[config_key][size_range].append(item)
+
+        # Store optimal configurations before combining ranges
+        optimal_configs = []
+
+        for config_key, size_ranges_dict in grouped_data.items():
+            collective, nodes, ranks, pipeOps, regBuff = config_key
+
+            for (min_size, max_size), items in size_ranges_dict.items():
+                if not items:
+                    continue
+
+                # Find the best performing configuration for this size range
+                best_item = items[0]
+                for item in items[1:]:
+                    if self.is_better(item, best_item):
+                        best_item = item
+
+                # Store the optimal configuration with its range
+                optimal_configs.append({
+                    'collective': collective,
+                    'min_size': min_size,
+                    'max_size': max_size,
+                    'algorithm': best_item.algorithm,
+                    'protocol': best_item.protocol,
+                    'channels': best_item.channels,
+                    'nodes': best_item.nodes,
+                    'ranks': best_item.ranks,
+                    'pipeOps': best_item.pipeOps,
+                    'regBuff': best_item.regBuff,
+                    'metric_value': getattr(best_item, self.optimization_metric)
+                })
+
+        # Combine sequential ranges with identical tunings
+        combined_configs = self.combine_sequential_ranges(optimal_configs)
+
+        # Generate config strings
+        configs = []
+        for config in combined_configs:
+            config_str = f"{config['collective']},{config['min_size']},{config['max_size']},{config['algorithm']},{config['protocol']},{config['channels']},{config['nodes']},{config['ranks']},{config['pipeOps']},{config['regBuff']}"
+            configs.append(config_str)
+
+            print(f"Optimal for {config['collective']} [{config['min_size']}-{config['max_size']}] nodes={config['nodes']} ranks={config['ranks']}: "
+                  f"{config['algorithm']}/{config['protocol']} channels={config['channels']} "
+                  f"({self.optimization_metric}={config['metric_value']:.3f})")
+
+        return configs
+
+    def combine_sequential_ranges(self, configs: List[Dict]) -> List[Dict]:
+        """Combine sequential ranges that have identical tuning parameters"""
+        if not configs:
+            return configs
+
+        # Group by collective and topology (nodes, ranks)
+        topology_groups = defaultdict(list)
+        for config in configs:
+            topology_key = (config['collective'], config['nodes'], config['ranks'],
+                          config['pipeOps'], config['regBuff'])
+            topology_groups[topology_key].append(config)
+
+        combined_configs = []
+
+        for topology_key, topology_configs in topology_groups.items():
+            # Sort by min_size to ensure proper ordering
+            topology_configs.sort(key=lambda x: x['min_size'])
+
+            # Group by tuning parameters (algorithm, protocol, channels)
+            tuning_groups = defaultdict(list)
+            for config in topology_configs:
+                tuning_key = (config['algorithm'], config['protocol'], config['channels'])
+                tuning_groups[tuning_key].append(config)
+
+            # For each tuning group, combine sequential ranges
+            for tuning_key, tuning_configs in tuning_groups.items():
+                if not tuning_configs:
+                    continue
+
+                # Sort by min_size
+                tuning_configs.sort(key=lambda x: x['min_size'])
+
+                # Combine sequential ranges
+                current_config = tuning_configs[0].copy()
+
+                for next_config in tuning_configs[1:]:
+                    # Check if ranges are adjacent or overlapping
+                    if current_config['max_size'] + 1 >= next_config['min_size']:
+                        # Extend the current range
+                        current_config['max_size'] = max(current_config['max_size'], next_config['max_size'])
+                        # Update metric value to the better one
+                        if self.optimization_metric == 'bandwidth_gbps':
+                            if next_config['metric_value'] > current_config['metric_value']:
+                                current_config['metric_value'] = next_config['metric_value']
+                        else:  # latency_us or default
+                            if next_config['metric_value'] < current_config['metric_value']:
+                                current_config['metric_value'] = next_config['metric_value']
+                    else:
+                        # Gap between ranges, save current and start new one
+                        combined_configs.append(current_config)
+                        current_config = next_config.copy()
+
+                # Add the last configuration
+                combined_configs.append(current_config)
+
+        # Sort final configs by collective, nodes, ranks, then min_size
+        combined_configs.sort(key=lambda x: (x['collective'], x['nodes'], x['ranks'], x['min_size']))
+
+        original_count = len(configs)
+        combined_count = len(combined_configs)
+        if combined_count < original_count:
+            print(f"Combined {original_count} ranges into {combined_count} ranges "
+                  f"(reduced by {original_count - combined_count})")
+
+        return combined_configs
+
+    def append_to_config_file(self, configs: List[str], config_file: str, add_header: bool = True):
+        """Append optimized configurations to NCCL tuner config file"""
+        try:
+            # Create directory if it doesn't exist
+            config_dir = os.path.dirname(config_file)
+            if config_dir and not os.path.exists(config_dir):
+                os.makedirs(config_dir)
+                print(f"Created directory: {config_dir}")
+
+            # Check if file exists and has content
+            file_exists = os.path.exists(config_file)
+            add_separator = False
+
+            if file_exists:
+                with open(config_file, 'r') as f:
+                    content = f.read().strip()
+                    add_separator = len(content) > 0
+                print(f"Appending to existing file: {config_file}")
+            else:
+                print(f"Creating new file: {config_file}")
+
+            with open(config_file, 'a') as f:
+                if add_separator:
+                    f.write("\n\n")
+
+                if add_header:
+                    f.write(f"# Optimized configurations generated by optimize_config.py\n")
+                    f.write(f"# Optimization metric: {self.optimization_metric}\n")
+                    f.write(f"# Format: collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff\n")
+
+                for config in configs:
+                    f.write(f"{config}\n")
+
+            if file_exists:
+                print(f"Appended {len(configs)} optimized configurations to {config_file}")
+            else:
+                print(f"Created {config_file} with {len(configs)} optimized configurations")
+
+        except PermissionError:
+            print(f"Error: Permission denied writing to {config_file}")
+            print("Try running with appropriate permissions or choose a different output location")
+            sys.exit(1)
+        except OSError as e:
+            print(f"Error: Cannot create/write to {config_file}: {e}")
+            print("Check that the path is valid and you have write permissions")
+            sys.exit(1)
+        except Exception as e:
+            print(f"Unexpected error writing to {config_file}: {e}")
+            sys.exit(1)
+
+def main():
+    parser = argparse.ArgumentParser(description="Optimize NCCL tuner configurations from performance data")
+    parser.add_argument("csv_file", help="Input CSV file with performance data")
+    parser.add_argument("-o", "--output", default="nccl_tuner.conf",
+                       help="Output NCCL tuner config file (default: nccl_tuner.conf)")
+    parser.add_argument("-m", "--metric", choices=['bandwidth_gbps', 'latency_us'],
+                       default='latency_us', help="Optimization metric (default: latency_us)")
+    parser.add_argument("--no-header", action="store_true",
+                       help="Don't add header comments to output file")
+    parser.add_argument("--dry-run", action="store_true",
+                       help="Print configurations without writing to file")
+    parser.add_argument("--no-auto-ranges", action="store_true",
+                       help="Disable automatic size range determination (use default ranges)")
+    parser.add_argument("--size-ranges", type=str,
+                       help="Custom size ranges as comma-separated pairs: 'min1-max1,min2-max2,...'")
+
+    args = parser.parse_args()
+
+    optimizer = ConfigOptimizer(args.metric)
+
+    # Handle size range configuration
+    if args.size_ranges:
+        # Parse custom size ranges
+        try:
+            ranges = []
+            for range_str in args.size_ranges.split(','):
+                min_size, max_size = map(int, range_str.split('-'))
+                ranges.append((min_size, max_size))
+            optimizer.set_size_ranges(ranges)
+            print(f"Using custom size ranges: {ranges}")
+        except ValueError:
+            print("Error: Invalid size ranges format. Use 'min1-max1,min2-max2,...'")
+            sys.exit(1)
+    elif args.no_auto_ranges:
+        # Disable auto-ranging
+        optimizer.auto_size_ranges = False
+        print("Using default hardcoded size ranges")
+    else:
+        # Auto-ranging is enabled by default - creates one bucket per unique size
+        optimizer.auto_size_ranges = True
+        print("Auto-ranging enabled: will create one bucket per unique size in data")
+
+    # Load and optimize data
+    data = optimizer.load_data(args.csv_file)
+    if not data:
+        print("No valid data found in CSV file")
+        sys.exit(1)
+
+    configs = optimizer.optimize_configurations(data)
+
+    if args.dry_run:
+        print("\nGenerated configurations:")
+        for config in configs:
+            print(config)
+    else:
+        optimizer.append_to_config_file(configs, args.output, not args.no_header)
+
+if __name__ == "__main__":
+    main()
--- a/ext-tuner/example/scripts/sample_performance_data.csv
+++ b/ext-tuner/example/scripts/sample_performance_data.csv
@ -0,0 +1,24 @@
+collective,size_bytes,algorithm,protocol,channels,nodes,ranks,pipeOps,regBuff,cost_metric,bandwidth_gbps,latency_us
+allreduce,1024,tree,simple,2,1,8,-1,-1,0.15,45.2,12.5
+allreduce,1024,ring,simple,4,1,8,-1,-1,0.12,52.1,10.8
+allreduce,1024,tree,ll,2,1,8,-1,-1,0.18,41.3,15.2
+allreduce,1024,ring,ll,4,1,8,-1,-1,0.14,48.7,12.1
+allreduce,32768,tree,simple,2,1,8,-1,-1,0.25,156.8,25.3
+allreduce,32768,ring,simple,4,1,8,-1,-1,0.18,189.2,18.4
+allreduce,32768,ring,ll128,8,1,8,-1,-1,0.16,201.5,16.2
+allreduce,1048576,ring,simple,4,1,8,-1,-1,0.45,425.6,45.1
+allreduce,1048576,ring,ll128,8,1,8,-1,-1,0.38,482.3,38.7
+allreduce,1048576,nvls,simple,16,1,8,-1,-1,0.32,551.2,32.1
+broadcast,1024,tree,simple,2,1,8,-1,-1,0.08,89.4,8.2
+broadcast,1024,ring,simple,4,1,8,-1,-1,0.12,71.3,12.1
+broadcast,32768,tree,simple,2,1,8,-1,-1,0.18,234.7,18.5
+broadcast,32768,ring,ll128,4,1,8,-1,-1,0.15,267.8,15.2
+broadcast,1048576,ring,simple,4,1,8,-1,-1,0.35,612.4,35.1
+broadcast,1048576,ring,ll128,8,1,8,-1,-1,0.28,702.1,28.3
+allreduce,1024,tree,simple,2,2,16,-1,-1,0.22,38.1,22.4
+allreduce,1024,ring,simple,4,2,16,-1,-1,0.19,42.7,19.6
+allreduce,32768,ring,simple,4,2,16,-1,-1,0.28,145.2,28.1
+allreduce,32768,ring,ll128,8,2,16,-1,-1,0.24,167.8,24.3
+allreduce,1048576,ring,simple,4,2,16,-1,-1,0.58,387.5,58.2
+allreduce,1048576,ring,ll128,8,2,16,-1,-1,0.48,456.9,48.1
+allreduce,1048576,nvls,simple,16,2,16,-1,-1,0.42,512.6,42.3
--- a/ext-tuner/example/test/Makefile
+++ b/ext-tuner/example/test/Makefile
@ -0,0 +1,30 @@
+#
+# Makefile for NCCL Tuner Plugin Unit Tests
+#
+
+CC := gcc
+CFLAGS := -Wall -Wextra -g -std=c99 -fPIC
+INC := -I. -I../nccl
+TARGET := test_plugin
+SOURCES := test_plugin.c
+
+# Default target
+all: $(TARGET)
+
+# Build the test executable
+$(TARGET): $(SOURCES)
+	$(CC) $(CFLAGS) $(INC) -o $(TARGET) $(SOURCES)
+
+# Run the tests
+test: $(TARGET)
+	./$(TARGET) $(TEST_CASE)
+
+# Run tests with verbose output
+test-verbose: $(TARGET)
+	NCCL_DEBUG=INFO ./$(TARGET) $(TEST_CASE)
+
+# Clean build artifacts
+clean:
+	rm -f $(TARGET) *.o *.gcov *.gcda *.gcno test_*.conf
+
+.PHONY: all test test-verbose clean
--- a/ext-tuner/example/test/README.md
+++ b/ext-tuner/example/test/README.md
@ -0,0 +1,205 @@
+# NCCL Tuner Plugin Unit Tests
+
+This directory contains comprehensive unit tests for the NCCL tuner plugin. The tests verify all major functionality including configuration parsing, matching logic, and cost table updates.
+
+## Test Structure
+
+```
+test/
+├── test_plugin.c     # Main unit test file
+├── Makefile          # Build system for tests
+└── README.md         # This file
+```
+
+## Building and Running Tests
+
+### Quick Start
+
+```bash
+# Build and run all tests
+make test
+
+# Or step by step
+make           # Build test executable
+./test_plugin  # Run tests
+```
+
+### Advanced Testing
+
+```bash
+# Run with memory leak detection (requires valgrind)
+make test-memory
+
+# Run with verbose logging
+make test-verbose
+
+# Generate code coverage report (requires gcov)
+make coverage
+
+# Create sample test configuration files
+make test-configs
+```
+
+## Test Coverage
+
+The unit tests cover the following functionality:
+
+### 1. **Plugin Initialization (`test_plugin_init`)**
+- Tests successful plugin initialization
+- Verifies context allocation
+- Tests cleanup on destroy
+
+### 2. **Configuration Parsing (`test_config_parsing_valid`, `test_config_parsing_invalid`)**
+- Valid CSV format parsing
+- Comment and empty line handling
+- Invalid format graceful handling
+- Environment variable configuration
+
+### 3. **Collective Type Matching (`test_collective_matching`)**
+- Correct matching of allreduce, broadcast, etc.
+- Algorithm/protocol selection
+- Channel configuration
+
+### 4. **Size Range Matching (`test_size_matching`)**
+- Small, medium, large message size handling
+- Proper range boundary checking
+- Multiple size-based configurations
+
+### 5. **Topology Matching (`test_topology_matching`)**
+- Single-node vs multi-node configurations
+- Exact nNodes/nRanks matching
+- Wildcard matching (-1 values)
+
+### 6. **Default Channels (`test_default_channels`)**
+- Proper handling of -1 channel specification
+- Preservation of NCCL default behavior
+
+### 7. **Registered Buffer Matching (`test_regbuff_matching`)**
+- Configurations based on regBuff parameter
+- Registered vs non-registered buffer handling
+- Backward compatibility with configs missing regBuff
+
+### 8. **Pipeline Operations Matching (`test_pipeops_matching`)**
+- Configurations based on numPipeOps parameter
+- Single vs multiple pipeline operation handling
+- Backward compatibility with configs missing numPipeOps
+
+### 9. **Fallback Behavior (`test_no_match_fallback`)**
+- Default behavior when no config matches
+- Ring/Simple algorithm fallback
+
+## Test Output
+
+Successful test run:
+```
+Running NCCL Tuner Plugin Unit Tests
+=====================================
+PASS: test_plugin_init
+PASS: test_config_parsing_valid
+PASS: test_config_parsing_invalid
+PASS: test_collective_matching
+PASS: test_size_matching
+PASS: test_topology_matching
+PASS: test_default_channels
+PASS: test_regbuff_matching
+PASS: test_pipeops_matching
+PASS: test_no_match_fallback
+
+=====================================
+Test Results: 9/9 tests passed
+All tests PASSED!
+```
+
+Failed test example:
+```
+FAIL: test_collective_matching - Tree/Simple should have low cost
+Test Results: 8/9 tests passed
+Some tests FAILED!
+```
+
+## Mock NCCL Implementation
+
+The tests use the actual NCCL header files from the `../nccl/` directory:
+
+- `tuner.h` - Complete NCCL tuner interface and type definitions
+- `common.h` - Common NCCL types and logging functions
+- `err.h` - NCCL error codes
+
+This allows testing with the real NCCL interface definitions while still being able to run tests without the full NCCL library installation.
+
+## Integration with CI/CD
+
+```bash
+# Install tests for CI/CD pipeline
+make install-test
+
+# Run as part of automated testing
+make test && echo "Tests passed" || echo "Tests failed"
+```
+
+## Memory Testing
+
+The tests can be run with valgrind for memory leak detection:
+
+```bash
+make test-memory
+```
+
+This will detect:
+- Memory leaks
+- Invalid memory access
+- Use of uninitialized memory
+
+## Code Coverage
+
+Generate code coverage reports to ensure comprehensive testing:
+
+```bash
+make coverage
+# Creates test_plugin.c.gcov with line-by-line coverage
+```
+
+## Adding New Tests
+
+To add a new test:
+
+1. Create a new test function in `test_plugin.c`:
+```c
+int test_new_feature() {
+  // Test setup
+  TEST_ASSERT(condition, "description");
+  // Test cleanup
+  TEST_PASS();
+}
+```
+
+2. Add the test to the main function:
+```c
+total++; passed += test_new_feature();
+```
+
+3. Rebuild and run:
+```bash
+make test
+```
+
+## Debugging Tests
+
+For debugging failed tests:
+
+```bash
+# Compile with debug symbols
+make CFLAGS="-g -O0 -DDEBUG"
+
+# Run with gdb
+gdb ./test_plugin
+```
+
+## Cleaning Up
+
+```bash
+# Remove all build artifacts and temporary files
+make clean
+```
+
+This comprehensive test suite ensures the NCCL tuner plugin works correctly across all supported configurations and edge cases.
--- a/ext-tuner/example/test/test_plugin.c
+++ b/ext-tuner/example/test/test_plugin.c
@ -0,0 +1,856 @@
+/*************************************************************************
+ * Unit tests for NCCL Tuner Plugin
+ ************************************************************************/
+
+#define _GNU_SOURCE  // Enable setenv/unsetenv and other GNU extensions
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <assert.h>
+#include <unistd.h>
+#include <sys/stat.h>
+#include <stdarg.h>
+
+
+// Include NCCL tuner header (which includes common.h and err.h)
+#include "tuner.h"
+
+// Include plugin source for testing
+#include "../plugin.c"
+
+// Test framework macros
+#define TEST_ASSERT(condition, message) \
+  do { \
+    if (!(condition)) { \
+      printf("FAIL: %s - %s\n", __func__, message); \
+      return 0; \
+    } \
+  } while(0)
+
+#define TEST_PASS() \
+  do { \
+    printf("PASS: %s\n", __func__); \
+    return 1; \
+  } while(0)
+
+// Global test state
+static int test_log_count = 0;
+
+// Mock logger function
+void mock_logger(ncclDebugLogLevel level, unsigned long flags,
+                 const char* file, int line, const char* fmt, ...) {
+  (void)flags; // Suppress unused parameter warning
+  test_log_count++;
+
+  // Check if we should print based on NCCL_DEBUG level
+  const char* debug_level = getenv("NCCL_DEBUG");
+  int should_print = 0;
+
+  if (debug_level) {
+    if (strcmp(debug_level, "TRACE") == 0) {
+      should_print = 1; // Print everything
+    } else if (strcmp(debug_level, "INFO") == 0 && level <= NCCL_LOG_INFO) {
+      should_print = 1; // Print INFO and below
+    } else if (strcmp(debug_level, "WARN") == 0 && level <= NCCL_LOG_WARN) {
+      should_print = 1; // Print WARN and below
+    }
+  }
+
+  if (!should_print) return;
+
+  // Convert log level to string
+  const char* level_str;
+  switch(level) {
+    case NCCL_LOG_NONE: level_str = "NONE"; break;
+    case NCCL_LOG_VERSION: level_str = "VERSION"; break;
+    case NCCL_LOG_WARN: level_str = "WARN"; break;
+    case NCCL_LOG_INFO: level_str = "INFO"; break;
+    case NCCL_LOG_ABORT: level_str = "ABORT"; break;
+    case NCCL_LOG_TRACE: level_str = "TRACE"; break;
+    default: level_str = "UNKNOWN"; break;
+  }
+
+  // Print log header
+  printf("[TUNER:%s:%s:%d] ", level_str, file, line);
+
+  // Print formatted message
+  va_list args;
+  va_start(args, fmt);
+  vprintf(fmt, args);
+  va_end(args);
+
+  printf("\n");
+}
+
+// Helper function to create test config file
+void create_test_config(const char* filename, const char* content) {
+  FILE* f = fopen(filename, "w");
+  if (f) {
+    fprintf(f, "%s", content);
+    fclose(f);
+  }
+}
+
+// Test 1: Plugin initialization
+int test_plugin_init() {
+  void* context = NULL;
+
+  // Test successful initialization
+  ncclResult_t result = pluginInit(8, 2, mock_logger, &context);
+  TEST_ASSERT(result == ncclSuccess, "Plugin init should succeed");
+  TEST_ASSERT(context != NULL, "Context should be allocated");
+
+  // Clean up
+  pluginDestroy(context);
+  TEST_PASS();
+}
+
+// Test 2: Configuration file parsing - valid CSV
+int test_config_parsing_valid() {
+  const char* test_config =
+    "# Test configuration\n"
+    "allreduce,0,65536,tree,simple,2,1,-1,-1,-1\n"
+    "broadcast,0,32768,ring,ll128,4,2,16,-1,-1\n"
+    "# Comment line\n"
+    "\n"  // Empty line
+    "reduce,1024,2048,tree,simple,-1,-1,-1,-1,-1\n";
+
+  create_test_config("test_valid.conf", test_config);
+
+  // Set environment variable to use our test config
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_valid.conf", 1);
+
+  void* context = NULL;
+  ncclResult_t result = pluginInit(16, 2, mock_logger, &context);
+  TEST_ASSERT(result == ncclSuccess, "Plugin init with valid config should succeed");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_valid.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 3: Configuration file parsing - invalid CSV
+int test_config_parsing_invalid() {
+  const char* test_config =
+    "allreduce,0,65536,tree,simple,2,1  # Missing nRanks and other fields\n"
+    "invalid_collective,0,1024,ring,simple,1,1,1,-1,-1\n"
+    "broadcast,abc,def,ring,simple,1,1,1,-1,-1\n";  // Invalid numbers
+
+  create_test_config("test_invalid.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_invalid.conf", 1);
+
+  void* context = NULL;
+  ncclResult_t result = pluginInit(8, 1, mock_logger, &context);
+  // Should still succeed but with no valid configs loaded
+  TEST_ASSERT(result == ncclSuccess, "Plugin init should succeed even with invalid config");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_invalid.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 4: Collective type matching
+int test_collective_matching() {
+  const char* test_config =
+    "allreduce,0,65536,tree,simple,8,1,-1,-1,-1\n"
+    "broadcast,0,32768,ring,ll128,4,-1,-1,-1,-1\n";
+
+  create_test_config("test_match.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_match.conf", 1);
+
+  void* context = NULL;
+  pluginInit(8, 1, mock_logger, &context);
+
+  // Create mock cost table
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0; // Default high cost
+    }
+  }
+
+  int nChannels;
+
+  // Test allreduce matching (should match first config)
+  ncclResult_t result = pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                                          cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                                          0, &nChannels);
+
+  TEST_ASSERT(result == ncclSuccess, "GetCollInfo should succeed");
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "DEBUG: Checking cost_table[TREE][SIMPLE] (%p) = %.1f (expecting 0.0)",
+              &cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE], cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE]);
+  TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Tree/Simple should have low cost");
+  TEST_ASSERT(nChannels == 8, "Should set 8 channels");
+
+  // Test broadcast matching (should match second config)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0; // Reset costs
+    }
+  }
+
+  result = pluginGetCollInfo(context, ncclFuncBroadcast, 16384, 1,
+                            cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                            0, &nChannels);
+  TEST_ASSERT(result == ncclSuccess, "GetCollInfo should succeed");
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "DEBUG: Checking cost_table[RING][LL128] (%p) = %.1f (expecting 0.0)",
+              &cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128], cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128]);
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128] == 0.0, "Ring/LL128 should have low cost");
+  TEST_ASSERT(nChannels == 4, "Should set 4 channels");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_match.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 5: Size range matching
+int test_size_matching() {
+  const char* test_config =
+    "allreduce,0,1024,tree,simple,2,-1,-1,-1,-1\n"
+    "allreduce,1025,65536,ring,simple,4,-1,-1,-1,-1\n"
+    "allreduce,65537,4294967295,ring,ll128,8,-1,-1,-1,-1\n";
+
+  create_test_config("test_size.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_size.conf", 1);
+
+  void* context = NULL;
+  pluginInit(8, 1, mock_logger, &context);
+
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+  int nChannels = 1;
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 512, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "DEBUG: Small message - checking cost_table[TREE][SIMPLE] (%p) = %.1f (expecting 0.0)",
+              &cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE], cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE]);
+  TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Small: Tree/Simple should have low cost");
+  TEST_ASSERT(nChannels == 2, "Small: Should set 2 channels");
+
+  // Test medium message (should match second config)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "DEBUG: Medium message - checking cost_table[RING][SIMPLE] (%p) = %.1f (expecting 0.0)",
+              &cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE], cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE]);
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] == 0.0, "Medium: Ring/Simple should have low cost");
+  TEST_ASSERT(nChannels == 4, "Medium: Should set 4 channels");
+
+  // Test large message (should match third config)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 1048576, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "DEBUG: Large message - checking cost_table[RING][LL128] (%p) = %.1f (expecting 0.0)",
+              &cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128], cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128]);
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128] == 0.0, "Large: Ring/LL128 should have low cost");
+  TEST_ASSERT(nChannels == 8, "Large: Should set 8 channels");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_size.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 6: Topology matching
+int test_topology_matching() {
+  const char* test_config =
+    "allreduce,0,65536,tree,simple,2,1,-1,-1,-1\n"      // Single node only
+    "allreduce,0,65536,ring,simple,4,4,32,-1,-1\n"      // 4 nodes, 32 ranks exactly
+    "allreduce,0,65536,ring,ll128,8,-1,-1,-1,-1\n";     // Any topology
+
+  create_test_config("test_topo.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_topo.conf", 1);
+
+  // Test with single node setup
+  void* context1 = NULL;
+  pluginInit(8, 1, mock_logger, &context1);  // 8 ranks, 1 node
+
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  int nChannels;
+  pluginGetCollInfo(context1, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Single node: Should match tree config");
+  TEST_ASSERT(nChannels == 2, "Single node: Should set 2 channels");
+
+  pluginDestroy(context1);
+
+  // Test with 4 nodes, 32 ranks setup
+  void* context2 = NULL;
+  pluginInit(32, 4, mock_logger, &context2);  // 32 ranks, 4 nodes
+
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context2, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] == 0.0, "4-node: Should match ring/simple config");
+  TEST_ASSERT(nChannels == 4, "4-node: Should set 4 channels");
+
+  // Clean up
+  unlink("test_topo.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 7: Default channels behavior (-1)
+int test_default_channels() {
+  const char* test_config =
+    "allreduce,0,65536,tree,simple,-1,-1,-1,-1,-1\n";  // Use default channels
+
+  create_test_config("test_default.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_default.conf", 1);
+
+  void* context = NULL;
+  pluginInit(8, 1, mock_logger, &context);
+
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  int nChannels = 99;  // Set to known value
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+
+  TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Should apply algorithm/protocol");
+  TEST_ASSERT(nChannels == 1, "Should keep default channels (1) when config has -1");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_default.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 8: regBuff matching
+int test_regbuff_matching() {
+  const char* test_config =
+    "allreduce,0,65536,tree,simple,2,-1,-1,-1,1\n"      // Registered buffers only
+    "allreduce,0,65536,ring,simple,4,-1,-1,-1,0\n"      // Non-registered buffers only
+    "allreduce,0,65536,ring,ll128,8,-1,-1,-1,-1\n";     // Any buffer type (backward compatible)
+
+  create_test_config("test_regbuff.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_regbuff.conf", 1);
+
+  void* context = NULL;
+  pluginInit(8, 1, mock_logger, &context);
+
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+  }
+
+  int nChannels;
+
+  // Test registered buffer (should match first config)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    1, &nChannels);  // regBuff = 1 (registered)
+  TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Registered buffer: Tree/Simple should have low cost");
+  TEST_ASSERT(nChannels == 2, "Registered buffer: Should set 2 channels");
+
+  // Test non-registered buffer (should match second config)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);  // regBuff = 0 (non-registered)
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] == 0.0, "Non-registered buffer: Ring/Simple should have low cost");
+  TEST_ASSERT(nChannels == 4, "Non-registered buffer: Should set 4 channels");
+
+  // Test backward compatibility - config without regBuff should match any regBuff value
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  // First try with regBuff=2 (unusual value, should match third config)
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    2, &nChannels);  // regBuff = 2 (only third config should match)
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128] == 0.0, "Any regBuff: Ring/LL128 should have low cost");
+  TEST_ASSERT(nChannels == 8, "Any regBuff: Should set 8 channels");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_regbuff.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 9: numPipeOps matching
+int test_pipeops_matching() {
+  const char* test_config =
+    "allreduce,0,65536,tree,simple,2,-1,-1,1,-1\n"      // Single pipeline op
+    "allreduce,0,65536,ring,simple,4,-1,-1,4,-1\n"      // Multiple pipeline ops
+    "allreduce,0,65536,ring,ll128,8,-1,-1,-1,-1\n";     // Any pipeline ops (backward compatible)
+
+  create_test_config("test_pipeops.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_pipeops.conf", 1);
+
+  void* context = NULL;
+  pluginInit(8, 1, mock_logger, &context);
+
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+  }
+
+  int nChannels;
+
+  // Test single pipeline op (should match first config)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Single pipeOp: Tree/Simple should have low cost");
+  TEST_ASSERT(nChannels == 2, "Single pipeOp: Should set 2 channels");
+
+  // Test multiple pipeline ops (should match second config)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 4,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] == 0.0, "Multiple pipeOps: Ring/Simple should have low cost");
+  TEST_ASSERT(nChannels == 4, "Multiple pipeOps: Should set 4 channels");
+
+  // Test different number of pipeline ops (should match third config - backward compatible)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 2,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128] == 0.0, "Any pipeOps: Ring/LL128 should have low cost");
+  TEST_ASSERT(nChannels == 8, "Any pipeOps: Should set 8 channels");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_pipeops.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 10: No matching configuration (fallback behavior)
+int test_no_match_fallback() {
+  const char* test_config =
+    "broadcast,0,1024,tree,simple,2,-1,-1,-1,-1\n";  // Only broadcast config
+
+  create_test_config("test_fallback.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_fallback.conf", 1);
+
+  void* context = NULL;
+  pluginInit(8, 1, mock_logger, &context);
+
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  int nChannels;
+  // Try allreduce (should not match, use fallback)
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "DEBUG: Fallback test - checking cost_table[RING][SIMPLE] (%p) = %.1f (expecting 0.0)",
+              &cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE], cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE]);
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] == 1.0, "Should use pass through unmodified");
+  TEST_ASSERT(nChannels == 1, "Should use default channels");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_fallback.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 11: Large configuration files (testing dynamic allocation)
+int test_large_config() {
+  const char* large_config_file = "test_large.conf";
+
+  // Create a large configuration file with many entries
+  // This tests the dynamic allocation functionality
+  FILE* f = fopen(large_config_file, "w");
+  TEST_ASSERT(f != NULL, "Should be able to create large config file");
+
+  // Write header comment
+  fprintf(f, "# Large configuration file for testing dynamic allocation\n");
+  fprintf(f, "# This file contains many configurations to test memory allocation\n");
+
+  // Generate a large number of configurations (much more than the old MAX_CONFIGS=100)
+  const int num_configs = 500; // 5x the old static limit
+  const char* collectives[] = {"allreduce", "broadcast", "reduce", "allgather", "reducescatter"};
+  const char* algorithms[] = {"tree", "ring", "collnet_direct", "nvls"};
+  const char* protocols[] = {"simple", "ll", "ll128"};
+
+  for (int i = 0; i < num_configs; i++) {
+    // Vary the configurations to create realistic test data
+    const char* coll = collectives[i % 5];
+    const char* algo = algorithms[i % 4];
+    const char* proto = protocols[i % 3];
+
+    size_t min_bytes = (i * 1024) % 1048576; // Vary from 0 to 1MB
+    size_t max_bytes = min_bytes + 65536;    // 64KB range
+    int channels = (i % 8) + 1;              // 1-8 channels
+    int nodes = (i % 4) == 0 ? -1 : (i % 4); // Mix of -1 and 1-3 nodes
+    int ranks = (i % 8) == 0 ? -1 : (i % 32) + 1; // Mix of -1 and 1-32 ranks
+    int pipeOps = (i % 3) == 0 ? -1 : (i % 4) + 1; // Mix of -1 and 1-4 pipeOps
+    int regBuff = (i % 3) == 0 ? -1 : (i % 2); // Mix of -1, 0, 1
+
+    fprintf(f, "%s,%zu,%zu,%s,%s,%d,%d,%d,%d,%d\n",
+            coll, min_bytes, max_bytes, algo, proto, channels, nodes, ranks, pipeOps, regBuff);
+  }
+
+  fclose(f);
+
+  // Set environment to use our large config file
+  setenv("NCCL_TUNER_CONFIG_FILE", large_config_file, 1);
+
+  // Initialize plugin with large config
+  void* context = NULL;
+  ncclResult_t result = pluginInit(16, 4, mock_logger, &context);
+  TEST_ASSERT(result == ncclSuccess, "Plugin init with large config should succeed");
+  TEST_ASSERT(context != NULL, "Context should be allocated");
+
+  // Verify that configurations were loaded
+  TunerContext* ctx = (TunerContext*)context;
+  TEST_ASSERT(ctx->numConfigs == num_configs, "Should load all configurations from large file");
+  TEST_ASSERT(ctx->maxConfigs == num_configs, "maxConfigs should match allocated size");
+  TEST_ASSERT(ctx->configs != NULL, "Configs array should be dynamically allocated");
+
+  // Test that we can access configurations throughout the array
+  // (This would have failed with the old static MAX_CONFIGS=100 limit)
+  for (int i = 0; i < ctx->numConfigs; i++) {
+    TuningConfig* config = &ctx->configs[i];
+    // Basic sanity checks on the loaded configurations
+    TEST_ASSERT(config->collType >= ncclFuncBroadcast && config->collType <= ncclFuncAllReduce,
+                "Collective type should be valid");
+    TEST_ASSERT(config->maxBytes >= config->minBytes, "maxBytes should be >= minBytes");
+    TEST_ASSERT(config->nChannels > 0, "nChannels should be positive");
+  }
+
+  // Test specific configuration access at various indices
+  // Index 0 (first config)
+  TuningConfig* first_config = &ctx->configs[0];
+  TEST_ASSERT(first_config != NULL, "First config should be accessible");
+
+  // Index in middle
+  TuningConfig* mid_config = &ctx->configs[num_configs / 2];
+  TEST_ASSERT(mid_config != NULL, "Middle config should be accessible");
+
+  // Index near end (this would have crashed with static array of 100)
+  TuningConfig* late_config = &ctx->configs[num_configs - 1];
+  TEST_ASSERT(late_config != NULL, "Last config should be accessible");
+
+  // Test memory allocation size - verify we didn't over-allocate
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "Successfully loaded %d configurations (dynamic allocation)", ctx->numConfigs);
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "Memory allocated for %d configurations (%zu bytes total)",
+              ctx->maxConfigs, ctx->maxConfigs * sizeof(TuningConfig));
+
+  // Test that the plugin can still find matching configurations from the large set
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0; // Default high cost
+    }
+  }
+
+  int nChannels;
+  // Try to find a matching configuration - should work with large config set
+  result = pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                            cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                            0, &nChannels);
+  TEST_ASSERT(result == ncclSuccess, "GetCollInfo should work with large config set");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink(large_config_file);
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+
+  TEST_PASS();
+}
+
+// Test 12: Very large configuration stress test
+int test_very_large_config_stress() {
+  const char* stress_config_file = "test_stress.conf";
+
+  // Create an even larger configuration file to stress test the implementation
+  FILE* f = fopen(stress_config_file, "w");
+  TEST_ASSERT(f != NULL, "Should be able to create stress test config file");
+
+  fprintf(f, "# Stress test configuration with very large number of entries\n");
+
+  // Generate an extremely large number of configurations
+  const int stress_configs = 2000; // 20x the old static limit
+
+  for (int i = 0; i < stress_configs; i++) {
+    // Create varied but valid configurations
+    fprintf(f, "allreduce,%d,%d,ring,simple,4,-1,-1,-1,-1\n",
+            i * 512, (i * 512) + 1024);
+  }
+
+  fclose(f);
+
+  setenv("NCCL_TUNER_CONFIG_FILE", stress_config_file, 1);
+
+  // Test initialization with stress config
+  void* context = NULL;
+  ncclResult_t result = pluginInit(8, 2, mock_logger, &context);
+  TEST_ASSERT(result == ncclSuccess, "Plugin should handle very large config files");
+
+  TunerContext* ctx = (TunerContext*)context;
+  TEST_ASSERT(ctx->numConfigs == stress_configs, "Should load all stress test configurations");
+  TEST_ASSERT(ctx->configs != NULL, "Stress test configs should be allocated");
+
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "Stress test - loaded %d configurations successfully", stress_configs);
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "Memory usage: %zu bytes for configuration array",
+              stress_configs * sizeof(TuningConfig));
+
+  // Verify we can access configurations throughout the entire range
+  for (int i = 0; i < stress_configs; i += 100) { // Sample every 100th config
+    TuningConfig* config = &ctx->configs[i];
+    TEST_ASSERT(config->collType == ncclFuncAllReduce, "Config should have correct collective type");
+    TEST_ASSERT(config->minBytes == (size_t)(i * 512), "Config should have correct minBytes");
+  }
+
+  // Clean up
+  pluginDestroy(context);
+  unlink(stress_config_file);
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+
+  TEST_PASS();
+}
+
+// Test 13: Edge case - empty config file
+int test_empty_config() {
+  const char* empty_config_file = "test_empty.conf";
+
+  // Create empty config file (only comments)
+  create_test_config(empty_config_file,
+    "# Empty configuration file\n"
+    "# No actual configurations\n"
+    "\n"
+    "\n");
+
+  setenv("NCCL_TUNER_CONFIG_FILE", empty_config_file, 1);
+
+  void* context = NULL;
+  ncclResult_t result = pluginInit(8, 2, mock_logger, &context);
+  TEST_ASSERT(result == ncclSuccess, "Plugin should handle empty config files");
+
+  TunerContext* ctx = (TunerContext*)context;
+  TEST_ASSERT(ctx->numConfigs == 0, "Should have zero configurations");
+  TEST_ASSERT(ctx->maxConfigs == 0, "Should have zero max configurations");
+  TEST_ASSERT(ctx->configs == NULL, "Should not allocate memory for empty config");
+
+  // Test that plugin still works with no configurations (fallback behavior)
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  int nChannels;
+  result = pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                            cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                            0, &nChannels);
+  TEST_ASSERT(result == ncclSuccess, "GetCollInfo should work with empty config");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink(empty_config_file);
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+
+  TEST_PASS();
+}
+
+// Test runner function pointer type
+typedef int (*TestFunction)(void);
+
+// Test registry
+typedef struct {
+  const char* name;
+  TestFunction func;
+  const char* description;
+} TestCase;
+
+// All available tests
+TestCase test_cases[] = {
+  {"init", test_plugin_init, "Plugin initialization"},
+  {"config-valid", test_config_parsing_valid, "Valid configuration parsing"},
+  {"config-invalid", test_config_parsing_invalid, "Invalid configuration parsing"},
+  {"collective", test_collective_matching, "Collective type matching"},
+  {"size", test_size_matching, "Size range matching"},
+  {"topology", test_topology_matching, "Topology matching"},
+  {"channels", test_default_channels, "Default channels behavior"},
+  {"regbuff", test_regbuff_matching, "Registered buffer matching"},
+  {"pipeops", test_pipeops_matching, "Pipeline operations matching"},
+  {"fallback", test_no_match_fallback, "Fallback behavior"},
+  {"large-config", test_large_config, "Large configuration files (dynamic allocation)"},
+  {"stress-config", test_very_large_config_stress, "Very large configuration stress test"},
+  {"empty-config", test_empty_config, "Empty configuration file handling"},
+  {NULL, NULL, NULL} // End marker
+};
+
+// Show help/usage information
+void show_help(const char* program_name) {
+  printf("Usage: %s [test_name ...]\n\n", program_name);
+  printf("Available tests:\n");
+  for (int i = 0; test_cases[i].name != NULL; i++) {
+    printf("  %-15s - %s\n", test_cases[i].name, test_cases[i].description);
+  }
+  printf("\nExamples:\n");
+  printf("  %s                    # Run all tests\n", program_name);
+  printf("  %s init               # Run only initialization test\n", program_name);
+  printf("  %s init collective    # Run initialization and collective tests\n", program_name);
+  printf("  %s --help             # Show this help\n", program_name);
+}
+
+// Find test by name
+TestFunction find_test(const char* name) {
+  for (int i = 0; test_cases[i].name != NULL; i++) {
+    if (strcmp(test_cases[i].name, name) == 0) {
+      return test_cases[i].func;
+    }
+  }
+  return NULL;
+}
+
+// Main test runner
+int main(int argc, char* argv[]) {
+  int passed = 0, total = 0;
+
+  // Check for help
+  if (argc > 1 && (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-h") == 0)) {
+    show_help(argv[0]);
+    return 0;
+  }
+
+  printf("Running NCCL Tuner Plugin Unit Tests\n");
+  printf("=====================================\n");
+
+  if (argc == 1) {
+    // No arguments - run all tests
+    for (int i = 0; test_cases[i].name != NULL; i++) {
+      total++;
+      passed += test_cases[i].func();
+    }
+  } else {
+    // Run specific tests
+    for (int arg = 1; arg < argc; arg++) {
+      TestFunction test_func = find_test(argv[arg]);
+      if (test_func) {
+        total++;
+        passed += test_func();
+      } else {
+        printf("ERROR: Unknown test '%s'\n", argv[arg]);
+        printf("Use --help to see available tests\n");
+        return 1;
+      }
+    }
+  }
+
+  printf("\n=====================================\n");
+  printf("Test Results: %d/%d tests passed\n", passed, total);
+
+  if (passed == total) {
+    printf("All tests PASSED!\n");
+    return 0;
+  } else {
+    printf("Some tests FAILED!\n");
+    return 1;
+  }
+}
--- a/makefiles/common.mk
+++ b/makefiles/common.mk
@ -17,6 +17,8 @@ PROFAPI ?= 1
 NVTX ?= 1
 RDMA_CORE ?= 0
 NET_PROFILER ?= 0
+MLX5DV ?= 0
+MAX_EXT_NET_PLUGINS ?= 0

 NVCC = $(CUDA_HOME)/bin/nvcc

@ -38,10 +40,12 @@ ifeq ($(shell test "0$(CUDA_MAJOR)" -lt 12; echo $$?),0)
 CUDA8_GENCODE += -gencode=arch=compute_35,code=sm_35
 endif
 CUDA9_GENCODE = -gencode=arch=compute_70,code=sm_70
+CUDA10_GENCODE = -gencode=arch=compute_75,code=sm_75
 CUDA11_GENCODE = -gencode=arch=compute_80,code=sm_80
 CUDA12_GENCODE = -gencode=arch=compute_90,code=sm_90
-CUDA13_GENCODE = -gencode=arch=compute_100,code=sm_100 \
-                 -gencode=arch=compute_120,code=sm_120
+CUDA12_8_GENCODE = -gencode=arch=compute_100,code=sm_100 \
+                   -gencode=arch=compute_120,code=sm_120
+CUDA13_GENCODE = -gencode=arch=compute_110,code=sm_110

 CUDA8_PTX     = -gencode=arch=compute_61,code=compute_61
 CUDA9_PTX     = -gencode=arch=compute_70,code=compute_70
@ -49,10 +53,12 @@ CUDA11_PTX    = -gencode=arch=compute_80,code=compute_80
 CUDA12_PTX    = -gencode=arch=compute_90,code=compute_90
 CUDA13_PTX    = -gencode=arch=compute_120,code=compute_120

-
-ifeq ($(shell test "0$(CUDA_MAJOR)" -eq 12 -a "0$(CUDA_MINOR)" -ge 8 -o "0$(CUDA_MAJOR)" -gt 12; echo $$?),0)
+ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 13; echo $$?),0)
+# Prior to SM75 is deprecated from CUDA13.0 onwards
+  NVCC_GENCODE ?= $(CUDA10_GENCODE) $(CUDA11_GENCODE) $(CUDA12_GENCODE) $(CUDA12_8_GENCODE) $(CUDA13_GENCODE) $(CUDA13_PTX)
+else ifeq ($(shell test "0$(CUDA_MAJOR)" -eq 12 -a "0$(CUDA_MINOR)" -ge 8; echo $$?),0)
 # Include Blackwell support if we're using CUDA12.8 or above
-  NVCC_GENCODE ?= $(CUDA8_GENCODE) $(CUDA9_GENCODE) $(CUDA11_GENCODE) $(CUDA12_GENCODE) $(CUDA13_GENCODE) $(CUDA13_PTX)
+  NVCC_GENCODE ?= $(CUDA8_GENCODE) $(CUDA9_GENCODE) $(CUDA11_GENCODE) $(CUDA12_GENCODE) $(CUDA12_8_GENCODE) $(CUDA13_PTX)
 else ifeq ($(shell test "0$(CUDA_MAJOR)" -eq 11 -a "0$(CUDA_MINOR)" -ge 8 -o "0$(CUDA_MAJOR)" -gt 11; echo $$?),0)
 # Include Hopper support if we're using CUDA11.8 or above
  NVCC_GENCODE ?= $(CUDA8_GENCODE) $(CUDA9_GENCODE) $(CUDA11_GENCODE) $(CUDA12_GENCODE) $(CUDA12_PTX)
@ -66,14 +72,21 @@ else
 endif
 $(info NVCC_GENCODE is ${NVCC_GENCODE})

+# CUDA 13.0 requires c++17
+ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 13; echo $$?),0)
+  CXXSTD ?= -std=c++17
+else
+  CXXSTD ?= -std=c++14
+endif
+
 CXXFLAGS   := -DCUDA_MAJOR=$(CUDA_MAJOR) -DCUDA_MINOR=$(CUDA_MINOR) -fPIC -fvisibility=hidden \
-              -Wall -Wno-unused-function -Wno-sign-compare -std=c++11 -Wvla \
-              -I $(CUDA_INC) \
+              -Wall -Wno-unused-function -Wno-sign-compare $(CXXSTD) -Wvla \
+              -I $(CUDA_INC) -I $(CUDA_INC)/cccl \
              $(CXXFLAGS)
 # Maxrregcount needs to be set accordingly to NCCL_MAX_NTHREADS (otherwise it will cause kernel launch errors)
 # 512 : 120, 640 : 96, 768 : 80, 1024 : 60
 # We would not have to set this if we used __launch_bounds__, but this only works on kernels, not on functions.
-NVCUFLAGS  := -ccbin $(CXX) $(NVCC_GENCODE) -std=c++11 --expt-extended-lambda -Xptxas -maxrregcount=96 -Xfatbin -compress-all
+NVCUFLAGS  := -ccbin $(CXX) $(NVCC_GENCODE) $(CXXSTD) --expt-extended-lambda -Xptxas -maxrregcount=96 -Xfatbin -compress-all
 # Use addprefix so that we can specify more than one path
 NVLDFLAGS  := -L${CUDA_LIB} -lcudart -lrt

@ -136,9 +149,17 @@ CXXFLAGS += -DPROFAPI
 endif

 ifneq ($(RDMA_CORE), 0)
-CXXFLAGS += -DNCCL_BUILD_RDMA_CORE=1
+CXXFLAGS += -DNCCL_BUILD_RDMA_CORE=1 -libverbs
+endif
+
+ifneq ($(MLX5DV), 0)
+CXXFLAGS += -DNCCL_BUILD_MLX5DV=1 -lmlx5
 endif

 ifneq ($(NET_PROFILER), 0)
 CXXFLAGS += -DNCCL_ENABLE_NET_PROFILING=1
 endif
+
+ifneq ($(MAX_EXT_NET_PLUGINS), 0)
+CXXFLAGS += -DNCCL_NET_MAX_PLUGINS=$(MAX_EXT_NET_PLUGINS)
+endif
--- a/makefiles/version.mk
+++ b/makefiles/version.mk
@ -1,6 +1,6 @@
 ##### version
 NCCL_MAJOR   := 2
-NCCL_MINOR   := 26
-NCCL_PATCH   := 5
+NCCL_MINOR   := 27
+NCCL_PATCH   := 6
 NCCL_SUFFIX  :=
 PKG_REVISION := 1
--- a/src/Makefile
+++ b/src/Makefile
@ -10,7 +10,7 @@ include ../makefiles/version.mk
 INCEXPORTS  := nccl.h
 LIBSRCFILES := \
 	bootstrap.cc channel.cc collectives.cc debug.cc enqueue.cc group.cc \
-	init.cc init_nvtx.cc proxy.cc transport.cc mnnvl.cc \
+	init.cc init_nvtx.cc proxy.cc transport.cc mnnvl.cc allocator.cc symmetric.cc \
 	$(wildcard graph/*.cc) \
 	$(wildcard misc/*.cc) \
 	$(wildcard transport/*.cc) \
--- a/src/allocator.cc
+++ b/src/allocator.cc
@ -0,0 +1,196 @@
+/*************************************************************************
+ * Copyright (c) 2015-2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#include "comm.h"
+#include "transport.h"
+#include "group.h"
+
+NCCL_API(ncclResult_t, ncclMemAlloc, void **ptr, size_t size);
+ncclResult_t  ncclMemAlloc(void **ptr, size_t size) {
+  NVTX3_FUNC_RANGE_IN(nccl_domain);
+  ncclResult_t ret = ncclSuccess;
+
+#if CUDART_VERSION >= 12010
+  size_t memGran = 0;
+  CUdevice currentDev;
+  CUmemAllocationProp memprop = {};
+  CUmemAccessDesc accessDesc = {};
+  CUmemGenericAllocationHandle handle = (CUmemGenericAllocationHandle)-1;
+  int cudaDev;
+  int flag;
+  int dcnt;
+
+  if (ptr == NULL || size == 0) goto fallback;
+
+  if (ncclCudaLibraryInit() != ncclSuccess) goto fallback;
+
+  CUDACHECK(cudaGetDevice(&cudaDev));
+  CUCHECK(cuDeviceGet(&currentDev, cudaDev));
+
+  if (ncclCuMemEnable()) {
+    size_t handleSize = size;
+    int requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
+    // Query device to see if FABRIC handle support is available
+    flag = 0;
+    (void) CUPFN(cuDeviceGetAttribute(&flag, CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED, currentDev));
+    if (flag) requestedHandleTypes |= CU_MEM_HANDLE_TYPE_FABRIC;
+    memprop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
+    memprop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
+    memprop.requestedHandleTypes = (CUmemAllocationHandleType) requestedHandleTypes;
+    memprop.location.id = currentDev;
+    // Query device to see if RDMA support is available
+    flag = 0;
+    CUCHECK(cuDeviceGetAttribute(&flag, CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED, currentDev));
+    if (flag) memprop.allocFlags.gpuDirectRDMACapable = 1;
+    CUCHECK(cuMemGetAllocationGranularity(&memGran, &memprop, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED));
+    CUDACHECK(cudaGetDeviceCount(&dcnt));
+    ALIGN_SIZE(handleSize, memGran);
+
+    if (requestedHandleTypes & CU_MEM_HANDLE_TYPE_FABRIC) {
+      /* First try cuMemCreate() with FABRIC handle support and then remove if it fails */
+      CUresult err = CUPFN(cuMemCreate(&handle, handleSize, &memprop, 0));
+      if (err == CUDA_ERROR_NOT_PERMITTED || err == CUDA_ERROR_NOT_SUPPORTED) {
+        requestedHandleTypes &= ~CU_MEM_HANDLE_TYPE_FABRIC;
+        memprop.requestedHandleTypes = (CUmemAllocationHandleType) requestedHandleTypes;
+        /* Allocate the physical memory on the device */
+        CUCHECK(cuMemCreate(&handle, handleSize, &memprop, 0));
+      } else if (err != CUDA_SUCCESS) {
+        // Catch and report any error from above
+        CUCHECK(cuMemCreate(&handle, handleSize, &memprop, 0));
+      }
+    } else {
+      /* Allocate the physical memory on the device */
+      CUCHECK(cuMemCreate(&handle, handleSize, &memprop, 0));
+    }
+    /* Reserve a virtual address range */
+    CUCHECK(cuMemAddressReserve((CUdeviceptr*)ptr, handleSize, memGran, 0, 0));
+    /* Map the virtual address range to the physical allocation */
+    CUCHECK(cuMemMap((CUdeviceptr)*ptr, handleSize, 0, handle, 0));
+    /* Now allow RW access to the newly mapped memory */
+    for (int i = 0; i < dcnt; ++i) {
+      int p2p = 0;
+      if (i == cudaDev || ((cudaDeviceCanAccessPeer(&p2p, i, cudaDev) == cudaSuccess) && p2p)) {
+        accessDesc.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
+        accessDesc.location.id = i;
+        accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
+        CUCHECK(cuMemSetAccess((CUdeviceptr)*ptr, handleSize, &accessDesc, 1));
+      }
+      if (0 == p2p && i != cudaDev) INFO(NCCL_ALLOC, "P2P not supported between GPU%d and GPU%d", cudaDev, i);
+    }
+    goto exit;
+  }
+
+fallback:
+#endif
+  // Coverity is right to complain that we may pass a NULL ptr to cudaMalloc.  That's deliberate though:
+  // we want CUDA to return an error to the caller.
+  // coverity[var_deref_model]
+  CUDACHECKGOTO(cudaMalloc(ptr, size), ret, fail);
+
+exit:
+  return ret;
+fail:
+  goto exit;
+}
+
+NCCL_API(ncclResult_t, ncclMemFree, void *ptr);
+ncclResult_t  ncclMemFree(void *ptr) {
+  NVTX3_FUNC_RANGE_IN(nccl_domain);
+  ncclResult_t ret = ncclSuccess;
+  int saveDevice;
+
+  CUDACHECK(cudaGetDevice(&saveDevice));
+#if CUDART_VERSION >= 12010
+  CUdevice ptrDev = 0;
+
+  if (ptr == NULL) goto fallback;
+  if (ncclCudaLibraryInit() != ncclSuccess) goto fallback;
+
+  CUCHECKGOTO(cuPointerGetAttribute((void*)&ptrDev, CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL, (CUdeviceptr)ptr), ret, fail);
+  CUDACHECKGOTO(cudaSetDevice((int)ptrDev), ret, fail);
+  if (ncclCuMemEnable()) {
+    NCCLCHECKGOTO(ncclCuMemFree(ptr), ret, fail);
+    goto exit;
+  }
+
+fallback:
+#endif
+  CUDACHECKGOTO(cudaFree(ptr), ret, fail);
+
+exit:
+  CUDACHECK(cudaSetDevice(saveDevice));
+  return ret;
+fail:
+  goto exit;
+}
+
+// This is a collective function and should be called by all ranks in the communicator
+ncclResult_t ncclCommSymmetricAllocInternal(struct ncclComm* comm, size_t size, size_t alignment, void** symPtr) {
+  ncclResult_t ret = ncclSuccess;
+  void* regSymAddr = NULL;
+  size_t allocSize = size;
+  size_t granularity;
+  CUdevice cuDev;
+  CUmemAllocationProp memprop = {};
+  CUmemGenericAllocationHandle memHandle;
+  int bit = 0, cnt = 0;
+
+  // aligment must be power of 2 as an input
+  while (bit < sizeof(size_t) * 8) {
+    if (alignment & (1L << bit)) cnt++;
+    if (cnt == 2) {
+      WARN("rank %d alignment %ld is not power of 2", comm->rank, alignment);
+      goto fail;
+    }
+    bit++;
+  }
+  // temporarily align the alignment to NCCL_REC_PAGE_SIZE
+  ALIGN_SIZE(alignment, NCCL_REC_PAGE_SIZE);
+
+  CUCHECKGOTO(cuDeviceGet(&cuDev, comm->cudaDev), ret, fail);
+  memprop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
+  memprop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
+  memprop.requestedHandleTypes = ncclCuMemHandleType;
+  memprop.location.id = cuDev;
+  CUCHECKGOTO(cuMemGetAllocationGranularity(&granularity, &memprop, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED), ret, fail);
+  ALIGN_SIZE(allocSize, granularity);
+
+  CUCHECKGOTO(cuMemCreate(&memHandle, allocSize, &memprop, 0), ret, fail);
+  ALIGN_SIZE(comm->symAllocHead, alignment);
+  NCCLCHECKGOTO(ncclIpcSymmetricMap(comm, comm->symAllocHead, allocSize, memHandle, &regSymAddr), ret, fail);
+  NCCLCHECKGOTO(ncclNvlsSymmetricMap(comm, comm->symAllocHead, allocSize, regSymAddr), ret, fail);
+  NCCLCHECKGOTO(bootstrapIntraNodeBarrier(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, comm->localRankToRank[0]), ret, fail);
+  comm->symAllocHead += allocSize;
+  *symPtr = regSymAddr;
+
+exit:
+  return ret;
+fail:
+  *symPtr = NULL;
+  goto exit;
+}
+
+ncclResult_t ncclCommSymmetricFreeInternal(struct ncclComm* comm, void* symPtr) {
+  CUmemGenericAllocationHandle handle;
+  size_t size = 0;
+  ncclResult_t ret = ncclSuccess;
+  int saveDev = comm->cudaDev;
+  CUDACHECKGOTO(cudaGetDevice(&saveDev), ret, fail);
+  if (ncclCuMemEnable()) {
+    CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
+    CUCHECKGOTO(cuMemRetainAllocationHandle(&handle, symPtr), ret, fail);
+    CUCHECKGOTO(cuMemRelease(handle), ret, fail);
+    CUCHECKGOTO(cuMemGetAddressRange(NULL, &size, (CUdeviceptr)symPtr), ret, fail);
+    NCCLCHECKGOTO(ncclNvlsSymmetricFree(comm, size, symPtr), ret, fail);
+    NCCLCHECKGOTO(ncclIpcSymmetricFree(comm, size, symPtr), ret, fail);
+    CUCHECKGOTO(cuMemRelease(handle), ret, fail);
+  }
+exit:
+  CUDACHECK(cudaSetDevice(saveDev));
+  return ret;
+fail:
+  goto exit;
+}
--- a/src/bootstrap.cc
+++ b/src/bootstrap.cc
@ -94,6 +94,7 @@ ncclResult_t bootstrapNetInit() {
    pthread_mutex_lock(&bootstrapNetLock);
    if (bootstrapNetInitDone == 0) {
      const char* env = ncclGetEnv("NCCL_COMM_ID");
+      int nIfs = 0;
      if (env) {
        union ncclSocketAddress remoteAddr;
        if (ncclSocketGetAddrFromString(&remoteAddr, env) != ncclSuccess) {
@ -101,13 +102,15 @@ ncclResult_t bootstrapNetInit() {
          pthread_mutex_unlock(&bootstrapNetLock);
          return ncclInvalidArgument;
        }
-        if (ncclFindInterfaceMatchSubnet(bootstrapNetIfName, &bootstrapNetIfAddr, &remoteAddr, MAX_IF_NAME_SIZE, 1) <= 0) {
+        NCCLCHECK(ncclFindInterfaceMatchSubnet(bootstrapNetIfName, &bootstrapNetIfAddr, &remoteAddr, MAX_IF_NAME_SIZE,
+                                               &nIfs));
+        if (nIfs <= 0) {
          WARN("NET/Socket : No usable listening interface found");
          pthread_mutex_unlock(&bootstrapNetLock);
          return ncclSystemError;
        }
      } else {
-        int nIfs = ncclFindInterfaces(bootstrapNetIfName, &bootstrapNetIfAddr, MAX_IF_NAME_SIZE, 1);
+        NCCLCHECK(ncclFindInterfaces(bootstrapNetIfName, &bootstrapNetIfAddr, MAX_IF_NAME_SIZE, 1, &nIfs));
        if (nIfs <= 0) {
          WARN("Bootstrap : no socket interface found");
          pthread_mutex_unlock(&bootstrapNetLock);
@ -828,7 +831,7 @@ ncclResult_t bootstrapSplit(uint64_t magic, struct ncclComm* comm, struct ncclCo

  NCCLCHECKGOTO(ncclCalloc(&state->peerP2pAddresses, nranks), ret, fail);
  memcpy(state->peerP2pAddresses + rank, &peerSocketAddress, sizeof(union ncclSocketAddress));
-  if (parent->config.splitShare) {
+  if (parent->shareResources) {
    /* map local rank to top parent local rank. */
    for (int i = 0; i < nranks; ++i) {
      comm->topParentRanks[i] = parent->topParentRanks[parentRanks[i]];
--- a/src/channel.cc
+++ b/src/channel.cc
@ -147,7 +147,7 @@ ncclResult_t initCollnetChannel(struct ncclComm* comm, int channelId, struct ncc
 ncclResult_t freeChannel(struct ncclChannel* channel, int nRanks, int collnetNRanks, int nvlsNRanks) {
  int nPeers = nRanks + collnetNRanks + nvlsNRanks;
  /* channel peers are only valid when async init thread completes commAlloc() and
-   * the channel is intialized with initChannel(); if either is not done, this channel
+   * the channel is initialized with initChannel(); if either is not done, this channel
   * should never be free. */
  if (channel->id == -1 || channel->peers == NULL) return ncclSuccess;

--- a/src/debug.cc
+++ b/src/debug.cc
@ -16,6 +16,8 @@
 #include <chrono>
 #include "param.h"

+#define NCCL_DEBUG_RESET_TRIGGERED (-2)
+
 int ncclDebugLevel = -1;
 static uint32_t ncclDebugTimestampLevels = 0;     // bitmaps of levels that have timestamps turned on
 static char ncclDebugTimestampFormat[256];        // with space for subseconds
@ -26,7 +28,7 @@ static int pid = -1;
 static char hostname[1024];
 thread_local int ncclDebugNoWarn = 0;
 char ncclLastError[1024] = ""; // Global string for the last error in human readable form
-static uint64_t ncclDebugMask = NCCL_INIT | NCCL_BOOTSTRAP | NCCL_ENV; // Default debug sub-system mask is INIT and ENV
+static uint64_t ncclDebugMask = 0;
 FILE *ncclDebugFile = stdout;
 static pthread_mutex_t ncclDebugLock = PTHREAD_MUTEX_INITIALIZER;
 static std::chrono::steady_clock::time_point ncclEpoch;
@ -34,11 +36,16 @@ static bool ncclWarnSetDebugInfo = false;

 static __thread int tid = -1;

+// This function must be called with ncclDebugLock locked!
 static void ncclDebugInit() {
-  pthread_mutex_lock(&ncclDebugLock);
-  if (ncclDebugLevel != -1) { pthread_mutex_unlock(&ncclDebugLock); return; }
  const char* nccl_debug = ncclGetEnv("NCCL_DEBUG");
  int tempNcclDebugLevel = -1;
+  uint64_t tempNcclDebugMask = NCCL_INIT | NCCL_BOOTSTRAP | NCCL_ENV; // Default debug sub-system mask
+  if (ncclDebugLevel == NCCL_DEBUG_RESET_TRIGGERED && ncclDebugFile != stdout) {
+    // Finish the reset initiated via ncclResetDebugInit().
+    fclose(ncclDebugFile);
+    ncclDebugFile = stdout;
+  }
  if (nccl_debug == NULL) {
    tempNcclDebugLevel = NCCL_LOG_NONE;
  } else if (strcasecmp(nccl_debug, "VERSION") == 0) {
@ -61,7 +68,7 @@ static void ncclDebugInit() {
  if (ncclDebugSubsysEnv != NULL) {
    int invert = 0;
    if (ncclDebugSubsysEnv[0] == '^') { invert = 1; ncclDebugSubsysEnv++; }
-    ncclDebugMask = invert ? ~0ULL : 0ULL;
+    tempNcclDebugMask = invert ? ~0ULL : 0ULL;
    char *ncclDebugSubsys = strdup(ncclDebugSubsysEnv);
    char *subsys = strtok(ncclDebugSubsys, ",");
    while (subsys != NULL) {
@ -102,7 +109,7 @@ static void ncclDebugInit() {
        mask = NCCL_ALL;
      }
      if (mask) {
-        if (invert) ncclDebugMask &= ~mask; else ncclDebugMask |= mask;
+        if (invert) tempNcclDebugMask &= ~mask; else tempNcclDebugMask |= mask;
      }
      subsys = strtok(NULL, ",");
    }
@ -246,15 +253,15 @@ static void ncclDebugInit() {
    if (debugFn[0] != '\0') {
      FILE *file = fopen(debugFn, "w");
      if (file != nullptr) {
-        setbuf(file, nullptr); // disable buffering
+        setlinebuf(file); // disable block buffering
        ncclDebugFile = file;
      }
    }
  }

  ncclEpoch = std::chrono::steady_clock::now();
+  ncclDebugMask = tempNcclDebugMask;
  __atomic_store_n(&ncclDebugLevel, tempNcclDebugLevel, __ATOMIC_RELEASE);
-  pthread_mutex_unlock(&ncclDebugLock);
 }

 /* Common logging function used by the INFO, WARN and TRACE macros
@ -262,19 +269,38 @@ static void ncclDebugInit() {
 * they can share the debugging mechanisms and output files
 */
 void ncclDebugLog(ncclDebugLogLevel level, unsigned long flags, const char *filefunc, int line, const char *fmt, ...) {
-  if (__atomic_load_n(&ncclDebugLevel, __ATOMIC_ACQUIRE) == -1) ncclDebugInit();
+  bool locked = false; // Keeps track of the ncclDebugLock state.
+  int gotLevel = __atomic_load_n(&ncclDebugLevel, __ATOMIC_ACQUIRE);
+
  if (ncclDebugNoWarn != 0 && level == NCCL_LOG_WARN) { level = NCCL_LOG_INFO; flags = ncclDebugNoWarn; }

  // Save the last error (WARN) as a human readable string
  if (level == NCCL_LOG_WARN) {
    pthread_mutex_lock(&ncclDebugLock);
+    locked = true;
    va_list vargs;
    va_start(vargs, fmt);
    (void) vsnprintf(ncclLastError, sizeof(ncclLastError), fmt, vargs);
    va_end(vargs);
-    pthread_mutex_unlock(&ncclDebugLock);
  }
-  if (ncclDebugLevel < level || ((flags & ncclDebugMask) == 0)) return;
+
+  if (gotLevel >= 0 && (gotLevel < level || (flags & ncclDebugMask) == 0)) {
+    if (locked)
+      pthread_mutex_unlock(&ncclDebugLock);
+    return;
+  }
+
+  if (!locked) {
+    pthread_mutex_lock(&ncclDebugLock);
+    locked = true;
+  }
+  // From this point on ncclDebugLock is always locked so we don't need to check "locked" anymore.
+  if (ncclDebugLevel < 0)
+    ncclDebugInit();
+  if (ncclDebugLevel < level || ((flags & ncclDebugMask) == 0)) {
+    pthread_mutex_unlock(&ncclDebugLock);
+    return;
+  }

  if (tid == -1) {
    tid = syscall(SYS_gettid);
@ -335,7 +361,7 @@ void ncclDebugLog(ncclDebugLogLevel level, unsigned long flags, const char *file
  // Add level specific formatting.
  if (level == NCCL_LOG_WARN) {
    len += snprintf(buffer+len, sizeof(buffer)-len, "[%d] %s:%d NCCL WARN ", cudaDev, filefunc, line);
-    if (ncclWarnSetDebugInfo) ncclDebugLevel = NCCL_LOG_INFO;
+    if (ncclWarnSetDebugInfo) __atomic_store_n(&ncclDebugLevel, NCCL_LOG_INFO, __ATOMIC_RELEASE);
  } else if (level == NCCL_LOG_INFO) {
    len += snprintf(buffer+len, sizeof(buffer)-len, "[%d] NCCL INFO ", cudaDev);
  } else if (level == NCCL_LOG_TRACE && flags == NCCL_CALL) {
@ -360,19 +386,17 @@ void ncclDebugLog(ncclDebugLogLevel level, unsigned long flags, const char *file
  // necessary since we write bytes instead of the string.
  buffer[len++] = '\n';
  fwrite(buffer, 1, len, ncclDebugFile);
+  pthread_mutex_unlock(&ncclDebugLock);
 }

 NCCL_API(void, ncclResetDebugInit);
 void ncclResetDebugInit() {
  // Cleans up from a previous ncclDebugInit() and reruns.
  // Use this after changing NCCL_DEBUG and related parameters in the environment.
-  __atomic_load_n(&ncclDebugLevel, __ATOMIC_ACQUIRE);
-  if (ncclDebugFile != stdout) {
-    fclose(ncclDebugFile);
-    ncclDebugFile = stdout;
-  }
-  ncclDebugLevel = -1;
-  ncclDebugInit();
+  pthread_mutex_lock(&ncclDebugLock);
+  // Let ncclDebugInit() know to complete the reset.
+  __atomic_store_n(&ncclDebugLevel, NCCL_DEBUG_RESET_TRIGGERED, __ATOMIC_RELEASE);
+  pthread_mutex_unlock(&ncclDebugLock);
 }

 NCCL_PARAM(SetThreadName, "SET_THREAD_NAME", 0);
--- a/src/device/Makefile
+++ b/src/device/Makefile
@ -23,6 +23,9 @@ INCFLAGS  = -I. -I.. -I$(BUILDDIR)/include -I../include
 NVCUFLAGS += $(INCFLAGS) --compiler-options "-fPIC -fvisibility=hidden"
 CXXFLAGS  += $(INCFLAGS)

+NVCUFLAGS_SYM := -ccbin $(CXX) $(CXXSTD) --expt-extended-lambda -Xptxas -maxrregcount=128 -Xfatbin -compress-all
+NVCUFLAGS_SYM += $(INCFLAGS) --compiler-options "-fPIC -fvisibility=hidden"
+
 SAY = @bash -c 'path="$$2"; [[ "$$(realpath "$$2")" =~ ^$(subst .,\.,$(abspath $(NCCLDIR)))/(.*)$$ ]] && path="$${BASH_REMATCH[1]}"; printf "%-15s %s\n" "$$1" "$$path"' SAY

 COMPILE.cu = $(NVCC) $(NVCUFLAGS) -dc $2 -o $1
@ -30,7 +33,21 @@ COMPILE.cc = $(CXX) $(CXXFLAGS) -c $2 -o $1
 define COMPILE
@$(SAY) "Compiling" $2;\
 mkdir -p $(dir $1);\
- $(call COMPILE$(suffix $2),$1,$2)
+ $(call COMPILE$(or $3,$(suffix $2)),$1,$2)
+endef
+
+ifeq ($(shell echo "$$((1000*$(CUDA_MAJOR) + 10*$(CUDA_MINOR) >= 12090))"),1)
+	NVCC_GENCODE_LDMC_FP8 = -gencode=arch=compute_100f,code=sm_100f
+else ifeq ($(shell echo "$$((1000*$(CUDA_MAJOR) + 10*$(CUDA_MINOR) >= 12070))"),1)
+  NVCC_GENCODE_LDMC_FP8 = -gencode=arch=compute_100a,code=sm_100a
+else
+	NVCC_GENCODE_LDMC_FP8 =
+endif
+
+define COMPILE_SYM
+@$(SAY) "Compiling" $2;\
+ mkdir -p $(dir $1);\
+ $(NVCC) $(NVCUFLAGS_SYM) $3 -dw $2 -o $1
 endef

 DEPENDS.cu = $(NVCC) $(NVCUFLAGS) -M -dc $1
@ -48,8 +65,6 @@ endef

 all: $(MANIFEST)

-ifeq (1,1)
-# Case if the <gensrc> directory is generated on-demand:
 $(OBJDIR)/gensrc: generate.py
 	@mkdir -p $@
 	(which python3 >/dev/null || \
@ -57,22 +72,26 @@ $(OBJDIR)/gensrc: generate.py
 	   printf "\n$${bar}\nERROR: Building NCCL requires a Python 3 installation invokable as 'python3'.\n$${bar}\n\n" 1>&2; \
 	   exit 1)) \
 	&& ./generate.py $@ "$(ONLY_FUNCS)"
-else
-# Case if the <gensrc> directory is pre-generated and checked in the repo as ./gen:
-$(OBJDIR)/gensrc:
-	@mkdir -p $(OBJDIR); ln -srfn ./gen $@
-endif
+
+$(OBJDIR)/gensrc/symmetric: $(OBJDIR)/gensrc symmetric/generate.py
+	@mkdir -p $@
+	./symmetric/generate.py $@

 # The trailing ";" is necessary to make this an "empty recipe":
 # https://www.gnu.org/software/make/manual/html_node/Empty-Recipes.html
 $(OBJDIR)/gensrc/rules.mk: $(OBJDIR)/gensrc ;

+$(OBJDIR)/gensrc/symmetric/rules.mk: $(OBJDIR)/gensrc/symmetric ;
+
 -include $(OBJDIR)/gensrc/rules.mk
 # "gensrc/rules.mk" populates $(LIB_OBJS_GEN)

+-include $(OBJDIR)/gensrc/symmetric/rules.mk
+# "gensrc/symmetric/rules.mk" populates $(LIB_OBJS_SYM_GEN)
+
 SRCS = common.cu onerank.cu

-LIB_OBJS = $(patsubst %, $(OBJDIR)/%.o, $(SRCS)) $(LIB_OBJS_GEN)
+LIB_OBJS = $(patsubst %, $(OBJDIR)/%.o, $(SRCS)) $(LIB_OBJS_GEN) $(LIB_OBJS_SYM_GEN)

 $(OBJDIR)/%.o: % $(OBJDIR)/%.d
 	$(call COMPILE,$@,$<)
@ -80,12 +99,18 @@ $(OBJDIR)/%.o: % $(OBJDIR)/%.d
 $(OBJDIR)/genobj/%.o: $(OBJDIR)/gensrc $(OBJDIR)/genobj/%.d
 	$(call COMPILE,$@,$(OBJDIR)/gensrc/$*)

+$(OBJDIR)/genobj/symmetric/%.o: $(OBJDIR)/gensrc/symmetric $(OBJDIR)/genobj/symmetric/%.d
+	$(call COMPILE,$@,$(OBJDIR)/gensrc/symmetric/$*)
+
 $(OBJDIR)/%.d: %
 	$(call DEPENDS,$@,$<)

 $(OBJDIR)/genobj/%.d: $(OBJDIR)/gensrc/%
 	$(call DEPENDS,$@,$<)

+$(OBJDIR)/genobj/symmetric/%.d: $(OBJDIR)/gensrc/symmetric/%
+	$(call DEPENDS,$@,$<)
+
 $(DEVGLUE_OBJ): $(LIB_OBJS)
 	$(NVCC) $(NVCUFLAGS) -dlink $^ -o $@

@ -94,6 +119,7 @@ $(MANIFEST): $(LIB_OBJS) $(DEVGLUE_OBJ)

 -include $(wildcard $(OBJDIR)/*.d)
 -include $(wildcard $(OBJDIR)/genobj/*.d)
+-include $(wildcard $(OBJDIR)/genobj/symmetric/*.d)

 .PHONY: clean
 clean:
--- a/src/device/all_gather.h
+++ b/src/device/all_gather.h
@ -173,73 +173,221 @@ struct RunWorkColl<ncclFuncAllGather, T, RedOp, NCCL_ALGO_PAT, NCCL_PROTO_SIMPLE

 template<typename T, typename RedOp>
 struct RunWorkColl<ncclFuncAllGather, T, RedOp, NCCL_ALGO_NVLS, NCCL_PROTO_SIMPLE> {
+  template<bool BcastSendNotRecv>
+  struct Scatterer {
+    struct ncclDevWorkColl* work;
+    ssize_t chunkSize;
+    ssize_t railGridOffset;
+
+    template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts, int MultimemSrcs, int MultimemDsts>
+    __device__ __forceinline__ void operator()(
+        int tid, int tn, int slice, int maxSliceSize,
+        int nSrcs, void** srcPtrs, int nDsts, void** dstPtrs, int32_t* dstSizes, uint32_t sendDirectFlag, uint32_t recvDirectFlag
+      ) {
+      static_assert(SlicePerChunk==1, "require: SlicePerChunk==1");
+      static_assert(MaxDsts<=1 || MaxSrcs<=1, "require: MaxDsts<=1 || MaxSrcs<=1");
+
+      struct ncclNvls* nvls = &ncclShmem.channel.nvls;
+      int nNodes = ncclShmem.comm.nNodes;
+      int nRails = nvls->nHeads;
+      int part = ncclShmem.channelId - work->channelLo;
+      char* inbuf = (char*)work->sendbuff;
+      char* outbuf = (char*)work->recvbuff;
+      ssize_t countPerRank = work->collnet.count;
+      bool inPlace = (inbuf == outbuf + ncclShmem.comm.rank * countPerRank);
+      ssize_t railAllBeg = min(railGridOffset + part * chunkSize, nNodes * countPerRank);
+      ssize_t railAllEnd = min(railAllBeg + chunkSize, nNodes * countPerRank);
+      int railAllSize = railAllEnd - railAllBeg;
+      int rail = 0;
+      int src = 0;
+
+      if (BcastSendNotRecv) {
+        rail = nvls->headRank;
+      } else {
+        if (work->regUsed) return;
+        rail = 0;
+      }
+      if (tid < nDsts) dstSizes[tid] = railAllSize;
+      do {
+        int node = railAllBeg / countPerRank;
+        int railAllOffset = 0;
+        while (railAllOffset < railAllSize) {
+          ssize_t railOneBeg = node * countPerRank;
+          ssize_t railOneEnd = railOneBeg + countPerRank;
+          ssize_t railOneOffset = (railAllBeg + railAllOffset) - railOneBeg;
+          int delta = min(railAllEnd, railOneEnd) - (railAllBeg + railAllOffset);
+          int rank = ncclShmem.comm.collNetDenseToUserRank[node * nRails + rail];
+          ssize_t userOneBeg = rank * countPerRank + railOneOffset;
+          int outIsDst = (inPlace && rank == ncclShmem.comm.rank) || BcastSendNotRecv || work->regUsed ? 0 : 1;
+          if (nSrcs != 0 && outIsDst + nDsts != 0) {
+            reduceCopy<ncclCollUnroll(), RedOp, T,
+              /*MultimemSrcs,MinSrcs,MaxSrcs=*/MultimemSrcs, 1, 1,
+              /*MultimemDsts=*/MultimemDsts, 0 + MultimemDsts + MinDsts, 1 + MaxDsts,
+              /*PreOpSrcs=*/0>
+              (tid, tn, 0, nullptr, false,
+                /*nSrcs=*/1, [=]__device__(int s/*==0*/) -> void* {
+              return (char*)srcPtrs[src] + railAllOffset;
+            },
+                /*nDsts=*/outIsDst + nDsts, [=]__device__(int d) -> void* {
+              return d < outIsDst ? outbuf + userOneBeg
+                : work->regUsed ? (char*)dstPtrs[d - outIsDst] + userOneBeg
+                : (char*)dstPtrs[d - outIsDst] + railAllOffset;
+            }, delta);
+          }
+          railAllOffset += delta;
+          node += 1;
+        }
+        rail += 1;
+        src += 1;
+      } while (!BcastSendNotRecv && src < nRails);
+    }
+  };
+
  __device__ __forceinline__ void run(int tid, int/*nthreads*/, struct ncclDevWorkColl* work) {
    struct ncclNvls* nvls = &ncclShmem.channel.nvls;
-    const ssize_t rank = ncclShmem.comm.rank;
-    size_t count, gridOffset, channelCount;
-    size_t chunkCount;
-    ncclCollCbdPart(work, ncclShmem.channelId, NCCL_PROTO_SIMPLE, sizeof(T), &count, &gridOffset, &channelCount, &chunkCount);
-    size_t offset;
    int nelem;

-    const int nThreadsBcast = work->regUsed ? (NCCL_MAX_NTHREADS - WARP_SIZE) : 4 * WARP_SIZE;
-    const int nThreadsGather = work->regUsed ? WARP_SIZE : NCCL_MAX_NTHREADS - nThreadsBcast;
-    const int tidEndGather = nThreadsGather;
-    const int tidEndBcast = tidEndGather + nThreadsBcast;
+    const int nThreadsNetSend = work->oneNode ? 0 : (work->netRegUsed ? WARP_SIZE :  6 * WARP_SIZE);
+    const int nThreadsGather = work->regUsed ? roundUp(nvls->nHeads << 2, WARP_SIZE) : 8 * WARP_SIZE;
+    const int nThreadsBcast = NCCL_MAX_NTHREADS - nThreadsNetSend - nThreadsGather;

-    if (!work->regUsed) {
-      if (tid < tidEndGather) {
-        // Gather
-        using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
-        Primitives<T, RedOp, FanAsymmetric<NCCL_MAX_NVLS_ARITY, 0>, /*Direct=*/0, Proto, 0>
-          prims(tid, nThreadsGather, nvls->up, NULL, NULL, work->recvbuff,
-            work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
-        for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
-          offset = gridOffset + elemOffset;
-          nelem = min(chunkCount, channelCount - elemOffset);
-          prims.gather(offset, nvls->nHeads * count, nelem, count, -1, 0);
+    const int tidEndGather = nThreadsGather;
+    const int tidEndNetSend = tidEndGather + nThreadsNetSend;
+    const int tidEndBcast = tidEndNetSend + nThreadsBcast;
+
+    if (work->oneNode) {
+      const ssize_t rank = ncclShmem.comm.rank;
+      size_t count, gridOffset, channelCount, offset, chunkCount;
+      ncclCollCbdPart(work, ncclShmem.channelId, NCCL_PROTO_SIMPLE, sizeof(T), &count, &gridOffset, &channelCount, &chunkCount);
+      if (!work->regUsed) {
+        if (tid < tidEndGather) {
+          // Gather
+          using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
+          Primitives<T, RedOp, FanAsymmetric<NCCL_MAX_NVLS_ARITY, 0>, /*Direct=*/0, Proto, 0>
+            prims(tid, nThreadsGather, nvls->up, NULL, NULL, work->recvbuff,
+              work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
+          for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
+            offset = gridOffset + elemOffset;
+            nelem = min(chunkCount, channelCount - elemOffset);
+            prims.gather(offset, nvls->nHeads * count, nelem, count, -1, 0);
+          }
+          // coverity[overrun-call] => Coverity think prims.index can be greater than 1
+        } else if (tid < tidEndBcast) {
+          // Bcast through NVLS
+          using Proto = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
+          Primitives<T, RedOp, FanAsymmetric<0, 1>, /*Direct=*/0, Proto, 0>
+            prims(tid - tidEndGather, nThreadsBcast, NULL, &nvls->down, work->sendbuff, NULL,
+              work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0);
+          for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
+            offset = gridOffset + elemOffset;
+            nelem = min(chunkCount, channelCount - elemOffset);
+            prims.send(offset, nelem);
+          }
+          // coverity[overrun-call] => Coverity think prims.index can be greater than 1
        }
-        // coverity[overrun-call] => Coverity think prims.index can be greater than 1
-      } else if (tid < tidEndBcast) {
-        // Bcast through NVLS
-        using Proto = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
-        Primitives<T, RedOp, FanAsymmetric<0, 1>, /*Direct=*/0, Proto, 0>
-          prims(tid - tidEndGather, nThreadsBcast, NULL, &nvls->down, work->sendbuff, NULL,
-            work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0);
-        for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
-          offset = gridOffset + elemOffset;
-          nelem = min(chunkCount, channelCount - elemOffset);
-          prims.send(offset, nelem);
+      } else {
+        if (tid < tidEndGather) {
+          using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
+          Primitives<T, RedOp, FanSymmetric<NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
+            prims(tid, nThreadsGather, nvls->up, nvls->up, NULL, NULL,
+              work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
+
+          /* used as sync */
+          prims.scatter(0, 0, 0, 0, -1, 0);
+
+          for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
+            prims.gather(0, 0, 0, 0, -1, 0);
+          }
+        } else if (tid < tidEndBcast) {
+          using Proto = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
+          Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, Proto, 0>
+            prims(tid - tidEndGather, nThreadsBcast, &nvls->down, &nvls->down, work->sendbuff, NULL,
+              work->redOpArg, 1 * Proto::MaxGroupWidth, 0, 0, work);
+          /* used as sync */
+          prims.recv(0, 0);
+
+          for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
+            ssize_t inpOffset = gridOffset + elemOffset;
+            ssize_t outOffset = inpOffset + rank * count;
+            nelem = min(chunkCount, channelCount - elemOffset);
+            prims.directSend(inpOffset, outOffset, nelem);
+          }
        }
-        // coverity[overrun-call] => Coverity think prims.index can be greater than 1
      }
    } else {
-      /* direct allgather */
+      // NVLS + IB SHARP
+      int nNodes = ncclShmem.comm.nNodes;
+      int part = ncclShmem.channelId - work->channelLo;
+      ssize_t countPerRank = work->collnet.count;
+      const int nChannels = work->channelHi - work->channelLo + 1;
+      ssize_t chunkCount = work->collnet.chunkCount;
      if (tid < tidEndGather) {
        using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
-        Primitives<T, RedOp, FanSymmetric<NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
-          prims(tid, nThreadsGather, nvls->up, nvls->up, NULL, NULL,
-            work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
-
-        /* used as sync */
-        prims.scatter(0, 0, 0, 0, -1, 0);
-
-        for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
-          prims.gather(0, 0, 0, 0, -1, 0);
+        Primitives<T, RedOp, FanAsymmetric<NCCL_MAX_NVLS_ARITY, 0>, /*Direct=*/1, Proto, 0>
+          prims(tid, nThreadsGather, nvls->up, nullptr, nullptr, work->recvbuff,
+            /*redOpArg=*/0, 1 * Proto::MaxGroupWidth, 1, 1, work);
+        for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
+          Scatterer</*BcastSendNotRecv=*/false> scat;
+          scat.work = work;
+          scat.chunkSize = chunkCount;
+          scat.railGridOffset = railGridOffset;
+          prims.template process</*Recv=*/1, /*Send=*/0>(scat);
        }
-      } else if (tid < tidEndBcast) {
-        using Proto = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
-        Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, Proto, 0>
-          prims(tid - tidEndGather, nThreadsBcast, &nvls->down, &nvls->down, work->sendbuff, NULL,
-            work->redOpArg, 1 * Proto::MaxGroupWidth, 0, 0, work);
-        /* used as sync */
-        prims.recv(0, 0);
-
-        for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
-          ssize_t inpOffset = gridOffset + elemOffset;
-          ssize_t outOffset = inpOffset + rank * count;
-          nelem = min(chunkCount, channelCount - elemOffset);
-          prims.directSend(inpOffset, outOffset, nelem);
+      } else {
+        if (work->netRegUsed) {
+          using ProtoSend = ProtoSimple<1, 1, COLL_UNROLL>;
+          using ProtoBcast = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
+          int maxSteps = (int)divUp(nNodes * countPerRank, nChannels * chunkCount);
+          int curSteps = -1;
+          int postThread = tid - tidEndGather == 0 ? 1 : 0;
+          // for UB, we need to control the send speed to avoid net congestion.
+          // first unroll 2 steps, then unroll the rest steps when the data is received.
+          if (postThread) {
+            curSteps = min(2, maxSteps);
+            Primitives<T, RedOp, FanAsymmetric<0, 1>, /*Direct=*/1, ProtoSend, 0>::sendPeerNotify(nvls->out, 1, curSteps);
+          }
+          Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, ProtoBcast, 0>
+            prims(tid - tidEndGather, nThreadsNetSend + nThreadsBcast, &nvls->out, &nvls->down, nullptr, nullptr,
+              /*redOpArg=*/0, 2 * ProtoBcast::MaxGroupWidth, 0, 0, work);
+          for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
+            Scatterer</*BcastSendNotRecv=*/true> scat;
+            scat.work = work;
+            scat.chunkSize = chunkCount;
+            scat.railGridOffset = railGridOffset;
+            prims.template process</*Recv=*/1, /*Send=*/1>(scat);
+            if (postThread && curSteps < maxSteps) {
+              curSteps++;
+              Primitives<T, RedOp, FanAsymmetric<0, 1>, /*Direct=*/1, ProtoSend, 0>::sendPeerNotify(nvls->out, 1, 1);
+            }
+          }
+        } else {
+          if (tid < tidEndNetSend) {
+            using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
+            Primitives<T, RedOp, FanAsymmetric<0, 1>, /*Direct=*/0, Proto, 0>
+              prims(tid - tidEndGather, nThreadsNetSend, nullptr, &nvls->out, work->sendbuff, nullptr,
+                /*redOpArg=*/0, 0 * Proto::MaxGroupWidth, 1, 1);
+            for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
+              ssize_t railAllBeg = railGridOffset + part * chunkCount;
+              ssize_t railAllEnd = min(railAllBeg + chunkCount, nNodes * countPerRank);
+              ssize_t railOneBeg = ncclShmem.comm.node * countPerRank;
+              ssize_t railOneEnd = railOneBeg + countPerRank;
+              ssize_t beg = max(railAllBeg, railOneBeg);
+              ssize_t end = min(railAllEnd, railOneEnd);
+              prims.send(beg - railOneBeg, max(ssize_t(0), end - beg));
+            }
+          } else {
+            using Proto = ProtoSimple<1, 1, COLL_UNROLL, 0, 1>;
+            Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/0, Proto, 0>
+              prims(tid - tidEndNetSend, nThreadsBcast, &nvls->out, &nvls->down, nullptr, nullptr,
+                /*redOpArg=*/0, 2 * Proto::MaxGroupWidth, 0, 0);
+            for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
+              Scatterer</*BcastSendNotRecv=*/true> scat;
+              scat.work = work;
+              scat.chunkSize = chunkCount;
+              scat.railGridOffset = railGridOffset;
+              prims.template process</*Recv=*/1, /*Send=*/1>(scat);
+            }
+          }
        }
      }
    }
@ -254,7 +402,7 @@ struct RunWorkColl<ncclFuncAllGather, T, RedOp, NCCL_ALGO_COLLNET_DIRECT, NCCL_P
    ssize_t chunkSize;
    ssize_t railGridOffset;

-    template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts>
+    template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts, int MultimemSrcs, int MultimemDsts>
    __device__ __forceinline__ void operator()(
        int tid, int tn, int slice, int maxSliceSize,
        int nSrcs, void** srcPtrs, int nDsts, void** dstPtrs, int32_t* dstSizes, uint32_t sendDirectFlag, uint32_t recvDirectFlag
--- a/src/device/all_reduce.h
+++ b/src/device/all_reduce.h
@ -106,7 +106,7 @@ namespace {
        for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
          offset = gridOffset + elemOffset;
          nelem = min(chunkCount, channelCount - elemOffset);
-          prims.directSend(offset, nelem);
+          prims.directSend(offset, offset, nelem);
        }
      }
      else {
--- a/src/device/common.h
+++ b/src/device/common.h
@ -52,7 +52,6 @@ struct ncclShmemData {
  uint16_t funcId;
  int nWorks;
  int workSize;
-  uint32_t workConsumed;
  uint64_t workCounter;
  bool profilerEnabled;
  struct ncclShmemGroup groups[NCCL_MAX_GROUPS];
@ -182,7 +181,6 @@ __device__ __forceinline__ void loadWorkBatchToShmem(
    }
    if (tid == 0) {
      ncclShmem.workSize = workSize;
-      ncclShmem.workConsumed = batch.offsetBase + (64-__clzll(batch.offsetBitset))*workSize;
    }
    // We deliberately replicate these div and mod calculations into the case
    // blocks above so that they get constant divisor optimizations by the compiler.
@ -242,6 +240,12 @@ __device__ __forceinline__ void loadWorkBatchToShmem(
  }
 }

+__device__ __forceinline__ unsigned long long int globaltimer() {
+  unsigned long long int timer;
+  asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(timer));
+  return timer;
+}
+
 template<ncclFunc_t Fn, typename T, typename RedOp, int Algo, int Proto>
 struct RunWorkColl {
  __device__ void run(int tid, int tn, struct ncclDevWorkColl* work) {
@ -296,40 +300,30 @@ struct RunWorkBatch {
 #define STOP  1
 #define FINI  2

-__device__ __forceinline__ bool profilerEnabled(void) {
-  // Check if any of the workItems in the batch is profiled. If so, there is an equivalent
-  // profiler ProxyOp waiting for the counter update in the host thread. If this check was
-  // done only for the first workItem the profiler counter for other workItems in the batch
-  // could never be updated, leaving the host thread spinning forever for the counter update
-  // and causing a hang.
-  bool enabled = false;
-  for (int i = 0; i < ncclShmem.nWorks && !enabled; i++) {
-    if (ncclShmem.workType == ncclDevWorkTypeP2p)
-      enabled = ((struct ncclDevWorkP2p*)ncclShmem.workStorage)[i].profilerEnabled;
-    else
-      enabled = ((struct ncclDevWorkColl*)ncclShmem.workStorage)[i].profilerEnabled;
-  }
-  return enabled;
+__device__ __forceinline__ bool profilerEnabled(int workItemIdx) {
+  return (ncclShmem.workType == ncclDevWorkTypeP2p) ?
+    ((struct ncclDevWorkP2p*)ncclShmem.workStorage)[workItemIdx].profilerEnabled :
+    ((struct ncclDevWorkColl*)ncclShmem.workStorage)[workItemIdx].profilerEnabled;
 }

 __device__ __forceinline__ void profiler(int action) {
-  if (action == START) {
-    if (threadIdx.x == 0) {
-      // increment workCounter regardless of the profiler being active or not
+  if (threadIdx.x == 0) {
+    int idx = 0;
+    uint64_t wc = ncclShmem.channel.workCounter + 1;
+    if (action == START) {
+      for (; wc <= ncclShmem.channel.workCounter + ncclShmem.nWorks; wc++) {
+        if (!profilerEnabled(idx++)) continue;
+        ncclShmem.comm.workStarted[ncclShmem.channelId].data[wc%MAX_PROFILER_EVENTS_PER_CHANNEL].timestamp = globaltimer();
+        ncclShmem.comm.workStarted[ncclShmem.channelId].data[wc%MAX_PROFILER_EVENTS_PER_CHANNEL].counter = wc;
+      }
+    } else {
+      for (; wc <= ncclShmem.channel.workCounter + ncclShmem.nWorks; wc++) {
+        if (!profilerEnabled(idx++)) continue;
+        ncclShmem.comm.workCompleted[ncclShmem.channelId].data[wc%MAX_PROFILER_EVENTS_PER_CHANNEL].timestamp = globaltimer();
+        ncclShmem.comm.workCompleted[ncclShmem.channelId].data[wc%MAX_PROFILER_EVENTS_PER_CHANNEL].counter = wc;
+      }
      ncclShmem.channel.workCounter += ncclShmem.nWorks;
-      if(!profilerEnabled()) return;
-      ncclShmem.comm.workStarted[ncclShmem.channelId] = ncclShmem.channel.workCounter;
-    }
-  } else if (action == STOP) {
-    if (threadIdx.x == 0 && profilerEnabled()) {
-      ncclShmem.comm.workCompleted[ncclShmem.channelId] = ncclShmem.channel.workCounter;
-    }
-  } else { // FINI
-    if (threadIdx.x == 0) {
-      // store the workCounter back to vidmem regardless of the profiler being active or not
-      ((ncclDevCommAndChannels*)ncclShmem.args.comm)->channels[ncclShmem.channelId].workCounter = ncclShmem.channel.workCounter;
-      if (!profilerEnabled()) return;
-      ncclShmem.comm.workCompleted[ncclShmem.channelId] = ncclShmem.channel.workCounter;
+      if (action == FINI) ((ncclDevCommAndChannels*)ncclShmem.args.comm)->channels[ncclShmem.channelId].workCounter = ncclShmem.channel.workCounter;
    }
  }
 }
@ -388,11 +382,6 @@ __device__ __forceinline__ void ncclKernelMain(struct ncclDevKernelArgs const* a
  }
  __syncthreads(); // publish ncclShmem

-  if (tid == 0 && ncclShmem.args.workStorageType == ncclDevWorkStorageTypeFifo) {
-    // ncclShmem.workConsumed written by loadWorkBatchToShmem before __syncthreads()
-    ncclShmem.comm.workConsumed[ncclShmem.channelId] = ncclShmem.workConsumed;
-  }
-
  while (ncclShmem.aborted == 0) {
    profiler(START);
    if (0 <= SpecializedFnId && ncclShmem.funcId == (unsigned)SpecializedFnId) {
@ -407,11 +396,6 @@ __device__ __forceinline__ void ncclKernelMain(struct ncclDevKernelArgs const* a
    profiler(STOP);
    loadWorkBatchToShmem(tid, tn, args, batchIx);
    __syncthreads();
-
-    if (tid == 0 && ncclShmem.args.workStorageType == ncclDevWorkStorageTypeFifo) {
-      // ncclShmem.workConsumed written by loadWorkBatchToShmem before __syncthreads()
-      ncclShmem.comm.workConsumed[ncclShmem.channelId] = ncclShmem.workConsumed;
-    }
  }
  profiler(FINI);
 }
--- a/src/device/generate.py
+++ b/src/device/generate.py
@ -327,7 +327,7 @@ with open(os.path.join(gensrc, "rules.mk"), "w") as f:
  out = f.write
  impl_names = sorted(name_to_funcs.keys())
  names = impl_names + ["host_table.cc", "device_table.cu"]
-  out("LIB_OBJS_GEN = $(patsubst %, $(OBJDIR)/genobj/%.o, {names})\n"
+  out("LIB_OBJS_GEN = $(patsubst %,$(OBJDIR)/genobj/%.o,{names})\n"
      .format(names=" ".join(names)))
  out("\n")

--- a/src/device/op128.h
+++ b/src/device/op128.h
@ -99,37 +99,60 @@ template<>
 union BytePack<0> {};
 template<>
 union BytePack<1> {
-  uint8_t u8, native;
+  uint8_t u8[1], native;
 };
 template<>
 union BytePack<2> {
  BytePack<1> half[2];
+  BytePack<1> b1[2];
  uint8_t u8[2];
-  uint16_t u16, native;
+  uint16_t u16[1], native;
 };
 template<>
 union BytePack<4> {
  BytePack<2> half[2];
+  BytePack<1> b1[4];
+  BytePack<2> b2[2];
  uint8_t u8[4];
  uint16_t u16[2];
-  uint32_t u32, native;
+  uint32_t u32[1], native;
 };
 template<>
 union BytePack<8> {
  BytePack<4> half[2];
+  BytePack<1> b1[8];
+  BytePack<2> b2[4];
+  BytePack<4> b4[2];
  uint8_t u8[8];
  uint16_t u16[4];
  uint32_t u32[2];
-  uint64_t u64, native;
+  uint64_t u64[1], native;
 };
 template<>
 union alignas(16) BytePack<16> {
  BytePack<8> half[2];
+  BytePack<1> b1[16];
+  BytePack<2> b2[8];
+  BytePack<4> b4[4];
+  BytePack<8> b8[2];
  uint8_t u8[16];
  uint16_t u16[8];
  uint32_t u32[4];
  uint64_t u64[2];
-  ulong2 ul2, native;
+  ulong2 ul2[1], native;
+};
+template<int Size>
+union BytePack {
+  BytePack<Size/2> half[2];
+  BytePack<1> b1[Size];
+  BytePack<2> b2[Size/2];
+  BytePack<4> b4[Size/4];
+  BytePack<8> b8[Size/8];
+  BytePack<16> b16[Size/16];
+  uint8_t u8[Size];
+  uint16_t u16[Size/2];
+  uint32_t u32[Size/4];
+  uint64_t u64[Size/8];
 };

 template<typename T>
@ -357,19 +380,19 @@ __device__ __forceinline__ void multimem_st_global<0>(uintptr_t addr, BytePack<0
 }
 template<>
 __device__ __forceinline__ void multimem_st_global<1>(uintptr_t addr, BytePack<1> val) {
-  asm volatile("st.global.b8 [%0], %1;" :: "l"(addr), "r"((uint32_t)val.u8) : "memory");
+  asm volatile("st.global.b8 [%0], %1;" :: "l"(addr), "r"((uint32_t)val.native) : "memory");
 }
 template<>
 __device__ __forceinline__ void multimem_st_global<2>(uintptr_t addr, BytePack<2> val) {
-  asm volatile("st.global.b16 [%0], %1;" :: "l"(addr), "h"(val.u16) : "memory");
+  asm volatile("st.global.b16 [%0], %1;" :: "l"(addr), "h"(val.native) : "memory");
 }
 template<>
 __device__ __forceinline__ void multimem_st_global<4>(uintptr_t addr, BytePack<4> val) {
-  asm volatile("multimem.st.global.b32 [%0], %1;" :: "l"(addr), "r"(val.u32) : "memory");
+  asm volatile("multimem.st.global.b32 [%0], %1;" :: "l"(addr), "r"(val.native) : "memory");
 }
 template<>
 __device__ __forceinline__ void multimem_st_global<8>(uintptr_t addr, BytePack<8> val) {
-  asm volatile("multimem.st.global.b64 [%0], %1;" :: "l"(addr), "l"(val.u64) : "memory");
+  asm volatile("multimem.st.global.b64 [%0], %1;" :: "l"(addr), "l"(val.native) : "memory");
 }
 template<>
 __device__ __forceinline__ void multimem_st_global<16>(uintptr_t addr, BytePack<16> val) {
@ -384,6 +407,56 @@ __device__ __forceinline__ void multimem_st_global(uintptr_t addr, BytePack<Size
 }
 #endif

+// Load pack starting at index in array. Ignore elements past end (length of array).
+template<typename Pack, typename T>
+__device__ __forceinline__ Pack loadPack(T* ptr, int ix, int end) {
+  constexpr int Size = sizeof(Pack);
+  ptr += ix;
+  int n = end - ix;
+  if (alignof(T) == Size && sizeof(T) == Size) {
+    return *(Pack*)ptr;
+  } else if ((Size+3)/4 + 1 < Size/sizeof(T)) {
+    union { Pack ans; uint32_t part[Size/4]; };
+    int misalign = reinterpret_cast<uintptr_t>(ptr) % 4;
+    uint32_t* down = reinterpret_cast<uint32_t*>(reinterpret_cast<uintptr_t>(ptr) & -uintptr_t(4));
+    int i;
+    #pragma unroll
+    for (i=0; i < Size/4; i++) {
+      if (i*4/sizeof(T) < 1 || i*4/sizeof(T) < n) part[i] = down[i];
+    }
+    uint32_t extra;
+    if (misalign) extra = down[i];
+    #pragma unroll
+    for (i=0; i < Size/4; i++) {
+      part[i] = __funnelshift_r(part[i], part[i+1], 8*misalign);
+    }
+    if (misalign) part[i] = __funnelshift_r(part[i], extra, 8*misalign);
+    return ans;
+  } else {
+    union { Pack ans; BytePack<sizeof(T)> part[Size/sizeof(T)]; };
+    #pragma unroll
+    for (int i=0; i < Size/sizeof(T); i++) {
+      if (i < 1 || i < n) part[i] = ((BytePack<sizeof(T)>*)ptr)[i];
+    }
+    return ans;
+  }
+}
+
+// Store pack starting at index in array. Ignore elements past end (length of array).
+template<typename Pack, typename T>
+__device__ __forceinline__ void storePack(T* ptr, int ix, int end, Pack val) {
+  constexpr int Size = sizeof(Pack);
+  union { Pack tmp; BytePack<sizeof(T)> part[Size/sizeof(T)]; };
+  tmp = val;
+  ptr += ix;
+  int n = end - ix;
+  #pragma unroll
+  for (int i=0; i < Size/sizeof(T); i++) {
+    if (i < 1 || i < n) ((BytePack<sizeof(T)>*)ptr)[i] = part[i];
+  }
+}
+
+
 // Warp-uniform memory copy from shared address (not generic) to global memory.
 // The number of bytes copied is `min(MaxBytes, nBytesAhead)`, a negative value
 // is interpeted as zero. EltSize is the guaranteed alignment of the addresses and sizes.
@ -426,10 +499,10 @@ __device__ __forceinline__ void copyGlobalShared_WarpUnrolled(
    b4[3] = ld_shared<4>(srcAddr + 3*4);
    if (srcMisalign != 0) {
      BytePack<4> b4_4 = ld_shared<4>(srcAddr + 4*4);
-      b4[0].u32 = __funnelshift_r(b4[0].u32, b4[1].u32, srcMisalign*8);
-      b4[1].u32 = __funnelshift_r(b4[1].u32, b4[2].u32, srcMisalign*8);
-      b4[2].u32 = __funnelshift_r(b4[2].u32, b4[3].u32, srcMisalign*8);
-      b4[3].u32 = __funnelshift_r(b4[3].u32, b4_4.u32, srcMisalign*8);
+      b4[0].native = __funnelshift_r(b4[0].native, b4[1].native, srcMisalign*8);
+      b4[1].native = __funnelshift_r(b4[1].native, b4[2].native, srcMisalign*8);
+      b4[2].native = __funnelshift_r(b4[2].native, b4[3].native, srcMisalign*8);
+      b4[3].native = __funnelshift_r(b4[3].native, b4_4.native, srcMisalign*8);
    }
    if (Multimem) multimem_st_global<16>(dstAddr, b16);
    else          st_global<16>(dstAddr, b16);
--- a/src/device/prims_simple.h
+++ b/src/device/prims_simple.h
@ -125,7 +125,7 @@ class Primitives<

      void **ptrs = isSendNotRecv ? (ncclShmem.groups[group].dsts + Dst)
                                  : (ncclShmem.groups[group].srcs + Src);
-      if (flags & NetRegMode) {
+      if ((flags & NetRegMode) && ((!isSendNotRecv && DirectRecv) || (isSendNotRecv && DirectSend))) {
        if (P2p) {
          ptrs[index] = NULL;
        } else {
@ -337,7 +337,7 @@ public:
  }

  template<int Recv, int Send, typename Fn>
-  __device__ __forceinline__ void process(Fn &&fn, uint32_t sendDirectFlag, uint32_t recvDirectFlag) {
+  __device__ __forceinline__ void process(Fn &&fn, uint32_t sendDirectFlag = 0, uint32_t recvDirectFlag = 0) {
    #pragma unroll 1
    for (int slice=0; slice < SlicePerChunk; slice++) {
      if (tid < nworkers) {
@ -361,7 +361,7 @@ public:
              } else if (flags & DirectRead) {  // empty send
                ptrs[index] = nullptr;
              } else {
-                ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*stepSize;
+                ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*connStepSize;
              }
            } else {
              if (flags & DirectRead) {
@ -372,11 +372,11 @@ public:
                else
                  ptrs[index] = nullptr;
              } else {
-                ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*stepSize;
+                ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*connStepSize;
              }
            }
          } else {
-            ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*stepSize;
+            ptrs[index] = connEltsFifo + (step%NCCL_STEPS)*connStepSize;
          }
        }
        subBarrier();
@ -391,7 +391,7 @@ public:
        } else {
          nsend = fan.nsend();
        }
-        fn.template operator() < SlicePerChunk, 0, Recv*MaxRecv, 0, Send*MaxSend >
+        fn.template operator()<SlicePerChunk, 0, Recv*MaxRecv, 0, Send*MaxSend, MultimemSrcs, MultimemDsts>
          (tid, nworkers, slice, stepSize * StepPerSlice,
            nrecv, ncclShmem.groups[group].srcs,
            nsend, ncclShmem.groups[group].dsts, ncclShmem.groups[group].dstSizes, sendDirectFlag, recvDirectFlag);
@ -896,6 +896,12 @@ private:
  __device__ __forceinline__ void directRecvDirectSend(intptr_t inpIx, intptr_t outIx, int eltN, bool postOp=false) {
    genericOp<1, 1, 1, 1, -1, -1>(inpIx, outIx, eltN, postOp);
  }
+  __device__ __forceinline__ void recvDirectSend(intptr_t outIx, int eltN, bool postOp=false) {
+    genericOp<0, 1, 1, 1, -1, -1>(-1, outIx, eltN, postOp);
+  }
+  __device__ __forceinline__ void directRecvSend(intptr_t outIx, int eltN, bool postOp=false) {
+    genericOp<1, 0, 1, 1, -1, -1>(outIx, outIx, eltN, postOp);
+  }
  __device__ __forceinline__ void recvCopyDirectSend(intptr_t outIx, int eltN, bool postOp=false) {
    genericOp<0, 1, 1, 1, -1, Output>(-1, outIx, eltN, postOp);
  }
--- a/src/device/reduce_kernel.h
+++ b/src/device/reduce_kernel.h
@ -38,18 +38,18 @@ struct IsFloatingPoint<double>: std::true_type {};
 //  3. Have constructor taking `uint64_t opArg`.

 template<typename T>
-struct FuncCopy { using EltType = T; __device__ FuncCopy(uint64_t opArg=0) {}; };
+struct FuncCopy { using EltType = T; __device__ __forceinline__ FuncCopy(uint64_t opArg=0) {}; };
 template<typename T>
-struct FuncSum  { using EltType = T; __device__ FuncSum(uint64_t opArg=0) {}; };
+struct FuncSum  { using EltType = T; __device__ __forceinline__ FuncSum(uint64_t opArg=0) {}; };
 template<typename T>
-struct FuncProd { using EltType = T; __device__ FuncProd(uint64_t opArg=0) {}; };
+struct FuncProd { using EltType = T; __device__ __forceinline__ FuncProd(uint64_t opArg=0) {}; };

 template<typename T>
 struct FuncMinMax {
  using EltType = T;
  BytePack<sizeof(T)> xormask; // only used by integers
  bool isMinNotMax; // only used by floats
-  __device__ FuncMinMax(uint64_t opArg=0) {
+  __device__ __forceinline__ FuncMinMax(uint64_t opArg=0) {
    xormask.native = opArg;
    isMinNotMax = (opArg&1)==0;
  }
@ -64,13 +64,13 @@ template<typename T> struct FuncSumPostDiv;
 template<typename Fn>
 struct RedOpArg { // default case: no argument
  static constexpr bool ArgUsed = false;
-  __device__ static uint64_t loadArg(void *ptr) { return 0; }
+  __device__ __forceinline__ static uint64_t loadArg(void *ptr) { return 0; }
 };

 template<typename T>
 struct RedOpArg<FuncMinMax<T>> {
  static constexpr bool ArgUsed = true;
-  __device__ static uint64_t loadArg(void *ptr) {
+  __device__ __forceinline__ static uint64_t loadArg(void *ptr) {
    union { uint64_t u64; T val; };
    u64 = 0;
    val = *(T*)ptr;
@ -84,6 +84,11 @@ struct RedOpArg<FuncMinMax<T>> {
 // of elements. These classes are intended to be specialized for specific
 // combinations of reduction function and pack size.

+template<typename A, typename B, int EltPerPackA>
+struct Apply_Cast/*{
+  static BytePack<EltPerPackA*sizeof(B)/sizeof(A)> cast(BytePack<EltPerPackA*sizeof(A)> a);
+}*/;
+
 template<typename Fn, int EltPerPack>
 struct Apply_Reduce /*{
  static BytePack<EltPerPack*sizeof(T)> reduce(
@ -111,16 +116,60 @@ struct Apply_LoadMultimem/*{
  static BytePack<BytePerPack> load(Fn fn, uintptr_t addr);
 }*/;

+
+// Helpers for dealing with BytePack<0>'s
+template<typename A, typename B, int EltPerPack>
+struct Apply_Cast_MaybeEmpty: Apply_Cast<A, B, EltPerPack> {};
+template<typename A, typename B>
+struct Apply_Cast_MaybeEmpty<A, B, /*EltPerPack=*/0> {
+  __device__ constexpr static BytePack<0> cast(BytePack<0> a) { return {}; }
+};
+
+template<typename Fn, int EltPerPack>
+struct Apply_Reduce_MaybeEmpty: Apply_Reduce<Fn, EltPerPack> {};
+template<typename Fn>
+struct Apply_Reduce_MaybeEmpty<Fn, 0> {
+  __device__ constexpr static BytePack<0> reduce(Fn fn, BytePack<0> a, BytePack<0> b) { return {}; }
+};
+
+template<typename Fn, int EltPerPack>
+struct Apply_PreOp_MaybeEmpty: Apply_PreOp<Fn, EltPerPack> {};
+template<typename Fn>
+struct Apply_PreOp_MaybeEmpty<Fn, 0> {
+  static constexpr bool IsIdentity = true;
+  __device__ constexpr static BytePack<0> preOp(Fn fn, BytePack<0> a) { return {}; }
+};
+
+template<typename Fn, int EltPerPack>
+struct Apply_PostOp_MaybeEmpty: Apply_PostOp<Fn, EltPerPack> {};
+template<typename Fn>
+struct Apply_PostOp_MaybeEmpty<Fn, 0> {
+  static constexpr bool IsIdentity = true;
+  __device__ constexpr static BytePack<0> postOp(Fn fn, BytePack<0> a) { return {}; }
+};
+
+template<typename Fn, int BytePerPack>
+struct Apply_LoadMultimem_MaybeEmpty: Apply_LoadMultimem<Fn, BytePerPack> {};
+template<typename Fn>
+struct Apply_LoadMultimem_MaybeEmpty<Fn, 0> {
+  __device__ constexpr static BytePack<0> load(Fn fn, uintptr_t addr) { return {}; }
+};
+
 ////////////////////////////////////////////////////////////////////////////////
 // Public API for calling the trait classes. These take the data elements as a
 // pack of any type, which could be a BytePack<?> or any integral type (uint64_t,
 // uint32_t, etc.), and will return a new pack where each element has been
 // transformed appropriately.

+template<typename A, typename B, typename PackA>
+__device__ __forceinline__ BytePack<BytePackOf<PackA>::Size*sizeof(B)/sizeof(A)> applyCast(PackA a) {
+  return Apply_Cast_MaybeEmpty<A, B, BytePackOf<PackA>::Size/sizeof(A)>::cast(toPack(a));
+}
+
 template<typename Fn, typename Pack>
 __device__ __forceinline__ Pack applyReduce(Fn fn, Pack a, Pack b) {
  return fromPack<Pack>(
-    Apply_Reduce<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
+    Apply_Reduce_MaybeEmpty<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
      ::reduce(fn, toPack(a), toPack(b))
  );
 }
@ -128,7 +177,7 @@ __device__ __forceinline__ Pack applyReduce(Fn fn, Pack a, Pack b) {
 template<typename Fn, typename Pack>
 __device__ __forceinline__ Pack applyPreOp(Fn fn, Pack a) {
  return fromPack<Pack>(
-    Apply_PreOp<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
+    Apply_PreOp_MaybeEmpty<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
      ::preOp(fn, toPack(a))
  );
 }
@ -136,23 +185,107 @@ __device__ __forceinline__ Pack applyPreOp(Fn fn, Pack a) {
 template<typename Fn, typename Pack>
 __device__ __forceinline__ Pack applyPostOp(Fn fn, Pack a) {
  return fromPack<Pack>(
-    Apply_PostOp<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
+    Apply_PostOp_MaybeEmpty<Fn, BytePackOf<Pack>::Size/sizeof(typename Fn::EltType)>
      ::postOp(fn, toPack(a))
  );
 }

 template<typename Fn, int BytePerPack>
 __device__ __forceinline__ BytePack<BytePerPack> applyLoadMultimem(Fn fn, uintptr_t addr) {
-  return Apply_LoadMultimem<Fn, BytePerPack>::load(fn, addr);
+  return Apply_LoadMultimem_MaybeEmpty<Fn, BytePerPack>::load(fn, addr);
 }

+////////////////////////////////////////////////////////////////////////////////
+// Apply_Cast
+
+template<typename A, typename B, int EltPerPack>
+struct Apply_Cast {
+  __device__ __forceinline__ static BytePack<EltPerPack*sizeof(B)> cast(BytePack<EltPerPack*sizeof(A)> a) {
+    BytePack<EltPerPack*sizeof(B)> b;
+    b.half[0] = Apply_Cast<A, B, EltPerPack/2>::cast(a.half[0]);
+    b.half[1] = Apply_Cast<A, B, EltPerPack/2>::cast(a.half[1]);
+    return b;
+  }
+};
+
+template<typename A, typename B>
+struct Apply_Cast<A, B, /*EltPerPack=*/1> {
+  __device__ __forceinline__ static BytePack<sizeof(B)> cast(BytePack<sizeof(A)> a) {
+    return toPack(B(fromPack<A>(a)));
+  }
+};
+
+template<>
+struct Apply_Cast<__half, float, /*EltPerPack=*/1> {
+  __device__ __forceinline__ static BytePack<sizeof(float)> cast(BytePack<sizeof(__half)> a) {
+    return toPack(__half2float(fromPack<__half>(a)));
+  }
+};
+template<>
+struct Apply_Cast<float, __half, /*EltPerPack=*/1> {
+  __device__ __forceinline__ static BytePack<sizeof(__half)> cast(BytePack<sizeof(float)> a) {
+    return toPack(__float2half_rn(fromPack<float>(a)));
+  }
+};
+
+template<>
+struct Apply_Cast<__half, float, /*EltPerPack=*/2> {
+  __device__ __forceinline__ static BytePack<4*2> cast(BytePack<2*2> a) {
+    return toPack(__half22float2(fromPack<__half2>(a)));
+  }
+};
+template<>
+struct Apply_Cast<float, __half, /*EltPerPack=*/2> {
+  __device__ __forceinline__ static BytePack<2*2> cast(BytePack<4*2> a) {
+    return toPack(__float22half2_rn(fromPack<float2>(a)));
+  }
+};
+
+#if defined(__CUDA_BF16_TYPES_EXIST__) && (CUDART_RUNTIME >= 12000 || __CUDA_ARCH__ >= 800)
+template<>
+struct Apply_Cast<__nv_bfloat16, float, /*EltPerPack=*/2> {
+  __device__ __forceinline__ static BytePack<4*2> cast(BytePack<2*2> a) {
+    return toPack(__bfloat1622float2(fromPack<__nv_bfloat162>(a)));
+  }
+};
+template<>
+struct Apply_Cast<float ,__nv_bfloat16, /*EltPerPack=*/2> {
+  __device__ __forceinline__ static BytePack<2*2> cast(BytePack<4*2> a) {
+    return toPack(__float22bfloat162_rn(fromPack<float2>(a)));
+  }
+};
+#endif
+
+#define EASY_CAST(A, B, EltPerPack, VecA, VecB) \
+  template<> \
+  struct Apply_Cast<A, B, EltPerPack> { \
+    __device__ __forceinline__ static BytePack<sizeof(B)*EltPerPack> cast(BytePack<sizeof(A)*EltPerPack> a) { \
+      return toPack(VecB(fromPack<VecA>(a))); \
+    } \
+  }; \
+  template<> \
+  struct Apply_Cast<B, A, EltPerPack> { \
+    __device__ __forceinline__ static BytePack<sizeof(A)*EltPerPack> cast(BytePack<sizeof(B)*EltPerPack> b) { \
+      return toPack(VecA(fromPack<VecB>(b))); \
+    } \
+  };
+
+#if defined(__CUDA_FP8_TYPES_EXIST__)
+EASY_CAST(__nv_fp8_e5m2, float, 2, __nv_fp8x2_e5m2, float2)
+EASY_CAST(__nv_fp8_e5m2, float, 4, __nv_fp8x4_e5m2, float4)
+
+EASY_CAST(__nv_fp8_e4m3, float, 2, __nv_fp8x2_e4m3, float2)
+EASY_CAST(__nv_fp8_e4m3, float, 4, __nv_fp8x4_e4m3, float4)
+#endif
+#undef EASY_CAST
+
 ////////////////////////////////////////////////////////////////////////////////
 // Apply_Reduce

 // Nonsensical base case
 template<typename Fn>
 struct Apply_Reduce<Fn, /*EltPerPack=*/0> {
-  __device__ static BytePack<0> reduce(Fn fn, BytePack<0> a, BytePack<0> b) {
+  __device__ __forceinline__ static BytePack<0> reduce(Fn fn, BytePack<0> a, BytePack<0> b) {
    return  {};
  }
 };
@ -164,7 +297,7 @@ struct Apply_Reduce<Fn, /*EltPerPack=*/0> {
 template<typename Fn, int EltPerPack>
 struct Apply_Reduce {
  template<int Size>
-  __device__ static BytePack<Size> reduce(Fn fn, BytePack<Size> a, BytePack<Size> b) {
+  __device__ __forceinline__ static BytePack<Size> reduce(Fn fn, BytePack<Size> a, BytePack<Size> b) {
    a.half[0] = Apply_Reduce<Fn, EltPerPack/2>::reduce(fn, a.half[0], b.half[0]);
    a.half[1] = Apply_Reduce<Fn, EltPerPack/2>::reduce(fn, a.half[1], b.half[1]);
    return a;
@ -174,25 +307,25 @@ struct Apply_Reduce {
 // Base case definitions (EltPerPack == 1)
 template<typename T>
 struct Apply_Reduce<FuncCopy<T>, /*EltPerPack=*/1> {
-  __device__ static BytePack<sizeof(T)> reduce(FuncCopy<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
+  __device__ __forceinline__ static BytePack<sizeof(T)> reduce(FuncCopy<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
    return a;
  }
 };
 template<typename T>
 struct Apply_Reduce<FuncSum<T>, /*EltPerPack=*/1> {
-  __device__ static BytePack<sizeof(T)> reduce(FuncSum<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
+  __device__ __forceinline__ static BytePack<sizeof(T)> reduce(FuncSum<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
    return toPack<T>(fromPack<T>(a) + fromPack<T>(b));
  }
 };
 template<typename T>
 struct Apply_Reduce<FuncProd<T>, /*EltPerPack=*/1> {
-  __device__ static BytePack<sizeof(T)> reduce(FuncProd<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
+  __device__ __forceinline__ static BytePack<sizeof(T)> reduce(FuncProd<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
    return toPack<T>(fromPack<T>(a) * fromPack<T>(b));
  }
 };
 template<typename T>
 struct Apply_Reduce<FuncMinMax<T>, /*EltPerPack=*/1> {
-  __device__ static BytePack<sizeof(T)> reduce(FuncMinMax<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
+  __device__ __forceinline__ static BytePack<sizeof(T)> reduce(FuncMinMax<T> fn, BytePack<sizeof(T)> a, BytePack<sizeof(T)> b) {
    return (a.native ^ fn.xormask.native) < (b.native ^ fn.xormask.native) ? a : b;
  }
 };
@ -200,7 +333,7 @@ struct Apply_Reduce<FuncMinMax<T>, /*EltPerPack=*/1> {
 // Optimizations for specfic types and element count combinations:
 template<>
 struct Apply_Reduce<FuncSum<uint8_t>, /*EltPerPack=*/4> {
-  __device__ static BytePack<4> reduce(FuncSum<uint8_t> fn, BytePack<4> a, BytePack<4> b) {
+  __device__ __forceinline__ static BytePack<4> reduce(FuncSum<uint8_t> fn, BytePack<4> a, BytePack<4> b) {
    constexpr uint32_t even = 0x00ff00ffu;
    uint32_t x = (a.native &  even) + (b.native &  even);
    uint32_t y = (a.native & ~even) + (b.native & ~even);
@ -236,7 +369,7 @@ struct Apply_Reduce<FuncMinMax<uint8_t>, /*EltPerPack=*/4> {

 template<>
 struct Apply_Reduce<FuncProd<uint8_t>, /*EltPerPack=*/4> {
-  __device__ static BytePack<4> reduce(FuncProd<uint8_t> fn, BytePack<4> apack, BytePack<4> bpack) {
+  __device__ __forceinline__ static BytePack<4> reduce(FuncProd<uint8_t> fn, BytePack<4> apack, BytePack<4> bpack) {
    uint32_t a = apack.native;
    uint32_t b = bpack.native;
    uint32_t ab0 = (a*b) & 0xffu;
@ -332,7 +465,7 @@ template<typename Fn, int EltPerPack>
 struct Apply_PreOp {
  static constexpr bool IsIdentity = Apply_PreOp<Fn, EltPerPack/2>::IsIdentity;
  template<int Size>
-  __device__ static BytePack<Size> preOp(Fn fn, BytePack<Size> a) {
+  __device__ __forceinline__ static BytePack<Size> preOp(Fn fn, BytePack<Size> a) {
    #if __cpp_if_constexpr
    if constexpr(!IsIdentity) {
    #else
@ -352,7 +485,7 @@ template<typename Fn>
 struct Apply_PreOp<Fn, /*EltPerPack=*/1> {
  static constexpr bool IsIdentity = true;
  template<int Size>
-  __device__ static BytePack<Size> preOp(Fn fn, BytePack<Size> a) {
+  __device__ __forceinline__ static BytePack<Size> preOp(Fn fn, BytePack<Size> a) {
    return a;
  }
 };
@ -360,7 +493,7 @@ struct Apply_PreOp<Fn, /*EltPerPack=*/1> {
 template<typename Fn>
 struct Apply_PreOp<Fn, /*EltPerPack=*/0> {
  static constexpr bool IsIdentity = true;
-  __device__ static BytePack<0> preOp(Fn fn, BytePack<0> a) {
+  __device__ __forceinline__ static BytePack<0> preOp(Fn fn, BytePack<0> a) {
    return {};
  }
 };
@ -373,7 +506,7 @@ template<typename Fn, int EltPerPack>
 struct Apply_PostOp {
  static constexpr bool IsIdentity = Apply_PostOp<Fn, EltPerPack/2>::IsIdentity;
  template<int Size>
-  __device__ static BytePack<Size> postOp(Fn fn, BytePack<Size> a) {
+  __device__ __forceinline__ static BytePack<Size> postOp(Fn fn, BytePack<Size> a) {
    #if __cpp_if_constexpr
    if constexpr(!IsIdentity) {
    #else
@ -393,7 +526,7 @@ template<typename Fn>
 struct Apply_PostOp<Fn, /*EltPerPack=*/1> {
  static constexpr bool IsIdentity = true;
  template<int Size>
-  __device__ static BytePack<Size> postOp(Fn fn, BytePack<Size> a) {
+  __device__ __forceinline__ static BytePack<Size> postOp(Fn fn, BytePack<Size> a) {
    return a;
  }
 };
@ -401,7 +534,7 @@ struct Apply_PostOp<Fn, /*EltPerPack=*/1> {
 template<typename Fn>
 struct Apply_PostOp<Fn, /*EltPerPack=*/0> {
  static constexpr bool IsIdentity = true;
-  __device__ static BytePack<0> postOp(Fn fn, BytePack<0> a) {
+  __device__ __forceinline__ static BytePack<0> postOp(Fn fn, BytePack<0> a) {
    return {};
  }
 };
@ -413,7 +546,7 @@ struct Apply_PostOp<Fn, /*EltPerPack=*/0> {
 template<typename T>
 struct RedOpArg<FuncPreMulSum<T>> {
  static constexpr bool ArgUsed = true;
-  __device__ static uint64_t loadArg(void *ptr) {
+  __device__ __forceinline__ static uint64_t loadArg(void *ptr) {
    union { uint64_t u64; T val; };
    u64 = 0;
    val = *(T*)ptr;
@ -426,7 +559,7 @@ template<typename T>
 struct FuncPreMulSum {
  using EltType = T;
  T scalar;
-  __device__ FuncPreMulSum(uint64_t opArg=0) {
+  __device__ __forceinline__ FuncPreMulSum(uint64_t opArg=0) {
    union { uint64_t u64; T val; };
    u64 = opArg;
    scalar = val;
@ -441,7 +574,7 @@ struct FuncPreMulSum<half> {
  using EltType = half;
 #if __CUDA_ARCH__ >= 530 && __CUDA_ARCH__ != 610
  __half2 scalar;
-  __device__ FuncPreMulSum(uint64_t opArg=0) {
+  __device__ __forceinline__ FuncPreMulSum(uint64_t opArg=0) {
    union { uint64_t u64; __half val; };
    u64 = opArg;
    scalar.x = val;
@ -449,7 +582,7 @@ struct FuncPreMulSum<half> {
  }
 #else
  float scalar;
-  __device__ FuncPreMulSum(uint64_t opArg=0) {
+  __device__ __forceinline__ FuncPreMulSum(uint64_t opArg=0) {
    union { uint64_t u64; __half val; };
    u64 = opArg;
    scalar = (float)val;
@ -466,7 +599,7 @@ struct FuncPreMulSum<half> {
    using EltType = __nv_bfloat16;
  #if __CUDA_ARCH__ >= 800
    __nv_bfloat162 scalar;
-    __device__ FuncPreMulSum(uint64_t opArg=0) {
+    __device__ __forceinline__ FuncPreMulSum(uint64_t opArg=0) {
      union { uint64_t u64; __nv_bfloat16 val; };
      u64 = opArg;
      scalar.x = val;
@ -474,7 +607,7 @@ struct FuncPreMulSum<half> {
    }
  #else
    float scalar;
-    __device__ FuncPreMulSum(uint64_t opArg=0) {
+    __device__ __forceinline__ FuncPreMulSum(uint64_t opArg=0) {
      union { uint64_t u64; __nv_bfloat16 val; };
      u64 = opArg;
      scalar = __bfloat162float(val);
@ -489,7 +622,7 @@ struct FuncPreMulSum<half> {
  struct FuncPreMulSum<__nv_fp8_e4m3> {
    using EltType = __nv_fp8_e4m3;
    __half2 scalar2;
-    __device__ FuncPreMulSum(uint64_t opArg) {
+    __device__ __forceinline__ FuncPreMulSum(uint64_t opArg) {
      union { uint64_t u64; __nv_fp8_storage_t val; };
      u64 = opArg;
      scalar2.x = __half(__nv_cvt_fp8_to_halfraw(val, __NV_E4M3));
@ -501,7 +634,7 @@ struct FuncPreMulSum<half> {
  struct FuncPreMulSum<__nv_fp8_e5m2> {
    using EltType = __nv_fp8_e5m2;
    __half2 scalar2;
-    __device__ FuncPreMulSum(uint64_t opArg) {
+    __device__ __forceinline__ FuncPreMulSum(uint64_t opArg) {
      union { uint64_t u64; __nv_fp8_storage_t val; };
      u64 = opArg;
      scalar2.x = __half(__nv_cvt_fp8_to_halfraw(val, __NV_E5M2));
@ -513,7 +646,7 @@ struct FuncPreMulSum<half> {

 template<typename T, int EltPerPack>
 struct Apply_Reduce<FuncPreMulSum<T>, EltPerPack> {
-  __device__ static BytePack<EltPerPack*sizeof(T)> reduce(FuncPreMulSum<T> fn, BytePack<EltPerPack*sizeof(T)> a, BytePack<EltPerPack*sizeof(T)> b) {
+  __device__ __forceinline__ static BytePack<EltPerPack*sizeof(T)> reduce(FuncPreMulSum<T> fn, BytePack<EltPerPack*sizeof(T)> a, BytePack<EltPerPack*sizeof(T)> b) {
    // FuncPreMulSum reduce dispatches to FuncSum.
    return Apply_Reduce<FuncSum<T>, EltPerPack>::reduce(FuncSum<T>(), a, b);
  }
@ -523,7 +656,7 @@ struct Apply_Reduce<FuncPreMulSum<T>, EltPerPack> {
 template<typename T>
 struct Apply_PreOp<FuncPreMulSum<T>, /*EltPerPack=*/1> {
  static constexpr bool IsIdentity = false;
-  __device__ static BytePack<sizeof(T)> preOp(FuncPreMulSum<T> fn, BytePack<sizeof(T)> a) {
+  __device__ __forceinline__ static BytePack<sizeof(T)> preOp(FuncPreMulSum<T> fn, BytePack<sizeof(T)> a) {
    return toPack<T>(fromPack<T>(a) * fn.scalar);
  }
 };
@ -534,7 +667,7 @@ struct Apply_PreOp<FuncPreMulSum<T>, /*EltPerPack=*/1> {
 template<>
 struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
  static constexpr bool IsIdentity = false;
-  __device__ static BytePack<sizeof(half)> preOp(FuncPreMulSum<half> fn, BytePack<sizeof(half)> a) {
+  __device__ __forceinline__ static BytePack<sizeof(half)> preOp(FuncPreMulSum<half> fn, BytePack<sizeof(half)> a) {
    #if __CUDA_ARCH__ >= 530 && __CUDA_ARCH__ != 610
      return toPack<half>(__hmul(fromPack<half>(a), fn.scalar.x));
    #else
@ -546,7 +679,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
  template<>
  struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/2> {
    static constexpr bool IsIdentity = false;
-    __device__ static BytePack<sizeof(half2)> preOp(FuncPreMulSum<half> fn, BytePack<sizeof(half2)> a) {
+    __device__ __forceinline__ static BytePack<sizeof(half2)> preOp(FuncPreMulSum<half> fn, BytePack<sizeof(half2)> a) {
      return toPack<half2>(__hmul2(fromPack<half2>(a), fn.scalar));
    }
  };
@ -559,7 +692,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
  template<>
  struct Apply_PreOp<FuncPreMulSum<__nv_bfloat16>, /*EltPerPack=*/1> {
    static constexpr bool IsIdentity = false;
-    __device__ static BytePack<sizeof(__nv_bfloat16)> preOp(
+    __device__ __forceinline__ static BytePack<sizeof(__nv_bfloat16)> preOp(
        FuncPreMulSum<__nv_bfloat16> fn, BytePack<sizeof(__nv_bfloat16)> a
      ) {
      #if __CUDA_ARCH__ >= 800
@ -573,7 +706,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
    template<>
    struct Apply_PreOp<FuncPreMulSum<__nv_bfloat16>, /*EltPerPack=*/2> {
      static constexpr bool IsIdentity = false;
-      __device__ static BytePack<sizeof(__nv_bfloat162)> preOp(
+      __device__ __forceinline__ static BytePack<sizeof(__nv_bfloat162)> preOp(
          FuncPreMulSum<__nv_bfloat16> fn, BytePack<sizeof(__nv_bfloat162)> a
        ) {
        return toPack<__nv_bfloat162>(__hmul2(fromPack<__nv_bfloat162>(a), fn.scalar));
@ -590,7 +723,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
  template<>
  struct Apply_PreOp<FuncPreMulSum<__nv_fp8_e4m3>, /*EltPerPack=*/1> {
    static constexpr bool IsIdentity = false;
-    __device__ static BytePack<sizeof(__nv_fp8_e4m3)> preOp(
+    __device__ __forceinline__ static BytePack<sizeof(__nv_fp8_e4m3)> preOp(
        FuncPreMulSum<__nv_fp8_e4m3> fn, BytePack<sizeof(__nv_fp8_e4m3)> a
      ) {
      return toPack<__nv_fp8_e4m3>(__nv_fp8_e4m3(__hmul(__half(fromPack<__nv_fp8_e4m3>(a)), fn.scalar2.x)));
@ -599,7 +732,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
  template<>
  struct Apply_PreOp<FuncPreMulSum<__nv_fp8_e4m3>, /*EltPerPack=*/2> {
    static constexpr bool IsIdentity = false;
-    __device__ static BytePack<sizeof(__nv_fp8x2_e4m3)> preOp(
+    __device__ __forceinline__ static BytePack<sizeof(__nv_fp8x2_e4m3)> preOp(
        FuncPreMulSum<__nv_fp8_e4m3> fn, BytePack<sizeof(__nv_fp8x2_e4m3)> a
      ) {
      return toPack<__nv_fp8x2_e4m3>(__nv_fp8x2_e4m3(__hmul2(__half2(fromPack<__nv_fp8x2_e4m3>(a)), fn.scalar2)));
@ -609,7 +742,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
  template<>
  struct Apply_PreOp<FuncPreMulSum<__nv_fp8_e5m2>, /*EltPerPack=*/1> {
    static constexpr bool IsIdentity = false;
-    __device__ static BytePack<sizeof(__nv_fp8_e5m2)> preOp(
+    __device__ __forceinline__ static BytePack<sizeof(__nv_fp8_e5m2)> preOp(
        FuncPreMulSum<__nv_fp8_e5m2> fn, BytePack<sizeof(__nv_fp8_e5m2)> a
      ) {
      return toPack<__nv_fp8_e5m2>(__nv_fp8_e5m2(__hmul(__half(fromPack<__nv_fp8_e5m2>(a)), fn.scalar2.x)));
@ -618,7 +751,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
  template<>
  struct Apply_PreOp<FuncPreMulSum<__nv_fp8_e5m2>, /*EltPerPack=*/2> {
    static constexpr bool IsIdentity = false;
-    __device__ static BytePack<sizeof(__nv_fp8x2_e5m2)> preOp(
+    __device__ __forceinline__ static BytePack<sizeof(__nv_fp8x2_e5m2)> preOp(
        FuncPreMulSum<__nv_fp8_e5m2> fn, BytePack<sizeof(__nv_fp8x2_e5m2)> a
      ) {
      return toPack<__nv_fp8x2_e5m2>(__nv_fp8x2_e5m2(__hmul2(__half2(fromPack<__nv_fp8x2_e5m2>(a)), fn.scalar2)));
@ -633,7 +766,7 @@ struct Apply_PreOp<FuncPreMulSum<half>, /*EltPerPack=*/1> {
 template<typename T>
 struct RedOpArg<FuncSumPostDiv<T>> {
  static constexpr bool ArgUsed = true;
-  __device__ static uint64_t loadArg(void *ptr) {
+  __device__ __forceinline__ static uint64_t loadArg(void *ptr) {
    return *(uint64_t*)ptr;
  }
 };
@ -646,12 +779,12 @@ struct FuncSumPostDiv {
  uint32_t divisor:31, isSigned:1;
  UintType recip;
  
-  __device__ FuncSumPostDiv(uint64_t opArg=0) {
+  __device__ __forceinline__ FuncSumPostDiv(uint64_t opArg=0) {
    isSigned = opArg & 1;
    divisor = opArg >> 1;
    recip =  UintType(-1)/divisor;
  }
-  __device__ T divide(T x) {
+  __device__ __forceinline__ T divide(T x) {
    // x is negative iff we are in signed mode and the top bit is set
    bool xneg = isSigned && (x & ~(T(-1)>>1));
    // Compute abs(x):
@ -673,7 +806,7 @@ struct FuncSumPostDiv {
 template<typename T, int EltPerPack>
 struct Apply_Reduce<FuncSumPostDiv<T>, EltPerPack>:
    Apply_Reduce<FuncSum<T>, EltPerPack> {
-  __device__ static BytePack<EltPerPack*sizeof(T)> reduce(FuncSumPostDiv<T> fn, BytePack<EltPerPack*sizeof(T)> a, BytePack<EltPerPack*sizeof(T)> b) {
+  __device__ __forceinline__ static BytePack<EltPerPack*sizeof(T)> reduce(FuncSumPostDiv<T> fn, BytePack<EltPerPack*sizeof(T)> a, BytePack<EltPerPack*sizeof(T)> b) {
    // FuncSumPostDiv reduce dispatches to FuncSum.
    return Apply_Reduce<FuncSum<T>, EltPerPack>::reduce(FuncSum<T>(), a, b);
  }
@ -682,7 +815,7 @@ struct Apply_Reduce<FuncSumPostDiv<T>, EltPerPack>:
 template<typename T>
 struct Apply_PostOp<FuncSumPostDiv<T>, /*EltPerPack=*/1> {
  static constexpr bool IsIdentity = false;
-  __device__ static BytePack<sizeof(T)> postOp(FuncSumPostDiv<T> fn, BytePack<sizeof(T)> a) {
+  __device__ __forceinline__ static BytePack<sizeof(T)> postOp(FuncSumPostDiv<T> fn, BytePack<sizeof(T)> a) {
    return toPack<T>(fn.divide(fromPack<T>(a)));
  }
 };
@ -690,120 +823,145 @@ struct Apply_PostOp<FuncSumPostDiv<T>, /*EltPerPack=*/1> {
 ////////////////////////////////////////////////////////////////////////////////
 // Apply_LoadMultimem

-#define SIZEOF_BytePack_field_u16 2
-#define PTX_REG_BytePack_field_u16 "h"
+#define RegCode_for_size_1 "r"
+#define RegCode_for_size_2 "h"
+#define RegCode_for_size_4 "r"
+#define RegCode_for_size_8 "l"

-#define SIZEOF_BytePack_field_u32 4
-#define PTX_REG_BytePack_field_u32 "r"
+#define RegSize_for_size_1 4
+#define RegSize_for_size_2 2
+#define RegSize_for_size_4 4
+#define RegSize_for_size_8 8

-#define SIZEOF_BytePack_field_u64 8
-#define PTX_REG_BytePack_field_u64 "l"
+#define PtxAcc_for_u32
+#define PtxAcc_for_s32
+#define PtxAcc_for_s64
+#define PtxAcc_for_u64
+#define PtxAcc_for_f32
+#define PtxAcc_for_f64
+#if CUDART_VERSION >= 12020
+  #define PtxAcc_for_f16 ".acc::f32"
+  #define PtxAcc_for_bf16 ".acc::f32"
+  #define PtxAcc_for_f16x2 ".acc::f32"
+  #define PtxAcc_for_bf16x2 ".acc::f32"
+#else
+  #define PtxAcc_for_f16
+  #define PtxAcc_for_bf16
+  #define PtxAcc_for_f16x2
+  #define PtxAcc_for_bf16x2
+#endif
+#define PtxAcc_for_e4m3 ".acc::f16"
+#define PtxAcc_for_e5m2 ".acc::f16"
+#define PtxAcc_for_e4m3x4 ".acc::f16"
+#define PtxAcc_for_e5m2x4 ".acc::f16"

-#define DEFINE_Apply_LoadMultimem_sum(T, ptx_ty, pack_field) \
+#define DEFINE_Apply_LoadMultimem_sum(T, ptx_ty, PackSize) \
  template<> \
-  struct Apply_LoadMultimem<FuncSum<T>, SIZEOF_BytePack_field_##pack_field> { \
-    static constexpr int PackSize = SIZEOF_BytePack_field_##pack_field; \
-    __device__ static BytePack<PackSize> load(FuncSum<T> fn, uintptr_t addr) { \
-      BytePack<PackSize> ans; \
-      asm volatile("multimem.ld_reduce.relaxed.sys.global.add." #ptx_ty " %0, [%1];" \
-        : "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field) \
+  struct Apply_LoadMultimem<FuncSum<T>, PackSize> { \
+    __device__ __forceinline__ static BytePack<PackSize> load(FuncSum<T> fn, uintptr_t addr) { \
+      BytePack<RegSize_for_size_##PackSize> reg; \
+      asm volatile("multimem.ld_reduce.relaxed.sys.global.add" PtxAcc_for_##ptx_ty "." #ptx_ty " %0, [%1];" \
+        : "=" RegCode_for_size_##PackSize(reg.native) \
        : "l"(addr) : "memory"); \
+      BytePack<PackSize> ans; \
+      ans.native = reg.native; \
      return ans; \
    } \
  };
-#define DEFINE_Apply_LoadMultimem_minmax(T, ptx_ty, pack_field) \
+#define DEFINE_Apply_LoadMultimem_minmax(T, ptx_ty, PackSize) \
  template<> \
-  struct Apply_LoadMultimem<FuncMinMax<T>, SIZEOF_BytePack_field_##pack_field> { \
-    static constexpr int PackSize = SIZEOF_BytePack_field_##pack_field; \
-    __device__ static BytePack<PackSize> load(FuncMinMax<T> fn, uintptr_t addr) { \
-      BytePack<PackSize> ans; \
+  struct Apply_LoadMultimem<FuncMinMax<T>, PackSize> { \
+    __device__ __forceinline__ static BytePack<PackSize> load(FuncMinMax<T> fn, uintptr_t addr) { \
+      BytePack<RegSize_for_size_##PackSize> reg; \
      if (fn.isMinNotMax) { \
        asm volatile("multimem.ld_reduce.relaxed.sys.global.min." #ptx_ty " %0, [%1];" \
-          : "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field) \
+          : "=" RegCode_for_size_##PackSize(reg.native) \
          : "l"(addr) : "memory"); \
      } else { \
        asm volatile("multimem.ld_reduce.relaxed.sys.global.max." #ptx_ty " %0, [%1];" \
-          : "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field) \
+          : "=" RegCode_for_size_##PackSize(reg.native) \
          : "l"(addr) : "memory"); \
      } \
+      BytePack<PackSize> ans; \
+      ans.native = reg.native; \
      return ans; \
    } \
  };

-#define DEFINE_Apply_LoadMultimem_sum_v4(T, ptx_ty, pack_field) \
+#define DEFINE_Apply_LoadMultimem_sum_v4(T, ptx_ty, VecEltSize) \
  template<> \
-  struct Apply_LoadMultimem<FuncSum<T>, 4*(SIZEOF_BytePack_field_##pack_field)> { \
-    static constexpr int PackSize = 4*(SIZEOF_BytePack_field_##pack_field); \
-    __device__ static BytePack<PackSize> load(FuncSum<T> fn, uintptr_t addr) { \
-      BytePack<PackSize> ans; \
-      asm volatile("multimem.ld_reduce.relaxed.sys.global.add.v4." #ptx_ty " {%0,%1,%2,%3}, [%4];" \
-        : "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[0]), \
-          "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[1]), \
-          "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[2]), \
-          "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[3]) \
+  struct Apply_LoadMultimem<FuncSum<T>, 4*(VecEltSize)> { \
+    static constexpr int PackSize = 4*(VecEltSize); \
+    __device__ __forceinline__ static BytePack<PackSize> load(FuncSum<T> fn, uintptr_t addr) { \
+      union { BytePack<PackSize> ans; BytePack<VecEltSize> elts[4]; }; \
+      asm volatile("multimem.ld_reduce.relaxed.sys.global.add" PtxAcc_for_##ptx_ty ".v4." #ptx_ty " {%0,%1,%2,%3}, [%4];" \
+        : "=" RegCode_for_size_##VecEltSize(elts[0].native), \
+          "=" RegCode_for_size_##VecEltSize(elts[1].native), \
+          "=" RegCode_for_size_##VecEltSize(elts[2].native), \
+          "=" RegCode_for_size_##VecEltSize(elts[3].native) \
        : "l"(addr) : "memory"); \
      return ans; \
    } \
  };
-#define DEFINE_Apply_LoadMultimem_minmax_v4(T, ptx_ty, pack_field) \
+#define DEFINE_Apply_LoadMultimem_minmax_v4(T, ptx_ty, VecEltSize) \
  template<> \
-  struct Apply_LoadMultimem<FuncMinMax<T>, 4*(SIZEOF_BytePack_field_##pack_field)> { \
-    static constexpr int PackSize = 4*(SIZEOF_BytePack_field_##pack_field); \
-    __device__ static BytePack<PackSize> load(FuncMinMax<T> fn, uintptr_t addr) { \
-      BytePack<PackSize> ans; \
+  struct Apply_LoadMultimem<FuncMinMax<T>, 4*(VecEltSize)> { \
+    static constexpr int PackSize = 4*(VecEltSize); \
+    __device__ __forceinline__ static BytePack<PackSize> load(FuncMinMax<T> fn, uintptr_t addr) { \
+      union { BytePack<PackSize> ans; BytePack<VecEltSize> elts[4]; }; \
      if (fn.isMinNotMax) { \
        asm volatile("multimem.ld_reduce.relaxed.sys.global.min.v4." #ptx_ty " {%0,%1,%2,%3}, [%4];" \
-          : "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[0]), \
-            "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[1]), \
-            "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[2]), \
-            "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[3]) \
+          : "=" RegCode_for_size_##VecEltSize(elts[0].native), \
+            "=" RegCode_for_size_##VecEltSize(elts[1].native), \
+            "=" RegCode_for_size_##VecEltSize(elts[2].native), \
+            "=" RegCode_for_size_##VecEltSize(elts[3].native) \
          : "l"(addr) : "memory"); \
      } else { \
        asm volatile("multimem.ld_reduce.relaxed.sys.global.max.v4." #ptx_ty " {%0,%1,%2,%3}, [%4];" \
-          : "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[0]), \
-            "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[1]), \
-            "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[2]), \
-            "=" PTX_REG_BytePack_field_##pack_field(ans.pack_field[3]) \
+          : "=" RegCode_for_size_##VecEltSize(elts[0].native), \
+            "=" RegCode_for_size_##VecEltSize(elts[1].native), \
+            "=" RegCode_for_size_##VecEltSize(elts[2].native), \
+            "=" RegCode_for_size_##VecEltSize(elts[3].native) \
          : "l"(addr) : "memory"); \
      } \
      return ans; \
    } \
  };

-#define DEFINE_Apply_LoadMultimem_sum_v4x2_and_subhalf(T, ptx_ty, pack_field) \
-  DEFINE_Apply_LoadMultimem_sum_v4(T, ptx_ty, pack_field) \
+#define DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(T, ptx_ty, VecEltSize) \
+  DEFINE_Apply_LoadMultimem_sum_v4(T, ptx_ty, VecEltSize) \
  template<> \
  struct Apply_LoadMultimem<FuncSum<T>, sizeof(T)> { \
-    __device__ static BytePack<sizeof(T)> load(FuncSum<T> fn, uintptr_t addr) { \
-      BytePack<2*sizeof(T)> tmp; \
-      asm volatile("multimem.ld_reduce.relaxed.sys.global.add." #ptx_ty " %0, [%1];" \
-        : "=" PTX_REG_BytePack_field_##pack_field(tmp.pack_field) \
-        : "l"(addr & -uintptr_t(2*sizeof(T))) : "memory"); \
-      return tmp.half[(addr/sizeof(T))%2]; \
+    __device__ __forceinline__ static BytePack<sizeof(T)> load(FuncSum<T> fn, uintptr_t addr) { \
+      union { BytePack<VecEltSize> tmp; BytePack<sizeof(T)> elts[(VecEltSize)/sizeof(T)]; }; \
+      asm volatile("multimem.ld_reduce.relaxed.sys.global.add" PtxAcc_for_##ptx_ty "." #ptx_ty " %0, [%1];" \
+        : "=" RegCode_for_size_##VecEltSize(tmp.native) \
+        : "l"(addr & -uintptr_t(VecEltSize)) : "memory"); \
+      return elts[(addr/sizeof(T))%((VecEltSize)/sizeof(T))]; \
    } \
  };
-#define DEFINE_Apply_LoadMultimem_minmax_v4x2_and_subhalf(T, ptx_ty, pack_field) \
-  DEFINE_Apply_LoadMultimem_minmax_v4(T, ptx_ty, pack_field) \
+#define DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(T, ptx_ty, VecEltSize) \
+  DEFINE_Apply_LoadMultimem_minmax_v4(T, ptx_ty, VecEltSize) \
  template<> \
  struct Apply_LoadMultimem<FuncMinMax<T>, sizeof(T)> { \
-    __device__ static BytePack<sizeof(T)> load(FuncMinMax<T> fn, uintptr_t addr) { \
-      BytePack<2*sizeof(T)> tmp; \
+    __device__ __forceinline__ static BytePack<sizeof(T)> load(FuncMinMax<T> fn, uintptr_t addr) { \
+      union { BytePack<VecEltSize> tmp; BytePack<sizeof(T)> elts[(VecEltSize)/sizeof(T)]; }; \
      if (fn.isMinNotMax) { \
        asm volatile("multimem.ld_reduce.relaxed.sys.global.min." #ptx_ty " %0, [%1];" \
-          : "=" PTX_REG_BytePack_field_##pack_field(tmp.pack_field) \
-          : "l"(addr & -uintptr_t(2*sizeof(T))) : "memory"); \
+          : "=" RegCode_for_size_##VecEltSize(tmp.native) \
+          : "l"(addr & -uintptr_t(VecEltSize)) : "memory"); \
      } else { \
        asm volatile("multimem.ld_reduce.relaxed.sys.global.max." #ptx_ty " %0, [%1];" \
-          : "=" PTX_REG_BytePack_field_##pack_field(tmp.pack_field) \
-          : "l"(addr & -uintptr_t(2*sizeof(T))) : "memory"); \
+          : "=" RegCode_for_size_##VecEltSize(tmp.native) \
+          : "l"(addr & -uintptr_t(VecEltSize)) : "memory"); \
      } \
-      return tmp.half[(addr/sizeof(T))%2]; \
+      return elts[(addr/sizeof(T))%((VecEltSize)/sizeof(T))]; \
    } \
  };

 template<typename Fn, int BytePerPack>
 struct Apply_LoadMultimem {
-  __device__ static BytePack<BytePerPack> load(Fn fn, uintptr_t addr) {
+  __device__ __forceinline__ static BytePack<BytePerPack> load(Fn fn, uintptr_t addr) {
    __trap();
    return {};
  }
@ -826,29 +984,36 @@ struct Apply_LoadMultimem {
      /*multimem.ld_reduce not supported:*/ 0;
  };

-  DEFINE_Apply_LoadMultimem_sum(uint32_t, u32, u32)
-  DEFINE_Apply_LoadMultimem_minmax(uint32_t, u32, u32)
+  DEFINE_Apply_LoadMultimem_sum(uint32_t, u32, 4)
+  DEFINE_Apply_LoadMultimem_minmax(uint32_t, u32, 4)

-  DEFINE_Apply_LoadMultimem_sum(int32_t, s32, u32)
-  DEFINE_Apply_LoadMultimem_minmax(int32_t, s32, u32)
+  DEFINE_Apply_LoadMultimem_sum(int32_t, s32, 4)
+  DEFINE_Apply_LoadMultimem_minmax(int32_t, s32, 4)

-  DEFINE_Apply_LoadMultimem_sum(uint64_t, u64, u64)
-  DEFINE_Apply_LoadMultimem_minmax(uint64_t, u64, u64)
+  DEFINE_Apply_LoadMultimem_sum(uint64_t, u64, 8)
+  DEFINE_Apply_LoadMultimem_minmax(uint64_t, u64, 8)

-  DEFINE_Apply_LoadMultimem_sum(int64_t, u64, u64)
-  DEFINE_Apply_LoadMultimem_minmax(int64_t, s64, u64)
+  DEFINE_Apply_LoadMultimem_sum(int64_t, u64, 8)
+  DEFINE_Apply_LoadMultimem_minmax(int64_t, s64, 8)

-  DEFINE_Apply_LoadMultimem_sum(float, f32, u32)
-  DEFINE_Apply_LoadMultimem_sum_v4(float, f32, u32)
+  DEFINE_Apply_LoadMultimem_sum(float, f32, 4)
+  DEFINE_Apply_LoadMultimem_sum_v4(float, f32, 4)

-  DEFINE_Apply_LoadMultimem_sum(double, f64, u64)
+  DEFINE_Apply_LoadMultimem_sum(double, f64, 8)

-  DEFINE_Apply_LoadMultimem_sum_v4x2_and_subhalf(half, f16x2, u32)
-  DEFINE_Apply_LoadMultimem_minmax_v4x2_and_subhalf(half, f16x2, u32)
+  DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(half, f16x2, 4)
+  DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(half, f16x2, 4)

  #if defined(__CUDA_BF16_TYPES_EXIST__)
-    DEFINE_Apply_LoadMultimem_sum_v4x2_and_subhalf(__nv_bfloat16, bf16x2, u32)
-    DEFINE_Apply_LoadMultimem_minmax_v4x2_and_subhalf(__nv_bfloat16, bf16x2, u32)
+    DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(__nv_bfloat16, bf16x2, 4)
+    DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(__nv_bfloat16, bf16x2, 4)
+  #endif
+
+  #if NCCL_CUDA_ARCH_SPECIFIC == 1000 || NCCL_CUDA_ARCH_SPECIFIC == 1010 || NCCL_CUDA_ARCH_FAMILY_SPECIFIC == 1000 || NCCL_CUDA_ARCH_FAMILY_SPECIFIC == 1010 || NCCL_CUDA_ARCH_SPECIFIC == 1200 || NCCL_CUDA_ARCH_SPECIFIC == 1210
+    DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(__nv_fp8_e4m3, e4m3x4, 4)
+    DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(__nv_fp8_e4m3, e4m3x4, 4)
+    DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(__nv_fp8_e5m2, e5m2x4, 4)
+    DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(__nv_fp8_e5m2, e5m2x4, 4)
  #endif
 #else
  template<typename Fn>
@ -860,11 +1025,29 @@ struct Apply_LoadMultimem {
 #undef DEFINE_Apply_LoadMultimem
 #undef DEFINE_Apply_LoadMultimem_v4
 #undef DEFINE_Apply_LoadMultimem_v4x2_and_subhalf
-#undef SIZEOF_BytePack_field_u64
-#undef PTX_REG_BytePack_field_u64
-#undef SIZEOF_BytePack_field_u32
-#undef PTX_REG_BytePack_field_u32
-#undef SIZEOF_BytePack_field_u16
-#undef PTX_REG_BytePack_field_u16
+
+#undef RegCode_for_size_2
+#undef RegCode_for_size_4
+#undef RegCode_for_size_8
+
+#undef RegSize_for_size_1
+#undef RegSize_for_size_2
+#undef RegSize_for_size_4
+#undef RegSize_for_size_8
+
+#undef PtxAcc_for_u32
+#undef PtxAcc_for_s32
+#undef PtxAcc_for_s64
+#undef PtxAcc_for_u64
+#undef PtxAcc_for_f32
+#undef PtxAcc_for_f64
+#undef PtxAcc_for_f16
+#undef PtxAcc_for_bf16
+#undef PtxAcc_for_f16x2
+#undef PtxAcc_for_bf16x2
+#undef PtxAcc_for_e4m3
+#undef PtxAcc_for_e5m2
+#undef PtxAcc_for_e4m3x4
+#undef PtxAcc_for_e5m2x4

 #endif // REDUCE_KERNEL_H_
--- a/src/device/reduce_scatter.h
+++ b/src/device/reduce_scatter.h
@ -142,82 +142,206 @@ struct RunWorkColl<ncclFuncReduceScatter, T, RedOp, NCCL_ALGO_PAT, NCCL_PROTO_SI

 template<typename T, typename RedOp>
 struct RunWorkColl<ncclFuncReduceScatter, T, RedOp, NCCL_ALGO_NVLS, NCCL_PROTO_SIMPLE> {
+  template<bool ReduceSendNotRecv>
+  struct Scatterer {
+    struct ncclDevWorkColl* work;
+    int chunkCount;
+    ssize_t railGridOffset;
+
+    template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts, int MultimemSrcs, int MultimemDsts>
+    __device__ __forceinline__ void operator()(
+        int tid, int tn, int slice, int maxSliceSize,
+        int nSrcs, void** srcPtrs, int nDsts, void** dstPtrs, int32_t* dstSizes, uint32_t sendDirectFlag, uint32_t recvDirectFlag
+      ) {
+      static_assert(SlicePerChunk == 1, "require: SlicePerChunk==1");
+      static_assert(MaxDsts <= 1 || MaxSrcs <= 1, "require: MaxDsts<=1 || MaxSrcs<=1");
+
+      struct ncclNvls* nvls = &ncclShmem.channel.nvls;
+      int nNodes = ncclShmem.comm.nNodes;
+      int nRails = nvls->nHeads;
+      int part = ncclShmem.channelId - work->channelLo;
+      void* inbuf = (void*)work->sendbuff;
+      ssize_t countPerRank = work->collnet.count;
+
+      ssize_t railAllBeg = min(railGridOffset + part * chunkCount, nNodes * countPerRank);
+      ssize_t railAllEnd = min(railAllBeg + chunkCount, nNodes * countPerRank);
+      int railAllSize = railAllEnd - railAllBeg;
+      int rail = nvls->headRank;
+      int dst = 0;
+      if (ReduceSendNotRecv) {
+        if (work->regUsed) return;
+        rail = 0;
+        nSrcs = 1;
+      } else {
+        rail = nvls->headRank;
+      }
+      if (tid < nDsts) dstSizes[tid] = railAllSize;
+      do {
+        int node = railAllBeg / countPerRank;
+        int railAllOffset = 0;
+        while (railAllOffset < railAllSize) {
+          ssize_t railOneBeg = node * countPerRank;
+          ssize_t railOneEnd = railOneBeg + countPerRank;
+          ssize_t railOneOffset = (railAllBeg + railAllOffset) - railOneBeg;
+          int delta = min(railAllEnd, railOneEnd) - (railAllBeg + railAllOffset);
+          int rank = ncclShmem.comm.collNetDenseToUserRank[node * nRails + rail];
+          ssize_t userOneBeg = rank * countPerRank + railOneOffset;
+          if (nDsts != 0) {
+            reduceCopy<ncclCollUnroll(), RedOp, T,
+              /*MultimemSrcs=*/MultimemSrcs, 1, 1 + MaxSrcs,
+              /*MultimemDsts,MinDsts,MaxDsts=*/MultimemDsts, 1, 1,
+              /*PreOpSrcs=*/1>
+              (tid, tn, work->redOpArg, &work->redOpArg, false,
+                /*nSrcs=*/nSrcs, [=]__device__(int s) {
+              return work->regUsed ? (T*)srcPtrs[s] + userOneBeg :
+                !ReduceSendNotRecv ? (T*)srcPtrs[s] + railAllOffset:
+                (T*)inbuf + userOneBeg;
+            },
+                /*nDsts=*/1, [=]__device__(int d/*==0*/) {
+              return (T*)dstPtrs[dst] + railAllOffset;
+            }, delta);
+          }
+          railAllOffset += delta;
+          node += 1;
+        }
+        dst += 1;
+        rail += 1;
+      } while (ReduceSendNotRecv && dst < nRails);
+    }
+  };
+
  __device__ __forceinline__ void run(int tid, int/*nthreads*/, struct ncclDevWorkColl* work) {
    struct ncclNvls* nvls = &ncclShmem.channel.nvls;
-    size_t count;
-    size_t gridOffset;
-    size_t channelCount;
-    size_t chunkCount;
-    ncclCollCbdPart(work, ncclShmem.channelId, NCCL_PROTO_SIMPLE, sizeof(T), &count, &gridOffset, &channelCount, &chunkCount);
-    const int rank = ncclShmem.comm.rank;
-    const int nranks = ncclShmem.comm.nRanks;
-    size_t offset;
    int nelem;

    /* if we are direct NVLS, we only need to allocate 1 warp to scatter for sync;
     * if not, based on #ranks, we allocate 7 or 5 warps to reduce to saturate bandwidth
     * and the rest are allocated to scatter. */
-    const int nThreadsReduce = work->regUsed ? (NCCL_MAX_NTHREADS - WARP_SIZE) : (nranks <= 6 ? 7 * WARP_SIZE : 5 * WARP_SIZE);
-    const int nThreadsScatter = work->regUsed ? WARP_SIZE : (NCCL_MAX_NTHREADS - nThreadsReduce);
-    const int tidEndScatter = nThreadsScatter;
+    const int nThreadsNetRecv = work->oneNode ? 0 : (work->netRegUsed ? WARP_SIZE :  6 * WARP_SIZE);
+    const int nThreadsScatter = work->regUsed ? roundUp(nvls->nHeads << 2, WARP_SIZE) : 8 * WARP_SIZE;
+    const int nThreadsReduce = NCCL_MAX_NTHREADS - nThreadsNetRecv - nThreadsScatter;
+    const int tidEndNetRecv = nThreadsNetRecv;
+    const int tidEndScatter = tidEndNetRecv + nThreadsScatter;
    const int tidEndReduce = tidEndScatter + nThreadsReduce;

-    if (!work->regUsed) {
-      if (tid < tidEndScatter) {
-        // Scatter
-        using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
-        Primitives<T, RedOp, FanAsymmetric<0, NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
-          prims(tid, nThreadsScatter, NULL, nvls->up, work->sendbuff, NULL,
-            work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
-        for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
-          offset = gridOffset + elemOffset;
-          nelem = min(chunkCount, channelCount - elemOffset);
-          prims.scatter(offset, nvls->nHeads * count, nelem, count, -1, 0);
+    if (work->oneNode) {
+      const int rank = ncclShmem.comm.rank;
+      size_t offset;
+      size_t count, gridOffset, channelCount, chunkCount;
+      ncclCollCbdPart(work, ncclShmem.channelId, NCCL_PROTO_SIMPLE, sizeof(T), &count, &gridOffset, &channelCount, &chunkCount);
+      if (!work->regUsed) {
+        if (tid < tidEndScatter) {
+          // Scatter
+          using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
+          Primitives<T, RedOp, FanAsymmetric<0, NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
+            prims(tid, nThreadsScatter, NULL, nvls->up, work->sendbuff, NULL,
+              work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
+          for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
+            offset = gridOffset + elemOffset;
+            nelem = min(chunkCount, channelCount - elemOffset);
+            prims.scatter(offset, nvls->nHeads * count, nelem, count, -1, 0);
+          }
+          // coverity[overrun-call] => Coverity think prims.index can be greater than 1
+        } else if (tid < tidEndReduce) {
+          // Reduce through NVLS
+          using Proto = ProtoSimple<1, 1, COLL_UNROLL, 1, 0>;
+          Primitives<T, RedOp, FanAsymmetric<1, 0>, /*Direct=*/0, Proto, 0>
+            prims(tid - tidEndScatter, nThreadsReduce, &nvls->down, NULL, NULL, work->recvbuff,
+              work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0);
+          for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
+            offset = gridOffset + elemOffset;
+            nelem = min(chunkCount, channelCount - elemOffset);
+            prims.recv(offset, nelem);
+          }
        }
-        // coverity[overrun-call] => Coverity think prims.index can be greater than 1
-      } else if (tid < tidEndReduce) {
-        // Reduce through NVLS
-        using Proto = ProtoSimple<1, 1, COLL_UNROLL, 1, 0>;
-        Primitives<T, RedOp, FanAsymmetric<1, 0>, /*Direct=*/0, Proto, 0>
-          prims(tid - tidEndScatter, nThreadsReduce, &nvls->down, NULL, NULL, work->recvbuff,
-            work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0);
-        for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
-          offset = gridOffset + elemOffset;
-          nelem = min(chunkCount, channelCount - elemOffset);
-          prims.recv(offset, nelem);
+      } else {
+        if (tid < tidEndScatter) {
+          // Scatter
+          using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
+          Primitives<T, RedOp, FanSymmetric<NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
+            prims(tid, nThreadsScatter, nvls->up, nvls->up, NULL, NULL,
+              work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
+          for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
+            prims.scatter(0, 0, 0, 0, -1, 0);
+          }
+
+          /* gather used as sync */
+          prims.gather(0, 0, 0, 0, -1, 0);
+        } else if (tid < tidEndReduce) {
+          // Reduce through NVLS
+          using Proto = ProtoSimple<1, 1, COLL_UNROLL, 1, 0>;
+          Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, Proto, 0>
+            prims(tid - tidEndScatter, nThreadsReduce, &nvls->down, &nvls->down, NULL, work->recvbuff,
+              work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0, work);
+          for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
+            size_t outOffset = gridOffset + elemOffset;
+            size_t inpOffset = outOffset + rank * count;
+            nelem = min(chunkCount, channelCount - elemOffset);
+            // Coverity complains about a possible overrun inside the method invoked below, but that's actually
+            // a false positive.
+            // coverity[overrun-call:FALSE]
+            prims.directRecvCopy(inpOffset, outOffset, nelem);
+          }
+
+          /* send for sync */
+          prims.send(0, 0);
        }
      }
    } else {
-      if (tid < tidEndScatter) {
-        // Scatter
+      // multi-node
+      int nNodes = ncclShmem.comm.nNodes;
+      int part = ncclShmem.channelId - work->channelLo;
+      ssize_t countPerRank = work->collnet.count;
+      const int nChannels = work->channelHi - work->channelLo + 1;
+      ssize_t chunkCount = work->collnet.chunkCount;
+      if (tid < tidEndNetRecv) {
        using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
-        Primitives<T, RedOp, FanSymmetric<NCCL_MAX_NVLS_ARITY>, /*Direct=*/0, Proto, 0>
-          prims(tid, nThreadsScatter, nvls->up, nvls->up, NULL, NULL,
-            work->redOpArg, 0 * Proto::MaxGroupWidth, 1, 1);
-        for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
-          prims.scatter(0, 0, 0, 0, -1, 0);
+        if (work->netRegUsed) {
+          if (tid == 0) {
+            int steps = (int)divUp(nNodes * countPerRank, nChannels * chunkCount);
+            Primitives<T, RedOp, FanAsymmetric<1, 0>, /*Direct=*/0, Proto, 0>::recvPeerNotify(nvls->out, 0, steps);
+          }
+          __syncwarp();
+        } else {
+          Primitives<T, RedOp, FanAsymmetric<1, 0>, /*Direct=*/0, Proto, 0>
+            prims(tid, nThreadsNetRecv, &nvls->out, nullptr, nullptr, work->recvbuff,
+              work->redOpArg, 0 * Proto::MaxGroupWidth, 0, 0);
+          for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
+            ssize_t railAllBeg = railGridOffset + part * chunkCount;
+            ssize_t railAllEnd = min(railAllBeg + chunkCount, nNodes * countPerRank);
+            ssize_t railOneBeg = ncclShmem.comm.node * countPerRank;
+            ssize_t railOneEnd = railOneBeg + countPerRank;
+            ssize_t beg = max(railAllBeg, railOneBeg);
+            ssize_t end = min(railAllEnd, railOneEnd);
+            prims.recv(beg - railOneBeg, max(ssize_t(0), end - beg), /*postOp=*/true);
+          }
        }
-
-        /* gather used as sync */
-        prims.gather(0, 0, 0, 0, -1, 0);
-      } else if (tid < tidEndReduce) {
-        // Reduce through NVLS
-        using Proto = ProtoSimple<1, 1, COLL_UNROLL, 1, 0>;
-        Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, Proto, 0>
-          prims(tid - tidEndScatter, nThreadsReduce, &nvls->down, &nvls->down, NULL, work->recvbuff,
-            work->redOpArg, 3 * Proto::MaxGroupWidth, 0, 0, work);
-        for (size_t elemOffset = 0; elemOffset < channelCount; elemOffset += chunkCount) {
-          size_t outOffset = gridOffset + elemOffset;
-          size_t inpOffset = outOffset + rank * count;
-          nelem = min(chunkCount, channelCount - elemOffset);
-          // Coverity complains about a possible overrun inside the method invoked below, but that's actually
-          // a false positive.
-          // coverity[overrun-call:FALSE]
-          prims.directRecvCopy(inpOffset, outOffset, nelem);
+      } else {
+        if (tid < tidEndScatter) {
+          using Proto = ProtoSimple<1, 1, COLL_UNROLL>;
+          Primitives<T, RedOp, FanAsymmetric<0, NCCL_MAX_NVLS_ARITY>, /*Direct=*/1, Proto, 0>
+            prims(tid - tidEndNetRecv, nThreadsScatter, nullptr, nvls->up, work->sendbuff, nullptr,
+              work->redOpArg, 1 * Proto::MaxGroupWidth, 1, 1, work);
+          for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
+            Scatterer</*ReduceSendNotRecv=*/true> scat;
+            scat.work = work;
+            scat.chunkCount = chunkCount;
+            scat.railGridOffset = railGridOffset;
+            prims.template process</*Recv=*/0, /*Send=*/1>(scat);
+          }
+        } else if (tid < tidEndReduce) {
+          using Proto = ProtoSimple<1, 1, COLL_UNROLL, 1, 0>;
+          Primitives<T, RedOp, FanSymmetric<1>, /*Direct=*/1, Proto, 0>
+            prims(tid - tidEndScatter, nThreadsReduce, &nvls->down, &nvls->out, nullptr, nullptr,
+              work->redOpArg, 2 * Proto::MaxGroupWidth, 0, 1, work);
+          for (ssize_t railGridOffset = 0; railGridOffset < nNodes * countPerRank; railGridOffset += nChannels * chunkCount) {
+            Scatterer</*ReduceSendNotRecv=*/false> scat;
+            scat.work = work;
+            scat.chunkCount = chunkCount;
+            scat.railGridOffset = railGridOffset;
+            prims.template process</*Recv=*/1, /*Send=*/1>(scat);
+          }
        }
-
-        /* send for sync */
-        prims.send(0, 0);
      }
    }
  }
@ -231,7 +355,7 @@ struct RunWorkColl<ncclFuncReduceScatter, T, RedOp, NCCL_ALGO_COLLNET_DIRECT, NC
    int chunkSize;
    ssize_t railGridOffset;

-    template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts>
+    template<int SlicePerChunk, int MinSrcs, int MaxSrcs, int MinDsts, int MaxDsts, int MultimemSrcs, int MultimemDsts>
    __device__ __forceinline__ void operator()(
        int tid, int tn, int slice, int maxSliceSize,
        int nSrcs, void** srcPtrs, int nDsts, void** dstPtrs, int32_t* dstSizes, uint32_t sendDirectFlag, uint32_t recvDirectFlag
--- a/src/device/symmetric/all_gather.cuh
+++ b/src/device/symmetric/all_gather.cuh
@ -0,0 +1,367 @@
+#include "symmetric.h"
+#include "symmetric/kernel.cuh"
+#include "symmetric/primitives.cuh"
+
+template<int BytePerPack, int UnrollPacks, int UnrollPeers>
+static __device__ void bcastDeep(
+    ncclSymPrims& prim, int tn, int t, bool waitNeeded,
+    char* inputHere, char* outputRank0, bool inPlace, int nIters
+  ) {
+  using Pack = BytePack<BytePerPack>;
+  int wn = tn/WARP_SIZE;
+  int w = t/WARP_SIZE;
+  int lane = t%WARP_SIZE;
+  int const& rank = prim.rank;
+  int const& nRanks = prim.nRanks;
+  uint32_t const& stride4G = prim.stride4G;
+  Pack* inpHere = (Pack*)inputHere + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
+  Pack* outRank0 = (Pack*)outputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
+  Pack tmp[UnrollPacks];
+
+  nIters -= w;
+  if (0 < nIters) {
+    #pragma unroll
+    for (int u=0; u < UnrollPacks; u++) {
+      tmp[u] = inpHere[u*WARP_SIZE];
+    }
+  }
+
+  if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+
+  if (0 < nIters) {
+    while (true) {
+      int dr = inPlace ? 1 : 0;
+      int r = rank + dr;
+      if (r == nRanks) r = 0;
+      #pragma unroll 2
+      for (int partial=0; partial <= 1; partial++) {
+        #pragma unroll 1
+        for (int i = 0;
+             partial ? i < 1 : (dr + UnrollPeers <= nRanks);
+             partial ? i++ : (dr += UnrollPeers)) {
+          #pragma unroll
+          for (int ur=0; ur < UnrollPeers-partial; ur++) {
+            if (partial && dr == nRanks) break;
+            #pragma unroll UnrollPacks
+            for (int u=0; u < UnrollPacks; u++) {
+              add4G(outRank0, r*stride4G)[u*WARP_SIZE] = tmp[u];
+            }
+            if (++r == nRanks) r = 0;
+          }
+        }
+      }
+      inpHere += intptr_t(wn)*UnrollPacks*WARP_SIZE;
+      outRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
+      nIters -= wn;
+      if (nIters <= 0) break;
+
+      // Load data for next iteration.
+      #pragma unroll
+      for (int u=0; u < UnrollPacks; u++) {
+        tmp[u] = inpHere[u*WARP_SIZE];
+      }
+    }
+  }
+}
+
+template<int UnrollPeers, typename T>
+static __device__ void bcastEnds(
+    ncclSymPrims& prim, int tn, int t,
+    T* inputHere, T* outputRank0, bool inPlace, size_t nElts, uint32_t nPreElts, size_t nSufElts
+  ) {
+  int const& rank = prim.rank;
+  int const& nRanks = prim.nRanks;
+  uint32_t const& stride4G = prim.stride4G;
+  BytePack<sizeof(T)>* inpHere = (BytePack<sizeof(T)>*)inputHere;
+  BytePack<sizeof(T)>* outRank0 = (BytePack<sizeof(T)>*)outputRank0;
+  #pragma unroll 1
+  for (size_t i = t; i < nPreElts+nSufElts; i += tn) {
+    size_t elt = i < nPreElts ? i : nElts-nPreElts-nSufElts+i;
+    BytePack<sizeof(T)> tmp = inpHere[elt];
+    int dr = inPlace ? 1 : 0;
+    int r = rank + dr;
+    if (r == nRanks) r = 0;
+    #pragma unroll 1
+    for (; dr + UnrollPeers <= nRanks; dr += UnrollPeers) {
+      #pragma unroll UnrollPeers
+      for (int u=0; u < UnrollPeers; u++) {
+        *add4G(outRank0+elt, r*stride4G) = tmp;
+        if (++r == nRanks) r = 0;
+      }
+    }
+    #pragma unroll UnrollPeers
+    for (int u=0; u < UnrollPeers; u++) {
+      if (dr+u == nRanks) break;
+      *add4G(outRank0+elt, r*stride4G) = tmp;
+      if (++r == nRanks) r = 0;
+    }
+  }
+}
+
+template<typename T>
+static __device__ void bcast(
+    ncclSymPrims& prim, int tn, int t, bool waitNeeded, T* input, T* output, size_t nElts
+  ) {
+  bool inPlace = (input == output);
+  // Mpve to rank=0
+  output = prim.peerPtr(0, output);
+
+  uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
+  uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
+  size_t nBytes = nElts*sizeof(T);
+
+  uint32_t nPreBytes = (128u - inputUptr)%128u;
+  nPreBytes = min((size_t)nPreBytes, nBytes);
+  uintptr_t cursor = nPreBytes;
+
+  constexpr int MinWarpPerBlock = 4;
+
+  if ((inputUptr-outputUptr)%16 == 0) {
+    constexpr int BytePerPack = 16, UnrollPacks = 4, UnrollPeers = 2;
+    constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
+    uint32_t chunks = (nBytes-cursor)/BytePerChunk;
+    chunks -= imodFast32(chunks, prim.nBlocks, prim.nBlocks_rcp32);
+    if (chunks != 0) {
+      uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
+      bcastDeep<BytePerPack, UnrollPacks, UnrollPeers>(
+        prim, tn, t, waitNeeded,
+        (char*)input + cursor, (char*)output + cursor, inPlace,
+        chunks*MinWarpPerBlock
+      );
+      cursor = cursorAfter;
+      waitNeeded = false;
+    }
+  }
+
+  if (sizeof(T) == 4 || (sizeof(T) < 4 && (inputUptr-outputUptr)%4 == 0)) {
+    constexpr int BytePerPack = 4, UnrollPacks = 4, UnrollPeers = 4;
+    constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
+    uint32_t chunks = (nBytes-cursor)/BytePerChunk;
+    chunks -= imodFast32(chunks, prim.nBlocks, prim.nBlocks_rcp32);
+    if (chunks != 0) {
+      uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
+      bcastDeep<(sizeof(T) <= BytePerPack ? BytePerPack : 0), UnrollPacks, UnrollPeers>(
+        prim, tn, t, waitNeeded,
+        (char*)input + cursor, (char*)output + cursor, inPlace,
+        chunks*MinWarpPerBlock
+      );
+      cursor = cursorAfter;
+      waitNeeded = false;
+    }
+  }
+
+  if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+
+  constexpr int UnrollPeers = 8;
+  size_t nSufElts = (nBytes-cursor)/sizeof(T);
+  bcastEnds<UnrollPeers>(prim, tn, t, input, output, inPlace, nElts, nPreBytes/sizeof(T), nSufElts);
+}
+
+__device__ __forceinline__ void ncclSymRun_AllGather_ST(ncclSymDevArgs const* args) {
+  ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier);
+  int const& rank = prim.rank;
+
+  // Threads numbered over rank.
+  int bt = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
+                     prim.block, prim.nBlocks,
+                     threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
+  int btn = prim.nBlocks*blockDim.x;
+
+  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
+  //prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+
+  bcast(prim, btn, bt, /*waitNeeded=*/true, (char*)args->input, (char*)args->output + rank*args->nElts, args->nElts);
+
+  prim.barrierArrive(ncclCoopCta(), /*release=*/true);
+  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+}
+
+
+template<typename T>
+static __device__ void bcastMultimem(
+    ncclSymPrims& prim, int tn, int t, T* input, T* output, size_t nElts
+  ) {
+  // Move output to multimem
+  output = prim.multimemPtr(output);
+
+  uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
+  uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
+  size_t nBytes = nElts*sizeof(T);
+
+  uint32_t nPreBytes = (16-inputUptr)%16;
+  nPreBytes = min((size_t)nPreBytes, nBytes);
+  uintptr_t nSufBytes;
+
+  if ((inputUptr-outputUptr)%16 == 0) {
+    constexpr int BytePerPack = 16, UnrollPacks = 8;
+    constexpr int BytePerChunk = UnrollPacks*WARP_SIZE*BytePerPack;
+    uintptr_t cursor = nPreBytes;
+    uint32_t nChunks = (nBytes-cursor)/BytePerChunk;
+    uintptr_t cursorAfter = cursor + uintptr_t(nChunks)*BytePerChunk;
+    nSufBytes = nBytes - cursorAfter;
+    cursor += (t/WARP_SIZE)*UnrollPacks*WARP_SIZE*BytePerPack;
+    cursor += (t%WARP_SIZE)*BytePerPack;
+    int nIters = nChunks - t/WARP_SIZE;
+    #pragma unroll 1
+    while (0 < nIters) {
+      BytePack<BytePerPack> tmp[UnrollPacks];
+      #pragma unroll
+      for (int u=0; u < UnrollPacks; u++) {
+        tmp[u] = *reinterpret_cast<BytePack<BytePerPack>*>(inputUptr + cursor + u*WARP_SIZE*BytePerPack);
+      }
+      #pragma unroll
+      for (int u=0; u < UnrollPacks; u++) {
+        multimem_st_global(outputUptr + cursor + u*WARP_SIZE*BytePerPack, tmp[u]);
+      }
+      cursor += tn*UnrollPacks*BytePerPack;
+      nIters -= tn/WARP_SIZE;
+    }
+  } else {
+    nPreBytes = 0;
+    nSufBytes = nBytes;
+  }
+
+  // Get the prefix+suffix element one at a time.
+  #pragma unroll 4
+  for (uintptr_t i = t*sizeof(T); i < nPreBytes + nSufBytes; i += tn*sizeof(T)) {
+    uintptr_t cursor = i < nPreBytes ? i : nBytes-nSufBytes+(i-nPreBytes);
+    BytePack<sizeof(T)> val = *reinterpret_cast<BytePack<sizeof(T)>*>(inputUptr + cursor);
+    multimem_st_global(outputUptr + cursor, val);
+    cursor += tn*sizeof(T);
+  }
+}
+
+__device__ __forceinline__ void ncclSymRun_AllGather_STMC(ncclSymDevArgs const* args) {
+  ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier|ncclSymPrims_UseMultimem);
+  int const& rank = prim.rank;
+
+  char* input = args->input;
+  char* output = args->output;
+  size_t bytes = args->nElts;
+  // Round robin memory to blocks.
+  int t = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
+                    prim.block, prim.nBlocks,
+                    threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
+  int tn = prim.nBlocks*blockDim.x;
+
+  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
+  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+
+  bcastMultimem(prim, tn, t, input, output + rank*bytes, bytes);
+
+  prim.barrierArrive(ncclCoopCta(), /*release=*/true);
+  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+}
+
+template<typename EltType>
+static __device__ void allgather_LL_body(
+    ncclSymPrims &prim, EltType* input, EltType* output, int nElts, int nPacks, int nStrideElts
+  ) {
+  using Pack = BytePack<8>;
+  constexpr int EltPerPack = 8/sizeof(EltType);
+
+  ncclCoopCta cta;
+  int rank = prim.rank;
+  int nRanks = prim.nRanks;
+  constexpr int tn = ncclSymMaxThreads;
+  int t = threadIdx.x;
+
+  #pragma unroll 1
+  while (0 < nElts) {
+    int nIterPacks = min(nPacks, tn);
+    if (t < nIterPacks) {
+      Pack x = loadPack<Pack>(input, t*EltPerPack, nElts);
+      prim.bcastLL(/*slot=*/nIterPacks*rank + t, x);
+    }
+
+    int tn_div_nPacks = tn/nIterPacks;
+    int tn_mod_nPacks = tn%nIterPacks;
+    int peer = t/nIterPacks;
+    int pack = t%nIterPacks;
+    #if 1
+      // NOTE: Unrolling speedup on eos nranks=8 size=64K: 5.7us vs 6.7us
+      constexpr int Unroll = 4;
+      #pragma unroll 1
+      for (int i = t; i < (nRanks*nIterPacks & -(Unroll*tn)); i += Unroll*tn) {
+        Pack got[Unroll];
+        prim.template recvLL<Unroll, Unroll>(i, Unroll, tn, /*&*/got);
+        #pragma unroll
+        for (int u=0; u < Unroll; u++) {
+          storePack<Pack>(output + peer*nStrideElts, pack*EltPerPack, nElts, got[u]);
+          peer += tn_div_nPacks;
+          pack += tn_mod_nPacks;
+          if (nIterPacks <= pack) { peer += 1; pack -= nIterPacks; }
+        }
+      }
+
+      int i = (nRanks*nIterPacks & -(Unroll*tn)) + t;
+      int n = (nRanks*nIterPacks)/tn % Unroll;
+      if (i + n*tn < nRanks*nIterPacks) n += 1;
+      if (n != 0) {
+        Pack got[Unroll];
+        prim.template recvLL<1, Unroll>(i, n, tn, /*&*/got);
+        #pragma unroll
+        for (int u=0; u < Unroll; u++) {
+          if (u != 0 && u == n) break;
+          storePack(output + peer*nStrideElts, pack*EltPerPack, nElts, got[u]);
+          peer += tn_div_nPacks;
+          pack += tn_mod_nPacks;
+          if (nIterPacks <= pack) { peer += 1; pack -= nIterPacks; }
+        }
+      }
+    #else
+      // The non-unrolled but "obviously correct" implementation for reference.
+      #pragma unroll 1
+      for (int i = t; i < nRanks*nIterPacks; i += tn) {
+        Pack got = prim.template recvLL<Pack>(i);
+        storePack(output + peer*nStrideElts, pack*EltPerPack, nElts, got);
+        peer += tn_div_nPacks;
+        pack += tn_mod_nPacks;
+        if (nIterPacks <= pack) { peer += 1; pack -= nIterPacks; }
+      }
+    #endif
+
+    prim.endLL(cta);
+
+    input += tn*EltPerPack;
+    output += tn*EltPerPack;
+    nElts -= tn*EltPerPack;
+    nPacks -= tn;
+  }
+}
+
+static __device__ void ncclSymRun_AllGather_LL_impl(ncclSymDevArgs const* args, bool multimem) {
+  ncclSymPrims prim(args->comm, ncclSymPrims_UseLL | multimem*ncclSymPrims_UseMultimem);
+  using Pack = BytePack<8>;
+  constexpr int BytePerPack = 8;
+  int nElts = args->nElts;
+  int nPacks = divUp(nElts, BytePerPack);
+
+  uint32_t nPackPerBlock, nPackModBlock;
+  idivmodFast32(&nPackPerBlock, &nPackModBlock, nPacks, prim.nBlocks, prim.nBlocks_rcp32);
+  int blockPackBegin = prim.block*nPackPerBlock + minval<int>(prim.block, nPackModBlock);
+  int blockPackEnd = blockPackBegin + nPackPerBlock + (prim.block < nPackModBlock ? 1 : 0);
+  int nBlockPacks = blockPackEnd - blockPackBegin;
+  int nBlockElts = nElts - blockPackBegin*BytePerPack;
+  nBlockElts = min(nBlockElts, nBlockPacks*BytePerPack);
+  char* blockInput = args->input + blockPackBegin*BytePerPack;
+  char* blockOutput = args->output + blockPackBegin*BytePerPack;
+
+  uint32_t lowBits = args->nElts;
+  lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->input);
+  lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->output);
+  if (__builtin_expect(lowBits%8 == 0, true)) {
+    // NOTE: Specializing for 8-byte alignment in one case help at size=65K: 8.9us vs 5.6us
+    allgather_LL_body(prim, (BytePack<8>*)blockInput, (BytePack<8>*)blockOutput, nBlockElts/8, nBlockPacks, nElts/8);
+  } else {
+    allgather_LL_body(prim, blockInput, blockOutput, nBlockElts, nBlockPacks, nElts);
+  }
+}
+
+__device__ __forceinline__ void ncclSymRun_AllGather_LL(ncclSymDevArgs const* args) {
+  ncclSymRun_AllGather_LL_impl(args, /*multimem=*/false);
+}
+
+__device__ __forceinline__ void ncclSymRun_AllGather_LLMC(ncclSymDevArgs const* args) {
+  ncclSymRun_AllGather_LL_impl(args, /*multimem=*/true);
+}
--- a/src/device/symmetric/all_reduce.cuh
+++ b/src/device/symmetric/all_reduce.cuh
@ -0,0 +1,432 @@
+#include "symmetric.h"
+#include "symmetric/kernel.cuh"
+#include "symmetric/primitives.cuh"
+
+template<int BytePerPack, int UnrollPacks, int UnrollPeers, typename T, typename Red>
+static __device__ __forceinline__ void allreduceDeep(
+    ncclSymPrims& prim, int tn, int t, bool waitNeeded,
+    Red red, char* inputRank0, char* outputRank0, int32_t nIters
+  ) {
+  using Pack = BytePack<BytePerPack>;
+  using Acc = typename Red::EltType;
+  using AccPack = BytePack<BytePerPack*sizeof(Acc)/sizeof(T)>;
+
+  int wn = tn/WARP_SIZE;
+  int w = t/WARP_SIZE;
+  int lane = t%WARP_SIZE;
+  int const& rank = prim.rank;
+  int const& nRanks = prim.nRanks;
+  uint32_t const& stride4G = prim.stride4G;
+  Pack* inpRank0 = (Pack*)inputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
+  Pack* outRank0 = (Pack*)outputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
+  Pack acc0[UnrollPacks];
+
+  nIters -= w;
+  if (0 < nIters) {
+    #pragma unroll
+    for (int u=0; u < UnrollPacks; u++) {
+      acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
+    }
+  }
+
+  if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+
+  if (0 < nIters) {
+    while (true) {
+      AccPack acc1[UnrollPacks];
+      int r = rank;
+      if (++r == nRanks) r = 0;
+      { Pack tmp1[UnrollPacks];
+        #pragma unroll
+        for (int u=0; u < UnrollPacks; u++) {
+          tmp1[u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
+        }
+        #pragma unroll
+        for (int u=0; u < UnrollPacks; u++) {
+          acc1[u] = applyReduce(red, applyCast<T, Acc>(acc0[u]), applyCast<T, Acc>(tmp1[u]));
+        }
+      }
+
+      if (++r == nRanks) r = 0;
+
+      int dr = 2;
+      #pragma unroll 2
+      for (int partial=0; partial <= 1; partial++) {
+        #pragma unroll 1
+        for (int i = 0;
+             partial ? i < 1 : (dr + UnrollPeers <= nRanks);
+             partial ? i++ : (dr += UnrollPeers)) {
+          if (partial && dr == nRanks) break;
+
+          Pack tmp1[UnrollPeers][UnrollPacks];
+          #pragma unroll
+          for (int ur=0; ur < UnrollPeers-partial; ur++) {
+            if (partial && ur!=0 && dr+ur == nRanks) break;
+            #pragma unroll UnrollPacks
+            for (int u=0; u < UnrollPacks; u++) {
+              tmp1[ur][u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
+            }
+            if (++r == nRanks) r = 0;
+          }
+          #pragma unroll
+          for (int ur=0; ur < UnrollPeers-partial; ur++) {
+            if (partial && ur!=0 && dr+ur == nRanks) break;
+            #pragma unroll UnrollPacks
+            for (int u=0; u < UnrollPacks; u++) {
+              acc1[u] = applyReduce(red, acc1[u], applyCast<T, Acc>(tmp1[ur][u]));
+            }
+          }
+        }
+      }
+
+      #pragma unroll
+      for (int u=0; u < UnrollPacks; u++) acc0[u] = applyCast<Acc, T>(acc1[u]);
+
+      dr = 0;
+      r = rank;
+      #pragma unroll 2
+      for (int partial=0; partial <= 1; partial++) {
+        #pragma unroll 1
+        for (int i = 0;
+             partial ? i < 1 : (dr + UnrollPeers <= nRanks);
+             partial ? i++ : (dr += UnrollPeers)) {
+          #pragma unroll
+          for (int ur=0; ur < UnrollPeers-partial; ur++) {
+            if (partial && dr == nRanks) break;
+            #pragma unroll UnrollPacks
+            for (int u=0; u < UnrollPacks; u++) {
+              add4G(outRank0, r*stride4G)[u*WARP_SIZE] = acc0[u];
+            }
+            if (++r == nRanks) r = 0;
+          }
+        }
+      }
+
+      inpRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
+      outRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
+      nIters -= wn;
+      if (nIters <= 0) break;
+
+      // Load data for next iteration.
+      #pragma unroll
+      for (int u=0; u < UnrollPacks; u++) {
+        acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
+      }
+    }
+  }
+}
+
+template<int UnrollPeers, typename Red, typename T>
+static __device__ __forceinline__ void allreduceEnds(
+    ncclSymPrims& prim, int tn, int t, Red red,
+    T* inputRank0, T* outputRank0, size_t nElts, uint32_t nPreElts, size_t nSufElts
+  ) {
+  using Acc = typename Red::EltType;
+
+  int const& rank = prim.rank;
+  int const& nRanks = prim.nRanks;
+  uint32_t const& stride4G = prim.stride4G;
+  BytePack<sizeof(T)>* inpRank0 = (BytePack<sizeof(T)>*)inputRank0;
+  BytePack<sizeof(T)>* outRank0 = (BytePack<sizeof(T)>*)outputRank0;
+
+  #pragma unroll 1
+  for (size_t i = t; i < nPreElts+nSufElts; i += tn) {
+    size_t elt = i < nPreElts ? i : nElts-nSufElts-nPreElts+i;
+    BytePack<sizeof(T)> acc0 = *add4G(inpRank0+elt, rank*stride4G);
+    BytePack<sizeof(Acc)> acc1;
+    BytePack<sizeof(T)> tmp[UnrollPeers];
+    int dr = 1;
+    int r = rank+1;
+    if (nRanks == r) r = 0;
+    bool first = true;
+
+    #pragma unroll 2
+    for (int partial=0; partial <= 1; partial++) {
+      #pragma unroll 1
+      for (int j = 0;
+           partial ? j < 1 : (dr + UnrollPeers <= nRanks);
+           partial ? j++ : (dr += UnrollPeers)) {
+        if (partial && dr == nRanks) break;
+
+        #pragma unroll
+        for (int u=0; u < UnrollPeers-partial; u++) {
+          if (partial && u!=0 && dr+u == nRanks) break;
+          tmp[u] = *add4G(inpRank0+elt, r*stride4G);
+          r += 1;
+          if (r == nRanks) r = 0;
+        }
+        if (first) {
+          first = false;
+          acc1 = applyCast<T, Acc>(acc0);
+        }
+        #pragma unroll
+        for (int u=0; u < UnrollPeers-partial; u++) {
+          if (partial && u!=0 && dr+u == nRanks) break;
+          acc1 = applyReduce(red, acc1, applyCast<T, Acc>(tmp[u]));
+        }
+      }
+    }
+
+    acc0 = applyCast<Acc, T>(acc1);
+    dr = 0;
+    r = rank;
+    #pragma unroll 2
+    for (int partial=0; partial <= 1; partial++) {
+      #pragma unroll 1
+      for (int j=0;
+           partial ? j < 1 : (dr + UnrollPeers <= nRanks);
+           partial ? j++ : (dr += UnrollPeers)) {
+        #pragma unroll
+        for (int u=0; u < UnrollPeers-partial; u++) {
+          if (partial && dr+u == nRanks) break;
+          *add4G(outRank0+elt, r*stride4G) = acc0;
+          r += 1;
+          if (r == nRanks) r = 0;
+        }
+      }
+    }
+  }
+}
+
+template<typename Red, typename T>
+static __device__ void allreduce(
+    ncclSymPrims& prim, int tn, int t, bool waitNeeded,
+    Red red, T* input, T* output, size_t nElts
+  ) {
+  int nRanks = prim.nRanks;
+  int nBlocks = prim.nBlocks;
+  // Mpve to rank=0
+  input = prim.peerPtr(0, input);
+  output = prim.peerPtr(0, output);
+
+  uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
+  uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
+  size_t nBytes = nElts*sizeof(T);
+
+  uint32_t nPreBytes = (16u - inputUptr)%16u;
+  nPreBytes = min((size_t)nPreBytes, nBytes);
+  uintptr_t cursor = nPreBytes;
+
+  constexpr int MinWarpPerBlock = 4;
+
+  if ((inputUptr-outputUptr)%16 == 0) {
+    constexpr int BytePerPack = 16, UnrollPacks = 4, UnrollPeers = 2;
+    constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
+    uint32_t chunks = (nBytes-cursor)/BytePerChunk;
+    chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
+    if (chunks != 0) {
+      uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
+      allreduceDeep<BytePerPack, UnrollPacks, UnrollPeers, T>(
+        prim, tn, t, waitNeeded, red,
+        (char*)input + cursor, (char*)output + cursor,
+        chunks*MinWarpPerBlock
+      );
+      cursor = cursorAfter;
+      waitNeeded = false;
+    }
+  }
+
+  if (sizeof(T) == 4 || (sizeof(T) < 4 && (inputUptr-outputUptr)%4 == 0)) {
+    constexpr int BytePerPack = 4, UnrollPacks = 4, UnrollPeers = 4;
+    constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
+    uint32_t chunks = (nBytes-cursor)/BytePerChunk;
+    chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
+    if (chunks != 0) {
+      uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
+      allreduceDeep<(sizeof(T) <= BytePerPack ? BytePerPack : 0), UnrollPacks, UnrollPeers, T>(
+        prim, tn, t, waitNeeded, red,
+        (char*)input + cursor, (char*)output + cursor,
+        chunks*MinWarpPerBlock
+      );
+      cursor = cursorAfter;
+      waitNeeded = false;
+    }
+  }
+
+  if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+
+  constexpr int UnrollPeers = 8;
+  size_t nSufElts = (nBytes-cursor)/sizeof(T);
+  allreduceEnds<UnrollPeers>(prim, tn, t, red, input, output, nElts, nPreBytes/sizeof(T), nSufElts);
+}
+
+
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLD_AGxST(ncclSymDevArgs const* args) {
+  ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier);
+  int /*const&*/ rank = prim.rank;
+  int /*const&*/ nRanks = prim.nRanks;
+  Red<typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type> red(args->redOpArg);
+
+  // Threads numbered globally such that we round robin warps by rank then block.
+  int gt = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
+                     rank, nRanks,
+                     prim.block, prim.nBlocks,
+                     threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
+  int gtn = nRanks*prim.nBlocks*blockDim.x;
+
+  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
+  //prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+
+  allreduce(prim, gtn, gt, /*waitNeeded=*/true, red, (T*)args->input, (T*)args->output, args->nElts);
+
+  prim.barrierArrive(ncclCoopCta(), /*release=*/true);
+  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+}
+
+
+template<typename Red, typename T>
+static __device__ void allreduceMultimem(
+    ncclSymPrims& prim, int tn, int t, Red red, T* input, T* output, size_t nElts
+  ) {
+  // Mpve to multimem
+  input = prim.multimemPtr(input);
+  output = prim.multimemPtr(output);
+
+  uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
+  uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
+  size_t nBytes = nElts*sizeof(T);
+
+  constexpr int BytePerPack = LoadMultimem_BigPackSize<Red>::BigPackSize;
+  uint32_t nPreBytes = (BytePerPack - inputUptr)%BytePerPack;
+  nPreBytes = min((size_t)nPreBytes, nBytes);
+  uintptr_t nSufBytes;
+
+  if (alignof(T) == BytePerPack || (inputUptr-outputUptr)%BytePerPack == 0) {
+    constexpr int UnrollPacks = 16*8/BytePerPack;
+    constexpr int BytePerChunk = UnrollPacks*WARP_SIZE*BytePerPack;
+    uintptr_t cursor = nPreBytes;
+    int nChunks = (nBytes-cursor)/BytePerChunk;
+    uintptr_t cursorAfter = cursor + uintptr_t(nChunks)*BytePerChunk;
+    nSufBytes = nBytes - cursorAfter;
+    cursor += (t/WARP_SIZE)*UnrollPacks*WARP_SIZE*BytePerPack;
+    cursor += (t%WARP_SIZE)*BytePerPack;
+    int nIters = nChunks - t/WARP_SIZE;
+    #pragma unroll 1
+    while (0 < nIters) {
+      BytePack<BytePerPack> tmp[UnrollPacks];
+      #pragma unroll
+      for (int u=0; u < UnrollPacks; u++) {
+        tmp[u] = applyLoadMultimem<Red, BytePerPack>(red, inputUptr + cursor + u*WARP_SIZE*BytePerPack);
+      }
+      #pragma unroll
+      for (int u=0; u < UnrollPacks; u++) {
+        multimem_st_global(outputUptr + cursor + u*WARP_SIZE*BytePerPack, tmp[u]);
+      }
+      cursor += tn*UnrollPacks*BytePerPack;
+      nIters -= tn/WARP_SIZE;
+    }
+  } else {
+    nPreBytes = 0;
+    nSufBytes = nBytes;
+  }
+
+  // Get the prefix+suffix element one at a time.
+  #pragma unroll 4
+  for (uintptr_t i = t*sizeof(T); i < nPreBytes + nSufBytes; i += tn*sizeof(T)) {
+    uintptr_t cursor = i < nPreBytes ? i : nBytes-nSufBytes+(i-nPreBytes);
+    BytePack<sizeof(T)> val = applyLoadMultimem<Red, sizeof(T)>(red, inputUptr + cursor);
+    multimem_st_global(outputUptr + cursor, val);
+    cursor += tn*sizeof(T);
+  }
+}
+
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLDMC_AGxSTMC(ncclSymDevArgs const* args) {
+  ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier|ncclSymPrims_UseMultimem);
+  Red<typename ncclSymAccumType<Red, T, /*nvls=*/true>::Type> red(args->redOpArg);
+
+  // Threads numbered globally such that we round robin warps by rank then block.
+  int gt = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
+                     prim.rank, prim.nRanks,
+                     prim.block, prim.nBlocks,
+                     threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
+  int gtn = prim.nRanks*prim.nBlocks*blockDim.x;
+
+  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
+  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+
+  allreduceMultimem(prim, gtn, gt, red, (T*)args->input, (T*)args->output, args->nElts);
+
+  prim.barrierArrive(ncclCoopCta(), /*release=*/true);
+  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+}
+
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLL_R_impl(ncclSymDevArgs const* args, bool multimem) {
+  ncclSymPrims prim(args->comm, ncclSymPrims_UseLL | multimem*ncclSymPrims_UseMultimem);
+  int /*const&*/ rank = prim.rank;
+  using Acc = typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type;
+  Red<Acc> red(args->redOpArg);
+
+  using Pack = BytePack<8>;
+  using AccPack = BytePack<8*sizeof(Acc)/sizeof(T)>;
+  constexpr int EltPerPack = 8/sizeof(T);
+  int nElts = args->nElts;
+  int nPacks = divUp(nElts, EltPerPack);
+
+  bool packAligned = 8 <= alignof(T) || (
+      args->nElts*sizeof(T) |
+      (uint32_t)reinterpret_cast<uintptr_t>(args->input) |
+      (uint32_t)reinterpret_cast<uintptr_t>(args->output)
+    )%8 == 0;
+
+  uint32_t nPackPerBlock, nPackModBlock;
+  idivmodFast32(&nPackPerBlock, &nPackModBlock, nPacks, prim.nBlocks, prim.nBlocks_rcp32);
+  int begin = prim.block*nPackPerBlock + minval<int>(prim.block, nPackModBlock);
+  int end = begin + nPackPerBlock + (prim.block < nPackModBlock ? 1 : 0);
+
+  nPacks = end - begin;
+  nElts -= begin*EltPerPack;
+  nElts = min(nElts, nPacks*EltPerPack);
+  T* input = (T*)args->input + begin*EltPerPack;
+  T* output = (T*)args->output + begin*EltPerPack;
+
+  ncclCoopCta cta;
+  int t = threadIdx.x;
+  int tn = ncclSymMaxThreads;
+
+  if (__builtin_expect(packAligned, true)) {
+    #pragma unroll 1
+    while (0 < nPacks) {
+      if (t < nPacks) {
+        int nIterPacks = min(nPacks, tn);
+        Pack inp = loadPack<Pack>((Pack*)input, t, nPacks);
+        prim.bcastLL(/*slot=*/nIterPacks*rank + t, inp);
+        Pack out = prim.template recvReduceLL<Pack, T>(t, nIterPacks, red);
+        storePack((Pack*)output, t, nPacks, out);
+      }
+      prim.endLL(cta);
+
+      input += tn*EltPerPack;
+      output += tn*EltPerPack;
+      nPacks -= tn;
+    }
+  } else {
+    #pragma unroll 1
+    while (0 < nElts) {
+      if (t*EltPerPack < nElts) {
+        int nIterPacks = min(nPacks, tn);
+        Pack inp = loadPack<Pack>(input, t*EltPerPack, nElts);
+        prim.bcastLL(/*slot=*/nIterPacks*rank + t, inp);
+        Pack out = prim.template recvReduceLL<Pack, T>(t, nIterPacks, red);
+        storePack(output, t*EltPerPack, nElts, out);
+      }
+      prim.endLL(cta);
+
+      input += tn*EltPerPack;
+      output += tn*EltPerPack;
+      nElts -= tn*EltPerPack;
+      nPacks -= tn;
+    }
+  }
+}
+
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLL_R(ncclSymDevArgs const* args) {
+  ncclSymRun_AllReduce_AGxLL_R_impl<Red, T>(args, /*multimem=*/false);
+}
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLLMC_R(ncclSymDevArgs const* args) {
+  ncclSymRun_AllReduce_AGxLL_R_impl<Red, T>(args, /*multimem=*/true);
+}
--- a/src/device/symmetric/generate.py
+++ b/src/device/symmetric/generate.py
@ -0,0 +1,294 @@
+#!/usr/bin/env python3
+import os
+import sys
+
+################################################################################
+# The first command line argument is the path to the directory to generate and
+# populate.
+
+gensrc = sys.argv[1]
+
+if os.path.exists(gensrc):
+  for name in os.listdir(gensrc):
+    os.remove(os.path.join(gensrc, name))
+    #os.truncate(os.path.join(gensrc, name), 0)
+else:
+  os.mkdir(gensrc)
+
+def paste(sep, *args):
+  return sep.join(args)
+
+indents = 0
+def emitln(f, lines):
+  global indents
+  for ln in ((lines,) if isinstance(lines, str) else lines):
+    f.write('  '*indents + ln + '\n')
+
+def indent(s):
+  return '\n'.join('  '+l for l in s.splitlines())
+
+class Rec(object):
+  def __init__(me, **kw):
+    me.__dict__.update(kw)
+  def __eq__(x, y):
+    if len(x) != len(y): return False
+    for k in x:
+      if k not in y: return False
+      if x[k] != y[k]: return False
+    return True
+  def __hash__(me):
+    h = 0
+    for k in me.__dict__:
+      h += hash((k, me.__dict__[k]))
+    return h
+
+################################################################################
+# Edit this region for introducing new algos etc
+
+reductions = ["AllReduce","ReduceScatter"]
+all_reds = ["sum"]
+all_tys = ["f32","f16","bf16","f8e4m3","f8e5m2"]
+
+nvls_algos_by_coll = {
+  "AllReduce": ["AGxLLMC_R","RSxLDMC_AGxSTMC"],
+  "ReduceScatter": ["LDMC"]
+}
+ldmc_algos = ["RSxLDMC_AGxSTMC", "LDMC"]
+
+coll_to_lower = {
+  "AllGather": "all_gather",
+  "AllReduce": "all_reduce",
+  "ReduceScatter": "reduce_scatter"
+}
+
+red_to_ncclDevRedOp = {
+  "sum": "ncclDevSum"
+}
+red_to_Func = {
+  "sum": "FuncSum"
+}
+
+ty_to_ncclDataType = {
+  "f32": "ncclFloat32",
+  "f16": "ncclFloat16",
+  "bf16": "ncclBfloat16",
+  "f8e4m3": "ncclFloat8e4m3",
+  "f8e5m2": "ncclFloat8e5m2"
+}
+ty_to_cxxtype = {
+  "f32": "float",
+  "f16": "half",
+  "bf16": "__nv_bfloat16",
+  "f8e4m3": "__nv_fp8_e4m3",
+  "f8e5m2": "__nv_fp8_e5m2"
+}
+
+def enumerate_kernels():
+  for algo in ["LL","LLMC","ST","STMC"]:
+    yield Rec(coll="AllGather", algo=algo)
+  for red in all_reds:
+    for ty in all_tys:
+      for algo in ["AGxLL_R","AGxLLMC_R","RSxLD_AGxST","RSxLDMC_AGxSTMC"]:
+        yield Rec(coll="AllReduce", algo=algo, red=red, ty=ty)
+      for algo in ["LL","LD","LDMC"]:
+        yield Rec(coll="ReduceScatter", algo=algo, red=red, ty=ty)
+
+def required_cuda(k):
+  cudart, arch, specific_sms  = 0, 0, None
+  is_nvls = k.algo in nvls_algos_by_coll.get(k.coll, [])
+  if is_nvls:
+    cudart = max(cudart, 12010)
+    arch = 900
+  if k.coll in reductions:
+    if k.ty == "bf16":
+      cudart = max(cudart, 11000)
+    if k.ty.startswith("f8"):
+      cudart = max(cudart, 11080)
+      arch = 900
+      if k.algo in ldmc_algos:
+        cudart = 12070
+        arch = None
+        specific_sms = ["100a", "101a", "100f", "101f", "120a", "121a"]
+  return (cudart, arch, specific_sms)
+
+################################################################################
+
+def kernel_fdep(k):
+  return coll_to_lower[k.coll] + '.cu'
+
+def kernel_fname(k):
+  if k.coll in reductions:
+    if k.algo in ldmc_algos and k.ty.startswith('f8'):
+      return paste('_', coll_to_lower[k.coll], k.red, k.ty, k.algo) + '.cu'
+    else:
+      return paste('_', coll_to_lower[k.coll], k.red, k.ty) + '.cu'
+  else:
+    return coll_to_lower[k.coll] + '.cu'
+
+def kernel_gencode(k):
+  if k.coll in reductions and k.algo in ldmc_algos and k.ty.startswith('f8'):
+    return "$(NVCC_GENCODE_LDMC_FP8)"
+  else:
+    return "$(NVCC_GENCODE)"
+
+def kernel_cname(k):
+  if k.coll in reductions:
+    return paste("_", "ncclSymDevKernel", k.coll, k.algo, k.red, k.ty)
+  else:
+    return paste("_", "ncclSymDevKernel", k.coll, k.algo)
+
+def kernel_conds(k):
+  cudart, arch, specific_sms = required_cuda(k)
+  if cudart == 0: return (None, None)
+
+  cudart_cond = "CUDART_VERSION >= %d"%cudart
+  if not specific_sms:
+    arch_cond = "__CUDA_ARCH__ >= %d"%arch
+  else:
+    arch_cond = " || ".join(["0"] + ["NCCL_CUDA_ARCH_%sSPECIFIC==%d"%("FAMILY_" if sm[-1] == "f" else "", 10*int(sm.replace('a', '').replace('f', ''))) for sm in specific_sms])
+  return cudart_cond, arch_cond
+
+def instantiate(k):
+  cudart_cond, arch_cond = kernel_conds(k)
+  if (cudart_cond, arch_cond) == (None, None):
+    form_red_ty = (
+      "__global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
+      "  ncclSymRun_{id}<{red}, {ty}>(&args);\n"
+      "}}"
+    )
+    form = (
+      "__global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
+      "  ncclSymRun_{id}(&args);\n"
+      "}}"
+    )
+  else:
+    form_red_ty = (
+      "#if {cudart_cond}\n"
+      "  __global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
+      "    #if {arch_cond}\n"
+      "      ncclSymRun_{id}<{red}, {ty}>(&args);\n"
+      "    #endif\n"
+      "  }}\n"
+      "#endif"
+    )
+    form = (
+      "#if {cudart_cond}\n"
+      "  __global__ void {cname}(ncclSymDevArgs NCCL_GRID_CONSTANT const args) {{\n"
+      "    #if {arch_cond}\n"
+      "      ncclSymRun_{id}(&args);\n"
+      "    #endif\n"
+      "  }}\n"
+      "#endif"
+    )
+
+  id = k.coll+'_'+k.algo
+  cname = kernel_cname(k)
+  if k.coll in reductions:
+    inst = form_red_ty.format(cname=cname, id=id, red=red_to_Func[k.red], ty=ty_to_cxxtype[k.ty], cudart_cond=cudart_cond, arch_cond=arch_cond)
+  else:
+    inst = form.format(cname=cname, id=id, cudart_cond=cudart_cond, arch_cond=arch_cond)
+  return inst
+
+def prototype(k):
+  cudart_cond, arch_cond = kernel_conds(k)
+  if cudart_cond is None:
+    form = "__global__ void {cname}(ncclSymDevArgs const);"
+  else:
+    form = (
+      "#if {cudart_cond}\n"
+      "  __global__ void {cname}(ncclSymDevArgs const);\n"
+      "#else\n"
+      "  constexpr void* {cname} = nullptr;\n"
+      "#endif"
+    )
+  return form.format(cname=kernel_cname(k), cudart_cond=cudart_cond)
+
+################################################################################
+
+def partition(vals, keyfn):
+  ans = {}
+  for x in vals:
+    k = keyfn(x)
+    if k not in ans:
+      ans[k] = []
+    ans[k].append(x)
+  return ans
+
+
+kernels_by_file = partition(enumerate_kernels(), lambda k: (kernel_fname(k), k.coll))
+
+# Add dependency only files (e.g. allreduce.cu)
+for coll in set(k.coll for k in enumerate_kernels()):
+  fname = coll_to_lower[coll]+'.cu'
+  if (fname, coll) not in kernels_by_file:
+    kernels_by_file[fname, coll] = []
+
+# Generate each kernel instantiation file
+for (fname, coll), ks in kernels_by_file.items():
+  with open(os.path.join(gensrc, fname), "w") as f:
+    emitln(f, '#include "symmetric.h"')
+    emitln(f, '#include "symmetric/kernel.cuh"')
+    emitln(f, '#include "symmetric/{coll}.cuh"'.format(coll=coll_to_lower[coll]))
+    for k in ks:
+      emitln(f, instantiate(k))
+
+# Generate <gensrc>/symmetric_host.cc
+with open(os.path.join(gensrc, "symmetric_kernels.cc"), "w") as f:
+  emitln(f, '#include "symmetric.h"')
+  emitln(f, '#include "device.h"')
+  emitln(f, '')
+
+  for k in enumerate_kernels():
+    emitln(f, prototype(k))
+  emitln(f, '')
+
+  emitln(f, 'extern int const ncclSymKernelCount = %d;' % len(list(enumerate_kernels())))
+  emitln(f, 'extern void* const ncclSymKernelList[] = {')
+  for k in enumerate_kernels():
+    emitln(f, '(void*){cname},'.format(cname=kernel_cname(k)))
+  emitln(f, 'nullptr};')
+  emitln(f, '')
+
+  emitln(f, 'void* ncclSymGetKernelPtr(ncclSymKernelId id, int red, ncclDataType_t ty) {')
+  indents += 1
+  emitln(f, 'switch (id) {')
+  emitln(f, 'default: return nullptr;')
+  for (coll, algo), coll_algo_ks in partition(enumerate_kernels(), lambda k: (k.coll, k.algo)).items():
+    emitln(f, 'case ncclSymKernelId_'+coll+'_'+algo+':')
+    indents += 1
+    if len(coll_algo_ks) == 1:
+      emitln(f, 'return (void*)&'+kernel_cname(coll_algo_ks[0])+';')
+    else:
+      emitln(f, 'switch ((ncclDevRedOp_t)red) {')
+      emitln(f, 'default: return nullptr;')
+      for red, coll_algo_red_ks in partition(coll_algo_ks, lambda k: k.red).items():
+        emitln(f, 'case '+red_to_ncclDevRedOp[red]+':')
+        indents += 1
+        emitln(f, 'switch (ty) {')
+        emitln(f, 'default: return nullptr;')
+        for k in coll_algo_red_ks:
+          emitln(f, 'case '+ty_to_ncclDataType[k.ty]+': return (void*)'+kernel_cname(k)+';')
+        emitln(f, '}')
+        indents -= 1
+      emitln(f, '}')
+    indents -=1
+  emitln(f, '}')
+  indents -= 1
+  emitln(f, '}')
+
+# Generate <gensrc>/rules.mk
+with open(os.path.join(gensrc, "rules.mk"), "w") as f:
+  inst_names = sorted(set(kernel_fname(k) for k in enumerate_kernels()))
+  names = inst_names + ["symmetric_kernels.cc"]
+  f.write("LIB_OBJS_SYM_GEN = $(patsubst %,$(OBJDIR)/genobj/symmetric/%.o,{names})\n"
+          .format(names=" ".join(names)))
+  f.write("\n")
+
+  inst_names = sorted(set((k.coll, kernel_fname(k), kernel_gencode(k)) for k in enumerate_kernels()))
+  for coll, name, gencode in inst_names:
+    f.write(
+      "$(OBJDIR)/genobj/symmetric/{name}.o: $(OBJDIR)/gensrc/symmetric $(OBJDIR)/genobj/symmetric/{coll}.cu.d\n"
+      "\t" "$(call COMPILE_SYM,$@,$(OBJDIR)/gensrc/symmetric/{name},{gencode})\n"
+      "\n"
+      .format(name=name, coll=coll_to_lower[coll], gencode=gencode)
+    )
--- a/src/device/symmetric/kernel.cuh
+++ b/src/device/symmetric/kernel.cuh
@ -0,0 +1,27 @@
+#ifndef NCCL_DEVICE_SYMMETRIC_KERNEL_H_
+#define NCCL_DEVICE_SYMMETRIC_KERNEL_H_
+
+#include "symmetric.h"
+
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLL_R(struct ncclSymDevArgs const* args);
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymRun_AllReduce_AGxLLMC_R(struct ncclSymDevArgs const* args);
+
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLD_AGxST(struct ncclSymDevArgs const* args);
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymRun_AllReduce_RSxLDMC_AGxSTMC(struct ncclSymDevArgs const* args);
+
+__device__ __forceinline__ void ncclSymRun_AllGather_LL(struct ncclSymDevArgs const* args);
+__device__ __forceinline__ void ncclSymRun_AllGather_LLMC(struct ncclSymDevArgs const* args);
+__device__ __forceinline__ void ncclSymRun_AllGather_ST(struct ncclSymDevArgs const* args);
+__device__ __forceinline__ void ncclSymRun_AllGather_STMC(struct ncclSymDevArgs const* args);
+
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymRun_ReduceScatter_LL(struct ncclSymDevArgs const* args);
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymRun_ReduceScatter_LD(struct ncclSymDevArgs const* args);
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymRun_ReduceScatter_LDMC(struct ncclSymDevArgs const* args);
+#endif
--- a/src/device/symmetric/primitives.cuh
+++ b/src/device/symmetric/primitives.cuh
@ -0,0 +1,420 @@
+#ifndef NCCL_DEVICE_SYMMETRIC_PRIMITIVES_H_
+#define NCCL_DEVICE_SYMMETRIC_PRIMITIVES_H_
+
+#include "symmetric.h"
+#include "bitops.h"
+#include "collectives.h"
+#include "op128.h"
+#include "reduce_kernel.h"
+
+#if __CUDA_ARCH__ >= 700
+// __grid_constant__ appears to break cuda-gdb
+#define NCCL_GRID_CONSTANT __grid_constant__
+#else
+#define NCCL_GRID_CONSTANT
+#endif
+
+// flattenIx(pos0, dim0, pos1, dim1, pos2, dim2, ...)
+// Given a position vector `pos` in a rectangular index space with lengths in the `dim`
+// vector, flatten that down to a linear index. The fastest moving dimension is given first.
+__device__ __forceinline__ int flattenIx() { return 0; }
+
+template<typename Int0, typename Int1, typename ...Ints>
+static __device__ Int0 flattenIx(Int0 pos, Int1 size, Ints ...more) {
+  return pos + size*flattenIx(more...);
+}
+
+// Precomputed integer reciprocoals for denominator values 1..64 inclusive.
+// Pass these to idivFast64() for fast division on the GPU.
+static __device__ uint64_t idivRcp64_upto64(int x) {
+  static constexpr uint64_t table[65] = {
+    idivRcp64(0x01), idivRcp64(0x01), idivRcp64(0x02), idivRcp64(0x03),
+    idivRcp64(0x04), idivRcp64(0x05), idivRcp64(0x06), idivRcp64(0x07),
+    idivRcp64(0x08), idivRcp64(0x09), idivRcp64(0x0a), idivRcp64(0x0b),
+    idivRcp64(0x0c), idivRcp64(0x0d), idivRcp64(0x0e), idivRcp64(0x0f),
+    idivRcp64(0x10), idivRcp64(0x11), idivRcp64(0x12), idivRcp64(0x13),
+    idivRcp64(0x14), idivRcp64(0x15), idivRcp64(0x16), idivRcp64(0x17),
+    idivRcp64(0x18), idivRcp64(0x19), idivRcp64(0x1a), idivRcp64(0x1b),
+    idivRcp64(0x1c), idivRcp64(0x1d), idivRcp64(0x1e), idivRcp64(0x1f),
+    idivRcp64(0x20), idivRcp64(0x21), idivRcp64(0x22), idivRcp64(0x23),
+    idivRcp64(0x24), idivRcp64(0x25), idivRcp64(0x26), idivRcp64(0x27),
+    idivRcp64(0x28), idivRcp64(0x29), idivRcp64(0x2a), idivRcp64(0x2b),
+    idivRcp64(0x2c), idivRcp64(0x2d), idivRcp64(0x2e), idivRcp64(0x2f),
+    idivRcp64(0x30), idivRcp64(0x31), idivRcp64(0x32), idivRcp64(0x33),
+    idivRcp64(0x34), idivRcp64(0x35), idivRcp64(0x36), idivRcp64(0x37),
+    idivRcp64(0x38), idivRcp64(0x39), idivRcp64(0x3a), idivRcp64(0x3b),
+    idivRcp64(0x3c), idivRcp64(0x3d), idivRcp64(0x3e), idivRcp64(0x3f),
+    idivRcp64(0x40)
+  };
+  return table[x];
+}
+
+static __device__ uint32_t idivRcp32_upto64(int x) {
+  return idivRcp64_upto64(x)>>32;
+}
+
+namespace {
+struct ncclCoopCta {
+  __device__ void sync() { __syncthreads(); }
+  __device__ int self() { return threadIdx.x; }
+  __device__ int count() { return blockDim.x; }
+};
+struct ncclCoopWarps {
+  int log2_nWarps;
+  __device__ void sync() {
+    asm volatile("barrier.sync %0, %1;" :: "r"(1 + (threadIdx.x>>(5+log2_nWarps))), "r"(32<<log2_nWarps) : "memory");
+  }
+  __device__ int self() { return threadIdx.x & ((32<<log2_nWarps)-1); }
+  __device__ int count() { return 32<<log2_nWarps; }
+};
+struct ncclCoopWarp {
+  __device__ void sync() { __syncwarp(); }
+  __device__ int self() { return threadIdx.x%32; }
+  __device__ int count() { return 32; }
+};
+}
+
+namespace {
+static constexpr int ncclSymPrims_UseBarrier = 1;
+static constexpr int ncclSymPrims_UseLL = 2;
+static constexpr int ncclSymPrims_UseMultimem = 4;
+struct ncclSymPrims {
+  int flags;
+  int const &rank;
+  int const &nRanks;
+  uint32_t const &nRanks_rcp32;
+  int block, nBlocks;
+  uint32_t nBlocks_rcp32;
+  uint32_t nBlocks_nWarps_rcp32;
+  uint32_t nRanks_nBlocks_rcp32;
+  uint32_t nWarpPerRank, nWarpPerRank_rcp32;
+  struct ncclSymDevBase* const &base;
+  uintptr_t offsetMc;
+
+  uint32_t const &stride4G;
+  uint32_t barEpoch;
+  uint32_t llEpoch;
+
+  __device__ ncclSymPrims(ncclSymDevComm const &comm, int flags):
+    flags(flags),
+    rank(comm.rank),
+    nRanks(comm.nRanks),
+    nRanks_rcp32(comm.nRanks_rcp32),
+    block(blockIdx.x),
+    nBlocks(gridDim.x),
+    nBlocks_rcp32(idivRcp32_upto64(nBlocks)),
+    nBlocks_nWarps_rcp32(imulRcp32(nBlocks, nBlocks_rcp32, blockDim.x/32, idivRcp32_upto64(blockDim.x/32))),
+    nRanks_nBlocks_rcp32(imulRcp32(nRanks, nRanks_rcp32, gridDim.x, nBlocks_rcp32)),
+    nWarpPerRank(idivFast32(nBlocks*blockDim.x/32, nRanks, nRanks_rcp32)),
+    nWarpPerRank_rcp32(idivRcp32_upto64(nWarpPerRank)),
+    base(comm.base),
+    offsetMc((flags & ncclSymPrims_UseMultimem) ? (char*)comm.baseMc - (char*)base : 0x0),
+    stride4G(comm.stride4G) {
+
+    #if CUDART_VERSION >= 12030 && __CUDA_ARCH__ >= 900
+      cudaGridDependencySynchronize();
+    #endif
+
+    if ((flags & ncclSymPrims_UseBarrier) && threadIdx.x < nRanks) {
+      barEpoch = (flags & ncclSymPrims_UseMultimem) ? base->barEpochMc[block] : base->barEpochUc[block];
+    }
+    if (flags & ncclSymPrims_UseLL) llEpoch = base->llEpoch[block] + 2;
+  }
+  __device__  ~ncclSymPrims() {
+    if (threadIdx.x == 0) {
+      if (flags & ncclSymPrims_UseBarrier) {
+        ((flags & ncclSymPrims_UseMultimem) ? base->barEpochMc : base->barEpochUc)[block] = barEpoch;
+      }
+      if (flags & ncclSymPrims_UseLL) base->llEpoch[block] = llEpoch - 2;
+    }
+  }
+
+  template<typename T>
+  __device__ T* peerPtr(int peer, T* selfPtr) {
+    return add4G(selfPtr, (peer-rank)*stride4G);
+  }
+
+  template<typename T>
+  __device__ T* multimemPtr(T* selfPtr) {
+    return reinterpret_cast<T*>(reinterpret_cast<uintptr_t>(selfPtr) + offsetMc);
+  }
+
+  __device__  void barrierArrive(ncclCoopCta cta, bool release) {
+    cta.sync();
+    #if __CUDA_ARCH__ < 700
+      if (release) {
+        if (cta.self() == 0) __threadfence_system();
+        cta.sync();
+      }
+    #endif
+    if (flags & ncclSymPrims_UseMultimem) {
+    #if __CUDA_ARCH__ >= 900 && CUDART_VERSION >= 12010
+      if (cta.self() == 0) {
+        uint32_t* inbox = &multimemPtr(base)->barInboxMc[block];
+        if (release) {
+          asm volatile("multimem.red.release.sys.add.u32 [%0],1;" :: "l"(inbox));
+        } else {
+          asm volatile("multimem.red.relaxed.sys.add.u32 [%0],1;" :: "l"(inbox));
+        }
+      }
+    #endif
+    } else {
+      int r = cta.self();
+      if (r != rank && r < nRanks) {
+        uint32_t* inbox = &peerPtr(r, base)->barInboxPerPeer[block*nRanks + rank];
+        #if __CUDA_ARCH__ >= 700
+          if (release) {
+            asm volatile("st.release.sys.u32 [%0],%1;" :: "l"(inbox), "r"(barEpoch+1));
+          } else {
+            asm volatile("st.relaxed.sys.u32 [%0],%1;" :: "l"(inbox), "r"(barEpoch+1));
+          }
+        #else
+          asm volatile("st.volatile.u32 [%0],%1;" :: "l"(inbox), "r"(barEpoch+1));
+        #endif
+      }
+    }
+  }
+
+  __device__  void barrierWait(ncclCoopCta cta, bool acquire) {
+    if (flags & ncclSymPrims_UseMultimem) {
+    #if __CUDA_ARCH__ >= 900
+      if (cta.self() == 0) {
+        uint32_t* inbox = &base->barInboxMc[block];
+        while (true) {
+          uint32_t got;
+          if (acquire) {
+            asm volatile("ld.acquire.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
+          } else {
+            asm volatile("ld.relaxed.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
+          }
+          if (got-(barEpoch+nRanks) <= uint32_t(-1)>>1) break;
+        }
+        barEpoch += nRanks;
+      }
+    #endif
+    } else {
+      int r = cta.self();
+      if (r != rank && r < nRanks) {
+        uint32_t* inbox = &base->barInboxPerPeer[block*nRanks + r];
+        while (true) {
+          uint32_t got;
+          #if __CUDA_ARCH__ >= 700
+            if (acquire) {
+              asm volatile("ld.acquire.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
+            } else {
+              asm volatile("ld.relaxed.sys.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
+            }
+          #else
+            asm volatile("ld.volatile.u32 %0,[%1];" : "=r"(got) : "l"(inbox));
+          #endif
+          if (got-(barEpoch+1) <= uint32_t(-1)>>1) break;
+        }
+      }
+      #if __CUDA_ARCH__ < 700
+        if (acquire) {
+          cta.sync();
+          if (cta.self() == 0) __threadfence();
+        }
+      #endif
+      barEpoch += 1;
+    }
+    cta.sync();
+  }
+
+  __device__ void endLL(ncclCoopCta cta) {
+    if (__builtin_expect(llEpoch >= -2u, false)) {
+      cta.sync();
+      uint4* buf = ncclSymDevBase_getLLBuf(base, nRanks, block, llEpoch);
+      int epochSize = ncclSymLLEpochSize(nRanks);
+      #pragma unroll 4
+      for (int i=cta.self(); i*16 < epochSize; i += cta.count()) {
+        buf[i] = uint4{0, 0, 0, 0};
+      }
+    }
+    cta.sync();
+    llEpoch += (llEpoch == -1u) ? 3 : 1;
+  }
+
+  template<typename T>
+  __device__ void sendLL(int peer, int slot, T val) {
+    union { T tmp; uint32_t u32[divUp(sizeof(T),8)][2]; };
+    tmp = val;
+    uint4* buf = ncclSymDevBase_getLLBuf(peerPtr(peer, base), nRanks, block, llEpoch) + slot;
+    #pragma unroll
+    for (int u=0; u < divUp(sizeof(T),8); u++) {
+      asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(buf + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
+    }
+  }
+
+  template<typename T>
+  __device__ void bcastLL(int slot, T val) {
+    if (flags & ncclSymPrims_UseMultimem) {
+      union { T tmp; uint32_t u32[divUp(sizeof(T),8)][2]; };
+      tmp = val;
+      uint4* bufmc = ncclSymDevBase_getLLBuf(multimemPtr(base), nRanks, block, llEpoch) + slot;
+      #pragma unroll
+      for (int u=0; u < divUp(sizeof(T),8); u++) {
+        asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(bufmc + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
+      }
+    } else {
+      union { T tmp; uint32_t u32[divUp(sizeof(T),8)][2]; };
+      tmp = val;
+      uint4* buf0 = ncclSymDevBase_getLLBuf(peerPtr(0, base), nRanks, block, llEpoch) + slot;
+      int dr = 0;
+      int r = rank;
+      #pragma unroll 1
+      for (; dr+8 <= nRanks; dr += 8) {
+        #pragma unroll
+        for (int ur=0; ur < 8; ur++) {
+          uint4* buf = add4G(buf0, r*stride4G);
+          #pragma unroll
+          for (int u=0; u < divUp(sizeof(T),8); u++) {
+            asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(buf + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
+          }
+          r += 1;
+          if (r == nRanks) r = 0;
+        }
+      }
+      #pragma unroll
+      for (int ur=0; ur < 8; ur++, dr++) {
+        if (dr == nRanks) break;
+        uint4* buf = add4G(buf0, r*stride4G);
+        #pragma unroll
+        for (int u=0; u < divUp(sizeof(T),8); u++) {
+          asm volatile("st.volatile.v4.u32 [%0],{%1,%3,%2,%3};" :: "l"(buf + ncclSymLLMaxSlots(sizeof(T))*u), "r"(u32[u][0]), "r"(u32[u][1]), "r"(llEpoch));
+        }
+        r += 1;
+        if (r == nRanks) r = 0;
+      }
+    }
+  }
+
+  template<int nSlotsMin, int nSlotsMax, typename T>
+  __device__ void recvLL(int slot0, int nSlots, int stride, T(&elts)[nSlotsMax]) {
+    uint4* buf = ncclSymDevBase_getLLBuf(base, nRanks, block, llEpoch) + slot0;
+    uint4 tmp[nSlotsMax][divUp(sizeof(T),8)];
+    //int spins=0;
+    while (true) {
+      #pragma unroll
+      for (int u=0; u < nSlotsMax; u++) {
+        if (u < nSlotsMin || u < nSlots) {
+          #pragma unroll
+          for (int v=0; v < divUp(sizeof(T),8); v++) {
+            asm volatile("ld.volatile.v4.u32 {%0,%1,%2,%3},[%4];" : "=r"(tmp[u][v].x), "=r"(tmp[u][v].y), "=r"(tmp[u][v].z), "=r"(tmp[u][v].w) : "l"(buf + u*stride + v*ncclSymLLMaxSlots(sizeof(T))));
+          }
+        }
+      }
+      bool okAll = true;
+      #pragma unroll
+      for (int u=0; u < nSlotsMax; u++) {
+        #pragma unroll
+        for (int v=0; v < divUp(sizeof(T),8); v++) {
+          if (u < nSlotsMin || u < nSlots) {
+            bool ok = tmp[u][v].y == llEpoch &&
+                      tmp[u][v].w == llEpoch;
+            okAll &= ok;
+          }
+        }
+      }
+      if (__builtin_expect(okAll, true)) break;
+      //if (spins++ == 10<<20) spins=0;
+    }
+    #pragma unroll
+    for (int u=0; u < nSlotsMax; u++) {
+      if (nSlotsMin <= u && u == nSlots) break;
+      union { T val; uint32_t u32[divUp(sizeof(T),8)][2]; };
+      #pragma unroll
+      for (int v=0; v < divUp(sizeof(T),8); v++) {
+        u32[v][0] = tmp[u][v].x;
+        u32[v][1] = tmp[u][v].z;
+      }
+      elts[u] = val;
+    }
+  }
+
+  template<typename Pack, typename T, typename Red, int Unroll=8>
+  __device__ Pack recvReduceLL(int slot, int stride, Red red) {
+    using Acc = typename Red::EltType;
+    using AccPack = BytePack<sizeof(Pack)*sizeof(Acc)/sizeof(T)>;
+    AccPack acc;
+    bool first = true;
+    int r = 0;
+    #pragma unroll 1
+    for (; r+Unroll <= nRanks; r += Unroll) {
+      Pack got[Unroll];
+      this->template recvLL</*Min=*/Unroll>(slot + r*stride, Unroll, stride, got);
+      AccPack acc0 = applyCast<T, Acc>(got[0]);
+      acc = first ? acc0 : applyReduce(red, acc, acc0);
+      first = false;
+      #pragma unroll
+      for (int i=1; i < Unroll; i++) acc = applyReduce(red, acc, applyCast<T, Acc>(got[i]));
+    }
+    if (r < nRanks) {
+      Pack got[Unroll];
+      this->template recvLL</*Min=*/1>(slot + r*stride, nRanks-r, stride, got);
+      AccPack acc0 = applyCast<T, Acc>(got[0]);
+      acc = first ? acc0 : applyReduce(red, acc, acc0);
+      #pragma unroll
+      for (int i=1; i < Unroll-1; i++) {
+        if (r+i < nRanks) acc = applyReduce(red, acc, applyCast<T, Acc>(got[i]));
+      }
+    }
+    return applyCast<Acc, T>(acc);
+  }
+
+  template<typename T>
+  __device__ T recvLL(int slot) {
+    T one[1];
+    this->template recvLL<1, 1, T>(slot, 1, 0, one);
+    return one[0];
+  }
+
+  template<typename Coop, typename T>
+  __device__ void coopRecvLL(Coop coop, int slot0, int nSlots, T* dst) {
+    int me = coop.self();
+    if (me < nSlots) {
+      uint4* buf = ncclSymDevBase_getLLBuf(base, nRanks, block, llEpoch) + slot0 + me;
+      uint4 got[divUp(sizeof(T), 8)];
+      //int spins=0;
+      #pragma unroll 1
+      while (true) {
+        #pragma unroll
+        for (int u=0; u < divUp(sizeof(T), 8); u++) {
+          asm volatile("ld.volatile.v4.u32 {%0,%1,%2,%3},[%4];" : "=r"(got[u].x), "=r"(got[u].y), "=r"(got[u].z), "=r"(got[u].w) : "l"(buf + u*ncclSymLLMaxSlots(sizeof(T))));
+        }
+        bool ok = true;
+        #pragma unroll
+        for (int u=0; u < divUp(sizeof(T), 8); u++) {
+          ok &= got[u].y == llEpoch;
+          ok &= got[u].w == llEpoch;
+        }
+        if (__builtin_expect(ok, true)) break;
+        //if (++spins == 10<<20) { spins=0; printf("r=%d LL spin @ ix=%d got=%d want=%d\n", rank, slot0+me, got[0].y, llEpoch); }
+      }
+      union { T val; uint32_t u32[divUp(sizeof(T), 8)][2]; };
+      #pragma unroll
+      for (int u=0; u < divUp(sizeof(T), 8); u++) {
+        u32[u][0] = got[u].x;
+        u32[u][1] = got[u].z;
+      }
+      dst[slot0 + me] = val;
+    }
+  }
+};
+}
+
+template<template<typename> typename Red, typename T, bool nvls>
+struct ncclSymAccumType { using Type = T; };
+
+// Only Red's whose opArg is invariant w.r.t. the datatype can have a different
+// accumulator type. At the moment this excludes integer min/max, sumpostdiv,
+// and premulsum.
+template<> struct ncclSymAccumType<FuncSum, __half, false> { using Type = float; };
+#if defined(__CUDA_BF16_TYPES_EXIST__)
+template<> struct ncclSymAccumType<FuncSum, __nv_bfloat16, false> { using Type = float; };
+#endif
+#if defined(__CUDA_FP8_TYPES_EXIST__)
+template<> struct ncclSymAccumType<FuncSum, __nv_fp8_e4m3, false> { using Type = float; };
+template<> struct ncclSymAccumType<FuncSum, __nv_fp8_e5m2, false> { using Type = float; };
+#endif
+#endif
--- a/src/device/symmetric/reduce_scatter.cuh
+++ b/src/device/symmetric/reduce_scatter.cuh
@ -0,0 +1,387 @@
+#include "symmetric.h"
+#include "symmetric/kernel.cuh"
+#include "symmetric/primitives.cuh"
+
+template<int BytePerPack, int UnrollPacks, int UnrollPeers, typename T, typename Red>
+static __device__ void reduceDeep(
+    ncclSymPrims& prim, int tn, int t, bool waitNeeded,
+    Red red, char* inputRank0, char* outputHere, int32_t nIters
+  ) {
+  using Pack = BytePack<BytePerPack>;
+  using Acc = typename Red::EltType;
+  using AccPack = BytePack<BytePerPack*sizeof(Acc)/sizeof(T)>;
+
+  int wn = tn/WARP_SIZE;
+  int w = t/WARP_SIZE;
+  int lane = t%WARP_SIZE;
+  int const& rank = prim.rank;
+  int const& nRanks = prim.nRanks;
+  uint32_t const& stride4G = prim.stride4G;
+  Pack* inpRank0 = (Pack*)inputRank0 + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
+  Pack* outHere = (Pack*)outputHere + intptr_t(w)*UnrollPacks*WARP_SIZE + lane;
+  Pack acc0[UnrollPacks];
+
+  nIters -= w;
+  if (0 < nIters) {
+    #pragma unroll
+    for (int u=0; u < UnrollPacks; u++) {
+      acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
+    }
+  }
+
+  if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+
+  if (0 < nIters) {
+    while (true) {
+      AccPack acc1[UnrollPacks];
+      int r = rank+1;
+      if (r == nRanks) r = 0;
+      { Pack tmp1[UnrollPacks];
+        #pragma unroll
+        for (int u=0; u < UnrollPacks; u++) {
+          tmp1[u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
+        }
+        #pragma unroll
+        for (int u=0; u < UnrollPacks; u++) {
+          acc1[u] = applyReduce(red, applyCast<T, Acc>(acc0[u]), applyCast<T, Acc>(tmp1[u]));
+        }
+      }
+
+      r += 1;
+      if (r == nRanks) r = 0;
+
+      int dr = 2;
+      #pragma unroll 2
+      for (int partial=0; partial <= 1; partial++) {
+        #pragma unroll 1
+        for (int i = 0;
+             partial ? i < 1 : (dr + UnrollPeers <= nRanks);
+             partial ? i++ : (dr += UnrollPeers)) {
+          if (partial && dr == nRanks) break;
+
+          Pack tmp1[UnrollPeers][UnrollPacks];
+          #pragma unroll
+          for (int ur=0; ur < UnrollPeers-partial; ur++) {
+            if (partial && ur!=0 && dr+ur == nRanks) break;
+            #pragma unroll UnrollPacks
+            for (int u=0; u < UnrollPacks; u++) {
+              tmp1[ur][u] = add4G(inpRank0, r*stride4G)[u*WARP_SIZE];
+            }
+            r += 1;
+            if (r == nRanks) r = 0;
+          }
+          #pragma unroll
+          for (int ur=0; ur < UnrollPeers-partial; ur++) {
+            if (partial && ur!=0 && dr+ur == nRanks) break;
+            #pragma unroll UnrollPacks
+            for (int u=0; u < UnrollPacks; u++) {
+              acc1[u] = applyReduce(red, acc1[u], applyCast<T, Acc>(tmp1[ur][u]));
+            }
+          }
+        }
+      }
+
+      #pragma unroll
+      for (int u=0; u < UnrollPacks; u++) acc0[u] = applyCast<Acc, T>(acc1[u]);
+
+      #pragma unroll UnrollPacks
+      for (int u=0; u < UnrollPacks; u++) outHere[u*WARP_SIZE] = acc0[u];
+
+      inpRank0 += intptr_t(wn)*UnrollPacks*WARP_SIZE;
+      outHere += intptr_t(wn)*UnrollPacks*WARP_SIZE;
+      nIters -= wn;
+      if (nIters <= 0) break;
+
+      // Load data for next iteration.
+      #pragma unroll
+      for (int u=0; u < UnrollPacks; u++) {
+        acc0[u] = add4G(inpRank0, rank*stride4G)[u*WARP_SIZE];
+      }
+    }
+  }
+}
+
+template<int UnrollPeers, typename Red, typename T>
+static __device__ void reduceEnds(
+    ncclSymPrims& prim, int tn, int t, Red red,
+    T* inputRank0, T* outputHere, size_t nElts, uint32_t nPreElts, size_t nSufElts
+  ) {
+  using Acc = typename Red::EltType;
+
+  int const& rank = prim.rank;
+  int const& nRanks = prim.nRanks;
+  uint32_t const& stride4G = prim.stride4G;
+  BytePack<sizeof(T)>* inpRank0 = (BytePack<sizeof(T)>*)inputRank0;
+  BytePack<sizeof(T)>* outHere = (BytePack<sizeof(T)>*)outputHere;
+  #pragma unroll 1
+  for (size_t i = t; i < nPreElts+nSufElts; i += tn) {
+    size_t elt = i < nPreElts ? i : nElts-nSufElts-nPreElts+i;
+    BytePack<sizeof(T)> acc0 = *add4G(inpRank0+elt, rank*stride4G);
+    BytePack<sizeof(Acc)> acc1;
+    BytePack<sizeof(T)> tmp[UnrollPeers];
+    int dr = 1;
+    int r = rank+1;
+    if (nRanks == r) r = 0;
+    bool first = true;
+
+    #pragma unroll 2
+    for (int partial=0; partial <= 1; partial++) {
+      #pragma unroll 1
+      for (int j = 0;
+           partial ? j < 1 : (dr + UnrollPeers <= nRanks);
+           partial ? j++ : (dr += UnrollPeers)) {
+        if (partial && dr == nRanks) break;
+
+        #pragma unroll
+        for (int u=0; u < UnrollPeers-partial; u++) {
+          if (partial && u!=0 && dr+u == nRanks) break;
+          tmp[u] = *add4G(inpRank0+elt, r*stride4G);
+          r += 1;
+          if (r == nRanks) r = 0;
+        }
+        if (first) {
+          first = false;
+          acc1 = applyCast<T, Acc>(acc0);
+        }
+        #pragma unroll
+        for (int u=0; u < UnrollPeers-partial; u++) {
+          if (partial && u!=0 && dr+u == nRanks) break;
+          acc1 = applyReduce(red, acc1, applyCast<T, Acc>(tmp[u]));
+        }
+      }
+    }
+
+    acc0 = applyCast<Acc, T>(acc1);
+    outHere[elt] = acc0;
+  }
+}
+
+template<typename Red, typename T>
+static __device__ void reduce(
+    ncclSymPrims& prim, int tn, int t, bool waitNeeded,
+    Red red, T* input, T* output, size_t nElts
+  ) {
+  int nRanks = prim.nRanks;
+  int nBlocks = prim.nBlocks;
+  // Mpve input to rank=0
+  input = prim.peerPtr(0, input);
+
+  uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
+  uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
+  uint32_t alignment = uint32_t(inputUptr - outputUptr);
+  size_t nBytes = nElts*sizeof(T);
+
+  uint32_t nPreBytes = (16u - inputUptr)%16u;
+  nPreBytes = min((size_t)nPreBytes, nBytes);
+  uintptr_t cursor = nPreBytes;
+
+  constexpr int MinWarpPerBlock = 4;
+
+  if (alignment%16 == 0) {
+    constexpr int BytePerPack = 16, UnrollPacks = 4, UnrollPeers = 2;
+    constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
+    uint32_t chunks = (nBytes-cursor)/BytePerChunk;
+    chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
+    if (chunks != 0) {
+      uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
+      reduceDeep<BytePerPack, UnrollPacks, UnrollPeers, T>(
+        prim, tn, t, waitNeeded, red,
+        (char*)input + cursor, (char*)output + cursor,
+        chunks*MinWarpPerBlock
+      );
+      cursor = cursorAfter;
+      waitNeeded = false;
+    }
+  }
+
+  if (sizeof(T) == 4 || (sizeof(T) < 4 && alignment%4 == 0)) {
+    constexpr int BytePerPack = 4, UnrollPacks = 4, UnrollPeers = 4;
+    constexpr int BytePerChunk = MinWarpPerBlock*UnrollPacks*WARP_SIZE*BytePerPack;
+    uint32_t chunks = (nBytes-cursor)/BytePerChunk;
+    chunks -= imodFast32(chunks, nRanks*nBlocks, prim.nRanks_nBlocks_rcp32);
+    if (chunks != 0) {
+      uintptr_t cursorAfter = cursor + uintptr_t(chunks)*BytePerChunk;
+      reduceDeep<(sizeof(T) <= BytePerPack ? BytePerPack : 0), UnrollPacks, UnrollPeers, T>(
+        prim, tn, t, waitNeeded, red,
+        (char*)input + cursor, (char*)output + cursor,
+        chunks*MinWarpPerBlock
+      );
+      cursor = cursorAfter;
+      waitNeeded = false;
+    }
+  }
+
+  if (waitNeeded) prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+
+  constexpr int UnrollPeers = 8;
+  size_t nSufElts = (nBytes-cursor)/sizeof(T);
+  reduceEnds<UnrollPeers>(prim, tn, t, red, input, output, nElts, nPreBytes/sizeof(T), nSufElts);
+}
+
+
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymRun_ReduceScatter_LD(ncclSymDevArgs const* args) {
+  ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier);
+  Red<typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type> red(args->redOpArg);
+
+  // Round robin warps over blocks.
+  int t = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
+                    prim.block, prim.nBlocks,
+                    threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
+  int tn = prim.nBlocks*blockDim.x;
+
+  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
+  //prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+
+  reduce(prim, tn, t, /*waitNeeded=*/true, red, (T*)args->input + prim.rank*args->nElts, (T*)args->output, args->nElts);
+
+  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
+  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+}
+
+
+template<typename Red, typename T>
+static __device__ void reduceMultimem(
+    ncclSymPrims& prim, int tn, int t, Red red, T* input, T* output, size_t nElts
+  ) {
+  // Mpve input to multimem
+  input = prim.multimemPtr(input);
+
+  uintptr_t inputUptr = reinterpret_cast<uintptr_t>(input);
+  uintptr_t outputUptr = reinterpret_cast<uintptr_t>(output);
+  size_t nBytes = nElts*sizeof(T);
+
+  constexpr int BytePerPack = LoadMultimem_BigPackSize<Red>::BigPackSize;
+  uint32_t nPreBytes = (BytePerPack - inputUptr)%BytePerPack;
+  nPreBytes = min((size_t)nPreBytes, nBytes);
+  uintptr_t nSufBytes;
+
+  if (sizeof(T) == BytePerPack || (inputUptr-outputUptr)%BytePerPack == 0) {
+    constexpr int UnrollPacks = 8*(16/BytePerPack);
+    constexpr int BytePerChunk = UnrollPacks*WARP_SIZE*BytePerPack;
+    uintptr_t cursor = nPreBytes;
+    uint32_t nChunks = (nBytes-cursor)/BytePerChunk;
+    uintptr_t cursorAfter = cursor + uintptr_t(nChunks)*BytePerChunk;
+    nSufBytes = nBytes - cursorAfter;
+    cursor += (t/WARP_SIZE)*UnrollPacks*WARP_SIZE*BytePerPack;
+    cursor += (t%WARP_SIZE)*BytePerPack;
+    int nIters = nChunks - t/WARP_SIZE;
+    #pragma unroll 1
+    while (0 < nIters) {
+      BytePack<BytePerPack> tmp[UnrollPacks];
+      #pragma unroll
+      for (int u=0; u < UnrollPacks; u++) {
+        tmp[u] = applyLoadMultimem<Red, BytePerPack>(red, inputUptr + cursor + u*WARP_SIZE*BytePerPack);
+      }
+      #pragma unroll
+      for (int u=0; u < UnrollPacks; u++) {
+        *reinterpret_cast<BytePack<BytePerPack>*>(outputUptr + cursor + u*WARP_SIZE*BytePerPack) = tmp[u];
+      }
+      cursor += tn*UnrollPacks*BytePerPack;
+      nIters -= tn/WARP_SIZE;
+    }
+  } else {
+    nPreBytes = 0;
+    nSufBytes = nBytes;
+  }
+
+  // Get the prefix+suffix element one at a time.
+  #pragma unroll 4
+  for (uintptr_t i = t*sizeof(T); i < nPreBytes + nSufBytes; i += tn*sizeof(T)) {
+    uintptr_t cursor = i < nPreBytes ? i : nBytes-nSufBytes+(i-nPreBytes);
+    BytePack<sizeof(T)> val = applyLoadMultimem<Red, sizeof(T)>(red, inputUptr + cursor);
+    *reinterpret_cast<BytePack<sizeof(T)>*>(outputUptr + cursor) = val;
+    cursor += tn*sizeof(T);
+  }
+}
+
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymRun_ReduceScatter_LDMC(ncclSymDevArgs const* args) {
+  ncclSymPrims prim(args->comm, ncclSymPrims_UseBarrier|ncclSymPrims_UseMultimem);
+  Red<typename ncclSymAccumType<Red, T, /*nvls=*/true>::Type> red(args->redOpArg);
+
+  // Round robin warps over blocks.
+  int t = flattenIx(threadIdx.x%WARP_SIZE, WARP_SIZE,
+                    prim.block, prim.nBlocks,
+                    threadIdx.x/WARP_SIZE, blockDim.x/WARP_SIZE);
+  int tn = prim.nBlocks*blockDim.x;
+
+  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
+  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+
+  reduceMultimem(prim, tn, t, red, (T*)args->input + prim.rank*args->nElts, (T*)args->output, args->nElts);
+
+  prim.barrierArrive(ncclCoopCta(), /*release=*/false);
+  prim.barrierWait(ncclCoopCta(), /*acquire=*/false);
+}
+
+// T is user type, EltType is the most aligned type
+template<typename T, typename Red, typename EltType>
+__device__ __forceinline__ void ncclSymRun_ReduceScatter_LL_body(
+    ncclSymPrims &prim, Red red, EltType* input, EltType* output, int nElts, int nPacks, int nStrideElts) {
+  using Pack = BytePack<8>;
+  constexpr int EltPerPack = 8/sizeof(EltType);
+
+  int nRanks = prim.nRanks;
+  int rank = prim.rank;
+  int t = threadIdx.x;
+  int tn = ncclSymMaxThreads;
+  ncclCoopCta cta;
+
+  #pragma unroll 1
+  while (0 < nElts) {
+    int nIterPacks = min(nPacks, tn);
+    int tn_div_nPacks = tn/nIterPacks;
+    int tn_mod_nPacks = tn%nIterPacks;
+    int peer = t/nIterPacks;
+    int pack = t%nIterPacks;
+
+    #pragma unroll 1
+    for (int i = t; i < nRanks*nIterPacks; i += tn) {
+      Pack got = loadPack<Pack>(input + peer*nStrideElts, pack*EltPerPack, nElts);
+      prim.sendLL(peer, rank*nIterPacks + pack, got);
+      peer += tn_div_nPacks;
+      pack += tn_mod_nPacks;
+      if (nIterPacks <= pack) { peer += 1; pack -= nIterPacks; }
+    }
+
+    if (t < nIterPacks) {
+      Pack got = prim.template recvReduceLL<Pack, T>(t, nIterPacks, red);
+      storePack(output, t*EltPerPack, nElts, got);
+    }
+    prim.endLL(cta);
+
+    input += tn*EltPerPack;
+    output += tn*EltPerPack;
+    nElts -= tn*EltPerPack;
+    nPacks -= tn;
+  }
+}
+template<template<typename> typename Red, typename T>
+__device__ __forceinline__ void ncclSymRun_ReduceScatter_LL(ncclSymDevArgs const* args) {
+  ncclSymPrims prim(args->comm, ncclSymPrims_UseLL);
+  Red<typename ncclSymAccumType<Red, T, /*nvls=*/false>::Type> red(args->redOpArg);
+
+  using Pack = BytePack<8>;
+  constexpr int EltPerPack = 8/sizeof(T);
+  int nAllElts = args->nElts;
+  int nAllPacks = divUp(nAllElts, EltPerPack);
+  uint32_t nPackPerBlock, nPackModBlock;
+  idivmodFast32(&nPackPerBlock, &nPackModBlock, nAllPacks, prim.nBlocks, prim.nBlocks_rcp32);
+  int blockPackBegin = prim.block*nPackPerBlock + minval<int>(prim.block, nPackModBlock);
+  int blockPackEnd = blockPackBegin + nPackPerBlock + (prim.block < nPackModBlock ? 1 : 0);
+  int nPacks = blockPackEnd - blockPackBegin;
+  int nElts = nAllElts - blockPackBegin*EltPerPack;
+  nElts = min(nElts, nPacks*EltPerPack);
+  T* input = (T*)args->input + blockPackBegin*EltPerPack;
+  T* output = (T*)args->output + blockPackBegin*EltPerPack;
+
+  uint32_t lowBits = args->nElts*sizeof(T);
+  lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->input);
+  lowBits |= (uint32_t)reinterpret_cast<uintptr_t>(args->output);
+  if (__builtin_expect(lowBits%8 == 0, true)) {
+    ncclSymRun_ReduceScatter_LL_body<T>(prim, red, (Pack*)input, (Pack*)output, nPacks, nPacks, nAllElts/EltPerPack);
+  } else {
+    ncclSymRun_ReduceScatter_LL_body<T>(prim, red, input, output, nElts, nPacks, nAllElts);
+  }
+}
--- a/src/enqueue.cc
+++ b/src/enqueue.cc
@ -13,6 +13,7 @@
 #include "cudawrap.h"
 #include "profiler.h"
 #include "transport.h"
+#include "register_inline.h"

 #include <cstring> // std::memcpy
 #include <cinttypes> // PRIx64
@ -28,34 +29,41 @@ ncclResult_t ncclInitKernelsForDevice(int cudaArch, int maxSharedMem, size_t* ma
  int carveout = ncclParamL1SharedMemoryCarveout();
  int ncclMaxSharedMem = ncclShmemDynamicSize(cudaArch);

-  for (int k=0; k < ncclDevKernelCount; k++) {
-    void* fn = ncclDevKernelList[k];
-    cudaFuncAttributes attr = {0};
-    if (fn == nullptr) continue;
+  for (int sym=0; sym <= 1; sym++) {
+    int kcount = sym==0 ? ncclDevKernelCount : ncclSymKernelCount;
+    void* const* kptrs = sym==0 ? ncclDevKernelList : ncclSymKernelList;
+    for (int k=0; k < kcount; k++) {
+      void* fn = kptrs[k];
+      cudaFuncAttributes attr = {0};
+      if (fn == nullptr) continue;

-    CUDACHECKGOTO(cudaFuncGetAttributes(&attr, fn), result, ignore0);
-    if (maxStackSize) {
-      if (attr.localSizeBytes > *maxStackSize) *maxStackSize = attr.localSizeBytes;
-    ignore0:;
-    }
-    if (carveout) {
-      CUDACHECKGOTO(cudaFuncSetAttribute(fn,
-        cudaFuncAttributePreferredSharedMemoryCarveout, carveout),
-        result, ignore1);
-    ignore1:;
-    }
-    if (ncclMaxSharedMem != 0) {
-      int sharedMemSize = ncclMaxSharedMem;
-      if (sharedMemSize > (maxSharedMem-attr.sharedSizeBytes)) {
-        WARN("cudaArch %d ncclMaxSharedMem %d exceeds device/fn maxSharedMem %zu",
-             cudaArch, sharedMemSize, maxSharedMem-attr.sharedSizeBytes);
-        return ncclSystemError;
+      cudaError_t errcode = cudaFuncGetAttributes(&attr, fn);
+      if (errcode == cudaErrorNoKernelImageForDevice) continue;
+      CUDACHECKGOTO(errcode, result, ignore0);
+
+      if (maxStackSize) {
+        if (attr.localSizeBytes > *maxStackSize) *maxStackSize = attr.localSizeBytes;
+      ignore0:;
      }
-      CUDACHECKGOTO(cudaFuncSetAttribute(fn,
-        cudaFuncAttributeMaxDynamicSharedMemorySize, sharedMemSize),
-        result, next_kernel);
+      if (carveout) {
+        CUDACHECKGOTO(cudaFuncSetAttribute(fn,
+          cudaFuncAttributePreferredSharedMemoryCarveout, carveout),
+          result, ignore1);
+      ignore1:;
+      }
+      if (ncclMaxSharedMem != 0) {
+        int sharedMemSize = ncclMaxSharedMem;
+        if (sharedMemSize > (maxSharedMem-attr.sharedSizeBytes)) {
+          WARN("cudaArch %d ncclMaxSharedMem %d exceeds device/fn maxSharedMem %zu",
+               cudaArch, sharedMemSize, maxSharedMem-attr.sharedSizeBytes);
+          return ncclSystemError;
+        }
+        CUDACHECKGOTO(cudaFuncSetAttribute(fn,
+          cudaFuncAttributeMaxDynamicSharedMemorySize, sharedMemSize),
+          result, next_kernel);
+      }
+    next_kernel:;
    }
-  next_kernel:;
  }
  return result;
 }
@ -258,8 +266,8 @@ static bool testBudget(

 ncclResult_t ncclTasksRegAndEnqueue(struct ncclComm* comm) {
  struct ncclKernelPlanner* planner = &comm->planner;
+  if (planner->isSymColl) return ncclSuccess;
  struct ncclTaskColl *task;
-
  task = ncclIntruQueueHead(&planner->collTaskQueue);
  while (task != nullptr) {
    // Build a ncclDevWorkColl[Reg?] struct for each task.
@ -331,6 +339,38 @@ ncclResult_t ncclPrepareTasks(struct ncclComm* comm, bool* algoNeedConnect, bool
  int fnOpTyIndices[ncclNumFuncs*ncclNumDevRedOps*ncclNumTypes];
  int fnOpTyCount = 0;

+  if (comm->nNodes == 1 && planner->nTasksColl == 1 && planner->nTasksP2p == 0) {
+    void* sendSymPtr;
+    void* recvSymPtr;
+    struct ncclReg* sendReg;
+    struct ncclReg* recvReg;
+    size_t size = task->count*ncclTypeSize(task->datatype);
+    NCCLCHECK(ncclRegFindSymmetric(comm, task->sendbuff, size, &sendSymPtr, &sendReg));
+    NCCLCHECK(ncclRegFindSymmetric(comm, task->recvbuff, size, &recvSymPtr, &recvReg));
+    bool implemented = ncclSymImplemented(task->func, task->opDev.op, task->datatype);
+
+    if (sendReg && recvReg && (sendReg->winFlags & recvReg->winFlags & NCCL_WIN_COLL_SYMMETRIC) && implemented) {
+      enum ncclSymKernelId kernel;
+      int nChannels, nWarps;
+      float estTimeUs = 1.e18;
+      NCCLCHECK(ncclSymPickKernel(comm, task->func, task->opDev.op, task->datatype, task->count, &estTimeUs, &kernel, &nChannels, &nWarps));
+
+      // We should only use symmetric kernel if it beats the asymmetric kernel. But the
+      // perf model accuracy from asymmetric kernels is too inaccurate and reports too high
+      // of a bandwidth. For now just always use symmetric if available.
+      if (kernel != ncclSymKernelId_Count) {
+        task->sendbuff = sendSymPtr;
+        task->recvbuff = recvSymPtr;
+        task->devFuncId = (int)kernel;
+        task->nMaxChannels = nChannels;
+        task->nWarps = nWarps;
+        ncclIntruQueueEnqueue(&planner->collTaskQueue, task);
+        planner->isSymColl = true;
+        return ncclSuccess;
+      }
+    }
+  }
+
  // Walk the size sorted tasks, binning them by (fn,op,ty).
  while (task != nullptr) {
    struct ncclTaskColl* next = task->next;
@ -603,6 +643,10 @@ static ncclResult_t scheduleCollTasksToPlan(
      (countHi != 0 ? countHi : countLo) -= cells*elementsPerCell - task->count;

      nChannels = (countLo!=0 ? 1 : 0) + nMidChannels + (cellsHi!=0 ? 1 : 0);
+
+      // Update number of channels propagated to the profiler
+      task->nChannels = (uint8_t)nChannels;
+
      // Ensure room for worst case of one new batch per channel
      if (!testBudget(budget, plan->nWorkBatches + nChannels, plan->workBytes + workNode->size)) {
        return ncclSuccess;
@ -860,6 +904,8 @@ static ncclResult_t addP2pToPlan(
        partSize = divUp(bytes[dir], nChannels[dir]);
      }
    }
+    // Update number of channels propagated to the profiler
+    if (p2pTasks[dir]) p2pTasks[dir]->nChannels = nChannels[dir];
  }

  struct ncclWorkList* workNode = ncclMemoryStackAllocInlineArray<ncclWorkList, ncclDevWorkP2p>(&comm->memScoped, 1);
@ -1052,47 +1098,17 @@ static ncclResult_t scheduleP2pTasksToPlan(
 }

 // Spin until its safe to increase comm->workFifoProduced to desiredProduced.
-static void waitWorkFifoAvailable(struct ncclComm* comm, uint32_t desiredProduced) {
-  bool hasRoom = (desiredProduced - comm->workFifoConsumedLeast) <= comm->workFifoBytes;
-  if (hasRoom) return;
-  while (true) {
-    // We have to poll for notifications from device.
-    uint32_t* consumedLive = comm->workFifoConsumed;
-    uint32_t consumed[MAXCHANNELS];
-    for (int c=0; c < MAXCHANNELS; c++) {
-      consumed[c] = __atomic_load_n(&consumedLive[c], __ATOMIC_RELAXED);
+static ncclResult_t waitWorkFifoAvailable(struct ncclComm* comm, uint32_t desiredProduced) {
+  bool hasRoom = (desiredProduced - comm->workFifoConsumed) <= comm->workFifoBytes;
+  if (!hasRoom) {
+    while (true) {
+      NCCLCHECK(ncclCommPollEventCallbacks(comm, /*waitSome=*/true));
+      hasRoom = (desiredProduced - comm->workFifoConsumed) <= comm->workFifoBytes;
+      if (hasRoom) break;
+      sched_yield();
    }
-    // Compiler-only fence to prevent fusion of loops to encourage dense loads.
-    __atomic_signal_fence(__ATOMIC_SEQ_CST);
-
-    uint32_t produced = comm->workFifoProduced;
-    uint32_t consumedLeast = produced;
-    for (int c=0; c < MAXCHANNELS; c++) {
-      // consumedLeast is min over all non-quiesced channels
-      if (consumed[c] != comm->channels[c].workFifoProduced) {
-        if ((produced - consumedLeast) < (produced - consumed[c])) {
-          consumedLeast = consumed[c];
-        }
-      }
-    }
-
-    // Compiler only fence to prevent fusion of loops to encourage dense stores.
-    __atomic_signal_fence(__ATOMIC_SEQ_CST);
-
-    for (int c=0; c < MAXCHANNELS; c++) {
-      // Advance counter on quiesced channels so they don't lag behind
-      // too far where they could get lost in 32-bit wraparound.
-      if (consumed[c] == comm->channels[c].workFifoProduced) {
-        comm->channels[c].workFifoProduced = consumedLeast;
-        __atomic_store_n(&consumedLive[c], consumedLeast, __ATOMIC_RELAXED);
-      }
-    }
-    comm->workFifoConsumedLeast = consumedLeast;
-
-    hasRoom = (desiredProduced - comm->workFifoConsumedLeast) <= comm->workFifoBytes;
-    if (hasRoom) break;
-    sched_yield();
  }
+  return ncclSuccess;
 }

 namespace {
@ -1106,11 +1122,14 @@ namespace {
    struct uploadWork_cleanup_t* me = (struct uploadWork_cleanup_t*)cb;
    free(me->hostBuf);
    CUDACHECK(cudaEventDestroy(me->base.event));
+    free(me);
    return ncclSuccess;
  }
 }

 static ncclResult_t uploadWork(struct ncclComm* comm, struct ncclKernelPlan* plan) {
+  if (plan->isSymColl) return ncclSuccess;
+
  size_t workBytes = plan->workBytes;
  size_t batchBytes = plan->nWorkBatches*sizeof(struct ncclDevWorkBatch);
  void* fifoBufHost;
@ -1127,7 +1146,7 @@ static ncclResult_t uploadWork(struct ncclComm* comm, struct ncclKernelPlan* pla
    fifoBufHost = comm->workFifoBuf;
    fifoCursor = comm->workFifoProduced;
    fifoMask = comm->workFifoBytes-1;
-    waitWorkFifoAvailable(comm, fifoCursor + workBytes);
+    NCCLCHECK(waitWorkFifoAvailable(comm, fifoCursor + workBytes));
    plan->kernelArgs->workBuf = comm->workFifoBufDev;
    break;
  case ncclDevWorkStorageTypePersistent:
@ -1208,7 +1227,7 @@ static ncclResult_t uploadWork(struct ncclComm* comm, struct ncclKernelPlan* pla
      ncclIntruQueueEnqueue(&comm->eventCallbackQueue, (struct ncclCommEventCallback *)cleanup);

      NCCLCHECKGOTO(ncclStrongStreamRelease(ncclCudaGraphNone(), &comm->sharedRes->deviceStream, /*concurrent=*/false), result, fail);
-      NCCLCHECKGOTO(ncclCommPollEventCallbacks(comm), result, fail);
+      NCCLCHECKGOTO(ncclCommPollEventCallbacks(comm, /*waitSome=*/false), result, fail);

    finish_scope:
      if (mode != cudaStreamCaptureModeRelaxed) (void)cudaThreadExchangeStreamCaptureMode(&mode);
@ -1226,6 +1245,7 @@ static ncclResult_t uploadProxyOps(struct ncclComm* comm, struct ncclKernelPlan*
  uint64_t collOpCount = comm->sharedRes->collOpCount;
  uint64_t p2pOpBump[MAXCHANNELS] = {/*0...*/};
  // Advance comm's collOpCount by number of colls in this plan.
+  int hasp2p = 0;
  comm->sharedRes->collOpCount += plan->collOpCount;
  comm->collOpCount += plan->collOpCount;

@ -1244,6 +1264,7 @@ static ncclResult_t uploadProxyOps(struct ncclComm* comm, struct ncclKernelPlan*
      // remember last value to compute max.
      p2pOpBump[op->channelId] = (oldId>>1) + 1; // +1 to ensure next plan doesn't collide
      op->opCount = (comm->sharedRes->p2pOpCount[op->channelId]<<1) + oldId;
+      hasp2p = 1;
    } else { // coll
      op->opCount = (collOpCount<<1) + oldId;
    }
@ -1253,9 +1274,11 @@ static ncclResult_t uploadProxyOps(struct ncclComm* comm, struct ncclKernelPlan*
    op = op->enqNext;
  }

-  for (int c=0; c < MAXCHANNELS; c++) {
-    // Advance channel's p2pOpCount by number of p2p's in this plan channel.
-    comm->sharedRes->p2pOpCount[c] += p2pOpBump[c];
+  if (hasp2p) {
+    for (int c=0; c < MAXCHANNELS; c++) {
+      // Advance channel's p2pOpCount by number of p2p's in this plan channel.
+      comm->sharedRes->p2pOpCount[c] += p2pOpBump[c];
+    }
  }
  return ncclSuccess;
 }
@ -1263,8 +1286,10 @@ static ncclResult_t uploadProxyOps(struct ncclComm* comm, struct ncclKernelPlan*
 static ncclResult_t hostStreamPlanTask(struct ncclComm* comm, struct ncclKernelPlan* plan) {
  NCCLCHECK(ncclProfilerStartGroupEvent(plan));
  NCCLCHECK(ncclProfilerStartTaskEvents(plan));
-  NCCLCHECK(uploadProxyOps(comm, plan));
-  NCCLCHECK(ncclProxyStart(comm));
+  if (ncclIntruQueueHead(&plan->proxyOpQueue)) {
+    NCCLCHECK(uploadProxyOps(comm, plan));
+    NCCLCHECK(ncclProxyStart(comm));
+  }
  NCCLCHECK(ncclProfilerStopTaskEvents(plan));
  NCCLCHECK(ncclProfilerStopGroupEvent(plan));
  if (!plan->persistent) {
@ -1281,7 +1306,6 @@ static void CUDART_CB hostStreamPlanCallback(void *plan_) {
  if (result != ncclSuccess) {
    WARN("hostStreamPlanCallback() failed : %s", ncclGetErrorString(result));
  }
-  if (!plan->persistent) ncclAtomicRefCountDecrement(&plan->comm->sharedRes->noncapturedRefs);
  return;
 }

@ -1357,9 +1381,8 @@ namespace {

 static ncclResult_t getImplicitOrder(enum ncclImplicitOrder *mode, bool capturing, int driver=-1) {
  if (ncclParamLaunchOrderImplicit()) {
-    // Due to an unresolved bug in CUDA ncclImplicitOrderLaunch is not supported in graphs
-    if (capturing) { *mode = ncclImplicitOrderSerial; return ncclSuccess; }
    if (driver < 0) { NCCLCHECK(ncclCudaDriverVersion(&driver)); }
+    if (capturing && driver < 12090) { *mode = ncclImplicitOrderSerial; return ncclSuccess; }
    *mode = 12030 <= std::min<int>(CUDART_VERSION, driver) ? ncclImplicitOrderLaunch : ncclImplicitOrderSerial;
    return ncclSuccess;
  }
@ -1386,26 +1409,51 @@ ncclResult_t ncclLaunchPrepare(struct ncclComm* comm) {
      plan->workStorageType = persistent ? ncclDevWorkStorageTypePersistent
                                         : ncclDevWorkStorageTypeFifo;

-      struct ncclKernelPlanBudget budget;
-      budget.inArgsBytes = comm->workArgsBytes - sizeof(struct ncclDevKernelArgs);
-      // Non-persistent kernels fill up at most half of our fifo per kernel.
-      budget.outArgsBytes = plan->persistent ? (1<<30) : comm->workFifoBytes/2;
+      if (planner->isSymColl) {
+        plan->workStorageType = ncclDevWorkStorageTypeArgs;

-      // Drain coll tasks first. This is essential since we partition tasks based
-      // on the work budget and p2p work isn't collective. If we were to drain p2p
-      // first, the place where we cut the kernel could vary by rank which would
-      // cause the "shortest channel first" channel picker to have divergent results.
-      if (planner->nTasksColl != 0) {
-        NCCLCHECKGOTO(scheduleCollTasksToPlan(comm, plan, &budget), result, failure);
-      }
-      // And only drain p2p tasks once colls are depleted.
-      if (planner->nTasksColl == 0 && planner->nTasksP2p != 0) {
-        NCCLCHECKGOTO(scheduleP2pTasksToPlan(comm, plan, &budget), result, failure);
-      }
-      finishPlan(comm, plan);
-      if (plan->workBytes != 0) {
+        struct ncclTaskColl* task = ncclIntruQueueHead(&planner->collTaskQueue);
+        plan->isSymColl = true;
+        plan->kernelFn = ncclSymGetKernelPtr((ncclSymKernelId)task->devFuncId, task->opDev.op, task->datatype);
+        plan->threadPerBlock = task->nWarps*WARP_SIZE;
+        plan->channelMask = uint64_t(-1) >> (64-task->nMaxChannels);
+
+        plan->kernelArgsSize = sizeof(struct ncclSymDevArgs);
+        plan->kernelSymArgs = ncclMemoryStackAlloc<struct ncclSymDevArgs>(&comm->memScoped);
+        plan->kernelSymArgs->comm = comm->symDevComm;
+        plan->kernelSymArgs->rootRank = task->root;
+        plan->kernelSymArgs->redOpArg = task->opDev.scalarArg;
+        plan->kernelSymArgs->nElts = task->count;
+        plan->kernelSymArgs->input = (char*)task->sendbuff;
+        plan->kernelSymArgs->output = (char*)task->recvbuff;
+
+        planner->nTasksColl -= 1;
        ncclIntruQueueEnqueue(&planner->planQueue, plan);
+        INFO(NCCL_TUNING, "%s [Symmetric]: %ld Bytes -> Kernel %s nchannels %d nthreads %d",
+        ncclFuncToString(task->func), task->count * ncclTypeSize(task->datatype), ncclSymKernelIdToString(task->devFuncId), task->nMaxChannels, plan->threadPerBlock);
        nPlans += 1;
+      } else {
+        struct ncclKernelPlanBudget budget;
+        budget.inArgsBytes = comm->workArgsBytes - sizeof(struct ncclDevKernelArgs);
+        // Non-persistent kernels fill up at most half of our fifo per kernel.
+        budget.outArgsBytes = plan->persistent ? (1<<30) : comm->workFifoBytes/2;
+
+        // Drain coll tasks first. This is essential since we partition tasks based
+        // on the work budget and p2p work isn't collective. If we were to drain p2p
+        // first, the place where we cut the kernel could vary by rank which would
+        // cause the "shortest channel first" channel picker to have divergent results.
+        if (planner->nTasksColl != 0) {
+          NCCLCHECKGOTO(scheduleCollTasksToPlan(comm, plan, &budget), result, failure);
+        }
+        // And only drain p2p tasks once colls are depleted.
+        if (planner->nTasksColl == 0 && planner->nTasksP2p != 0) {
+          NCCLCHECKGOTO(scheduleP2pTasksToPlan(comm, plan, &budget), result, failure);
+        }
+        finishPlan(comm, plan);
+        if (plan->workBytes != 0) {
+          ncclIntruQueueEnqueue(&planner->planQueue, plan);
+          nPlans += 1;
+        }
      }
    } while (planner->nTasksColl + planner->nTasksP2p != 0);

@ -1428,6 +1476,7 @@ ncclResult_t ncclLaunchPrepare(struct ncclComm* comm) {

    bool capturing = ncclCudaGraphValid(planner->capturingGraph);
    enum ncclImplicitOrder implicitOrder;
+    cudaError_t status = cudaSuccess;
    NCCLCHECKGOTO(getImplicitOrder(&implicitOrder, capturing), result, failure);

    if (implicitOrder != ncclImplicitOrderNone) {
@ -1439,7 +1488,8 @@ ncclResult_t ncclLaunchPrepare(struct ncclComm* comm) {
      NCCLCHECKGOTO(ncclStreamWaitStream(launchStream, launchOrder, comm->sharedRes->scratchEvent), result, failure);
    }

-    if (persistent || comm->sharedRes->persistentRefs != 0 || ncclCudaLaunchBlocking || __atomic_load_n(&comm->sharedRes->noncapturedRefs, __ATOMIC_ACQUIRE)) {
+    if (!persistent && comm->sharedRes->persistentRefs) status = cudaEventQuery(comm->sharedRes->hostStream.serialEvent);
+    if (persistent || ncclCudaLaunchBlocking || status == cudaErrorNotReady) {
      // We have to launch host tasks to push proxy args. We are careful to only
      // do this if necessary since host tasks impose a high performance cost in CUDA.
      bool acquired = false;
@ -1450,7 +1500,6 @@ ncclResult_t ncclLaunchPrepare(struct ncclComm* comm) {
            acquired = true;
            NCCLCHECKGOTO(ncclStrongStreamAcquire(planner->capturingGraph, &comm->sharedRes->hostStream, /*concurrent=*/false, &hostStream), result, failure);
          }
-          if (!persistent) ncclAtomicRefCountIncrement(&comm->sharedRes->noncapturedRefs);
          plan->isHostCbEnq = true;
          CUDACHECKGOTO(cudaLaunchHostFunc(hostStream, hostStreamPlanCallback, plan), result, failure);
        }
@ -1485,6 +1534,8 @@ ncclResult_t ncclLaunchKernelBefore_NoUncapturedCuda(struct ncclComm* comm, stru
 NCCL_PARAM(MemSyncDomain, "MEM_SYNC_DOMAIN", cudaLaunchMemSyncDomainRemote);
 #endif

+NCCL_PARAM(NvlinkUtilCentricSchedEnable, "NVLINK_UTIL_CENTRIC_SCHED_ENABLE", 0);
+
 ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan) {
  ncclResult_t ret = ncclSuccess;
  struct ncclKernelPlanner* planner = &comm->planner;
@ -1512,7 +1563,7 @@ ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan
    unsigned int clusterSize = (compCap >= 90) ? comm->config.cgaClusterSize : 0;

    CUlaunchConfig launchConfig = {0};
-    CUlaunchAttribute launchAttrs[4] = {};
+    CUlaunchAttribute launchAttrs[6] = {};
    int attrs = 0;
    /* Cooperative Group Array (CGA)
     * On sm90 and later we have an extra level of hierarchy where we
@ -1549,6 +1600,18 @@ ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan
      launchAttrs[attrs].value.launchCompletionEvent.flags = 0;
      attrs++;
    }
+    if (comm->planner.isSymColl && compCap >= 90 && driverVersion >= 12030) {
+      launchAttrs[attrs].id = CU_LAUNCH_ATTRIBUTE_PROGRAMMATIC_STREAM_SERIALIZATION;
+      launchAttrs[attrs].value.programmaticStreamSerializationAllowed = 1;
+      attrs++;
+    }
+    #endif
+    #if CUDART_VERSION >= 13000
+    if (compCap >= 90 && driverVersion >= 13000) {
+      launchAttrs[attrs].id = CU_LAUNCH_ATTRIBUTE_NVLINK_UTIL_CENTRIC_SCHEDULING;
+      launchAttrs[attrs].value.nvlinkUtilCentricScheduling = ncclParamNvlinkUtilCentricSchedEnable();
+      attrs++;
+    }
    #endif
    launchConfig.gridDimX = grid.x;
    launchConfig.gridDimY = grid.y;
@ -1560,7 +1623,6 @@ ncclResult_t ncclLaunchKernel(struct ncclComm* comm, struct ncclKernelPlan* plan
    launchConfig.attrs = launchAttrs;
    launchConfig.numAttrs = attrs;
    launchConfig.hStream = launchStream;
-
    CUCHECKGOTO(cuLaunchKernelEx(&launchConfig, fn, nullptr, extra), ret, do_return);
  #endif
  } else {
@ -1573,21 +1635,30 @@ do_return:
 }

 ncclResult_t ncclLaunchKernelAfter_NoCuda(struct ncclComm* comm, struct ncclKernelPlan* plan) {
-  if (!(plan->persistent || ncclCudaLaunchBlocking || plan->isHostCbEnq)) {
-    // We are not using the host stream for proxy ops and reclaimation submission.
+  if (!plan->isHostCbEnq) {
+    // we are not using the host stream for proxy ops and reclaimation submission, call
+    // hostStreamPlanTask directly
    NCCLCHECK(hostStreamPlanTask(comm, plan));
-  } else {
-    // We are using the host stream for proxy ops and reclaimation submission.
-    // Only plans with proxy ops have a callback pushed by ncclLaunchPrepare.
-    // Since non-persistent plans also require reclaimation, we have to do it
-    // here.
-    if (!plan->persistent && !plan->hasProxyOps) {
-      ncclIntruQueueMpscEnqueue(&comm->callbackQueue, &plan->reclaimer);
-    }
  }
  return ncclSuccess;
 }

+namespace {
+  struct KernelFinishCallback {
+    struct ncclCommEventCallback base;
+    uint32_t workFifoConsumed;
+  };
+  ncclResult_t KernelFinishCallback_fn(
+      struct ncclComm* comm, struct ncclCommEventCallback* cb
+    ) {
+    struct KernelFinishCallback* me = (struct KernelFinishCallback*)cb;
+    comm->workFifoConsumed = me->workFifoConsumed;
+    CUDACHECK(cudaEventDestroy(me->base.event));
+    free(me);
+    return ncclSuccess;
+  }
+}
+
 ncclResult_t ncclLaunchFinish(struct ncclComm* comm) {
  struct ncclKernelPlanner* planner = &comm->planner;
  if (!ncclIntruQueueEmpty(&planner->planQueue)) {
@ -1597,7 +1668,21 @@ ncclResult_t ncclLaunchFinish(struct ncclComm* comm) {

    cudaStream_t launchStream = planner->streams->stream; // First user stream gets launch
    cudaStream_t deviceStream, launchOrder;
-    CUDACHECK(cudaEventRecord(comm->sharedRes->scratchEvent, launchStream));
+    cudaEvent_t finishedEvent = comm->sharedRes->scratchEvent;
+    CUDACHECK(cudaEventRecord(finishedEvent, launchStream));
+
+    if (comm->workFifoProduced - comm->workFifoProducedLastRecorded > comm->workFifoBytes/8) {
+      comm->workFifoProducedLastRecorded = comm->workFifoProduced;
+      struct KernelFinishCallback* cb;
+      NCCLCHECK(ncclCalloc(&cb, 1));
+      cb->base.event = finishedEvent;
+      cb->base.fn = KernelFinishCallback_fn;
+      cb->workFifoConsumed = comm->workFifoProduced;
+      ncclIntruQueueEnqueue(&comm->eventCallbackQueue, &cb->base);
+      // We just stole scratchEvent so must create a new one.
+      CUDACHECK(cudaEventCreateWithFlags(&comm->sharedRes->scratchEvent, cudaEventDisableTiming));
+    }
+
    // deviceStream waits on userStream[0]
    NCCLCHECK(ncclStrongStreamAcquiredWorkStream(planner->capturingGraph, &comm->sharedRes->deviceStream, /*concurrent=*/false, &deviceStream));

@ -1606,13 +1691,13 @@ ncclResult_t ncclLaunchFinish(struct ncclComm* comm) {
    // on launchStream as a fast-forward. When building CUDA graphs fast forwards should
    // be handled specially so as not to create graphs with a blowup in the number of edges.
    // So we could do this:
-    //   CUDACHECK(cudaStreamWaitEvent(deviceStream, comm->sharedRes->scratchEvent, 0));
+    //   CUDACHECK(cudaStreamWaitEvent(deviceStream, finishedEvent, 0));
    // But instead we do:
-    NCCLCHECK(ncclStreamAdvanceToEvent(planner->capturingGraph, deviceStream, comm->sharedRes->scratchEvent));
+    NCCLCHECK(ncclStreamAdvanceToEvent(planner->capturingGraph, deviceStream, finishedEvent));

    // Each userStream[i] waits on userStream[0]
    for (struct ncclCudaStreamList* l=planner->streams->next; l != nullptr; l = l->next) {
-      CUDACHECK(cudaStreamWaitEvent(l->stream, comm->sharedRes->scratchEvent, 0));
+      CUDACHECK(cudaStreamWaitEvent(l->stream, finishedEvent, 0));
    }
    bool capturing = ncclCudaGraphValid(planner->capturingGraph);
    enum ncclImplicitOrder implicitOrder;
@ -1623,7 +1708,7 @@ ncclResult_t ncclLaunchFinish(struct ncclComm* comm) {
      // Incorporate launch event into per-device (context) launch order.
      NCCLCHECK(ncclStrongStreamAcquiredWorkStream(planner->capturingGraph, &comm->context->launchOrder, concurrent, &launchOrder));
      // If we don't have launch events (requires CUDA 12.3) then just use completion event (serialize execution).
-      CUDACHECK(cudaStreamWaitEvent(launchOrder, implicitOrder == ncclImplicitOrderLaunch ? comm->sharedRes->launchEvent : comm->sharedRes->scratchEvent));
+      CUDACHECK(cudaStreamWaitEvent(launchOrder, implicitOrder == ncclImplicitOrderLaunch ? comm->sharedRes->launchEvent : finishedEvent));
      // Release launchOrder as acquired in ncclLaunchPrepare()
      NCCLCHECK(ncclStrongStreamRelease(planner->capturingGraph, &comm->context->launchOrder, concurrent));
    }
@ -1645,7 +1730,7 @@ static inline ncclResult_t getCollNetSupport(
  if (info->opDev.op == ncclDevPreMulSum || info->opDev.op == ncclDevSumPostDiv) {
    netOp = ncclSum;
  }
-  *collNetSupport = comm->collNetSupport;
+  *collNetSupport = comm->config.collnetEnable;
  switch (info->func) {
  case ncclFuncAllReduce:
  case ncclFuncReduce:
@ -1683,10 +1768,8 @@ static ncclResult_t updateCollCostTable(
    if ((a == NCCL_ALGO_COLLNET_DIRECT || a == NCCL_ALGO_COLLNET_CHAIN) && collNetSupport != 1) continue;
    // CollNetDirect is only supported for up to 8 local GPUs
    if (a == NCCL_ALGO_COLLNET_DIRECT && comm->maxLocalRanks > NCCL_MAX_DIRECT_ARITY+1) continue;
-    if ((a == NCCL_ALGO_NVLS || a == NCCL_ALGO_NVLS_TREE) && nvlsSupport != 1 && info->func != ncclFuncAllGather) continue;
+    if ((a == NCCL_ALGO_NVLS || a == NCCL_ALGO_NVLS_TREE) && (!nvlsSupport || (info->func != ncclFuncAllReduce && comm->localRanks > NCCL_MAX_NVLS_ARITY))) continue;
    if (a == NCCL_ALGO_NVLS && collNetSupport != 1 && comm->nNodes > 1) continue;
-    /* now we only support single-node NVLS allgather and reducescatter */
-    if (a == NCCL_ALGO_NVLS && (info->func == ncclFuncAllGather || info->func == ncclFuncReduceScatter) && (comm->nNodes > 1 || comm->nRanks > NCCL_MAX_NVLS_ARITY)) continue;
    /* Tree reduceScatter doesn't support scaling yet */
    if (a == NCCL_ALGO_PAT && info->func == ncclFuncReduceScatter
        && (info->opDev.op == ncclDevPreMulSum || info->opDev.op == ncclDevSumPostDiv)) continue;
@ -1801,7 +1884,14 @@ static ncclResult_t getAlgoInfo(
    struct ncclComm* comm, struct ncclTaskColl* info,
    int collNetSupport, int nvlsSupport, int numPipeOps, ncclSimInfo_t* simInfo/* = NULL*/
  ) {
-  size_t nBytes = ncclTypeSize(info->datatype)*ncclFuncMaxSendRecvCount(info->func, comm->nRanks, info->count);
+  size_t elementSize = ncclTypeSize(info->datatype);
+  size_t nBytes = elementSize * ncclFuncMaxSendRecvCount(info->func, comm->nRanks, info->count);
+  struct ncclReg* regSendBuf = NULL;
+  struct ncclReg* regRecvBuf = NULL;
+  int regBuff;
+  bool isSendValid, isRecvValid;
+  size_t sendbuffSize = elementSize * ncclFuncSendCount(info->func, comm->nRanks, info->count);
+  size_t recvbuffSize = elementSize * ncclFuncRecvCount(info->func, comm->nRanks, info->count);
  info->algorithm = NCCL_ALGO_UNDEF;
  info->protocol = NCCL_PROTO_UNDEF;
  int nMaxChannels = 0;
@ -1809,20 +1899,42 @@ static ncclResult_t getAlgoInfo(
  initCollCostTable((float **)collCostTable);
  NCCLCHECK(updateCollCostTable(comm, info, nBytes, collNetSupport, nvlsSupport, numPipeOps, (float **)collCostTable));
  if (comm->tuner != NULL) {
-    size_t elementSize = ncclTypeSize(info->datatype);
-    size_t sendbuffSize = elementSize*ncclFuncSendCount(info->func, comm->nRanks, info->count);
-    size_t recvbuffSize = elementSize*ncclFuncRecvCount(info->func, comm->nRanks, info->count);
-    struct ncclReg* regSendBuf;
-    struct ncclReg* regRecvBuf;
    NCCLCHECK(ncclRegFind(comm, info->sendbuff, sendbuffSize, &regSendBuf));
    NCCLCHECK(ncclRegFind(comm, info->recvbuff, recvbuffSize, &regRecvBuf));
-    int regBuff = ((regSendBuf && regRecvBuf) || (ncclCudaGraphValid(comm->planner.capturingGraph) && ncclParamGraphRegister()));
+    NCCLCHECK(ncclRegLocalIsValid(regSendBuf, &isSendValid));
+    NCCLCHECK(ncclRegLocalIsValid(regRecvBuf, &isRecvValid));
+    regBuff = (regSendBuf && regRecvBuf && isSendValid && isRecvValid) || (ncclCudaGraphValid(comm->planner.capturingGraph) && ncclParamGraphRegister());
    NCCLCHECK(comm->tuner->getCollInfo(
          comm->tunerContext, info->func, nBytes,
          numPipeOps, (float **)collCostTable, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
          regBuff, &nMaxChannels));
+    NCCLCHECK(topoGetAlgoInfo(comm, info, nBytes, (float **)collCostTable, simInfo));
+  } else {
+    NCCLCHECK(topoGetAlgoInfo(comm, info, nBytes, (float **)collCostTable, simInfo));
+    // NCCL_CTA_POLICY_EFFICIENCY requires user (non-symmetric) buffer registration (currently unsupported with MNNVL)
+    if (comm->config.CTAPolicy == NCCL_CTA_POLICY_EFFICIENCY && ncclGetEnv("NCCL_ALGO") == NULL && ncclGetEnv("NCCL_PROTO") == NULL && !comm->MNNVL) {
+      // make algorithm selection based on buffer registration
+      // there can be other specialized policies for algorithms and protocols pickup in the future
+      NCCLCHECK(ncclRegFind(comm, info->sendbuff, sendbuffSize, &regSendBuf));
+      NCCLCHECK(ncclRegFind(comm, info->recvbuff, recvbuffSize, &regRecvBuf));
+      NCCLCHECK(ncclRegLocalIsValid(regSendBuf, &isSendValid));
+      NCCLCHECK(ncclRegLocalIsValid(regRecvBuf, &isRecvValid));
+      regBuff = (regSendBuf && regRecvBuf && isSendValid && isRecvValid) || (ncclCudaGraphValid(comm->planner.capturingGraph) && ncclParamGraphRegister());
+      if (regBuff && (info->func == ncclFuncAllGather || info->func == ncclFuncReduceScatter)) {
+        if ((comm->nNodes > 1 && collNetSupport && nvlsSupport) || (comm->nNodes == 1 && nvlsSupport)) {
+          int recChannels;
+          NCCLCHECK(ncclNvlsRegResourcesQuery(comm, info, &recChannels));
+          if (recChannels <= info->nMaxChannels) {
+            info->algorithm = NCCL_ALGO_NVLS;
+            info->protocol = NCCL_PROTO_SIMPLE;
+            info->nMaxChannels = recChannels;
+            info->nWarps = comm->maxThreads[info->algorithm][info->protocol] / WARP_SIZE;
+          }
+        }
+      }
+    }
  }
-  NCCLCHECK(topoGetAlgoInfo(comm, info, nBytes, (float **)collCostTable, simInfo));
+
  info->nMaxChannels = nMaxChannels == 0 ? info->nMaxChannels : nMaxChannels;
  return ncclSuccess;
 }
@ -1892,16 +2004,20 @@ static ncclResult_t calcCollChunking(
    while (nBytes / (nChannels * chunkSize) < comm->channels[0].collnetChain.depth * 8 && chunkSize > 65536) chunkSize /= 2;
    while (nBytes / (nChannels * chunkSize) < comm->channels[0].collnetChain.depth && chunkSize > 32768) chunkSize /= 2;
  } else if (info->algorithm == NCCL_ALGO_NVLS) {
-    int maxChunkSize = comm->nvlsChunkSize;
-    if (comm->nNodes > 1 && comm->bandwidths[ncclFuncAllReduce][NCCL_ALGO_NVLS][NCCL_PROTO_SIMPLE] < 150) maxChunkSize = 32768;
-    if (chunkSize > maxChunkSize) chunkSize = maxChunkSize;
-    // Use uint64_t so that concurrentOps*chunkSize*X does not overflow.
-    // However, nChannels * comm->channels[0].nvls.nHeads should easily fit in 32 bits.
-    // coverity[overflow_before_widen]
-    uint64_t concurrentOps = nChannels * comm->channels[0].nvls.nHeads;
-    if ((nBytes < (64 * (concurrentOps * chunkSize))) && (chunkSize > 65536)) chunkSize = 65536;
-    if ((nBytes < (8 * (concurrentOps * chunkSize))) && (chunkSize > 32768)) chunkSize = 32768;
-    if ((nBytes < (2 * (concurrentOps * chunkSize))) && (chunkSize > 16384)) chunkSize = 16384;
+    if ((info->regBufType & NCCL_NVLS_REG_BUFFER) && (info->func == ncclFuncAllGather || info->func == ncclFuncReduceScatter)) {
+      chunkSize = comm->buffSizes[NCCL_PROTO_SIMPLE] / NCCL_STEPS;
+    } else {
+      int maxChunkSize = comm->nvlsChunkSize;
+      if (comm->nNodes > 1 && comm->bandwidths[ncclFuncAllReduce][NCCL_ALGO_NVLS][NCCL_PROTO_SIMPLE] < 150) maxChunkSize = 32768;
+      if (chunkSize > maxChunkSize) chunkSize = maxChunkSize;
+      // Use uint64_t so that concurrentOps*chunkSize*X does not overflow.
+      // However, nChannels * comm->channels[0].nvls.nHeads should easily fit in 32 bits.
+      // coverity[overflow_before_widen]
+      uint64_t concurrentOps = nChannels * comm->channels[0].nvls.nHeads;
+      if ((nBytes < (64 * (concurrentOps * chunkSize))) && (chunkSize > 65536)) chunkSize = 65536;
+      if ((nBytes < (8 * (concurrentOps * chunkSize))) && (chunkSize > 32768)) chunkSize = 32768;
+      if ((nBytes < (2 * (concurrentOps * chunkSize))) && (chunkSize > 16384)) chunkSize = 16384;
+    }
  } else if (info->algorithm == NCCL_ALGO_NVLS_TREE) {
    // Use uint64_t so that concurrentOps*chunkSize*X does not overflow.
    // However, nChannels * comm->channels[0].nvls.nHeads should easily fit in 32 bits.
@ -2045,7 +2161,7 @@ static ncclResult_t calcCollChunking(
    proxyOp->reg = 0;
  }

-  if (pattern == ncclPatternCollnetDirect) {
+  if (pattern == ncclPatternCollnetDirect || pattern == ncclPatternNvls) {
    proxyOp->specifics.collnetDirect.nNodes = comm->nNodes;
    proxyOp->specifics.collnetDirect.node = comm->node;
    if (info->func == ncclFuncAllGather || info->func == ncclFuncReduceScatter) {
@ -2168,7 +2284,7 @@ static ncclResult_t taskAppend(struct ncclComm* comm, struct ncclInfo* info) {
    bool isSendNotRecv = info->coll == ncclFuncSend;

    // Must be in thread local group before tasks can be alloc'd in `comm->memScoped`.
-    ncclGroupCommJoin(info->comm);
+    ncclGroupCommJoin(info->comm, ncclGroupTaskTypeCollective);
    struct ncclTaskP2p* p2p = ncclMemoryPoolAlloc<struct ncclTaskP2p>(&comm->memPool_ncclTaskP2p, &comm->memPermanent);
    p2p->func = info->coll;
    p2p->buff = (void*)info->recvbuff;
@ -2235,7 +2351,7 @@ static ncclResult_t taskAppend(struct ncclComm* comm, struct ncclInfo* info) {
      return ncclSuccess;
    } else {
      // Must be in thread local group before tasks can be alloc'd in `comm->memScoped`.
-      ncclGroupCommJoin(info->comm);
+      ncclGroupCommJoin(info->comm, ncclGroupTaskTypeCollective);
      struct ncclTaskColl* t = ncclMemoryPoolAlloc<struct ncclTaskColl>(&comm->memPool_ncclTaskColl, &comm->memPermanent);
      t->func = info->coll;
      t->sendbuff = info->sendbuff;
--- a/src/graph/connect.cc
+++ b/src/graph/connect.cc
@ -258,7 +258,7 @@ static ncclResult_t connectNvls(struct ncclComm* comm, int* nvlsHeads, int nHead
    channel->nvls.out = -1;       // NVLS+SHARP not yet implemented.
    channel->nvls.headRank = headRank;
    channel->nvls.treeUp = channel->nvls.treeDown[0] = channel->nvls.treeDown[1] = channel->nvls.treeDown[2] = -1;
-    if (comm->collNetSupport && channel->nvls.headRank != -1) channel->nvls.out = comm->nRanks;
+    if (comm->config.collnetEnable && channel->nvls.headRank != -1) channel->nvls.out = comm->nRanks;
  }
  if (comm->nNodes == 1) return ncclSuccess;

@ -330,7 +330,7 @@ int ncclMinNchannels() {
  if (ncclParamMinNrings() != -2) minNchannels = ncclParamMinNrings();
  if (ncclParamMinNchannels() != -2) minNchannels = ncclParamMinNchannels();
  if (minNchannels > MAXCHANNELS) {
-    WARN("User asked for a minimum of %d channels, limiting to %d", minNchannels, MAXCHANNELS);
+    INFO(NCCL_GRAPH|NCCL_ENV, "User asked for a minimum of %d channels, limiting to %d", minNchannels, MAXCHANNELS);
    minNchannels = MAXCHANNELS;
  }
  if (minNchannels < 0) minNchannels = 0;
@ -346,7 +346,7 @@ int ncclMaxNchannels() {
  maxNchannels = std::min(maxNchannels, ncclDevMaxChannelsForArgsBytes(ncclParamWorkArgsBytes()));
  if (maxNchannels > MAXCHANNELS) maxNchannels = MAXCHANNELS;
  if (maxNchannels < 1) {
-    WARN("User asked for a maximum of %d channels, setting it to 1", maxNchannels);
+    INFO(NCCL_GRAPH|NCCL_ENV, "User asked for a maximum of %d channels, setting it to 1", maxNchannels);
    maxNchannels = 1;
  }
  return maxNchannels;
@ -379,7 +379,7 @@ ncclResult_t ncclTopoPostset(struct ncclComm* comm, int* firstRanks, int* treePa
  int nNodes = comm->nNodes;
  int nChannels = comm->nChannels;
  int minHeadNum = INT_MAX;
-  int shared = parent && parent->nvlsSupport  && parent->config.splitShare;
+  int shared = parent && parent->nvlsSupport  && parent->shareResources;
  NCCLCHECK(ncclCalloc(&ringRecv, nNodes*MAXCHANNELS));
  NCCLCHECKGOTO(ncclCalloc(&ringSend, nNodes*MAXCHANNELS), ret, fail);
  NCCLCHECKGOTO(ncclCalloc(&ringPrev, nranks*MAXCHANNELS), ret, fail);
@ -452,7 +452,7 @@ ncclResult_t ncclTopoPostset(struct ncclComm* comm, int* firstRanks, int* treePa
  nChannels = comm->nChannels = std::min(MAXCHANNELS,nChannels*2);

  // Setup CollNet
-  if (comm->collNetSupport == 1) {
+  if (comm->config.collnetEnable) {
    struct ncclTopoGraph* collNetChainGraph = graphs[NCCL_ALGO_COLLNET_CHAIN];
    // Add more channels to saturate intra-node bandwidth, except the 1 PPN case
    if (collNetChainGraph->bwIntra > collNetChainGraph->bwInter && comm->nRanks > comm->nNodes) {
--- a/src/graph/paths.cc
+++ b/src/graph/paths.cc
@ -175,6 +175,13 @@ ncclResult_t ncclGetLocalCpu(struct ncclTopoSystem* system, int gpu, int* retCpu
  return ncclSuccess;
 }

+static int mergePathType(int type0, int type1){
+  int max = std::max(type0,type1);
+  int min = std::min(type0,type1);
+  if(max == PATH_PHB && min == PATH_C2C) return PATH_P2C;
+  else return max;
+}
+
 static ncclResult_t addInterStep(struct ncclTopoSystem* system, int tx, int ix, int t1, int i1, int t2, int i2) {
  struct ncclTopoNode* cpuNode = system->nodes[tx].nodes+ix;
  struct ncclTopoNode* srcNode = system->nodes[t1].nodes+i1;
@ -187,7 +194,7 @@ static ncclResult_t addInterStep(struct ncclTopoSystem* system, int tx, int ix,

  // Update path characteristics
  srcNode->paths[t2][i2].count = l;
-  srcNode->paths[t2][i2].type = std::max(srcNode->paths[tx][ix].type, cpuNode->paths[t2][i2].type);
+  srcNode->paths[t2][i2].type = mergePathType(srcNode->paths[tx][ix].type, cpuNode->paths[t2][i2].type);
  if (tx == GPU) srcNode->paths[t2][i2].type = PATH_PXN;
  srcNode->paths[t2][i2].bw = std::min(srcNode->paths[tx][ix].bw, cpuNode->paths[t2][i2].bw);
  return ncclSuccess;
@ -214,7 +221,7 @@ ncclResult_t ncclGetLevel(int* level, const char* disableEnv, const char* levelE
      const char* str = ncclGetEnv(disableEnv);
      if (str) {
        int disable = strtol(str, NULL, 0);
-        if (disable == 1) l = 0;
+        if (disable == 1) l = PATH_LOC;
        if (l >= 0) INFO(NCCL_ALL, "%s set by environment to %d", disableEnv, disable);
      }
    }
@ -247,7 +254,18 @@ ncclResult_t ncclGetLevel(int* level, const char* disableEnv, const char* levelE

 NCCL_PARAM(IgnoreDisabledP2p, "IGNORE_DISABLED_P2P", 0);

-int ncclTopoUserP2pLevel = -1;
+static int ncclTopoUserP2pLevel = -1; // Initially "uninitialized".  When initialized but unset, changes to -2.
+
+// Gets the user-provided value of NCCL_P2P_LEVEL/NCCL_P2P_DISABLE.  If the user did not provide any, the value
+// of the "level" argument is left unchanged.
+ncclResult_t ncclGetUserP2pLevel(int* level) {
+  if (ncclTopoUserP2pLevel == -1)
+    NCCLCHECK(ncclGetLevel(&ncclTopoUserP2pLevel, "NCCL_P2P_DISABLE", "NCCL_P2P_LEVEL"));
+  if (ncclTopoUserP2pLevel != -2)
+    *level = ncclTopoUserP2pLevel;
+  return ncclSuccess;
+}
+
 ncclResult_t ncclTopoCheckP2p(struct ncclComm* comm, struct ncclTopoSystem* system, int rank1, int rank2,
                              int* p2p, int *read, int* intermediateRank) {
  int mnnvl = 0;
@ -275,9 +293,9 @@ ncclResult_t ncclTopoCheckP2p(struct ncclComm* comm, struct ncclTopoSystem* syst

  // Get GPUs from topology
  int g1, g2;
-  NCCLCHECK(ncclTopoRankToIndex(system, rank1, &g1));
+  NCCLCHECK(ncclTopoRankToIndex(system, rank1, &g1, /*showWarn=*/true));
  struct ncclTopoNode* gpu1 = system->nodes[GPU].nodes+g1;
-  if (ncclTopoRankToIndex(system, rank2, &g2) == ncclInternalError) {
+  if (ncclTopoRankToIndex(system, rank2, &g2, /*showWarn=*/false) == ncclInternalError) {
    // GPU not found, we can't use p2p.
    return ncclSuccess;
  }
@ -302,15 +320,8 @@ ncclResult_t ncclTopoCheckP2p(struct ncclComm* comm, struct ncclTopoSystem* syst
  if ((arch == NCCL_TOPO_CPU_ARCH_X86 && vendor == NCCL_TOPO_CPU_VENDOR_AMD) && system->nodes[GPU].count <= 2) p2pLevel = PATH_SYS;

  // User override
-  if (ncclTopoUserP2pLevel == -1)
-    NCCLCHECK(ncclGetLevel(&ncclTopoUserP2pLevel, "NCCL_P2P_DISABLE", "NCCL_P2P_LEVEL"));
-  if (ncclTopoUserP2pLevel != -2) {
-    p2pLevel = ncclTopoUserP2pLevel;
-    goto compare;
-  }
+  NCCLCHECK(ncclGetUserP2pLevel(&p2pLevel));

-
-compare:
  // Compute the PCI distance and compare with the p2pLevel.
  if (path->type <= p2pLevel) *p2p = 1;

@ -378,7 +389,8 @@ NCCL_PARAM(NetGdrRead, "NET_GDR_READ", -2);
 int ncclTopoUserGdrLevel = -1;
 const char* ncclTopoGdrModeStr[ncclTopoGdrModeNum] = { "Disabled", "Default", "PCI" };

-NCCL_PARAM(NetGdrC2c, "NET_GDR_C2C", 0);
+// On C2C platforms use GDRDMA on NICs which are connected to the CPUs
+NCCL_PARAM(NetGdrC2c, "NET_GDR_C2C", 1);

 ncclResult_t ncclTopoCheckGdr(struct ncclTopoSystem* system, int rank, int64_t netId, int read, enum ncclTopoGdrMode* gdrMode) {
  *gdrMode = ncclTopoGdrModeDisable;
@ -387,7 +399,7 @@ ncclResult_t ncclTopoCheckGdr(struct ncclTopoSystem* system, int rank, int64_t n
  int n, g;
  NCCLCHECK(ncclTopoIdToIndex(system, NET, netId, &n));
  struct ncclTopoNode* net = system->nodes[NET].nodes+n;
-  NCCLCHECK(ncclTopoRankToIndex(system, rank, &g));
+  NCCLCHECK(ncclTopoRankToIndex(system, rank, &g, /*showWarn=*/true));
  struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;

  // Check that both the NIC and GPUs support it
@ -423,29 +435,29 @@ ncclResult_t ncclTopoCheckGdr(struct ncclTopoSystem* system, int rank, int64_t n
    // In case of PXN, use the intermediate GPU distance instead
    int proxyRank;
    NCCLCHECK(ncclTopoGetIntermediateRank(system, gpu->gpu.rank, netId, &proxyRank));
-    NCCLCHECK(ncclTopoRankToIndex(system, proxyRank, &g));
+    NCCLCHECK(ncclTopoRankToIndex(system, proxyRank, &g, /*showWarn=*/true));
    gpu = system->nodes[GPU].nodes+g;
    distance = gpu->paths[NET][n].type;
  }

-  int c;
-  NCCLCHECK(ncclGetLocalCpu(system, g, &c));
-  if (ncclParamNetGdrC2c() && distance == PATH_PHB && gpu->paths[CPU][c].type == PATH_C2C) {
-    // On C2C platforms we can still use GDRDMA on NICs connected to the CPUs
-    INFO(NCCL_NET, "GPU %d / HCA %lx connected to CPU %d via C2C link", rank, netId, c);
+  // On C2C platforms we can still use GDRDMA on NICs connected to the CPUs
+  if (ncclParamNetGdrC2c() && distance == PATH_P2C) {
+    INFO(NCCL_GRAPH | NCCL_NET, "GPU %d / HCA %lx connected via C2C link", rank, netId);
    distance = PATH_C2C;
  }

  if (distance > netGdrLevel) {
-    INFO(NCCL_NET,"GPU Direct RDMA Disabled for GPU %d / HCA %lx (distance %d > %d)", rank, netId, distance, netGdrLevel);
+    INFO(NCCL_GRAPH|NCCL_NET,"GPU Direct RDMA Disabled for GPU %d / HCA %lx (distance %d > %d)", rank, netId, distance, netGdrLevel);
    return ncclSuccess;
  }

  // Force PCIe mapping if path goes through PCI on a C2C system
+  int c;
+  NCCLCHECK(ncclGetLocalCpu(system, g, &c));
  if (gpu->paths[CPU][c].type == PATH_C2C && distance != PATH_C2C) *gdrMode = ncclTopoGdrModePci;
  else *gdrMode = ncclTopoGdrModeDefault;

-  INFO(NCCL_NET,"GPU Direct RDMA Enabled for GPU %d / HCA %lx (distance %d <= %d), read %d mode %s", rank, netId, distance, netGdrLevel, read, ncclTopoGdrModeStr[*gdrMode]);
+  INFO(NCCL_GRAPH|NCCL_NET,"GPU Direct RDMA Enabled for GPU %d / HCA %lx (distance %d <= %d), read %d mode %s", rank, netId, distance, netGdrLevel, read, ncclTopoGdrModeStr[*gdrMode]);
  return ncclSuccess;
 }

@ -480,7 +492,7 @@ ncclResult_t ncclTopoNeedFlush(struct ncclComm* comm, int64_t netId, int netDev,
  if (props.forceFlush == 1 || ncclParamNetForceFlush()) return ncclSuccess;
  int g;
  struct ncclTopoSystem* system = comm->topo;
-  NCCLCHECK(ncclTopoRankToIndex(system, rank, &g));
+  NCCLCHECK(ncclTopoRankToIndex(system, rank, &g, /*showWarn=*/true));
  struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
  // Flush is required on Ampere and earlier
  if (gpu->gpu.cudaCompCap >= 90) *flush = 0;
@ -506,8 +518,8 @@ ncclResult_t ncclTopoCheckNet(struct ncclTopoSystem* system, int rank1, int rank
  *net = 1;
  // First check the current GPU-to-GPU speed.
  int g1, g2;
-  if (ncclTopoRankToIndex(system, rank1, &g1) != ncclSuccess ||
-      ncclTopoRankToIndex(system, rank2, &g2) != ncclSuccess) {
+  if (ncclTopoRankToIndex(system, rank1, &g1, /*showWarn=*/false) != ncclSuccess ||
+      ncclTopoRankToIndex(system, rank2, &g2, /*showWarn=*/false) != ncclSuccess) {
    return ncclSuccess;
  }

@ -533,7 +545,7 @@ ncclResult_t ncclTopoGetIntermediateRank(struct ncclTopoSystem* system, int rank
  // Get GPU and NET
  int n, g;
  NCCLCHECK(ncclTopoIdToIndex(system, NET, netId, &n));
-  NCCLCHECK(ncclTopoRankToIndex(system, rank, &g));
+  NCCLCHECK(ncclTopoRankToIndex(system, rank, &g, /*showWarn=*/true));
  struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
  struct ncclTopoLinkList* path = gpu->paths[NET]+n;
  if (path->type == PATH_PXN) {
@ -601,6 +613,8 @@ ncclResult_t ncclTopoGetPxnRanks(struct ncclComm* comm, int** intermediateRanks,
  return ncclSuccess;
 }

+NCCL_PARAM(PxnC2c, "PXN_C2C", 0);
+
 ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm* comm) {
  // Precompute paths between GPUs/NICs.

@ -659,6 +673,20 @@ ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm
      }
    }
  }
+  // update the GPU -> NIC path in the case of C2C + PHB
+  for (int n = 0; n < system->nodes[NET].count; n++) {
+    struct ncclTopoNode* netNode = system->nodes[NET].nodes + n;
+    for (int g = 0; g < system->nodes[GPU].count; g++) {
+      struct ncclTopoNode* gpuNode = system->nodes[GPU].nodes + g;
+      int c;
+      NCCLCHECK(ncclGetLocalCpu(system, g, &c));
+      if (c == -1) continue;
+      if (mergePathType(gpuNode->paths[CPU][c].type, netNode->paths[CPU][c].type) == PATH_P2C) {
+        gpuNode->paths[NET][n].type = std::min(PATH_P2C, gpuNode->paths[NET][n].type);
+        netNode->paths[GPU][g].type = std::min(PATH_P2C, netNode->paths[GPU][g].type);
+      }
+    }
+  }

  // Update paths for NICs (no GPU Direct, PXN, ...)
  for (int n=0; n<system->nodes[NET].count; n++) {
@ -674,15 +702,19 @@ ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm
          // PXN = PCI + NVLink.
          struct ncclTopoNode* peerNode = system->nodes[GPU].nodes+localGpuIndex;
          // Only use PXN for NIC n if remote GPU p ...
-          if (peerNode->paths[NET][n].type <= PATH_PXB && // Is connected to the NIC through PCI
-              peerNode->paths[GPU][g].type <= PATH_NVL && // Is connected to us through NVLink
-              NCCL_TOPO_ID_SYSTEM_ID(peerNode->id) == NCCL_TOPO_ID_SYSTEM_ID(gpu->id) && // Is on the same node as us
-              (peerNode->paths[NET][n].bw > gpu->paths[NET][n].bw || // Has either higher BW to that NIC
-               gpu->paths[NET][n].type > PATH_PXB))                  // or avoids going through a CPU
-          // We can use that GPU as relay to communicate with that NIC.
-          // Only enabling it in the GPU->NIC direction for now to favor
-          // receiving locally and sending remotely (consistent with net.cc)
-          NCCLCHECK(addInterStep(system, GPU, localGpuIndex, GPU, g, NET, n));
+          int pxnType = ncclParamPxnC2c() ? PATH_P2C : PATH_PXB;
+          if (/* (1) is connected to the NIC with PxN type*/
+              peerNode->paths[NET][n].type <= pxnType &&
+              /* and (2) is connected to us through NVLink */
+              peerNode->paths[GPU][g].type <= PATH_NVL &&
+              /* and (3) is on the same node as us */
+              NCCL_TOPO_ID_SYSTEM_ID(peerNode->id) == NCCL_TOPO_ID_SYSTEM_ID(gpu->id) &&
+              /* and (4) has either higher bw to that NIC or avoid going through the CPU (path.type is > PATH_PXN)*/
+              (peerNode->paths[NET][n].bw > gpu->paths[NET][n].bw || gpu->paths[NET][n].type > PATH_PXN))
+            // We can use that GPU as relay to communicate with that NIC.
+            // Only enabling it in the GPU->NIC direction for now to favor
+            // receiving locally and sending remotely (consistent with net.cc)
+            NCCLCHECK(addInterStep(system, GPU, localGpuIndex, GPU, g, NET, n));
        }
      }
      if (gpu->paths[NET][n].type < PATH_PHB) {
@ -699,6 +731,12 @@ ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm
      }
    }
  }
+
+  // Pre-compute NET local gpus to accelerate search
+  for (int n=0; n<system->nodes[NET].count; n++) {
+    struct ncclTopoNode* net = system->nodes[NET].nodes+n;
+    NCCLCHECK(ncclTopoGetLocalGpu(system, net->id, &net->net.localGpu));
+  }
  return ncclSuccess;
 }

@ -761,7 +799,7 @@ static ncclResult_t ncclTopoGetNchannels(struct ncclComm* comm, int g /*local gp
  int peer;
  struct ncclTopoSystem* system = comm->topo;
  struct ncclTopoLinkList* path = NULL;
-  if (ncclTopoRankToIndex(system, peerRank, &peer) == ncclSuccess) {
+  if (ncclTopoRankToIndex(system, peerRank, &peer, /*showWarn=*/false) == ncclSuccess) {
    // Same rank
    if (g == peer) {
      *nChannels = -1;
--- a/src/graph/search.cc
+++ b/src/graph/search.cc
@ -137,6 +137,7 @@ static ncclResult_t ncclTopoFollowPath(struct ncclTopoSystem* system, struct ncc
  float bw = intra ? graph->bwIntra : graph->bwInter;
  int type = intra ? graph->typeIntra : graph->typeInter;

+  if (path->type >= PATH_DIS) return ncclSuccess;
  if (mult == 1 && (path->type > type)) return ncclSuccess;
  if (mult == 1 && (graph->pattern == NCCL_TOPO_PATTERN_BALANCED_TREE ||
        graph->pattern == NCCL_TOPO_PATTERN_TREE ||
@ -328,8 +329,7 @@ ncclResult_t ncclTopoReplayGetGpu(struct ncclTopoSystem* system, struct ncclTopo
    *g = i;
    return ncclSuccess;
  }
-  if (*g == -1) return ncclInternalError;
-  return ncclSuccess;
+  return ncclInternalError;
 }

 ncclResult_t ncclTopoSearchRecGpu(struct ncclTopoSystem* system, struct ncclTopoGraph* graph, struct ncclTopoGraph* saveGraph, struct ncclTopoNode* gpu, int step, int backToNet, int backToFirstRank, int forcedOrder, int *time);
@ -437,6 +437,65 @@ ncclResult_t ncclTopoCompareGraphs(struct ncclTopoSystem* system, struct ncclTop
  return ncclSuccess;
 }

+// Add the preferred NICs ordered by GPU first
+static ncclResult_t ncclTopoPrefNetsGpuFirst(struct ncclTopoSystem* system, int gpu, int nets[NCCL_TOPO_MAX_NODES], int* netCount) {
+  const int nGpus = (gpu == -1) ? system->nodes[GPU].count : 1;
+  int gpuCount = nGpus;
+  int gpuIds[NCCL_TOPO_MAX_NODES] = {gpu};
+  int firstNets[NCCL_TOPO_MAX_NODES];
+  if (gpu == -1)
+    for (int g = 0; g < nGpus; g++) gpuIds[g] = g;
+
+  for (int c = 0; c < MAXCHANNELS; c++) {
+    for (int g = 0; g < nGpus; g++) {
+      if (gpuIds[g] == -1) continue;
+      int localNet;
+      int64_t netId;
+      struct ncclTopoNode* gpu = system->nodes[GPU].nodes + gpuIds[g];
+      NCCLCHECK(ncclTopoGetLocalNet(system, gpu->gpu.rank, c, &netId, NULL));
+      NCCLCHECK(ncclTopoIdToIndex(system, NET, netId, &localNet));
+      // store the first net found for each GPU in case of duplicates
+      if(c == 0) firstNets[g] = localNet;
+      // if the NET has already been returned for channel 0, that GPU is done
+      if (c > 0 && firstNets[g] == localNet) {
+        gpuIds[g] = -1;
+        gpuCount--;
+        continue;
+      }
+      // only add it to the list if it doesn't already exist
+      int found = 0;
+      while (found < (*netCount) && nets[found] != localNet) found++;
+      if (found == (*netCount)) nets[(*netCount)++] = localNet;
+    }
+    if (gpuCount == 0) break;
+  }
+  return ncclSuccess;
+}
+
+// Add the preferred NICs ordered by channels first
+static ncclResult_t ncclTopoPrefNetsChannelFirst(struct ncclTopoSystem* system, int gpu, int nets[NCCL_TOPO_MAX_NODES], int* netCount) {
+  for (int g = 0; g < system->nodes[GPU].count; g++) {
+    if (gpu != -1 && gpu != g) continue;
+    int localNetCount = 0, localNets[MAXCHANNELS];
+    struct ncclTopoNode* gpu = system->nodes[GPU].nodes + g;
+    for (int c = 0; c < MAXCHANNELS; c++) {
+      int64_t netId;
+      NCCLCHECK(ncclTopoGetLocalNet(system, gpu->gpu.rank, c, &netId, NULL));
+      NCCLCHECK(ncclTopoIdToIndex(system, NET, netId, localNets + localNetCount));
+      if (localNetCount > 0 && localNets[localNetCount] == localNets[0]) break;
+      localNetCount++;
+    }
+    // Append NICs to list
+    for (int i = 0; i < localNetCount; i++) {
+      int n = localNets[i];
+      int found = 0;
+      while (found < (*netCount) && nets[found] != n) found++;
+      if (found == (*netCount)) nets[(*netCount)++] = n;
+    }
+  }
+  return ncclSuccess;
+}
+
 // Build a sorted list of the NETs to try.
 //
 // "gpu" can be set to -1 to build a list suitable for all GPUs (search start) or to a given gpu
@ -445,39 +504,25 @@ ncclResult_t ncclTopoCompareGraphs(struct ncclTopoSystem* system, struct ncclTop
 // The list is built the following way:
 // 1. Select NETs starting with those close to GPU(s), based on paths[n].type.
 // 2. add other NETs satisfying typeInter but not already in the list.
-
+NCCL_PARAM(ScatterEnable, "MNNVL_SCATTER_NETS_ENABLE", 1);
 ncclResult_t ncclTopoSelectNets(struct ncclTopoSystem* system, int typeInter, int gpu, int nets[NCCL_TOPO_MAX_NODES], int* netCountRet) {
  ncclResult_t ret = ncclSuccess;
  int netCount = 0;
-  int localNetCount;
-  int localNets[MAXCHANNELS];

-  // First add the preferred NICs
-  for (int g=0; g<system->nodes[GPU].count; g++) {
-    if (gpu != -1 && gpu != g) continue;
-    localNetCount = 0;
-    struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
-    for (int c = 0; c<MAXCHANNELS; c++) {
-      int64_t netId;
-      NCCLCHECK(ncclTopoGetLocalNet(system, gpu->gpu.rank, c, &netId, NULL));
-      NCCLCHECK(ncclTopoIdToIndex(system, NET, netId, localNets+localNetCount));
-      if (localNetCount > 0 && localNets[localNetCount] == localNets[0]) break;
-      localNetCount++;
-    }
-    // Append NICs to list
-    for (int i=0; i<localNetCount; i++) {
-      int n = localNets[i];
-      int found = 0;
-      while (found<netCount && nets[found] != n) found++;
-      if (found == netCount) nets[netCount++] = n;
-    }
+  // First add the preferred NETs.
+  if (system->nHosts > 1 && ncclParamScatterEnable()) {
+    // For MNNVL systems, we sort the devices by GPU first, then by channel
+    NCCLCHECK(ncclTopoPrefNetsGpuFirst(system, gpu, nets, &netCount));
+  } else {
+    // For other systems, we sort the devices by channel first, then by GPU
+    NCCLCHECK(ncclTopoPrefNetsChannelFirst(system, gpu, nets, &netCount));
  }

  // Then add others satisfying typeInter
  for (int t=0; t <= typeInter; t++) {
-    for (int g=0; g<system->nodes[GPU].count; g++) {
+    for (int g = 0; g < system->nodes[GPU].count; g++) {
      if (gpu != -1 && gpu != g) continue;
-      localNetCount = 0;
+      int localNetCount = 0, localNets[MAXCHANNELS];
      struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
      struct ncclTopoLinkList* paths = gpu->paths[NET];
      for (int n=0; n<system->nodes[NET].count && n<MAXCHANNELS; n++) {
@ -625,8 +670,7 @@ ncclResult_t ncclTopoSearchRecNet(struct ncclTopoSystem* system, struct ncclTopo
    if (graph->pattern == NCCL_TOPO_PATTERN_NVLS || graph->pattern == NCCL_TOPO_PATTERN_COLLNET_DIRECT) {
      // NVLS search only tries to find NIC:GPU combinations to compute the heads.
      if (graph->nChannels < netCount) {
-        int gpu;
-        NCCLCHECK(ncclTopoGetLocalGpu(system, net->id, &gpu));
+        int gpu = net->net.localGpu;
        if (gpu != -1) {
          int duplicate = 0;
          // check whether there is duplicate head when one GPU connects with multiple NICs
@ -643,13 +687,12 @@ ncclResult_t ncclTopoSearchRecNet(struct ncclTopoSystem* system, struct ncclTopo
        }
      }
    } else {
-      if (graph->nChannels > 0) {
+      if (graph->nChannels > 0 && graph->sameChannels == 1) {
        // Try to replay the last channel
        int g;
        NCCLCHECK(ncclTopoReplayGetGpu(system, graph, -1, &g));
        NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, FORCED_ORDER_REPLAY, time, NET, n, g));
-      }
-      if (graph->nChannels == 0 || graph->sameChannels == 0) {
+      } else {
        if (graph->nChannels == 0 && system->nodes[NVS].count == 0) {
          // Always try the PCI order first to set a reference, but don't count in the timeout nor let it run for long
          int t = 1 << 10;
@ -658,24 +701,17 @@ ncclResult_t ncclTopoSearchRecNet(struct ncclTopoSystem* system, struct ncclTopo
        }

        // Then try the most local GPUs
-        float maxBw = 0;
-        int minHops = 0xfffffff;
-        struct ncclTopoLinkList* paths = net->paths[GPU];
-        for (int g=0; g<system->nodes[GPU].count; g++) {
-          if (paths[g].bw > maxBw) {
-            maxBw = paths[g].bw;
-            minHops = paths[g].count;
-          } else if (paths[g].bw == maxBw && paths[g].count > 0 && paths[g].count < minHops) {
-            minHops = paths[g].count;
-          }
+        int localGpu = net->net.localGpu;
+        if (localGpu != -1) {
+          NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, 0, time, NET, n, localGpu));
        }
-        if (maxBw >= bw) {
-          for (int i=0; i<system->nodes[GPU].count; i++) {
-            int g = (graph->nChannels+i)%system->nodes[GPU].count;
-            if (paths[g].bw == maxBw && paths[g].count == minHops) {
-              NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, 0, time, NET, n, g));
-            }
-          }
+        int localGpus[NCCL_TOPO_MAX_NODES], localGpuCount, pathType;
+        NCCLCHECK(ncclTopoGetLocal(system, NET, n, GPU, localGpus, &localGpuCount, &pathType));
+        // if no GPUs are connected, skip this net
+        if (pathType == PATH_DIS) continue;
+        for (int g = 0; g < localGpuCount; ++g) {
+          if (localGpus[g] == localGpu) continue; // We already tried this one
+          NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, 0, time, NET, n, localGpus[g]));
        }
      }
    }
@ -761,6 +797,7 @@ struct kvDict kvDictLinkType[] = {
  { "NVB", PATH_NVB },
  { "PIX", PATH_PIX },
  { "PXB", PATH_PXB },
+  { "P2C", PATH_P2C },
  { "PXN", PATH_PXN },
  { "PHB", PATH_PHB },
  { "SYS", PATH_SYS },
@ -809,8 +846,10 @@ ncclResult_t ncclTopoGetGraphFromXmlSub(struct ncclXmlNode *xmlGraph, struct ncc
  NCCLCHECK(xmlGetAttrInt(xmlGraph, "nchannels", &graph->nChannels));
  NCCLCHECK(xmlGetAttrFloat(xmlGraph, "speedintra", &graph->bwIntra));
  NCCLCHECK(xmlGetAttrFloat(xmlGraph, "speedinter", &graph->bwInter));
-  if (xmlGetAttrFloat(xmlGraph, "latencyinter", &graph->latencyInter) != ncclSuccess) graph->latencyInter = 0.0;
  const char* str;
+  NCCLCHECK(xmlGetAttr(xmlGraph, "latencyinter", &str));
+  if (!str) INFO(NCCL_GRAPH, "latencyinter not found in graph, using 0.0");
+  graph->latencyInter = str ? strtof(str, NULL) : 0.0;
  NCCLCHECK(xmlGetAttr(xmlGraph, "typeintra", &str));
  NCCLCHECK(kvConvertToInt(str, &graph->typeIntra, kvDictLinkType));
  NCCLCHECK(xmlGetAttr(xmlGraph, "typeinter", &str));
@ -920,8 +959,8 @@ float sm90SpeedArrayInter[] = { 48.0, 45.0, 42.0, 40.0, 30.0, 24.0, 22.0, 20.0,
 #define NSPEEDSINTRA_SM90 (sizeof(sm90SpeedArrayIntra)/sizeof(float))
 #define NSPEEDSINTER_SM90 (sizeof(sm90SpeedArrayInter)/sizeof(float))

-float sm100SpeedArrayIntra[] = { 90.0, 80.0, 70.0, 60.0, 50.0, 40.0, 30.0, 24.0, 20.0, 19.0 };
-float sm100SpeedArrayInter[] = { 48.0, 45.0, 42.0, 40.0, 30.0, 24.0, 22.0, 20.0, 17.5, 15.0, 12.0, 6.0, 3.0, 2.4, 1.2, 0.24, 0.12 };
+float sm100SpeedArrayIntra[] = { 90.0, 80.0, 70.0, 60.0, 50.0, 40.0, 30.0, 24.0, 20.0, 19.0, 18.0 };
+float sm100SpeedArrayInter[] = { 96.0, 48.0, 45.1, 42.0, 40.0, 30.0, 24.0, 22.0, 20.0, 17.5, 15.0, 12.0, 6.0, 3.0, 2.4, 1.2, 0.24, 0.12 };
 #define NSPEEDSINTRA_SM100 (sizeof(sm100SpeedArrayIntra)/sizeof(float))
 #define NSPEEDSINTER_SM100 (sizeof(sm100SpeedArrayInter)/sizeof(float))

@ -1060,13 +1099,13 @@ search:
    int maxIntra = system->nodes[NET].count > 0 ? tmpGraph.typeInter : maxTypeIntra;
    if (tmpGraph.typeIntra < maxIntra && (graph->nChannels == 0 || tmpGraph.typeIntra < graph->typeIntra)) {
      tmpGraph.typeIntra += 1;
-      goto search;
+      if (tmpGraph.typeIntra < PATH_DIS) goto search;
    }
    tmpGraph.typeIntra = minTypeIntra;

    if (system->nodes[NET].count > 0 && tmpGraph.typeInter < maxTypeInter && (graph->nChannels == 0 || tmpGraph.typeInter < graph->typeInter || tmpGraph.typeInter < PATH_PXN)) {
      tmpGraph.typeInter += 1;
-      goto search;
+      if (tmpGraph.typeInter < PATH_DIS) goto search;
    }
    tmpGraph.typeInter = minTypeInter;

@ -1124,7 +1163,7 @@ done:
  }

  if (graph->nChannels == 0 && graph->collNet == 0 && graph->pattern != NCCL_TOPO_PATTERN_NVLS) {
-    WARN("Could not find a path for pattern %d, falling back to simple order", graph->pattern);
+    INFO(NCCL_GRAPH, "Could not find a path for pattern %d, falling back to simple order", graph->pattern);
    for (int i=0; i<ngpus; i++) graph->intra[i] = system->nodes[GPU].nodes[i].gpu.rank;
    graph->inter[0] = graph->inter[1] = 0;
    graph->bwIntra = graph->bwInter = 0.1;
@ -1147,8 +1186,12 @@ ncclResult_t ncclTopoPrintGraph(struct ncclTopoSystem* system, struct ncclTopoGr
      offset = strlen(line);
    }
    for (int i=0; i<ngpus; i++) {
-      sprintf(line+offset, " %s/%d", topoNodeTypeStr[GPU], graph->intra[ngpus*c+i]);
+      int g;
+      ncclTopoRankToIndex(system, graph->intra[ngpus * c + i], &g, true);
+      int64_t topoId = system->nodes[GPU].nodes[g].id;
+      sprintf(line + offset, " %s/%lx-%lx", topoNodeTypeStr[GPU], NCCL_TOPO_ID_SYSTEM_ID(topoId), NCCL_TOPO_ID_LOCAL_ID(topoId));
      offset = strlen(line);
+      if (graph->id == 3) break; // NVLS graphs only use the first GPU
    }
    if (system->nodes[NET].count > 0) {
      sprintf(line+offset, " %s/%lx-%lx", topoNodeTypeStr[NET], NCCL_TOPO_ID_SYSTEM_ID(graph->inter[2*c+1]), NCCL_TOPO_ID_LOCAL_ID(graph->inter[2*c+1]));
@ -1248,7 +1291,7 @@ ncclResult_t ncclTopoGetNetDev(struct ncclComm* comm, int rank, struct ncclTopoG
      }
      if (pxnLevel == 1) {
        int g, n;
-        NCCLCHECK(ncclTopoRankToIndex(comm->topo, rank, &g));
+        NCCLCHECK(ncclTopoRankToIndex(comm->topo, rank, &g, /*showWarn=*/true));
        NCCLCHECK(ncclTopoIdToIndex(comm->topo, NET, netId, &n));
        struct ncclTopoNode* gpu = comm->topo->nodes[GPU].nodes+g;
        if (gpu->paths[NET][n].type <= PATH_PXN) {
@ -1260,11 +1303,12 @@ ncclResult_t ncclTopoGetNetDev(struct ncclComm* comm, int rank, struct ncclTopoG
        // Check which local GPU corresponds to that NIC and see if we can use PXN.
        int n, g1, g2;
        NCCLCHECK(ncclTopoIdToIndex(comm->topo, NET, netId, &n));
-        NCCLCHECK(ncclTopoRankToIndex(comm->topo, rank, &g1));
+        NCCLCHECK(ncclTopoRankToIndex(comm->topo, rank, &g1, /*showWarn=*/true));
        NCCLCHECK(ncclTopoGetLocalGpu(comm->topo, netId, &g2));
        if (g2 != -1) {
          struct ncclTopoNode* peerGpu = comm->topo->nodes[GPU].nodes+g2;
-          if (peerGpu->paths[GPU][g1].type <= PATH_NVL && peerGpu->paths[NET][n].type <= PATH_PXB) {
+          int pxnType = ncclParamPxnC2c() ? PATH_P2C : PATH_PXB;
+          if (peerGpu->paths[GPU][g1].type <= PATH_NVL && peerGpu->paths[NET][n].type <= pxnType) {
            *proxyRank = peerGpu->gpu.rank;
            if (dev) *dev = netDev;
            if (id) *id = netId;
--- a/src/graph/topo.cc
+++ b/src/graph/topo.cc
@ -9,12 +9,10 @@
 #include "topo.h"
 #include "comm.h"
 #include "nvmlwrap.h"
-#include "net.h"
 #include "coll_net.h"
 #include "transport.h"
 #include <sys/stat.h>
 #include <fcntl.h>
-#include "xml.h"
 #include "cpuset.h"
 #include "bootstrap.h"

@ -22,8 +20,8 @@
 #define BUSID_REDUCED_SIZE (sizeof("0000:00"))

 const char* topoNodeTypeStr[] = { "GPU", "PCI", "NVS", "CPU", "NIC", "NET" };
-const char* topoLinkTypeStr[] = { "LOC", "NVL", "",    "C2C", "PCI",    "",    "",    "", "SYS", "NET" };
-const char* topoPathTypeStr[] = { "LOC", "NVL", "NVB", "C2C", "PIX", "PXB", "PXN", "PHB", "SYS", "NET", "DIS" };
+const char* topoLinkTypeStr[] = { "LOC", "NVL", "",    "C2C", "PCI",    "",    "",    "",    "", "SYS", "NET" };
+const char* topoPathTypeStr[] = { "LOC", "NVL", "NVB", "C2C", "PIX", "PXB", "P2C", "PXN", "PHB", "SYS", "NET", "DIS" };

 /******************************************************************/
 /******************* Graph Creation Functions *********************/
@ -251,7 +249,7 @@ ncclResult_t ncclTopoFlattenBcmSwitches(struct ncclTopoSystem* system) {
      pciSwitch->pci.device |= 0xffff;
      free(subSwIds);
      // Restart, as system->nodes[PCI].nodes has changed.
-      s = 0;
+      s = -1;  // Will be incremented to 0 in the next loop iteration
      continue;
 fail:
      free(subSwIds);
@ -404,7 +402,9 @@ ncclResult_t ncclTopoAddGpu(struct ncclXmlNode* xmlGpu, struct ncclTopoSystem* s
  return ncclSuccess;
 }

-struct kvDict kvDictPciClass[] = { { "0x060400", PCI }, { "0x068000", NVS }, { "0x068001", CPU }, { "0x03", GPU }, { "0x02", NIC }, { NULL, PCI /* Default fallback value */ } };
+#define PCI_BRIDGE_DEVICE_CLASS "0x060400"
+
+struct kvDict kvDictPciClass[] = { { PCI_BRIDGE_DEVICE_CLASS, PCI }, { "0x068000", NVS }, { "0x068001", CPU }, { "0x03", GPU }, { "0x02", NIC }, { NULL, PCI /* Default fallback value */ } };
 struct kvDict kvDictPciGen[] = {
  { "2.5 GT/s", 15 }, { "5 GT/s", 30 }, { "8 GT/s", 60 }, { "16 GT/s", 120 }, { "32 GT/s", 240 }, /* Kernel 5.6 and earlier */
  { "2.5 GT/s PCIe", 15 }, { "5.0 GT/s PCIe", 30 }, { "8.0 GT/s PCIe", 60 }, { "16.0 GT/s PCIe", 120 }, { "32.0 GT/s PCIe", 240 }, { "64.0 GT/s PCIe", 480 },
@ -677,7 +677,14 @@ ncclResult_t ncclTopoGetSystemFromXml(struct ncclXml* xml, struct ncclTopoSystem
    struct ncclXmlNode* node = topNode->subs[s];
    if (strcmp(node->name, "cpu") == 0) NCCLCHECK(ncclTopoAddCpu(node, *topoSystem));
  }
-  for (int systemId=0; systemId<system->nHosts; systemId++) if (system->hostHashes[systemId] == localHostHash) system->systemId = systemId;
+
+  int systemId = 0;
+  while (systemId < system->nHosts && system->hostHashes[systemId] != localHostHash) systemId++;
+  system->systemId = systemId;
+  if(systemId == system->nHosts){
+    WARN("localHostHash = 0x%lx not found in the list of system hostHashes",localHostHash);
+    return ncclInvalidArgument;
+  }

  NCCLCHECK(ncclTopoAddNvLinks(topNode, *topoSystem, NULL, 0));
  NCCLCHECK(ncclTopoAddC2c(topNode, *topoSystem, NULL, 0));
@ -699,6 +706,7 @@ static ncclResult_t xmlInitAttrInt(struct ncclXmlNode* node, const char* attrNam
  if (index == -1) {
    index = node->nAttrs++;
    strncpy(node->attrs[index].key, attrName, MAX_STR_LEN);
+    node->attrs[index].key[MAX_STR_LEN] = '\0';
    snprintf(node->attrs[index].value, MAX_STR_LEN, "%d", value);
  }
  return ncclSuccess;
@ -709,6 +717,7 @@ static ncclResult_t xmlInitAttrUint64(struct ncclXmlNode* node, const char* attr
  if (index == -1) {
    index = node->nAttrs++;
    strncpy(node->attrs[index].key, attrName, MAX_STR_LEN);
+    node->attrs[index].key[MAX_STR_LEN] = '\0';
    snprintf(node->attrs[index].value, MAX_STR_LEN, "0x%lx", value);
  }
  return ncclSuccess;
@ -719,6 +728,7 @@ static ncclResult_t xmlInitAttrFloat(struct ncclXmlNode* node, const char* attrN
  if (index == -1) {
    index = node->nAttrs++;
    strncpy(node->attrs[index].key, attrName, MAX_STR_LEN);
+    node->attrs[index].key[MAX_STR_LEN] = '\0';
    snprintf(node->attrs[index].value, MAX_STR_LEN, "%f", value);
  }
  return ncclSuccess;
@ -799,6 +809,17 @@ typedef struct xmlNodeStack {

 } xmlNodeStack;

+ncclResult_t ncclFindFirstPciParent(ncclXmlNode** parent) {
+  ncclXmlNode* newParent = *parent;
+  while (strcmp(newParent->name, "pci") != 0) {
+    newParent = newParent->parent;
+    if (newParent == nullptr) return ncclSuccess;
+    if (strcmp(newParent->name, "system") == 0) return ncclSuccess;
+  }
+  *parent = newParent;
+  return ncclSuccess;
+}
+
 // 1. Find the common parent xmlNode between the given set of nodes
 ncclResult_t ncclTopoGetPath(ncclXmlNode** nodes, int nNodes, int* path, ncclXmlNode** parent) {
  // Track a stack of parents per-net node being merged
@ -897,6 +918,7 @@ ncclResult_t ncclTopoGetPath(ncclXmlNode** nodes, int nNodes, int* path, ncclXml
  }

 out:
+  ncclFindFirstPciParent(&common);
  *parent = common;
  free(parents);
  return ncclSuccess;
@ -960,13 +982,19 @@ ncclResult_t ncclTopoMakePciParent(struct ncclXml* xml, struct ncclXmlNode** par
  return ncclSuccess;
 }

-ncclResult_t ncclTopoMakeVnic(ncclComm_t comm, struct ncclXml* xml, ncclNetVDeviceProps_t* vProps,
-struct ncclXmlNode** physNetNodes, struct ncclXmlNode** netNode, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
+ncclResult_t ncclTopoMakeVnic(struct ncclXml* xml, ncclNetVDeviceProps_t* vProps,
+struct ncclXmlNode** physNetNodes, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
  if (vProps->ndevs > NCCL_NET_MAX_DEVS_PER_NIC) {
    WARN("TOPO/NET : Tried to merge too many NICs. %d > %d", vProps->ndevs, NCCL_NET_MAX_DEVS_PER_NIC);
    return ncclInternalError;
  }

+  // Don't make vNics of size 1
+  if (vProps->ndevs == 1) {
+    TRACE(NCCL_GRAPH, "TOPO/NET : Skipping vNic of size 1");
+    return ncclSuccess;
+  }
+
  // Trigger the merge, then get the new device's properties
  int vDevIndex = 0;
  ncclResult_t ret = makeVDevice(&vDevIndex, vProps);
@ -976,11 +1004,18 @@ struct ncclXmlNode** physNetNodes, struct ncclXmlNode** netNode, ncclResult_t (*
    return ret;
  }

+  // Mark original NICs as keep="0" in the topology
+  for (int i = 0; i < vProps->ndevs; i++) {
+    int dev = vProps->devs[i];
+    struct ncclXmlNode* netNode = physNetNodes[dev];
+    NCCLCHECK(xmlSetAttrInt(netNode, "keep", 0));
+  }
+
  INFO(NCCL_GRAPH, "TOPO/NET : Made vNic %d", vDevIndex);
  return ncclSuccess;
 }

-ncclResult_t ncclTopoForceMerge(ncclComm_t comm, struct ncclXml* xml, const char* str, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
+ncclResult_t ncclTopoForceMerge(struct ncclXml* xml, char* str, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
  ncclResult_t ret = ncclSuccess;
  INFO(NCCL_ENV|NCCL_NET, "TOPO/NET : Force-fusing NICs using NCCL_NET_FORCE_MERGE=%s", str);
  char* ncStr;
@ -1018,8 +1053,7 @@ ncclResult_t ncclTopoForceMerge(ncclComm_t comm, struct ncclXml* xml, const char
      goto fail;
    }

-    struct ncclXmlNode* netNode;
-    ret = ncclTopoMakeVnic(comm, xml, &vProps, physNetNodes, &netNode, makeVDevice);
+    ret = ncclTopoMakeVnic(xml, &vProps, physNetNodes, makeVDevice);
    if (ret == ncclSuccess) {
      // Only set that a device is "placed" after successfully making a vNic (it's possible to exit before this)
      for (int i = 0; i < vProps.ndevs; i++) {
@ -1041,7 +1075,7 @@ fail:
  goto exit;
 }

-ncclResult_t ncclTopoAutoMerge(ncclComm_t comm, struct ncclXml* xml, int mergeLevel, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
+ncclResult_t ncclTopoAutoMerge(struct ncclXml* xml, int mergeLevel, int* placedDevs, ncclNetProperties_t* propsList, struct ncclXmlNode** physNetNodes, int nPhysDevs, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*)) {
  // Compute the path type between each device
  int* paths = NULL;
  ncclResult_t res = ncclSuccess;
@ -1085,8 +1119,7 @@ ncclResult_t ncclTopoAutoMerge(ncclComm_t comm, struct ncclXml* xml, int mergeLe
        return ncclInternalError;
      }

-      struct ncclXmlNode* netNode;
-      ncclResult_t ret = ncclTopoMakeVnic(comm, xml, &vProps, physNetNodes, &netNode, makeVDevice);
+      ncclResult_t ret = ncclTopoMakeVnic(xml, &vProps, physNetNodes, makeVDevice);

      // Merging failed.
      // Mark all as unplaced and increase their distance to disconnected (PATH_DIS)
@ -1117,6 +1150,7 @@ struct kvDict nicPathKvList[] = {
  { "PORT", PATH_PORT },
  { "PIX",  PATH_PIX },
  { "PXB",  PATH_PXB },
+  { "P2C",  PATH_P2C },
  { "PXN",  PATH_PXN },
  { "PHB",  PATH_PHB },
  { "SYS",  PATH_SYS },
@ -1139,14 +1173,19 @@ ncclResult_t ncclTopoGetVNicParent(struct ncclXml* xml, ncclResult_t (*getProper
  if (path == PATH_LOC) {
    *parent = NULL;
  } else if (parent && strcmp((*parent)->name, "pci") == 0) {
-    // If the common parent is PCI, we must reparent the new NIC under a made up busId
-    NCCLCHECK(ncclTopoMakePciParent(xml, parent, physNetNodes[0]));
+    // Compare PCI class here to avoid NCCL WARN when the "class" attribute doesn't exist
+    const char* c;
+    NCCLCHECK(xmlGetAttrStr(*parent, "class", &c));
+    if (strcmp(c, PCI_BRIDGE_DEVICE_CLASS) == 0) {
+      // If the common parent is a PCI switch, we must reparent the new NIC under a made up pci device with a unique busid
+      NCCLCHECK(ncclTopoMakePciParent(xml, parent, physNetNodes[0]));
+    }
  }
  TRACE(NCCL_GRAPH, "Selected parent %s with path %d", (*parent)->name, path);
  return ncclSuccess;
 }

-ncclResult_t ncclTopoMakeVNics(ncclComm_t comm, struct ncclXml* xml, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*getProperties)(int, ncclNetProperties_t*), int physicalDevs) {
+ncclResult_t ncclTopoMakeVNics(struct ncclXml* xml, ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*getProperties)(int, ncclNetProperties_t*), int physicalDevs) {
  int* placedDevs = NULL;
  struct ncclXmlNode** physNetNodes = NULL;
  if (physicalDevs == 0) return ncclSuccess;
@ -1170,15 +1209,15 @@ ncclResult_t ncclTopoMakeVNics(ncclComm_t comm, struct ncclXml* xml, ncclResult_
  { // Avoids warnings related to jumping to "out"
    const char* mergeLevelEnv = ncclGetEnv("NCCL_NET_MERGE_LEVEL");
    if (mergeLevelEnv) kvConvertToInt(mergeLevelEnv, &mergeLevel, nicPathKvList);
-    const char* forceMerge = ncclGetEnv("NCCL_NET_FORCE_MERGE");
+    char* forceMerge = (char*) ncclGetEnv("NCCL_NET_FORCE_MERGE");
    NCCLCHECK(ncclCalloc(&placedDevs, physicalDevs));
    memset(placedDevs, 0, sizeof(int)*physicalDevs);

    if (forceMerge) {
-      NCCLCHECKGOTO(ncclTopoForceMerge(comm, xml, forceMerge, placedDevs, props, physNetNodes, physicalDevs, makeVDevice), res, out);
+      NCCLCHECKGOTO(ncclTopoForceMerge(xml, forceMerge, placedDevs, props, physNetNodes, physicalDevs, makeVDevice), res, out);
    }
  }
-  NCCLCHECKGOTO(ncclTopoAutoMerge(comm, xml, mergeLevel, placedDevs, props, physNetNodes, physicalDevs, makeVDevice), res, out);
+  NCCLCHECKGOTO(ncclTopoAutoMerge(xml, mergeLevel, placedDevs, props, physNetNodes, physicalDevs, makeVDevice), res, out);

 out:
  free(physNetNodes);
@ -1187,7 +1226,7 @@ out:
  return res;
 }

-static ncclResult_t ncclTopoPopulateNics(ncclComm_t comm, ncclXml* xml, int startIndex, int endIndex, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), const char* netName, int coll, int keep, int virtualNics) {
+static ncclResult_t ncclTopoPopulateNics(ncclXml* xml, int startIndex, int endIndex, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), const char* netName, int coll, int virtualNics, bool dmaBufSupport) {
  for (int n = startIndex; n < endIndex; n++) {
    ncclNetProperties_t props;
    NCCLCHECK(getProperties(n, &props));
@ -1206,15 +1245,17 @@ static ncclResult_t ncclTopoPopulateNics(ncclComm_t comm, ncclXml* xml, int star
    const char* colAttr;
    NCCLCHECK(xmlGetAttr(netNode, "coll", &colAttr));

-    // If coll == 0 but the netNode is tagged as coll, don't update the keep value
-    if (colAttr == NULL || coll != 0 || strcmp(colAttr,"1") != 0) NCCLCHECK(xmlSetAttrInt(netNode, "keep", keep));
+    NCCLCHECK(xmlSetAttrInt(netNode, "keep", 1));
+    int dev;
+    xmlGetAttrIntDefault(netNode, "dev", &dev, -1);
+    if (dev != -1 && dev != n) INFO(NCCL_GRAPH, "TOPO/NET : Changing %s dev index from %d to %d", netName, dev, n);
    NCCLCHECK(xmlSetAttrInt(netNode, "dev", n));
    NCCLCHECK(xmlInitAttrInt(netNode, "latency", props.latency));
    NCCLCHECK(xmlInitAttrInt(netNode, "speed", props.speed));
    NCCLCHECK(xmlInitAttrInt(netNode, "port", props.port));
    NCCLCHECK(xmlInitAttrUint64(netNode, "guid", props.guid));
    NCCLCHECK(xmlInitAttrInt(netNode, "maxconn", props.maxComms));
-    bool gdrSupport = (props.ptrSupport & NCCL_PTR_CUDA) || (comm->dmaBufSupport && (props.ptrSupport & NCCL_PTR_DMABUF));
+    bool gdrSupport = (props.ptrSupport & NCCL_PTR_CUDA) || (dmaBufSupport && (props.ptrSupport & NCCL_PTR_DMABUF));
    INFO(NCCL_NET,"NET/%s : GPU Direct RDMA %s for HCA %d '%s'", netName, gdrSupport ? "Enabled" : "Disabled", n, props.name);
    NCCLCHECK(xmlInitAttrInt(netNode, "gdr", gdrSupport));
    // Only set coll if it's not 0
@ -1230,30 +1271,22 @@ static ncclResult_t ncclTopoPopulateNics(ncclComm_t comm, ncclXml* xml, int star
  return ncclSuccess;
 }

-struct ncclTopoNetState {
-  int nVirtualNics;
-  int nPhysicalNics;
-  const char* name;
-};
-
 // Calls to network plugin APIs should be protected. This function should be called inside a per-process lock.
-static ncclResult_t ncclTopoProcessNet(ncclComm_t comm, ncclXml* xml, int coll, const char* dumpXmlFile, ncclTopoNetState* state, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*devices)(int*), const char* netName) {
+ncclResult_t ncclTopoProcessNet(ncclXml* xml, int coll, const char* dumpXmlFile, ncclTopoNetState* state, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*devices)(int*), const char* netName, bool dmaBufSupport) {
  int usePhysicalDevices = (dumpXmlFile || makeVDevice == NULL);
  if (state->nPhysicalNics == -1) NCCLCHECK(devices(&state->nPhysicalNics));
  // Enumerate physical devices
-  NCCLCHECK(ncclTopoPopulateNics(comm, xml, 0, state->nPhysicalNics, getProperties, netName, coll, 1, 0));
+  NCCLCHECK(ncclTopoPopulateNics(xml, 0, state->nPhysicalNics, getProperties, netName, coll, false, dmaBufSupport));
  if (!usePhysicalDevices) {
    if (state->nVirtualNics == -1) {
-      NCCLCHECK(ncclTopoMakeVNics(comm, xml, makeVDevice, getProperties, state->nPhysicalNics));
+      NCCLCHECK(ncclTopoMakeVNics(xml, makeVDevice, getProperties, state->nPhysicalNics));
      int nDevs;
      NCCLCHECK(devices(&nDevs));
      state->nVirtualNics = nDevs - state->nPhysicalNics;
    }
-    // Remove keep=1 for physical collnets
    if (state->nVirtualNics > 0) {
-      NCCLCHECK(ncclTopoPopulateNics(comm, xml, 0, state->nPhysicalNics, getProperties, netName, coll, 0, 0));
      // Populate new devices
-      NCCLCHECK(ncclTopoPopulateNics(comm, xml, state->nPhysicalNics, state->nPhysicalNics+state->nVirtualNics, getProperties, netName, coll, 1, 1));
+      NCCLCHECK(ncclTopoPopulateNics(xml, state->nPhysicalNics, state->nPhysicalNics+state->nVirtualNics, getProperties, netName, coll, true, dmaBufSupport));
    }
  }

@ -1301,6 +1334,15 @@ ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** sy
    // Try default XML topology location
    NCCLCHECKGOTO(ncclTopoGetXmlFromFile("/var/run/nvidia-topologyd/virtualTopology.xml", xml, 0), ret, fail);
  }
+  // Fixup the cpu's host_hashes.
+  struct ncclXmlNode* node;
+  // Update every cpu node's host_hash attribute since those are not
+  // intended to be preserved from the XML files that have been read.
+  NCCLCHECKGOTO(xmlFindTag(xml, "cpu", &node), ret, fail);
+  while (node != nullptr) {
+    NCCLCHECKGOTO(xmlSetAttrLong(node, "host_hash", getHostHash()), ret, fail);
+    NCCLCHECKGOTO(xmlFindNextTag(xml, "cpu", node, &node), ret, fail);
+  }
  if (xml->maxIndex == 0) {
    // Create top tag
    struct ncclXmlNode* top;
@ -1313,7 +1355,6 @@ ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** sy
  // Detect only the GPU managed by this process.  We'll get any others through XML fusion.
  char busId[NVML_DEVICE_PCI_BUS_ID_BUFFER_SIZE];
  NCCLCHECKGOTO(int64ToBusId(comm->peerInfo[comm->rank].busId, busId), ret, fail);
-  struct ncclXmlNode* node;
  NCCLCHECKGOTO(ncclTopoFillGpu(xml, busId, &node), ret, fail);
  if (node) {
    NCCLCHECKGOTO(xmlSetAttrInt(node, "keep", 1), ret, fail);
@ -1330,12 +1371,12 @@ ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** sy
  state = NULL;
  if (collNetSupport(comm)) {
    NCCLCHECKGOTO(ncclTopoGetSharedState(&state, comm->ncclCollNet->name, collNetStates), ret, fail);
-    NCCLCHECKGOTO(ncclTopoProcessNet(comm, xml, 1, dumpXmlFile, state,
-      comm->ncclCollNet->getProperties, comm->ncclCollNet->makeVDevice, comm->ncclCollNet->devices, comm->ncclCollNet->name), ret, fail);
+    NCCLCHECKGOTO(ncclTopoProcessNet(xml, 1, dumpXmlFile, state,
+      comm->ncclCollNet->getProperties, comm->ncclCollNet->makeVDevice, comm->ncclCollNet->devices, comm->ncclCollNet->name, comm->dmaBufSupport), ret, fail);
  }
  NCCLCHECKGOTO(ncclTopoGetSharedState(&state, comm->ncclNet->name, netStates), ret, fail);
-  NCCLCHECKGOTO(ncclTopoProcessNet(comm, xml, 0, dumpXmlFile, state,
-    comm->ncclNet->getProperties, comm->ncclNet->makeVDevice, comm->ncclNet->devices, comm->ncclNet->name), ret, fail);
+  NCCLCHECKGOTO(ncclTopoProcessNet(xml, 0, dumpXmlFile, state,
+    comm->ncclNet->getProperties, comm->ncclNet->makeVDevice, comm->ncclNet->devices, comm->ncclNet->name, comm->dmaBufSupport), ret, fail);
  pthread_mutex_unlock(&netLock);
  netLockHeld = 0;

@ -1387,7 +1428,7 @@ ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** sy
  }

  // Only update our topo tracking structure if we aren't dumping (separate steps)
-  if (dumpXmlFile == NULL) NCCLCHECKGOTO(ncclTopoGetSystemFromXml(xml, system, comm->peerInfo[comm->rank].hostHash), ret, fail);
+  if (dumpXmlFile == NULL) NCCLCHECKGOTO(ncclTopoGetSystemFromXml(xml, system, getHostHash()), ret, fail);

 exit:
  if (!comm->MNNVL && localRanks) free(localRanks);
@ -1399,7 +1440,7 @@ fail:
  goto exit;
 }

-static ncclResult_t ncclTopoGetLocal(struct ncclTopoSystem* system, int type, int index, int resultType,
+ncclResult_t ncclTopoGetLocal(struct ncclTopoSystem* system, int type, int index, int resultType,
                                     int locals[NCCL_TOPO_MAX_NODES], int* localCount, int* pathType) {
  int minType = PATH_DIS;
  float maxBw = 0;
@ -1452,7 +1493,7 @@ ncclResult_t getLocalNetCountByBw(struct ncclTopoSystem* system, int gpu, int *c

 ncclResult_t ncclTopoGetLocalNet(struct ncclTopoSystem* system, int rank, int channelId, int64_t* id, int* dev) {
  int gpu;
-  NCCLCHECK(ncclTopoRankToIndex(system, rank, &gpu));
+  NCCLCHECK(ncclTopoRankToIndex(system, rank, &gpu, /*showWarn=*/true));

  int localNets[NCCL_TOPO_MAX_NODES];
  int localNetCount;
@ -1517,7 +1558,7 @@ NCCL_PARAM(IgnoreCpuAffinity, "IGNORE_CPU_AFFINITY", 0);
 ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu_set_t* affinity) {
  struct ncclTopoNode* cpu = NULL, *gpu = NULL;
  int gpuIndex, cpuIndex;
-  NCCLCHECK(ncclTopoRankToIndex(system, rank, &gpuIndex));
+  NCCLCHECK(ncclTopoRankToIndex(system, rank, &gpuIndex, /*showWarn=*/true));
  NCCLCHECK(ncclGetLocalCpu(system, gpuIndex, &cpuIndex));
  gpu = system->nodes[GPU].nodes+gpuIndex;
  cpu = system->nodes[CPU].nodes+cpuIndex;
@ -1529,8 +1570,8 @@ ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu
 #ifdef ENABLE_TRACE
  {
    char affinityStr[sizeof(cpu_set_t)*2];
-    NCCLCHECK(ncclCpusetToStr(&mask, affinityStr));
-    TRACE(NCCL_INIT, "Current affinity for GPU %d is %s", gpu->gpu.dev, affinityStr);
+    TRACE(NCCL_INIT, "Current affinity for GPU %d is %s", gpu->gpu.dev,
+          ncclCpusetToRangeStr(&mask, affinityStr, sizeof(affinityStr)));
  }
 #endif

@ -1540,8 +1581,8 @@ ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu
 #ifdef ENABLE_TRACE
  {
    char affinityStr[sizeof(cpu_set_t)*2];
-    NCCLCHECK(ncclCpusetToStr(&cpuMask, affinityStr));
-    TRACE(NCCL_INIT, "CPU GPU affinity for GPU %d is %s", gpu->gpu.dev, affinityStr);
+    TRACE(NCCL_INIT, "CPU GPU affinity for GPU %d is %s", gpu->gpu.dev,
+          ncclCpusetToRangeStr(&cpuMask, affinityStr, sizeof(affinityStr)));
  }
 #endif

@ -1558,8 +1599,8 @@ ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu
  // If there is a non empty set, use it to set affinity
  if (CPU_COUNT(&finalMask)) {
    char affinityStr[sizeof(cpu_set_t)*2];
-    NCCLCHECK(ncclCpusetToStr(&finalMask, affinityStr));
-    INFO(NCCL_INIT, "Setting affinity for GPU %d to %s", gpu->gpu.dev, affinityStr);
+    INFO(NCCL_INIT, "Setting affinity for GPU %d to %s", gpu->gpu.dev,
+         ncclCpusetToRangeStr(&finalMask, affinityStr, sizeof(affinityStr)));
  }
  return ncclSuccess;
 }
--- a/src/graph/topo.h
+++ b/src/graph/topo.h
@ -9,6 +9,8 @@

 #include "graph.h"
 #include "core.h"
+#include "xml.h"
+#include "net.h"

 #define LOC_BW 5000.0
 #define SM60_NVLINK_BW 18.0
@ -16,7 +18,7 @@
 #define SM80_NVLINK_BW 20.0
 #define SM90_NVLINK_BW 20.6
 #define SM86_NVLINK_BW 12.0
-#define SM100_NVLINK_BW 40.0
+#define SM100_NVLINK_BW 40.1
 #define PCI_BW 12.0           // PCI Gen3 x16
 #define AMD_BW 16.0
 #define BDW_QPI_BW 6.0
@ -50,9 +52,10 @@ extern const char* topoNodeTypeStr[];
 #define LINK_PCI 4
 // Skipping 5 for PATH_PXB
 // Skipping 6 for PATH_PXN
-// Skipping 7 for PATH_PHB
-#define LINK_SYS 8
-#define LINK_NET 9
+// Skipping 7 for PATH_P2C
+// Skipping 8 for PATH_PHB
+#define LINK_SYS 9
+#define LINK_NET 10
 extern const char* topoLinkTypeStr[];

 // Local (myself)
@ -73,25 +76,30 @@ extern const char* topoLinkTypeStr[];
 // Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
 #define PATH_PXB 5

+// Connection between a GPU and a NIC using the C2C connection to the CPU and the PCIe connection to the NIC
+#define PATH_P2C 6
+
 // Connection between a GPU and a NIC using an intermediate GPU. Used to enable rail-local, aggregated network send/recv operations.
-#define PATH_PXN 6
+#define PATH_PXN 7

 // Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
-#define PATH_PHB 7
+#define PATH_PHB 8

 // Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
-#define PATH_SYS 8
+#define PATH_SYS 9

 // Connection through the network
-#define PATH_NET 9
+#define PATH_NET 10

 // New type of path which should precede PATH_PIX
 #define PATH_PORT PATH_NVL

 // Disconnected
-#define PATH_DIS 10
+#define PATH_DIS 11
 extern const char* topoPathTypeStr[];

+extern int64_t ncclParamPxnC2c();
+
 struct ncclTopoNode;
 struct ncclTopoLink {
  int type;
@ -137,6 +145,7 @@ struct ncclTopoNode {
      int gdrSupport;
      int collSupport;
      int maxChannels;
+      int localGpu;
    }net;
    struct {
      int arch;
@ -181,6 +190,13 @@ ncclResult_t ncclTopoGetGpuMinPath(struct ncclTopoSystem* system, int type, int*
 ncclResult_t ncclTopoGetGpuMaxPath(struct ncclTopoSystem* system, int type, int* max);
 ncclResult_t ncclTopoSplitNvLink(struct ncclTopoSystem* system, int* splitNvLink);

+struct ncclTopoNetState {
+  int nVirtualNics;
+  int nPhysicalNics;
+  const char* name;
+};
+ncclResult_t ncclTopoProcessNet(ncclXml* xml, int coll, const char* dumpXmlFile, ncclTopoNetState* state, ncclResult_t (*getProperties)(int, ncclNetProperties_t*), ncclResult_t (*makeVDevice)(int*, ncclNetVDeviceProps_t*), ncclResult_t (*devices)(int*), const char* netName, bool dmaBufSupport);
+
 #define NCCL_TOPO_XML_MAX_NODES 256
 #define NCCL_GRAPH_XML_MAX_NODES 4096
 ncclResult_t ncclTopoGetSystemFromXml(struct ncclXml* xml, struct ncclTopoSystem** topoSystem, uint64_t localHostHash);
@ -200,7 +216,7 @@ static ncclResult_t ncclTopoIdToIndex(struct ncclTopoSystem* system, int type, i
  return ncclInternalError;
 }

-static ncclResult_t ncclTopoRankToIndex(struct ncclTopoSystem* system, int rank, int* index) {
+static ncclResult_t ncclTopoRankToIndex(struct ncclTopoSystem* system, int rank, int* index, bool showWarn) {
  *index = -1;
  for (int i=0; i<system->nodes[GPU].count; i++) {
    if (system->nodes[GPU].nodes[i].gpu.rank == rank) {
@ -208,6 +224,7 @@ static ncclResult_t ncclTopoRankToIndex(struct ncclTopoSystem* system, int rank,
      return ncclSuccess;
    }
  }
+  if (showWarn) WARN("ncclTopoRankToIndex could not find rank %d", rank);
  return ncclInternalError;
 }

--- a/src/graph/tuning.cc
+++ b/src/graph/tuning.cc
@ -16,13 +16,13 @@ static int getNthreads(const char* name, int env, int min, int max, int def) {
  int nt = env;
  if (nt > 0) {
    if (nt % WARP_SIZE != 0) {
-      WARN("Invalid %s %d (must be a multiple of %d)", name, nt, WARP_SIZE);
+      INFO(NCCL_GRAPH|NCCL_ENV, "Invalid %s %d (must be a multiple of %d)", name, nt, WARP_SIZE);
      nt = max;
    } else if (nt > max) {
-      WARN("Invalid %s %d (maximum %d).", name, nt, max);
+      INFO(NCCL_GRAPH|NCCL_ENV, "Invalid %s %d (maximum %d).", name, nt, max);
      nt = max;
    } else if (nt < min) {
-      WARN("Invalid %s %d (minimum %d).", name, nt, min);
+      INFO(NCCL_GRAPH|NCCL_ENV, "Invalid %s %d (minimum %d).", name, nt, min);
      nt = min;
     }
  } else {
@ -51,11 +51,14 @@ static int getNthreads(const char* name, int env, int min, int max, int def) {
 //     NCCL_PROTO="^LL128;allreduce:LL128"
 // Enable everything but LL128, but only LL128 for allreduce.
 ncclResult_t parseList(const char* str, const char* prefixElems[], int nprefixes, const char* elems[], int nelems, int* list) {
+  ncclResult_t ret = ncclSuccess;
  char* fullStr = strdup(str);
  char* tmpFullStr;
  char* fullToken = strtok_r(fullStr, ";", &tmpFullStr);
+  char* subToken = nullptr;
+  char* tokStr = nullptr;
  while (fullToken) {
-    char* subToken = strdup(fullToken);
+    subToken = strdup(fullToken);
    char* tmpSubStr;
    char* prefix = strtok_r(subToken, ":", &tmpSubStr);
    char* elemList = strtok_r(NULL, ":", &tmpSubStr);
@ -65,7 +68,8 @@ ncclResult_t parseList(const char* str, const char* prefixElems[], int nprefixes
        // because then all the prefixes before the prefix-less entry would be
        // overwritten.
        WARN("All entries except the first must have a prefix: \"%s\"", str);
-        return ncclInvalidUsage;
+        ret = ncclInvalidUsage;
+        goto fail;
      }
      elemList = prefix;
      prefix = NULL;
@ -84,7 +88,7 @@ ncclResult_t parseList(const char* str, const char* prefixElems[], int nprefixes
      foundPrefix = true;
      for (int e=0; e<nelems; e++) list[p*nelems+e] = unset;

-      char* tokStr = strdup(elemList);
+      tokStr = strdup(elemList);
      char* tmpStr;
      char* elem = strtok_r(tokStr, ",", &tmpStr);
      while (elem) {
@ -97,22 +101,32 @@ ncclResult_t parseList(const char* str, const char* prefixElems[], int nprefixes
        }
        if (e==nelems) {
          WARN("Unrecognized element token \"%s\" when parsing \"%s\"", elem, str);
-          return ncclInvalidUsage;
+          ret = ncclInvalidUsage;
+          goto fail;
        }
        elem = strtok_r(NULL, ",", &tmpStr);
      }
      free(tokStr);
+      tokStr = nullptr;
    }
    if (!foundPrefix) {
      WARN("Unrecognized prefix token \"%s\" when parsing \"%s\"", prefix, str);
-      return ncclInvalidUsage;
+      ret = ncclInvalidUsage;
+      goto fail;
    }
    free(subToken);
+    subToken = nullptr;

    fullToken = strtok_r(NULL, ";", &tmpFullStr);
  }
+
+exit:
+  free(tokStr);
+  free(subToken);
  free(fullStr);
-  return ncclSuccess;
+  return ret;
+fail:
+  goto exit;
 }

 // Latencies in us, Bandwidths in GB/s
@ -194,6 +208,8 @@ static float getNetOverhead(struct ncclComm* comm) {
  return 1.0;
 }

+NCCL_PARAM(Ll128C2c, "LL128_C2C", 1);
+
 ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCompCap, struct ncclTopoGraph** graphs) {
  int simpleDefaultThreads = (graphs[NCCL_ALGO_RING]->bwIntra*graphs[NCCL_ALGO_RING]->nChannels <= PCI_BW) ? 256 : NCCL_SIMPLE_MAX_NTHREADS;
  comm->maxThreads[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] =
@ -248,7 +264,14 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
            && a == NCCL_ALGO_PAT && (p != NCCL_PROTO_SIMPLE || ncclPatEnable(comm) == 0)) continue;
        int collnet = (a == NCCL_ALGO_COLLNET_DIRECT || a == NCCL_ALGO_COLLNET_CHAIN) ? 1 : 0;
        float bw = nNodes <= 2 || collnet ? graphs[a]->bwIntra : graphs[a]->bwInter;
-        if (a == NCCL_ALGO_NVLS) bw = std::min(graphs[a]->bwIntra, graphs[a]->bwInter);
+        if (a == NCCL_ALGO_NVLS) {
+          if (coll == ncclFuncAllReduce) {
+            bw = std::min(graphs[a]->bwIntra, graphs[a]->bwInter);
+          } else {
+            // allgather and reducescatter
+            bw = std::min(graphs[a]->bwIntra * (ppn - 1.0f) / ppn, graphs[a]->bwInter * 0.9f);
+          }
+        }
        if (a == NCCL_ALGO_NVLS_TREE) bw = std::min(graphs[a]->bwIntra, nNodes <= 2 ? graphs[a]->bwInter : graphs[a]->bwInter/2);
        float busBw = graphs[a]->nChannels * bw;

@ -264,19 +287,7 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
        if (a == NCCL_ALGO_COLLNET_CHAIN && p != NCCL_PROTO_SIMPLE) busBw = 0;  // Not used
        if (a == NCCL_ALGO_COLLNET_DIRECT && p == NCCL_PROTO_SIMPLE) {
          if (coll == ncclFuncAllGather || coll == ncclFuncReduceScatter) {
-            busBw = ppn * bw;
-            // AllGather/ReduceScatter requires 1:1 GPU:NIC
-            int nicPerNode = comm->collNetHeadsNum;
-            if (coll == ncclFuncAllGather && comm->nNodes > 1) {
-              if (!comm->ncclCollNet || !comm->ncclCollNet->iallgather || ppn > nicPerNode) busBw = 0;
-            }
-            if (coll == ncclFuncReduceScatter && comm->nNodes > 1) {
-              if (!comm->ncclCollNet || !comm->ncclCollNet->ireducescatter || ppn > nicPerNode) busBw = 0;
-            }
-            // Measured corrective ratio needed at 1 ppn and 8ppn. Here we hackishly
-            // interpolate the two.
-            float w = (ppn-1)/(8-1);
-            busBw *= w*0.85 + (1-w)*0.95;
+            busBw = ppn * std::min(graphs[a]->bwIntra, graphs[a]->bwInter * 0.9f);
          } else {
            // Collnet+Direct requires all GPUs to have a local NIC to work at full speed
            float factor = ppn / (1.0*graphs[a]->nChannels); // GPU/NIC ratio
@ -285,6 +296,26 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
            if (minCompCap >= 90) busBw *= .85;
          }
        }
+        // disable collnet for allgather/reducescatter if #localranks > #heads
+        // AllGather/ReduceScatter requires 1:1 GPU:NIC
+        if ((a == NCCL_ALGO_NVLS || a == NCCL_ALGO_COLLNET_DIRECT) && p == NCCL_PROTO_SIMPLE && (coll == ncclFuncAllGather || coll == ncclFuncReduceScatter) && comm->nNodes > 1) {
+          int nHeads = 0;
+          if (coll == ncclFuncAllGather && comm->nNodes > 1 && (!comm->ncclCollNet || !comm->ncclCollNet->iallgather)) busBw = 0.0f;
+          if (coll == ncclFuncReduceScatter && comm->nNodes > 1 && (!comm->ncclCollNet || !comm->ncclCollNet->ireducescatter)) busBw = 0.0f;
+          if (comm->config.collnetEnable)
+            nHeads = comm->collNetHeadsNum;
+          else
+            busBw = 0.0f;
+          if (busBw > 0.0f) {
+            for (int r = 0; r < comm->nRanks; r++) {
+              int node = comm->rankToNode[r];
+              if (comm->nodeRanks[node].localRanks > nHeads) {
+                busBw = 0.0f;
+                break;
+              }
+            }
+          }
+        }

        // Convert bus BW to algorithm BW
        if (!(a != NCCL_ALGO_RING && (coll == ncclFuncAllGather || coll == ncclFuncReduceScatter))) {
@ -411,7 +442,7 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
      // Disable NVLS Tree on a single node
      if (comm->nNodes == 1 && a == NCCL_ALGO_NVLS_TREE) disable = 1;
      // Disable Collnet+Direct, Collnet+Chain or Collnet+NVLS if collnet is not supported.
-      if (comm->collNetSupport == 0 &&
+      if (comm->config.collnetEnable == 0 &&
          (a == NCCL_ALGO_COLLNET_DIRECT ||
           a == NCCL_ALGO_COLLNET_CHAIN ||
           (a == NCCL_ALGO_NVLS && comm->nNodes > 1))) disable = 1;
@ -424,19 +455,19 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
  for (int c=0; c<NCCL_NUM_FUNCTIONS; c++) for (int a=0; a<NCCL_NUM_ALGORITHMS; a++) for (int p=0; p<NCCL_NUM_PROTOCOLS; p++) {
    int pEnable = protoEnable[c*NCCL_NUM_PROTOCOLS+p];
    if (pEnable == 2 && p == NCCL_PROTO_LL128) {
-      // Enable LL128 by default only on Volta/Ampere/Hopper/Blackwell+NVLink. Other cases are not tested and may cause silent data corruption.
      pEnable = 1;
-      pEnable &= (graphs[a]->typeInter <= PATH_PXB || (minCompCap >= 90 && graphs[a]->typeInter <= PATH_PXN));
+      if (ncclParamLl128C2c() && minCompCap >= 90) {
+        // Enable LL128 by default only on Hopper/Blackwell for all connections up to P2C and PXN.
+        pEnable &= (graphs[a]->typeInter <= PATH_PXN);
+      } else {
+        // Enable LL128 only up to PXB. Don't enable LL128 over PxN because PxN can encapsulate PxB or P2C links.
+        pEnable &= (graphs[a]->typeInter <= PATH_PXB);
+        if (!ncclParamLl128C2c() && minCompCap >= 90)
+          INFO(NCCL_GRAPH, "Disabling LL128 over all PxN connections (PXB and C2C). This ensures that no C2C link will be used by LL128.");
+      }
      pEnable &= (graphs[a]->typeIntra <= PATH_NVB);
      pEnable &= (minCompCap == maxCompCap);
-      switch (minCompCap) {
-      case 70: pEnable &= 1; break;
-      case 80: pEnable &= 1; break;
-      case 90: pEnable &= !(CUDART_VERSION == 11080 && c == ncclFuncAllReduce && a == NCCL_ALGO_RING && comm->nRanks == 2); break;
-      case 100: pEnable &= 1; break;
-      case 120: pEnable &= 1; break;
-      default: pEnable &= 0; break;
-      }
+      pEnable &= !(minCompCap < 70 || (minCompCap == 90 && CUDART_VERSION == 11080 && c == ncclFuncAllReduce && a == NCCL_ALGO_RING && comm->nRanks == 2));
    }
    if (pEnable == 0) comm->bandwidths[c][a][p] = 0;
    if (algoEnable[c*NCCL_NUM_ALGORITHMS+a] == 0) comm->bandwidths[c][a][p] = 0;
@ -483,7 +514,7 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
      }
    }
  }
- 
+
  // Set per-thread amount of work before we increase nThreads and nChannels
  for (int a=0; a<NCCL_NUM_ALGORITHMS; a++) {
    comm->threadThresholds[a][NCCL_PROTO_LL] = NCCL_LL_THREAD_THRESHOLD;
--- a/src/graph/xml.cc
+++ b/src/graph/xml.cc
@ -39,7 +39,13 @@ ncclResult_t xmlGetValue(FILE* file, char* value, char* last) {
 #if INT_OK
    int o = 0;
    do {
-      value[o++] = c;
+      value[o] = c;
+      if (o == MAX_STR_LEN-1) {
+        value[o] = '\0';
+        WARN("Error : value %s too long (max %d)", value, MAX_STR_LEN);
+        return ncclInternalError;
+      }
+      o++;
      NCCLCHECK(xmlGetChar(file, &c));
    } while (c >= '0' && c <= '9');
    value[o] = '\0';
@ -51,10 +57,17 @@ ncclResult_t xmlGetValue(FILE* file, char* value, char* last) {
 #endif
  }
  int o = 0;
+  char quote = c;  // Remember which quote type we started with
  do {
    NCCLCHECK(xmlGetChar(file, &c));
-    value[o++] = c;
-  } while (c != '"');
+    value[o] = c;
+    if (o == MAX_STR_LEN-1) {
+      value[o] = '\0';
+      WARN("Error : value %s too long (max %d)", value, MAX_STR_LEN);
+      return ncclInternalError;
+    }
+    o++;
+  } while (c != quote);
  value[o-1] = '\0';
  NCCLCHECK(xmlGetChar(file, last));
  return ncclSuccess;
@ -267,7 +280,7 @@ ncclResult_t ncclTopoDumpXmlRec(int indent, FILE* file, struct ncclXmlNode* node
 ncclResult_t ncclTopoDumpXmlToFile(const char* xmlTopoFile, struct ncclXml* xml) {
  FILE* file = fopen(xmlTopoFile, "w");
  if (file == NULL) {
-    WARN("Unable to open %s, not dumping topology.", xmlTopoFile);
+    INFO(NCCL_GRAPH|NCCL_ENV, "Unable to open %s, not dumping topology.", xmlTopoFile);
    return ncclSuccess;
  }
  NCCLCHECK(ncclTopoDumpXmlRec(0, file, xml->nodes));
@ -375,7 +388,7 @@ ncclResult_t ncclTopoGetXmlFromFile(const char* xmlTopoFile, struct ncclXml* xml
  FILE* file = fopen(xmlTopoFile, "r");
  if (file == NULL) {
    if (warn) {
-      WARN("Could not open XML topology file %s : %s", xmlTopoFile, strerror(errno));
+      INFO(NCCL_GRAPH|NCCL_ENV, "Could not open XML topology file %s : %s", xmlTopoFile, strerror(errno));
    }
    return ncclSuccess;
  }
@ -759,7 +772,7 @@ ncclResult_t ncclTopoGetXmlFromGpu(struct ncclXmlNode* pciNode, nvmlDevice_t nvm
    int maxNvLinks = (sm < 60) ? 0 : (sm < 70) ? 4 : (sm < 80) ? 6 : (sm < 90) ? 12 : 18;

    if (maxNvLinks > 0 && nvmlDev == NULL) {
-      WARN("No NVML device handle. Skipping nvlink detection.");
+      INFO(NCCL_GRAPH, "No NVML device handle. Skipping nvlink detection.");
      maxNvLinks = 0;
    }

@ -961,8 +974,16 @@ ncclResult_t ncclTopoTrimXmlRec(struct ncclXmlNode* node, int* keep) {
      NCCLCHECK(ncclTopoTrimXmlRec(subs[s], &k));
      *keep += k;
    }
-    if (*keep == 0 && // Trim PCI switches or CPU with no used GPU/NIC under them.
-        (strcmp(node->name, "pci") == 0 || strcmp(node->name, "cpu") == 0)) {
+    // Remove node if it has no children and no keep attribute
+    if (*keep == 0 && // Trim PCI switches, CPUs with no used GPU/NIC under them, or pruned NICs
+        (strcmp(node->name, "pci") == 0 || strcmp(node->name, "cpu") == 0 || strcmp(node->name, "nic") == 0 || strcmp(node->name, "net") == 0)) {
+#ifdef ENABLE_TRACE
+      const char* name;
+      const char* busid;
+      NCCLCHECK(xmlGetAttr(node, "name", &name));
+      NCCLCHECK(xmlGetAttr(node, "busid", &busid));
+      TRACE(NCCL_GRAPH, "Removing node %s %s %s\n", node->name, name, busid);
+#endif
      NCCLCHECK(xmlRemoveNode(node));
    }
  }
--- a/src/graph/xml.h
+++ b/src/graph/xml.h
@ -117,6 +117,13 @@ static ncclResult_t xmlGetAttrIntDefault(struct ncclXmlNode* node, const char* a
  return ncclSuccess;
 }

+static ncclResult_t xmlGetAttrUint64(struct ncclXmlNode* node, const char* attrName, uint64_t* value) {
+  const char* str;
+  NCCLCHECK(xmlGetAttrStr(node, attrName, &str));
+  *value = strtoull(str, NULL, 0);
+  return ncclSuccess;
+}
+
 static ncclResult_t xmlGetAttrLong(struct ncclXmlNode* node, const char* attrName, int64_t* value) {
  const char* str;
  NCCLCHECK(xmlGetAttrStr(node, attrName, &str));
@ -124,7 +131,6 @@ static ncclResult_t xmlGetAttrLong(struct ncclXmlNode* node, const char* attrNam
  return ncclSuccess;
 }

-
 static ncclResult_t xmlGetAttrFloat(struct ncclXmlNode* node, const char* attrName, float* value) {
  const char* str;
  NCCLCHECK(xmlGetAttrStr(node, attrName, &str));
@ -254,7 +260,6 @@ static ncclResult_t xmlSetAttrInt(struct ncclXmlNode* node, const char* attrName
    node->attrs[index].key[MAX_STR_LEN] = '\0';
  }
  snprintf(node->attrs[index].value, MAX_STR_LEN, "%d", value);
-  node->attrs[index].value[MAX_STR_LEN] = '\0';
  return ncclSuccess;
 }

@ -267,7 +272,6 @@ static ncclResult_t xmlSetAttrFloat(struct ncclXmlNode* node, const char* attrNa
    node->attrs[index].key[MAX_STR_LEN] = '\0';
  }
  snprintf(node->attrs[index].value, MAX_STR_LEN, "%g", value);
-  node->attrs[index].value[MAX_STR_LEN] = '\0';
  return ncclSuccess;
 }

@ -280,7 +284,6 @@ static ncclResult_t xmlSetAttrLong(struct ncclXmlNode* node, const char* attrNam
    node->attrs[index].key[MAX_STR_LEN] = '\0';
  }
  snprintf(node->attrs[index].value, MAX_STR_LEN, "%#lx", value);
-  node->attrs[index].value[MAX_STR_LEN] = '\0';
  return ncclSuccess;
 }

--- a/src/group.cc
+++ b/src/group.cc
@ -12,16 +12,14 @@
 #include <assert.h>
 #include "bootstrap.h"

+#define GROUP_MAX_RECLAIM_STEPS 10
+
 __thread int ncclGroupDepth = 0; // depth of ncclGroupStart nesting
 __thread ncclResult_t ncclGroupError = ncclSuccess;
-__thread struct ncclComm* ncclGroupCommHead = nullptr;
+__thread struct ncclComm* ncclGroupCommHead[ncclGroupTaskTypeNum] = {nullptr};
 __thread struct ncclComm* ncclGroupCommPreconnectHead = nullptr;
 __thread struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> ncclAsyncJobs;
-__thread struct ncclGroupJob *ncclGroupJobMainPtr = NULL;
-__thread struct ncclGroupJob ncclGroupJobMain;
 __thread int ncclGroupBlocking = -1; /* default mode */
-__thread bool ncclGroupJobAbortFlag = false;
-
 void* ncclAsyncJobMain(void* arg);

 ncclResult_t ncclAsyncLaunch(
@ -191,6 +189,66 @@ fail:
  goto exit;
 }

+struct ncclGroupSymmetricJob {
+  struct ncclAsyncJob base;
+  struct ncclComm* comm;
+};
+
+NCCL_PARAM(WinStride, "WIN_STRIDE", -1);
+
+ncclResult_t ncclCommGroupRegisterSymmetric(struct ncclAsyncJob* job_) {
+  struct ncclGroupSymmetricJob* job = (struct ncclGroupSymmetricJob*)job_;
+  struct ncclComm* comm = job->comm;
+  ncclResult_t ret = ncclSuccess;
+
+  CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
+  if (comm->baseStride == 0) {
+    cudaStream_t hostStream;
+    // first time to allocate symmetric VA space.
+    // calling into this function means symmetric is supported.
+    struct ncclSymDevBase* symBase = NULL;
+    size_t size = ncclSymDevBase::size(comm->localRanks);
+    if (ncclParamWinStride() != -1) {
+      comm->baseStride = ncclParamWinStride();
+    } else {
+      size_t maxStride = 0;
+      for (int r = 0; r < comm->nRanks; ++r)
+        if (comm->peerInfo[r].totalGlobalMem > maxStride) maxStride = comm->peerInfo[r].totalGlobalMem;
+      comm->baseStride = maxStride;
+    }
+    INFO(NCCL_INIT, "rank %d base stride %zuGB total VM %zuGB", comm->rank, comm->baseStride >> 30, (comm->baseStride * comm->localRanks) >> 30);
+    NCCLCHECKGOTO(ncclIpcSymmetricInit(comm), ret, fail);
+    NCCLCHECKGOTO(ncclNvlsSymmetricInit(comm), ret, fail);
+    comm->symAllocHead = 0;
+
+    // Allocate symmetric memory for NCCL internal usage
+    NCCLCHECKGOTO(ncclCommSymmetricAllocInternal(comm, size, alignof(struct ncclSymDevBase), (void**)&symBase), ret, fail);
+    assert((void*)symBase == (void*)(comm->baseUCSymPtr + comm->localRank * comm->baseStride));
+    NCCLCHECKGOTO(ncclStrongStreamAcquire(ncclCudaGraphNone(), &comm->sharedRes->hostStream, /*concurrent=*/false, &hostStream), ret, fail);
+    CUDACHECKGOTO(cudaMemsetAsync(symBase, 0, size, hostStream), ret, fail);
+    CUDACHECKGOTO(cudaStreamSynchronize(hostStream), ret, fail);
+    NCCLCHECKGOTO(ncclStrongStreamRelease(ncclCudaGraphNone(), &comm->sharedRes->hostStream, /*concurrent=*/false), ret, fail);
+
+    comm->symDevComm.base = (struct ncclSymDevBase*)(comm->baseUCSymPtr + comm->localRank * comm->baseStride);
+    comm->symDevComm.baseMc = (struct ncclSymDevBase*)comm->baseMCSymPtr;
+    comm->symDevComm.nRanks = comm->localRanks;
+    comm->symDevComm.nRanks_rcp32 = idivRcp32(comm->localRanks);
+    comm->symDevComm.rank = comm->localRank;
+    comm->symDevComm.stride4G = comm->baseStride >> 32;
+  }
+
+  while (!ncclIntruQueueEmpty(&comm->symRegTaskQueue)) {
+    struct ncclSymRegTask* task = ncclIntruQueueDequeue(&comm->symRegTaskQueue);
+    NCCLCHECKGOTO(ncclCommSymmetricRegisterInternal(comm, task->buff, task->baseSize, task->alignment, task->memHandle, task->regHandle), ret, fail);
+    free(task);
+  }
+
+exit:
+  return ret;
+fail:
+  goto exit;
+}
+
 static ncclResult_t doLaunches(struct ncclComm* head) {
  ncclResult_t result = ncclSuccess;
  struct ncclComm* cliqueHead = head;
@ -207,7 +265,7 @@ static ncclResult_t doLaunches(struct ncclComm* head) {
      CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), result, failure);
      NCCLCHECKGOTO(ncclLaunchPrepare(comm), result, failure);
      if (useBarrier) ncclCommIntraBarrierIn(comm, 1);
-      comm = comm->groupNext;
+      comm = comm->groupNext[ncclGroupTaskTypeCollective];
    } while (comm != nullptr && comm->intraComm0 == cliqueHead->intraComm0);
    cliqueNextHead = comm;

@ -224,7 +282,7 @@ static ncclResult_t doLaunches(struct ncclComm* head) {
      bool moreRounds = false;
      comm = cliqueHead;
      do { // Iterate clique members.
-        struct ncclComm* next = comm->groupNext;
+        struct ncclComm* next = comm->groupNext[ncclGroupTaskTypeCollective];
        if (useBarrier) {
          // Barrier reduction result tells us if this was the final round.
          moreRounds = 0 != ncclCommIntraBarrierOut(comm);
@ -259,64 +317,60 @@ failure:
  return result;
 }

-static inline void groupResetJobState(struct ncclGroupJob* job) {
-  if (job) {
-    if (job->groupBlockingPtr) *job->groupBlockingPtr = -1;
-    if (job->abortFlagPtr) *job->abortFlagPtr = false;
-    if (job->groupErrorPtr) *job->groupErrorPtr = ncclSuccess;
-    if (job->groupCommHeadPtr) *job->groupCommHeadPtr = NULL;
-    if (job->groupCommPreconnectHeadPtr) *job->groupCommPreconnectHeadPtr = NULL;
-    memset(job, 0, sizeof(struct ncclGroupJob));
-  }
+static inline void groupLocalResetJobState() {
+  ncclGroupError = ncclSuccess;
+  for (int type = 0; type < ncclGroupTaskTypeNum; ++type) ncclGroupCommHead[type] = NULL;
+  ncclGroupCommPreconnectHead = NULL;
+  ncclGroupBlocking = -1;
+  ncclIntruQueueConstruct(&ncclAsyncJobs);
  return;
 }

-static void groupCleanup(struct ncclComm** groupCommHeadPtr, struct ncclComm** groupCommPreconnectHeadPtr, struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next>* asyncJobsPtr, ncclResult_t* groupErrorPtr, int* groupBlockingPtr, volatile bool* groupJobAbortFlagPtr, ncclResult_t error) {
-  struct ncclComm* comm = *groupCommHeadPtr;
-
-  /* reset all thread local variables */
-  *groupCommHeadPtr = NULL;
-  *groupCommPreconnectHeadPtr = NULL;
-  *groupErrorPtr = ncclSuccess;
-  *groupBlockingPtr = -1;
-  *groupJobAbortFlagPtr = false;
-
-  while (comm != nullptr) {
-    struct ncclComm* next = comm->groupNext;
-    (void) ncclGroupCommLeave(comm); // overwrites comm->groupNext
-    // We don't know if preconnect succeeded or happened at all, so clear
-    // the flags that let `taskAppend()` skip over checking if preconnect
-    // is needed.
-    comm->preconnectNext = reinterpret_cast<struct ncclComm*>(0x1);
-    for (int i = 0; i < comm->nRanks; i++) {
-      comm->connectSend[i] = 0UL;
-      comm->connectRecv[i] = 0UL;
-    }
-    // Reclaim abandoned kernel plan memory. Note ncclWork structs were already
-    // reclaimed by a `ncclMemoryStackPop(&comm->memScoped)` during `ncclGroupCommLeave()`.
-    while (!ncclIntruQueueEmpty(&comm->planner.planQueue)) {
-      struct ncclKernelPlan* plan = ncclIntruQueueDequeue(&comm->planner.planQueue);
-      // Persistent plans will be reclaimed via the callbackQueue when the
-      // graph drops its UserObject reference.
-      if (!plan->persistent) {
-        while (!ncclIntruQueueEmpty(&plan->proxyOpQueue)) {
-          struct ncclProxyOp* pxop = ncclIntruQueueDequeue(&plan->proxyOpQueue);
-          ncclMemoryPoolFree(&comm->memPool_ncclProxyOp, pxop);
+static void groupCleanup(struct ncclComm** groupCommHeadPtr, struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next>* asyncJobsPtr, ncclResult_t error) {
+  struct ncclComm* comm;
+  for (int type = 0; type < ncclGroupTaskTypeNum; ++type) {
+    comm = groupCommHeadPtr[type];
+    // reset groupCommHeadPtr[type]
+    groupCommHeadPtr[type] = nullptr;
+    while (comm != nullptr) {
+      struct ncclComm* next = comm->groupNext[type];
+      (void)ncclGroupCommLeave(comm, type); // overwrites comm->groupNext
+      // We don't know if preconnect succeeded or happened at all, so clear
+      // the flags that let `taskAppend()` skip over checking if preconnect
+      // is needed.
+      if (type == ncclGroupTaskTypeCollective) {
+        comm->preconnectNext = reinterpret_cast<struct ncclComm*>(0x1);
+        for (int i = 0; i < comm->nRanks; i++) {
+          comm->connectSend[i] = 0UL;
+          comm->connectRecv[i] = 0UL;
+        }
+        // Reclaim abandoned kernel plan memory. Note ncclWork structs were already
+        // reclaimed by a `ncclMemoryStackPop(&comm->memScoped)` during `ncclGroupCommLeave()`.
+        while (!ncclIntruQueueEmpty(&comm->planner.planQueue)) {
+          struct ncclKernelPlan* plan = ncclIntruQueueDequeue(&comm->planner.planQueue);
+          // Persistent plans will be reclaimed via the callbackQueue when the
+          // graph drops its UserObject reference.
+          if (!plan->persistent) {
+            while (!ncclIntruQueueEmpty(&plan->proxyOpQueue)) {
+              struct ncclProxyOp* pxop = ncclIntruQueueDequeue(&plan->proxyOpQueue);
+              ncclMemoryPoolFree(&comm->memPool_ncclProxyOp, pxop);
+            }
+            ncclMemoryPoolFree(&comm->memPool_ncclKernelPlan, plan);
+          }
+        }
+
+        { // Reset comm->planner to empty.
+          ncclKernelPlanner::Peer* tmp = comm->planner.peers;
+          memset(&comm->planner, 0, sizeof(comm->planner));
+          comm->planner.peers = tmp;
+          if (comm->planner.peers != NULL) memset(comm->planner.peers, 0, comm->nRanks * sizeof(comm->planner.peers[0]));
        }
-        ncclMemoryPoolFree(&comm->memPool_ncclKernelPlan, plan);
      }
-    }

-    { // Reset comm->planner to empty.
-      ncclKernelPlanner::Peer* tmp = comm->planner.peers;
-      memset(&comm->planner, 0, sizeof(comm->planner));
-      comm->planner.peers = tmp;
-      if (comm->planner.peers != NULL) memset(comm->planner.peers, 0, comm->nRanks*sizeof(comm->planner.peers[0]));
+      if (!comm->config.blocking)
+        (void)ncclCommSetAsyncError(comm, error);
+      comm = next;
    }
-
-    if (!comm->config.blocking)
-      (void) ncclCommSetAsyncError(comm, error);
-    comm = next;
  }

  /* reset everything */
@ -393,11 +447,10 @@ fail:
 static ncclResult_t groupLaunch(struct ncclAsyncJob *job_, ncclSimInfo_t* simInfo = NULL) {
  ncclResult_t ret = ncclSuccess;
  struct ncclGroupJob *gjob = (struct ncclGroupJob*) job_;
-  struct ncclComm *groupCommHeadMain = *gjob->groupCommHeadPtr;
-  struct ncclComm *groupCommPreconnectHeadMain = *gjob->groupCommPreconnectHeadPtr;
-  struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> *asyncJobsMain = gjob->asyncJobsPtr;
-
-  bool *groupAbortFlag = gjob->abortFlagPtr;
+  struct ncclComm **groupCommHeadMain = gjob->groupCommHead;
+  struct ncclComm *groupCommPreconnectHeadMain = gjob->groupCommPreconnectHead;
+  struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> *asyncJobsMain = &gjob->asyncJobs;
+  bool *groupAbortFlag = &gjob->abortFlag;

  if (!simInfo && groupCommPreconnectHeadMain != nullptr) {
    struct ncclComm* comm = groupCommPreconnectHeadMain;
@ -421,9 +474,41 @@ static ncclResult_t groupLaunch(struct ncclAsyncJob *job_, ncclSimInfo_t* simInf

  NCCLCHECKGOTO(asyncJobLaunch(asyncJobsMain, groupAbortFlag), ret, fail);

+  // only loop through sym alloc and register tasks
+  for (int type = ncclGroupTaskTypeSymRegister; type <= ncclGroupTaskTypeSymRegister; ++type) {
+    if (groupCommHeadMain[type]) {
+      struct ncclComm* cliqueHead = groupCommHeadMain[type];
+      struct ncclComm* comm = NULL;
+      struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> asyncSymJobs;
+      ncclIntruQueueConstruct(&asyncSymJobs);
+      do {
+        comm = cliqueHead;
+        do {
+          struct ncclGroupSymmetricJob* job;
+          NCCLCHECKGOTO(ncclCalloc(&job, 1), ret, fail);
+          job->base.func = ncclCommGroupRegisterSymmetric;
+          job->base.undo = nullptr;
+          job->base.destructor = free;
+          job->base.state = ncclGroupJobRunning;
+          job->base.abortFlag = comm->abortFlag;
+          job->base.abortFlagDev = comm->abortFlagDev;
+          job->comm = comm;
+          ncclIntruQueueEnqueue(&asyncSymJobs, (struct ncclAsyncJob*)job);
+          comm = comm->groupNext[type];
+        } while (comm != nullptr && comm->intraComm0 == cliqueHead->intraComm0);
+        NCCLCHECKGOTO(asyncJobLaunch(&asyncSymJobs, groupAbortFlag), ret, fail);
+        while (!ncclIntruQueueEmpty(&asyncSymJobs)) {
+          struct ncclAsyncJob* job = ncclIntruQueueDequeue(&asyncSymJobs);
+          if (job->destructor) job->destructor((void*)job);
+        }
+        cliqueHead = comm;
+      } while (cliqueHead != nullptr);
+    }
+  }
+
  /* Connect channels at runtime if cumem is supported */
-  if (groupCommHeadMain != nullptr) {
-    struct ncclComm* cliqueHead = groupCommHeadMain;
+  if (groupCommHeadMain[ncclGroupTaskTypeCollective] != nullptr) {
+    struct ncclComm* cliqueHead = groupCommHeadMain[ncclGroupTaskTypeCollective];
    struct ncclComm* comm = NULL;
    struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> asyncCollJobs;
    ncclIntruQueueConstruct(&asyncCollJobs);
@ -454,7 +539,7 @@ static ncclResult_t groupLaunch(struct ncclAsyncJob *job_, ncclSimInfo_t* simInf
          memcpy(job->algoNeedConnect, algoNeedConnect, sizeof(bool) * NCCL_NUM_ALGORITHMS);
          ncclIntruQueueEnqueue(&asyncCollJobs, &job->base);
        }
-        comm = comm->groupNext;
+        comm = comm->groupNext[ncclGroupTaskTypeCollective];
      } while (comm != nullptr && comm->intraComm0 == cliqueHead->intraComm0);
      // connect
      NCCLCHECKGOTO(asyncJobLaunch(&asyncCollJobs, groupAbortFlag), ret, fail);
@ -466,42 +551,49 @@ static ncclResult_t groupLaunch(struct ncclAsyncJob *job_, ncclSimInfo_t* simInf
    } while (cliqueHead != nullptr);

    // done with all buffer allocation, start registration and enqueue
-    comm = groupCommHeadMain;
+    comm = groupCommHeadMain[ncclGroupTaskTypeCollective];
    do {
      CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), ret, fail);
      NCCLCHECKGOTO(ncclTasksRegAndEnqueue(comm), ret, fail);
-      comm = comm->groupNext;
+      comm = comm->groupNext[ncclGroupTaskTypeCollective];
    } while (comm);
  }

-  if ((!simInfo) && (groupCommHeadMain != nullptr)) {
-    NCCLCHECKGOTO(doLaunches(groupCommHeadMain), ret, fail);
+  if ((!simInfo) && (groupCommHeadMain[ncclGroupTaskTypeCollective] != nullptr)) {
+    NCCLCHECKGOTO(doLaunches(groupCommHeadMain[ncclGroupTaskTypeCollective]), ret, fail);
  }

  while (!ncclIntruQueueEmpty(asyncJobsMain)) {
    struct ncclAsyncJob* job = ncclIntruQueueDequeue(asyncJobsMain);
-    if (!job->destroyFlag && job->comm && !job->comm->config.blocking)
+    if (!job->destroyFlag && job->comm && !job->comm->config.blocking && groupCommHeadMain[ncclGroupTaskTypeCollective] == nullptr)
      (void) ncclCommSetAsyncError(job->comm, ret);
    if (job->destructor) job->destructor((void*)job);
  }

-  while (groupCommHeadMain != nullptr) {
-    struct ncclComm* comm = groupCommHeadMain;
-    struct ncclComm* next = comm->groupNext;
-    // Poll for callbacks sent to us from other threads. Typically these free
-    // resources from to our memory pools and UB
-    NCCLCHECKGOTO(ncclCommPollCallbacks(comm, /*waitSome=*/false), ret, fail);
-    (void) ncclGroupCommLeave(comm);
-    if (!comm->config.blocking) {
-      (void) ncclCommSetAsyncError(comm, ret);
+  for (int type = 0; type < ncclGroupTaskTypeNum; ++type) {
+    while (groupCommHeadMain[type] != nullptr) {
+      struct ncclComm* comm = groupCommHeadMain[type];
+      struct ncclComm* next = comm->groupNext[type];
+      // Poll for callbacks sent to us from other threads. Typically these free
+      // resources from to our memory pools and UB
+      if (comm->reclaimSteps == GROUP_MAX_RECLAIM_STEPS) {
+        NCCLCHECKGOTO(ncclCommPollCallbacks(comm, /*waitSome=*/false), ret, fail);
+        comm->reclaimSteps = 0;
+      } else {
+        comm->reclaimSteps++;
+      }
+      (void)ncclGroupCommLeave(comm, type);
+      if (!comm->config.blocking) {
+        (void)ncclCommSetAsyncError(comm, ret);
+      }
+      groupCommHeadMain[type] = next;
    }
-    groupCommHeadMain = next;
  }

 exit:
  return ret;
 fail:
-  groupCleanup(gjob->groupCommHeadPtr, gjob->groupCommPreconnectHeadPtr, gjob->asyncJobsPtr, gjob->groupErrorPtr, gjob->groupBlockingPtr, gjob->abortFlagPtr, ret);
+  groupCleanup(gjob->groupCommHead, &gjob->asyncJobs, ret);
  goto exit;
 }

@ -514,6 +606,8 @@ ncclResult_t ncclGroupEndInternal(ncclSimInfo_t* simInfo) {
  ncclSimInfo_t internalSimInfo = NCCL_SIM_INFO_INITIALIZER;
  ncclSimInfo_t* internalSimInfoPtr = NULL;
  size_t realSize = 0;
+  bool hasCommHead = false;
+  ncclGroupJob* groupJob = NULL;

  internalSimInfo.magic = 0;

@ -539,72 +633,108 @@ ncclResult_t ncclGroupEndInternal(ncclSimInfo_t* simInfo) {
    internalSimInfoPtr = &internalSimInfo;
  }

-  if (ncclGroupCommHead != nullptr || !ncclIntruQueueEmpty(&ncclAsyncJobs) || ncclGroupCommPreconnectHead != nullptr) {
-    ncclGroupJobMain.groupCommHeadPtr = &ncclGroupCommHead;
-    ncclGroupJobMain.groupCommPreconnectHeadPtr = &ncclGroupCommPreconnectHead;
-    ncclGroupJobMain.groupErrorPtr = &ncclGroupError;
-    ncclGroupJobMain.asyncJobsPtr = &ncclAsyncJobs;
-    ncclGroupJobMain.abortFlagPtr = &ncclGroupJobAbortFlag;
-    ncclGroupJobMain.groupBlockingPtr = &ncclGroupBlocking;
-    ncclGroupJobMain.initialized = true;
-    ncclGroupJobMainPtr = &ncclGroupJobMain;
+  for (int type = 0; type < ncclGroupTaskTypeNum; ++type) {
+    if (ncclGroupCommHead[type]) {
+      hasCommHead = true;
+      break;
+    }
+  }
+
+  NCCLCHECKGOTO(ncclCalloc(&groupJob, 1), ret, fail);
+  ncclIntruQueueConstruct(&groupJob->asyncJobs);
+  groupJob->groupRefCount = 0;
+  groupJob->nonBlockingInit = false;
+  memcpy(groupJob->groupCommHead, ncclGroupCommHead, sizeof(ncclGroupCommHead));
+  groupJob->groupCommPreconnectHead = ncclGroupCommPreconnectHead;
+  groupJob->groupError = ncclSuccess;
+  groupJob->abortFlag = false;
+  groupJob->joined = false;
+  ncclIntruQueueTransfer(&groupJob->asyncJobs, &ncclAsyncJobs);
+
+  if (hasCommHead || !ncclIntruQueueEmpty(&groupJob->asyncJobs) || ncclGroupCommPreconnectHead != nullptr) {
    /* make sure ncclGroupBlocking has been set. */
    assert(ncclGroupBlocking == 0 || ncclGroupBlocking == 1);
    if (ncclGroupBlocking == 0) {
      /* nonblocking group */
-      if (!ncclIntruQueueEmpty(&ncclAsyncJobs)) {
-        ncclAsyncJob* job = ncclIntruQueueHead(&ncclAsyncJobs);
+      if (!ncclIntruQueueEmpty(&groupJob->asyncJobs)) {
+        ncclAsyncJob* job = ncclIntruQueueHead(&groupJob->asyncJobs);
        do {
          NCCLCHECKGOTO(ncclCommSetAsyncError(job->comm, ncclInProgress), ret, fail);
-          job->comm->groupJob = ncclGroupJobMainPtr;
+          if (job->comm->groupJob == NULL) {
+            job->comm->groupJob = groupJob;
+            groupJob->groupRefCount++;
+          }
          job = job->next;
        } while (job);
      }

-      if (ncclGroupCommHead) {
-        ncclComm_t comm = ncclGroupCommHead;
-        do {
-          NCCLCHECKGOTO(ncclCommSetAsyncError(comm, ncclInProgress), ret, fail);
-          /* link group job to communicators. */
-          comm->groupJob = ncclGroupJobMainPtr;
-          comm = comm->groupNext;
-        } while (comm);
+      for (int type = 0; type < ncclGroupTaskTypeNum; ++type) {
+        if (ncclGroupCommHead[type]) {
+          ncclComm_t comm = ncclGroupCommHead[type];
+          do {
+            NCCLCHECKGOTO(ncclCommSetAsyncError(comm, ncclInProgress), ret, fail);
+            /* link group job to communicators. */
+            if (comm->groupJob == NULL) {
+              comm->groupJob = groupJob;
+              groupJob->groupRefCount++;
+            }
+            comm = comm->groupNext[type];
+          } while (comm);
+        }
      }

-      ncclGroupJobMainPtr->base.func = groupLaunchNonBlocking;
-      PTHREADCHECKGOTO(pthread_create(&ncclGroupJobMainPtr->base.thread, NULL, ncclAsyncJobMain, (void*)&ncclGroupJobMainPtr->base), "pthread_create", ret, fail);
+      groupJob->base.func = groupLaunchNonBlocking;
+      PTHREADCHECKGOTO(pthread_create(&groupJob->base.thread, NULL, ncclAsyncJobMain, (void*)&groupJob->base), "pthread_create", ret, fail);
+      groupJob->nonBlockingInit = true;
      ret = ncclInProgress;
    } else {
      /* blocking group */
      int savedDev;
      CUDACHECKGOTO(cudaGetDevice(&savedDev), ret, fail);
-      NCCLCHECKGOTO(groupLaunch(&ncclGroupJobMainPtr->base, internalSimInfoPtr), ret, fail);
+      NCCLCHECKGOTO(groupLaunch(&groupJob->base, internalSimInfoPtr), ret, fail);
      CUDACHECKGOTO(cudaSetDevice(savedDev), ret, fail);
      if (simInfo) memcpy((void*)simInfo, (void*)internalSimInfoPtr, realSize);
-      groupResetJobState(ncclGroupJobMainPtr);
+      free(groupJob);
    }
  }
+  /* Reset the job state for the next group call. */
+  groupLocalResetJobState();

 exit:
  return ret;
 fail:
-  groupCleanup(&ncclGroupCommHead, &ncclGroupCommPreconnectHead, &ncclAsyncJobs, &ncclGroupError, &ncclGroupBlocking, &ncclGroupJobAbortFlag, ret);
+  if (groupJob) {
+    groupCleanup(groupJob->groupCommHead, &groupJob->asyncJobs, ret);
+    free(groupJob);
+  } else {
+    groupCleanup(ncclGroupCommHead, &ncclAsyncJobs, ret);
+  }
+  groupLocalResetJobState();
  goto exit;
 }

 ncclResult_t ncclGroupJobComplete(struct ncclGroupJob* groupJob) {
  ncclResult_t ret = ncclSuccess;
-  if (groupJob && groupJob->initialized) {
-    ret = ncclAsyncJobComplete(&groupJob->base);
-    groupResetJobState(groupJob);
+  if (groupJob && groupJob->nonBlockingInit) {
+    if (!__atomic_exchange_n(&groupJob->joined, true, __ATOMIC_ACQ_REL)) {
+      ret = ncclAsyncJobComplete(&groupJob->base);
+    }
+    if (ncclAtomicRefCountDecrement(&groupJob->groupRefCount) == 0) {
+      free(groupJob);
+    }
  }
  return ret;
 }

 ncclResult_t ncclGroupJobAbort(struct ncclGroupJob* groupJob) {
-  if (groupJob && groupJob->initialized) {
-    __atomic_store_n(groupJob->abortFlagPtr, true, __ATOMIC_RELEASE);
-    NCCLCHECK(ncclGroupJobComplete(groupJob));
+  if (groupJob && groupJob->nonBlockingInit) {
+    if (!__atomic_exchange_n(&groupJob->joined, true, __ATOMIC_ACQ_REL)) {
+      __atomic_store_n(&groupJob->abortFlag, true, __ATOMIC_RELAXED);
+      ncclAsyncJobComplete(&groupJob->base);
+    }
+    if (ncclAtomicRefCountDecrement(&groupJob->groupRefCount) == 0) {
+      free(groupJob);
+    }
  }
  return ncclSuccess;
 }
--- a/src/include/allocator.h
+++ b/src/include/allocator.h
@ -0,0 +1,13 @@
+/*************************************************************************
+ * Copyright (c) 2015-2025, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef NCCL_ALLOCATOR_H_
+#define NCCL_ALLOCATOR_H_
+
+ncclResult_t ncclCommSymmetricAllocInternal(struct ncclComm* comm, size_t size, size_t alignment, void** symPtr);
+ncclResult_t ncclCommSymmetricFreeInternal(struct ncclComm* comm, void* symPtr);
+
+#endif
--- a/src/include/bitops.h
+++ b/src/include/bitops.h
@ -19,6 +19,28 @@
  #endif
 #endif

+template<typename Int>
+constexpr static __host__ __device__ Int minval(Int a) { return a; }
+template<typename Int, typename ...More>
+constexpr static __host__ __device__ Int minval(Int a, Int b, More ...more) {
+  #if __CUDA_ARCH__
+    return minval(min(a, b), more...);
+  #else
+    return minval(a < b ? a : b, more...);
+  #endif
+}
+
+template<typename Int>
+constexpr static __host__ __device__ Int maxval(Int a) { return a; }
+template<typename Int, typename ...More>
+constexpr static __host__ __device__ Int maxval(Int a, Int b, More ...more) {
+  #if __CUDA_ARCH__
+    return maxval(max(a, b), more...);
+  #else
+    return maxval(a > b ? a : b, more...);
+  #endif
+}
+
 #define DIVUP(x, y) \
    (((x)+(y)-1)/(y))

@ -32,32 +54,150 @@
  size = ((size + (align) - 1) / (align)) * (align);

 template<typename X, typename Y, typename Z = decltype(X()+Y())>
-__host__ __device__ constexpr Z divUp(X x, Y y) {
+static __host__ __device__ constexpr Z divUp(X x, Y y) {
  return (x+y-1)/y;
 }

 template<typename X, typename Y, typename Z = decltype(X()+Y())>
-__host__ __device__ constexpr Z roundUp(X x, Y y) {
+static __host__ __device__ constexpr Z roundUp(X x, Y y) {
  return (x+y-1) - (x+y-1)%y;
 }
 template<typename X, typename Y, typename Z = decltype(X()+Y())>
-__host__ __device__ constexpr Z roundDown(X x, Y y) {
+static __host__ __device__ constexpr Z roundDown(X x, Y y) {
  return x - x%y;
 }

 // assumes second argument is a power of 2
 template<typename X, typename Z = decltype(X()+int())>
-__host__ __device__ constexpr Z alignUp(X x, int a) {
+static __host__ __device__ constexpr Z alignUp(X x, int a) {
  return (x + a-1) & Z(-a);
 }
 // assumes second argument is a power of 2
 template<typename X, typename Z = decltype(X()+int())>
-__host__ __device__ constexpr Z alignDown(X x, int a) {
+static __host__ __device__ constexpr Z alignDown(X x, int a) {
  return x & Z(-a);
 }

 template<typename Int>
-inline __host__ __device__ int countOneBits(Int x) {
+constexpr __host__ __device__ bool isPow2(Int x) {
+  return (x & (x-1)) == 0;
+}
+
+template<typename T>
+static __host__ __device__ T add4G(T base, int delta4G) {
+  union { T tmp; uint32_t u32[2]; };
+  tmp = base;
+  u32[1] += delta4G;
+  return tmp;
+}
+
+template<typename T>
+static __host__ __device__ T incWrap4G(T ptr, uint32_t delta4G, uint32_t lo4G, uint32_t hi4G) {
+  union { T tmp; uint32_t u32[2]; };
+  tmp = ptr;
+  u32[1] += delta4G;
+  if (u32[1] >= hi4G) u32[1] -= hi4G-lo4G;
+  return tmp;
+}
+
+template<typename T>
+static __host__ __device__ T decWrap4G(T ptr, uint32_t delta4G, uint32_t lo4G, uint32_t hi4G) {
+  union { T tmp; uint32_t u32[2]; };
+  tmp = ptr;
+  u32[1] -= delta4G;
+  if (u32[1] < lo4G) u32[1] += hi4G-lo4G;
+  return tmp;
+}
+
+// Produce the reciprocal of x for use in idivByRcp
+constexpr __host__ __device__ uint32_t idivRcp32(uint32_t x) {
+  return uint32_t(uint64_t(0x100000000)/x);
+}
+constexpr __host__ __device__ uint64_t idivRcp64(uint64_t x) {
+  return uint64_t(-1)/x + isPow2(x);
+}
+
+static __host__ __device__ uint32_t mul32hi(uint32_t a, uint32_t b) {
+#if __CUDA_ARCH__
+  return __umulhi(a, b);
+#else
+  return uint64_t(a)*b >> 32;
+#endif
+}
+static __host__ __device__ uint64_t mul64hi(uint64_t a, uint64_t b) {
+#if __CUDA_ARCH__
+  return __umul64hi(a, b);
+#else
+  return (uint64_t)(((unsigned __int128)a)*b >> 64);
+#endif
+}
+
+// Produce the reciprocal of x*y given their respective reciprocals. This incurs
+// no integer division on device.
+static __host__ __device__ uint32_t imulRcp32(uint32_t x, uint32_t xrcp, uint32_t y, uint32_t yrcp) {
+  if (xrcp == 0) return yrcp;
+  if (yrcp == 0) return xrcp;
+  uint32_t rcp = mul32hi(xrcp, yrcp);
+  uint32_t rem = -x*y*rcp;
+  if (x*y <= rem) rcp += 1;
+  return rcp;
+}
+static __host__ __device__ uint64_t imulRcp64(uint64_t x, uint64_t xrcp, uint64_t y, uint64_t yrcp) {
+  if (xrcp == 0) return yrcp;
+  if (yrcp == 0) return xrcp;
+  uint64_t rcp = mul64hi(xrcp, yrcp);
+  uint64_t rem = -x*y*rcp;
+  if (x*y <= rem) rcp += 1;
+  return rcp;
+}
+
+// Fast integer division where divisor has precomputed reciprocal.
+// idivFast(x, y, idivRcp(y)) == x/y
+static __host__ __device__ void idivmodFast32(uint32_t *quo, uint32_t *rem, uint32_t x, uint32_t y, uint32_t yrcp) {
+  uint32_t q = x, r = 0;
+  if (yrcp != 0) {
+    q = mul32hi(x, yrcp);
+    r = x - y*q;
+    if (r >= y) { q += 1; r -= y; }
+  }
+  *quo = q;
+  *rem = r;
+}
+static __host__ __device__ void idivmodFast64(uint64_t *quo, uint64_t *rem, uint64_t x, uint64_t y, uint64_t yrcp) {
+  uint64_t q = x, r = 0;
+  if (yrcp != 0) {
+    q = mul64hi(x, yrcp);
+    r = x - y*q;
+    if (r >= y) { q += 1; r -= y; }
+  }
+  *quo = q;
+  *rem = r;
+}
+
+static __host__ __device__ uint32_t idivFast32(uint32_t x, uint32_t y, uint32_t yrcp) {
+  uint32_t q, r;
+  idivmodFast32(&q, &r, x, y, yrcp);
+  return q;
+}
+static __host__ __device__ uint32_t idivFast64(uint64_t x, uint64_t y, uint64_t yrcp) {
+  uint64_t q, r;
+  idivmodFast64(&q, &r, x, y, yrcp);
+  return q;
+}
+
+static __host__ __device__ uint32_t imodFast32(uint32_t x, uint32_t y, uint32_t yrcp) {
+  uint32_t q, r;
+  idivmodFast32(&q, &r, x, y, yrcp);
+  return r;
+}
+static __host__ __device__ uint32_t imodFast64(uint64_t x, uint64_t y, uint64_t yrcp) {
+  uint64_t q, r;
+  idivmodFast64(&q, &r, x, y, yrcp);
+  return r;
+}
+
+template<typename Int>
+static __host__ __device__ int countOneBits(Int x) {
 #if __CUDA_ARCH__
  if (sizeof(Int) <= sizeof(unsigned int)) {
    return __popc((unsigned int)x);
@ -83,7 +223,7 @@ inline __host__ __device__ int countOneBits(Int x) {

 // Returns index of first one bit or returns -1 if mask is zero.
 template<typename Int>
-inline __host__ __device__ int firstOneBit(Int mask) {
+static __host__ __device__ int firstOneBit(Int mask) {
  int i;
 #if __CUDA_ARCH__
  if (sizeof(Int) <= sizeof(int)) {
@ -108,14 +248,14 @@ inline __host__ __device__ int firstOneBit(Int mask) {
 }

 template<typename Int>
-inline __host__ __device__ int popFirstOneBit(Int* mask) {
+static __host__ __device__ int popFirstOneBit(Int* mask) {
  Int tmp = *mask;
  *mask &= *mask-1;
  return firstOneBit(tmp);
 }

 template<typename Int>
-inline __host__ __device__ int log2Down(Int x) {
+static __host__ __device__ int log2Down(Int x) {
  int w, n;
 #if __CUDA_ARCH__
  if (sizeof(Int) <= sizeof(int)) {
@ -147,7 +287,7 @@ inline __host__ __device__ int log2Down(Int x) {
 }

 template<typename Int>
-inline __host__ __device__ int log2Up(Int x) {
+static __host__ __device__ int log2Up(Int x) {
  int w, n;
  if (x != 0) x -= 1;
 #if __CUDA_ARCH__
@ -180,19 +320,19 @@ inline __host__ __device__ int log2Up(Int x) {
 }

 template<typename Int>
-inline __host__ __device__ Int pow2Up(Int x) {
+static __host__ __device__ Int pow2Up(Int x) {
  return Int(1)<<log2Up(x);
 }

 template<typename Int>
-inline __host__ __device__ Int pow2Down(Int x) {
+static __host__ __device__ Int pow2Down(Int x) {
  // True, log2Down can return -1, but we don't normally pass 0 as an argument...
  // coverity[negative_shift]
  return Int(1)<<log2Down(x);
 }

 template<typename UInt, int nSubBits>
-inline __host__ UInt reverseSubBits(UInt x) {
+static __host__ UInt reverseSubBits(UInt x) {
  if (nSubBits >= 16 && 8*sizeof(UInt) == nSubBits) {
    switch (8*sizeof(UInt)) {
    case 16: x = __builtin_bswap16(x); break;
@ -225,7 +365,7 @@ template<> struct ncclToUnsigned<unsigned long long> { using type = unsigned lon

 // Reverse the bottom nBits bits of x. The top bits will be overwritten with 0's.
 template<typename Int>
-inline __host__ __device__ Int reverseBits(Int x, int nBits) {
+static __host__ __device__ Int reverseBits(Int x, int nBits) {
  using UInt = typename ncclToUnsigned<Int>::type;
  union { UInt ux; Int sx; };
  sx = x;
@ -249,7 +389,7 @@ inline __host__ __device__ Int reverseBits(Int x, int nBits) {
 // has nearly the full range of uint32_t except it only keeps the top 3 bits
 // beneath the leading 1 bit and thus has a max value of 0xf0000000.

-inline __host__ __device__ uint32_t u32fpEncode(uint32_t x, int bitsPerPow2) {
+static __host__ __device__ uint32_t u32fpEncode(uint32_t x, int bitsPerPow2) {
  int log2x;
  #if __CUDA_ARCH__
    log2x = 31-__clz(x|1);
@ -261,7 +401,7 @@ inline __host__ __device__ uint32_t u32fpEncode(uint32_t x, int bitsPerPow2) {
  return exponent<<bitsPerPow2 | mantissa;
 }

-inline __host__ __device__ uint32_t u32fpDecode(uint32_t x, int bitsPerPow2) {
+static __host__ __device__ uint32_t u32fpDecode(uint32_t x, int bitsPerPow2) {
  uint32_t exponent = x>>bitsPerPow2;
  uint32_t mantissa = (x & ((1u<<bitsPerPow2)-1)) | (exponent!=0 ? 0x8 : 0);
  if (exponent != 0) exponent -= 1;
@ -270,16 +410,16 @@ inline __host__ __device__ uint32_t u32fpDecode(uint32_t x, int bitsPerPow2) {

 constexpr uint32_t u32fp8MaxValue() { return 0xf0000000; }

-inline __host__ __device__ uint8_t u32fp8Encode(uint32_t x) {
+static __host__ __device__ uint8_t u32fp8Encode(uint32_t x) {
  return u32fpEncode(x, 3);
 }
-inline __host__ __device__ uint32_t u32fp8Decode(uint8_t x) {
+static __host__ __device__ uint32_t u32fp8Decode(uint8_t x) {
  return u32fpDecode(x, 3);
 }

 // The hash isn't just a function of the bytes but also where the bytes are split
 // into different calls to eatHash().
-inline __host__ __device__ void eatHash(uint64_t acc[2], const void* bytes, size_t size) {
+static __host__ __device__ void eatHash(uint64_t acc[2], const void* bytes, size_t size) {
  char const* ptr = (char const*)bytes;
  acc[0] ^= size;
  while (size != 0) {
@ -302,11 +442,11 @@ inline __host__ __device__ void eatHash(uint64_t acc[2], const void* bytes, size
 }

 template<typename T>
-inline __host__ __device__ void eatHash(uint64_t acc[2], const T* bytes) {
+static __host__ __device__ void eatHash(uint64_t acc[2], const T* bytes) {
  eatHash(acc, (const void*)bytes, sizeof(T));
 }

-inline __host__ __device__ uint64_t digestHash(uint64_t const acc[2]) {
+static __host__ __device__ uint64_t digestHash(uint64_t const acc[2]) {
  uint64_t h = acc[0];
  h ^= h >> 31;
  h *= 0xbac3bd562846de6b;
@ -316,13 +456,13 @@ inline __host__ __device__ uint64_t digestHash(uint64_t const acc[2]) {
  return h;
 }

-inline __host__ __device__ uint64_t getHash(const void* bytes, size_t size) {
+static __host__ __device__ uint64_t getHash(const void* bytes, size_t size) {
  uint64_t acc[2] = {1, 1};
  eatHash(acc, bytes, size);
  return digestHash(acc);
 }
 template<typename T>
-inline __host__ __device__ uint64_t getHash(const T* bytes) {
+static __host__ __device__ uint64_t getHash(const T* bytes) {
  return getHash((const void*)bytes, sizeof(T));
 }

--- a/src/include/comm.h
+++ b/src/include/comm.h
@ -17,6 +17,7 @@
 #include "register.h"
 #include "graph.h"
 #include "profiler.h"
+#include "allocator.h"

 #if CUDART_VERSION < 9000
 struct cudaLaunchParams {
@ -131,7 +132,6 @@ struct ncclSharedResources {
  int* tpRankToLocalRank;
  // Internal streams
  struct ncclStrongStream deviceStream, hostStream;
-  int noncapturedRefs; // number of non-captured hostStreamPlanCallback on the stream
  int persistentRefs;
  cudaEvent_t launchEvent, scratchEvent;

@ -218,6 +218,7 @@ struct ncclTaskColl {
  // Profiler plugin
  int eActivationMask;
  void* eventHandle;
+  uint8_t nChannels;
 };
 struct ncclTaskP2p {
  struct ncclTaskP2p* next;
@ -231,6 +232,7 @@ struct ncclTaskP2p {
  // Profiler plugin
  int eActivationMask;
  void* eventHandle;
+  uint8_t nChannels;
 };

 struct ncclKernelPlan {
@ -243,10 +245,14 @@ struct ncclKernelPlan {

  bool persistent; // aka captured in a graph
  bool isHostCbEnq;
+  bool isSymColl;
  enum ncclDevWorkStorageType workStorageType;
  bool kernelSpecialized;
-  void *kernelFn;
-  struct ncclDevKernelArgs* kernelArgs;
+  void* kernelFn;
+  union {
+    struct ncclDevKernelArgs* kernelArgs;
+    struct ncclSymDevArgs* kernelSymArgs;
+  };
  size_t kernelArgsSize;
  uint64_t channelMask; // bitset of which channels are present
  bool hasProxyOps; // does any channel have a non-empty proxyOpQueue
@ -355,6 +361,7 @@ struct ncclKernelPlanner {
  struct Peer* peers/*[nRanks]*/;
  int nTasksColl, nTasksP2p;
  bool persistent;
+  bool isSymColl;

  // The list of user streams aggregated over all tasks present.
  struct ncclCudaStreamList* streams;
@ -404,6 +411,12 @@ struct ncclKernelPlanner {

 #define NCCL_MAGIC 0x0280028002800280 // Nickel atomic number is 28.

+typedef enum ncclGroupTaskType {
+  ncclGroupTaskTypeCollective = 0,
+  ncclGroupTaskTypeSymRegister = 1,
+  ncclGroupTaskTypeNum = 2,
+} ncclGroupTaskType_t;
+
 struct ncclComm {
  uint64_t startMagic;
  struct ncclMemoryStack memPermanent, memScoped;
@ -420,9 +433,10 @@ struct ncclComm {
  struct ncclTopoSystem* topo;
  struct ncclProxyConnector* gproxyConn;
  struct ncclIntruQueue<struct ncclCommCallback, &ncclCommCallback::next> legacyRegCleanupQueue;
+  bool peerInfoValid;

-  int netPluginLoaded;
  ncclNet_t* ncclNet;
+  int netPluginIndex;
  int ncclNetVer;
  ncclNetDeviceType netDeviceType;
  ncclCollNet_t* ncclCollNet;
@ -439,7 +453,6 @@ struct ncclComm {

  uint64_t magic; // Magic number for all network communication. Not a security key -- only goal is to detect mismatches.

-  const char* commName;
  uint64_t commHash;
  int rank;    // my rank in the communicator
  int nRanks;  // number of GPUs in communicator
@ -515,6 +528,7 @@ struct ncclComm {

  // Device side of the communicator (for cudaFree's)
  struct ncclDevComm* devComm; // actually = &ncclDevCommAndChannels::comm
+  struct ncclSymDevComm symDevComm;

  uint32_t workArgsBytes; // max size of kernel args
  uint32_t workFifoBytes; // size of workFifoBuf, power of 2
@ -522,12 +536,10 @@ struct ncclComm {
  void* workFifoBufDev;
  void* workFifoBufGdrHandle;

-  // Monotonic number of bytes (mod 1<<32) consumed per channel. In cudaHost memory.
-  uint32_t* workFifoConsumed/*[MAXCHANNELS]*/;
-  // Last observed value of: min(workFifoConsumed[c] for c < MAXCHANNELS)
-  uint32_t workFifoConsumedLeast;
  // Monotonic number of bytes (mod 1<<32) sent to fifo.
  uint32_t workFifoProduced;
+  uint32_t workFifoProducedLastRecorded;
+  uint32_t workFifoConsumed;

  // Intra-process sync
  struct ncclComm* intraComm0; // leader of intra-process comms (self possible)
@ -543,10 +555,8 @@ struct ncclComm {
  struct ncclProxyState* proxyState;
  int proxyRefCountOld; /* store proxy post-atomic-sub refcount */
  // Whether this communicator uses collNet
-  int collNetSupport;
  bool isOneRPN;
  uint8_t collNetSupportMatrix[4/*sum,prod,max,min*/][ncclNumTypes];
-  bool intraNodeP2pSupport;
  int* collNetHeads;
  int collNetHeadsNum;
  int* collNetDenseToUserRank;
@ -568,7 +578,7 @@ struct ncclComm {

  // Next comm in this thread's active ncclGroup[Start|End](). Holds "0x1" when
  // this comm is not yet in a group.
-  struct ncclComm* groupNext;
+  struct ncclComm* groupNext[ncclGroupTaskTypeNum];
  // Subset of those in groupNext list. Holds 0x1 if not needing preconnect.
  struct ncclComm* preconnectNext;
  int localPersistentRefs; // number of persistent plan-lists capturing this comm
@ -588,6 +598,7 @@ struct ncclComm {
  ncclUserRedOp *userRedOps;

  // Queue of things for the main thread to do
+  int reclaimSteps;
  struct ncclIntruQueueMpsc<struct ncclCommCallback, &ncclCommCallback::next> callbackQueue;

  ncclConfig_t config;
@ -600,6 +611,9 @@ struct ncclComm {
  // group job to support multi-thread FT
  struct ncclGroupJob *groupJob;

+  // Flag indicating if this communicator shares resources with parent or children
+  bool shareResources;
+
  // Tuning plugin
  int tunerPluginLoaded;
  ncclTuner_t* tuner;
@ -613,9 +627,18 @@ struct ncclComm {
  // buffer registration cache
  struct ncclRegCache regCache;
  int isAllNvlink;
+  bool isAllDirectP2p;
+  int symmetricSupport;
  bool useNetPXN;
  bool useGdr;
  int splitCount;
+  // symmetric buffer
+  uint8_t* baseUCSymPtr;
+  uint8_t* baseMCSymPtr;
+  size_t baseStride;
+  size_t symAllocHead;
+  CUmemGenericAllocationHandle symMCHandle;
+  struct ncclIntruQueue<struct ncclSymRegTask, &ncclSymRegTask::next> symRegTaskQueue;
  uint64_t endMagic;
 };

@ -647,15 +670,21 @@ inline ncclResult_t ncclCommPollCallbacks(struct ncclComm* comm, bool waitSome)
  return ncclSuccess;
 }

-inline ncclResult_t ncclCommPollEventCallbacks(struct ncclComm *comm) {
+inline ncclResult_t ncclCommPollEventCallbacks(struct ncclComm *comm, bool waitSome) {
  ncclResult_t result = ncclSuccess;
  cudaStreamCaptureMode mode = cudaStreamCaptureModeRelaxed;
  CUDACHECK(cudaThreadExchangeStreamCaptureMode(&mode));
  while (true) {
    struct ncclCommEventCallback* cb = ncclIntruQueueHead(&comm->eventCallbackQueue);
    if (cb == nullptr) break;
-    cudaError_t ok = cudaEventSynchronize(cb->event);
-    if (ok == cudaErrorNotReady) break;
+    cudaError_t ok;
+    if (waitSome) {
+      ok = cudaEventSynchronize(cb->event);
+      waitSome = false;
+    } else {
+      ok = cudaEventQuery(cb->event);
+      if (ok == cudaErrorNotReady) break;
+    }
    ncclIntruQueueDequeue(&comm->eventCallbackQueue);
    if (ok == cudaSuccess) {
      NCCLCHECKGOTO(cb->fn(comm, cb), result, finish);
--- a/src/include/cpuset.h
+++ b/src/include/cpuset.h
@ -58,4 +58,29 @@ static ncclResult_t ncclCpusetToStr(cpu_set_t* mask, char* str) {
  return ncclSuccess;
 }

+static char* ncclCpusetToRangeStr(cpu_set_t* mask, char* str, size_t len) {
+  int c = 0;
+  int start = -1;
+  // Iterate through all possible CPU bits plus one extra position
+  for (int cpu = 0; cpu <= CPU_SETSIZE; cpu++) {
+    int isSet = (cpu == CPU_SETSIZE) ? 0 : CPU_ISSET(cpu, mask);
+    // Start of a new range
+    if (isSet && start == -1) {
+      start = cpu;
+    }
+    // End of a range, add comma between ranges
+    if (!isSet && start != -1) {
+      if (cpu-1 == start) {
+        c += snprintf(str+c, len-c, "%s%d", c ? "," : "", start);
+      } else {
+        c += snprintf(str+c, len-c, "%s%d-%d", c ? "," : "", start, cpu-1);
+      }
+      if (c >= len-1) break;
+      start = -1;
+    }
+  }
+  if (c == 0) str[0] = '\0';
+  return str;
+}
+
 #endif
--- a/src/include/cudawrap.h
+++ b/src/include/cudawrap.h
@ -36,6 +36,10 @@ extern CUmemAllocationHandleType ncclCuMemHandleType;
    }							      \
 } while(false)

+#define CUCALL(cmd) do {				      \
+    pfn_##cmd;				                \
+} while(false)
+
 #define CUCHECKGOTO(cmd, res, label) do {		      \
    CUresult err = pfn_##cmd;				      \
    if( err != CUDA_SUCCESS ) {				      \
@ -66,49 +70,49 @@ extern CUmemAllocationHandleType ncclCuMemHandleType;
    }									\
 } while(0)

-#define DECLARE_CUDA_PFN_EXTERN(symbol) extern PFN_##symbol pfn_##symbol
+#define DECLARE_CUDA_PFN_EXTERN(symbol,version) extern PFN_##symbol##_v##version pfn_##symbol

 #if CUDART_VERSION >= 11030
 /* CUDA Driver functions loaded with cuGetProcAddress for versioning */
-DECLARE_CUDA_PFN_EXTERN(cuDeviceGet);
-DECLARE_CUDA_PFN_EXTERN(cuDeviceGetAttribute);
-DECLARE_CUDA_PFN_EXTERN(cuGetErrorString);
-DECLARE_CUDA_PFN_EXTERN(cuGetErrorName);
-DECLARE_CUDA_PFN_EXTERN(cuMemGetAddressRange);
-DECLARE_CUDA_PFN_EXTERN(cuCtxCreate);
-DECLARE_CUDA_PFN_EXTERN(cuCtxDestroy);
-DECLARE_CUDA_PFN_EXTERN(cuCtxGetCurrent);
-DECLARE_CUDA_PFN_EXTERN(cuCtxSetCurrent);
-DECLARE_CUDA_PFN_EXTERN(cuCtxGetDevice);
-DECLARE_CUDA_PFN_EXTERN(cuPointerGetAttribute);
-DECLARE_CUDA_PFN_EXTERN(cuLaunchKernel);
+DECLARE_CUDA_PFN_EXTERN(cuDeviceGet, 2000);
+DECLARE_CUDA_PFN_EXTERN(cuDeviceGetAttribute, 2000);
+DECLARE_CUDA_PFN_EXTERN(cuGetErrorString, 6000);
+DECLARE_CUDA_PFN_EXTERN(cuGetErrorName, 6000);
+DECLARE_CUDA_PFN_EXTERN(cuMemGetAddressRange, 3020);
+DECLARE_CUDA_PFN_EXTERN(cuCtxCreate, 11040);
+DECLARE_CUDA_PFN_EXTERN(cuCtxDestroy, 4000);
+DECLARE_CUDA_PFN_EXTERN(cuCtxGetCurrent, 4000);
+DECLARE_CUDA_PFN_EXTERN(cuCtxSetCurrent, 4000);
+DECLARE_CUDA_PFN_EXTERN(cuCtxGetDevice, 2000);
+DECLARE_CUDA_PFN_EXTERN(cuPointerGetAttribute, 4000);
+DECLARE_CUDA_PFN_EXTERN(cuLaunchKernel, 4000);
 #if CUDART_VERSION >= 11080
-DECLARE_CUDA_PFN_EXTERN(cuLaunchKernelEx);
+DECLARE_CUDA_PFN_EXTERN(cuLaunchKernelEx, 11060);
 #endif
 // cuMem API support
-DECLARE_CUDA_PFN_EXTERN(cuMemAddressReserve);
-DECLARE_CUDA_PFN_EXTERN(cuMemAddressFree);
-DECLARE_CUDA_PFN_EXTERN(cuMemCreate);
-DECLARE_CUDA_PFN_EXTERN(cuMemGetAllocationGranularity);
-DECLARE_CUDA_PFN_EXTERN(cuMemExportToShareableHandle);
-DECLARE_CUDA_PFN_EXTERN(cuMemImportFromShareableHandle);
-DECLARE_CUDA_PFN_EXTERN(cuMemMap);
-DECLARE_CUDA_PFN_EXTERN(cuMemRelease);
-DECLARE_CUDA_PFN_EXTERN(cuMemRetainAllocationHandle);
-DECLARE_CUDA_PFN_EXTERN(cuMemSetAccess);
-DECLARE_CUDA_PFN_EXTERN(cuMemUnmap);
-DECLARE_CUDA_PFN_EXTERN(cuMemGetAllocationPropertiesFromHandle);
+DECLARE_CUDA_PFN_EXTERN(cuMemAddressReserve, 10020);
+DECLARE_CUDA_PFN_EXTERN(cuMemAddressFree, 10020);
+DECLARE_CUDA_PFN_EXTERN(cuMemCreate, 10020);
+DECLARE_CUDA_PFN_EXTERN(cuMemGetAllocationGranularity, 10020);
+DECLARE_CUDA_PFN_EXTERN(cuMemExportToShareableHandle, 10020);
+DECLARE_CUDA_PFN_EXTERN(cuMemImportFromShareableHandle, 10020);
+DECLARE_CUDA_PFN_EXTERN(cuMemMap, 10020);
+DECLARE_CUDA_PFN_EXTERN(cuMemRelease, 10020);
+DECLARE_CUDA_PFN_EXTERN(cuMemRetainAllocationHandle, 11000);
+DECLARE_CUDA_PFN_EXTERN(cuMemSetAccess, 10020);
+DECLARE_CUDA_PFN_EXTERN(cuMemUnmap, 10020);
+DECLARE_CUDA_PFN_EXTERN(cuMemGetAllocationPropertiesFromHandle, 10020);
 #if CUDA_VERSION >= 11070
-DECLARE_CUDA_PFN_EXTERN(cuMemGetHandleForAddressRange); // DMA-BUF support
+DECLARE_CUDA_PFN_EXTERN(cuMemGetHandleForAddressRange, 11070); // DMA-BUF support
 #endif
 #if CUDA_VERSION >= 12010
 /* NVSwitch Multicast support */
-DECLARE_CUDA_PFN_EXTERN(cuMulticastAddDevice);
-DECLARE_CUDA_PFN_EXTERN(cuMulticastBindMem);
-DECLARE_CUDA_PFN_EXTERN(cuMulticastBindAddr);
-DECLARE_CUDA_PFN_EXTERN(cuMulticastCreate);
-DECLARE_CUDA_PFN_EXTERN(cuMulticastGetGranularity);
-DECLARE_CUDA_PFN_EXTERN(cuMulticastUnbind);
+DECLARE_CUDA_PFN_EXTERN(cuMulticastAddDevice, 12010);
+DECLARE_CUDA_PFN_EXTERN(cuMulticastBindMem, 12010);
+DECLARE_CUDA_PFN_EXTERN(cuMulticastBindAddr, 12010);
+DECLARE_CUDA_PFN_EXTERN(cuMulticastCreate, 12010);
+DECLARE_CUDA_PFN_EXTERN(cuMulticastGetGranularity, 12010);
+DECLARE_CUDA_PFN_EXTERN(cuMulticastUnbind, 12010);
 #endif
 #endif

--- a/src/include/device.h
+++ b/src/include/device.h
@ -10,6 +10,7 @@
 #include "nccl.h"
 #include "nccl_common.h"
 #include "bitops.h"
+#include "symmetric.h"
 #include <algorithm>
 #include <stdint.h>
 #include <sys/types.h>
@ -29,6 +30,30 @@ extern const char* ncclProtoStr[NCCL_NUM_PROTOCOLS];
  #define NCCL_CUDA_ARCH 0
 #endif

+#ifdef __CUDA_ARCH_SPECIFIC__
+  #define NCCL_CUDA_ARCH_SPECIFIC __CUDA_ARCH_SPECIFIC__
+#elif defined(__CUDA_ARCH_HAS_FEATURE__)
+  #if __CUDA_ARCH_HAS_FEATURE__(SM90_ALL)
+    #define NCCL_CUDA_ARCH_SPECIFIC 900
+  #elif __CUDA_ARCH_HAS_FEATURE__(SM100_ALL)
+    #define NCCL_CUDA_ARCH_SPECIFIC 1000
+  #elif __CUDA_ARCH_HAS_FEATURE__(SM101_ALL)
+    #define NCCL_CUDA_ARCH_SPECIFIC 1010
+  #elif __CUDA_ARCH_HAS_FEATURE__(SM120_ALL)
+    #define NCCL_CUDA_ARCH_SPECIFIC 1200
+  #else
+    #define NCCL_CUDA_ARCH_SPECIFIC 0
+  #endif
+#else
+  #define NCCL_CUDA_ARCH_SPECIFIC 0
+#endif
+
+#ifdef __CUDA_ARCH_FAMILY_SPECIFIC__
+  #define NCCL_CUDA_ARCH_FAMILY_SPECIFIC __CUDA_ARCH_FAMILY_SPECIFIC__
+#else
+  #define NCCL_CUDA_ARCH_FAMILY_SPECIFIC 0
+#endif
+
 #include "net_device.h"

 enum ncclDevRedOp_t {
@ -380,6 +405,14 @@ struct alignas(16) ncclDevChannel {
  uint64_t workCounter;
 };

+#define MAX_PROFILER_EVENTS_PER_CHANNEL 64
+struct ncclDevProfiler {
+  struct {
+    uint64_t counter;
+    uint64_t timestamp;
+  } data[MAX_PROFILER_EVENTS_PER_CHANNEL];
+};
+
 struct ncclDevComm {
  int rank;
  int nRanks;
@ -389,9 +422,6 @@ struct ncclDevComm {
  int p2pChunkSize;
  int isAllNvlink;

-  // Work fifo return credits
-  uint32_t* workConsumed/*[MAXCHANNELS]*/;
-
  int* collNetDenseToUserRank;

  // Flag to ask NCCL kernels to abort
@ -402,8 +432,8 @@ struct ncclDevComm {
  int* rankToLocalRank;

  // Profiler counters
-  uint64_t* workStarted/*[MAXCHANNELS]*/;
-  uint64_t* workCompleted/*[MAXCHANNELS]*/;
+  struct ncclDevProfiler* workStarted/*[MAXCHANNELS]*/;
+  struct ncclDevProfiler* workCompleted/*[MAXCHANNELS]*/;
 };

 struct alignas(16) ncclDevCommAndChannels {
@ -476,7 +506,7 @@ __host__ __device__ constexpr int ncclCalcUnroll(int bytePerPack, int insns, int

 __host__ __device__ constexpr int ncclCollUnroll(int cudaArch = NCCL_CUDA_ARCH) {
  // Our collective unroll should move to the same bytes&insns model as NVLS.
-  return cudaArch >= 800 ? (cudaArch == 1200 ? 6 : 8) : 4;
+  return cudaArch >= 800 ? (cudaArch / 100 == 12 ? 6 : 8) : 4;
 }

 __host__ __device__ constexpr int ncclNvlsUnrollBytes(int cudaArch = NCCL_CUDA_ARCH) { return 4*16; }
@ -507,7 +537,6 @@ extern int const ncclDevKernelCount;
 extern void* const ncclDevKernelList[/*ncclDevKernelCount*/];

 // Table of most specialized kernel function to run given func index.
-extern int const ncclDevFuncIdCount;
 extern int const ncclDevFuncRowToId[];
 extern void* const ncclDevKernelForFunc[/*funcIndex*/];
 extern bool const ncclDevKernelForFuncIsSpecialized[/*funcIndex*/];
@ -535,11 +564,7 @@ inline bool ncclNvlsSupported(int devRedOp, int type) {

 // `ncclDevFuncIndex()` needs to be in sync with "all_functions()" in "src/device/generate.py"
 inline int ncclDevFuncId(int coll, int devRedOp, int type, int algo, int proto) {
-  #if defined(__CUDA_BF16_TYPES_EXIST__)
  constexpr int NumTypes = ncclNumTypes;
-  #else
-  constexpr int NumTypes = ncclNumTypes + 1;
-  #endif
  int row;
  do {
    row = 0; // ncclDevFuncIndex_P2p
@ -564,7 +589,7 @@ inline int ncclDevFuncId(int coll, int devRedOp, int type, int algo, int proto)
    }
    row += nAlgos*NCCL_NUM_PROTOCOLS;

-    nAlgos = 6;
+    nAlgos = 6; // TREE RING COLLNET_DIRECT COLLNET_CHAIN NVLS NVLS_TREE
    if (coll == ncclFuncAllReduce) {
      row += ((devRedOp*NumTypes + type)*nAlgos + algo)*NCCL_NUM_PROTOCOLS + proto;
      break;
--- a/src/include/graph.h
+++ b/src/include/graph.h
@ -50,6 +50,8 @@ int ncclPxnDisable(struct ncclComm* comm);
 ncclResult_t ncclTopoGetPxnRanks(struct ncclComm* comm, int** intermediateRanks, int* nranks);
 ncclResult_t ncclGetLocalCpu(struct ncclTopoSystem* system, int gpu, int* retCpu);

+ncclResult_t ncclGetUserP2pLevel(int* level);
+
 // Find CPU affinity
 ncclResult_t ncclTopoGetCpuAffinity(struct ncclTopoSystem* system, int rank, cpu_set_t* affinity);

@ -74,7 +76,9 @@ ncclResult_t ncclTopoGetLocalNet(struct ncclTopoSystem* system, int rank, int ch
 ncclResult_t ncclTopoGetLocalGpu(struct ncclTopoSystem* system, int64_t netId, int* gpuIndex);
 ncclResult_t getLocalNetCountByBw(struct ncclTopoSystem* system, int gpu, int *count);

-#define NCCL_TOPO_MAX_NODES 256
+// Allows for up to 32 NICs per node on GB200-NVL72
+#define NCCL_TOPO_MAX_NODES 576
+ncclResult_t ncclTopoGetLocal(struct ncclTopoSystem* system, int type, int index, int resultType, int locals[NCCL_TOPO_MAX_NODES], int* localCount, int* pathType);

 // Init search. Needs to be done before calling ncclTopoCompute
 ncclResult_t ncclTopoSearchInit(struct ncclTopoSystem* system);
--- a/src/include/group.h
+++ b/src/include/group.h
@ -9,9 +9,11 @@

 #include "nccl.h"
 #include "comm.h"
+#include "allocator.h"
+#include "register.h"

 ncclResult_t ncclGroupErrCheck(ncclResult_t ret);
-void ncclGroupCommJoin(struct ncclComm* comm);
+void ncclGroupCommJoin(struct ncclComm* comm, int type);
 void ncclGroupCommPreconnect(struct ncclComm* comm);
 ncclResult_t ncclGroupCommLeave(struct ncclComm* comm);
 ncclResult_t ncclGroupJobAbort(struct ncclGroupJob* groupJob);
@ -52,13 +54,14 @@ ncclResult_t ncclAsyncLaunch(

 struct ncclGroupJob {
  struct ncclAsyncJob base;
-  struct ncclComm **groupCommHeadPtr;
-  struct ncclComm **groupCommPreconnectHeadPtr;
-  ncclResult_t *groupErrorPtr;
-  bool *abortFlagPtr;
-  int *groupBlockingPtr;
-  struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> *asyncJobsPtr;
-  bool initialized;
+  int groupRefCount;
+  bool nonBlockingInit;
+  bool joined;
+  struct ncclComm *groupCommHead[ncclGroupTaskTypeNum];
+  struct ncclComm *groupCommPreconnectHead;
+  ncclResult_t groupError;
+  bool abortFlag;
+  struct ncclIntruQueue<struct ncclAsyncJob, &ncclAsyncJob::next> asyncJobs;
 };

 ncclResult_t ncclGroupStartInternal();
@ -69,27 +72,9 @@ ncclResult_t ncclAsyncJobComplete(struct ncclAsyncJob* job);

 extern __thread int ncclGroupDepth; // depth of ncclGroupStart nesting
 extern __thread ncclResult_t ncclGroupError;
-extern __thread struct ncclComm* ncclGroupCommHead;
+extern __thread struct ncclComm* ncclGroupCommHead[ncclGroupTaskTypeNum];
 extern __thread struct ncclComm* ncclGroupCommPreconnectHead;
 extern __thread int ncclGroupBlocking;
-extern __thread struct ncclGroupJob *ncclGroupJobMainPtr;
-extern __thread struct ncclGroupJob ncclGroupJobMain;
-
-static inline void groupResetJobState() {
-  ncclGroupBlocking = -1;
-  ncclGroupJobMainPtr = NULL;
-  memset(&ncclGroupJobMain, 0, sizeof(struct ncclGroupJob));
-  return;
-}
-
-static inline ncclResult_t groupJobComplete(struct ncclGroupJob* job) {
-  ncclResult_t ret = ncclSuccess;
-  if (job) {
-    ret = ncclAsyncJobComplete(&job->base);
-    groupResetJobState();
-  }
-  return ret;
-}

 inline ncclResult_t ncclGroupStartInternal() {
  ncclGroupDepth++;
@ -104,31 +89,32 @@ inline ncclResult_t ncclGroupErrCheck(ncclResult_t ret) {
 }

 // Add comm to this thread's group
-inline void ncclGroupCommJoin(struct ncclComm* comm) {
-  if (comm->groupNext == reinterpret_cast<struct ncclComm*>(0x1)) {
+inline void ncclGroupCommJoin(struct ncclComm* comm, int type) {
+  if (comm->groupNext[type] == reinterpret_cast<struct ncclComm*>(0x1)) {
    // Insert comm into ncclGroupCommHead adjacent to sibling comms. This preserves
    // the users program order yet insures siblings occur consecutively. This
    // is required by doLaunches() in "group.cc".
-    struct ncclComm** pp = &ncclGroupCommHead;
+    struct ncclComm** pp = &ncclGroupCommHead[type];
    while (*pp != nullptr && comm->intraComm0 != (*pp)->intraComm0)
-      pp = &(*pp)->groupNext;
+      pp = &(*pp)->groupNext[type];

    // didn't find its clique, we need to insert it with ascending order based on commHash
    if (*pp == nullptr) {
-      pp = &ncclGroupCommHead;
-      while (*pp != nullptr && (*pp)->commHash < comm->commHash) pp = &(*pp)->groupNext;
+      pp = &ncclGroupCommHead[type];
+      while (*pp != nullptr && (*pp)->commHash < comm->commHash) pp = &(*pp)->groupNext[type];
    }
-    comm->groupNext = *pp;
+    comm->groupNext[type] = *pp;
    *pp = comm;
    // Comms gets a new memory stack scope upon joining. Each task batched for
    // this comm is allocated there.
    ncclMemoryStackPush(&comm->memScoped);
-    // Initialize planner
-    ncclKernelPlanner::Peer* tmp = comm->planner.peers;
-    memset(&comm->planner, 0, sizeof(comm->planner));
-    comm->planner.peers = tmp;
+    if (type == ncclGroupTaskTypeCollective) {
+      // Initialize planner
+      ncclKernelPlanner::Peer* tmp = comm->planner.peers;
+      memset(&comm->planner, 0, sizeof(comm->planner));
+      comm->planner.peers = tmp;
+    }
  }
-
  ncclGroupBlocking = comm->config.blocking;
 }

@ -141,8 +127,8 @@ inline void ncclGroupCommPreconnect(struct ncclComm* comm) {
 }

 // Comm has left group
-inline ncclResult_t ncclGroupCommLeave(struct ncclComm* comm) {
-  comm->groupNext = reinterpret_cast<struct ncclComm*>(0x1);
+inline ncclResult_t ncclGroupCommLeave(struct ncclComm* comm, int type) {
+  comm->groupNext[type] = reinterpret_cast<struct ncclComm*>(0x1);
  ncclMemoryStackPop(&comm->memScoped);
  return ncclSuccess;
 }
--- a/src/include/ibvcore.h
+++ b/src/include/ibvcore.h
@ -9,6 +9,7 @@
 #include <stdint.h>
 #include <sys/types.h>
 #include <unistd.h>
+#include <string.h>

 #if __GNUC__ >= 3
 #  define __attribute_const __attribute__((const))
@ -39,7 +40,7 @@ union ibv_gid {
 #define vext_field_avail(type, fld, sz) (offsetof(type, fld) < (sz))

 /*XXX:__VERBS_ABI_IS_EXTENDED produces warning "integer operation result is out of range" with g++ 4.8.2*/
-//static void *__VERBS_ABI_IS_EXTENDED = ((uint8_t *)NULL) - 1;
+static void *__VERBS_ABI_IS_EXTENDED = ((uint8_t *)NULL) - 1;

 enum ibv_node_type {
 	IBV_NODE_UNKNOWN	= -1,
@ -208,7 +209,9 @@ struct ibv_port_attr {
 	uint8_t			active_speed;
 	uint8_t			phys_state;
 	uint8_t			link_layer;
-	uint8_t			reserved;
+	uint8_t                 flags;
+	uint16_t                port_cap_flags2;
+	uint32_t                active_speed_ex;
 };

 enum ibv_event_type {
@ -993,37 +996,50 @@ enum verbs_context_mask {

 struct verbs_context {
 	/*  "grows up" - new fields go here */
-	int (*_reserved_2) (void);
-	int (*destroy_flow) (struct ibv_flow *flow);
-	int (*_reserved_1) (void);
-	struct ibv_flow * (*create_flow) (struct ibv_qp *qp,
-					  struct ibv_flow_attr *flow_attr);
+	int (*query_port)(struct ibv_context *context, uint8_t port_num,
+			  struct ibv_port_attr *port_attr,
+			  size_t port_attr_len);
+	int (*_reserved[25]) (void);
+	struct verbs_ex_private *priv;
+	int (*query_device_ex)(struct ibv_context *context,
+			       const struct ibv_query_device_ex_input *input,
+			       struct ibv_device_attr_ex *attr,
+			       size_t attr_size);
+	int (*ibv_destroy_flow) (struct ibv_flow *flow);
+	void (*ABI_placeholder2) (void); /* DO NOT COPY THIS GARBAGE */
+	struct ibv_flow * (*ibv_create_flow) (struct ibv_qp *qp,
+					      struct ibv_flow_attr *flow_attr);
+	void (*ABI_placeholder1) (void); /* DO NOT COPY THIS GARBAGE */
 	struct ibv_qp * (*open_qp)(struct ibv_context *context,
 			struct ibv_qp_open_attr *attr);
 	struct ibv_qp * (*create_qp_ex)(struct ibv_context *context,
 			struct ibv_qp_init_attr_ex *qp_init_attr_ex);
 	int (*get_srq_num)(struct ibv_srq *srq, uint32_t *srq_num);
-	struct ibv_srq * (*create_srq_ex)(struct ibv_context *context,
-			struct ibv_srq_init_attr_ex *srq_init_attr_ex);
-	struct ibv_xrcd * (*open_xrcd)(struct ibv_context *context,
-			struct ibv_xrcd_init_attr *xrcd_init_attr);
-	int  (*close_xrcd)(struct ibv_xrcd *xrcd);
-	uint64_t has_comp_mask;
-	size_t   sz;	/* Must be immediately before struct ibv_context */
-	struct ibv_context context;/* Must be last field in the struct */
+	struct ibv_srq *	(*create_srq_ex)(struct ibv_context *context,
+						 struct ibv_srq_init_attr_ex *srq_init_attr_ex);
+	struct ibv_xrcd *	(*open_xrcd)(struct ibv_context *context,
+					     struct ibv_xrcd_init_attr *xrcd_init_attr);
+	int			(*close_xrcd)(struct ibv_xrcd *xrcd);
+	uint64_t _ABI_placeholder3;
+	size_t   sz;			/* Must be immediately before struct ibv_context */
+	struct ibv_context context;	/* Must be last field in the struct */
 };

-/*XXX:__VERBS_ABI_IS_EXTENDED produces warning "integer operation result is out of range" with g++ 4.8.2*/
-/*static inline struct verbs_context *verbs_get_ctx(struct ibv_context *ctx)
+static inline struct verbs_context *verbs_get_ctx(struct ibv_context *ctx)
 {
-	return (!ctx || (ctx->abi_compat != __VERBS_ABI_IS_EXTENDED)) ?
-		NULL : container_of(ctx, struct verbs_context, context);
+	if (ctx->abi_compat != __VERBS_ABI_IS_EXTENDED)
+		return NULL;
+
+	/* open code container_of to not pollute the global namespace */
+	return (struct verbs_context *)(((uintptr_t)ctx) -
+					offsetof(struct verbs_context,
+						 context));
 }

 #define verbs_get_ctx_op(ctx, op) ({ \
-	struct verbs_context *_vctx = verbs_get_ctx(ctx); \
-	(!_vctx || (_vctx->sz < sizeof(*_vctx) - offsetof(struct verbs_context, op)) || \
-	!_vctx->op) ? NULL : _vctx; })*/
+	struct verbs_context *__vctx = verbs_get_ctx(ctx); \
+	(!__vctx || (__vctx->sz < sizeof(*__vctx) - offsetof(struct verbs_context, op)) || \
+	 !__vctx->op) ? NULL : __vctx; })

 #define verbs_set_ctx_op(_vctx, op, ptr) ({ \
 	struct verbs_context *vctx = _vctx; \
@ -1055,4 +1071,20 @@ struct ibv_ece {
 	uint32_t comp_mask;
 };

+/**
+ * ibv_query_port_ex - Get (extended) port properties
+ */
+static inline int ibv_query_port_ex(struct ibv_context *context,
+				    uint8_t port_num,
+				    struct ibv_port_attr *port_attr)
+{
+	struct verbs_context *vctx = verbs_get_ctx_op(context, query_port);
+
+        if (vctx) {
+          return vctx->query_port(context, port_num, port_attr, sizeof(*port_attr));
+        }
+
+        return -1;
+}
+
 #endif  // NCCL_IBV_CORE_H_
--- a/src/include/mlx5/mlx5dvcore.h
+++ b/src/include/mlx5/mlx5dvcore.h
@ -0,0 +1,18 @@
+#ifndef NCCL_MLX5DV_CORE_H_
+#define NCCL_MLX5DV_CORE_H_
+
+/* Basic MLX5 direct verbs structs. Needed to dynamically load MLX5 direct verbs functions without
+ * explicit including of MLX5 direct verbs header.
+ */
+
+#include <stddef.h>
+#include <stdint.h>
+#include <sys/types.h>
+#include <unistd.h>
+#include "ibvwrap.h"
+
+enum mlx5dv_reg_dmabuf_access  {
+	MLX5DV_REG_DMABUF_ACCESS_DATA_DIRECT		= (1<<0),
+};
+
+#endif  // NCCL_MLX5DV_CORE_H_
--- a/src/include/mlx5/mlx5dvsymbols.h
+++ b/src/include/mlx5/mlx5dvsymbols.h
@ -0,0 +1,23 @@
+#ifndef NCCL_MLX5DV_SYMBOLS_H_
+#define NCCL_MLX5DV_SYMBOLS_H_
+
+#ifdef NCCL_BUILD_MLX5DV
+#include <infiniband/mlx5dv.h>
+#else
+#include "mlx5/mlx5dvcore.h"
+#endif
+
+#include "nccl.h"
+
+/* MLX5 Direct Verbs Function Pointers*/
+struct ncclMlx5dvSymbols {
+  bool (*mlx5dv_internal_is_supported)(struct ibv_device *device);
+  int (*mlx5dv_internal_get_data_direct_sysfs_path)(struct ibv_context *context, char *buf, size_t buf_len);
+  /* DMA-BUF support */
+  struct ibv_mr * (*mlx5dv_internal_reg_dmabuf_mr)(struct ibv_pd *pd, uint64_t offset, size_t length, uint64_t iova, int fd, int access, int mlx5_access);
+  };
+
+/* Constructs MLX5 direct verbs symbols per rdma-core linking or dynamic loading mode */
+ncclResult_t buildMlx5dvSymbols(struct ncclMlx5dvSymbols* mlx5dvSymbols);
+
+#endif  // NCCL_MLX5DV_SYMBOLS_H_
--- a/src/include/mlx5/mlx5dvwrap.h
+++ b/src/include/mlx5/mlx5dvwrap.h
@ -0,0 +1,41 @@
+/*************************************************************************
+ * Copyright (c) 2004, 2005 Topspin Communications.  All rights reserved.
+ * Copyright (c) 2004, 2011-2012 Intel Corporation.  All rights reserved.
+ * Copyright (c) 2005, 2006, 2007 Cisco Systems, Inc.  All rights reserved.
+ * Copyright (c) 2005 PathScale, Inc.  All rights reserved.
+ *
+ * Copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef NCCL_MLX5DVWRAP_H_
+#define NCCL_MLX5DVWRAP_H_
+
+#include <arpa/inet.h>
+#include <netinet/in.h>
+#ifdef NCCL_BUILD_MLX5DV
+#include <infiniband/mlx5dv.h>
+#else
+#include "mlx5/mlx5dvcore.h"
+#endif
+
+#include "core.h"
+#include "ibvwrap.h"
+#include <sys/types.h>
+#include <unistd.h>
+
+typedef enum mlx5dv_return_enum
+{
+    MLX5DV_SUCCESS = 0,                   //!< The operation was successful
+} mlx5dv_return_t;
+
+ncclResult_t wrap_mlx5dv_symbols(void);
+/* NCCL wrappers of MLX5 direct verbs functions */
+bool wrap_mlx5dv_is_supported(struct ibv_device *device);
+ncclResult_t wrap_mlx5dv_get_data_direct_sysfs_path(struct ibv_context *context, char *buf, size_t buf_len);
+/* DMA-BUF support */
+ncclResult_t wrap_mlx5dv_reg_dmabuf_mr(struct ibv_mr **ret, struct ibv_pd *pd, uint64_t offset, size_t length, uint64_t iova, int fd, int access, int mlx5_access);
+struct ibv_mr * wrap_direct_mlx5dv_reg_dmabuf_mr(struct ibv_pd *pd, uint64_t offset, size_t length, uint64_t iova, int fd, int access, int mlx5_access);
+
+#endif // NCCL_MLX5DVWRAP_H_
--- a/src/include/nccl_common.h
+++ b/src/include/nccl_common.h
@ -7,6 +7,8 @@
 #ifndef NCCL_DEBUG_H_
 #define NCCL_DEBUG_H_

+#include <cstdint>
+
 typedef enum {
  NCCL_LOG_NONE = 0,
  NCCL_LOG_VERSION = 1,
@ -38,6 +40,16 @@ typedef enum {

 typedef void (*ncclDebugLogger_t)(ncclDebugLogLevel level, unsigned long flags, const char *file, int line, const char *fmt, ...);

+// NCCL core profiler callback for network defined events instrumentation
+enum {
+  ncclProfilerNetEventStart = 0,
+  ncclProfilerNetEventStop,
+  ncclProfilerNetEventUpdate,
+  ncclProfilerNetEventUpdateAndStop,
+};
+
+typedef ncclResult_t (*ncclProfilerCallback_t)(void** eHandle, int type, void* pHandle, int64_t pluginId, void* extData);
+
 #define NCCL_NUM_FUNCTIONS 5 // Send/Recv not included for now
 typedef enum {
  ncclFuncBroadcast = 0,
@ -51,7 +63,7 @@ typedef enum {
  ncclNumFuncs = 8
 } ncclFunc_t;

-#define NCCL_NUM_ALGORITHMS 7 // Tree/Ring/CollNet*
+#define NCCL_NUM_ALGORITHMS 7 // Tree/Ring/CollNet*/PAT
 #define NCCL_ALGO_UNDEF -1
 #define NCCL_ALGO_TREE 0
 #define NCCL_ALGO_RING 1
--- a/src/include/net.h
+++ b/src/include/net.h
@ -14,8 +14,6 @@

 typedef char ncclNetHandle_t[NCCL_NET_HANDLE_MAXSIZE];

-ncclResult_t ncclNetPluginLoad(struct ncclComm* comm);
-ncclResult_t ncclNetPluginUnload(struct ncclComm* comm);
 ncclResult_t ncclNetInit(struct ncclComm* comm);
 ncclResult_t ncclNetFinalize(struct ncclComm* comm);

--- a/src/include/nvtx.h
+++ b/src/include/nvtx.h
@ -31,10 +31,11 @@
 #define NVTX_SID_CommInitRankScalable 12 // same schema as NVTX_SID_CommInitRank
 #define NVTX_SID_CommSplit            13
 #define NVTX_SID_CommFinalize         14
+#define NVTX_SID_CommShrink           15
 // When adding new schema IDs, DO NOT re-use/overlap with the enum schema ID below!

 // Define static schema ID for the reduction operation.
-#define NVTX_PAYLOAD_ENTRY_NCCL_REDOP 15 + NVTX_PAYLOAD_ENTRY_TYPE_SCHEMA_ID_STATIC_START
+#define NVTX_PAYLOAD_ENTRY_NCCL_REDOP 16 + NVTX_PAYLOAD_ENTRY_TYPE_SCHEMA_ID_STATIC_START

 extern const nvtxDomainHandle_t ncclNvtxDomainHandle;

--- a/src/include/nvtx_payload_schemas.h
+++ b/src/include/nvtx_payload_schemas.h
@ -67,6 +67,16 @@ NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsCommSplit, static cons
  )
 )

+NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsCommShrink, static constexpr,
+  NCCL_NVTX_PAYLOAD_ENTRIES(
+    (uint64_t, newcomm, TYPE_UINT64, nccl_nvtxCommStr),
+    (int, nranks, TYPE_INT, nccl_nvtxNranksStr),
+    (int, myrank, TYPE_INT, nccl_nvtxRankStr),
+    (int, cudaDev, TYPE_INT, nccl_nvtxCudaDevStr),
+    (int, num_exclude, TYPE_INT, "num_exclude")
+  )
+)
+
 NCCL_NVTX_DEFINE_STRUCT_WITH_SCHEMA_ENTRIES(NcclNvtxParamsCommFinalize, static constexpr,
  NCCL_NVTX_PAYLOAD_ENTRIES(
    (uint64_t, comm, TYPE_UINT64, nccl_nvtxCommStr)
--- a/src/include/plugin/nccl_net.h
+++ b/src/include/plugin/nccl_net.h
@ -28,10 +28,9 @@
 #define NCCL_NET_MAX_REQUESTS 32

 // Max number of ncclNet objects which can live in the same process
-#define NCCL_NET_MAX_PLUGINS 3
-
-// NCCL core profiler callback for network defined events instrumentation
-typedef ncclResult_t (*ncclProfilerCallback_t)(void** eHandle, int type, void* pHandle, int64_t pluginId, void* extData);
+#ifndef NCCL_NET_MAX_PLUGINS
+#define NCCL_NET_MAX_PLUGINS 16
+#endif

 #include "net/net_v10.h"
 #include "net/net_v9.h"
--- a/src/include/plugin/nccl_profiler.h
+++ b/src/include/plugin/nccl_profiler.h
@ -19,43 +19,53 @@ enum {
 };

 typedef enum {
-  ncclProfilerProxyOpSendPosted,
-  ncclProfilerProxyOpSendRemFifoWait,
-  ncclProfilerProxyOpSendTransmitted,
-  ncclProfilerProxyOpSendDone,
-  ncclProfilerProxyOpRecvPosted,
-  ncclProfilerProxyOpRecvReceived,
-  ncclProfilerProxyOpRecvTransmitted,
-  ncclProfilerProxyOpRecvDone,
+  ncclProfilerProxyOpSendPosted        = 0,  // deprecated in v4
+  ncclProfilerProxyOpSendRemFifoWait   = 1,  // deprecated in v4
+  ncclProfilerProxyOpSendTransmitted   = 2,  // deprecated in v4
+  ncclProfilerProxyOpSendDone          = 3,  // deprecated in v4
+  ncclProfilerProxyOpRecvPosted        = 4,  // deprecated in v4
+  ncclProfilerProxyOpRecvReceived      = 5,  // deprecated in v4
+  ncclProfilerProxyOpRecvTransmitted   = 6,  // deprecated in v4
+  ncclProfilerProxyOpRecvDone          = 7,  // deprecated in v4
+  ncclProfilerProxyOpInProgress_v4     = 19,

  /* Legacy proxy profiler states */
-  ncclProfilerProxyStepSendGPUWait,
-  ncclProfilerProxyStepSendWait,
-  ncclProfilerProxyStepRecvWait,
-  ncclProfilerProxyStepRecvFlushWait,
-  ncclProfilerProxyStepRecvGPUWait,
+  ncclProfilerProxyStepSendGPUWait     = 8,
+  ncclProfilerProxyStepSendPeerWait_v4 = 20,
+  ncclProfilerProxyStepSendWait        = 9,
+  ncclProfilerProxyStepRecvWait        = 10,
+  ncclProfilerProxyStepRecvFlushWait   = 11,
+  ncclProfilerProxyStepRecvGPUWait     = 12,

  /* Legacy proxy control states */
-  ncclProfilerProxyCtrlIdle,
-  ncclProfilerProxyCtrlActive,
-  ncclProfilerProxyCtrlSleep,
-  ncclProfilerProxyCtrlWakeup,
-  ncclProfilerProxyCtrlAppend,
-  ncclProfilerProxyCtrlAppendEnd,
+  ncclProfilerProxyCtrlIdle            = 13,
+  ncclProfilerProxyCtrlActive          = 14,
+  ncclProfilerProxyCtrlSleep           = 15,
+  ncclProfilerProxyCtrlWakeup          = 16,
+  ncclProfilerProxyCtrlAppend          = 17,
+  ncclProfilerProxyCtrlAppendEnd       = 18,
+
+  /* Network defined event states */
+  ncclProfilerNetPluginUpdate          = 21,
+
+  /* Kernel event states */
+  ncclProfilerKernelChStop             = 22,
 } ncclProfilerEventState_t;

 typedef ncclProfilerEventState_t ncclProfilerEventState_v1_t;
 typedef ncclProfilerEventState_t ncclProfilerEventState_v2_t;
 typedef ncclProfilerEventState_t ncclProfilerEventState_v3_t;
+typedef ncclProfilerEventState_t ncclProfilerEventState_v4_t;

 #include <cstdint>
+#include "profiler/profiler_v4.h"
 #include "profiler/profiler_v3.h"
 #include "profiler/profiler_v2.h"
 #include "profiler/profiler_v1.h"

-typedef ncclProfiler_v3_t ncclProfiler_t;
-typedef ncclProfilerEventDescr_v3_t ncclProfilerEventDescr_t;
-typedef ncclProfilerEventStateArgs_v3_t ncclProfilerEventStateArgs_t;
+typedef ncclProfiler_v4_t ncclProfiler_t;
+typedef ncclProfilerEventDescr_v4_t ncclProfilerEventDescr_t;
+typedef ncclProfilerEventStateArgs_v4_t ncclProfilerEventStateArgs_t;

 #define NCCL_PROFILER_NET_VER_BITS  (16)
 #define NCCL_PROFILER_NET_VER_MASK  (~0U >> NCCL_PROFILER_NET_VER_BITS)
--- a/src/include/plugin/plugin.h
+++ b/src/include/plugin/plugin.h
@ -9,10 +9,16 @@

 #include "nccl.h"

+enum ncclPluginType {
+  ncclPluginTypeNet,
+  ncclPluginTypeTuner,
+  ncclPluginTypeProfiler,
+};
+
 void* ncclOpenNetPluginLib(const char* name);
 void* ncclOpenTunerPluginLib(const char* name);
 void* ncclOpenProfilerPluginLib(const char* name);
-void* ncclGetNetPluginLib(void);
-ncclResult_t ncclClosePluginLib(void* handle);
+void* ncclGetNetPluginLib(enum ncclPluginType type);
+ncclResult_t ncclClosePluginLib(void* handle, enum ncclPluginType type);

 #endif
--- a/src/include/plugin/profiler/profiler_v4.h
+++ b/src/include/plugin/profiler/profiler_v4.h
@ -0,0 +1,123 @@
+/*************************************************************************
+ * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef PROFILER_V4_H_
+#define PROFILER_V4_H_
+
+typedef struct {
+  uint8_t type;                 // event type descriptor: ncclProfileColl, ...
+  void* parentObj;              // pointer to the profiler parent object (for coll is the group)
+  int rank;                     // originating rank
+  union {
+    struct {
+      uint64_t seqNumber;
+      const char* func;
+      void const* sendBuff;
+      void* recvBuff;
+      size_t count;
+      int root;
+      const char* datatype;
+      uint8_t nChannels;
+      uint8_t nWarps;
+      const char* algo;
+      const char* proto;
+    } coll;
+
+    struct {
+      const char* func;
+      void* buff;
+      const char* datatype;
+      size_t count;
+      int peer;
+      uint8_t nChannels;
+    } p2p;
+
+    struct {
+      pid_t pid;                // pid of the originating process
+      uint8_t channelId;        // channel id for this proxy operation
+      int peer;                 // remote rank for send/recv
+      int nSteps;               // number of steps for this proxy operation
+      int chunkSize;            // amount of data transferred by this proxy operation
+      int isSend;
+    } proxyOp;
+
+    struct {
+      int step;
+    } proxyStep;
+
+    struct {
+      uint8_t channelId;
+      uint64_t pTimer;          // start timestamp from GPU globaltimer
+    } kernelCh;
+
+    struct {
+      int64_t id;
+      void* data;
+    } netPlugin;
+  };
+} ncclProfilerEventDescr_v4_t;
+
+typedef union {
+  struct {
+    size_t transSize;
+  } proxyStep;
+
+  struct {
+    int appendedProxyOps;
+  } proxyCtrl;
+
+  struct {
+    void* data;
+  } netPlugin;
+
+  struct {
+    uint64_t pTimer;
+  } kernelCh;
+} ncclProfilerEventStateArgs_v4_t;
+
+typedef struct {
+  const char* name;
+
+  // init - initialize the profiler plugin
+  // Input
+  //  - context        : opaque profiler context object for separating profiler behavior across comms
+  //  - commName       : user assigned communicator name
+  //  - commHash       : communicator id
+  //  - nNodes         : number of nodes in communicator
+  //  - nranks         : number of ranks in communicator
+  //  - rank           : rank identifier in communicator
+  //  - logfn          : logger function
+  // Output
+  //  - eActivationMask: bitmask of active events set by the plugin
+  ncclResult_t (*init)(void** context, int* eActivationMask, const char* commName, uint64_t commHash, int nNodes, int nranks, int rank, ncclDebugLogger_t logfn);
+
+  // startEvent - initialize and start a new event for the supplied event descriptor inside the eventset
+  // Input
+  //  - context: opaque profiler context object
+  //  - eDescr : pointer to ncclProfilerEventDescr_t object
+  // Output
+  //  - eHandle: return event handle for supplied event descriptor object
+  ncclResult_t (*startEvent)(void* context, void** eHandle, ncclProfilerEventDescr_v4_t* eDescr);
+
+  // stopEvent - stop/finalize an event inside and event set
+  // Input
+  //  - eHandle: handle to event object
+  ncclResult_t (*stopEvent)(void* eHandle);
+
+  // recordEventState - record event state transitions and event attribute updates
+  // Input
+  //  - eHandle   : handle to event object created through startEvent
+  //  - eStateArgs: optional argument used to capture event attribute updates associated with the state transition
+  //  - eState    : event state transition
+  ncclResult_t (*recordEventState)(void* eHandle, ncclProfilerEventState_v4_t eState, ncclProfilerEventStateArgs_v4_t* eStateArgs);
+
+  // finalize - finalize the profiler plugin
+  // Input
+  //  - context: opaque profiler context object
+  ncclResult_t (*finalize)(void* context);
+} ncclProfiler_v4_t;
+
+#endif
--- a/src/include/profiler.h
+++ b/src/include/profiler.h
@ -21,8 +21,8 @@ struct ncclProxyConnector;

 struct ncclProfilerProxy {
  bool initialized;
-  uint64_t* workStarted/*[MAXCHANNELS]*/;
-  uint64_t* workCompleted/*[MAXCHANNELS]*/;
+  struct ncclDevProfiler* workStarted/*[MAXCHANNELS]*/;
+  struct ncclDevProfiler* workCompleted/*[MAXCHANNELS]*/;
  uint64_t workCounter[MAXCHANNELS]; // host work counter
  struct ncclProxyConnector sendProxyConn[MAXCHANNELS];
  struct ncclProxyConnector recvProxyConn[MAXCHANNELS];
@ -43,8 +43,7 @@ ncclResult_t ncclProfilerStartTaskEvents(struct ncclKernelPlan* plan);
 ncclResult_t ncclProfilerStopTaskEvents(struct ncclKernelPlan* plan);

 // Proxy Op Start/Stop Event Wrappers
-ncclResult_t ncclProfilerStartSendProxyOpEvent(int sub, struct ncclProxyArgs* args);
-ncclResult_t ncclProfilerStartRecvProxyOpEvent(int sub, struct ncclProxyArgs* args);
+ncclResult_t ncclProfilerStartProxyOpEvent(int sub, struct ncclProxyArgs* args);
 ncclResult_t ncclProfilerStopProxyOpEvent(int sub, struct ncclProxyArgs* args);

 // Proxy Step Start/Stop Event Wrappers
@ -57,11 +56,11 @@ ncclResult_t ncclProfilerStartProxyCtrlEvent(void* profilerContext, void** eHand
 ncclResult_t ncclProfilerStopProxyCtrlEvent(void* eHandle);

 // Kernel Channel Start/Stop Event Wrappers
-ncclResult_t ncclProfilerStartKernelChEvent(struct ncclProxyArgs* args, int s);
-ncclResult_t ncclProfilerStopKernelChEvent(struct ncclProxyArgs* args, int s);
+ncclResult_t ncclProfilerStartKernelChEvent(struct ncclProxyArgs* args, int s, uint64_t start);
+ncclResult_t ncclProfilerStopKernelChEvent(struct ncclProxyArgs* args, int s, uint64_t stop);

 // Record Event Wrappers
-ncclResult_t ncclProfilerRecordProxyOpEventState(int sub, struct ncclProxyArgs* args, int steps, size_t transSize, ncclProfilerEventState_t eState);
+ncclResult_t ncclProfilerRecordProxyOpEventState(int sub, struct ncclProxyArgs* args, ncclProfilerEventState_t eState);
 ncclResult_t ncclProfilerRecordProxyStepEventState(int sub, struct ncclProxyArgs* args, int stepId, ncclProfilerEventState_t eState);
 ncclResult_t ncclProfilerRecordProxyCtrlEventState(void*eHandle, int appended, ncclProfilerEventState_t eState);

--- a/src/include/proxy.h
+++ b/src/include/proxy.h
@ -105,6 +105,13 @@ struct ncclProxyOp {
  struct ncclProxyOp *enqNext;
 };

+struct ncclProxySubArgs;
+
+struct ncclProxyEventHandle {
+  void* stepEventHandle;
+  struct ncclProxySubArgs* subArgPtr;
+};
+
 struct ncclProxySubArgs {
  struct ncclProxyConnection* connection;
  int reg;
@ -137,13 +144,12 @@ struct ncclProxySubArgs {
  // Profiler plugin
  int eActivationMask;
  int rank;
-  uint64_t profilerSteps;
  pid_t pid;
  void* profilerContext;
  void* taskEventHandle;
  void* opEventHandle;
  void* kernelEventHandle;
-  void* stepEventHandles[NCCL_STEPS];
+  struct ncclProxyEventHandle pHandles[NCCL_STEPS];
  size_t transSize;
  uint64_t workCounter;

@ -226,6 +232,8 @@ struct ncclProxyPeer {
 };

 struct ncclSharedNetComms {
+  int activeConnect[MAXCHANNELS];
+  int activeAccept[MAXCHANNELS];
  void* sendComm[MAXCHANNELS];
  void* recvComm[MAXCHANNELS];
  int sendRefCount[MAXCHANNELS];
--- a/src/include/register.h
+++ b/src/include/register.h
@ -29,18 +29,24 @@ struct ncclRegNetHandles {
  struct ncclRegNetHandles* next;
 };

+struct ncclSymRegTask {
+  struct ncclSymRegTask *next;
+  void* buff;
+  size_t baseSize;
+  CUmemGenericAllocationHandle memHandle;
+  struct ncclReg* regHandle;
+  size_t alignment;
+};
+
 struct ncclReg {
  // common attributes
-  size_t pages;
+  uintptr_t begAddr, endAddr; // page aligned
  int localRefs;
  int graphRefs;
-  uintptr_t addr;
  uint32_t state;
  // net reg
  struct ncclRegNetHandles* netHandleHead;
  // nvls reg
-  uintptr_t baseAddr;
-  size_t baseSize;
  CUdeviceptr regAddr;
  size_t regUCSize, regMCSize;
  int dev;
@ -52,6 +58,10 @@ struct ncclReg {
  // general ipc reg
  struct ncclPeerRegIpcAddr regIpcAddrs;
  struct ncclIpcRegInfo* ipcInfos[NCCL_MAX_LOCAL_RANKS];
+  // symmetric reg
+  void* baseSymPtr;
+  size_t symSize;
+  int winFlags;
 };

 struct ncclRegCache {
@ -60,10 +70,14 @@ struct ncclRegCache {
  uintptr_t pageSize;
 };

+struct ncclWindow {
+  struct ncclReg* handle;
+};
+
 ncclResult_t ncclRegCleanup(struct ncclComm* comm);
-ncclResult_t ncclRegFind(struct ncclComm* comm, const void* data, size_t size, struct ncclReg** reg);
 ncclResult_t ncclCommGraphRegister(const ncclComm_t comm, void* buff, size_t size, void** handle);
 ncclResult_t ncclCommGraphDeregister(const ncclComm_t comm, struct ncclReg *handle);
 ncclResult_t ncclRegLocalIsValid(struct ncclReg *reg, bool *isValid);
+ncclResult_t ncclCommSymmetricRegisterInternal(struct ncclComm* comm, void* buff, size_t baseSize, size_t alignment, CUmemGenericAllocationHandle memHandle, struct ncclReg* regHandle);

 #endif
--- a/src/include/register_inline.h
+++ b/src/include/register_inline.h
@ -0,0 +1,33 @@
+#ifndef NCCL_REGISTER_INLINE_H_
+#define NCCL_REGISTER_INLINE_H_
+
+#include "comm.h"
+#include "register.h"
+
+static inline ncclResult_t ncclRegFind(struct ncclComm* comm, const void* data, size_t size, struct ncclReg** outReg) {
+  struct ncclRegCache* cache = &comm->regCache;
+  *outReg = NULL;
+  for (int slot=0; /*true*/; slot++) {
+    if (slot == cache->population) return ncclSuccess;
+    struct ncclReg *reg = cache->slots[slot];
+    if ((uintptr_t)data < reg->begAddr) return ncclSuccess;
+    if ((uintptr_t)data + size <= reg->endAddr) {
+      *outReg = reg;
+      return ncclSuccess;
+    }
+  }
+}
+
+static inline ncclResult_t ncclRegFindSymmetric(struct ncclComm* comm, const void* data, size_t size, void** symPtr, struct ncclReg** outReg) {
+  struct ncclReg* regRecord = NULL;
+  *symPtr = NULL;
+  *outReg = NULL;
+  NCCLCHECK(ncclRegFind(comm, data, size, &regRecord));
+  if (regRecord && regRecord->baseSymPtr) {
+    *symPtr = (void*)((uintptr_t)regRecord->baseSymPtr + (uintptr_t)data - (uintptr_t)regRecord->begAddr);
+    *outReg = regRecord;
+  }
+  return ncclSuccess;
+}
+
+#endif
--- a/src/include/socket.h
+++ b/src/include/socket.h
@ -69,8 +69,10 @@ struct ncclSocket {

 const char *ncclSocketToString(const union ncclSocketAddress *addr, char *buf, const int numericHostForm = 1);
 ncclResult_t ncclSocketGetAddrFromString(union ncclSocketAddress* ua, const char* ip_port_pair);
-int ncclFindInterfaceMatchSubnet(char* ifNames, union ncclSocketAddress* localAddrs, union ncclSocketAddress* remoteAddr, int ifNameMaxSize, int maxIfs);
-int ncclFindInterfaces(char* ifNames, union ncclSocketAddress *ifAddrs, int ifNameMaxSize, int maxIfs);
+ncclResult_t ncclFindInterfaceMatchSubnet(char* ifName, union ncclSocketAddress* localAddr,
+                                          union ncclSocketAddress* remoteAddr, int ifNameMaxSize, int* found);
+ncclResult_t ncclFindInterfaces(char* ifNames, union ncclSocketAddress *ifAddrs, int ifNameMaxSize, int maxIfs,
+                                int* nIfs);

 // Initialize a socket
 ncclResult_t ncclSocketInit(struct ncclSocket* sock, const union ncclSocketAddress* addr = NULL, uint64_t magic = NCCL_SOCKET_MAGIC, enum ncclSocketType type = ncclSocketTypeUnknown, volatile uint32_t* abortFlag = NULL, int asyncFlag = 0, int customRetry = 0);
--- a/src/include/symmetric.h
+++ b/src/include/symmetric.h
@ -0,0 +1,90 @@
+#ifndef NCCL_DEVICE_SYMMETRIC_H_
+#define NCCL_DEVICE_SYMMETRIC_H_
+
+#include "nccl.h"
+#include "nccl_common.h"
+#include "bitops.h"
+
+constexpr int ncclSymMaxBlocks = 64;
+constexpr int ncclSymMaxThreads = 512;
+constexpr int ncclSymLLMaxEltSize = 64;
+
+constexpr __host__ __device__ int ncclSymLLMaxSlots(int eltSize = ncclSymLLMaxEltSize) {
+  return ncclSymMaxThreads*ncclSymLLMaxEltSize/eltSize;
+}
+
+constexpr __host__ __device__ int ncclSymLLEpochSize(int nRanks) {
+  return /*LL Overhead*/2 * maxval(ncclSymMaxThreads*nRanks*8, ncclSymLLMaxSlots(ncclSymLLMaxEltSize)*ncclSymLLMaxEltSize);
+}
+
+struct alignas(16) ncclSymDevBase {
+  uint32_t llEpoch[ncclSymMaxBlocks];
+  uint32_t barEpochMc[ncclSymMaxBlocks], barEpochUc[ncclSymMaxBlocks];
+  uint32_t barInboxMc[ncclSymMaxBlocks];
+  uint32_t barInboxPerPeer[];
+
+  static constexpr size_t size(int nRanks) {
+    return sizeof(ncclSymDevBase) +
+           alignUp(ncclSymMaxBlocks*nRanks*sizeof(uint32_t), 16) +
+           ncclSymMaxBlocks * /*epochs=*/2 * ncclSymLLEpochSize(nRanks);
+  }
+};
+
+static __device__ uint4* ncclSymDevBase_getLLBuf(struct ncclSymDevBase* base, int nRanks, int block, uint32_t epoch) {
+  // Get pointer to buffer trailing the header struct.
+  char* ans = (char*)(base + 1);
+  // Skip over barInboxPerPeer[]
+  ans += alignUp(ncclSymMaxBlocks*nRanks*sizeof(uint32_t), 16);
+  // Skip to our block
+  int epochSize = ncclSymLLEpochSize(nRanks);
+  ans += block * /*epochs=*/2 * epochSize;
+  ans += (epoch & 1)*epochSize;
+  return (uint4*)ans;
+}
+
+struct ncclSymDevComm {
+  ncclSymDevBase* base;
+  ncclSymDevBase* baseMc;
+  uint32_t stride4G;
+  int nRanks, rank;
+  uint32_t nRanks_rcp32; // idivRcp32(nRanks)
+};
+
+struct alignas(16) ncclSymDevArgs {
+  struct ncclSymDevComm comm;
+  int rootRank;
+  uint64_t redOpArg; // must be collectively uniform
+  size_t nElts;
+  char* input;
+  char* output;
+};
+
+enum ncclSymKernelId {
+  ncclSymKernelId_AllReduce_AGxLL_R,
+  ncclSymKernelId_AllReduce_AGxLLMC_R,
+  ncclSymKernelId_AllReduce_RSxLD_AGxST,
+  ncclSymKernelId_AllReduce_RSxLDMC_AGxSTMC,
+
+  ncclSymKernelId_AllGather_LL,
+  ncclSymKernelId_AllGather_LLMC,
+  ncclSymKernelId_AllGather_ST,
+  ncclSymKernelId_AllGather_STMC,
+
+  ncclSymKernelId_ReduceScatter_LL,
+  ncclSymKernelId_ReduceScatter_LD,
+  ncclSymKernelId_ReduceScatter_LDMC,
+
+  ncclSymKernelId_Count
+};
+
+bool ncclSymImplemented(ncclFunc_t fn, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty);
+
+ncclResult_t ncclSymPickKernel(struct ncclComm* comm, ncclFunc_t fn, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty, size_t nElts, float* estTimeUs, ncclSymKernelId* kernelId, int* nBlocks, int* nWarps);
+
+// Generated by src/device/symmetric/generate.py
+extern int const ncclSymKernelCount;
+extern void* const ncclSymKernelList[];
+void* ncclSymGetKernelPtr(ncclSymKernelId kernelId, int/*ncclDevRedOp_t*/ red, ncclDataType_t ty);
+const char* ncclSymKernelIdToString(int kernelId);
+
+#endif
--- a/src/include/transport.h
+++ b/src/include/transport.h
@ -22,6 +22,7 @@

 #include "proxy.h"
 #include "comm.h"
+#include "bootstrap.h"

 extern struct ncclTransport p2pTransport;
 extern struct ncclTransport shmTransport;
@ -46,6 +47,7 @@ struct ncclPeerInfo {
  int64_t busId;
  struct ncclComm* comm;
  int cudaCompCap;
+  size_t totalGlobalMem;
  // MNNVL support
  nvmlGpuFabricInfoV_t fabricInfo;
  int cuMemSupport;
@ -53,6 +55,8 @@ struct ncclPeerInfo {
 };

 #define CONNECT_SIZE 256
+#define NCCL_MAX_PAGE_SIZE (512L * 1024L * 1024L)
+#define NCCL_REC_PAGE_SIZE (2L * 1024L * 1024L)
 struct ncclConnect {
  char data[CONNECT_SIZE];
 };
@ -80,6 +84,7 @@ struct ncclNvlsSharedRes {
  char* ucBuff; // Unicast NVLS buffer address
  char* ucCredit; // Unicast NVLS credit address
  int nChannels;
+  int nHeads;
  struct ncclShmemCollBuff nvlsShmem;
  void *nvlsShmemHandle;
 };
@ -119,7 +124,8 @@ struct ncclTransport {

 ncclResult_t ncclTransportP2pConnect(struct ncclComm* comm, int channelId, int nrecv, int* peerRecv, int nsend, int* peerSend, int connIndex);
 ncclResult_t ncclTransportP2pSetup(struct ncclComm* comm, struct ncclTopoGraph* graph, int connIndex);
-ncclResult_t ncclTransportCheckP2pType(struct ncclComm* comm, bool* intraNodeP2pSupport, bool* directMode);
+ncclResult_t ncclTransportCheckP2pType(struct ncclComm* comm, bool* isAllDirectP2p, bool* directMode);
+ncclResult_t ncclTransportIsAllDirectP2p(struct ncclComm* comm, int* isAllDirectP2p);

 ncclResult_t ncclNvlsInit(struct ncclComm* comm);
 ncclResult_t ncclNvlsSetup(struct ncclComm* comm, struct ncclComm* parent);
@ -154,5 +160,15 @@ ncclResult_t ncclRegisterP2pIpcBuffer(struct ncclComm* comm, void* userbuff, siz
 ncclResult_t ncclRegisterP2pNetBuffer(struct ncclComm* comm, void* userbuff, size_t size, struct ncclConnector* conn, int* regFlag, void** handle, struct ncclIntruQueue<struct ncclCommCallback, &ncclCommCallback::next>* cleanupQueue);
 ncclResult_t ncclRegisterCollBuffers(struct ncclComm* comm, struct ncclTaskColl* info, void* outRegBufSend[NCCL_MAX_LOCAL_RANKS], void* outRegBufRecv[NCCL_MAX_LOCAL_RANKS], struct ncclIntruQueue<struct ncclCommCallback, &ncclCommCallback::next>* cleanupQueue, bool* regNeedConnect);
 ncclResult_t ncclRegisterCollNvlsBuffers(struct ncclComm* comm, struct ncclTaskColl* info, void* outRegBufSend[NCCL_MAX_LOCAL_RANKS], void* outRegBufRecv[NCCL_MAX_LOCAL_RANKS], struct ncclIntruQueue<struct ncclCommCallback, &ncclCommCallback::next>* cleanupQueue, bool* regNeedConnect);
+ncclResult_t ncclNvlsRegResourcesQuery(struct ncclComm* comm, struct ncclTaskColl* info, int* recChannels);
+
+ncclResult_t ncclIpcSymmetricInit(struct ncclComm* comm);
+ncclResult_t ncclIpcSymmetricMap(struct ncclComm* comm, size_t offset, size_t size, CUmemGenericAllocationHandle memHandle, void** symPtr);
+ncclResult_t ncclIpcSymmetricFree(struct ncclComm* comm, size_t size, void* symPtr);
+ncclResult_t ncclIpcSymmetricFinalize(struct ncclComm* comm);
+ncclResult_t ncclNvlsSymmetricInit(struct ncclComm* comm);
+ncclResult_t ncclNvlsSymmetricMap(struct ncclComm* comm, size_t offset, size_t ucsize, void* ucaddr);
+ncclResult_t ncclNvlsSymmetricFree(struct ncclComm* comm, size_t ucsize, void* ucaddr);
+ncclResult_t ncclNvlsSymmetricFinalize(struct ncclComm* comm);

 #endif
--- a/src/include/utils.h
+++ b/src/include/utils.h
@ -43,6 +43,12 @@ static long log2i(long n) {
  return log2Down(n);
 }

+// Comparator function for qsort/bsearch to compare integers
+static int compareInts(const void *a, const void *b) {
+    int ia = *(const int*)a, ib = *(const int*)b;
+    return (ia > ib) - (ia < ib);
+}
+
 inline uint64_t clockNano() {
  struct timespec ts;
  clock_gettime(CLOCK_MONOTONIC, &ts);
--- a/src/init.cc
+++ b/src/init.cc
@ -18,6 +18,7 @@
 #include "argcheck.h"
 #include "tuner.h"
 #include "ras.h"
+#include "profiler.h"
 #include "mnnvl.h"
 #include <fcntl.h>
 #include <string.h>
@ -29,6 +30,7 @@
 #include <unistd.h>
 #include "param.h"
 #include "nvtx_payload_schemas.h"
+#include "utils.h"

 #define STR2(v) #v
 #define STR(v) STR2(v)
@ -48,6 +50,10 @@ NCCL_PARAM(GroupCudaStream, "GROUP_CUDA_STREAM", NCCL_GROUP_CUDA_STREAM);
 NCCL_PARAM(CheckPointers, "CHECK_POINTERS", 0);
 NCCL_PARAM(CommBlocking, "COMM_BLOCKING", NCCL_CONFIG_UNDEF_INT);
 NCCL_PARAM(RuntimeConnect, "RUNTIME_CONNECT", 1);
+NCCL_PARAM(WinEnable, "WIN_ENABLE", 1);
+NCCL_PARAM(CollnetEnable, "COLLNET_ENABLE", NCCL_CONFIG_UNDEF_INT);
+NCCL_PARAM(CtaPolicy, "CTA_POLICY", NCCL_CONFIG_UNDEF_INT);
+NCCL_PARAM(NvlsChannels, "NVLS_NCHANNELS", NCCL_CONFIG_UNDEF_INT);

 static ncclResult_t commReclaim(ncclComm_t comm);

@ -174,6 +180,10 @@ static ncclResult_t commFree(ncclComm_t comm) {
  if (comm == NULL)
    return ncclSuccess;

+  if (comm->symmetricSupport && comm->symDevComm.base) {
+    NCCLCHECK(ncclCommSymmetricFreeInternal(comm, comm->baseUCSymPtr + comm->rank * comm->baseStride));
+  }
+
  NCCLCHECK(ncclRasCommFini(comm));

  /* in commReclaim, we have guaranteed only last rank which calls ncclCommDestroy() will
@ -253,15 +263,16 @@ static ncclResult_t commFree(ncclComm_t comm) {

  NCCLCHECK(ncclRegCleanup(comm));

+  if (comm->symmetricSupport) {
+    NCCLCHECK(ncclNvlsSymmetricFinalize(comm));
+    NCCLCHECK(ncclIpcSymmetricFinalize(comm));
+  }
  INFO(NCCL_INIT,"comm %p rank %d nranks %d cudaDev %d busId %lx - %s COMPLETE", comm, comm->rank, comm->nRanks, comm->cudaDev, comm->busId, abort ? "Abort" : "Destroy");

  commPoison(comm); // poison comm before free to avoid comm reuse.
  NCCLCHECK(ncclProfilerPluginFinalize(comm));
  NCCLCHECK(ncclNetFinalize(comm));
-  NCCLCHECK(ncclNetPluginUnload(comm));
-
  ncclCudaContextDrop(comm->context);
-
  free(comm);

  return ncclSuccess;
@ -271,7 +282,7 @@ NCCL_PARAM(DisableGraphHelper, "GRAPH_HELPER_DISABLE", 0);
 // GDRCOPY support: FIFO_ENABLE when enabled locates a workFifo in CUDA memory
 NCCL_PARAM(GdrCopyFifoEnable, "GDRCOPY_FIFO_ENABLE", 1);
 #define NCCL_WORK_FIFO_BYTES_DEFAULT (1<<20)
-NCCL_PARAM(WorkFifoBytes, "WORK_FIFO_BYTES", -1);
+NCCL_PARAM(WorkFifoBytes, "WORK_FIFO_BYTES", NCCL_WORK_FIFO_BYTES_DEFAULT);
 NCCL_PARAM(WorkArgsBytes, "WORK_ARGS_BYTES", INT64_MAX);
 enum ncclLaunchMode ncclParamLaunchMode;

@ -331,12 +342,10 @@ static ncclResult_t commAlloc(struct ncclComm* comm, struct ncclComm* parent, in
  comm->rank = rank;
  comm->nRanks = ndev;

-  NCCLCHECK(ncclNetPluginLoad(comm));
  NCCLCHECK(ncclNetInit(comm));
-  NCCLCHECK(ncclProfilerPluginInit(comm));
  INFO(NCCL_INIT, "Using network %s", comm->ncclNet->name);

-  if (parent && parent->config.splitShare) {
+  if (parent && parent->shareResources) {
    if (parent->ncclNet != comm->ncclNet) {
      WARN("Split shares resources, but parent comm netName %s is different from child comm netName %s", parent->ncclNet->name, comm->ncclNet->name);
      return ncclInvalidUsage;
@ -361,13 +370,14 @@ static ncclResult_t commAlloc(struct ncclComm* comm, struct ncclComm* parent, in
  comm->checkPointers = ncclParamCheckPointers() == 1 ? true : false;
  comm->dmaBufSupport = (dmaBufSupported(comm) == ncclSuccess) ? true : false;

-  comm->collNetSupport = 0;
  memset(comm->collNetSupportMatrix, 0, sizeof(comm->collNetSupportMatrix));

  ncclMemoryPoolConstruct(&comm->memPool_ncclKernelPlan);
  ncclMemoryPoolConstruct(&comm->memPool_ncclProxyOp);

-  comm->groupNext = reinterpret_cast<struct ncclComm*>(0x1);
+  for (int i = 0; i < ncclGroupTaskTypeNum; i++) {
+    comm->groupNext[i] = reinterpret_cast<struct ncclComm*>(0x1);
+  }
  comm->preconnectNext = reinterpret_cast<struct ncclComm*>(0x1);

  static_assert(MAXCHANNELS <= sizeof(*comm->connectSend)*8, "comm->connectSend must have enough bits for all channels");
@ -378,7 +388,7 @@ static ncclResult_t commAlloc(struct ncclComm* comm, struct ncclComm* parent, in
  // Mark channels as non initialized.
  for (int c=0; c < MAXCHANNELS; c++) comm->channels[c].id = -1;

-  if (parent == NULL || !parent->config.splitShare) {
+  if (parent == NULL || !parent->shareResources) {
    struct ncclSharedResources* sharedRes = NULL;
    NCCLCHECK(ncclCalloc(&sharedRes, 1));
    /* most of attributes are assigned later in initTransportsRank(). */
@ -432,6 +442,7 @@ static ncclResult_t devCommSetup(ncclComm_t comm) {
  bool ccEnable;
  cudaStream_t deviceStream;

+  memset(&tmpCommAndChans, '\0', sizeof(tmpCommAndChans));
  NCCLCHECKGOTO(ncclStrongStreamAcquire(ncclCudaGraphNone(), &comm->sharedRes->deviceStream, /*concurrent=*/false, &deviceStream), ret, fail);
  NCCLCHECKGOTO(ncclCudaCallocAsync(&devCommAndChans, 1, deviceStream), ret, fail);
  ncclCommPushCudaFree(comm, devCommAndChans);
@ -458,22 +469,12 @@ static ncclResult_t devCommSetup(ncclComm_t comm) {
  if (ccEnable) {
    comm->workFifoBytes = 0;
  } else {
-    int64_t workFifoBytesParam = ncclParamWorkFifoBytes();
-    if (workFifoBytesParam == -1) {
-      if (comm->MNNVL && (comm->compCap >= 100)) {
-        // WAR: Disable work fifo for Blackwell all2all hang issue on MNNVL
-        INFO(NCCL_INIT, "Disabling work fifo");
-        comm->workFifoBytes = 0;
-      } else {
-        comm->workFifoBytes = NCCL_WORK_FIFO_BYTES_DEFAULT;
-      }
-    } else {
-      if (0 != (workFifoBytesParam & (workFifoBytesParam-1))) {
-        WARN("NCCL_WORK_FIFO_BYTES=%ld is being ignored because it is not a power of 2.", workFifoBytesParam);
-        comm->workFifoBytes = NCCL_WORK_FIFO_BYTES_DEFAULT;
-      }
-      comm->workFifoBytes = std::min<uint64_t>(workFifoBytesParam, 1ul<<30);
+    comm->workFifoBytes = ncclParamWorkFifoBytes();
+    if (0 != (comm->workFifoBytes & (comm->workFifoBytes-1))) {
+      WARN("NCCL_WORK_FIFO_BYTES=%d is being ignored because it is not a power of 2.", comm->workFifoBytes);
+      comm->workFifoBytes = NCCL_WORK_FIFO_BYTES_DEFAULT;
    }
+    comm->workFifoBytes = std::min(comm->workFifoBytes, 1u<<30);
  }

  if (comm->rank == 0) {
@ -492,11 +493,9 @@ static ncclResult_t devCommSetup(ncclComm_t comm) {
    comm->workFifoBufDev = comm->workFifoBuf;
  }

-  NCCLCHECKGOTO(ncclCudaHostCalloc(&comm->workFifoConsumed, MAXCHANNELS), ret, fail);
-  ncclCommPushCudaHostFree(comm, comm->workFifoConsumed);
  comm->workFifoProduced = 0;
-  comm->workFifoConsumedLeast = 0;
-  tmpCommAndChans.comm.workConsumed = comm->workFifoConsumed;
+  comm->workFifoProducedLastRecorded = 0;
+  comm->workFifoConsumed = 0;

  // Alloc profiler counters for the kernel
  NCCLCHECKGOTO(ncclCudaHostCalloc(&comm->profiler.workStarted, MAXCHANNELS), ret, fail);
@ -549,6 +548,7 @@ NCCL_PARAM(MNNVLUUID, "MNNVL_UUID", -1);
 NCCL_PARAM(MNNVLCliqueId, "MNNVL_CLIQUE_ID", -1);

 static ncclResult_t fillInfo(struct ncclComm* comm, struct ncclPeerInfo* info, uint64_t commHash) {
+  cudaDeviceProp prop;
  info->rank = comm->rank;
  info->cudaDev = comm->cudaDev;
  info->nvmlDev = comm->nvmlDev;
@ -556,6 +556,8 @@ static ncclResult_t fillInfo(struct ncclComm* comm, struct ncclPeerInfo* info, u
  info->hostHash=getHostHash()+commHash;
  info->pidHash=getPidHash()+commHash;
  info->cuMemSupport = ncclCuMemEnable();
+  CUDACHECK(cudaGetDeviceProperties(&prop, comm->cudaDev));
+  info->totalGlobalMem = ROUNDUP(prop.totalGlobalMem, (1L << 32));

  // Get the device MAJOR:MINOR of /dev/shm so we can use that
  // information to decide whether we can use SHM for inter-process
@ -700,6 +702,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
    struct ncclTopoRanks topoRanks;
    int cpuArch;
    int cpuVendor;
+    int localRanks;
  };

  int nChannelsOrig;
@ -711,12 +714,14 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
  struct ncclProxyConnector proxyConn;
  int* pxnPeers = NULL;
  int *topParentLocalRanks = NULL;
+  int p2pLevel = -1;

  timers[TIMER_INIT_ALLGATHER] = clockNano();
  // AllGather1 - begin
  NCCLCHECKGOTO(ncclCalloc(&comm->peerInfo, nranks+1), ret, fail); // Extra rank to represent CollNet root
  NCCLCHECKGOTO(fillInfo(comm, comm->peerInfo+rank, comm->commHash), ret, fail);
  NCCLCHECKGOTO(bootstrapAllGather(comm->bootstrap, comm->peerInfo, sizeof(struct ncclPeerInfo)), ret, fail);
+  __atomic_store_n(&comm->peerInfoValid, true, __ATOMIC_RELEASE);

  comm->cuMemSupport = 1;
  for (int i = 0; i < nranks; i++) {
@ -738,7 +743,8 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
  timers[TIMER_INIT_ALLGATHER] = clockNano() - timers[TIMER_INIT_ALLGATHER];

  // Check for MNNVL support
-  if ((nNodes > 1 && ncclParamMNNVLEnable() != 0) || ncclParamMNNVLEnable() == 1) {
+  NCCLCHECKGOTO(ncclGetUserP2pLevel(&p2pLevel), ret, fail);
+  if ((nNodes > 1 && ncclParamMNNVLEnable() != 0 && p2pLevel != 0) || ncclParamMNNVLEnable() == 1) {
    NCCLCHECKGOTO(ncclMnnvlCheck(comm), ret, fail);
  }

@ -829,14 +835,8 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
  }

  // Determine local CollNet support
-  if (collNetSupport(comm)) {
-    const char *collNetEnable = ncclGetEnv("NCCL_COLLNET_ENABLE");
-    if (collNetEnable != NULL) {
-      INFO(NCCL_ALL, "NCCL_COLLNET_ENABLE set by environment to %s.", collNetEnable);
-      if (strcmp(collNetEnable, "1") == 0) {
-        comm->collNetSupport = 1;
-      }
-    }
+  if (!collNetSupport(comm)) {
+    comm->config.collnetEnable = 0;
  }

  // Determine local Nvls support
@ -873,7 +873,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
  collNetDirectGraph->collNet = 1;
  collNetDirectGraph->minChannels = 1;
  collNetDirectGraph->maxChannels = MAXCHANNELS;
-  if (comm->collNetSupport) {
+  if (comm->config.collnetEnable) {
    NCCLCHECKGOTO(ncclTopoCompute(comm->topo, collNetChainGraph), ret, fail);
    NCCLCHECKGOTO(ncclTopoPrintGraph(comm->topo, collNetChainGraph), ret, fail);
    NCCLCHECKGOTO(ncclTopoCompute(comm->topo, collNetDirectGraph), ret, fail);
@ -1014,7 +1014,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
    }
    comm->maxTreePattern = std::max(comm->maxTreePattern, allGather3Data[i].graphInfo[NCCL_ALGO_TREE].pattern);
  }
-  if (graphs[NCCL_ALGO_COLLNET_CHAIN]->nChannels == 0) comm->collNetSupport = 0;
+  if (graphs[NCCL_ALGO_COLLNET_CHAIN]->nChannels == 0) comm->config.collnetEnable = 0;
  if (graphs[NCCL_ALGO_NVLS]->nChannels == 0) comm->nvlsSupport = comm->nvlsChannels = 0;

  comm->nChannels = treeGraph->nChannels = ringGraph->nChannels = std::min(treeGraph->nChannels, ringGraph->nChannels);
@ -1025,11 +1025,11 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
  }

  // Determine CollNet support after all-gather now that we know nNodes and each node localRanks
-  if (comm->collNetSupport == 1) {
+  if (comm->config.collnetEnable == 1) {
    int collNetNodeThreshold = ncclParamCollNetNodeThreshold();
    if (comm->nNodes < collNetNodeThreshold) {
      INFO(NCCL_INIT, "Communicator has %d nodes which is less than CollNet node threshold %d, disabling CollNet", comm->nNodes, collNetNodeThreshold);
-      comm->collNetSupport = 0;
+      comm->config.collnetEnable = 0;
    }
  }
  NCCLCHECK(ncclTopoPathAllNVLink(comm->topo, &comm->isAllNvlink));
@ -1075,9 +1075,12 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
  }
  comm->topParentLocalRanks = topParentLocalRanks;

-  NCCLCHECKGOTO(ncclTransportCheckP2pType(comm, &comm->intraNodeP2pSupport, &comm->directMode), ret, fail);
+  // Profiler plugin context has to be initialized before proxy thread
+  NCCLCHECK(ncclProfilerPluginInit(comm));
+
+  NCCLCHECKGOTO(ncclTransportCheckP2pType(comm, &comm->isAllDirectP2p, &comm->directMode), ret, fail);
  // Launch proxy service thread, after this, the proxy calls can be used.
-  if (parent && parent->config.splitShare) {
+  if (parent && parent->shareResources) {
    comm->proxyState = parent->sharedRes->proxyState;
    ncclAtomicRefCountIncrement(&parent->sharedRes->proxyState->refCount);
  } else {
@ -1147,10 +1150,10 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
    for (int c=0; c<comm->nChannels; c++) {
      NCCLCHECKGOTO(setupChannel(comm, c, rank, nranks, rings+c*nranks), ret, fail);
    }
-    // Setup NVLS
+    // Attempt to setup NVLS, may silently fail and disable NVLS
    NCCLCHECKGOTO(ncclNvlsSetup(comm, parent), ret, fail);
    // Check if we can setup CollNet
-    if (comm->collNetSupport > 0) ncclCollNetSetup(comm, parent, graphs);
+    if (comm->config.collnetEnable) ncclCollNetSetup(comm, parent, graphs);
  } else {
    for (int c=0; c<comm->nChannels; c++) {
      NCCLCHECKGOTO(setupChannel(comm, c, rank, nranks, rings+c*nranks), ret, fail);
@ -1163,7 +1166,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
    // Connect PAT only for communicators with 1 GPU per node
    if (comm->maxLocalRanks == 1) NCCLCHECKGOTO(ncclTransportPatConnect(comm), ret, fail);

-    // Setup NVLS
+    // Attempt to setup NVLS, may silently fail and disable NVLS
    NCCLCHECKGOTO(ncclNvlsSetup(comm, parent), ret, fail);
    NCCLCHECKGOTO(ncclNvlsBufferSetup(comm), ret, fail);

@ -1171,7 +1174,7 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
    NCCLCHECKGOTO(ncclNvlsTreeConnect(comm), ret, fail);

    // Check if we can setup CollNet
-    if (comm->collNetSupport > 0) {
+    if (comm->config.collnetEnable) {
      ncclCollNetSetup(comm, parent, graphs);
      NCCLCHECKGOTO(ncclCollNetChainBufferSetup(comm), ret, fail);
      if (comm->maxLocalRanks <= NCCL_MAX_DIRECT_ARITY+1) {
@ -1244,9 +1247,13 @@ static ncclResult_t initTransportsRank(struct ncclComm* comm, struct ncclComm* p
    }
  }

+  comm->symmetricSupport = comm->isAllDirectP2p && comm->nNodes == 1 && ncclParamWinEnable() && ncclCuMemEnable();
+  comm->baseStride = 0;
+
  // Call devCommSetup before the last barrier, making sure we don't have a thread running in front and starting to
  // launch NCCL kernels before all cuda mem allocation is complete. That could cause a deadlock.
  NCCLCHECKGOTO(devCommSetup(comm), ret, fail);
+
  timers[TIMER_INIT_CONNECT] = clockNano() -  timers[TIMER_INIT_CONNECT];
  /* Local intra-node barrier */
  NCCLCHECKGOTO(bootstrapIntraNodeBarrier(comm->bootstrap, comm->localRankToRank, comm->localRank, comm->localRanks, comm->localRankToRank[0]), ret, fail);
@ -1260,7 +1267,7 @@ exit:
  /* If split resource is shared, we are not able to unlink the proxy ops pool here since the child comm can
   * attach the proxy ops pool of parent at any time; otherwise, unlink it here to make sure the pool will be
   * properly cleaned up. */
-  if (comm->sharedRes->owner == comm && !comm->config.splitShare && ret == ncclSuccess && !ncclCuMemEnable()) ncclProxyShmUnlink(comm);
+  if (comm->sharedRes->owner == comm && !comm->shareResources && ret == ncclSuccess && !ncclCuMemEnable()) ncclProxyShmUnlink(comm);
  free(allTopoRanks);
  free(nodesTreePatterns);
  free(nodesFirstRank);
@ -1293,6 +1300,9 @@ struct ncclCommInitRankAsyncJob {
  struct ncclComm* parent;
  int color, key;
  int splitCount;
+  // For Shrink
+  int* excludeRanksList;
+  int excludeRanksCount;
  // name of the function calling
  char funcName[NCCL_COMMINIT_FUNCNAME_LEN];
 };
@ -1303,6 +1313,7 @@ struct ncclCommFinalizeAsyncJob {
 };

 NCCL_PARAM(CommSplitShareResources, "COMM_SPLIT_SHARE_RESOURCES", NCCL_CONFIG_UNDEF_INT);
+NCCL_PARAM(CommShrinkShareResources, "COMM_SHRINK_SHARE_RESOURCES", NCCL_CONFIG_UNDEF_INT);

 typedef struct{
  int key;
@ -1350,6 +1361,21 @@ fail:
  goto exit;
 }

+static ncclResult_t getParentRanks(int parentRanks, int parentRank, int* excludeRanksList, int excludeRanksCount, int* nRanksRet, int* myRankRet, int* parentRanksRet) {
+  int count = 0, j = 0;
+  for (int i = 0; i < parentRanks; i++) {
+    // we assume excludeRanksList is sorted
+    if (j < excludeRanksCount && excludeRanksList[j] == i) {
+      j++;
+      continue;
+    }
+    if (i == parentRank) *myRankRet = count;
+    parentRanksRet[count++] = i;
+  }
+  *nRanksRet = parentRanks - excludeRanksCount;
+  return ncclSuccess;
+}
+
 static ncclResult_t ncclCommInitRankFunc(struct ncclAsyncJob* job_) {
  struct ncclCommInitRankAsyncJob* job = (struct ncclCommInitRankAsyncJob*)job_;
  ncclComm_t comm = job->comm;
@ -1383,9 +1409,13 @@ static ncclResult_t ncclCommInitRankFunc(struct ncclAsyncJob* job_) {

  if (job->parent) {
    NCCLCHECKGOTO(ncclCalloc(&parentRanks, job->parent->nRanks), res, fail);
-    NCCLCHECKGOTO(commGetSplitInfo(comm, job->parent, job->color, job->key, &job->nranks, &job->myrank, parentRanks), res, fail);
-    // Negative color does not create a new comm object. We needed to take part in the allgather, but we're done now.
-    if (job->color == NCCL_SPLIT_NOCOLOR) goto exit;
+    if (job->excludeRanksCount) {
+      NCCLCHECKGOTO(getParentRanks(job->parent->nRanks, job->parent->rank, job->excludeRanksList, job->excludeRanksCount, &job->nranks, &job->myrank, parentRanks), res, fail);
+    } else {
+      NCCLCHECKGOTO(commGetSplitInfo(comm, job->parent, job->color, job->key, &job->nranks, &job->myrank, parentRanks), res, fail);
+      // Negative color does not create a new comm object. We needed to take part in the allgather, but we're done now.
+      if (job->color == NCCL_SPLIT_NOCOLOR) goto exit;
+    }
    timers[TIMER_INIT_ALLOC] = clockNano();
    NCCLCHECKGOTO(commAlloc(comm, job->parent, job->nranks, job->myrank), res, fail);
    timers[TIMER_INIT_ALLOC] = clockNano() - timers[TIMER_INIT_ALLOC];
@ -1477,6 +1507,10 @@ static ncclResult_t envConfigOverride(ncclComm_t comm) {
  int minCTAsEnv;
  int maxCTAsEnv;
  int splitShareEnv;
+  const char* collnetEnableEnv;
+  int ctaPolicyEnv;
+  int shrinkShareEnv;
+  int nvlsCTAsEnv;

  /* override configuration from env variable. */
  blockingEnv = ncclParamCommBlocking();
@ -1522,6 +1556,31 @@ static ncclResult_t envConfigOverride(ncclComm_t comm) {
  if (splitShareEnv != NCCL_CONFIG_UNDEF_INT) {
    comm->config.splitShare = splitShareEnv;
  }
+  shrinkShareEnv = ncclParamCommShrinkShareResources();
+  if (shrinkShareEnv != NCCL_CONFIG_UNDEF_INT) {
+    comm->config.shrinkShare = shrinkShareEnv;
+  }
+
+  // NCCL_COLLNET_ENABLE needs to be reloaded each time for comm init
+  // since users might change the env on the fly to enable/disable collnet
+  collnetEnableEnv = ncclGetEnv("NCCL_COLLNET_ENABLE");
+  if (collnetEnableEnv != NULL) {
+    int collnetEnableInt = (int)strtol(collnetEnableEnv, NULL, 0);
+    if (collnetEnableInt != NCCL_CONFIG_UNDEF_INT) {
+      comm->config.collnetEnable = collnetEnableInt;
+      INFO(NCCL_ENV, "NCCL_COLLNET_ENABLE set by environment to %d.", collnetEnableInt);
+    }
+  }
+
+  ctaPolicyEnv = ncclParamCtaPolicy();
+  if (ctaPolicyEnv != NCCL_CONFIG_UNDEF_INT) {
+    comm->config.CTAPolicy = ctaPolicyEnv;
+  }
+
+  nvlsCTAsEnv = ncclParamNvlsChannels();
+  if (nvlsCTAsEnv != NCCL_CONFIG_UNDEF_INT) {
+    comm->config.nvlsCTAs = nvlsCTAsEnv;
+  }

  /* cap channels if needed */
  if (comm->config.minCTAs > MAXCHANNELS) {
@ -1544,6 +1603,20 @@ static ncclResult_t envConfigOverride(ncclComm_t comm) {
    comm->config.splitShare = 0;
  }

+  if (comm->config.collnetEnable != 1 && comm->config.collnetEnable != 0) {
+    INFO(NCCL_ENV, "collnetEnable %d is not a valid value 0/1, set it to 0", comm->config.collnetEnable);
+    comm->config.collnetEnable = 0;
+  }
+
+  if (comm->config.CTAPolicy < NCCL_CTA_POLICY_DEFAULT || comm->config.CTAPolicy > NCCL_CTA_POLICY_EFFICIENCY) {
+    INFO(NCCL_ENV, "CTAPolicy %d is not a valid value, set it to %d", comm->config.CTAPolicy, NCCL_CTA_POLICY_DEFAULT);
+    comm->config.CTAPolicy = NCCL_CTA_POLICY_DEFAULT;
+  }
+
+  if (comm->config.nvlsCTAs != NCCL_CONFIG_UNDEF_INT && comm->config.nvlsCTAs <= 0) {
+    INFO(NCCL_ENV, "nvlsCTAs %d is not a valid value, NCCL will decide the default value automatically", comm->config.nvlsCTAs);
+    comm->config.nvlsCTAs = NCCL_CONFIG_UNDEF_INT;
+  }
  return ret;
 }

@ -1584,6 +1657,17 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
      internalConfigPtr->maxCTAs = defaultConfig.maxCTAs;
      internalConfigPtr->netName = defaultConfig.netName;
    }
+
+    if (internalConfigPtr->version < NCCL_VERSION(2, 25, 0)) {
+      internalConfigPtr->trafficClass = defaultConfig.trafficClass;
+    }
+
+    if (internalConfigPtr->version < NCCL_VERSION(2, 27, 0)) {
+      internalConfigPtr->collnetEnable = defaultConfig.collnetEnable;
+      internalConfigPtr->CTAPolicy = defaultConfig.CTAPolicy;
+      internalConfigPtr->shrinkShare = defaultConfig.shrinkShare;
+      internalConfigPtr->nvlsCTAs = defaultConfig.nvlsCTAs;
+    }
  }

  /* check input config attributes, -1 means user-undefined and we should use default value from NCCL. */
@ -1615,6 +1699,31 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
    goto fail;
  }

+  if (internalConfigPtr->collnetEnable != NCCL_CONFIG_UNDEF_INT && (internalConfigPtr->collnetEnable < 0 || internalConfigPtr->collnetEnable > 1)) {
+    WARN("Invalid config collnetEnable attribute value %d", internalConfigPtr->collnetEnable);
+    ret = ncclInvalidArgument;
+    goto fail;
+  }
+
+  if (internalConfigPtr->CTAPolicy != NCCL_CONFIG_UNDEF_INT && (internalConfigPtr->CTAPolicy < NCCL_CTA_POLICY_DEFAULT ||
+    internalConfigPtr->CTAPolicy > NCCL_CTA_POLICY_EFFICIENCY)) {
+    WARN("Invalid config policy attribute value %d", internalConfigPtr->CTAPolicy);
+    ret = ncclInvalidArgument;
+    goto fail;
+  }
+
+  if (internalConfigPtr->shrinkShare != NCCL_CONFIG_UNDEF_INT && internalConfigPtr->shrinkShare != 0 && internalConfigPtr->shrinkShare != 1) {
+    WARN("Invalid config shrinkShare attribute value %d", internalConfigPtr->shrinkShare);
+    ret = ncclInvalidArgument;
+    goto fail;
+  }
+
+  if (internalConfigPtr->nvlsCTAs != NCCL_CONFIG_UNDEF_INT && internalConfigPtr->nvlsCTAs <= 0) {
+    WARN("Invalid config nvlsCTAs attribute value %d", internalConfigPtr->nvlsCTAs);
+    ret = ncclInvalidArgument;
+    goto fail;
+  }
+
  /* default config value can be tuned on different platform. */
  NCCL_CONFIG_DEFAULT(internalConfigPtr, blocking, NCCL_CONFIG_UNDEF_INT, 1, "Blocking", "%d");
  NCCL_CONFIG_DEFAULT(internalConfigPtr, cgaClusterSize, NCCL_CONFIG_UNDEF_INT, 4, "CGA cluster size", "%d");
@ -1623,6 +1732,11 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
  NCCL_CONFIG_DEFAULT(internalConfigPtr, netName, NCCL_CONFIG_UNDEF_PTR, NULL, "Net name", "%s");
  NCCL_CONFIG_DEFAULT(internalConfigPtr, splitShare, NCCL_CONFIG_UNDEF_INT, 0, "Split share", "%d");
  NCCL_CONFIG_DEFAULT(internalConfigPtr, trafficClass, NCCL_CONFIG_UNDEF_INT, NCCL_CONFIG_UNDEF_INT, "Traffic class", "%d");
+  NCCL_CONFIG_DEFAULT(internalConfigPtr, commName, NCCL_CONFIG_UNDEF_PTR, NULL, "Comm name", "%s");
+  NCCL_CONFIG_DEFAULT(internalConfigPtr, collnetEnable, NCCL_CONFIG_UNDEF_INT, 0, "Collnet enable", "%d");
+  NCCL_CONFIG_DEFAULT(internalConfigPtr, CTAPolicy, NCCL_CONFIG_UNDEF_INT, NCCL_CTA_POLICY_DEFAULT, "CTA policy flags", "%d");
+  NCCL_CONFIG_DEFAULT(internalConfigPtr, shrinkShare, NCCL_CONFIG_UNDEF_INT, 0, "shrinkShare", "%d");
+  NCCL_CONFIG_DEFAULT(internalConfigPtr, nvlsCTAs, NCCL_CONFIG_UNDEF_INT, NCCL_CONFIG_UNDEF_INT, "nvlsCTAs", "%d");

  /* assign config to communicator */
  comm->config.blocking = internalConfigPtr->blocking;
@ -1632,7 +1746,11 @@ static ncclResult_t parseCommConfig(ncclComm_t comm, ncclConfig_t *config) {
  comm->config.netName = internalConfigPtr->netName;
  comm->config.splitShare = internalConfigPtr->splitShare;
  comm->config.trafficClass = internalConfigPtr->trafficClass;
-
+  comm->config.commName = internalConfigPtr->commName;
+  comm->config.collnetEnable = internalConfigPtr->collnetEnable;
+  comm->config.CTAPolicy = internalConfigPtr->CTAPolicy;
+  comm->config.shrinkShare = internalConfigPtr->shrinkShare;
+  comm->config.nvlsCTAs = internalConfigPtr->nvlsCTAs;
  NCCLCHECKGOTO(envConfigOverride(comm), ret, fail);

 exit:
@ -1909,7 +2027,7 @@ static ncclResult_t commDestroySync(struct ncclAsyncJob* job_) {
      WARN("commDestroySync: comm %p rank %d sync deviceStream error %d\n", comm, comm->rank, ret);
    }

-    NCCLCHECKGOTO(ncclCommPollEventCallbacks(comm), ret, fail);
+    NCCLCHECKGOTO(ncclCommPollEventCallbacks(comm, true), ret, fail);
    NCCLCHECKGOTO(ncclCommPollCallbacks(comm, false), ret, fail);
    // And keep polling until all graphs referencing us die.
    while (comm->localPersistentRefs != 0) {
@ -2074,6 +2192,17 @@ fail:
  goto exit;
 }

+static ncclResult_t setCommAbortFlags(ncclComm_t comm, int value) {
+  // Set abort flags
+  if (comm->childAbortFlag != nullptr) {
+    __atomic_store_n(comm->childAbortFlag, value, __ATOMIC_RELEASE);
+    __atomic_store_n(comm->childAbortFlagDev, value, __ATOMIC_RELEASE);
+  }
+  __atomic_store_n(comm->abortFlag, value, __ATOMIC_RELEASE);
+  __atomic_store_n(comm->abortFlagDev, value, __ATOMIC_RELEASE);
+  return ncclSuccess;
+}
+
 NCCL_API(ncclResult_t, ncclCommAbort, ncclComm_t comm);
 ncclResult_t ncclCommAbort(ncclComm_t comm) {
  NVTX3_RANGE(NcclNvtxParamsCommAbort);
@ -2083,12 +2212,7 @@ ncclResult_t ncclCommAbort(ncclComm_t comm) {
  }
  NCCLCHECK(ncclGroupStartInternal());
  // Ask anything that might still be running on the device to quit
-  if (comm->childAbortFlag != nullptr) {
-    __atomic_store_n(comm->childAbortFlag, 1, __ATOMIC_RELEASE);
-    __atomic_store_n(comm->childAbortFlagDev, 1, __ATOMIC_RELEASE);
-  }
-  __atomic_store_n(comm->abortFlag, 1, __ATOMIC_RELEASE);
-  __atomic_store_n(comm->abortFlagDev, 1, __ATOMIC_RELEASE);
+  NCCLCHECK(setCommAbortFlags(comm,1));
  comm->destroyFlag = 1;
  /* init thread must be joined before we destroy the comm,
   * and we should ignore the init error here. */
@ -2111,36 +2235,51 @@ ncclResult_t ncclCommAbort(ncclComm_t comm) {
 exit:
  ncclGroupErrCheck(res);
  NCCLCHECK(ncclGroupEndInternal());
-  return ncclSuccess;
+  return res;
 fail:
  goto exit;
 }

-NCCL_API(ncclResult_t, ncclCommSplit, ncclComm_t comm, int color, int key, ncclComm_t *newcomm, ncclConfig_t *config);
-ncclResult_t ncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t *newcomm, ncclConfig_t *config) {
+static void childCommCleanupJob(void* job) {
+  struct ncclCommInitRankAsyncJob* initJob = (struct ncclCommInitRankAsyncJob*)job;
+  if (initJob->excludeRanksList) free(initJob->excludeRanksList);
+  free(job);
+}
+
+// initializing a child communicator (for both split and shrink)
+static ncclResult_t ncclCommInitChildComm(ncclComm_t comm, ncclComm_t* newcomm, bool isShrink, int flags, int color, int key, int* excludeRanksList, int excludeRanksCount,
+                                          ncclConfig_t* config, const char* caller) {
  struct ncclCommInitRankAsyncJob *job = NULL;
  struct ncclComm* childComm = NCCL_COMM_NULL;
  ncclResult_t res = ncclSuccess;

-  NVTX3_RANGE(NcclNvtxParamsCommSplit)
-
  int oldDev;
  CUDACHECK(cudaGetDevice(&oldDev));
+  NCCLCHECKGOTO(CommCheck(comm, caller, "comm"), res, exit);
+  NCCLCHECKGOTO(PtrCheck(newcomm, caller, "newcomm"), res, exit);
+  if (isShrink) {
+    NCCLCHECKGOTO(PtrCheck(excludeRanksList, caller, "excludeRanksList"), res, exit);
+    NCCLCHECKGOTO(excludeRanksCount > 0 ? ncclSuccess : ncclInvalidArgument, res, exit);
+    // excludeRanksList may not be sorted, need to sort it
+    qsort(excludeRanksList, excludeRanksCount, sizeof(int), compareInts);
+    // ranks in excludeRanksList should not call into this function
+    NCCLCHECKGOTO(bsearch(&comm->rank, excludeRanksList, excludeRanksCount, sizeof(int), compareInts) ? ncclInvalidArgument : ncclSuccess, res, exit);
+  }
+  NCCLCHECKGOTO(ncclCommEnsureReady(comm), res, exit);
+  CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), res, exit);

-  NCCLCHECK(ncclGroupStartInternal());
-  NCCLCHECKGOTO(CommCheck(comm, "CommSplit", "comm"), res, fail);
-  NCCLCHECKGOTO(PtrCheck(newcomm, "CommSplit", "newcomm"), res, fail);
-  NCCLCHECKGOTO(ncclCommEnsureReady(comm), res, fail);
-
-  CUDACHECKGOTO(cudaSetDevice(comm->cudaDev), res, fail);
  /* *newcomm should be NCCL_COMM_NULL until comm split fully complete. */
  *newcomm = NCCL_COMM_NULL;
-  if (color == NCCL_SPLIT_NOCOLOR) {
+  if (!isShrink && color == NCCL_SPLIT_NOCOLOR) {
    INFO(NCCL_INIT, "Rank %d has color with NCCL_SPLIT_NOCOLOR, not creating a new communicator", comm->rank);
  } else {
    NCCLCHECKGOTO(ncclCalloc(&childComm, 1), res, fail);
    childComm->startMagic = childComm->endMagic = NCCL_MAGIC;
-    if (comm->config.splitShare) {
+
+    // Set the shareResource field, this is used throughout the init and must be reset every time.
+    // If we shrink, we only reuse resources if we shrink in the default mode
+    comm->shareResources = isShrink ? (!(flags & NCCL_SHRINK_ABORT) && comm->config.shrinkShare) : comm->config.splitShare;
+    if (comm->shareResources) {
      childComm->abortFlag = comm->abortFlag;
      childComm->abortFlagDev = comm->abortFlagDev;
      childComm->abortFlagRefCount = comm->abortFlagRefCount;
@ -2161,38 +2300,39 @@ ncclResult_t ncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t *newc
      NCCLCHECKGOTO(parseCommConfig(childComm, config), res, fail);
    }

-    /* start with ncclInProgress and will be changed to ncclSuccess if init succeeds. */
-    childComm->initState = ncclInProgress;
+    /* start with ncclInternalError and will be changed to ncclSuccess if init succeeds. */
+    childComm->initState = ncclInternalError;
  }

  NCCLCHECKGOTO(ncclCalloc(&job, 1), res, fail);
  job->comm = childComm;
  job->newcomm = newcomm;
  job->parent = comm;
-  job->splitCount = ++comm->splitCount;
  job->color = color;
  job->key = key;
+  if (excludeRanksList) {
+    // need to copy the list of ranks to exclude because the job is async
+    job->excludeRanksCount = excludeRanksCount;
+    NCCLCHECKGOTO(ncclCalloc(&job->excludeRanksList, excludeRanksCount), res, fail);
+    memcpy(job->excludeRanksList, excludeRanksList, excludeRanksCount * sizeof(int));
+  } else {
+    // each split has to lead to a unique comm, so increment the splitCount
+    job->splitCount = ++comm->splitCount;
+    job->excludeRanksList = NULL;
+  }
  job->cudaDev = comm->cudaDev;
-  snprintf(job->funcName, NCCL_COMMINIT_FUNCNAME_LEN, "%s", __func__);
-  NCCLCHECKGOTO(ncclAsyncLaunch((struct ncclAsyncJob*)job, ncclCommInitRankFunc, NULL, free, comm), res, fail);
+  snprintf(job->funcName, NCCL_COMMINIT_FUNCNAME_LEN, "%s", caller);
+  NCCLCHECKGOTO(ncclAsyncLaunch((struct ncclAsyncJob*)job, ncclCommInitRankFunc, /*undo=*/NULL, /*destructor=*/childCommCleanupJob, comm), res, fail);

 exit:
  (void)cudaSetDevice(oldDev);
-  (void)ncclGroupErrCheck(res);
-  NCCLCHECK(ncclGroupEndInternal());
-
-  if (res == ncclSuccess && *newcomm) {
-    NVTX3_RANGE_ADD_PAYLOAD(CommSplit, NcclNvtxParamsCommSplitSchema,
-      NVTX3_PAYLOAD((*newcomm)->commHash, comm->commHash, comm->nRanks, comm->rank, comm->cudaDev, color, key));
-  }
-
  return res;
 fail:
  if (childComm) {
-    if (!comm->config.splitShare) {
-      free(childComm->abortFlag);
+    if (!comm->shareResources) {
+      if (childComm->abortFlag) free(childComm->abortFlag);
      if (childComm->abortFlagDev) ncclCudaHostFree(childComm->abortFlagDev);
-      free(childComm->abortFlagRefCount);
+      if (childComm->abortFlagRefCount) free(childComm->abortFlagRefCount);
    }
    free(childComm);
  }
@ -2200,6 +2340,44 @@ fail:
  goto exit;
 }

+NCCL_API(ncclResult_t, ncclCommShrink, ncclComm_t comm, int* excludeRanksList, int excludeRanksCount, ncclComm_t* newcomm, ncclConfig_t* config, int shrinkFlags);
+ncclResult_t  ncclCommShrink(ncclComm_t comm, int* excludeRanksList, int excludeRanksCount, ncclComm_t *newcomm, ncclConfig_t* config, int shrinkFlags) {
+  NVTX3_RANGE(NcclNvtxParamsCommShrink)
+  ncclResult_t res = ncclSuccess;
+  NCCLCHECK(ncclGroupStartInternal());
+  // Handle error mode by setting abort flags and waiting for kernels to complete and unset the flags to avoid bootstrap issues
+  if (shrinkFlags & NCCL_SHRINK_ABORT) {
+    NCCLCHECKGOTO(setCommAbortFlags(comm, 1), res, exit);
+    NCCLCHECKGOTO(ncclStrongStreamSynchronize(&comm->sharedRes->deviceStream), res, exit);
+    NCCLCHECKGOTO(setCommAbortFlags(comm, 0), res, exit);
+  }
+  NCCLCHECKGOTO(ncclCommInitChildComm(comm, newcomm, /*isShrink=*/true, shrinkFlags, /*color=*/0, /*key=*/comm->rank, excludeRanksList, excludeRanksCount, config, __func__), res, exit);
+
+  if (*newcomm) NVTX3_RANGE_ADD_PAYLOAD(CommShrink, NcclNvtxParamsCommShrinkSchema, NVTX3_PAYLOAD(comm->commHash, comm->nRanks, comm->rank, comm->cudaDev, excludeRanksCount));
+
+exit:
+  (void)ncclGroupErrCheck(res);
+  NCCLCHECK(ncclGroupEndInternal());
+  return res;
+}
+
+NCCL_API(ncclResult_t, ncclCommSplit, ncclComm_t comm, int color, int key, ncclComm_t *newcomm, ncclConfig_t *config);
+ncclResult_t ncclCommSplit(ncclComm_t comm, int color, int key, ncclComm_t *newcomm, ncclConfig_t *config) {
+  NVTX3_RANGE(NcclNvtxParamsCommSplit)
+
+  ncclResult_t res = ncclSuccess;
+  NCCLCHECK(ncclGroupStartInternal());
+  NCCLCHECKGOTO(ncclCommInitChildComm(comm, newcomm, /*isShrink=*/false, /*shrink mode=*/NCCL_SHRINK_DEFAULT, color, key, NULL, 0, config, __func__), res, exit);
+
+  if (*newcomm)
+    NVTX3_RANGE_ADD_PAYLOAD(CommSplit, NcclNvtxParamsCommSplitSchema, NVTX3_PAYLOAD((*newcomm)->commHash, comm->commHash, comm->nRanks, comm->rank, comm->cudaDev, color, key));
+
+exit:
+  (void)ncclGroupErrCheck(res);
+  NCCLCHECK(ncclGroupEndInternal());
+  return res;
+}
+
 NCCL_API(const char*, ncclGetErrorString, ncclResult_t code);
 const char* ncclGetErrorString(ncclResult_t code) {
  switch (code) {
@ -2277,119 +2455,3 @@ ncclResult_t ncclCommUserRank(const ncclComm_t comm, int* rank) {
  *rank = comm->rank;
  return ncclSuccess;
 }
-
-NCCL_API(ncclResult_t, ncclMemAlloc, void **ptr, size_t size);
-ncclResult_t  ncclMemAlloc(void **ptr, size_t size) {
-  NVTX3_FUNC_RANGE_IN(nccl_domain);
-  ncclResult_t ret = ncclSuccess;
-
-#if CUDART_VERSION >= 12010
-  size_t memGran = 0;
-  CUdevice currentDev;
-  CUmemAllocationProp memprop = {};
-  CUmemAccessDesc accessDesc = {};
-  CUmemGenericAllocationHandle handle;
-  int cudaDev;
-  int flag;
-  int dcnt;
-
-  if (ptr == NULL || size == 0) goto fallback;
-
-  if (ncclCudaLibraryInit() != ncclSuccess) goto fallback;
-
-  CUDACHECK(cudaGetDevice(&cudaDev));
-  CUCHECK(cuDeviceGet(&currentDev, cudaDev));
-
-  if (ncclCuMemEnable()) {
-    size_t handleSize = size;
-    int requestedHandleTypes = CU_MEM_HANDLE_TYPE_POSIX_FILE_DESCRIPTOR;
-    // Query device to see if FABRIC handle support is available
-    flag = 0;
-    (void) CUPFN(cuDeviceGetAttribute(&flag, CU_DEVICE_ATTRIBUTE_HANDLE_TYPE_FABRIC_SUPPORTED, currentDev));
-    if (flag) requestedHandleTypes |= CU_MEM_HANDLE_TYPE_FABRIC;
-    memprop.type = CU_MEM_ALLOCATION_TYPE_PINNED;
-    memprop.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
-    memprop.requestedHandleTypes = (CUmemAllocationHandleType) requestedHandleTypes;
-    memprop.location.id = currentDev;
-    // Query device to see if RDMA support is available
-    flag = 0;
-    CUCHECK(cuDeviceGetAttribute(&flag, CU_DEVICE_ATTRIBUTE_GPU_DIRECT_RDMA_WITH_CUDA_VMM_SUPPORTED, currentDev));
-    if (flag) memprop.allocFlags.gpuDirectRDMACapable = 1;
-    CUCHECK(cuMemGetAllocationGranularity(&memGran, &memprop, CU_MEM_ALLOC_GRANULARITY_RECOMMENDED));
-    CUDACHECK(cudaGetDeviceCount(&dcnt));
-    ALIGN_SIZE(handleSize, memGran);
-
-    if (requestedHandleTypes & CU_MEM_HANDLE_TYPE_FABRIC) {
-      /* First try cuMemCreate() with FABRIC handle support and then remove if it fails */
-      CUresult err = CUPFN(cuMemCreate(&handle, handleSize, &memprop, 0));
-      if (err == CUDA_ERROR_NOT_PERMITTED || err == CUDA_ERROR_NOT_SUPPORTED) {
-        requestedHandleTypes &= ~CU_MEM_HANDLE_TYPE_FABRIC;
-        memprop.requestedHandleTypes = (CUmemAllocationHandleType) requestedHandleTypes;
-        /* Allocate the physical memory on the device */
-        CUCHECK(cuMemCreate(&handle, handleSize, &memprop, 0));
-      }
-    } else {
-      /* Allocate the physical memory on the device */
-      CUCHECK(cuMemCreate(&handle, handleSize, &memprop, 0));
-    }
-    /* Reserve a virtual address range */
-    CUCHECK(cuMemAddressReserve((CUdeviceptr*)ptr, handleSize, memGran, 0, 0));
-    /* Map the virtual address range to the physical allocation */
-    CUCHECK(cuMemMap((CUdeviceptr)*ptr, handleSize, 0, handle, 0));
-    /* Now allow RW access to the newly mapped memory */
-    for (int i = 0; i < dcnt; ++i) {
-      int p2p = 0;
-      if (i == cudaDev || ((cudaDeviceCanAccessPeer(&p2p, cudaDev, i) == cudaSuccess) && p2p)) {
-        accessDesc.location.type = CU_MEM_LOCATION_TYPE_DEVICE;
-        accessDesc.location.id = i;
-        accessDesc.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
-        CUCHECK(cuMemSetAccess((CUdeviceptr)*ptr, handleSize, &accessDesc, 1));
-      }
-      if (0 == p2p && i != cudaDev) INFO(NCCL_ALLOC, "P2P not supported between GPU%d and GPU%d", cudaDev, i);
-    }
-    goto exit;
-  }
-
-fallback:
-#endif
-  // Coverity is right to complain that we may pass a NULL ptr to cudaMalloc.  That's deliberate though:
-  // we want CUDA to return an error to the caller.
-  // coverity[var_deref_model]
-  CUDACHECKGOTO(cudaMalloc(ptr, size), ret, fail);
-
-exit:
-  return ret;
-fail:
-  goto exit;
-}
-
-NCCL_API(ncclResult_t, ncclMemFree, void *ptr);
-ncclResult_t  ncclMemFree(void *ptr) {
-  NVTX3_FUNC_RANGE_IN(nccl_domain);
-  ncclResult_t ret = ncclSuccess;
-  int saveDevice;
-
-  CUDACHECK(cudaGetDevice(&saveDevice));
-#if CUDART_VERSION >= 12010
-  CUdevice ptrDev = 0;
-
-  if (ptr == NULL) goto fallback;
-  if (ncclCudaLibraryInit() != ncclSuccess) goto fallback;
-
-  CUCHECKGOTO(cuPointerGetAttribute((void*)&ptrDev, CU_POINTER_ATTRIBUTE_DEVICE_ORDINAL, (CUdeviceptr)ptr), ret, fail);
-  CUDACHECKGOTO(cudaSetDevice((int)ptrDev), ret, fail);
-  if (ncclCuMemEnable()) {
-    NCCLCHECKGOTO(ncclCuMemFree(ptr), ret, fail);
-    goto exit;
-  }
-
-fallback:
-#endif
-  CUDACHECKGOTO(cudaFree(ptr), ret, fail);
-
-exit:
-  CUDACHECK(cudaSetDevice(saveDevice));
-  return ret;
-fail:
-  goto exit;
-}
--- a/src/misc/cudawrap.cc
+++ b/src/misc/cudawrap.cc
@ -105,53 +105,53 @@ error:
 #endif
 }

-#define DECLARE_CUDA_PFN(symbol) PFN_##symbol pfn_##symbol = nullptr
+#define DECLARE_CUDA_PFN(symbol,version) PFN_##symbol##_v##version pfn_##symbol = nullptr

 #if CUDART_VERSION >= 11030
 /* CUDA Driver functions loaded with cuGetProcAddress for versioning */
-DECLARE_CUDA_PFN(cuDeviceGet);
-DECLARE_CUDA_PFN(cuDeviceGetAttribute);
-DECLARE_CUDA_PFN(cuGetErrorString);
-DECLARE_CUDA_PFN(cuGetErrorName);
+DECLARE_CUDA_PFN(cuDeviceGet, 2000);
+DECLARE_CUDA_PFN(cuDeviceGetAttribute, 2000);
+DECLARE_CUDA_PFN(cuGetErrorString, 6000);
+DECLARE_CUDA_PFN(cuGetErrorName, 6000);
 /* enqueue.cc */
-DECLARE_CUDA_PFN(cuMemGetAddressRange);
-DECLARE_CUDA_PFN(cuLaunchKernel);
+DECLARE_CUDA_PFN(cuMemGetAddressRange, 3020);
+DECLARE_CUDA_PFN(cuLaunchKernel, 4000);
 #if CUDA_VERSION >= 11080
-DECLARE_CUDA_PFN(cuLaunchKernelEx);
+DECLARE_CUDA_PFN(cuLaunchKernelEx, 11060);
 #endif
 /* proxy.cc */
-DECLARE_CUDA_PFN(cuCtxCreate);
-DECLARE_CUDA_PFN(cuCtxDestroy);
-DECLARE_CUDA_PFN(cuCtxGetCurrent);
-DECLARE_CUDA_PFN(cuCtxSetCurrent);
-DECLARE_CUDA_PFN(cuCtxGetDevice);
+DECLARE_CUDA_PFN(cuCtxCreate, 11040);
+DECLARE_CUDA_PFN(cuCtxDestroy, 4000);
+DECLARE_CUDA_PFN(cuCtxGetCurrent, 4000);
+DECLARE_CUDA_PFN(cuCtxSetCurrent, 4000);
+DECLARE_CUDA_PFN(cuCtxGetDevice, 2000);
 /* cuMem API support */
-DECLARE_CUDA_PFN(cuMemAddressReserve);
-DECLARE_CUDA_PFN(cuMemAddressFree);
-DECLARE_CUDA_PFN(cuMemCreate);
-DECLARE_CUDA_PFN(cuMemGetAllocationGranularity);
-DECLARE_CUDA_PFN(cuMemExportToShareableHandle);
-DECLARE_CUDA_PFN(cuMemImportFromShareableHandle);
-DECLARE_CUDA_PFN(cuMemMap);
-DECLARE_CUDA_PFN(cuMemRelease);
-DECLARE_CUDA_PFN(cuMemRetainAllocationHandle);
-DECLARE_CUDA_PFN(cuMemSetAccess);
-DECLARE_CUDA_PFN(cuMemUnmap);
-DECLARE_CUDA_PFN(cuMemGetAllocationPropertiesFromHandle);
+DECLARE_CUDA_PFN(cuMemAddressReserve, 10020);
+DECLARE_CUDA_PFN(cuMemAddressFree, 10020);
+DECLARE_CUDA_PFN(cuMemCreate, 10020);
+DECLARE_CUDA_PFN(cuMemGetAllocationGranularity, 10020);
+DECLARE_CUDA_PFN(cuMemExportToShareableHandle, 10020);
+DECLARE_CUDA_PFN(cuMemImportFromShareableHandle, 10020);
+DECLARE_CUDA_PFN(cuMemMap, 10020);
+DECLARE_CUDA_PFN(cuMemRelease, 10020);
+DECLARE_CUDA_PFN(cuMemRetainAllocationHandle, 11000);
+DECLARE_CUDA_PFN(cuMemSetAccess, 10020);
+DECLARE_CUDA_PFN(cuMemUnmap, 10020);
+DECLARE_CUDA_PFN(cuMemGetAllocationPropertiesFromHandle, 10020);
 /* ncclMemAlloc/Free */
-DECLARE_CUDA_PFN(cuPointerGetAttribute);
+DECLARE_CUDA_PFN(cuPointerGetAttribute, 4000);
 #if CUDA_VERSION >= 11070
 /* transport/collNet.cc/net.cc*/
-DECLARE_CUDA_PFN(cuMemGetHandleForAddressRange); // DMA-BUF support
+DECLARE_CUDA_PFN(cuMemGetHandleForAddressRange, 11070); // DMA-BUF support
 #endif
 #if CUDA_VERSION >= 12010
 /* NVSwitch Multicast support */
-DECLARE_CUDA_PFN(cuMulticastAddDevice);
-DECLARE_CUDA_PFN(cuMulticastBindMem);
-DECLARE_CUDA_PFN(cuMulticastBindAddr);
-DECLARE_CUDA_PFN(cuMulticastCreate);
-DECLARE_CUDA_PFN(cuMulticastGetGranularity);
-DECLARE_CUDA_PFN(cuMulticastUnbind);
+DECLARE_CUDA_PFN(cuMulticastAddDevice, 12010);
+DECLARE_CUDA_PFN(cuMulticastBindMem, 12010);
+DECLARE_CUDA_PFN(cuMulticastBindAddr, 12010);
+DECLARE_CUDA_PFN(cuMulticastCreate, 12010);
+DECLARE_CUDA_PFN(cuMulticastGetGranularity, 12010);
+DECLARE_CUDA_PFN(cuMulticastUnbind, 12010);
 #endif
 #endif

@ -162,8 +162,17 @@ bool ncclCudaLaunchBlocking = false;

 #if CUDART_VERSION >= 11030

-#if CUDART_VERSION >= 12000
-#define LOAD_SYM(symbol, ignore) do {                                   \
+#if CUDART_VERSION >= 13000
+#define LOAD_SYM(symbol, version, ignore) do {                           \
+    cudaDriverEntryPointQueryResult driverStatus = cudaDriverEntryPointSymbolNotFound; \
+    res = cudaGetDriverEntryPointByVersion(#symbol, (void **) (&pfn_##symbol), version, cudaEnableDefault, &driverStatus); \
+    if (res != cudaSuccess || driverStatus != cudaDriverEntryPointSuccess) { \
+      if (!ignore) {                                                    \
+        WARN("Retrieve %s version %d failed with %d status %d", #symbol, version, res, driverStatus); \
+        return ncclSystemError; }                                       \
+    } } while(0)
+#elif CUDART_VERSION >= 12000
+#define LOAD_SYM(symbol, version, ignore) do {                           \
    cudaDriverEntryPointQueryResult driverStatus = cudaDriverEntryPointSymbolNotFound; \
    res = cudaGetDriverEntryPoint(#symbol, (void **) (&pfn_##symbol), cudaEnableDefault, &driverStatus); \
    if (res != cudaSuccess || driverStatus != cudaDriverEntryPointSuccess) { \
@ -172,7 +181,7 @@ bool ncclCudaLaunchBlocking = false;
        return ncclSystemError; }                                       \
    } } while(0)
 #else
-#define LOAD_SYM(symbol, ignore) do {                                   \
+#define LOAD_SYM(symbol, version, ignore) do {                           \
    res = cudaGetDriverEntryPoint(#symbol, (void **) (&pfn_##symbol), cudaEnableDefault); \
    if (res != cudaSuccess) { \
      if (!ignore) {                                                    \
@ -188,46 +197,46 @@ static ncclResult_t cudaPfnFuncLoader(void) {

  cudaError_t res;

-  LOAD_SYM(cuGetErrorString, 0);
-  LOAD_SYM(cuGetErrorName, 0);
-  LOAD_SYM(cuDeviceGet, 0);
-  LOAD_SYM(cuDeviceGetAttribute, 0);
-  LOAD_SYM(cuMemGetAddressRange, 1);
-  LOAD_SYM(cuCtxCreate, 1);
-  LOAD_SYM(cuCtxDestroy, 1);
-  LOAD_SYM(cuCtxGetCurrent, 1);
-  LOAD_SYM(cuCtxSetCurrent, 1);
-  LOAD_SYM(cuCtxGetDevice, 1);
-  LOAD_SYM(cuLaunchKernel, 1);
+  LOAD_SYM(cuGetErrorString, 6000, 0);
+  LOAD_SYM(cuGetErrorName, 6000, 0);
+  LOAD_SYM(cuDeviceGet, 2000, 0);
+  LOAD_SYM(cuDeviceGetAttribute, 2000, 0);
+  LOAD_SYM(cuMemGetAddressRange, 3020, 1);
+  LOAD_SYM(cuCtxCreate, 11040, 1);
+  LOAD_SYM(cuCtxDestroy, 4000, 1);
+  LOAD_SYM(cuCtxGetCurrent, 4000, 1);
+  LOAD_SYM(cuCtxSetCurrent, 4000, 1);
+  LOAD_SYM(cuCtxGetDevice, 2000, 1);
+  LOAD_SYM(cuLaunchKernel, 4000, 1);
 #if CUDA_VERSION >= 11080
-  LOAD_SYM(cuLaunchKernelEx, 1);
+  LOAD_SYM(cuLaunchKernelEx, 11060, 1);
 #endif
 /* cuMem API support */
-  LOAD_SYM(cuMemAddressReserve, 1);
-  LOAD_SYM(cuMemAddressFree, 1);
-  LOAD_SYM(cuMemCreate, 1);
-  LOAD_SYM(cuMemGetAllocationGranularity, 1);
-  LOAD_SYM(cuMemExportToShareableHandle, 1);
-  LOAD_SYM(cuMemImportFromShareableHandle, 1);
-  LOAD_SYM(cuMemMap, 1);
-  LOAD_SYM(cuMemRelease, 1);
-  LOAD_SYM(cuMemRetainAllocationHandle, 1);
-  LOAD_SYM(cuMemSetAccess, 1);
-  LOAD_SYM(cuMemUnmap, 1);
-  LOAD_SYM(cuMemGetAllocationPropertiesFromHandle, 1);
+  LOAD_SYM(cuMemAddressReserve, 10020, 1);
+  LOAD_SYM(cuMemAddressFree, 10020, 1);
+  LOAD_SYM(cuMemCreate, 10020, 1);
+  LOAD_SYM(cuMemGetAllocationGranularity, 10020, 1);
+  LOAD_SYM(cuMemExportToShareableHandle, 10020, 1);
+  LOAD_SYM(cuMemImportFromShareableHandle, 10020, 1);
+  LOAD_SYM(cuMemMap, 10020, 1);
+  LOAD_SYM(cuMemRelease, 10020, 1);
+  LOAD_SYM(cuMemRetainAllocationHandle, 11000, 1);
+  LOAD_SYM(cuMemSetAccess, 10020, 1);
+  LOAD_SYM(cuMemUnmap, 10020, 1);
+  LOAD_SYM(cuMemGetAllocationPropertiesFromHandle, 10020, 1);
 /* ncclMemAlloc/Free */
-  LOAD_SYM(cuPointerGetAttribute, 1);
+  LOAD_SYM(cuPointerGetAttribute, 4000, 1);
 #if CUDA_VERSION >= 11070
-  LOAD_SYM(cuMemGetHandleForAddressRange, 1); // DMA-BUF support
+  LOAD_SYM(cuMemGetHandleForAddressRange, 11070, 1); // DMA-BUF support
 #endif
 #if CUDA_VERSION >= 12010
 /* NVSwitch Multicast support */
-  LOAD_SYM(cuMulticastAddDevice, 1);
-  LOAD_SYM(cuMulticastBindMem, 1);
-  LOAD_SYM(cuMulticastBindAddr, 1);
-  LOAD_SYM(cuMulticastCreate, 1);
-  LOAD_SYM(cuMulticastGetGranularity, 1);
-  LOAD_SYM(cuMulticastUnbind, 1);
+  LOAD_SYM(cuMulticastAddDevice, 12010, 1);
+  LOAD_SYM(cuMulticastBindMem, 12010, 1);
+  LOAD_SYM(cuMulticastBindAddr, 12010, 1);
+  LOAD_SYM(cuMulticastCreate, 12010, 1);
+  LOAD_SYM(cuMulticastGetGranularity, 12010, 1);
+  LOAD_SYM(cuMulticastUnbind, 12010, 1);
 #endif
  return ncclSuccess;
 }
--- a/src/misc/ibvwrap.cc
+++ b/src/misc/ibvwrap.cc
@ -8,7 +8,11 @@
 #include <sys/types.h>
 #include <unistd.h>

+#ifdef NCCL_BUILD_RDMA_CORE
+#include <infiniband/verbs.h>
+#else
 #include "ibvcore.h"
+#endif
 #include "ibvsymbols.h"

 static pthread_once_t initOnceControl = PTHREAD_ONCE_INIT;
@ -138,8 +142,14 @@ ncclResult_t wrap_ibv_query_device(struct ibv_context *context, struct ibv_devic
  IBV_INT_CHECK_RET_ERRNO(ibvSymbols, ibv_internal_query_device, ibv_internal_query_device(context, device_attr), 0, "ibv_query_device");
 }

-ncclResult_t wrap_ibv_query_port(struct ibv_context *context, uint8_t port_num, struct ibv_port_attr *port_attr) { /*returns 0 on success, or the value of errno on failure (which indicates the failure reason)*/
-  IBV_INT_CHECK_RET_ERRNO(ibvSymbols, ibv_internal_query_port, ibv_internal_query_port(context, port_num, port_attr), 0, "ibv_query_port");
+ncclResult_t wrap_ibv_query_port(struct ibv_context *context, uint8_t port_num, struct ibv_port_attr *port_attr) {
+  // First try and query the extended port attributes (e.g. active_speed_ex)
+  if (ibv_query_port_ex(context, port_num, port_attr) != 0) {
+    // Fall back to the original attribute API call, but zero all members first
+    memset(port_attr, 0, sizeof(*port_attr));
+    IBV_INT_CHECK_RET_ERRNO(ibvSymbols, ibv_internal_query_port, ibv_internal_query_port(context, port_num, port_attr), 0, "ibv_query_port");
+  }
+  return ncclSuccess;
 }

 ncclResult_t wrap_ibv_query_gid(struct ibv_context *context, uint8_t port_num, int index, union ibv_gid *gid) {
--- a/src/misc/mlx5dvsymbols.cc
+++ b/src/misc/mlx5dvsymbols.cc
@ -0,0 +1,77 @@
+#include <sys/types.h>
+#include <unistd.h>
+
+#include "mlx5/mlx5dvsymbols.h"
+
+#ifdef NCCL_BUILD_MLX5DV
+/* Mlx5dv linking mode. Symbols are pointers to linked MLX5 Direct Verbs */
+
+#define ASSIGN_SYM(container, symbol, name) container->name= &symbol;
+
+ncclResult_t buildMlx5dvSymbols(struct ncclMlx5dvSymbols* mlx5dvSymbols) {
+  ASSIGN_SYM(mlx5dvSymbols, mlx5dv_is_supported, mlx5dv_internal_is_supported);
+  ASSIGN_SYM(mlx5dvSymbols, mlx5dv_get_data_direct_sysfs_path, mlx5dv_internal_get_data_direct_sysfs_path);
+  ASSIGN_SYM(mlx5dvSymbols, mlx5dv_reg_dmabuf_mr, mlx5dv_internal_reg_dmabuf_mr);
+  return ncclSuccess;
+}
+
+#else
+/* Mlx5dv dynamic loading mode. Symbols are loaded from shared objects. */
+
+#include <dlfcn.h>
+#include "core.h"
+
+// MLX5DV Library versioning
+#define MLX5DV_VERSION "MLX5_1.8"
+
+ncclResult_t buildMlx5dvSymbols(struct ncclMlx5dvSymbols* mlx5dvSymbols) {
+  static void* mlx5dvhandle = NULL;
+  void* tmp;
+  void** cast;
+
+  mlx5dvhandle=dlopen("libmlx5.so", RTLD_NOW);
+  if (!mlx5dvhandle) {
+    mlx5dvhandle=dlopen("libmlx5.so.1", RTLD_NOW);
+    if (!mlx5dvhandle) {
+      INFO(NCCL_INIT, "Failed to open libmlx5.so[.1]");
+      goto teardown;
+    }
+  }
+
+#define LOAD_SYM(handle, symbol, funcptr) do {           \
+    cast = (void**)&funcptr;                             \
+    tmp = dlvsym(handle, symbol, MLX5DV_VERSION);       \
+    if (tmp == NULL) {                                   \
+      WARN("dlvsym failed on %s - %s version %s", symbol, dlerror(), MLX5DV_VERSION);  \
+      goto teardown;                                     \
+    }                                                    \
+    *cast = tmp;                                         \
+  } while (0)
+
+// Attempt to load a specific symbol version - fail silently
+#define LOAD_SYM_VERSION(handle, symbol, funcptr, version) do {  \
+    cast = (void**)&funcptr;                                     \
+    *cast = dlvsym(handle, symbol, version);                     \
+    if (*cast == NULL) {                                         \
+      INFO(NCCL_NET, "dlvsym failed on %s - %s version %s", symbol, dlerror(), version);  \
+    }                                                            \
+  } while (0)
+
+  LOAD_SYM(mlx5dvhandle, "mlx5dv_is_supported", mlx5dvSymbols->mlx5dv_internal_is_supported);
+  // Cherry-pick the mlx5dv_get_data_direct_sysfs_path API from MLX5 1.25
+  LOAD_SYM_VERSION(mlx5dvhandle, "mlx5dv_get_data_direct_sysfs_path", mlx5dvSymbols->mlx5dv_internal_get_data_direct_sysfs_path, "MLX5_1.25");
+  // Cherry-pick the ibv_reg_dmabuf_mr API from MLX5 1.25
+  LOAD_SYM_VERSION(mlx5dvhandle, "mlx5dv_reg_dmabuf_mr", mlx5dvSymbols->mlx5dv_internal_reg_dmabuf_mr, "MLX5_1.25");
+
+  return ncclSuccess;
+
+teardown:
+  mlx5dvSymbols->mlx5dv_internal_is_supported = NULL;
+  mlx5dvSymbols->mlx5dv_internal_get_data_direct_sysfs_path = NULL;
+  mlx5dvSymbols->mlx5dv_internal_reg_dmabuf_mr = NULL;
+
+  if (mlx5dvhandle != NULL) dlclose(mlx5dvhandle);
+  return ncclSystemError;
+}
+
+#endif
--- a/src/misc/mlx5dvwrap.cc
+++ b/src/misc/mlx5dvwrap.cc
@ -0,0 +1,75 @@
+/*************************************************************************
+ * Copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#include "mlx5/mlx5dvwrap.h"
+#include <sys/types.h>
+#include <unistd.h>
+
+#ifdef NCCL_BUILD_MLX5DV
+#include <infiniband/mlx5dv.h>
+#else
+#include "mlx5/mlx5dvcore.h"
+#endif
+#include "mlx5/mlx5dvsymbols.h"
+
+static pthread_once_t initOnceControl = PTHREAD_ONCE_INIT;
+static ncclResult_t initResult;
+struct ncclMlx5dvSymbols mlx5dvSymbols;
+
+ncclResult_t wrap_mlx5dv_symbols(void) {
+  pthread_once(&initOnceControl,
+               [](){ initResult = buildMlx5dvSymbols(&mlx5dvSymbols); });
+  return initResult;
+}
+
+/* CHECK_NOT_NULL: helper macro to check for NULL symbol */
+#define CHECK_NOT_NULL(container, internal_name) \
+  if (container.internal_name == NULL) { \
+     WARN("lib wrapper not initialized."); \
+     return ncclInternalError; \
+  }
+
+#define MLX5DV_PTR_CHECK_ERRNO(container, internal_name, call, retval, error_retval, name) \
+  CHECK_NOT_NULL(container, internal_name); \
+  retval = container.call; \
+  if (retval == error_retval) { \
+    WARN("Call to " name " failed with error %s", strerror(errno)); \
+    return ncclSystemError; \
+  } \
+  return ncclSuccess;
+
+#define MLX5DV_INT_CHECK_RET_ERRNO(container, internal_name, call, success_retval, name) \
+  CHECK_NOT_NULL(container, internal_name); \
+  int ret = container.call; \
+  if (ret != success_retval) { \
+    INFO(NCCL_NET, "Call to " name " failed with error %s errno %d", strerror(ret), ret); \
+    return ncclSystemError; \
+  } \
+  return ncclSuccess;
+
+bool wrap_mlx5dv_is_supported(struct ibv_device *device) {
+  if (mlx5dvSymbols.mlx5dv_internal_is_supported == NULL) {
+    return 0;
+  }
+  return mlx5dvSymbols.mlx5dv_internal_is_supported(device);
+}
+
+ncclResult_t wrap_mlx5dv_get_data_direct_sysfs_path(struct ibv_context *context, char *buf, size_t buf_len) {
+  MLX5DV_INT_CHECK_RET_ERRNO(mlx5dvSymbols, mlx5dv_internal_get_data_direct_sysfs_path, mlx5dv_internal_get_data_direct_sysfs_path(context, buf, buf_len), 0, "mlx5dv_get_data_direct_sysfs_path");
+}
+
+/* DMA-BUF support */
+ncclResult_t wrap_mlx5dv_reg_dmabuf_mr(struct ibv_mr **ret, struct ibv_pd *pd, uint64_t offset, size_t length, uint64_t iova, int fd, int access, int mlx5_access) {
+  MLX5DV_PTR_CHECK_ERRNO(mlx5dvSymbols, mlx5dv_internal_reg_dmabuf_mr, mlx5dv_internal_reg_dmabuf_mr(pd, offset, length, iova, fd, access, mlx5_access), *ret, NULL, "mlx5dv_reg_dmabuf_mr");
+}
+
+struct ibv_mr * wrap_direct_mlx5dv_reg_dmabuf_mr(struct ibv_pd *pd, uint64_t offset, size_t length, uint64_t iova, int fd, int access, int mlx5_access) {
+  if (mlx5dvSymbols.mlx5dv_internal_reg_dmabuf_mr == NULL) {
+    errno = EOPNOTSUPP; // ncclIbDmaBufSupport() requires this errno being set
+    return NULL;
+  }
+  return mlx5dvSymbols.mlx5dv_internal_reg_dmabuf_mr(pd, offset, length, iova, fd, access, mlx5_access);
+}
--- a/src/misc/socket.cc
+++ b/src/misc/socket.cc
@ -68,7 +68,8 @@ static ncclResult_t socketProgress(int op, struct ncclSocket* sock, void* ptr, i
      return ncclSuccess;
    } else {
      char line[SOCKET_NAME_MAXLEN+1];
-      WARN("socketProgress: Connection closed by remote peer %s", ncclSocketToString(&sock->addr, line, 0));
+      WARN("socketProgress: Connection closed by remote peer %s",
+           ncclSocketToString(&sock->addr, line, /*numericHostForm*/0));
      return ncclRemoteError;
    }
  }
@ -86,17 +87,22 @@ static ncclResult_t socketWait(int op, struct ncclSocket* sock, void* ptr, int s
 * Output: "IPv4/IPv6 address<port>"
 */
 const char *ncclSocketToString(const union ncclSocketAddress *addr, char *buf, const int numericHostForm /*= 1*/) {
-  if (buf == NULL || addr == NULL) return NULL;
-  const struct sockaddr *saddr = &addr->sa;
-  if (saddr->sa_family != AF_INET && saddr->sa_family != AF_INET6) { buf[0]='\0'; return buf; }
+  const struct sockaddr *saddr;
  char host[NI_MAXHOST], service[NI_MAXSERV];
+  int flag = NI_NUMERICSERV | (numericHostForm ? NI_NUMERICHOST : 0);
+  if (buf == NULL || addr == NULL) goto fail;
+  saddr = &addr->sa;
+  if (saddr->sa_family != AF_INET && saddr->sa_family != AF_INET6) goto fail;
  /* NI_NUMERICHOST: If set, then the numeric form of the hostname is returned.
   * (When not set, this will still happen in case the node's name cannot be determined.)
   */
-  int flag = NI_NUMERICSERV | (numericHostForm ? NI_NUMERICHOST : 0);
-  (void) getnameinfo(saddr, sizeof(union ncclSocketAddress), host, NI_MAXHOST, service, NI_MAXSERV, flag);
+  if (getnameinfo(saddr, sizeof(union ncclSocketAddress), host, NI_MAXHOST, service, NI_MAXSERV, flag)) goto fail;
  sprintf(buf, "%s<%s>", host, service);
  return buf;
+fail:
+  if (buf)
+    buf[0] = '\0';
+  return buf;
 }

 static uint16_t socketToPort(union ncclSocketAddress *addr) {
@ -120,7 +126,8 @@ static int envSocketFamily(void) {
  return family;
 }

-static int findInterfaces(const char* prefixList, char* names, union ncclSocketAddress *addrs, int sock_family, int maxIfNameSize, int maxIfs) {
+static ncclResult_t findInterfaces(const char* prefixList, char* names, union ncclSocketAddress *addrs, int sock_family,
+                                   int maxIfNameSize, int maxIfs, int* found) {
 #ifdef ENABLE_TRACE
  char line[SOCKET_NAME_MAXLEN+1];
 #endif
@ -131,10 +138,10 @@ static int findInterfaces(const char* prefixList, char* names, union ncclSocketA
  if (searchExact) prefixList++;
  int nUserIfs = parseStringList(prefixList, userIfs, MAX_IFS);

-  int found = 0;
+  *found = 0;
  struct ifaddrs *interfaces, *interface;
-  getifaddrs(&interfaces);
-  for (interface = interfaces; interface && found < maxIfs; interface = interface->ifa_next) {
+  SYSCHECK(getifaddrs(&interfaces), "getifaddrs");
+  for (interface = interfaces; interface && *found < maxIfs; interface = interface->ifa_next) {
    if (interface->ifa_addr == NULL) continue;

    /* We only support IPv4 & IPv6 */
@ -162,23 +169,23 @@ static int findInterfaces(const char* prefixList, char* names, union ncclSocketA
    // Check that this interface has not already been saved
    // getifaddrs() normal order appears to be; IPv4, IPv6 Global, IPv6 Link
    bool duplicate = false;
-    for (int i = 0; i < found; i++) {
+    for (int i = 0; i < *found; i++) {
      if (strcmp(interface->ifa_name, names+i*maxIfNameSize) == 0) { duplicate = true; break; }
    }

    if (!duplicate) {
      // Store the interface name
-      strncpy(names+found*maxIfNameSize, interface->ifa_name, maxIfNameSize);
+      strncpy(names + (*found)*maxIfNameSize, interface->ifa_name, maxIfNameSize);
      // Store the IP address
      int salen = (family == AF_INET) ? sizeof(struct sockaddr_in) : sizeof(struct sockaddr_in6);
-      memset(addrs+found, '\0', sizeof(*addrs));
-      memcpy(addrs+found, interface->ifa_addr, salen);
-      found++;
+      memset(addrs + *found, '\0', sizeof(*addrs));
+      memcpy(addrs + *found, interface->ifa_addr, salen);
+      (*found)++;
    }
  }

  freeifaddrs(interfaces);
-  return found;
+  return ncclSuccess;
 }

 static bool matchSubnet(struct ifaddrs local_if, union ncclSocketAddress* remote) {
@ -219,20 +226,21 @@ static bool matchSubnet(struct ifaddrs local_if, union ncclSocketAddress* remote
    same &= (local_addr->sin6_scope_id == remote_addr.sin6_scope_id);
    return same;
  } else {
-    WARN("Net : Unsupported address family type");
+    INFO(NCCL_NET, "Net : Unsupported address family type");
    return false;
  }
 }

-int ncclFindInterfaceMatchSubnet(char* ifNames, union ncclSocketAddress* localAddrs, union ncclSocketAddress* remoteAddr, int ifNameMaxSize, int maxIfs) {
+ncclResult_t ncclFindInterfaceMatchSubnet(char* ifName, union ncclSocketAddress* localAddr,
+                                          union ncclSocketAddress* remoteAddr, int ifNameMaxSize, int* found) {
 #ifdef ENABLE_TRACE
  char line[SOCKET_NAME_MAXLEN+1];
-#endif
  char line_a[SOCKET_NAME_MAXLEN+1];
-  int found = 0;
+#endif
+  *found = 0;
  struct ifaddrs *interfaces, *interface;
-  getifaddrs(&interfaces);
-  for (interface = interfaces; interface && !found; interface = interface->ifa_next) {
+  SYSCHECK(getifaddrs(&interfaces), "getifaddrs");
+  for (interface = interfaces; interface && !*found; interface = interface->ifa_next) {
    if (interface->ifa_addr == NULL) continue;

    /* We only support IPv4 & IPv6 */
@ -247,21 +255,18 @@ int ncclFindInterfaceMatchSubnet(char* ifNames, union ncclSocketAddress* localAd

    // Store the local IP address
    int salen = (family == AF_INET) ? sizeof(struct sockaddr_in) : sizeof(struct sockaddr_in6);
-    memcpy(localAddrs+found, interface->ifa_addr, salen);
+    memcpy(localAddr, interface->ifa_addr, salen);

    // Store the interface name
-    strncpy(ifNames+found*ifNameMaxSize, interface->ifa_name, ifNameMaxSize);
+    strncpy(ifName, interface->ifa_name, ifNameMaxSize);

-    TRACE(NCCL_INIT|NCCL_NET,"NET : Found interface %s:%s in the same subnet as remote address %s", interface->ifa_name, ncclSocketToString(localAddrs+found, line), ncclSocketToString(remoteAddr, line_a));
-    found++;
-    if (found == maxIfs) break;
+    TRACE(NCCL_INIT|NCCL_NET,"NET : Found interface %s:%s in the same subnet as remote address %s",
+          interface->ifa_name, ncclSocketToString(localAddr, line), ncclSocketToString(remoteAddr, line_a));
+    *found = 1;
  }

-  if (found == 0) {
-    WARN("Net : No interface found in the same subnet as remote address %s", ncclSocketToString(remoteAddr, line_a));
-  }
  freeifaddrs(interfaces);
-  return found;
+  return ncclSuccess;
 }

 ncclResult_t ncclSocketGetAddrFromString(union ncclSocketAddress* ua, const char* ip_port_pair) {
@ -344,40 +349,41 @@ ncclResult_t ncclSocketGetAddrFromString(union ncclSocketAddress* ua, const char
  return ncclSuccess;
 }

-int ncclFindInterfaces(char* ifNames, union ncclSocketAddress *ifAddrs, int ifNameMaxSize, int maxIfs) {
+ncclResult_t ncclFindInterfaces(char* ifNames, union ncclSocketAddress *ifAddrs, int ifNameMaxSize, int maxIfs,
+                                int* nIfs) {
  static int shownIfName = 0;
-  int nIfs = 0;
  // Allow user to force the INET socket family selection
  int sock_family = envSocketFamily();
  // User specified interface
  const char* env = ncclGetEnv("NCCL_SOCKET_IFNAME");
+  *nIfs = 0;
  if (env && strlen(env) > 1) {
    INFO(NCCL_ENV, "NCCL_SOCKET_IFNAME set by environment to %s", env);
    // Specified by user : find or fail
    if (shownIfName++ == 0) INFO(NCCL_NET, "NCCL_SOCKET_IFNAME set to %s", env);
-    nIfs = findInterfaces(env, ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
+    NCCLCHECK(findInterfaces(env, ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
  } else {
    // Try to automatically pick the right one
    // Start with IB
-    nIfs = findInterfaces("ib", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
+    NCCLCHECK(findInterfaces("ib", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
    // else see if we can get some hint from COMM ID
-    if (nIfs == 0) {
+    if (*nIfs == 0) {
      const char* commId = ncclGetEnv("NCCL_COMM_ID");
      if (commId && strlen(commId) > 1) {
        INFO(NCCL_ENV, "NCCL_COMM_ID set by environment to %s", commId);
        // Try to find interface that is in the same subnet as the IP in comm id
        union ncclSocketAddress idAddr;
-        ncclSocketGetAddrFromString(&idAddr, commId);
-        nIfs = ncclFindInterfaceMatchSubnet(ifNames, ifAddrs, &idAddr, ifNameMaxSize, maxIfs);
+        NCCLCHECK(ncclSocketGetAddrFromString(&idAddr, commId));
+        NCCLCHECK(ncclFindInterfaceMatchSubnet(ifNames, ifAddrs, &idAddr, ifNameMaxSize, nIfs));
      }
    }
    // Then look for anything else (but not docker or lo)
-    if (nIfs == 0) nIfs = findInterfaces("^docker,lo", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
+    if (*nIfs == 0) NCCLCHECK(findInterfaces("^docker,lo", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
    // Finally look for docker, then lo.
-    if (nIfs == 0) nIfs = findInterfaces("docker", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
-    if (nIfs == 0) nIfs = findInterfaces("lo", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs);
+    if (*nIfs == 0) NCCLCHECK(findInterfaces("docker", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
+    if (*nIfs == 0) NCCLCHECK(findInterfaces("lo", ifNames, ifAddrs, sock_family, ifNameMaxSize, maxIfs, nIfs));
  }
-  return nIfs;
+  return ncclSuccess;
 }

 ncclResult_t ncclSocketListen(struct ncclSocket* sock) {
@ -435,21 +441,25 @@ static ncclResult_t socketTryAccept(struct ncclSocket* sock) {
  if (sock->fd != -1) {
    sock->state = ncclSocketStateAccepted;
  } else if (errno == ENETDOWN || errno == EPROTO || errno == ENOPROTOOPT || errno == EHOSTDOWN ||
-             errno == ENONET || errno == EHOSTUNREACH || errno == EOPNOTSUPP || errno == ENETUNREACH) {
+             errno == ENONET || errno == EHOSTUNREACH || errno == EOPNOTSUPP || errno == ENETUNREACH ||
+             errno == EINTR) {
    /* per accept's man page, for linux sockets, the following errors might be already pending errors
     * and should be considered as EAGAIN. To avoid infinite loop in case of errors, we use the retry count*/
    if (++sock->errorRetries == ncclParamRetryCnt()) {
-      WARN("socketTryAccept: exceeded error retry count (%d), %s", sock->errorRetries, strerror(errno));
+      WARN("socketTryAccept: exceeded error retry count after %d attempts, %s", sock->errorRetries, strerror(errno));
      return ncclSystemError;
    }
-    INFO(NCCL_ALL, "Call to accept returned %s, retrying", strerror(errno));
-  } else if (errno != EAGAIN && errno != EWOULDBLOCK) {
+    INFO(NCCL_NET|NCCL_INIT, "Call to accept returned %s, retrying", strerror(errno));
+  } else if (errno != EINTR && errno != EAGAIN && errno != EWOULDBLOCK) {
    WARN("socketTryAccept: Accept failed: %s", strerror(errno));
    return ncclSystemError;
  }
  return ncclSuccess;
 }

+NCCL_PARAM(SocketMaxRecvBuff, "SOCKET_RCVBUF", -1);
+NCCL_PARAM(SocketMaxSendBuff, "SOCKET_SNDBUF", -1);
+
 static ncclResult_t socketSetFlags(struct ncclSocket* sock) {
  const int one = 1;
  /* Set socket as non-blocking if async or if we need to be able to abort */
@ -458,34 +468,55 @@ static ncclResult_t socketSetFlags(struct ncclSocket* sock) {
    SYSCHECK(flags = fcntl(sock->fd, F_GETFL), "fcntl");
    SYSCHECK(fcntl(sock->fd, F_SETFL, flags | O_NONBLOCK), "fcntl");
  }
-  SYSCHECK(setsockopt(sock->fd, IPPROTO_TCP, TCP_NODELAY, (char*)&one, sizeof(int)), "setsockopt");
+  SYSCHECK(setsockopt(sock->fd, IPPROTO_TCP, TCP_NODELAY, (char*)&one, sizeof(int)), "setsockopt TCP NODELAY");
+  // setsockopt should not fail even if the sizes are too large, do not change the default if unset by the user (=-1)
+  int rcvBuf = ncclParamSocketMaxRecvBuff(), sndBuf = ncclParamSocketMaxSendBuff();
+  if (sndBuf > 0) SYSCHECK(setsockopt(sock->fd, SOL_SOCKET, SO_SNDBUF, (char*)&sndBuf, sizeof(int)), "setsockopt SO_SNDBUF");
+  if (rcvBuf > 0) SYSCHECK(setsockopt(sock->fd, SOL_SOCKET, SO_RCVBUF, (char*)&rcvBuf, sizeof(int)), "setsockopt SO_RCVBUF");
  return ncclSuccess;
 }

+static void socketResetAccept(struct ncclSocket* sock) {
+  char line[SOCKET_NAME_MAXLEN+1];
+  INFO(NCCL_NET|NCCL_INIT, "socketFinalizeAccept: didn't receive a valid magic from %s",
+       ncclSocketToString(&sock->addr, line));
+  // Ignore spurious connection and accept again
+  (void)close(sock->fd);
+  sock->fd = -1;
+  sock->state = ncclSocketStateAccepting;
+  sock->finalizeCounter = 0;
+}
+
 static ncclResult_t socketFinalizeAccept(struct ncclSocket* sock) {
  uint64_t magic;
  enum ncclSocketType type;
  int received;
+  char line[SOCKET_NAME_MAXLEN+1];
  // once accepted, linux sockets do NOT inherit file status flags such as O_NONBLOCK (BSD ones do)
  NCCLCHECK(socketSetFlags(sock));

  if (sock->asyncFlag == 0 || sock->finalizeCounter < sizeof(magic)) {
    if (sock->asyncFlag == 0) {
      received = 0;
-      NCCLCHECK(socketWait(NCCL_SOCKET_RECV, sock, &magic, sizeof(magic), &received));
+      if (socketWait(NCCL_SOCKET_RECV, sock, &magic, sizeof(magic), &received) != ncclSuccess) {
+        socketResetAccept(sock);
+        return ncclSuccess;
+      }
    } else {
+      int closed = 0;
      received = sock->finalizeCounter;
-      NCCLCHECK(socketProgress(NCCL_SOCKET_RECV, sock, sock->finalizeBuffer, sizeof(magic), &received));
+      NCCLCHECK(socketProgress(NCCL_SOCKET_RECV, sock, sock->finalizeBuffer, sizeof(magic), &received, &closed));
      sock->finalizeCounter = received;
-      if (received < sizeof(magic)) return ncclSuccess;
+      if (received < sizeof(magic)) {
+        if (closed) {
+          socketResetAccept(sock);
+        }
+        return ncclSuccess;
+      }
      memcpy(&magic, sock->finalizeBuffer, sizeof(magic));
    }
    if (magic != sock->magic) {
-      WARN("socketFinalizeAccept: wrong magic %lx != %lx", magic, sock->magic);
-      close(sock->fd);
-      sock->fd = -1;
-      // Ignore spurious connection and accept again
-      sock->state = ncclSocketStateAccepting;
+      socketResetAccept(sock);
      return ncclSuccess;
    }
  }
@ -500,7 +531,7 @@ static ncclResult_t socketFinalizeAccept(struct ncclSocket* sock) {
    memcpy(&type, sock->finalizeBuffer, sizeof(type));
  }
  if (type != sock->type) {
-    WARN("socketFinalizeAccept: wrong type %d != %d", type, sock->type);
+    WARN("socketFinalizeAccept from %s: wrong type %d != %d", ncclSocketToString(&sock->addr, line), type, sock->type);
    sock->state = ncclSocketStateError;
    close(sock->fd);
    sock->fd = -1;
@ -532,32 +563,38 @@ cleanup:
  }
  goto exit;
 }
+
 static ncclResult_t socketConnectCheck(struct ncclSocket* sock, int errCode, const char funcName[]) {
+  char line[SOCKET_NAME_MAXLEN+1];
  if (errCode == 0) {
    sock->state = ncclSocketStateConnected;
  } else if (errCode == EINPROGRESS) {
    sock->state = ncclSocketStateConnectPolling;
-  } else if (errCode == ETIMEDOUT || errCode == EHOSTUNREACH || errCode == ECONNREFUSED) {
+  } else if (errCode == EINTR || errCode == EWOULDBLOCK || errCode == EAGAIN || errCode == ETIMEDOUT ||
+             errCode == EHOSTUNREACH || errCode == ECONNREFUSED) {
    if (sock->customRetry == 0) {
      if (sock->errorRetries++ == ncclParamRetryCnt()) {
        sock->state = ncclSocketStateError;
-        WARN("%s: connect returned %s, exceeded error retry count (%d)", funcName, strerror(errCode), sock->errorRetries);
+        WARN("%s: connect to %s returned %s, exceeded error retry count after %d attempts",
+             funcName, ncclSocketToString(&sock->addr, line), strerror(errCode), sock->errorRetries);
        return ncclRemoteError;
      }
      unsigned int sleepTime = sock->errorRetries * ncclParamRetryTimeOut();
-      INFO(NCCL_ALL, "%s: connect returned %s, retrying (%d/%ld) after sleep for %u msec", funcName, strerror(errCode), sock->errorRetries, ncclParamRetryCnt(), sleepTime);
+      INFO(NCCL_NET|NCCL_INIT, "%s: connect to %s returned %s, retrying (%d/%ld) after sleep for %u msec",
+           funcName, ncclSocketToString(&sock->addr, line), strerror(errCode),
+           sock->errorRetries, ncclParamRetryCnt(), sleepTime);
      msleep(sleepTime);
    }
    NCCLCHECK(socketResetFd(sock)); /* in case of failure in connect, socket state is unspecified */
    sock->state = ncclSocketStateConnecting;
  } else {
-    char line[SOCKET_NAME_MAXLEN+1];
    sock->state = ncclSocketStateError;
-    WARN("%s: Connect to %s failed : %s", funcName, ncclSocketToString(&sock->addr, line), strerror(errCode));
+    WARN("%s: connect to %s failed : %s", funcName, ncclSocketToString(&sock->addr, line), strerror(errCode));
    return ncclSystemError;
  }
  return ncclSuccess;
 }
+
 static ncclResult_t socketStartConnect(struct ncclSocket* sock) {
  /* blocking/non-blocking connect() is determined by asyncFlag. */
  int ret = connect(sock->fd, &sock->addr.sa, sock->salen);
@ -568,6 +605,7 @@ static ncclResult_t socketPollConnect(struct ncclSocket* sock) {
  struct pollfd pfd;
  int timeout = 1, ret;
  socklen_t rlen = sizeof(int);
+  char line[SOCKET_NAME_MAXLEN+1];

  memset(&pfd, 0, sizeof(struct pollfd));
  pfd.fd = sock->fd;
@ -577,10 +615,7 @@ static ncclResult_t socketPollConnect(struct ncclSocket* sock) {
  if (ret == 0 || (ret < 0 && errno == EINTR)) {
    return ncclSuccess;
  } else if (ret < 0) {
-    WARN("socketPollConnect poll() failed with error %s", strerror(errno));
-    return ncclRemoteError;
-  } else if (ret != 1 || (pfd.revents & POLLOUT) == 0) {
-    WARN("socketPollConnect poll() returned %d%s", ret, (pfd.revents & POLLOUT) ? "" : ", no POLLOUT events");
+    WARN("socketPollConnect to %s failed with error %s", ncclSocketToString(&sock->addr, line), strerror(errno));
    return ncclSystemError;
  }

@ -899,7 +934,7 @@ ncclResult_t ncclSocketTryRecv(struct ncclSocket* sock, void* ptr, int size, int
 ncclResult_t ncclSocketShutdown(struct ncclSocket* sock, int how) {
  if (sock != NULL) {
    if (sock->fd >= 0) {
-      shutdown(sock->fd, how);
+      SYSCHECK(shutdown(sock->fd, how), "shutdown");
    }
    sock->state = ncclSocketStateTerminating;
  }
@ -921,8 +956,8 @@ ncclResult_t ncclSocketClose(struct ncclSocket* sock, bool wait) {
       * by refcount of fd, but close() is. close() won't close a fd and send FIN packet if
       * the fd is duplicated (e.g. fork()). So shutdown() guarantees the correct and graceful
       * connection close here. */
-      shutdown(sock->fd, SHUT_RDWR);
-      close(sock->fd);
+      (void)shutdown(sock->fd, SHUT_RDWR);
+      (void)close(sock->fd);
    }
    sock->state = ncclSocketStateClosed;
    sock->fd = -1;
--- a/src/misc/strongstream.cc
+++ b/src/misc/strongstream.cc
@ -9,13 +9,18 @@
 #include "checks.h"
 #include "param.h"

+#if CUDART_VERSION >= 13000
+#define cudaStreamGetCaptureInfo_v3 cudaStreamGetCaptureInfo
+#define cudaGraphAddDependencies_v2 cudaGraphAddDependencies
+#define cudaStreamUpdateCaptureDependencies_v2 cudaStreamUpdateCaptureDependencies
+#endif
+
 // Tracks the captured work a given graph captured identified by its graph id.
 struct ncclStrongStreamCapture {
  struct ncclStrongStreamCapture* next;
  cudaGraph_t graph;
  unsigned long long graphId;
  cudaStream_t captureStream;
-  cudaGraphNode_t lastRecord;
  void* acquiredBy;
 };

@ -89,7 +94,11 @@ ncclResult_t ncclCudaGetCapturingGraph(
    } else {
      #if CUDART_VERSION >= 11030
        cudaStreamCaptureStatus status;
+      #if CUDART_VERSION >= 13000
+        CUDACHECK(cudaStreamGetCaptureInfo_v3(stream, &status, &graph->graphId, &graph->graph, nullptr, nullptr, nullptr));
+      #else
        CUDACHECK(cudaStreamGetCaptureInfo_v2(stream, &status, &graph->graphId, &graph->graph, nullptr, nullptr));
+      #endif
        if (status != cudaStreamCaptureStatusActive) {
          graph->origin = nullptr;
          graph->graph = nullptr;
@ -206,7 +215,6 @@ ncclResult_t ncclStrongStreamAcquire(
        CUDACHECKGOTO(cudaStreamCreateWithFlags(&cap->captureStream, cudaStreamNonBlocking), ret, do_unlock);
      }
      cap->graphId = graph.graphId;
-      cap->lastRecord = nullptr;
      cap->acquiredBy = localThreadId();
      // Push to capturing list.
      cap->next = ss->captureHead;
@ -224,7 +232,11 @@ ncclResult_t ncclStrongStreamAcquire(
      CUDACHECK(cudaEventRecord(scratch, graph.origin));
      CUDACHECK(cudaStreamWaitEvent(cap->captureStream, scratch, 0));
      CUDACHECK(cudaEventDestroy(scratch));
+      #if CUDART_VERSION >= 13000
+      CUDACHECK(cudaStreamUpdateCaptureDependencies_v2(cap->captureStream, nullptr, nullptr, 0, cudaStreamSetCaptureDependencies));
+      #else
      CUDACHECK(cudaStreamUpdateCaptureDependencies(cap->captureStream, nullptr, 0, cudaStreamSetCaptureDependencies));
+      #endif

      if (mixing && firstCapture) {
        CUDACHECK(cudaEventRecord(ss->serialEvent, ss->liveStream));
@ -282,17 +294,15 @@ ncclResult_t ncclStrongStreamRelease(
        cudaGraphNode_t recordNode;
        CUDACHECK(cudaGraphAddEventRecordNode(&recordNode, graph.graph, nullptr, 0, ss->serialEvent));

-        // Make this record order after previous record on this stream.
-        if (cap->lastRecord != nullptr) {
-          CUDACHECK(cudaGraphAddDependencies(graph.graph, &cap->lastRecord, &recordNode, 1));
-        }
-        cap->lastRecord = recordNode;
-
        // Get current nodes from work stream so we can add them as dependencies.
        cudaStreamCaptureStatus status;
        cudaGraphNode_t const* nodes;
        size_t count = 0;
+        #if CUDART_VERSION >= 13000
+        cudaError_t res = cudaStreamGetCaptureInfo_v3(cap->captureStream, &status, nullptr, nullptr, &nodes, nullptr, &count);
+        #else
        cudaError_t res = cudaStreamGetCaptureInfo_v2(cap->captureStream, &status, nullptr, nullptr, &nodes, &count);
+        #endif

        #if CUDART_VERSION >= 12030
        if (res == cudaErrorLossyQuery) { // CUDA is telling us the dependencies have edge annotations.
@ -308,10 +318,30 @@ ncclResult_t ncclStrongStreamRelease(
        else {
          CUDACHECK(res /* = cudaStreamGetCaptureInfo_v2(...)*/);
          for (int i=0; i < (int)count; i++) {
+          #if CUDART_VERSION >= 13000
+            CUDACHECK(cudaGraphAddDependencies_v2(graph.graph, &nodes[i], &recordNode, nullptr, 1));
+          #else
            CUDACHECK(cudaGraphAddDependencies(graph.graph, &nodes[i], &recordNode, 1));
+          #endif
          }
        }

+	// Make every future operation captured on cap->captureStream depend on 'recordNode'.
+        #if CUDART_VERSION >= 13000
+        CUDACHECK(cudaStreamUpdateCaptureDependencies_v2(
+                    cap->captureStream,
+                    &recordNode,          /* dependencies                */
+                    /*edges =*/ nullptr,  /* no edge annotations         */
+                    1,                    /* count                       */
+                    cudaStreamSetCaptureDependencies));
+        #else
+        CUDACHECK(cudaStreamUpdateCaptureDependencies(
+                    cap->captureStream,
+                    &recordNode,
+                    1,
+                    cudaStreamSetCaptureDependencies));
+        #endif
+
        if (cap->acquiredBy != localThreadId() && ncclParamLaunchRaceFatal()) {
          WARN("%s", launchRaceFatalMsg);
          return ncclInvalidUsage;
@ -339,7 +369,11 @@ ncclResult_t ncclStreamAdvanceToEvent(struct ncclCudaGraph g, cudaStream_t s, cu
    cudaStreamCaptureStatus status;
    cudaGraphNode_t const* nodes;
    size_t count = 0;
+    #if CUDART_VERSION >= 13000
+    cudaError_t res = cudaStreamGetCaptureInfo_v3(tmp, &status, nullptr, nullptr, &nodes, nullptr, &count);
+    #else
    cudaError_t res = cudaStreamGetCaptureInfo_v2(tmp, &status, nullptr, nullptr, &nodes, &count);
+    #endif

    #if CUDART_VERSION >= 12030
    if (res == cudaErrorLossyQuery) { // CUDA is telling us the dependencies have edge annotations.
@ -352,7 +386,11 @@ ncclResult_t ncclStreamAdvanceToEvent(struct ncclCudaGraph g, cudaStream_t s, cu
    #endif
    else {
      CUDACHECK(res /* = cudaStreamGetCaptureInfo_v2(...)*/);
+    #if CUDART_VERSION >= 13000
+      CUDACHECK(cudaStreamUpdateCaptureDependencies_v2(s, (cudaGraphNode_t*)nodes, nullptr, count, cudaStreamSetCaptureDependencies));
+    #else
      CUDACHECK(cudaStreamUpdateCaptureDependencies(s, (cudaGraphNode_t*)nodes, count, cudaStreamSetCaptureDependencies));
+    #endif
    }

    CUDACHECK(cudaStreamDestroy(tmp));
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Yehuda Yitschak	1c1f49dcb6	Merge `ef96b97524` into `0d1ece2b43`	2025-07-22 20:58:01 +02:00
Stephen Sachs	0d1ece2b43	Exclude ongoing issues from auto-closing logic - Added a check to skip issues labeled "ongoing" in the close-old-issues script - Adjusted the condition to compare both creation and update dates against six months ago	2025-07-17 21:50:05 +02:00
Stephen Sachs	bfedf2629e	Add issues templates and Github action to remove stale issues We add 3 different issue types issue/question/RFE and add some predefined questions to speed up the debugging process. We also add a custom action which will close all issues create mode than 6 months ago which have not been updated for more than a month.	2025-07-16 17:56:12 +02:00
Kamil Iskra	7c12c627c6	NCCL 2.27.6-1 Improve support for DirectNIC (CX8) * Add support for XDR speed detection. * When DirectNIC is enabled, report only the RDMA interfaces. Extend the P2C (PXN over C2C) support to send/receive operations. Support compilation with GCC 14 (Issues #1743, #1751). Fix the unloading of network plugins that also provide tuner capability. Fix the change of the current device across the calls to ncclCommDestroy() and ncclCommAbort(). A note for users on MNNVL systems: please ensure an adequate stack size for NCCL threads. While the default Linux stack size limit of 8192 KB is known to be sufficient, we've seen crashes if the limit is changed to "unlimited", as it causes the glibc library to unexpectedly decrease the stack size of NCCL's background threads to just 2048 KB. Use "ulimit -s" in bash to print the current limit; if needed, reset it to 8192 KB using "ulimit -s 8192" (one also needs to ensure that the new setting is propagated to other nodes when launching a multi-node NCCL job).	2025-07-11 07:32:13 -07:00
Kamil Iskra	3ea7eedf3b	NCCL 2.27.5-1 Improvements for GB200 systems * Optimize the network performance by alternating the direction of the rings and the NIC to GPU assignment across communicators to limit unnecessary sharing. * Fix the detection of C2C links in case GPU Direct RDMA is disabled between a GPU and a NIC. * Fix PXN support on MNNVL systems, where NCCL would try (and fail) to share regular host memory across multiple nodes. * Fix P2C (PXN over C2C), which is now preferred over regular PXN. This support is currently preliminary and is disabled by default; use NCCL_PXN_C2C=1 to enable. Further reduce the overheads of CUDA graph capturing, which increased in NCCL 2.26.2 for large graphs. Optimize the network performance on DGX B200 systems by adjusting the bandwidths provided to the graph search algorithm. Enable fp8 reductions in symmetric kernels on Blackwell with CUDA 12.8. Restore the plugin name handling logic to make it possible to specify a path to the plugin (Issue #1732). Restore the ability to change NCCL_COLLNET_ENABLE during execution (Issue #1741). Add an example tuner plugin with CSV-based overrides. Remove an x86 dependency from the example profiler.	2025-06-18 10:34:47 -07:00
Kamil Iskra	72d2432094	NCCL 2.27.3-1 Symmetric memory API and symmetric kernels * Redesign from the ground up, enabling major latency and bandwidth improvements. * Add new API calls to register user-allocated memory among communicator ranks into a NCCL window: ncclCommWindowRegister() and ncclCommWindowDeregister(). The calls currently support symmetric registration for P2P and NVLS, and require VMM memory buffers (i.e., CUMEM must be operational). * Implement specialized kernels taking advantage of symmetrically registered memory, with performance gains expected particularly for small to medium message sizes. * The kernels support 32 bit floating point types and smaller, and sum as the reduction operator, with no more than one collective operation per group. * Floating point summation is always done in fp32 accumulators (with the exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus, the accuracy with fp8 and fp16 data types should be much improved. * This initial implementation supports non-network communicators only (P2P and NVLS transports). * To explore this functionality users need to use the new memory registration API calls with the NCCL_WIN_COLL_SYMMETRIC flag and all ranks of a communicator must pass buffers at the same offset in the same registration when invoking a collective NCCL operation. Add support for DGX Spark. Add support for DirectNIC (CX8) to the internal IB plugin. Add a new ncclCommShrink() API call * It is a non-collective call similar to ncclCommSplit(), which makes it possible to exclude some (possibly unresponsive) ranks from the parent communicator. Add support for loading multiple network plugins * This enables the creation of generic containers that can work across a range of providers. * Allow NCCL_NET_PLUGIN to accept a comma-separated list of plugins to load. NVLink SHARP (NVLS) improvements * Implement NVLS+IB SHARP support for AllGather and ReduceScatter with user buffer registration. This improves performance and reduces the number of CTAs needed to achieve peak bandwidth. * Gracefully fall back by default to other transports if NVLS initialization fails (the old behavior of returning an error code from a NCCL call can be preserved by setting NCCL_NVLS_ENABLE=1). * Decrease the NVLS channel count to 24 on Blackwell systems with multiple NVLink domains per communicator. * Enable fine-tuning of NCCL behavior per communicator using new "ncclConfig_t" members "collnetEnable", "CTAPolicy", and "nvlsCTAs". Profiler improvements * Extend the init function by adding communicator name, comm id (hash), rank, number of ranks, number of nodes, and the NCCL log function to the argument list. This makes the name and the comm id available to all events in the communicator without explicitly passing them to each individual event. Add the communicator id and rank to the profiler trace filename. Now, the communicator name can be set via a new "ncclConfig_t" member "commName". * Improve the accuracy of the GPU kernel events by providing GPU-generated timestamps for the start and stop of every NCCL operation. * Harmonize proxy events, removing overlaps between ProxyOp and ProxyStep states. * Add support for network-defined event updates (through "recordEventState"). * Report the correct number of channels used by every collective/p2p operation (used to be set to nMaxChannels for collectives and absent for p2ps). * Fix the logic on proxyCtrl Idle/Active events (Issue #1162). * Fix an issue where the network proxy profiler could lose track of an event identifier (Issue #1682). * Improve the backward compatibility with plugins older than v4. * Ensure that the work counters are 0-initialized. * Fix a potential race condition in the network profiler that could result in an event being linked to a wrong parent. MNNVL improvements * Increase to 16 the number of NICs used to communicate between MNNVL domains on GB200 systems, to optimize the performance of collective operations. * Add support for more complex MNNVL topologies with up to 32 NICs per node. * If the MNNVL fabric initialization was unsuccessful, NCCL will now fail by default, so as to avoid inadvertently falling back to a potentially much slower network transport. Such failures are typically due to a misconfigured IMEX support on the system. To continue without MNNVL, restart the job with NCCL_MNNVL_ENABLE=0. * Fix a potential hang in alltoall-like communication patterns at a scale of over 80 ranks. * Make NCCL_P2P_DISABLE=1 imply NCCL_MNNVL_ENABLE=0 (so the latter no longer needs to be specified on MNNVL systems). * Fix an initialization failure when NCCL_TOPO_FILE is used on MNNVL systems. * Fix the graph search to exclude non-local NICs. * Fix the SHM transport to use fabric handles on MNNVL systems. NIC Fusion improvements * Disable the creation of fused NICs for physical devices that haven't been merged. * Flatten multiple ports to a single PCI device within the internal IB plugin and reparent dual-port NICs under the first PCI parent. If the parent is not a PCI switch, PCI devices for fused NICs won't be duplicated. * Route traffic on GB200-CX8 systems through DirectNIC, not the host interface. Improve support for platforms with C2C connectivity (e.g., GB200) * Enable GPUDirect RDMA for the NICs by default. * Add support for P2C (PXN over C2C) and the LL128 protocol. Extend NCCL fault tolerance in multithreaded scenarios * Support the creation of multiple nonblocking communicators within a single group and polling in parallel for the completion using multiple threads (one per communicator). Enable ncclImplicitOrderLaunch for CUDA 12.9+ * This can potentially speed up NCCL_IMPLICIT_LAUNCH_ORDER. Improve the netSocket transport latency and control * Provide finer control over the size of the socket send/receive buffers, the task size, and the number of sockets that a single peer can open. * Add support for the inlining of small messages behind the header when using multiple sockets per connection. Improve the readability of the CPU affinity in the debug output * Print it as a range string rather than a bitmask. Fix a potential race condition in graph execution * A contention could arise when mixing graph and non-graph execution. Improve PXN connection code * Avoid duplicate and unused connections. RAS fixes * Fix a memory corruption at job termination time in case of a previously failed initialization of a RAS socket connection. * Fix a race condition leading to a crash when generating a RAS report during communicator initialization (Issues #1669, #1718). * Fix a potential race condition when gathering data for a RAS status report. Fix a potential memory corruption in ncclCommSplit() * Memory could get corrupted when resource sharing was in use and the size of the NVLink domain in the new communicator was smaller than in the old one. Fix asynchronous graph upload * Fix a small memory leak. * Fix oversychronization. Add a check for out-of-memory conditions in ncclMemAlloc() Clean up the NCCL socket code * accept() will retry also if just reading the magic failed (Issue #1613). * connect() will retry also if poll() did not return a POLLOUT event (Issue #1618). * Add error checking in a few instances (Issue #1539). * Fix the loop condition in ncclFindInterfaceMatchSubnet() (Issue #1574). * Clean up the debug output, downgrading WARN messages to INFO in non-critical cases, and printing the peer's address where relevant. Switch NCCL_DEBUG_FILE to line buffering * This should help avoid mixed-up partial output lines in multithreaded cases. Other minor fixes * Improve the checks for buffer overflows in the graph code (Issue #1585). * Extend logging and state clearing to all four events in the internal IB plugin (Issue #1650). * Fix the error path in case IB communication is not ready (Issue #1489). * Add ECE logging for IB fabric. * Fix various minor issues in the graph module (Issue #1635). * Clean up the debug output in the graph code, downgrading WARN messages to INFO in non-critical cases. * Add a missing argument to a directSend() call (Issue #1628). * Remove duplicate code in sendProxySetup() (Issue #1420). * Fix the order of arguments of cudaDeviceCanAccessPeer() (Issue #1507). * Fix compiler warnings with GCC 14. * Fix a typo in a comment (Issue #1236).	2025-05-29 20:56:40 -07:00
Giuseppe Congiu	8171af656b	NCCL 2.26.6-1 Fix profiler_v2 compatibility layer * Removing trafficBytes in profiler_v3 breaks casting to ncclProfilerEventDescr_v2_t in the compatibility layer for profiler_v2 interface. This patch fixes the issue by making the conversion between the two descriptors explicit.	2025-05-20 04:04:41 -07:00
Yehuda Yitschak	ef96b97524	proxy: yield progress only when idle Currently if no new ops were added the proxy thread may sleep while there are non-idle ops to process Signed-off-by: Yehuda Yitschak <yehuday@amazon.com>	2023-09-06 06:21:52 +00:00