Init cmake commit

Signed-off-by: cyy <cyyever@outlook.com>
NCCL 2.27.7-1
2025-07-30 16:10:36 +08:00 · 2025-07-24 10:39:53 -07:00 · 2025-07-17 21:50:05 +02:00 · 2025-07-16 17:56:12 +02:00 · 2025-07-11 07:32:13 -07:00 · 2025-06-18 10:34:47 -07:00
54 changed files with 3622 additions and 214 deletions
--- a/.github/ISSUE_TEMPLATE/ISSUE.yaml
+++ b/.github/ISSUE_TEMPLATE/ISSUE.yaml
@ -0,0 +1,77 @@
+name: NCCL issue or bug
+description: Report an issue or failure when running NCCL code
+title: "[Issue]: "
+labels: ["triage"]
+
+body:
+  - type: markdown
+    attributes:
+      value: |
+        Thanks for reaching out! Before reporting a new issue, please feel free to search for the behavior in the existing issues. If you found an issue which is already closed or you are unsure, open a new issue and reference the old one from it.
+        You can also check out the [troubleshooting section](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html) in our user guide.
+        
+        ---
+        
+        To ensure we can assist you quickly and accurately, we often need the following information:
+  - type: dropdown
+    id: type
+    attributes:
+      label: How is this issue impacting you?
+      description: What best describes your issue?
+      options:
+        - Lower performance than expected
+        - Application crash
+        - Data corruption
+        - Application hang
+    validations:
+      required: true
+
+  - type: textarea
+    id: log
+    attributes:
+      label: Share Your Debug Logs
+      description: |
+
+        The logs and topo-files are a great tool to pin down issues. You can create them by setting these environment variables before the run.
+        * `NCCL_DEBUG=INFO` and `NCCL_DEBUG_FILE=ncclDebug.%h.%p` to produce one file per rank
+        * `NCCL_TOPO_DUMP_FILE=ncclSystem.txt`
+
+  - type: textarea
+    id: repro
+    attributes:
+      label: Steps to Reproduce the Issue
+      description: |
+        * **Minimal Steps**: Please provide a simple way to recreate the issue (see [Minimal Bug Reports](https://matthewrocklin.com/minimal-bug-reports) for inspiration).
+        * **Environment Details**: Include software versions and relevant settings.
+        * **Intermittency**: Is this a sporadic issue? If so, how often does it occur?
+        * **Previous Success**: Did this work with an older NCCL version?
+
+        The easier we can reproduce on our side the more likely we are to be able to solve it in a timely manner.
+
+  - type: input
+    id: nccl_version
+    attributes:
+      label: NCCL Version
+      description: |
+        NCCL reports its version string in the debug logs.
+        You can also determine the version if you know which library was used by running `strings libnccl.so | grep 'NCCL version'`.
+      placeholder: "e.g. 2.27.1+cuda12.8"
+    validations:
+      required: true
+
+  - type: textarea
+    id: platform
+    attributes:
+      label: Your platform details
+      description: |
+        * **GPU & Network**: Share your architecture and topology (e.g., from `nvidia-smi`, `nvidia-smi topo -m`, `ibstatus`).
+        * **Environment**: Bare-metal, containers, or cloud?
+        * **Scalability**: Does this issue occur with a specific number of ranks/nodes?
+
+  - type: textarea
+    id: issue-description
+    attributes:
+      label: Error Message & Behavior
+      description: |
+        * **First Error**: What was the initial `NCCL WARN` message in your logs?
+        * **Expected vs. Actual**: Briefly describe the anticipated behavior versus what you're seeing.
--- a/.github/ISSUE_TEMPLATE/QUESTION.yaml
+++ b/.github/ISSUE_TEMPLATE/QUESTION.yaml
@ -0,0 +1,15 @@
+name: NCCL question
+description: Ask the NCCL team a question
+title: "[Question]: "
+labels: ["question"]
+
+body:
+  - type: markdown
+    attributes:
+      value: |
+        Thanks for reaching out! To solve your problem, feel free to check out the [user guide](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html), in particular the troubleshooting section, and also the [release notes](https://docs.nvidia.com/deeplearning/nccl/release-notes/index.html).
+        ---
+  - type: textarea
+    id: question
+    attributes:
+      label: Question
--- a/.github/ISSUE_TEMPLATE/RFE.yaml
+++ b/.github/ISSUE_TEMPLATE/RFE.yaml
@ -0,0 +1,22 @@
+name: NCCL request for enhancement
+description: Request for enhancement
+title: "[RFE]: "
+labels: ["enhancement"]
+body:
+  - type: markdown
+    attributes:
+      value: |
+
+        Thanks for your feedback! Before reporting a new RFE you could quickly check if this already exists in our [existing requests](https://github.com/NVIDIA/nccl/issues?q=sort%3Aupdated-desc%20is%3Aissue%20is%3Aopen%20label%3Aenhancement).
+        
+        ---
+  - type: textarea
+    id: rfe-description
+    attributes:
+      label: Please provide the below details to ensure we understand your needs
+      description: |
+        * What is the goal of this request?
+        * Who will benefit from this feature?
+        * Is this request for a specific GPU architecture or network infrastructure?
+        * How will this feature improve current workflows or processes?
+        * What is the priority level of this request?
--- a/.github/ISSUE_TEMPLATE/config.yml
+++ b/.github/ISSUE_TEMPLATE/config.yml
@ -0,0 +1 @@
+blank_issues_enabled: false
--- a/.github/workflows/close-old-issues.js
+++ b/.github/workflows/close-old-issues.js
@ -0,0 +1,79 @@
+const { Octokit } = require("@octokit/rest");
+
+const octokit = new Octokit({ auth: process.env.GITHUB_TOKEN });
+
+const owner = process.env.REPO_OWNER;
+const repo = process.env.REPO_NAME.split('/').pop(); // Handles owner/repo format
+
+const now = new Date();
+const sixMonthsAgo = new Date(now);
+sixMonthsAgo.setMonth(now.getMonth() - 6);
+const oneMonthAgo = new Date(now);
+oneMonthAgo.setMonth(now.getMonth() - 1);
+
+async function closeOldIssues() {
+  let page = 1;
+  let closedCount = 0;
+
+    // write a multiline comment into a variable:
+    let body = `### Issue Cleanup: Helping Us Focus on Current Challenges
+
+We're [reviewing](https://github.com/NVIDIA/nccl/discussions/1761) older issues to ensure we prioritize the most relevant and active ones. Since this issue hasn't seen updates in over 6 months, we'll be closing it for now.
+
+*This change helps us focus our efforts on addressing any current issues our users are facing.* If this issue still affects you, please don't hesitate to reopen it with a quick update (e.g., \"Still relevant on [version=X]\").
+Thanks for your understanding and for contributing to NCCL.`;
+
+  while (true) {
+    const { data: issues } = await octokit.issues.listForRepo({
+      owner,
+      repo,
+      state: "open",
+      per_page: 100,
+      page,
+    });
+
+    if (issues.length === 0) break;
+
+    for (const issue of issues) {
+      // Ignore PRs
+      if (issue.pull_request) continue;
+
+      // Ignore issues with label "ongoing"
+      if (issue.labels.some(label => label.name === "ongoing")) continue;
+
+      const createdAt = new Date(issue.created_at);
+      const updatedAt = new Date(issue.updated_at);
+
+        if (createdAt < sixMonthsAgo && updatedAt < sixMonthsAgo) {
+
+        // Add a comment before closing
+        await octokit.issues.createComment({
+          owner,
+          repo,
+          issue_number: issue.number,
+          body: body,
+        });
+
+        await octokit.issues.update({
+          owner,
+          repo,
+          issue_number: issue.number,
+          state: "closed",
+          state_reason: "not_planned",
+        });
+        closedCount++;
+        console.log(`Closed issue #${issue.number}`);
+
+        // Break out if we have closed 100 issues
+        if (closedCount >= 100) {
+          console.log("Closed 100 issues, stopping.");
+          return;
+        }
+      }
+    }
+    page++;
+  }
+  console.log(`Total closed: ${closedCount}`);
+}
+
+closeOldIssues().catch(console.error);
--- a/.github/workflows/close_old_issues.yaml
+++ b/.github/workflows/close_old_issues.yaml
@ -0,0 +1,31 @@
+name: Close Old Issues
+
+on:
+  schedule:
+    - cron: '30 2 * * *'  # Runs daily at 02:30 UTC
+  workflow_dispatch:
+
+permissions:
+  issues: write
+
+jobs:
+  close-old-issues:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Setup Node.js
+        uses: actions/setup-node@v4
+        with:
+          node-version: 20
+
+      - name: Install dependencies
+        run: npm install @octokit/rest@22.0.0
+
+      - name: Run close-old-issues script
+        run: node .github/workflows/close-old-issues.js
+        env:
+          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+          REPO_OWNER: ${{ github.repository_owner }}
+          REPO_NAME: ${{ github.event.repository.name || github.repository }}
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@ -0,0 +1,24 @@
+cmake_minimum_required(VERSION 4.0)
+
+project(nccl LANGUAGES CUDA CXX VERSION 2.27.7)
+
+option(VERBOSE "VERBOSE" OFF)
+option(KEEP "KEEP" OFF)
+option(TRACE "TRACE" OFF)
+option(PROFAPI "PROFAPI" OFF)
+option(NVTX "NVTX" ON)
+option(NET_PROFILER "NET_PROFILER" OFF)
+
+set(CMAKE_EXPORT_COMPILE_COMMANDS ON)
+add_subdirectory(src)
+
+install(
+  TARGETS nccl nccl_static
+  EXPORT NCCLConfig
+  FILE_SET public_headers
+  DESTINATION include)
+
+install(
+  EXPORT NCCLConfig
+  DESTINATION lib/cmake/nccl
+  NAMESPACE NCCL::)
--- a/cmake/common.cmake
+++ b/cmake/common.cmake
@ -0,0 +1,39 @@
+function(nccl_add_target_options target)
+  target_compile_options(${target} PRIVATE $<$<CONFIG:Debug>:-ggdb3>)
+  target_compile_options(${target} PRIVATE $<$<NOT:$<CONFIG:Debug>>:-O3>)
+  target_compile_options(
+    ${target} PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:--expt-extended-lambda -Xptxas
+                      -maxrregcount=96 -Xfatbin -compress-all -fPIC>)
+  target_compile_options(${target} PRIVATE -fPIC -Wall -Wno-unused-function
+                                           -Wno-sign-compare -Wvla)
+  set_property(TARGET ${target} PROPERTY CXX_STANDARD 17)
+  set_property(TARGET ${target} PROPERTY CUDA_STANDARD 17)
+  set_property(TARGET ${target} PROPERTY CXX_VISIBILITY_PRESET hidden)
+  set_property(TARGET ${target} PROPERTY VISIBILITY_INLINES_HIDDEN 1)
+  set_property(TARGET ${target} PROPERTY CUDA_RESOLVE_DEVICE_SYMBOLS ON)
+  if(VERBOSE)
+    target_compile_options(${target} PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-Xptxas
+                                             -v -Xcompiler -Wall,-Wextra>)
+    target_compile_options(${target} PRIVATE -Wall -Wextra)
+  endif()
+
+  if(TRACE)
+    target_compile_options(${target} PRIVATE ENABLE_TRACE)
+  endif()
+
+  if(NOT NVTX)
+    target_compile_options(${target} PRIVATE NVTX_DISABLE)
+  endif()
+
+  if(KEEP)
+    target_compile_options(${target} PRIVATE $<$<COMPILE_LANGUAGE:CUDA>:-keep>)
+  endif()
+
+  if(PROFAPI)
+    target_compile_options(${target} PRIVATE PROFAPI)
+  endif()
+
+  if(NET_PROFILER)
+    target_compile_options(${target} PRIVATE NET_PROFILER)
+  endif()
+endfunction()
--- a/ext-net/example/Makefile
+++ b/ext-net/example/Makefile
@ -3,15 +3,20 @@
 #
 # See LICENSE.txt for license information
 #
-NCCL_HOME:=../../build/
-CUDA_HOME:=/usr/local/cuda
-INC:= -I$(NCCL_HOME)/include -I$(CUDA_HOME)/include -Inccl
-PLUGIN_SO:=libnccl-net.so
+.DEFAULT_GOAL: build
+include ../../makefiles/common.mk
+SRCDIR   ?= $(abspath ../..)
+BUILDDIR ?= .
+NCCLDIR  := $(BUILDDIR)

-default: $(PLUGIN_SO)
+SRC_FILES := $(wildcard *.c)

-$(PLUGIN_SO): plugin.c
-	$(CC) $(INC) -fPIC -shared -o $@ -Wl,-soname,$(PLUGIN_SO) $^
+build: ${BUILDDIR}/libnccl-net-example.so
+
+${BUILDDIR}/libnccl-net-example.so: ${SRC_FILES}
+	@printf "Compiling  %-35s > %s\n" $< $@
+	@mkdir -p ${BUILDDIR}
+	$(CC) -Inccl -fPIC -shared -o $@ $^

 clean:
-	rm -f $(PLUGIN_SO)
+	rm -f ${BUILDDIR}/libnccl-net-example.so
--- a/ext-profiler/example/Makefile
+++ b/ext-profiler/example/Makefile
@ -3,14 +3,20 @@
 #
 # See LICENSE.txt for license information
 #
-NCCL_HOME := ../../build
-INC := -I$(NCCL_HOME)/include -I$(CUDA_HOME)/include -Inccl
-PLUGIN_SO := libnccl-profiler.so
+.DEFAULT_GOAL: build
+include ../../makefiles/common.mk
+SRCDIR   ?= $(abspath ../..)
+BUILDDIR ?= .
+NCCLDIR  := $(BUILDDIR)

-default: $(PLUGIN_SO)
+SRC_FILES := $(wildcard *.c)

-$(PLUGIN_SO): plugin.c event.c print_event.c
-	$(CXX) $(INC) -g -fPIC -shared -o $@ -Wl,-soname,$(PLUGIN_SO) $^
+build: ${BUILDDIR}/libnccl-profiler-example.so
+
+${BUILDDIR}/libnccl-profiler-example.so: ${SRC_FILES}
+	@printf "Compiling  %-35s > %s\n" $< $@
+	@mkdir -p ${BUILDDIR}
+	$(CC) -Inccl -fPIC -shared -o $@ $^

 clean:
-	rm -f $(PLUGIN_SO)
+	rm -f ${BUILDDIR}/libnccl-profiler-example.so
--- a/ext-profiler/example/plugin.c
+++ b/ext-profiler/example/plugin.c
@ -12,7 +12,7 @@
 #include <sys/types.h>
 #include <sys/syscall.h>
 #include <unistd.h>
-#include <x86intrin.h>
+#include <time.h>
 #include "event.h"
 #include "print_event.h"

@ -41,22 +41,10 @@ static struct proxyOp* detachPool;
 ncclDebugLogger_t logFn;
 #define INFO(FLAGS, ...) logFn(NCCL_LOG_INFO, (FLAGS), __func__, __LINE__, __VA_ARGS__)

-static double freq = -1;
-__hidden void calibrate() {
-  struct timeval tv;
-  gettimeofday(&tv, NULL);
-  uint64_t timeCycles = __rdtsc();
-  double time = - tv.tv_sec*1e6 - tv.tv_usec;
-  uint64_t total = 0ULL;
-  for (int i = 0; i < 10000; i++) total += __rdtsc();
-  gettimeofday(&tv, NULL);
-  timeCycles = __rdtsc() - timeCycles;
-  time += tv.tv_sec*1e6 + tv.tv_usec;
-  freq = timeCycles / time;
-}
-
 __hidden double gettime(void) {
-  return __rdtsc() / freq;
+  struct timespec t;
+  clock_gettime(CLOCK_MONOTONIC, &t);
+  return (t.tv_sec*1e6 + (t.tv_nsec*1e-3));
 }

 static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
@ -98,8 +86,6 @@ __hidden ncclResult_t exampleProfilerInit(void** context, int* eActivationMask,
    // process address space.
    pid = getpid();

-    // calibrate and start timer
-    calibrate();
    startTime = gettime();
  }
  pthread_mutex_unlock(&lock);
--- a/ext-tuner/README.md
+++ b/ext-tuner/README.md
@ -0,0 +1,182 @@
+# NCCL Tuner Plugin Development
+
+This directory contains resources and examples for developing NCCL tuner plugins. Tuner plugins allow you to customize NCCL's algorithm and protocol selection behavior to optimize performance for specific workloads and hardware configurations.
+
+## Overview
+
+NCCL tuner plugins provide a way to influence NCCL's automatic algorithm and protocol selection by modifying the cost tables that NCCL uses to make decisions. This allows you to:
+
+- Override default algorithm/protocol combinations for specific collective operations
+- Customize tuning based on message size, topology, and other parameters
+- Implement sophisticated tuning strategies without recompiling NCCL
+- Optimize performance for specific hardware configurations or workloads
+
+## Tuner Plugin Interface
+
+NCCL tuner plugins must implement the `ncclTuner_t` interface defined in `nccl_tuner.h` within `nccl/src/include/plugin`. These definitions have been forked to `tuner.h` in each example plugin, and it is expected that any plugin implementor forks the internal NCCL definitions as well. The current interface includes:
+
+```c
+// Initialize the tuner plugin
+ncclResult_t (*init)(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context);
+
+// Get and modify collective operation cost information
+ncclResult_t (*getCollInfo)(void* context, ncclFunc_t collType, size_t nBytes,
+                           int numPipeOps, float** collCostTable, int numAlgo, int numProto,
+                           int regBuff, int* nChannels);
+
+// Clean up plugin resources
+ncclResult_t (*destroy)(void* context);
+```
+
+## Development Guidelines
+
+### 1. Plugin Structure
+
+A typical tuner plugin should:
+- Include the necessary forked NCCL headers (`tuner.h`)
+- Implement all required interface functions
+- Export the plugin structure with appropriate version
+- Handle all input parameters gracefully
+
+### 2. Cost Table Modification
+
+The `getCollInfo` function receives a cost table that maps algorithm/protocol combinations to performance costs. Lower costs indicate preferred combinations. You can:
+
+- Set costs to `0.0` to make combinations highly preferred
+- Set costs to `NCCL_ALGO_PROTO_IGNORE` to disable combinations
+- Use relative costs to create preferences between options
+
+### 3. Channel Management
+
+The `nChannels` parameter allows you to:
+- Set a specific number of channels to use
+- Return the original value to preserve NCCL's default behavior
+- Implement dynamic channel selection based on message size or topology
+
+### 4. Error Handling
+
+Always return appropriate `ncclResult_t` values:
+- `ncclSuccess` for successful or ignored operations
+- `ncclInternalError` for plugin-specific errors. Returning an error is only advisable on plugin initialization and destruction, as the penalty users can pay for the overhead of a failed plugin call can be immense.
+- Other NCCL error codes as appropriate
+
+## Getting Started
+
+### Option 1: Start with the Example Plugin
+
+If you're new to tuner plugin development, start with the `example/` directory:
+
+```bash
+cd example/
+make
+```
+
+This provides a CSV-based configuration system that you can customize or use as a template.
+
+## Building and Testing
+
+### Build Requirements
+
+- GCC or compatible C compiler
+- NCCL headers (included in `nccl/` subdirectories)
+- Make
+
+## Option 2: Use the Basic Plugin
+
+For more customized tuning needs, you might want to start with a clean baseline. In that case, base off the basic plugin in the `basic/` directory:
+
+```bash
+cd basic/
+make
+```
+
+### Build Process
+
+Each plugin directory contains a Makefile:
+
+```bash
+cd basic/    # or example/
+make
+```
+
+This generates a shared library (`.so` file) that can be loaded by NCCL.
+
+### Loading the Plugin
+
+Set the `LD_LIBRARY_PATH` to include your plugin directory:
+
+```bash
+export LD_LIBRARY_PATH=/path/to/your/plugin:$LD_LIBRARY_PATH
+```
+
+Set `NCCL_TUNER_PLUGIN` to either the plugin name, or the absolute path to the plugin file. Any of the below can work:
+
+```bash
+export NCCL_TUNER_PLUGIN=example
+export NCCL_TUNER_PLUGIN=libnccl-tuner-example.so
+export NCCL_TUNER_PLUGIN=/path/to/your/plugin/libnccl-tuner-example.so
+```
+
+NCCL will automatically discover and load the plugin based on the exported symbol names.
+
+## Advanced Topics
+
+### Plugin Versioning
+
+NCCL supports multiple plugin interface versions. Make sure your plugin exports the correct version:
+
+```c
+const ncclTuner_v4_t ncclTunerPlugin_v4 = {
+    .name = "YourPluginName",
+    .init = yourInitFunction,
+    .getCollInfo = yourGetCollInfoFunction,
+    .destroy = yourDestroyFunction
+};
+```
+
+### Multi-GPU and Multi-Node Considerations
+
+Your plugin receives topology information (`nRanks`, `nNodes`) during initialization. Use this to:
+- Implement topology-aware tuning strategies
+- Handle single-node vs. multi-node optimizations differently
+- Scale channel counts based on available hardware
+
+### Performance Optimization
+
+- Keep plugin logic lightweight to avoid impacting NCCL performance
+- Cache expensive computations when possible
+- Use the logging system for debugging but avoid excessive output in production
+
+## Debugging and Logging
+
+Use NCCL's debug logging system:
+
+```bash
+export NCCL_DEBUG=INFO    # General information
+export NCCL_DEBUG_SUBSYS=TUNING
+```
+
+Within your plugin, use the provided `ncclDebugLogger_t` function for consistent logging.
+
+## Best Practices
+
+1. **Test thoroughly**: Verify your plugin works with various message sizes and topologies
+2. **Handle edge cases**: Ensure your plugin behaves correctly with unusual input parameters
+3. **Document your approach**: Clearly document your tuning strategy and configuration options
+4. **Version your plugin**: Use meaningful version numbers and maintain backward compatibility
+5. **Performance validation**: Measure the impact of your tuning decisions on real workloads
+
+## Contributing
+
+When developing new tuner plugins:
+- Follow the existing code style and structure
+- Include comprehensive documentation
+- Add example configurations and test cases
+- Consider contributing useful plugins back to the community
+
+## Resources
+
+- [NCCL Documentation](https://docs.nvidia.com/deeplearning/nccl/)
+- Example plugin implementations in this directory
+
+For questions and support, refer to the NCCL community resources and documentation.
--- a/ext-tuner/basic/Makefile
+++ b/ext-tuner/basic/Makefile
@ -0,0 +1,23 @@
+#
+# Copyright (c) 2015-2019, NVIDIA CORPORATION. All rights reserved.
+#
+# See LICENSE.txt for license information
+#
+.DEFAULT_GOAL: build
+include ../../makefiles/common.mk
+SRCDIR   ?= $(abspath ../..)
+BUILDDIR ?= .
+NCCLDIR  := $(BUILDDIR)
+
+SRC_FILES := $(wildcard *.c)
+DST_DIR   := $(BUILDDIR)/test/unit/plugins
+
+build: ${BUILDDIR}/libnccl-tuner-basic.so
+
+${BUILDDIR}/libnccl-tuner-basic.so: ${SRC_FILES}
+	@printf "Compiling  %-35s > %s\n" $< $@
+	@mkdir -p ${BUILDDIR}
+	$(CC) -Inccl -fPIC -shared -o $@ $^
+
+clean:
+	rm -f ${BUILDDIR}/libnccl-tuner-basic.so
--- a/ext-tuner/basic/README.md
+++ b/ext-tuner/basic/README.md
@ -0,0 +1,197 @@
+# Basic NCCL Tuner Plugin
+
+This directory contains a minimal placeholder implementation of an NCCL tuner plugin. It serves as a starting point for developing custom tuner plugins by providing the essential function stubs and interface structure required by NCCL.
+
+## Purpose
+
+This basic plugin is designed to:
+- Provide a minimal working example of the NCCL tuner plugin interface
+- Serve as a template for developing custom tuner plugins
+- Demonstrate the required function signatures and structure
+- Implement placeholder functionality that can be extended
+
+
+## Implementation Details
+
+The plugin implements the following functions:
+
+### `pluginInit`
+```c
+ncclResult_t pluginInit(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context)
+```
+- **Purpose**: Initialize the plugin with communicator information
+- **Current Implementation**: Simple placeholder that returns success
+- **Parameters**:
+  - `nRanks`: Total number of ranks in the communicator
+  - `nNodes`: Total number of nodes in the communicator
+  - `logFunction`: NCCL debug logging function
+  - `context`: Plugin context pointer (output)
+
+### `pluginGetCollInfo`
+```c
+ncclResult_t pluginGetCollInfo(void* context, ncclFunc_t collType, size_t nBytes,
+                              int numPipeOps, float** collCostTable, int numAlgo, int numProto,
+                              int regBuff, int* nChannels)
+```
+- **Purpose**: Modify cost tables for collective operations
+- **Current Implementation**:
+  - Sets RING+SIMPLE algorithm to cost 0.0 (highest preference)
+  - Sets channel count to 1
+- **Parameters**:
+  - `context`: Plugin context from init
+  - `collType`: Type of collective operation
+  - `nBytes`: Message size in bytes
+  - `numPipeOps`: Number of pipeline operations
+  - `collCostTable`: Cost table to modify
+  - `numAlgo`: Number of algorithms
+  - `numProto`: Number of protocols
+  - `regBuff`: Whether buffer can be registered
+  - `nChannels`: Number of channels to use (output)
+
+### `pluginDestroy`
+```c
+ncclResult_t pluginDestroy(void* context)
+```
+- **Purpose**: Clean up plugin resources
+- **Current Implementation**: Simple placeholder that returns success
+
+## Cost Table Structure
+
+The plugin demonstrates how to modify NCCL's cost tables:
+
+```c
+float (*table)[NCCL_NUM_PROTOCOLS] = (float (*)[NCCL_NUM_PROTOCOLS])collCostTable;
+```
+
+The cost table is a 2D array where:
+- First dimension: Algorithm index (e.g., `NCCL_ALGO_RING`)
+- Second dimension: Protocol index (e.g., `NCCL_PROTO_SIMPLE`)
+- Values: Cost for that algorithm/protocol combination
+
+### Cost Values
+- **0.0**: Highest preference (lowest cost)
+- **Positive values**: Relative costs (lower is better)
+- **`NCCL_ALGO_PROTO_IGNORE`**: Disable this combination
+
+## Building
+
+```bash
+make
+```
+
+This creates `libnccl-tuner-basic.so` which can be loaded by NCCL.
+
+## Usage
+
+### Loading the Plugin
+
+```bash
+export LD_LIBRARY_PATH=/path/to/basic:$LD_LIBRARY_PATH
+mpirun -np 4 your_nccl_application
+```
+
+```bash
+export NCCL_TUNER_PLUGIN=basic
+export NCCL_TUNER_PLUGIN=libnccl-tuner-basic.so
+export NCCL_TUNER_PLUGIN=/path/to/your/plugin/libnccl-tuner-basic.so
+```
+
+### Verifying Plugin Loading
+
+Enable NCCL debug output to see if the plugin is loaded:
+
+```bash
+export NCCL_DEBUG=INFO
+```
+
+You should see messages indicating the tuner plugin is being used.
+
+## Extending the Plugin
+
+This basic plugin provides a foundation that you can extend:
+
+### 1. Add Configuration Logic
+
+Modify `pluginGetCollInfo` to implement your tuning strategy:
+
+```c
+__hidden ncclResult_t pluginGetCollInfo(void* context, ncclFunc_t collType, size_t nBytes,
+                              int numPipeOps, float** collCostTable, int numAlgo, int numProto,
+                              int regBuff, int* nChannels) {
+  // Your custom tuning logic here
+  if (nBytes < 1024) {
+    // Small message optimization
+    table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] = 0.0;
+  } else {
+    // Large message optimization
+    table[NCCL_ALGO_RING][NCCL_PROTO_LL128] = 0.0;
+  }
+
+  // Dynamic channel selection
+  *nChannels = (nBytes > 1024*1024) ? 4 : 1;
+
+  return ncclSuccess;
+}
+```
+
+### 2. Add Context Management
+
+Use the context pointer to store plugin state:
+
+```c
+struct pluginContext {
+  int initialized;
+  size_t nRanks;
+  size_t nNodes;
+  // Add your plugin-specific data here
+};
+```
+
+### 3. Add File-Based Configuration
+
+Read configuration from files, environment variables, or other sources.
+
+### 4. Add Topology Awareness
+
+Use the `nRanks` and `nNodes` parameters to implement topology-specific tuning.
+
+## File Structure
+
+```
+basic/
+├── README.md          # This file
+├── plugin.c           # Plugin implementation
+├── Makefile           # Build configuration
+└── nccl/              # NCCL header files
+    └── tuner.h        # Tuner plugin interface definitions
+```
+
+## Next Steps
+
+1. **Understand the Interface**: Study the function signatures and parameters
+2. **Implement Your Logic**: Add your tuning strategy to `pluginGetCollInfo`
+3. **Test Thoroughly**: Verify your plugin works with different message sizes and topologies
+4. **Add Error Handling**: Implement proper error checking and resource management
+5. **Document Your Changes**: Update this README with your specific implementation details
+
+## Comparison with Example Plugin
+
+- **Basic Plugin**: Minimal implementation, good for learning and simple use cases
+- **Example Plugin**: Full-featured CSV-based configuration system, good for production use
+
+Choose the basic plugin if you want to:
+- Learn the tuner plugin interface
+- Implement simple, hardcoded tuning strategies
+- Build a custom plugin from scratch
+
+Choose the example plugin if you want:
+- File-based configuration
+- Complex tuning strategies
+- Production-ready features
+
+## Resources
+
+- [Parent Directory README](../README.md) - General tuner plugin development guide
+- [Example Plugin](../example/README.md) - Fully featured implementation
+
+This basic plugin provides the foundation you need to start developing custom NCCL tuner plugins. Extend it with your specific tuning logic and requirements.
--- a/ext-tuner/basic/nccl/common.h
+++ b/ext-tuner/basic/nccl/common.h
@ -0,0 +1,15 @@
+/*************************************************************************
+ * Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef COMMON_H_
+#define COMMON_H_
+
+typedef enum {NCCL_LOG_NONE=0, NCCL_LOG_VERSION=1, NCCL_LOG_WARN=2, NCCL_LOG_INFO=3, NCCL_LOG_ABORT=4, NCCL_LOG_TRACE=5} ncclDebugLogLevel;
+typedef enum {NCCL_INIT=1, NCCL_COLL=2, NCCL_P2P=4, NCCL_SHM=8, NCCL_NET=16, NCCL_GRAPH=32, NCCL_TUNING=64, NCCL_ENV=128, NCCL_ALLOC=256, NCCL_CALL=512, NCCL_PROXY=1024, NCCL_NVLS=2048, NCCL_BOOTSTRAP=4096, NCCL_REG=8192, NCCL_ALL=~0} ncclDebugLogSubSys;
+
+typedef void (*ncclDebugLogger_t)(ncclDebugLogLevel level, unsigned long flags, const char *file, int line, const char *fmt, ...);
+
+#endif
--- a/ext-tuner/basic/nccl/err.h
+++ b/ext-tuner/basic/nccl/err.h
@ -0,0 +1,17 @@
+/*
+ * Copyright (c) 2017-2022, NVIDIA CORPORATION. All rights reserved.
+ */
+
+#ifndef NCCL_ERR_H_
+#define NCCL_ERR_H_
+
+/* Error type for plugins */
+typedef enum { ncclSuccess                 =  0,
+               ncclUnhandledCudaError      =  1,
+               ncclSystemError             =  2,
+               ncclInternalError           =  3,
+               ncclInvalidArgument         =  4,
+               ncclInvalidUsage            =  5,
+               ncclRemoteError             =  6 } ncclResult_t;
+
+#endif
--- a/ext-tuner/basic/nccl/tuner.h
+++ b/ext-tuner/basic/nccl/tuner.h
@ -0,0 +1,97 @@
+/*************************************************************************
+ * Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
+ * Copyright (c) 2023, Meta Platforms, Inc. and affiliates.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#ifndef NCCL_TUNER_H_
+#define NCCL_TUNER_H_
+
+#include <stdint.h>
+#include <stdlib.h>
+
+#include "common.h"
+#include "err.h"
+
+#define NCCL_NUM_FUNCTIONS 5 // Send/Recv not included for now
+typedef enum {
+  ncclFuncBroadcast = 0,
+  ncclFuncReduce = 1,
+  ncclFuncAllGather = 2,
+  ncclFuncReduceScatter = 3,
+  ncclFuncAllReduce = 4,
+  ncclFuncSendRecv = 5,
+  ncclFuncSend = 6,
+  ncclFuncRecv = 7,
+  ncclNumFuncs = 8
+} ncclFunc_t;
+
+#define NCCL_NUM_ALGORITHMS 7 // Tree/Ring/CollNet*
+#define NCCL_ALGO_UNDEF -1
+#define NCCL_ALGO_TREE 0
+#define NCCL_ALGO_RING 1
+#define NCCL_ALGO_COLLNET_DIRECT 2
+#define NCCL_ALGO_COLLNET_CHAIN 3
+#define NCCL_ALGO_NVLS 4
+#define NCCL_ALGO_NVLS_TREE 5
+#define NCCL_ALGO_PAT 6
+
+#define NCCL_NUM_PROTOCOLS 3 // Simple/LL/LL128
+#define NCCL_PROTO_UNDEF -1
+#define NCCL_PROTO_LL 0
+#define NCCL_PROTO_LL128 1
+#define NCCL_PROTO_SIMPLE 2
+
+#define NCCL_ALGO_PROTO_IGNORE -1.0
+
+// API to be implemented by external tuner
+typedef struct {
+  // Name of the tuner
+  const char* name;
+
+  // Initializes tuner states.
+  // Inputs:
+  //   - nRanks: number of ranks in current communicator. Each communicator initialize its own tuner.
+  //   - nNodes: number of nodes in current communicator.
+  //   - logFunction: a logFunction can be useful to integrate logging together with NCCL core.
+  // Outputs:
+  //   - context: tuner context object
+  ncclResult_t (*init)(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context);
+
+  // Gets info (algo, protocol, number of ctas and threads) for a given collective.
+  // Inputs:
+  //   - context: tuner context object
+  //   - collType: collective type , e.g., allreduce, allgather…
+  //   - nBytes: collective size in bytes
+  //   - numPipeOps: number of operations in the group
+  //   - numAlgo: number of algorithms in collCostTable
+  //   - numProto: number of protocols in collCostTable
+  //   - regBuff: can register user buffer
+  //
+  // Outputs:
+  //   - nChannels: number of channels (hence SMs) to be used.
+  //
+  // InOut:
+  //   - collCostTable: collective cost table, generated by NCCL core, containing algo|proto|time entries for collType.
+  //                    NCCL core sets ignored algo/proto cost table entries to -1.0 (NCCL_ALGO_PROTO_IGNORE).
+  //
+  // If getCollInfo() does not return ncclSuccess, NCCL will fall back to the
+  // default tuning for the given collective.
+  // Also, the plugin is allowed to not set any output, or set only the
+  // algorithm and protocol, but not only the algorithm or only the protocol.
+  // Unset fields will be set automatically by NCCL.
+  ncclResult_t (*getCollInfo)(void* context, ncclFunc_t collType, size_t nBytes,
+                              int numPipeOps, float** collCostTable, int numAlgo, int numProto,
+                              int regBuff, int* nChannels);
+
+  // Terminates the plugin and cleans up any resources that the plugin allocated.
+  // context: tuner context object
+  ncclResult_t (*destroy)(void* context);
+} ncclTuner_v4_t;
+
+typedef ncclTuner_v4_t ncclTuner_t;
+
+#define NCCL_TUNER_PLUGIN_SYMBOL "ncclTunerPlugin_v4"
+
+#endif
--- a/ext-tuner/basic/plugin.c
+++ b/ext-tuner/basic/plugin.c
@ -0,0 +1,34 @@
+/*************************************************************************
+ * Copyright (c) 2015-2019, NVIDIA CORPORATION. All rights reserved.
+ *
+ * See LICENSE.txt for license information
+ ************************************************************************/
+
+#include "tuner.h"
+
+#define __hidden __attribute__ ((visibility("hidden")))
+
+__hidden ncclResult_t pluginInit(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context) { return ncclSuccess; }
+
+__hidden ncclResult_t pluginGetCollInfo(void* context, ncclFunc_t collType, size_t nBytes,
+                              int numPipeOps, float** collCostTable, int numAlgo, int numProto,
+                              int regBuff, int* nChannels) {
+  // Update NCCL core generated cost table. Updated table will be evaluated by NCCL to pick the best algo/proto combo
+  float (*table)[NCCL_NUM_PROTOCOLS] = (float (*)[NCCL_NUM_PROTOCOLS])collCostTable;
+  if (table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] != NCCL_ALGO_PROTO_IGNORE) {
+    table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] = 0.0;
+  }
+  *nChannels = 1;
+  return ncclSuccess;
+}
+
+__hidden ncclResult_t pluginDestroy(void* context) { return ncclSuccess; }
+
+#define PLUGIN_NAME "Basic"
+
+const ncclTuner_v4_t ncclTunerPlugin_v4 = {
+  .name = PLUGIN_NAME,
+  .init = pluginInit,
+  .getCollInfo = pluginGetCollInfo,
+  .destroy = pluginDestroy
+};
--- a/ext-tuner/example/Makefile
+++ b/ext-tuner/example/Makefile
@ -3,15 +3,53 @@
 #
 # See LICENSE.txt for license information
 #
-NCCL_HOME:=../../build/
-CUDA_HOME:=/usr/local/cuda
-INC:= -I$(NCCL_HOME)/include -I$(CUDA_HOME)/include -Inccl
-PLUGIN_SO:=libnccl-tuner.so

-default: $(PLUGIN_SO)
+.DEFAULT_GOAL: build
+PLUGIN_SO:=libnccl-tuner-example.so
+include ../../makefiles/common.mk
+SRCDIR   ?= $(abspath ../..)
+BUILDDIR ?= .
+NCCLDIR  := $(BUILDDIR)

-$(PLUGIN_SO): plugin.c
-	$(CC) $(INC) -fPIC -shared -o $@ -Wl,-soname,$(PLUGIN_SO) $^
+SRC_FILES := $(wildcard *.c)
+DST_DIR   := $(BUILDDIR)/test/unit/plugins
+
+default: ${BUILDDIR}/$(PLUGIN_SO)
+
+build: ${BUILDDIR}/$(PLUGIN_SO)
+
+${BUILDDIR}/$(PLUGIN_SO): plugin.c
+	@printf "Compiling  %-35s > %s\n" $< $@
+	@mkdir -p ${BUILDDIR}
+	$(CC) -Inccl $(INC) -fPIC -shared -o $@ -Wl,-soname,$(PLUGIN_SO) $^
+
+# Test targets - delegate to test directory
+test:
+	$(MAKE) -C test test TEST_CASE=$(TEST_CASE)
+
+test-verbose:
+	$(MAKE) -C test test-verbose TEST_CASE=$(TEST_CASE)
+
+# Build tests
+test-build:
+	$(MAKE) -C test all
+
+# Optimize configurations from performance data
+optimize-config:
+	@if [ -z "$(CSV_FILE)" ]; then \
+		echo "Usage: make optimize-config CSV_FILE=path/to/data.csv [OUTPUT=config.conf] [METRIC=latency_us]"; \
+		echo "Example: make optimize-config CSV_FILE=scripts/sample_performance_data.csv"; \
+		exit 1; \
+	fi
+	python3 scripts/optimize_config.py $(CSV_FILE) \
+		$(if $(OUTPUT),-o $(OUTPUT)) \
+		$(if $(METRIC),-m $(METRIC)) \
+		$(if $(SIZE_RANGES),--size-ranges $(SIZE_RANGES)) \
+		$(if $(DRY_RUN),--dry-run) \
+		$(if $(NO_HEADER),--no-header)

 clean:
-	rm -f $(PLUGIN_SO)
+	rm -f ${BUILDDIR}/$(PLUGIN_SO)
+	$(MAKE) -C test clean
+
+.PHONY: test test-verbose test-build optimize-config clean
--- a/ext-tuner/example/README.md
+++ b/ext-tuner/example/README.md
@ -0,0 +1,163 @@
+# NCCL Example Tuner Plugin
+
+This example plugin shows a practical example of a CSV file-based tuning approach, allowing selective overrides for tuning parameters based on all tuning inputs without recompiling.
+
+## Features
+
+- **File-based Configuration**: Read tuning parameters from a CSV configuration file
+- **Size-based Tuning**: Specify different configurations based on message size ranges
+- **Dimension-aware Tuning**: Match configurations based on number of nodes and ranks
+- **Optional Channels Configuration**: Set specific channel counts or use -1 to keep NCCL's default
+- **Environment Variable Support**: Specify config file location via `NCCL_TUNER_CONFIG_FILE`
+- **Fallback Behavior**: Gracefully handles missing config files and invalid entries
+
+## Building
+
+```bash
+make
+```
+
+This will create `libnccl-tuner-example.so` that can be loaded by NCCL.
+
+## Configuration File Format
+
+The configuration file uses CSV (Comma-Separated Values) format with one configuration per line:
+
+```
+collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
+```
+
+### Parameters
+
+- **collective_type**: The collective operation type
+  - `broadcast`, `reduce`, `allgather`, `reducescatter`, `allreduce`
+
+- **min_bytes/max_bytes**: The message size range (in bytes) for which this config applies
+  - Use `0` for minimum and `4294967295` for maximum (covers all sizes)
+
+- **algorithm**: The NCCL algorithm to use
+  - `tree`, `ring`, `collnet_direct`, `collnet_chain`, `nvls`, `nvls_tree`, `pat`
+
+- **protocol**: The NCCL protocol to use
+  - `ll`, `ll128`, `simple`
+
+- **channels**: Number of channels (SMs) to use
+  - Use a positive integer to specify exact channel count
+  - Use `-1` to keep NCCL's default channel selection
+
+- **nNodes**: Number of nodes to match
+  - Use a positive integer to match specific node count
+  - Use `-1` to match any number of nodes
+
+- **nRanks**: Number of ranks to match
+  - Use a positive integer to match specific rank count
+  - Use `-1` to match any number of ranks
+
+- **numPipeOps**: Number of pipeline operations to match (optional)
+  - Use a positive integer to match specific pipeline operation count
+  - Use `-1` to match any number of pipeline operations
+  - If omitted, configuration will match any numPipeOps value
+
+- **regBuff**: Whether user buffer can be registered (optional)
+  - Use `0` to match only non-registered buffers
+  - Use `1` to match only registered buffers
+  - Use `-1` to match either registered or non-registered buffers
+  - If omitted, configuration will match any regBuff value
+
+### Example Configuration
+
+```csv
+# Single-node, small allreduce: use tree algorithm, registered buffers only
+allreduce,0,65536,tree,simple,2,1,-1,-1,1
+
+# 4-node, 32-rank setup: medium allreduce, single pipeline op, non-registered buffers
+allreduce,65537,1048576,ring,simple,4,4,32,1,0
+
+# Any topology: large allreduce with LL128, multiple pipeline ops, any buffer type
+allreduce,1048577,4294967295,ring,ll128,-1,-1,-1,4,-1
+
+# Single-node broadcast: prefer tree, any pipeOps, registered buffers (backward compatible)
+broadcast,0,32768,tree,simple,-1,1,-1
+
+# Multi-node broadcast: optimized for non-registered buffers, single pipeline op
+broadcast,32769,4294967295,ring,simple,2,-1,-1,1,0
+```
+
+Comments start with `#` and empty lines are ignored. The CSV format makes it easy to edit configurations in spreadsheet applications like Excel, Google Sheets, or LibreOffice Calc.
+
+### Backward Compatibility
+
+Configurations without the numPipeOps and/or regBuff parameters are fully supported:
+- 8 fields: matches any numPipeOps and regBuff values
+- 9 fields: matches any regBuff value
+- 10 fields: full parameter specification
+
+This ensures existing configuration files continue to work without modification.
+
+## Usage
+
+### Method 1: Default Config File
+Place your configuration in `nccl_tuner.conf` in the current working directory.
+
+### Method 2: Environment Variable
+Set the `NCCL_TUNER_CONFIG_FILE` environment variable to specify the config file path:
+
+```bash
+export NCCL_TUNER_CONFIG_FILE=/path/to/your/tuner.conf
+mpirun -np 4 your_nccl_application
+```
+
+## Editing Configuration Files
+
+### Generating Configuration Files from Raw Data
+
+A python script to generate valid CSV configs has been provided. [Using optimize_config.py](scripts/README.md).
+
+### Spreadsheet Tips:
+- Use column headers: `collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff`
+- Save as CSV format (not Excel format) for the plugin to read
+- Use data validation to prevent typos in algorithm/protocol names
+
+## Logging
+
+The plugin uses NCCL's logging system. To see tuner-related messages:
+
+```bash
+export NCCL_DEBUG=INFO
+```
+
+This will show when configurations are loaded and applied, including the topology information.
+
+For detailed debugging output during tuning decisions:
+
+```bash
+export NCCL_DEBUG=TRACE
+```
+
+This will show verbose information about which configurations are being evaluated and matched.
+
+## Dimension Matching
+
+Configurations are only applied when the topology matches:
+
+- **Exact Match**: Configuration specifies `nNodes=4,nRanks=32`, only applied when communicator has exactly 4 nodes and 32 ranks
+- **Wildcard Nodes**: Configuration specifies `nNodes=-1,nRanks=8`, applied to any topology with exactly 8 ranks
+- **Wildcard Ranks**: Configuration specifies `nNodes=2,nRanks=-1`, applied to any 2-node topology regardless of ranks per node
+- **Wildcard Both**: Configuration specifies `nNodes=-1,nRanks=-1`, applied to any topology
+
+This allows you to create specialized configurations for different cluster setups while maintaining flexibility.
+
+## Default Behavior
+
+If no configuration file is found or no matching configuration exists for a collective operation, the plugin falls back to preferring the ring algorithm with simple protocol. All configured algorithm/protocol combinations are given a low cost (0.0) to make them preferred by NCCL's selection logic.
+
+When channels is set to `-1`, NCCL's default channel selection logic is preserved, allowing the system to automatically determine the optimal number of channels based on hardware and message size.
+
+## Troubleshooting
+
+1. **Config file not found**: Check the file path and permissions
+2. **Configurations not applied**: Verify the collective type, size ranges, algorithm/protocol names, and topology parameters
+3. **Plugin not loaded**: Ensure `LD_LIBRARY_PATH` includes the plugin directory and that `NCCL_TUNER_PLUGIN` either specifies the plugin name, or an absolute path to the plugin shared library.
+4. **No effect on performance**: Check that NCCL is actually using the tuner plugin with `NCCL_DEBUG=INFO`
+5. **Topology mismatch**: Verify that nNodes and nRanks match your actual setup, or use -1 for wildcards
+6. **CSV parsing errors**: Ensure no spaces after commas, or quote fields containing spaces
--- a/ext-tuner/example/nccl_tuner.conf
+++ b/ext-tuner/example/nccl_tuner.conf
@ -0,0 +1,45 @@
+# NCCL Tuner Configuration File (CSV Format)
+# Format: collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
+#
+# Collective types: broadcast, reduce, allgather, reducescatter, allreduce
+# Algorithms: tree, ring, collnet_direct, collnet_chain, nvls, nvls_tree, pat
+# Protocols: ll, ll128, simple
+# Channels: number of channels to use, or -1 to keep default
+# nNodes: number of nodes to match, or -1 for any number of nodes
+# nRanks: number of ranks to match, or -1 for any number of ranks
+# numPipeOps: number of pipeline operations to match, or -1 for any number (optional)
+# regBuff: whether user buffer can be registered (0=no, 1=yes, -1=any) (optional)
+#
+# Note: numPipeOps and regBuff parameters are optional - configurations without them will match any value
+#
+# Examples:
+
+# For single-node configurations with registered buffers
+# Small allreduce operations on single node - use tree algorithm, registered buffers
+allreduce,0,65536,tree,simple,2,1,-1,-1,1
+
+# For multi-node configurations with 4 nodes, 32 total ranks, single pipeline op, non-registered buffers
+# Medium allreduce operations - use ring algorithm
+allreduce,65537,1048576,ring,simple,4,4,32,1,0
+
+# For any topology - large allreduce operations with LL128 protocol, multiple pipeline ops, any buffer type
+allreduce,1048577,4294967295,ring,ll128,-1,-1,-1,4,-1
+
+# Broadcast operations - different configs for different topologies, pipeline complexity, and buffer types
+# Single node broadcast - prefer tree, any pipeOps, registered buffers only
+broadcast,0,32768,tree,simple,-1,1,-1,-1,1
+
+# Multi-node broadcast with single pipeline operation, non-registered buffers - use ring
+broadcast,32769,4294967295,ring,simple,2,-1,-1,1,0
+
+# AllGather operations - optimized for 2-node configurations, any pipeOps, any buffer type
+allgather,0,4294967295,ring,simple,4,2,-1
+
+# ReduceScatter operations
+# Small messages on single node, single pipeline op, registered buffers
+reducescatter,0,131072,tree,simple,2,1,-1,1,1
+# Large messages on any topology, multiple pipeline ops, non-registered buffers
+reducescatter,131073,4294967295,ring,simple,-1,-1,-1,2,0
+
+# Reduce operations - any topology, keep default channels, any pipeOps, any buffer type
+reduce,0,4294967295,tree,simple,-1,-1,-1
--- a/ext-tuner/example/plugin.c
+++ b/ext-tuner/example/plugin.c
@ -5,24 +5,443 @@
 ************************************************************************/

 #include "tuner.h"
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>

 #define __hidden __attribute__ ((visibility("hidden")))
+#define MAX_LINE_LENGTH 256

-__hidden ncclResult_t pluginInit(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context) { return ncclSuccess; }
+// CSV field indices for configuration parsing
+// Format: colltype,minbytes,maxbytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
+#define CONFIG_FIELD_COLLTYPE     0
+#define CONFIG_FIELD_MINBYTES     1
+#define CONFIG_FIELD_MAXBYTES     2
+#define CONFIG_FIELD_ALGORITHM    3
+#define CONFIG_FIELD_PROTOCOL     4
+#define CONFIG_FIELD_CHANNELS     5
+#define CONFIG_FIELD_NNODES       6
+#define CONFIG_FIELD_NRANKS       7
+#define CONFIG_FIELD_PIPEOPS      8  // Optional field
+#define CONFIG_FIELD_REGBUFF      9  // Optional field
+
+// Field count constants
+#define CONFIG_FIELDS_REQUIRED    8   // Minimum required fields (up to nRanks)
+#define CONFIG_FIELDS_WITH_PIPEOPS 9  // Fields including numPipeOps
+#define CONFIG_FIELDS_WITH_REGBUFF 10 // Fields including both numPipeOps and regBuff
+#define CONFIG_FIELDS_MAX         10  // Maximum number of fields supported
+
+typedef struct {
+  ncclFunc_t collType;
+  size_t minBytes;
+  size_t maxBytes;
+  int algorithm;
+  int protocol;
+  int nChannels;
+  int nNodes;
+  int nRanks;
+  int numPipeOps;
+  int regBuff;
+} TuningConfig;
+
+typedef struct {
+  TuningConfig* configs;  // Changed from static array to dynamic pointer
+  int numConfigs;
+  int maxConfigs;         // Added to track allocated size
+  size_t nRanks;
+  size_t nNodes;
+  ncclDebugLogger_t logFunction;
+} TunerContext;
+
+// Parse collective type from string
+static ncclFunc_t parseCollType(const char* str) {
+  if (strcmp(str, "broadcast") == 0) return ncclFuncBroadcast;
+  if (strcmp(str, "reduce") == 0) return ncclFuncReduce;
+  if (strcmp(str, "allgather") == 0) return ncclFuncAllGather;
+  if (strcmp(str, "reducescatter") == 0) return ncclFuncReduceScatter;
+  if (strcmp(str, "allreduce") == 0) return ncclFuncAllReduce;
+  return ncclFuncAllReduce; // default
+}
+
+// Convert collective type to string
+static const char* collTypeToString(ncclFunc_t collType) {
+  switch (collType) {
+    case ncclFuncBroadcast: return "broadcast";
+    case ncclFuncReduce: return "reduce";
+    case ncclFuncAllGather: return "allgather";
+    case ncclFuncReduceScatter: return "reducescatter";
+    case ncclFuncAllReduce: return "allreduce";
+    default: return "unknown";
+  }
+}
+
+// Parse algorithm from string
+static int parseAlgorithm(const char* str) {
+  if (strcmp(str, "tree") == 0) return NCCL_ALGO_TREE;
+  if (strcmp(str, "ring") == 0) return NCCL_ALGO_RING;
+  if (strcmp(str, "collnet_direct") == 0) return NCCL_ALGO_COLLNET_DIRECT;
+  if (strcmp(str, "collnet_chain") == 0) return NCCL_ALGO_COLLNET_CHAIN;
+  if (strcmp(str, "nvls") == 0) return NCCL_ALGO_NVLS;
+  if (strcmp(str, "nvls_tree") == 0) return NCCL_ALGO_NVLS_TREE;
+  if (strcmp(str, "pat") == 0) return NCCL_ALGO_PAT;
+  return NCCL_ALGO_RING; // default
+}
+
+// Convert algorithm to string
+static const char* algorithmToString(int algorithm) {
+  switch (algorithm) {
+    case NCCL_ALGO_TREE: return "tree";
+    case NCCL_ALGO_RING: return "ring";
+    case NCCL_ALGO_COLLNET_DIRECT: return "collnet_direct";
+    case NCCL_ALGO_COLLNET_CHAIN: return "collnet_chain";
+    case NCCL_ALGO_NVLS: return "nvls";
+    case NCCL_ALGO_NVLS_TREE: return "nvls_tree";
+    case NCCL_ALGO_PAT: return "pat";
+    default: return "unknown";
+  }
+}
+
+// Parse protocol from string
+static int parseProtocol(const char* str) {
+  if (strcmp(str, "ll") == 0) return NCCL_PROTO_LL;
+  if (strcmp(str, "ll128") == 0) return NCCL_PROTO_LL128;
+  if (strcmp(str, "simple") == 0) return NCCL_PROTO_SIMPLE;
+  return NCCL_PROTO_SIMPLE; // default
+}
+
+// Convert protocol to string
+static const char* protocolToString(int protocol) {
+  switch (protocol) {
+    case NCCL_PROTO_LL: return "ll";
+    case NCCL_PROTO_LL128: return "ll128";
+    case NCCL_PROTO_SIMPLE: return "simple";
+    default: return "unknown";
+  }
+}
+
+// Helper function to count valid configuration lines in file
+static int countConfigLines(const char* filename) {
+  FILE* file = fopen(filename, "r");
+  if (!file) {
+    return 0;
+  }
+
+  char line[MAX_LINE_LENGTH];
+  int count = 0;
+
+  while (fgets(line, sizeof(line), file)) {
+    // Skip comments and empty lines
+    if (line[0] == '#' || line[0] == '\n') continue;
+
+    // Remove trailing newline
+    line[strcspn(line, "\n")] = 0;
+
+    // Check if line has content
+    if (strlen(line) > 0) {
+      count++;
+    }
+  }
+
+  fclose(file);
+  return count;
+}
+
+// Load configuration from file
+static ncclResult_t loadConfig(TunerContext* ctx, const char* filename) {
+  FILE* file = fopen(filename, "r");
+  if (!file) {
+    if (ctx->logFunction) {
+      ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                       "TUNER/ExamplePlugin: Config file %s not found, using defaults", filename);
+    }
+    return ncclSuccess; // Not finding config file is not an error
+  }
+
+  // First pass: count valid configuration lines
+  int configCount = countConfigLines(filename);
+  if (configCount == 0) {
+    if (ctx->logFunction) {
+      ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                       "TUNER/ExamplePlugin: No valid configurations found in %s", filename);
+    }
+    fclose(file);
+    return ncclSuccess;
+  }
+
+  // Allocate memory for configurations based on actual count
+  ctx->configs = (TuningConfig*)malloc(configCount * sizeof(TuningConfig));
+  if (!ctx->configs) {
+    if (ctx->logFunction) {
+      ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                       "TUNER/ExamplePlugin: Failed to allocate memory for %d configurations", configCount);
+    }
+    fclose(file);
+    return ncclSystemError;
+  }
+
+  ctx->maxConfigs = configCount;
+  ctx->numConfigs = 0;
+
+  if (ctx->logFunction) {
+    ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                     "TUNER/ExamplePlugin: Allocated memory for %d configurations", configCount);
+  }
+
+  // Reset file pointer to beginning
+  fseek(file, 0, SEEK_SET);
+
+  char line[MAX_LINE_LENGTH];
+
+  while (fgets(line, sizeof(line), file) && ctx->numConfigs < ctx->maxConfigs) {
+    // Skip comments and empty lines
+    if (line[0] == '#' || line[0] == '\n') continue;
+
+    // Remove trailing newline
+    line[strcspn(line, "\n")] = 0;
+
+    // Parse CSV format: colltype,minbytes,maxbytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
+    char* token;
+    char* tokens[CONFIG_FIELDS_MAX];
+    int tokenCount = 0;
+
+    // Make a copy of the line for tokenizing
+    char lineCopy[MAX_LINE_LENGTH];
+    strncpy(lineCopy, line, sizeof(lineCopy));
+    lineCopy[sizeof(lineCopy) - 1] = '\0';
+
+    // Tokenize by comma
+    token = strtok(lineCopy, ",");
+    while (token != NULL && tokenCount < CONFIG_FIELDS_MAX) {
+      // Trim whitespace
+      while (*token == ' ' || *token == '\t') token++;
+      char* end = token + strlen(token) - 1;
+      while (end > token && (*end == ' ' || *end == '\t')) {
+        *end = '\0';
+        end--;
+      }
+      tokens[tokenCount++] = token;
+      token = strtok(NULL, ",");
+    }
+
+    // Validate field count: support required fields (8), with pipeOps (9), or with regBuff (10)
+    if (tokenCount >= CONFIG_FIELDS_REQUIRED && tokenCount <= CONFIG_FIELDS_MAX) {
+      TuningConfig* config = &ctx->configs[ctx->numConfigs];
+      config->collType = parseCollType(tokens[CONFIG_FIELD_COLLTYPE]);
+      config->minBytes = (size_t)strtoull(tokens[CONFIG_FIELD_MINBYTES], NULL, 10);
+      config->maxBytes = (size_t)strtoull(tokens[CONFIG_FIELD_MAXBYTES], NULL, 10);
+      config->algorithm = parseAlgorithm(tokens[CONFIG_FIELD_ALGORITHM]);
+      config->protocol = parseProtocol(tokens[CONFIG_FIELD_PROTOCOL]);
+      config->nChannels = atoi(tokens[CONFIG_FIELD_CHANNELS]);
+      config->nNodes = atoi(tokens[CONFIG_FIELD_NNODES]);
+      config->nRanks = atoi(tokens[CONFIG_FIELD_NRANKS]);
+
+      // numPipeOps is optional (9th field, index 8)
+      if (tokenCount >= CONFIG_FIELDS_WITH_PIPEOPS) {
+        config->numPipeOps = atoi(tokens[CONFIG_FIELD_PIPEOPS]);
+      } else {
+        config->numPipeOps = -1; // -1 means match any numPipeOps
+      }
+
+      // regBuff is optional (10th field, index 9)
+      if (tokenCount >= CONFIG_FIELDS_WITH_REGBUFF) {
+        config->regBuff = atoi(tokens[CONFIG_FIELD_REGBUFF]);
+      } else {
+        config->regBuff = -1; // -1 means match any regBuff value
+      }
+
+      ctx->numConfigs++;
+
+      if (ctx->logFunction) {
+        if (config->numPipeOps == -1 && config->regBuff == -1) {
+          ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                           "TUNER/ExamplePlugin: Loaded config: %s [%zu-%zu] %s/%s channels=%d nodes=%d ranks=%d pipeOps=any regBuff=any",
+                           tokens[CONFIG_FIELD_COLLTYPE], config->minBytes, config->maxBytes,
+                           tokens[CONFIG_FIELD_ALGORITHM], tokens[CONFIG_FIELD_PROTOCOL],
+                           config->nChannels, config->nNodes, config->nRanks);
+        } else if (config->regBuff == -1) {
+          ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                           "TUNER/ExamplePlugin: Loaded config: %s [%zu-%zu] %s/%s channels=%d nodes=%d ranks=%d pipeOps=%d regBuff=any",
+                           tokens[CONFIG_FIELD_COLLTYPE], config->minBytes, config->maxBytes,
+                           tokens[CONFIG_FIELD_ALGORITHM], tokens[CONFIG_FIELD_PROTOCOL],
+                           config->nChannels, config->nNodes, config->nRanks, config->numPipeOps);
+        } else if (config->numPipeOps == -1) {
+          ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                           "TUNER/ExamplePlugin: Loaded config: %s [%zu-%zu] %s/%s channels=%d nodes=%d ranks=%d pipeOps=any regBuff=%d",
+                           tokens[CONFIG_FIELD_COLLTYPE], config->minBytes, config->maxBytes,
+                           tokens[CONFIG_FIELD_ALGORITHM], tokens[CONFIG_FIELD_PROTOCOL],
+                           config->nChannels, config->nNodes, config->nRanks, config->regBuff);
+        } else {
+          ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                           "TUNER/ExamplePlugin: Loaded config: %s [%zu-%zu] %s/%s channels=%d nodes=%d ranks=%d pipeOps=%d regBuff=%d",
+                           tokens[CONFIG_FIELD_COLLTYPE], config->minBytes, config->maxBytes,
+                           tokens[CONFIG_FIELD_ALGORITHM], tokens[CONFIG_FIELD_PROTOCOL],
+                           config->nChannels, config->nNodes, config->nRanks, config->numPipeOps, config->regBuff);
+        }
+      }
+    }
+  }
+
+  fclose(file);
+  if (ctx->logFunction) {
+    ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                     "TUNER/ExamplePlugin: Loaded %d tuning configurations from %s", ctx->numConfigs, filename);
+  }
+  return ncclSuccess;
+}
+
+__hidden ncclResult_t pluginInit(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context) {
+  TunerContext* ctx = (TunerContext*)malloc(sizeof(TunerContext));
+  if (!ctx) return ncclSystemError;
+
+  ctx->configs = NULL;     // Initialize to NULL
+  ctx->numConfigs = 0;
+  ctx->maxConfigs = 0;     // Initialize to 0
+  ctx->nRanks = nRanks;
+  ctx->nNodes = nNodes;
+  ctx->logFunction = logFunction;
+
+  if (logFunction) {
+    logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                "TUNER/ExamplePlugin: Initializing tuner for %zu nodes, %zu ranks", nNodes, nRanks);
+  }
+
+  // Try to load config file from environment variable or default location
+  const char* configFile = getenv("NCCL_TUNER_CONFIG_FILE");
+  if (!configFile) {
+    configFile = "nccl_tuner.conf"; // default config file name
+  }
+
+  ncclResult_t result = loadConfig(ctx, configFile);
+  if (result != ncclSuccess) {
+    if (ctx->configs) {
+      free(ctx->configs);  // Clean up allocated memory on error
+    }
+    free(ctx);
+    return result;
+  }
+
+  *context = ctx;
+  return ncclSuccess;
+}

 __hidden ncclResult_t pluginGetCollInfo(void* context, ncclFunc_t collType, size_t nBytes,
                              int numPipeOps, float** collCostTable, int numAlgo, int numProto,
                              int regBuff, int* nChannels) {
-  // Update NCCL core generated cost table. Updated table will be evaluated by NCCL to pick the best algo/proto combo
-  float (*table)[NCCL_NUM_PROTOCOLS] = (float (*)[NCCL_NUM_PROTOCOLS])collCostTable;
-  if (table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] != NCCL_ALGO_PROTO_IGNORE) {
-    table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] = 0.0;
-  }
+  TunerContext* ctx = (TunerContext*)context;
+  if (!ctx) return ncclInternalError;
+
+  // Default channels
  *nChannels = 1;
+
+  if (ctx->logFunction) {
+    ctx->logFunction(NCCL_LOG_TRACE, NCCL_TUNING, __FILE__, __LINE__,
+                     "TUNER/ExamplePlugin: pluginGetCollInfo called - collType=%s, nBytes=%zu, numPipeOps=%d, regBuff=%d, numConfigs=%d",
+                     collTypeToString(collType), nBytes, numPipeOps, regBuff, ctx->numConfigs);
+  }
+
+  // Look for matching configuration
+  for (int i = 0; i < ctx->numConfigs; i++) {
+    TuningConfig* config = &ctx->configs[i];
+
+    if (ctx->logFunction) {
+      ctx->logFunction(NCCL_LOG_TRACE, NCCL_TUNING, __FILE__, __LINE__,
+                       "TUNER/ExamplePlugin: Checking config %d - collType=%s, minBytes=%zu, maxBytes=%zu, algo=%s, proto=%s, nNodes=%d, nRanks=%d, numPipeOps=%d, regBuff=%d",
+                       i, collTypeToString(config->collType), config->minBytes, config->maxBytes, algorithmToString(config->algorithm), protocolToString(config->protocol),
+                       config->nNodes, config->nRanks, config->numPipeOps, config->regBuff);
+    }
+
+    // Check if this config matches the current collective, size range, topology, pipeline ops, and regBuff
+    if (config->collType == collType &&
+        nBytes >= config->minBytes &&
+        nBytes <= config->maxBytes &&
+        (config->nNodes == -1 || config->nNodes == (int)ctx->nNodes) &&
+        (config->nRanks == -1 || config->nRanks == (int)ctx->nRanks) &&
+        (config->numPipeOps == -1 || config->numPipeOps == numPipeOps) &&
+        (config->regBuff == -1 || config->regBuff == regBuff)) {
+
+      if (ctx->logFunction) {
+        ctx->logFunction(NCCL_LOG_TRACE, NCCL_TUNING, __FILE__, __LINE__,
+                         "TUNER/ExamplePlugin: Config matches. Applying algo=%s, proto=%s, channels=%d",
+                         algorithmToString(config->algorithm), protocolToString(config->protocol), config->nChannels);
+      }
+
+      // Check bounds
+      if (config->algorithm < numAlgo && config->protocol < numProto) {
+        if (collCostTable[config->algorithm][config->protocol] != NCCL_ALGO_PROTO_IGNORE) {
+          if (ctx->logFunction) {
+            ctx->logFunction(NCCL_LOG_TRACE, NCCL_TUNING, __FILE__, __LINE__,
+                             "TUNER/ExamplePlugin: Setting cost table[%s][%s] (%p) = 0.0 (was %.1f)",
+                             algorithmToString(config->algorithm), protocolToString(config->protocol),
+                             &collCostTable[config->algorithm][config->protocol], collCostTable[config->algorithm][config->protocol]);
+          }
+          collCostTable[config->algorithm][config->protocol] = 0.0; // Set low cost to prefer this configuration
+
+          // Only override channels if not set to -1 (keep default)
+          if (config->nChannels != -1) {
+            *nChannels = config->nChannels;
+          }
+
+          if (ctx->logFunction) {
+            if (config->nChannels == -1) {
+              ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                               "TUNER/ExamplePlugin: Applied config for collType=%s, bytes=%zu, pipeOps=%d, regBuff=%d: algo=%s, proto=%s, channels=default (nodes=%d, ranks=%d)",
+                               collTypeToString(config->collType), nBytes, numPipeOps, regBuff, algorithmToString(config->algorithm), protocolToString(config->protocol),
+                               config->nNodes, config->nRanks);
+            } else {
+              ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                               "TUNER/ExamplePlugin: Applied config for collType=%s, bytes=%zu, pipeOps=%d, regBuff=%d: algo=%s, proto=%s, channels=%d (nodes=%d, ranks=%d)",
+                               collTypeToString(config->collType), nBytes, numPipeOps, regBuff, algorithmToString(config->algorithm), protocolToString(config->protocol),
+                               config->nChannels, config->nNodes, config->nRanks);
+            }
+          }
+          return ncclSuccess;
+        } else {
+          if (ctx->logFunction) {
+            ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                             "TUNER/ExamplePlugin: Algorithm/protocol combination [%s][%s] is marked as IGNORE",
+                             algorithmToString(config->algorithm), protocolToString(config->protocol));
+          }
+        }
+      } else {
+        if (ctx->logFunction) {
+          ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                           "TUNER/ExamplePlugin: Algorithm/protocol out of bounds - algo=%s (max %d), proto=%s (max %d)",
+                           algorithmToString(config->algorithm), numAlgo, protocolToString(config->protocol), numProto);
+        }
+      }
+    } else {
+      if (ctx->logFunction) {
+        ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                         "TUNER/ExamplePlugin: Config does not match - collType match=%d, size match=%d, nodes match=%d, ranks match=%d, pipeOps match=%d, regBuff match=%d",
+                         config->collType == collType,
+                         (nBytes >= config->minBytes && nBytes <= config->maxBytes),
+                         (config->nNodes == -1 || config->nNodes == (int)ctx->nNodes),
+                         (config->nRanks == -1 || config->nRanks == (int)ctx->nRanks),
+                         (config->numPipeOps == -1 || config->numPipeOps == numPipeOps),
+                         (config->regBuff == -1 || config->regBuff == regBuff));
+      }
+    }
+  }
+
+  // If no specific config found, apply default behavior
+  if (ctx->logFunction) {
+    ctx->logFunction(NCCL_LOG_INFO, NCCL_TUNING, __FILE__, __LINE__,
+                     "TUNER/ExamplePlugin: No matching config found");
+  }
+
  return ncclSuccess;
 }

-__hidden ncclResult_t pluginDestroy(void* context) { return ncclSuccess; }
+__hidden ncclResult_t pluginDestroy(void* context) {
+  if (context) {
+    TunerContext* ctx = (TunerContext*)context;
+    if (ctx->configs) {
+      free(ctx->configs);  // Free dynamically allocated configs array
+    }
+    free(context);
+  }
+  return ncclSuccess;
+}

 #define PLUGIN_NAME "Example"

--- a/ext-tuner/example/scripts/README.md
+++ b/ext-tuner/example/scripts/README.md
@ -0,0 +1,106 @@
+# NCCL Tuner Configuration Scripts
+
+This directory contains scripts for optimizing NCCL tuner configurations based on performance data.
+
+## optimize_config.py
+
+A Python script that reads performance data from CSV files and generates optimal NCCL tuner configurations.
+
+### Usage
+
+```bash
+python scripts/optimize_config.py [options] <input_csv_file>
+```
+
+### Options
+
+- `-o, --output FILE`: Output NCCL tuner config file (default: `nccl_tuner.conf`)
+- `-m, --metric METRIC`: Optimization metric (`cost_metric`, `bandwidth_gbps`, `latency_us`)
+- `--no-header`: Don't add header comments to output file
+- `--dry-run`: Print configurations without writing to file
+
+### CSV Input Format
+
+The input CSV file should have the following columns:
+
+```csv
+collective,size_bytes,algorithm,protocol,channels,nodes,ranks,pipeOps,regBuff,cost_metric,bandwidth_gbps,latency_us
+```
+
+**Required columns:**
+- `collective`: NCCL collective type (`allreduce`, `broadcast`, `reduce`, etc.)
+- `size_bytes`: Message size in bytes
+- `algorithm`: NCCL algorithm (`tree`, `ring`, `nvls`, etc.)
+- `protocol`: NCCL protocol (`simple`, `ll`, `ll128`)
+- `channels`: Number of channels (or `-1` for default)
+- `nodes`: Number of nodes (or `-1` for any)
+- `ranks`: Number of ranks (or `-1` for any)
+- `pipeOps`: Number of pipeline operations (or `-1` for any)
+- `regBuff`: Registered buffer flag (`0`, `1`, or `-1` for any)
+
+**Optional metrics (must have at least one present):**
+- `bandwidth_gbps`: Bandwidth in GB/s (higher is better)
+- `latency_us`: Latency in microseconds (lower is better)
+
+### Examples
+
+**Basic usage with cost optimization:**
+```bash
+python scripts/optimize_config.py sample_performance_data.csv
+```
+
+**Optimize for bandwidth and write to custom file:**
+```bash
+python scripts/optimize_config.py -m bandwidth_gbps -o my_tuner.conf performance_data.csv
+```
+
+**Preview configurations without writing:**
+```bash
+python scripts/optimize_config.py --dry-run performance_data.csv
+```
+
+### How It Works
+
+1. **Data Loading**: Reads CSV performance data and validates format
+2. **Grouping**: Groups data by collective type, topology (nodes/ranks), and other parameters
+3. **Size Ranges**: Automatically bins data into size ranges for optimization
+4. **Optimization**: Finds the best performing configuration for each group/size combination
+5. **Output**: Generates NCCL tuner config format and appends to specified file
+
+### Default Size Ranges
+
+The script uses these default size ranges (in bytes):
+- Small: 0 - 1,024
+- Medium: 1,025 - 65,536
+- Large: 65,537 - 1,048,576
+- XLarge: 1,048,577 - 16,777,216
+- XXLarge: 16,777,217 - 4,294,967,295
+
+### Sample Data
+
+See `sample_performance_data.csv` for an example of the expected input format.
+
+### Integration with NCCL
+
+The generated configuration file can be used directly with the NCCL tuner plugin:
+
+```bash
+export NCCL_TUNER_CONFIG_FILE=/path/to/optimized_config.conf
+export NCCL_TUNER_PLUGIN=/path/to/libnccl-tuner.so
+mpirun -np 8 your_nccl_application
+```
+
+### Performance Data Collection
+
+To collect performance data for optimization, you can:
+
+1. **Use NCCL benchmarks** with different algorithm/protocol combinations
+2. **Profile your applications** with various tuner settings
+3. **Run systematic sweeps** across parameter combinations
+4. **Use NCCL debug output** to collect timing information
+
+The key is to have comprehensive data covering:
+- Different message sizes (small to large)
+- Various topologies (single node, multi-node)
+- All relevant algorithm/protocol combinations
+- Different channel counts and pipeline configurations
--- a/ext-tuner/example/scripts/optimize_config.py
+++ b/ext-tuner/example/scripts/optimize_config.py
@ -0,0 +1,430 @@
+#!/usr/bin/env python3
+"""
+NCCL Tuner Configuration Optimizer
+
+Reads a CSV file containing performance data across different tuning parameters
+and generates optimal NCCL tuner configurations based on the best performing
+combinations.
+
+By default, creates growing size ranges that interpolate between the actual data sizes
+for each unique dimension (node count, rank count combination). This ensures that
+different cluster configurations get their own optimized size boundaries, as
+performance characteristics often vary significantly between topologies.
+
+Each dimension gets its own set of ranges starting from 0 and extending to the maximum
+size for that dimension, with boundaries at midpoints between consecutive data sizes.
+
+CSV Input Format:
+collective,size_bytes,algorithm,protocol,channels,nodes,ranks,pipeOps,regBuff,bandwidth_gbps,latency_us
+
+Output Format (NCCL Tuner Config):
+collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff
+
+Usage Examples:
+  # Auto-create dimension-specific interpolated ranges (default)
+  python3 optimize_config.py data.csv
+
+  # Use custom size ranges (applied to all topologies)
+  python3 optimize_config.py data.csv --size-ranges "0-1024,1025-65536,65537-1048576"
+
+  # Use hardcoded default ranges (applied to all topologies)
+  python3 optimize_config.py data.csv --no-auto-ranges
+"""
+
+import csv
+import argparse
+import sys
+import os
+from collections import defaultdict
+from typing import Dict, List, Tuple, Any
+
+class PerformanceData:
+    def __init__(self, row: Dict[str, str]):
+        self.collective = row['collective']
+        self.size_bytes = int(row['size_bytes'])
+        self.algorithm = row['algorithm']
+        self.protocol = row['protocol']
+        self.channels = int(row['channels']) if row['channels'] != '-1' else -1
+        self.nodes = int(row['nodes']) if row['nodes'] != '-1' else -1
+        self.ranks = int(row['ranks']) if row['ranks'] != '-1' else -1
+        self.pipeOps = int(row['pipeOps']) if row['pipeOps'] != '-1' else -1
+        self.regBuff = int(row['regBuff']) if row['regBuff'] != '-1' else -1
+
+        # Performance metrics
+        self.bandwidth_gbps = float(row.get('bandwidth_gbps', 0))  # Higher is better
+        self.latency_us = float(row.get('latency_us', 0))  # Lower is better
+
+    def get_config_key(self) -> Tuple:
+        """Generate a key for grouping similar configurations"""
+        return (self.collective, self.nodes, self.ranks, self.pipeOps, self.regBuff)
+
+    def get_size_range_key(self, topology_size_ranges: Dict[Tuple[int, int], List[Tuple[int, int]]]) -> Tuple[int, int]:
+        """Find which size range this data point belongs to for its dimension"""
+        topology_key = (self.nodes, self.ranks)
+
+        # Get size ranges for this dimension, or fall back to default
+        if topology_key in topology_size_ranges:
+            size_ranges = topology_size_ranges[topology_key]
+        elif (-1, -1) in topology_size_ranges:
+            size_ranges = topology_size_ranges[(-1, -1)]
+        else:
+            # Fallback to first available dimension ranges
+            size_ranges = next(iter(topology_size_ranges.values()))
+
+        for min_size, max_size in size_ranges:
+            if min_size <= self.size_bytes <= max_size:
+                return (min_size, max_size)
+        # If no range found, create a single-point range
+        return (self.size_bytes, self.size_bytes)
+
+class ConfigOptimizer:
+    def __init__(self, optimization_metric: str = 'latency_us'):
+        self.optimization_metric = optimization_metric
+        # Default size ranges - will be overridden by auto-detection
+        self.size_ranges = [
+            (0, 1024),
+            (1025, 64*1024),
+            (64*1024+1, 1024*1024),
+            (1024*1024+1, 16*1024*1024),
+            (16*1024*1024+1, 4*1024*1024*1024-1)
+        ]
+        self.auto_size_ranges = True
+
+    def set_size_ranges(self, ranges: List[Tuple[int, int]]):
+        """Set custom size ranges for optimization"""
+        self.size_ranges = ranges
+        self.auto_size_ranges = False
+
+    def auto_determine_size_ranges(self, data: List[PerformanceData]) -> Dict[Tuple[int, int], List[Tuple[int, int]]]:
+        """Create growing size ranges for each unique (nodes, ranks) dimension"""
+        if not data:
+            return {(-1, -1): self.size_ranges}
+
+        # Group data by dimension (nodes, ranks)
+        topology_data = defaultdict(list)
+        for item in data:
+            topology_key = (item.nodes, item.ranks)
+            topology_data[topology_key].append(item)
+
+        topology_ranges = {}
+
+        for topology_key, items in topology_data.items():
+            nodes, ranks = topology_key
+
+            # Extract unique sizes for this dimension and sort them
+            unique_sizes = sorted(set(item.size_bytes for item in items))
+
+            if len(unique_sizes) <= 1:
+                # Only one size, create a single range from 0 to that size
+                size = unique_sizes[0] if unique_sizes else 0
+                ranges = [(0, size)]
+            else:
+                # Create growing ranges that interpolate between data points
+                ranges = []
+
+                for i, size in enumerate(unique_sizes):
+                    if i == 0:
+                        # First range: 0 to midpoint between first and second size
+                        if len(unique_sizes) > 1:
+                            next_size = unique_sizes[i + 1]
+                            max_size = (size + next_size) // 2
+                        else:
+                            max_size = size
+                        min_size = 0
+                    elif i == len(unique_sizes) - 1:
+                        # Last range: previous max + 1 to current size (and beyond)
+                        min_size = ranges[-1][1] + 1
+                        max_size = size
+                    else:
+                        # Intermediate ranges: previous max + 1 to midpoint with next size
+                        min_size = ranges[-1][1] + 1
+                        next_size = unique_sizes[i + 1]
+                        max_size = (size + next_size) // 2
+
+                    ranges.append((min_size, max_size))
+
+            topology_ranges[topology_key] = ranges
+
+            print(f"Dimension {nodes} nodes, {ranks} ranks: {len(ranges)} size ranges from {len(unique_sizes)} unique sizes:")
+            for i, (min_size, max_size) in enumerate(ranges):
+                # Count data points that fall in this range for this dimension
+                count = sum(1 for item in items if min_size <= item.size_bytes <= max_size)
+                actual_sizes = sorted(set(item.size_bytes for item in items if min_size <= item.size_bytes <= max_size))
+                if actual_sizes:
+                    size_list = ', '.join(f"{s:,}" for s in actual_sizes[:3])
+                    if len(actual_sizes) > 3:
+                        size_list += f", ... (+{len(actual_sizes)-3} more)"
+                    print(f"  Range {i+1}: {min_size:,} - {max_size:,} bytes ({count} data points, sizes: {size_list})")
+
+        return topology_ranges
+
+    def load_data(self, csv_file: str) -> List[PerformanceData]:
+        """Load performance data from CSV file"""
+        data = []
+        try:
+            with open(csv_file, 'r') as f:
+                reader = csv.DictReader(f)
+                for row in reader:
+                    try:
+                        data.append(PerformanceData(row))
+                    except (ValueError, KeyError) as e:
+                        print(f"Warning: Skipping invalid row: {row} - {e}")
+        except FileNotFoundError:
+            print(f"Error: File {csv_file} not found")
+            sys.exit(1)
+        except Exception as e:
+            print(f"Error reading {csv_file}: {e}")
+            sys.exit(1)
+
+        print(f"Loaded {len(data)} performance data points")
+
+        # Auto-determine size ranges if enabled
+        if self.auto_size_ranges and data:
+            self.topology_size_ranges = self.auto_determine_size_ranges(data)
+        else:
+            # Use default ranges for all topologies
+            self.topology_size_ranges = {(-1, -1): self.size_ranges}
+
+        return data
+
+    def is_better(self, new_data: PerformanceData, current_best: PerformanceData) -> bool:
+        """Determine if new_data is better than current_best"""
+        if self.optimization_metric == 'bandwidth_gbps':
+            return new_data.bandwidth_gbps > current_best.bandwidth_gbps
+        elif self.optimization_metric == 'latency_us':
+            return new_data.latency_us < current_best.latency_us
+        else:
+            # Default to latency
+            return new_data.latency_us < current_best.latency_us
+
+    def optimize_configurations(self, data: List[PerformanceData]) -> List[str]:
+        """Find optimal configurations and return as NCCL config strings"""
+        # Group data by configuration key and size range
+        grouped_data = defaultdict(lambda: defaultdict(list))
+
+        for item in data:
+            config_key = item.get_config_key()
+            size_range = item.get_size_range_key(self.topology_size_ranges)
+            grouped_data[config_key][size_range].append(item)
+
+        # Store optimal configurations before combining ranges
+        optimal_configs = []
+
+        for config_key, size_ranges_dict in grouped_data.items():
+            collective, nodes, ranks, pipeOps, regBuff = config_key
+
+            for (min_size, max_size), items in size_ranges_dict.items():
+                if not items:
+                    continue
+
+                # Find the best performing configuration for this size range
+                best_item = items[0]
+                for item in items[1:]:
+                    if self.is_better(item, best_item):
+                        best_item = item
+
+                # Store the optimal configuration with its range
+                optimal_configs.append({
+                    'collective': collective,
+                    'min_size': min_size,
+                    'max_size': max_size,
+                    'algorithm': best_item.algorithm,
+                    'protocol': best_item.protocol,
+                    'channels': best_item.channels,
+                    'nodes': best_item.nodes,
+                    'ranks': best_item.ranks,
+                    'pipeOps': best_item.pipeOps,
+                    'regBuff': best_item.regBuff,
+                    'metric_value': getattr(best_item, self.optimization_metric)
+                })
+
+        # Combine sequential ranges with identical tunings
+        combined_configs = self.combine_sequential_ranges(optimal_configs)
+
+        # Generate config strings
+        configs = []
+        for config in combined_configs:
+            config_str = f"{config['collective']},{config['min_size']},{config['max_size']},{config['algorithm']},{config['protocol']},{config['channels']},{config['nodes']},{config['ranks']},{config['pipeOps']},{config['regBuff']}"
+            configs.append(config_str)
+
+            print(f"Optimal for {config['collective']} [{config['min_size']}-{config['max_size']}] nodes={config['nodes']} ranks={config['ranks']}: "
+                  f"{config['algorithm']}/{config['protocol']} channels={config['channels']} "
+                  f"({self.optimization_metric}={config['metric_value']:.3f})")
+
+        return configs
+
+    def combine_sequential_ranges(self, configs: List[Dict]) -> List[Dict]:
+        """Combine sequential ranges that have identical tuning parameters"""
+        if not configs:
+            return configs
+
+        # Group by collective and topology (nodes, ranks)
+        topology_groups = defaultdict(list)
+        for config in configs:
+            topology_key = (config['collective'], config['nodes'], config['ranks'],
+                          config['pipeOps'], config['regBuff'])
+            topology_groups[topology_key].append(config)
+
+        combined_configs = []
+
+        for topology_key, topology_configs in topology_groups.items():
+            # Sort by min_size to ensure proper ordering
+            topology_configs.sort(key=lambda x: x['min_size'])
+
+            # Group by tuning parameters (algorithm, protocol, channels)
+            tuning_groups = defaultdict(list)
+            for config in topology_configs:
+                tuning_key = (config['algorithm'], config['protocol'], config['channels'])
+                tuning_groups[tuning_key].append(config)
+
+            # For each tuning group, combine sequential ranges
+            for tuning_key, tuning_configs in tuning_groups.items():
+                if not tuning_configs:
+                    continue
+
+                # Sort by min_size
+                tuning_configs.sort(key=lambda x: x['min_size'])
+
+                # Combine sequential ranges
+                current_config = tuning_configs[0].copy()
+
+                for next_config in tuning_configs[1:]:
+                    # Check if ranges are adjacent or overlapping
+                    if current_config['max_size'] + 1 >= next_config['min_size']:
+                        # Extend the current range
+                        current_config['max_size'] = max(current_config['max_size'], next_config['max_size'])
+                        # Update metric value to the better one
+                        if self.optimization_metric == 'bandwidth_gbps':
+                            if next_config['metric_value'] > current_config['metric_value']:
+                                current_config['metric_value'] = next_config['metric_value']
+                        else:  # latency_us or default
+                            if next_config['metric_value'] < current_config['metric_value']:
+                                current_config['metric_value'] = next_config['metric_value']
+                    else:
+                        # Gap between ranges, save current and start new one
+                        combined_configs.append(current_config)
+                        current_config = next_config.copy()
+
+                # Add the last configuration
+                combined_configs.append(current_config)
+
+        # Sort final configs by collective, nodes, ranks, then min_size
+        combined_configs.sort(key=lambda x: (x['collective'], x['nodes'], x['ranks'], x['min_size']))
+
+        original_count = len(configs)
+        combined_count = len(combined_configs)
+        if combined_count < original_count:
+            print(f"Combined {original_count} ranges into {combined_count} ranges "
+                  f"(reduced by {original_count - combined_count})")
+
+        return combined_configs
+
+    def append_to_config_file(self, configs: List[str], config_file: str, add_header: bool = True):
+        """Append optimized configurations to NCCL tuner config file"""
+        try:
+            # Create directory if it doesn't exist
+            config_dir = os.path.dirname(config_file)
+            if config_dir and not os.path.exists(config_dir):
+                os.makedirs(config_dir)
+                print(f"Created directory: {config_dir}")
+
+            # Check if file exists and has content
+            file_exists = os.path.exists(config_file)
+            add_separator = False
+
+            if file_exists:
+                with open(config_file, 'r') as f:
+                    content = f.read().strip()
+                    add_separator = len(content) > 0
+                print(f"Appending to existing file: {config_file}")
+            else:
+                print(f"Creating new file: {config_file}")
+
+            with open(config_file, 'a') as f:
+                if add_separator:
+                    f.write("\n\n")
+
+                if add_header:
+                    f.write(f"# Optimized configurations generated by optimize_config.py\n")
+                    f.write(f"# Optimization metric: {self.optimization_metric}\n")
+                    f.write(f"# Format: collective_type,min_bytes,max_bytes,algorithm,protocol,channels,nNodes,nRanks,numPipeOps,regBuff\n")
+
+                for config in configs:
+                    f.write(f"{config}\n")
+
+            if file_exists:
+                print(f"Appended {len(configs)} optimized configurations to {config_file}")
+            else:
+                print(f"Created {config_file} with {len(configs)} optimized configurations")
+
+        except PermissionError:
+            print(f"Error: Permission denied writing to {config_file}")
+            print("Try running with appropriate permissions or choose a different output location")
+            sys.exit(1)
+        except OSError as e:
+            print(f"Error: Cannot create/write to {config_file}: {e}")
+            print("Check that the path is valid and you have write permissions")
+            sys.exit(1)
+        except Exception as e:
+            print(f"Unexpected error writing to {config_file}: {e}")
+            sys.exit(1)
+
+def main():
+    parser = argparse.ArgumentParser(description="Optimize NCCL tuner configurations from performance data")
+    parser.add_argument("csv_file", help="Input CSV file with performance data")
+    parser.add_argument("-o", "--output", default="nccl_tuner.conf",
+                       help="Output NCCL tuner config file (default: nccl_tuner.conf)")
+    parser.add_argument("-m", "--metric", choices=['bandwidth_gbps', 'latency_us'],
+                       default='latency_us', help="Optimization metric (default: latency_us)")
+    parser.add_argument("--no-header", action="store_true",
+                       help="Don't add header comments to output file")
+    parser.add_argument("--dry-run", action="store_true",
+                       help="Print configurations without writing to file")
+    parser.add_argument("--no-auto-ranges", action="store_true",
+                       help="Disable automatic size range determination (use default ranges)")
+    parser.add_argument("--size-ranges", type=str,
+                       help="Custom size ranges as comma-separated pairs: 'min1-max1,min2-max2,...'")
+
+    args = parser.parse_args()
+
+    optimizer = ConfigOptimizer(args.metric)
+
+    # Handle size range configuration
+    if args.size_ranges:
+        # Parse custom size ranges
+        try:
+            ranges = []
+            for range_str in args.size_ranges.split(','):
+                min_size, max_size = map(int, range_str.split('-'))
+                ranges.append((min_size, max_size))
+            optimizer.set_size_ranges(ranges)
+            print(f"Using custom size ranges: {ranges}")
+        except ValueError:
+            print("Error: Invalid size ranges format. Use 'min1-max1,min2-max2,...'")
+            sys.exit(1)
+    elif args.no_auto_ranges:
+        # Disable auto-ranging
+        optimizer.auto_size_ranges = False
+        print("Using default hardcoded size ranges")
+    else:
+        # Auto-ranging is enabled by default - creates one bucket per unique size
+        optimizer.auto_size_ranges = True
+        print("Auto-ranging enabled: will create one bucket per unique size in data")
+
+    # Load and optimize data
+    data = optimizer.load_data(args.csv_file)
+    if not data:
+        print("No valid data found in CSV file")
+        sys.exit(1)
+
+    configs = optimizer.optimize_configurations(data)
+
+    if args.dry_run:
+        print("\nGenerated configurations:")
+        for config in configs:
+            print(config)
+    else:
+        optimizer.append_to_config_file(configs, args.output, not args.no_header)
+
+if __name__ == "__main__":
+    main()
--- a/ext-tuner/example/scripts/sample_performance_data.csv
+++ b/ext-tuner/example/scripts/sample_performance_data.csv
@ -0,0 +1,24 @@
+collective,size_bytes,algorithm,protocol,channels,nodes,ranks,pipeOps,regBuff,cost_metric,bandwidth_gbps,latency_us
+allreduce,1024,tree,simple,2,1,8,-1,-1,0.15,45.2,12.5
+allreduce,1024,ring,simple,4,1,8,-1,-1,0.12,52.1,10.8
+allreduce,1024,tree,ll,2,1,8,-1,-1,0.18,41.3,15.2
+allreduce,1024,ring,ll,4,1,8,-1,-1,0.14,48.7,12.1
+allreduce,32768,tree,simple,2,1,8,-1,-1,0.25,156.8,25.3
+allreduce,32768,ring,simple,4,1,8,-1,-1,0.18,189.2,18.4
+allreduce,32768,ring,ll128,8,1,8,-1,-1,0.16,201.5,16.2
+allreduce,1048576,ring,simple,4,1,8,-1,-1,0.45,425.6,45.1
+allreduce,1048576,ring,ll128,8,1,8,-1,-1,0.38,482.3,38.7
+allreduce,1048576,nvls,simple,16,1,8,-1,-1,0.32,551.2,32.1
+broadcast,1024,tree,simple,2,1,8,-1,-1,0.08,89.4,8.2
+broadcast,1024,ring,simple,4,1,8,-1,-1,0.12,71.3,12.1
+broadcast,32768,tree,simple,2,1,8,-1,-1,0.18,234.7,18.5
+broadcast,32768,ring,ll128,4,1,8,-1,-1,0.15,267.8,15.2
+broadcast,1048576,ring,simple,4,1,8,-1,-1,0.35,612.4,35.1
+broadcast,1048576,ring,ll128,8,1,8,-1,-1,0.28,702.1,28.3
+allreduce,1024,tree,simple,2,2,16,-1,-1,0.22,38.1,22.4
+allreduce,1024,ring,simple,4,2,16,-1,-1,0.19,42.7,19.6
+allreduce,32768,ring,simple,4,2,16,-1,-1,0.28,145.2,28.1
+allreduce,32768,ring,ll128,8,2,16,-1,-1,0.24,167.8,24.3
+allreduce,1048576,ring,simple,4,2,16,-1,-1,0.58,387.5,58.2
+allreduce,1048576,ring,ll128,8,2,16,-1,-1,0.48,456.9,48.1
+allreduce,1048576,nvls,simple,16,2,16,-1,-1,0.42,512.6,42.3
--- a/ext-tuner/example/test/Makefile
+++ b/ext-tuner/example/test/Makefile
@ -0,0 +1,30 @@
+#
+# Makefile for NCCL Tuner Plugin Unit Tests
+#
+
+CC := gcc
+CFLAGS := -Wall -Wextra -g -std=c99 -fPIC
+INC := -I. -I../nccl
+TARGET := test_plugin
+SOURCES := test_plugin.c
+
+# Default target
+all: $(TARGET)
+
+# Build the test executable
+$(TARGET): $(SOURCES)
+	$(CC) $(CFLAGS) $(INC) -o $(TARGET) $(SOURCES)
+
+# Run the tests
+test: $(TARGET)
+	./$(TARGET) $(TEST_CASE)
+
+# Run tests with verbose output
+test-verbose: $(TARGET)
+	NCCL_DEBUG=INFO ./$(TARGET) $(TEST_CASE)
+
+# Clean build artifacts
+clean:
+	rm -f $(TARGET) *.o *.gcov *.gcda *.gcno test_*.conf
+
+.PHONY: all test test-verbose clean
--- a/ext-tuner/example/test/README.md
+++ b/ext-tuner/example/test/README.md
@ -0,0 +1,205 @@
+# NCCL Tuner Plugin Unit Tests
+
+This directory contains comprehensive unit tests for the NCCL tuner plugin. The tests verify all major functionality including configuration parsing, matching logic, and cost table updates.
+
+## Test Structure
+
+```
+test/
+├── test_plugin.c     # Main unit test file
+├── Makefile          # Build system for tests
+└── README.md         # This file
+```
+
+## Building and Running Tests
+
+### Quick Start
+
+```bash
+# Build and run all tests
+make test
+
+# Or step by step
+make           # Build test executable
+./test_plugin  # Run tests
+```
+
+### Advanced Testing
+
+```bash
+# Run with memory leak detection (requires valgrind)
+make test-memory
+
+# Run with verbose logging
+make test-verbose
+
+# Generate code coverage report (requires gcov)
+make coverage
+
+# Create sample test configuration files
+make test-configs
+```
+
+## Test Coverage
+
+The unit tests cover the following functionality:
+
+### 1. **Plugin Initialization (`test_plugin_init`)**
+- Tests successful plugin initialization
+- Verifies context allocation
+- Tests cleanup on destroy
+
+### 2. **Configuration Parsing (`test_config_parsing_valid`, `test_config_parsing_invalid`)**
+- Valid CSV format parsing
+- Comment and empty line handling
+- Invalid format graceful handling
+- Environment variable configuration
+
+### 3. **Collective Type Matching (`test_collective_matching`)**
+- Correct matching of allreduce, broadcast, etc.
+- Algorithm/protocol selection
+- Channel configuration
+
+### 4. **Size Range Matching (`test_size_matching`)**
+- Small, medium, large message size handling
+- Proper range boundary checking
+- Multiple size-based configurations
+
+### 5. **Topology Matching (`test_topology_matching`)**
+- Single-node vs multi-node configurations
+- Exact nNodes/nRanks matching
+- Wildcard matching (-1 values)
+
+### 6. **Default Channels (`test_default_channels`)**
+- Proper handling of -1 channel specification
+- Preservation of NCCL default behavior
+
+### 7. **Registered Buffer Matching (`test_regbuff_matching`)**
+- Configurations based on regBuff parameter
+- Registered vs non-registered buffer handling
+- Backward compatibility with configs missing regBuff
+
+### 8. **Pipeline Operations Matching (`test_pipeops_matching`)**
+- Configurations based on numPipeOps parameter
+- Single vs multiple pipeline operation handling
+- Backward compatibility with configs missing numPipeOps
+
+### 9. **Fallback Behavior (`test_no_match_fallback`)**
+- Default behavior when no config matches
+- Ring/Simple algorithm fallback
+
+## Test Output
+
+Successful test run:
+```
+Running NCCL Tuner Plugin Unit Tests
+=====================================
+PASS: test_plugin_init
+PASS: test_config_parsing_valid
+PASS: test_config_parsing_invalid
+PASS: test_collective_matching
+PASS: test_size_matching
+PASS: test_topology_matching
+PASS: test_default_channels
+PASS: test_regbuff_matching
+PASS: test_pipeops_matching
+PASS: test_no_match_fallback
+
+=====================================
+Test Results: 9/9 tests passed
+All tests PASSED!
+```
+
+Failed test example:
+```
+FAIL: test_collective_matching - Tree/Simple should have low cost
+Test Results: 8/9 tests passed
+Some tests FAILED!
+```
+
+## Mock NCCL Implementation
+
+The tests use the actual NCCL header files from the `../nccl/` directory:
+
+- `tuner.h` - Complete NCCL tuner interface and type definitions
+- `common.h` - Common NCCL types and logging functions
+- `err.h` - NCCL error codes
+
+This allows testing with the real NCCL interface definitions while still being able to run tests without the full NCCL library installation.
+
+## Integration with CI/CD
+
+```bash
+# Install tests for CI/CD pipeline
+make install-test
+
+# Run as part of automated testing
+make test && echo "Tests passed" || echo "Tests failed"
+```
+
+## Memory Testing
+
+The tests can be run with valgrind for memory leak detection:
+
+```bash
+make test-memory
+```
+
+This will detect:
+- Memory leaks
+- Invalid memory access
+- Use of uninitialized memory
+
+## Code Coverage
+
+Generate code coverage reports to ensure comprehensive testing:
+
+```bash
+make coverage
+# Creates test_plugin.c.gcov with line-by-line coverage
+```
+
+## Adding New Tests
+
+To add a new test:
+
+1. Create a new test function in `test_plugin.c`:
+```c
+int test_new_feature() {
+  // Test setup
+  TEST_ASSERT(condition, "description");
+  // Test cleanup
+  TEST_PASS();
+}
+```
+
+2. Add the test to the main function:
+```c
+total++; passed += test_new_feature();
+```
+
+3. Rebuild and run:
+```bash
+make test
+```
+
+## Debugging Tests
+
+For debugging failed tests:
+
+```bash
+# Compile with debug symbols
+make CFLAGS="-g -O0 -DDEBUG"
+
+# Run with gdb
+gdb ./test_plugin
+```
+
+## Cleaning Up
+
+```bash
+# Remove all build artifacts and temporary files
+make clean
+```
+
+This comprehensive test suite ensures the NCCL tuner plugin works correctly across all supported configurations and edge cases.
--- a/ext-tuner/example/test/test_plugin.c
+++ b/ext-tuner/example/test/test_plugin.c
@ -0,0 +1,856 @@
+/*************************************************************************
+ * Unit tests for NCCL Tuner Plugin
+ ************************************************************************/
+
+#define _GNU_SOURCE  // Enable setenv/unsetenv and other GNU extensions
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <assert.h>
+#include <unistd.h>
+#include <sys/stat.h>
+#include <stdarg.h>
+
+
+// Include NCCL tuner header (which includes common.h and err.h)
+#include "tuner.h"
+
+// Include plugin source for testing
+#include "../plugin.c"
+
+// Test framework macros
+#define TEST_ASSERT(condition, message) \
+  do { \
+    if (!(condition)) { \
+      printf("FAIL: %s - %s\n", __func__, message); \
+      return 0; \
+    } \
+  } while(0)
+
+#define TEST_PASS() \
+  do { \
+    printf("PASS: %s\n", __func__); \
+    return 1; \
+  } while(0)
+
+// Global test state
+static int test_log_count = 0;
+
+// Mock logger function
+void mock_logger(ncclDebugLogLevel level, unsigned long flags,
+                 const char* file, int line, const char* fmt, ...) {
+  (void)flags; // Suppress unused parameter warning
+  test_log_count++;
+
+  // Check if we should print based on NCCL_DEBUG level
+  const char* debug_level = getenv("NCCL_DEBUG");
+  int should_print = 0;
+
+  if (debug_level) {
+    if (strcmp(debug_level, "TRACE") == 0) {
+      should_print = 1; // Print everything
+    } else if (strcmp(debug_level, "INFO") == 0 && level <= NCCL_LOG_INFO) {
+      should_print = 1; // Print INFO and below
+    } else if (strcmp(debug_level, "WARN") == 0 && level <= NCCL_LOG_WARN) {
+      should_print = 1; // Print WARN and below
+    }
+  }
+
+  if (!should_print) return;
+
+  // Convert log level to string
+  const char* level_str;
+  switch(level) {
+    case NCCL_LOG_NONE: level_str = "NONE"; break;
+    case NCCL_LOG_VERSION: level_str = "VERSION"; break;
+    case NCCL_LOG_WARN: level_str = "WARN"; break;
+    case NCCL_LOG_INFO: level_str = "INFO"; break;
+    case NCCL_LOG_ABORT: level_str = "ABORT"; break;
+    case NCCL_LOG_TRACE: level_str = "TRACE"; break;
+    default: level_str = "UNKNOWN"; break;
+  }
+
+  // Print log header
+  printf("[TUNER:%s:%s:%d] ", level_str, file, line);
+
+  // Print formatted message
+  va_list args;
+  va_start(args, fmt);
+  vprintf(fmt, args);
+  va_end(args);
+
+  printf("\n");
+}
+
+// Helper function to create test config file
+void create_test_config(const char* filename, const char* content) {
+  FILE* f = fopen(filename, "w");
+  if (f) {
+    fprintf(f, "%s", content);
+    fclose(f);
+  }
+}
+
+// Test 1: Plugin initialization
+int test_plugin_init() {
+  void* context = NULL;
+
+  // Test successful initialization
+  ncclResult_t result = pluginInit(8, 2, mock_logger, &context);
+  TEST_ASSERT(result == ncclSuccess, "Plugin init should succeed");
+  TEST_ASSERT(context != NULL, "Context should be allocated");
+
+  // Clean up
+  pluginDestroy(context);
+  TEST_PASS();
+}
+
+// Test 2: Configuration file parsing - valid CSV
+int test_config_parsing_valid() {
+  const char* test_config =
+    "# Test configuration\n"
+    "allreduce,0,65536,tree,simple,2,1,-1,-1,-1\n"
+    "broadcast,0,32768,ring,ll128,4,2,16,-1,-1\n"
+    "# Comment line\n"
+    "\n"  // Empty line
+    "reduce,1024,2048,tree,simple,-1,-1,-1,-1,-1\n";
+
+  create_test_config("test_valid.conf", test_config);
+
+  // Set environment variable to use our test config
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_valid.conf", 1);
+
+  void* context = NULL;
+  ncclResult_t result = pluginInit(16, 2, mock_logger, &context);
+  TEST_ASSERT(result == ncclSuccess, "Plugin init with valid config should succeed");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_valid.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 3: Configuration file parsing - invalid CSV
+int test_config_parsing_invalid() {
+  const char* test_config =
+    "allreduce,0,65536,tree,simple,2,1  # Missing nRanks and other fields\n"
+    "invalid_collective,0,1024,ring,simple,1,1,1,-1,-1\n"
+    "broadcast,abc,def,ring,simple,1,1,1,-1,-1\n";  // Invalid numbers
+
+  create_test_config("test_invalid.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_invalid.conf", 1);
+
+  void* context = NULL;
+  ncclResult_t result = pluginInit(8, 1, mock_logger, &context);
+  // Should still succeed but with no valid configs loaded
+  TEST_ASSERT(result == ncclSuccess, "Plugin init should succeed even with invalid config");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_invalid.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 4: Collective type matching
+int test_collective_matching() {
+  const char* test_config =
+    "allreduce,0,65536,tree,simple,8,1,-1,-1,-1\n"
+    "broadcast,0,32768,ring,ll128,4,-1,-1,-1,-1\n";
+
+  create_test_config("test_match.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_match.conf", 1);
+
+  void* context = NULL;
+  pluginInit(8, 1, mock_logger, &context);
+
+  // Create mock cost table
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0; // Default high cost
+    }
+  }
+
+  int nChannels;
+
+  // Test allreduce matching (should match first config)
+  ncclResult_t result = pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                                          cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                                          0, &nChannels);
+
+  TEST_ASSERT(result == ncclSuccess, "GetCollInfo should succeed");
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "DEBUG: Checking cost_table[TREE][SIMPLE] (%p) = %.1f (expecting 0.0)",
+              &cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE], cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE]);
+  TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Tree/Simple should have low cost");
+  TEST_ASSERT(nChannels == 8, "Should set 8 channels");
+
+  // Test broadcast matching (should match second config)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0; // Reset costs
+    }
+  }
+
+  result = pluginGetCollInfo(context, ncclFuncBroadcast, 16384, 1,
+                            cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                            0, &nChannels);
+  TEST_ASSERT(result == ncclSuccess, "GetCollInfo should succeed");
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "DEBUG: Checking cost_table[RING][LL128] (%p) = %.1f (expecting 0.0)",
+              &cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128], cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128]);
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128] == 0.0, "Ring/LL128 should have low cost");
+  TEST_ASSERT(nChannels == 4, "Should set 4 channels");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_match.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 5: Size range matching
+int test_size_matching() {
+  const char* test_config =
+    "allreduce,0,1024,tree,simple,2,-1,-1,-1,-1\n"
+    "allreduce,1025,65536,ring,simple,4,-1,-1,-1,-1\n"
+    "allreduce,65537,4294967295,ring,ll128,8,-1,-1,-1,-1\n";
+
+  create_test_config("test_size.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_size.conf", 1);
+
+  void* context = NULL;
+  pluginInit(8, 1, mock_logger, &context);
+
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+  int nChannels = 1;
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 512, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "DEBUG: Small message - checking cost_table[TREE][SIMPLE] (%p) = %.1f (expecting 0.0)",
+              &cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE], cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE]);
+  TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Small: Tree/Simple should have low cost");
+  TEST_ASSERT(nChannels == 2, "Small: Should set 2 channels");
+
+  // Test medium message (should match second config)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "DEBUG: Medium message - checking cost_table[RING][SIMPLE] (%p) = %.1f (expecting 0.0)",
+              &cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE], cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE]);
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] == 0.0, "Medium: Ring/Simple should have low cost");
+  TEST_ASSERT(nChannels == 4, "Medium: Should set 4 channels");
+
+  // Test large message (should match third config)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 1048576, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "DEBUG: Large message - checking cost_table[RING][LL128] (%p) = %.1f (expecting 0.0)",
+              &cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128], cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128]);
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128] == 0.0, "Large: Ring/LL128 should have low cost");
+  TEST_ASSERT(nChannels == 8, "Large: Should set 8 channels");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_size.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 6: Topology matching
+int test_topology_matching() {
+  const char* test_config =
+    "allreduce,0,65536,tree,simple,2,1,-1,-1,-1\n"      // Single node only
+    "allreduce,0,65536,ring,simple,4,4,32,-1,-1\n"      // 4 nodes, 32 ranks exactly
+    "allreduce,0,65536,ring,ll128,8,-1,-1,-1,-1\n";     // Any topology
+
+  create_test_config("test_topo.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_topo.conf", 1);
+
+  // Test with single node setup
+  void* context1 = NULL;
+  pluginInit(8, 1, mock_logger, &context1);  // 8 ranks, 1 node
+
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  int nChannels;
+  pluginGetCollInfo(context1, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Single node: Should match tree config");
+  TEST_ASSERT(nChannels == 2, "Single node: Should set 2 channels");
+
+  pluginDestroy(context1);
+
+  // Test with 4 nodes, 32 ranks setup
+  void* context2 = NULL;
+  pluginInit(32, 4, mock_logger, &context2);  // 32 ranks, 4 nodes
+
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context2, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] == 0.0, "4-node: Should match ring/simple config");
+  TEST_ASSERT(nChannels == 4, "4-node: Should set 4 channels");
+
+  // Clean up
+  unlink("test_topo.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 7: Default channels behavior (-1)
+int test_default_channels() {
+  const char* test_config =
+    "allreduce,0,65536,tree,simple,-1,-1,-1,-1,-1\n";  // Use default channels
+
+  create_test_config("test_default.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_default.conf", 1);
+
+  void* context = NULL;
+  pluginInit(8, 1, mock_logger, &context);
+
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  int nChannels = 99;  // Set to known value
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+
+  TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Should apply algorithm/protocol");
+  TEST_ASSERT(nChannels == 1, "Should keep default channels (1) when config has -1");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_default.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 8: regBuff matching
+int test_regbuff_matching() {
+  const char* test_config =
+    "allreduce,0,65536,tree,simple,2,-1,-1,-1,1\n"      // Registered buffers only
+    "allreduce,0,65536,ring,simple,4,-1,-1,-1,0\n"      // Non-registered buffers only
+    "allreduce,0,65536,ring,ll128,8,-1,-1,-1,-1\n";     // Any buffer type (backward compatible)
+
+  create_test_config("test_regbuff.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_regbuff.conf", 1);
+
+  void* context = NULL;
+  pluginInit(8, 1, mock_logger, &context);
+
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+  }
+
+  int nChannels;
+
+  // Test registered buffer (should match first config)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    1, &nChannels);  // regBuff = 1 (registered)
+  TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Registered buffer: Tree/Simple should have low cost");
+  TEST_ASSERT(nChannels == 2, "Registered buffer: Should set 2 channels");
+
+  // Test non-registered buffer (should match second config)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);  // regBuff = 0 (non-registered)
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] == 0.0, "Non-registered buffer: Ring/Simple should have low cost");
+  TEST_ASSERT(nChannels == 4, "Non-registered buffer: Should set 4 channels");
+
+  // Test backward compatibility - config without regBuff should match any regBuff value
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  // First try with regBuff=2 (unusual value, should match third config)
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    2, &nChannels);  // regBuff = 2 (only third config should match)
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128] == 0.0, "Any regBuff: Ring/LL128 should have low cost");
+  TEST_ASSERT(nChannels == 8, "Any regBuff: Should set 8 channels");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_regbuff.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 9: numPipeOps matching
+int test_pipeops_matching() {
+  const char* test_config =
+    "allreduce,0,65536,tree,simple,2,-1,-1,1,-1\n"      // Single pipeline op
+    "allreduce,0,65536,ring,simple,4,-1,-1,4,-1\n"      // Multiple pipeline ops
+    "allreduce,0,65536,ring,ll128,8,-1,-1,-1,-1\n";     // Any pipeline ops (backward compatible)
+
+  create_test_config("test_pipeops.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_pipeops.conf", 1);
+
+  void* context = NULL;
+  pluginInit(8, 1, mock_logger, &context);
+
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+  }
+
+  int nChannels;
+
+  // Test single pipeline op (should match first config)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  TEST_ASSERT(cost_table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] == 0.0, "Single pipeOp: Tree/Simple should have low cost");
+  TEST_ASSERT(nChannels == 2, "Single pipeOp: Should set 2 channels");
+
+  // Test multiple pipeline ops (should match second config)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 4,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] == 0.0, "Multiple pipeOps: Ring/Simple should have low cost");
+  TEST_ASSERT(nChannels == 4, "Multiple pipeOps: Should set 4 channels");
+
+  // Test different number of pipeline ops (should match third config - backward compatible)
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 2,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_LL128] == 0.0, "Any pipeOps: Ring/LL128 should have low cost");
+  TEST_ASSERT(nChannels == 8, "Any pipeOps: Should set 8 channels");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_pipeops.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 10: No matching configuration (fallback behavior)
+int test_no_match_fallback() {
+  const char* test_config =
+    "broadcast,0,1024,tree,simple,2,-1,-1,-1,-1\n";  // Only broadcast config
+
+  create_test_config("test_fallback.conf", test_config);
+  setenv("NCCL_TUNER_CONFIG_FILE", "test_fallback.conf", 1);
+
+  void* context = NULL;
+  pluginInit(8, 1, mock_logger, &context);
+
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  int nChannels;
+  // Try allreduce (should not match, use fallback)
+  pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                    cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                    0, &nChannels);
+
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "DEBUG: Fallback test - checking cost_table[RING][SIMPLE] (%p) = %.1f (expecting 0.0)",
+              &cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE], cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE]);
+  TEST_ASSERT(cost_table[NCCL_ALGO_RING][NCCL_PROTO_SIMPLE] == 1.0, "Should use pass through unmodified");
+  TEST_ASSERT(nChannels == 1, "Should use default channels");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink("test_fallback.conf");
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+  TEST_PASS();
+}
+
+// Test 11: Large configuration files (testing dynamic allocation)
+int test_large_config() {
+  const char* large_config_file = "test_large.conf";
+
+  // Create a large configuration file with many entries
+  // This tests the dynamic allocation functionality
+  FILE* f = fopen(large_config_file, "w");
+  TEST_ASSERT(f != NULL, "Should be able to create large config file");
+
+  // Write header comment
+  fprintf(f, "# Large configuration file for testing dynamic allocation\n");
+  fprintf(f, "# This file contains many configurations to test memory allocation\n");
+
+  // Generate a large number of configurations (much more than the old MAX_CONFIGS=100)
+  const int num_configs = 500; // 5x the old static limit
+  const char* collectives[] = {"allreduce", "broadcast", "reduce", "allgather", "reducescatter"};
+  const char* algorithms[] = {"tree", "ring", "collnet_direct", "nvls"};
+  const char* protocols[] = {"simple", "ll", "ll128"};
+
+  for (int i = 0; i < num_configs; i++) {
+    // Vary the configurations to create realistic test data
+    const char* coll = collectives[i % 5];
+    const char* algo = algorithms[i % 4];
+    const char* proto = protocols[i % 3];
+
+    size_t min_bytes = (i * 1024) % 1048576; // Vary from 0 to 1MB
+    size_t max_bytes = min_bytes + 65536;    // 64KB range
+    int channels = (i % 8) + 1;              // 1-8 channels
+    int nodes = (i % 4) == 0 ? -1 : (i % 4); // Mix of -1 and 1-3 nodes
+    int ranks = (i % 8) == 0 ? -1 : (i % 32) + 1; // Mix of -1 and 1-32 ranks
+    int pipeOps = (i % 3) == 0 ? -1 : (i % 4) + 1; // Mix of -1 and 1-4 pipeOps
+    int regBuff = (i % 3) == 0 ? -1 : (i % 2); // Mix of -1, 0, 1
+
+    fprintf(f, "%s,%zu,%zu,%s,%s,%d,%d,%d,%d,%d\n",
+            coll, min_bytes, max_bytes, algo, proto, channels, nodes, ranks, pipeOps, regBuff);
+  }
+
+  fclose(f);
+
+  // Set environment to use our large config file
+  setenv("NCCL_TUNER_CONFIG_FILE", large_config_file, 1);
+
+  // Initialize plugin with large config
+  void* context = NULL;
+  ncclResult_t result = pluginInit(16, 4, mock_logger, &context);
+  TEST_ASSERT(result == ncclSuccess, "Plugin init with large config should succeed");
+  TEST_ASSERT(context != NULL, "Context should be allocated");
+
+  // Verify that configurations were loaded
+  TunerContext* ctx = (TunerContext*)context;
+  TEST_ASSERT(ctx->numConfigs == num_configs, "Should load all configurations from large file");
+  TEST_ASSERT(ctx->maxConfigs == num_configs, "maxConfigs should match allocated size");
+  TEST_ASSERT(ctx->configs != NULL, "Configs array should be dynamically allocated");
+
+  // Test that we can access configurations throughout the array
+  // (This would have failed with the old static MAX_CONFIGS=100 limit)
+  for (int i = 0; i < ctx->numConfigs; i++) {
+    TuningConfig* config = &ctx->configs[i];
+    // Basic sanity checks on the loaded configurations
+    TEST_ASSERT(config->collType >= ncclFuncBroadcast && config->collType <= ncclFuncAllReduce,
+                "Collective type should be valid");
+    TEST_ASSERT(config->maxBytes >= config->minBytes, "maxBytes should be >= minBytes");
+    TEST_ASSERT(config->nChannels > 0, "nChannels should be positive");
+  }
+
+  // Test specific configuration access at various indices
+  // Index 0 (first config)
+  TuningConfig* first_config = &ctx->configs[0];
+  TEST_ASSERT(first_config != NULL, "First config should be accessible");
+
+  // Index in middle
+  TuningConfig* mid_config = &ctx->configs[num_configs / 2];
+  TEST_ASSERT(mid_config != NULL, "Middle config should be accessible");
+
+  // Index near end (this would have crashed with static array of 100)
+  TuningConfig* late_config = &ctx->configs[num_configs - 1];
+  TEST_ASSERT(late_config != NULL, "Last config should be accessible");
+
+  // Test memory allocation size - verify we didn't over-allocate
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "Successfully loaded %d configurations (dynamic allocation)", ctx->numConfigs);
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "Memory allocated for %d configurations (%zu bytes total)",
+              ctx->maxConfigs, ctx->maxConfigs * sizeof(TuningConfig));
+
+  // Test that the plugin can still find matching configurations from the large set
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0; // Default high cost
+    }
+  }
+
+  int nChannels;
+  // Try to find a matching configuration - should work with large config set
+  result = pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                            cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                            0, &nChannels);
+  TEST_ASSERT(result == ncclSuccess, "GetCollInfo should work with large config set");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink(large_config_file);
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+
+  TEST_PASS();
+}
+
+// Test 12: Very large configuration stress test
+int test_very_large_config_stress() {
+  const char* stress_config_file = "test_stress.conf";
+
+  // Create an even larger configuration file to stress test the implementation
+  FILE* f = fopen(stress_config_file, "w");
+  TEST_ASSERT(f != NULL, "Should be able to create stress test config file");
+
+  fprintf(f, "# Stress test configuration with very large number of entries\n");
+
+  // Generate an extremely large number of configurations
+  const int stress_configs = 2000; // 20x the old static limit
+
+  for (int i = 0; i < stress_configs; i++) {
+    // Create varied but valid configurations
+    fprintf(f, "allreduce,%d,%d,ring,simple,4,-1,-1,-1,-1\n",
+            i * 512, (i * 512) + 1024);
+  }
+
+  fclose(f);
+
+  setenv("NCCL_TUNER_CONFIG_FILE", stress_config_file, 1);
+
+  // Test initialization with stress config
+  void* context = NULL;
+  ncclResult_t result = pluginInit(8, 2, mock_logger, &context);
+  TEST_ASSERT(result == ncclSuccess, "Plugin should handle very large config files");
+
+  TunerContext* ctx = (TunerContext*)context;
+  TEST_ASSERT(ctx->numConfigs == stress_configs, "Should load all stress test configurations");
+  TEST_ASSERT(ctx->configs != NULL, "Stress test configs should be allocated");
+
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "Stress test - loaded %d configurations successfully", stress_configs);
+  mock_logger(NCCL_LOG_INFO, NCCL_ALL, __FILE__, __LINE__,
+              "Memory usage: %zu bytes for configuration array",
+              stress_configs * sizeof(TuningConfig));
+
+  // Verify we can access configurations throughout the entire range
+  for (int i = 0; i < stress_configs; i += 100) { // Sample every 100th config
+    TuningConfig* config = &ctx->configs[i];
+    TEST_ASSERT(config->collType == ncclFuncAllReduce, "Config should have correct collective type");
+    TEST_ASSERT(config->minBytes == (size_t)(i * 512), "Config should have correct minBytes");
+  }
+
+  // Clean up
+  pluginDestroy(context);
+  unlink(stress_config_file);
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+
+  TEST_PASS();
+}
+
+// Test 13: Edge case - empty config file
+int test_empty_config() {
+  const char* empty_config_file = "test_empty.conf";
+
+  // Create empty config file (only comments)
+  create_test_config(empty_config_file,
+    "# Empty configuration file\n"
+    "# No actual configurations\n"
+    "\n"
+    "\n");
+
+  setenv("NCCL_TUNER_CONFIG_FILE", empty_config_file, 1);
+
+  void* context = NULL;
+  ncclResult_t result = pluginInit(8, 2, mock_logger, &context);
+  TEST_ASSERT(result == ncclSuccess, "Plugin should handle empty config files");
+
+  TunerContext* ctx = (TunerContext*)context;
+  TEST_ASSERT(ctx->numConfigs == 0, "Should have zero configurations");
+  TEST_ASSERT(ctx->maxConfigs == 0, "Should have zero max configurations");
+  TEST_ASSERT(ctx->configs == NULL, "Should not allocate memory for empty config");
+
+  // Test that plugin still works with no configurations (fallback behavior)
+  float cost_table[NCCL_NUM_ALGORITHMS][NCCL_NUM_PROTOCOLS];
+  float* cost_table_ptr[NCCL_NUM_ALGORITHMS];
+  for (int i = 0; i < NCCL_NUM_ALGORITHMS; i++) {
+    cost_table_ptr[i] = cost_table[i];
+    for (int j = 0; j < NCCL_NUM_PROTOCOLS; j++) {
+      cost_table[i][j] = 1.0;
+    }
+  }
+
+  int nChannels;
+  result = pluginGetCollInfo(context, ncclFuncAllReduce, 32768, 1,
+                            cost_table_ptr, NCCL_NUM_ALGORITHMS, NCCL_NUM_PROTOCOLS,
+                            0, &nChannels);
+  TEST_ASSERT(result == ncclSuccess, "GetCollInfo should work with empty config");
+
+  // Clean up
+  pluginDestroy(context);
+  unlink(empty_config_file);
+  unsetenv("NCCL_TUNER_CONFIG_FILE");
+
+  TEST_PASS();
+}
+
+// Test runner function pointer type
+typedef int (*TestFunction)(void);
+
+// Test registry
+typedef struct {
+  const char* name;
+  TestFunction func;
+  const char* description;
+} TestCase;
+
+// All available tests
+TestCase test_cases[] = {
+  {"init", test_plugin_init, "Plugin initialization"},
+  {"config-valid", test_config_parsing_valid, "Valid configuration parsing"},
+  {"config-invalid", test_config_parsing_invalid, "Invalid configuration parsing"},
+  {"collective", test_collective_matching, "Collective type matching"},
+  {"size", test_size_matching, "Size range matching"},
+  {"topology", test_topology_matching, "Topology matching"},
+  {"channels", test_default_channels, "Default channels behavior"},
+  {"regbuff", test_regbuff_matching, "Registered buffer matching"},
+  {"pipeops", test_pipeops_matching, "Pipeline operations matching"},
+  {"fallback", test_no_match_fallback, "Fallback behavior"},
+  {"large-config", test_large_config, "Large configuration files (dynamic allocation)"},
+  {"stress-config", test_very_large_config_stress, "Very large configuration stress test"},
+  {"empty-config", test_empty_config, "Empty configuration file handling"},
+  {NULL, NULL, NULL} // End marker
+};
+
+// Show help/usage information
+void show_help(const char* program_name) {
+  printf("Usage: %s [test_name ...]\n\n", program_name);
+  printf("Available tests:\n");
+  for (int i = 0; test_cases[i].name != NULL; i++) {
+    printf("  %-15s - %s\n", test_cases[i].name, test_cases[i].description);
+  }
+  printf("\nExamples:\n");
+  printf("  %s                    # Run all tests\n", program_name);
+  printf("  %s init               # Run only initialization test\n", program_name);
+  printf("  %s init collective    # Run initialization and collective tests\n", program_name);
+  printf("  %s --help             # Show this help\n", program_name);
+}
+
+// Find test by name
+TestFunction find_test(const char* name) {
+  for (int i = 0; test_cases[i].name != NULL; i++) {
+    if (strcmp(test_cases[i].name, name) == 0) {
+      return test_cases[i].func;
+    }
+  }
+  return NULL;
+}
+
+// Main test runner
+int main(int argc, char* argv[]) {
+  int passed = 0, total = 0;
+
+  // Check for help
+  if (argc > 1 && (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-h") == 0)) {
+    show_help(argv[0]);
+    return 0;
+  }
+
+  printf("Running NCCL Tuner Plugin Unit Tests\n");
+  printf("=====================================\n");
+
+  if (argc == 1) {
+    // No arguments - run all tests
+    for (int i = 0; test_cases[i].name != NULL; i++) {
+      total++;
+      passed += test_cases[i].func();
+    }
+  } else {
+    // Run specific tests
+    for (int arg = 1; arg < argc; arg++) {
+      TestFunction test_func = find_test(argv[arg]);
+      if (test_func) {
+        total++;
+        passed += test_func();
+      } else {
+        printf("ERROR: Unknown test '%s'\n", argv[arg]);
+        printf("Use --help to see available tests\n");
+        return 1;
+      }
+    }
+  }
+
+  printf("\n=====================================\n");
+  printf("Test Results: %d/%d tests passed\n", passed, total);
+
+  if (passed == total) {
+    printf("All tests PASSED!\n");
+    return 0;
+  } else {
+    printf("Some tests FAILED!\n");
+    return 1;
+  }
+}
--- a/makefiles/common.mk
+++ b/makefiles/common.mk
@ -40,10 +40,12 @@ ifeq ($(shell test "0$(CUDA_MAJOR)" -lt 12; echo $$?),0)
 CUDA8_GENCODE += -gencode=arch=compute_35,code=sm_35
 endif
 CUDA9_GENCODE = -gencode=arch=compute_70,code=sm_70
+CUDA10_GENCODE = -gencode=arch=compute_75,code=sm_75
 CUDA11_GENCODE = -gencode=arch=compute_80,code=sm_80
 CUDA12_GENCODE = -gencode=arch=compute_90,code=sm_90
-CUDA13_GENCODE = -gencode=arch=compute_100,code=sm_100 \
-                 -gencode=arch=compute_120,code=sm_120
+CUDA12_8_GENCODE = -gencode=arch=compute_100,code=sm_100 \
+                   -gencode=arch=compute_120,code=sm_120
+CUDA13_GENCODE = -gencode=arch=compute_110,code=sm_110

 CUDA8_PTX     = -gencode=arch=compute_61,code=compute_61
 CUDA9_PTX     = -gencode=arch=compute_70,code=compute_70
@ -53,10 +55,10 @@ CUDA13_PTX    = -gencode=arch=compute_120,code=compute_120

 ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 13; echo $$?),0)
 # Prior to SM75 is deprecated from CUDA13.0 onwards
-  NVCC_GENCODE ?= $(CUDA11_GENCODE) $(CUDA12_GENCODE) $(CUDA13_GENCODE) $(CUDA13_PTX)
+  NVCC_GENCODE ?= $(CUDA10_GENCODE) $(CUDA11_GENCODE) $(CUDA12_GENCODE) $(CUDA12_8_GENCODE) $(CUDA13_GENCODE) $(CUDA13_PTX)
 else ifeq ($(shell test "0$(CUDA_MAJOR)" -eq 12 -a "0$(CUDA_MINOR)" -ge 8; echo $$?),0)
 # Include Blackwell support if we're using CUDA12.8 or above
-  NVCC_GENCODE ?= $(CUDA8_GENCODE) $(CUDA9_GENCODE) $(CUDA11_GENCODE) $(CUDA12_GENCODE) $(CUDA13_GENCODE) $(CUDA13_PTX)
+  NVCC_GENCODE ?= $(CUDA8_GENCODE) $(CUDA9_GENCODE) $(CUDA11_GENCODE) $(CUDA12_GENCODE) $(CUDA12_8_GENCODE) $(CUDA13_PTX)
 else ifeq ($(shell test "0$(CUDA_MAJOR)" -eq 11 -a "0$(CUDA_MINOR)" -ge 8 -o "0$(CUDA_MAJOR)" -gt 11; echo $$?),0)
 # Include Hopper support if we're using CUDA11.8 or above
  NVCC_GENCODE ?= $(CUDA8_GENCODE) $(CUDA9_GENCODE) $(CUDA11_GENCODE) $(CUDA12_GENCODE) $(CUDA12_PTX)
@ -74,7 +76,7 @@ $(info NVCC_GENCODE is ${NVCC_GENCODE})
 ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 13; echo $$?),0)
  CXXSTD ?= -std=c++17
 else
-  CXXSTD ?= -std=c++11
+  CXXSTD ?= -std=c++14
 endif

 CXXFLAGS   := -DCUDA_MAJOR=$(CUDA_MAJOR) -DCUDA_MINOR=$(CUDA_MINOR) -fPIC -fvisibility=hidden \
--- a/makefiles/version.mk
+++ b/makefiles/version.mk
@ -1,6 +1,6 @@
 ##### version
 NCCL_MAJOR   := 2
 NCCL_MINOR   := 27
-NCCL_PATCH   := 3
+NCCL_PATCH   := 7
 NCCL_SUFFIX  :=
 PKG_REVISION := 1
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@ -0,0 +1,51 @@
+include(../cmake/common.cmake)
+
+find_package(CUDAToolkit REQUIRED)
+
+set(nccl_Major ${nccl_VERSION_MAJOR})
+set(nccl_Minor ${nccl_VERSION_MINOR})
+set(nccl_Patch ${nccl_VERSION_PATCH})
+# NCCL_VERSION(X,Y,Z) ((X) * 10000 + (Y) * 100 + (Z))
+math(
+  EXPR
+  nccl_Version
+  "${nccl_VERSION_MAJOR} * 10000 + ${nccl_VERSION_MINOR} * 100 + ${nccl_VERSION_PATCH}"
+)
+set(nccl_Suffix)
+configure_file(${CMAKE_CURRENT_SOURCE_DIR}/nccl.h.in
+               ${CMAKE_CURRENT_SOURCE_DIR}/include/nccl.h)
+
+file(
+  GLOB
+  SRC_FILES
+  "${CMAKE_CURRENT_SOURCE_DIR}/*.cc"
+  "${CMAKE_CURRENT_SOURCE_DIR}/misc/*.cc"
+  "${CMAKE_CURRENT_SOURCE_DIR}/transport/*.cc"
+  "${CMAKE_CURRENT_SOURCE_DIR}/collectives/*.cc"
+  "${CMAKE_CURRENT_SOURCE_DIR}/graph/*.cc")
+
+set(HEADER_FILES "${CMAKE_CURRENT_SOURCE_DIR}/include/nccl.h")
+
+set(NCCL_LIBS nccl;nccl_static)
+
+add_library(nccl SHARED ${SRC_FILES})
+add_library(nccl_static STATIC ${SRC_FILES})
+
+foreach(lib_name IN LISTS NCCL_LIBS)
+  nccl_add_target_options(${lib_name})
+  target_include_directories(
+    ${lib_name}
+    PRIVATE $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include>
+            $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/include/plugin>)
+  target_include_directories(${lib_name} PRIVATE "${CUDAToolkit_INCLUDE_DIRS}")
+  target_sources(
+    ${lib_name}
+    PUBLIC FILE_SET
+           public_headers
+           TYPE
+           HEADERS
+           BASE_DIRS
+           "${CMAKE_CURRENT_SOURCE_DIR}"
+           FILES
+           ${HEADER_FILES})
+endforeach()
--- a/src/device/CMakeLists.txt
+++ b/src/device/CMakeLists.txt
@ -0,0 +1,35 @@
+set(CU_FILES onerank_reduce.cu functions.cu)
+
+add_library(colldevice OBJECT ${CU_FILES})
+
+set(datatypes "i8;u8;i32;u32;i64;u64;f16;f32;f64")
+if(CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL "11")
+  list(APPEND datatypes bf16)
+endif()
+set(ops "sum;prod;min;max;premulsum;sumpostdiv")
+list(LENGTH ops op_num)
+math(EXPR op_num "${op_num} - 1")
+list(LENGTH datatypes datatype_num)
+math(EXPR datatype_num "${datatype_num} - 1")
+set(base_files "sendrecv;all_reduce;all_gather;broadcast;reduce;reduce_scatter")
+foreach(base IN LISTS base_files)
+  foreach(opn RANGE ${op_num})
+    list(GET ops ${opn} op)
+    foreach(dtn RANGE ${datatype_num})
+      list(GET datatypes ${dtn} dt)
+      set(new_file ${CMAKE_CURRENT_BINARY_DIR}/${base}_${op}_${dt}.cu)
+      configure_file(${CMAKE_CURRENT_SOURCE_DIR}/${base}.cu ${new_file}
+                     COPYONLY)
+      set_property(SOURCE ${new_file} PROPERTY COMPILE_DEFINITIONS
+                                               NCCL_OP=${opn} NCCL_TYPE=${dtn})
+      target_sources(colldevice PRIVATE ${new_file})
+    endforeach()
+  endforeach()
+endforeach()
+
+target_include_directories(
+  colldevice PRIVATE ${CMAKE_CURRENT_SOURCE_DIR}/../../include
+                     ${CMAKE_CURRENT_SOURCE_DIR})
+
+# Compiled kernels and collectives with relocatable device code ...
+set_property(TARGET colldevice PROPERTY CUDA_SEPARABLE_COMPILATION ON)
--- a/src/device/Makefile
+++ b/src/device/Makefile
@ -36,9 +36,8 @@ define COMPILE
 $(call COMPILE$(or $3,$(suffix $2)),$1,$2)
 endef

-ifeq ($(shell echo "$$((1000*$(CUDA_MAJOR) + 10*$(CUDA_MINOR) >= 12080))"),1)
-	NVCC_GENCODE_LDMC_FP8 = -gencode=arch=compute_100a,code=sm_100a \
-                     -gencode=arch=compute_120a,code=sm_120a
+ifeq ($(shell echo "$$((1000*$(CUDA_MAJOR) + 10*$(CUDA_MINOR) >= 12090))"),1)
+	NVCC_GENCODE_LDMC_FP8 = -gencode=arch=compute_100f,code=sm_100f
 else ifeq ($(shell echo "$$((1000*$(CUDA_MAJOR) + 10*$(CUDA_MINOR) >= 12070))"),1)
  NVCC_GENCODE_LDMC_FP8 = -gencode=arch=compute_100a,code=sm_100a
 else
--- a/src/device/reduce_kernel.h
+++ b/src/device/reduce_kernel.h
@ -1009,7 +1009,7 @@ struct Apply_LoadMultimem {
    DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(__nv_bfloat16, bf16x2, 4)
  #endif

-  #if NCCL_CUDA_ARCH_FAMILY_SPECIFIC == 1000 || NCCL_CUDA_ARCH_FAMILY_SPECIFIC == 1010 || NCCL_CUDA_ARCH_SPECIFIC == 1200 || NCCL_CUDA_ARCH_SPECIFIC == 1210
+  #if NCCL_CUDA_ARCH_SPECIFIC == 1000 || NCCL_CUDA_ARCH_SPECIFIC == 1010 || NCCL_CUDA_ARCH_FAMILY_SPECIFIC == 1000 || NCCL_CUDA_ARCH_FAMILY_SPECIFIC == 1010 || NCCL_CUDA_ARCH_SPECIFIC == 1200 || NCCL_CUDA_ARCH_SPECIFIC == 1210
    DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(__nv_fp8_e4m3, e4m3x4, 4)
    DEFINE_Apply_LoadMultimem_minmax_v4_and_xparts(__nv_fp8_e4m3, e4m3x4, 4)
    DEFINE_Apply_LoadMultimem_sum_v4_and_xparts(__nv_fp8_e5m2, e5m2x4, 4)
--- a/src/device/symmetric/generate.py
+++ b/src/device/symmetric/generate.py
@ -108,7 +108,7 @@ def required_cuda(k):
      if k.algo in ldmc_algos:
        cudart = 12070
        arch = None
-        specific_sms = [100, 120]
+        specific_sms = ["100a", "101a", "100f", "101f", "120a", "121a"]
  return (cudart, arch, specific_sms)

 ################################################################################
@ -145,7 +145,7 @@ def kernel_conds(k):
  if not specific_sms:
    arch_cond = "__CUDA_ARCH__ >= %d"%arch
  else:
-    arch_cond = " || ".join(["0"] + ["NCCL_CUDA_ARCH_SPECIFIC==%d"%(10*sm) for sm in specific_sms])
+    arch_cond = " || ".join(["0"] + ["NCCL_CUDA_ARCH_%sSPECIFIC==%d"%("FAMILY_" if sm[-1] == "f" else "", 10*int(sm.replace('a', '').replace('f', ''))) for sm in specific_sms])
  return cudart_cond, arch_cond

 def instantiate(k):
--- a/src/enqueue.cc
+++ b/src/enqueue.cc
@ -38,12 +38,9 @@ ncclResult_t ncclInitKernelsForDevice(int cudaArch, int maxSharedMem, size_t* ma
      if (fn == nullptr) continue;

      cudaError_t errcode = cudaFuncGetAttributes(&attr, fn);
-      if (errcode == cudaErrorNoKernelImageForDevice) continue;
-      CUDACHECKGOTO(errcode, result, ignore0);
-
+      if (errcode != cudaSuccess) continue; // Silently ignore failures
      if (maxStackSize) {
        if (attr.localSizeBytes > *maxStackSize) *maxStackSize = attr.localSizeBytes;
-      ignore0:;
      }
      if (carveout) {
        CUDACHECKGOTO(cudaFuncSetAttribute(fn,
--- a/src/graph/paths.cc
+++ b/src/graph/paths.cc
@ -175,6 +175,13 @@ ncclResult_t ncclGetLocalCpu(struct ncclTopoSystem* system, int gpu, int* retCpu
  return ncclSuccess;
 }

+static int mergePathType(int type0, int type1){
+  int max = std::max(type0,type1);
+  int min = std::min(type0,type1);
+  if(max == PATH_PHB && min == PATH_C2C) return PATH_P2C;
+  else return max;
+}
+
 static ncclResult_t addInterStep(struct ncclTopoSystem* system, int tx, int ix, int t1, int i1, int t2, int i2) {
  struct ncclTopoNode* cpuNode = system->nodes[tx].nodes+ix;
  struct ncclTopoNode* srcNode = system->nodes[t1].nodes+i1;
@ -187,7 +194,7 @@ static ncclResult_t addInterStep(struct ncclTopoSystem* system, int tx, int ix,

  // Update path characteristics
  srcNode->paths[t2][i2].count = l;
-  srcNode->paths[t2][i2].type = std::max(srcNode->paths[tx][ix].type, cpuNode->paths[t2][i2].type);
+  srcNode->paths[t2][i2].type = mergePathType(srcNode->paths[tx][ix].type, cpuNode->paths[t2][i2].type);
  if (tx == GPU) srcNode->paths[t2][i2].type = PATH_PXN;
  srcNode->paths[t2][i2].bw = std::min(srcNode->paths[tx][ix].bw, cpuNode->paths[t2][i2].bw);
  return ncclSuccess;
@ -674,9 +681,9 @@ ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm
      int c;
      NCCLCHECK(ncclGetLocalCpu(system, g, &c));
      if (c == -1) continue;
-      if (gpuNode->paths[NET][n].type == PATH_PHB && gpuNode->paths[CPU][c].type == PATH_C2C) {
-        gpuNode->paths[NET][n].type = PATH_P2C;
-        netNode->paths[GPU][g].type = PATH_P2C;
+      if (mergePathType(gpuNode->paths[CPU][c].type, netNode->paths[CPU][c].type) == PATH_P2C) {
+        gpuNode->paths[NET][n].type = std::min(PATH_P2C, gpuNode->paths[NET][n].type);
+        netNode->paths[GPU][g].type = std::min(PATH_P2C, netNode->paths[GPU][g].type);
      }
    }
  }
@ -695,16 +702,15 @@ ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm
          // PXN = PCI + NVLink.
          struct ncclTopoNode* peerNode = system->nodes[GPU].nodes+localGpuIndex;
          // Only use PXN for NIC n if remote GPU p ...
-          if (/* (1) is either connected to the NIC with PXB*/
-              (peerNode->paths[NET][n].type <= PATH_PXB ||
-               /* or with P2C and PxN over C2C is enabled */
-               (ncclParamPxnC2c() && peerNode->paths[NET][n].type == PATH_P2C)) &&
+          int pxnType = ncclParamPxnC2c() ? PATH_P2C : PATH_PXB;
+          if (/* (1) is connected to the NIC with PxN type*/
+              peerNode->paths[NET][n].type <= pxnType &&
              /* and (2) is connected to us through NVLink */
              peerNode->paths[GPU][g].type <= PATH_NVL &&
              /* and (3) is on the same node as us */
              NCCL_TOPO_ID_SYSTEM_ID(peerNode->id) == NCCL_TOPO_ID_SYSTEM_ID(gpu->id) &&
-              /* and (4) has either higher bw to that NIC or avoid going through the CPU*/
-              (peerNode->paths[NET][n].bw > gpu->paths[NET][n].bw || gpu->paths[NET][n].type > PATH_PXB))
+              /* and (4) has either higher bw to that NIC or avoid going through the CPU (path.type is > PATH_PXN)*/
+              (peerNode->paths[NET][n].bw > gpu->paths[NET][n].bw || gpu->paths[NET][n].type > PATH_PXN))
            // We can use that GPU as relay to communicate with that NIC.
            // Only enabling it in the GPU->NIC direction for now to favor
            // receiving locally and sending remotely (consistent with net.cc)
@ -725,6 +731,12 @@ ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm
      }
    }
  }
+
+  // Pre-compute NET local gpus to accelerate search
+  for (int n=0; n<system->nodes[NET].count; n++) {
+    struct ncclTopoNode* net = system->nodes[NET].nodes+n;
+    NCCLCHECK(ncclTopoGetLocalGpu(system, net->id, &net->net.localGpu));
+  }
  return ncclSuccess;
 }

--- a/src/graph/search.cc
+++ b/src/graph/search.cc
@ -437,6 +437,65 @@ ncclResult_t ncclTopoCompareGraphs(struct ncclTopoSystem* system, struct ncclTop
  return ncclSuccess;
 }

+// Add the preferred NICs ordered by GPU first
+static ncclResult_t ncclTopoPrefNetsGpuFirst(struct ncclTopoSystem* system, int gpu, int nets[NCCL_TOPO_MAX_NODES], int* netCount) {
+  const int nGpus = (gpu == -1) ? system->nodes[GPU].count : 1;
+  int gpuCount = nGpus;
+  int gpuIds[NCCL_TOPO_MAX_NODES] = {gpu};
+  int firstNets[NCCL_TOPO_MAX_NODES];
+  if (gpu == -1)
+    for (int g = 0; g < nGpus; g++) gpuIds[g] = g;
+
+  for (int c = 0; c < MAXCHANNELS; c++) {
+    for (int g = 0; g < nGpus; g++) {
+      if (gpuIds[g] == -1) continue;
+      int localNet;
+      int64_t netId;
+      struct ncclTopoNode* gpu = system->nodes[GPU].nodes + gpuIds[g];
+      NCCLCHECK(ncclTopoGetLocalNet(system, gpu->gpu.rank, c, &netId, NULL));
+      NCCLCHECK(ncclTopoIdToIndex(system, NET, netId, &localNet));
+      // store the first net found for each GPU in case of duplicates
+      if(c == 0) firstNets[g] = localNet;
+      // if the NET has already been returned for channel 0, that GPU is done
+      if (c > 0 && firstNets[g] == localNet) {
+        gpuIds[g] = -1;
+        gpuCount--;
+        continue;
+      }
+      // only add it to the list if it doesn't already exist
+      int found = 0;
+      while (found < (*netCount) && nets[found] != localNet) found++;
+      if (found == (*netCount)) nets[(*netCount)++] = localNet;
+    }
+    if (gpuCount == 0) break;
+  }
+  return ncclSuccess;
+}
+
+// Add the preferred NICs ordered by channels first
+static ncclResult_t ncclTopoPrefNetsChannelFirst(struct ncclTopoSystem* system, int gpu, int nets[NCCL_TOPO_MAX_NODES], int* netCount) {
+  for (int g = 0; g < system->nodes[GPU].count; g++) {
+    if (gpu != -1 && gpu != g) continue;
+    int localNetCount = 0, localNets[MAXCHANNELS];
+    struct ncclTopoNode* gpu = system->nodes[GPU].nodes + g;
+    for (int c = 0; c < MAXCHANNELS; c++) {
+      int64_t netId;
+      NCCLCHECK(ncclTopoGetLocalNet(system, gpu->gpu.rank, c, &netId, NULL));
+      NCCLCHECK(ncclTopoIdToIndex(system, NET, netId, localNets + localNetCount));
+      if (localNetCount > 0 && localNets[localNetCount] == localNets[0]) break;
+      localNetCount++;
+    }
+    // Append NICs to list
+    for (int i = 0; i < localNetCount; i++) {
+      int n = localNets[i];
+      int found = 0;
+      while (found < (*netCount) && nets[found] != n) found++;
+      if (found == (*netCount)) nets[(*netCount)++] = n;
+    }
+  }
+  return ncclSuccess;
+}
+
 // Build a sorted list of the NETs to try.
 //
 // "gpu" can be set to -1 to build a list suitable for all GPUs (search start) or to a given gpu
@ -445,39 +504,25 @@ ncclResult_t ncclTopoCompareGraphs(struct ncclTopoSystem* system, struct ncclTop
 // The list is built the following way:
 // 1. Select NETs starting with those close to GPU(s), based on paths[n].type.
 // 2. add other NETs satisfying typeInter but not already in the list.
-
+NCCL_PARAM(ScatterEnable, "MNNVL_SCATTER_NETS_ENABLE", 1);
 ncclResult_t ncclTopoSelectNets(struct ncclTopoSystem* system, int typeInter, int gpu, int nets[NCCL_TOPO_MAX_NODES], int* netCountRet) {
  ncclResult_t ret = ncclSuccess;
  int netCount = 0;
-  int localNetCount;
-  int localNets[MAXCHANNELS];

-  // First add the preferred NICs
-  for (int g=0; g<system->nodes[GPU].count; g++) {
-    if (gpu != -1 && gpu != g) continue;
-    localNetCount = 0;
-    struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
-    for (int c = 0; c<MAXCHANNELS; c++) {
-      int64_t netId;
-      NCCLCHECK(ncclTopoGetLocalNet(system, gpu->gpu.rank, c, &netId, NULL));
-      NCCLCHECK(ncclTopoIdToIndex(system, NET, netId, localNets+localNetCount));
-      if (localNetCount > 0 && localNets[localNetCount] == localNets[0]) break;
-      localNetCount++;
-    }
-    // Append NICs to list
-    for (int i=0; i<localNetCount; i++) {
-      int n = localNets[i];
-      int found = 0;
-      while (found<netCount && nets[found] != n) found++;
-      if (found == netCount) nets[netCount++] = n;
-    }
+  // First add the preferred NETs.
+  if (system->nHosts > 1 && ncclParamScatterEnable()) {
+    // For MNNVL systems, we sort the devices by GPU first, then by channel
+    NCCLCHECK(ncclTopoPrefNetsGpuFirst(system, gpu, nets, &netCount));
+  } else {
+    // For other systems, we sort the devices by channel first, then by GPU
+    NCCLCHECK(ncclTopoPrefNetsChannelFirst(system, gpu, nets, &netCount));
  }

  // Then add others satisfying typeInter
  for (int t=0; t <= typeInter; t++) {
-    for (int g=0; g<system->nodes[GPU].count; g++) {
+    for (int g = 0; g < system->nodes[GPU].count; g++) {
      if (gpu != -1 && gpu != g) continue;
-      localNetCount = 0;
+      int localNetCount = 0, localNets[MAXCHANNELS];
      struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
      struct ncclTopoLinkList* paths = gpu->paths[NET];
      for (int n=0; n<system->nodes[NET].count && n<MAXCHANNELS; n++) {
@ -625,8 +670,7 @@ ncclResult_t ncclTopoSearchRecNet(struct ncclTopoSystem* system, struct ncclTopo
    if (graph->pattern == NCCL_TOPO_PATTERN_NVLS || graph->pattern == NCCL_TOPO_PATTERN_COLLNET_DIRECT) {
      // NVLS search only tries to find NIC:GPU combinations to compute the heads.
      if (graph->nChannels < netCount) {
-        int gpu;
-        NCCLCHECK(ncclTopoGetLocalGpu(system, net->id, &gpu));
+        int gpu = net->net.localGpu;
        if (gpu != -1) {
          int duplicate = 0;
          // check whether there is duplicate head when one GPU connects with multiple NICs
@ -643,13 +687,12 @@ ncclResult_t ncclTopoSearchRecNet(struct ncclTopoSystem* system, struct ncclTopo
        }
      }
    } else {
-      if (graph->nChannels > 0) {
+      if (graph->nChannels > 0 && graph->sameChannels == 1) {
        // Try to replay the last channel
        int g;
        NCCLCHECK(ncclTopoReplayGetGpu(system, graph, -1, &g));
        NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, FORCED_ORDER_REPLAY, time, NET, n, g));
-      }
-      if (graph->nChannels == 0 || graph->sameChannels == 0) {
+      } else {
        if (graph->nChannels == 0 && system->nodes[NVS].count == 0) {
          // Always try the PCI order first to set a reference, but don't count in the timeout nor let it run for long
          int t = 1 << 10;
@ -658,11 +701,16 @@ ncclResult_t ncclTopoSearchRecNet(struct ncclTopoSystem* system, struct ncclTopo
        }

        // Then try the most local GPUs
+        int localGpu = net->net.localGpu;
+        if (localGpu != -1) {
+          NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, 0, time, NET, n, localGpu));
+        }
        int localGpus[NCCL_TOPO_MAX_NODES], localGpuCount, pathType;
        NCCLCHECK(ncclTopoGetLocal(system, NET, n, GPU, localGpus, &localGpuCount, &pathType));
        // if no GPUs are connected, skip this net
        if (pathType == PATH_DIS) continue;
        for (int g = 0; g < localGpuCount; ++g) {
+          if (localGpus[g] == localGpu) continue; // We already tried this one
          NCCLCHECK(ncclTopoSearchTryGpu(system, graph, saveGraph, 0, backToNet, backToFirstRank, 0, time, NET, n, localGpus[g]));
        }
      }
@ -749,8 +797,8 @@ struct kvDict kvDictLinkType[] = {
  { "NVB", PATH_NVB },
  { "PIX", PATH_PIX },
  { "PXB", PATH_PXB },
-  { "PXN", PATH_PXN },
  { "P2C", PATH_P2C },
+  { "PXN", PATH_PXN },
  { "PHB", PATH_PHB },
  { "SYS", PATH_SYS },
  { NULL, 0 }
@ -798,8 +846,10 @@ ncclResult_t ncclTopoGetGraphFromXmlSub(struct ncclXmlNode *xmlGraph, struct ncc
  NCCLCHECK(xmlGetAttrInt(xmlGraph, "nchannels", &graph->nChannels));
  NCCLCHECK(xmlGetAttrFloat(xmlGraph, "speedintra", &graph->bwIntra));
  NCCLCHECK(xmlGetAttrFloat(xmlGraph, "speedinter", &graph->bwInter));
-  if (xmlGetAttrFloat(xmlGraph, "latencyinter", &graph->latencyInter) != ncclSuccess) graph->latencyInter = 0.0;
  const char* str;
+  NCCLCHECK(xmlGetAttr(xmlGraph, "latencyinter", &str));
+  if (!str) INFO(NCCL_GRAPH, "latencyinter not found in graph, using 0.0");
+  graph->latencyInter = str ? strtof(str, NULL) : 0.0;
  NCCLCHECK(xmlGetAttr(xmlGraph, "typeintra", &str));
  NCCLCHECK(kvConvertToInt(str, &graph->typeIntra, kvDictLinkType));
  NCCLCHECK(xmlGetAttr(xmlGraph, "typeinter", &str));
@ -910,7 +960,7 @@ float sm90SpeedArrayInter[] = { 48.0, 45.0, 42.0, 40.0, 30.0, 24.0, 22.0, 20.0,
 #define NSPEEDSINTER_SM90 (sizeof(sm90SpeedArrayInter)/sizeof(float))

 float sm100SpeedArrayIntra[] = { 90.0, 80.0, 70.0, 60.0, 50.0, 40.0, 30.0, 24.0, 20.0, 19.0, 18.0 };
-float sm100SpeedArrayInter[] = { 47.9, 45.0, 42.0, 40.0, 30.0, 24.0, 22.0, 20.0, 17.5, 15.0, 12.0, 6.0, 3.0, 2.4, 1.2, 0.24, 0.12 };
+float sm100SpeedArrayInter[] = { 96.0, 48.0, 45.1, 42.0, 40.0, 30.0, 24.0, 22.0, 20.0, 17.5, 15.0, 12.0, 6.0, 3.0, 2.4, 1.2, 0.24, 0.12 };
 #define NSPEEDSINTRA_SM100 (sizeof(sm100SpeedArrayIntra)/sizeof(float))
 #define NSPEEDSINTER_SM100 (sizeof(sm100SpeedArrayInter)/sizeof(float))

@ -1136,8 +1186,12 @@ ncclResult_t ncclTopoPrintGraph(struct ncclTopoSystem* system, struct ncclTopoGr
      offset = strlen(line);
    }
    for (int i=0; i<ngpus; i++) {
-      sprintf(line+offset, " %s/%d", topoNodeTypeStr[GPU], graph->intra[ngpus*c+i]);
+      int g;
+      ncclTopoRankToIndex(system, graph->intra[ngpus * c + i], &g, true);
+      int64_t topoId = system->nodes[GPU].nodes[g].id;
+      sprintf(line + offset, " %s/%lx-%lx", topoNodeTypeStr[GPU], NCCL_TOPO_ID_SYSTEM_ID(topoId), NCCL_TOPO_ID_LOCAL_ID(topoId));
      offset = strlen(line);
+      if (graph->id == 3) break; // NVLS graphs only use the first GPU
    }
    if (system->nodes[NET].count > 0) {
      sprintf(line+offset, " %s/%lx-%lx", topoNodeTypeStr[NET], NCCL_TOPO_ID_SYSTEM_ID(graph->inter[2*c+1]), NCCL_TOPO_ID_LOCAL_ID(graph->inter[2*c+1]));
@ -1253,7 +1307,8 @@ ncclResult_t ncclTopoGetNetDev(struct ncclComm* comm, int rank, struct ncclTopoG
        NCCLCHECK(ncclTopoGetLocalGpu(comm->topo, netId, &g2));
        if (g2 != -1) {
          struct ncclTopoNode* peerGpu = comm->topo->nodes[GPU].nodes+g2;
-          if (peerGpu->paths[GPU][g1].type <= PATH_NVL && peerGpu->paths[NET][n].type <= PATH_PXB) {
+          int pxnType = ncclParamPxnC2c() ? PATH_P2C : PATH_PXB;
+          if (peerGpu->paths[GPU][g1].type <= PATH_NVL && peerGpu->paths[NET][n].type <= pxnType) {
            *proxyRank = peerGpu->gpu.rank;
            if (dev) *dev = netDev;
            if (id) *id = netId;
--- a/src/graph/topo.cc
+++ b/src/graph/topo.cc
@ -21,7 +21,7 @@

 const char* topoNodeTypeStr[] = { "GPU", "PCI", "NVS", "CPU", "NIC", "NET" };
 const char* topoLinkTypeStr[] = { "LOC", "NVL", "",    "C2C", "PCI",    "",    "",    "",    "", "SYS", "NET" };
-const char* topoPathTypeStr[] = { "LOC", "NVL", "NVB", "C2C", "PIX", "PXB", "PXN", "P2C", "PHB", "SYS", "NET", "DIS" };
+const char* topoPathTypeStr[] = { "LOC", "NVL", "NVB", "C2C", "PIX", "PXB", "P2C", "PXN", "PHB", "SYS", "NET", "DIS" };

 /******************************************************************/
 /******************* Graph Creation Functions *********************/
@ -677,7 +677,14 @@ ncclResult_t ncclTopoGetSystemFromXml(struct ncclXml* xml, struct ncclTopoSystem
    struct ncclXmlNode* node = topNode->subs[s];
    if (strcmp(node->name, "cpu") == 0) NCCLCHECK(ncclTopoAddCpu(node, *topoSystem));
  }
-  for (int systemId=0; systemId<system->nHosts; systemId++) if (system->hostHashes[systemId] == localHostHash) system->systemId = systemId;
+
+  int systemId = 0;
+  while (systemId < system->nHosts && system->hostHashes[systemId] != localHostHash) systemId++;
+  system->systemId = systemId;
+  if(systemId == system->nHosts){
+    WARN("localHostHash = 0x%lx not found in the list of system hostHashes",localHostHash);
+    return ncclInvalidArgument;
+  }

  NCCLCHECK(ncclTopoAddNvLinks(topNode, *topoSystem, NULL, 0));
  NCCLCHECK(ncclTopoAddC2c(topNode, *topoSystem, NULL, 0));
@ -1143,8 +1150,8 @@ struct kvDict nicPathKvList[] = {
  { "PORT", PATH_PORT },
  { "PIX",  PATH_PIX },
  { "PXB",  PATH_PXB },
-  { "PXN",  PATH_PXN },
  { "P2C",  PATH_P2C },
+  { "PXN",  PATH_PXN },
  { "PHB",  PATH_PHB },
  { "SYS",  PATH_SYS },
  { NULL, 0 }
@ -1421,7 +1428,7 @@ ncclResult_t ncclTopoGetSystem(struct ncclComm* comm, struct ncclTopoSystem** sy
  }

  // Only update our topo tracking structure if we aren't dumping (separate steps)
-  if (dumpXmlFile == NULL) NCCLCHECKGOTO(ncclTopoGetSystemFromXml(xml, system, comm->peerInfo[comm->rank].hostHash), ret, fail);
+  if (dumpXmlFile == NULL) NCCLCHECKGOTO(ncclTopoGetSystemFromXml(xml, system, getHostHash()), ret, fail);

 exit:
  if (!comm->MNNVL && localRanks) free(localRanks);
--- a/src/graph/topo.h
+++ b/src/graph/topo.h
@ -18,7 +18,7 @@
 #define SM80_NVLINK_BW 20.0
 #define SM90_NVLINK_BW 20.6
 #define SM86_NVLINK_BW 12.0
-#define SM100_NVLINK_BW 40.0
+#define SM100_NVLINK_BW 40.1
 #define PCI_BW 12.0           // PCI Gen3 x16
 #define AMD_BW 16.0
 #define BDW_QPI_BW 6.0
@ -76,11 +76,11 @@ extern const char* topoLinkTypeStr[];
 // Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
 #define PATH_PXB 5

-// Connection between a GPU and a NIC using an intermediate GPU. Used to enable rail-local, aggregated network send/recv operations.
-#define PATH_PXN 6
-
 // Connection between a GPU and a NIC using the C2C connection to the CPU and the PCIe connection to the NIC
-#define PATH_P2C 7
+#define PATH_P2C 6
+
+// Connection between a GPU and a NIC using an intermediate GPU. Used to enable rail-local, aggregated network send/recv operations.
+#define PATH_PXN 7

 // Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
 #define PATH_PHB 8
@ -98,6 +98,8 @@ extern const char* topoLinkTypeStr[];
 #define PATH_DIS 11
 extern const char* topoPathTypeStr[];

+extern int64_t ncclParamPxnC2c();
+
 struct ncclTopoNode;
 struct ncclTopoLink {
  int type;
@ -143,6 +145,7 @@ struct ncclTopoNode {
      int gdrSupport;
      int collSupport;
      int maxChannels;
+      int localGpu;
    }net;
    struct {
      int arch;
--- a/src/graph/tuning.cc
+++ b/src/graph/tuning.cc
@ -455,9 +455,16 @@ ncclResult_t ncclTopoTuneModel(struct ncclComm* comm, int minCompCap, int maxCom
  for (int c=0; c<NCCL_NUM_FUNCTIONS; c++) for (int a=0; a<NCCL_NUM_ALGORITHMS; a++) for (int p=0; p<NCCL_NUM_PROTOCOLS; p++) {
    int pEnable = protoEnable[c*NCCL_NUM_PROTOCOLS+p];
    if (pEnable == 2 && p == NCCL_PROTO_LL128) {
-      // Enable LL128 by default only on Volta/Ampere/Hopper/Blackwell+NVLink. Other cases are not tested and may cause silent data corruption.
      pEnable = 1;
-      pEnable &= (graphs[a]->typeInter <= PATH_PXB || (minCompCap >= 90 && graphs[a]->typeInter <= (ncclParamLl128C2c() ? PATH_P2C : PATH_PXN)));
+      if (ncclParamLl128C2c() && minCompCap >= 90) {
+        // Enable LL128 by default only on Hopper/Blackwell for all connections up to P2C and PXN.
+        pEnable &= (graphs[a]->typeInter <= PATH_PXN);
+      } else {
+        // Enable LL128 only up to PXB. Don't enable LL128 over PxN because PxN can encapsulate PxB or P2C links.
+        pEnable &= (graphs[a]->typeInter <= PATH_PXB);
+        if (!ncclParamLl128C2c() && minCompCap >= 90)
+          INFO(NCCL_GRAPH, "Disabling LL128 over all PxN connections (PXB and C2C). This ensures that no C2C link will be used by LL128.");
+      }
      pEnable &= (graphs[a]->typeIntra <= PATH_NVB);
      pEnable &= (minCompCap == maxCompCap);
      pEnable &= !(minCompCap < 70 || (minCompCap == 90 && CUDART_VERSION == 11080 && c == ncclFuncAllReduce && a == NCCL_ALGO_RING && comm->nRanks == 2));
--- a/src/include/ibvcore.h
+++ b/src/include/ibvcore.h
@ -9,6 +9,7 @@
 #include <stdint.h>
 #include <sys/types.h>
 #include <unistd.h>
+#include <string.h>

 #if __GNUC__ >= 3
 #  define __attribute_const __attribute__((const))
@ -39,7 +40,7 @@ union ibv_gid {
 #define vext_field_avail(type, fld, sz) (offsetof(type, fld) < (sz))

 /*XXX:__VERBS_ABI_IS_EXTENDED produces warning "integer operation result is out of range" with g++ 4.8.2*/
-//static void *__VERBS_ABI_IS_EXTENDED = ((uint8_t *)NULL) - 1;
+static void *__VERBS_ABI_IS_EXTENDED = ((uint8_t *)NULL) - 1;

 enum ibv_node_type {
 	IBV_NODE_UNKNOWN	= -1,
@ -208,7 +209,9 @@ struct ibv_port_attr {
 	uint8_t			active_speed;
 	uint8_t			phys_state;
 	uint8_t			link_layer;
-	uint8_t			reserved;
+	uint8_t                 flags;
+	uint16_t                port_cap_flags2;
+	uint32_t                active_speed_ex;
 };

 enum ibv_event_type {
@ -993,37 +996,50 @@ enum verbs_context_mask {

 struct verbs_context {
 	/*  "grows up" - new fields go here */
-	int (*_reserved_2) (void);
-	int (*destroy_flow) (struct ibv_flow *flow);
-	int (*_reserved_1) (void);
-	struct ibv_flow * (*create_flow) (struct ibv_qp *qp,
-					  struct ibv_flow_attr *flow_attr);
+	int (*query_port)(struct ibv_context *context, uint8_t port_num,
+			  struct ibv_port_attr *port_attr,
+			  size_t port_attr_len);
+	int (*_reserved[25]) (void);
+	struct verbs_ex_private *priv;
+	int (*query_device_ex)(struct ibv_context *context,
+			       const struct ibv_query_device_ex_input *input,
+			       struct ibv_device_attr_ex *attr,
+			       size_t attr_size);
+	int (*ibv_destroy_flow) (struct ibv_flow *flow);
+	void (*ABI_placeholder2) (void); /* DO NOT COPY THIS GARBAGE */
+	struct ibv_flow * (*ibv_create_flow) (struct ibv_qp *qp,
+					      struct ibv_flow_attr *flow_attr);
+	void (*ABI_placeholder1) (void); /* DO NOT COPY THIS GARBAGE */
 	struct ibv_qp * (*open_qp)(struct ibv_context *context,
 			struct ibv_qp_open_attr *attr);
 	struct ibv_qp * (*create_qp_ex)(struct ibv_context *context,
 			struct ibv_qp_init_attr_ex *qp_init_attr_ex);
 	int (*get_srq_num)(struct ibv_srq *srq, uint32_t *srq_num);
-	struct ibv_srq * (*create_srq_ex)(struct ibv_context *context,
-			struct ibv_srq_init_attr_ex *srq_init_attr_ex);
-	struct ibv_xrcd * (*open_xrcd)(struct ibv_context *context,
-			struct ibv_xrcd_init_attr *xrcd_init_attr);
-	int  (*close_xrcd)(struct ibv_xrcd *xrcd);
-	uint64_t has_comp_mask;
-	size_t   sz;	/* Must be immediately before struct ibv_context */
-	struct ibv_context context;/* Must be last field in the struct */
+	struct ibv_srq *	(*create_srq_ex)(struct ibv_context *context,
+						 struct ibv_srq_init_attr_ex *srq_init_attr_ex);
+	struct ibv_xrcd *	(*open_xrcd)(struct ibv_context *context,
+					     struct ibv_xrcd_init_attr *xrcd_init_attr);
+	int			(*close_xrcd)(struct ibv_xrcd *xrcd);
+	uint64_t _ABI_placeholder3;
+	size_t   sz;			/* Must be immediately before struct ibv_context */
+	struct ibv_context context;	/* Must be last field in the struct */
 };

-/*XXX:__VERBS_ABI_IS_EXTENDED produces warning "integer operation result is out of range" with g++ 4.8.2*/
-/*static inline struct verbs_context *verbs_get_ctx(struct ibv_context *ctx)
+static inline struct verbs_context *verbs_get_ctx(struct ibv_context *ctx)
 {
-	return (!ctx || (ctx->abi_compat != __VERBS_ABI_IS_EXTENDED)) ?
-		NULL : container_of(ctx, struct verbs_context, context);
+	if (ctx->abi_compat != __VERBS_ABI_IS_EXTENDED)
+		return NULL;
+
+	/* open code container_of to not pollute the global namespace */
+	return (struct verbs_context *)(((uintptr_t)ctx) -
+					offsetof(struct verbs_context,
+						 context));
 }

 #define verbs_get_ctx_op(ctx, op) ({ \
-	struct verbs_context *_vctx = verbs_get_ctx(ctx); \
-	(!_vctx || (_vctx->sz < sizeof(*_vctx) - offsetof(struct verbs_context, op)) || \
-	!_vctx->op) ? NULL : _vctx; })*/
+	struct verbs_context *__vctx = verbs_get_ctx(ctx); \
+	(!__vctx || (__vctx->sz < sizeof(*__vctx) - offsetof(struct verbs_context, op)) || \
+	 !__vctx->op) ? NULL : __vctx; })

 #define verbs_set_ctx_op(_vctx, op, ptr) ({ \
 	struct verbs_context *vctx = _vctx; \
@ -1055,4 +1071,20 @@ struct ibv_ece {
 	uint32_t comp_mask;
 };

+/**
+ * ibv_query_port_ex - Get (extended) port properties
+ */
+static inline int ibv_query_port_ex(struct ibv_context *context,
+				    uint8_t port_num,
+				    struct ibv_port_attr *port_attr)
+{
+	struct verbs_context *vctx = verbs_get_ctx_op(context, query_port);
+
+        if (vctx) {
+          return vctx->query_port(context, port_num, port_attr, sizeof(*port_attr));
+        }
+
+        return -1;
+}
+
 #endif  // NCCL_IBV_CORE_H_
--- a/src/include/plugin/plugin.h
+++ b/src/include/plugin/plugin.h
@ -9,10 +9,16 @@

 #include "nccl.h"

+enum ncclPluginType {
+  ncclPluginTypeNet,
+  ncclPluginTypeTuner,
+  ncclPluginTypeProfiler,
+};
+
 void* ncclOpenNetPluginLib(const char* name);
 void* ncclOpenTunerPluginLib(const char* name);
 void* ncclOpenProfilerPluginLib(const char* name);
-void* ncclGetNetPluginLib(void);
-ncclResult_t ncclClosePluginLib(void* handle);
+void* ncclGetNetPluginLib(enum ncclPluginType type);
+ncclResult_t ncclClosePluginLib(void* handle, enum ncclPluginType type);

 #endif
--- a/src/init.cc
+++ b/src/init.cc
@ -1507,7 +1507,7 @@ static ncclResult_t envConfigOverride(ncclComm_t comm) {
  int minCTAsEnv;
  int maxCTAsEnv;
  int splitShareEnv;
-  int collnetEnableEnv;
+  const char* collnetEnableEnv;
  int ctaPolicyEnv;
  int shrinkShareEnv;
  int nvlsCTAsEnv;
@ -1561,9 +1561,15 @@ static ncclResult_t envConfigOverride(ncclComm_t comm) {
    comm->config.shrinkShare = shrinkShareEnv;
  }

-  collnetEnableEnv = ncclParamCollnetEnable();
-  if (collnetEnableEnv != NCCL_CONFIG_UNDEF_INT) {
-    comm->config.collnetEnable = collnetEnableEnv;
+  // NCCL_COLLNET_ENABLE needs to be reloaded each time for comm init
+  // since users might change the env on the fly to enable/disable collnet
+  collnetEnableEnv = ncclGetEnv("NCCL_COLLNET_ENABLE");
+  if (collnetEnableEnv != NULL) {
+    int collnetEnableInt = (int)strtol(collnetEnableEnv, NULL, 0);
+    if (collnetEnableInt != NCCL_CONFIG_UNDEF_INT) {
+      comm->config.collnetEnable = collnetEnableInt;
+      INFO(NCCL_ENV, "NCCL_COLLNET_ENABLE set by environment to %d.", collnetEnableInt);
+    }
  }

  ctaPolicyEnv = ncclParamCtaPolicy();
@ -2164,6 +2170,7 @@ ncclResult_t ncclCommDestroy(ncclComm_t comm) {
    NVTX3_PAYLOAD(comm->commHash, nranks, rank, cudaDev));

  TRACE(NCCL_INIT, "comm %p rank %d nRanks %d cudaDev %d busId %lx", comm, rank, nranks, cudaDev, comm->busId);
+  NCCLCHECK(ncclGroupStartInternal());
  // Try and prevent a double free of the comm struct (user error)
  if (comm->rank == -1 || comm->nRanks == -1 || comm->cudaDev == -1 || comm->busId == -1) {
    WARN("comm %p has already been destroyed", comm);
@ -2178,6 +2185,8 @@ ncclResult_t ncclCommDestroy(ncclComm_t comm) {
  NCCLCHECKGOTO(ncclAsyncLaunch((struct ncclAsyncJob*)job, commReclaim, NULL, free, comm), res, fail);

 exit:
+  ncclGroupErrCheck(res);
+  NCCLCHECK(ncclGroupEndInternal());
  return res;
 fail:
  goto exit;
@ -2201,6 +2210,7 @@ ncclResult_t ncclCommAbort(ncclComm_t comm) {
  if (comm == NULL) {
    return ncclSuccess;
  }
+  NCCLCHECK(ncclGroupStartInternal());
  // Ask anything that might still be running on the device to quit
  NCCLCHECK(setCommAbortFlags(comm,1));
  comm->destroyFlag = 1;
@ -2223,7 +2233,9 @@ ncclResult_t ncclCommAbort(ncclComm_t comm) {
  NCCLCHECKGOTO(ncclAsyncLaunch((struct ncclAsyncJob*)job, commReclaim, NULL, free, comm), res, fail);

 exit:
-  return ncclSuccess;
+  ncclGroupErrCheck(res);
+  NCCLCHECK(ncclGroupEndInternal());
+  return res;
 fail:
  goto exit;
 }
--- a/src/misc/ibvwrap.cc
+++ b/src/misc/ibvwrap.cc
@ -142,8 +142,14 @@ ncclResult_t wrap_ibv_query_device(struct ibv_context *context, struct ibv_devic
  IBV_INT_CHECK_RET_ERRNO(ibvSymbols, ibv_internal_query_device, ibv_internal_query_device(context, device_attr), 0, "ibv_query_device");
 }

-ncclResult_t wrap_ibv_query_port(struct ibv_context *context, uint8_t port_num, struct ibv_port_attr *port_attr) { /*returns 0 on success, or the value of errno on failure (which indicates the failure reason)*/
-  IBV_INT_CHECK_RET_ERRNO(ibvSymbols, ibv_internal_query_port, ibv_internal_query_port(context, port_num, port_attr), 0, "ibv_query_port");
+ncclResult_t wrap_ibv_query_port(struct ibv_context *context, uint8_t port_num, struct ibv_port_attr *port_attr) {
+  // First try and query the extended port attributes (e.g. active_speed_ex)
+  if (ibv_query_port_ex(context, port_num, port_attr) != 0) {
+    // Fall back to the original attribute API call, but zero all members first
+    memset(port_attr, 0, sizeof(*port_attr));
+    IBV_INT_CHECK_RET_ERRNO(ibvSymbols, ibv_internal_query_port, ibv_internal_query_port(context, port_num, port_attr), 0, "ibv_query_port");
+  }
+  return ncclSuccess;
 }

 ncclResult_t wrap_ibv_query_gid(struct ibv_context *context, uint8_t port_num, int index, union ibv_gid *gid) {
--- a/src/misc/mlx5dvsymbols.cc
+++ b/src/misc/mlx5dvsymbols.cc
@ -52,6 +52,9 @@ ncclResult_t buildMlx5dvSymbols(struct ncclMlx5dvSymbols* mlx5dvSymbols) {
 #define LOAD_SYM_VERSION(handle, symbol, funcptr, version) do {  \
    cast = (void**)&funcptr;                                     \
    *cast = dlvsym(handle, symbol, version);                     \
+    if (*cast == NULL) {                                         \
+      INFO(NCCL_NET, "dlvsym failed on %s - %s version %s", symbol, dlerror(), version);  \
+    }                                                            \
  } while (0)

  LOAD_SYM(mlx5dvhandle, "mlx5dv_is_supported", mlx5dvSymbols->mlx5dv_internal_is_supported);
--- a/src/misc/socket.cc
+++ b/src/misc/socket.cc
@ -441,7 +441,8 @@ static ncclResult_t socketTryAccept(struct ncclSocket* sock) {
  if (sock->fd != -1) {
    sock->state = ncclSocketStateAccepted;
  } else if (errno == ENETDOWN || errno == EPROTO || errno == ENOPROTOOPT || errno == EHOSTDOWN ||
-             errno == ENONET || errno == EHOSTUNREACH || errno == EOPNOTSUPP || errno == ENETUNREACH) {
+             errno == ENONET || errno == EHOSTUNREACH || errno == EOPNOTSUPP || errno == ENETUNREACH ||
+             errno == EINTR) {
    /* per accept's man page, for linux sockets, the following errors might be already pending errors
     * and should be considered as EAGAIN. To avoid infinite loop in case of errors, we use the retry count*/
    if (++sock->errorRetries == ncclParamRetryCnt()) {
--- a/src/misc/strongstream.cc
+++ b/src/misc/strongstream.cc
@ -21,7 +21,6 @@ struct ncclStrongStreamCapture {
  cudaGraph_t graph;
  unsigned long long graphId;
  cudaStream_t captureStream;
-  cudaGraphNode_t lastRecord;
  void* acquiredBy;
 };

@ -216,7 +215,6 @@ ncclResult_t ncclStrongStreamAcquire(
        CUDACHECKGOTO(cudaStreamCreateWithFlags(&cap->captureStream, cudaStreamNonBlocking), ret, do_unlock);
      }
      cap->graphId = graph.graphId;
-      cap->lastRecord = nullptr;
      cap->acquiredBy = localThreadId();
      // Push to capturing list.
      cap->next = ss->captureHead;
@ -296,16 +294,6 @@ ncclResult_t ncclStrongStreamRelease(
        cudaGraphNode_t recordNode;
        CUDACHECK(cudaGraphAddEventRecordNode(&recordNode, graph.graph, nullptr, 0, ss->serialEvent));

-        // Make this record order after previous record on this stream.
-        if (cap->lastRecord != nullptr) {
-        #if CUDART_VERSION >= 13000
-          CUDACHECK(cudaGraphAddDependencies_v2(graph.graph, &cap->lastRecord, &recordNode, nullptr, 1));
-        #else
-          CUDACHECK(cudaGraphAddDependencies(graph.graph, &cap->lastRecord, &recordNode, 1));
-        #endif
-        }
-        cap->lastRecord = recordNode;
-
        // Get current nodes from work stream so we can add them as dependencies.
        cudaStreamCaptureStatus status;
        cudaGraphNode_t const* nodes;
@ -338,6 +326,22 @@ ncclResult_t ncclStrongStreamRelease(
          }
        }

+	// Make every future operation captured on cap->captureStream depend on 'recordNode'.
+        #if CUDART_VERSION >= 13000
+        CUDACHECK(cudaStreamUpdateCaptureDependencies_v2(
+                    cap->captureStream,
+                    &recordNode,          /* dependencies                */
+                    /*edges =*/ nullptr,  /* no edge annotations         */
+                    1,                    /* count                       */
+                    cudaStreamSetCaptureDependencies));
+        #else
+        CUDACHECK(cudaStreamUpdateCaptureDependencies(
+                    cap->captureStream,
+                    &recordNode,
+                    1,
+                    cudaStreamSetCaptureDependencies));
+        #endif
+
        if (cap->acquiredBy != localThreadId() && ncclParamLaunchRaceFatal()) {
          WARN("%s", launchRaceFatalMsg);
          return ncclInvalidUsage;
--- a/src/nccl.h.in
+++ b/src/nccl.h.in
@ -16,12 +16,12 @@
 #include <cuda_fp8.h>
 #endif

-#define NCCL_MAJOR ${nccl:Major}
-#define NCCL_MINOR ${nccl:Minor}
-#define NCCL_PATCH ${nccl:Patch}
-#define NCCL_SUFFIX "${nccl:Suffix}"
+#define NCCL_MAJOR ${nccl_Major}
+#define NCCL_MINOR ${nccl_Minor}
+#define NCCL_PATCH ${nccl_Patch}
+#define NCCL_SUFFIX "${nccl_Suffix}"

-#define NCCL_VERSION_CODE ${nccl:Version}
+#define NCCL_VERSION_CODE ${nccl_Version}
 #define NCCL_VERSION(X,Y,Z) (((X) <= 2 && (Y) <= 8) ? (X) * 1000 + (Y) * 100 + (Z) : (X) * 10000 + (Y) * 100 + (Z))

 #ifdef __cplusplus
--- a/src/plugin/net.cc
+++ b/src/plugin/net.cc
@ -67,7 +67,7 @@ static pthread_once_t initPluginLibsOnceControl = PTHREAD_ONCE_INIT;
 static ncclResult_t ncclNetPluginUnload(netPluginLib_t* pluginLib) {
  if ((pluginLib->dlHandle) && ((pluginLib->ncclNetPluginRefCount) == 0)) {
    INFO(NCCL_INIT|NCCL_NET, "Unloading plugin %s", pluginLib->name);
-    NCCLCHECK(ncclClosePluginLib(pluginLib->dlHandle));
+    NCCLCHECK(ncclClosePluginLib(pluginLib->dlHandle, ncclPluginTypeNet));
    memset(pluginLib, 0, sizeof(netPluginLib_t));
  }
  return ncclSuccess;
@ -105,8 +105,9 @@ exit:
  return ncclSuccess;
 fail:
  if (pluginLib->dlHandle) {
-    NCCLCHECK(ncclClosePluginLib(pluginLib->dlHandle));
+    NCCLCHECK(ncclClosePluginLib(pluginLib->dlHandle, ncclPluginTypeNet));
  }
+  pluginLib->dlHandle = nullptr;
  pluginLib->ncclNetPluginState = ncclNetPluginStateLoadFailed;
  pluginLib->ncclCollNetPluginState = ncclNetPluginStateLoadFailed;
  goto exit;
--- a/src/plugin/plugin_open.cc
+++ b/src/plugin/plugin_open.cc
@ -10,16 +10,12 @@
 #include <dlfcn.h>

 #include "debug.h"
+#include "plugin.h"

 #define MAX_STR_LEN 255

-enum ncclPluginType {
-  ncclPluginTypeNet,
-  ncclPluginTypeTuner,
-  ncclPluginTypeProfiler,
-};
-
 #define NUM_LIBS 3
+static char* libNames[NUM_LIBS];
 static void *libHandles[NUM_LIBS];
 static const char *pluginNames[NUM_LIBS] = { "NET", "TUNER", "PROFILER" };
 static const char *pluginPrefix[NUM_LIBS] = { "libnccl-net", "libnccl-tuner", "libnccl-profiler" };
@ -61,24 +57,26 @@ static void* openPluginLib(enum ncclPluginType type, const char* libName) {
  char eNoEntNameList[PATH_MAX] = { 0 };

  if (libName && strlen(libName)) {
-    // match names that start with 'lib' and end with '.so'
-    if (strlen(libName) >= strlen("libX.so") && strncmp(libName, "lib", strlen("lib")) == 0 && strncmp(libName + strlen(libName) - strlen(".so"), ".so", strlen(".so")) == 0) {
-      snprintf(libName_, MAX_STR_LEN, "%s", libName);
-      libHandles[type] = tryOpenLib(libName_, &openErr, openErrStr);
-      if (libHandles[type]) {
-        INFO(subsys[type], "%s/Plugin: Plugin name set by env to %s", pluginNames[type], libName_);
-        return libHandles[type];
-      }
-      if (openErr == ENOENT) {
-        appendNameToList(eNoEntNameList, &len, libName_);
-      } else {
-        INFO(subsys[type], "%s/Plugin: %s", pluginNames[type], openErrStr);
-      }
+    snprintf(libName_, MAX_STR_LEN, "%s", libName);
+    libHandles[type] = tryOpenLib(libName_, &openErr, openErrStr);
+    if (libHandles[type]) {
+      INFO(subsys[type], "%s/Plugin: Plugin name set by env to %s", pluginNames[type], libName_);
+      libNames[type] = strdup(libName_);
+      return libHandles[type];
+    }
+    if (openErr == ENOENT) {
+      appendNameToList(eNoEntNameList, &len, libName_);
    } else {
+      INFO(subsys[type], "%s/Plugin: %s", pluginNames[type], openErrStr);
+    }
+
+    // libName can't be a relative or absolute path (start with '.' or contain any '/'). It can't be a library name either (start with 'lib' or end with '.so')
+    if (strchr(libName, '/') == nullptr && (strncmp(libName, "lib", strlen("lib")) || strlen(libName) < strlen(".so") || strncmp(libName + strlen(libName) - strlen(".so"), ".so", strlen(".so")))) {
      snprintf(libName_, MAX_STR_LEN, "%s-%s.so", pluginPrefix[type], libName);
      libHandles[type] = tryOpenLib(libName_, &openErr, openErrStr);
      if (libHandles[type]) {
        INFO(subsys[type], "%s/Plugin: Plugin name set by env to %s", pluginNames[type], libName_);
+        libNames[type] = strdup(libName_);
        return libHandles[type];
      }
      if (openErr == ENOENT) {
@ -91,6 +89,7 @@ static void* openPluginLib(enum ncclPluginType type, const char* libName) {
    snprintf(libName_, MAX_STR_LEN, "%s.so", pluginPrefix[type]);
    libHandles[type] = tryOpenLib(libName_, &openErr, openErrStr);
    if (libHandles[type]) {
+      libNames[type] = strdup(libName_);
      return libHandles[type];
    }
    if (openErr == ENOENT) {
@ -120,22 +119,21 @@ void* ncclOpenProfilerPluginLib(const char* name) {
  return openPluginLib(ncclPluginTypeProfiler, name);
 }

-void* ncclGetNetPluginLib(void) {
-  return libHandles[ncclPluginTypeNet];
+void* ncclGetNetPluginLib(enum ncclPluginType type) {
+  if (libNames[ncclPluginTypeNet]) {
+    // increment the reference counter of the net library
+    libNames[type] = strdup(libNames[ncclPluginTypeNet]);
+    libHandles[type] = dlopen(libNames[ncclPluginTypeNet], RTLD_NOW | RTLD_LOCAL);
+  }
+  return libHandles[type];
 }

-ncclResult_t ncclClosePluginLib(void* handle) {
-  bool found = false;
-  for (int l=0; l<NUM_LIBS; l++) {
-    if (libHandles[l] == handle) {
-      libHandles[l] = nullptr;
-      if (!found) {
-        if (handle) {
-          dlclose(handle);
-        }
-        found = true;
-      }
-    }
+ncclResult_t ncclClosePluginLib(void* handle, enum ncclPluginType type) {
+  if (handle && libHandles[type] == handle) {
+    dlclose(handle);
+    libHandles[type] = nullptr;
+    free(libNames[type]);
+    libNames[type] = nullptr;
  }
  return ncclSuccess;
 }
--- a/src/plugin/profiler.cc
+++ b/src/plugin/profiler.cc
@ -77,7 +77,8 @@ exit:
  pthread_mutex_unlock(&profilerLock);
  return ncclSuccess;
 fail:
-  if (profilerPluginLib) NCCLCHECK(ncclClosePluginLib(profilerPluginLib));
+  if (profilerPluginLib) NCCLCHECK(ncclClosePluginLib(profilerPluginLib, ncclPluginTypeProfiler));
+  profilerPluginLib = nullptr;
  profilerPluginStatus = profilerPluginLoadFailed;
  goto exit;
 }
@ -86,7 +87,7 @@ static ncclResult_t ncclProfilerPluginUnload(void) {
  pthread_mutex_lock(&profilerLock);
  if (0 == (--profilerPluginRefCount)) {
    INFO(NCCL_ENV, "PROFILER/Plugin: Closing profiler plugin %s", ncclProfiler->name);
-    NCCLCHECK(ncclClosePluginLib(profilerPluginLib));
+    NCCLCHECK(ncclClosePluginLib(profilerPluginLib, ncclPluginTypeProfiler));
    profilerPluginLib = nullptr;
    ncclProfiler = nullptr;
    profilerPluginStatus = profilerPluginLoadReady;
--- a/src/plugin/tuner.cc
+++ b/src/plugin/tuner.cc
@ -52,7 +52,7 @@ ncclResult_t ncclTunerPluginLoad(struct ncclComm* comm) {

  tunerPluginLib = ncclOpenTunerPluginLib(ncclGetEnv("NCCL_TUNER_PLUGIN"));
  if (nullptr == tunerPluginLib) {
-    tunerPluginLib = ncclGetNetPluginLib();
+    tunerPluginLib = ncclGetNetPluginLib(ncclPluginTypeTuner);
    if (nullptr == tunerPluginLib) {
      goto fail;
    }
@ -78,6 +78,7 @@ exit:
  pthread_mutex_unlock(&tunerPluginLock);
  return ncclSuccess;
 fail:
+  if (tunerPluginLib) NCCLCHECK(ncclClosePluginLib(tunerPluginLib, ncclPluginTypeTuner));
  tunerPluginLib = nullptr;
  status = tunerPluginLoadFailed;
  goto exit;
@ -87,7 +88,7 @@ ncclResult_t ncclTunerPluginUnload(struct ncclComm* comm) {
  pthread_mutex_lock(&tunerPluginLock);
  if (comm->tunerPluginLoaded && 0 == (--tunerPluginRefCount)) {
    INFO(NCCL_TUNING, "TUNER/Plugin: Closing tuner: '%s'", tunerSymbol->name);
-    NCCLCHECK(ncclClosePluginLib(tunerPluginLib));
+    NCCLCHECK(ncclClosePluginLib(tunerPluginLib, ncclPluginTypeTuner));
    tunerPluginLib = nullptr;
    tunerSymbol = nullptr;
    comm->tuner = nullptr;
--- a/src/transport/net_ib.cc
+++ b/src/transport/net_ib.cc
@ -494,7 +494,9 @@ static int ibvSpeeds[] = {
  14000, /* FDR */
  25000, /* EDR */
  50000, /* HDR */
-  100000 /* NDR */ };
+  100000, /* NDR */
+  200000  /* XDR */
+};

 static int firstBitSet(int val, int max) {
  int i = 0;
@ -650,12 +652,15 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t pr
        enum ncclIbProvider ibProvider = IB_PROVIDER_NONE;
        char dataDirectDevicePath[PATH_MAX];
        int dataDirectSupported = 0;
+        int skipNetDevForDataDirect = 0;
        if (wrap_mlx5dv_is_supported(devices[d])) {
          ibProvider = IB_PROVIDER_MLX5;
          snprintf(dataDirectDevicePath, PATH_MAX, "/sys");
          if((ncclMlx5dvDmaBufCapable(context)) && (wrap_mlx5dv_get_data_direct_sysfs_path(context, dataDirectDevicePath + 4, PATH_MAX - 4) == ncclSuccess)) {
-            INFO(NCCL_NET, "Data Direct DMA Interface is detected for device:%s", devices[d]->name);
-            if(ncclParamIbDataDirect()) dataDirectSupported = 1;
+            INFO(NCCL_INIT|NCCL_NET, "NET/IB: Data Direct DMA Interface is detected for device:%s", devices[d]->name);
+            // Now check whether Data Direct has been disabled by the user
+            if(ncclParamIbDataDirect() == 1) { dataDirectSupported = 1; skipNetDevForDataDirect = 1; }
+            if(ncclParamIbDataDirect() == 2) { dataDirectSupported = 1; skipNetDevForDataDirect = 0; }
          }
        }
        int nPorts = 0;
@ -667,7 +672,8 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t pr
          continue;
        }
        for (int port_num = 1; port_num <= devAttr.phys_port_cnt; port_num++) {
-          for (int dataDirect = 0; dataDirect < 1 + dataDirectSupported; ++dataDirect) {
+          // dataDirect = 0 exposes the devices normally, dataDirect = 1 exposes the devices through direct NIC
+          for (int dataDirect = skipNetDevForDataDirect; dataDirect < 1 + dataDirectSupported; ++dataDirect) {
            struct ibv_port_attr portAttr;
            if (ncclSuccess != wrap_ibv_query_port(context, port_num, &portAttr)) {
              WARN("NET/IB : Unable to query port_num %d", port_num);
@ -688,15 +694,18 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t pr
            ncclIbDevs[ncclNIbDevs].portAttr = portAttr;
            ncclIbDevs[ncclNIbDevs].portNum = port_num;
            ncclIbDevs[ncclNIbDevs].link = portAttr.link_layer;
-            ncclIbDevs[ncclNIbDevs].speed = ncclIbSpeed(portAttr.active_speed) * ncclIbWidth(portAttr.active_width);
+            if (portAttr.active_speed_ex)
+              // A non-zero active_speed_ex indicates XDR rate (0x100) or higher
+              ncclIbDevs[ncclNIbDevs].speed = ncclIbSpeed(portAttr.active_speed_ex) * ncclIbWidth(portAttr.active_width);
+            else
+              ncclIbDevs[ncclNIbDevs].speed = ncclIbSpeed(portAttr.active_speed) * ncclIbWidth(portAttr.active_width);
            ncclIbDevs[ncclNIbDevs].context = context;
            ncclIbDevs[ncclNIbDevs].pdRefs = 0;
            ncclIbDevs[ncclNIbDevs].pd = NULL;
            if (!dataDirect) {
              strncpy(ncclIbDevs[ncclNIbDevs].devName, devices[d]->name, MAXNAMESIZE);
              NCCLCHECKGOTO(ncclIbGetPciPath(ncclIbDevs[ncclNIbDevs].devName, &ncclIbDevs[ncclNIbDevs].pciPath, &ncclIbDevs[ncclNIbDevs].realPort), ret, fail);
-            }
-            else {
+            } else {
              snprintf(ncclIbDevs[ncclNIbDevs].devName, MAXNAMESIZE, "%s_dma", devices[d]->name);
              NCCLCHECK(ncclCalloc(&ncclIbDevs[ncclNIbDevs].pciPath, PATH_MAX));
              strncpy(ncclIbDevs[ncclNIbDevs].pciPath, dataDirectDevicePath, PATH_MAX);
Author	SHA1	Message	Date
cyy	b1c424d1a6	Init cmake commit Signed-off-by: cyy <cyyever@outlook.com>	2025-07-30 16:10:36 +08:00
Kamil Iskra	593de54e52	NCCL 2.27.7-1 Prevent initialization failures in certain configurations when attempting to load fp8-specific symmetric multicast kernels on GPUs older than Blackwell.	2025-07-24 10:39:53 -07:00
Stephen Sachs	0d1ece2b43	Exclude ongoing issues from auto-closing logic - Added a check to skip issues labeled "ongoing" in the close-old-issues script - Adjusted the condition to compare both creation and update dates against six months ago	2025-07-17 21:50:05 +02:00
Stephen Sachs	bfedf2629e	Add issues templates and Github action to remove stale issues We add 3 different issue types issue/question/RFE and add some predefined questions to speed up the debugging process. We also add a custom action which will close all issues create mode than 6 months ago which have not been updated for more than a month.	2025-07-16 17:56:12 +02:00
Kamil Iskra	7c12c627c6	NCCL 2.27.6-1 Improve support for DirectNIC (CX8) * Add support for XDR speed detection. * When DirectNIC is enabled, report only the RDMA interfaces. Extend the P2C (PXN over C2C) support to send/receive operations. Support compilation with GCC 14 (Issues #1743, #1751). Fix the unloading of network plugins that also provide tuner capability. Fix the change of the current device across the calls to ncclCommDestroy() and ncclCommAbort(). A note for users on MNNVL systems: please ensure an adequate stack size for NCCL threads. While the default Linux stack size limit of 8192 KB is known to be sufficient, we've seen crashes if the limit is changed to "unlimited", as it causes the glibc library to unexpectedly decrease the stack size of NCCL's background threads to just 2048 KB. Use "ulimit -s" in bash to print the current limit; if needed, reset it to 8192 KB using "ulimit -s 8192" (one also needs to ensure that the new setting is propagated to other nodes when launching a multi-node NCCL job).	2025-07-11 07:32:13 -07:00
Kamil Iskra	3ea7eedf3b	NCCL 2.27.5-1 Improvements for GB200 systems * Optimize the network performance by alternating the direction of the rings and the NIC to GPU assignment across communicators to limit unnecessary sharing. * Fix the detection of C2C links in case GPU Direct RDMA is disabled between a GPU and a NIC. * Fix PXN support on MNNVL systems, where NCCL would try (and fail) to share regular host memory across multiple nodes. * Fix P2C (PXN over C2C), which is now preferred over regular PXN. This support is currently preliminary and is disabled by default; use NCCL_PXN_C2C=1 to enable. Further reduce the overheads of CUDA graph capturing, which increased in NCCL 2.26.2 for large graphs. Optimize the network performance on DGX B200 systems by adjusting the bandwidths provided to the graph search algorithm. Enable fp8 reductions in symmetric kernels on Blackwell with CUDA 12.8. Restore the plugin name handling logic to make it possible to specify a path to the plugin (Issue #1732). Restore the ability to change NCCL_COLLNET_ENABLE during execution (Issue #1741). Add an example tuner plugin with CSV-based overrides. Remove an x86 dependency from the example profiler.	2025-06-18 10:34:47 -07:00