mirror of https://github.com/NVIDIA/nccl.git
Compare commits
4 Commits
e96ec28ae2
...
56cbf555c5
Author | SHA1 | Date |
---|---|---|
![]() |
56cbf555c5 | |
![]() |
593de54e52 | |
![]() |
0d1ece2b43 | |
![]() |
bfedf2629e |
|
@ -0,0 +1,77 @@
|
|||
name: NCCL issue or bug
|
||||
description: Report an issue or failure when running NCCL code
|
||||
title: "[Issue]: "
|
||||
labels: ["triage"]
|
||||
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thanks for reaching out! Before reporting a new issue, please feel free to search for the behavior in the existing issues. If you found an issue which is already closed or you are unsure, open a new issue and reference the old one from it.
|
||||
You can also check out the [troubleshooting section](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html) in our user guide.
|
||||
|
||||
---
|
||||
|
||||
To ensure we can assist you quickly and accurately, we often need the following information:
|
||||
- type: dropdown
|
||||
id: type
|
||||
attributes:
|
||||
label: How is this issue impacting you?
|
||||
description: What best describes your issue?
|
||||
options:
|
||||
- Lower performance than expected
|
||||
- Application crash
|
||||
- Data corruption
|
||||
- Application hang
|
||||
validations:
|
||||
required: true
|
||||
|
||||
- type: textarea
|
||||
id: log
|
||||
attributes:
|
||||
label: Share Your Debug Logs
|
||||
description: |
|
||||
|
||||
The logs and topo-files are a great tool to pin down issues. You can create them by setting these environment variables before the run.
|
||||
* `NCCL_DEBUG=INFO` and `NCCL_DEBUG_FILE=ncclDebug.%h.%p` to produce one file per rank
|
||||
* `NCCL_TOPO_DUMP_FILE=ncclSystem.txt`
|
||||
|
||||
- type: textarea
|
||||
id: repro
|
||||
attributes:
|
||||
label: Steps to Reproduce the Issue
|
||||
description: |
|
||||
* **Minimal Steps**: Please provide a simple way to recreate the issue (see [Minimal Bug Reports](https://matthewrocklin.com/minimal-bug-reports) for inspiration).
|
||||
* **Environment Details**: Include software versions and relevant settings.
|
||||
* **Intermittency**: Is this a sporadic issue? If so, how often does it occur?
|
||||
* **Previous Success**: Did this work with an older NCCL version?
|
||||
|
||||
The easier we can reproduce on our side the more likely we are to be able to solve it in a timely manner.
|
||||
|
||||
- type: input
|
||||
id: nccl_version
|
||||
attributes:
|
||||
label: NCCL Version
|
||||
description: |
|
||||
NCCL reports its version string in the debug logs.
|
||||
You can also determine the version if you know which library was used by running `strings libnccl.so | grep 'NCCL version'`.
|
||||
placeholder: "e.g. 2.27.1+cuda12.8"
|
||||
validations:
|
||||
required: true
|
||||
|
||||
- type: textarea
|
||||
id: platform
|
||||
attributes:
|
||||
label: Your platform details
|
||||
description: |
|
||||
* **GPU & Network**: Share your architecture and topology (e.g., from `nvidia-smi`, `nvidia-smi topo -m`, `ibstatus`).
|
||||
* **Environment**: Bare-metal, containers, or cloud?
|
||||
* **Scalability**: Does this issue occur with a specific number of ranks/nodes?
|
||||
|
||||
- type: textarea
|
||||
id: issue-description
|
||||
attributes:
|
||||
label: Error Message & Behavior
|
||||
description: |
|
||||
* **First Error**: What was the initial `NCCL WARN` message in your logs?
|
||||
* **Expected vs. Actual**: Briefly describe the anticipated behavior versus what you're seeing.
|
|
@ -0,0 +1,15 @@
|
|||
name: NCCL question
|
||||
description: Ask the NCCL team a question
|
||||
title: "[Question]: "
|
||||
labels: ["question"]
|
||||
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thanks for reaching out! To solve your problem, feel free to check out the [user guide](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html), in particular the troubleshooting section, and also the [release notes](https://docs.nvidia.com/deeplearning/nccl/release-notes/index.html).
|
||||
---
|
||||
- type: textarea
|
||||
id: question
|
||||
attributes:
|
||||
label: Question
|
|
@ -0,0 +1,22 @@
|
|||
name: NCCL request for enhancement
|
||||
description: Request for enhancement
|
||||
title: "[RFE]: "
|
||||
labels: ["enhancement"]
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
|
||||
Thanks for your feedback! Before reporting a new RFE you could quickly check if this already exists in our [existing requests](https://github.com/NVIDIA/nccl/issues?q=sort%3Aupdated-desc%20is%3Aissue%20is%3Aopen%20label%3Aenhancement).
|
||||
|
||||
---
|
||||
- type: textarea
|
||||
id: rfe-description
|
||||
attributes:
|
||||
label: Please provide the below details to ensure we understand your needs
|
||||
description: |
|
||||
* What is the goal of this request?
|
||||
* Who will benefit from this feature?
|
||||
* Is this request for a specific GPU architecture or network infrastructure?
|
||||
* How will this feature improve current workflows or processes?
|
||||
* What is the priority level of this request?
|
|
@ -0,0 +1 @@
|
|||
blank_issues_enabled: false
|
|
@ -0,0 +1,79 @@
|
|||
const { Octokit } = require("@octokit/rest");
|
||||
|
||||
const octokit = new Octokit({ auth: process.env.GITHUB_TOKEN });
|
||||
|
||||
const owner = process.env.REPO_OWNER;
|
||||
const repo = process.env.REPO_NAME.split('/').pop(); // Handles owner/repo format
|
||||
|
||||
const now = new Date();
|
||||
const sixMonthsAgo = new Date(now);
|
||||
sixMonthsAgo.setMonth(now.getMonth() - 6);
|
||||
const oneMonthAgo = new Date(now);
|
||||
oneMonthAgo.setMonth(now.getMonth() - 1);
|
||||
|
||||
async function closeOldIssues() {
|
||||
let page = 1;
|
||||
let closedCount = 0;
|
||||
|
||||
// write a multiline comment into a variable:
|
||||
let body = `### Issue Cleanup: Helping Us Focus on Current Challenges
|
||||
|
||||
We're [reviewing](https://github.com/NVIDIA/nccl/discussions/1761) older issues to ensure we prioritize the most relevant and active ones. Since this issue hasn't seen updates in over 6 months, we'll be closing it for now.
|
||||
|
||||
*This change helps us focus our efforts on addressing any current issues our users are facing.* If this issue still affects you, please don't hesitate to reopen it with a quick update (e.g., \"Still relevant on [version=X]\").
|
||||
Thanks for your understanding and for contributing to NCCL.`;
|
||||
|
||||
while (true) {
|
||||
const { data: issues } = await octokit.issues.listForRepo({
|
||||
owner,
|
||||
repo,
|
||||
state: "open",
|
||||
per_page: 100,
|
||||
page,
|
||||
});
|
||||
|
||||
if (issues.length === 0) break;
|
||||
|
||||
for (const issue of issues) {
|
||||
// Ignore PRs
|
||||
if (issue.pull_request) continue;
|
||||
|
||||
// Ignore issues with label "ongoing"
|
||||
if (issue.labels.some(label => label.name === "ongoing")) continue;
|
||||
|
||||
const createdAt = new Date(issue.created_at);
|
||||
const updatedAt = new Date(issue.updated_at);
|
||||
|
||||
if (createdAt < sixMonthsAgo && updatedAt < sixMonthsAgo) {
|
||||
|
||||
// Add a comment before closing
|
||||
await octokit.issues.createComment({
|
||||
owner,
|
||||
repo,
|
||||
issue_number: issue.number,
|
||||
body: body,
|
||||
});
|
||||
|
||||
await octokit.issues.update({
|
||||
owner,
|
||||
repo,
|
||||
issue_number: issue.number,
|
||||
state: "closed",
|
||||
state_reason: "not_planned",
|
||||
});
|
||||
closedCount++;
|
||||
console.log(`Closed issue #${issue.number}`);
|
||||
|
||||
// Break out if we have closed 100 issues
|
||||
if (closedCount >= 100) {
|
||||
console.log("Closed 100 issues, stopping.");
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
page++;
|
||||
}
|
||||
console.log(`Total closed: ${closedCount}`);
|
||||
}
|
||||
|
||||
closeOldIssues().catch(console.error);
|
|
@ -0,0 +1,31 @@
|
|||
name: Close Old Issues
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: '30 2 * * *' # Runs daily at 02:30 UTC
|
||||
workflow_dispatch:
|
||||
|
||||
permissions:
|
||||
issues: write
|
||||
|
||||
jobs:
|
||||
close-old-issues:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Setup Node.js
|
||||
uses: actions/setup-node@v4
|
||||
with:
|
||||
node-version: 20
|
||||
|
||||
- name: Install dependencies
|
||||
run: npm install @octokit/rest@22.0.0
|
||||
|
||||
- name: Run close-old-issues script
|
||||
run: node .github/workflows/close-old-issues.js
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
REPO_OWNER: ${{ github.repository_owner }}
|
||||
REPO_NAME: ${{ github.event.repository.name || github.repository }}
|
|
@ -0,0 +1,182 @@
|
|||
# NCCL Tuner Plugin Development
|
||||
|
||||
This directory contains resources and examples for developing NCCL tuner plugins. Tuner plugins allow you to customize NCCL's algorithm and protocol selection behavior to optimize performance for specific workloads and hardware configurations.
|
||||
|
||||
## Overview
|
||||
|
||||
NCCL tuner plugins provide a way to influence NCCL's automatic algorithm and protocol selection by modifying the cost tables that NCCL uses to make decisions. This allows you to:
|
||||
|
||||
- Override default algorithm/protocol combinations for specific collective operations
|
||||
- Customize tuning based on message size, topology, and other parameters
|
||||
- Implement sophisticated tuning strategies without recompiling NCCL
|
||||
- Optimize performance for specific hardware configurations or workloads
|
||||
|
||||
## Tuner Plugin Interface
|
||||
|
||||
NCCL tuner plugins must implement the `ncclTuner_t` interface defined in `nccl_tuner.h` within `nccl/src/include/plugin`. These definitions have been forked to `tuner.h` in each example plugin, and it is expected that any plugin implementor forks the internal NCCL definitions as well. The current interface includes:
|
||||
|
||||
```c
|
||||
// Initialize the tuner plugin
|
||||
ncclResult_t (*init)(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context);
|
||||
|
||||
// Get and modify collective operation cost information
|
||||
ncclResult_t (*getCollInfo)(void* context, ncclFunc_t collType, size_t nBytes,
|
||||
int numPipeOps, float** collCostTable, int numAlgo, int numProto,
|
||||
int regBuff, int* nChannels);
|
||||
|
||||
// Clean up plugin resources
|
||||
ncclResult_t (*destroy)(void* context);
|
||||
```
|
||||
|
||||
## Development Guidelines
|
||||
|
||||
### 1. Plugin Structure
|
||||
|
||||
A typical tuner plugin should:
|
||||
- Include the necessary forked NCCL headers (`tuner.h`)
|
||||
- Implement all required interface functions
|
||||
- Export the plugin structure with appropriate version
|
||||
- Handle all input parameters gracefully
|
||||
|
||||
### 2. Cost Table Modification
|
||||
|
||||
The `getCollInfo` function receives a cost table that maps algorithm/protocol combinations to performance costs. Lower costs indicate preferred combinations. You can:
|
||||
|
||||
- Set costs to `0.0` to make combinations highly preferred
|
||||
- Set costs to `NCCL_ALGO_PROTO_IGNORE` to disable combinations
|
||||
- Use relative costs to create preferences between options
|
||||
|
||||
### 3. Channel Management
|
||||
|
||||
The `nChannels` parameter allows you to:
|
||||
- Set a specific number of channels to use
|
||||
- Return the original value to preserve NCCL's default behavior
|
||||
- Implement dynamic channel selection based on message size or topology
|
||||
|
||||
### 4. Error Handling
|
||||
|
||||
Always return appropriate `ncclResult_t` values:
|
||||
- `ncclSuccess` for successful or ignored operations
|
||||
- `ncclInternalError` for plugin-specific errors. Returning an error is only advisable on plugin initialization and destruction, as the penalty users can pay for the overhead of a failed plugin call can be immense.
|
||||
- Other NCCL error codes as appropriate
|
||||
|
||||
## Getting Started
|
||||
|
||||
### Option 1: Start with the Example Plugin
|
||||
|
||||
If you're new to tuner plugin development, start with the `example/` directory:
|
||||
|
||||
```bash
|
||||
cd example/
|
||||
make
|
||||
```
|
||||
|
||||
This provides a CSV-based configuration system that you can customize or use as a template.
|
||||
|
||||
## Building and Testing
|
||||
|
||||
### Build Requirements
|
||||
|
||||
- GCC or compatible C compiler
|
||||
- NCCL headers (included in `nccl/` subdirectories)
|
||||
- Make
|
||||
|
||||
## Option 2: Use the Basic Plugin
|
||||
|
||||
For more customized tuning needs, you might want to start with a clean baseline. In that case, base off the basic plugin in the `basic/` directory:
|
||||
|
||||
```bash
|
||||
cd basic/
|
||||
make
|
||||
```
|
||||
|
||||
### Build Process
|
||||
|
||||
Each plugin directory contains a Makefile:
|
||||
|
||||
```bash
|
||||
cd basic/ # or example/
|
||||
make
|
||||
```
|
||||
|
||||
This generates a shared library (`.so` file) that can be loaded by NCCL.
|
||||
|
||||
### Loading the Plugin
|
||||
|
||||
Set the `LD_LIBRARY_PATH` to include your plugin directory:
|
||||
|
||||
```bash
|
||||
export LD_LIBRARY_PATH=/path/to/your/plugin:$LD_LIBRARY_PATH
|
||||
```
|
||||
|
||||
Set `NCCL_TUNER_PLUGIN` to either the plugin name, or the absolute path to the plugin file. Any of the below can work:
|
||||
|
||||
```bash
|
||||
export NCCL_TUNER_PLUGIN=example
|
||||
export NCCL_TUNER_PLUGIN=libnccl-tuner-example.so
|
||||
export NCCL_TUNER_PLUGIN=/path/to/your/plugin/libnccl-tuner-example.so
|
||||
```
|
||||
|
||||
NCCL will automatically discover and load the plugin based on the exported symbol names.
|
||||
|
||||
## Advanced Topics
|
||||
|
||||
### Plugin Versioning
|
||||
|
||||
NCCL supports multiple plugin interface versions. Make sure your plugin exports the correct version:
|
||||
|
||||
```c
|
||||
const ncclTuner_v4_t ncclTunerPlugin_v4 = {
|
||||
.name = "YourPluginName",
|
||||
.init = yourInitFunction,
|
||||
.getCollInfo = yourGetCollInfoFunction,
|
||||
.destroy = yourDestroyFunction
|
||||
};
|
||||
```
|
||||
|
||||
### Multi-GPU and Multi-Node Considerations
|
||||
|
||||
Your plugin receives topology information (`nRanks`, `nNodes`) during initialization. Use this to:
|
||||
- Implement topology-aware tuning strategies
|
||||
- Handle single-node vs. multi-node optimizations differently
|
||||
- Scale channel counts based on available hardware
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
- Keep plugin logic lightweight to avoid impacting NCCL performance
|
||||
- Cache expensive computations when possible
|
||||
- Use the logging system for debugging but avoid excessive output in production
|
||||
|
||||
## Debugging and Logging
|
||||
|
||||
Use NCCL's debug logging system:
|
||||
|
||||
```bash
|
||||
export NCCL_DEBUG=INFO # General information
|
||||
export NCCL_DEBUG_SUBSYS=TUNING
|
||||
```
|
||||
|
||||
Within your plugin, use the provided `ncclDebugLogger_t` function for consistent logging.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Test thoroughly**: Verify your plugin works with various message sizes and topologies
|
||||
2. **Handle edge cases**: Ensure your plugin behaves correctly with unusual input parameters
|
||||
3. **Document your approach**: Clearly document your tuning strategy and configuration options
|
||||
4. **Version your plugin**: Use meaningful version numbers and maintain backward compatibility
|
||||
5. **Performance validation**: Measure the impact of your tuning decisions on real workloads
|
||||
|
||||
## Contributing
|
||||
|
||||
When developing new tuner plugins:
|
||||
- Follow the existing code style and structure
|
||||
- Include comprehensive documentation
|
||||
- Add example configurations and test cases
|
||||
- Consider contributing useful plugins back to the community
|
||||
|
||||
## Resources
|
||||
|
||||
- [NCCL Documentation](https://docs.nvidia.com/deeplearning/nccl/)
|
||||
- Example plugin implementations in this directory
|
||||
|
||||
For questions and support, refer to the NCCL community resources and documentation.
|
|
@ -0,0 +1,197 @@
|
|||
# Basic NCCL Tuner Plugin
|
||||
|
||||
This directory contains a minimal placeholder implementation of an NCCL tuner plugin. It serves as a starting point for developing custom tuner plugins by providing the essential function stubs and interface structure required by NCCL.
|
||||
|
||||
## Purpose
|
||||
|
||||
This basic plugin is designed to:
|
||||
- Provide a minimal working example of the NCCL tuner plugin interface
|
||||
- Serve as a template for developing custom tuner plugins
|
||||
- Demonstrate the required function signatures and structure
|
||||
- Implement placeholder functionality that can be extended
|
||||
|
||||
|
||||
## Implementation Details
|
||||
|
||||
The plugin implements the following functions:
|
||||
|
||||
### `pluginInit`
|
||||
```c
|
||||
ncclResult_t pluginInit(size_t nRanks, size_t nNodes, ncclDebugLogger_t logFunction, void **context)
|
||||
```
|
||||
- **Purpose**: Initialize the plugin with communicator information
|
||||
- **Current Implementation**: Simple placeholder that returns success
|
||||
- **Parameters**:
|
||||
- `nRanks`: Total number of ranks in the communicator
|
||||
- `nNodes`: Total number of nodes in the communicator
|
||||
- `logFunction`: NCCL debug logging function
|
||||
- `context`: Plugin context pointer (output)
|
||||
|
||||
### `pluginGetCollInfo`
|
||||
```c
|
||||
ncclResult_t pluginGetCollInfo(void* context, ncclFunc_t collType, size_t nBytes,
|
||||
int numPipeOps, float** collCostTable, int numAlgo, int numProto,
|
||||
int regBuff, int* nChannels)
|
||||
```
|
||||
- **Purpose**: Modify cost tables for collective operations
|
||||
- **Current Implementation**:
|
||||
- Sets RING+SIMPLE algorithm to cost 0.0 (highest preference)
|
||||
- Sets channel count to 1
|
||||
- **Parameters**:
|
||||
- `context`: Plugin context from init
|
||||
- `collType`: Type of collective operation
|
||||
- `nBytes`: Message size in bytes
|
||||
- `numPipeOps`: Number of pipeline operations
|
||||
- `collCostTable`: Cost table to modify
|
||||
- `numAlgo`: Number of algorithms
|
||||
- `numProto`: Number of protocols
|
||||
- `regBuff`: Whether buffer can be registered
|
||||
- `nChannels`: Number of channels to use (output)
|
||||
|
||||
### `pluginDestroy`
|
||||
```c
|
||||
ncclResult_t pluginDestroy(void* context)
|
||||
```
|
||||
- **Purpose**: Clean up plugin resources
|
||||
- **Current Implementation**: Simple placeholder that returns success
|
||||
|
||||
## Cost Table Structure
|
||||
|
||||
The plugin demonstrates how to modify NCCL's cost tables:
|
||||
|
||||
```c
|
||||
float (*table)[NCCL_NUM_PROTOCOLS] = (float (*)[NCCL_NUM_PROTOCOLS])collCostTable;
|
||||
```
|
||||
|
||||
The cost table is a 2D array where:
|
||||
- First dimension: Algorithm index (e.g., `NCCL_ALGO_RING`)
|
||||
- Second dimension: Protocol index (e.g., `NCCL_PROTO_SIMPLE`)
|
||||
- Values: Cost for that algorithm/protocol combination
|
||||
|
||||
### Cost Values
|
||||
- **0.0**: Highest preference (lowest cost)
|
||||
- **Positive values**: Relative costs (lower is better)
|
||||
- **`NCCL_ALGO_PROTO_IGNORE`**: Disable this combination
|
||||
|
||||
## Building
|
||||
|
||||
```bash
|
||||
make
|
||||
```
|
||||
|
||||
This creates `libnccl-tuner-basic.so` which can be loaded by NCCL.
|
||||
|
||||
## Usage
|
||||
|
||||
### Loading the Plugin
|
||||
|
||||
```bash
|
||||
export LD_LIBRARY_PATH=/path/to/basic:$LD_LIBRARY_PATH
|
||||
mpirun -np 4 your_nccl_application
|
||||
```
|
||||
|
||||
```bash
|
||||
export NCCL_TUNER_PLUGIN=basic
|
||||
export NCCL_TUNER_PLUGIN=libnccl-tuner-basic.so
|
||||
export NCCL_TUNER_PLUGIN=/path/to/your/plugin/libnccl-tuner-basic.so
|
||||
```
|
||||
|
||||
### Verifying Plugin Loading
|
||||
|
||||
Enable NCCL debug output to see if the plugin is loaded:
|
||||
|
||||
```bash
|
||||
export NCCL_DEBUG=INFO
|
||||
```
|
||||
|
||||
You should see messages indicating the tuner plugin is being used.
|
||||
|
||||
## Extending the Plugin
|
||||
|
||||
This basic plugin provides a foundation that you can extend:
|
||||
|
||||
### 1. Add Configuration Logic
|
||||
|
||||
Modify `pluginGetCollInfo` to implement your tuning strategy:
|
||||
|
||||
```c
|
||||
__hidden ncclResult_t pluginGetCollInfo(void* context, ncclFunc_t collType, size_t nBytes,
|
||||
int numPipeOps, float** collCostTable, int numAlgo, int numProto,
|
||||
int regBuff, int* nChannels) {
|
||||
// Your custom tuning logic here
|
||||
if (nBytes < 1024) {
|
||||
// Small message optimization
|
||||
table[NCCL_ALGO_TREE][NCCL_PROTO_SIMPLE] = 0.0;
|
||||
} else {
|
||||
// Large message optimization
|
||||
table[NCCL_ALGO_RING][NCCL_PROTO_LL128] = 0.0;
|
||||
}
|
||||
|
||||
// Dynamic channel selection
|
||||
*nChannels = (nBytes > 1024*1024) ? 4 : 1;
|
||||
|
||||
return ncclSuccess;
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Add Context Management
|
||||
|
||||
Use the context pointer to store plugin state:
|
||||
|
||||
```c
|
||||
struct pluginContext {
|
||||
int initialized;
|
||||
size_t nRanks;
|
||||
size_t nNodes;
|
||||
// Add your plugin-specific data here
|
||||
};
|
||||
```
|
||||
|
||||
### 3. Add File-Based Configuration
|
||||
|
||||
Read configuration from files, environment variables, or other sources.
|
||||
|
||||
### 4. Add Topology Awareness
|
||||
|
||||
Use the `nRanks` and `nNodes` parameters to implement topology-specific tuning.
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
basic/
|
||||
├── README.md # This file
|
||||
├── plugin.c # Plugin implementation
|
||||
├── Makefile # Build configuration
|
||||
└── nccl/ # NCCL header files
|
||||
└── tuner.h # Tuner plugin interface definitions
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Understand the Interface**: Study the function signatures and parameters
|
||||
2. **Implement Your Logic**: Add your tuning strategy to `pluginGetCollInfo`
|
||||
3. **Test Thoroughly**: Verify your plugin works with different message sizes and topologies
|
||||
4. **Add Error Handling**: Implement proper error checking and resource management
|
||||
5. **Document Your Changes**: Update this README with your specific implementation details
|
||||
|
||||
## Comparison with Example Plugin
|
||||
|
||||
- **Basic Plugin**: Minimal implementation, good for learning and simple use cases
|
||||
- **Example Plugin**: Full-featured CSV-based configuration system, good for production use
|
||||
|
||||
Choose the basic plugin if you want to:
|
||||
- Learn the tuner plugin interface
|
||||
- Implement simple, hardcoded tuning strategies
|
||||
- Build a custom plugin from scratch
|
||||
|
||||
Choose the example plugin if you want:
|
||||
- File-based configuration
|
||||
- Complex tuning strategies
|
||||
- Production-ready features
|
||||
|
||||
## Resources
|
||||
|
||||
- [Parent Directory README](../README.md) - General tuner plugin development guide
|
||||
- [Example Plugin](../example/README.md) - Fully featured implementation
|
||||
|
||||
This basic plugin provides the foundation you need to start developing custom NCCL tuner plugins. Extend it with your specific tuning logic and requirements.
|
|
@ -104,7 +104,6 @@ Set the `NCCL_TUNER_CONFIG_FILE` environment variable to specify the config file
|
|||
|
||||
```bash
|
||||
export NCCL_TUNER_CONFIG_FILE=/path/to/your/tuner.conf
|
||||
export LD_LIBRARY_PATH=/path/to/plugin:$LD_LIBRARY_PATH
|
||||
mpirun -np 4 your_nccl_application
|
||||
```
|
||||
|
||||
|
@ -158,7 +157,7 @@ When channels is set to `-1`, NCCL's default channel selection logic is preserve
|
|||
|
||||
1. **Config file not found**: Check the file path and permissions
|
||||
2. **Configurations not applied**: Verify the collective type, size ranges, algorithm/protocol names, and topology parameters
|
||||
3. **Plugin not loaded**: Ensure `LD_LIBRARY_PATH` includes the plugin directory
|
||||
3. **Plugin not loaded**: Ensure `LD_LIBRARY_PATH` includes the plugin directory and that `NCCL_TUNER_PLUGIN` either specifies the plugin name, or an absolute path to the plugin shared library.
|
||||
4. **No effect on performance**: Check that NCCL is actually using the tuner plugin with `NCCL_DEBUG=INFO`
|
||||
5. **Topology mismatch**: Verify that nNodes and nRanks match your actual setup, or use -1 for wildcards
|
||||
6. **CSV parsing errors**: Ensure no spaces after commas, or quote fields containing spaces
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
##### version
|
||||
NCCL_MAJOR := 2
|
||||
NCCL_MINOR := 27
|
||||
NCCL_PATCH := 6
|
||||
NCCL_PATCH := 7
|
||||
NCCL_SUFFIX :=
|
||||
PKG_REVISION := 1
|
||||
|
|
|
@ -38,12 +38,9 @@ ncclResult_t ncclInitKernelsForDevice(int cudaArch, int maxSharedMem, size_t* ma
|
|||
if (fn == nullptr) continue;
|
||||
|
||||
cudaError_t errcode = cudaFuncGetAttributes(&attr, fn);
|
||||
if (errcode == cudaErrorNoKernelImageForDevice) continue;
|
||||
CUDACHECKGOTO(errcode, result, ignore0);
|
||||
|
||||
if (errcode != cudaSuccess) continue; // Silently ignore failures
|
||||
if (maxStackSize) {
|
||||
if (attr.localSizeBytes > *maxStackSize) *maxStackSize = attr.localSizeBytes;
|
||||
ignore0:;
|
||||
}
|
||||
if (carveout) {
|
||||
CUDACHECKGOTO(cudaFuncSetAttribute(fn,
|
||||
|
|
Loading…
Reference in New Issue