mirror of https://github.com/NVIDIA/nccl.git
Compare commits
4 Commits
01ed5e5087
...
cdafb2a10f
Author | SHA1 | Date |
---|---|---|
![]() |
cdafb2a10f | |
![]() |
0d1ece2b43 | |
![]() |
bfedf2629e | |
![]() |
7c12c627c6 |
|
@ -0,0 +1,77 @@
|
|||
name: NCCL issue or bug
|
||||
description: Report an issue or failure when running NCCL code
|
||||
title: "[Issue]: "
|
||||
labels: ["triage"]
|
||||
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thanks for reaching out! Before reporting a new issue, please feel free to search for the behavior in the existing issues. If you found an issue which is already closed or you are unsure, open a new issue and reference the old one from it.
|
||||
You can also check out the [troubleshooting section](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html) in our user guide.
|
||||
|
||||
---
|
||||
|
||||
To ensure we can assist you quickly and accurately, we often need the following information:
|
||||
- type: dropdown
|
||||
id: type
|
||||
attributes:
|
||||
label: How is this issue impacting you?
|
||||
description: What best describes your issue?
|
||||
options:
|
||||
- Lower performance than expected
|
||||
- Application crash
|
||||
- Data corruption
|
||||
- Application hang
|
||||
validations:
|
||||
required: true
|
||||
|
||||
- type: textarea
|
||||
id: log
|
||||
attributes:
|
||||
label: Share Your Debug Logs
|
||||
description: |
|
||||
|
||||
The logs and topo-files are a great tool to pin down issues. You can create them by setting these environment variables before the run.
|
||||
* `NCCL_DEBUG=INFO` and `NCCL_DEBUG_FILE=ncclDebug.%h.%p` to produce one file per rank
|
||||
* `NCCL_TOPO_DUMP_FILE=ncclSystem.txt`
|
||||
|
||||
- type: textarea
|
||||
id: repro
|
||||
attributes:
|
||||
label: Steps to Reproduce the Issue
|
||||
description: |
|
||||
* **Minimal Steps**: Please provide a simple way to recreate the issue (see [Minimal Bug Reports](https://matthewrocklin.com/minimal-bug-reports) for inspiration).
|
||||
* **Environment Details**: Include software versions and relevant settings.
|
||||
* **Intermittency**: Is this a sporadic issue? If so, how often does it occur?
|
||||
* **Previous Success**: Did this work with an older NCCL version?
|
||||
|
||||
The easier we can reproduce on our side the more likely we are to be able to solve it in a timely manner.
|
||||
|
||||
- type: input
|
||||
id: nccl_version
|
||||
attributes:
|
||||
label: NCCL Version
|
||||
description: |
|
||||
NCCL reports its version string in the debug logs.
|
||||
You can also determine the version if you know which library was used by running `strings libnccl.so | grep 'NCCL version'`.
|
||||
placeholder: "e.g. 2.27.1+cuda12.8"
|
||||
validations:
|
||||
required: true
|
||||
|
||||
- type: textarea
|
||||
id: platform
|
||||
attributes:
|
||||
label: Your platform details
|
||||
description: |
|
||||
* **GPU & Network**: Share your architecture and topology (e.g., from `nvidia-smi`, `nvidia-smi topo -m`, `ibstatus`).
|
||||
* **Environment**: Bare-metal, containers, or cloud?
|
||||
* **Scalability**: Does this issue occur with a specific number of ranks/nodes?
|
||||
|
||||
- type: textarea
|
||||
id: issue-description
|
||||
attributes:
|
||||
label: Error Message & Behavior
|
||||
description: |
|
||||
* **First Error**: What was the initial `NCCL WARN` message in your logs?
|
||||
* **Expected vs. Actual**: Briefly describe the anticipated behavior versus what you're seeing.
|
|
@ -0,0 +1,15 @@
|
|||
name: NCCL question
|
||||
description: Ask the NCCL team a question
|
||||
title: "[Question]: "
|
||||
labels: ["question"]
|
||||
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
Thanks for reaching out! To solve your problem, feel free to check out the [user guide](https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html), in particular the troubleshooting section, and also the [release notes](https://docs.nvidia.com/deeplearning/nccl/release-notes/index.html).
|
||||
---
|
||||
- type: textarea
|
||||
id: question
|
||||
attributes:
|
||||
label: Question
|
|
@ -0,0 +1,22 @@
|
|||
name: NCCL request for enhancement
|
||||
description: Request for enhancement
|
||||
title: "[RFE]: "
|
||||
labels: ["enhancement"]
|
||||
body:
|
||||
- type: markdown
|
||||
attributes:
|
||||
value: |
|
||||
|
||||
Thanks for your feedback! Before reporting a new RFE you could quickly check if this already exists in our [existing requests](https://github.com/NVIDIA/nccl/issues?q=sort%3Aupdated-desc%20is%3Aissue%20is%3Aopen%20label%3Aenhancement).
|
||||
|
||||
---
|
||||
- type: textarea
|
||||
id: rfe-description
|
||||
attributes:
|
||||
label: Please provide the below details to ensure we understand your needs
|
||||
description: |
|
||||
* What is the goal of this request?
|
||||
* Who will benefit from this feature?
|
||||
* Is this request for a specific GPU architecture or network infrastructure?
|
||||
* How will this feature improve current workflows or processes?
|
||||
* What is the priority level of this request?
|
|
@ -0,0 +1 @@
|
|||
blank_issues_enabled: false
|
|
@ -0,0 +1,79 @@
|
|||
const { Octokit } = require("@octokit/rest");
|
||||
|
||||
const octokit = new Octokit({ auth: process.env.GITHUB_TOKEN });
|
||||
|
||||
const owner = process.env.REPO_OWNER;
|
||||
const repo = process.env.REPO_NAME.split('/').pop(); // Handles owner/repo format
|
||||
|
||||
const now = new Date();
|
||||
const sixMonthsAgo = new Date(now);
|
||||
sixMonthsAgo.setMonth(now.getMonth() - 6);
|
||||
const oneMonthAgo = new Date(now);
|
||||
oneMonthAgo.setMonth(now.getMonth() - 1);
|
||||
|
||||
async function closeOldIssues() {
|
||||
let page = 1;
|
||||
let closedCount = 0;
|
||||
|
||||
// write a multiline comment into a variable:
|
||||
let body = `### Issue Cleanup: Helping Us Focus on Current Challenges
|
||||
|
||||
We're [reviewing](https://github.com/NVIDIA/nccl/discussions/1761) older issues to ensure we prioritize the most relevant and active ones. Since this issue hasn't seen updates in over 6 months, we'll be closing it for now.
|
||||
|
||||
*This change helps us focus our efforts on addressing any current issues our users are facing.* If this issue still affects you, please don't hesitate to reopen it with a quick update (e.g., \"Still relevant on [version=X]\").
|
||||
Thanks for your understanding and for contributing to NCCL.`;
|
||||
|
||||
while (true) {
|
||||
const { data: issues } = await octokit.issues.listForRepo({
|
||||
owner,
|
||||
repo,
|
||||
state: "open",
|
||||
per_page: 100,
|
||||
page,
|
||||
});
|
||||
|
||||
if (issues.length === 0) break;
|
||||
|
||||
for (const issue of issues) {
|
||||
// Ignore PRs
|
||||
if (issue.pull_request) continue;
|
||||
|
||||
// Ignore issues with label "ongoing"
|
||||
if (issue.labels.some(label => label.name === "ongoing")) continue;
|
||||
|
||||
const createdAt = new Date(issue.created_at);
|
||||
const updatedAt = new Date(issue.updated_at);
|
||||
|
||||
if (createdAt < sixMonthsAgo && updatedAt < sixMonthsAgo) {
|
||||
|
||||
// Add a comment before closing
|
||||
await octokit.issues.createComment({
|
||||
owner,
|
||||
repo,
|
||||
issue_number: issue.number,
|
||||
body: body,
|
||||
});
|
||||
|
||||
await octokit.issues.update({
|
||||
owner,
|
||||
repo,
|
||||
issue_number: issue.number,
|
||||
state: "closed",
|
||||
state_reason: "not_planned",
|
||||
});
|
||||
closedCount++;
|
||||
console.log(`Closed issue #${issue.number}`);
|
||||
|
||||
// Break out if we have closed 100 issues
|
||||
if (closedCount >= 100) {
|
||||
console.log("Closed 100 issues, stopping.");
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
page++;
|
||||
}
|
||||
console.log(`Total closed: ${closedCount}`);
|
||||
}
|
||||
|
||||
closeOldIssues().catch(console.error);
|
|
@ -0,0 +1,31 @@
|
|||
name: Close Old Issues
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: '30 2 * * *' # Runs daily at 02:30 UTC
|
||||
workflow_dispatch:
|
||||
|
||||
permissions:
|
||||
issues: write
|
||||
|
||||
jobs:
|
||||
close-old-issues:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout repository
|
||||
uses: actions/checkout@v4
|
||||
|
||||
- name: Setup Node.js
|
||||
uses: actions/setup-node@v4
|
||||
with:
|
||||
node-version: 20
|
||||
|
||||
- name: Install dependencies
|
||||
run: npm install @octokit/rest@22.0.0
|
||||
|
||||
- name: Run close-old-issues script
|
||||
run: node .github/workflows/close-old-issues.js
|
||||
env:
|
||||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
||||
REPO_OWNER: ${{ github.repository_owner }}
|
||||
REPO_NAME: ${{ github.event.repository.name || github.repository }}
|
|
@ -76,7 +76,7 @@ $(info NVCC_GENCODE is ${NVCC_GENCODE})
|
|||
ifeq ($(shell test "0$(CUDA_MAJOR)" -ge 13; echo $$?),0)
|
||||
CXXSTD ?= -std=c++17
|
||||
else
|
||||
CXXSTD ?= -std=c++11
|
||||
CXXSTD ?= -std=c++14
|
||||
endif
|
||||
|
||||
CXXFLAGS := -DCUDA_MAJOR=$(CUDA_MAJOR) -DCUDA_MINOR=$(CUDA_MINOR) -fPIC -fvisibility=hidden \
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
##### version
|
||||
NCCL_MAJOR := 2
|
||||
NCCL_MINOR := 27
|
||||
NCCL_PATCH := 5
|
||||
NCCL_PATCH := 6
|
||||
NCCL_SUFFIX :=
|
||||
PKG_REVISION := 1
|
||||
|
|
|
@ -709,8 +709,8 @@ ncclResult_t ncclTopoComputePaths(struct ncclTopoSystem* system, struct ncclComm
|
|||
peerNode->paths[GPU][g].type <= PATH_NVL &&
|
||||
/* and (3) is on the same node as us */
|
||||
NCCL_TOPO_ID_SYSTEM_ID(peerNode->id) == NCCL_TOPO_ID_SYSTEM_ID(gpu->id) &&
|
||||
/* and (4) has either higher bw to that NIC or avoid going through the CPU*/
|
||||
(peerNode->paths[NET][n].bw > gpu->paths[NET][n].bw || gpu->paths[NET][n].type > pxnType))
|
||||
/* and (4) has either higher bw to that NIC or avoid going through the CPU (path.type is > PATH_PXN)*/
|
||||
(peerNode->paths[NET][n].bw > gpu->paths[NET][n].bw || gpu->paths[NET][n].type > PATH_PXN))
|
||||
// We can use that GPU as relay to communicate with that NIC.
|
||||
// Only enabling it in the GPU->NIC direction for now to favor
|
||||
// receiving locally and sending remotely (consistent with net.cc)
|
||||
|
|
|
@ -960,7 +960,7 @@ float sm90SpeedArrayInter[] = { 48.0, 45.0, 42.0, 40.0, 30.0, 24.0, 22.0, 20.0,
|
|||
#define NSPEEDSINTER_SM90 (sizeof(sm90SpeedArrayInter)/sizeof(float))
|
||||
|
||||
float sm100SpeedArrayIntra[] = { 90.0, 80.0, 70.0, 60.0, 50.0, 40.0, 30.0, 24.0, 20.0, 19.0, 18.0 };
|
||||
float sm100SpeedArrayInter[] = { 48.0, 45.1, 42.0, 40.0, 30.0, 24.0, 22.0, 20.0, 17.5, 15.0, 12.0, 6.0, 3.0, 2.4, 1.2, 0.24, 0.12 };
|
||||
float sm100SpeedArrayInter[] = { 96.0, 48.0, 45.1, 42.0, 40.0, 30.0, 24.0, 22.0, 20.0, 17.5, 15.0, 12.0, 6.0, 3.0, 2.4, 1.2, 0.24, 0.12 };
|
||||
#define NSPEEDSINTRA_SM100 (sizeof(sm100SpeedArrayIntra)/sizeof(float))
|
||||
#define NSPEEDSINTER_SM100 (sizeof(sm100SpeedArrayInter)/sizeof(float))
|
||||
|
||||
|
@ -1307,7 +1307,8 @@ ncclResult_t ncclTopoGetNetDev(struct ncclComm* comm, int rank, struct ncclTopoG
|
|||
NCCLCHECK(ncclTopoGetLocalGpu(comm->topo, netId, &g2));
|
||||
if (g2 != -1) {
|
||||
struct ncclTopoNode* peerGpu = comm->topo->nodes[GPU].nodes+g2;
|
||||
if (peerGpu->paths[GPU][g1].type <= PATH_NVL && peerGpu->paths[NET][n].type <= PATH_PXB) {
|
||||
int pxnType = ncclParamPxnC2c() ? PATH_P2C : PATH_PXB;
|
||||
if (peerGpu->paths[GPU][g1].type <= PATH_NVL && peerGpu->paths[NET][n].type <= pxnType) {
|
||||
*proxyRank = peerGpu->gpu.rank;
|
||||
if (dev) *dev = netDev;
|
||||
if (id) *id = netId;
|
||||
|
|
|
@ -98,6 +98,8 @@ extern const char* topoLinkTypeStr[];
|
|||
#define PATH_DIS 11
|
||||
extern const char* topoPathTypeStr[];
|
||||
|
||||
extern int64_t ncclParamPxnC2c();
|
||||
|
||||
struct ncclTopoNode;
|
||||
struct ncclTopoLink {
|
||||
int type;
|
||||
|
|
|
@ -9,6 +9,7 @@
|
|||
#include <stdint.h>
|
||||
#include <sys/types.h>
|
||||
#include <unistd.h>
|
||||
#include <string.h>
|
||||
|
||||
#if __GNUC__ >= 3
|
||||
# define __attribute_const __attribute__((const))
|
||||
|
@ -39,7 +40,7 @@ union ibv_gid {
|
|||
#define vext_field_avail(type, fld, sz) (offsetof(type, fld) < (sz))
|
||||
|
||||
/*XXX:__VERBS_ABI_IS_EXTENDED produces warning "integer operation result is out of range" with g++ 4.8.2*/
|
||||
//static void *__VERBS_ABI_IS_EXTENDED = ((uint8_t *)NULL) - 1;
|
||||
static void *__VERBS_ABI_IS_EXTENDED = ((uint8_t *)NULL) - 1;
|
||||
|
||||
enum ibv_node_type {
|
||||
IBV_NODE_UNKNOWN = -1,
|
||||
|
@ -208,7 +209,9 @@ struct ibv_port_attr {
|
|||
uint8_t active_speed;
|
||||
uint8_t phys_state;
|
||||
uint8_t link_layer;
|
||||
uint8_t reserved;
|
||||
uint8_t flags;
|
||||
uint16_t port_cap_flags2;
|
||||
uint32_t active_speed_ex;
|
||||
};
|
||||
|
||||
enum ibv_event_type {
|
||||
|
@ -993,11 +996,20 @@ enum verbs_context_mask {
|
|||
|
||||
struct verbs_context {
|
||||
/* "grows up" - new fields go here */
|
||||
int (*_reserved_2) (void);
|
||||
int (*destroy_flow) (struct ibv_flow *flow);
|
||||
int (*_reserved_1) (void);
|
||||
struct ibv_flow * (*create_flow) (struct ibv_qp *qp,
|
||||
int (*query_port)(struct ibv_context *context, uint8_t port_num,
|
||||
struct ibv_port_attr *port_attr,
|
||||
size_t port_attr_len);
|
||||
int (*_reserved[25]) (void);
|
||||
struct verbs_ex_private *priv;
|
||||
int (*query_device_ex)(struct ibv_context *context,
|
||||
const struct ibv_query_device_ex_input *input,
|
||||
struct ibv_device_attr_ex *attr,
|
||||
size_t attr_size);
|
||||
int (*ibv_destroy_flow) (struct ibv_flow *flow);
|
||||
void (*ABI_placeholder2) (void); /* DO NOT COPY THIS GARBAGE */
|
||||
struct ibv_flow * (*ibv_create_flow) (struct ibv_qp *qp,
|
||||
struct ibv_flow_attr *flow_attr);
|
||||
void (*ABI_placeholder1) (void); /* DO NOT COPY THIS GARBAGE */
|
||||
struct ibv_qp * (*open_qp)(struct ibv_context *context,
|
||||
struct ibv_qp_open_attr *attr);
|
||||
struct ibv_qp * (*create_qp_ex)(struct ibv_context *context,
|
||||
|
@ -1008,22 +1020,26 @@ struct verbs_context {
|
|||
struct ibv_xrcd * (*open_xrcd)(struct ibv_context *context,
|
||||
struct ibv_xrcd_init_attr *xrcd_init_attr);
|
||||
int (*close_xrcd)(struct ibv_xrcd *xrcd);
|
||||
uint64_t has_comp_mask;
|
||||
uint64_t _ABI_placeholder3;
|
||||
size_t sz; /* Must be immediately before struct ibv_context */
|
||||
struct ibv_context context; /* Must be last field in the struct */
|
||||
};
|
||||
|
||||
/*XXX:__VERBS_ABI_IS_EXTENDED produces warning "integer operation result is out of range" with g++ 4.8.2*/
|
||||
/*static inline struct verbs_context *verbs_get_ctx(struct ibv_context *ctx)
|
||||
static inline struct verbs_context *verbs_get_ctx(struct ibv_context *ctx)
|
||||
{
|
||||
return (!ctx || (ctx->abi_compat != __VERBS_ABI_IS_EXTENDED)) ?
|
||||
NULL : container_of(ctx, struct verbs_context, context);
|
||||
if (ctx->abi_compat != __VERBS_ABI_IS_EXTENDED)
|
||||
return NULL;
|
||||
|
||||
/* open code container_of to not pollute the global namespace */
|
||||
return (struct verbs_context *)(((uintptr_t)ctx) -
|
||||
offsetof(struct verbs_context,
|
||||
context));
|
||||
}
|
||||
|
||||
#define verbs_get_ctx_op(ctx, op) ({ \
|
||||
struct verbs_context *_vctx = verbs_get_ctx(ctx); \
|
||||
(!_vctx || (_vctx->sz < sizeof(*_vctx) - offsetof(struct verbs_context, op)) || \
|
||||
!_vctx->op) ? NULL : _vctx; })*/
|
||||
struct verbs_context *__vctx = verbs_get_ctx(ctx); \
|
||||
(!__vctx || (__vctx->sz < sizeof(*__vctx) - offsetof(struct verbs_context, op)) || \
|
||||
!__vctx->op) ? NULL : __vctx; })
|
||||
|
||||
#define verbs_set_ctx_op(_vctx, op, ptr) ({ \
|
||||
struct verbs_context *vctx = _vctx; \
|
||||
|
@ -1055,4 +1071,20 @@ struct ibv_ece {
|
|||
uint32_t comp_mask;
|
||||
};
|
||||
|
||||
/**
|
||||
* ibv_query_port_ex - Get (extended) port properties
|
||||
*/
|
||||
static inline int ibv_query_port_ex(struct ibv_context *context,
|
||||
uint8_t port_num,
|
||||
struct ibv_port_attr *port_attr)
|
||||
{
|
||||
struct verbs_context *vctx = verbs_get_ctx_op(context, query_port);
|
||||
|
||||
if (vctx) {
|
||||
return vctx->query_port(context, port_num, port_attr, sizeof(*port_attr));
|
||||
}
|
||||
|
||||
return -1;
|
||||
}
|
||||
|
||||
#endif // NCCL_IBV_CORE_H_
|
||||
|
|
|
@ -9,10 +9,16 @@
|
|||
|
||||
#include "nccl.h"
|
||||
|
||||
enum ncclPluginType {
|
||||
ncclPluginTypeNet,
|
||||
ncclPluginTypeTuner,
|
||||
ncclPluginTypeProfiler,
|
||||
};
|
||||
|
||||
void* ncclOpenNetPluginLib(const char* name);
|
||||
void* ncclOpenTunerPluginLib(const char* name);
|
||||
void* ncclOpenProfilerPluginLib(const char* name);
|
||||
void* ncclGetNetPluginLib(void);
|
||||
ncclResult_t ncclClosePluginLib(void* handle);
|
||||
void* ncclGetNetPluginLib(enum ncclPluginType type);
|
||||
ncclResult_t ncclClosePluginLib(void* handle, enum ncclPluginType type);
|
||||
|
||||
#endif
|
||||
|
|
|
@ -2170,6 +2170,7 @@ ncclResult_t ncclCommDestroy(ncclComm_t comm) {
|
|||
NVTX3_PAYLOAD(comm->commHash, nranks, rank, cudaDev));
|
||||
|
||||
TRACE(NCCL_INIT, "comm %p rank %d nRanks %d cudaDev %d busId %lx", comm, rank, nranks, cudaDev, comm->busId);
|
||||
NCCLCHECK(ncclGroupStartInternal());
|
||||
// Try and prevent a double free of the comm struct (user error)
|
||||
if (comm->rank == -1 || comm->nRanks == -1 || comm->cudaDev == -1 || comm->busId == -1) {
|
||||
WARN("comm %p has already been destroyed", comm);
|
||||
|
@ -2184,6 +2185,8 @@ ncclResult_t ncclCommDestroy(ncclComm_t comm) {
|
|||
NCCLCHECKGOTO(ncclAsyncLaunch((struct ncclAsyncJob*)job, commReclaim, NULL, free, comm), res, fail);
|
||||
|
||||
exit:
|
||||
ncclGroupErrCheck(res);
|
||||
NCCLCHECK(ncclGroupEndInternal());
|
||||
return res;
|
||||
fail:
|
||||
goto exit;
|
||||
|
@ -2207,6 +2210,7 @@ ncclResult_t ncclCommAbort(ncclComm_t comm) {
|
|||
if (comm == NULL) {
|
||||
return ncclSuccess;
|
||||
}
|
||||
NCCLCHECK(ncclGroupStartInternal());
|
||||
// Ask anything that might still be running on the device to quit
|
||||
NCCLCHECK(setCommAbortFlags(comm,1));
|
||||
comm->destroyFlag = 1;
|
||||
|
@ -2229,7 +2233,9 @@ ncclResult_t ncclCommAbort(ncclComm_t comm) {
|
|||
NCCLCHECKGOTO(ncclAsyncLaunch((struct ncclAsyncJob*)job, commReclaim, NULL, free, comm), res, fail);
|
||||
|
||||
exit:
|
||||
return ncclSuccess;
|
||||
ncclGroupErrCheck(res);
|
||||
NCCLCHECK(ncclGroupEndInternal());
|
||||
return res;
|
||||
fail:
|
||||
goto exit;
|
||||
}
|
||||
|
|
|
@ -142,9 +142,15 @@ ncclResult_t wrap_ibv_query_device(struct ibv_context *context, struct ibv_devic
|
|||
IBV_INT_CHECK_RET_ERRNO(ibvSymbols, ibv_internal_query_device, ibv_internal_query_device(context, device_attr), 0, "ibv_query_device");
|
||||
}
|
||||
|
||||
ncclResult_t wrap_ibv_query_port(struct ibv_context *context, uint8_t port_num, struct ibv_port_attr *port_attr) { /*returns 0 on success, or the value of errno on failure (which indicates the failure reason)*/
|
||||
ncclResult_t wrap_ibv_query_port(struct ibv_context *context, uint8_t port_num, struct ibv_port_attr *port_attr) {
|
||||
// First try and query the extended port attributes (e.g. active_speed_ex)
|
||||
if (ibv_query_port_ex(context, port_num, port_attr) != 0) {
|
||||
// Fall back to the original attribute API call, but zero all members first
|
||||
memset(port_attr, 0, sizeof(*port_attr));
|
||||
IBV_INT_CHECK_RET_ERRNO(ibvSymbols, ibv_internal_query_port, ibv_internal_query_port(context, port_num, port_attr), 0, "ibv_query_port");
|
||||
}
|
||||
return ncclSuccess;
|
||||
}
|
||||
|
||||
ncclResult_t wrap_ibv_query_gid(struct ibv_context *context, uint8_t port_num, int index, union ibv_gid *gid) {
|
||||
IBV_INT_CHECK_RET_ERRNO(ibvSymbols, ibv_internal_query_gid, ibv_internal_query_gid(context, port_num, index, gid), 0, "ibv_query_gid");
|
||||
|
|
|
@ -441,7 +441,8 @@ static ncclResult_t socketTryAccept(struct ncclSocket* sock) {
|
|||
if (sock->fd != -1) {
|
||||
sock->state = ncclSocketStateAccepted;
|
||||
} else if (errno == ENETDOWN || errno == EPROTO || errno == ENOPROTOOPT || errno == EHOSTDOWN ||
|
||||
errno == ENONET || errno == EHOSTUNREACH || errno == EOPNOTSUPP || errno == ENETUNREACH) {
|
||||
errno == ENONET || errno == EHOSTUNREACH || errno == EOPNOTSUPP || errno == ENETUNREACH ||
|
||||
errno == EINTR) {
|
||||
/* per accept's man page, for linux sockets, the following errors might be already pending errors
|
||||
* and should be considered as EAGAIN. To avoid infinite loop in case of errors, we use the retry count*/
|
||||
if (++sock->errorRetries == ncclParamRetryCnt()) {
|
||||
|
|
|
@ -67,7 +67,7 @@ static pthread_once_t initPluginLibsOnceControl = PTHREAD_ONCE_INIT;
|
|||
static ncclResult_t ncclNetPluginUnload(netPluginLib_t* pluginLib) {
|
||||
if ((pluginLib->dlHandle) && ((pluginLib->ncclNetPluginRefCount) == 0)) {
|
||||
INFO(NCCL_INIT|NCCL_NET, "Unloading plugin %s", pluginLib->name);
|
||||
NCCLCHECK(ncclClosePluginLib(pluginLib->dlHandle));
|
||||
NCCLCHECK(ncclClosePluginLib(pluginLib->dlHandle, ncclPluginTypeNet));
|
||||
memset(pluginLib, 0, sizeof(netPluginLib_t));
|
||||
}
|
||||
return ncclSuccess;
|
||||
|
@ -105,8 +105,9 @@ exit:
|
|||
return ncclSuccess;
|
||||
fail:
|
||||
if (pluginLib->dlHandle) {
|
||||
NCCLCHECK(ncclClosePluginLib(pluginLib->dlHandle));
|
||||
NCCLCHECK(ncclClosePluginLib(pluginLib->dlHandle, ncclPluginTypeNet));
|
||||
}
|
||||
pluginLib->dlHandle = nullptr;
|
||||
pluginLib->ncclNetPluginState = ncclNetPluginStateLoadFailed;
|
||||
pluginLib->ncclCollNetPluginState = ncclNetPluginStateLoadFailed;
|
||||
goto exit;
|
||||
|
|
|
@ -10,16 +10,12 @@
|
|||
#include <dlfcn.h>
|
||||
|
||||
#include "debug.h"
|
||||
#include "plugin.h"
|
||||
|
||||
#define MAX_STR_LEN 255
|
||||
|
||||
enum ncclPluginType {
|
||||
ncclPluginTypeNet,
|
||||
ncclPluginTypeTuner,
|
||||
ncclPluginTypeProfiler,
|
||||
};
|
||||
|
||||
#define NUM_LIBS 3
|
||||
static char* libNames[NUM_LIBS];
|
||||
static void *libHandles[NUM_LIBS];
|
||||
static const char *pluginNames[NUM_LIBS] = { "NET", "TUNER", "PROFILER" };
|
||||
static const char *pluginPrefix[NUM_LIBS] = { "libnccl-net", "libnccl-tuner", "libnccl-profiler" };
|
||||
|
@ -65,6 +61,7 @@ static void* openPluginLib(enum ncclPluginType type, const char* libName) {
|
|||
libHandles[type] = tryOpenLib(libName_, &openErr, openErrStr);
|
||||
if (libHandles[type]) {
|
||||
INFO(subsys[type], "%s/Plugin: Plugin name set by env to %s", pluginNames[type], libName_);
|
||||
libNames[type] = strdup(libName_);
|
||||
return libHandles[type];
|
||||
}
|
||||
if (openErr == ENOENT) {
|
||||
|
@ -79,6 +76,7 @@ static void* openPluginLib(enum ncclPluginType type, const char* libName) {
|
|||
libHandles[type] = tryOpenLib(libName_, &openErr, openErrStr);
|
||||
if (libHandles[type]) {
|
||||
INFO(subsys[type], "%s/Plugin: Plugin name set by env to %s", pluginNames[type], libName_);
|
||||
libNames[type] = strdup(libName_);
|
||||
return libHandles[type];
|
||||
}
|
||||
if (openErr == ENOENT) {
|
||||
|
@ -91,6 +89,7 @@ static void* openPluginLib(enum ncclPluginType type, const char* libName) {
|
|||
snprintf(libName_, MAX_STR_LEN, "%s.so", pluginPrefix[type]);
|
||||
libHandles[type] = tryOpenLib(libName_, &openErr, openErrStr);
|
||||
if (libHandles[type]) {
|
||||
libNames[type] = strdup(libName_);
|
||||
return libHandles[type];
|
||||
}
|
||||
if (openErr == ENOENT) {
|
||||
|
@ -120,22 +119,21 @@ void* ncclOpenProfilerPluginLib(const char* name) {
|
|||
return openPluginLib(ncclPluginTypeProfiler, name);
|
||||
}
|
||||
|
||||
void* ncclGetNetPluginLib(void) {
|
||||
return libHandles[ncclPluginTypeNet];
|
||||
void* ncclGetNetPluginLib(enum ncclPluginType type) {
|
||||
if (libNames[ncclPluginTypeNet]) {
|
||||
// increment the reference counter of the net library
|
||||
libNames[type] = strdup(libNames[ncclPluginTypeNet]);
|
||||
libHandles[type] = dlopen(libNames[ncclPluginTypeNet], RTLD_NOW | RTLD_LOCAL);
|
||||
}
|
||||
return libHandles[type];
|
||||
}
|
||||
|
||||
ncclResult_t ncclClosePluginLib(void* handle) {
|
||||
bool found = false;
|
||||
for (int l=0; l<NUM_LIBS; l++) {
|
||||
if (libHandles[l] == handle) {
|
||||
libHandles[l] = nullptr;
|
||||
if (!found) {
|
||||
if (handle) {
|
||||
ncclResult_t ncclClosePluginLib(void* handle, enum ncclPluginType type) {
|
||||
if (handle && libHandles[type] == handle) {
|
||||
dlclose(handle);
|
||||
}
|
||||
found = true;
|
||||
}
|
||||
}
|
||||
libHandles[type] = nullptr;
|
||||
free(libNames[type]);
|
||||
libNames[type] = nullptr;
|
||||
}
|
||||
return ncclSuccess;
|
||||
}
|
||||
|
|
|
@ -77,7 +77,8 @@ exit:
|
|||
pthread_mutex_unlock(&profilerLock);
|
||||
return ncclSuccess;
|
||||
fail:
|
||||
if (profilerPluginLib) NCCLCHECK(ncclClosePluginLib(profilerPluginLib));
|
||||
if (profilerPluginLib) NCCLCHECK(ncclClosePluginLib(profilerPluginLib, ncclPluginTypeProfiler));
|
||||
profilerPluginLib = nullptr;
|
||||
profilerPluginStatus = profilerPluginLoadFailed;
|
||||
goto exit;
|
||||
}
|
||||
|
@ -86,7 +87,7 @@ static ncclResult_t ncclProfilerPluginUnload(void) {
|
|||
pthread_mutex_lock(&profilerLock);
|
||||
if (0 == (--profilerPluginRefCount)) {
|
||||
INFO(NCCL_ENV, "PROFILER/Plugin: Closing profiler plugin %s", ncclProfiler->name);
|
||||
NCCLCHECK(ncclClosePluginLib(profilerPluginLib));
|
||||
NCCLCHECK(ncclClosePluginLib(profilerPluginLib, ncclPluginTypeProfiler));
|
||||
profilerPluginLib = nullptr;
|
||||
ncclProfiler = nullptr;
|
||||
profilerPluginStatus = profilerPluginLoadReady;
|
||||
|
|
|
@ -52,7 +52,7 @@ ncclResult_t ncclTunerPluginLoad(struct ncclComm* comm) {
|
|||
|
||||
tunerPluginLib = ncclOpenTunerPluginLib(ncclGetEnv("NCCL_TUNER_PLUGIN"));
|
||||
if (nullptr == tunerPluginLib) {
|
||||
tunerPluginLib = ncclGetNetPluginLib();
|
||||
tunerPluginLib = ncclGetNetPluginLib(ncclPluginTypeTuner);
|
||||
if (nullptr == tunerPluginLib) {
|
||||
goto fail;
|
||||
}
|
||||
|
@ -78,6 +78,7 @@ exit:
|
|||
pthread_mutex_unlock(&tunerPluginLock);
|
||||
return ncclSuccess;
|
||||
fail:
|
||||
if (tunerPluginLib) NCCLCHECK(ncclClosePluginLib(tunerPluginLib, ncclPluginTypeTuner));
|
||||
tunerPluginLib = nullptr;
|
||||
status = tunerPluginLoadFailed;
|
||||
goto exit;
|
||||
|
@ -87,7 +88,7 @@ ncclResult_t ncclTunerPluginUnload(struct ncclComm* comm) {
|
|||
pthread_mutex_lock(&tunerPluginLock);
|
||||
if (comm->tunerPluginLoaded && 0 == (--tunerPluginRefCount)) {
|
||||
INFO(NCCL_TUNING, "TUNER/Plugin: Closing tuner: '%s'", tunerSymbol->name);
|
||||
NCCLCHECK(ncclClosePluginLib(tunerPluginLib));
|
||||
NCCLCHECK(ncclClosePluginLib(tunerPluginLib, ncclPluginTypeTuner));
|
||||
tunerPluginLib = nullptr;
|
||||
tunerSymbol = nullptr;
|
||||
comm->tuner = nullptr;
|
||||
|
|
|
@ -652,12 +652,15 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t pr
|
|||
enum ncclIbProvider ibProvider = IB_PROVIDER_NONE;
|
||||
char dataDirectDevicePath[PATH_MAX];
|
||||
int dataDirectSupported = 0;
|
||||
int skipNetDevForDataDirect = 0;
|
||||
if (wrap_mlx5dv_is_supported(devices[d])) {
|
||||
ibProvider = IB_PROVIDER_MLX5;
|
||||
snprintf(dataDirectDevicePath, PATH_MAX, "/sys");
|
||||
if((ncclMlx5dvDmaBufCapable(context)) && (wrap_mlx5dv_get_data_direct_sysfs_path(context, dataDirectDevicePath + 4, PATH_MAX - 4) == ncclSuccess)) {
|
||||
INFO(NCCL_INIT|NCCL_NET, "Data Direct DMA Interface is detected for device:%s", devices[d]->name);
|
||||
if(ncclParamIbDataDirect()) dataDirectSupported = 1;
|
||||
INFO(NCCL_INIT|NCCL_NET, "NET/IB: Data Direct DMA Interface is detected for device:%s", devices[d]->name);
|
||||
// Now check whether Data Direct has been disabled by the user
|
||||
if(ncclParamIbDataDirect() == 1) { dataDirectSupported = 1; skipNetDevForDataDirect = 1; }
|
||||
if(ncclParamIbDataDirect() == 2) { dataDirectSupported = 1; skipNetDevForDataDirect = 0; }
|
||||
}
|
||||
}
|
||||
int nPorts = 0;
|
||||
|
@ -669,7 +672,8 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t pr
|
|||
continue;
|
||||
}
|
||||
for (int port_num = 1; port_num <= devAttr.phys_port_cnt; port_num++) {
|
||||
for (int dataDirect = 0; dataDirect < 1 + dataDirectSupported; ++dataDirect) {
|
||||
// dataDirect = 0 exposes the devices normally, dataDirect = 1 exposes the devices through direct NIC
|
||||
for (int dataDirect = skipNetDevForDataDirect; dataDirect < 1 + dataDirectSupported; ++dataDirect) {
|
||||
struct ibv_port_attr portAttr;
|
||||
if (ncclSuccess != wrap_ibv_query_port(context, port_num, &portAttr)) {
|
||||
WARN("NET/IB : Unable to query port_num %d", port_num);
|
||||
|
@ -690,6 +694,10 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t pr
|
|||
ncclIbDevs[ncclNIbDevs].portAttr = portAttr;
|
||||
ncclIbDevs[ncclNIbDevs].portNum = port_num;
|
||||
ncclIbDevs[ncclNIbDevs].link = portAttr.link_layer;
|
||||
if (portAttr.active_speed_ex)
|
||||
// A non-zero active_speed_ex indicates XDR rate (0x100) or higher
|
||||
ncclIbDevs[ncclNIbDevs].speed = ncclIbSpeed(portAttr.active_speed_ex) * ncclIbWidth(portAttr.active_width);
|
||||
else
|
||||
ncclIbDevs[ncclNIbDevs].speed = ncclIbSpeed(portAttr.active_speed) * ncclIbWidth(portAttr.active_width);
|
||||
ncclIbDevs[ncclNIbDevs].context = context;
|
||||
ncclIbDevs[ncclNIbDevs].pdRefs = 0;
|
||||
|
@ -697,8 +705,7 @@ ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction, ncclProfilerCallback_t pr
|
|||
if (!dataDirect) {
|
||||
strncpy(ncclIbDevs[ncclNIbDevs].devName, devices[d]->name, MAXNAMESIZE);
|
||||
NCCLCHECKGOTO(ncclIbGetPciPath(ncclIbDevs[ncclNIbDevs].devName, &ncclIbDevs[ncclNIbDevs].pciPath, &ncclIbDevs[ncclNIbDevs].realPort), ret, fail);
|
||||
}
|
||||
else {
|
||||
} else {
|
||||
snprintf(ncclIbDevs[ncclNIbDevs].devName, MAXNAMESIZE, "%s_dma", devices[d]->name);
|
||||
NCCLCHECK(ncclCalloc(&ncclIbDevs[ncclNIbDevs].pciPath, PATH_MAX));
|
||||
strncpy(ncclIbDevs[ncclNIbDevs].pciPath, dataDirectDevicePath, PATH_MAX);
|
||||
|
|
Loading…
Reference in New Issue