在GPU docker 环境下运行模型训练时CUDA error

已解决

默认937922024-12-14

您好，

在OE J6E GPU Docker 环境下, 按 Horizon Torch Samples 做QAT 训练时，出现如下错误:

RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

详细错误记录请看附件。

-----------------------------------------------------------------------------

在同一环境下检查

结果为真

运行

>>> import torch

>>> print(torch.rand(2,3).cuda())

可输出

tensor([[0.0229, 0.4223, 0.2337], [0.4514, 0.2438, 0.6422]], device='cuda:0')

附件:

算法工具链

征程6