Swin-T 算法模型浮点模型训练问题

已解决

kja232023-05-19

用户您好，请详细描述您所遇到的问题，这会帮助我们快速定位问题~

1.芯片型号：J5

2.天工开物开发包OpenExplorer版本J5_OE_1.1.37

3.问题定位：Swin-T 算法模型浮点模型训练

4.问题具体描述：执行命令python3 tools/train.py --stage float --config configs/classification/horizon_swin_transformer.py，报错OSError:/usr/lib64/libcuda.so.1:file too short，是不是要在开发机装cuda,cuda的版本是多少？

附件:

v1/static/fileData/微信图片_20230519110705_20230519110902.jpg

算法工具链

0/600

颜值即正义
Lv.2
您好，建议您这边参考swint参考算法文档的参考环境部署章节在地平线提供的docker中运行swint,
可在此帖中获取docker：https://developer.horizon.ai/forumDetail/118363912788935318；另外，目前工具链支持的cuda版本是11.1
2023-05-19
0
10
- kja23回复颜值即正义:
  我是从ftp://j5ftp@vrftp.horizon.ai/拉取的OE1.37版本，通过sh rundocker.sh data进入docker的，再请问swint参考算法中ImageNet 数据集从哪获取。
  2023-05-19
  0
  回复
- 颜值即正义回复kja23:
  如果您已经在本地docker load 了离线镜像，也成功运行了run_docker.sh脚本，然后还是报了OSError:/usr/lib64/libcuda.so.1:file too short的错误，那么请检查一下您本地是否满足以下环境：
  注：py3.6后期将不再支持。
  另外，swint参考算法中ImageNet 数据集需要您这边自行从官网https://www.image-net.org/download.php获取，然后运行数据集打包脚本将数据集打包：
  #pack train_Set
  python3 tools/datasets/imagenet_packer.py --src-data-dir ${src-data-dir} --target-data-dir ${target-data-dir} --split-name train --num-workers 10 --pack-type lmdb
  #pack test_Set
  python3 tools/datasets/imagenet_packer.py --src-data-dir ${src-data-dir} --target-data-dir ${target-data-dir} --split-name train --num-workers 10 --pack-type lmdb
  2023-05-19
  0
  回复
- kja23回复颜值即正义:
  能不能不训练，直接用训练好的模型上板子测试，具体步骤怎么操作。我直接运行量化编译命令python3 tools/compile_perf.py --config configs/classification/horizon_swin_transformer.py --out-dir ./ --opt 3，报错：python3 tools/compile_perf.py --config configs/classification/horizon_swin_transformer.py --out-dir ./ --opt 3
  /root/.local/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2157.)
  return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
  2023-05-19 15:53:52,041 INFO Successfully convert float model to qat model.
  Traceback (most recent call last):
  File "tools/compile_perf.py", line 190, in
  compile_then_perf(
  File "tools/compile_perf.py", line 71, in compile_then_perf
  int_infer_trainer = build_from_registry(int_infer_trainer)
  File "/usr/local/python3.8/lib/python3.8/site-packages/hat/registry.py", line 236, in build_from_registry
  return _impl(x)
  File "/usr/local/python3.8/lib/python3.8/site-packages/hat/registry.py", line 223, in _impl
  obj = build_from_cfg(OBJECT_REGISTRY, x)
  File "/usr/local/python3.8/lib/python3.8/site-packages/hat/registry.py", line 98, in build_from_cfg
  instance = obj_cls(**cfg)
  File "/usr/local/python3.8/lib/python3.8/site-packages/hat/engine/trainer.py", line 86, in __init__
  super(Trainer, self).__init__(
  File "/usr/local/python3.8/lib/python3.8/site-packages/hat/engine/loop_base.py", line 248, in __init__
  self.model = model_convert_pipeline(self.model)
  File "/usr/local/python3.8/lib/python3.8/site-packages/hat/models/model_convert/pipelines.py", line 52, in __call__
  model = converter(model)
  File "/usr/local/python3.8/lib/python3.8/site-packages/hat/models/model_convert/converters.py", line 234, in __call__
  model_checkpoint = load_checkpoint(
  File "/usr/local/python3.8/lib/python3.8/site-packages/hat/utils/checkpoint.py", line 103, in load_checkpoint
  path = get_hash_file_if_hashed_and_local(
  File "/usr/local/python3.8/lib/python3.8/site-packages/hat/utils/hash.py", line 202, in get_hash_file_if_hashed_and_local
  for name in os.listdir(dir_path):
  FileNotFoundError: [Errno 2] No such file or directory: 'tmp_models/horizon_swin_transformer_cls'
  2023-05-19
  0
  回复
- 颜值即正义回复kja23:
  您好，horizon_model_train_sample\scripts\configs\classification目录下的README提供了模型权重和板端hbm模型的ftp下载链接，您这边下载hbm模型后，把hbm模型传输到板子上，然后在板端使用hrt_model_exec perf工具来评估性能
  2023-05-22
  0
  回复
- 颜值即正义回复颜值即正义:
  latency测试: hrt_model_exec perf --model_file model.hbm --core_id 1 --frame_count 1000 --thread_num 1 --profile_path "./"
  双核FPS测试： hrt_model_exec perf --model_file model.hbm --core_id 0 --frame_count 1000 --thread_num 8 --profile_path "./"
  2023-05-22
  0
  回复
- 颜值即正义回复kja23:
  另外，您这边编译报错的原因是没有正确配置config文件中的ckpt_dir路径，解决办法是使用README中的下载链接把ckpt下载下来后，根据您的实际存储路径来配置ckpt_dir。
  2023-05-22
  0
  回复
- kja23回复颜值即正义:
  不好意思，根据你提供的horizon_model_train_sample\scripts\configs\classification目录下的README文件，在ftp://openexplorer@vrftp.horizon.ai/openexplorer_j5/1.1.48/py36/modelzoo/qat_origin_modelzoo/horizon_swin_transformer_cls/* --ftp-password='c5R,2!pG'，没找到model.hbm，不知你指的是哪个目录下有这个文件。
  # classification
  | model | dataset | backbone | Input shape | config | ckpt download |
  | :----------: | :-------:| :--------: | :------------: | :------: | :--------: |
  | efficientnasnetm | ImageNet | efficientnasnetm | 332x332 | configs/classification/efficientnasnetm.py | wget -c ftp://openexplorer@vrftp.horizon.ai/openexplorer_j5/1.1.48/py36/modelzoo/qat_origin_modelzoo/efficientnasnetm_cls/float-checkpoint-best.pth.tar --ftp-password='c5R,2!pG' |
  | efficientnasnets | ImageNet | efficientnasnets | 300x300 | configs/classification/efficientnasnets.py | wget -c ftp://openexplorer@vrftp.horizon.ai/openexplorer_j5/1.1.48/py36/modelzoo/qat_origin_modelzoo/efficientnasnets_cls/float-checkpoint-best.pth.tar --ftp-password='c5R,2!pG' |
  | efficientnet | ImageNet | efficientnet | 224x224 | configs/classification/efficientnet.py | wget -c ftp://openexplorer@vrftp.horizon.ai/openexplorer_j5/1.1.48/py36/modelzoo/qat_origin_modelzoo/efficientnet_cls/* --ftp-password='c5R,2!pG' |
  | swin_transformer | ImageNet | swin_transformer | 224x224 | configs/classification/horizon_swin_transformer.py | wget -c ftp://openexplorer@vrftp.horizon.ai/openexplorer_j5/1.1.48/py36/modelzoo/qat_origin_modelzoo/horizon_swin_transformer_cls/* --ftp-password='c5R,2!pG' |
  2023-05-22
  0
  回复
- 颜值即正义回复kja23:
  您好，您还可以从aibenchmark示例中获取hbm模型，进入到OE开发包的ddk/samples/ai_toolchain/model_zoo/runtime/ai_benchmark文件夹下，运行resolve_ai_benchmark_qat.sh脚本下载模型，然后在/ddk/samples/model_zoo/runtime/ai_benchmark/qat/horizon_swin_transformer_cls/compile文件夹下获取model.hbm模型
  2023-05-22
  0
  回复
- kja23回复颜值即正义:
  模型上板测试没有问题了，但是训练还有问题，cuda驱动已经装好了，
  nvcc -V
  nvcc: NVIDIA (R) Cuda compiler driver
  Copyright (c) 2005-2020 NVIDIA Corporation
  Built on Tue_Sep_15_19:10:02_PDT_2020
  Cuda compilation tools, release 11.1, V11.1.74
  Build cuda_11.1.TC455_06.29069683_0
  运行命令python3 tools/train.py --stage float --config configs/classification/horizon_swin_transformer.py
  报错如下：
  -- Process 1 terminated with the following error:
  Traceback (most recent call last):
    File "/root/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
      fn(i, *args)
    File "/usr/local/python3.8/lib/python3.8/site-packages/hat/engine/ddp_trainer.py", line 394, in _main_func
      torch.cuda.set_device(local_rank % num_devices)
    File "/root/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 311, in set_device
      torch._C._cuda_setDevice(device)
  RuntimeError: CUDA error: invalid device ordinal
  CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
  For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
  2023-05-24
  0
  回复
- 颜值即正义回复kja23:
  您好，此报错和device的设置有关于，首先请检查一下configs/classification/horizon_swin_transformer.py文件中device_ids的设置是否与你的机器一致
  2023-05-24
  0
  回复