专栏算法工具链Swin-T 算法模型浮点模型训练问题

Swin-T 算法模型浮点模型训练问题

已解决
kja232023-05-19
93
11

用户您好,请详细描述您所遇到的问题,这会帮助我们快速定位问题~

1.芯片型号:J5
2.天工开物开发包OpenExplorer版本J5_OE_1.1.37
3.问题定位:Swin-T 算法模型浮点模型训练
4.问题具体描述:执行命令python3 tools/train.py --stage float --config configs/classification/horizon_swin_transformer.py,报错OSError:/usr/lib64/libcuda.so.1:file too short,是不是要在开发机装cuda,cuda的版本是多少?
附件:
算法工具链
评论1
0/1000
  • 颜值即正义
    Lv.2
    您好,建议您这边参考swint参考算法文档的参考环境部署章节在地平线提供的docker中运行swint,
    可在此帖中获取docker:https://developer.horizon.ai/forumDetail/118363912788935318;另外,目前工具链支持的cuda版本是11.1
    2023-05-19
    0
    10
    • kja23回复颜值即正义:
      我是从ftp://j5ftp@vrftp.horizon.ai/拉取的OE1.37版本,通过sh rundocker.sh data进入docker的,再请问swint参考算法中ImageNet 数据集从哪获取。
      2023-05-19
      0
    • 颜值即正义回复kja23:
      如果您已经在本地docker load 了离线镜像,也成功运行了run_docker.sh脚本,然后还是报了OSError:/usr/lib64/libcuda.so.1:file too short的错误,那么请检查一下您本地是否满足以下环境:

      注:py3.6后期将不再支持。

      另外,swint参考算法中ImageNet 数据集需要您这边自行从官网https://www.image-net.org/download.php获取,然后运行数据集打包脚本将数据集打包:
      #pack train_Set
      python3 tools/datasets/imagenet_packer.py --src-data-dir ${src-data-dir} --target-data-dir ${target-data-dir} --split-name train --num-workers 10 --pack-type lmdb
      #pack test_Set
      python3 tools/datasets/imagenet_packer.py --src-data-dir ${src-data-dir} --target-data-dir ${target-data-dir} --split-name train --num-workers 10 --pack-type lmdb
      2023-05-19
      0
    • kja23回复颜值即正义:
      能不能不训练,直接用训练好的模型上板子测试,具体步骤怎么操作。我直接运行量化编译命令python3 tools/compile_perf.py --config configs/classification/horizon_swin_transformer.py --out-dir ./ --opt 3,报错:python3 tools/compile_perf.py --config configs/classification/horizon_swin_transformer.py --out-dir ./ --opt 3

      /root/.local/lib/python3.8/site-packages/torch/functional.py:445: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2157.)

      return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]

      2023-05-19 15:53:52,041 INFO Successfully convert float model to qat model.

      Traceback (most recent call last):

      File "tools/compile_perf.py", line 190, in

      compile_then_perf(

      File "tools/compile_perf.py", line 71, in compile_then_perf

      int_infer_trainer = build_from_registry(int_infer_trainer)

      File "/usr/local/python3.8/lib/python3.8/site-packages/hat/registry.py", line 236, in build_from_registry

      return _impl(x)

      File "/usr/local/python3.8/lib/python3.8/site-packages/hat/registry.py", line 223, in _impl

      obj = build_from_cfg(OBJECT_REGISTRY, x)

      File "/usr/local/python3.8/lib/python3.8/site-packages/hat/registry.py", line 98, in build_from_cfg

      instance = obj_cls(**cfg)

      File "/usr/local/python3.8/lib/python3.8/site-packages/hat/engine/trainer.py", line 86, in __init__

      super(Trainer, self).__init__(

      File "/usr/local/python3.8/lib/python3.8/site-packages/hat/engine/loop_base.py", line 248, in __init__

      self.model = model_convert_pipeline(self.model)

      File "/usr/local/python3.8/lib/python3.8/site-packages/hat/models/model_convert/pipelines.py", line 52, in __call__

      model = converter(model)

      File "/usr/local/python3.8/lib/python3.8/site-packages/hat/models/model_convert/converters.py", line 234, in __call__

      model_checkpoint = load_checkpoint(

      File "/usr/local/python3.8/lib/python3.8/site-packages/hat/utils/checkpoint.py", line 103, in load_checkpoint

      path = get_hash_file_if_hashed_and_local(

      File "/usr/local/python3.8/lib/python3.8/site-packages/hat/utils/hash.py", line 202, in get_hash_file_if_hashed_and_local

      for name in os.listdir(dir_path):

      FileNotFoundError: [Errno 2] No such file or directory: 'tmp_models/horizon_swin_transformer_cls'

      2023-05-19
      0
    • 颜值即正义回复kja23:

      您好,horizon_model_train_sample\scripts\configs\classification目录下的README提供了模型权重和板端hbm模型的ftp下载链接,您这边下载hbm模型后,把hbm模型传输到板子上,然后在板端使用hrt_model_exec perf工具来评估性能

      2023-05-22
      0
    • 颜值即正义回复颜值即正义:

      latency测试: hrt_model_exec perf --model_file model.hbm --core_id 1 --frame_count 1000 --thread_num 1 --profile_path "./"

      双核FPS测试: hrt_model_exec perf --model_file model.hbm --core_id 0 --frame_count 1000 --thread_num 8 --profile_path "./"

      2023-05-22
      0
    • 颜值即正义回复kja23:

      另外,您这边编译报错的原因是没有正确配置config文件中的ckpt_dir路径,解决办法是使用README中的下载链接把ckpt下载下来后,根据您的实际存储路径来配置ckpt_dir。

      2023-05-22
      0
    • kja23回复颜值即正义:
      不好意思,根据你提供的horizon_model_train_sample\scripts\configs\classification目录下的README文件,在ftp://openexplorer@vrftp.horizon.ai/openexplorer_j5/1.1.48/py36/modelzoo/qat_origin_modelzoo/horizon_swin_transformer_cls/* --ftp-password='c5R,2!pG',没找到model.hbm,不知你指的是哪个目录下有这个文件。

      # classification

      | model | dataset | backbone | Input shape | config | ckpt download |

      | :----------: | :-------:| :--------: | :------------: | :------: | :--------: |

      | efficientnasnetm | ImageNet | efficientnasnetm | 332x332 | configs/classification/efficientnasnetm.py | wget -c ftp://openexplorer@vrftp.horizon.ai/openexplorer_j5/1.1.48/py36/modelzoo/qat_origin_modelzoo/efficientnasnetm_cls/float-checkpoint-best.pth.tar --ftp-password='c5R,2!pG' |

      | efficientnasnets | ImageNet | efficientnasnets | 300x300 | configs/classification/efficientnasnets.py | wget -c ftp://openexplorer@vrftp.horizon.ai/openexplorer_j5/1.1.48/py36/modelzoo/qat_origin_modelzoo/efficientnasnets_cls/float-checkpoint-best.pth.tar --ftp-password='c5R,2!pG' |

      | efficientnet | ImageNet | efficientnet | 224x224 | configs/classification/efficientnet.py | wget -c ftp://openexplorer@vrftp.horizon.ai/openexplorer_j5/1.1.48/py36/modelzoo/qat_origin_modelzoo/efficientnet_cls/* --ftp-password='c5R,2!pG' |

      | swin_transformer | ImageNet | swin_transformer | 224x224 | configs/classification/horizon_swin_transformer.py | wget -c ftp://openexplorer@vrftp.horizon.ai/openexplorer_j5/1.1.48/py36/modelzoo/qat_origin_modelzoo/horizon_swin_transformer_cls/* --ftp-password='c5R,2!pG' |

      2023-05-22
      0
    • 颜值即正义回复kja23:

      您好,您还可以从aibenchmark示例中获取hbm模型,进入到OE开发包的ddk/samples/ai_toolchain/model_zoo/runtime/ai_benchmark文件夹下,运行resolve_ai_benchmark_qat.sh脚本下载模型,然后在/ddk/samples/model_zoo/runtime/ai_benchmark/qat/horizon_swin_transformer_cls/compile文件夹下获取model.hbm模型

      2023-05-22
      0
    • kja23回复颜值即正义:
      模型上板测试没有问题了,但是训练还有问题,cuda驱动已经装好了,

      nvcc -V

      nvcc: NVIDIA (R) Cuda compiler driver

      Copyright (c) 2005-2020 NVIDIA Corporation

      Built on Tue_Sep_15_19:10:02_PDT_2020

      Cuda compilation tools, release 11.1, V11.1.74

      Build cuda_11.1.TC455_06.29069683_0

      运行命令python3 tools/train.py --stage float --config configs/classification/horizon_swin_transformer.py

      报错如下:

      -- Process 1 terminated with the following error:

      Traceback (most recent call last):

        File "/root/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap

          fn(i, *args)

        File "/usr/local/python3.8/lib/python3.8/site-packages/hat/engine/ddp_trainer.py", line 394, in _main_func

          torch.cuda.set_device(local_rank % num_devices)

        File "/root/.local/lib/python3.8/site-packages/torch/cuda/__init__.py", line 311, in set_device

          torch._C._cuda_setDevice(device)

      RuntimeError: CUDA error: invalid device ordinal

      CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.

      For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

      2023-05-24
      0
    • 颜值即正义回复kja23:
      您好,此报错和device的设置有关于,首先请检查一下configs/classification/horizon_swin_transformer.py文件中device_ids的设置是否与你的机器一致
      2023-05-24
      0