专栏算法工具链训练bev出现错误

训练bev出现错误

已解决
mmario2023-03-03
165
10

用户您好,请详细描述您所遇到的问题。

1.硬件获取渠道:购买J5芯片

2.当前系统镜像版本:docker_openexplorer_ubuntu_20_j5_gpu_v1.1.40_py38

3.当前天工开物版本:horizon_j5_open_explorer_v1.1.40_py38_20230210

4.问题定位:执行bev训练的命令出现错误

5.开发的demo/案例:bev_release_package-1.6.16

6.需要提供的解决方案:

在ddk/package/host/路径下,执行bash install.sh后,再执行如下训练命令:python3 tools/train.py --config configs/bev/bev_mt_lss.py --stage float

出现错误如下:

2023-03-03 17:25:12,057 INFO [logger.py:147] Node[0] ==================================================BEGIN FLOAT STAGE==================================================

2023-03-03 17:25:12,090 INFO [thread_init.py:38] Node[1] init torch_num_thread is `12`,opencv_num_thread is `12`,openblas_num_thread is `12`,mkl_num_thread is `12`,omp_num_thread is `12`,

2023-03-03 17:25:12,108 INFO [thread_init.py:38] Node[3] init torch_num_thread is `12`,opencv_num_thread is `12`,openblas_num_thread is `12`,mkl_num_thread is `12`,omp_num_thread is `12`,

2023-03-03 17:25:12,108 INFO [thread_init.py:38] Node[2] init torch_num_thread is `12`,opencv_num_thread is `12`,openblas_num_thread is `12`,mkl_num_thread is `12`,omp_num_thread is `12`,

2023-03-03 17:25:12,111 INFO [thread_init.py:38] Node[0] init torch_num_thread is `12`,opencv_num_thread is `12`,openblas_num_thread is `12`,mkl_num_thread is `12`,omp_num_thread is `12`,

2023-03-03 17:25:12,143 ERROR [ddp_trainer.py:363] Node[1] Traceback (most recent call last):

File "/root/.local/lib/python3.8/site-packages/hat/engine/ddp_trainer.py", line 359, in _with_exception

fn(*args)

File "/open_explorer/bev_release_package/tools/train.py", line 185, in train_entrance

trainer = build_from_registry(trainer)

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 236, in build_from_registry

return _impl(x)

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in <genexpr>

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in <genexpr>

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 213, in _impl

_raise_invalid_type_error(object_type)

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 75, in _raise_invalid_type_error

raise TypeError(

TypeError: LSSTransformer has not registered in any of registry ['HAT_OBJECT_REGISTRY'] and is not a class, which is not allowed

2023-03-03 17:25:12,157 ERROR [ddp_trainer.py:363] Node[0] Traceback (most recent call last):

File "/root/.local/lib/python3.8/site-packages/hat/engine/ddp_trainer.py", line 359, in _with_exception

fn(*args)

File "/open_explorer/bev_release_package/tools/train.py", line 185, in train_entrance

trainer = build_from_registry(trainer)

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 236, in build_from_registry

return _impl(x)

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in <genexpr>

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in <genexpr>

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 213, in _impl

_raise_invalid_type_error(object_type)

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 75, in _raise_invalid_type_error

raise TypeError(

TypeError: LSSTransformer has not registered in any of registry ['HAT_OBJECT_REGISTRY'] and is not a class, which is not allowed

2023-03-03 17:25:12,157 ERROR [ddp_trainer.py:363] Node[3] Traceback (most recent call last):

File "/root/.local/lib/python3.8/site-packages/hat/engine/ddp_trainer.py", line 359, in _with_exception

fn(*args)

File "/open_explorer/bev_release_package/tools/train.py", line 185, in train_entrance

trainer = build_from_registry(trainer)

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 236, in build_from_registry

return _impl(x)

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in <genexpr>

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in <genexpr>

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 213, in _impl

_raise_invalid_type_error(object_type)

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 75, in _raise_invalid_type_error

raise TypeError(

TypeError: LSSTransformer has not registered in any of registry ['HAT_OBJECT_REGISTRY'] and is not a class, which is not allowed

2023-03-03 17:25:12,158 ERROR [ddp_trainer.py:363] Node[2] Traceback (most recent call last):

File "/root/.local/lib/python3.8/site-packages/hat/engine/ddp_trainer.py", line 359, in _with_exception

fn(*args)

File "/open_explorer/bev_release_package/tools/train.py", line 185, in train_entrance

trainer = build_from_registry(trainer)

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 236, in build_from_registry

return _impl(x)

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in <genexpr>

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in _impl

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 196, in <genexpr>

build_x = dict(((key, _impl(value)) for key, value in x.items())) # noqa

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 213, in _impl

_raise_invalid_type_error(object_type)

File "/root/.local/lib/python3.8/site-packages/hat/registry.py", line 75, in _raise_invalid_type_error

raise TypeError(

TypeError: LSSTransformer has not registered in any of registry ['HAT_OBJECT_REGISTRY'] and is not a class, which is not allowed

ERROR:__main__:launch trainer failed! process 0 terminated with exit code 1

Traceback (most recent call last):

File "tools/train.py", line 277, in <module>

train(

File "tools/train.py", line 272, in train

raise e

File "tools/train.py", line 255, in train

launch(

File "/root/.local/lib/python3.8/site-packages/hat/engine/ddp_trainer.py", line 328, in launch

mp.spawn(

File "/root/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn

return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')

File "/root/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes

while not context.join():

File "/root/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 139, in join

raise ProcessExitedException(

torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with exit code 1

算法工具链
征程5
评论5
0/1000
  • VictorChen
    Lv.1
    你好,bev_release_package-1.6.16 这个包方便共享一下吗
    2023-03-13
    0
    4
    • 颜值即正义回复VictorChen:
      你好,目前bev_release_package需要走项目哈,暂不支持直接分享~
      2023-03-13
      0
    • VictorChen回复颜值即正义:

      个人学习研究有可能获取到嘛???

      2023-03-28
      0
    • 颜值即正义回复VictorChen:

      当前可以看帖子:https://developer.horizon.ai/forumDetail/146177165367615505,代码尚不支持哈,建议持续关注,未来是会开放的

      2023-03-28
      0
    • 颜值即正义回复VictorChen:
      你好,从J5 OE1.1.57开始,bev会合入OE包,个人也可以获取啦,建议直接使用当前最新版本J5 OE1.1.60,获取链接:https://developer.horizon.ai/forumDetail/118363912788935318
      2023-07-18
      0
  • llll
    Lv.1

    有解决这个问题吗

    2023-07-18
    0
    1
    • mmario回复llll:

      解决了,环境没配置好。

      2023-07-24
      0
  • 颜值即正义
    Lv.2
    您好,docker环境已经可以满足bev的运行,不需要再执行bash install.sh(install.sh是本地环境部署脚本),请您按照文档中提供的教程部署环境后再运行浮点训练命令
    2023-03-03
    0
    0
  • 颜值即正义
    Lv.2

    见文档3.1.1环境部署的step3 添加环境变量:https://developer.horizon.ai/forumDetail/143772473308124163

    2023-03-04
    0
    0
  • 颜值即正义
    Lv.2
    2023-04-24
    0
    0