专栏算法工具链J3板上加载yolov5s模型bin文件失败

J3板上加载yolov5s模型bin文件失败

已解决
wwwwswwx2023-02-28
106
23

系统软件版本: x3j3_lnx_db_20221121 debug

天工开物版本:horizon_xj3_open_explorer_v1.11.4_20220413

问题涉及的技术领域: (操作系统)

问题描述:调用hbDNNInitializeFromFiles接口加载yolov5s模型bin文件报错,ion alloc ion_opened[344] failed ret:-1!,但./hrt_model_exec perf工具可以验证模型性能

复现概率:(必现) 已进行的排查措施、分析及结果:

问题日志:

[BPU_PLAT]BPU Platform Version(1.3.1)!

[HBRT] set log level as 0. version = 3.13.27

[000:000] (keros_util.cpp:99): keros_authentication failed, ret = 0

[000:000] (configuration.cpp:147): Keros key init failed.

[DNN] Runtime version = 1.8.1_(3.13.27 HBRT)

[HorizonRT] The model builder version = 1.6.8

[HorizonRT] The model builder version = 1.6.8

[HorizonRT] The model builder version = 1.6.8

[HorizonRT] The model builder version = 1.6.8

ion alloc ion_opened[344] failed ret:-1!

ion phys failed ret:-1!

HBMEM alloc[0x2a8d960] failed

The process open fds num : 359

hbmem buffer number = 343

#

# Fatal error in /home/jenkins/workspace/_ap_toolchain_horizonrtd_v1.8.1g/src/plan/hbm_exec_plan.cpp, line 191

# last system error: 12

# Check failed: one_exec_info.hbm_exec_output_internal_alloc_ != 0 (0 vs. 0)

#

#

==== C stack trace ===============================

/lib/adas/aarch64/track/libdnn.so(+0x3b3c24) [0x7fb2325c24]

/lib/adas/aarch64/track/libdnn.so(+0x12a070) [0x7fb209c070]

/lib/adas/aarch64/track/libdnn.so(+0x131f68) [0x7fb20a3f68]

/lib/adas/aarch64/track/libdnn.so(+0x134e04) [0x7fb20a6e04]

/lib/adas/aarch64/track/libdnn.so(+0xaf924) [0x7fb2021924]

/lib/adas/aarch64/track/libdnn.so(+0xb6564) [0x7fb2028564]

/lib/adas/aarch64/track/libdnn.so(+0xb8820) [0x7fb202a820]

8: hbDNNInitializeFromFiles

算法工具链
征程3技术深度解析
评论4
0/1000
  • 颜值即正义
    Lv.2

    我们根据日志查看,分析的确是因为BPU内存不够引起的,因为BPU内存申请失败导致程序终止。考虑到[HorizonRT] The model builder version = 1.6.8这句日志出现了多次,可以初步判断为加载模型多次导致。同时起多个进程也可能导致bpu内存不够,请您排查一下这两点可能的问题原因看看。

    2023-03-01
    0
    13
    • wwwwswwx回复颜值即正义:

      对这个模型加载三次就失败了,你们提供的加载10不报错

      2023-03-01
      0
    • wwwwswwx回复颜值即正义:

      你们了解模型占用较多的bpu内存的原因吗

      2023-03-01
      0
    • 颜值即正义回复wwwwswwx:

      运行正常的话一般来说bpu内存都是够用的

      2023-03-01
      0
    • wwwwswwx回复颜值即正义:

      std::vector packed_dnn_handles;

      packed_dnn_handles.resize(10);

      dnn_handles.resize(10);

      int model_count = 0;

      m_strModelFileName = "exp6_yolov5s_class10_best.bin";

      const char* pModelFileName = m_strModelFileName.data();

      for(auto i = 0; i < 10; ++i)

      {

      HB_CHECK_RET_FALSE(hbDNNInitializeFromFiles(&packed_dnn_handles[i], &pModelFileName, 1)

      , "hbDNNInitializeFromFiles failed.");

      ...

      }

      2023-03-01
      0
    • 颜值即正义回复wwwwswwx:

      您是想一次加载10次相同模型吗?使用需求是什么呢?加载一次就可以了。

      2023-03-01
      0
    • wwwwswwx回复颜值即正义:

      我是想多线程去推理任务

      2023-03-01
      0
    • 颜值即正义回复wwwwswwx:

      所以本质上是想让一个模型同时能推理多张图片吗?可以编译时在yaml配置batch_size,但是在J3上这样的配置性能提升收益并不大。

      2023-03-01
      0
    • wwwwswwx回复颜值即正义:

      j3板子上我们还跑了其他程序,类似一个L2系统,检测的对象是摄像头实时采集的图像,没办法同时推理,只用一个线程推理的话,从加载图像到出结果时间太长,显得视觉检测频率很低

      2023-03-01
      0
    • 颜值即正义回复wwwwswwx:

      那您先试一下hrt model exec perf的时候开启多线程,测试下帧率能否满足要求呢?在运行其他程序的情况下。

      2023-03-01
      0
    • 颜值即正义回复wwwwswwx:

      如果想多线程运行的话,以runtime sample的00示例代码为例,可以只读取一次模型(即只有一个dnn_handle),多定义几个task handle,再多次使用hbDNNInfer,每个hbDNNInfer的dnn_handle用同一个,task_handle使用不同的。

      2023-03-01
      0
    • wwwwswwx回复颜值即正义:

      runtime sample的00示例代码路径是哪里,我参考一下

      2023-03-01
      0
    • 颜值即正义回复wwwwswwx:

      ddk/samples/ai_toolchain/horizon_runtime_sample/code/00_quick_start

      2023-03-01
      0
    • wwwwswwx回复颜值即正义:

      用一个dnn_handle ,跑多线程的话就要多个线程共用input_tensors 和 output_tensors;

      hbDNNGetInputTensorProperties(&input[i].properties, dnn_handle, i)

      hbDNNGetOutputTensorProperties(&output[i].properties, dnn_handle, i)

      2023-03-01
      1
  • 颜值即正义
    Lv.2

    您好。看上去是bpu内存不够了,请问有同时加载好几个模型吗?

    2023-03-01
    0
    3
    • wwwwswwx回复颜值即正义:

      加载了1个模型文件

      2023-03-01
      0
    • 颜值即正义回复wwwwswwx:

      好的。您这边方便提供一下yolov5s的onnx和bin模型吗,我们这边分析一下呢。可以用百度网盘。

      2023-03-01
      0
    • wwwwswwx回复颜值即正义:

      链接: https://pan.baidu.com/s/1BpYDATB6KvnZl782t7cCvw 提取码: 4s8k

      2023-03-01
      0
  • 颜值即正义
    Lv.2

    另外系统软件的日志信息有点少,可以运行 ulimit -n 65535 调大一下数量,再运行一下看看有没有更细节的报错信息呢?

    2023-03-01
    0
    3
    • wwwwswwx回复颜值即正义:

      root@PML:/userdata/wangyajun# ulimit -n 65535

      [BPU_PLAT]BPU Platform Version(1.3.1)!

      [HBRT] set log level as 0. version = 3.13.27

      [000:000] (keros_util.cpp:99): keros_authentication failed, ret = 0

      [000:002] (configuration.cpp:147): Keros key init failed.

      [DNN] Runtime version = 1.8.1_(3.13.27 HBRT)

      [HorizonRT] The model builder version = 1.6.8

      [HorizonRT] The model builder version = 1.6.8

      ion alloc ion_opened[109] failed ret:-1!

      ion phys failed ret:-1!

      HBMEM alloc[0x1b9000] failed

      The process open fds num : 124

      hbmem buffer number = 108

      #

      # Fatal error in /home/jenkins/workspace/_ap_toolchain_horizonrtd_v1.8.1g/src/plan/hbm_exec_plan.cpp, line 376

      # last system error: 12

      # Check failed: tmp_region.one_output_ptr != 0 (0 vs. 0)

      #

      #

      ==== C stack trace ===============================

      /lib/adas/aarch64/track/libdnn.so(+0x3b3c24) [0x7f825a2c24]

      /lib/adas/aarch64/track/libdnn.so(+0x127168) [0x7f82316168]

      /lib/adas/aarch64/track/libdnn.so(+0x12726c) [0x7f8231626c]

      /lib/adas/aarch64/track/libdnn.so(+0x131f68) [0x7f82320f68]

      /lib/adas/aarch64/track/libdnn.so(+0x134e04) [0x7f82323e04]

      /lib/adas/aarch64/track/libdnn.so(+0xaf924) [0x7f8229e924]

      /lib/adas/aarch64/track/libdnn.so(+0xb6564) [0x7f822a5564]

      /lib/adas/aarch64/track/libdnn.so(+0xb8820) [0x7f822a7820]

      9: hbDNNInitializeFromFiles

      2023-03-01
      2
    • wwwwswwx回复颜值即正义:

      地平线提供的YOLOv5s.onnx量化后的模型文件没有这个问题,加载模型的时候打印如下

      [BPU_PLAT]BPU Platform Version(1.3.1)!

      [HBRT] set log level as 0. version = 3.13.27

      [000:000] (keros_util.cpp:99): keros_authentication failed, ret = 0

      [000:001] (configuration.cpp:147): Keros key init failed.

      [DNN] Runtime version = 1.8.1_(3.13.27 HBRT)

      [HorizonRT] The model builder version = 1.6.8

      [HorizonRT] The model builder version = 1.6.8

      [HorizonRT] The model builder version = 1.6.8

      [HorizonRT] The model builder version = 1.6.8

      [HorizonRT] The model builder version = 1.6.8

      [HorizonRT] The model builder version = 1.6.8

      [HorizonRT] The model builder version = 1.6.8

      [HorizonRT] The model builder version = 1.6.8

      [HorizonRT] The model builder version = 1.6.8

      [HorizonRT] The model builder version = 1.6.8

      2023-03-01
      0
    • wwwwswwx回复颜值即正义:

      我现在用的模型是用yolov5 3.1版本代码,对common.py文件修改如下,对Conv层,将激活函数由默认的Hardswish替换为LeakyReLU,使训练出的模型结构和官网提供的yolov5s保持一致

      2023-03-01
      0
  • 颜值即正义
    Lv.2
    2023-04-24
    0
    0