模型转化优化-卷积算子偏慢问题

用户您好，请详细描述您所遇到的问题。
1.硬件获取渠道：
2.当前系统镜像版本：x3j3_lnx_db_20220407
3.当前天工开物版本：ai_toolchain_centos_7_xj3: v2.2.3a
4.问题定位：模型转化优化-卷积算子偏慢问题
5.开发的demo/案例：
6.需要提供的解决方案：

搭建了一个深度学习模型，这个模型最后的部分是由多个卷积构成的重复结构的残差块构成的。如图所示

模型结构
但是，在编译完成后发现，Conv #334花费的计算时间和存取时间远比其他卷积多，花费的时间如表所示

layer	ops	computing cost (no DDR)	load/store cost
Conv_323-conv	66,355,200	134 us (0.1% of model)	1 us (0% of model)
Conv_325-conv	597,196,800	1184 us (1.4% of model)	3 us (0% of model)
Conv_327-conv	66,355,200	132 us (0.1% of model)	120 us (0.1% of model)
Conv_330-conv	66,355,200	132 us (0.1% of model)	2 us (0% of model)
Conv_332-conv	597,196,800	1167 us (1.4% of model)	122 us (0.1% of model)
Conv_334	66,355,200	3646 us (4.5% of model)	761 us (0.9% of model)

请问同样的结构下Conv #334为啥会偏慢，是因为是树池之前的最后一个卷积吗？请问有没有规避方法