STM32N6 NPU acceleration sometimes not used for 1x1 Conv or Gemm operations

Question

Hi,I'm trying to use the NPU of the STM32N6 to run a onnx model in attachment.The issue I try to fix is that stedgeai is not using the NPU for all operations. Some of them are run in SW instead of HW, like Conv and Gemm and I don't understand what is preventing the acceleration.Here is the stedgeai output:/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/linux/stedgeai generate --target stm32n6 --name network - m denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx --st-neural-art "n6-noextmem@/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.0.0/scripts/N6_scripts/user_neuralar t.json" --verbosity 1 ST Edge AI Core v2.1.0-20194 329b0e98d WARNING: Unsupported keys in the current profile n6-noextmem are ignored: memory_desc > memory_desc is not a valid key anymore, use machine_desc instead >>>> EXECUTING NEURAL ART COMPILER /home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/linux/atonn -i "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q_OE_3_2_0.onnx" --json-quant-file "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q_OE_3_2_0_Q.json" -g "network.c" --load-mdesc "/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/configs/stm32n6.mdesc" --load-mpool "/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.0.0/scripts/N6_scripts/my_mpools/stm32n6__noextmem.mpool" --save-mpool-file "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws/neural_art__network/stm32n6__noextmem.mpool" --out-dir-prefix "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws/neural_art__network/" --optimization 3 --all-buffers-info --mvei --no-hw-sw-parallelism --cache-maintenance --Oalt-sched --native-float --enable-virtual-mem-pools --Omax-ca-pipe 4 --Oshuffle-dma --Ocache-opt --Os --output-info-file "c_info.json" <<<< DONE EXECUTING NEURAL ART COMPILER Exec/report summary (generate) --------------------------------------------------------------------------------------------------------------------------------- model file : /media/doc/USB5_EXT4/Projects/IA/TestAI/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx type : onnx c_name : network options : allocate-inputs, allocate-outputs optimization : balanced target/series : stm32n6npu workspace dir : /media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws output dir : /media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output model_fmt : ss/sa per channel model_name : denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q model_hash : 0x53769866cbc27812c941e6bf8eee7d23 params # : 1,250,317 items (4.77 MiB) --------------------------------------------------------------------------------------------------------------------------------- input 1/5 : 'Input_18_out_0', int8(1x257x1), 257 Bytes, QLinear(0.024541926,2,int8), activations input 2/5 : 'Input_13_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations input 3/5 : 'Input_9_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations input 4/5 : 'Input_4_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations input 5/5 : 'Input_0_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations inputs (total) : 0 Bytes output 1/5 : 'Quantize_84_out_0', int8(1x257x1), 257 Bytes, QLinear(0.003748817,-128,int8), activations output 2/5 : 'Quantize_47_out_0', int8(1x256), 256 Bytes, QLinear(0.007790764,0,int8), activations output 3/5 : 'Quantize_49_out_0', int8(1x256), 256 Bytes, QLinear(0.007674512,0,int8), activations output 4/5 : 'Quantize_70_out_0', int8(1x256), 256 Bytes, QLinear(0.007619916,-2,int8), activations output 5/5 : 'Quantize_72_out_0', int8(1x256), 256 Bytes, QLinear(0.007183936,-1,int8), activations outputs (total) : 0 Bytes macc : 0 weights (ro) : 1,287,937 B (1.23 MiB) (4 segments) / -3,713,331(-74.2%) vs float model activations (rw) : 435,585 B (425.38 KiB) (1 segment) * ram (total) : 435,585 B (425.38 KiB) = 435,585 + 0 + 0 --------------------------------------------------------------------------------------------------------------------------------- (*) 'input'/'output' buffers can be used from the activations buffer [...] Total number of epochs: 23 of which 3 implemented in software epoch ID HW/SW/EC Operation (SW only) epoch 1 HW epoch 2 -SW- ( Conv ) epoch 3 -SW- ( Conv ) epoch 4 HW epoch 5 HW epoch 6 HW epoch 7 HW epoch 8 HW epoch 9 HW epoch 10 HW epoch 11 HW epoch 12 HW epoch 13 HW epoch 14 HW epoch 15 HW epoch 16 HW epoch 17 HW epoch 18 HW epoch 19 HW epoch 20 HW epoch 21 -SW- ( Conv ) epoch 22 HW epoch 23 HW [...]The model is quantized using this script:import numpy from onnxruntime.quantization import QuantFormat, QuantType, StaticQuantConfig, quantize, preprocess, CalibrationMethod from onnxruntime.quantization import CalibrationDataReader example_inputs = numpy.random.randn(1, 257, 1).astype(numpy.float32) example_hidden = numpy.random.randn(1, 256).astype(numpy.float32) class XXXDataReader(CalibrationDataReader): def __init__(self): self.enum_data = None pass def get_next(self): if self.enum_data is None: self.enum_data = iter( [{"input": example_inputs, "lstm_hidden_input_h_0": example_hidden, "lstm_hidden_input_c_0": example_hidden, "lstm_hidden_input_h_1": example_hidden, "lstm_hidden_input_c_1": example_hidden}] ) return next(self.enum_data, None) def rewind(self): pass dr = XXXDataReader() conf = StaticQuantConfig( calibration_data_reader=dr, quant_format=QuantFormat.QDQ, calibrate_method=CalibrationMethod.MinMax, activation_type=QuantType.QInt8, weight_type=QuantType.QInt8, #op_types_to_quantize=["Conv","Slice"], extra_options={ "ForceQuantizeNoInputCheck":True, }, # nodes_to_exclude=['resnetv17_dense0_fwd', ..], #nodes_to_quantize=['/conv1/Conv'], per_channel=True) preprocess.quant_pre_process("denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified.onnx", "denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_p.onnx") quantize("denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_p.onnx", "denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx", conf) When I run atonn with "--d-lower 50", I see logs like this, but I don't what "scale-offset format" means as the issue was the same with symmetric input and weights : Lowering Conv2D_23 id=80 with Lowerer HW Lowering Conv ... HW lowering not done for Conv node=Conv2D_23 because output tensor (id=522) has scale-offset format Let's try with the next lowerer Let's try with the next lowerer Lowering Conv2D_23 id=80 with Lowerer SW (scale offset) Standard Software Lowering operations for node Conv2D_23 tensor w: 1 tensor h: 1 tensor ch: 257 tensor chin: 257 [...] Lowering Gemm_28_conv_16 id=94 with Lowerer HW Lowering Conv ... HW lowering not done for Conv node=Gemm_28_conv_16 because output tensor (id=611) has scale-offset format Let's try with the next lowerer Let's try with the next lowerer Lowering Gemm_28_conv_16 id=94 with Lowerer SW (scale offset) Standard Software Lowering operations for node Gemm_28_conv_16 tensor w: 1 tensor h: 1 tensor ch: 257 tensor chin: 1024 [...] Lowering Conv2D_79 id=218 with Lowerer HW Lowering Conv ... HW lowering not done for Conv node=Conv2D_79 because output tensor (id=1458) has scale-offset format Let's try with the next lowerer Let's try with the next lowerer Lowering Conv2D_79 id=218 with Lowerer SW (scale offset) Standard Software Lowering operations for node Conv2D_79 tensor w: 1 tensor h: 1 tensor ch: 257 tensor chin: 257 What is preventing the above 3 operations to be run on the NPU ?Especially for the Gemm_28_conv_16 node as other Gemm operations are accelerated on the NPU.The quantized model is in attachment. Thanks for your help,Alexis Murzeau

Julian E. · Accepted Answer

Hello @AMurz.1,

It is because of a bug that fails to split layers with channel equals to certain prime numbers. We are aware of it and it was escalated to the dev team. We are waiting for a fix.

In your case, 257 is one of this problematic numbers.

To fix this, please change this number of channels in your concerned layer to 256 and it should fix the bug.

Have a good day,

Julian

AMurz.1 · Answer

Hi,

Thanks, splitting the Gemm operation in two 144 + 113 makes them both accelerated on the NPU.

But I have a question about the MAC/cycles of power-of-2 Gemm operations doing 1024*256 MAC.

It seems to not use the full CONVACC MAC/cycles, according to the reference manual, one CONVACC is able to do at most 36 16*8 MACs:

But the 1024*256 Gemm operation seems to only use 2 MAC/cycles according to the tmp.dot graph output of atonn and the real timing when measuring on STM32N657 hardware :

Is there a way to improve the MAC/cycle ratio to speed-up inference ?

I see this paragraph in ST Edge AI documentation which may be related :

What "run at the speed" means ? The inference speed would be the same ?

As I have ICH=256 and OCH=1024 in my case, am I limited by the OCH = N*16 (with N = 64) ?

I'm attaching the quantized model generated by stedgeai in st_ai_output folder and the tmp.dot converted to svg format.

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded