Skip to main content
AMurz.1
Associate II
June 9, 2025
Solved

STM32N6 NPU acceleration sometimes not used for 1x1 Conv or Gemm operations

  • June 9, 2025
  • 2 replies
  • 605 views

Hi,

I'm trying to use the NPU of the STM32N6 to run a onnx model in attachment.

The issue I try to fix is that stedgeai is not using the NPU for all operations. Some of them are run in SW instead of HW, like Conv and Gemm and I don't understand what is preventing the acceleration.

Here is the stedgeai output:

/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/linux/stedgeai generate --target stm32n6 --name network -
m denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx --st-neural-art "n6-noextmem@/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.0.0/scripts/N6_scripts/user_neuralar
t.json" --verbosity 1
ST Edge AI Core v2.1.0-20194 329b0e98d
WARNING: Unsupported keys in the current profile n6-noextmem are ignored: memory_desc 
 > memory_desc is not a valid key anymore, use machine_desc instead 
 >>>> EXECUTING NEURAL ART COMPILER 
 /home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/linux/atonn -i "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q_OE_3_2_0.onnx" --json-quant-file "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q_OE_3_2_0_Q.json" -g "network.c" --load-mdesc "/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.1.0/Utilities/configs/stm32n6.mdesc" --load-mpool "/home/doc/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/10.0.0/scripts/N6_scripts/my_mpools/stm32n6__noextmem.mpool" --save-mpool-file "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws/neural_art__network/stm32n6__noextmem.mpool" --out-dir-prefix "/media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws/neural_art__network/" --optimization 3 --all-buffers-info --mvei --no-hw-sw-parallelism --cache-maintenance --Oalt-sched --native-float --enable-virtual-mem-pools --Omax-ca-pipe 4 --Oshuffle-dma --Ocache-opt --Os --output-info-file "c_info.json"
 <<<< DONE EXECUTING NEURAL ART COMPILER 
 
 Exec/report summary (generate)
 ---------------------------------------------------------------------------------------------------------------------------------
 model file : /media/doc/USB5_EXT4/Projects/IA/TestAI/denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx 
 type : onnx 
 c_name : network 
 options : allocate-inputs, allocate-outputs 
 optimization : balanced 
 target/series : stm32n6npu 
 workspace dir : /media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_ws 
 output dir : /media/doc/USB5_EXT4/Projects/IA/TestAI/st_ai_output 
 model_fmt : ss/sa per channel 
 model_name : denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q 
 model_hash : 0x53769866cbc27812c941e6bf8eee7d23 
 params # : 1,250,317 items (4.77 MiB) 
 ---------------------------------------------------------------------------------------------------------------------------------
 input 1/5 : 'Input_18_out_0', int8(1x257x1), 257 Bytes, QLinear(0.024541926,2,int8), activations 
 input 2/5 : 'Input_13_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations 
 input 3/5 : 'Input_9_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations 
 input 4/5 : 'Input_4_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations 
 input 5/5 : 'Input_0_out_0', int8(1x256), 256 Bytes, QLinear(0.023921722,5,int8), activations 
 inputs (total) : 0 Bytes 
 output 1/5 : 'Quantize_84_out_0', int8(1x257x1), 257 Bytes, QLinear(0.003748817,-128,int8), activations 
 output 2/5 : 'Quantize_47_out_0', int8(1x256), 256 Bytes, QLinear(0.007790764,0,int8), activations 
 output 3/5 : 'Quantize_49_out_0', int8(1x256), 256 Bytes, QLinear(0.007674512,0,int8), activations 
 output 4/5 : 'Quantize_70_out_0', int8(1x256), 256 Bytes, QLinear(0.007619916,-2,int8), activations 
 output 5/5 : 'Quantize_72_out_0', int8(1x256), 256 Bytes, QLinear(0.007183936,-1,int8), activations 
 outputs (total) : 0 Bytes 
 macc : 0 
 weights (ro) : 1,287,937 B (1.23 MiB) (4 segments) / -3,713,331(-74.2%) vs float model 
 activations (rw) : 435,585 B (425.38 KiB) (1 segment) * 
 ram (total) : 435,585 B (425.38 KiB) = 435,585 + 0 + 0 
 ---------------------------------------------------------------------------------------------------------------------------------
 (*) 'input'/'output' buffers can be used from the activations buffer
 
[...]
 
Total number of epochs: 23 of which 3 implemented in software 
 
epoch ID HW/SW/EC Operation (SW only) 
epoch 1 HW 
epoch 2 -SW- ( Conv ) 
epoch 3 -SW- ( Conv ) 
epoch 4 HW 
epoch 5 HW 
epoch 6 HW 
epoch 7 HW 
epoch 8 HW 
epoch 9 HW 
epoch 10 HW 
epoch 11 HW 
epoch 12 HW 
epoch 13 HW 
epoch 14 HW 
epoch 15 HW 
epoch 16 HW 
epoch 17 HW 
epoch 18 HW 
epoch 19 HW 
epoch 20 HW 
epoch 21 -SW- ( Conv ) 
epoch 22 HW 
epoch 23 HW 

[...]

The model is quantized using this script:

import numpy

from onnxruntime.quantization import QuantFormat, QuantType, StaticQuantConfig, quantize, preprocess, CalibrationMethod
from onnxruntime.quantization import CalibrationDataReader


example_inputs = numpy.random.randn(1, 257, 1).astype(numpy.float32)
example_hidden = numpy.random.randn(1, 256).astype(numpy.float32)

class XXXDataReader(CalibrationDataReader):
 def __init__(self):
 self.enum_data = None
 pass

 def get_next(self):
 if self.enum_data is None:
 self.enum_data = iter(
 [{"input": example_inputs,
 "lstm_hidden_input_h_0": example_hidden,
 "lstm_hidden_input_c_0": example_hidden,
 "lstm_hidden_input_h_1": example_hidden,
 "lstm_hidden_input_c_1": example_hidden}]
 )
 return next(self.enum_data, None)

 def rewind(self):
 pass

dr = XXXDataReader()

conf = StaticQuantConfig(
 calibration_data_reader=dr,
 quant_format=QuantFormat.QDQ,
 calibrate_method=CalibrationMethod.MinMax,
 activation_type=QuantType.QInt8,
 weight_type=QuantType.QInt8,
 #op_types_to_quantize=["Conv","Slice"],
 extra_options={
 "ForceQuantizeNoInputCheck":True,
 },
 # nodes_to_exclude=['resnetv17_dense0_fwd', ..],
 #nodes_to_quantize=['/conv1/Conv'],
 per_channel=True)

preprocess.quant_pre_process("denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified.onnx", "denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_p.onnx")
quantize("denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_p.onnx", "denoiser_LSTM_Valetini_nobatchsize_preprocess_simplified_q.onnx", conf)

 

When I run atonn with "--d-lower 50", I see logs like this, but I don't what "scale-offset format" means as the issue was  the same with symmetric input and weights :

 Lowering Conv2D_23 id=80 with Lowerer HW
Lowering Conv ...
HW lowering not done for Conv node=Conv2D_23 because output tensor (id=522) has scale-offset format
 Let's try with the next lowerer
 Let's try with the next lowerer
 Lowering Conv2D_23 id=80 with Lowerer SW (scale offset)
Standard Software Lowering operations for node Conv2D_23
tensor w: 1
tensor h: 1
tensor ch: 257
tensor chin: 257

[...]

 Lowering Gemm_28_conv_16 id=94 with Lowerer HW
Lowering Conv ...
HW lowering not done for Conv node=Gemm_28_conv_16 because output tensor (id=611) has scale-offset format
 Let's try with the next lowerer
 Let's try with the next lowerer
 Lowering Gemm_28_conv_16 id=94 with Lowerer SW (scale offset)
Standard Software Lowering operations for node Gemm_28_conv_16
tensor w: 1
tensor h: 1
tensor ch: 257
tensor chin: 1024

[...]

 Lowering Conv2D_79 id=218 with Lowerer HW
Lowering Conv ...
HW lowering not done for Conv node=Conv2D_79 because output tensor (id=1458) has scale-offset format
 Let's try with the next lowerer
 Let's try with the next lowerer
 Lowering Conv2D_79 id=218 with Lowerer SW (scale offset)
Standard Software Lowering operations for node Conv2D_79
tensor w: 1
tensor h: 1
tensor ch: 257
tensor chin: 257

 

What is preventing the above 3 operations to be run on the NPU ?

Especially for the Gemm_28_conv_16 node as other Gemm operations are accelerated on the NPU.

The quantized model is in attachment.

 

Thanks for your help,

Alexis Murzeau

Best answer by Julian E.

Hello @AMurz.1,

 

It is because of a bug that fails to split layers with channel equals to certain prime numbers. We are aware of it and it was escalated to the dev team. We are waiting for a fix.

In your case, 257 is one of this problematic numbers.

 

To fix this, please change this number of channels in your concerned layer to 256 and it should fix the bug.

Have a good day,

Julian 

2 replies

Julian E.
Julian E.Best answer
Technical Moderator
June 10, 2025

Hello @AMurz.1,

 

It is because of a bug that fails to split layers with channel equals to certain prime numbers. We are aware of it and it was escalated to the dev team. We are waiting for a fix.

In your case, 257 is one of this problematic numbers.

 

To fix this, please change this number of channels in your concerned layer to 256 and it should fix the bug.

Have a good day,

Julian 

​In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.
AMurz.1
AMurz.1Author
Associate II
June 11, 2025

Hi,

 

Thanks, splitting the Gemm operation in two 144 + 113 makes them both accelerated on the NPU.

But I have a question about the MAC/cycles of power-of-2 Gemm operations doing 1024*256 MAC.

AMurz1_5-1749678402142.png

It seems to not use the full CONVACC MAC/cycles, according to the reference manual, one CONVACC is able to do at most 36 16*8 MACs:

AMurz1_0-1749677630958.png

But the 1024*256 Gemm operation seems to only use 2 MAC/cycles according to the tmp.dot graph output of atonn and the real timing when measuring on STM32N657 hardware :

AMurz1_3-1749677970802.png

 

Is there a way to improve the MAC/cycle ratio to speed-up inference ?

I see this paragraph in ST Edge AI documentation which may be related :

AMurz1_4-1749678001065.png

What "run at the speed" means ? The inference speed would be the same ?

As I have ICH=256 and OCH=1024 in my case, am I limited by the OCH = N*16 (with N = 64) ?

 

I'm attaching the quantized model generated by stedgeai in st_ai_output folder and the tmp.dot converted to svg format.

Julian E.
Technical Moderator
June 12, 2025

Hello @AMurz.1,

 

Here is the answer from our experts:

 

The 2Mac/cycle there, is the maximum theoretical value that we can obtain here, given all the conditions around including memory access etc.

JulianE_0-1749728922273.jpeg

 

In the screenshot of the SVG graph, you can see that there is a property of the conv node saying that the "choked ports = (weights)". This is due to the fact that we don't have any data reuse with GEMM node, since each value is read and used only once. That's why we don't get better measures in terms of Mac/cycle like the maximum ones stated here by the user:

JulianE_1-1749728922274.jpeg

 

 

 

furthermore, the 16x8 simd mode is enabled (chosen since we have an FSUB !=0) and thus we can't enable the Deep1x1 mode

 

Have a good day,

Julian

​In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.