Skip to main content
Explorer
February 15, 2026
Question

STM32N6 – ONNX model performs well in Python (including ST optimized model) but degrades on target

  • February 15, 2026
  • 1 reply
  • 229 views

 

Hi,

I am compiling a MiniFASNet-based liveness ONNX model for STM32N6 using ST Edge AI Core v2.2.0.

The model behaves correctly in Python, but when deployed on STM32N6 the performance degrades noticeably.

What is confusing is the following:

  • Original ONNX model → good results in Python

  • ST-generated optimized model (*_OE_3_3_0.onnx) → also good results in Python (not compared to original model)

  • Same compiled model running on STM32N6 → significantly worse liveness performance


Model & Compilation Details

Command used:

./stedgeai generate \ --model best_model_quantized_calib.onnx \ --target stm32n6 \ --input-data-type float32 \ --output-data-type float32 \ --inputs-ch-position chlast \ --no-onnx-optimizer \ --verbosity 3

Compilation summary (excerpt):

  • Input: f32 (1x128x128x3)

  • Output: f32 (1x2)

  • Model format: ss/sa per tensor

  • 119 epochs (2 software: QuantizeLinear, DequantizeLinear)

  • Native float enabled

  • Activations allocated in NPU RAM regions

(Full log pasted below)


What Has Been Verified

  1. Preprocessing on STM32 matches Python exactly:

    • Resize size identical (128x128)

    • Same interpolation

    • Same normalization

    • Same channel order

  2. Postprocessing matches Python:

    • Same logit difference logic

    • Same threshold

    • Same decision rule

  3. ST optimized ONNX (*_OE_3_3_0.onnx) produces correct results in Python.

The discrepancy only appears when executing on STM32N6.


Observed Behavior on Target

  • Outputs are not random

  • Inference runs successfully

  • Logits are reasonable values

  • However, classification confidence is consistently worse compared to Python


Questions

  1. Could there be any known STM32N6 NPU runtime considerations that could cause numeric drift compared to PC execution?

  2. Is there any scenario where per-tensor quantization fallback could behave differently on target compared to ONNX Runtime execution of the optimized model?


Goal

I want to isolate whether this is:

  • A runtime configuration issue

  • A memory/cache issue

  • Or something specific to the STM32N6 execution environment

Any guidance on how to systematically debug numeric differences between PC and STM32N6 execution would be appreciated.


PS C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows> ./stedgeai generate --model .\best_model_quantized_calib.onnx --target stm32n6 --st-neural-art default@user_neuralart.json --input-data-type float32 --output-data-type float32 --inputs-ch-position chlast --no-onnx-optimizer --verbosity 3
ST Edge AI Core v2.2.0-20266 2adc00962
WARNING: Unsupported keys in the current profile default are ignored: memory_desc
 > memory_desc is not a valid key anymore, use machine_desc instead
>>>> EXECUTING NEURAL ART COMPILER
 C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/atonn.exe -i "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_output/best_model_quantized_calib_OE_3_3_0.onnx" --json-quant-file "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_output/best_model_quantized_calib_OE_3_3_0_Q.json" -g "network.c" --load-mdesc "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/configs/stm32n6.mdesc" --load-mpool "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/my_mpools/stm32n6-app2.mpool" --save-mpool-file "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_ws/neural_art__network/stm32n6-app2.mpool" --out-dir-prefix "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_ws/neural_art__network/" --all-buffers-info --no-hw-sw-parallelism --cache-maintenance --enable-virtual-mem-pools --native-float --optimization 3 --Os --Omax-ca-pipe 4 --Ocache-opt --output-info-file "c_info.json"
<<<< DONE EXECUTING NEURAL ART COMPILER

Exec/report summary (generate)
--------------------------------------------------------------------------------------------------------------
model file : C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\best_model_quantized_calib.onnx
type : onnx
c_name : network
options : allocate-inputs, allocate-outputs
optimization : balanced
target/series : stm32n6npu
workspace dir : C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_ws
output dir : C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output
model_fmt : ss/sa per tensor
model_name : best_model_quantized_calib
model_hash : 0x72a0c3e8b907f5eb00804c4e2a91e8d1
params # : 468,048 items (1.79 MiB)
--------------------------------------------------------------------------------------------------------------
input 1/1 : 'Input_0_out_0', f32(1x128x128x3), 192.00 KBytes, activations
output 1/1 : 'Dequantize_273_out_0', f32(1x2), 8 Bytes, activations
macc : 0
weights (ro) : 513,105 B (501.08 KiB) (1 segment) / -1,359,087(-72.6%) vs float model
activations (rw) : 1,476,608 B (1.41 MiB) (4 segments) *
ram (total) : 1,476,608 B (1.41 MiB) = 1,476,608 + 0 + 0
--------------------------------------------------------------------------------------------------------------
(*) 'input'/'output' buffers are allocated in the activations buffer

Computing AI RT data/code size (target=stm32n6npu)..
-> compiler "gcc:arm-none-eabi-gcc" is not in the PATH

Compilation details
 ---------------------------------------------------------------------------------
Compiler version: 1.1.1-14
Compiler arguments: -i C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0.onnx --json-quant-file C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0_Q.json -g network.c --load-mdesc C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\configs\stm32n6.mdesc --load-mpool C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\my_mpools\stm32n6-app2.mpool --save-mpool-file C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_ws\neural_art__network\stm32n6-app2.mpool --out-dir-prefix C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_ws\neural_art__network/ --all-buffers-info --no-hw-sw-parallelism --cache-maintenance --enable-virtual-mem-pools --native-float --optimization 3 --Os --Omax-ca-pipe 4 --Ocache-opt --output-info-file c_info.json
====================================================================================
Memory usage information (input/output buffers are included in activations)
 ---------------------------------------------------------------------------------
 npuRAM3 [0x34200000 - 0x34270000]: 448.000 kB / 448.000 kB (100.00 % used) -- weights: 0 B ( 0.00 % used) activations: 448.000 kB (100.00 % used)
 npuRAM4 [0x34270000 - 0x342E0000]: 448.000 kB / 448.000 kB (100.00 % used) -- weights: 0 B ( 0.00 % used) activations: 448.000 kB (100.00 % used)
 npuRAM5 [0x342E0000 - 0x34350000]: 448.000 kB / 448.000 kB (100.00 % used) -- weights: 0 B ( 0.00 % used) activations: 448.000 kB (100.00 % used)
 npuRAM6 [0x34350000 - 0x343C0000]: 98.000 kB / 448.000 kB ( 21.88 % used) -- weights: 0 B ( 0.00 % used) activations: 98.000 kB ( 21.88 % used)
 octoFlash [0x72880000 - 0x72C80000]: 501.079 kB / 4.000 MB ( 12.23 % used) -- weights: 501.079 kB ( 12.23 % used) activations: 0 B ( 0.00 % used)
 hyperRAM [0x90000000 - 0x91000000]: 0 B / 16.000 MB ( 0.00 % used) -- weights: 0 B ( 0.00 % used) activations: 0 B ( 0.00 % used)

Total: 1.898 MB -- weights: 501.079 kB activations: 1.408 MB
====================================================================================
Used memory ranges
 ---------------------------------------------------------------------------------
 npuRAM3 [0x34200000 - 0x34270000]: 0x34200000-0x34270000
 npuRAM4 [0x34270000 - 0x342E0000]: 0x34270000-0x342E0000
 npuRAM5 [0x342E0000 - 0x34350000]: 0x342E0000-0x34350000
 npuRAM6 [0x34350000 - 0x343C0000]: 0x34350000-0x34368800
 octoFlash [0x72880000 - 0x72C80000]: 0x72880000-0x728FD460
====================================================================================
Epochs details
 ---------------------------------------------------------------------------------
Total number of epochs: 119 of which 2 implemented in software

epoch ID HW/SW/EC Operation (SW only)
epoch 1 HW
epoch 2 -SW- ( QuantizeLinear )
epoch 3 HW
epoch 4 HW
epoch 5 HW
epoch 6 HW
epoch 7 HW
epoch 8 HW
epoch 9 HW
epoch 10 HW
epoch 11 HW
epoch 12 HW
epoch 13 HW
epoch 14 HW
epoch 15 HW
epoch 16 HW
epoch 17 HW
epoch 18 HW
epoch 19 HW
epoch 20 HW
epoch 21 HW
epoch 22 HW
epoch 23 HW
epoch 24 HW
epoch 25 HW
epoch 26 HW
epoch 27 HW
epoch 28 HW
epoch 29 HW
epoch 30 HW
epoch 31 HW
epoch 32 HW
epoch 33 HW
epoch 34 HW
epoch 35 HW
epoch 36 HW
epoch 37 HW
epoch 38 HW
epoch 39 HW
epoch 40 HW
epoch 41 HW
epoch 42 HW
epoch 43 HW
epoch 44 HW
epoch 45 HW
epoch 46 HW
epoch 47 HW
epoch 48 HW
epoch 49 HW
epoch 50 HW
epoch 51 HW
epoch 52 HW
epoch 53 HW
epoch 54 HW
epoch 55 HW
epoch 56 HW
epoch 57 HW
epoch 58 HW
epoch 59 HW
epoch 60 HW
epoch 61 HW
epoch 62 HW
epoch 63 HW
epoch 64 HW
epoch 65 HW
epoch 66 HW
epoch 67 HW
epoch 68 HW
epoch 69 HW
epoch 70 HW
epoch 71 HW
epoch 72 HW
epoch 73 HW
epoch 74 HW
epoch 75 HW
epoch 76 HW
epoch 77 HW
epoch 78 HW
epoch 79 HW
epoch 80 HW
epoch 81 HW
epoch 82 HW
epoch 83 HW
epoch 84 HW
epoch 85 HW
epoch 86 HW
epoch 87 HW
epoch 88 HW
epoch 89 HW
epoch 90 HW
epoch 91 HW
epoch 92 HW
epoch 93 HW
epoch 94 HW
epoch 95 HW
epoch 96 HW
epoch 97 HW
epoch 98 HW
epoch 99 HW
epoch 100 HW
epoch 101 HW
epoch 102 HW
epoch 103 HW
epoch 104 HW
epoch 105 HW
epoch 106 HW
epoch 107 HW
epoch 108 HW
epoch 109 HW
epoch 110 HW
epoch 111 HW
epoch 112 HW
epoch 113 HW
epoch 114 HW
epoch 115 HW
epoch 116 HW
epoch 117 HW
epoch 118 HW
epoch 119 -SW- ( DequantizeLinear )
====================================================================================

Generated files (5)
--------------------------------------------------------------------------------------------------------------
C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0.onnx
C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0_Q.json
C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\network.c
C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\network_atonbuf.xSPI2.raw
C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\network.h

Creating txt report file C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\network_generate_report.txt
elapsed time (generate): 271.131s

 

1 reply

Julian E.
Technical Moderator
February 19, 2026

Hi @Afreen,

 

First to answer your questions:

  1. Could there be any known STM32N6 NPU runtime considerations that could cause numeric drift compared to PC execution? Yes

  2. Is there any scenario where per-tensor quantization fallback could behave differently on target compared to ONNX Runtime execution of the optimized model? I don't think so

 

I would suggest updating the core to version 3.0, then please install the new tool replacing X Cube AI to be able to validate your model on 3.0. More info here: Introducing STM32CubeAI Studio - STMicroelectronics Community

 

I suggest validating your model on target with and without NPU and check the "COS" metric in the report. This should be very close to 1. If not, it indicates that the output of the compiled model is different from the original model. This could be a bug.

 

Validating the model with and without the NPU allows to know if it is the STM32 libraries or the Neural art library (NPU)  that is the problem.

 

Note that it would be better to use real data than random one while validating.

 

Have a good day,

Julian

 

​In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.