STM32N6 – ONNX model performs well in Python (including ST optimized model) but degrades on target

Question

Hi,I am compiling a MiniFASNet-based liveness ONNX model for STM32N6 using ST Edge AI Core v2.2.0.The model behaves correctly in Python, but when deployed on STM32N6 the performance degrades noticeably.What is confusing is the following:Original ONNX model → good results in PythonST-generated optimized model (*_OE_3_3_0.onnx) → also good results in Python (not compared to original model)Same compiled model running on STM32N6 → significantly worse liveness performanceModel & Compilation DetailsCommand used:./stedgeai generate \ --model best_model_quantized_calib.onnx \ --target stm32n6 \ --input-data-type float32 \ --output-data-type float32 \ --inputs-ch-position chlast \ --no-onnx-optimizer \ --verbosity 3Compilation summary (excerpt):Input: f32 (1x128x128x3)Output: f32 (1x2)Model format: ss/sa per tensor119 epochs (2 software: QuantizeLinear, DequantizeLinear)Native float enabledActivations allocated in NPU RAM regions(Full log pasted below)What Has Been VerifiedPreprocessing on STM32 matches Python exactly:Resize size identical (128x128)Same interpolationSame normalizationSame channel orderPostprocessing matches Python:Same logit difference logicSame thresholdSame decision ruleST optimized ONNX (*_OE_3_3_0.onnx) produces correct results in Python.The discrepancy only appears when executing on STM32N6.Observed Behavior on TargetOutputs are not randomInference runs successfullyLogits are reasonable valuesHowever, classification confidence is consistently worse compared to PythonQuestionsCould there be any known STM32N6 NPU runtime considerations that could cause numeric drift compared to PC execution?Is there any scenario where per-tensor quantization fallback could behave differently on target compared to ONNX Runtime execution of the optimized model?GoalI want to isolate whether this is:A runtime configuration issueA memory/cache issueOr something specific to the STM32N6 execution environmentAny guidance on how to systematically debug numeric differences between PC and STM32N6 execution would be appreciated.PS C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows> ./stedgeai generate --model .\best_model_quantized_calib.onnx --target stm32n6 --st-neural-art default@user_neuralart.json --input-data-type float32 --output-data-type float32 --inputs-ch-position chlast --no-onnx-optimizer --verbosity 3 ST Edge AI Core v2.2.0-20266 2adc00962 WARNING: Unsupported keys in the current profile default are ignored: memory_desc > memory_desc is not a valid key anymore, use machine_desc instead >>>> EXECUTING NEURAL ART COMPILER C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/atonn.exe -i "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_output/best_model_quantized_calib_OE_3_3_0.onnx" --json-quant-file "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_output/best_model_quantized_calib_OE_3_3_0_Q.json" -g "network.c" --load-mdesc "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/configs/stm32n6.mdesc" --load-mpool "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/my_mpools/stm32n6-app2.mpool" --save-mpool-file "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_ws/neural_art__network/stm32n6-app2.mpool" --out-dir-prefix "C:/ST/STEdgeAI_Feb_2026_Latest/2.2/Utilities/windows/st_ai_ws/neural_art__network/" --all-buffers-info --no-hw-sw-parallelism --cache-maintenance --enable-virtual-mem-pools --native-float --optimization 3 --Os --Omax-ca-pipe 4 --Ocache-opt --output-info-file "c_info.json" <<<< DONE EXECUTING NEURAL ART COMPILER Exec/report summary (generate) -------------------------------------------------------------------------------------------------------------- model file : C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\best_model_quantized_calib.onnx type : onnx c_name : network options : allocate-inputs, allocate-outputs optimization : balanced target/series : stm32n6npu workspace dir : C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_ws output dir : C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output model_fmt : ss/sa per tensor model_name : best_model_quantized_calib model_hash : 0x72a0c3e8b907f5eb00804c4e2a91e8d1 params # : 468,048 items (1.79 MiB) -------------------------------------------------------------------------------------------------------------- input 1/1 : 'Input_0_out_0', f32(1x128x128x3), 192.00 KBytes, activations output 1/1 : 'Dequantize_273_out_0', f32(1x2), 8 Bytes, activations macc : 0 weights (ro) : 513,105 B (501.08 KiB) (1 segment) / -1,359,087(-72.6%) vs float model activations (rw) : 1,476,608 B (1.41 MiB) (4 segments) * ram (total) : 1,476,608 B (1.41 MiB) = 1,476,608 + 0 + 0 -------------------------------------------------------------------------------------------------------------- (*) 'input'/'output' buffers are allocated in the activations buffer Computing AI RT data/code size (target=stm32n6npu).. -> compiler "gcc:arm-none-eabi-gcc" is not in the PATH Compilation details --------------------------------------------------------------------------------- Compiler version: 1.1.1-14 Compiler arguments: -i C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0.onnx --json-quant-file C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0_Q.json -g network.c --load-mdesc C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\configs\stm32n6.mdesc --load-mpool C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\my_mpools\stm32n6-app2.mpool --save-mpool-file C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_ws eural_art__network\stm32n6-app2.mpool --out-dir-prefix C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_ws eural_art__network/ --all-buffers-info --no-hw-sw-parallelism --cache-maintenance --enable-virtual-mem-pools --native-float --optimization 3 --Os --Omax-ca-pipe 4 --Ocache-opt --output-info-file c_info.json ==================================================================================== Memory usage information (input/output buffers are included in activations) --------------------------------------------------------------------------------- npuRAM3 [0x34200000 - 0x34270000]: 448.000 kB / 448.000 kB (100.00 % used) -- weights: 0 B ( 0.00 % used) activations: 448.000 kB (100.00 % used) npuRAM4 [0x34270000 - 0x342E0000]: 448.000 kB / 448.000 kB (100.00 % used) -- weights: 0 B ( 0.00 % used) activations: 448.000 kB (100.00 % used) npuRAM5 [0x342E0000 - 0x34350000]: 448.000 kB / 448.000 kB (100.00 % used) -- weights: 0 B ( 0.00 % used) activations: 448.000 kB (100.00 % used) npuRAM6 [0x34350000 - 0x343C0000]: 98.000 kB / 448.000 kB ( 21.88 % used) -- weights: 0 B ( 0.00 % used) activations: 98.000 kB ( 21.88 % used) octoFlash [0x72880000 - 0x72C80000]: 501.079 kB / 4.000 MB ( 12.23 % used) -- weights: 501.079 kB ( 12.23 % used) activations: 0 B ( 0.00 % used) hyperRAM [0x90000000 - 0x91000000]: 0 B / 16.000 MB ( 0.00 % used) -- weights: 0 B ( 0.00 % used) activations: 0 B ( 0.00 % used) Total: 1.898 MB -- weights: 501.079 kB activations: 1.408 MB ==================================================================================== Used memory ranges --------------------------------------------------------------------------------- npuRAM3 [0x34200000 - 0x34270000]: 0x34200000-0x34270000 npuRAM4 [0x34270000 - 0x342E0000]: 0x34270000-0x342E0000 npuRAM5 [0x342E0000 - 0x34350000]: 0x342E0000-0x34350000 npuRAM6 [0x34350000 - 0x343C0000]: 0x34350000-0x34368800 octoFlash [0x72880000 - 0x72C80000]: 0x72880000-0x728FD460 ==================================================================================== Epochs details --------------------------------------------------------------------------------- Total number of epochs: 119 of which 2 implemented in software epoch ID HW/SW/EC Operation (SW only) epoch 1 HW epoch 2 -SW- ( QuantizeLinear ) epoch 3 HW epoch 4 HW epoch 5 HW epoch 6 HW epoch 7 HW epoch 8 HW epoch 9 HW epoch 10 HW epoch 11 HW epoch 12 HW epoch 13 HW epoch 14 HW epoch 15 HW epoch 16 HW epoch 17 HW epoch 18 HW epoch 19 HW epoch 20 HW epoch 21 HW epoch 22 HW epoch 23 HW epoch 24 HW epoch 25 HW epoch 26 HW epoch 27 HW epoch 28 HW epoch 29 HW epoch 30 HW epoch 31 HW epoch 32 HW epoch 33 HW epoch 34 HW epoch 35 HW epoch 36 HW epoch 37 HW epoch 38 HW epoch 39 HW epoch 40 HW epoch 41 HW epoch 42 HW epoch 43 HW epoch 44 HW epoch 45 HW epoch 46 HW epoch 47 HW epoch 48 HW epoch 49 HW epoch 50 HW epoch 51 HW epoch 52 HW epoch 53 HW epoch 54 HW epoch 55 HW epoch 56 HW epoch 57 HW epoch 58 HW epoch 59 HW epoch 60 HW epoch 61 HW epoch 62 HW epoch 63 HW epoch 64 HW epoch 65 HW epoch 66 HW epoch 67 HW epoch 68 HW epoch 69 HW epoch 70 HW epoch 71 HW epoch 72 HW epoch 73 HW epoch 74 HW epoch 75 HW epoch 76 HW epoch 77 HW epoch 78 HW epoch 79 HW epoch 80 HW epoch 81 HW epoch 82 HW epoch 83 HW epoch 84 HW epoch 85 HW epoch 86 HW epoch 87 HW epoch 88 HW epoch 89 HW epoch 90 HW epoch 91 HW epoch 92 HW epoch 93 HW epoch 94 HW epoch 95 HW epoch 96 HW epoch 97 HW epoch 98 HW epoch 99 HW epoch 100 HW epoch 101 HW epoch 102 HW epoch 103 HW epoch 104 HW epoch 105 HW epoch 106 HW epoch 107 HW epoch 108 HW epoch 109 HW epoch 110 HW epoch 111 HW epoch 112 HW epoch 113 HW epoch 114 HW epoch 115 HW epoch 116 HW epoch 117 HW epoch 118 HW epoch 119 -SW- ( DequantizeLinear ) ==================================================================================== Generated files (5) -------------------------------------------------------------------------------------------------------------- C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0.onnx C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output\best_model_quantized_calib_OE_3_3_0_Q.json C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output etwork.c C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output etwork_atonbuf.xSPI2.raw C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output etwork.h Creating txt report file C:\ST\STEdgeAI_Feb_2026_Latest\2.2\Utilities\windows\st_ai_output etwork_generate_report.txt elapsed time (generate): 271.131s

Julian E. · Answer

Hi @Afreen,

First to answer your questions:

Could there be any known STM32N6 NPU runtime considerations that could cause numeric drift compared to PC execution? Yes
Is there any scenario where per-tensor quantization fallback could behave differently on target compared to ONNX Runtime execution of the optimized model? I don't think so

I would suggest updating the core to version 3.0, then please install the new tool replacing X Cube AI to be able to validate your model on 3.0. More info here: Introducing STM32CubeAI Studio - STMicroelectronics Community

I suggest validating your model on target with and without NPU and check the "COS" metric in the report. This should be very close to 1. If not, it indicates that the output of the compiled model is different from the original model. This could be a bug.

Validating the model with and without the NPU allows to know if it is the STM32 libraries or the Neural art library (NPU) that is the problem.

Note that it would be better to use real data than random one while validating.

Have a good day,

Julian

Model & Compilation Details

What Has Been Verified

Observed Behavior on Target

Questions

Goal

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded