Is FP32 forcing software kernels and preventing NPU acceleration? Hi,I am testing CIFAR-10 ResNet on STM32N6570-DK and I need help understanding a latency gap between two STEdgeAI generation flows.I used the same FP32 model as input:CPU-only generation flowCPU+NPU generation flowBoth flows generated C files and hex successfully, and both run correctly on target.However, the CPU+NPU build is significantly slower than CPU-only.What I would like ST to confirm:For STM32N6, can full FP32 ResNet graphs be truly accelerated on Neural-ART, or are they generally expected to run as software float kernels?Is it normal that CPU+NPU generation for FP32 can be slower than CPU-only generation?For real NPU acceleration, is calibrated INT8 mandatory in practice?Which exact generator settings should I use to ensure that most layers map to NPU kernels (and how to verify mapping from generated artifacts)?Observed latency:CPU-only flow: about 211 msCPU+NPU flow: about 439 msMy main question:Is this expected for FP32 models on STM32N6 because operations are mapped to software float kernels (CPU path), so the CPU+NPU flow adds runtime overhead without real NPU acceleration?My target and setup:Board: STM32N6570-DKModel: ResNet FP32 TFLiteTool: STEdgeAI generated C and hexMeasurement: on-board DWT-based latency, single-sample inference loop

Associate II

Question

STM32N6 FP32 model on CPU+NPU flow gives higher latency than CPU-only flow.

Forum|Forum|1 month ago
April 3, 2026
1 reply
112 views

Is FP32 forcing software kernels and preventing NPU acceleration?

Hi,

I am testing CIFAR-10 ResNet on STM32N6570-DK and I need help understanding a latency gap between two STEdgeAI generation flows.

I used the same FP32 model as input:

CPU-only generation flow
CPU+NPU generation flow

Both flows generated C files and hex successfully, and both run correctly on target.
However, the CPU+NPU build is significantly slower than CPU-only.

What I would like ST to confirm:

For STM32N6, can full FP32 ResNet graphs be truly accelerated on Neural-ART, or are they generally expected to run as software float kernels?
Is it normal that CPU+NPU generation for FP32 can be slower than CPU-only generation?
For real NPU acceleration, is calibrated INT8 mandatory in practice?
Which exact generator settings should I use to ensure that most layers map to NPU kernels (and how to verify mapping from generated artifacts)?

Observed latency:

CPU-only flow: about 211 ms
CPU+NPU flow: about 439 ms

My main question:
Is this expected for FP32 models on STM32N6 because operations are mapped to software float kernels (CPU path), so the CPU+NPU flow adds runtime overhead without real NPU acceleration?

My target and setup:

Board: STM32N6570-DK
Model: ResNet FP32 TFLite
Tool: STEdgeAI generated C and hex
Measurement: on-board DWT-based latency, single-sample inference loop

Julian E.

Technical Moderator

Hi @RanjithRemasan,

The NPU only support INT8, so it did not accelerate anything in your case.

But what you report is still strange... Could you share the model please?

Have a good day,

Julian

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded