STM32N6 FP32 model on CPU+NPU flow gives higher latency than CPU-only flow.
Is FP32 forcing software kernels and preventing NPU acceleration?
Hi,
I am testing CIFAR-10 ResNet on STM32N6570-DK and I need help understanding a latency gap between two STEdgeAI generation flows.
I used the same FP32 model as input:
- CPU-only generation flow
- CPU+NPU generation flow
Both flows generated C files and hex successfully, and both run correctly on target.
However, the CPU+NPU build is significantly slower than CPU-only.
What I would like ST to confirm:
- For STM32N6, can full FP32 ResNet graphs be truly accelerated on Neural-ART, or are they generally expected to run as software float kernels?
- Is it normal that CPU+NPU generation for FP32 can be slower than CPU-only generation?
- For real NPU acceleration, is calibrated INT8 mandatory in practice?
- Which exact generator settings should I use to ensure that most layers map to NPU kernels (and how to verify mapping from generated artifacts)?
Observed latency:
- CPU-only flow: about 211 ms
- CPU+NPU flow: about 439 ms
My main question:
Is this expected for FP32 models on STM32N6 because operations are mapped to software float kernels (CPU path), so the CPU+NPU flow adds runtime overhead without real NPU acceleration?
My target and setup:
- Board: STM32N6570-DK
- Model: ResNet FP32 TFLite
- Tool: STEdgeAI generated C and hex
- Measurement: on-board DWT-based latency, single-sample inference loop
