Skip to main content
Associate II
June 23, 2025
Solved

STM32N6 : CubeAI ?? Epoch Issue and Why PReLU Runs in Software After Quantization

  • June 23, 2025
  • 1 reply
  • 813 views

 

Hello everyone,

I am using CubeAI 10.1.0 and STAGEAI 2.1 to analyze my model for STM32N6, and I encountered an issue where some epochs show ?? instead of the expected results. Here's the log:

epoch ID HW/SW/EC Operation (SW only)
epoch 1 EC
epoch 2 EC
epoch 3 -SW- (DequantizeLinear)
epoch 4 -SW- (PRelu)
epoch 5 -SW- (QuantizeLinear)
epoch 6 -SW- (MaxPool)
epoch 7 EC
epoch 8 EC
epoch 9 -SW- (DequantizeLinear)
epoch 10 -SW- (PRelu)
epoch 11 -SW- (QuantizeLinear)
epoch 12 EC
epoch 13 EC
epoch 14 -SW- (DequantizeLinear)
epoch 15 -SW- (PRelu)
epoch 16 EC
epoch 17 -SW- (Conv)
epoch 18 -SW- (Add)
epoch 19 EC
epoch 20 -SW- (Conv)
epoch 21 -SW- (Add)
epoch 22 ??
epoch 23 -SW- (Add)
epoch 24 ??
epoch 25 -SW- (Add)
epoch 26 EC

In epoch 22 and epoch 24, the result is shown as ??, and I couldn't retrieve any computation results. I have a few questions:

1. What does ?? mean?

  • Does the ?? represent that some operators or operations failed to execute during these epochs? Does it imply that those operators are not supported on STM32N6, or could it be due to hardware resource limitations?

2. Will this affect model results?

  • If epoch shows ??, will it impact the final recognition or inference accuracy of the model? Should I be concerned that this issue may lead to unreliable results from the model?

3. Why is the PReLU operator still executed in software after quantization?

  • The official documentation mentions that PReLU is supported on STM32N6, but after model quantization, the computation for PReLU is still executed in software rather than on the hardware. Why is that? Is it due to hardware limitations, or is STM32N6's hardware acceleration for this operator not fully optimized? Is there any other reason why PReLU still runs in software?

4. How can I optimize the model to avoid these issues?

  • If these issues occur, are there any recommended optimization methods or adjustment strategies to address them and ensure that the model runs smoothly and gives accurate results? Should I consider simplifying the model or replacing PReLU with another activation function to avoid the operator being executed in software?

Thank you in advance for your help and suggestions!

Best answer by Julian E.

Hello @qiqi,

 

So, the PReLU being in software is a bug of the CLI front end. It is supported by the aton compiler.

The bug is fixed and will be part of the next version (2.2) planned for beginning/mid July.

 

Concerning the ?? bug, I opened an internal ticket, and I will update you.

Until I know more, I would suggest either not to use the option causing the issue or to use the validate on target with and without the option to see the difference and make sure the results are correct.

 

Have a good day,

Julian

1 reply

Julian E.
Technical Moderator
June 24, 2025

Hello @qiqi,

 

Could you please share your model in a .zip file?

 

Concerning the PReLU, it is indeed supported. As for why it is not used in SW epoch it could be because the compiler decided that it is faster to do it in SW. I will look at it with more detail if you share your model.

 

Have a good day,

Julian

​In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.
qiqiAuthor
Associate II
June 24, 2025

Dear Julian,

Thank you so much for your help! I have packed the models into a .zip file and attached it for your review. The zip file contains three models: mobilefacenet.onnx, ONet.onnx, and RNet.onnx, all of which are quantized models. During the analysis, both ONet.onnx and RNet.onnx showed ?? epochs. Could you kindly take a look and help identify any issues and suggest possible solutions?

Additionally, if you don't mind, I would like to ask you one more question. The mobilefacenet.onnx feature extraction model has relatively large parameters, and the analysis shows a total of 164 epochs, of which 111 are implemented in software. Through empirical testing, the inference time is around 100ms, which I feel is a bit long. Is there a way to move more epochs to hardware execution instead of software?

Furthermore, the model’s activations are 3.062 MB, and apart from npuRAM3, npuRAM4, npuRAM5, and npuRAM6, it must occupy some space in hyperRAM. According to the official documentation I reviewed, this might affect the inference speed. Is that the case? If so, can it be optimized by adjusting the options in the user_neuralart.json file?

Apologies for all the questions, and I really appreciate your help in answering them and optimizing the model.

Thanks again for your support, and I look forward to your reply!

Best regards,
QiQi

Julian E.
Technical Moderator
June 24, 2025

Hello @qiqi,

 

Thank you for the models, I will first take a look at this ?? issue.

 

Regarding optimization, if the activations do not fit into internal RAM, then, it will indeed have a big impact on the inference time. The weights are in external flash, but because they are read one time when needed, the impact is not heavy. For activations however, multiple read and writes will require to access external memory, inducing this augmentation of inference time. 

 

I will take a look with my colleague to see if we can provide you with some tips to help you.

In the meantime, you can look at this piece of information, if you have not already seen it:
https://stedgeai-dc.st.com/assets/embedded-docs/stneuralart_neural_art_compiler.html#tips-variations-around-the-basic-use-case 

 

Have a good day,

Julian

​In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.