Skip to main content
HKim.16.78
Associate II
August 19, 2020
Solved

Does compressing the model speeds up the inference (prediction)?

  • August 19, 2020
  • 2 replies
  • 1417 views

Hi

I imported simple CNN to STM32L462RCT using STM32CUBE-AI v5.1.2 ApplicationTemplate

I found that compressing the model has no effect on inference time.

The aiRun procedure runs for 115ms both in 8-bit compression and "none" configurations although the accuracy drops a bit.

I thought compressing float network parameters to uint8_t would not only save the memory but also speed up the inference.

So, is compressing the model supposed to speed up the inference?

This topic has been closed for replies.
Best answer by jean-michel.d

Hi HKim,

Effectively, for the floating-point model, the compression is only applied to the FC layers. Only the weights are compressed to reduce the flash memory size. Concerning the impact on inference time, no significant change is expected. For a compressed FC layer (x8 or x4), the number of operation is always the same, there is only an indirection to retrieve the weight values (LUT-based). Only an impact of the accuracy can appear due to the "compression" of the weights.

br,

Jean-Michel

2 replies

HKim.16.78
Associate II
September 13, 2020

Several weeks ago I found that my model has no fully connected layers and the compression only applies to the FC layers.

jean-michel.dBest answer
ST Employee
September 15, 2020

Hi HKim,

Effectively, for the floating-point model, the compression is only applied to the FC layers. Only the weights are compressed to reduce the flash memory size. Concerning the impact on inference time, no significant change is expected. For a compressed FC layer (x8 or x4), the number of operation is always the same, there is only an indirection to retrieve the weight values (LUT-based). Only an impact of the accuracy can appear due to the "compression" of the weights.

br,

Jean-Michel