Does compressing the model speeds up the inference (prediction)?

Forum|Forum|5 years ago
August 19, 2020
2 replies
1417 views

Hi

I imported simple CNN to STM32L462RCT using STM32CUBE-AI v5.1.2 ApplicationTemplate

I found that compressing the model has no effect on inference time.

The aiRun procedure runs for 115ms both in 8-bit compression and "none" configurations although the accuracy drops a bit.

I thought compressing float network parameters to uint8_t would not only save the memory but also speed up the inference.

So, is compressing the model supposed to speed up the inference?

This topic has been closed for replies.

Best answer by jean-michel.d

Hi HKim,

Effectively, for the floating-point model, the compression is only applied to the FC layers. Only the weights are compressed to reduce the flash memory size. Concerning the impact on inference time, no significant change is expected. For a compressed FC layer (x8 or x4), the number of operation is always the same, there is only an indirection to retrieve the weight values (LUT-based). Only an impact of the accuracy can appear due to the "compression" of the weights.

br,

Jean-Michel

HKim.16.78Author

Associate II

Several weeks ago I found that my model has no fully connected layers and the compression only applies to the FC layers.

J

jean-michel.dBest answer

ST Employee

Hi HKim,

Effectively, for the floating-point model, the compression is only applied to the FC layers. Only the weights are compressed to reduce the flash memory size. Concerning the impact on inference time, no significant change is expected. For a compressed FC layer (x8 or x4), the number of operation is always the same, there is only an indirection to retrieve the weight values (LUT-based). Only an impact of the accuracy can appear due to the "compression" of the weights.

br,

Jean-Michel

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded