Possible suboptimal STM32CubeAI Conv2D kernels? Errors for documented parameters?

Question

Hi there, I'm not sure if it's better to leave a post here, or on github. If there is a preffered place, please let me know!TL;DRI'd appreciate any guidance on how to correctly configure the pad-values for 1-bit conv operations within X-CUBE-AI.I appreciate those at ST may not be able to fully comment on the inner working of ST's kernel implementation, but I have a concern 1-bit convolutions may be implemented sub-optimally.Problem and SetupThe problem I'm working on is generating a Binary Nerual Network (BNN). I've decided to trial the X-CUBE-AI framework for my company.A terse description of my environment to replicate this is as follows (most of this should be irrelevant but just incase):OS: Mac OS 15.1Chip: M2 ProPython: 3.11.8TensorFlow: 2.15.0Larq: 0.13.3X-CUBE-AI: 9.1.0Simple Self-Contained ExampleI've got a very simple python script called "test.py" (see below). It imports tensorflow and larq, and creates a very simple model, with a single larq binary 2D convolution (weights and inputs), where we take the sign of the inputs and the weights (e.g. go to the range [-1, 1]), to make a simple binary conv op. Note we have "same" padding, so there will typically be some padding at the edges.    import tensorflow as tf
import larq as lq

def build_model(pad_values: int, kernel_size: int = 3, stride: int = 1, out_channels: int = 16):
 x = tf.keras.Input(shape=(32, 32, 1))
 layer = lq.layers.QuantConv2D(
 filters=out_channels,
 kernel_size=(kernel_size, kernel_size),
 kernel_quantizer="ste_sign",
 input_quantizer="ste_sign",
 use_bias=False,
 strides=(stride, stride),
 padding="same",
 pad_values=pad_values,
 groups=1
 )
 y = layer(x)
 return tf.keras.Model(inputs=x, outputs=y)

print(tf.__version__)
model = build_model(pad_values=0) # <-- TRY VARYING THIS VALUE
# Save the model
model.save("test.h5")   The python script at the end then saves the model as a file. I have a second script (bash this time), called "test.sh", see below:   #!/bin/bash
set -e

# Ensure we always start from a new model
rm -f test.h5

# Create a model
python test.py

# Change this path if your install is at a different location
~/STM32Cube/Repository/Packs/STMicroelectronics/X-CUBE-AI/9.1.0/Utilities/macarm/stedgeai \
 analyze \
 --model test.h5 \
 --target stm32 \
 --type keras    Note how it calls the python test.py script to build the model, and then passes the model "test.h5" to the "stegeai" CLI.What I've noticed is that if the "pad_values" is 0, then "stedgeai" will work fine, then we can transpile from keras to the ST Edge internal framework successfully, and we'll see something like:However, if I return to the Python script above, and modify pad_values to either -1 or 1, then we'll crash and burn! We'll see and error like:NOT IMPLEMENTED: QuantConv2D (padded with 1) with formats {'out_0': (FLOAT, 32 bit, C Size: 32 bits), 'weights': (SIGNED, 1 bit, C Size: 1 bits Scales: [2.0] Zeros: [-0.5] Quantizer: UNIFORM), 'in_0': (SIGNED, 1 bit, C Size: 1 bits Scales: [2.0] Zeros: [-0.5] Quantizer: UNIFORM)} not supportedFor the record, I think this error message (and the multiple other errors I've worked through) leaves a little to be desired. However, this is not the objective of this post. But, just leaving some feedback - if you have a closed-source library - error messages are super crucial to help guide the users to a working solution. I only realised pad_values broke the conv op, through trial and error on a number of other parameters, (in my opinion) this would have been much clearer with better, more verbose error messages!Why I Think There is a ProblemI've implemented a few binary convs along the way by hand, and as a start compared to a naive conv2d implementation, typically we'd use XNOR and popcount to efficiently do the actualy convolution, (aswell as leveraging GEMM via Im2Col to help improve performance).What I don't understand here is, in my example above, my larq conv2d layer uses the sign operator to quantise, meaning we have [-1, 1] binary values, rather than [0, 1]. If we add zero-same-padding, e.g. pad the edges with zero, then the signal being fed to the keras operation is ternary [-1, 0, 1]. In keras, this is fine, training is being done in float32, so this probably doesn't matter. But when we get to an efficient implementation in C, how are the 0 values being represented? The convolution ideally would convert [-1, 1] to [0, 1], and then go on it's merry way computing the conv with XNOR/popcount. But if we have ternary [-1, 0, 1] values, then I'm not sure what's going on in the kernel under the hood, nor whether it's optimal - read here for more on this topic.Confusing Docs Don't Help EitherI'll be honest, I've found a number of discrepancies in the documentation and github examples (model zoo), aswell as  missing documentation, which lead to difficulty in using the X-CUBE-AI framework.I'm sharing this incase anyone at ST want's to improve these areas, or show me where I've misunderstood things/got it wrong (highly possible!)ST Larq Docshttps://wiki.st.com/stm32mcu/wiki/AI:Deep_Quantized_Neural_Network_support#Supported_Larq_layersSee the above link, it writes on larq Conv2D:for binary quantization, 'pad_values=-1 or 1' is requested if 'padding="same"'Here I can't get pad values -1 or 1 to work with "stedgeai" at all. This documentation suggests the writer thought -1 or 1 are required if padding is "same" (which I've set), so I find this documentation incorrect. ST Model Zoo Codehttps://github.com/STMicroelectronics/stm32ai-modelzoo/blob/e5361e76f8427b0907b67d9815101d05c32e7407/image_classification/src/models/st_resnet_8_hybrid.py#L38One of the things I did when trying to interpret how "stedgeai" deals with binary convs was search for all use of larq in the model zoo repo. I'm really confused how this code here suggests we can use pad_values=1, where as I've demonstrated in my example this causes an obscure error.Question(s)So what's going on here?Why does the model zoo point to 1 padding values being okay, but "stedgeai" failing on this? What is happening if we pad to zero under the hood for a 1-bit convolution for the ST Micro kernels?Thanks in advance, and I hope this hasn't come across too negative about the framework, I'm just keen to test out the most optimised neural networks with this framework and could do with a hand!

tiny-incy-wincy-weeny-ml · Answer

As a follow up to this, I've noticed that if I set "padding_values" to 0, and analyze the model, it's getting converted to a float32 conv anyway, so there's no optimised kernel (as far as I can tell) getting used at the end of the day anyway!

I used the above python script to generate the 1-layer 1-bit conv op, and then I see the following graph:

Screenshot 2024-11-03 at 17.31.25.png

And I also get the following report:

ST Edge AI Core v1.0.0-19899
Created date : 2024-11-03 17:30:57
Parameters : analyze --target stm32h7 --name network -m /Users/xxx/weight_layer/test.h5 --compression lossless --verbosity 1 --allocate-inputs --allocate-outputs --custom /Users/xxx/custom_layers/custom_layers.json --workspace /var/folders/l4/jlz9n1z53wldxg0vsqldyxg80000gn/T/mxAI_workspace6097541673667914955950694509463940 --output /Users/xxx/.stm32cubemx/network_output

Exec/report summary (analyze)
---------------------------------------------------------------------------------------------------------------------------
model file : /Users/xxx/weight_layer/test.h5 
type : keras 
c_name : network 
compression : lossless 
options : allocate-inputs, allocate-outputs 
optimization : balanced 
target/series : stm32h7 
workspace dir : /var/folders/l4/jlz9n1z53wldxg0vsqldyxg80000gn/T/mxAI_workspace6097541673667914955950694509463940 
output dir : /Users/xxx/.stm32cubemx/network_output 
model_fmt : float 
model_name : test 
model_hash : 0x697014b17b79c93775e307a402d3e471 
params # : 144 items (576 B) 
---------------------------------------------------------------------------------------------------------------------------
input 1/1 : 'input_1', int1(1x32x32x1), 4.00 KBytes, 1b-32bpacked, activations 
output 1/1 : 'quant_conv2d', f32(1x32x32x16), 64.00 KBytes, activations 
macc : 149,520 
weights (ro) : 640 B (640 B) (1 segment) / +64(+11.1%) vs float model 
activations (rw) : 69,668 B (68.04 KiB) (1 segment) * 
ram (total) : 69,668 B (68.04 KiB) = 69,668 + 0 + 0 
---------------------------------------------------------------------------------------------------------------------------
(*) 'input'/'output' buffers can be used from the activations buffer

Model name - test
------ --------------------------------------------- ---------------------- ------------ --------- ------------------- --- --------------- -------------------- ---------------- 
m_id layer (type,original) oshape param/size macc connected to | c_size c_macc c_type 
------ --------------------------------------------- ---------------------- ------------ --------- ------------------- --- --------------- -------------------- ---------------- 
0 input_1 (Input, InputLayer) [b:1,h:32,w:32,c:1] | +2,048(+100.0%) Conversion_[0] 
------ --------------------------------------------- ---------------------- ------------ --------- ------------------- --- --------------- -------------------- ---------------- 
1 quant_conv2d_conv (Conversion, QuantConv2D) [b:1,h:32,w:32,c:1] 2,048 input_1 | +640(+100.0%) +145,424(+7100.8%) Conv2D_[o][1] 
 quant_conv2d (Conv2D, QuantConv2D) [b:1,h:32,w:32,c:16] 144/576 147,456 quant_conv2d_conv | -576(-100.0%) -147,456(-100.0%) 
------ --------------------------------------------- ---------------------- ------------ --------- ------------------- --- --------------- -------------------- ---------------- 
model/c-model: macc=149,504/149,520 +16(+0.0%) weights=576/640 +64(+11.1%) activations=--/69,668 io=--/0



Generated C-graph summary
------------------------------------------------------------------------------------------------------------------------
model name : test
c-name : network
c-node # : 2
c-array # : 6
activations size : 69668 (1 segment)
weights size : 640 (1 segment)
macc : 149520
inputs : ['input_1_output']
outputs : ['quant_conv2d_output']

C-Arrays (6)
------ ----------------------------- ------------- ------------------------- ------------- --------- 
c_id name (*_array) item/size domain/mem-pool c-type comment 
------ ----------------------------- ------------- ------------------------- ------------- --------- 
0 input_1_0_conversion_output 1024/4096 activations/**default** float 
1 input_1_output 1024/4096 activations/**default** s1 /input 
2 quant_conv2d_bias 16/64 weights/weights const float 
3 quant_conv2d_output 16384/65536 activations/**default** float /output 
4 quant_conv2d_scratch0 9/36 activations/**default** float 
5 quant_conv2d_weights 144/576 weights/weights const float 
------ ----------------------------- ------------- ------------------------- ------------- --------- 

C-Layers (2)
------ ---------------------- ---- ------------- -------- ----- -------------------------------- --------------------- 
c_id name (*_layer) id layer_type macc rom tensors shape (array id) 
------ ---------------------- ---- ------------- -------- ----- -------------------------------- --------------------- 
0 input_1_0_conversion 0 Conversion 2048 0 I: input_1_output int1(1x32x32x1) (1) 
 O: input_1_0_conversion_output f32(1x32x32x1) (0) 
------ ---------------------- ---- ------------- -------- ----- -------------------------------- --------------------- 
1 quant_conv2d 1 Conv2D 147472 640 I: input_1_0_conversion_output f32(1x32x32x1) (0) 
 S: quant_conv2d_scratch0 
 W: quant_conv2d_weights f32(16x3x3x1) (5) 
 W: quant_conv2d_bias f32(16) (2) 
 O: quant_conv2d_output f32(1x32x32x16) (3) 
------ ---------------------- ---- ------------- -------- ----- -------------------------------- --------------------- 



Number of operations per c-layer
------- ------ ----------------------------------- --------- -------------- 
c_id m_id name (type) #op type 
------- ------ ----------------------------------- --------- -------------- 
0 0 input_1_0_conversion (Conversion) 2,048 smul_s1_f32 
1 1 quant_conv2d (Conv2D) 147,472 smul_f32_f32 
------- ------ ----------------------------------- --------- -------------- 
total 149,520 

Number of operation types
---------------- --------- ----------- 
operation type # % 
---------------- --------- ----------- 
smul_s1_f32 2,048 1.4% 
smul_f32_f32 147,472 98.6% 

Complexity report (model)
------ ------------------- ------------------------- ------------------------- ------ 
m_id name c_macc c_rom c_id 
------ ------------------- ------------------------- ------------------------- ------ 
0 input_1 | 1.4% | 0.0% [0] 
1 quant_conv2d_conv |||||||||||||||| 98.6% |||||||||||||||| 100.0% [1] 
------ ------------------- ------------------------- ------------------------- ------ 
macc=149,520 weights=640 act=69,668 ram_io=0

In this report we see the majority of ops being smul_f32_f32, which I interpret as this getting converted to a float32 convolution. I've read the documentation here which mentions this fallback to float32, but it doesn't indicate there is any way for us to know what the eggregious parameters are that are causing the fallback (would be great to have some more helpful error messages here in the framework).

Is there any warning/error/reason for this to fall back on the float32 conv? Does anyone know anywhere in the docs that instruct exactly what is, and what isn't, supported (in terms of combinations of parameters) to get this working?

Has anyone got a working example of larq/1-bit convs working on github/some place that they can kindly point me to?

Thanks in advance

TL;DR

Problem and Setup

Simple Self-Contained Example

Why I Think There is a Problem

Confusing Docs Don't Help Either

ST Larq Docs

ST Model Zoo Code

Question(s)

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded