Skip to main content
Visitor II
May 10, 2024
Question

stm32h732 matrix-vector multiply throughput

  • May 10, 2024
  • 5 replies
  • 4364 views

I am trying to gain an appreciation of how much f32 128x128 element matrix-vector multiply compute I should expect from this type of processor (for the purpose of running small control neural networks on the m7 core).

As I understand the m7 core, it should be capable of executing one fused-multiply-add per clockcycle as long as there is data in the core. I dont have a definite reference on that so if anyone knows more id love to hear it.

 

What I find most difficult to appreciate is the memory system.

 

As I understand, the tightly-coupled-memory (TCM) can be read with zero-latency; I imagine that means it takes a single cycle and that latency can get hidden by the instruction pipeline.

The way I see it, I could page matrix entries from data-sram into dTCM using DMA. 

Moreoever, I could page matrix entries from flash into data-sram using DMA.

 

The questions I am most in the dark about are about bandwidth:

* how many 4-byte f32 transfers can I expect per m7 core clock tick using DMA from dram to dTCM

* how many 4-byte f32 transfers can I expect per m7 core clock tick using DMA from flash to dram

 

And perhaps even more useful than these theoretical questions would be real world benchmarks on these type of workloads. If anyone knows of those id love to see it. Experiences with similar models of processors are also definitely welcome.

    This topic has been closed for replies.

    5 replies

    EelcoAuthor
    Visitor II
    May 10, 2024

    I was just reading up on the 16 bit FMAC functionality; and I saw that this has a maximum throughput of 1 fmad per two core clock cycles, since it takes two load instructions to fetch both arguments. The document concerning the 16 bit FMAC did not explicitly promise it was possible to sustain that one op per two cycles given possible constraints on keeping the local memory fed with relevant data; though it seemed implied at least.

    I suppose that implied im not going to get more than 1 f32 fmad per two clock cycles either. Or does it? The FMAD is additional silicon intended to run without bothering the rest of the core; I suppose its limitations do not necessarily imply limitation of the full core?

    In any case... going for 16 bit quantization likely is an option for me; so understanding what the 16 bit FMAD can do is also interesting in itself.

    Super User
    May 10, 2024

    To know whats the real timing, just try it. 

    A simple loop , 1000x transfer (1000 int32 or f32 with mdma -> to DTCM) - then you know.

    +

    * how many 4-byte f32 transfers can I expect per m7 core clock tick using DMA from dram to dTCM

    The AXI bus is clocked at core speed (afaik) , so 1 clk -> 1 32b transfer . (maximum- if "nobody" else requesting the bus, maybe cpu or other DMA...)

    +

    * how many 4-byte f32 transfers can I expect per m7 core clock tick using DMA from flash to dram

    Depends on size of transfer, flash is accessed 256b wide, at 3WS (4 clks) at max. speed (550MHz);

    so 4 clks for first , then 1 clk next 7  x 32b words.

    AScha3_0-1715361396750.png

    see rm.

    If D-cache enabled and cache line loaded, then maybe without any waitstates .

     

    But anyway : try a simple test, then you know.

    EelcoAuthor
    Visitor II
    May 10, 2024

    Thanks; those numbers you mention inspire confidence.

    But I dont have that much confidence in the 'just try it' approach, considering im very new to stm32 programming; and I expect that all my synthetic benchmarking would prove is that I have no clue what im doing. So im hoping to teach myself a bit of an understanding of what is going on, independent from what any narrow benchmark might (seem to) show.

     

    Note that in my last reply I quoted a paper where they only achieve about 1 fmad every 16 clock cycles; with code generated by cube.ai; perhaps that isnt the best generated code but I also wouldnt expect it to be the worst. Thats for an stm32L4; but yeah...

    Graduate II
    May 10, 2024

    But the laymans 'just try it' approach will at least give a lower bound...

    EelcoAuthor
    Visitor II
    May 10, 2024

    Just found this paper with some rare numbers relevant to my use case; some 5M fmads/s on a 80mhz STM32L4 processor. Thats a ratio less than what id be hoping for... but thats on a processor without a FMAC; and using cube.ai autogenerated C code. From the examples it seems quite heavily geared to convolutional applications; not RNNs where you are dealing with a single token at a time at inference and thus are more constrained by memory bandwidth...

    I probably should take a look at that cube.ai autogenerated code, to see what that looks like and if it seems like it ought to come close to making good use of the hardware, for my intended use case.

    Explorer
    May 11, 2024

    Read:

    AN4891
    Application note
    STM32H72x, STM32H73x, and single-core STM32H74x/75x
    system architecture and performance

    EelcoAuthor
    Visitor II
    May 11, 2024

    Thanks; that does indeed go into some more detail: I like this part

    `It is split into two DTCM-RAMs with 32-bit access each. Both memories are connected respectively to the D0TCM and D1TCM ports of the Cortex®-M7 (not represented in the figures) and can be used in parallel (for load/store operations) thanks to the Cortex®- M7 dual issue capability`

    So it seems I could load an f32 matrix and vector component in a single cycle if both are stored in different DTCM banks. Which low-key contradicts what I read about the FMAC; but that might have been an FMAC specific limitation then. 

    Piecing together all the tidbits of information ive found, it seems to me that someone who knows exactly what they are doing could get one fused-multiply-add per cycle out of the stm32h7... but I also suspect that its going to be a lot of work to get there, manually orchestrating all the memory management from flash to ram to TCM to core. Writing a single matmul-benchmark might not be too much work; but actually tying it together into an actual neural network would probably end up a little like writing your own cube.ai code generator.

     

    Speaking of trying things out; I should certainly give cube.ai a spin, to see if its anything nearly as bad as the paper linked above for RNNs. Maybe im lucky; and probably the autogenerated code will teach me a lot.

    Explorer
    May 11, 2024

    "Piecing together all the tidbits of information ive found, it seems to me that someone who knows exactly what they are doing could get one fused-multiply-add per cycle out of the stm32h7.."

    Been in similar situation, experimenting with FFT (butterfly operation multyply-sum-accumulate in core) I was not able to get any better than 2 msps processing time, about 240 instructions per sample on 480 MHz uCPU. Even counting mult add two complex numbers, it's more than 20 cycles per single operation.

    My understanding that Cortex-M7 is kind of different "big-farma" belongins, same apply to GCC, so ST is likely not the one to blame for such low performace

    Graduate II
    May 11, 2024

    I have no real idea what this is about - but it looks interesting! ;)

    Therefore, just 2 things:

    - DMA has no access to TCM

    - grab a H723 Nucleo and try, it's only about 30 $ / €

     

    EelcoAuthor
    Visitor II
    May 11, 2024

    According to the STM32H7 documentation, the MDMA does have access to the DTCM; so that should be good?

    As mentioned before, the $30 isnt the issue here. The issue is the many months itd take me to convince myself id coded up a benchmark thats representative of what the chip is capable of.

    Graduate II
    May 11, 2024

    Oops, when it comes to DMA I always think of the peripherals.

    So yes, by reference manual MDMA can access DTCM.