stm32h732 matrix-vector multiply throughput
I am trying to gain an appreciation of how much f32 128x128 element matrix-vector multiply compute I should expect from this type of processor (for the purpose of running small control neural networks on the m7 core).
As I understand the m7 core, it should be capable of executing one fused-multiply-add per clockcycle as long as there is data in the core. I dont have a definite reference on that so if anyone knows more id love to hear it.
What I find most difficult to appreciate is the memory system.
As I understand, the tightly-coupled-memory (TCM) can be read with zero-latency; I imagine that means it takes a single cycle and that latency can get hidden by the instruction pipeline.
The way I see it, I could page matrix entries from data-sram into dTCM using DMA.
Moreoever, I could page matrix entries from flash into data-sram using DMA.
The questions I am most in the dark about are about bandwidth:
* how many 4-byte f32 transfers can I expect per m7 core clock tick using DMA from dram to dTCM
* how many 4-byte f32 transfers can I expect per m7 core clock tick using DMA from flash to dram
And perhaps even more useful than these theoretical questions would be real world benchmarks on these type of workloads. If anyone knows of those id love to see it. Experiences with similar models of processors are also definitely welcome.
