STM32H7 code execution time question
Hello All,
I needed a very fast vector addition routine for the H7 so I wrote it in assembly. However, I discovered that very small changes in the code caused huge differences in execution time. Both of the following functions add a 16-bit vector to a 32-bit sum vector. The vectors are 32-bit word aligned. The sum vector is in DTCM and the other is in external SRAM. The first routine adds sequential data to the sum and the second adds every fourth point of a 4x larger vector to the same size sum. So both process the same amount of data, but the first is 3 times faster. Does anyone know what would cause such a large difference in execution time for these nearly identical functions?
Thanks
Dan
3X faster one
loop:
LDRH r3, [r0, r2, lsl #1] // load 16bit raw data in r3
LDR r4, [r1, r2, lsl #2] // load 32bit sum in r4
ADD r4,r3 // add raw data to sum
STR r4, [r1, r2, lsl #2] // store new sum
SUBS r2,#1 // next data point
BPL loop
Slower one
loop:
LDRH r3, [r0, r2, lsl #1] // load 16bit raw otdr data in r3
LDR r4, [r1, r2] // load 32bit sum in r4
ADD r4,r3 // add raw data to sum
STR r4, [r1, r2] // store new sum
SUBS r2,#4 // next data point
BPL loop
