CMSIS DSP Library performance
I'm using the STM32F769DI and the SMT32cCube for a project which will be using the CMSIS DSP libraries in a computationally demanding application.
My C code is set for fast optimisation. __FPU_USED and FPU_PRESENT are set. ARM_MATH_CM7 is defined. I compare two blocks of simple float32 multiplication using the CMSIS library and a simply loop. I'm linking in the arm_cortexM7lfdp_math library from the GCC directory. I can check the timing with a scope on LED2.
// 711 us
BSP_LED_On(LED2);
arm_mult_f32(x, y, z, 10000);
BSP_LED_Off(LED2);
// 50 us
BSP_LED_On(LED2);
for (int k = 0; k < 10000; k++) {
z[k] = x[k] * y[k];
}
BSP_LED_Off(LED2);
Does anyone have any insight into why the simple for loop should be so much faster than the SIMD based CMSIS library. Is there some additional initialisation necessary? That I've missed. I've tried compiling the CMSIS library into my own library but realised pretty much the same results?
