Skip to main content
Graduate
April 16, 2024
Solved

Do all FPU operations proceed in a single SYSCLK tick?

  • April 16, 2024
  • 5 replies
  • 2998 views

This should be simple, but I can't find documentation.   I am running an STM32L4P5 at max clock speed (120 MHz), and I have moved all code to SRAM so it is running with zero wait states.    I have an extensive calculation to do using the FPU -- I am optimizing retained constants and such in the FPU registers.  

I am wondering if any of the FPU instructions take longer than a single clock tick to execute.   Divide?  Square root?  Are there any inserted wait states?

If there is a solid answer and it is in the documentation, please point to where it is docced.    Thanks!

    This topic has been closed for replies.

    5 replies

    Graduate II
    April 16, 2024

    A clock interrupting at KHz is not suitable for measuring nanoseconds of elapsed time.

    SysTick is a 24-bit Down Counter, typically at 1/8th the MCU clock.

    For precise cycle counts use the DWT's CYCCNT instead. It's a full range 32-bit Up Counter so easy to delta the counts.

    Not sure I have a table of FPU cycles, but it runs concurrently with the MCU and has its own pipeline. So would be more helpful to looks at throughput. Although you could probably make something with chained dependency if you want to worst case it.

    Graduate II
    April 16, 2024
    Super User
    April 16, 2024
    JCase.1Author
    Graduate
    April 16, 2024

    You guys are great.    This is what I expected from my benchmarking on the scope.   I was suspecting that coding the expression:      (A*B*C)/(D*E*F) is far more efficient being coded as:

                  ANS = A*B,   ANS = ANS*C,     S0 = D*E,   S0 = S0*F,      ANS = ANS/S0

    than by

                  ANS = A*B,    ANS = ANS*C,     ANS = ANS/D,    ANS = ANS/E,    ANS = ANS/F

    f32 multiplies take 1 clock tick, f32 divides take 14 clock ticks, so clearly the former is faster.

    If you either of you pass through Cedar City, Utah, I'll buy you a beer.   Thanks!    Jeff

     

              

    JCase.1Author
    Graduate
    April 16, 2024

    Another question on the same topic.     If the FPU 32 divide and square root instructions take 14 clock ticks, and the FPU is a true co-processor with a separate pipeline, does that mean that I can launch a divide (or square root) and do 14 ticks worth of instructions without the CPU stalling, as long as none of those instructions use the FPU?

    JCase.1Author
    Graduate
    April 16, 2024

    Never mind, the same documentation page answered that second question in the footnotes -- they DO proceed in parallel, apparently.   I'll have to interleave FPU and non-FPU functions cleverly to get more speed.