Skip to main content
Visitor II
November 27, 2020
Question

memset() execution slower at some addresses

  • November 27, 2020
  • 12 replies
  • 4047 views

Hello,

After some investigation was found that memset has different behavior executed from different places in flash. Data and instruction cache are off! Micro used is stm32h743xi.

Function is called with following arguments --> memset(dummy, 0, 64)

Its execution time is ~5us when function is placed at:

..., 0x8040c34, 0x8040c54, 0x8040c74, ...

Its execution time is ~1us when function is placed at:

..., 0x8040c3c, 0x8040c44, 0x8040c4c , 0x8040c5c, 0x8040c64, 0x8040c6c ...

Any ideas?

Thanks

    This topic has been closed for replies.

    12 replies

    Visitor II
    April 2, 2025

    There's nothing particularly special about the memset() function; similar behavior can be observed with any code. In my case, I was working with the STM32F767 microcontroller. I measured the runtime of a for-loop that writes to a variable in SRAM1 using a free-running timer. By systematically moving the code to different locations in flash memory, I discovered a distinct pattern: 11 addresses were fast, while 6 addresses were slow. The slowdown on the AXI bus (0x0800'0000) was nearly six times, and on the ITCM bus(0x0020'0000), it was three times. Throughout all tests ICACHE, DCACHE, ART and PREFETCH were disabled. Interestingly, even slight modifications to the code within the for-loop altered this pattern.

    Super User
    April 2, 2025

    Interesting is that addresses where the execution is slower are +0x20 from each other

    0x20 is exactly the FLASH "word" size, 256 bits as Jan has mentioned.

    What is more interesting to me... can interrupt latency be affected in the same way?

    The good news is that instruction cache is solving the issue.

    Indeed good news for tight looping code, but not for latency of (rarely occurring) interrupts?

    As @pavlo1r tested on a 'F767, this is not specific to the bus matrix of 'H7.

     

    Visitor II
    April 15, 2025

     

    Both the STM32F767 and STM32H743xI implement the Cortex-M7 core. The issue of code position-dependent performance is not necessarily tied to a specific bus matrix, but rather to the behavior of the Cortex-M7's prefetch unit. This unit can be partially configured—for example, the branch target address cache (BTAC) used for branch prediction can be disabled.

    While the instruction cache (ICache) generally improves overall performance, it may not directly solve the issue. In fact, predictability can sometimes decrease when the ICache is enabled. I still believe that the same code—particularly tight loops—can exhibit different performance characteristics depending on its memory location, even with the ICache active. In any case, tight loops should be avoided when possible due to these sensitivity issues.

    The offset of +0x20 also sunds like the size of the prefetch queue. Even if the code is already present in the ICache, a poor branch prediction can flush the queue, requiring it to be refilled from the ICache.

    If an interrupt occurs, the prefetch queue is typically invalidated, and there's only a slim chance that the target code is already in the ICache. This leads to increased interrupt entry latency