Skip to main content
Visitor II
November 27, 2020
Question

memset() execution slower at some addresses

  • November 27, 2020
  • 12 replies
  • 4047 views

Hello,

After some investigation was found that memset has different behavior executed from different places in flash. Data and instruction cache are off! Micro used is stm32h743xi.

Function is called with following arguments --> memset(dummy, 0, 64)

Its execution time is ~5us when function is placed at:

..., 0x8040c34, 0x8040c54, 0x8040c74, ...

Its execution time is ~1us when function is placed at:

..., 0x8040c3c, 0x8040c44, 0x8040c4c , 0x8040c5c, 0x8040c64, 0x8040c6c ...

Any ideas?

Thanks

    This topic has been closed for replies.

    12 replies

    Super User
    November 27, 2020

    Where is dummy located?

    How are you measuring execution time?

    DApo.1Author
    Visitor II
    November 27, 2020

    dummy is located in ram

    Time is measured via free running timer used as a clock. Its CNT is taken before and after execution.

    Super User
    November 27, 2020

    How do you "place" a function? gcc has built-in versions of memset etc. and may decide to inline/unroll its implementation for small values of size. Take a look at the assembler code.

    DApo.1Author
    Visitor II
    November 27, 2020

    This is the asm code, It is the original gcc code for byte memset, nothing strange. It is the same no matter where the memset is placed:

    08040c54:  add    r2, r0

    08040c56:  mov    r3, r0

    08040c58:  cmp    r3, r2

    08040c5a:  bne.n  0x8040c5e <memset+10>

    08040c5c:  bx     lr

    08040c5e:  strb.w r1, [r3], #1

    08040c62:  b.n    0x8040c58 <memset+4>

    To manipulate the address of memset i just add dummy code somewhere else.

    Interesting is that addresses where the execution is slower are +0x20 from each other

    Super User
    November 27, 2020

    > Data and instruction cash are off! 

    It's spelled "cache" and I don't believe they are off.

    Which RAM? How are clocks and FLASH latency set?

    Try 64000 bytes.

    JW

    DApo.1Author
    Visitor II
    November 29, 2020

    >>It's spelled "cache"

    Thx for the spelling, sorry for the mistake, corrected in the description.

    >> I don't believe they are off.

    SCB-> CCR = 0x40200, read before memset call clearly show that both the caches are off. If you mean something else pls specify more detailed.

    >>Which RAM?

    dummy as locally defined array, defined in AXI-SRAM. But its address is the same in both cases, so i do not see why this matters

    >>Try 64000 bytes

    Here are the measurements for different sizes:

    bytes   slow   fast

    B    us    us

    64    5.11    0.99

    640    45.4    5.22

    6400    120.88   48.42

    64000   220.72   152.74

    >> clocks

    Clock config could be seen in the attached picture, but it is the same in both the cases => i can not get your point. It looks to me that this could not be the reason.

    Super User
    November 29, 2020

    Haven't used a H7 but Cortex-M7 has a quite complex micro architecture (6-stage pipeline, dual-instruction issue). You could use DWT counters to get more info about what's going on. The ratios between you figures are varying alot, hmm.

    Super User
    November 29, 2020

    >>> I don't believe they are off.

    > SCB-> CCR = 0x40200, read before memset call clearly show that both the caches are off. If you mean something else pls specify more detailed.

    No, I meant this.

    Okay, so this may be the more complex case (as compared to instruction cache being switched on). As there's no caching, the processor requests each instruction word from FLASH. Instructions are 16-bit wide and go through a 6-stage pipeline to the asymmetric two-core execution unit (as KnarfB noted above), processor fetch is probably 32-bit wide (I am lazy to look it up), it goes through the 64-bit axi bus to the FLASH controller. FLASH is 256-bits wide (qword) and is accessed through a 3-qword read queue, see FLASH read operations/Read operation overview in RM. Add branch prediction to the mix. Detailed behaviour of all the components mentioned above is simply undocumented.

    I would expect that the behaviour would vary at any position within the 256-bit window; and also depend on previous execution state, short sequence making this more pronounced - exactly as you've experienced:

    >Here are the measurements for different sizes:

    >bytes   slow   fast

    >B   us   us

    >64   5.11   0.99

    >640   45.4   5.22

    >6400   120.88  48.42

    >64000   220.72   152.74

    where you can see, that not only the relative difference decreases but also execution time per transferred word decreases too, with increasing number of loops.

    >>>Which RAM?

    >dummy as locally defined array, defined in AXI-SRAM. But its address is the same in both cases, so i do not see why this matters

    Writing to RAM goes throught the AXI matrix too. That's by no means a passive interconnection; it's a beast, again poorly documented. It may well be that writes are delayed for some reason (e.g. to group them together into dwords) and that then slows down subsequent writes and that this is somehow dependent on the relative phase between various involved clocks; but this of course is just a pure speculation and doesn't sound to be the main reason for the difference here.

    The "optically" high processor speed brings higher number crunching capabilities, but the real-time-control aspect generally stays the same as it used to be.

    In Cortex-M7, generally, the TCM buses (and associated memories) are intended to bring down the uncertainties/jitter; but they of course have their own set of issues, and the uncertainties inherent in the processor (stemming from dual-issue, branch prediction etc.) remain.

    Welcome to the wonderful world of 32-bit mcus.

    JW

    Super User
    November 29, 2020

    Just a note, this change

    >6400   120.88  48.42

    >64000   220.72   152.74

    is suspiciously low - hasn't the timing overflown there?

    JW

    DApo.1Author
    Visitor II
    November 30, 2020

    You caught me again :). I had to go down to 1 MHz timer frequency to avoid the overflow in the slow measurement when there is 64000.

    Here are the corrected measurements:

    bytes   slow  fast

    B   us   us

    64   5.11   0.99

    640   45.4   5.22

    6400   449   49

    64000  4481  480

    The good news is that instruction cache is solving the issue.