Visitor II

Question

memset() execution slower at some addresses

Forum|Forum|5 years ago
November 27, 2020
12 replies
4047 views

Hello,

After some investigation was found that memset has different behavior executed from different places in flash. Data and instruction cache are off! Micro used is stm32h743xi.

Function is called with following arguments --> memset(dummy, 0, 64)

Its execution time is ~5us when function is placed at:

..., 0x8040c34, 0x8040c54, 0x8040c74, ...

Its execution time is ~1us when function is placed at:

..., 0x8040c3c, 0x8040c44, 0x8040c4c , 0x8040c5c, 0x8040c64, 0x8040c6c ...

Any ideas?

Thanks

This topic has been closed for replies.

T

TDK

Super User

Where is dummy located?

How are you measuring execution time?

D

DApo.1Author

Visitor II

dummy is located in ram

Time is measured via free running timer used as a clock. Its CNT is taken before and after execution.

K

KnarfB

Super User

How do you "place" a function? gcc has built-in versions of memset etc. and may decide to inline/unroll its implementation for small values of size. Take a look at the assembler code.

D

DApo.1Author

Visitor II

This is the asm code, It is the original gcc code for byte memset, nothing strange. It is the same no matter where the memset is placed:

08040c54: add r2, r0

08040c56: mov r3, r0

08040c58: cmp r3, r2

08040c5a: bne.n 0x8040c5e <memset+10>

08040c5c: bx lr

08040c5e: strb.w r1, [r3], #1

08040c62: b.n 0x8040c58 <memset+4>

To manipulate the address of memset i just add dummy code somewhere else.

Interesting is that addresses where the execution is slower are +0x20 from each other

W

waclawek.jan

Super User

> Data and instruction cash are off!

It's spelled "cache" and I don't believe they are off.

Which RAM? How are clocks and FLASH latency set?

Try 64000 bytes.

JW

D

DApo.1Author

Visitor II

>>It's spelled "cache"

Thx for the spelling, sorry for the mistake, corrected in the description.

>> I don't believe they are off.

SCB-> CCR = 0x40200, read before memset call clearly show that both the caches are off. If you mean something else pls specify more detailed.

>>Which RAM?

dummy as locally defined array, defined in AXI-SRAM. But its address is the same in both cases, so i do not see why this matters

>>Try 64000 bytes

Here are the measurements for different sizes:

bytes slow fast

B us us

64 5.11 0.99

640 45.4 5.22

6400 120.88 48.42

64000 220.72 152.74

>> clocks

Clock config could be seen in the attached picture, but it is the same in both the cases => i can not get your point. It looks to me that this could not be the reason.

K

KnarfB

Super User

Haven't used a H7 but Cortex-M7 has a quite complex micro architecture (6-stage pipeline, dual-instruction issue). You could use DWT counters to get more info about what's going on. The ratios between you figures are varying alot, hmm.

W

waclawek.jan

Super User

>>> I don't believe they are off.

> SCB-> CCR = 0x40200, read before memset call clearly show that both the caches are off. If you mean something else pls specify more detailed.

No, I meant this.

Okay, so this may be the more complex case (as compared to instruction cache being switched on). As there's no caching, the processor requests each instruction word from FLASH. Instructions are 16-bit wide and go through a 6-stage pipeline to the asymmetric two-core execution unit (as KnarfB noted above), processor fetch is probably 32-bit wide (I am lazy to look it up), it goes through the 64-bit axi bus to the FLASH controller. FLASH is 256-bits wide (qword) and is accessed through a 3-qword read queue, see FLASH read operations/Read operation overview in RM. Add branch prediction to the mix. Detailed behaviour of all the components mentioned above is simply undocumented.

I would expect that the behaviour would vary at any position within the 256-bit window; and also depend on previous execution state, short sequence making this more pronounced - exactly as you've experienced:

>Here are the measurements for different sizes:

>bytes slow fast

>B us us

>64 5.11 0.99

>640 45.4 5.22

>6400 120.88 48.42

>64000 220.72 152.74

where you can see, that not only the relative difference decreases but also execution time per transferred word decreases too, with increasing number of loops.

>>>Which RAM?

>dummy as locally defined array, defined in AXI-SRAM. But its address is the same in both cases, so i do not see why this matters

Writing to RAM goes throught the AXI matrix too. That's by no means a passive interconnection; it's a beast, again poorly documented. It may well be that writes are delayed for some reason (e.g. to group them together into dwords) and that then slows down subsequent writes and that this is somehow dependent on the relative phase between various involved clocks; but this of course is just a pure speculation and doesn't sound to be the main reason for the difference here.

The "optically" high processor speed brings higher number crunching capabilities, but the real-time-control aspect generally stays the same as it used to be.

In Cortex-M7, generally, the TCM buses (and associated memories) are intended to bring down the uncertainties/jitter; but they of course have their own set of issues, and the uncertainties inherent in the processor (stemming from dual-issue, branch prediction etc.) remain.

Welcome to the wonderful world of 32-bit mcus.

JW

W

waclawek.jan

Super User

Just a note, this change

>6400 120.88 48.42

>64000 220.72 152.74

is suspiciously low - hasn't the timing overflown there?

JW

D

DApo.1Author

Visitor II

You caught me again :). I had to go down to 1 MHz timer frequency to avoid the overflow in the slow measurement when there is 64000.

Here are the corrected measurements:

bytes slow fast

B us us

64 5.11 0.99

640 45.4 5.22

6400 449 49

64000 4481 480

The good news is that instruction cache is solving the issue.

Show more replies

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded