Graduate II

Solved

Very bad performances on the stm32N657

Forum|Forum|6 months ago
August 18, 2025
8 replies
2079 views

Dear all,
I am facing very poor performance with the STM32N657.
I have some benchmarks that manipulate arrays of data in different ways.

I ran these benchmarks on
Nucleo_H753 @ 480 MHz, with caches ON
Nucleo_N657 @ 600 MHz, with caches ON

For the STM32N657 I measured the CPU and AXI clocks via MCO2:
fCPU = 600 MHz
fAXI = 400 MHz

The compiler used is gcc-15.2.0

Bench for the Nucleo_H753
-------------------------

uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
 the internal memory. Then, compute the
 X-Y projections and the histogram.
 Fill the array t = 17 [us]
 X projection t = 41 [us]
 Y projection t = 18 [us]
 Histogram t = 30 [us]

Bench 01: Fill a small 2D array (200 x 200) elements
 in the internal memory. Then, compute the
 X-Y projections and the histogram.
 Fill the array t = 171 [us]
 X projection t = 672 [us]
 Y projection t = 288 [us]
 Histogram t = 451 [us]

Bench 02: Fill a small 1D array (1000) elements in
 the internal memory with a random pattern.
 Then, compute the min / max values.
 Number of tests n = 1000 [-]
 Min / Max t = 1110 [us]

Bench 03: Fill a big 1D array (50000) elements in
 the internal memory with a random pattern.
 Then, compute the min / max values.
 Number of tests n = 100 [-]
 Min / Max t = 107 [us]

Bench 04: Compute the integer atan2 using the CORDIC
 algorithm
 Number of tests n = 1000 [-]
 1000 x atan2(y, x) t = 1088 [us]


Bench for the Nucleo_N657
-------------------------

uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
 the internal memory. Then, compute the
 X-Y projections and the histogram.
 Fill the array t = 29 [us]
 X projection t = 173 [us]
 Y projection t = 167 [us]
 Histogram t = 330 [us]

Bench 01: Fill a small 2D array (200 x 200) elements
 in the internal memory. Then, compute the
 X-Y projections and the histogram.
 Fill the array t = 400 [us]
 X projection t = 2766 [us]
 Y projection t = 2640 [us]
 Histogram t = 5369 [us]

Bench 02: Fill a small 1D array (1000) elements in
 the internal memory with a random pattern.
 Then, compute the min / max values.
 Number of tests n = 1000 [-]
 Min / Max t = 3127 [us]

Bench 03: Fill a big 1D array (50000) elements in
 the internal memory with a random pattern.
 Then, compute the min / max values.
 Number of tests n = 100 [-]
 Min / Max t = 323 [us]

Bench 04: Compute the integer atan2 using the CORDIC
 algorithm
 Number of tests n = 1000 [-]
 1000 x atan2(y, x) t = 2726 [us]

As shown, the N6 performance is not acceptable!
The clock measurements on MCO2 reflect the PLL values, but maybe some other elements
(not clearly identified) are influencing the code execution.

Any clue to get more decent results for the N6?
Kind regards,
Edo

This topic has been closed for replies.

Best answer by Franzi.Edo

Dear all,
I have identified the main cause of my problem.
The poor N6 performance compared to the H7 was due to the MPU configuration.

I had initially specified the RAM as Sharable, which — for reasons that are not entirely clear — degraded memory performance.
After changing the RAM to Non-sharable, the results are now much more consistent and explainable.
With this adjustment, the N6 @ 588 MHz and the H7 @ 480 MHz deliver very close performance.
I consider this issue mostly resolved, although a few open points remain:
Why does Sharable RAM perform so poorly?

In my NOP test, the H7 is still about 30% faster than the N6, despite the N6 running at a higher clock speed. My assumption is that the H7’s memory scheme (code executed in FLASH with the ART accelerator) is more efficient than the N6’s cache-based approach.

Anyway, I’d like to thank you all for your great support.
Best regards,
Edo

A

AScha.3

Super User

Hi,

what optimizer setting you had ?

Try -O2 , compile, check then again.

F

Franzi.EdoAuthor

Graduate II

Hi AScha.3,

Thank you for the suggestion.

Both target use the same gcc setting and the optimisation is -Os. Here is for the N6 the results for -O2 and for -O3. Even with these optimisation wa are very far from the -Os of the H753. More probably something is not good with the hardware, but I can measure only the PLL clocks!

Here the new results:

-O2
---

uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
 the internal memory. Then, compute the
 X-Y projections and the histogram.
 Fill the array t = 24 [us]
 X projection t = 169 [us]
 Y projection t = 172 [us]
 Histogram t = 331 [us]

Bench 01: Fill a small 2D array (200 x 200) elements
 in the internal memory. Then, compute the
 X-Y projections and the histogram.
 Fill the array t = 334 [us]
 X projection t = 2752 [us]
 Y projection t = 2632 [us]
 Histogram t = 5349 [us]

Bench 02: Fill a small 1D array (1000) elements in
 the internal memory with a random pattern.
 Then, compute the min / max values.
 Number of tests n = 1000 [-]
 Min / Max t = 2963 [us]

Bench 03: Fill a big 1D array (50000) elements in
 the internal memory with a random pattern.
 Then, compute the min / max values.
 Number of tests n = 100 [-]
 Min / Max t = 307 [us]

Bench 04: Compute the integer atan2 using the CORDIC
 algorithm
 Number of tests n = 1000 [-]
 1000 x atan2(y, x) t = 2747 [us]

-O3
---

uKOS-X > bench
System bench.
Bench 00: Fill a small 2D array (50 x 50) elements in
 the internal memory. Then, compute the
 X-Y projections and the histogram.
 Fill the array t = 24 [us]
 X projection t = 169 [us]
 Y projection t = 171 [us]
 Histogram t = 330 [us]

Bench 01: Fill a small 2D array (200 x 200) elements
 in the internal memory. Then, compute the
 X-Y projections and the histogram.
 Fill the array t = 334 [us]
 X projection t = 2752 [us]
 Y projection t = 2633 [us]
 Histogram t = 5348 [us]

Bench 02: Fill a small 1D array (1000) elements in
 the internal memory with a random pattern.
 Then, compute the min / max values.
 Number of tests n = 1000 [-]
 Min / Max t = 2962 [us]

Bench 03: Fill a big 1D array (50000) elements in
 the internal memory with a random pattern.
 Then, compute the min / max values.
 Number of tests n = 100 [-]
 Min / Max t = 292 [us]

Bench 04: Compute the integer atan2 using the CORDIC
 algorithm
 Number of tests n = 1000 [-]
 1000 x atan2(y, x) t = 2862 [us]

A

AScha.3

Super User

Ok,

was just because you didnt state the optimizer setting.

(I dont have the N6 , with M55 core , so just guessing...)

For your tests : where is the program ?

Did you try , to load it to RAM , or better to TCM RAM ? and supply...VOS high ?

Did you check with scope on MCO , clock setting is correct ?

T

TDK

Super User

Bench 02: Fill a small 1D array (1000) elements in
 the internal memory with a random pattern.
 Then, compute the min / max values.
 Number of tests n = 1000 [-]
 Min / Max t = 1110 [us]

Bench 03: Fill a big 1D array (50000) elements in
 the internal memory with a random pattern.
 Then, compute the min / max values.
 Number of tests n = 100 [-]
 Min / Max t = 107 [us]

What exactly is being reported on the "Min / Max" line? The two readings aren't consistent with each other as far as I can see. If it's time per test, the smaller array should be faster. If its total time for all tests, the math doesn't add up. 1000*1000 values take 1110 us but 100*50000 values only take 107 us? Nah.

Showing actual code being used here might help and avoid the 20-questions back and forth and get an answer faster.

F

Franzi.EdoAuthor

Graduate II

Hi TDK,

you are right ; let me investigate this inconsistency.

Btw, here is the benched routine:

/*
 * \brief local_minMax
 *
 * - Compute the min / max of an array
 *
 */
static	void	local_minMax(uint32_t *array, uint64_t *time, uint32_t *min, uint32_t *max) {
	uint64_t	tStamp[2];
	uint32_t	i;

	kern_getTickCount(&tStamp[0]);
	*min = 0xFFFFFFFF; *max = 0x00000000;
	for (i = 0; i < KNB_ELEMENTS; i++) {
		if (*(array + i) < *min) { *min = *(array + i); }
		if (*(array + i) > *max) { *max = *(array + i); }
	}
	kern_getTickCount(&tStamp[1]);

	*time = tStamp[1] - tStamp[0];
}

F

Franzi.EdoAuthor

Graduate II

Dear all,

To investigate my speed problem, I created a very simple test.
I run an infinite loop (with interrupts off) where I execute a simple NOP loop 1,000,000 times.
At the end of each loop execution, I toggle a GPIO pin.

Here is the C code:

#include	"uKOS.h"

#define	KNB_TESTS			1000000

// CLI tool specific
// =================

static	void	 local_loop(uint32_t nb);

/*
 * \brief bench_05
 *
 * - loop
 *
 */
bool	bench_05(void) {

	dprintf(KSYST, "Bench 05: For scope tests!\n");

	kern_suspendProcess(1000);

	INTERRUPTION_OFF_HARD
	while (true) {

		ANALYSER_TOGGLE;
		local_loop(KNB_TESTS);
	}

	return (true);
}

// Local routines
// ==============

/*
 * \brief local_loop
 *
 * - Execute the nop
 *
 */
static	void local_loop(uint32_t nb) {
	volatile	uint32_t	i;

	for (i = 0; i < nb; i++) {
		NOP;
	}
}

here is the H7 assembly

Here is the H7 assembly

08016278 <bench_05>:
 8016278:	b507 	push	{r0, r1, r2, lr}
 801627a:	490e 	ldr	r1, [pc, #56]	@ (80162b4 <bench_05+0x3c>)
 801627c:	480e 	ldr	r0, [pc, #56]	@ (80162b8 <bench_05+0x40>)
 801627e:	f01f ff21 	bl	80360c4 <dprintf>
 8016282:	f44f 707a 	mov.w	r0, #1000	@ 0x3e8
 8016286:	f7ed fa4b 	bl	8003720 <kern_suspendProcess>
 801628a:	b672 	cpsid	i
 801628c:	f3bf 8f6f 	isb	sy
 8016290:	4b0a 	ldr	r3, [pc, #40]	@ (80162bc <bench_05+0x44>)
 8016292:	2100 	movs	r1, #0
 8016294:	480a 	ldr	r0, [pc, #40]	@ (80162c0 <bench_05+0x48>)
 8016296:	695a 	ldr	r2, [r3, #20]
 8016298:	f082 0201 	eor.w	r2, r2, #1
 801629c:	615a 	str	r2, [r3, #20]
 801629e:	695a 	ldr	r2, [r3, #20]
 80162a0:	9101 	str	r1, [sp, #4]
 80162a2:	9a01 	ldr	r2, [sp, #4]
 80162a4:	4282 	cmp	r2, r0
 80162a6:	d8f6 	bhi.n	8016296 <bench_05+0x1e>
 80162a8:	bf00 	nop
 80162aa:	9a01 	ldr	r2, [sp, #4]
 80162ac:	3201 	adds	r2, #1
 80162ae:	9201 	str	r2, [sp, #4]
 80162b0:	e7f7 	b.n	80162a2 <bench_05+0x2a>
 80162b2:	bf00 	nop
 80162b4:	08043ee1 			@ <UNDEFINED> instruction: 08043ee1
 80162b8:	73797374 			@ <UNDEFINED> instruction: 73797374
 80162bc:	58020400 			@ <UNDEFINED> instruction: 58020400
 80162c0:	000f423f 			@ <UNDEFINED> instruction: 000f423f
 80162c4:	00000000 			@ <UNDEFINED> instruction: 00000000

The inner loop executed 1000000 is:
8016296:	695a 	ldr	r2, [r3, #20]
 8016298:	f082 0201 	eor.w	r2, r2, #1
 801629c:	615a 	str	r2, [r3, #20]
 801629e:	695a 	ldr	r2, [r3, #20]
 80162a0:	9101 	str	r1, [sp, #4]
 80162a2:	9a01 	ldr	r2, [sp, #4]
 80162a4:	4282 	cmp	r2, r0
 80162a6:	d8f6 	bhi.n	8016296 <bench_05+0x1e>

here is the N6 assembly

Here is the N6 assembly

34014bd4 <bench_05>:
34014bd4:	b507 	push	{r0, r1, r2, lr}
34014bd6:	490e 	ldr	r1, [pc, #56]	@ (34014c10 <bench_05+0x3c>)
34014bd8:	480e 	ldr	r0, [pc, #56]	@ (34014c14 <bench_05+0x40>)
34014bda:	f017 fd91 	bl	3402c700 <dprintf>
34014bde:	f44f 707a 	mov.w	r0, #1000	@ 0x3e8
34014be2:	f7ef f80d 	bl	34003c00 <kern_suspendProcess>
34014be6:	b672 	cpsid	i
34014be8:	f3bf 8f6f 	isb	sy
34014bec:	2100 	movs	r1, #0
34014bee:	4b0a 	ldr	r3, [pc, #40]	@ (34014c18 <bench_05+0x44>)
34014bf0:	480a 	ldr	r0, [pc, #40]	@ (34014c1c <bench_05+0x48>)
34014bf2:	695a 	ldr	r2, [r3, #20]
34014bf4:	f082 0202 	eor.w	r2, r2, #2
34014bf8:	615a 	str	r2, [r3, #20]
34014bfa:	695a 	ldr	r2, [r3, #20]
34014bfc:	9101 	str	r1, [sp, #4]
34014bfe:	9a01 	ldr	r2, [sp, #4]
34014c00:	4282 	cmp	r2, r0
34014c02:	d8f6 	bhi.n	34014bf2 <bench_05+0x1e>
34014c04:	bf00 	nop
34014c06:	9a01 	ldr	r2, [sp, #4]
34014c08:	3201 	adds	r2, #1
34014c0a:	9201 	str	r2, [sp, #4]
34014c0c:	e7f7 	b.n	34014bfe <bench_05+0x2a>
34014c0e:	bf00 	nop
34014c10:	9fdc 	ldr	r7, [sp, #880]	@ 0x370
34014c12:	3403 	adds	r4, #3
34014c14:	7374 	strb	r4, [r6, #13]
34014c16:	7379 	strb	r1, [r7, #13]
34014c18:	1800 	adds	r0, r0, r0
34014c1a:	5602 	ldrsb	r2, [r0, r0]
34014c1c:	423f 	tst	r7, r7
34014c1e:	000f 	movs	r7, r1

The inner loop executed 1000000 is:
34014bf2:	695a 	ldr	r2, [r3, #20]
34014bf4:	f082 0202 	eor.w	r2, r2, #2
34014bf8:	615a 	str	r2, [r3, #20]
34014bfa:	695a 	ldr	r2, [r3, #20]
34014bfc:	9101 	str	r1, [sp, #4]
34014bfe:	9a01 	ldr	r2, [sp, #4]
34014c00:	4282 	cmp	r2, r0
34014c02:	d8f6 	bhi.n	34014bf2 <bench_05+0x1e>

As you can see, the two inner loops are identical.
However, the logic analyzer on the GPIO shows:
H7: 10.4 ms
N6: 21.59 ms

Execution time ratio = 2.75

The measured frequency on MCO2 is:
H7 = 480 MHz
N6 = 588 MHz
Clock ratio = 1.22

So, the H7 @ 480 MHz is effectively 3.35× faster than the N6 @ 588 MHz.
Clearly something is wrong somewhere in the chain, but at the moment the only concrete measurement I have is the MCO2 frequency output values and the GPIO measurement.
Question: Where am I losing this factor of 3?
Best regards,

R

RomainR.

ST Employee

Hello @Franzi.Edo

On the N6 side, can you read and share the contents of the MSCR register?
Refer to PM0273 - Rev 3 section 6.8.2 Memory System Control Register, MSCR
You should verify that bits 12 DCACTIVE and 13 ICACTIVE of the L1 data and instruction cache memory interfaces should be enabled.
This can be done using the macros in core_cm55.h and implementing the line below at the beginning of your code:

MEMSYSCTL->MSCR |= MEMSYSCTL_MSCR_DCACTIVE_Msk|MEMSYSCTL_MSCR_ICACTIVE_Msk;

Regarding measuring execution times, I suggest using DWT_CYCCNT (also available on H7) instead of a hardware timer. Count CPU cycles instead of microseconds, and compare with your assembler code. Once your CPU cycles reach what you expect, convert to time from the actual H7 and N6 CPU frequencies.

In attachment the main.c shows how to configure and use DWT_CYCCNT on N6.

Let me know if it helps?
Best regards,

Romain,

F

Franzi.EdoAuthor

Graduate II

Hi RomainR,
Thank you for your suggestion.
I just printed the content of MEMSYSCTL->MSCR, and its value is 0x300A. Unfortunately, the bits you suggested to set are already enabled.

Do you have any other suggestions I could try? It feels like there is some kind of divider between the clock observed on MCO2 and the actual CPU clock.

Regarding the DWT_CYCCNT, you are absolutely right. The issue is that these benchmarks are exactly the same across all the architectures supported by my OS (Cortex, RISC-V). So the simplest solution was to rely on a timer value provided by the OS, in order to avoid multiple code implementations.
Best regards,
Edo

P

Pavel A.

Super User

Does the N6 test run in the RAM or external flash?

F

Franzi.EdoAuthor

Graduate II

Hi Pavel, the N6 run in the internal AXI SRAM1.

BR, Edo

L

LCE

Graduate II

For any accurate timing measurements, I would:

- use the ARM cycle counter

- turn off all interrupts ( __disable_irq() if possible, otherwise disable all interrupts not related to the tested functions)

F

Franzi.EdoAuthor

Graduate II

Hi LCE,

Thank you for your advice.

Regarding the DWT_CYCCNT, you are absolutely right. The issue is that these benchmarks are exactly the same across all the architectures supported by my OS (Cortex, RISC-V). So the simplest solution was to rely on a timer value provided by the OS, in order to avoid multiple code implementations.

For the moment I do not need a sub us measurement. Here the problem is the effective speed of the cpu. Just check my previous test (simple loop with NOP). It turns out that H7 is 3.3 x faster than the N6, and I cannot believe that.
Best regards,
Edo

L

LCE

Graduate II

CYCCNT:

I prefer this not only because of the sub-second accuracy, but also because it is not an STM32 peripheral, thus not depending on bus clock or peripheral settings.

T

TDK

Super User

Note also that the M7 core is faster than the M55 on a per-MHz level.

Might also be running into bus contention issues with code and data being transferring over the same bus. Some sources on the internet say a 2-3x speed difference, which is what you're seeing.

> So, the H7 @ 480 MHz is effectively 3.35× faster than the N6 @ 588 MHz.

I calculate a 2.53x difference, not 3.35x.

The N6 has a lot of NPU-specific computational power which is not being exercised here at all. That's what it was built for, not single-thread execution.

A

AScha.3

Super User

Arm giving not much differring numbers :)

M7 just 20% faster than M55 . (on Coremark)

https://documentation-service.arm.com/static/6267de1c7e121f01fd22d677?token=

https://documentation-service.arm.com/static/61bb37962183326f2176f8cc

F

Franzi.EdoAuthor

Graduate II

Hi AScha.3

You are right: the M7 is faster than the M55 per MHz.
In my test, the N6 shows a speed advantage of a factor 1.22, which should place it roughly at the same level as the H7. So, the 588-MHz N6 CoreMark is 4.40 x 588 = 2'587.2 and the 480-MHz H7 CoreMark is 5.29 x 480 = 2'539. So, both machines should give very similar results on my tests.

In some tests, I even see an execution time ratio of >4.
But I’ve identified the problem — I’ll explain in a moment.
Thanks.

F

Franzi.EdoAuthorAnswer

Graduate II

Dear all,
I have identified the main cause of my problem.
The poor N6 performance compared to the H7 was due to the MPU configuration.

I had initially specified the RAM as Sharable, which — for reasons that are not entirely clear — degraded memory performance.
After changing the RAM to Non-sharable, the results are now much more consistent and explainable.
With this adjustment, the N6 @ 588 MHz and the H7 @ 480 MHz deliver very close performance.
I consider this issue mostly resolved, although a few open points remain:
Why does Sharable RAM perform so poorly?

In my NOP test, the H7 is still about 30% faster than the N6, despite the N6 running at a higher clock speed. My assumption is that the H7’s memory scheme (code executed in FLASH with the ART accelerator) is more efficient than the N6’s cache-based approach.

Anyway, I’d like to thank you all for your great support.
Best regards,
Edo

R

RomainR.

ST Employee

Hi @Franzi.Edo

Thank you for sharing your tests and results. It was almost certainly a matter of MPU configuration.

A memory area on the ST NOC AXI (SRAM1 and 2 of the N6) configured with shareable cacheable attributes will be translated by the CM55 processor as Normal Shareable Non-cacheable. This could penalize processor access and explains the degraded performances.

Here is a note on this subject in the Arm Cortex-M55 Processor Technical Reference Manual.
Section Memory system/Manager-AXI interface then Memory attribute conversion on M-AXI:

https://developer.arm.com/documentation/101051/0101/Memory-system/Manager-AXI-interface/Memory-attribute-conversion-on-M-AXI

It is also possible that, as on Cortex-M7, a shareable and cacheable area may not have the data cache enabled, only the instruction cache is used:

https://www.youtube.com/watch?v=6IUfxSAFhlw&list=PLnMKNibPkDnEQXu4S6QUUHuSKj81MeqCz&ab_channel=STMicroelectronics

Best regards,

Romain,

P

Pavel A.

Super User

> A memory area on the ST NOC AXI (SRAM1 and 2 of the N6) configured with shareable cacheable attributes will be translated by the CM55 processor as Normal Shareable Non-cacheable.

@RomainR. This is even if this area is defined via the MPU as cacheable? Or this is when this area is in "background region" (without MPU)?

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded