Skip to main content
acapola
Associate II
January 29, 2026
Solved

STM32N657: how to measure execution time with cycle accuracy

  • January 29, 2026
  • 3 replies
  • 229 views

the best I achieve so far using DWT as well as PMU is 'few cycles' accuracy. My question is, how I measure execution time such that I get 1 cycle difference when I insert a nop in the measured code ? (assuming interrupts and caches are disabled).

Both functions below return 1!

.global test_pmu0
.type test_pmu0, %function
.align 2
test_pmu0:
	isb	sy
	dsb	sy
	ldr	r3, .test_pmu0.pmu_base
	ldr	r0, [r3, #0x7C] //read PMU->CCNTR
	ldr	r1, [r3, #0x7C] //read PMU->CCNTR
	sub r0,r1,r0
	bx lr
.align 2
.test_pmu0.pmu_base:
	.word	0xE0003000 //PMU base address
.global test_pmu1
.type test_pmu1, %function
.align 2
test_pmu1:
	isb	sy
	dsb	sy
	ldr	r3, .test_pmu1.pmu_base
	ldr	r0, [r3, #0x7C] //read PMU->CCNTR
	nop
	ldr	r1, [r3, #0x7C] //read PMU->CCNTR
	sub r0,r1,r0
	bx lr
.align 2
.test_pmu1.pmu_base:
	.word	0xE0003000 //PMU base address

 

Best answer by TDK

You can't. The Cortex-M55 has a complex pipeline. Instructions are not completed in serial and in isolation--nearby instructions affect how fast the others go. This is in contrast to something like the Cortex-M4 where instructions do have cycle counts.

If the goal is to optimize a piece of code, profile a chunk that takes some nontrivial amount of time. Then change it and re-profile. Compare the delta. That's the proper approach to optimizing code on platforms like this.

Arm-Cortex-M55-Processor-Datasheet.pdf

https://documentation-service.arm.com/static/61952957f45f0b1fbf3a89e4

 

3 replies

mbarg.1
Senior III
January 29, 2026

I always have my oscilloscope on my bench and toggling a pin is best marker for interval measurement.

You can add a counter in interval measurement mode, but is not a usual tool for informatics.

Last you can trigger an interrrupt and count ticks  in between - less accurate but more informatic ..

Andrew Neil
Super User
January 29, 2026

@acapola wrote:

how I measure execution time such that I get 1 cycle difference when I insert a nop in the measured code ?


You're working on a false assumption there - Inserting a NOP does not necessarily cause a 1-cycle difference:

AndrewNeil_0-1769703991796.png

https://www.st.com/resource/en/programming_manual/pm0273-stm32-cortexm55-mcus-programming-manual-stmicroelectronics.pdf#page=392

 

A complex system that works is invariably found to have evolved from a simple system that worked.A complex system designed from scratch never works and cannot be patched up to make it work.
TDK
TDKBest answer
Super User
January 29, 2026

You can't. The Cortex-M55 has a complex pipeline. Instructions are not completed in serial and in isolation--nearby instructions affect how fast the others go. This is in contrast to something like the Cortex-M4 where instructions do have cycle counts.

If the goal is to optimize a piece of code, profile a chunk that takes some nontrivial amount of time. Then change it and re-profile. Compare the delta. That's the proper approach to optimizing code on platforms like this.

Arm-Cortex-M55-Processor-Datasheet.pdf

https://documentation-service.arm.com/static/61952957f45f0b1fbf3a89e4

 

"If you feel a post has answered your question, please click ""Accept as Solution""."