Skip to main content
Explorer
July 16, 2025
Question

Interrupt curiosity (STM32G491)

  • July 16, 2025
  • 14 replies
  • 1319 views

Hello friends : )

I've recently tested the NUCLEO-G491RE board for an upcoming redesign.

Current design relies on quite tough interrupt latency so here is where I begun.

As I understand, with no FPU usage (in isr) (ASPEN + LSPEN = 0) one could expect up to 12 SYSCLK latency.

In this case 75ns at 160MHz which sounds pretty decent (today it's about 73ns).

 

I started off by creating two very similar interrupt services I intended to toggle between.

Each would pulse a output pin and trigger the other one:

Attributes: 'interrupt' + optimize("-O2")' + 'section(".RamFunc")' + 'aligned(8)' + 'naked'

static void onTimeStampEvent(void)
{
 GPIOA->BSRR = 1<<12;
 NVIC->ISPR[1] = 1<<(39-1*32); // USART3
 GPIOA->BRR = 1<<12;
#ifdef NAKED
 __ASM volatile ("BX LR":::);
#endif
 return;
}
 
static void onReceiveEvent(void)
{
 GPIOB->BSRR = 1<<14;
 NVIC->ISPR[0] = 1<<(11-0*32); // DMA1_CH1
 GPIOB->BRR = 1<<14;
#ifdef NAKED
 __ASM volatile ("BX LR":::);
#endif
 return;
}

In main the usual suspects:

  • HAL and System initiation
  • Peripheral initiation
  • Setting up interrupts

Finally, the main loop:

 u = 0;
 while(1)
 {
 NVIC->ISPR[0] = 1<<(11-0*32); // DMA1_CH1
 u++;
 }

 

This works splendidly, sort of...

<PicoScope shot 1>

The total time for a complete round trip is about 381ns or 61 SYSCLK.

Variable "u" in main never changes from "0" suggesting expected continuous interrupts.

The thing is, would not tail-chaining occur?

 

Now I tried diversify priority levels, yellow being less important:

<PicoScope shot 2>

Priority in action, indeed, however now a round trip takes 562ns (90 SYSCLK).

Still no tail-chaining, and worse, lots of extra time for the same amount of work.

 

Where have I done wrong?

Any help appreciated = )

/Hen

 

    This topic has been closed for replies.

    14 replies

    Super User
    July 16, 2025

    Try to interpret the timing we see using the disasm.

    JW

    Explorer
    July 16, 2025

    Im sorry, not to knowledged in STM matters.

    Would you please explain a little bit further?

    BTW, thank you = )

    /Hen

    Super User
    July 16, 2025

    If you want to discuss clock-level timing, you have to have a look at the particular instructions executed, not the source code.

    JW

    Explorer
    July 16, 2025

    Yes, that makes perfect sense : )

     onTimeStampEvent:
    20000010: mov.w r3, #1207959552 @ 0x48000000
    20000014: mov.w r2, #4096 @ 0x1000
    20000018: str r2, [r3, #24]
     140 NVIC->ISPR[1] = 1<<(39-1*32); // USART3
    2000001a: ldr r3, [pc, #20] @ (0x20000030 <onTimeStampEvent+32>)
    2000001c: movs r2, #128 @ 0x80
    2000001e: str.w r2, [r3, #260] @ 0x104
     141 GPIOA->BRR = 1<<12;
    20000022: mov.w r3, #1207959552 @ 0x48000000
    20000026: mov.w r2, #4096 @ 0x1000
    2000002a: str r2, [r3, #40] @ 0x28
     143 __ASM volatile ("BX LR":::);
    2000002c: bx lr
     145 return;
     onReceiveEvent:
    20000038: ldr r3, [pc, #28] @ (0x20000058 <onReceiveEvent+32>)
    2000003a: mov.w r2, #16384 @ 0x4000
    2000003e: str r2, [r3, #24]
     161 NVIC->ISPR[0] = 1<<(11-0*32); // DMA1_CH1
    20000040: ldr r3, [pc, #24] @ (0x2000005c <onReceiveEvent+36>)
    20000042: mov.w r2, #2048 @ 0x800
    20000046: str.w r2, [r3, #256] @ 0x100
     162 GPIOB->BRR = 1<<14;
    2000004a: ldr r3, [pc, #12] @ (0x20000058 <onReceiveEvent+32>)
    2000004c: mov.w r2, #16384 @ 0x4000
    20000050: str r2, [r3, #40] @ 0x28
     164 __ASM volatile ("BX LR":::);
    20000052: bx lr
     166 return;

     Is this what you're talking about?

    /Hen

    Explorer
    July 16, 2025

    Will be offline a couple of days, but some extras before I go.

     

    I've set break on every handler (stopped systick) and no trap.

    HAL_SuspendTick();

     

    The debug variable is at file scope and volatile.

    volatile unsigned u;

     

    Tested different priorities with no change (except both the same).

     

    Verified no FPU registers being stacked.

     

    Much appreciated for any insight

    /Hen

    Super User
    July 16, 2025
    NVIC->ISPR[0] = 1<<(39-1*32); // DMA1_CH1

    What do you think this statement will do? (Hint: C is not python!)

    This MCU has core-coupled RAM (CCM SRAM), you can put these functions there for better latency.

     

    Explorer
    July 20, 2025

    Hmm, from where did you copy this line?

    Either "NVIC->ISPR[1] = 1<<(39-1*32); // USART3"

    or: "NVIC->ISPR[0] = 1<<(11-0*32); // DMA1_CH1"

    My intention was using bit-banding but did not reach that far yet.

     

    On the CCM, indeed cycles were shelved off, more than I expected...

    On equal priority: from 61 to 52 SYSCLK (PicoScope1.png)

    With escalating priority: from 90 to 69 SYSCLK (PicoScope2.png)

    But still, more SYSCLK with differentiated priorities.

     

    What happened with "Interrupt tail-chaining"?

     

    BTW, I tried booting without debugger but no changes...

    Super User
    July 20, 2025

    Where is the vector table located?

    Where is the stack located?

    JW

    Explorer
    July 24, 2025

    Good points : )

    SP->SRAM2 // 0x2001BFF0 (BG) and 0x2001BFD0 (IC)

    VTOR->SRAM1 // 0x20000800

    IC @ CCMRAM // 0x10000000..0x10000044

    Maybe swapping SP and VTOR linkage would be more efficient?

    I probably need a stack frame (non-naked) in the end.

     

    Super User
    July 24, 2025

    Try moving stack to CCMRAM.

    Fetching the ISR address from vector table should occur in parallel with the registers stacking, so they should be in different memories but at the same time the vector fetch should have less of an impact if it lasts longer.

    These are very complex SoC-s, where cycle counting is cumbersome due to the many elements involved, and the theoretical numbers from the processor's specs alone are in practice usually impossible to reach, again due to the huge influence of the whole SoC. Generally, consider the processor's specs to be just sweet marketing speech.   

    JW

    Explorer
    July 24, 2025

    Yes, fetching the vector probably occur once every interrupt taken.

    And due to the initial multi stacking, the above may be less important regarding latency.

    However, wouldn't the stack starve the instruction pipe, residing in CCMRAM as well?

     

    Regarding cycle counting, yes there's a lot going on in parallel here and exact numbers may not exist.

    That's not the core of my hacks, just a feel for what's feasible and concurrent comparisons.

    I think I can get away with about 60ns latency spread on The One important event.

    Today it's 30ns. Well, assuming no degraded performance, which also could be considered.

     

    Then the unexpected scenario happened, interrupt escalation with performance penalty.

    That's the real bugger, is it not?

    Super User
    July 24, 2025

    Latency and its jitter is at least an order of magnitude worse in these SoCs than in the 8-bit micro*controllers*, and so is its controllability and state of documentation. So, I don't consider interrupts to be a viable option for timing sensitive tasks anymore, and always resort to hardware.

    JW

    Explorer
    July 24, 2025

    Yes, so it is.

    I do utilize DMA:s as the snippet may reveal.

    The competing core is also a *modern* MCU albeit a bit more recent than the G4.

    It is quite up to the task, but for 'platformics' we need to move here.

    In the near future we aim at the H7+

    Super User
    July 24, 2025

    > In the near future we aim at the H7+

    The interrupt latency and overall timing controllability Cortex-M7 is of course worse than in Cortex-M4.

    That's the price we pay for raw speed.

    JW 

    Explorer
    July 27, 2025

    I just realized this may not work for us in this way?

    CCMRAM seems to be the best performance option.

    Tried several linkages for stack (at highest address) vs ram-code with not so obvious results.

    The stack (at high end CCM area) and the ram-code (at the low end CCM area).

    1. Both in 0x10000000[0x4000] --- 52ck vs 67ck.

    2. Both in 0x20018000[0x4000] --- 72ck vs 86ck.

    3. stack in 0x1xxx code in 0x2xxx --- 72ck vs 92ck.

    4. stack in 0x2xxx code in 0x1xxx --- 26ck vs 69ck. ???

     

    There is no spread present because this is the only thing the core does.

    What happens when all bells and whistles are in place?

    Will the spread (for the most prominent interrupt) be more than 60ns (10sc)?

    I understand the tests do not say squat about this but them gives worries...