Skip to main content
Graduate
February 12, 2025
Solved

NMI Fault without any obvious fault bits set

  • February 12, 2025
  • 3 replies
  • 1597 views

Hi everyone,

I'm currently experiencing a strange crash on a STM32G473 and I'm a bit stumped on how to debug it.

First the crash:

The system is a STM32G473 running FreeRTOS V10.5.1. I have a simple FDCAN ISR which takes the incoming CAN frames and pushes it to a FreeRTOS queue for use in user space in a task; nothing fancy except this, just transforming the data to a nicer structure and clearing the FIFO and so on. During high bus loads however, the FDCAN ISR will sometimes fire during a FreeRTOS context switch. When it does, the system crashes, which is a problem in and of itself, but the main issue is that the NMI is triggered and not the Hardfault.

SRAM Parity is enabled, but not Clock Security.

Attached is the NMI Handler. I'm checking every bit in HFSR, CFSR, AFSR, The parity error bit and the flash ecc bit and even the css bit are checked. I'm stepping through the function and checking if anything could have triggered the NMI. None of the relevant bits are set to 1 however. I can clearly see that the memcpy() in the FreeRTOS queue-pushing is triggering the NMI; and the callstack in the debugger is very clear that an FDCAN ISR was triggered during the context switch.

Other things tried: I've checked the Vector Table to see if the NMI handler had ended up at another position, which it hasn't. The hardfault handler works as intended (its a simple while(1) { __asm("nop"); } right now. I've checked the errata and couldn't find anything related to the NMI.

So my question(s) is: How do I properly debug the NMI and why does it trigger instead of a Hardfault? Are there any more registers I need to check to determine why we are in the fault handler?

 

/**
 * @brief Assembler part of the NMI handler
 *
 * Determine which stack pointer (MSP or PSP) was in use when the system crashed.
 * Put the stack pointer into r0 and call a C function to handle the exception.
 * R0 will be the first argument to the C function and we can unwind the stack
 */
__attribute__((naked)) void NMI_Handler(void)
{
 __asm(
 "TST LR, #4\n" /* Check EXC_RETURN value in LR */
 "ITE EQ\n" /* If equal (zero), use MSP; else, use PSP */
 "MRSEQ R0, MSP\n" /* Move MSP to r0 if LR[2] == 0 */
 "MRSNE R0, PSP\n" /* Move PSP to r0 if LR[2] != 0 */
 "B nmi_handler_c\n"); /* Branch to the C handler passing r0 (stack pointer) as argument. */
}

/**
 * @brief C-part of the NMI handler
 *
 * stacked_registers Pointer to the stack
 */
void nmi_handler_c(unsigned int* stacked_registers)
{
 volatile unsigned int hfsr = SCB->HFSR; /* Hard Fault Status Register */
 volatile unsigned int cfsr = SCB->CFSR; /* Configurable Fault Status Register */
 volatile unsigned int mmfar = SCB->MMFAR; /* Memory Management Fault Address Register */
 volatile unsigned int bfar = SCB->BFAR; /* Bus Fault Address Register */
 volatile unsigned int afsr = SCB->AFSR; /* Aux Fault Address Register */
 volatile unsigned int sram_parity = SYSCFG->CFGR2 & SYSCFG_CFGR2_SPF;
 volatile unsigned int flash_error = FLASH->ECCR & (FLASH_ECCR_ECCD2 | FLASH_ECCR_ECCD);
 volatile unsigned int css_error = RCC->CIFR & (RCC_CIFR_CSSF | RCC_CIFR_LSECSSF);

 // --- SRAM and Flash Parity errors ---
 if (sram_parity)
 {
 // SRAM parity failed
 __asm("nop");
 }

 if (flash_error)
 {
 // Flash ECC error
 __asm("nop");
 }

 if (css_error)
 {
 // Clock Security error
 __asm("nop");
 }

 // --- Memory Management Fault Analysis (CFSR bits 0-7) ---
 if (cfsr & (1 << 0))
 {
 // IACCVIOL: An instruction access violation occurred.
 __asm("nop");
 }
 if (cfsr & (1 << 1))
 {
 // DACCVIOL: A data access violation occurred.
 __asm("nop");
 }
 if (cfsr & (1 << 3))
 {
 // MUNSTKERR: Unstacking error during exception return.
 __asm("nop");
 }
 if (cfsr & (1 << 4))
 {
 // MSTKERR: Stacking error during exception entry.
 __asm("nop");
 }
 if (cfsr & (1 << 5))
 {
 // MLSPERR: Lazy state preservation error occurred.
 __asm("nop");
 }

 if (cfsr & (1 << 7))
 {
 // MMARVALID is set: The MMFAR register holds a valid memory fault address.
 // Check mmfar to see the address that triggered the memory management fault.
 mmfar = mmfar;
 __asm("nop");
 }

 // --- Bus Fault Analysis (CFSR bits 8-15) ---
 if (cfsr & (1 << 8))
 {
 // IBUSERR: An instruction bus error occurred.
 __asm("nop");
 }
 if (cfsr & (1 << 9))
 {
 // PRECISERR: A precise data bus error occurred.
 __asm("nop");
 }
 if (cfsr & (1 << 10))
 {
 // IMPRECISERR: An imprecise data bus error occurred.
 __asm("nop");
 }
 if (cfsr & (1 << 11))
 {
 // UNSTKERR: Unstacking error during exception return (bus fault).
 __asm("nop");
 }
 if (cfsr & (1 << 12))
 {
 // STKERR: Stacking error during exception entry (bus fault).
 __asm("nop");
 }
 if (cfsr & (1 << 13))
 {
 // LSPERR: Lazy state preservation error on bus fault.
 __asm("nop");
 }
 // ---------------------- Bus Fault Analysis ------------------------
 if (cfsr & (1 << 15))
 {
 // BFARVALID is set: The BFAR register holds a valid bus fault address.
 // Check bfar to see the address related to the bus fault.
 bfar = bfar;
 __asm("nop");
 }

 // --- Usage Fault Analysis (CFSR bits 16-31) ---
 if (cfsr & (1 << 16))
 {
 // UNDEFINSTR: An undefined instruction was executed.
 __asm("nop");
 }
 if (cfsr & (1 << 17))
 {
 // INVSTATE: Invalid state occurred (possibly an invalid EPSR value).
 __asm("nop");
 }
 if (cfsr & (1 << 18))
 {
 // INVPC: Invalid PC load; may indicate a bad EXC_RETURN value.
 __asm("nop");
 }
 if (cfsr & (1 << 19))
 {
 // NOCP: Attempted to use a coprocessor that is not present.
 __asm("nop");
 }
 if (cfsr & (1 << 24))
 {
 // UNALIGNED: Unaligned access error occurred.
 __asm("nop");
 }
 if (cfsr & (1 << 25))
 {
 // DIVBYZERO: Division by zero error occurred.
 __asm("nop");
 }

 // --- Hard Fault Status Analysis (HFSR) ---
 if (hfsr & (1 << 1))
 {
 // VECTTBL: Bus fault on vector table read during exception processing.
 __asm("nop");
 }
 if (hfsr & (1 << 30))
 {
 // FORCED: A configurable fault (memory management, bus, or usage fault) escalated to a hard fault.
 __asm("nop");
 }
 __asm("bkpt 1");
}

 

 

    This topic has been closed for replies.
    Best answer by turbofish

    Hi again,

    We managed to solve the issue, but it still leaves some questions.

    During the reset handler, where we zero out the BSS, and copy variables from flash to ram; we added a memory check (this was planned anyway) to see if there was any hardware fault. We write 0x55 and 0xAA (to test all bits) over the entire SRAM and then readback to verify that there was nothing funky happening and voila, no more NMI Faults. Parity was enabled in the option bytes from before. The check only adds a few ms to startup time.

    From the datasheet:

    turbofish_0-1739883605438.png

    Its only advised to do this, but it seems to be required in order to not have the NMI. Also, when the crash occurs the SRAM Parity Error Flag (SPF) in the SYSCFG_CFGR2 is most definitely NOT set.  

    turbofish_1-1739883843073.png

    If we turn off the SRAM Parity setting in the optionbytes, the system works as intended even without the RAM test.

    So the questions are; is it required or is it advised to initialize the entire SRAM during startup to not have NMIs with parity turned on? And is there a check for a parity error except the Parity Error Flag, or should this maybe be in the errata?

     

    Thanks for all the feedback folks!

    3 replies

    Visitor II
    February 12, 2025

    The crash during memcpy()  may occur because of a possible stack overflow or misaligned access.

    Check the task stack size and ISR stack size:

     

     

    uxTaskGetStackHighWaterMark(myTaskHandle);

     

     

    Increase configMINIMAL_STACK_SIZE and check the FreeRTOS heap settings.

    turbofishAuthor
    Graduate
    February 12, 2025

    Stack Overflows and Misalignment should result in a hardfault, not an NMI if I understand it correctly. I've had plenty of stack overflows and misalignments in this project and we always end up in a hardfault.

    Or do those errors in an ISR automatically get escalated to a NMI instead of a hardfault, and how do you debug them?

    Super User
    February 12, 2025

    Ok, but i would just try: make stack and freeRTOS buffer areas bigger ...just as a test.

    If nothing changes : you know, it is something else to look for.

    turbofishAuthor
    Graduate
    February 12, 2025

    I've tried to increase the stacks, not the FreeRTOS buffer areas though. It's quite a big complex systems with lots of peripherals active and lots of ISR firing. It's interesting that its only the FDCAN that messes up the core and only during context switching; smashing the stack should result in a Hardfault?

     

    The test is simply spamming the unit with short CAN frames (DLC=1) with a CAN ID not handled by the application, just to stress-test the ISR. By sending every ms it crashes quite fast. I've tried disabling most if not all of the other tasks running just to see if it made any difference but it didn't.

     

    As I understand it, the only way to reach the NMI is Flash ECC Errors, SRAM Parity Errors, Clock Safety Errors and (this im not too certain about) faults-in-faults, as if you mess up your hardfault handler and generate another hardfault. There has to be some bit set in some register somewhere telling me WHY I'm fiddling around in the NMI, but I can't find it.

    turbofishAuthorAnswer
    Graduate
    February 18, 2025

    Hi again,

    We managed to solve the issue, but it still leaves some questions.

    During the reset handler, where we zero out the BSS, and copy variables from flash to ram; we added a memory check (this was planned anyway) to see if there was any hardware fault. We write 0x55 and 0xAA (to test all bits) over the entire SRAM and then readback to verify that there was nothing funky happening and voila, no more NMI Faults. Parity was enabled in the option bytes from before. The check only adds a few ms to startup time.

    From the datasheet:

    turbofish_0-1739883605438.png

    Its only advised to do this, but it seems to be required in order to not have the NMI. Also, when the crash occurs the SRAM Parity Error Flag (SPF) in the SYSCFG_CFGR2 is most definitely NOT set.  

    turbofish_1-1739883843073.png

    If we turn off the SRAM Parity setting in the optionbytes, the system works as intended even without the RAM test.

    So the questions are; is it required or is it advised to initialize the entire SRAM during startup to not have NMIs with parity turned on? And is there a check for a parity error except the Parity Error Flag, or should this maybe be in the errata?

     

    Thanks for all the feedback folks!