Skip to main content
Graduate
November 18, 2024
Question

HardFault by stack overflow(maybe) but not break in HardFault_Handler()

  • November 18, 2024
  • 11 replies
  • 3795 views

I am using the STM32H562VGT6 and IAR Workbench.

I have set up a 4kHz interrupt with LPTIM3 and started transmitting 32 bytes of data using HAL_SPI_Transmit_DMA.

If I break and resume execution after starting the timer interrupt, the SP becomes 0xffff ffd8, causing a HardFault error.  The call stack only contains (Exception frame), making it untraceable. The stack area is from 0x2000'0f90 to 0x2000'4f8f.

Although HardFault_Handler() is defined in stm32h5xx_it.c, breaking there does not hit during a HardFault.
Breaking in MemManage_Handler or BusFault_Handler also does not hit.

Commenting out HAL_SPI_Transmit_DMA during the 4kHz interrupt prevents the HardFault, but adding macros to check if the SP is out of range within the HAL API does not trigger any response.

#define CHECK_MSP() {\
uint32_t msp; \
__asm volatile ("MRS %0, msp" : "=r" (msp)); \
if (msp < 0x20000f80 || 0x20004f7f < msp) { \
__BKPT(0); \
} \
}


How can I capture the cause of this issue?

    This topic has been closed for replies.

    11 replies

    Graduate II
    November 18, 2024

    Please put code in a code block:

    unsigned_char_array_0-1731924796225.png

    Stack overflow doesn't cause a hardfault error unless you set up your linker file in such a way a stack overflow causes access in an illegal memory region. Or if you set up the MPU to define an illegal region past the stack.

    Check in STM32CubeMX in NVIC if MemManage_Handler and BusFault_Handler interrupts are enabled.

    ThoufielAuthor
    Graduate
    November 18, 2024

    Here is InterruptHandler in 4kHz and Registers in HardFault

    {
    	static const int clkTbl[8 + 1][8] = {
    		{ 0, 0, 0, 0, 0, 0, 0, 0 },
    		{ 0, 0, 0, 0, 0, 0, 0, 1 },
    		{ 0, 0, 0, 1, 0, 0, 0, 1 },
    		{ 0, 0, 1, 0, 0, 1, 0, 1 },
    		{ 0, 1, 0, 1, 0, 1, 0, 1 },
    		{ 0, 1, 0, 1, 1, 0, 1, 1 },
    		{ 0, 1, 1, 1, 0, 1, 1, 1 },
    		{ 0, 1, 1, 1, 1, 1, 1, 1 },
    		{ 1, 1, 1, 1, 1, 1, 1, 1 },
    	};
    
    	int step;
    	int clk[4];
    	int dir[4];
    	BYTE val;
    
    	int side = c_count & 1;
    	if(HAL_SPI_Transmit_DMA(&hspi4, (uint8_t*)(&c_dirclk[(c_count +1 )& 1][0]), 32) != HAL_OK){
    		Error_Handler();
    	}
    	for (int i = 0; i < 4; i++) {
    		for (int j = 0; j < 4; j++) {
    			id_type id = i * 4 + j;
    			if (id < MOTOR_COUNT) {
    				step = (int)m_motors[id].proceed();
    				clk[j]= abs(step);
    				dir[j] = (m_motors[id].direction() ? 1 : 0);
    			} else {
    				clk[j] = 0;
    				dir[j] = 1;
    			}
    		}
    		for (int seq = 0; seq < 8; seq++) {
    			val = 0;
    			for (int k = 0; k < 4; k++) {
    				val |= ((dir[k] << (k + 4))) | (clkTbl[clk[k]][seq] << k);
    			}
    			c_dirclk[side][(seq * 4 + i)] = val;
    		}
    	}
    	c_count++;
    }

     

     

    regs.png

    Graduate II
    November 18, 2024
    ThoufielAuthor
    Graduate
    November 18, 2024

    Thank you for replying.

    CFSR : 0x1001

    SHCSR : 0x4

     

    And here is debug log messeage in IAR Workbench :

    ------------------------------------------------------------------------------------------------

    Mon Nov 18, 2024 19:46:50: MemManage fault escalated into HardFault
    Mon Nov 18, 2024 19:46:50: The MemManage handler is disabled
    Mon Nov 18, 2024 19:46:50: HardFault exception.
    Mon Nov 18, 2024 19:46:50: The processor has escalated a configurable-priority exception to HardFault.
    Mon Nov 18, 2024 19:46:50:
    Mon Nov 18, 2024 19:46:50: An MPU or Execute Never (XN) default memory map access violation has occurred on an instruction fetch (CFSR.IACCVIOL, MMFAR).
    Mon Nov 18, 2024 19:46:50:
    Mon Nov 18, 2024 19:46:50: A derived bus fault has occurred on exception entry (CFSR.STKERR, BFAR).
    Mon Nov 18, 2024 19:46:50:
    Mon Nov 18, 2024 19:46:50: Could not determine the location where the exception occurred.
    Mon Nov 18, 2024 19:46:50:
    Mon Nov 18, 2024 19:46:50: See the call stack for more information.

    Graduate II
    November 18, 2024

    @Thoufiel wrote:

    SHCSR : 0x4

    Only one handler is enabled


    @Thoufiel wrote:

    Mon Nov 18, 2024 19:46:50: The MemManage handler is disabled


    As I suspected.

    Have you enabled the MPU?

    Can you step through the interrupt until the error occurs?

    ThoufielAuthor
    Graduate
    November 19, 2024

    MPU is not used.

     

    >Can step through the interrupt until the error occurs?

    I have tried several times, but the step execution does not generate the error properly and I am not able to stop just before the error.

    Explorer
    November 19, 2024

    >Can step through the interrupt until the error occurs?

    With core faults involved, this usually implies stepping through on instruction/assembler level.
    Just saying.

    Graduate II
    November 19, 2024

    Any DMA into an auto/local array?

    ThoufielAuthor
    Graduate
    November 19, 2024

    >Any DMA into an auto/local array?

    No, and c_count, c_dirclk is a static member of the class to which this interrupt handler function belongs.

    Graduate II
    November 19, 2024

    So CPP? Perhaps issues with stack or heap or constructors.

    Not sure how SP goes into 0xFFFFxxxx space. Perhaps PSP/MSP initialization.

    Or double faulting. 

    Perhaps MemManage Handler?

    Can you fish in the primary stack for a context frame for a real PC/LR?

    Graduate II
    November 19, 2024

    Have you checked alignment requirements for the DMA? Sometimes source/destination arrays have to be aligned at more than 4 bytes.

     

    I don't know what happens in

    m_motors[id].proceed()

    Are you sure it doesn't crash there?
    Have you tried setting spare gpio IO pins at certain parts of the code to check with a logic analyzer when/where the code crashes.

    (Also please use stdint.h int32_t instead of "int" if you intend a specific size integer (such as for clkTbl), it makes code more readable and portable.)

    ThoufielAuthor
    Graduate
    November 19, 2024

    -Step Execution

    I have tried stepping through HAL_SPI_Transmit_DMA at the assembler level using the disassembler, and it exits the function successfully.
    After resuming from temporary, I tried to go through function HAL_SPI_Transmit_DMA this way several times and exited the function without error.
    (If I unbreak and restart debugging as is, error occured)
    If I stop at interrupts dozens of times, maybe I can figure out the code that caused the problem, but I haven't tried it.

     

    -primary stack for a context frame for a real PC/LR

    Is the "primary stack for a context frame for a real PC/LR” here the PC/LR value immediately before the error?
    Unfortunately, the direct cause of the error has not been identified, so we have not been able to verify the value.

    The register after the error is in a previous post in this thread.

     

    -m_motors[id].proceed()

    I tried to make assert appear when the value of step is not suitable for clkTbl, but it did not respond, so I do not think this part is a problem.

    I have not verified using GPIO, so I will consider it.

     

    Graduate II
    November 20, 2024

    @Thoufiel wrote:

    -Step Execution

    I have tried stepping through HAL_SPI_Transmit_DMA at the assembler level using the disassembler, and it exits the function successfully.
    After resuming from temporary, I tried to go through function HAL_SPI_Transmit_DMA this way several times and exited the function without error.
    (If I unbreak and restart debugging as is, error occured)

    Tip for this type of problem: increase a counter (uint32_t counter) in the interrupt and when the error occurs read this count value. You can repeat this to see if it consistently fails at the same count. You can then set a counting breakpoint (a breakpoint that will trigger after it has hit that breakpoint x times). If a counting breakpoint doesn't work you can make your own using an if statement with a breakable line of code in that block.

    ThoufielAuthor
    Graduate
    November 20, 2024

    As it turns out, no progress has been made...

    --------------------------------------------------------------------------------------------
    Enable Fault at the beginning of the program,

    int main(void)
    {
     SCB->SHCSR|= SCB_SHCSR_USGFAULTENA_Msk
     | SCB_SHCSR_BUSFAULTENA_Msk
     | SCB_SHCSR_MEMFAULTENA_Msk;
    

     

     

    and Handler to set up a break -> no hit

    Attached are the CFSR results when the error occurs.

    --------------------------------------------------------------------------------------------

     

    ThoufielAuthor
    Graduate
    November 20, 2024

    Progress on the problem:

    This time the program was set to start from 0x0803 0000 (= Area B).
    After placing the same program in 0x0800 0000 (=Area A), I started the program from Area B and repeated the pause and resume in the debugger.


    After restarting, the message “CPU status reset” may appear.
    When the program is paused again, the PC is set to less than 0x0803 0000 and not break in the IDE.

     

    At the time of the first report, Area A was filled with 0xff because it was started after all Flash areas were erased,
    I thought that a reset caused the program to start from Area A, which resulted in an exception error due to an abnormal value in a register.
    I believe this also explains why the HardFault_Handler placed in Area B was not caught.

     

    The ResetHandler break is caught at the start of the program in Area B, but not when the CPU status is reset.
    (I am assuming it probably jumps to the ResetHandler in Area A.)

     

    I would like to know if you have any ideas on a better way to find out why the CPU resets, or why it does not jump to the ResetHandler in Area B when it resets.

    Graduate II
    November 20, 2024

    @Thoufiel wrote:

    I would like to know if you have any ideas on a better way to find out why the CPU resets, or why it does not jump to the ResetHandler in Area B when it resets.


    You can read the reset cause from various registers in the CPU. You can use __HAL_RCC_GET_FLAG to check various flags (you have to check them in a certain order since multiple flags can be set.)
    The reset jump address is inside the interrupt vector table. If you erase flash it has an invalid address.
    You can flash different areas without erasing all flash and you can debug without flashing (just make sure the flashed binary matches the ELF).

    ThoufielAuthor
    Graduate
    November 21, 2024

    Thank you for your advice.

     

    Debugging a program placed at 0x8000 0000(Area A) also caused a CPU reset, so I checked the RCC flags with ResetHandler.

    After CPU reset, bit 29 (IWDGRSTF) and bit 26 (PINRSTF) of RSR were standing.

    This was probably due to the fact that they did not pass the break that was set up in IWDG's Reset before the CPU reset occurred.

    In my initial investigation I concluded that WatchDog was irrelevant, but I was wrong...

     

    This would explain why the CPU reset occurs only when pausing and resuming in the debugger, without any problem during normal operation (since I had not enabled DBG_IWDG_STOP).

     

    It seems that enabling DBG_IWDG_STOP requires enabling the TrustZone setting, but considering the impact on operation, I would like to take other steps if possible.

     

    Any ideas?