Skip to main content
Associate III
November 18, 2024
Question

HardFault by stack overflow(maybe) but not break in HardFault_Handler()

  • November 18, 2024
  • 11 replies
  • 3799 views

I am using the STM32H562VGT6 and IAR Workbench.

I have set up a 4kHz interrupt with LPTIM3 and started transmitting 32 bytes of data using HAL_SPI_Transmit_DMA.

If I break and resume execution after starting the timer interrupt, the SP becomes 0xffff ffd8, causing a HardFault error.  The call stack only contains (Exception frame), making it untraceable. The stack area is from 0x2000'0f90 to 0x2000'4f8f.

Although HardFault_Handler() is defined in stm32h5xx_it.c, breaking there does not hit during a HardFault.
Breaking in MemManage_Handler or BusFault_Handler also does not hit.

Commenting out HAL_SPI_Transmit_DMA during the 4kHz interrupt prevents the HardFault, but adding macros to check if the SP is out of range within the HAL API does not trigger any response.

#define CHECK_MSP() {\
uint32_t msp; \
__asm volatile ("MRS %0, msp" : "=r" (msp)); \
if (msp < 0x20000f80 || 0x20004f7f < msp) { \
__BKPT(0); \
} \
}


How can I capture the cause of this issue?

11 replies

Lead III
November 18, 2024

Please put code in a code block:

unsigned_char_array_0-1731924796225.png

Stack overflow doesn't cause a hardfault error unless you set up your linker file in such a way a stack overflow causes access in an illegal memory region. Or if you set up the MPU to define an illegal region past the stack.

Check in STM32CubeMX in NVIC if MemManage_Handler and BusFault_Handler interrupts are enabled.

ThoufielAuthor
Associate III
November 18, 2024

Here is InterruptHandler in 4kHz and Registers in HardFault

{
	static const int clkTbl[8 + 1][8] = {
		{ 0, 0, 0, 0, 0, 0, 0, 0 },
		{ 0, 0, 0, 0, 0, 0, 0, 1 },
		{ 0, 0, 0, 1, 0, 0, 0, 1 },
		{ 0, 0, 1, 0, 0, 1, 0, 1 },
		{ 0, 1, 0, 1, 0, 1, 0, 1 },
		{ 0, 1, 0, 1, 1, 0, 1, 1 },
		{ 0, 1, 1, 1, 0, 1, 1, 1 },
		{ 0, 1, 1, 1, 1, 1, 1, 1 },
		{ 1, 1, 1, 1, 1, 1, 1, 1 },
	};

	int step;
	int clk[4];
	int dir[4];
	BYTE val;

	int side = c_count & 1;
	if(HAL_SPI_Transmit_DMA(&hspi4, (uint8_t*)(&c_dirclk[(c_count +1 )& 1][0]), 32) != HAL_OK){
		Error_Handler();
	}
	for (int i = 0; i < 4; i++) {
		for (int j = 0; j < 4; j++) {
			id_type id = i * 4 + j;
			if (id < MOTOR_COUNT) {
				step = (int)m_motors[id].proceed();
				clk[j]= abs(step);
				dir[j] = (m_motors[id].direction() ? 1 : 0);
			} else {
				clk[j] = 0;
				dir[j] = 1;
			}
		}
		for (int seq = 0; seq < 8; seq++) {
			val = 0;
			for (int k = 0; k < 4; k++) {
				val |= ((dir[k] << (k + 4))) | (clkTbl[clk[k]][seq] << k);
			}
			c_dirclk[side][(seq * 4 + i)] = val;
		}
	}
	c_count++;
}

 

 

regs.png

Lead III
November 18, 2024
ThoufielAuthor
Associate III
November 18, 2024

Thank you for replying.

CFSR : 0x1001

SHCSR : 0x4

 

And here is debug log messeage in IAR Workbench :

------------------------------------------------------------------------------------------------

Mon Nov 18, 2024 19:46:50: MemManage fault escalated into HardFault
Mon Nov 18, 2024 19:46:50: The MemManage handler is disabled
Mon Nov 18, 2024 19:46:50: HardFault exception.
Mon Nov 18, 2024 19:46:50: The processor has escalated a configurable-priority exception to HardFault.
Mon Nov 18, 2024 19:46:50:
Mon Nov 18, 2024 19:46:50: An MPU or Execute Never (XN) default memory map access violation has occurred on an instruction fetch (CFSR.IACCVIOL, MMFAR).
Mon Nov 18, 2024 19:46:50:
Mon Nov 18, 2024 19:46:50: A derived bus fault has occurred on exception entry (CFSR.STKERR, BFAR).
Mon Nov 18, 2024 19:46:50:
Mon Nov 18, 2024 19:46:50: Could not determine the location where the exception occurred.
Mon Nov 18, 2024 19:46:50:
Mon Nov 18, 2024 19:46:50: See the call stack for more information.

Lead III
November 18, 2024

@Thoufiel wrote:

SHCSR : 0x4

Only one handler is enabled


@Thoufiel wrote:

Mon Nov 18, 2024 19:46:50: The MemManage handler is disabled


As I suspected.

Have you enabled the MPU?

Can you step through the interrupt until the error occurs?

ThoufielAuthor
Associate III
November 19, 2024

MPU is not used.

 

>Can step through the interrupt until the error occurs?

I have tried several times, but the step execution does not generate the error properly and I am not able to stop just before the error.

Ozone
Principal
November 19, 2024

>Can step through the interrupt until the error occurs?

With core faults involved, this usually implies stepping through on instruction/assembler level.
Just saying.

Tesla DeLorean
Guru
November 19, 2024

Any DMA into an auto/local array?

Tips, Buy me a coffee, or three.. PayPal VenmoUp vote any posts that you find helpful, it shows what's working..
ThoufielAuthor
Associate III
November 19, 2024

>Any DMA into an auto/local array?

No, and c_count, c_dirclk is a static member of the class to which this interrupt handler function belongs.

Tesla DeLorean
Guru
November 19, 2024

So CPP? Perhaps issues with stack or heap or constructors.

Not sure how SP goes into 0xFFFFxxxx space. Perhaps PSP/MSP initialization.

Or double faulting. 

Perhaps MemManage Handler?

Can you fish in the primary stack for a context frame for a real PC/LR?

Tips, Buy me a coffee, or three.. PayPal VenmoUp vote any posts that you find helpful, it shows what's working..
Lead III
November 19, 2024

Have you checked alignment requirements for the DMA? Sometimes source/destination arrays have to be aligned at more than 4 bytes.

 

I don't know what happens in

m_motors[id].proceed()

Are you sure it doesn't crash there?
Have you tried setting spare gpio IO pins at certain parts of the code to check with a logic analyzer when/where the code crashes.

(Also please use stdint.h int32_t instead of "int" if you intend a specific size integer (such as for clkTbl), it makes code more readable and portable.)

ThoufielAuthor
Associate III
November 19, 2024

-Step Execution

I have tried stepping through HAL_SPI_Transmit_DMA at the assembler level using the disassembler, and it exits the function successfully.
After resuming from temporary, I tried to go through function HAL_SPI_Transmit_DMA this way several times and exited the function without error.
(If I unbreak and restart debugging as is, error occured)
If I stop at interrupts dozens of times, maybe I can figure out the code that caused the problem, but I haven't tried it.

 

-primary stack for a context frame for a real PC/LR

Is the "primary stack for a context frame for a real PC/LR” here the PC/LR value immediately before the error?
Unfortunately, the direct cause of the error has not been identified, so we have not been able to verify the value.

The register after the error is in a previous post in this thread.

 

-m_motors[id].proceed()

I tried to make assert appear when the value of step is not suitable for clkTbl, but it did not respond, so I do not think this part is a problem.

I have not verified using GPIO, so I will consider it.

 

Lead III
November 20, 2024

@Thoufiel wrote:

-Step Execution

I have tried stepping through HAL_SPI_Transmit_DMA at the assembler level using the disassembler, and it exits the function successfully.
After resuming from temporary, I tried to go through function HAL_SPI_Transmit_DMA this way several times and exited the function without error.
(If I unbreak and restart debugging as is, error occured)

Tip for this type of problem: increase a counter (uint32_t counter) in the interrupt and when the error occurs read this count value. You can repeat this to see if it consistently fails at the same count. You can then set a counting breakpoint (a breakpoint that will trigger after it has hit that breakpoint x times). If a counting breakpoint doesn't work you can make your own using an if statement with a breakable line of code in that block.

ThoufielAuthor
Associate III
November 20, 2024

As it turns out, no progress has been made...

--------------------------------------------------------------------------------------------
Enable Fault at the beginning of the program,

int main(void)
{
 SCB->SHCSR|= SCB_SHCSR_USGFAULTENA_Msk
 | SCB_SHCSR_BUSFAULTENA_Msk
 | SCB_SHCSR_MEMFAULTENA_Msk;

 

 

and Handler to set up a break -> no hit

Attached are the CFSR results when the error occurs.

--------------------------------------------------------------------------------------------

 

ThoufielAuthor
Associate III
November 20, 2024

Progress on the problem:

This time the program was set to start from 0x0803 0000 (= Area B).
After placing the same program in 0x0800 0000 (=Area A), I started the program from Area B and repeated the pause and resume in the debugger.


After restarting, the message “CPU status reset” may appear.
When the program is paused again, the PC is set to less than 0x0803 0000 and not break in the IDE.

 

At the time of the first report, Area A was filled with 0xff because it was started after all Flash areas were erased,
I thought that a reset caused the program to start from Area A, which resulted in an exception error due to an abnormal value in a register.
I believe this also explains why the HardFault_Handler placed in Area B was not caught.

 

The ResetHandler break is caught at the start of the program in Area B, but not when the CPU status is reset.
(I am assuming it probably jumps to the ResetHandler in Area A.)

 

I would like to know if you have any ideas on a better way to find out why the CPU resets, or why it does not jump to the ResetHandler in Area B when it resets.

Lead III
November 20, 2024

@Thoufiel wrote:

I would like to know if you have any ideas on a better way to find out why the CPU resets, or why it does not jump to the ResetHandler in Area B when it resets.


You can read the reset cause from various registers in the CPU. You can use __HAL_RCC_GET_FLAG to check various flags (you have to check them in a certain order since multiple flags can be set.)
The reset jump address is inside the interrupt vector table. If you erase flash it has an invalid address.
You can flash different areas without erasing all flash and you can debug without flashing (just make sure the flashed binary matches the ELF).

ThoufielAuthor
Associate III
November 21, 2024

Thank you for your advice.

 

Debugging a program placed at 0x8000 0000(Area A) also caused a CPU reset, so I checked the RCC flags with ResetHandler.

After CPU reset, bit 29 (IWDGRSTF) and bit 26 (PINRSTF) of RSR were standing.

This was probably due to the fact that they did not pass the break that was set up in IWDG's Reset before the CPU reset occurred.

In my initial investigation I concluded that WatchDog was irrelevant, but I was wrong...

 

This would explain why the CPU reset occurs only when pausing and resuming in the debugger, without any problem during normal operation (since I had not enabled DBG_IWDG_STOP).

 

It seems that enabling DBG_IWDG_STOP requires enabling the TrustZone setting, but considering the impact on operation, I would like to take other steps if possible.

 

Any ideas?