Graduate II

Solved

Imprecise error flag after writing to the flash

Forum|Forum|1 year ago
June 13, 2024
6 replies
5138 views

Hi experts,

I'm adding to my existing code (that runs on a STM32F105) a function to store a word in the flash.
I basically copied the code from the official example based on the HAL libraries, here my function:

uint32_t FLASH_Write_Data(uint32_t StartPageAddress, uint32_t *Data, uint16_t NWords)
{
	static FLASH_EraseInitTypeDef EraseInitStruct;
	uint32_t PageError;

	/* Unlock the Flash memory to enable the flash control register access */
	HAL_FLASH_Unlock();
	/* Erase the FLASH area*/
	EraseInitStruct.TypeErase = FLASH_TYPEERASE_PAGES;
	EraseInitStruct.PageAddress = FLASH_USER_START_ADDR;
	EraseInitStruct.NbPages = (FLASH_USER_END_ADDR - FLASH_USER_START_ADDR) / FLASH_PAGE_SIZE;

	if (HAL_FLASHEx_Erase(&EraseInitStruct, &PageError) != HAL_OK)
	{
		/*Error occurred while page erase.*/
		return HAL_FLASH_GetError();
	}
	/* Program the user FLASH area word by word*/
	uint32_t i = 0;
	while (i < NWords)
	{
		if (HAL_FLASH_Program(FLASH_TYPEPROGRAM_WORD, StartPageAddress, Data[i]) == HAL_OK)
		{
			StartPageAddress += MEMORY_OFFSET;
			i++;
		}
		else
		{
			/* Error occurred while writing data in Flash memory*/
			return HAL_FLASH_GetError();
		}
	}
	HAL_FLASH_Lock();
	return HAL_OK;
}

The very first strange thing is that as soon as I start debugging, the program counter often jumps to the HardFault_Handler function.

From there, I can reset the chip and restart the debug session without any apparent problem.

However, when I execute the program, it consistently ends up in the HardFault_Handler function.

If I comment out the flash write function, the program works as expected.

I started commenting out part of the flash function code, and I noticed that when the HAL_FLASHEx_Erase and the HAL_FLASH_Program functions are not executed, the program continues to work as expected.

I stepped through the code and I didn't notice anything obviousy wrong, as I can erase and even write the passed value to the passed memory address without any immediate error!

The only thing that happens before the program counter jumps into the HardFault_Handler function is that, after several instructions, the IMPRECISERR flag of the CFS register is set.

I tried moving the flash function into the main code, just after the SystemClock_Config or just after before the while(1), but surprisingly, the instruction where the IMPRECISERR flag is set does not change.

The point where this is happening is on the closing (yes on the close function parenthesis "}") of this function:

bool Shaft_Measure(void)
{
	static int32_t aShaft_old;
	int32_t aShaft;
	msg.a_shaft = (int16_t)aShaft;
 if (ABS(aShaft - aShaft_old) > 1){
		bRead_speed = true;
	}
	else{
		bRead_speed = false;
	}
	aShaft_old = aShaft;
	return bRead_speed;
}

I'm sure that the problem is not there, but unfortunately I don't know how to proceed to identify the issue.

The MCU I'm using is the STM32F105RC which has 256 kB of flash, and the value I'm trying to save it is just a uint32_t number.
The constants used by the flash functions are defined in my flash.h here reported:

#define FLASH_ADDR_PAGE_127 ((uint32_t)0x0803F800)
#define FLASH_USER_START_ADDR FLASH_ADDR_PAGE_127
#define FLASH_USER_END_ADDR FLASH_ADDR_PAGE_127 + FLASH_PAGE_SIZE
#define MEMORY_OFFSET ((uint32_t)0x4U)

These addresses have been copied from the device datasheet.

Here my system:

Segger JLink base (with the latest ver: 7.96l)

STM32CubeIDE (ver: 1.15.1)

STM32CubeMX (ver: 6.11.1)

HAL libraries for STM32F1 (ver: 1.8.5)

Any help would be greatly appreciated!

This topic has been closed for replies.

Best answer by waclawek.jan

@Tesla DeLorean,

it's simpler than that. Look at the end of loop condition of this function:

void FLASH_Read_Data (uint32_t StartPageAddress, uint32_t *data, uint16_t n_words)
{
	while (1)
	{
		*data = *(__IO uint32_t *)StartPageAddress;
		StartPageAddress += MEMORY_OFFSET;
		data++;
		if (!(n_words--)) break;
	}
}

It reads from FLASH and writes to destination (which is given by pointer) one more word than is n_words, so

uint32_t last_boot_counter, new_boot_counter;
	FLASH_Read_Data(FLASH_USER_START_ADDR, &last_boot_counter, 0x1U);

writes 2 words to stack (as last_boot_counter is a local variable), and thrashes what's just after last_boot_counter, which happens to be the stacked stack frame pointer in r7 from the caller function. And that thrashed stack frame pointer then just propagates to the point where it's actually used, resulting in the fault.

@aga,

> how should I start to write code to defend myself from these hidden traps?

I could preach here "proper coding methods" etc. Some folks believe religiously in the power of tools like static checkers or various fancy programming languages.

But the naked truth is, humans make errors. So, the practical solution is to a) try to be as meticulous as possible within reasonable boundaries; b) be prepared for errors at various levels.

In this particular case, for example, in FLASH_Read_Data(), I would use the for() loop rather than do/while() or while(). There's a reason for the for() loop - it's usually perceived as a simple alternative and it's more readily understood, if one sticks to the simple for(i = 0; i < max; i++) pattern; so it's less likely to result in error.

And, also, in this particular case, you have been able to track down the root cause and mitigate it (by using the correct FLASH_Read_Data()). I never consider the "the problem went away, although I'm not sure why" to be an acceptable solution; that's why to me it's very important to maintain the problem (i.e. don't make any subsequent changes, unless I can surely undo them) until I am absolutely sure what was the root cause and that I removed it.

JW

T

Tesla DeLorean

Graduate II

In this context "Imprecise" means it's a deferred write, ie through the Write Buffers

The code address of the fault is therefore not exact, as it was started a few cycles earlier in the pipe-line, and you're now executing later instructions.

Look at what's actually reported, look at the address of the failed write, which will be correct, and look a little earlier in the code instructions.

Addresses must be 4-byte / 32-bit aligned, and can't be written more than once per erase cycle.

Have a Hard Fault handler that outputs actionable data, during development, and in the field, so support techs can actual identify and fix issues..

https://github.com/cturvey/RandomNinjaChef/blob/main/KeilHardFault.c

A

agaAuthor

Graduate II

Hi @Tesla DeLorean ,

first of all, many thanks for your message and for providing the code to catch the HardFault exceptions!

I spent the entire day integrating the code you suggested into my project and I'm sure that it was time well spent.

Here's what I ended up with:

[Hard Fault]
CPU registers dump:
r0 = 00000000, r1 = 20000578, r2 = 200009EC, r3 = 00000000
r4 = 20000A18, r5 = 24264CEE, r6 = 8715A7C8, sp = 2000FFE8
r12= 0803F802, lr = 080045EB, pc = 080045F0, psr= 01000000
bfar=E000ED38, cfsr=00000400, hfsr=40000000, dfsr=00000000, afsr=00000000
Stack dump:
00000001
0800DC81
20000A18
24264CEE
2000FF90
08005833
Instructions dump:
B2DB 2B00 D00A F001 F8A9 4603 71FB 79FB (4618) F001 F8D1 4B3B 2200 701A

As I have the Joseph Yiu's book "The definitive guide to ARM Cortex-3 and Cortex-M4 processors..." I started reading about this fault and I discovered that I could disable the write buffer feature to properly catch the point that triggered the bus fault. Unfortunately I couldn't set the DISDEFWBUF bit (SCnSCB->ACTLR) as my MCU (ARM Cortex-M3 revision r0p1) does not have it.

So, I went throught the stack starting from the last address (0x08005833) and I found these instructions (in the disassembly view):

 Reset_Handler:
08005800: bl 0x80051a4 <SystemInit>
 68 ldr r0, =_sdata
08005804: ldr r0, [pc, #44] @ (0x8005834 <Reset_Handler+51>)
 69 ldr r1, =_edata
08005806: ldr r1, [pc, #48] @ (0x8005838 <LoopFillZerobss+18>)
 70 ldr r2, =_sidata
08005808: ldr r2, [pc, #48] @ (0x800583c <LoopFillZerobss+22>)
 71 movs r3, #0
0800580a: movs r3, #0
 72 b LoopCopyDataInit
0800580c: b.n 0x8005814 <Reset_Handler+19>
 75 ldr r4, [r2, r3]
0800580e: ldr r4, [r2, r3]
 76 str r4, [r0, r3]
08005810: str r4, [r0, r3]
 77 adds r3, r3, #4
08005812: adds r3, #4
 80 adds r4, r0, r3
08005814: adds r4, r0, r3
 81 cmp r4, r1
08005816: cmp r4, r1
 82 bcc CopyDataInit
08005818: bcc.n 0x800580e <Reset_Handler+13>
 85 ldr r2, =_sbss
0800581a: ldr r2, [pc, #36] @ (0x8005840 <LoopFillZerobss+26>)
 86 ldr r4, =_ebss
0800581c: ldr r4, [pc, #36] @ (0x8005844 <LoopFillZerobss+30>)
 87 movs r3, #0
0800581e: movs r3, #0
 88 b LoopFillZerobss
08005820: b.n 0x8005826 <Reset_Handler+37>
 91 str r3, [r2]
08005822: str r3, [r2, #0]
 92 adds r2, r2, #4
08005824: adds r2, #4
 95 cmp r2, r4
08005826: cmp r2, r4
 96 bcc FillZerobss
08005828: bcc.n 0x8005822 <Reset_Handler+33>
 99 bl __libc_init_array
0800582a: bl 0x800dc4c <__libc_init_array>
101 bl main
0800582e: bl 0x80044d4 <main>
102 bx lr
08005832: bx lr
 68 ldr r0, =_sdata
08005834: movs r0, r0
08005836: movs r0, #0
 69 ldr r1, =_edata
08005838: movs r4, r1
0800583a: movs r0, #0
 70 ldr r2, =_sidata
0800583c: b.n 0x80050f8 <HAL_CAN_RxFifo0MsgPendingCallback+92>
0800583e: lsrs r0, r0, #32
 85 ldr r2, =_sbss
08005840: movs r0, r2
08005842: movs r0, #0
 86 ldr r4, =_ebss
08005844: lsrs r0, r3, #8
08005846: movs r0, #0
115 b Infinite_Loop
 WWDG_IRQHandler:
08005848: b.n 0x8005848 <WWDG_IRQHandler>
266 TST lr, #4

To me the location (0x08005833) looks like the reset handler, althought the address does not perfectly match.

Moreover I don't know how to read the instructions dump.

What are the following steps should I do?

Many thanks! :folded_hands:

W

waclawek.jan

Super User

pc = 080045F0

What you want is to look at disasm a couple of instructions before this address, and from content of other registers in the fault handler discern, which was the offending instruction. If you have mixed disasm/C view, it's usually quite obvious what's the problem in the source.

JW

A

agaAuthor

Graduate II

Hi @waclawek.jan,

thanks for your help.

here the registers dump value and below the dissasembly code that contains where the PC points after the HardFault is triggered.

[Hard Fault]
CPU registers dump:
r0 = 00000000, r1 = 20000578, r2 = 200009E8, r3 = 00000000
r4 = 20000A10, r5 = 64264CEE, r6 = 8715A5C8, sp = 2000FFE8
r12= 0803F802, lr = 0800460F, pc = 08004614, psr= 01000000
bfar=E000ED38, cfsr=00000400, hfsr=40000000, dfsr=00000000, afsr=00000000
Stack dump:
00000001
0800DCA9
20000A10
64264CEE
2000FF90
0800585B
Instructions dump:
B2DB 2B00 D00A F001 F8A9 4603 71FB 79FB (4618) F001 F8D3 4B3B 2200 701A

Disassembly around the address: 0x08004614

279 		if (DEV_VAR_X == NDevice_variant){
080045f8: ldr r3, [pc, #252] @ (0x80046f8 <main+512>)
080045fa: ldrb r3, [r3, #0]
080045fc: cmp r3, #0
080045fe: bne.n 0x8004620 <main+296>
282 			if (bRun_task_shaft_meas){
08004600: ldr r3, [pc, #260] @ (0x8004708 <main+528>)
08004602: ldrb r3, [r3, #0]
08004604: uxtb r3, r3
08004606: cmp r3, #0
08004608: beq.n 0x8004620 <main+296>
285 				bool bRead_speed = Shaft_Measure();
0800460a: bl 0x8005760 <Shaft_Measure>
0800460e: mov r3, r0
08004610: strb r3, [r7, #7]
286 				Shaft_Speed(bRead_speed);
08004612: ldrb r3, [r7, #7]
08004614: mov r0, r3
08004616: bl 0x80057c0 <Shaft_Speed>
289 				bRun_task_shaft_meas = false;
0800461a: ldr r3, [pc, #236] @ (0x8004708 <main+528>)
0800461c: movs r2, #0
0800461e: strb r2, [r3, #0]
298 		if (DEV_VAR_X == NDevice_variant){
08004620: ldr r3, [pc, #212] @ (0x80046f8 <main+512>)
08004622: ldrb r3, [r3, #0]
08004624: cmp r3, #0
08004626: bne.n 0x8004638 <main+320>
301 			if (bRun_task_LEDs){
08004628: ldr r3, [pc, #224] @ (0x800470c <main+532>)
0800462a: ldrb r3, [r3, #0]
0800462c: uxtb r3, r3
0800462e: cmp r3, #0
08004630: beq.n 0x8004638 <main+320>
306 				bRun_task_LEDs = false;
08004632: ldr r3, [pc, #216] @ (0x800470c <main+532>)
08004634: movs r2, #0
08004636: strb r2, [r3, #0]

The correspondig C code is this:

#ifdef	USE_HALL_SENSORS

		if (DEV_VAR_X == NDevice_variant){

			/* Run the shaft measurement task - triggered by HAL_TIM_IC_CaptureCallback() */
			if (bRun_task_shaft_meas){

				bool bRead_speed = Shaft_Measure();
				Shaft_Speed(bRead_speed);

				bRun_task_shaft_meas = false;
			}
		}

#endif	// USE_HALL_SENSORS

From your suggestion as the PC points to Shaft_Speed(bRead_speed) the error should be around here.

void Shaft_Speed(bool bRead_speed)
{
	if (bRead_speed){
		WG_TX_1_msg.n_shaft = (uint16_t)(1000000/htim4.Instance->CCR1);
	}
	else{
		WG_TX_1_msg.n_shaft = 0U;
	}
}

Here I don't really see anything wrong, as the code it is very simple. Looking a bit before, I have the other function Shaft_Measure (reported in the first post), which also looks good to me, so I have no idea where the issue could be.

Any other help please?

Many thanks!

W

waclawek.jan

Super User

I'd say the problem happens in

08004610: strb r3, [r7, #7]

can we know r7 from the hardfault handler?

OTOH it's a local variable so should be located at the stack; don't quite understand what might have happened to r7 (which is probably the local frame pointer, i.e. points to the stack where local variables are allocated).

Is the fault reproducible? Can you get content of r7?

JW

A

agaAuthor

Graduate II

Hi @waclawek.jan,

the fault happens every time I run the code, but I'm not sure if that would be reproducible... It would take some time (that I do not have) to create a similar program.

The value of r7 while the function Shaft_Measure is executed it is always 0x2000FFD8 which contains 0x02F80308, and never changes.

Then after the HardFault within the startup assembly is invoked it changes to 0xFFFFFFFF.

When it executes the hard_fault_handler_c function it changes again to 0x2000F90 which now contains 0xC8A50587 (that was missing from the first dump - sorry).

I really hope that this could help to find the issue.

Many thanks!

W

waclawek.jan

Super User

Why does the offending code's address keep changing?

Are you adding/removing code for the various experiments? That makes it a moving target, harder to aim and hit.

In the above code, Shaft_Measure() is irrelevant - r7 is pushed to stack at the beginning and then popped back at end, i.e. it's unchanged. As your comment also indicates, r7 was already 0xFFFFFFFF at the point where Shaft_Measure() was called, and that's incorrect value as it's used as stack frame i.e. it should point somewhere near the top of stack.

In other words, go back to the beginning of the calling function (i.e. top of function which called Shaft_Measure()), and observe how r7 is set up there.

JW

A

agaAuthor

Graduate II

HI @waclawek.jan,

I saw a bit of inconsistency as well, because I changed a bit the code, but never in its functionality, but you are right, next time I'll make a new branch to avoid changes in the code.

Understood, I'll try to trace r7 and see where it changes to the wrong value (0xFFFFFFFF).

Many thanks!

W

waclawek.janAnswer

Super User

@Tesla DeLorean,

it's simpler than that. Look at the end of loop condition of this function:

void FLASH_Read_Data (uint32_t StartPageAddress, uint32_t *data, uint16_t n_words)
{
	while (1)
	{
		*data = *(__IO uint32_t *)StartPageAddress;
		StartPageAddress += MEMORY_OFFSET;
		data++;
		if (!(n_words--)) break;
	}
}

It reads from FLASH and writes to destination (which is given by pointer) one more word than is n_words, so

uint32_t last_boot_counter, new_boot_counter;
	FLASH_Read_Data(FLASH_USER_START_ADDR, &last_boot_counter, 0x1U);

writes 2 words to stack (as last_boot_counter is a local variable), and thrashes what's just after last_boot_counter, which happens to be the stacked stack frame pointer in r7 from the caller function. And that thrashed stack frame pointer then just propagates to the point where it's actually used, resulting in the fault.

@aga,

> how should I start to write code to defend myself from these hidden traps?

I could preach here "proper coding methods" etc. Some folks believe religiously in the power of tools like static checkers or various fancy programming languages.

But the naked truth is, humans make errors. So, the practical solution is to a) try to be as meticulous as possible within reasonable boundaries; b) be prepared for errors at various levels.

In this particular case, for example, in FLASH_Read_Data(), I would use the for() loop rather than do/while() or while(). There's a reason for the for() loop - it's usually perceived as a simple alternative and it's more readily understood, if one sticks to the simple for(i = 0; i < max; i++) pattern; so it's less likely to result in error.

And, also, in this particular case, you have been able to track down the root cause and mitigate it (by using the correct FLASH_Read_Data()). I never consider the "the problem went away, although I'm not sure why" to be an acceptable solution; that's why to me it's very important to maintain the problem (i.e. don't make any subsequent changes, unless I can surely undo them) until I am absolutely sure what was the root cause and that I removed it.

JW

A

agaAuthor

Graduate II

Hi @Tesla DeLorean and @waclawek.jan.

Many thanks for the time spent on this thread. Very appraciated! :folded_hands:

@waclawek.jan

I now understand the issue. As you probably noticed, I hadn't marked my post as a solution because I hadn't fully grasped the root cause.

I've marked it just now thanks to your explanation. However, I still want to understand why the stack pointer retained that value. Do you have any idea?

I agree about the for loop too, I normally use it.

Cheers

W

waclawek.jan

Super User

It's not "stack pointer" but "pointer to stack frame". Stack frame is the name of the space on stack the compiler makes upon entry to a function, so that it can place local variables there and, in some cases, also temporary variables it is unable to hold in registers (so called "register spills"). Compiler needs to maintain a pointer to that space, as it is not a fixed value, it depends on where top of stack (pointed to by stack pointer) was at the moment the function was called.

But upon function entry, before creating the stack frame, the compiler needs to push on stack registers it intends to modify in that function (except r0-r3, which are used as parameters upon entry/return values upon exit/freely modifiable in the function).

For example:

 FLASH_BootCounter:
08002f98: push {r7, lr}		--> r7 = 0x2000FFF0 (contains 0x20000A10), lr contains the returning address
08002f9a: sub sp, #8			--> sp = 0x2000FFE0 (contains 0x2000FF0)
08002f9c: add r7, sp, #0		--> r7 = 0x2000FFF0 sp = 0x2000FFD8 (contains 0x8003B27)

here, in the first instruction, lr (the return address) and r7 are pushed first, as the compiler intends to use r7 in that function; then moves the top of stack to make space for the stack frame - here for two local variables, and stores that address into r7 to be used as pointer when those local addresses are to be used.

So, on stack, we have

r7-> [new_boot_counter] [last_boot_counter] [saved r7 of caller] [lr i.e. return address] [... stack content of caller ...]

Now calling FLASH_Read_Data(FLASH_USER_START_ADDR, &last_boot_counter, 0x1U); wrote two words instead of one, so it wrote last_boot_counter, but also corrupted [saved r7 of caller]. When FLASH_BootCounter finished:

08002fc4: pop {r7, pc}		--> r7 = sp = 0x2000FFE0 (contains 0xFFFFFFFF)
									--> sp = 0x2000FFE8, r7 = 0xFFFFFFFF

that corrupted r7 is popped from the stack. Execution returned to the function, which called FLASH_BootCounter() (I don't know how is it called, let's call it X), where apparently r7 is again used as stack frame pointer for the local variables of function X. And then later, in function X, a local variable bRead_speed is accessed:

bool bRead_speed = Shaft_Measure();

(while bRead_speed is defined in the middle of the function, the compiler gathers all local variables throughout the function and creates one common frame at the beginning of the function). However, r7 now points to invalid address and that access then causes the fault.

JW

W

waclawek.jan

Super User

> is there any (debug) tool that could have saved me from all of this?

This is similar to your previous question:

>> how should I start to write code to defend myself from these hidden traps?

And, again, the answer is - as you probably know - disappointing: there is no magical debug tool in the same way as there is no magic tool preventing errors. (There are folks promoting such tools, nonetheless.)

Errors do happen, and some errors are simply very complicated. The best thing we can do is to try to prevent making errors as much as possible (see above); and have a full toolbox of debugging tools (i.e. not hoping in one single magic tool).

In other words, the hard way.

Btw. note, that in what happened you got quite lucky, as the fault happened not too far from the place where the bug was. Would the bug overwrite some very unrelated variable on the stack, or resulting in a complicated chain of events, and the symptoms might've been very far and unrelated to the point where the bug "acted", with even much more time and effort spent on hunting it down. But this all is part of our job, it does and will happen, and we have to be prepared to the occurences of it.

JW

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded