Skip to main content
Graduate II
June 13, 2024
Solved

Imprecise error flag after writing to the flash

  • June 13, 2024
  • 6 replies
  • 5138 views

Hi experts,


I'm adding to my existing code (that runs on a STM32F105) a function to store a word in the flash.
I basically copied the code from the official example based on the HAL libraries, here my function:

uint32_t FLASH_Write_Data(uint32_t StartPageAddress, uint32_t *Data, uint16_t NWords)
{
	static FLASH_EraseInitTypeDef EraseInitStruct;
	uint32_t PageError;

	/* Unlock the Flash memory to enable the flash control register access */
	HAL_FLASH_Unlock();
	/* Erase the FLASH area*/
	EraseInitStruct.TypeErase = FLASH_TYPEERASE_PAGES;
	EraseInitStruct.PageAddress = FLASH_USER_START_ADDR;
	EraseInitStruct.NbPages = (FLASH_USER_END_ADDR - FLASH_USER_START_ADDR) / FLASH_PAGE_SIZE;

	if (HAL_FLASHEx_Erase(&EraseInitStruct, &PageError) != HAL_OK)
	{
		/*Error occurred while page erase.*/
		return HAL_FLASH_GetError();
	}
	/* Program the user FLASH area word by word*/
	uint32_t i = 0;
	while (i < NWords)
	{
		if (HAL_FLASH_Program(FLASH_TYPEPROGRAM_WORD, StartPageAddress, Data[i]) == HAL_OK)
		{
			StartPageAddress += MEMORY_OFFSET;
			i++;
		}
		else
		{
			/* Error occurred while writing data in Flash memory*/
			return HAL_FLASH_GetError();
		}
	}
	HAL_FLASH_Lock();
	return HAL_OK;
}

The very first strange thing is that as soon as I start debugging, the program counter often jumps to the HardFault_Handler function.

From there, I can reset the chip and restart the debug session without any apparent problem.

However, when I execute the program, it consistently ends up in the HardFault_Handler function.

If I comment out the flash write function, the program works as expected.

I started commenting out part of the flash function code, and I noticed that when the HAL_FLASHEx_Erase and the HAL_FLASH_Program functions are not executed, the program continues to work as expected.

I stepped through the code and I didn't notice anything obviousy wrong, as I can erase and even write the passed value to the passed memory address without any immediate error!

The only thing that happens before the program counter jumps into the HardFault_Handler function is that, after several instructions, the IMPRECISERR flag of the CFS register is set.

I tried moving the flash function into the main code, just after the SystemClock_Config or just after before the while(1), but surprisingly, the instruction where the IMPRECISERR flag is set does not change.

The point where this is happening is on the closing (yes on the close function parenthesis "}") of this function:

bool Shaft_Measure(void)
{
	static int32_t aShaft_old;
	int32_t aShaft;
	msg.a_shaft = (int16_t)aShaft;
 if (ABS(aShaft - aShaft_old) > 1){
		bRead_speed = true;
	}
	else{
		bRead_speed = false;
	}
	aShaft_old = aShaft;
	return bRead_speed;
}

I'm sure that the problem is not there, but unfortunately I don't know how to proceed to identify the issue.

The MCU I'm using is the STM32F105RC which has 256 kB of flash, and the value I'm trying to save it is just a uint32_t number.
The constants used by the flash functions are defined in my flash.h here reported:

#define FLASH_ADDR_PAGE_127 ((uint32_t)0x0803F800)
#define FLASH_USER_START_ADDR FLASH_ADDR_PAGE_127
#define FLASH_USER_END_ADDR FLASH_ADDR_PAGE_127 + FLASH_PAGE_SIZE
#define MEMORY_OFFSET ((uint32_t)0x4U)

These addresses have been copied from the device datasheet.

Here my system:

Segger JLink base (with the latest ver: 7.96l)

STM32CubeIDE (ver: 1.15.1)

STM32CubeMX (ver: 6.11.1)

HAL libraries for STM32F1 (ver: 1.8.5)

 

Any help would be greatly appreciated!

    This topic has been closed for replies.
    Best answer by waclawek.jan

    @Tesla DeLorean,

    it's simpler than that. Look at the end of loop condition of this function:

    void FLASH_Read_Data (uint32_t StartPageAddress, uint32_t *data, uint16_t n_words)
    {
    	while (1)
    	{
    		*data = *(__IO uint32_t *)StartPageAddress;
    		StartPageAddress += MEMORY_OFFSET;
    		data++;
    		if (!(n_words--)) break;
    	}
    }

     It reads from FLASH and writes to destination (which is given by pointer) one more word than is n_words, so

    uint32_t last_boot_counter, new_boot_counter;
    	FLASH_Read_Data(FLASH_USER_START_ADDR, &last_boot_counter, 0x1U);

    writes 2 words to stack (as last_boot_counter is a local variable), and thrashes what's just after last_boot_counter, which happens to be the stacked stack frame pointer in r7 from the caller function. And that thrashed stack frame pointer then just propagates to the point where it's actually used, resulting in the fault.

    @aga,

    > how should I start to write code to defend myself from these hidden traps?

    I could preach here "proper coding methods" etc. Some folks believe religiously in the power of tools like static checkers or various fancy programming languages.

    But the naked truth is, humans make errors. So, the practical solution is to a) try to be as meticulous as possible within reasonable boundaries; b) be prepared for errors at various levels.

    In this particular case, for example, in FLASH_Read_Data(), I would use the for() loop rather than do/while() or while(). There's a reason for the for() loop - it's usually perceived as a simple alternative and it's more readily understood, if one sticks to the simple for(i = 0; i < max; i++) pattern; so it's less likely to result in error.

    And, also, in this particular case, you have been able to track down the root cause and mitigate it (by using the correct FLASH_Read_Data()). I never consider the "the problem went away, although I'm not sure why" to be an acceptable solution; that's why to me it's very important to maintain the problem (i.e. don't make any subsequent changes, unless I can surely undo them) until I am absolutely sure what was the root cause and that I removed it.

    JW

     

    6 replies

    Graduate II
    June 13, 2024

    In this context "Imprecise" means it's a deferred write, ie through the Write Buffers

    The code address of the fault is therefore not exact, as it was started a few cycles earlier in the pipe-line, and you're now executing later instructions.

    Look at what's actually reported, look at the address of the failed write, which will be correct, and look a little earlier in the code instructions.

    Addresses must be 4-byte / 32-bit aligned, and can't be written more than once per erase cycle.

    Have a Hard Fault handler that outputs actionable data, during development, and in the field, so support techs can actual identify and fix issues..

    https://github.com/cturvey/RandomNinjaChef/blob/main/KeilHardFault.c

    agaAuthor
    Graduate II
    June 17, 2024

    Hi @Tesla DeLorean ,

    first of all, many thanks for your message and for providing the code to catch the HardFault exceptions!

    I spent the entire day integrating the code you suggested into my project and I'm sure that it was time well spent.

    Here's what I ended up with:

    [Hard Fault]
    CPU registers dump:
    r0 = 00000000, r1 = 20000578, r2 = 200009EC, r3 = 00000000
    r4 = 20000A18, r5 = 24264CEE, r6 = 8715A7C8, sp = 2000FFE8
    r12= 0803F802, lr = 080045EB, pc = 080045F0, psr= 01000000
    bfar=E000ED38, cfsr=00000400, hfsr=40000000, dfsr=00000000, afsr=00000000
    Stack dump:
    00000001
    0800DC81
    20000A18
    24264CEE
    2000FF90
    08005833
    Instructions dump:
    B2DB 2B00 D00A F001 F8A9 4603 71FB 79FB (4618) F001 F8D1 4B3B 2200 701A 

     

    As I have the Joseph Yiu's book "The definitive guide to ARM Cortex-3 and Cortex-M4 processors..." I started reading about this fault and I discovered that I could disable the write buffer feature to properly catch the point that triggered the bus fault. Unfortunately I couldn't set the DISDEFWBUF bit (SCnSCB->ACTLR) as my MCU (ARM Cortex-M3 revision r0p1) does not have it.

    So, I went throught the stack starting from the last address (0x08005833) and I found these instructions (in the disassembly view):

     Reset_Handler:
    08005800: bl 0x80051a4 <SystemInit>
     68 ldr r0, =_sdata
    08005804: ldr r0, [pc, #44] @ (0x8005834 <Reset_Handler+51>)
     69 ldr r1, =_edata
    08005806: ldr r1, [pc, #48] @ (0x8005838 <LoopFillZerobss+18>)
     70 ldr r2, =_sidata
    08005808: ldr r2, [pc, #48] @ (0x800583c <LoopFillZerobss+22>)
     71 movs r3, #0
    0800580a: movs r3, #0
     72 b LoopCopyDataInit
    0800580c: b.n 0x8005814 <Reset_Handler+19>
     75 ldr r4, [r2, r3]
    0800580e: ldr r4, [r2, r3]
     76 str r4, [r0, r3]
    08005810: str r4, [r0, r3]
     77 adds r3, r3, #4
    08005812: adds r3, #4
     80 adds r4, r0, r3
    08005814: adds r4, r0, r3
     81 cmp r4, r1
    08005816: cmp r4, r1
     82 bcc CopyDataInit
    08005818: bcc.n 0x800580e <Reset_Handler+13>
     85 ldr r2, =_sbss
    0800581a: ldr r2, [pc, #36] @ (0x8005840 <LoopFillZerobss+26>)
     86 ldr r4, =_ebss
    0800581c: ldr r4, [pc, #36] @ (0x8005844 <LoopFillZerobss+30>)
     87 movs r3, #0
    0800581e: movs r3, #0
     88 b LoopFillZerobss
    08005820: b.n 0x8005826 <Reset_Handler+37>
     91 str r3, [r2]
    08005822: str r3, [r2, #0]
     92 adds r2, r2, #4
    08005824: adds r2, #4
     95 cmp r2, r4
    08005826: cmp r2, r4
     96 bcc FillZerobss
    08005828: bcc.n 0x8005822 <Reset_Handler+33>
     99 bl __libc_init_array
    0800582a: bl 0x800dc4c <__libc_init_array>
    101 bl main
    0800582e: bl 0x80044d4 <main>
    102 bx lr
    08005832: bx lr
     68 ldr r0, =_sdata
    08005834: movs r0, r0
    08005836: movs r0, #0
     69 ldr r1, =_edata
    08005838: movs r4, r1
    0800583a: movs r0, #0
     70 ldr r2, =_sidata
    0800583c: b.n 0x80050f8 <HAL_CAN_RxFifo0MsgPendingCallback+92>
    0800583e: lsrs r0, r0, #32
     85 ldr r2, =_sbss
    08005840: movs r0, r2
    08005842: movs r0, #0
     86 ldr r4, =_ebss
    08005844: lsrs r0, r3, #8
    08005846: movs r0, #0
    115 b Infinite_Loop
     WWDG_IRQHandler:
    08005848: b.n 0x8005848 <WWDG_IRQHandler>
    266 TST lr, #4

     

    To me the location (0x08005833) looks like the reset handler, althought the address does not perfectly match.

    Moreover I don't know how to read the instructions dump.

    What are the following steps should I do?

    Many thanks! :folded_hands:

    Super User
    June 17, 2024

     

    pc = 080045F0

     

    What you want is to look at disasm a couple of instructions before this address, and from content of other registers in the fault handler discern, which was the offending instruction. If you have mixed disasm/C view, it's usually quite obvious what's the problem in the source.

    JW

    agaAuthor
    Graduate II
    June 17, 2024

    Hi @waclawek.jan,

    thanks for your help.

    here the registers dump value and below the dissasembly code that contains where the PC points after the HardFault is triggered.

    [Hard Fault]
    CPU registers dump:
    r0 = 00000000, r1 = 20000578, r2 = 200009E8, r3 = 00000000
    r4 = 20000A10, r5 = 64264CEE, r6 = 8715A5C8, sp = 2000FFE8
    r12= 0803F802, lr = 0800460F, pc = 08004614, psr= 01000000
    bfar=E000ED38, cfsr=00000400, hfsr=40000000, dfsr=00000000, afsr=00000000
    Stack dump:
    00000001
    0800DCA9
    20000A10
    64264CEE
    2000FF90
    0800585B
    Instructions dump:
    B2DB 2B00 D00A F001 F8A9 4603 71FB 79FB (4618) F001 F8D3 4B3B 2200 701A 

    Disassembly around the address: 0x08004614

    279 		if (DEV_VAR_X == NDevice_variant){
    080045f8: ldr r3, [pc, #252] @ (0x80046f8 <main+512>)
    080045fa: ldrb r3, [r3, #0]
    080045fc: cmp r3, #0
    080045fe: bne.n 0x8004620 <main+296>
    282 			if (bRun_task_shaft_meas){
    08004600: ldr r3, [pc, #260] @ (0x8004708 <main+528>)
    08004602: ldrb r3, [r3, #0]
    08004604: uxtb r3, r3
    08004606: cmp r3, #0
    08004608: beq.n 0x8004620 <main+296>
    285 				bool bRead_speed = Shaft_Measure();
    0800460a: bl 0x8005760 <Shaft_Measure>
    0800460e: mov r3, r0
    08004610: strb r3, [r7, #7]
    286 				Shaft_Speed(bRead_speed);
    08004612: ldrb r3, [r7, #7]
    08004614: mov r0, r3
    08004616: bl 0x80057c0 <Shaft_Speed>
    289 				bRun_task_shaft_meas = false;
    0800461a: ldr r3, [pc, #236] @ (0x8004708 <main+528>)
    0800461c: movs r2, #0
    0800461e: strb r2, [r3, #0]
    298 		if (DEV_VAR_X == NDevice_variant){
    08004620: ldr r3, [pc, #212] @ (0x80046f8 <main+512>)
    08004622: ldrb r3, [r3, #0]
    08004624: cmp r3, #0
    08004626: bne.n 0x8004638 <main+320>
    301 			if (bRun_task_LEDs){
    08004628: ldr r3, [pc, #224] @ (0x800470c <main+532>)
    0800462a: ldrb r3, [r3, #0]
    0800462c: uxtb r3, r3
    0800462e: cmp r3, #0
    08004630: beq.n 0x8004638 <main+320>
    306 				bRun_task_LEDs = false;
    08004632: ldr r3, [pc, #216] @ (0x800470c <main+532>)
    08004634: movs r2, #0
    08004636: strb r2, [r3, #0]

     The correspondig C code is this:

    #ifdef	USE_HALL_SENSORS
    
    		if (DEV_VAR_X == NDevice_variant){
    
    			/* Run the shaft measurement task - triggered by HAL_TIM_IC_CaptureCallback() */
    			if (bRun_task_shaft_meas){
    
    				bool bRead_speed = Shaft_Measure();
    				Shaft_Speed(bRead_speed);
    
    				bRun_task_shaft_meas = false;
    			}
    		}
    
    #endif	// USE_HALL_SENSORS

    From your suggestion as the PC points to Shaft_Speed(bRead_speed) the error should be around here.

    void Shaft_Speed(bool bRead_speed)
    {
    	if (bRead_speed){
    		WG_TX_1_msg.n_shaft = (uint16_t)(1000000/htim4.Instance->CCR1);
    	}
    	else{
    		WG_TX_1_msg.n_shaft = 0U;
    	}
    }

    Here I don't really see anything wrong, as the code it is very simple. Looking a bit before, I have the other function Shaft_Measure (reported in the first post), which also looks good to me, so I have no idea where the issue could be.

    Any other help please?

    Many thanks!

    Super User
    June 17, 2024

    I'd say the problem happens in

    08004610: strb r3, [r7, #7]

    can we  know r7 from the hardfault handler?

    OTOH it's a local variable so should be located at the stack; don't quite understand what might have happened to r7 (which is probably the local frame pointer, i.e. points to the stack where local variables are allocated).

    Is the fault reproducible? Can you get content of r7?

    JW

    agaAuthor
    Graduate II
    June 18, 2024

    Hi @waclawek.jan,

    the fault happens every time I run the code, but I'm not sure if that would be reproducible... It would take some time (that I do not have) to create a similar program.

    The value of r7 while the function Shaft_Measure is executed it is always 0x2000FFD8 which contains 0x02F80308, and never changes.

    Then after the HardFault within the startup assembly is invoked it changes to 0xFFFFFFFF.

    When it executes the hard_fault_handler_c function it changes again to 0x2000F90 which now contains 0xC8A50587 (that was missing from the first dump - sorry).

    I really hope that this could help to find the issue.

    Many thanks!

    Super User
    June 18, 2024

    Why does the offending code's address keep changing?

    Are you adding/removing code for the various experiments? That makes it a moving target, harder to aim and hit.

    In the above code, Shaft_Measure() is irrelevant - r7 is pushed to stack at the beginning and then popped back at end, i.e. it's unchanged. As your comment also indicates, r7 was already 0xFFFFFFFF at the point where Shaft_Measure()  was called, and that's incorrect value as it's used as stack frame i.e. it should point somewhere near the top of stack.

    In other words, go back to the beginning of the calling function (i.e. top of function which called Shaft_Measure()), and observe how r7 is set up there.

    JW

    agaAuthor
    Graduate II
    June 18, 2024

    HI @waclawek.jan,

    I saw a bit of inconsistency as well, because I changed a bit the code, but never in its functionality, but you are right, next time I'll make a new branch to avoid changes in the code.

    Understood, I'll try to trace r7 and see where it changes to the wrong value (0xFFFFFFFF).

    Many thanks!

    Super User
    June 20, 2024

    @Tesla DeLorean,

    it's simpler than that. Look at the end of loop condition of this function:

    void FLASH_Read_Data (uint32_t StartPageAddress, uint32_t *data, uint16_t n_words)
    {
    	while (1)
    	{
    		*data = *(__IO uint32_t *)StartPageAddress;
    		StartPageAddress += MEMORY_OFFSET;
    		data++;
    		if (!(n_words--)) break;
    	}
    }

     It reads from FLASH and writes to destination (which is given by pointer) one more word than is n_words, so

    uint32_t last_boot_counter, new_boot_counter;
    	FLASH_Read_Data(FLASH_USER_START_ADDR, &last_boot_counter, 0x1U);

    writes 2 words to stack (as last_boot_counter is a local variable), and thrashes what's just after last_boot_counter, which happens to be the stacked stack frame pointer in r7 from the caller function. And that thrashed stack frame pointer then just propagates to the point where it's actually used, resulting in the fault.

    @aga,

    > how should I start to write code to defend myself from these hidden traps?

    I could preach here "proper coding methods" etc. Some folks believe religiously in the power of tools like static checkers or various fancy programming languages.

    But the naked truth is, humans make errors. So, the practical solution is to a) try to be as meticulous as possible within reasonable boundaries; b) be prepared for errors at various levels.

    In this particular case, for example, in FLASH_Read_Data(), I would use the for() loop rather than do/while() or while(). There's a reason for the for() loop - it's usually perceived as a simple alternative and it's more readily understood, if one sticks to the simple for(i = 0; i < max; i++) pattern; so it's less likely to result in error.

    And, also, in this particular case, you have been able to track down the root cause and mitigate it (by using the correct FLASH_Read_Data()). I never consider the "the problem went away, although I'm not sure why" to be an acceptable solution; that's why to me it's very important to maintain the problem (i.e. don't make any subsequent changes, unless I can surely undo them) until I am absolutely sure what was the root cause and that I removed it.

    JW

     

    agaAuthor
    Graduate II
    June 20, 2024

    Hi @Tesla DeLorean  and @waclawek.jan.

    Many thanks for the time spent on this thread. Very appraciated! :folded_hands:

    @waclawek.jan

    I now understand the issue. As you probably noticed, I hadn't marked my post as a solution because I hadn't fully grasped the root cause.

    I've marked it just now thanks to your explanation. However, I still want to understand why the stack pointer retained that value. Do you have any idea?

    I agree about the for loop too, I normally use it.

    Cheers

    Super User
    June 20, 2024

    It's not "stack pointer" but "pointer to stack frame". Stack frame is the name of the space on stack the compiler makes upon entry to a function, so that it can place local variables there and, in some cases, also temporary variables it is unable to hold in registers (so called "register spills"). Compiler needs to maintain a pointer to that space, as it is not a fixed value, it depends on where top of stack (pointed to by stack pointer) was at the moment the function was called.

    But upon function entry, before creating the stack frame, the compiler needs to push on stack registers it intends to modify in that function (except r0-r3, which are used as parameters upon entry/return values upon exit/freely modifiable in the function).

    For example:

     FLASH_BootCounter:
    08002f98: push {r7, lr}		--> r7 = 0x2000FFF0 (contains 0x20000A10), lr contains the returning address
    08002f9a: sub sp, #8			--> sp = 0x2000FFE0 (contains 0x2000FF0)
    08002f9c: add r7, sp, #0		--> r7 = 0x2000FFF0 sp = 0x2000FFD8 (contains 0x8003B27)

    here, in the first instruction, lr (the return address) and r7 are pushed first, as the compiler intends to use r7 in that function; then moves the top of stack to make space for the stack frame - here for two local variables, and stores that address into r7 to be used as pointer when those local addresses are to be used.

    So, on stack, we have

    r7-> [new_boot_counter] [last_boot_counter] [saved r7 of caller] [lr i.e. return address] [... stack content of caller ...]

    Now calling FLASH_Read_Data(FLASH_USER_START_ADDR, &last_boot_counter, 0x1U); wrote two words instead of one, so it wrote last_boot_counter, but also corrupted [saved r7 of caller]. When FLASH_BootCounter finished:

    08002fc4: pop {r7, pc}		--> r7 = sp = 0x2000FFE0 (contains 0xFFFFFFFF)
    									--> sp = 0x2000FFE8, r7 = 0xFFFFFFFF

    that corrupted r7 is popped from the stack. Execution returned to the function, which called FLASH_BootCounter() (I don't know how is it called, let's call it X), where apparently r7 is again used as stack frame pointer for the local variables of function X. And then later, in function X, a local variable bRead_speed is accessed:

    bool bRead_speed = Shaft_Measure();

     (while bRead_speed is defined in the middle of the function, the compiler gathers all local variables throughout the function and creates one common frame at the beginning of the function). However, r7 now points to invalid address and that access then causes the fault.

    JW

    Super User
    June 22, 2024

    > is there any (debug) tool that could have saved me from all of this?

    This is similar to your previous question:

    >> how should I start to write code to defend myself from these hidden traps?

    And, again, the answer is - as you probably know - disappointing: there is no magical debug tool in the same way as there is no magic tool preventing errors. (There are folks promoting such tools, nonetheless.)

    Errors do happen, and some errors are simply very complicated. The best thing we can do is to try to prevent making errors as much as possible (see above); and have a full toolbox of debugging tools (i.e. not hoping in one single magic tool).

    In other words, the hard way.

    Btw. note, that in what happened you got quite lucky, as the fault happened not too far from the place where the bug was. Would the bug overwrite some very unrelated variable on the stack, or resulting in a complicated chain of events, and the symptoms might've been very far and unrelated to the point where the bug "acted", with even much more time and effort spent on hunting it down. But this all is part of our job, it does and will happen, and we have to be prepared to the occurences of it.

    JW