Skip to main content
Graduate II
May 14, 2024
Question

Data-like flash read with memory barrier?

  • May 14, 2024
  • 18 replies
  • 4296 views

Heyho,

I recently finished - or so I thought - the ethernet bootloader for STM32H73x.

The bootloader can update the application and vice versa, updating, jumping back and forth, all working fine.

Until yesterday all of a sudden (when I was working on some other http POST stuff unrelated to the internal flash) the bootloader somehow got stuck when verifying the freshly written application.
Verification goes like:
- read (Octo-) SPI flash image page (256 bytes)
- compare SPI flash buffer to internal flash via pointer

The hard fault info that I got was not really useful, I was looking for all kinds of stuff, but could not solve the problem.

Until I added some memory barriers, as you can see below - combined with loading each flash byte into a variable, and then comparing (before that I used directly used the flash byte pointer sFlashIntCtl.pu8ChkAddr[i] in the for loop).
Now it's working again.

BUT... does that make sense?
Or was that just coincidence, and I have another problem?

				/* ++++++++++++++++++++++++++++++++++++++++++++++++++++++ */
				/* read from SPI flash
				 *	BLOCKING &
				 *	### outside of state machine ###
				 */
				sFlashIntCtl.u8ChkError = OspiFlashRdPage();
				__DSB();

				if( sFlashIntCtl.u8ChkError != HAL_OK )
				{
					#if DEBUG_FLASH_INT
						uart_printf("\n\r# ERR: FLINT_STATE_CHECK OspiFlashRdPage()\n\r");
					#endif 	/* DEBUG_FLASH_INT */

					sFlashIntCtl.u32ChkErrors++;
					u8FlashIntState = FLINT_STATE_ERROR;
				}
				else
				{
					/* compare */
					uint32_t u32ErrOld = sFlashIntCtl.u32ChkErrors;
					uint8_t u8FlashByte = 0;

					/* compare internal flash to SPI flash page buffer */
					for( uint32_t i = 0; i < (uint32_t)OSPI_FLASH_PAGE_SIZE; i++ )
					{
						u8FlashByte = sFlashIntCtl.pu8ChkAddr[i];
						__DSB();
						if( u8OspiFlashPageBuf[i] != u8FlashByte )
						{
							sFlashIntCtl.u32ChkErrors++;
						}
						if( (sFlashIntCtl.u32ChkBtDone + i) >= (sSpiFileInfo.u32Size - 1) ) break;
					}
					sFlashIntCtl.u32ChkBtDone += sFlashIntCtl.u32ChkLen;

 

PS: why is that very useful </> code insert button back in line 2? This should be the first one to appear, it's surely more often used than other stuff that appears first.

    This topic has been closed for replies.

    18 replies

    Super User
    May 16, 2024

    As I've said, you have to look at the stack for the PC before the fault handler is called (actual content of PC points to the fault handler). Then look at the code where this reconstructed PC points, in particular look before that point (in case of precise error it should be the previous instruction, in case of imprecise error it may be a couple of instructions before that. You will probably see some access - load/store - to the offending address (i.e. here through [r1]. To find out what and why is that access, you have to read a bit the asm, together with sources - that's why mixed source/asm is the best way to look at these things.

    IDEs may help to decipher this for you, i.e. show the reconstructed registers, point to the source where the reconstructed PC points, show stack backtrace - I don't use IDE.

    JW

    LCEAuthor
    Graduate II
    May 16, 2024

    I hope and think I got it:

    lwIP's httpd calls http_parse_request() calls http_find_file() which also checks the URI for CGI commands.

    CGI application info is stored in a table containing 2 pointers for each entry, one to a constant string (like "/fileup.cgi") and the other to the respective function (called something like CGI_Handler_FileUp()).
    These are constants stored in flash.

    I check the contents and addresses of these at http server init and get these results, which make sense:

    CGI HttpServerInit():
    CgiTable[0].pszCGIName = /fileup.cgi @ 080239BC
    CgiTable[1].pszCGIName = /admin.cgi @ 080239B1

    When not working with the internal flash, these pointers seem to work well.

    But when there's flash writing going on, these constants cannot be accessed, and somehow some weird value is assigned, see picture below.
    And these are the values found in BFAR and MMAR (0x53535353).

    HardFault_FlashWrite_H735_httpd.png

    LCEAuthor
    Graduate II
    May 16, 2024

    Further info:

    I trigger the flash programming, then let the STM32 do other stuff - which makes absolutely no sense, because this H7 canNOT do anything while flash is erased or programmed.

    So I changed that to "blocking".

    Now even with that, the SysTick seems to be crazy, which I use for a timeout check... SysTick okay, stoopid me...

    BTW, in debug mode I was looking for the stack and stuff in the right top window, until I found it very comdortable on the left...

    Super User
    May 16, 2024

    > somehow some weird value is assigned

    Place data breakpoint (a.k.a. watchpoint) at the variables which are overwritten, and observe what overwrites them.

    JW

    LCEAuthor
    Graduate II
    May 17, 2024

    > Place data breakpoint (a.k.a. watchpoint) at the variables which are overwritten, and observe what overwrites them.

    I cannot find the setting for that in STM32CubeIDE.
    I have these variables in the "Expressions" window, but that doesn't show me when they are changed.

    BUT again my UART debugging helps:

    Right before the very first FlashIntWrPage() (programming 1 flash page in application area):

    httpd_cgis[0].pszCGIName = /fileup.cgi @ 08023B27
    httpd_cgis[0].pfnCGIHandler is @ 080025E1
    httpd_cgis[1].pszCGIName = /admin.cgi @ 08023B1C
    httpd_cgis[1].pfnCGIHandler is @ 080026B5

    Right after the very first FlashIntWrPage() these pointers are reset:
    httpd_cgis[0].pszCGIName = @ 00000000
    httpd_cgis[0].pfnCGIHandler is @ 00000000
    httpd_cgis[1].pszCGIName = @ 00000000
    httpd_cgis[1].pfnCGIHandler is @ 00000000

    This is really crazy...

    Here's the FlashIntWrPage function - it checks the address to be written to make sure that only addresses in the application area are written (FLASH_INT_ADDR_START_APP = 0x08040000).
    It's basically like HAL_FLASH_Program(), without the extra bank option.

    /* +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ */
    /* FLASH write page
     */
    /**
     * @brief write page / word of 256 bits = 8x 32bit words
     *				waiting for programming page done in state machine
     *				only writing APPLICATION flash from bootloader
     * @PAram u8WrBlock	wait here for finish
     *				write settings elsewhere,
     *				for now fixed page buffer is used
     * @retval 0 = ok
     */
    uint8_t FlashIntWrPage(uint8_t u8WrBlock)
    {
    	uint8_t u8RetVal = 0;
    
    	/* check source pointer */
    	if( NULL == sFlashIntCtl.pu8WrBuf )
    	{
    		#if DEBUG_FLASH_INT
    			uart_printf("# ERR: FlashIntWrPage(): pu8WrBuf = NULL\n\r");
    		#endif 	/* DEBUG_FLASH_INT */
    		return HAL_ERROR;
    	}
    
    	/* check flash destination address */
    	if( sFlashIntCtl.u32WrAddr < FLASH_INT_ADDR_START_APP ||
    		sFlashIntCtl.u32WrAddr >= FLASH_INT_ADDR_END_APP )
    	{
    		#if DEBUG_FLASH_INT
    			uart_printf("# ERR: FlashIntWrPage(): u32WrAddr out of range\n\r");
    		#endif 	/* DEBUG_FLASH_INT */
    		return HAL_ERROR;
    	}
    
    	/* check LOCK bit */
    	if( FLASH->CR1 & FLASH_CR_LOCK )
    	{
    		if( FlashIntUnlock() != HAL_OK )
    		{
    			#if DEBUG_FLASH_INT
    				uart_printf("# ERR: FlashIntWrPage(): FlashIntUnlock()\n\r");
    			#endif 	/* DEBUG_FLASH_INT */
    			return HAL_ERROR;
    		}
    	}
    
    	/* check if last operation done */
    	if( FlashIntWaitLastOp(FLASH_INT_TO_BUSY_MS) != HAL_OK )
    	{
    		#if DEBUG_FLASH_INT
    			uart_printf("# ERR: FlashIntWrPage(): QW\n\r");
    		#endif 	/* DEBUG_FLASH_INT */
    		return HAL_ERROR;
    	}
    
    	/* wait for BUSY release to write registers */
    	if( FlashIntCheckBusyTo(FLASH_INT_TO_BUSY_MS) != HAL_OK )
    	{
    		#if DEBUG_FLASH_INT
    			uart_printf("# ERR: FlashIntWrPage(): BUSY\n\r");
    		#endif 	/* DEBUG_FLASH_INT */
    		return HAL_ERROR;
    	}
    
    	/* reset some error flags */
    	/* get error flags */
    	uint32_t u32RegVal = (FLASH->SR1 & FLASH_INT_SR_ERR_ALL);
    
    	/* in case of error reported in Flash SR1 */
    	if( u32RegVal != 0 )
    	{
    		/* save error code*/
    		sFlashIntCtl.u32SrErrors |= u32RegVal;
    		/* clear error programming flags */
    		FLASH->CCR1 = u32RegVal;
    
    		#if DEBUG_FLASH_INT
    			uart_printf("# ERR: FlashIntWrPage(): SR errors = %08lX\n\r", sFlashIntCtl.u32SrErrors);
    		#endif 	/* DEBUG_FLASH_INT */
    		return HAL_ERROR;
    	}
    
    	/* check SR1 once more */
    	u32RegVal = FLASH->SR1;
    	if( u32RegVal != 0 )
    	{
    		#if DEBUG_FLASH_INT
    			uart_printf("# ERR: FlashIntWrPage(): SR = %08lX\n\r", u32RegVal);
    		#endif 	/* DEBUG_FLASH_INT */
    		return HAL_ERROR;
    	}
    
    	/* +++++++++++++++++++++++++++++++++++++++++++ */
    	/* start programming */
    
    __disable_irq();
    
    	__IO uint32_t *pu32SrcAddr = (__IO uint32_t *)sFlashIntCtl.pu8WrBuf;
    	__IO uint32_t *pu32DstAddr = (__IO uint32_t *)sFlashIntCtl.u32WrAddr;
    	uint8_t u8RowIndex = FLASH_NB_32BITWORD_IN_FLASHWORD;	/* = 8 -> * 32= 256 */
    
    	/* set PG program enable bit */
    	FLASH->CR1 |= FLASH_CR_PG;
    
    	__ISB();
    	__DSB();
    
    	/* program the flash word */
    	do
    	{
    		*pu32DstAddr = *pu32SrcAddr;
    		pu32DstAddr++;
    		pu32SrcAddr++;
    		u8RowIndex--;
    	} while( u8RowIndex != 0 );
    
    	__ISB();
    	__DSB();
    
    __enable_irq();
    
    /* wait in state machine ? */
    #if( 1 )
    	if( FLASH_INT_WRITE_BLOCK == u8WrBlock )
    	{
    		/* wait for last operation to be completed */
    		//u8RetVal = FlashIntWaitLastOp(FLASH_INT_TO_PROG_PAGE);
    		uint32_t u32TickWrStr = HAL_GetTick();
    		do
    		{
    			u8RetVal = FlashIntCheckLastOp();
    			if( (HAL_GetTick() - u32TickWrStr) > FLASH_INT_TO_PROG_PAGE )
    			{
    				u8RetVal = HAL_TIMEOUT;
    				break;
    			}
    		} while( u8RetVal != HAL_OK );
    
    		/* if the program operation is completed, disable PG bit */
    		FLASH->CR1 &= ~FLASH_CR_PG;
    	}
    #endif
    
    	return u8RetVal;
    }

     

     

    Super User
    May 17, 2024
    LCEAuthor
    Graduate II
    May 17, 2024

    All of a sudden the compiler moved the variable CGI table to the start of AXI SRAM D1 - no more problems.
    No matter what I tried, the compiler didn't move it anywhere else in further compilations.

    But I still have the "application", which showed the same behavior when updating the bootloader.

    And yes, the current app does the same:

    It looks like the variable holding the CGI table pointers above gets reset while reading from OSPI flash, when the "bytes received counter" is incremented.

    It looks like the pointer is moved to somewhere else, in this case to one of the http constants for pdfs...   

    Now I'd like to understand: how can this happen?

    Picture 1: CGI table as it should be

    HardFault_H735_CGI_good.png

     

    Picture 2: CGI table changed by OSPI flash reading

    HardFault_H735_CGI_bad.png

     

    Super User
    May 17, 2024

    This may be a case of runaway pointer which causes random places in RAM to be overwritten.

    That's why I recommended to capture the moment when it's getting overwritten using data breakpoint (watchpoint).

    However, note, that when using data breakpoints, execution does not stop exactly on the instruction which causes the access to watched variable/area, but a couple of instructions after that.

    JW

    LCEAuthor
    Graduate II
    May 17, 2024

    @waclawek.jan thanks again...

    > This may be a case of runaway pointer which causes random places in RAM to be overwritten.

    How is that possible, any idea?

    In this case, the CGI table is set / initialized at start, and then never written / set again.

    I don't use any memory allocation - except for lwIP, which has it's own heap, and that is big enough for sure (checked with lwIP memory statistics, and confirmed by lots of troubles if it gets too small).

    I now know how to solve this particular case, but I'd like to understand what's going on to prevent that in the future.

    Super User
    May 18, 2024

    I put forth, I don't claim runaway pointer *is* the root of your problem (but it much sounds like that).

    > How is that possible, any idea?

    The causes of runaway pointer (or array index, which in C is the same) are many and may be very indirect. For example, accepting a pointer or index from "outside" of the program (e.g. from some data stored in external memory, which got damaged through hardware problem either with the memory or with transport paths; from communication with other system, etc.); bug in program using data from an area already reused (e.g. incorrectly used data from circular buffer already overwritten by other incoming data e.g. by DMA); bug in program stemming from atomicity-related issues (i.e. sharing nonatomic data between interrupt and "main", or  between threads in multitasking, which is the same). In other words, I can't pinpoint, but not even give you a clue, as for where the error stems from, as it may be very subtle and very much depends on everything (e.g. even timing) in your particular program and its surrounding environment.

    > In this case, the CGI table is set / initialized at start, and then never written / set again.

    Result of runaway pointer is damaged variable(s), but this variable has absolutely no relationship to the pointer which run away. So you can't/shouldn't investigate the ways how CGI table is intended to be set in your program. In other words, you should mentally abstract from your source, and think how the brutal (from Latin brūtus (dull, ***)) machine (the processor) works - it simply doesn't care about your source, variables, structures, etc.

    > I'd like to understand what's going on to prevent that in the future.

    That's why it's important to find the exact cause, not just cover up by a symptomatic patch.

    JW