Explorer

Question

STM32H7 with FatFS on eMMC sometimes failing

Forum|Forum|2 years ago
January 26, 2024
6 replies
9525 views

Hello everyone,

I have a project using the STM32H7, using FreeRTOS and with a FatFS (ChaN) runnin on a 4GB eMMC (Kingston), using peripheral SDMMC1.

Normally everything works fine. SDMMC1 and FatFS (FAT32) are initialized at main, before OS, and is used later on for many things such as creating and writing/reading files, etc.

However, we found some devices that were working fine and suddenly (after a couple of thousands of writings in the eMMC, I'd assume not too intense) we start having a constant and reproducible error. What happens is that the FatFS initializes OK but as soon as we want to write something (still at main, before OS), it fails. Basically when trying to open a file (that already exists) it fails with FR_DISK_ERR and we cannot use the FatFS anymore. Trying to debug a bit, I saw that on the lowest level HAL_MMC_ReadBlocks (from stm32h7xx_hal_mmc.c) is failing with HAL_TIMEOUT.

However, if I send a "reset" command to our device (it basically just executes a __NVIC_SystemReset of CMSIS on the STM32), on the following boot everything works fine. Both the issue and the "fix" are constant and once they happen once they start happening every time. FYI, the eMMC is also powered off/on when booting (we control this with dedicated circuits controlled by main uC).

I'm aware there are many layers where the failure could be:
1) HW? but then why most devices work fine and those failing are consistent once they start failing?
2) ST HAL: could be some casuistic triggering any bugs?
3) FatFS: something related to the FAT gets corrupted inside the eMMC? We do mount and access the eMMC via USB and so I tried formatting it (from PC with Windows) and didn't make a difference...
4) A combination of electrical/eMMC/firmware conditions ¿?

I suspect that the conditions are somehow also related to using Standby low-power mode, since I cannot reproduce it if I just use Stop mode. Initially I don't see why this could be affecting, since as far as I understand when waking up from Standby mode the STM32 reboots and therefore everything would be initialized the same way.

Since it has this characteristic of stop happening after a SW reset, but then happening again on following boot when powered off/on, I figured maybe someone would identify any other potential issue... Any ideas of what could be or what should I check?

Thanks in advance!

This topic has been closed for replies.

Tesla DeLorean

Graduate II

The FatFs that ST ships is quite old and has some issues.

If the file system gets trashed that tends to be persistent.

Use a USB MSC to allow connection to PC and run CHKDSK against the problem volume.

Error propagation in FatFs isn't particularly good. You'd want to instrument DISKIO to understand if the error was occurring at the peripheral or card/chip level, and what error specifically. Use DMA, polled mode is too fragile.

TVare.1Author

Explorer

Thanks everyone for your answers!

@Tesla DeLorean
Actually I don't think I have the version that ST ships, since I remember having updated it recently. I'm using R0.15 w/patch1.
We are able to set as USB MSC so I tried the CHKDSK on Windows. It said that the unit had no issues. Anyways I run and it seemed that it was actually doing something as the device got to work with no errors after that. However, after a couple of boots the issue was back again. So this gave me the idea that indeed the issue might be related to FatFS corruption but that the CHKDSK is not enough to totally fix it.
Regarding the failure, I'll try to debug it again to be sure of the exact reason. We'll also try to implement DMA for write/read.
Since you mentioned that once the memory gets trashed is hard to fix it, and considering that this FatFS implementation doesn't have any correction tool, would you recommend any other measure to avoid the FatFS to get corrupted in the first place?

@AScha.3
Thanks for your insight. In the near future I'll actually have access to some other memories, even from SanDisk. So this idea of measuring access times seems like a good approach to evaluate "quality" of the memory, as I guess that could increase the chances of the FatFS getting corrupted eventually.

@tjaekel
It is a 4GB NAND flash memory with eMMC interface. It is located in the same board, just 2cm away. The hardware is supposedly already validated (although I don't discard anything at the moment). I did try to lower the speed in devices where I had already the failure and it didn't make a difference. It might be useful to lower the speed even in devices that are still OK (as long as this doesn't affect performance) to avoid the FatFS from getting corrupted. Anyways we are already using the slowest modes that the memory supports (32MHz at the moment).
As I told to Tesla, I'll try to narrow down the failure reason and will let you know.

Thanks again!

AScha.3

Super User

Hi,

I cannot say anything about eMMC , because we use ( at work ) only sd-cards , but afaik the memory is also flash - and we have -maybe- a similar problem.

Some cards failing after months or > 1 year of use with FR_DISK_ERR . I tried to get the "bad" cards and found :

one was "dead" , can do ( in (windows) PC card adapter) nothing any more; it shows ( on my linux PC , "drive tool" ) one unknown partition, no read, mount, format, test possible, no more any access. (Kingston and Transcend is written on cards.)

Other "bad" cannot read or mount, but format still possible. New format and working again.

So i put some music (wav files) on it , to see, how they perform in my music player (on H743 , sdio 4 bit mode).

Interesting symptom: looking with the scope at the read time ( i set/reset a pin when f_read starts /ready) , always 8 KB blocks, the read time is jittering a lot, from 1ms to >50ms everything is there.

A new card (SanDisk) shows some jitter 1ms to 1,5ms , not more.

A new Kingston "canvas select" 64GB shows 1...40ms !! Most time around 1ms, but some "dropouts" up to 40ms.

So my assumption now : this time is some indication of the real quality of the accessed flash area, long and very long delayed read times indicating, the card controller has to do some repeated read and a lot error correction to calculate, until it has puzzled the data together. And here the (more expensive) SanDisk cards are obviously better, so for now we use only SanDisk - and next year i can tell, whether they are really better/persistent - or not.

Maybe you could do same test: write some big files on card, read constant blocks in a loop and look at a pin, that you set/reset with the read-time .

tjaekel

Visitor II

Try to lower the eMMC (or is it a SD Card? SDMMC1 for SD cards?) clocks a bit. Check also the external wiring (if it is an external SD Card adapter).

I have tried few days ago an SD card adapter with flying wires, on a NUCLEO-U575ZI-Q board: and it was not working (I got CRC errors on commands). Just with a nice main board and SD card adapter directly connected to the PCB - it was working.

Step through the code where it starts to fail: in my case I saw on SD card commands a response with CRC error code. Later, higher up in file system: it would just say "failed". Trace it down to the original point where it fails.

SD Cards toggle often between slow clock (for init, a single lane) to a faster clock (and 4-bit-mode). Try to find if you can lower one frequency.

Also possible, that a specific SD card does not work: in this case, during the initialization, e.g. setting a voltage, setting features, e.g. 4-bit-mode, can fail. Important to know where it really starts to fail (use breakpoints to trace).

TVare.1Author

Explorer

Hi!

As an update: I managed to further debug and identify the exact moment when the issue was happening. Basically what I saw is:

The first write operation on the eMMC after powering it, takes a lot of time. And this time seems to increase proportional to the level of usage of the eMMC.
Once first operation finishes, following write attempts are normal. This was the reason my reset was "fixing" the issue, because during the reset the boot of my uC was so fast that the eMMC was always powered ON (thanks to some capacitors) and therefore the "first write operation" already happened.
The failure happens when this first write operation takes more than 100ms, because of this piece of code in the HAL (HAL_MMC_WriteBlocks():(

while (!__HAL_MMC_GET_FLAG(hmmc,
 SDMMC_FLAG_TXUNDERR | SDMMC_FLAG_DCRCFAIL | SDMMC_FLAG_DTIMEOUT | SDMMC_FLAG_DATAEND))
 {
 if (__HAL_MMC_GET_FLAG(hmmc, SDMMC_FLAG_TXFIFOHE) && (dataremaining >= 32U))
 {
 /* Write data to SDMMC Tx FIFO */
 for (count = 0U; count < 8U; count++)
 {
 data = (uint32_t)(*tempbuff);
 tempbuff++;
 data |= ((uint32_t)(*tempbuff) << 8U);
 tempbuff++;
 data |= ((uint32_t)(*tempbuff) << 16U);
 tempbuff++;
 data |= ((uint32_t)(*tempbuff) << 24U);
 tempbuff++;
 (void)SDMMC_WriteFIFO(hmmc->Instance, &data);
 }
 dataremaining -= 32U;
 }

 if (((HAL_GetTick() - tickstart) >= Timeout) || (Timeout == 0U))
 {
 /* Clear all the static flags */
 __HAL_MMC_CLEAR_FLAG(hmmc, SDMMC_STATIC_FLAGS);
 hmmc->ErrorCode |= errorstate;
 hmmc->State = HAL_MMC_STATE_READY;
 return HAL_TIMEOUT;
 }
 }

The TXFIFOHE flag takes too long to be set and therefore the code is stuck here until eventually fails because of Timeout being configured as 100ms. I guess this delay is somehow related to the hardware flow control and how the eMMC and the SDMMC controller communicates. Again, this only happens during the first write operation on the eMMC.

My guess is that the eMMC internal controller has some kind of wear leveling/garbage collection/whatever algorithm that runs on the first operation after powering ON. After that, performance gets "normal". Could someone confirm me this? or could be another reason?

I'll try to measure these times using other eMMCs, to see if this is due to the quality of the card itself, or if it's something that will always happen and I'll have to live with it (i.e. increasing timeout value).

@Tesla DeLorean I'm not sure if implementing DMA will benefit me. When you say that polling mode is fragile you are talking about this issue that I'm experiencing with the timeout? is your recommendation related to the benefit of having the CPU free for other operations while the eMMC writing is taking long? If this is the case, maybe is not that important for my use case. But please let me know if there are other reasons why you recommend DMA.

Thanks!

Stefano Ugolini

Visitor II

We experience the same issue.

The first write operation after power up, on a eMMC (IS21ES08G) is much slower than the consecutive ones.
We use HAL_MMC_WriteBlocks_DMA and we wait for the callback to happen.

We write 1 block (512B)

The first write is around 80ms while any consecutive one is around 1ms.

@TVare.1 did you find anything related to it?

MMARI.1

Graduate

hi @Stefano Ugolini

Initially i have plan with eMMC as storage device but finding many difficulties to connect also steady of eMMC performance including the failure , high storage handling is not yet concluded by any one .

Seems SD card (4bit Mode ) will be better than eMMC (8 bit mode ) but only Speed may little slow but Ok for me .

Now ST developed their own middle ware FileX instead of FatFs .

tjaekel

Visitor II

I think, the first write can be delayed a lot due to the File System.
Before you write new data into a file (a sector) - the File System has to scan the entire FAT in order to know where to write this sector. The File System will cash (read) the MBR, FAT, find the cluster where the sector to write is in it. Not sure, if it would also read the entire cluster (I think, not necessary, SD card sectors can be written in 512byte chunks).

But what will happen as well: if you write less than a sector size (512 bytes), this sector must be read first, modified (overwrite your data) and then write it back. Also possible, that this write back is delayed: written data still sitting in a sector cache. If you write more data into a sector - written into cache memory. If not a flush forced then the cached sector is maybe written after a time out (elapsed time to write back from cache to SD card sector).

So, consider also the overhead in File System, what happens in particular on a write. You will potentially see a lot of reads before the first write.
After the first write done and continuous writes into same file can be way faster: the FAT and all the needed information, e.g. cluster number, next sector number, are known. Just if you cross a cluster it might be delayed again a bit. It depends how much of the FAT and sectors is cached by the File System. Also a write will be done on a cached sector in memory before it will be written back.

I would assume, this behavior is related to the overhead in the File System and how it works to figure out which sector to write (several iterations over the FAT which must be read first).

TVare.1Author

Explorer

Just to be sure I understand you, when you talk about the File System, are you referring to the FatFS library running on the STM32? because if this is the case, I should clarify that the place where I saw the delays is not there but in the lowest level, already when the FatFS library calls a disk_write.

In my case, this happens when doing the close of the first file I write (I write less than 512 bytes), which is the moment when the f_sync happens and the actual communication with the eMMC takes place. This is why I'm pretty sure that the delay comes from the communication with the eMMC itself, because I measured the time that was being spend already in the lowest layer, at HAL_MMC_WriteBlocks (once the file system already decided in which cluster to write and so on).

After this first direct write into the eMMC happens, following writes, either in the same file or different files, take way less time.

I must say that in some cases I did see a slight delay that I think I managed to identify being caused by the FatFS library looking for a cluster to write, but if I'm not wrong this delay was never significant and was not breaking anything of the normal behaviour of the device and libraries.

tjaekel

Visitor II

OK, you measure the duration for the first write itself. If this first writes takes longer as usual, there is something else to bear in mind:

When the SD Card gets a write command, for one sector, this sector has to be erased first. The SD card might be "smart" to realize: before it can write - this sector has to be erased first (all to 0xFF). This could block the actual write command to be done until the erase command has finished (inside the SD card).

To be honest: it does not explain why all following writes are faster. All writes to a sector (with full sector size) should have this delay.

Just what I mean: do not blame the MCU and FW first, maybe it is also the SD card itself.
Is there a difference between a fresh formatted and empty SD card vs. one where you want to write (overwrite) an existing file?
Or the SD card needs time on the first time to prepare for erase and write, e.g. to increase internal voltages needed for the erase and write. If the SD card has an internal power management: I could imagine that a first write takes longer in order to prepare the SD card that it will be written (and needs also always an erase cycle).

Personally, I would accept that the very first write is slow, as long as all following writes are within the spec. And write will be always slower as read. Maybe this feature is there to use SD cards as fast Read-Only memory, e.g. for booting from it, before doing any write.

I think, it could be the SD card: do you have chance to trace the signals and see the write command? There is potentially a status read for the FW function doing the write. The FW might poll the status for the "write done" and at the first time it takes longer to get the bit set in the "status polling".
Or: check if your FW write command does such a status polling: check there how often the status read will be done, how long does it take to get a "write done" status. If this varies between first and following writes: it is the SD card which slows it down (not your FW).

TVare.1Author

Explorer

@tjaekel Exactly, everything seems to point to the eMMC itself. There is indeed a difference between a fresh eMMC and a used one. So far as I observed, the more used the eMMC, the longer this "first write" takes. So far, I got around 200ms in worst cases and that would be doable for my use case. However, if this delay of the first write keeps raising too much (idk, >500ms) it could affect the user experience of my device. At the moment I'm doing some stress tests to do many writes and I'll check if this time keeps increasing or eventually reaches a maximum constant delay.

I also suspected about being something electrical, but I've add delays between powering ON the eMMC and the first write operation, to give some time to voltages to stabilize and so on, but I didn't see any change.

It's interesting what you mention about the eMMC doing something different (random write) on the first operation. That's actually my guess as well, but I cannot find actual information to confirm it.

At the moment I was not able to trace the signals, I just can tell you that the delay comes from inside HAL_MMC_WriteBlocks, specifically the while loop that calls SDMMC_WriteFIFO. (sorry I don't post the code snippet, but is getting marked as SPAM and erased...)

I'm also researching the eMMC standard, to see if there is some configuration/cmd or approach I can use to mitigate this first operation delay.

@AScha.3 not yet, at the moment I'm trying to get some PCBs with another eMMCs from different manufacturers (or at least one with Sandisk). Then I'll do the same tests and write operations with it and share the results, hopefully just by changing the component it gets better...

MMARI.1

Graduate

hi @TVare.1 ,

you have solved your issue ? since i am planning to use eMMC in SDMMC mode .

hi @Tesla DeLorean

eMMC RST Pin remains as open (or) RST Pin should be Pulled up with 4.7K.

Tesla DeLorean

Graduate II

Depends on the nature of the signal you're presenting as to whether you'd need another at the eMMC

But if driving by a GPIO-OD, a pull-up in the 4K7 to 47K range would be fine. You should review the datasheet(s) for the part(s) of interest

A reasonably complete implementation is presented here:

https://www.st.com/resource/en/schematic_pack/mb1381-h745xi-b03-schematic.pdf#page=10

Speed/performance is going to depend on the file system using large aligned blocks, and informing the device of the size of transfer (where supported) for writes so it can pre-erase and not churn through resources.

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded