Skip to main content
Explorer
May 5, 2022
Question

SCB_InvalidateDCache_by_Addr not operating correctly

  • May 5, 2022
  • 8 replies
  • 11826 views

Using twin STM32H7B3I EVAL boards to develop master and slave firmware.

Found that slave board was experiencing overruns on SPI slave receive.

Revised SPI slave implementation to use DMA.

Found incredibly bizarre data corruption, after some number of good packets moved from EVAL master to EVAL slave.

SCB_InvalidateDCache_by_Addr was called after each DMA receive completion to invalidate the data cache for the receive buffer.

However, once data cache was completely disabled, data corruption of receive DMA packets ceased.

Are there any timing or special considerations for the use of SCB_InvalidateDCache_by_Addr?

    This topic has been closed for replies.

    8 replies

    Super User
    May 5, 2022

    > SCB_InvalidateDCache_by_Addr was called after each DMA receive completion

    Try to call it [also] before starting DMA.

    Try to use MPU to set the RX buffers non-cacheable.

    Graduate II
    May 5, 2022

    > Try to call it [also] before starting DMA.

    It's not in a category of experimenting with some choice. It just must be called before starting the reception on a particular buffer. Also invalidating the second time after the reception is useless. It can make some difference only if the CPU is accessing the data buffer, while it is processed by DMA, but then the code is broken anyway.

    Explorer
    May 6, 2022

    I will revise and advise of results.

    However, apparently I do not understand what it means to "invalidate" a cache.

    My understanding, apparently incorrect, was:

    1. CPU has cache of memory bytes
    2. DMA writes to those memory bytes, unbeknownst to the CPU
    3. Assume no call to SCB_InvalidateDCache_by_Addr
    4. Program accesses those memory bytes but gets what is in cache not what DMA wrote

    I thought by calling SCB_InvalidateDCache_by_Addr at step #3, the problem at step #4 is avoided.

    If I move the call to SCB_InvalidateDCache_by_addr to between steps #1 and #2, there is still the possibility of the CPU placing data in data cache before reaching step #4, hence the same problem.

    However, I will revise as suggested and advise of results.

    Thanks for your time.

    Graduate II
    May 5, 2022

    Watch for issues related to 32-byte alignment, both in terms of the structure in question, sufficient coverage, and collateral damage to abutting/surrounding structures.

    Super User
    May 7, 2022

    Apologize for bringing this topic again.... Speculative reads.

    Can it be that a speculative read is issued to cached RX buffer during DMA and pulls incomplete data to the D cache? :face_without_mouth:

    Super User
    May 16, 2022

    (answering to myself)

    The definitive answer can be found in CM7 TRM v. r1p2 , section 5.2

    * "Speculative data reads can be initiated to any Normal, read/write, or read-only memory address. In some rare cases, this can occur regardless of whether there is any instruction that causes the data read."

    * "Speculative cache linefills are never made to Non-cacheable memory addresses"

    My conclusion from this:

    Cache invalidate before DMA read cannot avoid speculative linefills during DMA read If the buffer is in normal cacheable memory.

    if DMA read buffer is defined non-cacheable, speculative read can occur, but it won't pollute D-cache, so harmless. And speculative linefills from non-cacheable memory are forbidden.

    So, defining a DMA buffer non-cacheable, or non-normal (SO, device) should prevent D-cache pollution.

    Cache invalidate *after* DMA read should work too, given that any writes to the buffer have been flushed or invalidated before DMA start (double invalidate, before and after).

    Comments are welcome.

    Explorer
    May 16, 2022

    Have resolved the issue.

    The thread's main loop was zeroing two buffers that DMA would write into, then start the DMA, then wait for DMA complete, then call SCB_InvalidateCache_by_Addr for one or the other buffer, then look at the data.

    What apparently was happening was that the buffer zeroing, at the top of the thread's main loop, was not immediately flushing to SDRAM, but apparently dawdling in data cache.

    By some race condition, sometimes the CPU beat the DMA to writing the buffers, sometimes the DMA beat the CPU.

    Once I put SCB_CleanInvalidateCache_by_Addr immediately after the two memset-zeros of the buffers, everything started working perfectly.

    The moral of the story is: if CPU has written memory, call SCB_CleanInvalidateCache_by_Addr before starting any DMA operation writing to the same memory.

    Graduate II
    May 17, 2022

    Zeroing and cleaning is not required. If it is required specifically in your implementation, then I suggest zeroing only the part of the buffer, which was not written by DMA, after the read operation.

    Also read my discussion with Pavel. It turns out that invalidating the buffer before the DMA read is also not enough...

    Super User
    May 18, 2022

    @Piranha​ 

    > So indeed it turns out that the correct solution is doing invalidation before and after the DMA read.

    Yes, I'm still pondering on the answer on the ARM forum. Still reading.

    Double invalidation is too much trouble. It has overhead, after all, if you look at the source.

    Maybe only invalidation after the DMA read is enough, if the buffer (cache) was clean before starting the DMA.

    Making the buffer non-cacheable seems better (especially small, so overhead of SCB_Invalidate... is same as non-cached read) - but managing the MPU is not too easy on CM7. I've read that MPU management will be much easier in the new CM-85...

    By the way: SCB_Invalidate.... can hard-fault when cache is disabled (at least, I'm seeing this on H7), and other developers can leave the cache disabled. Because of that I wrap the SCB... calls with "if (D-cache enabled {...}".

    Graduate II
    May 18, 2022

    Did a test on my demo firmware and compared the CPU load on invalidation "before" vs "before and after" scenarios. The CPU load on Rx test is 14,65% vs 16,00% respectively. So, comparing relatively, the correct code is not gaining the illegal 8,44%, but those are only 1,35% of the total CPU time. And that's at the maximum 94,9 Mbps while receiving 8127 frames per second on F7. In most real life scenarios the loss will be proportionally smaller to the smaller Rx traffic and on H7 it will be also proportionally smaller to the higher CPU clock speed.

    For the cache to be clean, it requires cleaning, but cleaning writes back dirty lines, which takes additional time, loads buses and is generally useless. Invalidation just marks all of the relevant lines invalid regardless of their previous state (invalid, clean, dirty) and doesn't involve any memory access at all. Also invalidation frees those cache lines instead of keeping them allocated, which could be an overall advantage.

    For a buffers significantly smaller than the cache line size (32 B on Cortex-M7), cache management is probably an inefficient choice also because of the fact that one will have to waste 32 B of memory on each buffer anyway. In such scenarios indeed disabling the cache on those buffers could be a better choice. But, on a buffers larger than the cache line size and especially in a zero-copy solutions, the cache will have a major positive impact on a performance - much larger than a loss because of an additional SCB_InvalidateDCache_by_Addr() call.

    When D-cache is disabled, I remember having issues with older SCB_***() function implementations, but the current ones seem to be harmless. I can even enable/disable the whole D-cache at runtime repeatedly without any issue. On H7 even the ST are shipping up-to-date versions for some time. Contrary to F7, where those turtles are still shipping the 4 year old broken ones... Are you using up-to-date implementations or still the old broken ones and wrapping those in an additional code layer?

    Super User
    May 18, 2022

    I'm using CMSIS from CubeH7 lib v. 1.8...1.10. This is ver. V5.6.0 (Core-M ver. 5.3.0 ... whatever this means)

    Graduate II
    May 18, 2022

    Here is another interesting fact. I have a project on F7 with FreeRTOS, where the MCU is continuously receiving a stream of at least a 100 data packets per second through Ethernet with lwIP. The same project also has a web server based on the "httpd" app provided by lwIP, and the web page does AJAX requests back to the MCU once per second. Almost all was working fine, except for one thing. When the web page was open simultaneously with the data transfer, the AJAX requests were causing missing data packets for the data transfer. Keeping the web page open for a minute (60 requests), it was causing approximately 55 missed data packets. Reloading the web page repeatedly was causing even more missing packets. So it was more than a 90% chance of a data packet loss happening because of any web request simultaneously with the data transfer.

    Because those requests are processed in a callback from the thread, on which the whole network processing depends, and those requests involve some non-trivial processing, my initial blind guess was that the request processing just takes too long. And, as the web page is only for configuration and not necessary for day-to-day use, it is not a significant issue for the specific usage scenario.

    Till now my Ethernet Rx code was doing D-cache invalidation only before the DMA reception. Now I added the second invalidation after the DMA reception. I did set up the data stream reception and a browser with the open web page with AJAX requests once per second running simultaneously and left it overnight. The result - not a single packet missed! As I can easily repeat this test with an absolute reliability, it basically shows that the speculative reads are not just a theoretical concept but actually do happen. And, at least in some scenarios, they are not a rare occasions also.

    Super User
    May 18, 2022

    Terrific.

    Graduate II
    November 4, 2023

    @DRega.1 , take a note that, after the discussion in this topic, I created an article about D-cache maintenance, which explains all of the requirements and shows a correct and complete example:

    https://community.st.com/t5/stm32-mcus-products/maintaining-cpu-data-cache-coherence-for-dma-buffers/m-p/95746

    @David Littell , for the flush/clean operation the buffer address and size alignment is not necessary, and for the invalidate operation doing it just after the reception is not enough - it must be done both before and after the reception. It is all explained in the link above and earlier posts in this topic.

    Visitor II
    January 28, 2025

    Hello,

    I also struggled with this problem for a some time. The problem, is the command itself.
    Instead of SCB_InvalidateDCache_by_Add you must call the command SCB_CleanInvalidateDCache_by_Add,
    because you want the data to be written from the cache to the memory and not just set memory range as dirty.
    I have attached a small function below that I use. This function aligns the address and the buffer length to 32 bytes (I do this because it was recommended here in the forum). It calls the SCB_CleanInvalidateDCache_by_Add command and sets __DSB() a data synchronization barrier  (probably not needed, but just to be safe).

    Here the code:

     

    void align23Byte_cacheClean(unsigned int* address, size_t length) {
        unsigned int newAdr;
        size_t newLength;
        newAdr = ((uintptr_t)address) & 0xffffffe0; 
        newLength = ((((uintptr_t)address) - newAdr)+length+31) & 0xffffffe0; 
        SCB_CleanInvalidateDCache_by_Addr((unsigned int*)newAdr,newLength);
    __DSB();   
    }

    Kind regads

    Alex