Skip to main content
Graduate II
May 20, 2022
Question

Maintaining CPU data cache coherence for DMA buffers

  • May 20, 2022
  • 6 replies
  • 16431 views

This topic is inspired by discussions in ST forum and ARM forum, where a proper cache maintenance was sorted out and an example of a real-life speculative read was detected. Also there is another discussion, where a real-life example of cache eviction was detected.

For Tx (from memory to peripheral) transfers the maintenance is rather simple:

// Application code.
GenerateDataToTransmit(pbData, nbData);
// Prepare and start the DMA Tx transfer.
SCB_CleanDCache_by_Addr(pbData, nbData);
DMA_TxStart(pbData, nbData);

For Rx (from peripheral to memory) transfers the maintenance is a bit more complex:

#define ALIGN_BASE2_CEIL(nSize, nAlign) ( ((nSize) + ((nAlign) - 1)) & ~((nAlign) - 1) )
 
uint8_t abBuffer[ALIGN_BASE2_CEIL(67, __SCB_DCACHE_LINE_SIZE)] __ALIGNED(__SCB_DCACHE_LINE_SIZE);
 
// Prepare and start the DMA Rx transfer.
SCB_InvalidateDCache_by_Addr(abBuffer, sizeof(abBuffer));
DMA_RxStart(abBuffer, sizeof(abBuffer));
 
// Later, when the DMA has completed the transfer.
size_t nbReceived = DMA_RxGetReceivedDataSize();
SCB_InvalidateDCache_by_Addr(abBuffer, nbReceived);
// Application code.
ProcessReceivedData(abBuffer, nbReceived);

The first cache invalidation at line 6 before the DMA transfer ensures that during the DMA transfer the cache has no dirty lines associated to the buffer, which could be written back to memory by cache eviction. The second cache invalidation at line 11 after the DMA transfer ensures that the cache lines, which during the DMA transfer could be read from memory by speculative reads, are discarded. Therefore cache invalidation for Rx buffers must be done before and after DMA transfer and skipping any of these will lead to Rx buffer corruption.

Doing cache invalidation on arbitrary buffer can corrupt an adjacent memory before and after the particular buffer. To ensure that it does not happen, the buffer has to exactly fill an integer number of cache lines. For that to be the case, the buffer address and size must be aligned to the size of cache line. CMSIS defined constant for data cache line size is __SCB_DCACHE_LINE_SIZE and it is 32 bytes for Cortex-M7 processor. The __ALIGNED() is a CMSIS defined macro for aligning the address of a variable. And the ALIGN_BASE2_CEIL() is a custom macro, which aligns an arbitrary number to the nearest upper multiple of a base-2 number. In this example the 67 is aligned to a multiple of 32 and respectively the buffer size is set to 96 bytes.

Unfortunately for Cortex-M processors ARM doesn't provide a clear explanation or example, but they do provide a short explanation for Cortex-A and Cortex-R series processors.

    This topic has been closed for replies.

    6 replies

    Super User
    May 20, 2022

    Thanks for summing things up cleanly for us.

    To me "speculative access" sounds like there's absolutely no guarantee the processor won't overwrite the physical memory once it's cached. In other words, to me it sounds like the fact that you explicitly evict a properly aligned DMA Rx buffer before starting the Rx won't guarantee that the processor won't re-read and re-evict any cache line related to given buffer during the DMA Rx process.

    I'd like to ask you to discuss merits of having DMA buffers cached at all. Additionally, do you think it is a bad idea to switch DMA buffers "cachedness" dynamically in MPU?

    JW

    Super User
    May 20, 2022

    My understanding of "speculative access" is that access is reading memory. As long as the memory served by that cache line is only written to by the DMA, there is no danger of the speculatively cached data being marked as "dirty" and flushed back to RAM (which would overwrite DMA data). If that cache line needs to be evicted, it would simply be discarded.

    Super User
    May 20, 2022

    > If that cache line needs to be evicted, it would simply be discarded.

    Oh, yes. Silly me. Thanks.​

    JW​

    Graduate
    May 20, 2022

    My approach for stm32f7 is deliberately not to buffer or cache any buffers that I DMA in-to or out-of.

    I set up a section of memory for all the buffers, then program an MPU region for that with:

    MPU_InitStruct.IsBufferable = MPU_ACCESS_NOT_BUFFERABLE;
    MPU_InitStruct.IsCacheable = MPU_ACCESS_NOT_CACHEABLE;
    MPU_InitStruct.DisableExec = MPU_INSTRUCTION_ACCESS_DISABLE;

    I don't know how that compares, in terms of overall processor performance, with invalidating/flushing caches as appropriate prior-to and on-completion-of DMA operations. I assume (rightly or wrongly) that the processor doesn't need to make many reads / writes to each location in a DMA buffer, so the benefit of cacheing such locations is relatively small.

    Regards, Danish

    Visitor II
    September 3, 2022

    Hi Danish,

    I'm interested in this approach! Can you share a more complete example?

    Thanks Andrea

    PiranhaAuthor
    Graduate II
    September 4, 2022

    Many ST examples for Cortex-M7 are configuring DMA memory regions as non-cacheable in MPU instead of doing cache maintenance. Take a look on Ethernet examples - linker script, MPU configuration and variable definition. Also read AN4838 and AN4839 for more details.

    PiranhaAuthor
    Graduate II
    May 21, 2022

    For a buffers significantly smaller than the cache line size (32 B on Cortex-M7), cache maintenance can be inefficient because of two factors. First, the CPU time spent on SCB_***() calls could be more than the gain from using cache. Second, on Rx side the buffers will still have to fill the whole cache line and therefore waste relatively (to the actually used part of buffer) large amounts of memory. In such scenarios indeed disabling the cache on those buffers could be a better choice. But, on a buffers larger than the cache line size and especially in a zero-copy solutions, the cache will have a major positive impact on a performance - much larger than a loss because of an SCB_***() calls. For example, Ethernet DMA descriptors with a size of 32 B or 16 B perform better, when put in non-cacheable memory, but Rx data buffers with a typical required size of 1522 B gain from D-cache tremendously.

    Also it's a misconception that cache improves performance only for repetitive accesses. For example, imagine that code reads an array byte-by-byte and does some simple processing on each byte. After the first byte was read, the other 31 bytes are already in a cache. Without cache the CPU would have to wait on 31 additional separate reads going through buses to the memory, which could also be used by other bus masters at that moment. Also advanced CPUs have a data prefetch feature, which detects data access patterns and reads the next/previous cache lines speculatively before they are accessed by code. At least Cortex-A55, Cortex-A9 and even the Cortex-A5 does it. Seems that such a feature is not present in Cortex-M7, but it's definitely coming in Cortex-M55 and Cortex-M85. Anyway, in a project similar to my demo firmware on STM32F7 I am receiving Ethernet frames and doing just a full linear forward scan once on every frame, but still enabling the D-cache approximately halves the CPU time.

    Reconfiguring the MPU dynamically should be possible, but my guess is that it will still require the cache cleaning and invalidation at least after disabling the cacheability of the buffer. And, if that is the case, then it results in a less performance and more code, which is irrational.

    Super User
    May 25, 2022

    Thanks for the comments.

    Sounds much like this is might be case for benchmarking (as yourself did).

    JW

    PiranhaAuthor
    Graduate II
    May 25, 2022

    Let's fix the documentation related to these issues. For ST it's the AN4839 rev 2 that must be corrected and updated.

    Section 3.2, page 7:

    "Another case is when the DMA is writing to the SRAM1 and the CPU is going to read data from the SRAM1. To ensure the data coherency between the cache and the SRAM1, the software must perform a cache invalidate before reading the updated data from the SRAM1."

    Section 4, page 8:

    "After the DMA transfer complete, when reading the data from the peripheral, the software must perform a cache invalidate before reading the DMA updated memory region."

    Those explanations are wrong and should be updated according to the explanation presented in the head post of this topic. For Rx transfers the cache invalidation must be done before and after the DMA transfer and the address and size of Rx buffers must be aligned to the cache line size.

    It is pretty strange that the section 3.2 explains how to manage cache coherence for Tx with different options and suggestions, but for Rx there is only a single poor sentence. It recommends SCB_CleanDCache() function for Tx cache maintenance, but nothing for Rx. It seems that the person, who wrote it, already saw that something is wrong here and deliberately did not recommend anything specific. And there is a good reason for it - the seemingly opposite function SCB_InvalidateDCache() unavoidably corrupts unrelated memory and is not practically usable. The SCB_CleanInvalidateDCache() can be used, but it still requires the Rx buffers to be properly aligned. Anyway, operations on the whole cache are terribly inefficient and that is why SCB_***_by_Addr() functions were introduced by ARM and should be used as presented. Those functions were there in year 2015 and the application note was written in year 2016, but, as always, ST used old code. Therefore the table 1 "CMSIS cache functions" should also be updated.

    @Imen DAHMEN​, @Amel NASRI​, let's make ST the first one of the MCU manufacturers, who have corrected this mess of a global scale! ;)

    PiranhaAuthor
    Graduate II
    May 25, 2022

    Also here is a review of an incorrect and incomplete related documents from other manufacturers.

    • NXP (Freescale) AN12042 rev 1. Sections 4.3.1 and 6.
    • Atmel AN-15679. Slides 25, 27 and code example on slide 42.

    Both documents does not inform of the necessity to invalidate the cache before the DMA Rx transfer and the necessity of an Rx buffer size to be aligned to a cache line size.

    • Microchip (Atmel) TB3195. Section 4, page 7 and code examples on pages 9 and 10.

    Does not inform of the necessity to invalidate the cache before the DMA Rx transfer. Also it informs of an alignment requirements for addr and dsize parameters of SCB_***_by_Addr() functions, which is not required since this improvement.

    • Infineon (Cypress) AN224432 rev E. Section 6.4.2.1 "Solution 2: Use cache maintenance APIs" on page 44 and "Code Listing 31" on page 38.
    • Infineon AN234254 rev A. Section 5.4.2.1 "Solution 2: Use cache maintenance APIs" on page 33 and "Code Listing 3" on page 27.

    Both documents does not inform of the necessity to invalidate the cache before the DMA Rx transfer, but instead shows an example of a cache clean done on a destination (Rx) buffer, which is suboptimal. Also it informs of an alignment requirements for addr parameter of SCB_***_by_Addr() functions, which is not required since this improvement. The code shows an address alignment for source (Tx) buffer, which is not necessary. The text does not inform of the necessity of an Rx buffer size to be aligned to a cache line size.

    So, from the seven companies, which have been involved in making MCUs with Cortex-M7, all seven have failed with this. And since 2014, when the Cortex-M7 has been announced, ARM also haven't stepped up and helped to fix the situation by giving a clear explanation and examples.

    ST Employee
    May 23, 2024

    Multiple cases:

    1) RX buffer not cacheable

    Rationale: no real CPU processing on it
    Alignment: probably the non-cacheable region is already aligned to a multiple of cache line, so no real constraint (except maybe DMA constraint)
    CPU can consume the buffer without any prerequisite

    2) RX buffer cacheable

    Rationale: CPU processing on it
    Alignment: probably the cacheable region is already aligned to a multiple of cache line, so no real constraint (except maybe DMA constraint)
    CPU can consume the buffer with prerequisite: once DMA transfer is over, need to invalidate the RX buffer area.

    3) TX buffer non cacheable

    Rationale: no real CPU processing on it
    Alignment: probably the non-cacheable region is already aligned to a multiple of cache line, so no real constraint (except maybe DMA constraint)
    CPU can submit the buffer to the HW (DMA here) without any prerequisite

    4) TX buffer cacheable

    Rationale: CPU processing on it
    Alignment: probably the cacheable region is already aligned to a multiple of cache line, so no real constraint (except maybe DMA constraint)
    CPU can submit the buffer to the HW (DMA here) with prerequisite: need to clean & invalidate (aka flush) the TX buffer area.

    On top, the cache policy (write-trough, write-back) may save some actions as "clean&invalidate before submitting to the HW" for write-through, but the actions listed above are generally applicable (write-back).

     

    Graduate II
    May 25, 2024

    2) RX buffer cacheable

    This is incomplete as explained in the OP and subsequent posts.  Invalidate need to be done before the DMA to address line eviction and after the DMA completes to address speculative accesses.  Further, the buffer should be aligned and sized to the cache line geometry.

     

    4.) TX buffer cacheable

    The invalidate step isn't needed, only the clean.  It's also a Really Good Idea for TX buffers to be aligned and sized to the cache line geometry.  (Although, Piranha will debate the "and sized" portion, I do it for simplicity and consistency.)

     

    Read the OP and Piranha's subsequent posts multiple times as needed until you get it.

    ST Employee
    May 27, 2024

     Invalidate need to be done before the DMA to address line eviction

    What are you trying to achieve doing so ? Eviction due to dirty lines (slide 31 of http://events17.linuxfoundation.org/sites/events/files/slides/slides_17.pdf) ? But you fixed the rules: buffer is cache-line aligned and size is cache-line multiple. This is a good practice to avoid fragmented cache lines.

    Slides 2 & 3 are important to consider: the statements about elaborated caches don' t always apply the same: some assumptions may shorten/enhance your life, typically the real use of the buffer and its life-cycle.

     

    The invalidate step isn't needed, only the clean.

    This is general practice to clean & invalidate.

    https://stackoverflow.com/questions/76155579/whats-the-point-of-cache-clean-and-invalidate-in-arm-cortex-processors

    https://stackoverflow.com/questions/77677914/cache-clean-invalidate

     

     

    Read the OP and Piranha's subsequent posts multiple times as needed until you get it.

    Please keep such sentence for you.