Skip to main content
Visitor II
February 18, 2016
Question

Hint: DMA and cache coherency

  • February 18, 2016
  • 6 replies
  • 11603 views
Posted on February 19, 2016 at 00:36

To share experience with all:

STM32F7 has DMAs and caches (DCACHE here in mind). You can use a DMA for Peripheral-to-Memory or even Memory-to-Memory (I use as HW-based 'background' memcpy() ).

But you should bear in mind: DMA transfer does not go through MCU DCACHE. It writes directly to memory. If DCACHE is enabled, the same memory location already hosted in cache - any update on memory (done by DMA) is not 'visible' for MCU. MCU will still see the 'old' content because it is read from cache.

It means: DMA is not coherent , they do not force an update on DCACHE

(not a Cache Coherency Interconnect, CCI in the system).

There are some conclusions:

1) before you send something via DMA from memory - a need to do a Cache Clean maintenance operation - force to let update the memory with cache content

2) when something was received in memory via DMA - a need to do Cache Invalidate maintenance operation - force to let update caches again with memory content to see the changes

But, I think there is a faster (and easier way): use the DTCM memory region:

If you manage to have the buffers for DMAs on DTCM then you should be fine: there is not the DCACHE involved, it is tightly coupled for MCU and DMA has dedicated path to it as well.

On this DTCM you will have 'coherency', no need to deal with cache maintenance.

Regular memories with DCACHE 'in between' might look like 'some data missing' (not coherent).

BTW: even the C keyword 'volatile' might be ''tricky'': it tells compiler not to optimize, to read and update variable all the time again (in order to see the 'side effect'). But it is not related to caches, it is not a cache maintenance operation:

if such a volatile variable is updated by a non-coherent master (DMAs are such one) - the MCU might still not see a new value in it, even volatile used and really read again (but from cache, not memory).

DCACHE in system might need careful consideration what does it mean for specific features such as DMAs.

#dcache #dma #stm32f7
    This topic has been closed for replies.

    6 replies

    Technical Moderator
    December 26, 2016
    Posted on December 26, 2016 at 11:28

    Hi

    ‌,

    Thanks for your helpful hint.

    More hints on the same context with farther details may be found in the

    http://www.st.com/content/ccc/resource/technical/document/application_note/group0/08/dd/25/9c/4d/83/43/12/DM00272913/files/DM00272913.pdf/jcr:content/translations/en.DM00272913.pdf

    (Level 1 cache on STM32F7 Series).

    -Amel-

    Graduate
    May 4, 2018
    Posted on May 04, 2018 at 19:22

    BTW, I have a project using H743i-Eval board and I am using the RAM_D1 area for .data and .bss.  I found that it did not matter if cache was enabled or not.  If I set up a buffer to send data out the UART via DMA HAL call, the only way to make sure the data out the UART matched the data in the buffer was to use Invalidate Cache just before the call to start the DMA process.  Clean Cache did not fix the incoherence issue.  Previous to that I thought all was working because my initial tests used const data as in placing a literal string in the call.  Interesting about that is the data there would be coming from Flash and it worked 100%.  Thanks for pointing to AN4839.  Just know that the sentence claiming the Clean Cache is a solution is not always true.

    Visitor II
    May 7, 2018
    Posted on May 07, 2018 at 20:51

    I am also working on a STM32H7 Nucleo project: SPI with DMA. Yes, DMA is not coherent (no CCI on bus matrix), so DMA from/to memory and MCU with caches needs careful cache maintenance.

    I have realized:

    a) using DTCM is not possible: DMA cannot access, it will result in a DMA error if buffer is on DTCM (obvious)

    b) not enabling cache works fine for me: DMA will transfer properly in both directions (direct access the physical memory by MCU as well as DMA)

    c) with caches enabled - we had to use Clean and/or Invalidate. It works for me, using CleanDCache_by_Addr etc.

    BTW: using InvalidateDCache_by_Addr (or similar CleanDCache_by_Addr) needs to make sure that the buffer is aligned with the Cache Line Size, on a 32-byte-boundary address:

    __ALIGNED(32)

    is needed for the buffer definition.

    Which one to use depends on the direction: if MCU generates/updates a buffer which should be transferred afterwards via DMA (Mem-to-Peri) - we had to Clean the cache (let's update the memory with the modified cache content, write it back). If DMA receives and writes into memory (Peri-to-Mem) - the MCU (cache) must be 'informed' about an 'out-of-sync' state (cache does not match memory anymore, force to refill cache). Than an Invalidate is needed ('forget' the current cache content and refill cache again).

    It works fine for me.

    Example code from my project:

    //align buffer with cache line size

    uint8_t uartRxBuf[UART_RX_STR_SIZE] __ALIGNED(32) __attribute__((section('.ram1')));

    //...

            if (xSemaphoreTake(xSemaphoreSPI1Tx, portMAX_DELAY /*1000*/) == pdTRUE)

            {

                

    //clean the buffer for DMA to see it

                

    SCB_CleanDCache_by_Addr((uint32_t *)txBuf,

    ((len+31)/32)*32

    );

                if (HAL_SPI_Receive_DMA(&hspi4, rxBuf, len) != HAL_OK)

                {

                    Error_Handler();

                }

                if (HAL_SPI_Transmit_DMA(&hspi1, txBuf, len) != HAL_OK)

                {

                    Error_Handler();

                }

                //wait for Rx complete

                if (xSemaphoreTake(xSemaphoreSPI4Rx, portMAX_DELAY /*1000*/) == pdTRUE)

                {

                    

    //invalidate to see DMA results

                    

    SCB_InvalidateDCache_by_Addr((uint32_t *)rxBuf,

    ((len+31)/32)*32

    );

                    return HAL_OK;

                }

            }

    BTW:

    We could also use and initialize MPU: we could configure one RAM region w/o caches enabled, or as 'write-through' (for MCU -> DMA -> Peripheral).

    Just to bear in mind: with caches enabled, the MCU uses the Cache content, but a DMA uses physical memory. They can be 'out-of-sync' and we need cache maintenance operations in order to make DMA coherent with MCU (cache).

    Remark: if you use a lot of other data memory and DCache, it could look like it works (because cache is often updated if we have a lot of other data memory used, some cache lines are ripped out so that an updated memory done by a DMA is reloaded because it is not in cache anymore). But it will fail, if all the data memories we use fit into DCache. In  this case the MCU runs completely on cache only and caches are not in sync anymore with memory. Therefore: make sure when using DMAs not to forget to handle the cache (Clean to update memory - before a DMA is launched, Invalidate to update cache from memory - after a DMA was done).
    Graduate
    May 8, 2018
    Posted on May 08, 2018 at 18:29

    Thanks Jaekel I have confirmed the same behavior using the UART DMA.  However for me not enabling the Cache by not calling SCB_EnableDCache() does not work.  In fact there is a SCB_DisableDCache() and using it does not fix the data corruption.  The only thing that works for me is to enable cache and judiciously use Clean of Invalidate depending on the direction just before or after the DMA call.

    Graduate
    May 8, 2018
    Posted on May 08, 2018 at 19:52

    Just for completeness and to make sure I was not missing something about the MPU operation I went back to see if I caould manage to make the SRAM_D1 region at 2400000 to exhibit write-through without buffering.  I went to the extent to break-point the HAL code after the MPU is set for the region and double check C,B,S, and TEX were set to the values given in the AN4838 for Normal, Write-back, no write allocate, with he Share bit on.  I also did it for the Strongly Ordered operation.  Regardless of the setup, the UART data does not match the expected data.  The only solution that works is to not place the SRAM_D1 under MPU control, enable cache, and use the Clean and Invalidate Cache calls at the right time.

    Visitor II
    May 31, 2018
    Posted on May 31, 2018 at 12:49

    Hi All,

    I was doing SPI DMA Transmit Operation and i captured some observations which confused me. Please help.

     

    Observation 1:

    I was using global buffer ( uint8_t txBuf[5] ; ) and I enabled (using STM32CubeMx) D-Cache inside main() then to perform DMA, I need to call SCB_CleanDCache_by_Addr((uint32_t*)&txBuf[0], 5) before HAL_SPI_Transmit_DMA(&hspi4, txBuf, 5) otherwise DMA doesn't work or need to configure MPU_Config() for the DMA to work.

    Observation 2:

    I was using global buffer ( uint8_t txBuf[10] ; ) and I didn't enable (using STM32CubeMx) D-Cache inside main() then to perform DMA , I don't need to call SCB_CleanDCache_by_Addr((uint32_t*)&txBuf[0], 5) before HAL_SPI_Transmit_DMA(&hspi4, txBuf, 5)

    and DMA works fine.

     

    I am confused with the results and i checked it 10-20 times. I am unable to reach to conclusion as i am new to it.

    Regards

    Manish

    Visitor II
    May 31, 2018
    Posted on May 31, 2018 at 18:09

    Actually, your observations seem to be correct. Just to bear in mind: DMA is a master, like the MCU (in terms of memory access). But the DMAs in such MCUs are 'NOT COHERENT'. It means: there is not a CCI (Cache Coherency Interface) on the bus fabrics. A DMA will access the memory directly, w/o any caches involved. But the MCU reads the memory with caches involved, not 'really' from/to memory.

    So, with caches enabled - the MCU reads quite likely from cache whereby DMA reads and writes directly to physical memory. Any update on memory content, e.g. DMA writes new memory content but MCU reads still from cache (A) or MCU writes new memory content (B) but it 'hangs' still in cache and DMA will not see yet on physical memory, needs these Cache Maintenance functions called (if cache is enable or it is not configured as 'write back').

    You, as software engineer, have to do and bear in mind how to make it coherent between MCU (caches) and DMA (memories).

    Comment:

    When you do cache maintenance via CleanDCache (which is used to let write MCU caches to memory, before a DMA is kicked off, see(B)) or InvalidateDCache (which is used after a DMA done to let MCU caches refill again, see (A)) - you have to bear in mind the ALIGNMENT with the Cache Line Size (here 32 bytes).

    These _byAddr functions clean or invalidate Cache Lines! So, the start address of your buffer should be 32 byte aligned, e,g, use __ALIGNED(32) on definition, or you should take the address of your buffer and round it down to the next lower 32byte boundary address when you call the function (plus length parameter as rounded up multiples of 32!).

    If you don't do and the address parameter for the cache maintenance function call _byAddr is not aligned, or length is not multiple of 32: a) nothing will be done (due to wrong Cache Line aligned address) or part of your buffer, e.g. the first bytes, first Cache Line, are not updated.

    ==> align with the Cache Line Size and make sure length covers all needed Cache Lines (multiples of 32 bytes)

    Visitor II
    May 31, 2018
    Posted on May 31, 2018 at 18:20

    sorry, I think cache should be configured as 'write-through'. So, if MCU writes - DMA should see updated memory even w/o cache maintenance function called (any MCU write goes directly to memory).

    But still a need to use Invalidate for the other direction (MCU reads from cache but DMA wrote on memory - still not coherent).

    So, MPU configuration seems to be needed if caches are enabled (and DMAs used). Check the manual what the default w/o MMU enabled is: different SRAM regions can differ on the default cache modes.

    I suggest, if caches are enabled and DMAs are used:

    a) do and check the MPU configuration (regions)

    b) check which DMA can access which memory (esp. which memory is NOT access-able by DMA, e.g. DTCM).

        (BTW: DTCM does not have caches involved, could be 'coherent'. But not all DMA engines can access DTCM)
    Graduate
    June 4, 2018
    Posted on June 04, 2018 at 18:59

    Observation 1:

    Conclusion: Does it mean that the default cache policy is write-through?

    I would concur with this based on my observations.

    As well,

    garsi.khouloud

    has confirmed that on the H7 devices:

    So then using a statement like:

    MPU_InitStruct.IsShareable = MPU_ACCESS_SHAREABLE;

    will override:

    MPU_InitStruct.IsCacheable = MPU_ACCESS_CACHEABLE;

    in the same region making that region NOT cacheable.

    In the same vein:

    MPU_InitStruct.IsCacheable = MPU_ACCESS_NOT_CACHEABLE;

    MPU_InitStruct.IsShareable = MPU_ACCESS_SHAREABLE;

    in the same region is redundant.

    See thread:

    https://community.st.com/message/197787-an4838-s-field-equivalent-to-non-cacheable

    Visitor II
    June 19, 2018
    Posted on June 19, 2018 at 06:46

    Thanks for your reply.

    What if i do like :

    MPU_InitStruct.IsShareable = MPU_ACCESS_NOT_SHAREABLE

    MPU_InitStruct.IsCacheable = MPU_ACCESS_CACHEABLE;

    Is it cacheable ?

    I would love if you can address above one in context with DMA:

    My Observation:

    1) Allocate global buffer uint8_t __attribute__((section ('.RAM_D2'))) txBuf[1000] ;

    2) I enabled D-Cache.

    3) I enabled MPU ( Cacheable , Not Shareable)

    For DMA to work, we need to configure MPU as 'not cacheable' but in my observation, i configured it as 'cacheable' and DMA (write operation) still works.

    Regards,

    Manish

    Visitor II
    June 19, 2018
    Posted on June 19, 2018 at 23:10

    Hi Manish,

    a) shareable vs. non-shareable:

    The sentence 'The STM32F7 Series and STM32H7 Series do not support hardware coherency. the S field is equivalent to non-cacheable mem' in

    http://www.st.com/content/ccc/resource/technical/document/application_note/group0/bc/2d/f7/bd/fb/3f/48/47/DM00272912/files/DM00272912.pdf/jcr:content/translations/en.DM00272912.pdf

      means for my understanding:

    The MPU is an ARM IP core:

    https://developer.arm.com/docs/ddi0439/latest/memory-protection-unit/about-the-mpu

      .

    It is capable to support a Multi-Core-System (or a system with several masters which can have their own L1 cache). In this case shareable or non-shareable matters. These are HW signals which can be used with a CCI (Cache Coherency Interconnect). In case there would be a CCI (and several cores/masters) - the caches would get be informed to 'synchronize' so that 'coherency' between different cores/masters and (their) L1-caches is guaranteed.

    But STM has not integrated such a CCI (so complex for a small MCU, CM4/CM7 system). So, the sharable attribute (S-bit) does not have any function or meaning.

    'is equivalent' means here for my understanding: 'if you have shareable memory, e.g. shared between MCU and DMA - you had to run/configure as non-cacheable. Then it is a shared memory, coherent for both cores/masters. Shared without a CCI means non-cacheable.'

    So, I assume this S-bit does not have any effect (because not wired from MPU). As shareable memory you had to use non-cacheable - this is 'equivalent'.

    b) DMA still working even with cacheable:

    It is not enough just to see MPU_InitStruct.IsCacheable = MPU_ACCESS_CACHEABLE;. You had to check also the other bits, e.g. TEX, C and B fields!

    Cacheable has different 'flavors' (policies), e.g.

    Write-Back

    ,

    Write-Trough

    (WBWA, WTNWA etc.). See here on page 7:

    http://www.st.com/content/ccc/resource/technical/document/application_note/group0/bc/2d/f7/bd/fb/3f/48/47/DM00272912/files/DM00272912.pdf/jcr:content/translations/en.DM00272912.pdf

     

    If you enable cache, but the policy is 'Write-Through' - the DMA will still work (in one direction): if you write data from MCU - with 'write-through' it ends up immediately in memory (writing is like non-cached, MCU writes are 'coherent' with DMA and memory).

    So, not enough just to say 'cacheable' and cache is enabled:

    the policy matters much more

    .

    I would suggest:

    - try to understand how the MPU can be configured (all the modes)

    - what do all these bits mean, esp. what is the difference between 'write-back' and 'write-through', 'write-allocate' vs. 'non-write-allocate'

    - use 'write-through' for DMAs (maybe you did already, therefore 'still working'), if MCU places into memory and DMA should grab it, DMA Tx - 'write-through' works with caches enabled and w/o cache maintenance called

    - or use the cache maintenance function (anyway my recommendation, it might not hurt to do always cache maintenance, Clear and Invalidate, even cache is maybe not enabled, but if you would enable cache - it will still work)

    - check also the original ARM TRM:

    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0646a/BIHJJABA.html

     
    Graduate
    July 26, 2024

    Very good explanation here, I was invalidating cache right after entering uart rx callback (dma with idle intr) and clean it before leaving, well it works well with debug build tag, but in release it was not invalidating cache...Resolved  by just aligning uart dma buffer: uint8_t uart_reception_dma[1024]__attribute__((aligned(32)));