ST Employee

DMA is not working on STM32H7 devices

Forum|Forum|7 years ago
July 23, 2018
24 replies
180121 views

The problem is related to two things: memory layout on STM32H7 and internal data cache (D-Cache) of the Cortex-M7 core.

In summary these can be the possible issues:

Memory placed in DTCM RAM for D1/D2 peripherals. Unfortunately, this memory is used as default in some projects including examples.
Memory not placed in D3 SRAM4 for D3 peripherals.
D-Cache enabled for DMA buffers, different content in cache and in SRAM memory.
Starting the DMA just after writing the data to TX buffer, without placing __DSB() instruction between.

For Ethernet related problems, please see separate FAQ: FAQ: Ethernet not working on STM32H7x3

1. Explanation: memory layout

The STM32H7 device consists of three bus matrix domains (D1, D2 and D3) as seen on the picture below. The D1 and D2 are connected through bus bridges, both can also access data in D3 domain. However, there is no connection from D3 domain to D1 or D2 domain. In some devices (STM32H7A3/7B3 and STM32H7B0), we can find only two domains, where D1 and D2 domains are merged into one domain which is the CD Domain, and D3 is nominated as SRD Domain.

The DMA1 and DMA2 controllers are located in D2 domain and can access almost all memories, with the exception of ITCM and DTCM RAM (located at 0x20000000). These controllers are used in most cases.

The BDMA controller is located in the D3 domain and can access only SRAM4 and backup SRAM in the D3 domain.

The MDMA controller is located in D1 domain and can access all memories, including ITCM and DTCM. This controller is primarily used for handling D1 peripherals and memory-to-memory transfers.

From performance perspective, it is better to place DMA buffers inside the D2 domain (SRAM1, SRAM2 and SRAM3), since the D2-to-D1 bridge can introduce additional delay.

2. Explanation: handling DMA buffers with D-Cache enabled

The Cortex-M7 contains two internal caches: I-Cache for loading instructions and D-Cache for data. The D-Cache can affect the functionality of DMA transfers because it holds the new data in the internal cache and does not write it to the SRAM memory. However, the DMA controller loads the data from SRAM memory, not from the D-Cache.

If the DMA transfer starts immediately after writing data to the tx_buffer in the code, the tx_buffer data might still reside in the write buffer inside the CPU while the DMA has already started. The solution is to set the tx_buffer as a device type to force the CPU to order memory operations or to add the __DSB() instruction before starting the DMA.

There are several ways to manage DMA buffers with D-Cache:

Disable the D-cache globally. This is the simplest solution, but it is not an effective one, as you can lose a significant portion of performance. However, it can be useful for debugging to analyze whether the problem is related to the D-cache.
Disable the D-cache for a portion of the memory by configuring the memory protection unit (MPU). However, the MPU regions have specific alignment restrictions, and it is necessary to place the DMA buffers in designated parts of the memory. Each toolchain (GCC, IAR, KEIL) must be configured differently.
- Note that MPU regions can overlap, and the higher region number has priority. Together with subregion disable bits, this feature can soften the alignment and size restrictions.
- Note that Device and Strongly Ordered memory types do not allow unaligned access to memory.
Configure a part of the memory as write-through. This configuration can only be used for TX DMA. Note that on some revisions (r1p1 and older, excluding r0p0) of the Cortex-M7 core, there is an erratum concerning the write-through configuration. This issue affects only STM32H74x and STM32H75x devices from the STM32H7 family.
Use cache maintenance operations to manage data consistency. You can write data stored in the cache back to memory using the "clean" operation for a specific address range. Additionally, you can discard data stored in the cache using the "invalidate" operation.
- The downside is that these operations work with a cache-line size of 32 bytes, so you cannot clean or invalidate a single byte from the cache. This limitation can lead to errors when the RX buffer shares the cache line with other data or the TX buffer (see the figure below).
- Beware that with an uninitialized D-cache, the maintenance operations "clean" or "clean and invalidate" can lead to a BusFault exception. This issue is caused by uninitialized ECC (error correction code) after a power-on reset. If your project involves frequent maintenance operations and you want to temporarily disable the D-cache, you can use the SCB_InvalidateDCache function. This function cleans the cache and sets the correct ECC without enabling it.

Below are the possible MPU configurations. Green configurations are suitable for DMA buffers, blue configurations are suitable only for TX-only DMA buffers, and red configurations are forbidden. Other configurations are not suitable for DMA buffers and require cache maintenance operations:

3. Solution example 1: simple placement of all memory in the D1 domain

D-Cache must be disabled globally for this solution to work.

GCC (Atollic TrueStudio/System Workbench for STM32/Eclipse)

Replace DTCMRAM with RAM_D1 for section placement in linkerscript (.ld file extension), for example, like this:

.data : 
{
 ... /* Keep same */
} >RAM_D1 AT> FLASH

This should be done also for the .bss and the ._user_heap_stack sections.

In some linker scripts, the initial stack is defined separately. Therefore, you must either update it with the section or define it inside the section, as shown below:

._user_heap_stack :
{
 . = ALIGN(8);
 PROVIDE ( end = . );
 PROVIDE ( _end = . );
 . = . + _Min_Heap_Size;
 . = . + _Min_Stack_Size;
 _estack = .; /* <<<< line added */
 . = ALIGN(8);
} >RAM_D1

And remove the original _estack definition.

IAR (in project settings):

For Keil:

4. Solution example 2: placing buffers in separated memory part

D-cache must be disabled through the MPU for the specific memory region where the DMA buffer is placed. Note that the MPU region size must be a power of two. Additionally, the region's start address must have the same alignment as its size. For example, if the region is 512 bytes, the start address must be aligned to 512 bytes (the 9 least significant bits must be zero).

NOTE: IAR compiler and Keil compiler version <= 5 allow placing variables at absolute address in code using compiler specific extensions.

C code:

Define placement macro:

#if defined( __ICCARM__ )
 #define DMA_BUFFER \
 _Pragma("location=\".dma_buffer\"")
#else
 #define DMA_BUFFER \
 __attribute__((section(".dma_buffer")))
#endif

Specify DMA buffers in code:

DMA_BUFFER uint8_t rx_buffer[256];

GCC linkerscript (*.ld file)

Place the section in D2 RAM. You can also specify custom memory regions in the linker script file.

.dma_buffer : /* Space before ':' is critical */
{
 *(.dma_buffer)
} >RAM_D2

This does not include default value initialization. Otherwise, you must place special symbols and add your own initialization code.

IAR linker file (*.icf file)

define region D2_SRAM2_region = mem:[from 0x30020000 to 0x3003FFFF];
place in D2_SRAM2_region { section .dma_buffer};
initialize by copy { section .dma_buffer}; /* optional initialization of default values */

Keil scatter file (*.sct file)

LR_IROM1 0x08000000 0x00200000 { ; load region size_region
 ER_IROM1 0x08000000 0x00200000 { ; load address = execution address
 *.o (RESET, +First)
 *(InRoot$$Sections)
 .ANY (+RO)
 }
 RW_IRAM2 0x24000000 0x00080000 { ; RW data
 .ANY (+RW +ZI)
 }
 ; Added new region
 DMA_BUFFER 0x30040000 0x200 {
 *(.dma_buffer)
 }
}

Generation of scatter file should be disabled in Keil:

5. Solution example 3: Use Cache maintenance functions

Transmitting data:

#define TX_LENGTH (16)
uint8_t tx_buffer[TX_LENGTH];

/* Write data */
tx_buffer[0] = 0x0;
tx_buffer[1] = 0x1;

/* Clean D-cache */
/* Make sure the address is 32-byte aligned and add 32-bytes to length, in case it overlaps cacheline */
SCB_CleanDCache_by_Addr((uint32_t*)(((uint32_t)tx_buffer) & ~(uint32_t)0x1F), TX_LENGTH+32);

/* Start DMA transfer */
HAL_UART_Transmit_DMA(&huart1, tx_buffer, TX_LENGTH);

Receiving data:

#define RX_LENGTH (16)
uint8_t rx_buffer[RX_LENGTH];

/* Invalidate D-cache before reception */
/* Make sure the address is 32-byte aligned and add 32-bytes to length, in case it overlaps cacheline */
SCB_InvalidateDCache_by_Addr((uint32_t*)(((uint32_t)rx_buffer) & ~(uint32_t)0x1F), RX_LENGTH+32);

/* Start DMA transfer */

HAL_UART_Receive_DMA(&huart1, rx_buffer, RX_LENGTH);

/* No access to rx_buffer should be made before DMA transfer is completed */

Please note that in case of reception, there can be a problem if the rx_buffer is not aligned to the size of the cache line (32 bytes). During the invalidate operation, other data sharing the same cache line(s) with the rx_buffer might be lost.

6. References

"AN4838: Managing memory protection unit (MPU) in STM32 MCUs"

https://www.st.com/content/ccc/resource/technical/document/application_note/group0/bc/2d/f7/bd/fb/3f/48/47/DM00272912/files/DM00272912.pdf/jcr:content/translations/en.DM00272912.pdf

"AN4839: Level 1 cache on STM32F7 Series and STM32H7 Series":

https://www.st.com/content/ccc/resource/technical/document/application_note/group0/08/dd/25/9c/4d/83/43/12/DM00272913/files/DM00272913.pdf/jcr:content/translations/en.DM00272913.pdf

"AN4296: Overview and tips for using STM32F303/328/334/358xx CCM RAM with IAR EWARM, Keil MDK-ARM and GNU-based toolchains":

https://www.st.com/content/ccc/resource/technical/document/application_note/bb/09/ca/83/14/e9/44/c5/DM00083249.pdf/files/DM00083249.pdf/jcr:content/translations/en.DM00083249.pdf

"AN4891: STM32H7x3 system architecture and performance software expansion for STM32Cube":

https://www.st.com/content/ccc/resource/technical/document/application_note/group0/0d/b5/e7/b7/47/0c/4a/ae/DM00306681/files/DM00306681.pdf/jcr:content/translations/en.DM00306681.pdf

Show previous replies

HTD

Senior II

I'm glad it worked ;) I forgot to tell that `HAL_ADC_RegisterCallback` is just a special way to do it, but it's not available until you enable it in STM32CubeIDE Project Manager / Advanced Settings / Register Callbacks / ADC. Most HAL peripherals can be set up to either use weak function overrides to provide callbacks, or register any (matching) function as callback. Each setting has its pros and cons. Registering callbacks allows easier integration with C++ code, overwriting the weak function is just overall simpler and more straight forward to do in C.

magene

Senior

I've been trying to get the LPUART working with the BDMA module on a STM32H7A3 using the information in this conversation along with the reference manual. I can TX characters byte by byte using polling and can see them on the logic analyzer but when I try to TX using the BDMA module, no characters are showing up on the logic analyzer. I have been using DMA with regular UARTs and the character match functionality for a while so I understand the basic concepts but still haven't gotten the LPUART working with the BDMA module. The whole problem is described in detail here https://community.st.com/t5/stm32-mcus-products/stm32h7a3-lpuart-and-bdma/m-p/623558#M231216 and I'm hoping to get a little more help here.

Thanks - Gene

brymat

Associate

For anyone trying Solution example 2: Placing buffers in separated memory part, remember that by default RAM_D2 is not powered on.

To make this example work I had to add this to the start of main:

__HAL_RCC_D2SRAM1_CLK_ENABLE();

And because it isn't part of the normal MCU initialization, any data mapped to this section will NOT be initialized. So, for me, it was easiest to only place my receive buffer in RAM_D2.

And here is my CubeMx setup:

帅

帅气王老板

Associate

Solution 3 solved my problem

Thank you, very useful