Skip to main content
Graduate
August 30, 2024
Question

STM32H743 memory bandwidth issues with DCMI, FMC, ADC, parallel bus, DMA1, DMA2 / DCACHE issue

  • August 30, 2024
  • 5 replies
  • 3826 views

I am developing a product which has an STM32H743 device. I am designing the firmware for the first PCBA prototype. The product receives 30fps video signal from an 720i analog video decoder through DCMI (every second bytes are stored, all lines, so 360x240 resolution frames are captured) to external SDRAM. The stm32 does an image conversion and sending the converted data from internal RAM to a second processing unit through parallel bus. The second processing unit displays the data. The parallel bus is a 16bit interface, with 10MHz clock (20Mbyte data rate with ~44% duty cycle – 44% reading, 56% idle). And in the during operation, the device captures two channel audio signals as well.

The ideal operation flow is the following:

  • DCMI interface uses DMA1 Stream0 (very high priority, fifo), dma in double buffer mode. DMA moves data from DCMI to FMC external SDRAM. It generates two interrupt signal, half transfer complete and full transfer complete. When full transfer completed the transfer is restarted. With 30FPS I am getting half transfer and full transfer signals, so this seems like working properly.
  • When half transfer completed flag is set, the CPU runs a conversion function lets call ycrycbToRGB16() which converts the ycrycb video format to rgb 16bit format. This function finishes before the full transfer signal are set (<15ms execution time)
  • When full transfer completed, and ycrycbToRGB16() executed which is true, then we give a signal to external processor. It can start reading out the converted data, through the parallel interface (bus enable high). The parallel interface is implemented, at the STM32 side, with a timer capture input and full utilization of GPIO port B. The TIM5 input capture, captures the parallel bus clock signal and triggers a dma transfer (DMA2 Stream0, TIM5_CH1, fifo) to copy data from internal SDRAM to GPIOB port. When all data is read, the external processor signalize this on a GPIO input and goes to idle.
  • ADC capture is running in the background on two channels. Slow channels are used, clock operates on 10MHz maximum, 48ksps,scan conversion mode, DMA circular mode (DMA2 Stream1, the DMA transfer cycle is ~15kHz), this data is also send concatenated with the video capture data.

The whole process must be in synchronization, if there is a pending something or too much calculation time, frames will be skipped and the ideal operation will be not kept.

And the issues that I observing:

  1. The ideal timing requirements are met only if D-CACHE is enabled. In this case everything executed properly the data acquisition runs smoothly without issue. However, the image data contains artifacts. And it is because the DMA2 copies the internal SDRAM content to GPIOB while the CPU still has data in D-CACHE and it did not store the data to internal SRAM, in fact it took approx.  60ms to update all the frame in sram. Here is an example image, with a dummy signal where the color bars changing colors at each vsync event. E.g. the red bar becomes pink and pink bar becomes red. It can be seen some pixel data remains at previous color but most of them are updated and was read out successfully with the parallel bus.
    robbits_0-1725041231489.png

     

  2. If I disable the D-CACHE, the performance drops too much, and the timing requirements of the ideal operation is not met. E.g. the calculation time of the ycrycbToRGB16() function increases from ~7ms to 20ms which is not acceptable. With D-Cache the algo finished sooner, but the data was not there in the internal sram. So either solution is not ok.
    Here is another test frame:
    robbits_1-1725041231491.png

     


    The image does not contain any artifact, but the capture rate dropped to 15FPS instead of 30FPS because of the slowing down to access to SDRAM.
  3. If I increase the clock rate of the parallel bus, from 10MHz to 20MHz the timings are “crashed”. It seems stm32 is missing clocks from the external controller and therefore the data transmission breaks. The clock signal looks like this on the proto (with 20MHz clock):
    robbits_2-1725041231496.png

     

All the parallel bus lines has a 120ohm series resistor now.

Some additional info:

  • The STM32 configured to 480MHz SYSCLK from a 16MHz HSE crystal
  • The FMC is configured to 240MHz FMC clock (SDRAM common clock 2HCLK – 120MHz, CAS latency 3 clock – 80MHz)
  • The TIM5 configured with no preclear and autoload of 2
  • each frame consists of ~320kB data, in double buffer mode it is 640kB data

I think I have some memory bandwidth issues. There must be some wait cycles when the CPU tries to write to internal SDRAM. Or when DCMI/DMA1 writes to external SDRAM and the TIM5 triggers data transfer with DMA2 to GPIO port happens in the same time causes latency.

I am thinking about the following options:

  1. Write code for MDMA or BDMA or DMAMUX or DMA2D controller for more advanced memory action. Especially considering a different domain rams+dma (D1, D2, D3)
  2. using mixed memory usage of D-CACHE + non cached sram. E.g. non cache for parallel bus data
  3. using d-cache but handling cases for sdram preparation before dma transfer e.g. clear/invalidate d-cache
  4. maximize the external SDRAM clock to 200MHz (as the SRAM maximum) and lower the SYSCLK
  5. using different memory layout e.g. smaller internal sram buffers for dcmi and writing data to external sdram with dma when the internal buffer is filled

Is there anything else to improve the acquisition? Do you see any issue with the concept of the ideal operation implementation?

 

    This topic has been closed for replies.

    5 replies

    Super User
    August 30, 2024

    Seems to be a problem with the cache management - how you do this ?

    (you didnt write about...)

    rob-bitsAuthor
    Graduate
    August 30, 2024

    Thanks for the comment. Indeed I do nothing with cache management. Basically in previous projects I used STM32F4xx, L4xx controllers and cache was not a thing there. In STM32CubeIDE I just clicked the "magic" button enable D-CACHE and thats all what I do.

    Do you have any great docu about this?

    Any advice for best approach for my use case?

    What I am unsure, how the data is moving in the internal bus. So when it is writing data to external SDRAM, when it is writing to internal SDRAM and is there any collusion, wait cycles which could be optimized... It would be great to see a measure how this is happening in my use case.

    Thanks!

    Super User
    August 30, 2024

    Ok, just think about...the D-cache keeps data from cpu, but if data is changed ( by DMA ) , it still has old data.

    You have to "tell" him, to refresh data...

    So make a picture/diagram, what data is changed by dma or else, than cpu - and when its used and has to be real/new data, because dma is sending it to ...somewhere.

    For cache management you have :

    -  SCB_InvalidateDCache_by_Addr(..)  -> delete cache , because now old data

    -  SCB_CleanDCache_by_Addr(...)  -> write cache to mem, because cache/cpu has new data and needs write out to update memory

    +

    all addresses you use for cache management have to be aligned to match the cache access -> like this :

    uint8_t inbuf[4096*8] __attribute__ (aligned (32));

     

    Graduate II
    September 2, 2024

    Do you use an OS?

    Make sure that DMA has enough time for bus access, so let the CPU sleep whenever possible.

    Without using an OS, I made the mistake not having a "sleep state" in my main state machine, which made the CPU constantly and always check some peripherals and variables, although it was absolutely not needed.

    Hardware / 20 MHz:
    - the 120 Ohms seem a little high, maybe the flanks are not steep enough for some IO
    - have you set all GPIOs to highest speed possible?

    rob-bitsAuthor
    Graduate
    September 3, 2024

    Seems like the issue is not related to the series resistor, but I replaced it to 10R. So here is the thing, I could manage to increase the parallel bus speed to 25MHz from 10Mhz, but it works only, if the audio capture is disabled. Without audio capture, the video stream runs smoothly. When I enable the audio capture, the DMA stream for parallel out is corrupted somehow... Both peripheral uses the DMA2 (Stream0 and Stream1). The DMA streams are configured as:

    robbits_0-1725366998328.png

    The TIM5_CH1 is the parallel interface clock input, it is configured for very high priority. The analog capture is on low priority. On the following picture you can see a "normal operation". CH9 is the interesting on, it shows the interrupt of the DMA2 Stream0, on CH5 you can see the parallel reading clock. Between two DMA interrupts we have ~1.895ms time. 

    robbits_1-1725367094817.png

    And here it is the corrupted one:

    robbits_2-1725367262363.png

    The third transfer does not finish properly, and the last dma interrupt occours at ~2.95ms instead of 1.89ms. But the first arrived in time. And you can see the impulse train on the paralel clock are "doubled" in each reading cycle, because the MCU did not signalized the end of the parallel communication and so the external controller started a new reading cycle. 

    I checked plenty of things in the code already but I feel like I lost in the woods...It seems like the DMA2 Stream0 and Stream1 got into trouble somehow. Or there is some bandwidth issues on the internal buses. If I change back the paralell read speed to 10MHz then everything works properly. 

    Any idea to look for?

     

     

    Graduate II
    October 11, 2025

    Hi @rob-bits

    I'm running into a similar issue with the H743 and the FMC bus.  Were you able to achieve better performance?

    thanks

    Matt

    Graduate II
    September 2, 2024

    ... but first start with some cache management!

    I have no ideas about that, though...

    rob-bitsAuthor
    Graduate
    September 2, 2024

    Thanks the input. Makes lot of sense. Indeed, I am not using OS and I have an always running while loop. I will try to add some sleep, it can fit there.

    For series resistor, first I used 27R and I changed to 120R that I read from other forums. Basically from datasheet I saw 5pF input capacitance, so with 120R the cut frequency is pretty high still.

    For all GPIOB the output speed is very high as they are configured as digital output. For clock input, I am not sure if I can set high speed for timer input. In GPIO config of cubeide I see this:

    robbits_0-1725286100224.png

    I do not know if it make sense to change the "Maximum output speed". The signal is input. Will check this.

    And I have this timer config:

    robbits_1-1725286181301.png

     

     

     

    Graduate II
    September 2, 2024

    The speed register settings only apply to outputs. But mind that the data lines to a memory are usually bi-directional.

     

    120R: It's not only about the RC-lowpass corner frequency, this is also about flank steepness.

    Graduate II
    October 11, 2025

    Hi @rob-bits 

    I'm still debugging on my side, but first... my setup...

    1- FMC clock is set at 100MHz  (max - according to datasheet/errata)

    2- SDRAM1 seems to be running (32bit wide),

    3- I have an FPGA configured as a SRAM (32bit wide) - with a 13bit address space.  

    4- While debugging, I'm not accessing the SDRAM, and  I removed the DMA out of the equation, and I'm just using the processor to do the transfer.

    	while (c) {
    		*(uint32_t *)0x64000010) = i;
    		c--;
    	}

    5- This code will max out with a transfer rate of 10M transfers / sec or  (40MBytes/sec) (regardless of the SRAM timing parameters in the FMC controller.

    6- The strange thing is... if I modify the code to:

    	while (c) {
    		*(uint64_t *)0x64000010) = i;
    		c--;
    	}

    then I get a transfer rate of 20M transfers/sec or (80MBytes/sec) - on the scope I see a burst of 2x32bits for every chip select with the chip select period still at 10MHz

    7- I don't think the C compiler supports 128bit data, so instead in the FMC SRAM settings, I changed the data path from 32 to 16bits, and ran the same code. This time, I got a transfer rate of 40M transfers/sec but because it is a 16bit data path (80MBytes/sec) - on the scope I see a burst of 4x16 bit for every chip select with the chip select period still at 10MHz.

     

    I can't seem to get the FMC to run faster than the 10MHz chip select period, or > than 80MBytes/sec (either using a 32 bit data path or 16bit data path).