Skip to main content
Graduate
June 17, 2024
Question

Optimizing FMC Read / Writes for Audio Application STM32H7

  • June 17, 2024
  • 6 replies
  • 4987 views

Hi, I am working on a realtime audio DSP product using the STM32H7 with a AS4C4M16SA-7BCN SDRAM chip for long delay line memory. I am using the FMC controller with the settings in the attached photo:

EPala2_0-1718652361505.png

The product processes an incoming audio stream in real time, so this is a very runtime critical application. I have found that reads and writes to and from the delay memory on the SDRAM are by far the biggest drag on overall performance. 

Currently I am just accessing SDRAM memory automatically through the C++ compiler, declaring as follows and accessing as I would any other variable: 
static float delay_line[48000 * 30] __attribute__((section(".sdram"))); //48000 sample rate * 30 seconds

I am wondering if there are any ways to optimize SDRAM reads and writes to get better performance, either through how I structure my code, or through settings in the CubeMX configurator. 

In particular, would it be faster to do sequential reads from consecutive SDRAM locations to a buffer in onboard memory rather than just accessing at random points based on my code behavior? Is there a vector style function that can quickly copy a block of data from the SDRAM to local memory? Would this approach be likely to provide a noticeable performance increase?

Please advise, thanks!

 

    This topic has been closed for replies.

    6 replies

    Super User
    June 17, 2024

     would it be faster to do sequential reads from consecutive SDRAM locations to a buffer in onboard memory 

    Yes. This is called the Data cache on Cortex-M7.

     

    Graduate II
    June 17, 2024

    For MCU access you can cache the SDRAM see MPUConfig() examples

    DMA into memory won't cache.

    You should be able to use MEM2MEM DMA modes to move data in the background, but that might add contention.

    You'll have to benchmark to see the amount of performance you can trade doing processing on-board, and then migrating out to SDRAM. Generally the least amount of moves, and simpler the pipe-line the better.

    On the F4's the SDRAM was of the order of 6x slower than Internal-SRAM.

    The DTCM is not cached, and is better than 0-wait state outside the core. If you can keep things small-and-fast, do that.

    If you use the SDRAM as the dynamic memory pool (HEAP) and use pointers you can likely test and adapt things more quickly.

    Don't use SDRAM for STACK

    EPala.2Author
    Graduate
    June 17, 2024

    Thank you! What are the MPUConfig examples that you are referring to? I tried googling but wasn't seeing any clear results. 

    EPala.2Author
    Graduate
    June 17, 2024

    In my application I am writing incoming audio to a very long delay line (30+ seconds, 48kHz sample rate), and executing a lot of reads from different points which are then mixed together. Maybe it would be possible to execute the writes to SDRAM via DMA (since there is only one write per delay line happening per callback), and then the reads as memcpy calls from SDRAM to local buffers the size of my audio callback buffer.  And I could store the local buffers in DTCRAM for faster execution. 

    Does that sound like a good approach?

    Graduate II
    June 17, 2024

    Unmanageably long / large amounts of Data, my gut says do it once in to / out of SDRAM. Least complicated, least number of moves.

    If you're pre-processing, do it in the fastest memory first, and move/generate the results into the SDRAM, ideally directly

    EPala.2Author
    Graduate
    July 5, 2024

    Hi all, I've found that this method of loading a whole buffer at a time is running faster than the original version:

    void mdsp_pedal_yy::process_grain_cloud_sd(T* in_b, T* out_b){
    
    	memcpy(&gfxl[write_ptr], in_b, sizeof(T) * PROC_BUFFER_SIZE);
    
    	if(write_ptr <= PROC_BUFFER_SIZE * 2){ //copy the beginning of delay memory to end to ensure contiguous reads
    		memcpy(&gfxl[write_ptr + GRAIN_DELAY], in_b, sizeof(T) * PROC_BUFFER_SIZE);
    	}
    
    	write_ptr += PROC_BUFFER_SIZE;
    
    	if(write_ptr >= GRAIN_DELAY){
    		write_ptr -= GRAIN_DELAY;
    	}
    
    	for(int g = 0; g < num_sd_grains; ++g){
    		if(sd_grains[g].pos >= sd_grains[g].size || !sd_grains[g].active){
    			sd_grains[g].active = true;
    			start_grain_cloud_sd(g);
    		}
    		sd_grains[g].read_ptr = write_ptr + GRAIN_DELAY - sd_grains[g].read;
    		if(sd_grains[g].read_ptr >= GRAIN_DELAY){ sd_grains[g].read_ptr -= GRAIN_DELAY; }
    	}
    
    	for(int g = 0; g < num_sd_grains; ++g){
    		memcpy(sd_grains[g].buffer, &gfxl[sd_grains[g].read_ptr], sizeof(T) * sd_grains[g].buffer_len);
    	}
    
    	memset(out_b, 0, sizeof(T) * PROC_BUFFER_SIZE);
    
    	for(int g = 0; g < num_sd_grains; ++g){
    		for(int i = 0; i < PROC_BUFFER_SIZE; ++i){
    			T env = sinf((sd_grains[g].pos / sd_grains[g].size) * M_PI) * 1.0f;
    			if(env < 0){ env = 0; }
    			if(env > 1){ env = 1; }
    			out_b[i] += sd_grains[g].buffer[i] * env;
    			sd_grains[g].pos += 1;
    		}
    	}
    }

     I do have an issue where for a pitch shifted octave up process I need to read every other sample from SDRAM memory (to double playback speed of the sample).

    Do y'all think it would be faster to do this via a for loop that reads every other sample from SDRAM? Or faster to just read a double length buffer with memcpy()? 

    Super User
    July 5, 2024

    Hi,

    just : why you use float ? > static float delay_line[48000 * 30]

    To get Hi-Fi , 16b would be ok, for top studio quality 24b . If some extra headroom...32b. (Integer.)

    And how you load the delay ? Circular buffer...?

    And why so super long delay line ? (30 sec = 9 km size room...even "over" to simulate a free air concert in a stadion.)

    EPala.2Author
    Graduate
    July 5, 2024

    16b is an approach we are probably going to take. Some other parts of the processing need float resolution, but for the granular process 16b will be enough. Just have not added that in yet, focusing on read / write functions for time being. 

    As for delay memory, this is a granular process, it is a creative effect, different than a regular delay or reverb. Long memory is part of how that works. 

    Do you have any thoughts on the question of my octave up use case?

    In this scenario I need every other sample played back in order to achieve a pitch shift effect. 

    Would it be faster to do this by using memcpy and reading a buffer of twice the length; or by using a for loop to read every other sample from an SDRAM segment to local memory?

     

    Please advise.

    Super User
    July 5, 2024

    Faster ? So use 16b ! (min. 200% faster than float.)

    +

    Your H7?? is at 400Mhz , or so ; its always faster, than the (external) memory access, so needs wait states for every access. Doing it faster is :

    1. smaller data , float -> int16_t ;

    2. memcpy or direct "take the int" is about same (always the cpu doing and waiting...)

    3. maybe (!) faster : if needed part of memory is copied by the DMA or MDMA to internal RAM, "maybe" because: while DMA action is going on, the internal bus is busy and so the cpu has to wait for free bus access.

    What is really fastest, you have to try - and make a plan in advance, where the data going and which bus is blocked then...so (i dont know, what you doing, you didnt tell any details ) it could be some way, to copy data by DMA to a block B , while cpu works on data in block A; when finished, cpu works on B, while A is loaded with new data by the DMA. (circular DMA might be your friend...and half/full callbacks.)

    EPala.2Author
    Graduate
    July 9, 2024
    void mdsp_pedal_yy::process_grain_cloud_sd(T* in_b, T* out_b){
    
    	int16_t cpy_buffer[PROC_BUFFER_SIZE];
    
    	for(int i = 0; i < PROC_BUFFER_SIZE; ++i){
    		cpy_buffer[i] = in_b[i] * FLOAT_TO_INT16;
    	}
    
    	memcpy(&gfxlsd[write_ptr], cpy_buffer, sizeof(int16_t) * PROC_BUFFER_SIZE);
    
    	if(write_ptr <= PROC_BUFFER_SIZE * 2){ //copy the beginning of delay memory to end to ensure contiguous reads
    		memcpy(&gfxlsd[write_ptr + GRAIN_DELAY], cpy_buffer, sizeof(int16_t) * PROC_BUFFER_SIZE);
    	}
    
    	write_ptr += PROC_BUFFER_SIZE;
    
    	if(write_ptr >= GRAIN_DELAY){
    		write_ptr -= GRAIN_DELAY;
    	}
    
    	for(int g = 0; g < num_sd_grains; ++g){
    		if(sd_grains[g].pos >= sd_grains[g].size || !sd_grains[g].active){
    			sd_grains[g].active = true;
    			start_grain_cloud_sd(g);
    		}
    		sd_grains[g].read_ptr = write_ptr + GRAIN_DELAY - sd_grains[g].read;
    		if(sd_grains[g].read_ptr >= GRAIN_DELAY){ sd_grains[g].read_ptr -= GRAIN_DELAY; }
    	}
    
    	for(int g = 0; g < num_sd_grains; ++g){
    		int pitch = sd_grains[g].pitch;
    		uint32_t read = sd_grains[g].read_ptr;
    		if(pitch == normal_speed){
    			memcpy(sd_grains[g].buffer, &gfxlsd[read], sizeof(int16_t) * PROC_BUFFER_SIZE);
    		}else if(pitch == double_speed){
    			for(int i = 0; i < PROC_BUFFER_SIZE * 2; i += 2){
    				sd_grains[g].buffer[i>>1] = gfxlsd[read + i];
    			}
    			sd_grains[g].read -= PROC_BUFFER_SIZE;
    		}else if(pitch == half_speed){
    			memcpy(cpy_buffer, &gfxlsd[read], sizeof(int16_t) * ((PROC_BUFFER_SIZE >> 1) + 1));
    			for(int i = 0; i < PROC_BUFFER_SIZE; i += 2){
    				int loc = i >> 1;
    				sd_grains[g].buffer[i + 1] = (cpy_buffer[loc] >> 1) + (cpy_buffer[loc + 1] >> 1);
    				sd_grains[g].buffer[i] = cpy_buffer[loc];
    			}
    			sd_grains[g].read += PROC_BUFFER_SIZE >> 1;
    		}else if(pitch == reverse){
    			for(int i = PROC_BUFFER_SIZE - 1; i >= 0; --i){
    				int loc = PROC_BUFFER_SIZE - 1 - i;
    				sd_grains[g].buffer[loc] = gfxlsd[read + i];
    			}
    			sd_grains[g].read += PROC_BUFFER_SIZE * 2;
    		}
    	}
    
    	memset(out_b, 0, sizeof(T) * PROC_BUFFER_SIZE);
    
    	for(int g = 0; g < num_sd_grains; ++g){
    		for(int i = 0; i < PROC_BUFFER_SIZE; ++i){
    			T env = sinf((sd_grains[g].pos / sd_grains[g].size) * M_PI) * 1.0f;
    			if(env < 0){ env = 0; }
    			if(env > 1){ env = 1; }
    			out_b[i] += float(sd_grains[g].buffer[i]) * INT16_TO_FLOAT * env * sd_grains[g].vol;
    			sd_grains[g].pos += 1;
    		}
    	}
    }

    Here's an updated version using int16_t for the granular delay memory. Only getting a marginal increase in performance from doing this (+2 read pointers), perhaps because of all the extra multiplication I need to do to convert from int to float and back. Is there any way to further optimize this using vector functions for multiplication or something similar?

    Super User
    July 10, 2024

    Ok, not much gain. :)

    Basically the INT multiplication is in one cycle, but in cpu with FPU also the float same speed (H7 has double float FPU), just loading the fpu registers need extra clock cycles, thats why the INT is a little faster. BUT if you have to do multiplications for every value from float to int and back - then you loose the higher speed by this, as is here.

    So doing it with INT will be faster, if all is in INT, without conversion.

    >an incoming audio stream

    This is in INT16 , so keep it at INT16....without any conversion to float and then back to int and then to float etc.

    And - whats your optimizer setting ? (This has strong effect on speed...), i use -O2 , but try -Ofast also.

    EPala.2Author
    Graduate
    July 10, 2024

    My thought on using int16_t as the memory format was that the SDRAM read write process is the slowest thing happening and therefore the biggest bottleneck to performance. int16_t means that half the amount of data will be read off the SDRAM compared to floating point (2 bytes versus 4 bytes), but it seems like the extra overhead to perform the conversion negates some of this advantage.

    I need the rest of the system to be floating point.

    Is there any way to implement something like float16_t ? This would be the best of both worlds if the option exists.