Graduate

Question

Optimizing FMC Read / Writes for Audio Application STM32H7

Forum|Forum|1 year ago
June 17, 2024
6 replies
4987 views

Hi, I am working on a realtime audio DSP product using the STM32H7 with a AS4C4M16SA-7BCN SDRAM chip for long delay line memory. I am using the FMC controller with the settings in the attached photo:

The product processes an incoming audio stream in real time, so this is a very runtime critical application. I have found that reads and writes to and from the delay memory on the SDRAM are by far the biggest drag on overall performance.

Currently I am just accessing SDRAM memory automatically through the C++ compiler, declaring as follows and accessing as I would any other variable:
static float delay_line[48000 * 30] __attribute__((section(".sdram"))); //48000 sample rate * 30 seconds

I am wondering if there are any ways to optimize SDRAM reads and writes to get better performance, either through how I structure my code, or through settings in the CubeMX configurator.

In particular, would it be faster to do sequential reads from consecutive SDRAM locations to a buffer in onboard memory rather than just accessing at random points based on my code behavior? Is there a vector style function that can quickly copy a block of data from the SDRAM to local memory? Would this approach be likely to provide a noticeable performance increase?

Please advise, thanks!

This topic has been closed for replies.

P

Pavel A.

Super User

> would it be faster to do sequential reads from consecutive SDRAM locations to a buffer in onboard memory

Yes. This is called the Data cache on Cortex-M7.

T

Tesla DeLorean

Graduate II

For MCU access you can cache the SDRAM see MPUConfig() examples

DMA into memory won't cache.

You should be able to use MEM2MEM DMA modes to move data in the background, but that might add contention.

You'll have to benchmark to see the amount of performance you can trade doing processing on-board, and then migrating out to SDRAM. Generally the least amount of moves, and simpler the pipe-line the better.

On the F4's the SDRAM was of the order of 6x slower than Internal-SRAM.

The DTCM is not cached, and is better than 0-wait state outside the core. If you can keep things small-and-fast, do that.

If you use the SDRAM as the dynamic memory pool (HEAP) and use pointers you can likely test and adapt things more quickly.

Don't use SDRAM for STACK

E

EPala.2Author

Graduate

Thank you! What are the MPUConfig examples that you are referring to? I tried googling but wasn't seeing any clear results.

T

Tesla DeLorean

Graduate II

https://github.com/STMicroelectronics/STM32CubeH7/blob/master/Projects/STM32H743I-EVAL/Examples/FMC/FMC_SDRAM_DataMemory/Src/main.c#L273

E

EPala.2Author

Graduate

In my application I am writing incoming audio to a very long delay line (30+ seconds, 48kHz sample rate), and executing a lot of reads from different points which are then mixed together. Maybe it would be possible to execute the writes to SDRAM via DMA (since there is only one write per delay line happening per callback), and then the reads as memcpy calls from SDRAM to local buffers the size of my audio callback buffer. And I could store the local buffers in DTCRAM for faster execution.

Does that sound like a good approach?

T

Tesla DeLorean

Graduate II

Unmanageably long / large amounts of Data, my gut says do it once in to / out of SDRAM. Least complicated, least number of moves.

If you're pre-processing, do it in the fastest memory first, and move/generate the results into the SDRAM, ideally directly

E

EPala.2Author

Graduate

Hi all, I've found that this method of loading a whole buffer at a time is running faster than the original version:

void mdsp_pedal_yy::process_grain_cloud_sd(T* in_b, T* out_b){

	memcpy(&gfxl[write_ptr], in_b, sizeof(T) * PROC_BUFFER_SIZE);

	if(write_ptr <= PROC_BUFFER_SIZE * 2){ //copy the beginning of delay memory to end to ensure contiguous reads
		memcpy(&gfxl[write_ptr + GRAIN_DELAY], in_b, sizeof(T) * PROC_BUFFER_SIZE);
	}

	write_ptr += PROC_BUFFER_SIZE;

	if(write_ptr >= GRAIN_DELAY){
		write_ptr -= GRAIN_DELAY;
	}

	for(int g = 0; g < num_sd_grains; ++g){
		if(sd_grains[g].pos >= sd_grains[g].size || !sd_grains[g].active){
			sd_grains[g].active = true;
			start_grain_cloud_sd(g);
		}
		sd_grains[g].read_ptr = write_ptr + GRAIN_DELAY - sd_grains[g].read;
		if(sd_grains[g].read_ptr >= GRAIN_DELAY){ sd_grains[g].read_ptr -= GRAIN_DELAY; }
	}

	for(int g = 0; g < num_sd_grains; ++g){
		memcpy(sd_grains[g].buffer, &gfxl[sd_grains[g].read_ptr], sizeof(T) * sd_grains[g].buffer_len);
	}

	memset(out_b, 0, sizeof(T) * PROC_BUFFER_SIZE);

	for(int g = 0; g < num_sd_grains; ++g){
		for(int i = 0; i < PROC_BUFFER_SIZE; ++i){
			T env = sinf((sd_grains[g].pos / sd_grains[g].size) * M_PI) * 1.0f;
			if(env < 0){ env = 0; }
			if(env > 1){ env = 1; }
			out_b[i] += sd_grains[g].buffer[i] * env;
			sd_grains[g].pos += 1;
		}
	}
}

I do have an issue where for a pitch shifted octave up process I need to read every other sample from SDRAM memory (to double playback speed of the sample).

Do y'all think it would be faster to do this via a for loop that reads every other sample from SDRAM? Or faster to just read a double length buffer with memcpy()?

A

AScha.3

Super User

Hi,

just : why you use float ? > static float delay_line[48000 * 30]

To get Hi-Fi , 16b would be ok, for top studio quality 24b . If some extra headroom...32b. (Integer.)

And how you load the delay ? Circular buffer...?

And why so super long delay line ? (30 sec = 9 km size room...even "over" to simulate a free air concert in a stadion.)

E

EPala.2Author

Graduate

16b is an approach we are probably going to take. Some other parts of the processing need float resolution, but for the granular process 16b will be enough. Just have not added that in yet, focusing on read / write functions for time being.

As for delay memory, this is a granular process, it is a creative effect, different than a regular delay or reverb. Long memory is part of how that works.

Do you have any thoughts on the question of my octave up use case?

In this scenario I need every other sample played back in order to achieve a pitch shift effect.

Would it be faster to do this by using memcpy and reading a buffer of twice the length; or by using a for loop to read every other sample from an SDRAM segment to local memory?

Please advise.

A

AScha.3

Super User

Faster ? So use 16b ! (min. 200% faster than float.)

+

Your H7?? is at 400Mhz , or so ; its always faster, than the (external) memory access, so needs wait states for every access. Doing it faster is :

1. smaller data , float -> int16_t ;

2. memcpy or direct "take the int" is about same (always the cpu doing and waiting...)

3. maybe (!) faster : if needed part of memory is copied by the DMA or MDMA to internal RAM, "maybe" because: while DMA action is going on, the internal bus is busy and so the cpu has to wait for free bus access.

What is really fastest, you have to try - and make a plan in advance, where the data going and which bus is blocked then...so (i dont know, what you doing, you didnt tell any details ) it could be some way, to copy data by DMA to a block B , while cpu works on data in block A; when finished, cpu works on B, while A is loaded with new data by the DMA. (circular DMA might be your friend...and half/full callbacks.)

E

EPala.2Author

Graduate

void mdsp_pedal_yy::process_grain_cloud_sd(T* in_b, T* out_b){

	int16_t cpy_buffer[PROC_BUFFER_SIZE];

	for(int i = 0; i < PROC_BUFFER_SIZE; ++i){
		cpy_buffer[i] = in_b[i] * FLOAT_TO_INT16;
	}

	memcpy(&gfxlsd[write_ptr], cpy_buffer, sizeof(int16_t) * PROC_BUFFER_SIZE);

	if(write_ptr <= PROC_BUFFER_SIZE * 2){ //copy the beginning of delay memory to end to ensure contiguous reads
		memcpy(&gfxlsd[write_ptr + GRAIN_DELAY], cpy_buffer, sizeof(int16_t) * PROC_BUFFER_SIZE);
	}

	write_ptr += PROC_BUFFER_SIZE;

	if(write_ptr >= GRAIN_DELAY){
		write_ptr -= GRAIN_DELAY;
	}

	for(int g = 0; g < num_sd_grains; ++g){
		if(sd_grains[g].pos >= sd_grains[g].size || !sd_grains[g].active){
			sd_grains[g].active = true;
			start_grain_cloud_sd(g);
		}
		sd_grains[g].read_ptr = write_ptr + GRAIN_DELAY - sd_grains[g].read;
		if(sd_grains[g].read_ptr >= GRAIN_DELAY){ sd_grains[g].read_ptr -= GRAIN_DELAY; }
	}

	for(int g = 0; g < num_sd_grains; ++g){
		int pitch = sd_grains[g].pitch;
		uint32_t read = sd_grains[g].read_ptr;
		if(pitch == normal_speed){
			memcpy(sd_grains[g].buffer, &gfxlsd[read], sizeof(int16_t) * PROC_BUFFER_SIZE);
		}else if(pitch == double_speed){
			for(int i = 0; i < PROC_BUFFER_SIZE * 2; i += 2){
				sd_grains[g].buffer[i>>1] = gfxlsd[read + i];
			}
			sd_grains[g].read -= PROC_BUFFER_SIZE;
		}else if(pitch == half_speed){
			memcpy(cpy_buffer, &gfxlsd[read], sizeof(int16_t) * ((PROC_BUFFER_SIZE >> 1) + 1));
			for(int i = 0; i < PROC_BUFFER_SIZE; i += 2){
				int loc = i >> 1;
				sd_grains[g].buffer[i + 1] = (cpy_buffer[loc] >> 1) + (cpy_buffer[loc + 1] >> 1);
				sd_grains[g].buffer[i] = cpy_buffer[loc];
			}
			sd_grains[g].read += PROC_BUFFER_SIZE >> 1;
		}else if(pitch == reverse){
			for(int i = PROC_BUFFER_SIZE - 1; i >= 0; --i){
				int loc = PROC_BUFFER_SIZE - 1 - i;
				sd_grains[g].buffer[loc] = gfxlsd[read + i];
			}
			sd_grains[g].read += PROC_BUFFER_SIZE * 2;
		}
	}

	memset(out_b, 0, sizeof(T) * PROC_BUFFER_SIZE);

	for(int g = 0; g < num_sd_grains; ++g){
		for(int i = 0; i < PROC_BUFFER_SIZE; ++i){
			T env = sinf((sd_grains[g].pos / sd_grains[g].size) * M_PI) * 1.0f;
			if(env < 0){ env = 0; }
			if(env > 1){ env = 1; }
			out_b[i] += float(sd_grains[g].buffer[i]) * INT16_TO_FLOAT * env * sd_grains[g].vol;
			sd_grains[g].pos += 1;
		}
	}
}

Here's an updated version using int16_t for the granular delay memory. Only getting a marginal increase in performance from doing this (+2 read pointers), perhaps because of all the extra multiplication I need to do to convert from int to float and back. Is there any way to further optimize this using vector functions for multiplication or something similar?

A

AScha.3

Super User

Ok, not much gain. :)

Basically the INT multiplication is in one cycle, but in cpu with FPU also the float same speed (H7 has double float FPU), just loading the fpu registers need extra clock cycles, thats why the INT is a little faster. BUT if you have to do multiplications for every value from float to int and back - then you loose the higher speed by this, as is here.

So doing it with INT will be faster, if all is in INT, without conversion.

>an incoming audio stream

This is in INT16 , so keep it at INT16....without any conversion to float and then back to int and then to float etc.

And - whats your optimizer setting ? (This has strong effect on speed...), i use -O2 , but try -Ofast also.

E

EPala.2Author

Graduate

My thought on using int16_t as the memory format was that the SDRAM read write process is the slowest thing happening and therefore the biggest bottleneck to performance. int16_t means that half the amount of data will be read off the SDRAM compared to floating point (2 bytes versus 4 bytes), but it seems like the extra overhead to perform the conversion negates some of this advantage.

I need the rest of the system to be floating point.

Is there any way to implement something like float16_t ? This would be the best of both worlds if the option exists.

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded