Skip to main content
Explorer II
May 4, 2025
Solved

STM32H723: How to optimize summation of an array?

  • May 4, 2025
  • 7 replies
  • 1807 views

Hi folks.

I am trying to optimize (by time) the following piece of code.

	for (uint32_t i = 6 + adc_data_index; i < 35 + adc_data_index; i++)
	{
		raw[0] += (adc_data[i]);
		raw[1] += (adc_data[i + 35]);
		raw[2] += (adc_data[i + 70]);
		raw[3] += (adc_data[i + 115]);
	}

For now it takes 3.5 micro-second at 250 MHz clock

I want to make it less by at least factor of 2.

Do you have any ideas?

What I tried:

1. Change the optimization to be -Ofast

2. Using pointer

3. Also, thought about FMAC and DFSDM

How can I achieve that?

Thanks

Yonatan

    This topic has been closed for replies.
    Best answer by TDK

    > 3.5 micro-second at 250 MHz clock

    So 875 cycles and you're doing 116 (4*29) summations. Probably some improvement to be made.

     

    Storing raw and adc_data in DTCMRAM will help.

    Enabling data cache if not already enabled will help a lot.

    Executing the function out of ITCMRAM for the function will also help.

     

    Looking at the disassembly will be the most useful here to understand what the compiler is doing and seeing what is unnecessary. That can help guide you to the right solution. I imagine using a pointer for access and comparing the loop variable to a pointer constant rather than 35 + X will help a bit.

    7 replies

    Graduate
    May 4, 2025

    Suggestion: avoid computations in loop, like replacing i with an arrray before running the loop, plus run the loop from a to zero to optimize end of loop check.

     

     

    yonatanAuthor
    Explorer II
    May 4, 2025

    Thanks @mbarg.1 

    WDYM in "replacing i with an array"?

     

    TDKAnswer
    Super User
    May 4, 2025

    > 3.5 micro-second at 250 MHz clock

    So 875 cycles and you're doing 116 (4*29) summations. Probably some improvement to be made.

     

    Storing raw and adc_data in DTCMRAM will help.

    Enabling data cache if not already enabled will help a lot.

    Executing the function out of ITCMRAM for the function will also help.

     

    Looking at the disassembly will be the most useful here to understand what the compiler is doing and seeing what is unnecessary. That can help guide you to the right solution. I imagine using a pointer for access and comparing the loop variable to a pointer constant rather than 35 + X will help a bit.

    yonatanAuthor
    Explorer II
    May 5, 2025

    Thanks.

    1. Does enabling the I/D cache have any downsides?

    2. Should I protect the adc_data buffer with the MPU? Is this mandatory?

    3. Does placing the adc_data in the DTCM eliminate the need to use the MPU (Is DTCM always protected from cache issues?)

    Graduate
    May 5, 2025

    Cache will speed execution BUT you must manage it - up to you to decide if extra load and complexity can be a pros or a cons.

    Protecting data is application dependent - ADC typically are primitives, aka uint16_t that cannot be invalid but you could need to have the whole set valid before processing - again, up to you to decide.

     

    Graduate II
    May 5, 2025

    A mix of all of the above might help - although I'm afraid of caches... :D

    But you probably use the ADC with DMA, so the ADC buffer cannot be placed there.

    So I would try:

    uint16_t *pu16Adat0 = &adc_data[adc_data_index + 6 + 0]; // pointer type must be same as adc_data!
    uint16_t *pu16Adat1 = &adc_data[adc_data_index + 6 + 35];
    uint16_t *pu16Adat2 = &adc_data[adc_data_index + 6 + 70];
    uint16_t *pu16Adat3 = &adc_data[adc_data_index + 6 + 105]; // or is it really "115" ?
    
    for( uint32_t i = 0; i < 29; i++ )
    {
     raw[0] += pu16Adat0[i];
     raw[1] += pu16Adat1[i];
     raw[2] += pu16Adat2[i];
     raw[3] += pu16Adat3[i];
    }

     

    Interesting to see if using pointers and incrementing these might speed things up, like 

    raw[0] += *(puAdat0++);

    yonatanAuthor
    Explorer II
    May 5, 2025

    Thanks!

    It saved me ~250 nS

    I am counting every clock.

    Graduate II
    May 5, 2025

    It saved me ~250 nS

    Oh my, that's disappointing... :(

    Graduate II
    May 5, 2025

    Maybe it helps if you place at least the iteration variable i and the destination buffer raw[] into DTCM.

    And / or using data cache might help.

    Graduate II
    May 6, 2025

    Thanks for coming back with the working code!

    How's the timing with the function in ITCM RAM?

    yonatanAuthor
    Explorer II
    May 6, 2025

    it saved me something like another ~500 nS

    (BTW, I also enabled the ICache so the "profit" is small)

    Super User
    May 6, 2025

    Was explicit loop unrolling already mentioned?

    uint32_t* p = &adc_data[6 + adc_data_index];
    raw[0] = p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + 
     p[8] + p[9] + p[10] + p[11] + p[12] + p[13] + p[14] + p[15] + 
     p[16] + p[17] + p[18] + p[19] + p[20] + p[21] + p[22] + p[23] + 
     p[24] + p[25] + p[26] + p[27] + p[28] + p[29];
    p += 35;
    raw[1] = p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + 
     p[8] + p[9] + p[10] + p[11] + p[12] + p[13] + p[14] + p[15] + 
     p[16] + p[17] + p[18] + p[19] + p[20] + p[21] + p[22] + p[23] + 
     p[24] + p[25] + p[26] + p[27] + p[28] + p[29];
    p += 35;
    raw[2] = p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + 
     p[8] + p[9] + p[10] + p[11] + p[12] + p[13] + p[14] + p[15] + 
     p[16] + p[17] + p[18] + p[19] + p[20] + p[21] + p[22] + p[23] + 
     p[24] + p[25] + p[26] + p[27] + p[28] + p[29];
    p += 35;
    raw[3] = p[0] + p[1] + p[2] + p[3] + p[4] + p[5] + p[6] + p[7] + 
     p[8] + p[9] + p[10] + p[11] + p[12] + p[13] + p[14] + p[15] + 
     p[16] + p[17] + p[18] + p[19] + p[20] + p[21] + p[22] + p[23] + 
     p[24] + p[25] + p[26] + p[27] + p[28] + p[29];

    Observe disasm of the resulting code, you should see a repeating pattern of ld/add.

    JW

     

    PS. You can MDMA into DTCM. The idea is to gather data from peripherals into SRAM using the "normal" DMA, and then the DMA's transfer complete would trigger MDMA which would in turn move all that data to DTCM for the processor to process further.

    yonatanAuthor
    Explorer II
    May 6, 2025

    Hi @waclawek.jan 

    You are right but the problem is that '6' and '35' is not known at compilation time.

    They are initialized at run time to their values.

    Regarding the MDMA...

    In general it is possible but the DMA action is in circular buffer so I am afraid of missing some signals (interrupts etc.)

    Super User
    May 6, 2025

    > the problem is that '6' and '35' is not known at compilation time

    That makes things more complicated but not hopeless.

    If there's a limited number of '6' and '35' variants, you can have a separate function for each combination (i.e. you compile many functions), and then in runtime chosing whichever is appropriate.

    If there are more variants than manageable reasonably, you can use "calculated jumps" amidst the series of additions. Switch/case may accomplish this, but it needs to be checked whether compiler actually compiles it reasonably.

    nr = var35 - var6;
    p = &adc_data[var6];
    sum = 0;
    switch(nr) {
     case 29: sum += *p++; // note the intentional fallthrough 
     case 28: sum += *p++;
     case 27: sum += *p++;
     [etc.]
    }
    p += whatever_remains;

    One may here also want to resort to asm, inline or not, if C does not provide enough control over the resulting code - I'm not sure if any compiler recognizes the pattern and actually calculates the jump, most of them should at least use the table-jump instruction (TBB/TBH), but some may be stubborn and generate a branch of jumps, which is useless here.

    A partial unroll, together with calculated jump can be used as a slightly worse simplified version, too.  This combination is know as Duff's device.

    Another option is to generate the code into RAM in runtime, or use self-modifying code (which may be as simple as inserting at the appropriate place in a sequence of additions a jump out of the sequence).

    >> MDMA
    > In general it is possible but the DMA action is in circular buffer so I am afraid of missing some signals (interrupts etc.)

    I don't see why would anything got missed here, but I also don't know your whole application.

    JW

    Graduate II
    May 6, 2025

    MDMA:

    if you're afraid of losing data, you could use also DMA's transfer half-complete interrupt, then trigger MDMA for first half of the buffer.

    And / or the DMA's double buffer mode (DBM).