Skip to main content
Visitor II
January 14, 2020
Question

STM32F767 Execution time is more compared with STM32F429

  • January 14, 2020
  • 11 replies
  • 2032 views

Tried creating OS task 100 times in Example codes with FreeRTOS taken from STM32CubeMX for both F429 and F767 and found the observations as below.

F429 - 6 Ticks

F767 - 16 Ticks

Difference - 10 Ticks

What is the reason for the delay and Is there any other way to speed up

    This topic has been closed for replies.

    11 replies

    Graduate II
    January 14, 2020

    Different number of wait states, Code in RAM? Show relevant parts!

    RBGAuthor
    Visitor II
    January 16, 2020

    Below added code snippet which is used for testing both F429 and F767 boards. Found the Tick difference between part highlighted.

    0690X00000Bw3kbQAB.jpg

    RBGAuthor
    Visitor II
    January 20, 2020

    Even for simple malloc observed the tick difference between F429 and F767.

    For memory allocation 10000 times.

    F429 - 68 Ticks

    F767 - 78 Ticks

    Tick difference - 10Ticks

    Below is the code part.

    void StartDefaultTask(void const * argument)

     int *ptr;

     /* USER CODE BEGIN 5 */

     /* Infinite loop */

     for(;;)

     {

       printf( "Tick_test_1:%d\n", xTaskGetTickCount() );

       for(long i=0;i<10000;i++)

       {

         ptr = (int*) malloc(5*sizeof(int));

       }

       printf( "Tick_test_2:%d\n", xTaskGetTickCount() );

       osDelay(1);

     }

    }

    Graduate II
    January 20, 2020

    Do you understand that the first printf() and (I guess) UART transmission underneath is included in your measurement? And xTaskCreate() and malloc() both use dynamic memory and are not deterministic in terms of both - processing time and success of result.

    RBGAuthor
    Visitor II
    January 23, 2020

    yes, I tried in other approach. Is this a better method to check the performance.

    I tried to increment a variable in one tick count and the results are below.

    F429 - a=976

    F767 - a=691

    F767 is not running as many times F429 is running through the code in specific tick.

    And the situation is only task running that is this default task and code base is default simple example code taken from STM32cubemx

    0690X00000BwNgZQAV.jpg

    Graduate II
    January 23, 2020

    Disable all interrupts (__disable_irq()/__enable_irq()) and use DWT->CYCCNT for precise measurement.

    How are clocks, PLL, buses, flash and cache configured?

    RBGAuthor
    Visitor II
    January 24, 2020

    I tried attaching the complete code but it is not allowed here. I am attaching the main function snapshot and system clock config functions snapshot.

    Code is taken from STM32CubeMX V 4.24

    Firmware package versions

    F429 - STM32Cube_FW_F4_V1.9.0

    F767 - STM32Cube_FW_F7_V1.15.0

    Nothing else is changed in that example.

    Results for the below code when kept variable(a) in live watch:

    F429 - a=998

    F767 - a=6610690X00000BwWP8QAN.jpg0690X00000BwWOyQAN.jpg

    Graduate II
    January 25, 2020

    Compare the how HAL_Init() configures FLASH_ACR in both cases.

    RBGAuthor
    Visitor II
    January 27, 2020

    @Piranha​ @Uwe Bonnes​ 

    Major difference in Hal_init() is data and instruction cache and prefetch .

    Tried the combinations and didn't find much diffference.

    F429-with cache and prefetch enabled - a=998

    F429 with cache and prefetch disabled - a=997

    F767-with cache and prefetch enabled - a=661

    F767 with cache and prefetch disabled - a=661

    F767 is slow because of there is no data caching ?

    Hal_init comparison F767-F4290690X00000BwaylQAB.jpg

    F429_Flash_register_status0690X00000Bwb24QAB.jpg

    F767_Flash_register_status

    0690X00000Bwb2TQAR.jpg

    Graduate
    January 27, 2020

    One reason the 'F7 is can be slower because it has a longer pipeline. On any branch (function-call, if, goto, loop), any partly-executed instructions in the pipeline have to be abandoned and the new instruction sequence has to be loaded. (Inlining a function-call eliminates this.)

    Why do this? Because that means the processor can be clocked at a higher frequency - if you choose to do so.

    The F7 has an advantage that it can sometimes execute two instructions simultaneously, which the 'F4 cannot. This very much depends on the data dependencies between successive instructions, and it takes a clever compiler run at high optimisation-level to take full advantage of this.

    What optimisation-level were you compiling at? F7 is likely to optimise better.

    You will find examples where F4 wins over F7 in terms of cycle count. And you'll find examples where F7 wins.

    Hope this helps,

    Danish

    RBGAuthor
    Visitor II
    January 27, 2020

    Is it a disadvantage of having longer pipeline? Because F429 performance is better over F767 for a simple execution.

    I am running both hardwares in high optimization

    0690X00000BwbZDQAZ.jpg

    My concern is F767 is taking more time for malloc ,creating OS tasks and printf's and other major tasks than F429.

    Or some configuration i can change to increase the speed. Or i can conclude F767 is slower than F429

    RBGAuthor
    Visitor II
    January 27, 2020

    Yeah that was a different approach using task created using FreeRTOS.

    Now tried with the simple code. which is incrementing a initialized global integer(a) 0690X00000BwbeNQAR.jpg

    0690X00000BwWP8QAN.jpg