Skip to main content
Graduate II
October 16, 2024
Question

H7 OCTOSPI HyperRAM data throughput changing with compilation

  • October 16, 2024
  • 13 replies
  • 3073 views

Heyho,

I'm using the H733 (custom board) / H735 (eval kit) with Infineon's HyperRAM S70KL1281 / S70KL1282 at 100 MHz for some time now, all working great, except for one thing that is very annoying:

  • the data throughput from / to HyperRAM seems to depend on compilation, even though the OCTOSPI peripheral was not changed
  • after some compilations it's about 178 Mbyte / s, after another only 54 MB/s.
  • data throughput is constant for one compilation, no matter if I call the test function at MCU power up or while operating with all other peripherals running
  • no caching anywhere

I'm pretty sure that it's not "faulty" timing measurements, using the cycle counter and disabling all interrupt calls around the for loops.

  • Is there something wrong in my test function?
  • Is it maybe "only" how the for loop / iteration is compiled?
  • right now I can't get it back to the high speed, so no map / list file
  • my scope here is too old and slow to check the signal lines

Here's the test function, first writing to HyperRAM, then reading:

/* +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ */
/* OCTOSPI HyperRAM test
 */
#define HYPER_TEST_UART		1

uint32_t OspiHypRamTest(uint8_t u8CountDown)
{
	uint32_t i = 0;
	uint32_t u32Val = 0xFFFFFFFF;
	uint32_t u32MaxLen = (uint32_t)((uint32_t)OSPI_HYPERRAM_END_ADDR / 4);
	uint32_t u32Errors = 0;
	uint32_t u32Data = 0;
	uint32_t u32CycStart = 0;
	uint32_t u32Cycles = 0;
	float flClockMHz = (float)HAL_RCC_GetSysClockFreq() / 1E6;
	float flVal = 0.0f;
	uint32_t *pu32MemAddr = NULL;

	if( 	 OCTOSPI1 == pOspiHypRam ) pu32MemAddr = (uint32_t *)OCTOSPI1_BASE;
	else if( OCTOSPI2 == pOspiHypRam ) pu32MemAddr = (uint32_t *)OCTOSPI2_BASE;

#if HYPER_TEST_UART
	uart_printf("\n\r+++++++++++++++++++++++++++++++++++++++++++++++++\n\r");
	uart_printf("OCTOSPI HyperRAM test, memory mapped, IRQs OFF\n\rcounting ");
	if( 0 == u8CountDown ) uart_printf("UP, start with 0\n\r\n\r");
	else uart_printf("DOWN, start with %08lX\n\r\n\r", u32Val);

	uart_printf("writing bytes: %lu\n\r", (uint32_t)OSPI_HYPERRAM_END_ADDR);
#endif

__DSB();
__disable_irq();

/* write complete HyperRAM */
	/* UP - should be faster */
	if( 0 == u8CountDown )
	{
		u32CycStart = DWT->CYCCNT;
		for( i = 0; i < u32MaxLen; i++ )
		{
			pu32MemAddr[i] = i;
		}
		__DMB();
		__DSB();
		u32Cycles = DWT->CYCCNT;
	}
	/* DOWN */
	else
	{
		u32Val = 0xFFFFFFFF;
		u32CycStart = DWT->CYCCNT;
		for( i = 0; i < u32MaxLen; i++ )
		{
			pu32MemAddr[i] = u32Val;
			u32Val--;
		}
		__DMB();
		__DSB();
		u32Cycles = DWT->CYCCNT;
	}

__enable_irq();
__DSB();

	u32Cycles -= u32CycStart;

	flVal = (float)u32Cycles / flClockMHz;
	flOspiRamSpeedMBpsMmWr = (float)OSPI_HYPERRAM_END_ADDR / flVal;
	flOspiRamSpeedMBpsMmWr *= (float)MEGA_CORRECTION;

#if HYPER_TEST_UART
	uart_printf("%lu CPU cycles = %.1f ms\n\r", u32Cycles, (flVal / 1000.0f));
	uart_printf("\n\r-> %.2f MB/s (%.0f Mbit/s) WRITE\n\r\n\r", flOspiRamSpeedMBpsMmWr, (8.0f * flOspiRamSpeedMBpsMmWr));

	uart_printf("reading & comparing bytes: %lu\n\r", (uint32_t)OSPI_HYPERRAM_END_ADDR);
#endif

__DSB();

	if( 	 OCTOSPI1 == pOspiHypRam ) pu32MemAddr = (uint32_t *)OCTOSPI1_BASE;
	else if( OCTOSPI2 == pOspiHypRam ) pu32MemAddr = (uint32_t *)OCTOSPI2_BASE;

__disable_irq();
__DSB();

/* read complete HyperRAM and compare */
	/* UP - should be faster */
	if( 0 == u8CountDown )
	{
		u32CycStart = DWT->CYCCNT;
		for( i = 0; i < u32MaxLen; i++ )
		{
			u32Data = pu32MemAddr[i];
			if( u32Data != i ) u32Errors++;
		}
		__DMB();
		__DSB();

		u32Cycles = DWT->CYCCNT;
	}
	/* DOWN */
	else
	{
		u32Val = 0xFFFFFFFF;
		u32CycStart = DWT->CYCCNT;
		for( i = 0; i < u32MaxLen; i++ )
		{
			u32Data = pu32MemAddr[i];
			if( u32Data != (u32Val - i) ) u32Errors++;
		}
		__DMB();
		__DSB();

		u32Cycles = DWT->CYCCNT;
	}
__enable_irq();

	u32Cycles -= u32CycStart;

	flVal = (float)u32Cycles / flClockMHz;
	flOspiRamSpeedMBpsMmRd = (float)OSPI_HYPERRAM_END_ADDR / flVal;
	flOspiRamSpeedMBpsMmRd *= (float)MEGA_CORRECTION;

#if HYPER_TEST_UART
	uart_printf("%lu CPU cycles = %.1f ms\n\r", u32Cycles, (flVal / 1000.0f));
	uart_printf("\n\r-> %.2f MB/s (%.0f Mbit/s) READ\n\r", flOspiRamSpeedMBpsMmRd, (8.0f * flOspiRamSpeedMBpsMmRd));

	if( 0 == u32Errors ) uart_printf("\n\rNULL errors\n\r");
	else uart_printf("\n\r# ERR: u32Errors = %lu\n\r", u32Errors);
	uart_printf("-------------------------------------------------\n\r");
#endif

	return u32Errors;
}

Anybody any ideas?

Thanks in advance!

    This topic has been closed for replies.

    13 replies

    Technical Moderator
    October 16, 2024

    Dear @LCE ,

    Thanks for the interesting use case. is that possible to detail the exact IDE/compiler environment so we can try to reproduce the same at our end ?   @KDJEM.1 and then analyze 

    Ciao

    STOne-32. 

    LCEAuthor
    Graduate II
    October 16, 2024

    I'm using

    - H735 EVK or H733 custom board

    - STM32CubeIDE Version: 1.10.1

    - optimization FAST

    - CPU clock 400 MHz

    - OSPI 100 MHz

    - HyperRAM setup via direct register access (doesn't make a difference to HAL setup)

    LCEAuthor
    Graduate II
    October 16, 2024

    I just got the "fast" version again, maybe there's some bus issues in the background, depending on the UART use:

    UART 3 is used for debugging, in TX DMA mode.

    The ouput function uart_printf() fills the TX DMA buffer, just waits at the beginning for previous transfers to finish by checking TC and other stuff with a function UartDbgDmaTxWait().

    When I put UartDbgDmaTxWait() after each uart_printf() around OspiHypRamTest() I get the high speed - for now at least...

    The question remains, before I did that, why sometimes fast / slow results, without changing anything concerning the OSCTOSPI peripheral and the test function?

     

     

    LCEAuthor
    Graduate II
    October 16, 2024

    I also compared the assembler in the list files, between slow / fast version:

    the important loops reading / writing HyperRAM and comparing - while the interrupts are disabled - basically look the same

    Graduate II
    October 16, 2024

    Hello LCE,

    Have you considered providing protection if

    u32Cycles -= u32CycStart;

    wraps around?  Perhaps that would account for the two consistent values...

    Regards,

    Dave

    LCEAuthor
    Graduate II
    October 16, 2024

    That's not necessary with (C's ?) unsigned integer math.

    (I think I did that before, it didn't change anything.)

    That would only explain the values at start-up, a rather defined time, but I also get the exact same timing values if I start OspiHypRamTest() by UART anytime the application is running.

    And I checked also with the 1 ms SysTick, giving the same results.

    Super User
    October 16, 2024

    So what is different in "compilation"? Debug vs Release? Optimization?

     

    LCEAuthor
    Graduate II
    October 17, 2024

    So what is different in "compilation"? Debug vs Release? Optimization?

    That would be too easy and too obvious! ;)

    No, that happens with a new compilation with no change of release / debug mode or optimization settings.

    And even without any change of the relevant HyperRam init and test files.

     

    So it can be only something happening in the background, using the same bus as OCTOSPI, my guess. 

    The test is performed at start-up, the only stuff doing using busses "in the background" until then are:

    • ADC3 (AHB4) via DMA to SRAM4 (AHB)
    • UART 3 (APB1) - with TX DMA from AXI SRAM 

    See above, the UART is my best guess for now, as it is using DMA and the AXI SRAM, where also OCTOSPI is connected. And as said above, waiting until UART3 TX DMA was finished already helped.

    I'll keep an eye on this with my next compilations...

    Technical Moderator
    October 18, 2024

    Dear @LCE ,

    this is a follow-up can you change all of your variables and buffer of data to have 64-bits wide instead of a word by this 

    uint64_t instead of uint32_t

    and let us know if now the compilation is stable . The idea is to use maximum optimized width for the AXI bus where the OctoSPI is connected.

    Hope it helps ,

    STOne-32

     

    LCEAuthor
    Graduate II
    October 21, 2024

    Hello @STOne-32 ,

    thanks for taking a look at this!

    So, right now I get these results:

    counting UP:
    Read = 144.61 MB/s
    Write = 58.69 MB/s          for( i = 0; i < u32MaxLen; i++ ) pu32MemAddr[i] = i;

     

    counting DOWN:
    Read = 54.50 MB/s
    Write = 179.32 MB/s

    I would have expected the results to be the other way round...

    Because Write UP is simply:  for( i = 0; i < u32MaxLen; i++ ) pu32MemAddr[i] = i;
    And in the list file that's only 4 lines of assembler (also only 4 lines for write DOWN)

    Here's the list file part with the function OspiHypRamTest(uint8_t u8CountDown)

    
    080523b0 <OspiHypRamTest>:
     80523b0:	b538 	push	{r3, r4, r5, lr}
     80523b2:	4604 	mov	r4, r0
     80523b4:	f01d fa74 	bl	806f8a0 <HAL_RCC_GetSysClockFreq>
     80523b8:	ee07 0a90 	vmov	s15, r0
     80523bc:	4d4c 	ldr	r5, [pc, #304]	; (80524f0 <OspiHypRamTest+0x140>)
     80523be:	4a4d 	ldr	r2, [pc, #308]	; (80524f4 <OspiHypRamTest+0x144>)
     80523c0:	eeb8 7a67 	vcvt.f32.u32	s14, s15
     80523c4:	682b 	ldr	r3, [r5, #0]
     80523c6:	ed9f 6b48 	vldr	d6, [pc, #288]	; 80524e8 <OspiHypRamTest+0x138>
     80523ca:	eeb7 7ac7 	vcvt.f64.f32	d7, s14
     80523ce:	4293 	cmp	r3, r2
     80523d0:	ee27 7b06 	vmul.f64	d7, d7, d6
     80523d4:	eeb7 7bc7 	vcvt.f32.f64	s14, d7
     80523d8:	d07f 	beq.n	80524da <OspiHypRamTest+0x12a>	; load address of OCTOSPI 1
     80523da:	f502 42a0 	add.w	r2, r2, #20480	; 0x5000
     80523de:	4293 	cmp	r3, r2
     80523e0:	bf0c 	ite	eq
     80523e2:	f04f 42e0 	moveq.w	r2, #1879048192	; 0x70000000 OCTOSPI 2 unused
     80523e6:	2200 	movne	r2, #0
     80523e8:	f3bf 8f4f 	dsb	sy
     
    ; WRITE loops in between cpsid / cpsie
     80523ec:	b672 	cpsid	i
     80523ee:	4b42 	ldr	r3, [pc, #264]	; (80524f8 <OspiHypRamTest+0x148>) DWT->CYCCNT
     
     80523f0:	3a04 	subs	r2, #4
     80523f2:	6858 	ldr	r0, [r3, #4]
     80523f4:	4611 	mov	r1, r2
     80523f6:	2c00 	cmp	r4, #0
     80523f8:	d165 	bne.n	80524c6 <OspiHypRamTest+0x116>		; u8CountDown != 0, jump to write DOWN
     80523fa:	4623 	mov	r3, r4
     
    ; write UP loop
     80523fc:	f841 3f04 	str.w	r3, [r1, #4]!
     8052400:	3301 	adds	r3, #1
     8052402:	f5b3 0f80 	cmp.w	r3, #4194304	; 0x400000 memory size in 32b
     8052406:	d1f9 	bne.n	80523fc <OspiHypRamTest+0x4c>
     
     8052408:	f3bf 8f5f 	dmb	sy
     805240c:	f3bf 8f4f 	dsb	sy
     8052410:	4b39 	ldr	r3, [pc, #228]	; (80524f8 <OspiHypRamTest+0x148>) DWT->CYCCNT
     8052412:	685b 	ldr	r3, [r3, #4]
     8052414:	b662 	cpsie	i
    ; WRITE end
     
     8052416:	f3bf 8f4f 	dsb	sy
     805241a:	1a1b 	subs	r3, r3, r0
     805241c:	ed9f 6a37 	vldr	s12, [pc, #220]	; 80524fc <OspiHypRamTest+0x14c>
     8052420:	eddf 6a37 	vldr	s13, [pc, #220]	; 8052500 <OspiHypRamTest+0x150>
     8052424:	ee07 3a90 	vmov	s15, r3
     8052428:	4936 	ldr	r1, [pc, #216]	; (8052504 <OspiHypRamTest+0x154>)
     805242a:	ee27 7a26 	vmul.f32	s14, s14, s13
     805242e:	eef8 7a67 	vcvt.f32.u32	s15, s15
     8052432:	eec6 6a27 	vdiv.f32	s13, s12, s15
     8052436:	ee66 7a87 	vmul.f32	s15, s13, s14
     805243a:	edc1 7a00 	vstr	s15, [r1]
     805243e:	f3bf 8f4f 	dsb	sy
     8052442:	492c 	ldr	r1, [pc, #176]	; (80524f4 <OspiHypRamTest+0x144>)
     8052444:	682b 	ldr	r3, [r5, #0]
     8052446:	428b 	cmp	r3, r1
     8052448:	d04a 	beq.n	80524e0 <OspiHypRamTest+0x130>
     805244a:	482f 	ldr	r0, [pc, #188]	; (8052508 <OspiHypRamTest+0x158>)
     805244c:	492f 	ldr	r1, [pc, #188]	; (805250c <OspiHypRamTest+0x15c>)
     805244e:	4283 	cmp	r3, r0
     8052450:	bf08 	it	eq
     8052452:	460a 	moveq	r2, r1
    
    ; READ & compare loops in between cpsid / cpsie
     8052454:	b672 	cpsid	i
     8052456:	f3bf 8f4f 	dsb	sy
     805245a:	bb1c 	cbnz	r4, 80524a4 <OspiHypRamTest+0xf4>		; u8CountDown != 0, jump to read DOWN
     805245c:	4b26 	ldr	r3, [pc, #152]	; (80524f8 <OspiHypRamTest+0x148>) DWT->CYCCNT
     805245e:	4620 	mov	r0, r4
     8052460:	685c 	ldr	r4, [r3, #4]
     8052462:	4603 	mov	r3, r0
     
    ; read & compare UP loop
     8052464:	f852 1f04 	ldr.w	r1, [r2, #4]!
     8052468:	4299 	cmp	r1, r3
     805246a:	f103 0301 	add.w	r3, r3, #1
     805246e:	bf18 	it	ne
     8052470:	3001 	addne	r0, #1
     8052472:	f5b3 0f80 	cmp.w	r3, #4194304	; 0x400000 memory size in 32b
     8052476:	d1f5 	bne.n	8052464 <OspiHypRamTest+0xb4>
     
     8052478:	f3bf 8f5f 	dmb	sy
     805247c:	f3bf 8f4f 	dsb	sy
     8052480:	4b1d 	ldr	r3, [pc, #116]	; (80524f8 <OspiHypRamTest+0x148>) DWT->CYCCNT
     8052482:	685b 	ldr	r3, [r3, #4]
     8052484:	b662 	cpsie	i
    ; READ & compare end
     
     8052486:	1b1b 	subs	r3, r3, r4
     8052488:	ed9f 6a1c 	vldr	s12, [pc, #112]	; 80524fc <OspiHypRamTest+0x14c>
     805248c:	4a20 	ldr	r2, [pc, #128]	; (8052510 <OspiHypRamTest+0x160>)
     805248e:	ee07 3a90 	vmov	s15, r3
     8052492:	eef8 7a67 	vcvt.f32.u32	s15, s15
     8052496:	eec6 6a27 	vdiv.f32	s13, s12, s15
     805249a:	ee26 7a87 	vmul.f32	s14, s13, s14
     805249e:	ed82 7a00 	vstr	s14, [r2]
     80524a2:	bd38 	pop	{r3, r4, r5, pc}
     
    ; read & compare prepare
    80524a4:	4914 	ldr	r1, [pc, #80]	; (80524f8 <OspiHypRamTest+0x148>) DWT->CYCCNT
     80524a6:	f04f 33ff 	mov.w	r3, #4294967295		; 0xFFFFFFFF
     80524aa:	2000 	movs	r0, #0
     80524ac:	f46f 0c80 	mvn.w	ip, #4194304		; 0x400000 memory size in 32b
     80524b0:	684c 	ldr	r4, [r1, #4]
     
    ; read & compare DOWN loop
     80524b2:	f852 1f04 	ldr.w	r1, [r2, #4]!
     80524b6:	4299 	cmp	r1, r3
     80524b8:	f103 33ff 	add.w	r3, r3, #4294967295	; 0xFFFFFFFF
     80524bc:	bf18 	it	ne
     80524be:	3001 	addne	r0, #1
     80524c0:	4563 	cmp	r3, ip
     80524c2:	d1f6 	bne.n	80524b2 <OspiHypRamTest+0x102>
     80524c4:	e7d8 	b.n	8052478 <OspiHypRamTest+0xc8>	; back to end of read
     
     80524c6:	f04f 33ff 	mov.w	r3, #4294967295		; 0xFFFFFFFF
     80524ca:	f46f 0c80 	mvn.w	ip, #4194304		; 0x400000 memory size in 32b
     
    ; write DOWN loop
     80524ce:	f841 3f04 	str.w	r3, [r1, #4]!
     80524d2:	3b01 	subs	r3, #1
     80524d4:	4563 	cmp	r3, ip
     80524d6:	d1fa 	bne.n	80524ce <OspiHypRamTest+0x11e>
     
     80524d8:	e796 	b.n	8052408 <OspiHypRamTest+0x58>	; back to end of write
     
     80524da:	f04f 4210 	mov.w	r2, #2415919104	; 0x90000000	OCTOSPI 1 with HyperRAM
     80524de:	e783 	b.n	80523e8 <OspiHypRamTest+0x38>		; back
     
     80524e0:	4a0c 	ldr	r2, [pc, #48]	; (8052514 <OspiHypRamTest+0x164>)
     80524e2:	e7b7 	b.n	8052454 <OspiHypRamTest+0xa4>		; back to 1st loop
     80524e4:	f3af 8000 	nop.w
     
     80524e8:	a0b5ed8d 	.word	0xa0b5ed8d
     80524ec:	3eb0c6f7 	.word	0x3eb0c6f7
     80524f0:	24002cbc 	.word	0x24002cbc
     80524f4:	52005000 	.word	0x52005000
     80524f8:	e0001000 	.word	0xe0001000
     80524fc:	4b800000 	.word	0x4b800000
     8052500:	3f742400 	.word	0x3f742400
     8052504:	24002bfc 	.word	0x24002bfc
     8052508:	5200a000 	.word	0x5200a000
     805250c:	6ffffffc 	.word	0x6ffffffc
     8052510:	24002bf8 	.word	0x24002bf8
     8052514:	8ffffffc 	.word	0x8ffffffc
    

     

    LCEAuthor
    Graduate II
    October 21, 2024

    ... including some comments.

    I didn't find the option to select "assembler" for source code posting.

    LCEAuthor
    Graduate II
    October 21, 2024

    I just found that I had set the alignment of the HyperRAM in the linker filer to "ALIGN(8)" = 64 bit.

    I changed it to ALIGN(4) - and it didn't change anything.