Skip to main content
Graduate
May 12, 2024
Question

Clock cycle shift on GPIO output STM32F103

  • May 12, 2024
  • 15 replies
  • 14832 views

Dear Community,

I am porting an old application made on AVR to STM32, and I am facing a strange timing issue. 

In a nutshell, the application is reading sector (512 Bytes) from a SDCARD and output the content of the buffer to GPIO with 4us cycle (meaning 3us low, 1 us data signal). 

The SDCard read is working fine, and I have written a small assembly code to output GPIO signal with precise MCU cycle counting. 

Using DWT on the debugger, it give a very stable and precise counting (288 cycles for a total of 4us).

When using a Logic analyser with 24 MHz freq, I can see shift of signal by 1 or 2 cpu cycles and so delay. 

I have tried to use ODR directly and BSRR but with no luck. 

Attached :

- Screenshot of the logic analyzer

Screenshot 2024-05-12 at 06.30.59.png
As you can see I do not have 3us but 3.042 and this is not always the case
 

Clock configuration

Screenshot 2024-05-12 at 06.32.34.png

Port configuration:

 

GPIO_InitStruct.Pin = GPIO_PIN_13| READ_PULSE_Pin|READ_CLK_Pin;
GPIO_InitStruct.Mode = GPIO_MODE_OUTPUT_PP;
GPIO_InitStruct.Pull = GPIO_NOPULL;
GPIO_InitStruct.Speed=GPIO_SPEED_FREQ_HIGH;
HAL_GPIO_Init(GPIOC, &GPIO_InitStruct);
 
Assembly code : 
 
.global wait_1us
wait_1us:
.fnstart
push {lr}
nop ;// 1 1
nop ;// 1 2
mov r2,#20 ;// 1 3
wait_1us_1:
subs r2,r2,#1 ;// 1 1
bne wait_1us_1 ;// 1 2
pop {lr}
bx lr // return from function call
.fnend

.global wait_3us
wait_3us:
.fnstart
push {lr}
nop
nop
wait_3us_1:
subs r2,r2,#1
bne wait_3us_1
pop {lr}
bx lr // return from function call
.fnend
 
 
sendByte:
 
and r5,r3,0x80000000;// 1 1
lsl r3,r3,#1 ;// 1 2 // right shift r3 by 1
subs r4,r4,#1 ;// 1 3 //; dec r4 bit counter
//mov r6,#0 // Reset the DWT Cycle counter for debug cycle counting
//ldr r6,=DWTCYCNT
//mov r2,#0
//str r2,[r6] // end
bne sendBit ;// 1 4
beq process ;// 1 5
// Clk 15, Readpulse 14, Enable 13
sendBit:
ldr r6,=PIN_BSRR ;// 2 2
LDR r2, [r6] ;// 3 5
cmp r5,#0 ;// 1 6
ITE EQ ;// 1 7

 
ORREQ r2,r2, #0x80000000 ;// 1 8 set bit 13 to 1, OR with 0000 0010 0000 0000 0x2000 (Bit13) 0x6000 (Bit13 & 14)
ORRNE r2,r2, #0x00008000 ;// 1 9 set bit 29 to 1, OR with 0010 0000 0000 0000
 
 
ORR r2,r2, #0x00004000 ;// 1 8 set bit 13 to 1, OR with 0000 0010 0000 0000 0x2000 (Bit13) 0x6000 (Bit13 & 14)
 
STR r2, [r6] ;// 1 10 set the GPIO port -> from this point we need 1us, 72 CPU cycles (to be confirmed)
bl wait_1us ;// 65 75 144 209
ORR r2,r2, #0xC0000000 ;// 1 12 ; // Bring the pin down
STR r2,[r6] ;// 1 13 ; //
; // We need to adjust the duration of the 3us function if it is the first bit (coming from process less 10 cycle)
cmp r4,#1
ite eq
moveq r2,#56
movne r2,#62
bl wait_3us ; // wait for 3 us in total
b sendByte

 

I do not know where to look at to be honnest

 

    This topic has been closed for replies.

    15 replies

    Graduate II
    May 14, 2024

    Use logic analyzer or oscilloscope with much higher sampling frequency then your MCU frequency or you will not be able to distinguish measurment relics from real MCU output jitter. Imagine that you generating perfect square wave with frequency 72MHz/20=7.2MHz (period 277.8ns). If  you sample this signal by analyzer with 24MHz (period 41.67ns) the you will see jittering output of 2 consecutive periods in duration 6*41.67 (250ns) and one period in duration 7*41.67 (292ns) - on perfect square wave ! And that can leads you to wrong conclusion that jitter comes from MCU output ...

    vbessonAuthor
    Graduate
    May 14, 2024

    Hello Michal, 

    Thanks and I genuinely agree with you. However it is not easy to find an AL with capability above 24 MHz. 

    what I do is to look at the data stream over the whole transmission period and I should not get significant delay. Unfortunately I have delay and not a 250kHz fdata freq.

    Vincent 

    Graduate II
    May 14, 2024

    Ordinary oscilloscope should be able to handle it with ease. As others have already written - use SPI or USART or Timer + DMA and you will get easy seamless pulse stream and with minimal CPU load.

    vbessonAuthor
    Graduate
    May 14, 2024

    I will do some test and get back to this thread. 

    Quick question, If I use  DMA buffering with USART, I cannot have stop bit each Bytes, is there a way to remove the stop bit ? It means as well that I will set the clock speed to 1 us (72 cycles), and I will rearrange data stream to have 0.0.0.DATABITS. 

     

    Graduate II
    May 14, 2024

    How is data sampled on the receiving side? At what edge? I do not see any sensible setup/hold time guard!

    vbessonAuthor
    Graduate
    May 14, 2024

    I need to bit stream data along 1 wire interface with1us Data pulse (High or Low) , 3us data line Low, no start bit, no stop bit, what is best SPI, USART seems to have start and stop bits

    Graduate II
    May 14, 2024

    @vbesson wrote:

    I need to bit stream data along 1 wire interface with1us Data pulse (High or Low) , 3us data line Low, no start bit, no stop bit, what is best SPI, USART seems to have start and stop bits


    1-wire doesn't have critical timing at all. A 1 has a low pulse of 1-15us and a 0 is a low pulse of 60us. Nanosecond dither is irrelevant.

    vbessonAuthor
    Graduate
    May 22, 2024

     

    Hello All, 

    Quick update on my test based on the feedback you gave me.

    What I have tested:

    • Double buffer DMA with USART
    • Double buffer DMA with SPI
    • Double buffer DMA with GPIO & BSR
    • Disable all IRQ with bit bang SPI and ASM
    • Reducing the clock speed to avoid congestion on the bus

    and combination of all the above.

    My feedback:

    USART was a great approach to reduce the buffer size, indeed I needed a 2048 buffer (512 Bytes * 4 clock cycles).

    The issue with USART is the pause between bytes even without stop bits, USART is waiting a few cycle between Bytes. so I can not use this approach as I need a continuous stream of bit every 3us for 1us.

     

    SPI same a USART, giving the same results. the good thing is using DMA I see more accuracy on the stream of data.

     

    Disabling all IRQ and doing bit bang on SPI with GPIO output using ASM: disabling IRQ does not change anything the accuracy is not there. This is the most frustrating stuff, having an ASM function not behaving the same in one CPU cycle to cycle... there must be a way to do it. Maybe ST can help and provide a more detailed explanation.

     

    Reducing clock speed: it has no effect on accuracy, and btw I need cpu speed to be able to manage SPI without running after the SDCard as I am not doing bitbang on SPI and on GPIO.

     

    Double buffer DMA with GPIO & BSR. This is for the moment the best approach, even from a memory perspective it is pretty ugly. Indeed, for the record I have a buffer of 402 Bytes to be send on a 4us cycle (3us delay, 1us data cycle). It means 13 Chunk of 32 Bytes, so I needed a unint32 (BSR is UINT32) buffer of 2048 = 8192 bytes (64 Bytes x 8 Bit x 4 timer cycles, x 4 UINT size #&##é!). What I could do and not done yet, is to use straight ODR to have a UINT16 and dividing the buffer in 2. I need to test this as my prog is not over and I would need more memory space to manage the OLED screen.

    NB: I scratched my head on the DMA interrupt not triggering. I used 

     

    HAL_DMA_Start_(&hdma_tim2_up, (uint32_t)DMA_BUFFER, (uint32_t)&(GPIOC->BSRR), 2048);
    
    //instead of
    
    HAL_DMA_Start_IT(&hdma_tim2_up, (uint32_t)DMA_BUFFER, (uint32_t)&(GPIOC->BSRR), 2048);
     
     

     

    This is the way I prepare the buffer :

     

    void initeDMABuffer(char * buffer){ // TODO check number of CPU cycle in C and Assembly
     
     uint32_t GPIO_14L_15L= 0xC0000000; // No Data Pulse, No Clock
     uint32_t GPIO_14H_15H= 0x0000C000; // Data HIGH, Clock HIGH
     uint32_t GPIO_14H_15L= 0x80004000; // Data LOW, Clock HIGH
     
     char c=0;
     int l=0;
    
     for (int j=0;j<DMA_BUFFER_SIZE;j++){ // Populate 128 Bytes, 8 bits each, and 4 x 1us step,
     c=buffer[j]; // DMA to GPIO will be based on a 1us frequency, so 72 clock cycle on a STM32F103,
     for (int k=0;k<8;k++){
     // upfront compute for optimization,
     DMA_BUFFER[l]=GPIO_14L_15L; // Cycle 1 wait, 
     DMA_BUFFER[l+1]=GPIO_14L_15L; // Cycle 2 wait,
     DMA_BUFFER[l+2]=GPIO_14L_15L; // Cycle 3 wait, 
     
     if (c & 0x80) // AND x80, test if Bit 15 is 1, 0x1000 0000 0000 
     DMA_BUFFER[l+3]=GPIO_14H_15H; // Only populate the 4th value to do , 1us wait cycle, 1us wait cylce, 1us wait cycle, 1 us data cycle
     else
     DMA_BUFFER[l+3]=GPIO_14H_15L; // Assuming Bit 15 is 0, then GPIO 15 Low
     c=c<<1;
     l+=4; // Left shift by 1 next iteration
     }
     
     }
    
    }

     

    This the Half buffer preparation during DMA Cycle

     

    void populateHalfDMABuffer(char * buffer,int pos,int half){
    
    // GPIO13 -> Chip enable (active low)
    // GPIO14 -> Clock pulse
    // GPIO15 -> Data pulse
    
    // buffer correspond to the Sector char buffer,
    // pos is the current position in the buffer %64
    // Half is the first 0 or second half of the DMA array
    
    uint32_t GPIO_14H_15H= 0x0000C000;
    uint32_t GPIO_14H_15L= 0x80004000;
    
    char c=0;
    unsigned int l=half*1024;
    unsigned int bsize=DMA_BUFFER_SIZE/2;
    for (int i=0;i<bsize;i++){
     c=buffer[pos+i];
     for (int j=0;j<8;j++){
    
     if (c & 0x80)
     DMA_BUFFER[l+3]=GPIO_14H_15H;
     else
     DMA_BUFFER[l+3]=GPIO_14H_15L;
     c=c<<1;
     l+=4;
     }
    }
    
    return;
    }

     

     

    These are my 2 DMA Buffer callback functions:

     

    volatile int ClusterSlice;
    void HAL_DMA_HalfTxIntCallback(DMA_HandleTypeDef *hdma)
    {
    	 if (ClusterSlice<13){
     ClusterSlice++;
     // Half the buffer has been transmitted;
     populateHalfDMABuffer(sectorBuf,ClusterSlice*DMA_BUFFER_SIZE/2,0);
     }else{
     
     __disable_irq(); 
     HAL_TIM_Base_Stop_DMA(&htim2);
     __enable_irq(); 
     prepareNewSector=1;
     
     }
    }
    
    void HAL_DMA_FullTxIntCallback(DMA_HandleTypeDef *hdma){
    	 
     
     if (ClusterSlice<13){
     ClusterSlice++;
     populateHalfDMABuffer(sectorBuf,ClusterSlice*DMA_BUFFER_SIZE/2,1);
     }
     
     else{
     
     __disable_irq(); 
     /* might not be necessary */
     //hdma_tim2_up.XferCpltCallback=NULL;
     HAL_TIM_Base_Stop_DMA(&htim2);
     __enable_irq(); 
    
     prepareNewSector=1;
     //printf("End of DMA\n");
     }
     
     // end if the the initial buffer
     
    }

     

    In the end this is the output on the logic Analyser:

    Full 402 Data ChunkFull 402 Data ChunkScreenshot 2024-05-22 at 06.23.38.png

    What is left todo:

    - Manage disk head movement based on GPIO interrupt (and then move to the right SD card Sector and cluster)

    - Doing some testing to see if the timing is accurate enough.

    I will keep you posted on how I progress. 

    Vincent

     

    vbessonAuthor
    Graduate
    May 28, 2024

    Hello All, 

    the delay between 2 data chunk of (512 Bytes) is causing some trouble. 

    I am heading to using DMA SPI to send bytes. 

    I am having some issues with DMA interrupt and I need a small help.

    I want to have Half transfer DMA and Complete transfer DMA interrupt. 

    I did 

     hdma_spi1_tx.XferHalfCpltCallback=HAL_DMA_HalfSpiTxIntCallback;
     hdma_spi1_tx.XferCpltCallback=HAL_DMA_FullSpiTxIntCallback;
     
     HAL_SPI_Transmit_DMA(&hspi1,DMA_BIT_BUFFER,1608); 

    The interrupt never get called...

    Should I use instead ?

     hdma_spi1_tx.XferHalfCpltCallback=HAL_DMA_HalfSpiTxIntCallback;
     hdma_spi1_tx.XferCpltCallback=HAL_DMA_FullSpiTxIntCallback;
     
     //HAL_SPI_Transmit_DMA(&hspi1,DMA_BIT_BUFFER,1608); // 402*8*4
     HAL_SPI_Transmit_IT(&hdma_spi1_tx,DMA_BIT_BUFFER,1608);

    In that case I assume I have to setup the timer to hdma_spi1_tx ?

    Thanks for your help 

    Vincent 

     

     

     

    vbessonAuthor
    Graduate
    May 28, 2024

    Ok I found it :

    It needs to have 

    void HAL_SPI_TxCpltCallback(SPI_HandleTypeDef *hspi)
    {
     printf("debug full\n");
    }
    
    
    void HAL_SPI_TxHalfCpltCallback(SPI_HandleTypeDef *hspi){
     printf("debug half\n");
    }

    along with

     HAL_SPI_Transmit_DMA(&hspi1,DMA_BIT_BUFFER,1608); // 402*8*4

    By the way the DMA_SPI_TX seems to be the best approach to save RAM and very precise & accurate

     

    Vincent