Skip to main content
Explorer
February 11, 2020
Solved

[bug fixes] STM32H7 Ethernet

  • February 11, 2020
  • 34 replies
  • 44437 views

@Amel NASRI​, @ranran​, @Piranha​, @Harrold​, @Pavel A.​ 

V2 of my fixes and improvements to H7_FW V1.5.0/V1.6.0 Ethernet...

Changes include

  • Decoupling receive buffers from receive descriptors so buffers may be held any length time without choking receive.
  • Optionally queuing transmit, so transmit doesn't need to block until complete.
  • Many bug fixes.

Find full details and source in the attached zip. V1 was posted to another developer's question. Please post any questions about V2 here.

    This topic has been closed for replies.
    Best answer by Amel NASRI

    Dear All,

    Our Experts tried to answer almost all the limitations reported in this thread.

    Please refer to this post for more details.

    At this point, I suggest to close this discussion as it becomes difficult for us to follow it with the great number of comments.

    Don't hesitate to submit your new posts asking new questions.

    Thanks for all the ones involved to make ST solutions more efficient.

    -Amel

    34 replies

    Graduate
    April 3, 2020

    From some testing I have been doing today I found some potential improvements on transmit. I believe alister mentioned in his documentation that transmit was not a priority, and actually for my application is not so important either.

    Rather than calling HAL_Transmit when not using a tx queue, calling HAL_TransmitIT, waiting on an event and signalling from the interrupt callback seems to yield better performance (the mechanism is already there for the tx queue). When doing load testing it seemed that the application was spending significant time spinning in HAL_transmit and potentially starving other (particularly the lwip) thread.

    However, this may not be the case for infrequent transmission of small packets rather than frequent transmission of large packets. Some testing with setting/clearing a port pin on entry/exit to the transmit function may be required to quantify the difference exactly.

    Personally I would not run this driver without the transmit queue as that yields a 41-51MBps improvement in my application and has little impact on memory footprint.

    Just thought I would open it up for discussion...

    alisterAuthor
    Explorer
    April 3, 2020

    Tx-queuing, using HAL_ETH_Transmit_IT, is enabled as default. Are you sure you'd ported all of ethernetif.c?

    It's controlled by the ETH_TX_QUEUE_ENABLE macro.

    Take care, non-blocking transmit (tx-queuing) is only safe if you know your app won't change a pbuf (or its buffer) after it's been transmitted.

    Graduate
    April 3, 2020

    I tried with/without tx queue and blocking/non blocking transmit with tx queue disabled. The fully blocking option did not work well, and not a good design practice to use blocking/polling functions unless absolutely necessary. This application has four different nefifs, plus a heap of other stuff going on so needs to stay responsive.

    Transmit with interrupt is still a better option for no queue than a blocking transmit. Neither option prevents another thread from modifying the buffer, both options prevent the lwip tcp thread from modifying the buffer as it is blocked either way. I am running with the queue enabled as want to keep the lwip thread running to service the other netifs.

    AFAIK, there is no way my application can modify the buffers. But I do remember reading somewhere that lwip may modify them in certain cases. Maybe TCP segmentation, or transmit failure. I cant remember where I saw that though.

    Visitor II
    April 5, 2020

    Hello,

    I have also used your code. It works very well ( 94Mbit send/rcv). But i have one problem i can not resolve. 

    After reset uC or first connect rj45 to PC i got max transfert. But when i reconnect RJ45 the transfer slow down ( <500kbps). In debug i get message "memp_malloc: out of memory in pool TCPIP_MSG_INPKT".

    After disconnect function RxBuffFree is called, so the rx buffers should be empty.

    What can be wrong ?

    alisterAuthor
    Explorer
    April 5, 2020

    >i get message "memp_malloc: out of memory in pool TCPIP_MSG_INPKT".

    MEMP_TCPIP_MSG_INPKT is lwIP's container for received packets. What's MEMP_NUM_TCPIP_MSG_INPKT?

    >After disconnect function RxBuffFree is called

    You're calling HAL_ETH_DeInit or HAL_ETH_Stop? You shouldn't need to. Or what's calling RxBuffFree?

    Visitor II
    April 6, 2020

    MEMP_NUM_TCPIP_MSG_INPKT   is 8

    There is additional task to handle PHY connection (from CubeH7 example)

    void ethernet_link_thread( void const * argument )

    {

     ETH_MACConfigTypeDef MACConf;

     int32_t PHYLinkState;

     uint32_t linkchanged = 0, speed = 0, duplex =0;

     struct netif *netif = (struct netif *) argument;

      

     for(;;)

     {

       

      PHYLinkState = LAN8742_GetLinkState(&LAN8742);

       

      if(netif_is_link_up(netif) && (PHYLinkState <= LAN8742_STATUS_LINK_DOWN))

      {

       HAL_ETH_Stop_IT(&heth);

       netif_set_down(netif);

       netif_set_link_down(netif);

      }

      else if(!netif_is_link_up(netif) && (PHYLinkState > LAN8742_STATUS_LINK_DOWN))

      {

       switch (PHYLinkState)

       {

       case LAN8742_STATUS_100MBITS_FULLDUPLEX:

        duplex = ETH_FULLDUPLEX_MODE;

        speed = ETH_SPEED_100M;

        linkchanged = 1;

        break;

       case LAN8742_STATUS_100MBITS_HALFDUPLEX:

        duplex = ETH_HALFDUPLEX_MODE;

        speed = ETH_SPEED_100M;

        linkchanged = 1;

        break;

       case LAN8742_STATUS_10MBITS_FULLDUPLEX:

        duplex = ETH_FULLDUPLEX_MODE;

        speed = ETH_SPEED_10M;

        linkchanged = 1;

        break;

       case LAN8742_STATUS_10MBITS_HALFDUPLEX:

        duplex = ETH_HALFDUPLEX_MODE;

        speed = ETH_SPEED_10M;

        linkchanged = 1;

        break;

       default:

        break;    

       }

        

       if(linkchanged)

       {

        /* Get MAC Config MAC */

        HAL_ETH_GetMACConfig(&heth, &MACConf); 

        MACConf.DuplexMode = duplex;

        MACConf.Speed = speed;

        HAL_ETH_SetMACConfig(&heth, &MACConf);

        HAL_ETH_Start_IT(&heth);

        netif_set_up(netif);

        netif_set_link_up(netif);

       }

      }

      osDelay(100);

     }

    }

    alisterAuthor
    Explorer
    April 6, 2020

    MEMP_NUM_TCPIP_MSG_INPKT  = 8 isn't enough. This is possibly the only problem.

    About HAL_ETH_Stop_IT... I did make some changes there. But still I'm unsure its stop/start is good as I'd only inspected it, didn't use HAL_ETH_Stop_IT and (sorry) only tested what I'd used. Are you sure you need to HAL_ETH_Stop_IT?

    If you do, the first HAL_ETH_Start_IT is in low_level_init. Before your next HAL_ETH_Start_IT, make sure all the rx buffers (from the EthIfRxBuff pool) that had been previously held by the ETH driver were freed. Anyway, a problem with this isn't indicated and would surely manifest after a few 10s of link up/down.

    Visitor II
    April 6, 2020

    How many MEMP_NUM_TCPIP_MSG_INPK do you suggest ?

    I left HAL_ETH_Stop_IT from Cube example. If it is not necessery i will remove this.

    Thanks for suggestion.

    alisterAuthor
    Explorer
    April 6, 2020

    Please reply the reply so the conversations are delineated.

    >How many MEMP_NUM_TCPIP_MSG_INPK

    Double or triple.

    >I left HAL_ETH_Stop_IT from Cube example.

    You'll have to decide. My attitude is, everything ought have a reason and nothing ought have no reason. That's backed up by studies showing bugs are proportional to lines of code (doesn't mean remove comments).

    Visitor II
    April 6, 2020

    Removing the HAL_ETH_Stop_IT resolved the problem.

    But there is still messages "out of memory in pool TCP PCB"

    Graduate II
    April 8, 2020

    > There is additional task to handle PHY connection (from CubeH7 example)

    Which has the flaws described shortly but pretty clearly in "lwIP API related" part of my topic:

    https://community.st.com/s/question/0D50X0000BOtfhnSQB/how-to-make-ethernet-and-lwip-working-on-stm32

    April 15, 2020

    Keil has its own version of the STM32H7 drivers (current version 2.5.0), including the drivers for Ethernet MAC and PHY. Has anybody checked if they have all the same problems as HAL?

    alisterAuthor
    Explorer
    April 16, 2020

    I see at https://community.st.com/s/question/0D50X0000BWqXETSQ3/ethernet-complexity you'd mentioned "STM32Cube_FW_H7 1.5.0 does work".

    But it doesn't. This page describes FW_H7 1.5.0/1.6.0 bugs and my fixes/improvements to it. I don't use Keil.

    April 16, 2020

    When I said that it works I just meant it does the simple things described in the readme (because this example in many other releases of STM32Cube_FW_H7 doesn't even ping); I didn't mean it is bug-free.

    My question about Keil drivers still remains for those who do use Keil.

    Visitor II
    April 21, 2020

    any same fix on the STM32F7 firmware​?

    alisterAuthor
    Explorer
    April 22, 2020

    >any same fix on the STM32F7 firmware​?

    Check Piranha's issues list at https://community.st.com/s/question/0D50X0000BOtfhnSQB/how-to-make-ethernet-and-lwip-working-on-stm32, and search Community for posts about STM32F7 Ethernet.

    Visitor II
    June 20, 2020

    Hi Alister, and other community members. I’ve tested your code on stm32h745 – speed is awesome (90Mbit/s) in comparison to original HAL. But I’ve noticed, that CPU utilization is rather high during TCP transmission from board (about 50% @ 480Mhz M7 Core.) While during reception is only about 20-25%. Could someone show me direction for digging to get more free CPU time. Or, maybe, I’m doing something wrong?

    Thank you in advance.

    alisterAuthor
    Explorer
    June 21, 2020

    Thanks for the feedback.

    I'd only made easy improvements to the ETH driver's transmit code. I'd identified its throughput would be sub-optimal because its implementation is tentative because it doesn't start any descriptors until all its buffers are successfully linked and so it transmits late. But transmit was not a priority for me and, apart from adding the queuing in ethernetif.c, its shape is unchanged.

    For transmit's high CPU utilization, are you able to cast fresh eyes over it?

    alisterAuthor
    Explorer
    June 21, 2020

    >Could someone show me direction for digging to get more free CPU time.

    First, without changing any of the ETH driver...

    1. Is ETH_TX_QUEUE_ENABLE enabled? ETH_TX_QUEUE_ENABLE's HAL_ETH_Transmit_IT should use less cycles than HAL_ETH_Transmit.
    2. In low_level_output, is TxQueueFree ever empty? Try increasing ETH_TX_QUEUE_SIZE.
    3. Is ETH_TX_BUFFERS_ARE_CACHED enabled and is d-cache enabled? Enabling ETH_TX_BUFFERS_ARE_CACHED would be faster than sticking lwIP's heap in an uncached MPU region.
    4. Is i-cache enabled?

    Next, determine where the cycles are being used:

    1. At https://community.st.com/s/question/0D50X0000AhNBoWSQW/actually-working-stm32-ethernet-and-lwip-demonstration-firmware, @Piranha​ describes using DWT->CYCCNT to count cycles. You could try instrumenting your application with that or a fast 32-bit timer counter to determine where the cycles are being spent.

    Improving the ETH driver:

    1. Don't start this without first determining where the cycles are being spent. Redesign HAL_ETH_Transmit_IT to start the descriptors immediately. I haven't done the reading and couldn't start this without study. But some thoughts and questions to probe it... (a) Are the number of buffers linked per transmit packet deterministic? You'd need to study your app and lwIP. You'd want to avoid counting them prior to transmit because it costs cycles. (b) Assuming you'd link each buffer to a descriptor and start it immediately, if you run out of descriptors, how would you quickly/efficiently resume linking the buffers after a previous transmit's complete? Remember your transmit's performed by lwIP's task, tcpip_thread. (c) As immediately linking and starting a descriptor should be a design goal, would it matter if the app or lwIP linked more buffers than you have descriptors? If it matters you'll need to find a way to recover and drop that packet. (d) Descriptors are cheap. Can you do more than one transmit per interrupt? Interrupts cost cycles too. A bit in the descriptor controls that.

    @Piranha, can you share any thoughts please?

    Graduate II
    July 23, 2020

    Hi, guys, and sorry for a long delay. I should really limit my time and effort spent on hopeless users here and concentrate more on specific useful topics...

    Regarding overall design I can say that descriptor lists are queues which are partly managed by hardware. Therefore I'm not using any additional queues neither for Rx, nor Tx. The lwIP memory pool for Rx and TxQueue for Tx are unnecessary. You don't need lwIP pool management for your array, which is managed by hardware and your code anyway. And there is no real sense in additional Tx queue, because it also has a limit exactly like descriptor array. If it's not enough, just increase the number of descriptors. In my driver I just have 3 separate arrays for Rx (descriptors, data buffers and pbuf_custom) and 1 array (descriptors) for Tx. The number of elements in Rx arrays are equal, but being separate makes them more effective regarding size/alignment and makes it possible to put them in different memories/locations.

    For this to work, I added additional pointer argument at the end of descriptor structure. I use it to attach a pbuf/pbuf_custom to a descriptor. While descriptors are used and recycled incrementally, the use and release of Rx data buffer and pbuf_custom structure pairs depends on application code and is not deterministic, but the pairs are always fully synchronized - indexes are always the same. That way, when the pbuf_free_custom_fn() is called, I calculate the respective data buffer index from pbuf pointer, because at that point the payload member contains junk. And then just attach the released pair to the next free Rx descriptor.

    For Tx the pbuf segment count are not deterministic because it depends on many factors. Just as an example, the stack itself adds Ethernet+IP+UDP/TCP combined header in front of the sent data by adding additional pbuf at front all the time. Counting pbuf-s before queuing is necessary, but, as typically a pbuf chain has 1-3 segments, the CPU cycles spent on that are negligible. If there are not enough descriptors, just drop the frame altogether. TCP will re-transmit and UDP and Ethernet itself doesn't have to guarantee the delivery anyway - the same situation as with broken network. Stopping the whole tcpip_thread() can potentially be even worse than dropping some frames under abnormally high load. As alister said - descriptors are cheap! One can set their numbers to tens or even hundreds, if necessary. For example, my demo firmware uses 16 Rx descriptors (with 1536 B data buffers) and 48 Tx descriptors.

    P.S. Of course, ask for more details if/when necessary. :)

    alisterAuthor
    Explorer
    July 23, 2020

    Really thought-provoking ideas. Thanks for sharing.

    The results of Piranha's effort are at https://community.st.com/s/question/0D50X0000AhNBoWSQW/actually-working-stm32-ethernet-and-lwip-demonstration-firmware.

    @SLuka.1​ this is an answer to your post. I'll add it's not unexpected for tx_cycles to exceed rx_cycles because rx is an external event that the software merely responds, whereas tx needs to be prepared by the app and scheduled (especially TCP) by the stack.

    Graduate
    September 6, 2020

    FYI V1.8.0 STM32CubeH7 has some minor changes to the ethernet driver. I didn't study them in much detail at all. Looks to address some of the issues but does not appear to have changed much at all.