@Amel NASRI, @ranran, @Piranha, @Harrold, @Pavel A. V2 of my fixes and improvements to H7_FW V1.5.0/V1.6.0 Ethernet...Changes includeDecoupling receive buffers from receive descriptors so buffers may be held any length time without choking receive.Optionally queuing transmit, so transmit doesn't need to block until complete.Many bug fixes.Find full details and source in the attached zip. V1 was posted to another developer's question. Please post any questions about V2 here.

A

alister

Explorer

Solved

[bug fixes] STM32H7 Ethernet

Forum|Forum|6 years ago
February 11, 2020
34 replies
44437 views

@Amel NASRI, @ranran, @Piranha, @Harrold, @Pavel A.

V2 of my fixes and improvements to H7_FW V1.5.0/V1.6.0 Ethernet...

Changes include

Decoupling receive buffers from receive descriptors so buffers may be held any length time without choking receive.
Optionally queuing transmit, so transmit doesn't need to block until complete.
Many bug fixes.

Find full details and source in the attached zip. V1 was posted to another developer's question. Please post any questions about V2 here.

This topic has been closed for replies.

Best answer by Amel NASRI

Dear All,

Our Experts tried to answer almost all the limitations reported in this thread.

Please refer to this post for more details.

At this point, I suggest to close this discussion as it becomes difficult for us to follow it with the great number of comments.

Don't hesitate to submit your new posts asking new questions.

Thanks for all the ones involved to make ST solutions more efficient.

-Amel

Show previous replies

A

ADunc.1

Graduate

From some testing I have been doing today I found some potential improvements on transmit. I believe alister mentioned in his documentation that transmit was not a priority, and actually for my application is not so important either.

Rather than calling HAL_Transmit when not using a tx queue, calling HAL_TransmitIT, waiting on an event and signalling from the interrupt callback seems to yield better performance (the mechanism is already there for the tx queue). When doing load testing it seemed that the application was spending significant time spinning in HAL_transmit and potentially starving other (particularly the lwip) thread.

However, this may not be the case for infrequent transmission of small packets rather than frequent transmission of large packets. Some testing with setting/clearing a port pin on entry/exit to the transmit function may be required to quantify the difference exactly.

Personally I would not run this driver without the transmit queue as that yields a 41-51MBps improvement in my application and has little impact on memory footprint.

Just thought I would open it up for discussion...

A

alisterAuthor

Explorer

Tx-queuing, using HAL_ETH_Transmit_IT, is enabled as default. Are you sure you'd ported all of ethernetif.c?

It's controlled by the ETH_TX_QUEUE_ENABLE macro.

Take care, non-blocking transmit (tx-queuing) is only safe if you know your app won't change a pbuf (or its buffer) after it's been transmitted.

A

ADunc.1

Graduate

I tried with/without tx queue and blocking/non blocking transmit with tx queue disabled. The fully blocking option did not work well, and not a good design practice to use blocking/polling functions unless absolutely necessary. This application has four different nefifs, plus a heap of other stuff going on so needs to stay responsive.

Transmit with interrupt is still a better option for no queue than a blocking transmit. Neither option prevents another thread from modifying the buffer, both options prevent the lwip tcp thread from modifying the buffer as it is blocked either way. I am running with the queue enabled as want to keep the lwip thread running to service the other netifs.

AFAIK, there is no way my application can modify the buffers. But I do remember reading somewhere that lwip may modify them in certain cases. Maybe TCP segmentation, or transmit failure. I cant remember where I saw that though.

A

ASar

Visitor II

Hello,

I have also used your code. It works very well ( 94Mbit send/rcv). But i have one problem i can not resolve.

After reset uC or first connect rj45 to PC i got max transfert. But when i reconnect RJ45 the transfer slow down ( <500kbps). In debug i get message "memp_malloc: out of memory in pool TCPIP_MSG_INPKT".

After disconnect function RxBuffFree is called, so the rx buffers should be empty.

What can be wrong ?

A

alisterAuthor

Explorer

>i get message "memp_malloc: out of memory in pool TCPIP_MSG_INPKT".

MEMP_TCPIP_MSG_INPKT is lwIP's container for received packets. What's MEMP_NUM_TCPIP_MSG_INPKT?

>After disconnect function RxBuffFree is called

You're calling HAL_ETH_DeInit or HAL_ETH_Stop? You shouldn't need to. Or what's calling RxBuffFree?

A

ASar

Visitor II

MEMP_NUM_TCPIP_MSG_INPKT is 8

There is additional task to handle PHY connection (from CubeH7 example)

void ethernet_link_thread( void const * argument )

{

ETH_MACConfigTypeDef MACConf;

int32_t PHYLinkState;

uint32_t linkchanged = 0, speed = 0, duplex =0;

struct netif *netif = (struct netif *) argument;

for(;;)

{

PHYLinkState = LAN8742_GetLinkState(&LAN8742);

if(netif_is_link_up(netif) && (PHYLinkState <= LAN8742_STATUS_LINK_DOWN))

{

HAL_ETH_Stop_IT(&heth);

netif_set_down(netif);

netif_set_link_down(netif);

}

else if(!netif_is_link_up(netif) && (PHYLinkState > LAN8742_STATUS_LINK_DOWN))

{

switch (PHYLinkState)

{

case LAN8742_STATUS_100MBITS_FULLDUPLEX:

duplex = ETH_FULLDUPLEX_MODE;

speed = ETH_SPEED_100M;

linkchanged = 1;

break;

case LAN8742_STATUS_100MBITS_HALFDUPLEX:

duplex = ETH_HALFDUPLEX_MODE;

speed = ETH_SPEED_100M;

linkchanged = 1;

break;

case LAN8742_STATUS_10MBITS_FULLDUPLEX:

duplex = ETH_FULLDUPLEX_MODE;

speed = ETH_SPEED_10M;

linkchanged = 1;

break;

case LAN8742_STATUS_10MBITS_HALFDUPLEX:

duplex = ETH_HALFDUPLEX_MODE;

speed = ETH_SPEED_10M;

linkchanged = 1;

break;

default:

break;

}

if(linkchanged)

{

/* Get MAC Config MAC */

HAL_ETH_GetMACConfig(&heth, &MACConf);

MACConf.DuplexMode = duplex;

MACConf.Speed = speed;

HAL_ETH_SetMACConfig(&heth, &MACConf);

HAL_ETH_Start_IT(&heth);

netif_set_up(netif);

netif_set_link_up(netif);

}

osDelay(100);

}

A

alisterAuthor

Explorer

MEMP_NUM_TCPIP_MSG_INPKT = 8 isn't enough. This is possibly the only problem.

About HAL_ETH_Stop_IT... I did make some changes there. But still I'm unsure its stop/start is good as I'd only inspected it, didn't use HAL_ETH_Stop_IT and (sorry) only tested what I'd used. Are you sure you need to HAL_ETH_Stop_IT?

If you do, the first HAL_ETH_Start_IT is in low_level_init. Before your next HAL_ETH_Start_IT, make sure all the rx buffers (from the EthIfRxBuff pool) that had been previously held by the ETH driver were freed. Anyway, a problem with this isn't indicated and would surely manifest after a few 10s of link up/down.

A

ASar

Visitor II

How many MEMP_NUM_TCPIP_MSG_INPK do you suggest ?

I left HAL_ETH_Stop_IT from Cube example. If it is not necessery i will remove this.

Thanks for suggestion.

A

alisterAuthor

Explorer

Please reply the reply so the conversations are delineated.

>How many MEMP_NUM_TCPIP_MSG_INPK

Double or triple.

>I left HAL_ETH_Stop_IT from Cube example.

You'll have to decide. My attitude is, everything ought have a reason and nothing ought have no reason. That's backed up by studies showing bugs are proportional to lines of code (doesn't mean remove comments).

A

ASar

Visitor II

Removing the HAL_ETH_Stop_IT resolved the problem.

But there is still messages "out of memory in pool TCP PCB"

P

Piranha

Graduate II

> There is additional task to handle PHY connection (from CubeH7 example)

Which has the flaws described shortly but pretty clearly in "lwIP API related" part of my topic:

https://community.st.com/s/question/0D50X0000BOtfhnSQB/how-to-make-ethernet-and-lwip-working-on-stm32

A

Anonymous

Keil has its own version of the STM32H7 drivers (current version 2.5.0), including the drivers for Ethernet MAC and PHY. Has anybody checked if they have all the same problems as HAL?

A

alisterAuthor

Explorer

I see at https://community.st.com/s/question/0D50X0000BWqXETSQ3/ethernet-complexity you'd mentioned "STM32Cube_FW_H7 1.5.0 does work".

But it doesn't. This page describes FW_H7 1.5.0/1.6.0 bugs and my fixes/improvements to it. I don't use Keil.

A

Anonymous

When I said that it works I just meant it does the simple things described in the readme (because this example in many other releases of STM32Cube_FW_H7 doesn't even ping); I didn't mean it is bug-free.

My question about Keil drivers still remains for those who do use Keil.

K

kqian

Visitor II

any same fix on the STM32F7 firmware?

A

alisterAuthor

Explorer

>any same fix on the STM32F7 firmware?

Check Piranha's issues list at https://community.st.com/s/question/0D50X0000BOtfhnSQB/how-to-make-ethernet-and-lwip-working-on-stm32, and search Community for posts about STM32F7 Ethernet.

S

SLuka.1

Visitor II

Hi Alister, and other community members. I’ve tested your code on stm32h745 – speed is awesome (90Mbit/s) in comparison to original HAL. But I’ve noticed, that CPU utilization is rather high during TCP transmission from board (about 50% @ 480Mhz M7 Core.) While during reception is only about 20-25%. Could someone show me direction for digging to get more free CPU time. Or, maybe, I’m doing something wrong?

Thank you in advance.

A

alisterAuthor

Explorer

Thanks for the feedback.

I'd only made easy improvements to the ETH driver's transmit code. I'd identified its throughput would be sub-optimal because its implementation is tentative because it doesn't start any descriptors until all its buffers are successfully linked and so it transmits late. But transmit was not a priority for me and, apart from adding the queuing in ethernetif.c, its shape is unchanged.

For transmit's high CPU utilization, are you able to cast fresh eyes over it?

A

alisterAuthor

Explorer

>Could someone show me direction for digging to get more free CPU time.

First, without changing any of the ETH driver...

Is ETH_TX_QUEUE_ENABLE enabled? ETH_TX_QUEUE_ENABLE's HAL_ETH_Transmit_IT should use less cycles than HAL_ETH_Transmit.
In low_level_output, is TxQueueFree ever empty? Try increasing ETH_TX_QUEUE_SIZE.
Is ETH_TX_BUFFERS_ARE_CACHED enabled and is d-cache enabled? Enabling ETH_TX_BUFFERS_ARE_CACHED would be faster than sticking lwIP's heap in an uncached MPU region.
Is i-cache enabled?

Next, determine where the cycles are being used:

At https://community.st.com/s/question/0D50X0000AhNBoWSQW/actually-working-stm32-ethernet-and-lwip-demonstration-firmware, @Piranha describes using DWT->CYCCNT to count cycles. You could try instrumenting your application with that or a fast 32-bit timer counter to determine where the cycles are being spent.

Improving the ETH driver:

Don't start this without first determining where the cycles are being spent. Redesign HAL_ETH_Transmit_IT to start the descriptors immediately. I haven't done the reading and couldn't start this without study. But some thoughts and questions to probe it... (a) Are the number of buffers linked per transmit packet deterministic? You'd need to study your app and lwIP. You'd want to avoid counting them prior to transmit because it costs cycles. (b) Assuming you'd link each buffer to a descriptor and start it immediately, if you run out of descriptors, how would you quickly/efficiently resume linking the buffers after a previous transmit's complete? Remember your transmit's performed by lwIP's task, tcpip_thread. (c) As immediately linking and starting a descriptor should be a design goal, would it matter if the app or lwIP linked more buffers than you have descriptors? If it matters you'll need to find a way to recover and drop that packet. (d) Descriptors are cheap. Can you do more than one transmit per interrupt? Interrupts cost cycles too. A bit in the descriptor controls that.

@Piranha, can you share any thoughts please?

P

Piranha

Graduate II

Hi, guys, and sorry for a long delay. I should really limit my time and effort spent on hopeless users here and concentrate more on specific useful topics...

Regarding overall design I can say that descriptor lists are queues which are partly managed by hardware. Therefore I'm not using any additional queues neither for Rx, nor Tx. The lwIP memory pool for Rx and TxQueue for Tx are unnecessary. You don't need lwIP pool management for your array, which is managed by hardware and your code anyway. And there is no real sense in additional Tx queue, because it also has a limit exactly like descriptor array. If it's not enough, just increase the number of descriptors. In my driver I just have 3 separate arrays for Rx (descriptors, data buffers and pbuf_custom) and 1 array (descriptors) for Tx. The number of elements in Rx arrays are equal, but being separate makes them more effective regarding size/alignment and makes it possible to put them in different memories/locations.

For this to work, I added additional pointer argument at the end of descriptor structure. I use it to attach a pbuf/pbuf_custom to a descriptor. While descriptors are used and recycled incrementally, the use and release of Rx data buffer and pbuf_custom structure pairs depends on application code and is not deterministic, but the pairs are always fully synchronized - indexes are always the same. That way, when the pbuf_free_custom_fn() is called, I calculate the respective data buffer index from pbuf pointer, because at that point the payload member contains junk. And then just attach the released pair to the next free Rx descriptor.

For Tx the pbuf segment count are not deterministic because it depends on many factors. Just as an example, the stack itself adds Ethernet+IP+UDP/TCP combined header in front of the sent data by adding additional pbuf at front all the time. Counting pbuf-s before queuing is necessary, but, as typically a pbuf chain has 1-3 segments, the CPU cycles spent on that are negligible. If there are not enough descriptors, just drop the frame altogether. TCP will re-transmit and UDP and Ethernet itself doesn't have to guarantee the delivery anyway - the same situation as with broken network. Stopping the whole tcpip_thread() can potentially be even worse than dropping some frames under abnormally high load. As alister said - descriptors are cheap! One can set their numbers to tens or even hundreds, if necessary. For example, my demo firmware uses 16 Rx descriptors (with 1536 B data buffers) and 48 Tx descriptors.

P.S. Of course, ask for more details if/when necessary. :)

A

alisterAuthor

Explorer

Really thought-provoking ideas. Thanks for sharing.

The results of Piranha's effort are at https://community.st.com/s/question/0D50X0000AhNBoWSQW/actually-working-stm32-ethernet-and-lwip-demonstration-firmware.

@SLuka.1 this is an answer to your post. I'll add it's not unexpected for tx_cycles to exceed rx_cycles because rx is an external event that the software merely responds, whereas tx needs to be prepared by the app and scheduled (especially TCP) by the stack.

A

ADunc.1

Graduate

FYI V1.8.0 STM32CubeH7 has some minor changes to the ethernet driver. I didn't study them in much detail at all. Looks to address some of the issues but does not appear to have changed much at all.

Show more replies

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded