Random hardfault bug with STM32F730

Visitor II

It is somewhat relieving for us to learn that someone else has got exactly the same problems. If you @pprovencher ever find out the reason to your problems, please post it here.

W

waclawek.jan

Super User

> how to catch it when we do not even have a hardfault occurring!?

Oh, bugs are a live thing and not all of them fail in a neat reproducible way clearly indicating the point of failure. The "random bug" is the worthy one to catch, and how to hunt them down, is the art of this trade.

So you have to have at hand a whole repertoaire of tools - hardware like probes, oscilloscope you have a good grip at, logic analyzer, even the humble multimeter. "But I am a software engineer" is a whine to be left out of the door. Of course you have to have documentation at hand, and consult it often. And software - have a good grip at the toolchain and its darker corners, know thy mapfile. It's good to have a sleeveful of tricks in the mcu's software, too - like knowing how to toggle pins to be observed by LA, how to output or store relevant debug info/files without relying on printf() and/or any "semihosting" automagic, how to use otherwise unused resources of the mcu like memories embedded in the peripherals, etc. But, when it comes to bugs, disassembler is your biggest friend, together with the on-chp-debugging utilities (but those have to be known and understood well, too). And remember - software, or the box on the table, may be called debugger, but the real debugger sits on your chair.

So be innovative. The first thing is to get the bug reproduce, at least in some way. Enhance reproduction by stressing the application - churn on it communication at maximum speed, connect pushbuttons to random pulse generators, feed the audio processor with noise or some 10-hour youtube ***, deliberately overdriving the input. Carefully observe symptoms - even if they may be unrelated to your initial complaint or any theory you might have formulated, deviation of any form from what is "normal" *must* be explained.

Try to catch the bug. It may be that the symptoms indicate the bug has happened far before the symptoms occur, but try to halt execution as soon as observable symptoms occur. For example, if stack overflow is suspected to be the raw cause, then using the DEADBEEF method this theory can be safely disproved even at a late catch if some DEADBEEF remains (it can't be safely confirmed but suspicion certainly increases in that direction if there's no DEADBEEF left).

Then try carefully crafted changes, observe the behaviour. Understand, what the changes really mean. Try to make the changes very local - this is hard to do I know. Make sure you can always return to the reproducible state. It may help to take notes at this point.

Formulate theories, then devise experiments to prove or disprove them. You suspect stack overflow? How could overflow be catched? It's a write to an address beyond the last address allocated to the stack, so what about using the on-chip-debugger's data breakpoint facility; or, if desperate, the MPU? Or, maybe a simpler

You may also try the divide et impera method I mentioned - omitting whole blocks of program. Sometimes it gets obvious what's wrong. Not often; but hey, this is the Real Bug, so there's little to lose. Still make sure you can return to the reproducible state.

Sleep well. Discuss the problem with your colleagues, with the cleaning lady, with your partner at home, with your teddy bear.

Be persistent. When the Real Bug creeps in, deadlines are void. Getting a product out of the door knowing there's a bug - well, that gives "deadline" the real meaning. If a manager pops in demanding progress, explain him the problem in great technical detail (and don't let him slip out), and then ask him to hold an oscilloscope probe - yes, ON the PIN 57 of LQFP176, NOT on the pad! - while making your software experiments finally with some comfort. An hour or two should suffice.

All this takes years to get good at. And then, one day, comes a Real Real Bug, and I - with all those years and knowledge and experience - then look as a complete a****ole anyway... ;)

JW

Visitor II

Thanx JW for the supportive thoughts. I've been on business for thirty years and still occasionally encounter problems that seem to challenge all one have ever learnt. But all the bugs have been solved - at least as I remember - or they could have been attributed to the hardware.

Visitor II

That text should be put in the beginning of every book about learning embedded development.

Visitor II

I still think it's a good idea to check the fault registers and the exception stack frame, if it is a synchronous fault.

Visitor II

We spent yesterday examining the faulty build, fault registers of exception I already posted in June 10, 2019.

First we thought that the odd value in LR was the reason to the exception, and it took a while for us to understand that some library functions are using the register for their own purposes and the actual return address is popped from the stack.

We also learned that stack was not overflown - as it never was in the earlier exceptions either. There were no interrupts active at exception point.

What we learned - with this specific build - is that the line where the exception occurred is the following:

90001ed6: blx r7 // atof call

90001ed8: ldr r6, [pc, #64] ; (0x90001f1c <MC60E_Gnss::ParseGGA(char const*, unsigned long)+228>)

90001eda: vmov r0, r1, d0

90001ede: blx r6

and the usage fault reason for the exception is: Attempt to execute a coprocessor instruction (NOCP)

and this is the line - which indeed seems to be a coprocessor instruction - that already has been run several times before the exception occurs. It has also been calculating the correct floating point values. Stack and register values seems to be correct in the exception point.

We also have a very different failing build, which we continue to examine today.

W

waclawek.jan

Super User

> and the usage fault reason for the exception is: Attempt to execute a coprocessor instruction (NOCP)

Can you reproduce this fault? If yes, can you then read out the content of SCB_CPACR?

JW

Visitor II

That's why I pushed to check the CFSR.

The CPACR, Jan Waclawek mentioned, has a bitmap telling which coprocessors are implemented. FPU is implemented as coprocessors 10 and 11.

You might also want to check FPCCR.

BTW, when you have problems with core (as opposed to peripherals), Arm-manuals are more "it". In your case: https://static.docs.arm.com/ddi0403/eb/DDI0403E_B_armv7m_arm.pdf

Visitor II

Here are some register values from the crashing point:

to us they seem to be quite normal - though our understanding on this area is not on very high (if any) level.

In our compiler settings the FPU is FPv5-SP-D16 and - as said before - the calculations are performed successfully and correctly until this crash occurs

W

waclawek.jan

Super User

I see no reason for the NOCP fault, as both CP10 and CP11 are enabled.

At this point, I'd start to suspect hardware. Does the fault occur on multiple instances of the hardware? Is the power supply rock solid, as observed directly on the power pins? What's the voltage, can it be changed to slightly lower/higher? Are all supply and ground pins connected properly? Have been all pins checked for bad solder joints? Decoupling capacitors, especially VCAP, have been scrutinized?

JW

Visitor II

We don't believe that it is about the HW, to us it seems to be all about the builds: badly behaving build crashes in all the devices, good build is stable in all the devices.

P

Piranha

Graduate II

PC points to the line:

float hdop = atof(data);

---

The question is still valid - are data and other string variables in all places guaranteed to always be zero-terminated, when passing them to strchr and other string processing functions? :)

Visitor II

Just rechecked the data and string variables used in crashing thread: they are always nul terminated, the collecting functions can't exit without setting the terminating nul