Skip to main content
Visitor II
June 6, 2019
Question

Random hardfault bug with STM32F730

  • June 6, 2019
  • 27 replies
  • 8031 views

We are struggling with mysterious bug with our STM32F730 project, even the bug itself is not easy to explain.

We use FreeRTOS and our code is run XIP from QSPI FLASH SST26VF016B.

IDE used is Atollic version: 9.1.0.

The problem is that the application crashes to hardfault vector, but not always.

We sometimes can build the binary which runs perfectly for hours or "forever". But if we add randomly line of code somewhere the binary crashes within random time: in seconds, minutes or hours - not even nearby the line where the code was added.

When debugging the problem, one build always crashes at about the same place and another build in totally different place. There seems to be nothing common in between these places.

With debugger we can have the LR, SP and PC values to the place where the hardfault occurs, but that is no helpful: those points to the code in which there's nothing wrong and which has already been run thousands of times before suddenly crashing.

Debug trace usually shows the last Signal handler called addressees 0xFFFFFFF1 or 0xFFFFFFFD. The problem seems to be somehow asynchronous, related maybe to interrupts.

We do have the simple test software build with the same drivers and it has never crashed, although it is using the same interfaces: SPI, three UARTS, AD, floating point calculations.

We have tried SW and HW floating points, different QSPI speeds and different compiler options.

What we would need is to have some trace of code flow just before the hardfault occurs, but we can't put a brakepoint anywhere in the code.

Does anybody have any hints how to detect the bug? Has anybody encountered similar problems?

    This topic has been closed for replies.

    27 replies

    Visitor II
    June 11, 2019

    we are experiencing something very similar with a STM32F413, not using RTOS but using UART, CAN and SDIO as eMMC interface using FATFs as a high level memory management. We are using the ST HAL but the rest of the code is ported from a Freescale HCS12X project. Atollic 9.3, latest HAL libs

    But it was working pretty well since we started using the STM32. We were adding new libraries one at a time making them work before adding the next one. At some point, the eMMC was functional with FATFs and I added some lines in one of the function and it has started our nightmare! If I added more lines to debug the problem, the problem disappeared. If I change the return type of the function from bool to byte, it works!?!? At some point, changing the compiler optimization to none was fixing the problem too. In fact, the weird thing here is my function seems to return true all the time but the function which is calling it is receiving/reading/changing the result to false!!!

    My colleague had something similar. Using the same code as mine, he was testing a new library but when he added a new function call in the main loop, we stopped processing one CAN message but others were still processed correctly! He changed some other lines and hardfault was occurring.

    I know my post is not there to help but maybe it can show it can happen not only with QSPI or RTOS but with many different setups.

    I'm not as advanced as you are guys, but my feeling tells me it is related to a buffer overflow or a stack overflow depending on the case. But how to catch it when we do not even have a hardfault occurring!?

    PP

    Visitor II
    June 13, 2019

    It is somewhat relieving for us to learn that someone else has got exactly the same problems. If you @pprovencher​ ever find out the reason to your problems, please post it here.

    Super User
    June 11, 2019

    > how to catch it when we do not even have a hardfault occurring!?

    Oh, bugs are a live thing and not all of them fail in a neat reproducible way clearly indicating the point of failure. The "random bug" is the worthy one to catch, and how to hunt them down, is the art of this trade.

    So you have to have at hand a whole repertoaire of tools - hardware like probes, oscilloscope you have a good grip at, logic analyzer, even the humble multimeter. "But I am a software engineer" is a whine to be left out of the door. Of course you have to have documentation at hand, and consult it often. And software - have a good grip at the toolchain and its darker corners, know thy mapfile. It's good to have a sleeveful of tricks in the mcu's software, too - like knowing how to toggle pins to be observed by LA, how to output or store relevant debug info/files without relying on printf() and/or any "semihosting" automagic, how to use otherwise unused resources of the mcu like memories embedded in the peripherals, etc. But, when it comes to bugs, disassembler is your biggest friend, together with the on-chp-debugging utilities (but those have to be known and understood well, too). And remember - software, or the box on the table, may be called debugger, but the real debugger sits on your chair.

    So be innovative. The first thing is to get the bug reproduce, at least in some way. Enhance reproduction by stressing the application - churn on it communication at maximum speed, connect pushbuttons to random pulse generators, feed the audio processor with noise or some 10-hour youtube ***, deliberately overdriving the input. Carefully observe symptoms - even if they may be unrelated to your initial complaint or any theory you might have formulated, deviation of any form from what is "normal" *must* be explained.

    Try to catch the bug. It may be that the symptoms indicate the bug has happened far before the symptoms occur, but try to halt execution as soon as observable symptoms occur. For example, if stack overflow is suspected to be the raw cause, then using the DEADBEEF method this theory can be safely disproved even at a late catch if some DEADBEEF remains (it can't be safely confirmed but suspicion certainly increases in that direction if there's no DEADBEEF left).

    Then try carefully crafted changes, observe the behaviour. Understand, what the changes really mean. Try to make the changes very local - this is hard to do I know. Make sure you can always return to the reproducible state. It may help to take notes at this point.

    Formulate theories, then devise experiments to prove or disprove them. You suspect stack overflow? How could overflow be catched? It's a write to an address beyond the last address allocated to the stack, so what about using the on-chip-debugger's data breakpoint facility; or, if desperate, the MPU? Or, maybe a simpler

    You may also try the divide et impera method I mentioned - omitting whole blocks of program. Sometimes it gets obvious what's wrong. Not often; but hey, this is the Real Bug, so there's little to lose. Still make sure you can return to the reproducible state.

    Sleep well. Discuss the problem with your colleagues, with the cleaning lady, with your partner at home, with your teddy bear.

    Be persistent. When the Real Bug creeps in, deadlines are void. Getting a product out of the door knowing there's a bug - well, that gives "deadline" the real meaning. If a manager pops in demanding progress, explain him the problem in great technical detail (and don't let him slip out), and then ask him to hold an oscilloscope probe - yes, ON the PIN 57 of LQFP176, NOT on the pad! - while making your software experiments finally with some comfort. An hour or two should suffice.

    All this takes years to get good at. And then, one day, comes a Real Real Bug, and I - with all those years and knowledge and experience - then look as a complete a****ole anyway... ;)

    JW

    Visitor II
    June 13, 2019

    Thanx JW for the supportive thoughts. I've been on business for thirty years and still occasionally encounter problems that seem to challenge all one have ever learnt. But all the bugs have been solved - at least as I remember - or they could have been attributed to the hardware.

    Visitor II
    June 12, 2019

    That text should be put in the beginning of every book about learning embedded development.

    Visitor II
    June 12, 2019

    I still think it's a good idea to check the fault registers and the exception stack frame, if it is a synchronous fault.

    Visitor II
    June 13, 2019

    We spent yesterday examining the faulty build, fault registers of exception I already posted in June 10, 2019.

    First we thought that the odd value in LR was the reason to the exception, and it took a while for us to understand that some library functions are using the register for their own purposes and the actual return address is popped from the stack.

    We also learned that stack was not overflown - as it never was in the earlier exceptions either. There were no interrupts active at exception point.

    What we learned - with this specific build - is that the line where the exception occurred is the following:

    90001ed6:  blx    r7 // atof call

    90001ed8:  ldr    r6, [pc, #64]  ; (0x90001f1c <MC60E_Gnss::ParseGGA(char const*, unsigned long)+228>)

    90001eda:  vmov   r0, r1, d0

    90001ede:  blx    r6

    and the usage fault reason for the exception is: Attempt to execute a coprocessor instruction (NOCP)

    and this is the line - which indeed seems to be a coprocessor instruction - that already has been run several times before the exception occurs. It has also been calculating the correct floating point values. Stack and register values seems to be correct in the exception point.

    We also have a very different failing build, which we continue to examine today.

    Super User
    June 13, 2019

    > and the usage fault reason for the exception is: Attempt to execute a coprocessor instruction (NOCP)

    Can you reproduce this fault? If yes, can you then read out the content of SCB_CPACR?

    JW

    Visitor II
    June 13, 2019

    That's why I pushed to check the CFSR.

    The CPACR, Jan Waclawek mentioned, has a bitmap telling which coprocessors are implemented. FPU is implemented as coprocessors 10 and 11.

    You might also want to check FPCCR.

    BTW, when you have problems with core (as opposed to peripherals), Arm-manuals are more "it". In your case: https://static.docs.arm.com/ddi0403/eb/DDI0403E_B_armv7m_arm.pdf

    Visitor II
    June 14, 2019

    Here are some register values from the crashing point:

    0690X000008idmDQAQ.png0690X000008idm8QAA.pngto us they seem to be quite normal - though our understanding on this area is not on very high (if any) level.

    In our compiler settings the FPU is FPv5-SP-D16 and - as said before - the calculations are performed successfully and correctly until this crash occurs

    Super User
    June 15, 2019

    I see no reason for the NOCP fault, as both CP10 and CP11 are enabled.

    At this point, I'd start to suspect hardware. Does the fault occur on multiple instances of the hardware? Is the power supply rock solid, as observed directly on the power pins? What's the voltage, can it be changed to slightly lower/higher? Are all supply and ground pins connected properly? Have been all pins checked for bad solder joints? Decoupling capacitors, especially VCAP, have been scrutinized?

    JW

    Visitor II
    June 17, 2019

    We don't believe that it is about the HW, to us it seems to be all about the builds: badly behaving build crashes in all the devices, good build is stable in all the devices.

    Graduate II
    June 14, 2019

    PC points to the line:

    float hdop = atof(data);

    ---

    The question is still valid - are data and other string variables in all places guaranteed to always be zero-terminated, when passing them to strchr and other string processing functions? :)

    Visitor II
    June 17, 2019

    Just rechecked the data and string variables used in crashing thread: they are always nul terminated, the collecting functions can't exit without setting the terminating nul

    Visitor II
    June 15, 2019

    Also, it might reveal something if you disassembled some code and searched for coprocessor accesses (CDP or CDP2). Maybe some other coprocessor gets accessed. Shouldn't, but then again this fault shouldn't happen in the first place. Maybe - just maybe - there is a bug in the compiler or compiler setup.

    Maybe the execution gets into some literal pool or something...