Skip to main content
Visitor II
August 5, 2020
Solved

Memory/Instruction barriers before writing to the backup SRAM

  • August 5, 2020
  • 12 replies
  • 6169 views

The following code didn't worked because of a missing data barrier before writing to the backup SRAM:

HAL_PWR_EnableBkUpAccess();
std::copy(buffer, buffer + num_bytes, BaseAddress + address);
HAL_PWR_DisableBkUpAccess();

Adding a DSB, solved the issue for now.

HAL_PWR_EnableBkUpAccess();
__DSB();
std::copy(buffer, buffer + num_bytes, BaseAddress + address);
HAL_PWR_DisableBkUpAccess();

It is understandable, that the enable BkUp (which is setting a single bit) needs to be fully completed before writing the the actually memory addresses.

But for me it is not fully understandable, if a DMB would be enough (it also works) and if I need an additional ISB, so Enable and Disable don't happen just before the actual copy, like so:

HAL_PWR_EnableBkUpAccess();
__DSB();
std::copy(buffer, buffer + num_bytes, BaseAddress + address);
__ISB();
HAL_PWR_DisableBkUpAccess();

Can someone help me out here, what the correct way would be?

    This topic has been closed for replies.
    Best answer by waclawek.jan

    > how can I be sure the effect of enable is effective?

    It may not be sufficient, if APB1 is slow, or if there are busmaster (DMA) conflicts on APB1. See below.

    The H4 is very, VERY different in this - backup SRAM is there by default in Normal area which can reorder writes even if the area is not cached. The bus structure is different, too. Barriers may be necessary and also not sufficient; I am not interested in 'H4 to pay more than casual attention.

    In 'F4, writes never get reordered, not even in Normal area. The issue here is given by the relatively slow APB1 bus on which PWR sits, versus the relatively fast AHB1 bus on which BKPSRAM (and RCC with BDCR) sits. The same applies to all backup domain items.

    I've talked about it here already a couple of years ago, ST "discovered" it only recently (see the 'F407 erratum "Possible delay in backup domain protection disabling/enabling after programming the DBP bit"). Use the recommended workaround from there - for your case, only the readback is applicable (C, I don't ++):

    PWR->CR |= PWR_CR_DBP;
     (void)PWR->CR; // readback to ensure the bit is set before commencing the SRAM/RTC access, as PWR is on APB1 whereas RTC and SRAM are on AHB1

    There is no such issue in the other way round, i.e. after writing to BPKSRAM, there is no need for any delay before writing to PWR_CR.DBP.

    You may want to make sure the compiler won't reorder accesses, though (see volatile and sequence points; again, I don't ++).

    JW

    12 replies

    Super User
    August 5, 2020

    Which STM32?

    This has nothing to do with barriers as such; the DSB there acts as a simple delay.

    JW

    Visitor II
    August 5, 2020

    STM32F446

    If it has nothing todo with barriers, how can I be sure the effect of enable is effective?

    Without the barrier and optimization enabled (Os) the write always fails for the first byte.

    Super User
    August 5, 2020

    Seen this on STM32H743 (Nucleo) and 753.

    After writing to the PWR registers and the backup RAM, __DSB is needed (but not ISB).

    (Maybe a MPU region can be set up to make it work without flush, I have not tried)

    -- pa

    Visitor II
    August 5, 2020

    Thank you for your answer. So this way?

    HAL_PWR_EnableBkUpAccess();
    __DSB();
    std::copy(buffer, buffer + num_bytes, BaseAddress + address);
    __DSB();
    HAL_PWR_DisableBkUpAccess();
    __DSB();

    Graduate II
    August 5, 2020

    Replace them with DMB and remove the last one. :)

    Graduate II
    August 5, 2020

    ISB is not a memory barrier and is totally unrelated. DSB can be used but is unnecessary restrictive. DMB is sufficient and most optimal.

    @Community member​, by default SRAM is of normal memory type while peripheral registers are of device memory type and the CPU is allowed to reorder accesses to normal memory. Before Cortex-M7 it wasn't a thing, but Cortex-M7 is capable of it and does actually does it. Therefore memory barrier is required.

    @Pavel A.​, you don't believe in memory barriers, don't you? Only some hackers believe in those... ;)

    But seriously, AN4838 section 3.1:

    Normal memory: allows the load and store of bytes, half-words and words to be arranged by the CPU in an

    efficient manner (the compiler is not aware of memory region types). For the normal memory region the load /

    store is not necessarily performed by the CPU in the order listed in the program.

    Device memory: within the device region, the loads and stores are done strictly in order. This is to ensure the

    registers are set in the proper order.

    And my topic on this:

    https://community.st.com/s/question/0D50X0000C4Nk4GSQS/bug-missing-compiler-and-cpu-memory-barriers

    Visitor II
    August 5, 2020

    Thank you very much for your detail explanation. :)

    Super User
    August 5, 2020

    @Piranha​ ,

    OP uses a 'F446. See my post below.

    JW

    Super User
    August 5, 2020

    > how can I be sure the effect of enable is effective?

    It may not be sufficient, if APB1 is slow, or if there are busmaster (DMA) conflicts on APB1. See below.

    The H4 is very, VERY different in this - backup SRAM is there by default in Normal area which can reorder writes even if the area is not cached. The bus structure is different, too. Barriers may be necessary and also not sufficient; I am not interested in 'H4 to pay more than casual attention.

    In 'F4, writes never get reordered, not even in Normal area. The issue here is given by the relatively slow APB1 bus on which PWR sits, versus the relatively fast AHB1 bus on which BKPSRAM (and RCC with BDCR) sits. The same applies to all backup domain items.

    I've talked about it here already a couple of years ago, ST "discovered" it only recently (see the 'F407 erratum "Possible delay in backup domain protection disabling/enabling after programming the DBP bit"). Use the recommended workaround from there - for your case, only the readback is applicable (C, I don't ++):

    PWR->CR |= PWR_CR_DBP;
     (void)PWR->CR; // readback to ensure the bit is set before commencing the SRAM/RTC access, as PWR is on APB1 whereas RTC and SRAM are on AHB1

    There is no such issue in the other way round, i.e. after writing to BPKSRAM, there is no need for any delay before writing to PWR_CR.DBP.

    You may want to make sure the compiler won't reorder accesses, though (see volatile and sequence points; again, I don't ++).

    JW

    Super User
    August 5, 2020

    +1 Read-back is intuitive and portable way to ensure flushing of writes and delays, at once.

    Yes, H7 has it's own can of worms...

    Visitor II
    August 5, 2020

    So read-back or DMB? Or is both valid? :(

    Graduate II
    August 5, 2020

    All of this raises the question about AHB buses - does these guarantee that a write with memory barrier is completed over the bus or can there be some delays also? What about synchronization of AXI and AHB with same (F7) and different (H7) frequencies?

    @Amel NASRI​ , @Imen DAHMEN​ ​, or someone from ST - can someone finally comment/solve the long-standing mystery of bus synchronization and delays?

    Super User
    August 5, 2020

    The processor - and its facilities - does not "see" beyond its boundaries. In other words, all barriers etc. act only upon the processor and the attached write buffer (that seems to include the bitbanding attachment in case of CM3/CM4 - I have a fun story with that one on the NXP LPC17xx, where GPIO is in bit-bandable *memory* (thus normal) area).

    In case of CM7, probably some or all the AXIM stuff, I'm not sure - as I've said, I am not interested - exactly because of the complexity, I work more at the "control" side so I give up processing power in favour of control.

    In other words, whatever is beyond the busmatrix, is not controlled by processor, and may and does involve various timing issues. The biggest fun is with inter-bus inter-module interconnections. It's ST which is supposed to describe it. I understand it's a hard task, OTOH, they solve it generally by massive handwavings. (Not that other manufacturers are better, but that's no argument of course)

    JW

    Graduate II
    August 5, 2020

    Are you sure that the CPU cannot see the completion of access even over strongly ordered memory type?

    From AN 4838:

    Strongly ordered memory: everything is always done in the programmatically listed order, where the CPU waits

    the end of load/store instruction execution (effective bus access) before executing the next instruction in the

    program stream. This can cause a performance hit.

    Super User
    August 5, 2020

    > Are you sure that the CPU cannot see the completion of access

    This will be an exercise in modal verbs... It *may* see them.

    First, as you all are already painfully aware, these are SoC rather than microcontrollers, i.e. not peripherals tightly integrated around a core sharing clock, but IPs slapped to the core through the bus fabric with transactions based on handshakes. If a peripheral can't store/retrieve data immediately, it signals WAIT, this is propagated back to the originator. To avoid slowdowns from slower peripherals/sub-buses (e.g. APB), buffers (FIFOs, usually one-transaction) are inserted. WAIT is then propagated back through buffer only if it is full.

    The barriers outwardly wait until all WAITs cease (they perform some tasks internally to processor and its nearest kin, too). If a write still sits in a buffer, the processor does know about it. There may be some extra "are all buffers empty?" signal from the processor, but I doubt there is any.

    So, barriers on straightforward memories will work. On APB buses probably won't. On the intermatrix interconnects in H7, I don't know and don't care.

    JW

    Super User
    August 9, 2020

    Ok, then a small concrete question, if I may:

    Will write then read sequence on the same address, properly aligned, work "correctly" no matter what is bus matrix or other mentioned things? That's, the write is guaranteed to complete by its target before the CPU gets the read value (with all the needed waits) ?

    -- pa

    Graduate II
    August 10, 2020

    ARM architectural requirements:

    For device and strongly-ordered memory types - yes. For normal memory type - no, DMB between write-read is required.

    Cortex-M implementation details:

    At the moment of writing, except for Cortex-M7 and the upcoming Cortex-M55, all other Cortex-M cores do not have a capability of instruction reordering and will work correctly on all memory types even without memory barrier instructions.

    Also remember that those accesses both need to be volatile or must have compiler barrier between them for the respective instructions to be compiled in an order in which they are written in code.

    Super User
    August 10, 2020

    > if this is true about the normal memory,

    >it would break most "normal" programs.

    Why do you think so?

    The sole purpose of the program in mcu is to perform accesses to the peripherals. These have to happen in the order as they are written in the program - and this indeed is ensured, at the compiler level by qualifying the registers as volatile, at the processor level by having them located as Device. Everything else has effect only on the timing. Allowing reordering and buffering/caching speeds up execution, and that's what everybody wants.

    Note, that even if a value written by program to a Normal memory is not physically stored to the memory, the execution is still correct. Either the processor can infer from the flow that the written value is not yet needed (because it's not read, and if the same address is written in program again, the old value may be safely forgotten without being written ever), so it can delay its writing to some suitable later time; or it can serve the value to a read from a cache or a buffer at the processor boundary. That is, provided that the variable in question has not been eliminated altogether already when compiling.

    The backup SRAM is sort of a crossover between a peripheral and memory. The default mapping in H7 puts it at Normal (and cached), which may be quite OK, once you make the"special procedure" to unlock it after reset, and perhaps ensure proper writeback in an "early powerdown warning" interrupt. Or, you can remap it using the MPU. Or, you can avoid using the F7/H7...

    JW

    Super User
    August 11, 2020

    > Why do you think so?

    Because most normal programs just write and read from a normal memory without any explicit barriers.

    Any decent MCU must have at least some amount of normal memory (stack..)

    /* Yes, there is speculative execution and other such things ... the Intel folks thought they are so smart and can get away with it, but it ended badly */

    > Or, you can avoid using the F7/H7

    Sorry, cannot. Must cope with what the customer wants ;)

    -- pa