STM32H7x SPI Slave FIFO data loading
Hello,
We have been trying to use the STM32H747I-DISCO to act as a SPI slave which should respond to a command byte, behaving similarly to how a sensor or EEPROM might.
That is, the fist byte sent from the master will be a command and our slave device should respond to this command. Unfortunately, the best we are able to do is respond to this command 3 bytes late.
In the following, the top is what we'd like, the bottom is what we are currently seeing.

It appears that to get the slave to send back any data, the TXFIFO must have at least 32-bits loaded into it at the start of the transfer, it doesn't matter if I do this via a single 32-bit register write
hspi5.Instance->TXDR = 0x89ABCDEF // 32 bits of "junk"or 4, 8-bit writes
*(volatile uint8_t *)hspi5.Instance->TXDR = 0x89 // "junk" byte 1
*(volatile uint8_t *)hspi5.Instance->TXDR = 0xAB // "junk" byte 2
*(volatile uint8_t *)hspi5.Instance->TXDR = 0xCD // "junk" byte 3
*(volatile uint8_t *)hspi5.Instance->TXDR = 0xEF // "junk" byte 4Either way, we see the behavior in the bottom half of the image.
In more detail, if we do what we'd expect to be correct and do a single 8 bit write to load:
// Make 8-bit pointers to the TX and RX buffers
uint8_t* spi_rx_ptr;
spi_rx_ptr = (uint8_t*)&(hspi5.Instance->RXDR);
uint8_t* spi_tx_ptr;
spi_tx_ptr = (uint8_t*)&(hspi5.Instance->TXDR);
// Disable and re-enable the SPI peripheral
CLEAR_BIT(hspi5.Instance->CR1, SPI_CR1_SPE);
SET_BIT(hspi5.Instance->CR1, SPI_CR1_SPE);
// Set junk byte so we can receive from master
*spi_tx_ptr = 0xAB; // "junk" byte 1
// Wait for RXP (blocking)
while ((hspi5.Instance->SR & SPI_FLAG_RXP) == 0);
// Get the command from the master
uint8_t command_byte = *spi_rx_ptr;
// Normally here we'd switch on the command to change behavior,
// but that's added complexity not needed for this question.
// Wait for TXP so we can load the first (possibly only)
// byte to be transmitted in response to the command.
while ((hspi5.Instance->SR & SPI_FLAG_TXP) == 0);
// Load the byte (We have a little extra logic here
// for stepping though a LUT, incrementing the
// address if it's multi-byte, but, again, let's keep this minimal)
// and just hard code the response for testing
*spi_tx_ptr = 0xCD;
// We have some more logic after this to wait for the end of the
// transfer. For testing, this can be replaced by
// Wait for EOT (blocking)
while ((hspi5.Instance->SR & SPI_FLAG_EOT) == 0);
// Though this may only work once after each reboot.We only see

The `0xAB` of the 1st load to the TXFIFO works, but the second (setting it to `0xCD`) doesn't, instead sending out `0xFF` on MISO until the end of the transfer.
If instead we change the code to
// Make 8-bit pointers to the TX and RX buffers
uint8_t* spi_rx_ptr;
spi_rx_ptr = (uint8_t*)&(hspi5.Instance->RXDR);
uint8_t* spi_tx_ptr;
spi_tx_ptr = (uint8_t*)&(hspi5.Instance->TXDR);
// Disable and re-enable the SPI peripheral
CLEAR_BIT(hspi5.Instance->CR1, SPI_CR1_SPE);
SET_BIT(hspi5.Instance->CR1, SPI_CR1_SPE);
// Set junk byte so we can receive from master
*spi_tx_ptr = 0x12; // "junk" byte 1
*spi_tx_ptr = 0x34; // "junk" byte 2
*spi_tx_ptr = 0x56; // "junk" byte 3
*spi_tx_ptr = 0x78; // "junk" byte 4
// Wait for RXP (blocking)
while ((hspi5.Instance->SR & SPI_FLAG_RXP) == 0);
// Get the command from the master
uint8_t command_byte = *spi_rx_ptr;
// Wait for TXP so we can load the response
while ((hspi5.Instance->SR & SPI_FLAG_TXP) == 0);
// Load the response
*spi_tx_ptr = 0xCD;
// Wait for EOT (blocking)
while ((hspi5.Instance->SR & SPI_FLAG_EOT) == 0);We are able to at least send the response:

For added context, here is an abridged version of our init function (removed pin setup, error handling, etc.)
hspi5.Instance = SPI5;
hspi5.Init.Mode = SPI_MODE_SLAVE;
hspi5.Init.Direction = SPI_DIRECTION_2LINES;
hspi5.Init.DataSize = SPI_DATASIZE_8BIT;
hspi5.Init.CLKPolarity = SPI_POLARITY_LOW;
hspi5.Init.CLKPhase = SPI_PHASE_1EDGE;
hspi5.Init.NSS = SPI_NSS_HARD_INPUT;
hspi5.Init.FirstBit = SPI_FIRSTBIT_MSB;
hspi5.Init.TIMode = SPI_TIMODE_DISABLE;
hspi5.Init.CRCCalculation = SPI_CRCCALCULATION_DISABLE;
hspi5.Init.CRCPolynomial = 0x0;
hspi5.Init.NSSPMode = SPI_NSS_PULSE_DISABLE;
hspi5.Init.NSSPolarity = SPI_NSS_POLARITY_LOW;
hspi5.Init.FifoThreshold = SPI_FIFO_THRESHOLD_01DATA;
hspi5.Init.TxCRCInitializationPattern
= SPI_CRC_INITIALIZATION_ALL_ZERO_PATTERN;
hspi5.Init.RxCRCInitializationPattern
= SPI_CRC_INITIALIZATION_ALL_ZERO_PATTERN;
hspi5.Init.MasterSSIdleness = SPI_MASTER_SS_IDLENESS_00CYCLE;
hspi5.Init.MasterInterDataIdleness = SPI_MASTER_INTERDATA_IDLENESS_00CYCLE;
hspi5.Init.MasterReceiverAutoSusp = SPI_MASTER_RX_AUTOSUSP_DISABLE;
hspi5.Init.MasterKeepIOState = SPI_MASTER_KEEP_IO_STATE_DISABLE;
hspi5.Init.IOSwap = SPI_IO_SWAP_DISABLE;
// Set TSER = 0, TSIZE = 0 (RM0399 53.11.2)
hspi5.Instance->CR2 = 0;Also note SPI5 is being clocked from HSE (25MHz) directly as Table 120 "SPI dynamic characteristics" of the STM32H747xI/G datasheet implies that a full duplex slave can't do above ~30Mhz, voltage dependent. Though, I'm not sure if that's saying the peripheral clock or the input clock from the master.
I am using the CM7 exclusively, with the CM4 asleep. The CM7 is clocked as high as possible, at 480MHz.
I have tested with the master clock set to as low as 100Hz and as high as 10Mhz and seen the behavior remain the same, only breaking down at higher clock speeds.
When testing similar code on the STM32F412G-DISCOVERY, with it's different SPI peripheral, we had very similar code working completely, albeit only up to ~1Mhz SPI clock. As we know there is very little time between receiving the command byte and when we need to lead the response data we moved to the H747 in hopes of being able to achieve 5Mhz or greater, noting that in the full code we were bottlenecked by the time to run the FSM which determines what data should be sent in response to the command byte.
