Audio input USB host

Question

I've been attempting to stream audio in via a USB microphone using the USB host functionality on my STM32F411. Unfortunately, it seems as though the audio class has been left unfinished and, although about 50% of the code is there, the other important 50%, that actually streams the input, is missing. I have spent a significant amount of time testing and googling (and frankly using a bit of ChatGPT because I don't have much time to spend digging around in the USB spec) to try and piece together what is missing.
I'm at a point where I am receiving data, but the data just seems to be entirely wrong. It does't seem garbled, but it tends to just repeat the same packet data over and over again, and sometimes it doesn't receive anything and the buffer stays filled with zeros.

This post was very helpful in laying the base, but I have since expanded on it a little bit. The first important bit I added was to set the microphone to use the highest possible frequency available, which in my case is 48kHz. This is purely so I am 100% confident that the packet size I calculate is correct, otherwise I just have to guess the microphone is set to that frequency. Below is the code used to do this, which I have placed inside the `USBH_AUDIO_ClassRequest` function. It's not amazing but it works.

case AUDIO_REQ_GET_FREQ:
 if (AUDIO_Handle->microphone.supported == 1U) {
 uint32_t max_freq_idx = 0;
 uint32_t max_freq = 0;

 for (int i = 0; i < AUDIO_MAX_SAMFREQ_NBR; i++) {
 uint32_t freq = get_sample_rate(AUDIO_Handle->class_desc.as_desc[0].FormatTypeDesc, i);
 if (max_freq < freq) {
 max_freq = freq;
 max_freq_idx = i;
 }
 }

 uint8_t bit_depth = get_bit_depth(AUDIO_Handle->class_desc.as_desc[0].FormatTypeDesc);
 uint8_t n_chan = get_n_channels(AUDIO_Handle->class_desc.as_desc[0].FormatTypeDesc);

 AUDIO_Handle->microphone.frequency = max_freq;
 // !!!!! frame_length is important as this is the size of packet
 AUDIO_Handle->microphone.frame_length = ((max_freq * bit_depth * n_chan) / 8000U);

 // setup single packet buffer
 if (AUDIO_Handle->microphone.buf) {
 USBH_free(AUDIO_Handle->microphone.buf);
 }

 uint16_t buf_len = AUDIO_Handle->microphone.frame_length;
 AUDIO_Handle->microphone.buf = (uint8_t *)USBH_malloc(buf_len);

 if (AUDIO_Handle->microphone.buf == NULL) {
 return USBH_FAIL;
 }

 memset(AUDIO_Handle->microphone.buf, 0, buf_len);

 AUDIO_Handle->req_state = AUDIO_REQ_SET_FREQ;
 }
 
 break;

case AUDIO_REQ_SET_FREQ:
 if (AUDIO_Handle->microphone.supported == 1U) {
 if (AUDIO_Handle->freq_state == AUDIO_FREQ_SET_INFERFACE1) {
 USBH_StatusTypeDef status = USBH_SetInterface(phost, AUDIO_INTERFACE_NUM, 0);

 if (status == USBH_OK) {
 AUDIO_Handle->freq_state = AUDIO_FREQ_SET_INFERFACE2;
 } else if (status == USBH_NOT_SUPPORTED) {
 status = USBH_FAIL;
 }

 } else if (AUDIO_Handle->freq_state == AUDIO_FREQ_SET_INFERFACE2) {
 USBH_StatusTypeDef status = USBH_SetInterface(phost, AUDIO_INTERFACE_NUM, 1);

 if (status == USBH_OK) {
 AUDIO_Handle->freq_state = AUDIO_FREQ_URB_OUT;
 } else if (status == USBH_NOT_SUPPORTED) {
 status = USBH_FAIL;
 }
 } else if (AUDIO_Handle->freq_state == AUDIO_FREQ_URB_OUT) {
 AUDIO_Handle->mem[0] = AUDIO_Handle->class_desc.as_desc[0].FormatTypeDesc->tSamFreq[max_freq_idx][0];
 AUDIO_Handle->mem[1] = AUDIO_Handle->class_desc.as_desc[0].FormatTypeDesc->tSamFreq[max_freq_idx][1];
 AUDIO_Handle->mem[2] = AUDIO_Handle->class_desc.as_desc[0].FormatTypeDesc->tSamFreq[max_freq_idx][2];

 USBH_StatusTypeDef status = USBH_AUDIO_SetEndpointControls(phost, AUDIO_Handle->microphone.Ep, (uint8_t *)(AUDIO_Handle->mem));

 if (status == USBH_OK) {
 AUDIO_Handle->freq_state = AUDIO_FREQ_SET_INFERFACE1;
 AUDIO_Handle->req_state = AUDIO_REQ_IDLE;

 } else if (status == USBH_NOT_SUPPORTED) {
 status = USBH_FAIL;
 }
 }
 }

 break;

This makes sure the microphone is set to 48kHz, therefore the frame_length will be ((48000 * 16 * 1) / 8000)) which equals a packet size of 96 bytes, which is correct and within the 100 bytes `wMaxPacketSize`.

The next important part is the replacment of the `USBH_AUDIO_SOFProcess` and `USBH_AUDIO_InputStream` functions, which are what handle the incoming data.

static USBH_StatusTypeDef USBH_AUDIO_SOFProcess(USBH_HandleTypeDef *phost) {
 AUDIO_HandleTypeDef *AUDIO_Handle = (AUDIO_HandleTypeDef *)phost->pActiveClass->pData;

 if (audio_flag == 0) {
 audio_flag = 1; // set flag to indicate new data should be received
 }

 return USBH_OK;
}

static USBH_StatusTypeDef USBH_AUDIO_InputStream(USBH_HandleTypeDef *phost) {
 USBH_StatusTypeDef status = USBH_OK;
 AUDIO_HandleTypeDef *AUDIO_Handle;

 AUDIO_Handle = (AUDIO_HandleTypeDef *)phost->pActiveClass->pData;
 
 if (audio_flag == 1) {
 audio_flag = 0; // reset flag
 
 USBH_StatusTypeDef status2 = USBH_IsocReceiveData(phost, AUDIO_Handle->microphone.buf, AUDIO_Handle->microphone.frame_length, AUDIO_Handle->microphone.Pipe)

 // print the buffer to get an idea of the data
 if (status2 == USBH_OK) {
 for (int i = 0; i < AUDIO_Handle->microphone.frame_length; i++) {
 printf("%d,", AUDIO_Handle->microphone.buf[i]);
 }
 printf("\n");
 }
 } else {
 status = USBH_BUSY;
 }

 return status;
}

Here is where something is wrong. The buffer never seems to be filled with any real PCM data, and it often just seems to repeat the same packet data over and over again, but I cannot for the life of me figure out why. I have tried flipping around the code, where I read the packet inside the SOF callback instead, and then process it later in the InputStream function, but it replicates the same behaviour.

A little bit of confusion that I have run into is that I have use USBPcap to look at the packets in Wireshark, and at least from what wireshark is showing, each frame actually contains 10 packets, whereas in my case each frame is a single packet. However, according the old ChatGPT, the `wMaxPacketSize` is supposed to actually show what the maximum size of a frame is, and if each frame contains ten 96 byte packets, then it should be 960 bytes, but it still shows as just 100 bytes, which is just 1 packet. This also aligns with the behaviour I see since if I raise the buffer to 960 bytes then it still only fills with a single packets worth of bytes, not ten. So I'm a little lost as to who to believe since I can only receive one packet but I know my code is broken and if Wireshark is showing ten packets per frame, maybe that is correct?

I wish I could be a little more helpful in providing info but I have just been desperately trying anything and so it's hard to keep track of what I have tested.

If anyone has any idea how I can receive these PCM packets correctly that would be massively appreciated since I'm at a total loss!

AScha.3 · Answer

About Audio Host I just can say: I tried it with the STM "examples", that are more or less just ONE example in some copies, and this is just for connecting a head phone set - and nothing else. So maybe you can use it, to get a microphone signal in, but that's all - If this is working at all. I didn't bother with that useless "Audio Host", but tried with Azure RTOS , because there the Audio Host can enumerate standard USB sound card and it is really working. But then I gave up on this, because for hi-end audio it seems useless: only at 48khz is a correct stream possible, but never at 44.1khz, because no packet sizes can have 44.1 samples, only 44 possible, so even this basic requirement cannot work on the USB bus. Just I don't know, how they do it with the hi-end Audio DAC s , maybe they connect with bulk mode and have the DAC some buffer memory and doing the correct timing.

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded