MIPI Camera GPU Acceleration on STM32MP257F-DK – High CPU Load, GPU Mostly Idle
Hi everyone,
I’m currently working on a camera preview pipeline on the STM32MP257F-DK using the Sony IMX335 (MIPI CSI-2) sensor.
One important point is that STM32MP2 uses a media-controller-based camera architecture (DCMIPP + CSI subdevices). Because of this, Qt Multimedia cannot directly detect the MIPI camera as a standard /dev/videoX capture device.
For that reason, I am using libcamera as the capture backend. It correctly handles the media graph configuration internally and exposes a usable video stream to GStreamer, which makes camera streaming stable and reliable.
The functional pipeline is working. However, the goal is to display the live feed inside a Qt6 QML application with proper GPU acceleration, and this is where I am facing a major performance bottleneck: CPU usage is very high during preview, while the Vivante GPU remains mostly idle.
I’d really appreciate guidance from anyone who has implemented a zero-copy GPU camera pipeline on STM32MP2.
Platform Overview
Board: STM32MP257F-DK
SoC: STM32MP257 (Cortex-A35 + Vivante GC7000L GPU)
Camera: Sony IMX335 (5MP, MIPI CSI-2)
ISP: DCMIPP
OS: OpenSTLinux 6.0 (Scarthgap)
Kernel: 6.6.48
Qt: 6.6.3 (QML / QtMultimedia)
GStreamer: 1.22.12
libcamera: 0.3.0
Display: Wayland + EGL
What I’m Trying to Achieve
My goal is simple:
Show live camera preview in Qt6 QML with the GPU doing the heavy work — not the CPU.
Ideally:
No CPU pixel format conversion
No memcpy per frame
No CPU → GPU texture upload copies
DMA-BUF zero-copy from camera to GPU
What Is Currently Working
Using libcamera, the following pipeline works:
libcamerasrc
→ video/x-raw,format=RGB16,width=1280,height=720,framerate=25/1
→ videoconvert
→ video/x-raw,format=BGRA
→ appsink
→ QVideoSink
→ QML VideoOutput
Preview is stable at 25 fps, 1280x720.
So functionally everything is correct.
The Real Problem
CPU usage is between 60% and 75%, just for preview.
At the same time:
GPU usage is around 5%
GPU is almost idle
This clearly means the camera path is CPU-bound.
After profiling, I see:
RGB16 → BGRA conversion (videoconvert) consumes significant CPU
In appsink, frames are copied multiple times
Qt uploads texture from CPU to GPU
Around 250+ MB/s of memory bandwidth is being used for nothing but copying pixels
So even though we have a GPU (Vivante GC7000L), almost the entire pipeline is CPU-based.
Why This Feels Wrong
The hardware clearly supports:
DCMIPP ISP
DMA
Vivante GPU with EGL
Wayland + OpenGL ES
Architecturally, this should be possible:
IMX335
→ DCMIPP ISP
→ libcamerasrc (DMA-BUF)
→ glupload (EGL import)
→ glcolorconvert (shader)
→ qmlglsink
→ QML scene graph
This would keep frames on the GPU from capture to display.
But currently I cannot reach this architecture.
Main Blockers
1) qmlglsink Not Available
The correct solution seems to be:
libcamerasrc → glupload → glcolorconvert → qmlglsinkHowever:
gst-inspect-1.0 qmlglsink
→ No such element
It seems the Qt6 GStreamer plugin from gst-plugins-bad is not packaged in OpenSTLinux 6.0.
Is there an official ST package or Yocto recipe for this?
2) DMA-BUF Heap Not Enabled
There is no:
/dev/dma_heap/Kernel config options appear missing:
CONFIG_DMABUF_HEAPS
CONFIG_DMABUF_HEAPS_SYSTEM
CONFIG_DMABUF_HEAPS_CMA
Without DMA-BUF heap support, true zero-copy EGL import may not be possible.
Is this intentionally disabled in STM32MP2 BSP?
3) Qt6 Removed RGB565 Support
libcamera outputs RGB16 (RGB565) efficiently.
But Qt6 QVideoFrameFormat does not support RGB565 anymore.
So I am forced to convert to 32-bit (BGRA/RGBx) before sending to QVideoSink.
That conversion alone costs a lot of CPU.
Is there a recommended Qt6-based approach on STM32MP2 to avoid this conversion?
My Question to the Community
Has anyone successfully implemented:
libcamera
Qt6 QML
GPU-accelerated preview
Zero-copy DMA-BUF path
on STM32MP257 or STM32MP2 family?
