MIPI Camera GPU Acceleration on STM32MP257F-DK – High CPU Load, GPU Mostly Idle

Question

Hi everyone,I’m currently working on a camera preview pipeline on the STM32MP257F-DK using the Sony IMX335 (MIPI CSI-2) sensor.One important point is that STM32MP2 uses a media-controller-based camera architecture (DCMIPP + CSI subdevices). Because of this, Qt Multimedia cannot directly detect the MIPI camera as a standard /dev/videoX capture device. For that reason, I am using libcamera as the capture backend. It correctly handles the media graph configuration internally and exposes a usable video stream to GStreamer, which makes camera streaming stable and reliable.The functional pipeline is working. However, the goal is to display the live feed inside a Qt6 QML application with proper GPU acceleration, and this is where I am facing a major performance bottleneck: CPU usage is very high during preview, while the Vivante GPU remains mostly idle.I’d really appreciate guidance from anyone who has implemented a zero-copy GPU camera pipeline on STM32MP2.Platform OverviewBoard: STM32MP257F-DKSoC: STM32MP257 (Cortex-A35 + Vivante GC7000L GPU)Camera: Sony IMX335 (5MP, MIPI CSI-2)ISP: DCMIPPOS: OpenSTLinux 6.0 (Scarthgap)Kernel: 6.6.48Qt: 6.6.3 (QML / QtMultimedia)GStreamer: 1.22.12libcamera: 0.3.0Display: Wayland + EGLWhat I’m Trying to AchieveMy goal is simple:Show live camera preview in Qt6 QML with the GPU doing the heavy work — not the CPU.Ideally:No CPU pixel format conversionNo memcpy per frameNo CPU → GPU texture upload copiesDMA-BUF zero-copy from camera to GPUWhat Is Currently WorkingUsing libcamera, the following pipeline works:libcamerasrc → video/x-raw,format=RGB16,width=1280,height=720,framerate=25/1 → videoconvert → video/x-raw,format=BGRA → appsink → QVideoSink → QML VideoOutputPreview is stable at 25 fps, 1280x720.So functionally everything is correct.The Real ProblemCPU usage is between 60% and 75%, just for preview.At the same time:GPU usage is around 5%GPU is almost idleThis clearly means the camera path is CPU-bound.After profiling, I see:RGB16 → BGRA conversion (videoconvert) consumes significant CPUIn appsink, frames are copied multiple timesQt uploads texture from CPU to GPUAround 250+ MB/s of memory bandwidth is being used for nothing but copying pixelsSo even though we have a GPU (Vivante GC7000L), almost the entire pipeline is CPU-based.Why This Feels WrongThe hardware clearly supports:DCMIPP ISPDMAVivante GPU with EGLWayland + OpenGL ESArchitecturally, this should be possible:IMX335 → DCMIPP ISP → libcamerasrc (DMA-BUF) → glupload (EGL import) → glcolorconvert (shader) → qmlglsink → QML scene graphThis would keep frames on the GPU from capture to display.But currently I cannot reach this architecture.Main Blockers1) qmlglsink Not AvailableThe correct solution seems to be:libcamerasrc → glupload → glcolorconvert → qmlglsinkHowever:gst-inspect-1.0 qmlglsink→ No such elementIt seems the Qt6 GStreamer plugin from gst-plugins-bad is not packaged in OpenSTLinux 6.0.Is there an official ST package or Yocto recipe for this?2) DMA-BUF Heap Not EnabledThere is no:/dev/dma_heap/Kernel config options appear missing:CONFIG_DMABUF_HEAPSCONFIG_DMABUF_HEAPS_SYSTEMCONFIG_DMABUF_HEAPS_CMAWithout DMA-BUF heap support, true zero-copy EGL import may not be possible.Is this intentionally disabled in STM32MP2 BSP?3) Qt6 Removed RGB565 Supportlibcamera outputs RGB16 (RGB565) efficiently.But Qt6 QVideoFrameFormat does not support RGB565 anymore.So I am forced to convert to 32-bit (BGRA/RGBx) before sending to QVideoSink.That conversion alone costs a lot of CPU.Is there a recommended Qt6-based approach on STM32MP2 to avoid this conversion?My Question to the CommunityHas anyone successfully implemented:libcameraQt6 QMLGPU-accelerated previewZero-copy DMA-BUF pathon STM32MP257 or STM32MP2 family?

Dhanakrishna_Chaitanya · Accepted Answer

Hi Yassine_behilil,Yes — I now have a working solution on STM32MP2 with libcamera and OpenGL.Instead of going through GStreamer, I implemented a direct pipeline using DMA-BUF and EGL. The current flow is:libcamera → DMA-BUF (FD) → EGLImage (EGL_LINUX_DMA_BUF_EXT) → GL texture (GL_TEXTURE_EXTERNAL_OES) → Qt/OpenGL rendering In this approach:Frames are never copied to CPU (zero-copy path)I use libcamera requestCompleted() to get the DMA FDThe FD is passed to the UI thread via a callbackEGLImage is created once per FD and reused (buffer pool)glEGLImageTargetTexture2DOES is used to bind it to a textureThis significantly reduces CPU usage. However, there are some important constraints:Format must be GPU/EGL compatible (only RGB565 works)Proper thread handling is required (UI thread must do EGL/GL work)EGLImage creation must be cached per FD to avoid overhead.

jumman_JHINGA · Answer

With neon accelerator it will reduce 10 to 15% cpu usage

Platform Overview

What I’m Trying to Achieve

What Is Currently Working

The Real Problem

Why This Feels Wrong

Main Blockers

1) qmlglsink Not Available

2) DMA-BUF Heap Not Enabled

3) Qt6 Removed RGB565 Support

My Question to the Community

Sign up

Login with SSO

Login to the community

Login with SSO

Scanning file for viruses.

This file cannot be downloaded