Skip to main content
ramkumarkoppu
Associate III
April 25, 2025
Question

Quantized Gemma Model Inference on STM32MP257F-DK Board

  • April 25, 2025
  • 3 replies
  • 297 views

Hi,

Could you share documentation or examples for running quantized foundational models (e.g. Google Gemma) on the STM32MP257F-DK—first in Python, then in C/C++ using the STM32MP2 NPU? Specifically:

  • Does the STM32MP2 NPU support transformer-based architectures, or is it limited to CNNs (like the STM32N6)?

  • Which inference frameworks are supported for GenAI on this platform? does this NPU ported by ST for llama.cpp ?

Sorry, couldn't find required info from STM32 MPU wiki pages.

Thanks!

 

 

3 replies

Visitor II
November 7, 2025

Hello

We are currently evaluating hardware options and have the same question. Can somebody from ST answer it here?

 

Thank you and best regards
Jan

Associate III
January 6, 2026

Hello,

I have the same question. Is it possible to run LLMs on the STM32MP2 series?

Additionally, what is the expected performance/inference efficiency?

Thanks!

Technical Moderator
January 8, 2026

Hello, 

The NPU architecture of the STM32MP2 series does not support transformer based architecture models.
LLM models can be run using the CPU.

The framework supported by the X-LINUX-AI package are listed in this wiki page:
https://wiki.st.com/stm32mpu/wiki/Category:X-LINUX-AI_expansion_package

BR

 

 

​In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.
PatrickF
Technical Moderator
January 8, 2026

An example of local running LLM on STM32MP257F-EV1 (as said, using CPU only).
https://www.linkedin.com/posts/danilopietropau_another-great-example-of-llm-on-stm32mp2-activity-7293222309333495809-Qe0d

Regards.

In order to give better visibility on the answered topics, please click on 'Accept as Solution' on the reply which solved your issue or answered your question.NEW ! Sidekick STM32 AI agent, see here