Back to Blog
Edge AIJuly 14, 202514 min read

TinyML Solutions on STM32: Deploying Machine Learning Models on Microcontrollers

Learn how a TinyML solutions company deploys TensorFlow Lite models on STM32 MCUs using X-CUBE-AI. From model training to on-device inference for embedded AI applications.

TinyML Solutions on STM32: Deploying Machine Learning Models on Microcontrollers

TinyML on STM32 microcontrollers enables on-device machine learning inference for applications like keyword spotting, gesture recognition, anomaly detection, and predictive maintenance without cloud connectivity. STMicroelectronics provides X-CUBE-AI, a comprehensive tool integrated into STM32CubeMX that converts pre-trained neural networks from TensorFlow, Keras, PyTorch (via ONNX), and scikit-learn into optimized C code targeting Cortex-M processors. The workflow involves training a model on a PC or cloud platform, converting it to TensorFlow Lite format, quantizing from FP32 to INT8 to reduce size by 4x, and using X-CUBE-AI to generate a memory-mapped inference engine that leverages CMSIS-NN optimized kernels. A typical keyword spotting model (with MFCC preprocessing) fits within 50 KB of flash and 20 KB of RAM on an STM32L4, achieving 10ms inference latency while consuming under 5 mW. This makes STM32-based TinyML viable for battery-powered consumer and industrial IoT products.

How Do You Train a Model for STM32 Deployment?

The training pipeline for TinyML differs from standard ML in several ways. First, the model architecture must be constrained to fit within MCU resources. Common architectures include MobileNetV2 (for vision), DS-CNN (for audio), and simple dense/convolutional networks for sensor data classification. The input preprocessing must be implementable on the MCU—for audio, this means MFCC or mel-spectrogram computation must run in real-time. Edge Impulse provides an end-to-end platform that handles data collection, model training, and STM32 deployment, but you can also use a manual workflow with TensorFlow/Keras and STM32Cube.AI.

# Training a keyword spotting model for STM32
import tensorflow as tf
from tensorflow.keras import layers, models

# Model architecture for keyword spotting
model = models.Sequential([
    layers.Input(shape=(49, 10, 1)),  # MFCC features
    layers.Conv2D(8, (3, 3), activation='relu', padding='same'),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(16, (3, 3), activation='relu', padding='same'),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(32, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(4, activation='softmax')  # 4 keywords
])

model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=50, validation_split=0.2)

# Convert to TFLite with INT8 quantization
def representative_dataset():
    for sample in X_train[:100]:
        yield [sample.reshape(1, 49, 10, 1).astype('float32')]

converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_dataset
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

tflite_model = converter.convert()
with open('keyword_model.tflite', 'wb') as f:
    f.write(tflite_model)

How Does X-CUBE-AI Optimize Models for STM32?

X-CUBE-AI performs several critical optimizations when converting models for STM32. It fuses consecutive layers (such as Conv2D + BatchNorm + ReLU) into single operations, reducing memory traffic and computation. It maps operations to CMSIS-NN optimized kernels that leverage Cortex-M DSP instructions (SIMD) for 2-4x speedup over naive implementations. The tool provides detailed memory and performance reports, showing per-layer flash usage, RAM requirements (for activation buffers), and estimated inference time. X-CUBE-AI also supports model compression through weight sharing and pruning, further reducing model size.

What STM32 Boards Are Best for TinyML?

Recommended STM32 boards for TinyML development:

  • STM32L4 (Cortex-M4, 1MB flash, 320KB SRAM): Best balance of performance and power for sensor-based ML applications.
  • STM32H7 (Cortex-M7, 2MB flash, 1MB SRAM): Highest performance for vision-based ML, supports larger models.
  • STM32U5 (Cortex-M33 with TrustZone, ultra-low power): Ideal for secure, battery-powered ML applications.
  • STM32N6 (Cortex-M55 with Helium MVE + NPU): Next-generation MCU with dedicated neural processing, 5-10x faster ML inference.
  • B-U585I-IOT02A Discovery Kit: Built-in sensors (accelerometer, microphone, temperature) perfect for TinyML prototyping.

What Are Real-World TinyML Use Cases on STM32?

Production TinyML deployments on STM32 span diverse industries. In consumer electronics, always-on keyword detection (like "Hey Siri" functionality) runs on low-power Cortex-M4 MCUs consuming under 1 mW in listening mode. Industrial applications use vibration-based anomaly detection on rotating machinery, where an accelerometer streams data through an FFT and a CNN classifier detects bearing failures weeks before they occur. Smart agriculture devices use TinyML to classify pest sounds or analyze soil sensor data locally, transmitting only actionable alerts over LoRaWAN. Wearable health monitors run heart rhythm analysis on-device to detect atrial fibrillation, using convolutional neural networks processing single-lead ECG data in real-time.

Key takeaway: Deploy TinyML models on STM32 microcontrollers using X-CUBE-AI, which converts TensorFlow, Keras, and ONNX models into CMSIS-NN optimized C code. INT8 quantization reduces model size by 4x while maintaining over 95% accuracy, enabling keyword spotting in 50 KB flash and anomaly detection in under 10 KB on Cortex-M4 MCUs.

How Did We Build a Production TinyML Anomaly Detector?

In a recent EmbedCrest project, we developed a TinyML-based motor current anomaly detector for a food processing plant with 120 electric motors driving conveyor belts and mixers. We chose the STM32L4R5 (Cortex-M4, 2 MB flash, 640 KB SRAM) paired with an ACS712 current sensor sampling at 4 kHz. The ML model was a lightweight autoencoder with 3 encoder layers (128-64-16 neurons) and 3 decoder layers, trained on 2 weeks of healthy operation data collected at 1-second intervals. After INT8 quantization through X-CUBE-AI, the model occupied 12.4 KB flash and 3.2 KB activation RAM, completing inference in 1.8 ms at 80 MHz. The autoencoder reconstruction error served as the anomaly score: values exceeding a threshold calibrated during commissioning triggered alerts via RS-485 Modbus to the plant's SCADA system. Over 8 months of operation, the system detected 14 genuine anomalies (including 3 developing winding insulation failures) with zero false positives after threshold tuning.

What Are the Most Common TinyML Deployment Mistakes?

The most frequent TinyML deployment mistake is training on desktop-collected data that does not match real-world sensor characteristics. ADC quantization noise, sensor drift over temperature, and mounting-dependent vibration coupling all create distributional shift between lab training data and field inference data. Always collect training data from the actual deployment hardware in the target environment. Second, developers often skip quantization-aware training (QAT) and rely solely on post-training quantization (PTQ). While PTQ is faster, QAT typically recovers 1-3% accuracy by simulating quantization effects during training backpropagation. Third, forgetting to account for preprocessing latency is common: if your MFCC computation takes 15 ms and your CNN inference takes 5 ms, your total latency is 20 ms, not 5 ms. Fourth, not implementing a confidence threshold leads to spurious predictions. Always check model output confidence and report "unknown" when the maximum softmax probability falls below 0.7-0.8, rather than forcing a classification on out-of-distribution inputs.

How Do Different STM32 Families Compare for TinyML Performance?

TinyML inference speed scales predictably across STM32 families due to CMSIS-NN kernel optimizations. On a keyword spotting DS-CNN model (49x10 MFCC input, 4 classes, 20 KB INT8): the STM32L0 (Cortex-M0+ at 32 MHz) completes inference in 180 ms, too slow for real-time audio. The STM32L4 (Cortex-M4 at 80 MHz with DSP/SIMD) completes the same model in 11 ms, suitable for always-on keyword detection. The STM32H7 (Cortex-M7 at 480 MHz with double-precision FPU and L1 cache) achieves 2.1 ms, enabling more complex models or higher throughput. The upcoming STM32N6 (Cortex-M55 with Helium MVE and dedicated NPU) provides 5-10x speedup over Cortex-M7 for INT8 convolutions, bringing MobileNetV2-based image classification (224x224 input) down to under 50 ms on an MCU. For most sensor-data TinyML applications (vibration, current, temperature time-series), the STM32L4 provides the optimal cost-performance balance at $3-5 per unit in volume.

TinyMLSTM32TensorFlow LiteMachine LearningX-CUBE-AI

Rajdatt

Lead Embedded Systems Engineer at EmbedCrest Technology

Delivering enterprise grade embedded systems, IoT, and Edge AI engineering solutions.

FAQ

Frequently Asked Questions

How much flash memory does a TinyML model require on STM32?

A quantized INT8 model for keyword spotting typically uses 20-50 KB of flash. Vision models like MobileNetV2 require 300-500 KB quantized. Simple anomaly detection models can be as small as 5-10 KB.

Can I use PyTorch models with X-CUBE-AI?

Yes, by first exporting the PyTorch model to ONNX format using torch.onnx.export(), then importing the ONNX model into X-CUBE-AI. Ensure all operations in your model are supported by the ONNX opset and X-CUBE-AI.

What is the accuracy loss from INT8 quantization?

Typical accuracy loss from FP32 to INT8 quantization is 1-3% for well-designed models. Using quantization-aware training (QAT) instead of post-training quantization can reduce this loss to under 1%.

Does STM32 support real-time audio ML inference?

Yes, a Cortex-M4 at 80 MHz can compute MFCC features and run a keyword spotting model in under 30ms, well within the 100ms budget for real-time audio. The STM32H7 at 480 MHz enables more complex audio models.

Ready to Build Your Embedded Solution?

From Edge AI to industrial IoT, our engineering team delivers end to end embedded systems solutions. Let us discuss your project requirements.

Get in Touch