NPUs in embedded SoCs: edge AI without sending everything to the cloud

NPUs in embedded SoCs: edge AI without sending everything to the cloud

Introduction: the evolution of embedded computing

In recent years, embedded systems have experienced a profound transformation. From devices designed exclusively for specific and inflexible tasks, they are evolving into platforms capable of hosting complex and intelligent applications. The increase in demand for local processing, especially for artificial intelligence, made it clear that CPU and GPU were no longer sufficient. The introduction of Neural Processing Units (NPUs) in System-on-Chips (SoCs) therefore represents a technological turning point: hardware accelerators dedicated to neural computing, designed to enable real-time execution of deep neural networks directly on the device, without having to constantly rely on the cloud.

Architecture and operating principles of NPUs

An NPU stands out from other computing units because it was created to perform operations typical of deep learning in an extremely efficient manner, such as matrix multiplications and convolutions. Unlike CPUs, which work sequentially, and GPUs, which are optimized for graphics parallelization, NPUs exploit architectures such as systolic arrays, capable of simultaneously processing enormous quantities of data with reduced consumption. The heart of this approach is optimization for low numerical precision: working with formats like INT8 or FP16, instead of traditional 32-bit arithmetic, allows you to accelerate operations while maintaining sufficient precision for machine learning applications. Added to this are techniques such as quantization and pruning, which further reduce the complexity of the models, making them lighter and suitable for platforms with limited memory and resources.

Advantages of integration into embedded SoCs

The integration of an NPU into a SoC does not only represent a gain in terms of performance, but a real paradigm shift. Inferences can be run directly locally, reducing latency and ensuring real-time responses, a key requirement in areas such as autonomous driving, robotics or portable medical systems. Local processing also reduces the need to send continuously sensitive data to the cloud, thus improving security and privacy. Furthermore, an SoC with NPU consumes significantly less power than a solution which relies on traditional GPUs or CPUs, allowing you to design compact, battery-powered and always connected devices. This combination of efficiency, speed and safety opens the way to a new paradigm, that of Edge AI, in which artificial intelligence is no longer centralized, but distributed directly to the devices.

Scientific and industrial applications

The application possibilities of NPUs in embedded systems are extremely broad. In the automotive sector, for example, these units come used in advanced driver assistance systems (ADAS), where the ability to recognize obstacles, pedestrians and road signs is limited milliseconds can make the difference between safety and risk. In industrial and robotics, NPUs allow collaborative robots to recognize objects, optimize routes and adapt to changing conditions without the need for constant connectivity. In the field of healthcare, portable devices equipped with NPU can analyze diagnostic images or physiological signals in real time, providing support immediate to doctors even in the absence of network infrastructure. Even in the IoT and smart home sector the impact is evident: a voice assistant or a video surveillance system with on-chip AI processing can work faster and with greater confidentiality than solutions based exclusively on the cloud.

State of the art and commercial solutions

The race to adopt NPUs in embedded SoCs has already begun. Manufacturers like NXP have introduced processors like the i.MX 8M Plus, equipped with a 2.3 TOPS designed for computer vision and machine learning applications embedded (NXP i.MX 8M Plus). STMicroelectronics has followed a similar path with the STM32N6 series, which integrates a neural accelerator directly into a microcontroller, demonstrating how it is possible to bring artificial intelligence even to very low-power systems. Arm, with the line Ethos-U, instead offers scalable solutions that enable partners to design custom SoCs with dedicated inference capabilities. A key parameter that distinguishes these solutions is the TOPS/Watt, i.e. the ratio between the operations that can be performed per second and energy consumption: this is the metric that guides the design of new architectures and which determines the competitiveness of an NPU compared to other solutions such as FPGA or embedded GPU.

Emerging trends and new research directions

Current trends show how the industry is moving beyond simply increasing performance. One research direction concerns the introduction of even more compact numerical formats than the classic INT8, with "narrow precision" representations designed to further reduce memory impact and increase throughput in the most demanding networks. This approach is particularly useful for transformer models and multimodal networks, which combine textual, visual and audio data into a single pipeline.

Another emerging area is co-design between hardware and models via Neural Architecture Search techniques, where hybrid CNN/ViT architectures are automatically adapted to heterogeneous executions on NPUs and Compute-in-Memory blocks. This strategy, in addition to improving latency and consumption, paves the way for increasingly optimized SoCs at a systemic level.

On the architectural front, the virtualization of NPUs is also discussed. While CPUs and GPUs already have established solutions for sharing resources, bringing the same concept to NPUs requires deep changes to the ISA and internal logic to enable more applications to simultaneously exploit the same accelerator without loss of isolation.

Finally, recent benchmark studies have highlighted that ultra-low-power micro-NPUs exhibit uneven performance scaling. Increasing model complexity does not always result in a linear increase in speed, due to memory bottlenecks and to the internal organization of the pipelines. This aspect invites developers and companies to validate the models in the field, avoiding reliance only to the theoretical numbers declared by the manufacturers.

Future challenges and prospects

Despite progress, NPUs still pose some challenges. One of the main ones concerns the software ecosystem. Models trained on frameworks like TensorFlow or PyTorch must be converted to optimized formats to work properly on the hardware accelerator, with toolchains that often vary from manufacturer to manufacturer. This fragmentation can hinder portability and increase development time. At the same time, the design of the system must be able to balance CPU, GPU and NPU, correctly assigning loads to make the most of the available resources. However, the trajectory is clear: in the near future NPUs will become an integral part of every embedded SoC, just as GPUs are standard in consumer devices today. The goal is to be able to bring increasingly complex models into increasingly smaller, more reliable and low-power devices.

Conclusion

Neural Processing Units represent one of the most relevant innovations in the evolution of embedded SoCs. Thanks to their ability to perform neural computations efficiently, securely and with low latency, are transforming embedded devices into intelligent nodes, capable of making autonomous decisions. On-chip AI is no longer a futuristic concept, but a reality that is already reshaping the landscape technological in scientific, industrial and consumer fields. For those working in the embedded electronics industry, master the usage of NPUs means having access to new design possibilities and paving the way for an era in which distributed intelligence will become the norm.

Do you want to bring on-chip AI to your embedded projects?

The Neural Processing Units (NPU) they are redefining edge computing and opening up new possibilities in the field IoT, industrial And automotive. If you would like to evaluate how to integrate AI accelerators into your systems or design a tailor-made solution, Silicon LogiX can support you from the analysis phase to full development.

Let's talk about it

Working on a similar problem?

IoT and connected-device work

Practical patterns for connected devices, web UIs, MQTT, Wi-Fi and network protocols.

View service Technical audit 90 minutes Discuss your project

Continue the path

Related resources

IoT and connected-device work

Practical patterns for connected devices, web UIs, MQTT, Wi-Fi and network protocols.

ESP32 local IoT with a web UI

Related deep dive in the Local IoT, ESP32 and connectivity path.

GPS/GNSS in embedded systems

Related deep dive in the Local IoT, ESP32 and connectivity path.

SLX Memory Map Explorer

Visualize memory maps, linker maps and firmware layout for MCU analysis and debugging.

Related articles

Back to English news