Fast Isn’t Fast Enough: Redefining Metrics for Edge AI

发布时间：2026-04-08来源：Semiconductor Engineering

Key Takeaways:

Edge AI performance is about low latency and power efficiency, not peak TOPS.

Memory bandwidth and data movement now limit edge AI more than compute.

Successful edge AI requires balanced hardware, software, and fast model updates.

Experts At The Table: Today’s chip architect must contend with multiple factors when architecting AI processors for fast and efficient performance set against the context of rapidly evolving AI models. Semiconductor Engineering sat down to discuss this with James McNiven, vice president of client computing, Edge AI, at Arm; Amol Borkar, group director, product management for Tensilica DSPs at Cadence, Jason Lawley, director of product marketing, AI IP at Cadence; Sharad Chole, chief scientist and co-founder at Expedera; Justin Endo, director of marketing at Mixel, a Silvaco company; Steve Roddy, chief marketing officer at Quadric; Steven Woo, fellow and distinguished inventor at Rambus; Sathishkumar Balasubramanian, head of products for IC verification and EDA AI at Siemens EDA; and Gordon Cooper, principal product manager at Synopsys. What follows are excerpts of that discussion.

Top Row: Cadence’s Borkar, Cadence’s Lawley, Expedera’s Chole, Mixel’s Endo.

Bottom Row: Quadric’s Roddy, Rambus’ Woo, Siemens EDA’s Balasubramanian, and Synopsys’ Cooper.

SE: How do you define fast and efficient in the context of edge AI processing?

McNiven: At the edge, fast and efficient means delivering useful AI performance within real-world device constraints, not chasing peak compute. It’s about how quickly the system responds, how much energy it consumes, and how effectively it manages memory and bandwidth within a compact, cost-sensitive design. In practice, this means low, predictable latency and real-time on-device decision-making across use cases like smart cameras, industrial systems, wearables, or smart home endpoints — without overwhelming the power budget, creating excessive thermal demands, or forcing tradeoffs that compromise the rest of the product. Arm approaches this as a system-level optimization challenge across compute, memory, interconnect, and software. True efficiency comes from how CPUs, AI accelerators, and the memory subsystem work together, supported by a scalable software stack. That is increasingly important as workloads become more sophisticated and as developers need solutions that are not only performant, but also portable and scalable across a broad ecosystem of devices. This means that for today’s designs, AI must be treated as a core requirement from the outset, shaping compute partitioning, memory design, software choices, and security. At the same time, the rapid evolution of edge AI – from CNNs to transformer-based networks and increasingly multimodal workloads – means design teams need architectures that are efficient today, but also flexible enough to accommodate the next generation of software and model evolution. For tomorrow, this requires building in headroom, software portability, and secure scalability across a broad spectrum of devices and use cases. Future-ready edge AI will depend on a combination of efficiency, flexibility, and ecosystem support, because the devices that succeed will be the ones that can evolve as workloads and user expectations grow.

Borkar: Fast means you are targeting a specific application segment or usage, and it’s meeting those requirements very easily. If you’re speaking in the latest context of generative AI type of applications and different sorts of AI agents, or even super agents, whatever modules are required to run that — it’s 40 or 50 tokens per second easily on an edge application, allowing real-time performance for the solution that’s at a very, very high level, obviously — there are a lot more details. But just in terms of end-to-end execution for the entire pipeline, it should allow real-time usage and open up a gateway for additional applications. What it means in terms of efficiency, which is a common problem that most of us in this embedded space run into, is that we’re always consuming way too much power, way too much energy, or it’s always a bit larger than what it needs to be. Everybody’s always asking for zero energy, zero area, but highest performance, and that’s one of the things we’re always striving for. Chris Jones in our organization always says zero-calorie, sugar-free, and fat-free ice cream does not exist. But that’s one of the things we’re always trying to push the envelope for, because as these applications are coming out, they do require a lot more demand on compute. compute also means power. And then, it’s a vicious circle, because we’re improving our processors to be more efficient in our products, per se, so a lot more things can get done faster and more efficiently. But then that efficiency also opens up new requirements for new applications, which are also more demanding, so you have to build new processors for that, and that just keeps going around and around.

Woo: Fast means the system hits its latency target every time, not just on average. Efficient means not exceeding the power budget and consuming the fewest resources necessary. Excessive data movement and underutilized resources are primary sources of inefficiency. Designs that are the most efficient pay careful attention to minimizing data movement and choose memory systems that are power efficient and lead to low latency and predictable performance. This means that for today’s designs, architects are increasingly aware of the memory wall and data transfer costs, because compute scaling alone will still be bottlenecked by memory systems and data movement. In the future, there will be more heterogeneous processing pipelines, more specialized silicon, and greater attention to memory architecture and data movement placement. Customers are pushing for real-time behavior with tight, predictable latency, and the fastest way to miss that is to underfeed the compute. Target memory bandwidths for edge inferencing can be above 300 to 500 GB/s, as workloads can often be bandwidth-limited rather than compute-limited. Power efficiency is also important. Performance per watt is most critical for battery-powered devices to supply power and manage thermals.

Lawley: As an IP vendor, our customers need to be better than their competition. So when we talk about fast, it’s all relative to what the actual end customer needs in terms of achieving the performance that they want to get. Typically, that’s inference per second, or frames per second, in terms of efficiency. There are a lot of different avenues of efficiency. Some of it is going to be whether we can make it into the power budgets that they have. Can we fit the area requirements that they have? But then you go beyond that to the software side of it. How much effort do they have to put in, in terms of spending actual dollars or resources to map those networks into the IP itself? From that perspective, the efficiency part is more than just power and more than just area. It’s also a use case of how easy it is for them to get what they need out of the solution.

Roddy: I’ll echo what Jason said in a couple of areas. Obviously, efficiency. The first thing you think of is power budgets, etc. Everyone wants that. That’s zero-calorie ice cream. But efficiency and speed of landing new models are critical. Especially over the last six months of rapid innovation of agentic AI, we’ve all seen AI models and LLMs on the edge, where people are desperate to get the latest and greatest model and get it landed on the platform as soon as possible. So that has to be one of the vectors that someone designing a new chip or designing a new product with a chip absolutely is going to factor in. When models change — and they’re going to change — how quickly can that new model land on the target? Can I do it myself as the OEM? Do I need to go to somebody else to get it ported? That’s one of the key aspects and attributes of efficient and fast. And, of course, zero power, take no area, no cost. Those things are a given.

Chole: We already have ‘fast’ in the data center, and we already have the technologies to actually learn large models in the data center. But what matters on the edge is more of how you can fit on that small footprint, so it’s not always the ‘fast’ that matters. The latency is always bound by either the sensors or the users, so what you want to do is look at how to get that technology that’s already been running at billions of parameters of models, take it down to the edge, and make it run in real-time and efficiently. If you boil it down to matrix multiplications, it’s TOPS per watt and TOPS per millimeter squared, and I might just as well add effective on top of it, like effective TOPS per watt and effective TOPS per millimeter squared. What it takes to actually deploy the new models and support them is a secondary part of the solution we are creating. We have to create a solution that enables the footprint to be smaller, and the latency has to allow for latency-sensitive applications. Real-time is a way, or control systems, like in automotive. Those require very latency-sensitive deployments, and we need to make sure that those are hit with a small footprint. This is not just a hardware architecture problem. It starts with the models, with quantization, and with the application. This is a whole-stack problem.

Balasubramanian: The key thing we are seeing from our customers in terms of edge AI processing is definitely that latency is key. In fact, even in GenAI, in any application it’s interacting with — even for regular human interaction — it’s one millisecond. That’s where it’s seamless. It gets even more critical in applications like automotive or industrial. So that’s going to be key. The second thing that Amol and Sharad talked about, as well as in terms of efficiency, is more in terms of power. But it’s also how you are handling and getting the right data, so you can make the right inference at the edge and be able to handle the SLMs, or industry foundation models on the industrial side. All the boilers in industrial and factory settings have different AI processing. The requirements are different, so how you handle those kinds of models for different applications, and the ability to handle all these different real-world circumstances without the need for people or human intervention, is going to be key.

Cooper: If we talk about real-time edge AI, certainly, it’s anything with sensors. In the processing, it’s power, performance, and area, all those things we’ve talked about. Software is critical. I would add manufacturability, too, because particularly with the larger sizes, these are not insignificant in terms of that step of efficiently getting them into silicon. What’s changed is video, for example. This sensor has played a crucial role in our processing operations, with significant emphasis placed on the speed at which computations can be performed. As AI technology transitions toward physical hardware, and generative AI increasingly operates at the edge, new challenges emerge regarding compute power and memory balance. Typically, large language models are limited by memory. Therefore, efficiency is no longer just about optimizing power and performance. It now also centers on bandwidth — how effectively data can be transferred — which has become even more critical than before. That’s key.

Endo: ‘Fast and efficient’ is about making the right processing decision in the right place. Fast means low latency or processing data close to the sensor so that systems can react in real-time, such as with object detection, wake-up triggers, and safety responses. Efficient means minimizing energy per decision, which is heavily influenced by data movement, not just compute. Generally speaking, moving data often consumes significantly more energy than computing on it. This is why latency-critical functions are increasingly pushed to the edge, and systems aim to ingest sensor data efficiently and process it immediately. For example, we’re seeing edge processors designed to support fast system bring-up and event detection, leveraging MIPI interfaces for always-on, low-power data streaming and rapid transition from idle to active processing states. From an IP provider perspective, one of the most consistent trends is the need for flexible, scalable interconnect between sensors and processors. In today’s designs, we’re seeing strong demand for silicon-proven MIPI PHYs, particularly our C-PHY/D-PHY combo, which allows SoCs to interface with a broad and evolving ecosystem. This also reduces risk, since designers often don’t control both ends of the link. In tomorrow’s designs, we expect continued growth in aggregate sensor bandwidth with greater emphasis on pin efficiency (favoring C-PHY and D-PHY with embedded clock), energy per bit (pJ/bit) optimization, and scalable multipoint architectures. At the same time, MIPI standards continue to evolve to meet these needs, while foundries push process limits to enable higher speeds and lower power. However, developing PHY IP in-house is becoming increasingly challenging due to rapid standard evolution, advanced node complexity, and aggressive time-to-market requirements. This is why we’re seeing strong adoption of production-proven IP solutions, which help customers reduce integration risk, accelerate time to market, and ensure first-pass silicon success.

SE: What do today’s leading-edge applications need for AI processing?

Woo: Memory capacity and bandwidth are critical across the board for AI processing. Cost-effective and power-efficient inference are increasingly important to the industry, with special-purpose architectures designed for AI workloads that are memory-limited. The proliferation of inference workloads and platforms has resulted in memory system designs ranging from on-chip SRAM memory to low-power DRAM in the 50 to 100 GB/s range, and all the way up to edge platforms chasing 300 to 500 GB/s or more. The ‘right’ answer depends on the workload mix and form factor, but the requirement is the same — move data quickly and efficiently, move it predictably, and keep the compute engines busy. The primary tradeoff is between memory bandwidth, power, and cost. Higher memory bandwidth requires more pins, more power, tougher SI/PI, and bigger thermal challenges. The secondary tradeoff is how much data you put on-chip versus off-chip, and how you manage the movement of data. SRAM is a precious resource, but it isn’t scaling the way logic does, and it forces hard choices about caching and tiling. If design choices aren’t made intentionally, performance and power efficiency suffer as a result.

Endo: Fundamentally, edge AI starts with data acquisition. Whether it’s automotive ADAS, industrial vision, AR/VR, wearables, or smart surveillance, systems need high-quality, real-time data to make decisions. In practice, much of this data comes from imaging and vision sensors, which are rapidly increasing in resolution, frame rate and dynamic range (HDR). This creates two simultaneous requirements — high bandwidth to move large volumes of sensor data, and low power operation, since edge systems are often power-constrained (from sub-watt IoT devices to thermally limited automotive modules). From a connectivity standpoint, this is exactly where MIPI interfaces like MIPI CSI-2 over D-PHY and/or C-PHY play a critical role. MIPI enables efficient, high-throughput data movement from sensors into the processing domain. At a high level, the primary tradeoffs are still power, performance, area, and cost (PPA/C). However, in edge AI systems, especially from a PHY perspective, there are some important nuances. Primary levers are still bandwidth versus power, higher data rates which increase dynamic power, pin count versus throughput, fewer wires (pins) that require higher signaling efficiency (e.g., C-PHY encoding), data movement vs. compute, and moving data off-chip, which can dominate total system power. Secondary considerations include PHY efficiency (picojoules/bit) and burst versus continuous operation. It is often more efficient to transmit data at high speed and then enter a low-power or idle state. This drives event-driven architectures, incorporating event detection or wake-on-motion, which enables systems to operate in a highly power-efficient manner.

McNiven: Today’s leading-edge applications need AI processing that is not only high-performance, but also responsive, efficient, and deployable in real-world edge environments. The focus has shifted away from peak TOPS, towards executing AI where the data is created, with low latency, within practical power and thermal limits, and across increasingly diverse workloads. This is critical as use cases expand beyond image classification into more complex and dynamic tasks such as multimodal sensing, voice and vision interaction, industrial automation, and intelligent human-machine interfaces. These require a balanced compute approach, where CPUs, NPUs, and the broader system architecture work together efficiently. This reflects the broader move toward distributed AI, with intelligence increasingly running at the edge. CPUs play a central role in both AI processing and orchestration, alongside specialized accelerators, enabling real-time performance, improved privacy, and reduced reliance on the cloud. The primary tradeoffs remain the classic system-level constraints: performance, power, silicon area, memory bandwidth, and cost. Edge devices must deliver increasingly capable AI experiences within strict limits on battery life, thermals, size, and bill of materials. That forces careful decisions around workload placement, the level of acceleration required, and overall system architecture, so that AI can be designed in as a core, optimized part of the system from the outset. Remember, the cheaper chip is only cheaper until the first model update breaks the product. Secondary trade-offs are becoming just as important. These include software portability, framework support, development complexity, security, and the ability to update or scale AI capabilities over time. These factors often determine how quickly a product can move from concept to deployment–and its viability across future software generations. This is why Arm emphasizes a platform approach to edge AI. The goal is not just more compute, but a balanced system where hardware, software, and tools are aligned– supported by optimized frameworks, libraries, and toolchains that map efficiently to Arm-based platforms.

转载说明：本文系转载内容，版权归原作者及原出处所有。转载目的在于传递更多行业信息，文章观点仅代表原作者本人，与本平台立场无关。若涉及作品版权问题，请原作者或相关权利人及时与本平台联系，我们将在第一时间核实后移除相关内容。