A New Era For Co-Processing
Key Takeaways:
- There is no single processor capable of executing everything efficiently, meaning that multiple processors are required.
- Maximum efficiency is gained by minimizing the movement of data.
- Architects must maximize efficiency for today’s workloads, while also adding enough flexibility to handle tomorrow’s.
New processor architectures are rapidly evolving thanks to changing workloads associated with AI, but no one processor can do it all. Coordination is easy on paper, but a lot more difficult in practice.
There has never been one processor architecture that can do everything. The humble central processing unit (CPU) has been the primary workhorse for 50 years, but even in the early days of personal computers there was an acknowledgement that some workloads required more focused processing. The 8086 processor had its sidekick, the 8087 floating-point co-processor.
The advent of audio processing and cell phones quickly made the digital signal processor (DSP) a necessary second processor. These processors recognized that data transfer was a performance limiter and separated data and instruction streams, while adding specialized multiply/accumulate logic necessary for quickly performing Fourier transforms. Later on, functionality was extended to support functions needed for encoding/decoding, compression, modulation, demodulation, and error correction.
Applications such as CAD required much faster graphics processing, which fueled the emergence of the commercial games market and resulted in rapidly evolving GPU architectures. These processors enabled AI to move from being rule-based to model-based, and that brings us to the present decade.
Migration to new architectures is not easy. “Three decades of SoC evolution show a consistent pattern — power–performance motivates new processor categories, but full programmability determines which ones succeed,” says Steve Roddy, chief marketing officer at Quadric. “If a workload can run within power and performance limits on a CPU, it will. Architects only introduce specialization when the CPU becomes inefficient.”
The rapid evolution of AI has resulted in a similar innovation of hardware architectures. It appears that they are evolving faster than hardware can be designed, verified, implemented, and put into operation. “The key question for co-processors is fundamentally about workload,” says William Wang, CEO of ChipAgents. “As AI systems evolve, workloads are shifting from short, kernel-style inference tasks to long-running agentic workloads that involve reasoning loops, tool use, memory access, and interaction across many software components. In this world, the challenge is less about building ever-faster compute blocks and more about balancing general-purpose programmability with ASIC-level efficiency.”
Many companies have tried to introduce new processor architectures, and while impressive on paper, they failed to deliver. “The winning co-processor is usually the one that minimizes data movement, software friction, and verification risk at the same time,” says Simon Davidmann, AI and EDA researcher at the University of Southampton. “In AI, the best co-processor is not the one with the highest peak TOPS. It is the one that wastes the least energy moving data.”
Architectures
Within a processing environment that includes multiple heterogeneous processing units, all working toward a common goal, there is often one processor that acts as the coordinator. “In every case, there is some high-level host, usually a CPU,” says Gordon Cooper, principal product manager at Synopsys. “Everything else can be considered a co-processor. We have a neural processing unit (NPU) IP, which is a full processor, but it does what the host tells it to. As we look at large language models, the host could do some of that work, and hosts often have some level of capability built in, but they’re going to offload most of it because it’s certainly much more efficient to do the math in an NPU for a large language model or a vision language model. But it all starts with control from the main processor. Within our NPU, we’ve got multiple scalar processors, vector processors, and some specialized engines to do the math. It is a heterogeneous collection of processing power all within something that we call an NPU.”
The NPU has quickly evolved. “NPUs are meant to run the AI models, but they were typically very specialized fixed-function hardware blocks,” says Amol Borkar, product marketing director for AI IP and software at Cadence. “But now the AI models have become more complicated, more than just doing MAC computations. You may have small bits of hardware that would help with non-MAC operations, or activation functions. The advantage of this is it allows the NPU to become a little bit more flexible. But what we have learned is that anytime a new layer comes in — a new operator, a new type of Llama, a new type of Claude model — we face a challenge, because if this extra hardware is not designed to run these operators, you’re ending up facing the problem that this network might not run.”
Arm has taken a very different approach with its new AGI CPU chip, ramping up performance per watt of the CPU. “As agentic AI becomes mainstream, all of the work required to make that happen is CPU-bound,” according to Rene Haas, Arm’s CEO. “The data center is choking. These accelerators, which are very expensive and which generate the tokens, now need to send those tokens back through the cloud. So what you see is a huge bottleneck. That means you need more and more CPUs.”
Fig. 1: Bottleneck in the data center. Source: Arm
It also adds yet another variable into this equation. Defining what a processor or a co-processor is becoming more difficult. “Co-processing architectures now span tightly coupled units, loosely coupled accelerators, and fabric-based distributed systems,” says Andy Nightingale, vice president of product management and marketing for Arteris. “Tightly coupled designs benefit from low latency, shared memory, and simpler programmability, making them efficient for smaller-scale or latency-sensitive workloads. However, they struggle to scale due to contention and coherence overhead. Loosely coupled approaches, often implemented as chiplets, enable modular scaling and specialization across functions such as training, inference, and networking. But they introduce higher latency and significant coordination complexity in both hardware and software. Fabric-based architectures strike a balance, enabling scalable and dynamic resource sharing, but place heavy demands on the interconnect and introduce substantial system-level complexity.”
The RISC-V ecosystem is attempting to add a new possibility, where merging of processor and co-processor can happen. “For accelerators and very specialized processor architectures, the RISC-V Instruction Set Architecture (ISA) is uniquely qualified,” says Dave Kelf, CEO for Breker Verification Systems. “As such, we are seeing the emergence of RISC-V based accelerators where the processing element is part of the accelerator, eliminating the overhead of control and data transfer between separate units. For low-power applications, the combination of just the required processor elements with the accelerator can save significant wattage. This appears most effective for AI device applications, where standardized software stacks can be applied directly to the accelerators themselves. It’s a new paradigm enabled by RISC-V, and probably the future of the open ISA.”
This can apply to adding functions into the CPU, or adding more general processing capability into the NPU. “Switching between processors has a time and distance penalty,” says Jason Lawley, director of product marketing for AI IP at Cadence. “You start trying to figure out how much area I can give to vector processing and scalar processing, knowing that I can’t do everything a CPU does. And so that’s why you start seeing small RISC-V cores sitting much closer to the MAC arrays. Those small RISC-Vs are not going to be able to do everything that your big CPU can do. The software developers have to figure out how do they segment those workloads to be maximally effective.”
To see the complete picture, you also have to look outside of the pure electronics space. “There are other types of processors, such as photonic AI accelerators,” says Jan van Hese, high speed digital portfolio manager at Keysight Technologies. “These can have great advantages. They may be hard to design, but if you can make them work, they are very fast and have low power consumption.”
The architectures that bind them are also fluid. “When you talk about co-processors, there is a continual ebb and flow of where that compute is done, and that influences how you move data, and where you store data,” says Cadence’s Lawley. “A lot of compute is being done in the GPUs and the NPUs. As AI is starting to mature, especially with agents, you’re starting to see that there needs to be more CPU work done. It used to be one CPU and a large MAC array, but you’re starting to see that for a certain number of MACs you need a CPU, just because of the extra compute that’s needed. If I’m sitting with my NPU hat on, I look at the NPU as the center of the world, and everything else is a co-processor to me, whereas, if I’m a CPU designer, everything else is a co-processor to me. You’re always going to have different specialized functions that don’t always fit in one processor. And that’s when you’re going to get co-processors.”
History may show us there is point of convergence in the future. “Task-specific processors gain efficiency by aligning native data types and compute primitives to the workload,” says Quadric’s Roddy. “But specialization alone isn’t enough. A tightly attached ‘helper’ accelerator does not truly free the CPU. Partitioned execution increases interconnect traffic, latency, and power. System-level efficiency depends on independence. History reinforces this point. Early graphics engines were attached accelerators. Real scaling occurred only when fully programmable GPUs emerged and decoupled from the CPU. The same transition happened in DSP domains. AI appears to be crossing that same boundary—from fixed-function accelerators toward fully programmable, independent AI processors. Beyond power and performance gains, independence simplifies integration, verification, modeling, and chiplet-based scaling.”
There are tradeoffs to be made. “CPU-adjacent acceleration is easier to program and easier to integrate into existing software flows, but it rarely wins on sustained performance-per-watt,” says University of Southampton’s Davidmann. “GPU-style engines are flexible and powerful, but they bring a heavy software stack and a large data-movement bill. Dedicated accelerators usually deliver the best efficiency, but only if the compiler, runtime, and model coverage are mature enough to prevent the hardware from becoming a specialized island. Heterogeneous subsystems sit in the middle. They are often the best system answer, but also the most demanding architectural answer.”
Arm’s approach is roughly in line with that view. “Twenty-four hours a day, these agents are going to be running,” says Mohamed Awad, executive vice president for Cloud AI at Arm. “And if they’re not performing fast enough, then the rest of that infrastructure that’s relying on it grinds to a halt.”
Challenges
Concentrating on the processing architecture may be missing the big picture. “People like to view it as a math problem,” says Synopsys’ Cooper. “But it’s really about data movement, particularly with LLMs that have massive numbers of parameters. It’s how efficiently you can move data to one place, process it or deal with that data, and then not have to continue to move it around. I really think it’s about data flow. You have to find the right balance between processing power and data bandwidth. It makes no sense to have more MACs if you don’t have enough data flow and finish up starving the MACs. You have to find the right balance.”
That starts with system-level planning. “You need to shift left in the design cycle,” says Keysight’s van Hese. “The ideal is to do things at the IC level, at the package level, at the system level – simultaneously. You need to co-design between all these types of building blocks to make the whole system work. For instance, your system contains an IC plus a package, or two dies communicating through an interposer with UCIe. System-level simulations need to be done while you’re designing these entire systems.”
Distributing the processing may make some aspects simpler, but it complicates others. “While chiplets and heterogeneous co-processors promise a more open and flexible ecosystem, they introduce significant integration challenges beyond basic interoperability,” says Arteris’ Nightingale. “Standards such as UCIe and CXL address physical and protocol compatibility, but they do not solve system-level behavioral integration. Differences in traffic management, memory ordering, QoS expectations, and latency tolerance across vendors can lead to unpredictable performance when components are combined. A consistent interconnect layer becomes essential, not just to connect components, but to enforce predictable system behavior across them. Without that, ecosystems risk becoming technically compatible but operationally unreliable under real workloads.”
Extensibility
Until the development of AI-related models and tasks slows down, hardware will always be behind where the software would like it to be. “You are designing something that will take a year to create a chip, another year to integrate into a product, and it’s got to live in the market for a number of years,” says Cooper. “It’s an interesting challenge if I’m making an SOC to figure out, how do I future proof this?”
What you provide for the future probably costs you today. “If we create specific hardware, which is tightly coupled exactly to their workload that they need to run today, we would probably have a lot higher efficiency,” says Borkar. “I need to make sure that, within this timeframe, this is the hardware I’m going to provide. You can maximize, tune, optimize everything that you need to do. But the challenge with that, obviously, is if they decide to change their spec, or they give us a different network or a different model, you’re in a really bad shape.”
Every hardware developer has to find the right balance. “Architects need enough specialization to achieve power and performance, but enough flexibility to support rapidly evolving AI workloads,” says ChipAgents’ Wang. “This makes system-level scheduling, data movement, and software integration just as important as raw compute throughput, and it is exactly where agentic AI can help engineers reason about the tradeoffs and manage the growing complexity of heterogeneous co-processing systems.”
It is not just the operations that need to be considered when attempting to future-proof. “The original NPUs were designed to do CNN workloads,” says Cooper. “Matrix multiplies were pretty straightforward. And then you get the transformers, and it’s a little more complicated. TOPs become less relevant, because it is not just multiply accumulates. You’ve got other things with tensor networks to deal with. That evolves to be LLMs. Now you’re very memory-bound. Where before you had the balance down, now with an LLM you are memory-bound. Then you migrate to mixed mode, or multi-modal. Now you have some vision processing that needs to be done again. There are different parameters associated with how you can mix and match. Any of those could be called an NPU.”
Optimizing the operators is perhaps the easy part. “If you break down the network into a sequence of operators, usually a large chunk of the operators would be common between many of these customers,” says Borkar. “For those, we should be able to provide very high performance and high efficiency. It’s usually the ones that we have typically not planned for, and we need to find a way to run them. That’s where, many times, you run into efficiency issues.”
The same is true for datatypes. “You can support any number of existing data types,” says Cooper. “But as more are developed, you need some engine to support any data type that might come through. There’s a point where you have to really design your product to be future-proof by making it flexible. But there’s a tradeoff with area, and in an NPU you want to be as efficient as possible and still be programmable.”
Editor’s Note: There are many aspects to efficiency, including power and energy, development time and effort for both hardware and software, including verification. These subjects will be discussed in a future article.
