Inference is giving AI chip startups a second chance to make their mark

AI adoption is reaching an inflection point as the focus shifts from training new models to serving them. For the AI startups vying for a slice of Nvidia's pie, it's now or never.

Compared to training, inference is a much more diverse workload, which presents an opportunity for chip startups to carve out a niche for themselves. Large batch inference requires a different mix of compute, memory, and bandwidth than an AI assistant or code agent.

Because of this, inference has become increasingly heterogeneous, certain aspects of which may be better suited to GPUs and other more specialized hardware.

Nvidia's $20 billion acquihire of Groq back in December is a prime example. The startup's SRAM-heavy chip architecture meant that, with enough of them, Groq's LPUs could churn out tokens faster than any GPU. However, their limited compute capacity and aging chip tech meant they couldn't scale all that efficiently.

Nvidia side stepped this problem by moving the compute heavy prefill bit of the inference pipeline to its GPUs while it kept the bandwidth-constrained decode operations on its shiny new LPUs.

This combination isn't unique to Nvidia. The week after GTC, AWS announced a disaggregated compute platform of its own that used its custom Trainium accelerators for prefill and Cerebras Systems' dinner-plate sized wafer-scale accelerators for decode.

Even Intel has gotten in on the fun, announcing a reference design that'll use GPUs - presumably the one they teased last northern hemisphere fall - for prefill and AI chip startup SambaNova's new RDUs for decode.

So far, most of the AI chip startups' wins have been on the decode side of the equation. SRAM, while not particularly capacious, is stupendously fast. So with enough chips, or at least a big enough chip in the case of Cerebras, they're well suited to accelerating decode operations, but chip startups aren't limited to this regime.

This week, Lumai detailed its optical inference accelerator, which uses light, rather than electrons, to perform the matrix multiplication operations at the heart of most machine learning workloads using a fraction of the power of a purely digital architecture.

Lumai expects its next-gen Iris Tetra systems will achieve an exaOPS of AI performance in a 10kW power budget by 2029.

Technically, the chips use hybrid electro-optical architecture, but the bulk of the compute done during inference is handled by the chip's optical tensor core.

Initially, the company is positioning the chip as a standalone alternative to GPUs for compute-bound inference workloads, such as batch processing. Longer-term, the company also plans to use its optical accelerators as prefill processors.

The architecture is still in its infancy, capable of running billion parameter models like Llama 3.1 8B or 70B today, but it's far enough along that the UK-based startup has opened its chips up to neoclouds and hyperscalers for evaluation.

Having said that, not every AI chip startup is keen on using different chips for prefill and decode. Earlier this week Tenstorrent unveiled its RISC-V-based Galaxy Blackhole compute platforms, and suffice to say the company's CEO Jim Keller isn't a fan of the disaggregated inference formula.

"Every company in the industry is pairing up to build the accelerator accelerator accelerator. CPUs run code. GPUs accelerate CPUs. TPUs accelerate GPUs. LPUs accelerate TPUs. And so on. This leads to complex solutions which are unlikely to be compatible with changes in AI models and uses. At Tenstorrent, we thought something more general and simpler would work," he said in a statement. ®

Search
About Us
Website HardCracked provides softwares, patches, cracks and keygens. If you have software or keygens to share, feel free to submit it to us here. Also you may contact us if you have software that needs to be removed from our website. Thanks for use our service!
IT News
May 6
AWS lets agents drive its virtual cloudy desktops - which could cost 500,00 tokens per click

Vendor benchmark finds APIs let you do the job faster and cheaper

May 6
India orders infosec red alert in case Mythos sparks crime spree

Securities regulator urges market players to develop new strategies and nail cyber-basics before AI models fuel mass attacks

May 6
OpenAI exec says company hopes to burn $50B of somebody else's money on compute this year

If the numbers are large enough, perhaps we won't question the math

May 5
Astera speaks softly and carries a big switch

High-speed connectivity without NVLink baggage

May 5
IBM asks DBAs to trust AI to act on their behalf

With help from Google and Intel, Big Blue brings new automation to Db2

May 5
ServiceNow clears agents for landing with new AI control tower

ServiceNow acquisitions Veza and Traceloop join to monitor agents and AI workflows