SambaNova makes Llama gallop in inference cloud debut

Not to be outdone by rival AI systems upstarts, SambaNova has launched inference cloud of its own that it says is ready to serve up Meta's largest models faster than the rest.

The cloud offering is one of several which have cropped up amid the AI boom, offering API access to popular open-weight models. Most of these are GPU-based, but for the more boutique vendors dealing in specialized hardware, like Cerebras, Groq, and now SambaNova, it seems whoever can get the largest model to spit out tokens the fastest has a leg up.

If you're not familiar, tokens here refer to how large language models encode words, word fragments, punctuation, and figures. So, the faster your infrastructure can generate tokens, the less time you're left waiting for a response.

According to CEO Rodrigo Liang, SambaNova has managed to get Meta's 405 billion parameter Llama 3.1 model (more than twice the size of OpenAI's GPT-3.5 model) to churn out tokens at a rate of 132 per second and at the full 16-bit precision it was trained at no less.

To put that in perspective, its estimated the average person can read at about 5 words per second. At 132 tokens a second, SambaNova's system is nearly twice as fasts as the next fastest GPU systems at least according to Artificial Analysis data cited in SambaNova's announcement.

Pedal to the metal

Introduced earlier this summer, Llama 3.1 405B is Meta's first frontier-class model capable of going toe-to-toe with much larger models from the likes of OpenAI, Anthropic, and Google.

And while far smaller than competing models, running 405B at 16-bit precision isn't an easy feat, as simply fitting it into memory requires 810 GB of capacity. That's not even counting the space required by the key-value cache.

To run the model, SambaNova used 16 of its SN40L accelerators, each with 64 GB of speedy HBM3 memory and 520 MB of on-die SRAM. You can find a full breakdown of the chip, codenamed Cerulean 1, on our sibling site The Next Platform.

Using this configuration, SambaNova boasts it's achieved a throughput of 132 tokens per second in 405B and 461 tokens a second when running the smaller 70 billion parameter variant. By comparison, data from Artificial Analysis shows that even the best GPU-based systems can only managed to serve Meta's 405B model at 72 tokens per second with most much slower than that.

What's more, the startup claims it's able to maintain performance in excess of 100 tokens per second up to a batch size of four. Or, in other words, for up to four simultaneous requests. According to Anton McGonnell, head of SambaNova's software products division, there may be some additional headroom to scale that even further.

This level of performance is possible in part thanks to the SN40L's larger caches, McGonnell told the Register. This, he added, allows it to avoid the performance overheads commonly seen in multi-GPU systems.

"If GPUs could truly utilize their memory bandwidth, they will be much faster, but they can't," he explained.

But, while SambaNova was able to get Llama 3 405B running at 16-bit precision, it wasn't without compromise. One of the biggest concessions is the model isn't running at its full 128k token context window and was instead cut back to 8k.

"For the purposes of launch, we're just making the 8k version available, if only because of traffic," McGonnell said. "If people start using 128k, then it slows everything down for everybody else."

While this is unlikely to negatively impact performance in something like a customer service chatbot, it will limit the service's practicality for longer-context applications like document summarization.

The competition heats up

SambaNova Cloud's free and paid enterprise tiers are available starting today. The infrastructure provider also plans to roll out a developer tier later this year which, in addition to higher rate limits, will let devs build models based on Llama 3.1.

However, as we mentioned earlier, SambaNova is far from the only infrastructure vendor leaning on speed to differentiate itself from a sea of GPU-based offerings. Cerebras, which announced its own inference cloud at the Hot Chips conference late last month, already boasts performance of up to 450 tokens per second in Llama 3.1 70B and anticipates it will be able to achieve 350 tokens per second when running the 405B variant. If Cerebras can actually pull that off, it'll put the company well ahead of SambaNova, even if doing so will require 12 of its wafer-scale chips.

There's also Groq, which has previously managed to achieve throughputs of 300 tokens a second in Llama 2 70B using some 576 of its language processing units. The firm recently nabbed $640 million in a series-D funding round, which among other things will help it ramp up the development of its next-gen accelerators. ®

Search
About Us
Website HardCracked provides softwares, patches, cracks and keygens. If you have software or keygens to share, feel free to submit it to us here. Also you may contact us if you have software that needs to be removed from our website. Thanks for use our service!
IT News
Sep 17
The end is in sight for Windows 10, but Microsoft keeps pushing out fixes

Persistent SSO prompts after DMA update addressed in release preview

Sep 17
Using AI in your tech stack? Accuracy and reliability a worry for most

Churns out apps, but testing needed to iron out performance woes

Sep 17
Sainsbury's bags a ticket to RISE with SAP, hoping not to trip like Asda

UK's second largest retailer set to move ERP to the cloud with AWS, Accenture, and the German software gaint

Sep 17
Oracle urged again to give up JavaScript trademark

If there's one thing we know about Big Red, it's being entirely reasonable

Sep 17
Desktop hypervisors are like buses: none for ages, then four at once

VirtualBox, Parallels, and VMware have all upgraded

Sep 16
The empire of C++ strikes back with Safe C++ blueprint

You pipsqueaks want memory safety? We'll show you memory safety! We'll borrow that borrow checker

Sep 16
Ellison declares Oracle 'all in' on AI mass surveillance

Eyes on everyone: From cops to the public