At its re:Invent convention, AWS right this moment introduced the overall availably of its Trainium2 (T2) chips for coaching and deploying giant language fashions (LLMs). These chips, which AWS first introduced a yr in the past, might be 4 occasions as quick as their predecessors, with a single Trainium2-powered EC2 occasion with 16 T2 chips offering as much as 20.8 petaflops of compute efficiency. In observe, meaning working inference for Meta’s large Llama 405B mannequin as a part of Amazon’s Bedrock LLM platform will have the ability to provide “3x higher token-generation throughput compared to other available offerings by major cloud providers,” in accordance with AWS.
These new chips can even be deployed in what AWS calls the ‘EC2 Trn2 UltraServers.’ These cases will characteristic 64 interconnected Trainium2 chips which might scale as much as 83.2 peak petaflops of compute. An AWS spokesperson knowledgeable us that these efficiency numbers of 20.8 petaflops are for dense fashions and FP8 precision. The 83.2 petaflops worth is for FP8 with sparse fashions.
AWS notes that these UltraServers use a NeuronLink interconnect to hyperlink all of those Trainium chips collectively.
The corporate is working with Anthropic, the LLM supplier AWS has put its (monetary) bets on, to construct an enormous cluster of those UltraServers with “hundreds of thousands of Trainium2 chips” to coach Anthropics fashions. This new cluster, AWS says, might be 5x as highly effective (by way of exaflops of compute) in comparison with the cluster Anthropic used to coach its present era of fashions and, AWS additionally notes, “is expected to be the world’s largest AI compute cluster reported to date.”
Total, these specs are an enchancment over Nvidia’s present era of GPUs, which stay in excessive demand and brief provide. They’re dwarfed, nonetheless, by what Nvidia has promised for its next-gen Blackwell chips (with as much as 720 petaflops of FP8 efficiency in a rack with 72 Blackwell GPUs), which ought to arrive — after a little bit of a delay — early subsequent yr.
Trainium3: 4x sooner, coming in 2025
Possibly that’s why AWS additionally used this second to right away announce its subsequent era of chips, too, the Trainium3. For Trainium3, AWS expects one other 4x efficiency acquire for its UltraServers, for instance, and it guarantees to ship this subsequent iteration, constructed on a 3-nanometer course of, in late 2025. That’s a really quick launch cycle, although it stays to be seen how lengthy the Trainium3 chips will stay in preview and once they’ll additionally get into the palms of builders.
“Trainium2 is the highest performing AWS chip created to date,” stated David Brown, vice chairman of Compute and Networking at AWS, within the announcement. “And with models approaching trillions of parameters, we knew customers would need a novel approach to train and run those massive models. The new Trn2 UltraServers offer the fastest training and inference performance on AWS for the world’s largest models. And with our third-generation Trainium3 chips, we will enable customers to build bigger models faster and deliver superior real-time performance when deploying them.”
The Trn2 cases at the moment are usually obtainable in AWS’ US East (Ohio) area (with different areas launching quickly), whereas the UltraServers are presently in preview.