13 Wafer Scale Chips for an Exaflop AI Supercomputer

Cerebras Systems, the pioneer in accelerating artificial intelligence (AI) compute, today unveiled Andromeda, a 13.5 million core AI supercomputer, now available and being used for commercial and academic work. Built with a cluster of 16 Cerebras CS-2 systems and leveraging Cerebras MemoryX and SwarmX technologies, Andromeda delivers more than 1 Exaflop of AI compute and 120 Petaflops of dense compute at 16-bit half precision. It is the only AI supercomputer to ever demonstrate near-perfect linear scaling on large language model workloads relying on simple data parallelism alone.

The 13.5 million AI-optimized compute cores and fed by 18,176 3rd Gen AMD EPYC™ processors, Andromeda features more cores than 1,953 Nvidia A100 GPUs and 1.6 times as many cores as the largest supercomputer in the world, Frontier, which has 8.7 million cores. Unlike any known GPU-based cluster, Andromeda delivers near-perfect scaling via simple data parallelism across GPT-class large language models, including GPT-3, GPT-J and GPT-NeoX.

Near-perfect scaling means that as additional CS-2s are used, training time is reduced in near perfect proportion. This includes large language models with very large sequence lengths, a task that is impossible to achieve on GPUs. In fact, GPU impossible work was demonstrated by one of Andromeda’s first users, who achieved near perfect scaling on GPT-J at 2.5 billion and 25 billion parameters with long sequence lengths — MSL of 10,240. The users attempted to do the same work on Polaris, a 2,000 Nvidia A100 cluster, and the GPUs were unable to do the work because of GPU memory and memory bandwidth limitations.

The Wafer-Scale Engine (WSE-2), which powers the Cerebras CS-2 system, is the largest chip ever built. The WSE-2 is 56 times larger than the largest GPU, has 123 times more compute cores, and 1000 times more high-performance on-chip memory. The only wafer scale processor ever produced, it contains 2.6 trillion transistors, 850,000 AI-optimized cores, and 40 gigabytes of high performance on-wafer memory all at accelerating your AI work.

Cluster-Scale in a Single Chip

Unlike traditional devices with tiny amounts of on-chip cache memory and limited communication bandwidth, the WSE-2 features 40GB of on-chip SRAM, spread evenly across the entire surface of the chip, providing every core with single-clock-cycle access to fast memory at an extremely high bandwidth of 20PB/s. This is 1,000x more capacity and 9,800x greater bandwidth than the leading GPU.

High Bandwidth, Low Latency
The WSE-2 on-wafer interconnect eliminates the communication slowdown and inefficiencies of connecting hundreds of small devices via wires and cables. It delivers an astonishing 220 Pb/s interconnect bandwidth between cores. That’s more than 45,000x the bandwidth delivered between graphics processors. The result is faster, more efficient execution for your deep learning work at a fraction of the power draw of traditional GPU clusters.

Source link