Microsoft Unveils Maia 200, An FP4 and FP8 Optimized AI Inference Accelerator for Azure Datacenters

Maia 200 is Microsoft’s new in house AI accelerator designed for inference in Azure datacenters. It targets the cost of token generation for large language models and other reasoning workloads by combining narrow precision compute, a dense on chip memory hierarchy and an Ethernet based scale up fabric.

Why Microsoft built a dedicated inference chip?

Training and inference stress hardware in different ways. Training needs very large all to all communication and long running jobs. Inference cares about tokens per second, latency and tokens per dollar. Microsoft positions Maia 200 as its most efficient inference system, with about 30 percent better performance per dollar than the latest hardware in its fleet.

SMART launches new Wearable Imaging for Transforming Elderly Care research group | MIT News

How to Build Multi-Layered LLM Safety Filters to Defend Against Adaptive, Paraphrased, and Adversarial Prompt Attacks

Maia 200 is part of a heterogeneous Azure stack. It will serve multiple models, including the latest GPT 5.2 models from OpenAI, and will power workloads in Microsoft Foundry and Microsoft 365 Copilot. The Microsoft Superintelligence team will use the chip for synthetic data generation and reinforcement learning to improve in house models.

Core silicon and numeric specifications

Each Maia 200 die is fabricated on TSMC’s 3 nanometer process. The chip integrates more than 140 billion transistors.

The compute pipeline is built around native FP8 and FP4 tensor cores. A single chip delivers more than 10 petaFLOPS in FP4 and more than 5 petaFLOPS in FP8, within a 750W SoC TDP envelope.

Memory is split between stacked HBM and on die SRAM. Maia 200 provides 216 GB of HBM3e with about 7TB per second of bandwidth and 272MB of on die SRAM. The SRAM is organized into tile level SRAM and cluster level SRAM and is fully software managed. Compilers and runtimes can place working sets explicitly to keep attention and GEMM kernels close to compute.

Tile based microarchitecture and memory hierarchy

The Maia 200 microarchitecture is hierarchical. The base unit is the tile. A tile is the smallest autonomous compute and storage unit on the chip. Each tile includes a Tile Tensor Unit for high throughput matrix operations and a Tile Vector Processor as a programmable SIMD engine. Tile SRAM feeds both units and tile DMA engines move data in and out of SRAM without stalling compute. A Tile Control Processor orchestrates the sequence of tensor and DMA work.

Multiple tiles form a cluster. Each cluster exposes a larger multi banked Cluster SRAM that is shared across tiles in that cluster. Cluster level DMA engines move data between Cluster SRAM and the co packaged HBM stacks. A cluster core coordinates multi tile execution and uses redundancy schemes for tiles and SRAM to improve yield while keeping the same programming model.

This hierarchy lets the software stack pin different parts of the model in different tiers. For example, attention kernels can keep Q, K, V tensors in tile SRAM, while collective communication kernels can stage payloads in cluster SRAM and reduce HBM pressure. The design goal is sustained high utilization when models grow in size and sequence length.

On chip data movement and Ethernet scale up fabric

Inference is often limited by data movement, not peak compute. Maia 200 uses a custom Network on Chip along with a hierarchy of DMA engines. The Network on Chip spans tiles, clusters, memory controllers and I/O units. It has separate planes for large tensor traffic and for small control messages. This separation keeps synchronization and small outputs from being blocked behind large transfers.

Beyond the chip boundary, Maia 200 integrates its own NIC and an Ethernet based scale up network that runs the AI Transport Layer protocol. The on-die NIC exposes about 1.4 TB per second in each direction, or 2.8 TB per second bidirectional bandwidth, and scales to 6,144 accelerators in a two tier domain.

Within each tray, four Maia accelerators form a Fully Connected Quad. These four devices have direct non switched links to each other. Most tensor parallel traffic stays inside this group, while only lighter collective traffic goes out to switches. This improves latency and reduces switch port count for typical inference collectives.

Azure system integration and cooling

At system level, Maia 200 follows the same rack, power and mechanical standards as Azure GPU servers. It supports air cooled and liquid cooled configurations and uses a second generation closed loop liquid cooling Heat Exchanger Unit for high density racks. This allows mixed deployments of GPUs and Maia accelerators in the same datacenter footprint.

The accelerator integrates with the Azure control plane. Firmware management, health monitoring and telemetry use the same workflows as other Azure compute services. This enables fleet wide rollouts and maintenance without disrupting running AI workloads.

Key Takeaways

Here are 5 concise, technical takeaways:

Inference first design: Maia 200 is Microsoft’s first silicon and system platform built only for AI inference, optimized for large scale token generation in modern reasoning models and large language models.
Numeric specs and memory hierarchy: The chip is fabricated on TSMCs 3nm, integrates about 140 billion transistors and delivers more than 10 PFLOPS FP4 and more than 5 PFLOPS FP8, with 216 GB HBM3e at 7TB per second along with 272 MB on chip SRAM split into tile SRAM and cluster SRAM and managed in software.
Performance versus other cloud accelerators: Microsoft reports about 30 percent better performance per dollar than the latest Azure inference systems and claims 3 times FP4 performance of third generation Amazon Trainium and higher FP8 performance than Google TPU v7 at the accelerator level.
Tile based architecture and Ethernet fabric: Maia 200 organizes compute into tiles and clusters with local SRAM, DMA engines and a Network on Chip, and exposes an integrated NIC with about 1.4 TB per second per direction Ethernet bandwidth that scales to 6,144 accelerators using Fully Connected Quad groups as the local tensor parallel domain.

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

Source_link