Google Cloud Next 2025: Infrastructure Updates Focused on AI Inference Advancements

Is AI the Perfect Umbrella for Zoned Object Storage, 400G Networking, and Managed SLURM Clusters?

At the recent Google Cloud Next ’25 event, the company emphasized AI as the overarching theme connecting several key technological advancements. These include zonal object storage solutions, 400G network bandwidth, and managed clusters using the SLURM workload manager. These three pillars have now been unified under the umbrella of the AI Hypercomputer brand, representing Google’s strategic focus on infrastructure optimized for artificial intelligence workloads.

In a blog post welcoming attendees to Google Cloud Next ’25, Thomas Kurian highlighted recent developments within the AI Hypercomputer portfolio. Notable features include new chips such as Ironwood, virtual machine (VM) types A4 and A4X, inference capabilities on Google Kubernetes Engine (GKE), and the Pathways runtime. A dedicated article expands on these updates, providing details on the hardware and software enhancements, including the upcoming 400G networking capacity. However, many of the rollout timelines remain uncertain, with availability dates often only broadly announced as “later this year.”

Preview of GB200 and the Ironwood Chips in Development

The general availability of the VM A4 series was announced at the NVIDIA GTC conference in mid-March. Currently, a configuration is available featuring 8 B200 GPUs with 1440 GB of total memory, 224 vCPUs (totaling 3968 GB of VM memory), 12 TB of SSD storage, and 3600 Gbit/s bandwidth. These high-performance VMs are geared towards AI training and inference workloads.

Additionally, the VM A4X instances, which were introduced as previews at the same event, continue to be in testing mode. Each A4X VM hosts four GB200 chips, offering an even more scalable solution for demanding AI tasks.

Looking ahead, Google plans to release its next-generation ASICs, the Ironwood chips, later this year. These are the seventh generation of Google’s Tensor Processing Units (TPUs) and will support configurations with up to 256 and 9126 chips per pod. Google claims that the largest setup will deliver over 42.5 exaflops of computing power—surpassing the capabilities of the existing El Capitan supercomputer. However, this benchmark uses FP8 precision, which is primarily suitable for AI inference rather than detailed scientific calculations. The company also reports a doubled performance-per-watt ratio compared to the previous TPU generation and highlights enhancements in SparseCores, which leverage vectorized representations found in recommendation models.

Google’s Concept of “RDMA Firewall”… When Will It Arrive?

One of the anticipated network innovations is the introduction of 400G connectivity options for Cloud Interconnect and Cross-Cloud Interconnect services. Currently, these services provide physical network connections up to 100 Gbit/s, connecting Google Cloud’s VPCs with on-premises data centers or across multiple cloud providers.

Google is also working on a feature referred to as “RDMA Firewall,” although specifics remain limited. The idea is to implement security policies directly at the network interface level, potentially securing Remote Direct Memory Access (RDMA) traffic. The company has not disclosed a timeline for this feature’s release, but it signals a move toward more secure and high-performance network interconnectivity.

Exabytes of Storage on the Horizon

By the end of June, Google plans to allow users to experiment with Hyperdisk’s new Exapools—a larger-scale variant of the Hyperdisk storage pools launched last year. These pools promise to provision exabytes (which is a billion gigabytes) of storage capacity, with throughput measured in terabytes per second, enabling massive-scale data handling for AI and data analytics applications.

Traditional storage pools have also been expanded from supporting up to 1 petabyte (Po) to now accommodating up to 5 Po. The core principle remains the same: allocate a defined capacity, bandwidth, and IOPS (Input/Output Operations Per Second), then distribute these resources among various workloads as needed.

Meanwhile, zonal object storage is currently in private preview under the branding of Rapid Storage. It aims to be a competitor to Amazon’s S3 One Zone-Infrequent Access. This service appears to be a wrapper around Google’s Colossus scalable file system, offering an interface based on gRPC. Google claims it provides read performance up to 20 times faster for random access patterns compared to regional buckets.

Similarly, Anywhere Cache is now generally available. It allows users to deploy caches within the same zone as their workloads, reducing inter-zone data transfer costs and latency by eliminating the need to read data from multi-region buckets.

Rebranding: From Hypercompute Cluster to Cluster Director

Google has renamed its orchestration platform previously known as Hypercompute Cluster to Cluster Director. The core functionality remains unchanged: providing a unified management layer for large fleets of accelerators—including GPUs, TPUs, and other hardware—using APIs from Compute Engine or GKE. This simplifies the administration of complex AI and HPC (High Performance Computing) environments.

Google currently offers Cluster Director for GKE, with promises to add new features later this year focusing on observability and ensuring workload continuity. A version tailored for SLURM clusters is also available in early access, giving users flexibility across different workload management systems.

Pushing Inference Capabilities into Kubernetes

Public preview is underway for new “special inference capabilities” within GKE. These include:

Utilizing model server metrics—such as cache utilization and queue length—for autoscaling and load balancing
Multiplexing multiple LoRA (Low-Rank Adaptation) models on a single accelerator
Enhanced observability of inference request flows
Routing inference requests based on Kubernetes API specifications

Additionally, Google is testing the GKE Inference Quickstart. This utility allows users to specify inference requirements and automatically generates optimized Kubernetes configurations aligned with Google’s best practices, streamlining AI deployment in containers.

Expanding Pathways for Advanced Model Training

Google’s custom runtime for training its Gemini models is based on the Pathways framework. This system leverages asynchronous dataflow orchestration, enabling flexible management of workloads using a single JAX client. It simplifies the deployment of parallel and distributed training patterns essential for large-scale AI model development.

Pathways is now open for Google Cloud customers in pre-general availability (pre-GA), alongside the activation of vLLM on TPUs and extensions to the AI Hypercomputer’s dynamic workload scheduler. The scheduler now supports newer chips such as Trillium and TPU V5e, as well as VM types A3 Ultra (H200) and A4 (B200). Initially, these features will be available in a flexible, on-demand mode dubbed “Flex Start,” with an upcoming option for reserved scheduling called “Calendar mode,” enabling users to reserve compute time for larger training jobs later this month.