IaaS, AI Inference, and Office Apps: How Microsoft Is Making Its Cloud Smarter

No trusted-launch VMs, no GPUs on AKS, and no remediation actions with the policy engine… In offline mode, Azure Local has functional limitations.

Nevertheless, this mode has just moved to general availability. It complements the so-called “limited connectivity” mode, which does not require hosting the control plane locally and which sends certain data to the cloud, starting with logs.

In offline mode, Azure Local currently enables the creation of Windows VMs (Windows 10 Enterprise; Server 2022/2025) and Linux VMs (Ubuntu 22.04/24.04 LTS). The management of vanilla Kubernetes clusters and AKS is in preview. This also applies to trusted-launch VMs (secure boot, vTPM and attestation).

The management cluster must comprise at least three physical nodes. Each node comes with 96 GB of RAM, 24 physical cores and 2 TB NVMe. Some operations cannot be performed from the Azure portal, such as creating network interfaces and SSH keys (for AKS). Identity synchronization cannot be forced; it runs every 15 minutes.

Microsoft 365 tailored for Azure Local

Another offering that has moved to general availability: Microsoft 365 Local. It enables deploying Exchange Server, SharePoint Server and Skype for Business Server (Subscription Edition) on Azure Local reference architectures. Prerequisite: use Premier-certified hardware (about twenty configurations available: Dell AX and APEX, Lenovo ThinkAgile and HPE ProLiant).

Read also: Microsoft Azure disrupted again by a CDN issue

Microsoft has committed to supporting the three products at least until the end of 2035.

Expanded catalog for Foundry Local

Foundry Local remains in preview, but is welcoming larger models to its catalog.

This local version of Microsoft Foundry (formerly Azure AI Foundry) can be installed on Windows 10 (x64), Windows 11 (x64/Arm), Windows Server 2025 and macOS (Apple Dawn Liphardt). It provides access to an API and a REST server, an SDK (C#, Python, JavaScript) and an ONNX runtime. Inference is performed locally, but the network can be used to download models and components, and potentially share logs.

For now, the API only works in chat/completions mode – the SDK allows the use of Whisper speech recognition models. Designed for a single-node setup, Foundry Local does not support autoscaling, nor concurrency (parallelism must be managed at the application level), nor continuous batching. As for the catalog, with 25 models, it is still far from the more than 8,000 offered in the cloud version of Foundry.

The 25 available models

Model	Size	License	Variants
Phi-3-mini-4k-instruct	2.1 GB	MIT	CPU GPU (CUDA, WebGPU, OpenVINO, TensorRT) NPU (QNN, Vitis)
Phi-3-mini-128k-instruct	2.1 GB	MIT	CPU GPU (CUDA, WebGPU, OpenVINO, TensorRT) NPU (QNN, Vitis)
Phi-3.5-mini-instruct	2.1 GB	MIT	CPU GPU (CUDA, WebGPU, OpenVINO, TensorRT) NPU (QNN, Vitis)
Phi-4-mini-instruct	3.6 GB	MIT	CPU GPU (CUDA, WebGPU, OpenVINO, TensorRT) NPU (OpenVINO, Vitis)
Phi-4-mini-reasoning	3.1 GB	MIT	CPU GPU (CUDA, WebGPU, OpenVINO) NPU (OpenVINO, Vitis)
Phi-4	8.4 GB	MIT	CPU GPU (CUDA, WebGPU, OpenVINO, TensorRT)
Phi-4-reasoning	8.4 GB	MIT	CPU GPU (CUDA, WebGPU)
DeepSeek-R1-Distill-Qwen-1.5B	1.4 GB	MIT	GPU (TensorRT)
DeepSeek-R1-Distill-Qwen-7B	5.3 GB	MIT	CPU GPU (CUDA, WebGPU, OpenVINO, TensorRT) NPU (OpenVINO, Vitis)
DeepSeek-R1-Distill-Qwen-14B	9.8 GB	MIT	CPU GPU (CUDA, WebGPU, OpenVINO, TensorRT) NPU (QNN)
Qwen2.5-0.5B-Instruct	0.5 GB	Apache 2.0	CPU GPU (CUDA, WebGPU, OpenVINO, TensorRT) NPU (OpenVINO, Vitis)
Qwen2.5-Coder-0.5B-Instruct	0.5 GB	Apache 2.0	CPU GPU (CUDA, WebGPU, OpenVINO, TensorRT) NPU (OpenVINO, Vitis)
Qwen2.5-1.5B-Instruct	1.3 GB	Apache 2.0	CPU GPU (CUDA, WebGPU, OpenVINO, TensorRT) NPU (OpenVINO, QNN)
Qwen2.5-Coder-1.5B-Instruct	1.3 GB	Apache 2.0	CPU GPU (CUDA, WebGPU, OpenVINO, TensorRT) NPU (OpenVINO, Vitis)
Qwen2.5-7B-Instruct	4.7 GB	Apache 2.0	CPU GPU (CUDA, WebGPU, OpenVINO, TensorRT) NPU (Vitis)
Qwen2.5-Coder-7B-Instruct	4.7 GB	Apache 2.0	CPU GPU (CUDA, WebGPU, OpenVINO, TensorRT) NPU (OpenVINO, Vitis)
Qwen2.5-14B-Instruct	8.8 GB	Apache 2.0	CPU GPU (CUDA, WebGPU, OpenVINO, TensorRT)
Qwen2.5-Coder-14B-Instruct	8.8 GB	Apache 2.0	CPU GPU (CUDA, WebGPU, OpenVINO, TensorRT)
Mistral-7B-Instruct-v0.2	4.3 GB	Apache 2.0	GPU (OpenVINO) NPU (OpenVINO, Vitis)
gpt-oss-20b	9.7 GB	Apache 2.0	CPU GPU (CUDA)