Install and Run AI Models Locally with Microsoft Foundry Local

QUICK INFO


Difficulty	Beginner
Time Required	15-20 minutes
Prerequisites	Terminal/command line basics, 8GB RAM minimum (16GB recommended), 3GB free disk space
Tools Needed	Windows 10/11 (x64/ARM), Windows Server 2025, or macOS with Apple silicon

What You'll Learn:

Install Foundry Local via package manager or manual download
Run your first local AI model with a single command
Integrate local models into applications using Python, JavaScript, or C# SDKs
Manage cached models and control the Foundry service

Foundry Local runs generative AI models directly on your hardware with no Azure subscription, no API keys, and no data leaving your device. This guide covers installation, running models via CLI, and integrating with applications through the OpenAI-compatible API.

Getting Started

System Requirements

Operating Systems:

Windows 10 (x64), Windows 11 (x64/ARM), Windows Server 2025
macOS (Apple silicon only)

Hardware:

Minimum: 8GB RAM, 3GB free disk space
Recommended: 16GB RAM, 15GB free disk space

Optional Hardware Acceleration:

NVIDIA GPU (RTX 2000 series or newer)
AMD GPU (6000 series or newer)
Intel iGPU, Intel NPU (requires 32GB+ system memory)
Qualcomm Snapdragon X Elite (8GB+ memory)
Apple silicon (GPU acceleration included)

NPU support on Windows requires version 24H2 or later. Intel NPU users should install the Intel NPU driver separately.

Installation

Windows (via WinGet):

Open PowerShell or Command Prompt and run:

winget install Microsoft.FoundryLocal

macOS (via Homebrew):

brew tap microsoft/foundrylocal
brew install foundrylocal

Verify the installation:

foundry --help

Expected result: A list of available commands and their descriptions.

Manual Installation (Alternative):

Download installers from the GitHub releases page. On Windows, download the .msix package matching your architecture (x64 or arm64) and install via PowerShell:

Add-AppxPackage .\FoundryLocal.msix

Running Your First Model

Start an Interactive Chat

Run a model with a single command:

foundry model run phi-3.5-mini

Foundry Local downloads the model variant optimized for your hardware automatically. On NVIDIA systems, it fetches the CUDA version. On Qualcomm NPUs, it downloads the NPU-optimized variant. Without GPU or NPU, it defaults to the CPU version.

Expected result: After download completes, an interactive chat session starts. Type your prompt and press Enter to receive responses.

Available Models

List all models in the catalog:

foundry model list

The output displays model aliases, supported devices (CPU/GPU/NPU), file sizes, and licenses.

Common models include:

Model	Parameters	Use Case
phi-3.5-mini	3.8B	General chat, coding assistance
phi-4-mini	14B	Complex reasoning, analysis
qwen2.5-0.5b	0.5B	Lightweight tasks, low-resource devices
qwen2.5-coder-14b	14B	Code generation
mistral-7b-v0.2	7B	General purpose, multilingual
deepseek-r1	Various	Reasoning tasks

Model Information

Get details about a specific model before running:

foundry model info phi-3.5-mini

This shows available variants, hardware requirements, license terms, and download size.

CLI Command Reference

Model Commands

Command	Description
`foundry model list`	List all available models
`foundry model run <model>`	Download (if needed) and start interactive chat
`foundry model download <model>`	Download model without running
`foundry model load <model>`	Load model into service memory
`foundry model unload <model>`	Remove model from memory
`foundry model info <model>`	Display model details

Service Commands

Command	Description
`foundry service start`	Start the Foundry Local service
`foundry service stop`	Stop the service
`foundry service restart`	Restart the service
`foundry service status`	Check service status and endpoint
`foundry service ps`	List currently loaded models

Cache Commands

Command	Description
`foundry cache list`	List downloaded models
`foundry cache location`	Show cache directory path
`foundry cache remove <model>`	Delete a model from cache
`foundry cache cd <path>`	Change cache directory

Filtering Models

Filter the model list by hardware or task:

foundry model list --filter device=GPU
foundry model list --filter task=chat-completion
foundry model list --filter alias=phi*

Negate filters with !:

foundry model list --filter device=!CPU

Integrating with Applications

Foundry Local exposes an OpenAI-compatible REST API. After loading a model, applications can send requests to the local endpoint.

Check the API Endpoint

foundry service status

The output shows the endpoint URL, typically http://127.0.0.1:5273/v1.

Python Integration

Install the SDK:

pip install foundry-local-sdk openai

Example script:

import openai
from foundry_local import FoundryLocalManager

alias = "phi-3.5-mini"
manager = FoundryLocalManager(alias)

client = openai.OpenAI(
    base_url=manager.endpoint,
    api_key=manager.api_key
)

response = client.chat.completions.create(
    model=manager.get_model_info(alias).id,
    messages=[{"role": "user", "content": "Explain recursion in programming."}],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

The FoundryLocalManager starts the service if not running and loads the specified model automatically.

JavaScript/Node.js Integration

Install dependencies:

npm install foundry-local-sdk openai

Example:

import { OpenAI } from "openai";
import { FoundryLocalManager } from "foundry-local-sdk";

const alias = "phi-3.5-mini";
const manager = new FoundryLocalManager();
const modelInfo = await manager.init(alias);

const openai = new OpenAI({
  baseURL: manager.endpoint,
  apiKey: manager.apiKey,
});

const stream = await openai.chat.completions.create({
  model: modelInfo.id,
  messages: [{ role: "user", content: "What is the golden ratio?" }],
  stream: true,
});

for await (const chunk of stream) {
  if (chunk.choices[0]?.delta?.content) {
    process.stdout.write(chunk.choices[0].delta.content);
  }
}

C# Integration

Add the NuGet package:

dotnet add package Microsoft.AI.Foundry.Local.WinML

The C# SDK runs entirely in-process without requiring the Foundry CLI or HTTP calls to the local service.

using Microsoft.AI.Foundry.Local;

var config = new Configuration { AppName = "my-app" };
await FoundryLocalManager.CreateAsync(config);
var mgr = FoundryLocalManager.Instance;

var catalog = await mgr.GetCatalogAsync();
var model = await catalog.GetModelAsync("qwen2.5-0.5b");

await model.DownloadAsync(progress => Console.Write($"\rDownloading: {progress:F1}%"));
await model.LoadAsync();

var chatClient = await model.GetChatClientAsync();
var messages = new List<ChatMessage> 
{
    new ChatMessage { Role = "user", Content = "Why is the sky blue?" }
};

var response = chatClient.CompleteChatStreamingAsync(messages, CancellationToken.None);
await foreach (var chunk in response)
{
    Console.Write(chunk.Choices[0].Message.Content);
}

await model.UnloadAsync();

Direct REST API Usage

Without SDKs, send requests directly to the endpoint:

curl http://127.0.0.1:5273/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Phi-3.5-mini-instruct-generic-gpu",
    "messages": [{"role": "user", "content": "Hello"}],
    "stream": false
  }'

Use the exact model ID from foundry model list, not the alias, when calling the API directly.

Troubleshooting

Symptom: foundry: command not found after installation

Fix: Close and reopen your terminal to refresh the PATH. On Windows, try opening a new PowerShell window as Administrator.

Symptom: Exception: Request to local service failed when running model list

Fix: The service may have crashed. Run foundry service restart and try again.

Symptom: Model download stalls or fails

Fix: Check your internet connection. If resuming a partial download, run foundry cache remove <model> then retry the download.

Symptom: Out of memory errors during inference

Fix: Try a smaller model (e.g., qwen2.5-0.5b instead of phi-4). Close other memory-intensive applications. On systems with less than 16GB RAM, avoid models larger than 7B parameters.

Symptom: GPU not being utilized (inference running on CPU)

Fix: Verify GPU drivers are up to date. NVIDIA requires driver version 32.0.15.5585 or newer with CUDA 12.5+. Run foundry model list to confirm GPU variants appear.

Symptom: NPU model not available on Windows ARM

Fix: NPU support requires Windows 24H2 or later. Check Windows Update for the latest version. Intel NPU users must install the Intel NPU driver separately.

What's Next

You now have Foundry Local running AI models locally with full privacy. For production deployments, see the Foundry Local documentation for converting custom Hugging Face models to ONNX format using Microsoft Olive.

PRO TIPS

Use foundry model download <model> to pre-cache models before going offline
Set a custom cache location on a larger drive with foundry cache cd /path/to/drive
Run foundry service set --port 8081 to change the default API port (5273) if it conflicts with other services
Models stay in memory for 10 minutes by default after the last request (TTL). Load explicitly with foundry model load for persistent availability
On multi-GPU systems, use foundry service set --gpu 1 to specify which GPU to use

COMMON MISTAKES

Using the model alias instead of the full model ID when calling the REST API directly: The alias (e.g., phi-3.5-mini) works in CLI commands but REST calls require the exact model ID (e.g., Phi-3.5-mini-instruct-generic-gpu). Get IDs from foundry model list.
Forgetting to load the model before API calls: The SDKs handle this automatically, but direct REST API users must run foundry model load <model> first or the service returns 404.
Installing on unsupported macOS (Intel): Foundry Local requires Apple silicon. Intel Macs are not supported.
Running multiple large models simultaneously: Each loaded model consumes RAM/VRAM. Unload unused models with foundry model unload <model> before loading another.

PROMPT TEMPLATES

System Prompt for Structured Output

You are a data extraction assistant. Extract information from user text and return valid JSON only. No explanations. No markdown formatting.

Schema: {"name": string, "date": string, "amount": number}

Customize by: Modify the JSON schema to match your data structure.

Example output:

{"name": "Invoice #4521", "date": "2025-03-15", "amount": 1250.00}

Code Review Assistant

Review this code for bugs, security issues, and performance problems. List findings as:
- [SEVERITY] Location: Description

Severity levels: CRITICAL, HIGH, MEDIUM, LOW

Customize by: Add language-specific rules or focus areas (e.g., "Focus on SQL injection vulnerabilities").

Example output:

- [HIGH] Line 23: User input passed directly to SQL query without sanitization
- [MEDIUM] Line 45: Unnecessary database call inside loop, move outside
- [LOW] Line 12: Variable 'temp' declared but never used

FAQ

Q: Does Foundry Local send any data to Microsoft or the cloud? A: No. All inference happens locally. An internet connection is only needed to download models initially. After caching, models run entirely offline.

Q: Can I use models from Hugging Face that aren't in the catalog? A: Yes. Convert models to ONNX format using Microsoft Olive, then place them in a custom directory structure that Foundry Local can read. See the custom models documentation.

Q: What's the difference between the alias and model ID? A: The alias (e.g., phi-3.5-mini) is a friendly name that auto-selects the best hardware variant. The model ID (e.g., Phi-3.5-mini-instruct-generic-gpu) specifies an exact variant. Use aliases in CLI commands; use IDs for direct API calls.

Q: How do I update Foundry Local? A: On Windows: winget upgrade --id Microsoft.FoundryLocal. On macOS: brew upgrade foundrylocal.

Q: Can I run Foundry Local in a Docker container or CI/CD pipeline? A: Foundry Local is designed for desktop/edge devices with direct hardware access. Container support is limited. For server deployments, consider Azure AI Foundry instead.

Q: What licenses apply to the models? A: Each model has its own license (MIT, Apache 2.0, etc.) shown in foundry model info <model>. Review terms before commercial use.

RESOURCES

Foundry Local GitHub Repository: Source code, issue tracking, release downloads
Microsoft Learn Documentation: Official guides, architecture overview, SDK reference
Foundry Local Discord: Community support and discussions
Model Compilation Guide: Convert Hugging Face models to ONNX format