Open-Source AI

How to Run Local LLMs: Complete Setup Guide for 2026

Get a private ChatGPT alternative running on your machine, no API keys required.

Trần Quang Hùng
Trần Quang HùngChief Explainer of Things
December 20, 20259 min read
Share:
Laptop displaying a local AI chat interface in a home office setting, illustrating private on-device language model inference

SEO METADATA

Meta Title: How to Run Local LLMs: Complete Setup Guide for 2025 Meta Description: Set up a local AI chatbot in under 30 minutes. Covers Ollama, hardware requirements, model selection, and Open WebUI for a ChatGPT-like experience. URL Slug: how-to-run-local-llm-guide Primary Keyword: how to run local llm Secondary Keywords: ollama tutorial, local ai setup, llama.cpp guide, open webui setup, local chatgpt alternative Tags: ["local-llm", "ollama", "open-webui", "llama", "qwen", "ai-privacy", "self-hosted-ai", "gpu-inference", "quantization", "gguf"]


QUICK INFO

Difficulty Beginner
Time Required 20-45 minutes
Prerequisites Basic command line familiarity, admin rights on your machine
Tools Needed Ollama (free), Docker (optional), 8GB+ RAM minimum

What You'll Learn:

  • Install and configure Ollama to run AI models locally
  • Choose the right model for your hardware
  • Set up Open WebUI for a browser-based chat interface
  • Troubleshoot common issues with GPU detection and memory

GUIDE

Run a Local LLM Without Touching the Cloud

Get a private ChatGPT alternative running on your machine, no API keys required.

This guide walks you through setting up local language model inference from scratch. You'll end up with a working chat interface that runs entirely on your hardware. The whole stack is free and open source.

Who this is for: anyone who wants to experiment with AI models privately, avoid API costs, or just understand how this stuff works under the hood. You don't need machine learning experience.

The Stack

There are roughly three layers to a local LLM setup: an inference engine that actually runs the model, a model file you download, and optionally a frontend that makes chatting less painful than typing into a terminal.

For most people, the fastest path is Ollama plus Open WebUI. Ollama handles downloading models and running inference. Open WebUI gives you a browser interface that looks and feels like ChatGPT. Both are free.

If you want more control or have unusual hardware, you might eventually look at llama.cpp directly (Ollama uses it internally) or vLLM for high-throughput scenarios. But start with Ollama. Seriously. You can always dig deeper later.

Step 1: Install Ollama

Head to ollama.com and grab the installer for your platform. On macOS and Windows, it's a standard installer. On Linux, there's a one-liner:

curl -fsSL https://ollama.com/install.sh | sh

After installation, Ollama runs as a background service. You can verify it's working:

ollama --version

Expected result: Something like ollama version 0.5.x. The exact number doesn't matter much.

GPU Detection

Ollama automatically detects NVIDIA GPUs if you have CUDA drivers installed. For AMD cards on Linux, ROCm support exists but can be finicky. Apple Silicon Macs use Metal acceleration out of the box.

If you're on Windows with an NVIDIA card and inference seems slow, check that you have reasonably recent GPU drivers. Ollama should report using your GPU when you run a model.

Step 2: Download Your First Model

Models come from Ollama's registry, similar to Docker images. To download and immediately start chatting with a model:

ollama run llama3.2:3b

This pulls the 3-billion parameter Llama 3.2 model (about 2GB download) and drops you into an interactive prompt. Type a message, hit enter, watch it generate.

Type /bye to exit.

Picking a Model Size

The number after the colon indicates the parameter count, which roughly correlates with capability and resource requirements. Here's the practical breakdown:

For 8GB VRAM or 16GB system RAM, you can comfortably run 7-8B parameter models at 4-bit quantization. That's models like llama3.2:8b, qwen3:8b, or mistral:7b. These handle most everyday tasks: summarization, Q&A, basic coding help, brainstorming.

With 12-16GB VRAM, you can push into 13-14B territory or run smaller models at higher quality settings. The qwen3:14b and gemma3:12b models hit a nice sweet spot here.

24GB VRAM (RTX 3090/4090) opens up 30B+ models, which start approaching GPT-3.5 quality on many tasks.

Below 8GB total, stick to the smallest models: qwen3:0.6b, gemma3:1b, or phi-3.5:mini. They're limited but surprisingly usable for simple tasks.

What's Quantization?

When you download a model from Ollama, you're usually getting a quantized version. The raw model weights use 16 or 32 bits per parameter. Quantization compresses them to 4 or 8 bits, dramatically reducing memory requirements with modest quality loss.

A 7B model that would need ~14GB at full precision fits in ~4GB when quantized to 4-bit. Ollama defaults to Q4_K_M quantization for most models, which is a reasonable quality/size tradeoff.

You'll sometimes see filenames like model.Q4_K_M.gguf on Hugging Face. The Q4 means 4-bit, K_M is a specific quantization method. For most users, the Ollama defaults work fine.

Step 3: Manage Your Models

List what you've downloaded:

ollama list

Remove a model you don't need:

ollama rm llama3.2:3b

Models live in ~/.ollama/models on Linux/macOS and somewhere in AppData on Windows. They add up fast if you're experimenting.

Adding a Web Interface

The terminal interface works, but it's clunky for longer conversations. Open WebUI gives you a proper chat interface with conversation history, model switching, and features like document uploads.

Docker Install (Recommended)

If you have Docker installed, this is the fastest path:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

After the container starts, open http://localhost:3000 in your browser. You'll need to create an account (stored locally, not sent anywhere).

Open WebUI automatically connects to Ollama running on the same machine. Select a model from the dropdown and start chatting.

All-in-One Option

If you don't have Ollama installed separately, Open WebUI has a bundled image that includes both:

docker run -d -p 3000:8080 --gpus=all \
  -v ollama:/root/.ollama \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:ollama

Drop the --gpus=all flag if you don't have an NVIDIA GPU.

Non-Docker Install

Open WebUI can also run directly via pip, though it requires Python 3.11+ and has more dependencies:

pip install open-webui
open-webui serve

This serves on port 8080 by default. Add --host localhost if you want to restrict it to local access only.

Troubleshooting

Symptom: "connection refused" when Open WebUI tries to reach Ollama

The Docker container can't see localhost the same way your host machine does. Ensure you used --add-host=host.docker.internal:host-gateway in the run command. If you're on Linux and that doesn't work, try --network=host instead.

Symptom: Model runs extremely slowly (under 1 token/second)

This usually means the model is running on CPU instead of GPU, or it's too large for your VRAM and is spilling into system RAM. Check nvidia-smi while generating to see GPU utilization. If it's near zero, Ollama isn't using your GPU. Reinstall with the CUDA-enabled version.

Also check the model size. If you're running a 70B model on 8GB VRAM, you're going to have a bad time regardless of what the docs promise.

Symptom: Out of memory errors

Try a smaller model or more aggressive quantization. You can also reduce context length in Open WebUI settings (Admin Panel > Settings > Connections > Ollama), though this limits how much conversation history the model can reference.

Model Recommendations

For general chat, Qwen3 models have been performing well relative to their size. The 8B variant handles most tasks competently and runs on mid-range hardware. Llama 3.2 is another solid choice with good instruction-following.

For coding tasks, Qwen3-Coder variants or DeepSeek-Coder are worth trying. They're trained specifically on code and perform better than general-purpose models on programming questions.

For truly constrained environments (4GB VRAM, older laptops), Phi-3.5-mini from Microsoft or Gemma 3:1b punch above their weight. Don't expect miracles, but they can summarize text and answer simple questions.

I haven't tested every model combination extensively. The ecosystem moves fast. Check r/LocalLLaMA for current recommendations, they track what's working in practice.

What's Next

Once you have the basics running, there are rabbit holes in every direction. RAG (Retrieval-Augmented Generation) lets you chat with your own documents. Open WebUI supports this natively. Fine-tuning lets you specialize a model for specific tasks. And if Ollama feels limiting, llama.cpp gives you direct control over quantization, context size, and inference parameters.

But honestly? Start by just using it. Run a model, have some conversations, see where it helps and where it falls short. The theoretical stuff makes more sense once you've hit the practical limitations.


PRO TIPS

The keyboard shortcut to stop generation mid-response in Open WebUI is Escape. Useful when the model goes off track.

If you're running multiple models, ollama ps shows what's currently loaded in memory. Models stay loaded for a few minutes after use, which speeds up subsequent requests but consumes VRAM.

Context length matters more than people realize. A model with 8K context can "remember" about 6,000 words of conversation. Once you exceed that, early messages get dropped. Open WebUI shows token count in the interface.

Running Ollama with OLLAMA_DEBUG=1 environment variable dumps verbose logs. Helpful when GPU detection isn't working.


FAQ

Q: Can I run this completely offline? A: Yes. Once models are downloaded, everything runs locally. No internet required for inference.

Q: How does local LLM quality compare to ChatGPT? A: Smaller models (7-8B) are roughly GPT-3.5 tier on many tasks. Larger local models (70B+) can approach GPT-4 quality but require serious hardware. No local model currently matches GPT-4o or Claude across the board, but the gap is shrinking.

Q: Is my data private? A: With this setup, prompts never leave your machine. No telemetry, no logging to external servers. That's the main appeal for many users.

Q: Can I use these models commercially? A: Depends on the model license. Llama models have restrictions above certain revenue thresholds. Qwen, Mistral 7B, and Gemma use Apache 2.0 licenses that allow commercial use. Check individual model cards.

Q: Why is the first response slow but subsequent ones faster? A: The model needs to load into VRAM before inference starts. Ollama keeps models loaded briefly after use, so immediate follow-up requests skip this step.


RESOURCES

Tags:local-llmollamallamaqwenai-privacyself-hosted-ai
Trần Quang Hùng

Trần Quang Hùng

Chief Explainer of Things

Hùng is the guy his friends text when their Wi-Fi breaks, their code won't compile, or their furniture instructions make no sense. Now he's channeling that energy into guides that help thousands of readers solve problems without the panic.

Related Articles

Stay Ahead of the AI Curve

Get the latest AI news, reviews, and deals delivered straight to your inbox. Join 100,000+ AI enthusiasts.

By subscribing, you agree to our Privacy Policy. Unsubscribe anytime.

How to Run Local LLMs: Complete Setup Guide for 2026 | aiHola