QUICK INFO
| Difficulty | Beginner to Intermediate |
| Time Required | 30-45 minutes |
| Prerequisites | Python 3.9+, basic command line familiarity, pip |
| Tools Needed | Python environment, 4GB+ RAM, internet connection |
What You'll Learn:
- Find and evaluate models from the Hub without wading through garbage
- Use pipelines for quick inference vs. AutoModel for control
- Configure caching so you stop re-downloading multi-GB models
- Deploy a working demo on Spaces
GUIDE
How to Use Hugging Face Without Wasting Your Time
A practical guide to the model hub, transformers library, and Spaces
This guide covers what you actually need to know about Hugging Face to get productive. It skips the marketing pitch and focuses on the parts that matter: finding models that work, loading them correctly, and not filling your disk with duplicate downloads. If you've tried Hugging Face before and found it confusing, or if you're starting from scratch, this is written for you.
What Hugging Face Actually Is
Hugging Face started as an NLP company but evolved into something closer to GitHub for machine learning. The platform has three main components you'll interact with: the Hub (where models, datasets, and demos live), the transformers library (Python code for using those models), and Spaces (free hosting for demos).
The Hub currently hosts over 2 million models according to their documentation. That number sounds impressive until you realize most of them are fine-tunes of fine-tunes, abandoned experiments, or duplicates. The skill isn't accessing models; it's finding the ones worth using.
Getting Started
Install the core libraries first:
pip install transformers datasets huggingface_hub
You'll also need a deep learning framework. PyTorch is the safer choice for compatibility:
pip install torch
If you're on a Mac with Apple Silicon, PyTorch works natively with MPS acceleration. On Linux with NVIDIA GPUs, make sure your CUDA version matches what PyTorch expects. I won't cover CUDA troubleshooting here because it's a rabbit hole, but the PyTorch website has a configurator that generates the right install command.
Create a Hugging Face account at huggingface.co if you haven't. You don't need it for basic usage, but gated models (like Llama 2 and some others) require you to accept licenses through your account.
Finding Models That Actually Work
The search interface at huggingface.co/models looks straightforward but rewards some strategy. Here's how to filter effectively:
Start with the task filter. If you need text generation, sentiment analysis, or image classification, the task dropdown narrows thousands of options to hundreds. Sort by "Most Downloads" rather than "Trending" because download counts indicate models that people actually use in production, not just models that got social media attention this week.
Check the model card before downloading anything. A good model card lists: what the model does, what it was trained on, known limitations, and example code. If the model card is empty or just says "fine-tuned BERT," that's a red flag. The author either didn't document it or doesn't expect anyone else to use it.
Look at the "Files and versions" tab. Models with hundreds of megabytes of weights are common; models with tens of gigabytes need a plan. A 7B parameter model in fp16 is roughly 14GB. If that's going to eat your disk space or crash your laptop, consider the quantized versions (look for "GGUF" or "AWQ" in model names).
For text generation specifically, models from Meta (Llama family), Mistral AI, and Qwen tend to be well-documented and widely tested. For embeddings, sentence-transformers models are battle-tested. For image generation, Stable Diffusion models from Stability AI or Runway have the most community support.
Using Pipelines for Quick Inference
The pipeline API is the fastest path from zero to working code. It handles tokenization, model loading, and output formatting automatically.
from transformers import pipeline
# Sentiment analysis
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product")
print(result) # [{'label': 'POSITIVE', 'score': 0.9998}]
The first call downloads and caches the default model for that task. Subsequent runs load from cache, which takes a few seconds instead of minutes.
To use a specific model instead of the default:
generator = pipeline("text-generation", model="microsoft/DialoGPT-medium")
response = generator("Hello, how are you?", max_new_tokens=50)
Expected result: A list containing one dictionary with a "generated_text" key. The actual text varies based on the model and any sampling parameters you set.
Pipelines work for dozens of tasks: text-generation, summarization, translation, question-answering, fill-mask, image-classification, object-detection, automatic-speech-recognition, and more. The full list is in the transformers documentation.
One thing that trips people up: pipeline downloads go to ~/.cache/huggingface/hub by default. On shared systems or when disk space matters, you'll want to change that. More on caching in a bit.
Using AutoModel for More Control
Pipelines abstract away details you sometimes need. When you want to access embeddings, modify generation parameters precisely, or integrate with training code, use AutoModel and AutoTokenizer directly.
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
inputs = tokenizer("The weather today is", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))
The from_pretrained method does the heavy lifting: it downloads model weights, loads configuration, and returns a ready-to-use object. The tokenizer handles converting text to token IDs and back.
For larger models, you'll want to specify device and precision:
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-v0.1",
torch_dtype=torch.float16,
device_map="auto"
)
The device_map="auto" setting distributes the model across available GPUs (or CPU if no GPU). This is essential for models that don't fit in a single GPU's memory.
Expected result: The model loads without memory errors. If you see CUDA out of memory, either your GPU doesn't have enough VRAM or you need to use quantization.
Managing Cache (Stop Re-downloading Everything)
The default cache location is ~/.cache/huggingface/. On my machine, this grew to 40GB before I noticed. Models don't get cleaned up automatically.
To change the cache location, set environment variables before importing transformers:
export HF_HOME="/path/to/your/cache"
export TRANSFORMERS_CACHE="/path/to/your/cache"
Or in Python:
import os
os.environ["HF_HOME"] = "/data/model_cache"
from transformers import pipeline # Import after setting env vars
To pre-download models (useful for offline environments or CI pipelines):
huggingface-cli download gpt2
To see what's cached:
huggingface-cli scan-cache
To delete specific models from cache:
huggingface-cli delete-cache
The delete command is interactive and shows you size information before confirming. I've recovered 30GB+ by clearing models I downloaded once for testing and never used again.
Authentication for Gated Models
Some models require accepting a license before download. Llama models, certain research checkpoints, and proprietary model distillations fall into this category.
Log in via the CLI:
huggingface-cli login
This prompts for an access token, which you generate at huggingface.co/settings/tokens. The token gets stored locally, and subsequent downloads authenticate automatically.
In code, if you're scripting deployments:
from huggingface_hub import login
login(token="hf_xxxxxxxxxxxxx")
Or set the HF_TOKEN environment variable. The library checks for it automatically.
Spaces: Deploying Demos
Spaces gives you free hosting for Gradio or Streamlit apps. This is genuinely useful for sharing demos with people who can't run Python locally.
Create a Space through the web interface or the CLI:
huggingface-cli repo create my-demo-space --type=space --space_sdk=gradio
A minimal Gradio app in app.py:
import gradio as gr
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
def analyze(text):
result = classifier(text)[0]
return f"{result['label']}: {result['score']:.2%}"
demo = gr.Interface(fn=analyze, inputs="text", outputs="text")
demo.launch()
Push it like a Git repo:
git clone https://huggingface.co/spaces/your-username/my-demo-space
cd my-demo-space
# Add app.py and requirements.txt
git add .
git commit -m "Initial commit"
git push
Spaces builds and deploys automatically. Free tier has limited CPU and RAM, which means large model inference is slow or crashes. For anything beyond demos, you'll need either Inference Endpoints (paid) or your own infrastructure.
Troubleshooting
Symptom: OSError: Can't load tokenizer for 'model-name'
Fix: The model might require accepting a license. Check the model card for access requirements and make sure you're logged in.
Symptom: OutOfMemoryError: CUDA out of memory
Fix: Your model is too large for your GPU. Options: use a smaller model, load in 8-bit quantization (load_in_8bit=True), or use CPU with device_map="cpu".
Symptom: Model downloads every time despite caching
Fix: You might be setting cache_dir inconsistently between runs, or an old transformers version has a different cache structure. Check your environment variables and update to the latest transformers.
Symptom: trust_remote_code warning or error
Fix: Some models include custom Python code that runs during loading. You have to explicitly allow this with trust_remote_code=True. Only do this for models from sources you trust.
What's Next
You now have the core workflow: finding models, loading them efficiently, managing disk space, and deploying demos. For fine-tuning models on custom data, look at the Trainer API documentation or the PEFT library for parameter-efficient methods. The datasets library (which we didn't cover in depth) handles loading and preprocessing training data.
PRO TIPS
Set TOKENIZERS_PARALLELISM=false if you see fork-related warnings when using tokenizers in multiprocessing code. The warning is harmless but annoying.
Use model.half() or torch.float16 dtype for inference when possible. Half precision uses half the memory with negligible quality loss for most tasks.
The accelerate library handles multi-GPU setups better than manual device management. Install it and use device_map="auto" for models that need distribution.
When testing code interactively, load the tokenizer first and verify it works before loading the model. Tokenizer problems surface faster and waste less time than waiting for a multi-GB download to fail.
FAQ
Q: Which models are free to use commercially? A: Check the license field on each model card. Apache 2.0 and MIT are permissive. Llama 2 has a custom license allowing commercial use with conditions. Many research models are "non-commercial only."
Q: How do I run models offline?
A: Download models in advance with huggingface-cli download model-name, then set HF_HUB_OFFLINE=1 or use local_files_only=True in from_pretrained().
Q: Why is my first inference slow but subsequent ones fast? A: The first call compiles computation graphs (especially with PyTorch 2.0's compile feature) and warms up GPU kernels. This is normal.
Q: Can I use Hugging Face models with TensorFlow instead of PyTorch?
A: Yes, use TFAutoModel classes instead of AutoModel. But honestly, most examples and community support assume PyTorch.
RESOURCES
- Transformers documentation: API reference and tutorials
- Model Hub: Browse and filter available models
- Hugging Face course: Free video lessons (covers more than this guide)
- Gradio documentation: For building Spaces demos




