How to Get Started with LLMs From Scratch Repository

QUICK INFO


Difficulty	Intermediate
Time Required	40-80 hours for full curriculum
Prerequisites	Python proficiency, basic calculus, linear algebra fundamentals
Tools Needed	Python 3.10+, PyTorch 2.x, Jupyter Notebook, 8GB+ RAM

What You'll Learn:

Implement a GPT-style language model from the ground up in PyTorch
Build the complete LLM pipeline: tokenization, attention mechanisms, pretraining, and finetuning
Load pretrained weights from GPT-2 and Llama models
Apply instruction finetuning and text classification techniques

GUIDE

This guide walks you through setting up Sebastian Raschka's LLMs-from-scratch repository and navigating its seven-chapter curriculum. You'll learn the environment requirements, chapter structure, and how to work through the material effectively. The guide assumes Python proficiency and targets developers who want to understand transformer architectures by building them.

Getting Started

The repository includes a comprehensive setup guide with OS-specific instructions.

Prerequisites Check

You need confident Python skills: classes, functions, NumPy operations. Neural network familiarity helps but isn't required (Appendix A covers PyTorch basics). Mathematical prerequisites include derivatives, chain rule concepts, and matrix multiplication.

Step 1: Clone the Repository

git clone --depth 1 https://github.com/rasbt/LLMs-from-scratch.git
cd LLMs-from-scratch

Expected result: Folders ch01 through ch07, appendix directories, and a setup folder.

Step 2: Create a Python Environment

python -m venv llm-env
source llm-env/bin/activate  # Windows: llm-env\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

Installation takes 5-10 minutes. Packages include PyTorch, tiktoken, matplotlib, and numpy.

Step 4: Verify Your Setup

jupyter notebook ch02/01_main-chapter-code/ch02.ipynb

Run the first few cells. Successful execution confirms your environment works.

To check GPU availability:

import torch
print(torch.cuda.is_available())

The code uses GPU automatically when available but runs on CPU without modification.

Hardware Expectations

The main chapter code runs on conventional laptops. Training in Chapter 5 takes 1-2 hours on CPU, minutes on GPU. Bonus materials covering Llama 3.2 and Qwen3 benefit from GPU access.

Curriculum Structure

Each chapter builds on previous ones. Work through them sequentially.

Chapter 1: Understanding Large Language Models

Theory only, no code. Covers what LLMs are, how they differ from earlier NLP, and the architecture you'll implement. Read this even if you want to jump to code.

Chapter 2: Working with Text Data

Concepts: Tokenization, byte pair encoding, vocabulary building, training batches.

Main notebook: `ch02.ipynb`

You'll build a data loader that chunks text into training sequences. This component feeds every subsequent chapter.

Chapter 3: Coding Attention Mechanisms

Concepts: Self-attention, scaled dot-product attention, causal masking, multi-head attention.

Main notebook: `ch03.ipynb`

Summary reference: `multihead-attention.ipynb`

Chapter 4: Implementing a GPT Model from Scratch

Concepts: Transformer blocks, layer normalization, residual connections, model configuration.

Main notebook: `ch04.ipynb`

Clean implementation: `gpt.py`

Expected result: A complete untrained GPT model that accepts input tokens and produces output logits.

Chapter 5: Pretraining on Unlabeled Data

Concepts: Cross-entropy loss, training loops, learning rate scheduling, sampling strategies (temperature, top-k).

Main notebook: `ch05.ipynb`

This chapter includes loading pretrained GPT-2 weights from OpenAI, letting you skip expensive pretraining.

Chapter 6: Finetuning for Text Classification

Concepts: Classification heads, transfer learning, supervised dataset preparation.

Main notebook: `ch06.ipynb`

Uses spam detection as the example task.

Chapter 7: Finetuning to Follow Instructions

Concepts: Instruction datasets, chat formatting, evaluation strategies.

Main notebook: `ch07.ipynb`

Bonus: Direct Preference Optimization (DPO) implementation.

Appendices

Appendix A: Introduction to PyTorch - Tensors, autograd, training basics. Start here if new to PyTorch.
Appendix B: References and Further Reading - Curated links for deeper exploration.
Appendix C: Exercise Solutions - Complete solutions index.
Appendix D: Training Loop Improvements - Learning rate schedulers, gradient clipping.
Appendix E: LoRA Finetuning - Parameter-efficient finetuning implementation.

Repository Navigation

Each chapter folder follows this structure:

ch0X/
├── 01_main-chapter-code/
│   ├── ch0X.ipynb          # Primary notebook
│   ├── exercise-solutions.ipynb
│   └── summary files
└── 02_bonus_*/             # Optional extensions

Start with the main notebook. Use summary files (.py scripts) as reference after understanding the material. Bonus folders contain advanced topics: alternative implementations, performance optimizations, and larger model ports.

Troubleshooting

Symptom: ModuleNotFoundError when importing chapter modules

Fix: Run Jupyter from the repository root directory with your virtual environment activated.

Symptom: CUDA out of memory error during training

Fix: Reduce batch size in the configuration dictionary. Alternatively, set device = 'cpu' to bypass GPU.

Symptom: Training loss stays flat or increases

Fix: Verify optimizer.zero_grad() is called before each backward pass. Print a few data loader batches to confirm correct data formatting.

Symptom: Generated text is random characters after training

Fix: Check that the same tokenizer instance is used for both training and generation. Verify the tokenizer encodes and decodes a test string correctly.

Symptom: CPU training takes hours

Fix: Expected behavior. Reduce training iterations for learning purposes, or use the pretrained weight loading approach in Chapter 5.

What's Next

After completing the main curriculum, explore the Llama 3.2 implementation, try LoRA finetuning, or continue to the sequel "Build a Reasoning Model (From Scratch)".

PRO TIPS

Run git pull periodically; the repository receives regular updates and bug fixes
Use Shift+Enter to execute cells sequentially in Jupyter; avoid running cells out of order
Set torch.manual_seed(42) at the start of notebooks for reproducible results during debugging
Export trained models with torch.save(model.state_dict(), 'model.pt') to avoid retraining
Use nvidia-smi -l 1 in a separate terminal to monitor GPU memory during training

COMMON MISTAKES

Skipping Chapter 1: The theory chapter establishes vocabulary and concepts referenced throughout. Skipping it creates confusion when terminology appears in code comments.
Running notebooks from wrong directory: Import statements assume you're in the repository root. Running from a subdirectory causes import failures.
Modifying code before understanding it: The notebooks are designed to work as-is. Make a copy before experimenting with architecture changes.
Ignoring the exercise solutions: The exercises reinforce understanding. Attempting them before checking solutions produces better retention than passive reading.

PROMPT TEMPLATES

Understanding Chapter Concepts

I'm working through Chapter [X] of "Build a Large Language Model From Scratch" and need clarification on [specific concept].

Here's the code I'm confused about:
[paste code]

Explain what this code does and how it fits into the overall GPT architecture.

Customize: Replace chapter number, specify the concept (causal masking, layer normalization, etc.), paste the actual code block.

Example output: For attention masking code, you'd receive an explanation of why future tokens must be masked during training, how the triangular matrix achieves this, and how it connects to autoregressive generation.

Debugging Training Issues

My GPT training from Chapter 5 shows [specific problem].

Configuration:
- Batch size: [X]
- Learning rate: [X]
- Device: [CPU/GPU]

Training loop code:
[paste code]

What's causing this and how do I fix it?

Extending the Implementation

I've completed Chapter [X] and want to add [specific modification].

Current implementation:
[paste relevant code section]

What changes are needed and what might break?

FAQ

Q: Is the GitHub repository sufficient, or do I need the book?

A: The repository contains all code with substantial notebook explanations. The book adds context and structured prose. Many complete the curriculum using only the repository.

Q: Can I run this on Google Colab?

A: Yes. Upload notebooks or clone to Google Drive. Colab's free GPU handles all main chapter code.

Q: How does this compare to Andrej Karpathy's "Let's build GPT" video?

A: Karpathy's video covers similar ground faster (2 hours). Raschka's material is more comprehensive with book-length explanations and exercises. They complement each other well.

Q: What if I get stuck?

A: Check the chapter's exercise solutions, then search GitHub Discussions. The Manning forum is also active.

Q: When can I build something practical?

A: After Chapter 5 (20-30 hours), you can generate text. After Chapter 7, you can finetune for specific tasks.

Q: Can someone without a CS degree complete this?

A: Yes, given Python proficiency. The material is self-contained and explains concepts as they appear.

RESOURCES

Official GitHub Repository: All code, notebooks, bonus materials
Manning Book Page: Print and ebook versions
17-Hour Video Course: Companion video coding through each chapter
GitHub Discussions: Community Q&A
Free Quiz PDF: 30 questions per chapter for self-assessment
Reasoning From Scratch: Sequel repository covering advanced reasoning techniques

Build a GPT-Style Language Model From Scratch Using Sebastian Raschka's Repository