edge AItutorialraspberry pi

Run an Offline LLM on Raspberry Pi 5: End‑to‑End Guide with AI HAT+ 2

ccodeacademy

2026-02-15

10 min read

Step-by-step 2026 guide to run an offline LLM on Raspberry Pi 5 with AI HAT+ 2—model selection, quantization, tuning, and micro‑app UX.

Hook: Run a tiny, private LLM on a Pi that fits in your pocket

If you’ve ever wanted a private, low-latency assistant that runs entirely offline—no cloud, no API keys, no monthly bills—this guide gets you there on a Raspberry Pi 5 with the new AI HAT+ 2. I’ll walk you end-to-end: picking a model, quantizing it for ARM, integrating the AI HAT+ 2 accelerator, squeezing performance out of limited RAM/CPU, and wrapping the result into a snappy micro‑app UX you can ship to friends or use yourself.

What you’ll build and why it matters (TL;DR)

By the end of this guide you will have a working local LLM on a Raspberry Pi 5 that responds to prompts without internet access. We focus on practical choices that make the project realistic in 2026 when edge-optimized quantization and HAT accelerators have matured. Expect to be able to run small 3B–7B models (quantized) with usable latency for micro‑apps like personal assistants, search agents, or conversational IoT interfaces.

Why 2026 is the right time to do on‑device LLM

Recent community and industry advances (late 2024 through 2025) made efficient quantization and small model architectures practical on edge boards:

High-quality 3–7B models with instruction tuning and permissive community checkpoints are widely available—ideal for offline tasks.
Quantization methods (GPTQ, AWQ, and refined 3-bit/4-bit schemes) let you fit these weights into a fraction of original memory, while maintaining strong output quality.
Edge accelerators and HATs now ship with SDKs to plug into common runtimes, allowing hardware offload where pure CPU inference would be too slow—pair this with good edge telemetry and device tooling for production deployments.

Put simply: combining a Pi 5, AI HAT+ 2, and modern quantized weights lets you run a useful LLM fully on-device.

What you need

Hardware

Raspberry Pi 5 (64-bit OS recommended; 8GB or higher is helpful)
AI HAT+ 2 (HAT that exposes an NPU/accelerator and drivers for on-device inference)
Fast storage: USB 3.0 NVMe or high-speed SD card
Power supply and network for setup (final run can be offline)

Software & tools

Raspberry Pi OS 64-bit or a Debian 12+ based 64-bit image
Build tools: git, make, gcc, cmake, python3, pip
Inference runtimes: llama.cpp (ggml backend), llama-cpp-python (optional), AutoGPTQ/AWQ for quantization
Optional: Flask/FastAPI for the micro‑app, nginx for local hosting

Step 1 — Prepare the OS and storage

Start with a clean 64-bit Raspberry Pi OS. Use a fast NVMe or high-quality SD card to minimize swap I/O and latency.

sudo apt update && sudo apt upgrade -y
sudo apt install -y git build-essential cmake python3-pip libopenblas-dev libomp-dev
# Optional: prepare swap for large model builds (size depends on model)
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Note: swap helps when memory is tight but will slow inference if used frequently—prefer more RAM or an accelerator for production micro-apps.

Step 2 — Install and build an ARM-optimized runtime

The fastest, most battle-tested path on tiny hardware is llama.cpp (ggml). It has ARM NEON optimizations and a small footprint.

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
# Use make with ARM-friendly flags; recent forks auto-detect NEON/FMA
make -j4

If you plan to call the model from Python, install the binding:

pip3 install -U pip setuptools wheel
pip3 install llama-cpp-python

Step 3 — Choose the right model for Pi 5 + AI HAT+ 2

Model selection is the single most important decision. Focus on these criteria:

Size: 3B–7B models are the practical sweet spot for Pi-class devices. 3B for faster response; 7B for better capabilities.
License: confirm local-use legality (some weights are restricted). For production projects you may also want to review compliance and procurement notes like those around regulated AI platforms.
Architecture: decoder-only models (GPT-like) are simpler to run on lightweight runtimes.

Examples commonly used by the community in 2026 include distilled/instruction-tuned 3B and 7B checkpoints and tiny Mistral variants. Always check the model’s license and community quantized checkpoints.

Step 4 — Quantize the model for ARM

Quantization reduces memory and compute. In 2026, 4-bit and advanced 3-bit (AWQ/GPTQ) quantizations are standard practice for edge devices.

Options

q4_0 / q4_k (4-bit): fastest and simplest—good latency, moderate quality tradeoff.
3-bit (q3_k / AWQ): better quality per memory but sometimes slower or more complex to run.
q8_0 (8-bit): less compression but occasionally better performance for certain runtimes.

Workflow: obtain FP16/FP32 weights, run AutoGPTQ or AWQ conversion on a machine with enough RAM (desktop/GPU), then transfer the resulting gguf/ggml file to your Pi. Many teams do their heavy conversion on cloud or desktop rigs and then copy the artifact to the device—this mirrors best practices in portable development and field workflows.

# Example: convert with AutoGPTQ locally (desktop/GPU)
# (This is done on a powerful machine; not on the Pi)
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
python3 quantize.py --model PATH/TO/FP16 --bits 4 --out /tmp/model_q4.gguf

# Copy to Pi
scp /tmp/model_q4.gguf pi@pi5:/home/pi/models/

llama.cpp now supports gguf files directly. Use the quantized gguf on the Pi.

Step 5 — Integrate AI HAT+ 2

The AI HAT+ 2 adds hardware acceleration and optimized drivers. The exact SDK will vary by vendor, but the integration pattern is similar:

Install the HAT driver and runtime (follow vendor instructions).
Install or build the runtime variant that supports the accelerator (a fork of llama.cpp or a proprietary runtime).
Set environment variables or CLI flags to offload tensor work to the NPU.

Example (vendor-abstract):

# Install HAT runtime (pseudo-commands)
git clone https://github.com/vendor/ai-hat-runtime.git
cd ai-hat-runtime
./install.sh

# Run the llama.cpp executable compiled with HAT support
./main -m /home/pi/models/model_q4.gguf --use-hat --threads 4 --n_ctx 1024

In practice, the HAT reduces CPU load and improves tokens/sec. Expect larger quality-to-latency gains on 7B models. For real-world deployments, pairing the HAT with robust edge message brokers and telemetry helps you spot load or network issues quickly.

Step 6 — Run baseline inference and benchmark

Start with a short prompt and measure tokens per second and latency. Keep tests reproducible.

# Simple run (llama.cpp)
./main -m /home/pi/models/model_q4.gguf -p "Translate to French: Hello, how are you?" --threads 4 --n_predict 128

# With timing (bash)
time ./main -m /home/pi/models/model_q4.gguf -p "Write a 100-word summary of quantum computing." --threads 4 --n_predict 100

What to expect in 2026 (typical ranges):

3B quantized (q4) on Pi 5 CPU-only: ~1–5 tokens/sec depending on thread and storage speed.
7B quantized (q4 or q3) with AI HAT+ 2: ~5–20 tokens/sec depending on accelerator efficiency and model.
Single-token latency often falls in the 200 ms to 1s range; accelerators push this toward the low end for short-context responses.

Those ranges are broad—run benchmarks that match your use case (chat vs. single-shot generation). For teams shipping micro-apps, it’s worth reading field reviews of compact devkits and cloud-PC hybrids to understand portable performance expectations (see the Nimbus Deck Pro review and similar write-ups).

Step 7 — Performance tuning checklist

Threads: experiment with different --threads values. Too many causes contention; too few wastes cores.
Swap: keep swap minimal. If swap is hit during inference, latency will spike.
Storage: use NVMe if possible; SD card I/O can bottleneck loading and paging.
Quantization: 4-bit is lower latency; 3-bit may improve output at cost of complexity. Test both.
Context length: reduce n_ctx for micro-apps to lower memory and compute.
Batching: for streaming responses, avoid large batch sizes; for offline generation, batching can help throughput.
Model trimming: slice out unused tokens or reduce vocabulary if using a customized tokenizer for a micro-app.

Step 8 — Build a micro‑app UX

Micro‑apps are the perfect use-case for on-device LLMs: single-purpose, private, and fast. Below is a minimal Flask API that proxies local LLM inference using llama-cpp-python. It assumes the quantized model is available as /home/pi/models/model.gguf.

from flask import Flask, request, jsonify
from llama_cpp import Llama

app = Flask(__name__)
llm = Llama(model_path='/home/pi/models/model.gguf')

@app.route('/api/generate', methods=['POST'])
def generate():
    payload = request.json or {}
    prompt = payload.get('prompt', '')
    resp = llm.create(prompt=prompt, max_tokens=128, temperature=0.7)
    return jsonify({'text': resp['choices'][0]['text']})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Frontend ideas for micro-apps:

A minimal single-page app (HTML/CSS/JS) that hits /api/generate.
Voice interface: use local speech-to-text and text-to-speech to stay offline.
Hardware buttons: map physical button presses on the Pi header to pre-canned prompts.

UX best practices for micro‑apps

Keep prompts short to save tokens and latency.
Use system messages to narrow the model’s behavior (e.g., "You are a helpful recipe assistant").
Pre-cache likely prompts/answers as templates for instant replies.
Progressive responses: stream tokens to the client to give perceived speed improvements even if total generation takes longer.

Privacy, security, and legal notes

Running a model locally gives you a strong privacy advantage, but you still must:

Confirm model licensing allows local use and redistribution.
Secure the Pi: disable open network ports, use a strong password, and keep the system patched.
Handle sensitive data carefully: on-device does not automatically mean compliant with regulations—always follow local data rules. Use a starter privacy policy template if you plan to expose endpoints or share device images.

Troubleshooting common issues

Model load fails with out-of-memory

Use a higher compression quant (q4_0 instead of q8) or a smaller model.
Increase swap while converting weights off-device, but avoid swap at inference time.

Very slow responses

Check if swap is being used heavily—move to faster storage.
Enable HAT offloading if available and supported by your runtime.
Lower the context length and n_predict values.

Bad or repetitive outputs

Improve the prompt or use instruction-tuned models.
Adjust decoding params: higher temperature for creativity, nucleus sampling (top_p) to vary output.

Advanced strategies (for power users)

LoRA/Adapters: keep a base quantized model and attach small LoRA adapters for domain-specific behavior—cheap and fast to transfer.
Distillation: distill a larger model into a smaller one tailored for your tasks (requires more tooling and compute).
Hybrid pipeline: run embeddings locally and perform small vector search on-device; offload only heavy generative calls to the HAT.
Model sharding: split model across RAM and HAT memory if the runtime supports it.

Benchmarks & expectations (practical guidance)

Benchmarks vary with quantization, model, and HAT efficiency. Use these as ballpark figures for planning a micro-app:

Simple Q&A with a 3B q4 model — ~1–5 tokens/sec (CPU-only)
Conversational 7B q4 on Pi + HAT — ~5–20 tokens/sec
Small instruction tasks (single-turn) — ~200–800 ms perceived latency for short outputs

Why offline LLMs still win in 2026

Cloud models remain powerful, but on-device LLMs offer:

Privacy — data never leaves the device.
Predictable cost — no per-token bills.
Low-latency local interactivity for micro-apps and embedded agents.
Resilience — works without connectivity and easier to deploy in sensitive environments. Consider pairing with resilient edge message brokers when you need offline sync.

Case study: a personal recipe assistant

In under a weekend, a hobbyist built a Pi 5 + AI HAT+ 2 device that answers cooking questions in the kitchen. They used a 3B instruction-tuned quantized model and a small Flask front-end served to their phone over a local Wi-Fi hotspot. The device stored a 500‑recipe local database and used embeddings to match user queries—no cloud involved. The user reported faster, more private interactions than using a cloud assistant.

"Running an offline assistant let me customize behavior and keep family recipes private—plus the HAT reduced response time enough for real-time kitchen use." — Community contributor, 2025

Final checklist before you ship

Confirm model licensing for local usage.
Quantize model and validate output quality on a desktop before copying to Pi.
Install HAT drivers and test offload paths.
Measure latency and tune threads/storage/context length.
Harden the device: firewall, keys, and restricted access for production.

Actionable takeaways

Start small: pick a 3B quantized model to validate architecture and UX.
Quantize off-device: conversion is faster and less painful on a desktop GPU.
Use AI HAT+ 2: when available, it changes the performance tradeoffs and makes 7B models feasible.
Design micro-apps around short prompts, streaming, and local caching to mask remaining latency.

Where to go next (resources)

llama.cpp and community forks for ARM optimization
AutoGPTQ / AWQ for quantization workflows
llama-cpp-python for easy Python integration
Vendor docs for AI HAT+ 2 (SDK, drivers, and runtime examples)

Call to action

Ready to try it yourself? Clone this starter repo with build scripts, a sample Flask micro-app, and a checklist to move from zero to a private Pi-based assistant: github.com/your-org/pi5-llm-starter. Share your project, benchmarks, and UX ideas—let’s build the best micro-apps for on-device AI together.

codeacademy

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Up Next

Building a Satellite Internet Solution: What Developers Can Learn from Starlink vs. Blue Origin

Education Tech•10 min read

Hands‑On Review: Top Study Apps for 2026 — Privacy, Reliability, and Classroom Sync

android•10 min read

Automate the Android 4‑Step Cleanup: Build an App to Keep Phones Running Smooth

From Our Network

Trending stories across our publication group

Comparing Navigation Engines for Embedded Devices: Google Maps API vs Open-Source Alternatives

circuits.pro

Navigation•11 min read

Comparing Navigation Engines for Embedded Devices: Google Maps API vs Open-Source Alternatives

Evaluating Hardware for Intensive Tasks: A Comparative Review of the MSI Vector A18 HX