Tiny Local AI in Your Mobile Browser

Build a compact, privacy-first sentiment classifier that runs in your mobile browser using WebGPU and WebAssembly—step-by-step for students.

Run a tiny, private AI right on your phone — no cloud required

Struggling to learn ML on a slow laptop or worried about sending student data to the cloud? In 2026 it's practical to run compact machine-learning models directly in a mobile browser using WebAssembly and WebGPU. Inspired by Puma's local-AI browser approach, this step-by-step project shows students how to load a compact, quantized model into an Android or iPhone browser and run on-device inference with strong performance and privacy.

Why this matters in 2026

Browsers and mobile GPUs have matured fast. In late 2024–2025 WebGPU moved from experimental to widely enabled in major browsers, and WebAssembly gained stable SIMD and threading support across mobile platforms. That momentum means you can now do practical on-device inference in the browser — no remote servers, no account, no latency spikes. Universities and privacy-conscious apps are adopting this approach for classroom experiments, demos, and prototyping.

Project overview: Local sentiment classifier that runs in your mobile browser

This guide builds a small, practical project: a sentiment classifier (positive/negative) that runs entirely in the mobile browser. The model is:

Compact — distilled and quantized to reduce size.
On-device — fetched once and cached to IndexedDB for offline reuse.
Accelerated — uses WebGPU when available, falls back to WebAssembly.
Student-friendly — clear build steps and simple JS code.

Why a sentiment classifier?

It’s small, meaningful, and a great student project. You learn model conversion, quantization, WebAssembly/WebGPU backends, performance tuning, and privacy best practices — all transferable skills for larger models later.

What you’ll need

A modern mobile device (Android or iPhone) with up-to-date browser (Chrome, Edge, Firefox, Safari with WebGPU support in 2026).
Node.js (for a tiny static server) and Python for model conversion.
ONNX Runtime Web (recommended) or TensorFlow.js (alternatives exist).
A compact pretrained model (we’ll use a DistilBERT-style text classifier converted to ONNX and quantized).

Step 1 — Prepare a compact model (Python)

Start with a small, fine-tuned transformer or a lightweight classifier. The steps below use Hugging Face Transformers to export a DistilBERT-based sentiment model to ONNX and quantize it with ONNX Runtime tools. If you’re new, run these on a desktop — the final artifact will be small enough for mobile browsers.

Python: export and quantize

Install the required packages (late-2025/early-2026 compatible):

pip install transformers onnx onnxruntime onnxruntime-tools

Export a simple model to ONNX (example uses Hugging Face)

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from pathlib import Path

model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
out_dir = Path('model')
out_dir.mkdir(exist_ok=True)

# Load and export
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example with opt export utility
from transformers.onnx import export
export('cpu', model_name, out_dir / 'model.onnx')

Quantize with ONNX Runtime's dynamic quantization to shrink weights (8-bit). Dynamic 8-bit quantization is safe and yields 2–4x size reduction.

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic('model/model.onnx', 'model/model.quant.onnx', weight_type=QuantType.QInt8)

For extreme size trade-offs you can research 4-bit quantization (ggml-style or advanced tools) — these reduce quality more but are used in 2025–2026 for tiny local LLM prototypes.

Step 2 — Serve and cache the model in the browser

We host the quantized ONNX model as static assets and cache it via IndexedDB to avoid repeated downloads. Use a tiny static server for development (e.g., npm http-server).

Simple server

npm install -g http-server
http-server . -p 8080

On the client, fetch the model using fetch() and store the ArrayBuffer in IndexedDB (or let ONNXRuntime fetch directly if you prefer). Caching is important for mobile data and offline demos.

Step 3 — Run ONNX Runtime Web with WebGPU and WASM backends

ONNX Runtime Web is a practical, well-supported option in 2026. It supports multiple execution providers: webgpu (fast, modern) and wasi/wasm (compatibility). We'll try WebGPU first and fall back to WASM.

HTML skeleton

<!-- index.html -->
<script src="https://cdn.jsdelivr.net/npm/onnxruntime-web/dist/ort.min.js"></script>
<body>
  <textarea id="text" placeholder="Type text..."></textarea>
  <button id="predict">Predict</button>
  <div id="result"></div>
</body>

Client JS: load model and run inference

This snippet demonstrates key steps: initialize ONNX Runtime, attempt WebGPU, fall back to WASM, create input tensor, and run the session.

// app.js
(async () => {
  // 1) Configure ONNX Runtime Web
  // Enable threads and SIMD for Wasm where supported
  if (ort.env && ort.env.wasm) {
    // use 2 threads on mid-tier phones; tune later
    await ort.env.wasm.setThreads(2);
    await ort.env.wasm.setWasmPaths('./wasm/'); // if hosting wasm files
  }

  // 2) Choose execution providers: prefer webgpu
  const providers = ['webgpu', 'wasm'];

  // 3) Load model
  const modelUrl = './model/model.quant.onnx';
  const session = await ort.InferenceSession.create(modelUrl, { executionProviders: providers });

  // 4) Prepare a tokenizer in JS (keep tokenizer small) or perform tokenization on the server/desktop
  // For demo, a tiny JS tokenizer is acceptable; here use a deterministic mapping
  const tokenizer = simpleTokenizer();

  document.getElementById('predict').addEventListener('click', async () => {
    const text = document.getElementById('text').value;
    const tokens = tokenizer.encode(text); // e.g., returns Int32Array

    // ONNX model input name depends on the export; common: input_ids, attention_mask
    const input = new ort.Tensor('int32', Int32Array.from(tokens.ids), [1, tokens.ids.length]);
    const mask = new ort.Tensor('int32', Int32Array.from(tokens.mask), [1, tokens.ids.length]);

    const feeds = { input_ids: input, attention_mask: mask };
    const results = await session.run(feeds);

    // Interpret output (logits) and show label
    const logits = results['logits'].data; // float32 array
    const pred = logits[0] > logits[1] ? 'Negative' : 'Positive';
    document.getElementById('result').innerText = `Prediction: ${pred}`;
  });
})();

Notes:

Tokenization: keep it small. For students, implement a tiny WordPiece or use a minimal tokenizer converted from Python to JS.
ONNX input/output names may differ. Inspect the ONNX graph if needed.
WebGPU requires a secure context (HTTPS or localhost).

Step 4 — Performance tuning for mobile

Mobile phones in 2026 are powerful, but energy and memory constraints still matter. Here are practical tuning steps to squeeze good performance.

1. Prefer WebGPU

WebGPU leverages the GPU’s compute units. On supported browsers it’s usually the fastest. Detect availability and prefer it. In our example, ONNX Runtime Web handles provider selection when given a list.

2. Quantization levels

8-bit quantization is the best first step: dramatic size reduction with minimal accuracy loss. If you need a smaller model, investigate 4-bit quantization; expect larger accuracy trade-offs and more complex tooling.

3. Model architecture

Use distilled or mobile-friendly model families (DistilBERT, MobileBERT, TinyBERT). For non-text tasks, MobileNetV3 or EfficientNet-lite variants are small and fast.

4. Tokenization and pre/post-processing

Compute tokenization in JavaScript efficiently (avoid huge lookup tables). If the tokenizer is heavy, consider precomputing or using a simpler scheme for classroom demos.

5. Threads and memory

If using Wasm, set a small number of threads (1–4) depending on the device. Too many threads increase context-switching and memory usage on low-RAM phones.

6. Progressive loading

Break large models into shards and load the smallest shard first to deliver a quick demo. You can then fetch extra shards in background.

7. Measure, iterate

Use Chrome DevTools on Android and Safari Web Inspector on iOS to profile memory, CPU, and GPU timing.
Measure cold start (first inference) vs warm start (subsequent runs).
Log inference time to tune thread counts and backend selection programmatically.

Privacy and security

One of the big wins for local-AI in the browser is privacy: inputs never leave the device. But pay attention to these details:

HTTPS/secure context — WebGPU and many APIs require a secure origin.
Model provenance — know the model’s training data and license before distributing in class.
Storage safety — store models encrypted in IndexedDB if they contain proprietary weights.
Permissions — if you access camera/microphone for multimodal demos, request explicit permissions and make the privacy trade-offs clear to students.

Advanced strategies and future-proofing

As of 2026, the ecosystem keeps evolving. Here are recommendations that reflect late-2025/early-2026 trends.

1. Leverage WebNN when it lands

WebNN aims to be a browser-standard inference API. In 2025–2026 browser vendors accelerated implementations. When available on your platform, WebNN can simplify portable acceleration.

2. Consider GGML-style toolchains for tiny LLMs

For local LLM experiments, GGML and similar toolchains brought quantized LLM runtime to mobiles in 2024–2025. They can produce very small files (4-bit) but often run natively; bridging them to the browser is an active research area.

3. Progressive model updates

Design your app so you can push model updates as small delta patches — useful for classroom iterations and bug fixes without re-downloading large files.

4. On-device personalization

Local fine-tuning and adaptation (federated or on-device) became more accessible by 2026. For student demos you can simulate personalization by adjusting small linear heads locally without retraining the full model.

Example student assignments

Modify the tokenizer to support emojis and measure how prediction accuracy changes.
Quantize to 8-bit and compare inference time and accuracy vs the FP32 model.
Implement progressive loading and show a perceived-performance improvement.
Measure battery impact on different phones and report trade-offs.

Common pitfalls and troubleshooting

WebGPU not available: ensure secure context and up-to-date browser; fallback to Wasm works reliably.
Large model size: quantize, distill, or use a lighter architecture.
Tokenization mismatches: confirm the tokenizer used during export matches your JS tokenizer.
Indexing and caching: use checksums/versioning to avoid stale model loads from IndexedDB.

“Local-AI in the mobile browser is no longer an experiment — it’s a practical learning environment for students.”

Case study: classroom demo inspired by Puma (short)

At a university lab in late 2025, instructors built an in-browser demo where each student’s phone ran a quantized sentiment model locally. The demo loaded in under 3 seconds on most modern phones, preserved student text locally, and allowed instructors to push model updates overnight. Using WebGPU reduced latency by 2–3x vs Wasm for batch inference and made live demos feel responsive.

Final checklist before you demo to a classroom

Model quantized and tested for accuracy.
WebGPU enabled and fallback to Wasm implemented.
Model caching and offline use verified.
Privacy and license statements prepared for students.
Performance metrics collected for at least two device classes (low-end and high-end phones).

Where to go next (resources & further reading)

ONNX Runtime Web documentation — run and tune ONNX models in the browser.
WebGPU spec and tutorials — learn GPU compute in the browser.
WebAssembly SIMD and threading guides — squeeze Wasm performance.
Quantization toolkits (ONNX quantization, GGML guides) — reduce model size.

Takeaways

Local AI in a mobile browser is practical in 2026. With WebGPU and WebAssembly improvements, small, quantized models deliver responsive on-device inference for student projects. Start with a compact model, use ONNX Runtime Web for a reliable WebGPU/WASM path, quantize aggressively, and always test on target phones. The result is a private, fast, and educational local-AI experience — just like Puma’s browser-first thinking, but tailored for students.

Call to action

Ready to build your own tiny local AI? Clone the starter repo (includes Python export scripts, quantization steps, and a demo web app), run it on your phone over HTTPS, and share your results with classmates. Start small — quantify the trade-offs — and post your findings to student forums. If you want, paste your code here and I’ll help tune it for a specific phone model.