Running large language models on your own Mac instead of calling Claude / OpenAI / Gemini has become genuinely viable in 2026. Apple Silicon's unified memory architecture is great for inference, the open-weights model field is competitive, and the tooling (Ollama, LM Studio) is friendly. This post is the practical setup guide and an honest comparison to cloud frontier models.
Why run LLMs locally
- Privacy. The conversation never leaves your machine. Useful for sensitive material, regulated industries, or work you'd rather not log to a third party.
- Offline use. Models work on a plane, in a coffee shop with no Wi-Fi, in a basement.
- No API costs. Inference is free per request (you pay for electricity and the upfront hardware).
- Latency. No network round-trip. First-token latency can be sub-100ms on M-series chips for smaller models.
- Learning. Running a model yourself is the best way to understand what these things actually are.
- Custom deployments. Fine-tune a small model on your own data; embed it in your own app.
Hardware: what your Mac can actually run
Apple Silicon (M1, M2, M3, M4 chips) is excellent for LLM inference because the GPU and CPU share unified memory — the model fits in RAM and the GPU can access it directly without copying.
Practical RAM guidance:
- 8 GB RAM: 2-3B parameter models (Gemma 2B, Phi-3 Mini, Llama 3.2 3B). Workable for simple tasks.
- 16 GB RAM: up to ~8B parameter models comfortably. Llama 3.3 8B, Gemma 9B, Qwen 7B.
- 32 GB RAM: ~14B-24B parameter models. Mistral Small 24B, Qwen 14B, Gemma 27B.
- 64 GB RAM: ~32-70B parameter models with aggressive quantization. Llama 3.3 70B at Q4.
- 128 GB+ RAM: 70B models comfortably; some 100B+ models possible.
For your typical M3 or M4 MacBook Pro with 32-64GB, you can run models that are comparable in capability to GPT-3.5 or older Claude versions. Frontier 2026 models (Opus 4.7, GPT-5, Gemini Ultra) are far larger than anything that fits locally.
The major open-weight model families
- Gemma (Google) — 2B, 7B, 9B, 27B variants. Strong general performance, multimodal versions, friendly licensing for commercial use. Covered in detail below.
- Llama (Meta) — 3.2 (1B / 3B), 3.3 (8B / 70B), the workhorse open family. Llama 3.3 8B is the sweet spot for most local setups.
- Qwen (Alibaba) — 2.5 family in 0.5B-72B. Excellent at math, coding, multilingual.
- Mistral / Mixtral — Mistral Small 24B and Mixtral 8x7B / 8x22B (mixture-of-experts).
- Phi (Microsoft) — small models (3.8B-14B) optimized for reasoning, run well on modest hardware.
- DeepSeek-R1 / DeepSeek-V3 — reasoning-focused models with permissive licensing.
- Code-specialized: Qwen2.5-Coder, DeepSeek-Coder, CodeLlama.
Gemma specifically
Gemma is Google's open-weight model family, related to but separate from Gemini (which is closed). Gemma is permissively licensed for commercial use under Google's open-weights terms.
The current Gemma lineup (2026):
- Gemma 2B / 3B — runs on phones and laptops. Surprisingly capable at simple tasks.
- Gemma 7B / 9B — mid-range, fits in 8-16 GB RAM. Good general assistant for short tasks.
- Gemma 27B — the high-end open variant. Competitive with mid-size proprietary models on many benchmarks.
- Gemma multimodal — vision-capable variants for image understanding.
- Gemma Code — coding-specialized fine-tunes.
Strengths: clean licensing, well-documented, integrates with Hugging Face and Vertex AI. Weaknesses: not at the bleeding edge of capability — if you compare Gemma 27B to Claude Sonnet 4.6, the gap is real.
Ollama โ the easy starter
Ollama is the friendliest way to start. Install once, then run any model with one command.
# Install Ollama on macOS
brew install ollama
# Start the Ollama daemon
ollama serve # (or it auto-runs after install)
# Pull and run a model
ollama run gemma2:9b
# Now chat with it interactively
>>> Hi, what can you do?
Other useful commands:
ollama list— show installed models.ollama pull llama3.3:8b— download a model.ollama rm gemma2:9b— delete a model (they're big — 5-50GB each).ollama show gemma2:9b— show model details.
Ollama exposes a local HTTP API on localhost:11434 that mimics the OpenAI API. Code calling OpenAI's /v1/chat/completions works against Ollama with one URL change.
LM Studio
LM Studio is a GUI desktop app for running local models. Browse and download models, chat with them, configure parameters via UI. Friendlier than Ollama for non-CLI users.
Strengths: clean interface, model catalog browser, performance tuning UI, supports many model formats. Weaknesses: closed-source, heavier than Ollama, slightly less scriptable.
If you're more comfortable in a GUI and don't need to integrate the model into other tools, start with LM Studio. If you're scripting / building / wiring local models into Claude Code or other tools, start with Ollama.
llama.cpp under the hood
Both Ollama and LM Studio use llama.cpp as their inference engine. llama.cpp is the C++ library that runs quantized models efficiently on CPU / Apple Silicon. You can use llama.cpp directly via brew install llama.cpp if you want maximum control and minimum dependencies.
For most users, Ollama is the right level of abstraction. llama.cpp itself is what you reach for if you're building custom integrations or want to fine-tune.
Real use cases for local LLMs
- Privacy-sensitive notes / writing. Drafts you don't want logged.
- Code completion in restricted environments (defense, healthcare, finance).
- Offline assistant on a laptop while traveling.
- Batch processing where you don't want the per-call cost of a cloud API.
- Embeddings generation for RAG pipelines — small embedding models run very fast locally.
- Fine-tuned models for specific domains (a model fine-tuned on your specific documentation).
- Educational — understanding how these things actually work.
- Backup / fallback when your cloud provider has an outage or your monthly quota is exhausted.
Honest limitations vs Claude / GPT / Gemini
- Capability gap. The best frontier model (Opus 4.7, GPT-5, Gemini Ultra) is significantly more capable than the best local model. On complex reasoning, instruction following, and long-context coherence, the gap is real.
- Speed. A 70B model on an M3 Max generates ~5-15 tokens/second. A cloud API delivers 50-200 tokens/second. For interactive use, local feels slow.
- Context window. Most local models top out at 32K-128K tokens. The 1M-token Opus and Gemini are very different beasts.
- Tool use / agent capability. Local models are catching up but still trail frontier models on multi-step tool use.
- Vision / multimodal. Local multimodal models work but lag the cloud versions on detail.
- Updates. A local model is frozen at the moment you downloaded it. New knowledge / capability comes only when you replace it.
For the RDR2 Companion's main "answer questions about the game" workload, Claude in the cloud beats any local model on quality. Use local LLMs as a tool, not a replacement.
Bundling a model in an iOS app
You can ship a small model inside your iOS app for on-device inference. Apple's Core ML framework supports running quantized models locally on iPhone hardware (Neural Engine, GPU, CPU).
Practical models that fit on a phone:
- Apple's own Foundation Models (available via the OS-level Apple Intelligence framework).
- Llama 3.2 1B-3B, Gemma 2B, Phi-3 Mini — converted to Core ML or via MLX.
- Whisper-tiny / Whisper-base for on-device speech recognition.
Tradeoffs: on-device inference works offline and protects user privacy, but the model is small (capability constrained), the app binary is large, and battery / thermal cost is real for sustained use. For RDR2 / GTA V Companion-style "deep knowledge" use cases, the on-device model is too small — cloud Claude is still the right call.
10-minute setup for Ollama + Gemma
brew install ollamaollama serve(auto-starts in background)ollama pull gemma2:9b(downloads ~5GB)ollama run gemma2:9b— chat interactively- To use from code: POST to
http://localhost:11434/api/chatwith a JSON body specifying model + messages. - To use with Claude Code or other tools: point any "OpenAI-compatible" model setting at
http://localhost:11434/v1.
If you've never run a local model, this is a perfectly worthwhile evening. Even if you don't end up integrating it into your work, you'll understand the technology better.
See also: Claude vs ChatGPT vs Gemini, Google Gemini, What is RAG?.
- Ollama โ Documentation
- Google โ Gemma model family
- LM Studio โ Desktop GUI
- llama.cpp โ Inference engine