Blog ยท AI Companies
๐Ÿง  AI Companies

Local LLMs and Gemma 2026: Running AI Models on Your Own Mac

๐Ÿ“Š View 1-page infographic (share-ready PDF)

Running large language models on your own Mac instead of calling Claude / OpenAI / Gemini has become genuinely viable in 2026. Apple Silicon's unified memory architecture is great for inference, the open-weights model field is competitive, and the tooling (Ollama, LM Studio) is friendly. This post is the practical setup guide and an honest comparison to cloud frontier models.

Why run LLMs locally

Hardware: what your Mac can actually run

Apple Silicon (M1, M2, M3, M4 chips) is excellent for LLM inference because the GPU and CPU share unified memory — the model fits in RAM and the GPU can access it directly without copying.

Practical RAM guidance:

For your typical M3 or M4 MacBook Pro with 32-64GB, you can run models that are comparable in capability to GPT-3.5 or older Claude versions. Frontier 2026 models (Opus 4.7, GPT-5, Gemini Ultra) are far larger than anything that fits locally.

The major open-weight model families

Gemma specifically

Gemma is Google's open-weight model family, related to but separate from Gemini (which is closed). Gemma is permissively licensed for commercial use under Google's open-weights terms.

The current Gemma lineup (2026):

Strengths: clean licensing, well-documented, integrates with Hugging Face and Vertex AI. Weaknesses: not at the bleeding edge of capability — if you compare Gemma 27B to Claude Sonnet 4.6, the gap is real.

Ollama โ€” the easy starter

Ollama is the friendliest way to start. Install once, then run any model with one command.

# Install Ollama on macOS
brew install ollama

# Start the Ollama daemon
ollama serve   # (or it auto-runs after install)

# Pull and run a model
ollama run gemma2:9b

# Now chat with it interactively
>>> Hi, what can you do?

Other useful commands:

Ollama exposes a local HTTP API on localhost:11434 that mimics the OpenAI API. Code calling OpenAI's /v1/chat/completions works against Ollama with one URL change.

LM Studio

LM Studio is a GUI desktop app for running local models. Browse and download models, chat with them, configure parameters via UI. Friendlier than Ollama for non-CLI users.

Strengths: clean interface, model catalog browser, performance tuning UI, supports many model formats. Weaknesses: closed-source, heavier than Ollama, slightly less scriptable.

If you're more comfortable in a GUI and don't need to integrate the model into other tools, start with LM Studio. If you're scripting / building / wiring local models into Claude Code or other tools, start with Ollama.

llama.cpp under the hood

Both Ollama and LM Studio use llama.cpp as their inference engine. llama.cpp is the C++ library that runs quantized models efficiently on CPU / Apple Silicon. You can use llama.cpp directly via brew install llama.cpp if you want maximum control and minimum dependencies.

For most users, Ollama is the right level of abstraction. llama.cpp itself is what you reach for if you're building custom integrations or want to fine-tune.

Real use cases for local LLMs

Honest limitations vs Claude / GPT / Gemini

For the RDR2 Companion's main "answer questions about the game" workload, Claude in the cloud beats any local model on quality. Use local LLMs as a tool, not a replacement.

Bundling a model in an iOS app

You can ship a small model inside your iOS app for on-device inference. Apple's Core ML framework supports running quantized models locally on iPhone hardware (Neural Engine, GPU, CPU).

Practical models that fit on a phone:

Tradeoffs: on-device inference works offline and protects user privacy, but the model is small (capability constrained), the app binary is large, and battery / thermal cost is real for sustained use. For RDR2 / GTA V Companion-style "deep knowledge" use cases, the on-device model is too small — cloud Claude is still the right call.

10-minute setup for Ollama + Gemma

  1. brew install ollama
  2. ollama serve (auto-starts in background)
  3. ollama pull gemma2:9b (downloads ~5GB)
  4. ollama run gemma2:9b — chat interactively
  5. To use from code: POST to http://localhost:11434/api/chat with a JSON body specifying model + messages.
  6. To use with Claude Code or other tools: point any "OpenAI-compatible" model setting at http://localhost:11434/v1.

If you've never run a local model, this is a perfectly worthwhile evening. Even if you don't end up integrating it into your work, you'll understand the technology better.


See also: Claude vs ChatGPT vs Gemini, Google Gemini, What is RAG?.

Sources & References
  1. Ollama โ€” Documentation
  2. Google โ€” Gemma model family
  3. LM Studio โ€” Desktop GUI
  4. llama.cpp โ€” Inference engine