Cloud AI services like GPT-4o and Claude require sending data to third-party servers — a critical limitation for medical records, legal documents, trade secrets, or personal information. Local LLMs run open-source models on your own hardware, solving data sovereignty problems while enabling offline use and eliminating per-token costs.
## Major Local LLM Tools
**Ollama**: the most popular local LLM runner, supporting macOS, Linux, and Windows. A single command downloads and runs Llama 3, Mistral, Phi-3, Gemma, and other leading open-source models. Provides an OpenAI-compatible API for smooth replacement of cloud endpoints. See [ollama.com](https://ollama.com).
**LM Studio**: GUI-first, accessible to non-technical users. Downloads GGUF-format models directly from Hugging Face, with a built-in chat interface and local server.
**llama.cpp**: the open-source inference framework that enabled the local LLM movement, using quantization (4-bit, 8-bit) to run 7B–70B models on consumer CPUs and GPUs.
**Jan**: an open-source local AI assistant with conversation history management and tool integration, similar to a local ChatGPT.
## Notable Open-Source Models
**Llama 3** (Meta): 8B and 70B variants approaching GPT-3.5 capability on most benchmarks. The 8B runs efficiently on 8GB VRAM; the 70B needs 40GB+ or quantization.
**Mistral 7B / Mixtral 8x7B**: from French startup Mistral AI. The 7B outperforms earlier much larger models; Mixtral’s mixture-of-experts architecture achieves higher performance at lower computational cost.
**Phi-3** (Microsoft): 3.8B and 7B models optimized for constrained devices (phones, laptops), with performance disproportionate to their parameter count.
**Qwen 2.5** (Alibaba): balanced Chinese-English bilingual performance; strongest option for Chinese-language local AI applications.
**DeepSeek Coder**: code-generation-optimized open-source model, providing near-Copilot code assistance running entirely locally.
## Hardware Requirements
A quantized 7B model runs on 8GB RAM/VRAM; a 13B needs 16GB; a 70B needs 48GB in 4-bit quantization. Apple Silicon (M1/M2/M3) is particularly well-suited due to unified memory architecture: a 64GB M2 Max comfortably runs quantized 70B models.
## When Local vs. Cloud
Local LLMs are the right choice for: processing sensitive medical, legal, or financial data; enterprise intranet AI tools (data never leaves the network); offline work environments; developer fine-tuning experiments; and cost-sensitive high-volume applications.
Cloud APIs remain preferable for: frontier capability (top performance); multimodal tasks; large-scale production deployment; and real-time web search integration.
See [AI Agent Workflows](https://sunqi.org/ai-agent-workflow-en/), [Ollama documentation](https://ollama.com), and the [Hugging Face model hub](https://huggingface.co/models).
—




