Ollama MLX: Local AI on Apple Silicon Gets Faster

On March 31, Ollama shipped version 0.19 with a preview feature that most coverage treated as a straightforward performance upgrade: MLX support on Apple Silicon. The benchmarks are real. Prompt processing is 1.6x faster, token generation nearly doubles, and the M5's GPU Neural Accelerators push time-to-first-token for a dense 14B model under ten seconds. But treating this as a speed bump misses the structural shift underneath. Ollama just ripped out the engine that made it famous and replaced it with one built by Apple. That decision tells us far more about where local AI is headed than any benchmark ever could.

From llama.cpp to MLX: Why Ollama Switched Engines

Since its launch in 2023, Ollama has been a polished wrapper around llama.cpp, the C/C++ inference engine created by Georgi Gerganov that kicked off the local LLM revolution. That architecture made sense. llama.cpp ran everywhere, supported dozens of quantization formats, and had a massive contributor base grinding out CUDA kernels and ARM optimizations. Ollama's genius was never the inference layer. It was the Docker-like Modelfile system, the one-command setup, and the REST API that turned running a local model from a weekend project into a five-minute task. By late 2025, Ollama had accumulated over 112 million model pulls for Llama 3.1 alone, making it the dominant interface for local AI.

So why swap out a proven backend? Because llama.cpp was designed to be maximally portable, not maximally fast on any single platform. Its Metal backend for Apple GPUs worked, but it treated Apple Silicon like any other GPU target, bolting on support through a compatibility layer rather than building for the hardware's unique properties. MLX, which Apple open-sourced in December 2023, takes the opposite approach. It was built from the ground up for Apple Silicon's unified memory architecture, where CPU, GPU, and Neural Engine share a single memory pool with no copying overhead. The difference is not incremental. It is architectural.

On a Mac mini M4 Pro with 64GB of unified memory running Qwen3-Coder-30B-A3B, MLX delivers approximately 130 tokens per second compared to 43 tokens per second on Ollama's previous llama.cpp backend. That is a 3x improvement on the same hardware, same model, same quantization. The bottleneck was never the Mac's compute. It was the software layer failing to exploit what the hardware could actually do.

The Unified Memory Advantage Nobody Fully Appreciates

Every discussion of Apple Silicon for AI eventually mentions unified memory, and almost every discussion undersells it. Here is why it matters so much for local inference specifically.

In a discrete GPU system, running a 70B parameter model requires loading the model weights into VRAM. An NVIDIA RTX 4090 has 24GB of VRAM. A 70B model quantized to 4-bit precision needs roughly 35GB. It does not fit. You either offload layers to system RAM (and accept a 10-50x latency penalty on those layers as data crosses the PCIe bus) or you do not run the model at all. The RTX 5090, with 32GB of VRAM, still cannot hold it.

A Mac Studio with an M4 Ultra has up to 256GB of unified memory, all of it accessible to the GPU at full bandwidth. A 70B model loads entirely into GPU-accessible memory. No offloading. No PCIe bottleneck. No layer splitting. The Mac runs the model that the $2,000 gaming GPU literally cannot, and it does so while consuming 80 watts instead of 450.

MLX exploits this architecture in ways llama.cpp's Metal backend never did. MLX uses lazy evaluation and a unified memory model where arrays live in shared memory and operations on them can be scheduled across any available compute unit without data movement. When Ollama runs on MLX, the model weights stay exactly where they are. The GPU reads them in place. The CPU can inspect the same data without a copy. This is not just faster. It eliminates an entire category of memory management complexity that plagues every other inference stack.

The M5 chips introduced GPU Neural Accelerators, which Apple's own research shows yield up to 4x speedup over M4 for time-to-first-token. Generating a 1024x1024 image with FLUX-dev-4bit is 3.8x faster on M5 than M4. These accelerators are purpose-built matrix multiplication units embedded in the GPU, analogous to NVIDIA's Tensor Cores but designed for MLX's execution model. They are not accessible through llama.cpp's Metal backend. They require MLX.

Apple's Quiet Platform Play

Apple rarely telegraphs its platform strategies, but the pattern here is legible if you track the sequence of moves.

December 2023: Apple open-sources MLX, positioning it as "an array framework for machine learning research on Apple silicon." The framing is modest. Research tool. Apple silicon only.

Throughout 2024 and 2025: Apple's ML research team publishes a steady stream of optimized model implementations in MLX format. Quantized Llama, Mistral, Phi, Qwen variants appear in the mlx-community repository on Hugging Face. The MLX ecosystem quietly grows a library of ready-to-run models.

Early 2026: MLX adds CUDA support, meaning the framework now runs on NVIDIA hardware too. This is the move that reframes everything. MLX is no longer just an Apple Silicon optimization. It is a cross-platform ML framework that happens to be best on Apple hardware. The same strategic playbook Apple used with WebKit and Swift: open the standard, optimize for your platform, let everyone else be a second-class citizen.

March 2026: Ollama, the most popular local LLM tool with millions of users, adopts MLX as its Apple Silicon backend. Apple did not need to ship its own inference app. It got the market leader to adopt its framework voluntarily, because the performance advantage was too large to ignore.

This is textbook platform strategy. Apple is building the runtime layer for on-device AI the same way it built Core Animation for graphics and Core ML for traditional machine learning. Once developers target MLX, they are targeting Apple Silicon's specific capabilities. Every optimization they make feeds back into Apple's hardware advantage. Every model published in MLX format works best on a Mac. The framework becomes the moat.

What This Means for the NVIDIA Monopoly

NVIDIA's dominance in AI is real but narrower than most people assume. In the datacenter, for training and high-throughput multi-user inference, NVIDIA has no serious competitor. Blackwell's native FP4 support, the NVLink interconnect, and the CUDA ecosystem are years ahead of anything else.

But local, single-user inference is a different market with different economics. Here, the relevant metrics are not throughput-per-dollar or FLOPS-per-watt at datacenter scale. They are: can this model fit in memory, how fast does it respond to one user, and how much does the complete system cost?

On those metrics, Apple Silicon is already competitive and MLX widens the gap. A Mac Studio M4 Ultra with 192GB of unified memory costs around $6,000 and runs a 70B model at 15-25 tokens per second. To match that capacity with NVIDIA hardware, you need multiple GPUs, a workstation chassis, and a power supply that could heat a small apartment, for roughly the same cost but with dramatically higher complexity and power draw.

The RTX 5090 generates 50+ tokens per second on models that fit in its 32GB VRAM. It is faster when the model fits. But "when the model fits" is an increasingly important qualifier as models grow. The trend in open-weight AI is toward larger models with better reasoning capabilities: Qwen3-30B, Llama 4 Scout at 109B parameters with mixture-of-experts, DeepSeek variants pushing past 200B. The models that matter are getting bigger, and NVIDIA's consumer GPU memory is not keeping pace.

NVIDIA knows this. The company's response has been to push cloud inference (GeForce NOW for AI, essentially) and to keep memory-rich hardware at professional price points. A single H100 with 80GB costs more than a fully loaded Mac Studio. Apple is offering the only consumer-accessible path to running frontier-class models locally, and MLX is the software that makes it work.

The Limitations Are Real but Temporary

Ollama's MLX preview is genuinely limited right now. It supports exactly one model: Qwen3.5-35B-A3B, a mixture-of-experts model configured for coding. It requires 32GB or more of unified memory, excluding the base M1, M2, and M3 Macs that millions of people own. And the preview designation means bugs and performance regressions are expected.

These constraints will not last. Ollama explicitly states they are working to support additional models and architectures. The MLX framework itself already supports dozens of model architectures through the mlx-lm library. The gap is in Ollama's integration layer, not in MLX's capabilities. Expect broad model support by mid-2026.

The memory requirement is more fundamental. Running meaningful local AI requires memory, and 8GB or 16GB Macs are simply too constrained. This is Apple's upsell engine working as designed. The company that famously charged $200 to upgrade from 8GB to 16GB of RAM now has a compelling reason for customers to buy the 64GB or 128GB configuration. "It runs AI" is the new "it runs Photoshop." Every Mac sold with 32GB or more of unified memory is a potential local AI workstation, and Apple's margins on memory upgrades are extraordinary.

What Builders Should Do Now

If you are building tools, agents, or applications that use local LLM inference, the Ollama MLX integration changes your hardware recommendations and architectural assumptions.

For individual developers: A Mac with 32GB or more of unified memory is now the best local AI development machine at any price point below $5,000. The combination of Ollama's ease of use, MLX's performance, and macOS's general development tooling creates an environment where you can prototype with 30B+ parameter models on the same machine where you write code. Update your Ollama installation and test with the MLX preview immediately.

For startups building AI-powered desktop apps: Target MLX directly for Mac builds. The performance gap between MLX and llama.cpp on Apple Silicon is large enough that using the generic backend is leaving 2-3x performance on the table. Ship two inference paths: MLX on Mac, llama.cpp (or CUDA-optimized) on Windows/Linux.

For enterprises evaluating on-device AI: Apple Silicon's unified memory model fundamentally changes the cost analysis for edge AI deployments. A fleet of Mac Minis with 64GB of memory running local inference through Ollama could replace a centralized GPU server for many internal use cases, with better per-user latency, simpler networking, and no cloud inference costs. The MLX integration makes this viable where it was marginal before.

For model developers: Publish MLX-format weights alongside GGUF. The audience is large and growing. Apple's mlx-community on Hugging Face already has hundreds of converted models, but first-party MLX weights will perform better than conversions from other formats.

Where This Goes Next

Three predictions. First, Apple will integrate MLX capabilities directly into macOS within the next twelve months. Not as a developer framework buried in Xcode, but as a system service that any application can call to run local models. The groundwork is already laid with Core ML and the Neural Engine. MLX-powered inference will become an OS feature, the same way Core Spotlight made search an OS feature.

Second, Ollama's MLX adoption will force LM Studio, Jan, and every other local LLM tool to follow. The performance gap is too visible in side-by-side comparisons. By the end of 2026, MLX will be the default backend for local AI on macOS across the ecosystem, not just in Ollama.

Third, and most importantly: the combination of 128GB+ unified memory Macs and MLX-optimized inference will make local AI good enough to replace cloud API calls for a meaningful percentage of use cases. Not for training. Not for serving millions of users. But for the developer running an AI coding assistant, the writer using a local chatbot, the small business running document processing, local inference on Apple Silicon will be fast enough, private enough, and cheap enough to be the default choice. That is not a performance story. That is a market structure story. And it starts with Ollama 0.19.

Ollama MLX Integration Signals Apple's Real AI Platform Play