Malcolm Stanley.
malcolm × AI × Home PC

Memory Bandwidth Is the New CPU Speed: A Pre-WWDC Guide to Local AI Hardware

The squeeze on remote inference changes what personal hardware is for. A decision framework for early adopters — written two weeks before WWDC, when Apple could change everything overnight.

Memory bandwidth is the new CPU speed....

AI × Home PC

AI practitioners optimizing for capability have largely focused on data center hardware and metrics. But as remote inference becomes more expensive, local inference becomes much more possible — and personal. Hardware buyers optimizing for the wrong metrics — clock speed, benchmark scores, TOPS — are spending money on things that do not move the needle for local inference. The squeeze on remote inference has made this mismatch expensive.

What your personal hardware is for has fundamentally changed. Anthropic and other AI providers are tightening token budgets and raising prices. AI capability is increasingly understood as a key competitive skill, while the cost of accessing it keeps rising. The only productive engineering response to this squeeze is to offload inference from remote providers whenever possible — and to do that well, you need to understand what your hardware actually needs to do.

We Have Been Here Before

In the early days of personal computing, the networking card came with a speed and you had to choose. CGA versus EGA versus VGA graphics were substantially different in cost and quality. The 286 ran real-mode software fine but could not address enough memory for the next generation of applications. The 386 unlocked protected mode, virtual memory, and 32-bit processing — and within three or four years it was the baseline.

We are at the same inflection point for local AI. A current MacBook Pro with 24GB of RAM runs 7–9 billion parameter models — useful but limited, roughly analogous to running Lotus 1-2-3 in real mode. The "386 moment" for local inference is around 96–128GB of unified memory, where 70B models run comfortably and agentic workflows become genuinely practical. The hardware to do this already exists and is already at consumer prices. The decisions required to get there are new, the terminology is unfamiliar, and the timing is awkward — WWDC 2026 is weeks away. This is a decision framework, not a shopping list.

Scope of This Guide

  • Affordable personal hardware: approximately $2,500 USD, not more than $3,000.
  • Primary form factor: a single desktop or laptop, with clustering discussed as an option.
  • Cloud reserved instances are excluded — that is a different post.
  • The specific use case: offloading agentic inference from commercial remote models to local hardware whenever possible without significant quality loss. This is not always achievable, but the results of comparative tests are often surprising, and the cost savings are real.

Four Definitions Worth Having

VRAM (Video RAM). On discrete GPU systems — NVIDIA cards — VRAM is dedicated memory physically on the GPU, separate from system RAM. A model must fit entirely within VRAM to run at full GPU speed. Anything that overflows gets offloaded to system RAM across a bus that throttles performance sharply. The RTX 5090 has 32GB of VRAM. That hard ceiling is why, despite having the highest memory bandwidth of any consumer card at 1,792 GB/s, it cannot run models that a $2,000 unified-memory mini PC can. Extraordinary bandwidth, wrong ceiling for large models.

Unified Memory. Apple Silicon (all M-series chips) and AMD's Ryzen AI Max+ use a single memory pool shared by the CPU, GPU, and Neural Engine — all at the same bandwidth. There is no separate VRAM ceiling. On a 128GB unified memory system, the GPU can access all 128GB. This is why the comparison of a $2,000 Ryzen mini PC against a $2,000 NVIDIA GPU flips in the mini PC's favor once you exceed 32B parameters: the GPU runs out of VRAM; the mini PC does not.

Memory Transfer Speed (Memory Bandwidth). Measured in GB/s, this is how fast data moves between the processor and memory. For LLM inference, generating each token requires loading model weights from memory, running a forward pass, and writing results back. The limiting factor in this loop is almost always how fast weights can be loaded. More GPU cores do not help when you are bandwidth-limited, which during inference you almost always are. Clock speed does not help either. The benchmark that matters: M5 Max 40-core GPU at 614 GB/s. M4 Max 40-core at 546 GB/s. M4 Pro at 273 GB/s. Ryzen AI Max+ 395 at 256 GB/s theoretical, approximately 215 GB/s measured. Bandwidth is the leading indicator of inference speed. It should be the first number you look at when evaluating hardware for this use case.

RDMA, 10 Gigabit Ethernet, and NVLink — for clustering. If you want to pool memory across multiple machines to run larger models than any single node can hold, the interconnect technology determines how well that works.

  • RDMA over Thunderbolt 5 (Apple Silicon clusters): enabled in macOS Tahoe 26.2, dropping inter-node latency to under 50 microseconds. Bandwidth ceiling is approximately 10 GB/s. Memory is not truly pooled — each node keeps its own memory, with the model sharded and distributed across machines.
  • 10 Gigabit Ethernet (Ryzen mini PC clusters): ~1.25 GB/s effective. Used for Framework Desktop clusters and Ryzen mini PC clusters via USB4-to-10GbE adapters. Sufficient for inference sharding when each node has high local bandwidth.
  • NVLink-C2C (DGX Spark): NVIDIA's chip-to-chip interconnect within a single unit runs at 600 GB/s. The inter-unit fabric (two DGX Sparks linked via ConnectX-7) runs at approximately 25 GB/s — 2.5× faster than 10GbE, but a fraction of the on-chip speed. True memory pooling within a single unit; model sharding across units.
  • NVLink SXM (datacenter): up to 900 GB/s per GPU, sub-microsecond latency. Not available in consumer hardware. The gap between consumer clustering options and datacenter NVLink is approximately 90× in bandwidth — not a price problem, a physics problem.

The Hardware Landscape

The current market divides into two meaningful tiers for local inference. At 48GB, the Mac Mini M4 Pro (~$1,399) is the lowest-friction entry: Apple ecosystem, quiet, energy-efficient, runs 32B models comfortably, and pairs well with an existing MacBook via Ollama or EXO to pool memory. If your primary use case is coding assistance or single-model chat, this is the pragmatic move — especially if you are waiting to see what WWDC brings before committing more capital.

At 128GB, you have a different conversation. The Ryzen AI Max+ 395 implementations — Framework Desktop (~$1,999), Beelink GTR9 Pro (~$2,100) — and Apple's Mac Studio M4 Max (~$3,699) all land here. 128GB puts 70B models within reach today, and 235B Mixture-of-Experts models within reach via quantization. The choice between the Ryzen and Apple options within this tier comes down to a specific tradeoff: the Ryzen implementations offer better value and a cleaner always-on server profile; Apple's M4 Max delivers higher memory bandwidth and tighter ecosystem integration at a significant price premium.

The memory bandwidth numbers are worth sitting with. An M4 Max running DeepSeek R1 Distilled Llama 70B achieves 6.31 tokens per second. The Ryzen AI Max+ 395, with identical memory capacity, achieves 5.0 tokens per second. The difference is almost entirely explained by bandwidth: 546 GB/s versus 215 GB/s measured. That gap widens further with the M5 Max at 614 GB/s. For single-model chat and coding tasks where speed per response is the metric, the bandwidth premium of Apple Max silicon is real and matters. For agentic workflows — where you are orchestrating many smaller calls with tool use between them — the per-call latency difference compounds differently, and the calculus shifts.

Agentic inference has a different hardware profile than single-model chat. An agentic stack holds multiple models resident simultaneously: an orchestrator at 32–70B (~20–40GB), a specialized coding or tool-use model at 7–14B (~5–10GB), an embeddings model (~1–2GB), a reranker (~1GB), and KV cache overhead for long contexts (10–30GB at 32K+ tokens). That is 40–80GB of memory just to keep things running without model loading and unloading — which can take 30–90 seconds per swap, making a multi-agent system that constantly swaps models effectively unusable. For agentic use, 128GB is not a luxury. It is the minimum viable configuration. The 48GB tier is a capable coding assistant; the 128GB tier is an agentic infrastructure platform.

A final note before the comparison table: this article is being written two weeks before Apple's WWDC 2026. The M5 Mac Mini and M5 MacBook Pro Ultra are widely expected. The M5 Max already ships in the current MacBook Pro at 614 GB/s. An M5 Mac Mini at that bandwidth would meaningfully change the 48GB tier recommendation, and an M5 Ultra Mac Studio with projected ~1TB/s bandwidth and 256GB would change the entire upper tier. If you can wait until late June, wait. If you cannot, the framework below holds — just know the numbers may shift.

Hardware Comparison

Device RAM Bandwidth Price Best For
Mac Mini M4 Pro 48GB 273 GB/s ~$1,399 Chat, coding, low-friction entry
Framework Desktop (Ryzen AI Max+ 395) 128GB 215 GB/s* ~$1,999 Agentic inference, always-on server, best TCO
Beelink GTR9 Pro (Ryzen AI Max+ 395) 128GB 215 GB/s* ~$2,100 Same as Framework; dual 10GbE onboard
Mac Studio M4 Max 128GB 546 GB/s ~$3,699 Single-model performance, Apple ecosystem
Mac Studio M5 Max (expected post-WWDC) 128GB 614 GB/s TBA Best bandwidth per dollar at launch, if history holds
NVIDIA DGX Spark 128GB 273 GB/s $4,699 CUDA-dependent workflows, PyTorch fine-tuning

* Measured bandwidth; theoretical is 256 GB/s.

Clustering: When One Machine Is Not Enough

If your workload exceeds what a single 128GB machine can handle — you want to run 235B+ parameter models, or run multiple large models simultaneously — clustering is the path forward. EXO (github.com/exo-explore/exo) is the software that makes this practical: it supports heterogeneous device clustering across Apple Silicon, AMD ROCm, and NVIDIA CUDA hardware with automatic model sharding and zero-configuration peer discovery. Two machines on the same local network find each other automatically.

The most immediately practical option if you already own a MacBook: pair it with a Framework Desktop via EXO. Your MacBook's 24GB and the Framework's 128GB pool to 152GB, comfortably reaching 120B+ parameter models. The MacBook contributes when it is on the network; the Framework carries the full load when it is not. Cost: ~$1,999 in new hardware, using what you already own.

For a dedicated two-node cluster, two Framework Desktops linked via USB4-to-10GbE adapters on a 10GbE switch give you 256GB pooled memory and a clean, always-on infrastructure. The Framework's onboard networking is 5GbE, but USB4 ports support 10GbE adapters, and a pair of adapters plus a small switch runs approximately $200 total — closing the networking gap with the Beelink GTR9 Pro's onboard dual 10GbE while preserving the Framework's unique advantage: a mainboard upgrade path. AMD's next-generation "Medusa Halo" APU, expected 2027–28, is projected to bring 460–691 GB/s bandwidth — dropping into the same chassis via a ~$1,200 mainboard swap. No other unit in this class offers that.

The NVIDIA DGX Spark two-unit configuration is the only consumer option with higher inter-node bandwidth at 25 GB/s versus 10 GB/s for Ethernet-based clusters. What you are buying with the $9,400 two-unit price is the CUDA ecosystem and that bandwidth — not additional memory capacity, which is identical to a two-Framework setup at 256GB. For developers building PyTorch-native agentic systems or doing local fine-tuning, that CUDA premium has genuine value. For inference-only use cases, it is hard to justify.

Cluster Setup RAM Pool Inter-node BW Min. Cost Notes
MacBook M4 Pro + Framework Desktop (EXO) 152GB 10 GB/s ~$2,200 Uses existing laptop; MacBook must be on-network
2× Mac Mini M4 Pro (RDMA / EXO) 96GB 10 GB/s ~$2,800 Apple ecosystem; mature software; lower total RAM
2× Framework Desktop (10GbE via USB4 adapter) 256GB 10 GB/s ~$4,200 Best TCO; both nodes mainboard-upgradeable to 2028
2× DGX Spark (ConnectX-7) 256GB 25 GB/s ~$9,400 CUDA ecosystem; highest inter-node bandwidth; Linux-first

Total Cost of Ownership

Purchase price is the wrong lens. Hardware for always-on agentic inference runs continuously, upgrades on a two-to-three year cycle, and draws power every hour it operates. The four-year TCO for each major option, assuming 12 hours of inference load daily at $0.14/kWh:

Platform Initial Hardware 2028 Upgrade 4yr Electricity 4yr Total
Mac Mini M4 Pro (48GB) $1,399 ~$1,600 (new unit) ~$300 ~$3,300
Framework Desktop (128GB) $1,999 ~$1,200 (mainboard only) ~$700 ~$3,900
Mac Studio M4 Max (128GB) $3,699 ~$3,699 (new unit) ~$900 ~$8,300
DGX Spark $4,699 ~$4,699 (new unit) ~$880 ~$10,300

For a consulting practice, all hardware qualifies under Section 179 and can be fully expensed in the acquisition year, reducing effective cost by approximately 30–37%. This applies equally to all options and does not change the relative ranking.

The Framework TCO advantage is driven entirely by the mainboard upgrade path. No other unit in this class lets you swap the compute for ~$1,200 while keeping the chassis, power supply, storage, and expansion cards. The Mac Studio at $3,699 initial cost requires a full unit replacement in 2028. The DGX Spark is the most expensive platform on this chart across every dimension. The Mac Mini's TCO looks attractive but its 48GB ceiling limits it to use cases that do not require a resident multi-model agentic stack — in which case it is not actually doing the same job as the others.

What I Would Actually Do

For chat, coding assistance, and single-model inference: a Mac Mini M4 Pro 48GB paired with the MacBook I already own, running Ollama across both via EXO when I am at my desk. ~$1,399 in new hardware, 72GB effective pooled memory, runs 32–40B models comfortably. If you need nothing more than this for the next two years, this is the right call and the timing is fine — even post-WWDC an M5 Mac Mini at this tier will be incrementally better, not transformatively different.

For agentic inference: a Framework Desktop at 128GB as a dedicated always-on inference server, running Ollama in server mode, accessed from the MacBook as a client. ~$1,999. All your agent models stay resident. The x86/Linux environment means every agentic framework — LangChain, LlamaIndex, AutoGen, CrewAI — runs without adaptation. When the 2028 generation of AMD APUs arrives, a mainboard swap rather than a full unit replacement. If you need to choose one hardware decision to make now, this is it.

For maximum performance without clustering: the Mac Studio M4 Max, or — after WWDC — whatever Apple announces for the M5 Studio. Higher bandwidth means each individual agent call is faster, which matters when you are chaining 20–30 calls in a single agentic run. The latency compounds. A 30-step agent loop where each step takes 2 seconds instead of 3 is a 30-second difference per run, which adds up over a day of use. This is the right call if your workflow involves sustained high-throughput generation from a single large model and the TCO premium is acceptable.

The one thing I would avoid: buying anything before WWDC if you can wait two weeks. Apple has the opportunity to change the value calculation at every tier — a faster Mac Mini, a higher-memory Mac Studio — at prices that may undercut the current Ryzen advantage. The Framework TCO story survives a better Mac Mini announcement. It might not survive a dramatically repriced M5 Mac Studio Ultra. The WWDC wildcard is real.

Gear & resources referenced in this post
NVIDIA DGX Spark † AI Workstation $4,699
Ollama — local model runtime Software (free)
LM Studio — local model UI Software (free)
Links marked with † are Amazon Associates affiliate links. If you purchase through them, I earn a small commission at no extra cost to you. I only link gear directly relevant to the analysis.