How Flash-MoE Runs a 397B Parameter Model on a MacBook Pro at 4.4 tok/s
A developer ran Qwen3.5-397B—a model bigger than GPT-4—on a laptop with no Python and no frameworks. Here's exactly how.
A 397-billion-parameter AI model just ran on a MacBook Pro. Not via a cloud API. Not through some clever proxy. Locally, at 4.4 tokens per second, with full tool-calling support — streamed directly off a consumer NVMe SSD into Apple's Metal GPU compute pipeline. No Python. No PyTorch. No transformers. Just C, Objective-C, and hand-tuned Metal shaders.
That's Flash-MoE, a project that exploded onto Hacker News this week with 356 upvotes and 114 comments, and for good reason: it proves that the hardware barrier between "laptop" and "frontier model" is thinner than anyone assumed. The key insight isn't a research breakthrough — it's a stack of engineering decisions that most inference frameworks got wrong. Here's how it works, what it takes to replicate, and what it means for local AI going forward.
📁 Full source code for this article is available on GitHub: github.com/aistackinsights/stackinsights/how-flash-moe-runs-397b-model-on-macbook
Why This Matters
The default assumption in local AI is that model size is gated by VRAM. A 70B model needs ~40GB of GPU memory. A 405B model needs multiple A100s. Consumer hardware tops out around 24GB for most GPUs, which caps you somewhere around 13B parameters in FP16.
Mixture-of-Experts (MoE) architectures shatter this assumption — but only if you're willing to rethink inference from first principles.
Qwen3.5-397B-A17B has 397 billion total parameters, but only 17 billion are activated per token. The model routes each token through 4 of its 512 expert networks per layer, leaving the other 508 dormant. If you never load the dormant experts into memory, you only need to hold the active 4 — roughly 27MB per layer instead of the full 209GB.
Flash-MoE does exactly this. It stores all 512 experts per layer on the SSD and reads only the 4 that matter, on demand, for each token. The rest of the model — attention weights, norms, routing logic — fits in 5.5GB of RAM. The SSD becomes a 17.5 GB/s extension of your memory hierarchy.
The Architecture: Qwen3.5-397B Under the Hood
Before diving into the optimizations, you need to understand why this model is uniquely suited to SSD-streaming inference.
Qwen3.5-397B-A17B has a hybrid architecture across 60 transformer layers:
- 45 GatedDeltaNet layers — linear attention (O(n) complexity, not O(n²))
- 15 standard full-attention layers — traditional multi-head attention
- 512 experts per MoE sublayer, with K=4 routed + 1 shared expert per token
- Hidden dimension: 4096, expert intermediate dimension: 1024
- Context window: 262,144 tokens natively, extensible to 1M
The GatedDeltaNet layers are critical. Linear attention means the recurrence state is a fixed-size 128×128 matrix per head — no growing KV cache, no quadratic scaling with sequence length. This is what lets the model handle long contexts without blowing up memory.
Each expert in 4-bit quantization is ~6.75MB. Loading 4 of them per layer (60 layers) means each forward pass reads approximately 1.6GB from SSD. At 17.5 GB/s sequential read speed, that's ~91ms of I/O per token — the single biggest bottleneck in the pipeline.
The Three Tricks That Make It Work
1. Trust the OS — Don't Fight the Page Cache
The first thing most engineers do when building a custom inference engine is implement a custom cache. Flash-MoE tried this too — Metal LRU cache, malloc-based cache, LZ4-compressed cache. Every approach was slower than doing nothing.
The OS page cache, naturally, achieves a ~71% hit rate on expert data. That's because MoE routing isn't uniform — certain experts activate far more often than others across typical workloads. The kernel's standard LRU eviction naturally keeps hot experts warm. Fighting it with a custom solution adds overhead without improving hit rate.
The project abandoned every custom caching approach and adopted a "Trust the OS" principle: use standard pread() with GCD dispatch groups, let the OS manage caching via LRU, and stay out of the way. This single decision delivered a 38% performance improvement over the next-best caching approach.
The dispatch_io trap: Using Apple's dispatch_io for expert streaming was tested and rejected — it was 70% slower due to dispatch_data management overhead. Raw pread() with parallel dispatch groups outperforms the higher-level abstraction when you control the access pattern.
2. FMA-Optimized Dequantization Kernel
4-bit quantization stores two weight values per byte. To use them in matrix multiplication, you dequantize on the fly: for each nibble w, compute (w * scale + bias) * x.
The naive formulation requires three operations: multiply, add, multiply. Flash-MoE restructures this into a fused multiply-add (FMA) instruction by pre-computing scale*x and bias*x once per input vector, then computing fma(w, scale_x, bias_x) for each weight. The GPU's FMA unit can execute this in a single instruction.
This algebraic rearrangement sounds trivial. It delivered a 12% throughput increase.
// shaders.metal — FMA-optimized 4-bit dequantization inner loop
// Naive: result = (float(nibble) * scale + bias) * x
// FMA: pre-compute scale_x = scale * x, bias_x = bias * x
// result = fma(float(nibble), scale_x, bias_x)
kernel void matvec_4bit_fma(
device const uchar* weights [[ buffer(0) ]],
device const float* input [[ buffer(1) ]],
device float* output [[ buffer(2) ]],
device const float* scales [[ buffer(3) ]],
device const float* biases [[ buffer(4) ]],
uint tid [[ thread_position_in_grid ]],
uint simd_lid [[ thread_index_in_simdgroup ]]
) {
// Each thread handles one output element
float acc = 0.0f;
// Pre-load scale and bias for this row
float scale = scales[tid];
float bias = biases[tid];
for (uint i = 0; i < INPUT_DIM / 2; i++) {
uchar packed = weights[tid * (INPUT_DIM / 2) + i];
// Unpack two 4-bit weights
float w0 = float(packed & 0x0F);
float w1 = float(packed >> 4);
float x0 = input[i * 2 + 0];
float x1 = input[i * 2 + 1];
// FMA: fma(nibble, scale*x, bias*x) — one GPU instruction
acc += fma(w0, scale * x0, bias * x0);
acc += fma(w1, scale * x1, bias * x1);
}
// SIMD reduction across the group
acc = simd_sum(acc);
if (simd_lid == 0) output[tid] = acc;
}3. BLAS for Linear Attention — Delegate to Accelerate
The GatedDeltaNet layers require updating a 128×128 state matrix per head, 64 heads per layer. This is a sequence of sscal, sgemv, and sger BLAS operations — exactly the kind of highly-optimized, well-parallelized math that Apple's Accelerate framework has been tuned for on Silicon for years.
Flash-MoE delegates this to cblas_sscal, cblas_sgemv, and cblas_sger from the Accelerate framework. The result: CPU attention time dropped from 0.78ms to 0.28ms — a 64% reduction — versus hand-written scalar code.
This is the same principle as trusting the OS page cache: sometimes the right answer is knowing what battle not to fight.
The Pipeline: Every Millisecond Accounted For
The per-layer execution timeline looks like this:
CMD3(prev) → CMD1: attention projections + delta-net [1.22ms GPU]
→ CPU: flush results [0.01ms CPU]
→ CMD2: o_proj + norm + routing + shared [0.55ms GPU]
→ CPU: softmax + topK routing [0.003ms CPU]
→ I/O: parallel pread K=4 experts [2.41ms SSD]
→ CMD3: expert forward + combine + norm [0.04ms encode, DEFERRED]
Total per-layer: ~4.2ms, dominated by SSD I/O at 2.41ms.
One crucial discovery: SSD DMA and GPU compute cannot be overlapped on Apple Silicon. They share the same memory controller, and even small background SSD reads cause disproportionate GPU latency spikes through memory controller arbitration. The serial pipeline (GPU → SSD → GPU) is hardware-optimal, not a limitation of the implementation.
Don't try to overlap GPU and SSD on Apple Silicon. The unified memory architecture means both fight for the same memory controller. F_RDADVISE prefetch was tested and caused a net 73% GPU slowdown from SSD DMA interference. Serial execution is the correct approach.
Step-by-Step: How to Run Flash-MoE
You'll need a MacBook Pro with an M3 Max chip and 48GB unified memory. The 4-bit model requires 209GB of SSD space; the 2-bit variant needs 120GB but breaks tool calling.
Prerequisites:
- macOS 26+ (Darwin 25.2.0)
- Xcode command-line tools
- ~210GB free SSD
- Qwen3.5-397B-A17B weights from HuggingFace
Step 1: Clone and build
# clone-and-build.sh
git clone https://github.com/danveloper/flash-moe.git
cd flash-moe/metal_infer
makeStep 2: Extract weights from HuggingFace safetensors
# extract_weights.py (already in the repo)
# Downloads ~397GB safetensors, converts to flash-moe binary format
# Non-expert weights → model_weights.bin (~5.5GB, mmap'd)
# Expert weights → packed_experts/ directory (209GB at 4-bit)
python extract_weights.py --model-dir /path/to/Qwen3.5-397B-A17B \
--output-dir /path/to/flash-moe-weightsStep 3: Run inference
# inference-commands.sh
# Standard inference (4-bit, production quality)
./infer --prompt "Explain quantum computing" --tokens 100
# Interactive chat with full tool calling support
./chat
# Per-layer timing breakdown (useful for profiling)
./infer --prompt "Hello" --tokens 20 --timing
# 2-bit mode (faster, ~5.7 tok/s, but breaks JSON/tool calling)
./infer --prompt "Explain quantum computing" --tokens 100 --2bitThe first run will be cold — the OS page cache is empty and every expert read hits SSD. After a few hundred tokens, the cache warms and throughput increases noticeably. The engine makes no attempt to force expert data into memory; it simply lets the OS do its job.
Benchmarks: What the Numbers Actually Say
| Configuration | Tokens/sec | Quality | Notes |
|---|---|---|---|
| 4-bit experts, FMA kernel | 4.36 | Excellent | Current best. Full tool calling. 209GB disk. |
| 4-bit experts, baseline | 3.90 | Excellent | Before FMA kernel optimization |
| 2-bit experts, OS cache | 5.74 | Good* | 120GB disk. *Breaks JSON/tool calling |
| 2-bit peak (warm cache) | 7.05 | Good* | Single-token burst. Not for sustained use. |
For comparison, llama.cpp running Qwen3.5-397B at Q8_0 quantization on the same machine (48GB Mac) achieves approximately 20 tok/s for token generation at empty context, dropping to ~8 tok/s at 250k context — but Q8_0 requires ~113GB in RAM, which exceeds the 48GB available. Flash-MoE's SSD streaming approach is the only way to run this model on consumer hardware at all without model truncation.
On the model quality side, Qwen3.5-397B-A17B benchmarks show:
- GPQA Diamond: 88.4% (vs GPT-5.2 at 92.4%, Claude 4.5 Opus at 87.0%)
- MMLU-Pro: 87.8% (near GPT-5.2's 87.4%)
- SWE-bench Verified: 76.4% (vs GPT-5.2 at 80.0%)
- IFBench: 76.5% (beating GPT-5.2 at 75.4%)
This isn't a toy. The 2-bit flash-moe variant scored 82% on GPQA Diamond in community testing — within 6 points of the BF16 reference at 88%.
Optimizations That Failed (And What We Learn From Them)
The project's 90+ experiments are as instructive as the wins:
| Approach | Outcome | Why |
|---|---|---|
| LZ4 expert compression | -13% | Decompress overhead exceeded warm-cache savings |
| Temporal expert prediction | -18% | 25% hit rate, wasted SSD bandwidth on wrong prefetch |
| MLP routing predictor | 31% accuracy | Worse than temporal baseline — not worth it |
| GPU LUT dequantization | -2% | Indirect register access serializes lanes |
| Spin-poll GPU wait | -23% | CPU thermal throttling competes with GPU |
| Speculative early routing | -38% | Cache pollution + overhead negated all gains |
| MTP speculative decoding | Break-even | MoE I/O scales per-token unlike dense models |
The pattern: complexity tends to hurt. Every approach that introduced more moving parts — predictors, compressors, custom caches — lost to the simple baseline. The highest-value optimizations (OS page cache, BLAS, FMA) were about removing complexity, not adding it.
Limitations and What to Watch
Hardware requirements are steep. This requires an M3 Max with 48GB. The M4 Max (also 48GB, ~600 GB/s bandwidth) would likely push to 6+ tok/s. Standard 16GB or 24GB Macs cannot run this — the non-expert weights alone are 5.5GB, and the OS needs the rest for page cache to work.
macOS only. Metal shaders and the Apple Accelerate framework aren't portable. A CUDA/ROCm port would require rewriting the entire compute layer. The architecture is sound, but the implementation is Apple-native.
2-bit quality tradeoff. The 2-bit variant produces \name\ instead of "name" in JSON output, making tool calling unreliable. This isn't a Flash-MoE bug — it's a known limitation of aggressive quantization for structured output. Community testing suggests 2.46 BPW with mixed precision (some tensors at Q8_0, Q6_K) outperforms naive 2-bit while maintaining quality.
Speed is real but not fast. 4.4 tok/s is usable for exploratory use. It's not production throughput. For agentic workloads generating thousands of tokens per task, this is a slow machine — but it's a local, private, offline 397B model.
Expert count reduction. Flash-MoE activates K=4 experts per token vs. the official K=10. This is a deliberate tradeoff for memory bandwidth. Quality holds up well at 4, but it's not identical to the full model.
Final Thoughts
Flash-MoE is a proof-of-concept that answers a question the field hadn't seriously asked: what happens when you design an inference engine around SSD bandwidth instead of VRAM?
The answer is surprisingly good. A 397B parameter model, running locally, with tool calling, on hardware you can buy at an Apple store. Not via cloud APIs. Not through a company's servers. On a laptop.
The techniques here — SSD expert streaming, OS page cache delegation, FMA kernel restructuring, BLAS for linear attention — aren't Flash-MoE-specific. They're transferable to any MoE model inference engine. As Qwen3.5, Mixtral, and DeepSeek-style sparse architectures proliferate, these patterns become more valuable, not less.
The inspiration for this approach came from Apple's own research — the "LLM in a Flash" paper (ACL 2024) demonstrated 4-5x speedup for CPU inference and 20-25x for GPU by streaming model weights from flash storage. Flash-MoE took that concept, applied it specifically to the sparse activation patterns of MoE models, and tuned every millisecond of the pipeline against real hardware.
The story of local AI isn't just getting bigger models on cloud hardware. It's getting smarter about what "memory" even means. Your SSD holds 1TB. Your model holds 209GB. At 17.5 GB/s, the math works.
Sources cited: Flash-MoE GitHub · HN Discussion · Qwen3.5-397B-A17B on HuggingFace · Apple "LLM in a Flash" (arXiv:2312.11514) · llama.cpp · Flash-MoE Technical Paper · Apple Metal Developer Documentation · Apple Accelerate Framework · MoEKD: MoE Knowledge Distillation (arXiv, March 2026)
Was this article helpful?
Related Posts
GPT-5.4's Native Computer-Use API Is Live — and It Just Outperformed Humans on Desktop Automation
GPT-5.4 ships native computer-use today, hitting 75% on OSWorld — surpassing the 72.4% human baseline. Here's how to build agents with it.
Read moreHow to Give Claude Full Control of Your WordPress Site Using MCP
WordPress.com just shipped 19 write operations via MCP — your AI agent can now draft posts, fix SEO, and manage your entire site in plain English.
Read moreBuild an Event-Sourced AI Agent from Scratch: Full Working Code
Step-by-step tutorial with complete Python code to build a production-ready event-sourced AI agent — orchestrator, planner, policy guard, tool executor, and replay engine.
Read moreComments
No comments yet. Be the first to share your thoughts!