How Flash-MoE Runs a 397B Parameter Model on a MacBook Pro at 4.4 tok/s

A 397-billion-parameter AI model just ran on a MacBook Pro. Not via a cloud API. Not through some clever proxy. Locally, at 4.4 tokens per second, with full tool-calling support — streamed directly off a consumer NVMe SSD into Apple's Metal GPU compute pipeline. No Python. No PyTorch. No transformers. Just C, Objective-C, and hand-tuned Metal shaders.

That's Flash-MoE, a project that exploded onto Hacker News this week with 356 upvotes and 114 comments, and for good reason: it proves that the hardware barrier between "laptop" and "frontier model" is thinner than anyone assumed. The key insight isn't a research breakthrough — it's a stack of engineering decisions that most inference frameworks got wrong. Here's how it works, what it takes to replicate, and what it means for local AI going forward.

📁 Full source code for this article is available on GitHub: github.com/aistackinsights/stackinsights/how-flash-moe-runs-397b-model-on-macbook

Why This Matters

The default assumption in local AI is that model size is gated by VRAM. A 70B model needs ~40GB of GPU memory. A 405B model needs multiple A100s. Consumer hardware tops out around 24GB for most GPUs, which caps you somewhere around 13B parameters in FP16.

Mixture-of-Experts (MoE) architectures shatter this assumption — but only if you're willing to rethink inference from first principles.

Qwen3.5-397B-A17B has 397 billion total parameters, but only 17 billion are activated per token. The model routes each token through 4 of its 512 expert networks per layer, leaving the other 508 dormant. If you never load the dormant experts into memory, you only need to hold the active 4 — roughly 27MB per layer instead of the full 209GB.

Flash-MoE does exactly this. It stores all 512 experts per layer on the SSD and reads only the 4 that matter, on demand, for each token. The rest of the model — attention weights, norms, routing logic — fits in 5.5GB of RAM. The SSD becomes a 17.5 GB/s extension of your memory hierarchy.

The Architecture: Qwen3.5-397B Under the Hood

Before diving into the optimizations, you need to understand why this model is uniquely suited to SSD-streaming inference.

Qwen3.5-397B-A17B has a hybrid architecture across 60 transformer layers:

45 GatedDeltaNet layers — linear attention (O(n) complexity, not O(n²))
15 standard full-attention layers — traditional multi-head attention
512 experts per MoE sublayer, with K=4 routed + 1 shared expert per token
Hidden dimension: 4096, expert intermediate dimension: 1024
Context window: 262,144 tokens natively, extensible to 1M

The GatedDeltaNet layers are critical. Linear attention means the recurrence state is a fixed-size 128×128 matrix per head — no growing KV cache, no quadratic scaling with sequence length. This is what lets the model handle long contexts without blowing up memory.

Each expert in 4-bit quantization is ~6.75MB. Loading 4 of them per layer (60 layers) means each forward pass reads approximately 1.6GB from SSD. At 17.5 GB/s sequential read speed, that's ~91ms of I/O per token — the single biggest bottleneck in the pipeline.

The Three Tricks That Make It Work

1. Trust the OS — Don't Fight the Page Cache

The first thing most engineers do when building a custom inference engine is implement a custom cache. Flash-MoE tried this too — Metal LRU cache, malloc-based cache, LZ4-compressed cache. Every approach was slower than doing nothing.

The OS page cache, naturally, achieves a ~71% hit rate on expert data. That's because MoE routing isn't uniform — certain experts activate far more often than others across typical workloads. The kernel's standard LRU eviction naturally keeps hot experts warm. Fighting it with a custom solution adds overhead without improving hit rate.

The project abandoned every custom caching approach and adopted a "Trust the OS" principle: use standard pread() with GCD dispatch groups, let the OS manage caching via LRU, and stay out of the way. This single decision delivered a 38% performance improvement over the next-best caching approach.

The dispatch_io trap: Using Apple's dispatch_io for expert streaming was tested and rejected — it was 70% slower due to dispatch_data management overhead. Raw pread() with parallel dispatch groups outperforms the higher-level abstraction when you control the access pattern.

2. FMA-Optimized Dequantization Kernel

4-bit quantization stores two weight values per byte. To use them in matrix multiplication, you dequantize on the fly: for each nibble w, compute (w * scale + bias) * x.

The naive formulation requires three operations: multiply, add, multiply. Flash-MoE restructures this into a fused multiply-add (FMA) instruction by pre-computing scale*x and bias*x once per input vector, then computing fma(w, scale_x, bias_x) for each weight. The GPU's FMA unit can execute this in a single instruction.

This algebraic rearrangement sounds trivial. It delivered a 12% throughput increase.

// shaders.metal — FMA-optimized 4-bit dequantization inner loop
// Naive: result = (float(nibble) * scale + bias) * x
// FMA:   pre-compute scale_x = scale * x, bias_x = bias * x
//        result = fma(float(nibble), scale_x, bias_x)
 
kernel void matvec_4bit_fma(
    device const uchar* weights   [[ buffer(0) ]],
    device const float* input     [[ buffer(1) ]],
    device float*       output    [[ buffer(2) ]],
    device const float* scales    [[ buffer(3) ]],
    device const float* biases    [[ buffer(4) ]],
    uint   tid                    [[ thread_position_in_grid ]],
    uint   simd_lid               [[ thread_index_in_simdgroup ]]
) {
    // Each thread handles one output element
    float acc = 0.0f;
    
    // Pre-load scale and bias for this row
    float scale = scales[tid];
    float bias  = biases[tid];
 
    for (uint i = 0; i < INPUT_DIM / 2; i++) {
        uchar packed = weights[tid * (INPUT_DIM / 2) + i];
 
        // Unpack two 4-bit weights
        float w0 = float(packed & 0x0F);
        float w1 = float(packed >> 4);
 
        float x0 = input[i * 2 + 0];
        float x1 = input[i * 2 + 1];
 
        // FMA: fma(nibble, scale*x, bias*x) — one GPU instruction
        acc += fma(w0, scale * x0, bias * x0);
        acc += fma(w1, scale * x1, bias * x1);
    }
 
    // SIMD reduction across the group
    acc = simd_sum(acc);
    if (simd_lid == 0) output[tid] = acc;
}

3. BLAS for Linear Attention — Delegate to Accelerate

The GatedDeltaNet layers require updating a 128×128 state matrix per head, 64 heads per layer. This is a sequence of sscal, sgemv, and sger BLAS operations — exactly the kind of highly-optimized, well-parallelized math that Apple's Accelerate framework has been tuned for on Silicon for years.

Flash-MoE delegates this to cblas_sscal, cblas_sgemv, and cblas_sger from the Accelerate framework. The result: CPU attention time dropped from 0.78ms to 0.28ms — a 64% reduction — versus hand-written scalar code.

This is the same principle as trusting the OS page cache: sometimes the right answer is knowing what battle not to fight.

The Pipeline: Every Millisecond Accounted For

The per-layer execution timeline looks like this:

CMD3(prev) → CMD1: attention projections + delta-net  [1.22ms GPU]
           → CPU:  flush results                       [0.01ms CPU]
           → CMD2: o_proj + norm + routing + shared    [0.55ms GPU]
           → CPU:  softmax + topK routing              [0.003ms CPU]
           → I/O:  parallel pread K=4 experts          [2.41ms SSD]
           → CMD3: expert forward + combine + norm     [0.04ms encode, DEFERRED]

Total per-layer: ~4.2ms, dominated by SSD I/O at 2.41ms.

One crucial discovery: SSD DMA and GPU compute cannot be overlapped on Apple Silicon. They share the same memory controller, and even small background SSD reads cause disproportionate GPU latency spikes through memory controller arbitration. The serial pipeline (GPU → SSD → GPU) is hardware-optimal, not a limitation of the implementation.

Don't try to overlap GPU and SSD on Apple Silicon. The unified memory architecture means both fight for the same memory controller. F_RDADVISE prefetch was tested and caused a net 73% GPU slowdown from SSD DMA interference. Serial execution is the correct approach.

Step-by-Step: How to Run Flash-MoE

You'll need a MacBook Pro with an M3 Max chip and 48GB unified memory. The 4-bit model requires 209GB of SSD space; the 2-bit variant needs 120GB but breaks tool calling.

Prerequisites:

macOS 26+ (Darwin 25.2.0)
Xcode command-line tools
~210GB free SSD
Qwen3.5-397B-A17B weights from HuggingFace

Step 1: Clone and build

# clone-and-build.sh
git clone https://github.com/danveloper/flash-moe.git
cd flash-moe/metal_infer
make

Step 2: Extract weights from HuggingFace safetensors

# extract_weights.py (already in the repo)
# Downloads ~397GB safetensors, converts to flash-moe binary format
# Non-expert weights → model_weights.bin (~5.5GB, mmap'd)
# Expert weights    → packed_experts/ directory (209GB at 4-bit)
python extract_weights.py --model-dir /path/to/Qwen3.5-397B-A17B \
                          --output-dir /path/to/flash-moe-weights

Step 3: Run inference

# inference-commands.sh
 
# Standard inference (4-bit, production quality)
./infer --prompt "Explain quantum computing" --tokens 100
 
# Interactive chat with full tool calling support
./chat
 
# Per-layer timing breakdown (useful for profiling)
./infer --prompt "Hello" --tokens 20 --timing
 
# 2-bit mode (faster, ~5.7 tok/s, but breaks JSON/tool calling)
./infer --prompt "Explain quantum computing" --tokens 100 --2bit

The first run will be cold — the OS page cache is empty and every expert read hits SSD. After a few hundred tokens, the cache warms and throughput increases noticeably. The engine makes no attempt to force expert data into memory; it simply lets the OS do its job.

Benchmarks: What the Numbers Actually Say

Configuration	Tokens/sec	Quality	Notes
4-bit experts, FMA kernel	4.36	Excellent	Current best. Full tool calling. 209GB disk.
4-bit experts, baseline	3.90	Excellent	Before FMA kernel optimization
2-bit experts, OS cache	5.74	Good*	120GB disk. *Breaks JSON/tool calling
2-bit peak (warm cache)	7.05	Good*	Single-token burst. Not for sustained use.

For comparison, llama.cpp running Qwen3.5-397B at Q8_0 quantization on the same machine (48GB Mac) achieves approximately 20 tok/s for token generation at empty context, dropping to ~8 tok/s at 250k context — but Q8_0 requires ~113GB in RAM, which exceeds the 48GB available. Flash-MoE's SSD streaming approach is the only way to run this model on consumer hardware at all without model truncation.

On the model quality side, Qwen3.5-397B-A17B benchmarks show:

GPQA Diamond: 88.4% (vs GPT-5.2 at 92.4%, Claude 4.5 Opus at 87.0%)
MMLU-Pro: 87.8% (near GPT-5.2's 87.4%)
SWE-bench Verified: 76.4% (vs GPT-5.2 at 80.0%)
IFBench: 76.5% (beating GPT-5.2 at 75.4%)

This isn't a toy. The 2-bit flash-moe variant scored 82% on GPQA Diamond in community testing — within 6 points of the BF16 reference at 88%.

Optimizations That Failed (And What We Learn From Them)

The project's 90+ experiments are as instructive as the wins:

Approach	Outcome	Why
LZ4 expert compression	-13%	Decompress overhead exceeded warm-cache savings
Temporal expert prediction	-18%	25% hit rate, wasted SSD bandwidth on wrong prefetch
MLP routing predictor	31% accuracy	Worse than temporal baseline — not worth it
GPU LUT dequantization	-2%	Indirect register access serializes lanes
Spin-poll GPU wait	-23%	CPU thermal throttling competes with GPU
Speculative early routing	-38%	Cache pollution + overhead negated all gains
MTP speculative decoding	Break-even	MoE I/O scales per-token unlike dense models

The pattern: complexity tends to hurt. Every approach that introduced more moving parts — predictors, compressors, custom caches — lost to the simple baseline. The highest-value optimizations (OS page cache, BLAS, FMA) were about removing complexity, not adding it.

Limitations and What to Watch

Hardware requirements are steep. This requires an M3 Max with 48GB. The M4 Max (also 48GB, ~600 GB/s bandwidth) would likely push to 6+ tok/s. Standard 16GB or 24GB Macs cannot run this — the non-expert weights alone are 5.5GB, and the OS needs the rest for page cache to work.

macOS only. Metal shaders and the Apple Accelerate framework aren't portable. A CUDA/ROCm port would require rewriting the entire compute layer. The architecture is sound, but the implementation is Apple-native.

2-bit quality tradeoff. The 2-bit variant produces \name\ instead of "name" in JSON output, making tool calling unreliable. This isn't a Flash-MoE bug — it's a known limitation of aggressive quantization for structured output. Community testing suggests 2.46 BPW with mixed precision (some tensors at Q8_0, Q6_K) outperforms naive 2-bit while maintaining quality.

Speed is real but not fast. 4.4 tok/s is usable for exploratory use. It's not production throughput. For agentic workloads generating thousands of tokens per task, this is a slow machine — but it's a local, private, offline 397B model.

Expert count reduction. Flash-MoE activates K=4 experts per token vs. the official K=10. This is a deliberate tradeoff for memory bandwidth. Quality holds up well at 4, but it's not identical to the full model.

Final Thoughts

Flash-MoE is a proof-of-concept that answers a question the field hadn't seriously asked: what happens when you design an inference engine around SSD bandwidth instead of VRAM?

The answer is surprisingly good. A 397B parameter model, running locally, with tool calling, on hardware you can buy at an Apple store. Not via cloud APIs. Not through a company's servers. On a laptop.

The techniques here — SSD expert streaming, OS page cache delegation, FMA kernel restructuring, BLAS for linear attention — aren't Flash-MoE-specific. They're transferable to any MoE model inference engine. As Qwen3.5, Mixtral, and DeepSeek-style sparse architectures proliferate, these patterns become more valuable, not less.

The inspiration for this approach came from Apple's own research — the "LLM in a Flash" paper (ACL 2024) demonstrated 4-5x speedup for CPU inference and 20-25x for GPU by streaming model weights from flash storage. Flash-MoE took that concept, applied it specifically to the sparse activation patterns of MoE models, and tuned every millisecond of the pipeline against real hardware.

The story of local AI isn't just getting bigger models on cloud hardware. It's getting smarter about what "memory" even means. Your SSD holds 1TB. Your model holds 209GB. At 17.5 GB/s, the math works.

Sources cited: Flash-MoE GitHub · HN Discussion · Qwen3.5-397B-A17B on HuggingFace · Apple "LLM in a Flash" (arXiv:2312.11514) · llama.cpp · Flash-MoE Technical Paper · Apple Metal Developer Documentation · Apple Accelerate Framework · MoEKD: MoE Knowledge Distillation (arXiv, March 2026)

📁 Full source code for this article is available on GitHub: github.com/aistackinsights/stackinsights/how-flash-moe-runs-397b-model-on-macbook

Why This Matters

Mixture-of-Experts (MoE) architectures shatter this assumption — but only if you're willing to rethink inference from first principles.

The Architecture: Qwen3.5-397B Under the Hood

Before diving into the optimizations, you need to understand why this model is uniquely suited to SSD-streaming inference.

Qwen3.5-397B-A17B has a hybrid architecture across 60 transformer layers:

45 GatedDeltaNet layers — linear attention (O(n) complexity, not O(n²))
15 standard full-attention layers — traditional multi-head attention
512 experts per MoE sublayer, with K=4 routed + 1 shared expert per token
Hidden dimension: 4096, expert intermediate dimension: 1024
Context window: 262,144 tokens natively, extensible to 1M

The Three Tricks That Make It Work

1. Trust the OS — Don't Fight the Page Cache

2. FMA-Optimized Dequantization Kernel

4-bit quantization stores two weight values per byte. To use them in matrix multiplication, you dequantize on the fly: for each nibble w, compute (w * scale + bias) * x.

This algebraic rearrangement sounds trivial. It delivered a 12% throughput increase.

// shaders.metal — FMA-optimized 4-bit dequantization inner loop
// Naive: result = (float(nibble) * scale + bias) * x
// FMA:   pre-compute scale_x = scale * x, bias_x = bias * x
//        result = fma(float(nibble), scale_x, bias_x)
 
kernel void matvec_4bit_fma(
    device const uchar* weights   [[ buffer(0) ]],
    device const float* input     [[ buffer(1) ]],
    device float*       output    [[ buffer(2) ]],
    device const float* scales    [[ buffer(3) ]],
    device const float* biases    [[ buffer(4) ]],
    uint   tid                    [[ thread_position_in_grid ]],
    uint   simd_lid               [[ thread_index_in_simdgroup ]]
) {
    // Each thread handles one output element
    float acc = 0.0f;
    
    // Pre-load scale and bias for this row
    float scale = scales[tid];
    float bias  = biases[tid];
 
    for (uint i = 0; i < INPUT_DIM / 2; i++) {
        uchar packed = weights[tid * (INPUT_DIM / 2) + i];
 
        // Unpack two 4-bit weights
        float w0 = float(packed & 0x0F);
        float w1 = float(packed >> 4);
 
        float x0 = input[i * 2 + 0];
        float x1 = input[i * 2 + 1];
 
        // FMA: fma(nibble, scale*x, bias*x) — one GPU instruction
        acc += fma(w0, scale * x0, bias * x0);
        acc += fma(w1, scale * x1, bias * x1);
    }
 
    // SIMD reduction across the group
    acc = simd_sum(acc);
    if (simd_lid == 0) output[tid] = acc;
}

3. BLAS for Linear Attention — Delegate to Accelerate

This is the same principle as trusting the OS page cache: sometimes the right answer is knowing what battle not to fight.

The Pipeline: Every Millisecond Accounted For

The per-layer execution timeline looks like this:

CMD3(prev) → CMD1: attention projections + delta-net  [1.22ms GPU]
           → CPU:  flush results                       [0.01ms CPU]
           → CMD2: o_proj + norm + routing + shared    [0.55ms GPU]
           → CPU:  softmax + topK routing              [0.003ms CPU]
           → I/O:  parallel pread K=4 experts          [2.41ms SSD]
           → CMD3: expert forward + combine + norm     [0.04ms encode, DEFERRED]

Total per-layer: ~4.2ms, dominated by SSD I/O at 2.41ms.

Step-by-Step: How to Run Flash-MoE

You'll need a MacBook Pro with an M3 Max chip and 48GB unified memory. The 4-bit model requires 209GB of SSD space; the 2-bit variant needs 120GB but breaks tool calling.

Prerequisites:

macOS 26+ (Darwin 25.2.0)
Xcode command-line tools
~210GB free SSD
Qwen3.5-397B-A17B weights from HuggingFace

Step 1: Clone and build

# clone-and-build.sh
git clone https://github.com/danveloper/flash-moe.git
cd flash-moe/metal_infer
make

Step 2: Extract weights from HuggingFace safetensors

# extract_weights.py (already in the repo)
# Downloads ~397GB safetensors, converts to flash-moe binary format
# Non-expert weights → model_weights.bin (~5.5GB, mmap'd)
# Expert weights    → packed_experts/ directory (209GB at 4-bit)
python extract_weights.py --model-dir /path/to/Qwen3.5-397B-A17B \
                          --output-dir /path/to/flash-moe-weights

Step 3: Run inference

# inference-commands.sh
 
# Standard inference (4-bit, production quality)
./infer --prompt "Explain quantum computing" --tokens 100
 
# Interactive chat with full tool calling support
./chat
 
# Per-layer timing breakdown (useful for profiling)
./infer --prompt "Hello" --tokens 20 --timing
 
# 2-bit mode (faster, ~5.7 tok/s, but breaks JSON/tool calling)
./infer --prompt "Explain quantum computing" --tokens 100 --2bit

Benchmarks: What the Numbers Actually Say

Configuration	Tokens/sec	Quality	Notes
4-bit experts, FMA kernel	4.36	Excellent	Current best. Full tool calling. 209GB disk.
4-bit experts, baseline	3.90	Excellent	Before FMA kernel optimization
2-bit experts, OS cache	5.74	Good*	120GB disk. *Breaks JSON/tool calling
2-bit peak (warm cache)	7.05	Good*	Single-token burst. Not for sustained use.

On the model quality side, Qwen3.5-397B-A17B benchmarks show:

GPQA Diamond: 88.4% (vs GPT-5.2 at 92.4%, Claude 4.5 Opus at 87.0%)
MMLU-Pro: 87.8% (near GPT-5.2's 87.4%)
SWE-bench Verified: 76.4% (vs GPT-5.2 at 80.0%)
IFBench: 76.5% (beating GPT-5.2 at 75.4%)

This isn't a toy. The 2-bit flash-moe variant scored 82% on GPQA Diamond in community testing — within 6 points of the BF16 reference at 88%.

Optimizations That Failed (And What We Learn From Them)

The project's 90+ experiments are as instructive as the wins:

Approach	Outcome	Why
LZ4 expert compression	-13%	Decompress overhead exceeded warm-cache savings
Temporal expert prediction	-18%	25% hit rate, wasted SSD bandwidth on wrong prefetch
MLP routing predictor	31% accuracy	Worse than temporal baseline — not worth it
GPU LUT dequantization	-2%	Indirect register access serializes lanes
Spin-poll GPU wait	-23%	CPU thermal throttling competes with GPU
Speculative early routing	-38%	Cache pollution + overhead negated all gains
MTP speculative decoding	Break-even	MoE I/O scales per-token unlike dense models

Limitations and What to Watch

Final Thoughts

Flash-MoE is a proof-of-concept that answers a question the field hadn't seriously asked: what happens when you design an inference engine around SSD bandwidth instead of VRAM?

How Flash-MoE Runs a 397B Parameter Model on a MacBook Pro at 4.4 tok/s

Why This Matters

The Architecture: Qwen3.5-397B Under the Hood

The Three Tricks That Make It Work

1. Trust the OS — Don't Fight the Page Cache

2. FMA-Optimized Dequantization Kernel

3. BLAS for Linear Attention — Delegate to Accelerate

The Pipeline: Every Millisecond Accounted For

Step-by-Step: How to Run Flash-MoE

Benchmarks: What the Numbers Actually Say

Optimizations That Failed (And What We Learn From Them)

Limitations and What to Watch

Final Thoughts

Related Posts

The iPhone 17 Pro Is Running a 400B LLM. Here's the Engineering That Makes It Possible.

The LLM Gateway Pattern: Cut Your AI Bill 80% Without Touching a Prompt

Multi-Agent AI Systems Are Eating Single Agents. Here's How to Build One That Works.

Comments

Leave a comment

How Flash-MoE Runs a 397B Parameter Model on a MacBook Pro at 4.4 tok/s

Why This Matters

The Architecture: Qwen3.5-397B Under the Hood

The Three Tricks That Make It Work

1. Trust the OS — Don't Fight the Page Cache

2. FMA-Optimized Dequantization Kernel

3. BLAS for Linear Attention — Delegate to Accelerate

The Pipeline: Every Millisecond Accounted For

Step-by-Step: How to Run Flash-MoE

Benchmarks: What the Numbers Actually Say

Optimizations That Failed (And What We Learn From Them)

Limitations and What to Watch

Final Thoughts

Related Posts

The iPhone 17 Pro Is Running a 400B LLM. Here's the Engineering That Makes It Possible.

The LLM Gateway Pattern: Cut Your AI Bill 80% Without Touching a Prompt

Multi-Agent AI Systems Are Eating Single Agents. Here's How to Build One That Works.

Comments

Leave a comment