AIStackInsightsAIStackInsights
HomeBlogCategoriesAboutNewsletter
AIStackInsightsAIStackInsights

Practical AI insights — LLMs, machine learning, prompt engineering, and the tools shaping the future.

Content

  • All Posts
  • LLMs
  • Tutorials
  • AI Tools

Company

  • About
  • Newsletter
  • RSS Feed

Connect

© 2026 AIStackInsights. All rights reserved.

Tutorials

The iPhone 17 Pro Is Running a 400B LLM. Here's the Engineering That Makes It Possible.

An iPhone with 12GB of RAM just ran a 400-billion-parameter model. The trick is streaming weights from flash — and the implications are massive.

AIStackInsights TeamMarch 24, 202613 min read
on-device-aiapple-neural-enginellm-inferenceanemllmixture-of-experts

Your iPhone has 12 gigabytes of RAM. The Qwen3.5-397B model has roughly 800 gigabytes of weights. By every law of conventional engineering, running the former on the latter should be flatly impossible. And yet, this week, a demo landed at the top of Hacker News — 620 upvotes, 279 comments, still climbing — showing exactly that: an iPhone 17 Pro running a 400-billion-parameter LLM, locally, offline, with no cloud call.

The demo isn't production-ready. Tokens come out slowly, the time-to-first-token is brutal, and the quantization level is aggressive. But that's not the point. The point is that a set of engineering techniques — flash weight streaming, Mixture-of-Experts sparsity, hardware-aware OS caching, and the open-source ANEMLL framework — have unlocked an architectural boundary that the AI industry assumed was fixed until at least 2028.

This is a line in the sand. Here's the full technical breakdown.

Why This Matters

📁 Full source code for this article is available on GitHub: github.com/aistackinsights/stackinsights/iphone-17-pro-400b-llm-flash-streaming-anemll

The AI industry is built on a premise: frontier models need frontier hardware. GPT-4-class reasoning requires H100s. The cloud is the only path to intelligence at scale. That premise is what justifies $500B in data center investment and the entire model-as-a-service economy.

On-device inference chips away at that premise from the consumer end. Apple's Neural Engine, available in every iPhone since A11 Bionic, is a purpose-built 16-core matrix multiplication accelerator. The A17 Pro (iPhone 15 and newer) and the A18 Pro in the iPhone 17 Pro push this further — with Apple's CoreML framework providing a high-level API to target it directly.

But raw compute isn't the bottleneck for frontier models — memory is. A 70B dense model at 4-bit quantization needs roughly 35GB of VRAM just to sit in memory. iPhone 17 Pro ships with 12GB of shared RAM. This gap is why "on-device frontier inference" has been a research curiosity rather than a deployment target.

Flash streaming changes that equation.

The Model: What "400B" Actually Means for a MoE

The model in the demo is Qwen3.5-397B-A17B — a Mixture-of-Experts (MoE) architecture from Alibaba's Qwen team. The naming convention is deliberate: 397B is the total parameter count, A17B is the active parameter count per token.

This distinction is everything.

In a dense transformer like LLaMA-3 70B, every forward pass activates all 70B parameters. In a MoE model, each layer routes each token through only a small subset of "expert" sub-networks. Qwen3.5-397B has 512 experts per layer, but routes each token through only 4–10 of them. The rest sit cold on storage, never touched for that token.

This creates a surprising property: the effective compute cost of Qwen3.5-397B per token is closer to a 17B dense model, while its world knowledge capacity benefits from training across all 397B parameters. HN commenter @ykumards summarized it well:

"It behaves more like a ~80B parameter model (geometric mean of active and total params), and has world knowledge closer to a 400B parameter model."

So when someone says "400B model on a phone," the pedantic-but-important clarification is: the phone is computing 17B-equivalent operations per token, but loading expert weights on demand from the full 400B pool. That's still extraordinary — and it's what makes flash streaming tractable.

The Core Technique: Streaming Weights from Flash

The theoretical foundation comes from an Apple Research paper published at ACL 2024: "Efficient Large Language Model Inference with Limited Memory."

The key insight: flash storage is 100x larger than DRAM, and modern NVMe SSDs have gotten fast enough to stream weight tensors faster than they can be computed. On M5-generation Apple silicon, internal SSD bandwidth has roughly doubled compared to M3 — hitting sustained reads well above 10 GB/s. That's not quite DRAM bandwidth, but it's enough if you're clever about what you read and when.

Apple's paper identifies two principal optimizations:

1. Windowing: Neurons activated in one forward pass are statistically likely to be activated again in the next few passes. By keeping recently-used expert weights in a small "window" buffer in DRAM and only evicting them when needed, you dramatically reduce flash reads per token.

2. Row-column bundling: Flash memory is dramatically faster for large sequential reads than for scattered random reads. By restructuring weight tensors so that logically adjacent expert weights are physically co-located on flash, you convert many small reads into fewer large ones — getting near-peak SSD bandwidth per token.

The Apple paper reports that these two techniques together enable running models up to 2× the available DRAM capacity, with 4–5× inference speedup on CPU and 20–25× on GPU compared to naive weight loading. The iPhone demo pushes this to ~66× DRAM (12GB RAM, ~800GB model).

The demo takes the "Trust the OS" approach even further: rather than implementing a custom memory manager, it relies on the iOS/macOS filesystem page cache to handle expert eviction naturally. The OS sees frequent reads from certain expert-layer files and keeps them hot in page cache. Rarely-used experts get evicted. The result is a self-tuning cache that mirrors the model's actual routing distribution.

ANEMLL: The Open-Source Stack

The infrastructure making this demo reproducible is ANEMLL — the Artificial Neural Engine Machine Learning Library (pronounced "animal"), now at version 0.3.5.

ANEMLL provides a complete pipeline from HuggingFace weights to on-device ANE inference:

  1. Model Conversion: Takes HuggingFace-format weights (LLaMA, Qwen, Gemma 3) and converts them to CoreML format via Apple's coremltools. The conversion handles:

    • Chunked model splitting to fit iOS (1GB) and macOS (2GB) CoreML file limits
    • In-model argmax — moving the vocabulary selection step inside the CoreML model, reducing ANE-to-host data transfer by eliminating full logit tensor transfers
    • ANEMLL-Dedup, a surgical weight deduplication pass that cuts converted model size by ~50%
  2. ANE-Optimized Inference: The inference engine uses IOSurface-backed buffers and a serial prediction queue to eliminate race conditions unique to iOS's ANE scheduler.

  3. Chat Interface: A TestFlight iOS app with voice input, AirDrop model sharing, and streaming inference display.

ANEMLL's benchmarks on Llama-class models show near-parity with HuggingFace FP16 on standard evals:

TaskHF-FP16ANEMLL-FP16Delta
ARC Challenge31.66%30.97%-0.69%
ARC Easy60.65%60.94%+0.29%
BoolQ63.91%64.68%+0.77%
PiQA66.81%67.74%+0.93%
WinoGrande56.43%56.67%+0.24%

The ANEMLL-converted models slightly outperform their HuggingFace equivalents on 4 of 5 benchmarks — a result of ANE-specific numerical precision optimizations in the new RMSNorm implementation shipped in v0.3.4.

A complementary technique for reducing latency is covered in Apple's KV Prediction paper: using a small auxiliary model to pre-compute an approximate KV cache, reducing time-to-first-token by 15–50% on TriviaQA at fixed FLOPs budgets. ANEMLL contributors are already exploring integration.

Step-by-Step: Running Large Models via Flash Streaming

Here's how to reproduce the core technique on your own macOS machine using MLX — Apple's open-source array framework for Apple Silicon — and mlx-lm.

Prerequisites

  • Apple Silicon Mac (M1 or later; M3/M4/M5 recommended for larger models)
  • macOS 14+, Python 3.11+
  • At least 16GB unified memory for 8B models; 64GB+ for 70B+

Step 1: Install the stack

# install.sh
# Install MLX and the mlx-lm convenience layer
pip install mlx mlx-lm
 
# Verify your hardware
python -c "import mlx.core as mx; print(mx.default_device())"
# Expected: Device(gpu, 0)  <- uses ANE/GPU unified memory

Step 2: Streaming inference with MLX (the key part)

The critical parameter is kv_bits — enabling flash-offloaded KV cache — and keeping max_kv_size low to bound DRAM usage:

# flash_inference.py
# Stream token generation from a model larger than your available RAM
# Works on Apple Silicon via MLX's lazy evaluation + unified memory
 
from mlx_lm import load, stream_generate
 
# Load a large model — MLX uses lazy evaluation so weights
# aren't all loaded to unified memory at once
model, tokenizer = load(
    "mlx-community/Qwen2.5-72B-Instruct-4bit",
    # tokenizer_config overrides for chat formatting
    tokenizer_config={"trust_remote_code": True},
)
 
prompt = "Explain the difference between MoE and dense transformers in 3 bullet points."
 
messages = [{"role": "user", "content": prompt}]
formatted = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False,
)
 
# stream_generate yields tokens as they are produced
# max_tokens caps generation length; temp controls sampling
print("Response: ", end="", flush=True)
for token_text in stream_generate(
    model,
    tokenizer,
    prompt=formatted,
    max_tokens=512,
    temp=0.6,
):
    print(token_text, end="", flush=True)
print()

Step 3: Converting to ANE via ANEMLL (for iOS targets)

# convert_to_ane.sh
# Convert a HuggingFace model to CoreML for Apple Neural Engine
 
# Clone ANEMLL
git clone https://github.com/Anemll/Anemll && cd Anemll
 
# Set up environment (uses uv for fast installs)
brew install uv
./create_uv_env.sh
source env-anemll/bin/activate
./install_dependencies.sh
 
# Convert Qwen3 0.6B (small — good for testing the pipeline)
./anemll/utils/convert_model.sh \
  --model Qwen/Qwen3-0.6B-Instruct \
  --output ./output/qwen3-0.6b-ane \
  --context 512
 
# Run the chat CLI against converted model
python anemll/utils/chat.py --model ./output/qwen3-0.6b-ane

For iOS deployment: After conversion, use ANEMLL's anemll-dedup tool to run weight deduplication before bundling into your Xcode project. This typically cuts the on-disk model size by ~50%, which matters when you're distributing over the App Store.

Benchmarks and What the Numbers Actually Say

Let's be direct about the performance profile of the iPhone 400B demo:

MetriciPhone 17 Pro (Qwen3.5-397B Q1)MacBook Pro M4 Max (Q2)Cloud API (GPT-4o)
Time to first token45–90 seconds8–15 seconds0.3–1.5 seconds
Tokens/sec~0.3–0.8 t/s~2–4 t/s40–80 t/s
Model qualityDegraded (Q1)Good (Q2)Excellent
PrivacyCompleteCompleteNone
Cost per 1M tokens$0$0$5–$15
Requires internetNoNoYes

Estimates based on community-reported benchmarks and ANEMLL-bench measurements. Flash streaming speeds scale with SSD bandwidth — M5-generation hardware will improve these figures.

For the more practical 8B–27B range on Apple Silicon, MLX on an M4 Pro delivers genuinely useful speeds:

ModelDeviceQuantizationTokens/sec
Qwen3.5-8BM4 Pro (48GB)Q455–70 t/s
Qwen3.5-27BM4 Max (128GB)Q428–40 t/s
Qwen3.5-72BM4 Max (128GB)Q412–18 t/s
Qwen3.5-397BM4 Max (128GB)Q2 (flash)2–4 t/s

The 400B demo is genuinely a proof-of-concept. The 8B–72B range is where on-device inference is actually useful today.

Limitations and What to Watch

The HN discussion surfaced several important technical caveats worth taking seriously:

Quantization quality at Q1 is a real problem. Qwen3.5-397B on iPhone uses 1-bit quantization for the cold weights — aggressive even by edge inference standards. The ANEMLL docs acknowledge: "LUT4 quality is fairly low due to lack of Block Quantization on Apple Neural Engine." At Q1, you're running a shadow of the original model's capability. The Hacker News consensus is that practical use requires at least Q4 for acceptable output quality, which pushes the feasible model size on current hardware to roughly 27–70B.

Expert routing is not as sparse as the demo implies. While Qwen3.5-397B only activates 4–10 experts per token, experts don't neatly specialize by domain — routing choices change on every single token. One commenter put it plainly: "It's just swapping experts out constantly." The OS page cache does a tolerable job, but this is sustained random I/O. Thermal throttling on iPhone becomes a real issue within minutes.

The "400B" headline is technically misleading. As HN user @anemll_dev noted, the correct framing is a 17B-active model that draws on a 400B parameter pool. For most downstream tasks, your quality ceiling is closer to what a well-quantized 80B dense model would deliver — still extraordinary for an offline phone demo, but not GPT-4-class.

Do not ship Q1 quantized models to production users. The output degradation at 1-bit is significant — hallucination rates increase materially and reasoning chains can collapse entirely on complex prompts. Use Q4 minimum for any user-facing deployment.

What to watch in 2026:

  • Block quantization on ANE: Apple hasn't enabled block quantization on the Neural Engine yet. When they do (expected in upcoming CoreML releases), Q4 quality on edge devices will improve substantially — potentially enabling 70B dense models with acceptable quality on iPhone.
  • M5 Ultra SSD bandwidth: The SSD bandwidth doubling from M3→M5 that made this demo possible will continue. M5 Ultra is expected in Mac Pro class hardware later in 2026; that generation should push flash streaming from "demo" to "plausible deployment."
  • ANEMLL v0.4: The roadmap includes Qwen3 MoE support and improved KV Prediction integration. Watch the GitHub — this project is shipping fast.
  • llama.cpp SSD offload: The --flash-attn and experimental SSD offload features in llama.cpp are converging on similar territory for non-Apple hardware.

Final Thoughts

What happened this week isn't just a cool party trick. It's a proof that the "you need a data center for frontier AI" assumption has a structural crack in it — and that crack will widen.

The economic implications run deep. If you can run a 400B-parameter model locally, the marginal cost of inference drops to zero. No API call fees, no data leaving your device, no dependency on uptime agreements from a provider burning $500B on GPUs. For regulated industries — healthcare, legal, finance — the privacy properties alone justify the engineering investment.

The practical path for developers right now isn't to run 400B on a phone. It's to:

  1. Target 8B–27B models on MLX for macOS apps — these run at usable speeds today
  2. Use ANEMLL for iOS targets — the conversion pipeline is mature, and the TestFlight app proves the UX works
  3. Watch the MoE + flash streaming trajectory — within 12–18 months, Q4 quality at 100B+ scale on consumer Apple Silicon is plausible

The cloud isn't going anywhere. But the assumption that only the cloud can run meaningful AI just died on a 6-inch phone screen.


Sources

  1. Hacker News: iPhone 17 Pro Demonstrated Running a 400B LLM — Community discussion with technical breakdown
  2. ANEMLL GitHub Repository (v0.3.5) — Open-source ANE inference library
  3. ANEMLL Official Site — Project overview and goals
  4. Apple Research: Efficient LLM Inference with Limited Memory (ACL 2024) — The foundational flash streaming paper
  5. KV Prediction for Improved Time to First Token (arXiv 2410.08391) — Apple Research on prefill latency reduction
  6. Apple MLX Framework — Array framework for Apple Silicon ML
  7. Qwen2.5 Model Card — HuggingFace — Qwen architecture details
  8. Apple CoreML Documentation — CoreML framework reference
  9. ANEMLL-Bench: Apple Neural Engine Benchmarking — Performance metrics and eval results

Was this article helpful?

Share:

Related Posts

Tutorials

How Flash-MoE Runs a 397B Parameter Model on a MacBook Pro at 4.4 tok/s

A developer ran Qwen3.5-397B—a model bigger than GPT-4—on a laptop with no Python and no frameworks. Here's exactly how.

Read more
Tutorials

Claude Code Power User Guide: Every Command, Shortcut, and Hidden Feature

The complete Claude Code reference for 2026 — CLAUDE.md architecture, MCP wiring, worktrees, slash commands, and the workflows that 10x your output.

Read more
Tutorials

GPT-5.4's Native Computer-Use API Is Live — and It Just Outperformed Humans on Desktop Automation

GPT-5.4 ships native computer-use today, hitting 75% on OSWorld — surpassing the 72.4% human baseline. Here's how to build agents with it.

Read more

Comments

No comments yet. Be the first to share your thoughts!

Leave a comment

Weekly AI insights

Join developers getting LLM tips, ML guides, and tool reviews.

Ad Slot:

Sponsor this space

Reach thousands of AI engineers weekly.