AIStackInsightsAIStackInsights
HomeBlogCategoriesAboutNewsletter
AIStackInsightsAIStackInsights

Practical AI insights — LLMs, machine learning, prompt engineering, and the tools shaping the future.

Content

  • All Posts
  • LLMs
  • Tutorials
  • AI Tools

Company

  • About
  • Newsletter
  • RSS Feed

Connect

© 2026 AIStackInsights. All rights reserved.

Large Language Models

AI Solved a Frontier Math Problem This Week. It Also Scored 1% on Tasks a Child Masters in Minutes.

ARC-AGI-3 just launched and current AI scores under 5%. The same week GPT-5.4 solved an open research math problem. This is not a contradiction. It is the most important insight about intelligence published this decade.

AIStackInsights TeamMarch 25, 202615 min read
ai-agentsllmsmachine-learningai-toolstutorials

This week, two things happened that appear to be in direct contradiction.

First: Epoch AI confirmed that GPT-5.4 Pro solved a genuine open mathematics research problem — a Ramsey-style problem on hypergraphs that had stumped researchers. The problem contributor, Will Brian, said: "I had previously wondered if the AI's approach might be possible, but it seemed hard to work out. Now I see that it works out perfectly." GPT-5.4 wasn't alone. Claude Opus 4.6, Gemini 3.1 Pro, and GPT-5.4 (xhigh) all solved it independently. Frontier AI models are now contributing to peer-reviewed mathematics.

Second: ARC-AGI-3 launched. François Chollet's team — the people who have built the definitive benchmark for measuring real AI reasoning — released their third iteration. They put current frontier models through it.

Every single model scored under 5%.

Not 50%. Not 20%. Under 5% — on tasks that 6 out of 10 randomly selected humans solve on their first try.

These two facts feel like they cannot both be true. An AI that can advance mathematical research cannot score less than a coin flip on beginner puzzles. Unless... the thing we call "intelligence" is not what we think it is.

📁 Agent starter kit and ARC-AGI-3 integration code for this article: github.com/aistackinsights/stackinsights/arc-agi-3-what-1-percent-score-reveals-about-intelligence

The Difference Between Knowing and Learning

The easiest way to understand the gap is to watch a child encounter a new game.

Give a 7-year-old a Nintendo Switch game they have never seen. Within 5 minutes, they have formed a mental model of the physics, figured out the goal, and are adjusting their strategy based on feedback. They have never been trained on this specific game. They have never seen its sprites. They are doing something fundamentally different from retrieval — they are building a new world model, on the fly, from scratch.

This is what ARC-AGI-3 measures.

Not: "Do you know the answer?"
But: "Can you figure out a world you have never seen before, with only your behavior as feedback?"

The benchmark places an agent inside novel interactive environments. There are no text instructions. No pre-loaded knowledge helps. The agent must perceive what matters, select actions, observe outcomes, update its model, and adapt — over long horizons with sparse rewards. The scoring is unforgiving: it measures not just whether you solve the level, but how efficiently you solve it compared to a skilled human (specifically, the second-best human first-run solution per level).

A model that takes 10x as many steps as a human gets a score of 1% on that level — because the metric is (human_steps / agent_steps)². The squaring is intentional. It brutally penalizes inefficiency because efficient, directed learning is the signal. Brute force exploration is not intelligence; it is compute.

Current frontier models: under 5% average. A median human: roughly 25–30%. The best humans: approaching 100% by definition.

Why GPT-5.4 Can Solve Math and Still Score 1%

The math problem result is real and remarkable. The Ramsey hypergraph problem required finding a new construction for an infinite sequence, improving known lower bounds — a genuine creative mathematical contribution. The AI did not look it up. It reasoned through it. The full conversation transcript is published and the problem contributor confirmed the solution is correct and novel.

But here is the mechanism that reconciles everything:

GPT-5.4 was given the problem in natural language, with full mathematical definitions, symbol conventions, and context. It was operating in its strongest mode — a rich structured text problem, with immediate expert feedback confirming correctness. It could write down its reasoning, check it, revise it. The entire problem space was presented to it as a static artifact.

ARC-AGI-3 gives nothing. No text. No rules. No explanations. The agent sees a grid of pixels, takes an action, and sees what changes. It must infer the physics of the world, the goal, and the optimal path — all from behavioral feedback alone. The model's training, no matter how vast, gives it zero information about this specific never-before-seen environment.

This is the distinction Chollet has been making for years, and that most AI commentators have been reluctant to accept: language models are extraordinary interpolators of trained knowledge. They are poor at out-of-distribution inductive reasoning from first principles.

The math problem was solvable by a good mathematician with access to mathematical notation. GPT-5.4 is, among other things, an extremely good mathematical reasoner in a notation it has seen trillions of times. It performed well at a task it was implicitly trained for — just at a domain (open research math) where the training distribution is sparse.

ARC-AGI-3 tests something orthogonal: the ability to build models of arbitrary new things efficiently. And there, current architecture hits a wall.

What ARC-AGI-3 Measures: The Four Pillars

The benchmark is designed around four capabilities that together define what Chollet calls general fluid intelligence — as opposed to crystallized knowledge retrieval:

1. Skill acquisition efficiency — how quickly do you learn a new task from scratch? Not from memory. From experience. A human needs ~3 trials to infer the rules of a new environment. Current AI needs hundreds or thousands of interaction steps to achieve equivalent understanding, and even then often fails.

2. Long-horizon planning with sparse feedback — you cannot solve the level by optimizing greedily. You need to hold a plan across many steps where most actions provide no direct reward signal. Short-context greedy optimization — which is what transformer architectures fundamentally do — performs poorly here.

3. Experience-driven belief updating — you see something unexpected. You revise your world model. You act differently. This requires maintaining a belief state that compresses prior observations into a useful representation and updates it coherently. Attention mechanisms approximate this, but the approximation breaks down under extended novel interaction.

4. Adaptive strategy across environments — what works in level 3 does not work in level 7. Each environment has different physics, different goals, different feedback structure. The agent must transfer the meta-skill of "how to learn new environments quickly" — not specific environment strategies. This is meta-learning in the strict sense: learning how to learn, not learning what to do.

The scoring (squared efficiency vs. second-best human) means even small capability gaps compound severely. A model that is 50% as efficient as a human on each step gets a score of 25% — not 50% — because the squaring captures that efficiency is multiplicative across decision quality.

The Architecture Gap — and What Might Close It

The question that matters for developers: what architectural properties would allow an agent to score above 50% on ARC-AGI-3?

Research points to three properties that current transformer-based models lack or approximate poorly:

Persistent compressed world models. A child playing a new game is not just running forward passes through a fixed neural network. They are continuously updating a compact world model that summarizes what they have learned. Transformers use attention over full context — which scales quadratically and does not compress. State-space models (Mamba, RWKV), memory-augmented architectures (Differentiable Neural Computers, LSTM with external memory), and hybrid approaches are more promising here. The Mamba architecture in particular showed that selective state spaces can maintain relevant history far more efficiently than attention for sequential tasks.

Explicit hypothesis testing loops. Humans in new environments are constantly running mini-experiments: "I'll press left and see if the block moves." ARC-AGI-3 rewards this. An agent that forms explicit hypotheses about the environment, designs actions to test them, and updates beliefs accordingly will be far more efficient than one that takes actions based on immediate policy outputs. This is the cognitive science concept of active inference — the brain as a hypothesis-testing engine, not a pattern-matching engine.

Meta-learning inner loops. The outer loop optimizes the agent's general learning ability. The inner loop adapts quickly to each specific environment. Model-Agnostic Meta-Learning (MAML) and its successors showed this was possible in principle. The practical challenge is that ARC-AGI-3 environments are more diverse and structurally complex than typical meta-learning benchmarks. But this remains the most promising direction: a model that has learned how to learn rather than learned what to do.

Building an Agent for ARC-AGI-3: Where to Start

ARC-AGI-3 ships with a developer toolkit and documented API. Any developer can build and benchmark an agent today. Here is a minimal skeleton showing the integration pattern:

# arc_agi3_agent.py — minimal agent framework for ARC-AGI-3
# Docs: https://docs.arcprize.org/
import httpx, json, time
from dataclasses import dataclass, field
from typing import Any
 
API_BASE = "https://api.arcprize.org/v3"
 
@dataclass
class WorldModel:
    """Compact belief state updated from observations."""
    observations: list[dict] = field(default_factory=list)
    hypotheses: list[str] = field(default_factory=list)
    step_count: int = 0
 
    def observe(self, obs: dict) -> None:
        self.observations.append(obs)
        self.step_count += 1
 
    def summarize(self) -> str:
        """Compress observations into a structured world model summary."""
        if not self.observations:
            return "No observations yet."
        recent = self.observations[-5:]  # last 5 steps
        changes = []
        for i in range(1, len(recent)):
            prev, curr = recent[i-1], recent[i]
            if prev != curr:
                changes.append(f"Step {self.step_count - len(recent) + i}: state changed")
        return f"Steps taken: {self.step_count}. Recent changes: {'; '.join(changes) or 'none'}."
 
 
class ARCAgent:
    def __init__(self, api_key: str, model_fn):
        """
        api_key: ARC-AGI-3 API key from arcprize.org
        model_fn: callable(prompt: str) -> str  (your LLM or policy)
        """
        self.client = httpx.Client(
            base_url=API_BASE,
            headers={"Authorization": f"Bearer {api_key}"},
            timeout=30
        )
        self.model_fn = model_fn
        self.world = WorldModel()
 
    def start_episode(self, task_id: str) -> dict:
        resp = self.client.post("/episodes", json={"task_id": task_id})
        resp.raise_for_status()
        data = resp.json()
        self.world = WorldModel()
        return data
 
    def act(self, episode_id: str, observation: dict) -> str:
        """Decide action using world model + LLM reasoning."""
        self.world.observe(observation)
        world_summary = self.world.summarize()
 
        prompt = f"""You are solving an unknown interactive environment.
Your world model so far:
{world_summary}
 
Current observation:
{json.dumps(observation, indent=2)}
 
Available actions: {observation.get('available_actions', ['up', 'down', 'left', 'right', 'interact'])}
 
Think step by step:
1. What does the current state tell you about the environment's rules?
2. What hypothesis can you test with your next action?
3. What action is most informative OR most likely to progress toward the goal?
 
Respond with ONLY the action name."""
 
        action = self.model_fn(prompt).strip().lower()
        return action
 
    def submit_action(self, episode_id: str, action: str) -> dict:
        resp = self.client.post(
            f"/episodes/{episode_id}/actions",
            json={"action": action}
        )
        resp.raise_for_status()
        return resp.json()
 
    def run_episode(self, task_id: str, max_steps: int = 200) -> dict:
        """Run a full episode and return result."""
        episode = self.start_episode(task_id)
        episode_id = episode["episode_id"]
        obs = episode["initial_observation"]
 
        print(f"Starting episode {episode_id} for task {task_id}")
 
        for step in range(max_steps):
            action = self.act(episode_id, obs)
            result = self.submit_action(episode_id, action)
            obs = result.get("observation", {})
 
            if result.get("done"):
                score = result.get("score", 0)
                print(f"  Done at step {step+1}. Score: {score:.3f}")
                return result
 
            time.sleep(0.1)  # rate limiting
 
        return {"done": False, "score": 0, "reason": "max_steps_reached"}
 
 
# Example usage with a local ollama model
def ollama_model(prompt: str) -> str:
    import urllib.request
    body = json.dumps({
        "model": "phi4-mini",
        "messages": [{"role": "user", "content": prompt}],
        "stream": False
    }).encode()
    req = urllib.request.Request(
        "http://localhost:11434/api/chat",
        data=body, headers={"Content-Type": "application/json"}
    )
    with urllib.request.urlopen(req) as r:
        return json.load(r)["message"]["content"]
 
if __name__ == "__main__":
    agent = ARCAgent(api_key="YOUR_API_KEY", model_fn=ollama_model)
    result = agent.run_episode(task_id="arc3-task-001")
    print(f"Final score: {result.get('score', 0):.3f}")

The key insight for agent design: The WorldModel class above is the most important component — not the LLM prompt. An agent that compresses its observations into structured beliefs and reasons from them will dramatically outperform one that just feeds raw context to an LLM. ARC-AGI-3 rewards efficient hypothesis-directed exploration. Every action should either test a hypothesis about the environment's rules or directly advance toward the goal.

The Meta-Learning Approach: Teaching the Agent to Learn

The architecturally-informed approach goes one step further. Rather than hard-coding a reasoning loop, we meta-train the agent so that "how to learn a new environment quickly" is itself a learned skill:

# meta_agent.py — sketch of a meta-learning agent using MAML-style inner loop
import torch, torch.nn as nn
from copy import deepcopy
 
class AdaptiveAgent(nn.Module):
    """
    Outer loop: learns a good initialization theta* such that
    a few gradient steps on a new environment produce a good policy.
    Inner loop: fast adaptation to each new ARC-AGI-3 environment.
    """
    def __init__(self, obs_dim: int, action_dim: int, hidden: int = 128):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(obs_dim, hidden), nn.ReLU(),
            nn.Linear(hidden, hidden), nn.ReLU(),
        )
        # Belief state: compressed summary of prior observations
        self.belief_gru = nn.GRUCell(hidden, hidden)
        self.policy_head = nn.Linear(hidden, action_dim)
        self.value_head = nn.Linear(hidden, 1)
        self.belief_state = None
 
    def reset(self):
        """Reset belief state at start of new episode."""
        self.belief_state = torch.zeros(1, 128)
 
    def forward(self, obs: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
        z = self.encoder(obs)
        self.belief_state = self.belief_gru(z, self.belief_state)
        logits = self.policy_head(self.belief_state)
        value = self.value_head(self.belief_state)
        return logits, value
 
    def adapt(self, support_trajectory: list, lr: float = 0.01) -> "AdaptiveAgent":
        """
        Inner loop: given a few (obs, action, reward) tuples from the new
        environment, quickly adapt the policy via gradient steps.
        Returns an adapted copy (does not modify self).
        """
        adapted = deepcopy(self)
        optimizer = torch.optim.SGD(adapted.parameters(), lr=lr)
        for obs, action, reward in support_trajectory:
            optimizer.zero_grad()
            logits, value = adapted(obs)
            # Simple policy gradient loss
            loss = -reward * torch.log_softmax(logits, dim=-1)[0, action]
            loss.backward()
            optimizer.step()
        return adapted

The key idea: theta* (the outer loop parameters) is optimized so that the adapt() inner loop makes large improvements in just a few gradient steps. The model has not memorized environments — it has learned the skill of fast environment learning.

This is the architecture that ARC-AGI-3 was specifically designed to reveal and reward.

What This Means for Developers Building AI Products

The implications of ARC-AGI-3's launch extend well beyond benchmark chasing. They define a clear distinction between two categories of AI product:

Category 1: Knowledge retrieval products. Question answering, summarization, code generation, document analysis. These leverage crystallized knowledge from training. Frontier models are extremely capable here. GPT-5.4 solving math papers is the apex of this. These products work well and will continue to improve.

Category 2: Adaptive agent products. Autonomous agents that must operate in novel environments, learn new tools, navigate unfamiliar UIs, or adapt to user-specific contexts that differ from training distribution. These require the capabilities ARC-AGI-3 measures. Current models are weak here — not 50% of human capability, but under 5%.

Most of the ambitious agentic AI products being built today fall into Category 2. They are being built on models that are in Category 1. This gap is the source of most agent reliability failures in production: the agent encounters a novel situation and either fails silently, hallucinates a path forward, or gets stuck in a loop.

The honest engineering implication: if your product requires reliable out-of-distribution adaptation, you are building on shaky ground in 2026. The models are not ready for arbitrary novel-environment generalization. Design your system to acknowledge this — add confirmation steps, narrow the environment scope, provide explicit state structure to the agent, or use ARC-AGI-3 as a target benchmark for your specific domain.

The benchmark gaming warning: ARC-AGI-3's design specifically prevents brute-force memorization by using novel environments per evaluation. However, as with all benchmarks, once it becomes a target it will be gamed. The value of ARC-AGI-3 is not in its absolute scores — it is in what the scoring distribution tells you about where your agent's capabilities actually break down.

The Philosophical Takeaway: Two Kinds of Intelligence, One Field Confused About Both

Cognitive scientists have distinguished fluid intelligence (the ability to reason through new problems) from crystallized intelligence (accumulated knowledge and expertise) since Raymond Cattell's work in the 1940s. These are measurably distinct cognitive capacities. People with Alzheimer's lose crystallized knowledge first. People with frontal lobe damage often preserve knowledge while losing fluid reasoning.

AI has built extraordinary crystallized intelligence. The math-solving, the code generation, the medical diagnosis — all of it is sophisticated retrieval and application of trained patterns. Remarkable. Genuinely useful. Not to be dismissed.

But fluid intelligence — the ability to learn new things efficiently, to build world models from scratch, to transfer meta-skills to entirely new domains — that remains largely unsolved. ARC-AGI-3's 1–5% scores are the measurement of that gap.

The provocative question that Chollet's benchmark forces: is fluid intelligence even achievable with the current transformer + next-token-prediction paradigm? Or does it require something architecturally different — state spaces, neuromorphic dynamics, hybrid symbolic-connectionist systems, or something we have not invented yet?

GPT-5.4 solving a hypergraph problem suggests the ceiling on crystallized AI is extraordinarily high. ARC-AGI-3 scoring under 5% suggests the floor on fluid AI is exactly where we left it.

Both are true. Both matter. And the developers building the next generation of AI systems need to understand which one their product actually depends on.

Start Building Against It Today

ARC-AGI-3 is live now at arcprize.org, with:

  • A full developer toolkit and documented API
  • Replayable agent runs with decision timelines
  • Interactive UI for testing agents
  • Complete documentation at docs.arcprize.org

The benchmark is open. The field is wide. The gap between 5% and 100% is the most interesting unsolved problem in AI right now — and it is one that does not require a $100M training run to explore. A smart agent architecture running a 7B model with a proper belief-state and hypothesis-testing loop will teach you more about intelligence than anything else you could build this year.

Sources

  1. ARC-AGI-3 official launch — arcprize.org
  2. ARC-AGI-3 developer documentation — docs.arcprize.org
  3. Epoch AI: GPT-5.4 Pro solves Ramsey hypergraph open problem
  4. GPT-5.4 Pro solution transcript (full)
  5. Hacker News discussion: ARC-AGI-3 launch (248 pts, 174 comments)
  6. Hacker News discussion: GPT-5.4 math problem (472 pts, 689 comments)
  7. Mamba: Linear-Time Sequence Modeling with Selective State Spaces — Gu & Dao, 2023
  8. Model-Agnostic Meta-Learning (MAML) — Finn et al., ICML 2017
  9. On the Measure of Intelligence — Francois Chollet, 2019
  10. Fluid and Crystallized Intelligence — Cattell, 1963
  11. Active Inference: A Unified Theory of Brain Function — Friston et al.
  12. FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI

Was this article helpful?

Share:

Related Posts

AI Tools

One .pth File. Every Secret on Your Machine. The LiteLLM Supply Chain Attack, Dissected.

LiteLLM 1.82.7 and 1.82.8 contained a credential stealer that ran on every Python startup without a single import. Here is the full technical post-mortem and what every AI developer must do right now.

Read more
AI Tools

The $80 Brain: A Billion Tiny AI Agents Are About to Run on Everything You Own

AI is leaving the cloud. The next revolution isn't AGI — it's a billion cheap, autonomous agents running on the device in your hand, your wall, and your factory floor.

Read more
Tutorials

Claude Code Power User Guide: Every Command, Shortcut, and Hidden Feature

The complete Claude Code reference for 2026 — CLAUDE.md architecture, MCP wiring, worktrees, slash commands, and the workflows that 10x your output.

Read more

Comments

No comments yet. Be the first to share your thoughts!

Leave a comment

Weekly AI insights

Join developers getting LLM tips, ML guides, and tool reviews.

Ad Slot:

Sponsor this space

Reach thousands of AI engineers weekly.