The $80 Brain: A Billion Tiny AI Agents Are About to Run on Everything You Own

We've been thinking about AI wrong.

For the past three years the mental model has been the same: you have a question, a device, a sensor, a problem — you send it to a server somewhere, a giant model thinks about it, and the answer comes back. The cloud is the brain. The device is the mouth and ears.

That model is ending.

A Raspberry Pi 5 costs $80. An NVIDIA Jetson Orin Nano Super fits in your palm and delivers 67 trillion AI operations per second. Apple's M4 chip runs a full 7B reasoning model at 60 tokens per second — faster than you can read. The hardware that held AI hostage in the data center for a decade is now cheap, fanless, and shipping inside children's toys, hospital beds, and industrial sensors.

But here's what almost nobody is talking about yet: it's not just that models can run on small devices. It's that agents can. Not a model that answers a question — an agent that wakes up, reads sensors, forms a plan, executes tools, observes results, corrects mistakes, and completes a task. Autonomously. Without asking a server for permission. Without an internet connection. Without you.

A security camera that doesn't just detect motion — it reasons about whether it's a threat, decides on a response, and logs its own chain of thought. A medical device that monitors vitals, detects early warning patterns, and adjusts dosage — on data it is legally forbidden from sending to a cloud. A factory robot arm that diagnoses its own mechanical drift, orders its own replacement part, and schedules its own maintenance window.

This is not a roadmap. It is buildable today, with tools you can download in the next ten minutes.

This is the architecture.

📁 Full source code for this article is on GitHub: github.com/aistackinsights/stackinsights/agentic-ai-on-edge-devices-autonomous-workflows

Why Edge Changes the Agentic Equation

Cloud-based agentic AI has a fundamental architecture problem: every reasoning step requires a round trip. A ReAct loop with 5 steps and 200ms average latency per call takes over 1 second of pure network time before your agent has thought about anything. Add token generation time and you're at 5–10 seconds for a simple task.

On edge hardware, that loop runs in memory. A Phi-4-mini model (3.8B parameters) on a Jetson Orin Nano completes a reasoning step in under 400ms — fully local, no network at all. The agentic loop compresses from seconds to milliseconds.

Three other forces make this the right moment:

Model miniaturization: Phi-4-mini (3.8B), Qwen2.5-3B, Gemma 3 2B — models small enough to fit in 4GB of RAM that score 70%+ on coding benchmarks.
Tool-calling on small models: Phi-4-mini supports native function calling. A 3.8B model can reliably choose the right tool and parse its output without a cloud backbone.
Hardware NPUs: The Jetson Orin Nano's 1024-core Ampere GPU, Apple's Neural Engine, and Qualcomm's Hexagon DSP are purpose-built for transformer inference — they deliver 10–40x the efficiency of general CPU inference.

Model selection rule of thumb for edge agents: 3B models for single-step tool use, 7B for multi-step reasoning chains, 14B+ only if you have 8GB+ VRAM dedicated to inference. For most edge automation tasks, 3B is enough.

The Agentic Loop on Edge: Architecture Overview

A cloud-based agent and an edge agent share the same logical structure — the difference is where each component runs:

Cloud Agent:                        Edge Agent:
  Sensor / Input                      Sensor / Input
       ↓                                   ↓
  HTTP → LLM API (remote)           Local LLM (llama.cpp / ollama)
       ↓                                   ↓
  Tool call → HTTP → External API   Tool call → Local function / local API
       ↓                                   ↓
  Observation → LLM API (remote)    Observation → Local LLM
       ↓                                   ↓
  Output → HTTP → Application       Output → Local action

The edge version has zero network hops in the critical path. The only external calls are optional — sync results, trigger cloud processes, send alerts — and they're non-blocking.

The minimal stack:

Inference runtime: llama.cpp (C++, runs on CPU/GPU/NPU), ollama (Docker-friendly wrapper), or transformers with quantization
Agent framework: smolagents (HuggingFace, lightweight), or a hand-rolled ReAct loop
Tool layer: Python functions, local SQLite, file system, GPIO/sensor APIs
Persistence: SQLite for state, JSON files for configuration

Build a Full Edge Agent in Python

Here's a complete, working agentic loop that runs on any device with 4GB RAM using ollama for local inference:

# edge_agent.py — full agentic loop running entirely on device
import json, subprocess, sqlite3, datetime
from typing import Any
 
OLLAMA_MODEL = "phi4-mini"  # or qwen2.5:3b, gemma3:2b
 
def call_local_llm(messages: list[dict], tools: list[dict] | None = None) -> dict:
    """Call ollama running on localhost — no network dependency."""
    payload = {"model": OLLAMA_MODEL, "messages": messages, "stream": False}
    if tools:
        payload["tools"] = tools
    result = subprocess.run(
        ["ollama", "run", "--format", "json", OLLAMA_MODEL],
        input=json.dumps(payload), capture_output=True, text=True, timeout=30
    )
    # Use ollama REST API for cleaner tool_call support
    import urllib.request
    req = urllib.request.Request(
        "http://localhost:11434/api/chat",
        data=json.dumps(payload).encode(),
        headers={"Content-Type": "application/json"},
        method="POST"
    )
    with urllib.request.urlopen(req, timeout=30) as r:
        return json.load(r)
 
# --- Tool definitions ---
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "read_sensor",
            "description": "Read a sensor value by sensor ID. Returns current float reading.",
            "parameters": {
                "type": "object",
                "properties": {
                    "sensor_id": {"type": "string", "description": "Sensor identifier, e.g. 'temp_01'"}
                },
                "required": ["sensor_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "log_event",
            "description": "Log an event to the local database with severity and message.",
            "parameters": {
                "type": "object",
                "properties": {
                    "severity": {"type": "string", "enum": ["info", "warning", "critical"]},
                    "message": {"type": "string"}
                },
                "required": ["severity", "message"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "trigger_action",
            "description": "Trigger a physical action: fan_on, fan_off, alert_led, shutdown.",
            "parameters": {
                "type": "object",
                "properties": {
                    "action": {"type": "string", "enum": ["fan_on", "fan_off", "alert_led", "shutdown"]}
                },
                "required": ["action"]
            }
        }
    }
]
 
# --- Tool implementations ---
def read_sensor(sensor_id: str) -> float:
    """Simulated sensor read — replace with GPIO/I2C/MQTT call."""
    import random
    mock_values = {"temp_01": 78.5 + random.uniform(-5, 15), "humidity_01": 45.0}
    return mock_values.get(sensor_id, 0.0)
 
def log_event(severity: str, message: str) -> str:
    conn = sqlite3.connect("edge_agent.db")
    conn.execute("CREATE TABLE IF NOT EXISTS events (ts TEXT, severity TEXT, message TEXT)")
    conn.execute("INSERT INTO events VALUES (?, ?, ?)",
                 (datetime.datetime.now().isoformat(), severity, message))
    conn.commit(); conn.close()
    print(f"[{severity.upper()}] {message}")
    return "logged"
 
def trigger_action(action: str) -> str:
    print(f">>> ACTION: {action}")
    # Replace with actual GPIO: import RPi.GPIO as GPIO ...
    return f"{action} executed"
 
TOOL_DISPATCH = {"read_sensor": read_sensor, "log_event": log_event, "trigger_action": trigger_action}
 
def execute_tool(name: str, args: dict) -> Any:
    if name not in TOOL_DISPATCH:
        return f"Unknown tool: {name}"
    return TOOL_DISPATCH[name](**args)
 
# --- Agentic loop ---
def run_agent(task: str, max_steps: int = 6) -> str:
    messages = [
        {"role": "system", "content": (
            "You are an autonomous edge AI agent managing industrial sensors. "
            "Use tools to observe the environment, reason about readings, and take action. "
            "Always read sensors before acting. Log events when anomalies are detected."
        )},
        {"role": "user", "content": task}
    ]
 
    for step in range(max_steps):
        response = call_local_llm(messages, tools=TOOLS)
        message = response.get("message", {})
        tool_calls = message.get("tool_calls", [])
 
        if not tool_calls:
            # Agent reached a conclusion
            return message.get("content", "Task complete.")
 
        # Execute all tool calls and collect observations
        messages.append({"role": "assistant", "content": None, "tool_calls": tool_calls})
        for tc in tool_calls:
            fn = tc["function"]
            result = execute_tool(fn["name"], fn.get("arguments", {}))
            messages.append({
                "role": "tool",
                "tool_call_id": tc["id"],
                "content": str(result)
            })
 
    return "Max steps reached."
 
if __name__ == "__main__":
    print(run_agent("Check all sensors, identify any anomalies, and take appropriate action."))

Run it with a 3.8B model:

# Install ollama and pull the model (one-time setup)
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull phi4-mini          # 3.8B — fits in 4GB RAM
# ollama pull qwen2.5:3b       # Alternative: Qwen 2.5 3B
# ollama pull gemma3:2b        # Alternative: Gemma 3 2B
 
# Run the edge agent
pip install --upgrade pip
python edge_agent.py

On a Raspberry Pi 5 (8GB), phi4-mini runs at ~8 tok/s — fast enough for non-latency-critical automation. On a Jetson Orin Nano Super, you're at 25–35 tok/s. On Apple M4, 60+ tok/s.

Persistent Agent State: Surviving Reboots

Cloud agents are stateless by design — the API is the state boundary. Edge agents live on a device that reboots, loses power, and runs indefinitely. State must be durable.

# agent_state.py — durable state management for edge agents
import sqlite3, json
from datetime import datetime, UTC
from pathlib import Path
 
class EdgeAgentState:
    def __init__(self, db_path: str = "agent_state.db"):
        self.conn = sqlite3.connect(db_path, check_same_thread=False)
        self._init_schema()
 
    def _init_schema(self):
        self.conn.executescript("""
            CREATE TABLE IF NOT EXISTS memory (
                key TEXT PRIMARY KEY,
                value TEXT NOT NULL,
                updated_at TEXT NOT NULL
            );
            CREATE TABLE IF NOT EXISTS task_log (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                task TEXT NOT NULL,
                result TEXT,
                steps INTEGER,
                started_at TEXT NOT NULL,
                completed_at TEXT
            );
            CREATE TABLE IF NOT EXISTS sensor_history (
                sensor_id TEXT NOT NULL,
                value REAL NOT NULL,
                recorded_at TEXT NOT NULL
            );
            CREATE INDEX IF NOT EXISTS idx_sensor_time ON sensor_history(sensor_id, recorded_at DESC);
        """)
        self.conn.commit()
 
    def remember(self, key: str, value: object) -> None:
        self.conn.execute(
            "INSERT OR REPLACE INTO memory VALUES (?, ?, ?)",
            (key, json.dumps(value), datetime.now(UTC).isoformat())
        )
        self.conn.commit()
 
    def recall(self, key: str, default=None) -> object:
        row = self.conn.execute("SELECT value FROM memory WHERE key = ?", (key,)).fetchone()
        return json.loads(row[0]) if row else default
 
    def record_sensor(self, sensor_id: str, value: float) -> None:
        self.conn.execute(
            "INSERT INTO sensor_history VALUES (?, ?, ?)",
            (sensor_id, value, datetime.now(UTC).isoformat())
        )
        self.conn.commit()
 
    def get_sensor_trend(self, sensor_id: str, last_n: int = 10) -> list[float]:
        rows = self.conn.execute(
            "SELECT value FROM sensor_history WHERE sensor_id = ? ORDER BY recorded_at DESC LIMIT ?",
            (sensor_id, last_n)
        ).fetchall()
        return [r[0] for r in reversed(rows)]

The agent now remembers what it decided last cycle, tracks sensor trends across reboots, and can detect slow-drift anomalies that a stateless agent would miss entirely.

Hardware Guide: Which Edge Device for Which Workload

Device	AI Compute	RAM	Best For
Raspberry Pi 5	CPU only (~2 TOPS)	8GB	1–3B models, light automation
NVIDIA Jetson Orin Nano Super	67 TOPS	8GB	3–7B models, real-time vision + language
Apple M4 Mac Mini	~38 TOPS Neural Engine	16–64GB	7–30B models, complex reasoning
Qualcomm RB3 Gen 2	73 TOPS (Hexagon)	8GB	Mobile-class edge, robotics
Intel Core Ultra (NPU)	~34 TOPS	32GB	Windows edge servers, enterprise

NVIDIA's Jetson Orin Nano Super is the current sweet spot for most agentic edge deployments: 67 TOPS at $249, supported by the full NVIDIA software stack (TensorRT, DeepStream, JetPack), and small enough to mount inside equipment.

The Privacy and Latency Case

Two arguments for edge that aren't about hardware performance:

Privacy: Healthcare, legal, and financial workloads face regulations — HIPAA, GDPR, financial data residency — that make cloud inference legally impossible. An agentic AI that audits patient records, flags billing anomalies, or processes attorney-client communications must stay on premises. Edge is not just faster — it's the only compliant option.

Offline resilience: A cloud-dependent agent in a factory goes down when the internet goes down. Edge agents survive network partitions. This is the entire reason industrial automation has always favored local compute over centralized systems.

Model accuracy vs. size tradeoff: 3B models on edge hardware will make reasoning errors that a 70B cloud model would not. Design your agentic loop with explicit retry logic, confidence thresholds, and human escalation paths for critical actions. Never let a 3B model autonomously trigger irreversible operations without a confirmation gate.

Hybrid Architecture: Edge + Cloud in the Right Places

Edge-only is not always the right answer. The winning architecture for most production deployments:

Edge: Sensor ingestion, real-time loop, local tool execution, state persistence, routine decisions
Cloud: Model updates, analytics aggregation, anomaly reporting, fallback for edge failures, complex tasks that exceed local model capability

# hybrid_agent.py — edge-first with selective cloud escalation
async def run_hybrid_agent(task: str, state: EdgeAgentState) -> str:
    # Try edge first
    try:
        result = await run_local_agent(task, state, timeout_seconds=10)
        confidence = state.recall("last_confidence", 0.5)
        if confidence >= 0.75:
            return result
        # Low confidence — escalate to cloud
        print("Edge confidence low, escalating to cloud...")
        return await run_cloud_agent(task, local_context=result)
    except TimeoutError:
        return await run_cloud_agent(task)

The edge agent handles 95% of decisions locally. Only low-confidence or complex cases hit the cloud, cutting API costs by an order of magnitude and keeping latency near zero for the common path.

Limitations to Plan Around

Context window: Most edge-deployable models cap at 8K–32K tokens. Long multi-step tasks that accumulate observations can hit this limit. Implement context pruning — keep the last N observations, not all of them.
Tool calling reliability: 3B models drop tool calls ~15% of the time on complex tool schemas. Keep tool schemas simple (≤4 tools, ≤3 parameters each). Add explicit retry logic.
Quantization artifacts: INT4 quantization (needed to fit 7B in 6GB) degrades JSON-following ability. Use Q5_K_M or Q8_0 if your RAM allows; fall back to Q4_K_M only if necessary.
Thermal throttling: Sustained inference on Jetson and Raspberry Pi generates heat. Budget for cooling — passive heatsinks are insufficient for agents running 24/7.

Sources

We've been thinking about AI wrong.

That model is ending.

This is not a roadmap. It is buildable today, with tools you can download in the next ten minutes.

This is the architecture.

📁 Full source code for this article is on GitHub: github.com/aistackinsights/stackinsights/agentic-ai-on-edge-devices-autonomous-workflows

Why Edge Changes the Agentic Equation

Three other forces make this the right moment:

Model miniaturization: Phi-4-mini (3.8B), Qwen2.5-3B, Gemma 3 2B — models small enough to fit in 4GB of RAM that score 70%+ on coding benchmarks.
Tool-calling on small models: Phi-4-mini supports native function calling. A 3.8B model can reliably choose the right tool and parse its output without a cloud backbone.
Hardware NPUs: The Jetson Orin Nano's 1024-core Ampere GPU, Apple's Neural Engine, and Qualcomm's Hexagon DSP are purpose-built for transformer inference — they deliver 10–40x the efficiency of general CPU inference.

The Agentic Loop on Edge: Architecture Overview

A cloud-based agent and an edge agent share the same logical structure — the difference is where each component runs:

Cloud Agent:                        Edge Agent:
  Sensor / Input                      Sensor / Input
       ↓                                   ↓
  HTTP → LLM API (remote)           Local LLM (llama.cpp / ollama)
       ↓                                   ↓
  Tool call → HTTP → External API   Tool call → Local function / local API
       ↓                                   ↓
  Observation → LLM API (remote)    Observation → Local LLM
       ↓                                   ↓
  Output → HTTP → Application       Output → Local action

The edge version has zero network hops in the critical path. The only external calls are optional — sync results, trigger cloud processes, send alerts — and they're non-blocking.

The minimal stack:

Inference runtime: llama.cpp (C++, runs on CPU/GPU/NPU), ollama (Docker-friendly wrapper), or transformers with quantization
Agent framework: smolagents (HuggingFace, lightweight), or a hand-rolled ReAct loop
Tool layer: Python functions, local SQLite, file system, GPIO/sensor APIs
Persistence: SQLite for state, JSON files for configuration

Build a Full Edge Agent in Python

Here's a complete, working agentic loop that runs on any device with 4GB RAM using ollama for local inference:

# edge_agent.py — full agentic loop running entirely on device
import json, subprocess, sqlite3, datetime
from typing import Any
 
OLLAMA_MODEL = "phi4-mini"  # or qwen2.5:3b, gemma3:2b
 
def call_local_llm(messages: list[dict], tools: list[dict] | None = None) -> dict:
    """Call ollama running on localhost — no network dependency."""
    payload = {"model": OLLAMA_MODEL, "messages": messages, "stream": False}
    if tools:
        payload["tools"] = tools
    result = subprocess.run(
        ["ollama", "run", "--format", "json", OLLAMA_MODEL],
        input=json.dumps(payload), capture_output=True, text=True, timeout=30
    )
    # Use ollama REST API for cleaner tool_call support
    import urllib.request
    req = urllib.request.Request(
        "http://localhost:11434/api/chat",
        data=json.dumps(payload).encode(),
        headers={"Content-Type": "application/json"},
        method="POST"
    )
    with urllib.request.urlopen(req, timeout=30) as r:
        return json.load(r)
 
# --- Tool definitions ---
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "read_sensor",
            "description": "Read a sensor value by sensor ID. Returns current float reading.",
            "parameters": {
                "type": "object",
                "properties": {
                    "sensor_id": {"type": "string", "description": "Sensor identifier, e.g. 'temp_01'"}
                },
                "required": ["sensor_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "log_event",
            "description": "Log an event to the local database with severity and message.",
            "parameters": {
                "type": "object",
                "properties": {
                    "severity": {"type": "string", "enum": ["info", "warning", "critical"]},
                    "message": {"type": "string"}
                },
                "required": ["severity", "message"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "trigger_action",
            "description": "Trigger a physical action: fan_on, fan_off, alert_led, shutdown.",
            "parameters": {
                "type": "object",
                "properties": {
                    "action": {"type": "string", "enum": ["fan_on", "fan_off", "alert_led", "shutdown"]}
                },
                "required": ["action"]
            }
        }
    }
]
 
# --- Tool implementations ---
def read_sensor(sensor_id: str) -> float:
    """Simulated sensor read — replace with GPIO/I2C/MQTT call."""
    import random
    mock_values = {"temp_01": 78.5 + random.uniform(-5, 15), "humidity_01": 45.0}
    return mock_values.get(sensor_id, 0.0)
 
def log_event(severity: str, message: str) -> str:
    conn = sqlite3.connect("edge_agent.db")
    conn.execute("CREATE TABLE IF NOT EXISTS events (ts TEXT, severity TEXT, message TEXT)")
    conn.execute("INSERT INTO events VALUES (?, ?, ?)",
                 (datetime.datetime.now().isoformat(), severity, message))
    conn.commit(); conn.close()
    print(f"[{severity.upper()}] {message}")
    return "logged"
 
def trigger_action(action: str) -> str:
    print(f">>> ACTION: {action}")
    # Replace with actual GPIO: import RPi.GPIO as GPIO ...
    return f"{action} executed"
 
TOOL_DISPATCH = {"read_sensor": read_sensor, "log_event": log_event, "trigger_action": trigger_action}
 
def execute_tool(name: str, args: dict) -> Any:
    if name not in TOOL_DISPATCH:
        return f"Unknown tool: {name}"
    return TOOL_DISPATCH[name](**args)
 
# --- Agentic loop ---
def run_agent(task: str, max_steps: int = 6) -> str:
    messages = [
        {"role": "system", "content": (
            "You are an autonomous edge AI agent managing industrial sensors. "
            "Use tools to observe the environment, reason about readings, and take action. "
            "Always read sensors before acting. Log events when anomalies are detected."
        )},
        {"role": "user", "content": task}
    ]
 
    for step in range(max_steps):
        response = call_local_llm(messages, tools=TOOLS)
        message = response.get("message", {})
        tool_calls = message.get("tool_calls", [])
 
        if not tool_calls:
            # Agent reached a conclusion
            return message.get("content", "Task complete.")
 
        # Execute all tool calls and collect observations
        messages.append({"role": "assistant", "content": None, "tool_calls": tool_calls})
        for tc in tool_calls:
            fn = tc["function"]
            result = execute_tool(fn["name"], fn.get("arguments", {}))
            messages.append({
                "role": "tool",
                "tool_call_id": tc["id"],
                "content": str(result)
            })
 
    return "Max steps reached."
 
if __name__ == "__main__":
    print(run_agent("Check all sensors, identify any anomalies, and take appropriate action."))

Run it with a 3.8B model:

# Install ollama and pull the model (one-time setup)
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull phi4-mini          # 3.8B — fits in 4GB RAM
# ollama pull qwen2.5:3b       # Alternative: Qwen 2.5 3B
# ollama pull gemma3:2b        # Alternative: Gemma 3 2B
 
# Run the edge agent
pip install --upgrade pip
python edge_agent.py

On a Raspberry Pi 5 (8GB), phi4-mini runs at ~8 tok/s — fast enough for non-latency-critical automation. On a Jetson Orin Nano Super, you're at 25–35 tok/s. On Apple M4, 60+ tok/s.

Persistent Agent State: Surviving Reboots

Cloud agents are stateless by design — the API is the state boundary. Edge agents live on a device that reboots, loses power, and runs indefinitely. State must be durable.

# agent_state.py — durable state management for edge agents
import sqlite3, json
from datetime import datetime, UTC
from pathlib import Path
 
class EdgeAgentState:
    def __init__(self, db_path: str = "agent_state.db"):
        self.conn = sqlite3.connect(db_path, check_same_thread=False)
        self._init_schema()
 
    def _init_schema(self):
        self.conn.executescript("""
            CREATE TABLE IF NOT EXISTS memory (
                key TEXT PRIMARY KEY,
                value TEXT NOT NULL,
                updated_at TEXT NOT NULL
            );
            CREATE TABLE IF NOT EXISTS task_log (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                task TEXT NOT NULL,
                result TEXT,
                steps INTEGER,
                started_at TEXT NOT NULL,
                completed_at TEXT
            );
            CREATE TABLE IF NOT EXISTS sensor_history (
                sensor_id TEXT NOT NULL,
                value REAL NOT NULL,
                recorded_at TEXT NOT NULL
            );
            CREATE INDEX IF NOT EXISTS idx_sensor_time ON sensor_history(sensor_id, recorded_at DESC);
        """)
        self.conn.commit()
 
    def remember(self, key: str, value: object) -> None:
        self.conn.execute(
            "INSERT OR REPLACE INTO memory VALUES (?, ?, ?)",
            (key, json.dumps(value), datetime.now(UTC).isoformat())
        )
        self.conn.commit()
 
    def recall(self, key: str, default=None) -> object:
        row = self.conn.execute("SELECT value FROM memory WHERE key = ?", (key,)).fetchone()
        return json.loads(row[0]) if row else default
 
    def record_sensor(self, sensor_id: str, value: float) -> None:
        self.conn.execute(
            "INSERT INTO sensor_history VALUES (?, ?, ?)",
            (sensor_id, value, datetime.now(UTC).isoformat())
        )
        self.conn.commit()
 
    def get_sensor_trend(self, sensor_id: str, last_n: int = 10) -> list[float]:
        rows = self.conn.execute(
            "SELECT value FROM sensor_history WHERE sensor_id = ? ORDER BY recorded_at DESC LIMIT ?",
            (sensor_id, last_n)
        ).fetchall()
        return [r[0] for r in reversed(rows)]

The agent now remembers what it decided last cycle, tracks sensor trends across reboots, and can detect slow-drift anomalies that a stateless agent would miss entirely.

Hardware Guide: Which Edge Device for Which Workload

Device	AI Compute	RAM	Best For
Raspberry Pi 5	CPU only (~2 TOPS)	8GB	1–3B models, light automation
NVIDIA Jetson Orin Nano Super	67 TOPS	8GB	3–7B models, real-time vision + language
Apple M4 Mac Mini	~38 TOPS Neural Engine	16–64GB	7–30B models, complex reasoning
Qualcomm RB3 Gen 2	73 TOPS (Hexagon)	8GB	Mobile-class edge, robotics
Intel Core Ultra (NPU)	~34 TOPS	32GB	Windows edge servers, enterprise

The Privacy and Latency Case

Two arguments for edge that aren't about hardware performance:

Hybrid Architecture: Edge + Cloud in the Right Places

Edge-only is not always the right answer. The winning architecture for most production deployments:

Edge: Sensor ingestion, real-time loop, local tool execution, state persistence, routine decisions
Cloud: Model updates, analytics aggregation, anomaly reporting, fallback for edge failures, complex tasks that exceed local model capability

# hybrid_agent.py — edge-first with selective cloud escalation
async def run_hybrid_agent(task: str, state: EdgeAgentState) -> str:
    # Try edge first
    try:
        result = await run_local_agent(task, state, timeout_seconds=10)
        confidence = state.recall("last_confidence", 0.5)
        if confidence >= 0.75:
            return result
        # Low confidence — escalate to cloud
        print("Edge confidence low, escalating to cloud...")
        return await run_cloud_agent(task, local_context=result)
    except TimeoutError:
        return await run_cloud_agent(task)

The edge agent handles 95% of decisions locally. Only low-confidence or complex cases hit the cloud, cutting API costs by an order of magnitude and keeping latency near zero for the common path.

Limitations to Plan Around

Context window: Most edge-deployable models cap at 8K–32K tokens. Long multi-step tasks that accumulate observations can hit this limit. Implement context pruning — keep the last N observations, not all of them.
Tool calling reliability: 3B models drop tool calls ~15% of the time on complex tool schemas. Keep tool schemas simple (≤4 tools, ≤3 parameters each). Add explicit retry logic.
Quantization artifacts: INT4 quantization (needed to fit 7B in 6GB) degrades JSON-following ability. Use Q5_K_M or Q8_0 if your RAM allows; fall back to Q4_K_M only if necessary.
Thermal throttling: Sustained inference on Jetson and Raspberry Pi generates heat. Budget for cooling — passive heatsinks are insufficient for agents running 24/7.

The $80 Brain: A Billion Tiny AI Agents Are About to Run on Everything You Own

Why Edge Changes the Agentic Equation

The Agentic Loop on Edge: Architecture Overview

Build a Full Edge Agent in Python

Persistent Agent State: Surviving Reboots

Hardware Guide: Which Edge Device for Which Workload

The Privacy and Latency Case

Hybrid Architecture: Edge + Cloud in the Right Places

Limitations to Plan Around

Sources

Related Posts

1-Bit LLMs Hit Production: What Prism's Bonsai and BitNet Mean for On-Device AI

Context Engineering: The Developer Skill That Turns AI from a Chatbot into a Colleague

One .pth File. Every Secret on Your Machine. The LiteLLM Supply Chain Attack, Dissected.

Comments

Leave a comment

The $80 Brain: A Billion Tiny AI Agents Are About to Run on Everything You Own

Why Edge Changes the Agentic Equation

The Agentic Loop on Edge: Architecture Overview

Build a Full Edge Agent in Python

Persistent Agent State: Surviving Reboots

Hardware Guide: Which Edge Device for Which Workload

The Privacy and Latency Case

Hybrid Architecture: Edge + Cloud in the Right Places

Limitations to Plan Around

Sources

Related Posts

1-Bit LLMs Hit Production: What Prism's Bonsai and BitNet Mean for On-Device AI

Context Engineering: The Developer Skill That Turns AI from a Chatbot into a Colleague

One .pth File. Every Secret on Your Machine. The LiteLLM Supply Chain Attack, Dissected.

Comments

Leave a comment