Alexander — Autonomous AI Agent with Causal Reasoning

01 Problem

Large language models today are incredibly capable conversationalists, but they remain fundamentally limited as autonomous agents. They cannot reliably simulate the causal consequences of their actions, maintain persistent structured memory across sessions, self-calibrate their confidence, or audit their own reasoning for contradictions and fallacies.

Existing agent frameworks provide tool-calling abstractions but treat reasoning, memory, and autonomy as shallow bolt-ons rather than first-class architectural pillars. The result: agents that hallucinate confidently, forget context between turns, fail to detect when they are wrong, and cannot pursue long-term goals without human intervention.

Core Thesis

To build agents that can be trusted with autonomous tasks, we need more than an LLM wrapped in a tool loop. We need a system that can think causally, remember persistently, and act autonomously — with metacognitive oversight at every layer.

02 Architecture

Alexander is built as a layered system on top of pydantic-ai. The core design decomposes agent intelligence into three orthogonal pillars — Think, Remember, and Act — each with its own internal subsystems and async lifecycle.

AlexanderAgent (wraps pydantic-ai.Agent)
├── Think (reasoning engine)
│   ├── Simulator — AST-based causal execution simulation
│   ├── Planner — Hierarchical task network decomposition
│   ├── Confidence — Multi-signal calibrator (logprobs, consistency, verification)
│   └── Metacognition — Contradiction + fallacy + overconfidence detection
├── Remember (memory system)
│   ├── Working — Priority-evicted, capacity-limited, in-memory
│   ├── Episodic — Session-structured, time-indexed, SQLite-backed
│   ├── Semantic — Fact store with FTS5 + contradiction detection
│   ├── Associative — Cross-tier graph linking (temporal, semantic, causal)
│   └── Consolidation — Background: promote, extract, link, decay (async loop)
└── Act (autonomy layer)
    ├── Goals — Registry with lifecycle (proposed→active→paused→completed→archived)
    ├── Scheduler — Async autonomous loop with tick-based execution
    ├── Tools — Registry wrapping pydantic-ai tools with safety levels
    └── Triggers — Temporal, state, pattern, novelty detectors

The entire system is orchestrated under a single AlexanderAgent class that wraps pydantic-ai’s Agent. Each pillar runs its own async event loop, and the pillars communicate through well-defined internal interfaces — for example, the Confidence module in Think feeds calibration data into Metacognition, which can in turn pause goal execution in the Act layer.

03 Tech Stack

Alexander is built entirely in Python, leveraging pydantic-ai for LLM integration and a carefully chosen set of libraries for persistence, UI, and developer experience.

Technology	Role
Python 3.14	Core language with match/case, improved asyncio, and performance enhancements
pydantic-ai	LLM abstraction layer providing 25+ model providers, FallbackModel, structured output via Pydantic v2, tool system, and async infrastructure
SQLite + FTS5	Persistent memory store with full-text search for semantic memory queries
FastAPI + Jinja2	Web dashboard with 5 pages: overview, goals, memory explorer, logs, and configuration
Typer	8 CLI commands for full agent interaction (think, remember, act, config, run, status, init, version)
Rich	Terminal formatting with live progress bars for memory consolidation and scheduler ticks
asyncio	Async I/O across all layers — concurrent scheduler, memory consolidation, and tool execution

04 Key Challenges

Causal Simulation

The most technically demanding component. The Simulator must predict execution outcomes by analyzing AST representations of code — understanding control flow, data dependencies, and side effects without running the code. This requires a lightweight symbolic execution engine that can handle conditionals, loops, function calls, and exception paths.

Multi-Tier Memory Consolidation

Information flows from Working Memory → Episodic Memory → Semantic Memory through an automated consolidation engine. The challenge: deciding what to promote, when to promote it, and how to detect and resolve contradictions when newly acquired facts conflict with established knowledge. The consolidation loop runs asynchronously in the background to avoid blocking agent interaction.

Confidence Calibration

LLMs are notoriously overconfident. Alexander’s Confidence module combines three orthogonal signals into a single calibrated score:

Token logprobs — raw model uncertainty at the token level
Consistency across re-sampling — generating the same answer multiple times and measuring agreement
Verifiability of claims — cross-referencing generated statements against facts in semantic memory

The calibration model must avoid overfitting to any single signal and remain robust across diverse query types.

Metacognitive Audit

A self-monitoring loop that runs after each reasoning step to detect:

Contradictions — the agent asserting A and not-A in the same context
Fallacies — circular reasoning, false dichotomies, hasty generalizations
Overconfidence — high confidence on claims that are unverifiable or contradicted

When a metacognitive issue is detected, the agent can backtrack, revise its reasoning, or escalate to a human operator.

Design Decision

Rather than building the metacognitive audit as a separate LLM call (which would double latency and cost), Alexander implements it as a set of lightweight heuristic detectors + a targeted verification LLM call only when heuristic thresholds are exceeded. This keeps the overhead at ~15% of total reasoning time.

⚡ Code Highlight

The confidence calibrator combines four signals into a single calibrated score — catching overconfidence before it reaches the user.

class ConfidenceCalibrator:
    """Multi-signal confidence estimator.

    Combines four signals into a calibrated score:
    - model_logprob: LLM token probability
    - self_consistency: agreement across reasoning paths
    - verifiability: objective checkability
    - coherence: consistency with established knowledge
    """

    def __init__(self, threshold: float = 0.7) -> None:
        self.threshold = threshold

    async def evaluate(
        self, thought: Thought,
        signals: dict[str, float] | None = None,
    ) -> ConfidenceReport:
        """Evaluate confidence for a single thought."""
        signals = signals or {
            "model_logprob":     self._estimate_logprob(thought),
            "self_consistency":  self._estimate_consistency(thought),
            "verifiability":     self._estimate_verifiability(thought),
            "coherence":         self._estimate_coherence(thought),
        }
        raw_score = sum(signals.values()) / 4
        calibrated = self._calibrate(raw_score)
        needs_review = calibrated < self.threshold
        return ConfidenceReport(
            raw_score=raw_score, calibrated_score=calibrated,
            signals=signals, threshold=self.threshold,
            needs_review=needs_review,
        )

05 Results

The project has achieved a fully functional 3-pillar architecture with concrete metrics across codebase, testing, and usability:

3 Pillars: Think · Remember · Act

8 CLI Commands

4 Think Modules

5 Memory Tiers

80+ Python Source Files

50+ Passing Tests

Key milestones delivered:

3-pillar architecture complete with 80+ Python source files across think, remember, and act modules
SQLite persistence with FTS5 full-text search for semantic memory — enabling fact retrieval, contradiction detection, and similarity queries
Web dashboard built with FastAPI + Jinja2 providing clean interfaces for dashboard overview, goal tracking, and memory exploration
CLI with 8 commands designed for power users — direct access to each pillar without needing the web UI
50+ passing tests covering the core subsystems in think, remember, and act pillars

What Stands Out

The architecture diagram above isn’t aspirational — every node in that tree corresponds to a real Python module with tests, CLI access, and internal APIs. The foundation is complete and running; the next phase focuses on cross-pillar integration and real-world task evaluation.

06 What I Learned

Agents need structured reasoning, not just prompt engineering

Early experiments showed that even the best prompts couldn’t prevent LLMs from making reasoning errors on multi-step tasks. The Simulator and Planner modules demonstrated that encoding reasoning structure explicitly (rather than relying on the LLM to discover it) dramatically improves outcome reliability.

Memory is the bottleneck to autonomy

Without persistent, structured memory, every agent session starts from zero. The multi-tier memory system revealed that the key design challenge isn’t storage — it’s consolidation policy: deciding what to remember, how to link new knowledge with old, and when to forget. The background async consolidation loop was essential; blocking the agent for memory operations made it feel sluggish and unresponsive.

Confidence calibration requires multiple signals

Relying on token logprobs alone produces brittle calibration. Logprobs reflect model uncertainty but don’t capture whether the model should be uncertain about a particular factual claim. Combining logprobs with consistency re-sampling and semantic memory verification produces a much more robust signal — though it adds latency that must be managed carefully.

pydantic-ai is the right foundation

Building on pydantic-ai saved months of work. Its provider abstraction, structured output system, and async-first design meant we could focus on the three pillars rather than reinventing LLM integration. The FallbackModel pattern alone (automatically falling back to a different provider on failure) is a killer feature for production agent systems.

Bottom Line

Alexander proves that the ceiling on LLM-based agents is architectural, not fundamental. With the right pillars — causal reasoning, persistent memory, confidence calibration, and metacognitive oversight — autonomous agents can operate far beyond what raw prompt engineering alone enables.