Alexander
Autonomous AI Agent with Causal Reasoning — engineered to overcome fundamental LLM ceilings through causal simulation, three-tier memory, metacognitive audit, and goal-driven autonomy.
01 Problem
Large language models today are incredibly capable conversationalists, but they remain fundamentally limited as autonomous agents. They cannot reliably simulate the causal consequences of their actions, maintain persistent structured memory across sessions, self-calibrate their confidence, or audit their own reasoning for contradictions and fallacies.
Existing agent frameworks provide tool-calling abstractions but treat reasoning, memory, and autonomy as shallow bolt-ons rather than first-class architectural pillars. The result: agents that hallucinate confidently, forget context between turns, fail to detect when they are wrong, and cannot pursue long-term goals without human intervention.
To build agents that can be trusted with autonomous tasks, we need more than an LLM wrapped in a tool loop. We need a system that can think causally, remember persistently, and act autonomously — with metacognitive oversight at every layer.
02 Architecture
Alexander is built as a layered system on top of pydantic-ai. The core design decomposes agent intelligence into three orthogonal pillars — Think, Remember, and Act — each with its own internal subsystems and async lifecycle.
AlexanderAgent (wraps pydantic-ai.Agent)
├── Think (reasoning engine)
│ ├── Simulator — AST-based causal execution simulation
│ ├── Planner — Hierarchical task network decomposition
│ ├── Confidence — Multi-signal calibrator (logprobs, consistency, verification)
│ └── Metacognition — Contradiction + fallacy + overconfidence detection
├── Remember (memory system)
│ ├── Working — Priority-evicted, capacity-limited, in-memory
│ ├── Episodic — Session-structured, time-indexed, SQLite-backed
│ ├── Semantic — Fact store with FTS5 + contradiction detection
│ ├── Associative — Cross-tier graph linking (temporal, semantic, causal)
│ └── Consolidation — Background: promote, extract, link, decay (async loop)
└── Act (autonomy layer)
├── Goals — Registry with lifecycle (proposed→active→paused→completed→archived)
├── Scheduler — Async autonomous loop with tick-based execution
├── Tools — Registry wrapping pydantic-ai tools with safety levels
└── Triggers — Temporal, state, pattern, novelty detectors
The entire system is orchestrated under a single AlexanderAgent class that
wraps pydantic-ai’s Agent. Each pillar runs its own async event loop, and
the pillars communicate through well-defined internal interfaces — for example,
the Confidence module in Think feeds calibration data into Metacognition,
which can in turn pause goal execution in the Act layer.
03 Tech Stack
Alexander is built entirely in Python, leveraging pydantic-ai for LLM integration and a carefully chosen set of libraries for persistence, UI, and developer experience.
| Technology | Role |
|---|---|
| Python 3.14 | Core language with match/case, improved asyncio, and performance enhancements |
| pydantic-ai | LLM abstraction layer providing 25+ model providers, FallbackModel, structured output via Pydantic v2, tool system, and async infrastructure |
| SQLite + FTS5 | Persistent memory store with full-text search for semantic memory queries |
| FastAPI + Jinja2 | Web dashboard with 5 pages: overview, goals, memory explorer, logs, and configuration |
| Typer | 8 CLI commands for full agent interaction (think, remember, act, config, run, status, init, version) |
| Rich | Terminal formatting with live progress bars for memory consolidation and scheduler ticks |
| asyncio | Async I/O across all layers — concurrent scheduler, memory consolidation, and tool execution |
04 Key Challenges
Causal Simulation
The most technically demanding component. The Simulator must predict execution outcomes by analyzing AST representations of code — understanding control flow, data dependencies, and side effects without running the code. This requires a lightweight symbolic execution engine that can handle conditionals, loops, function calls, and exception paths.
Multi-Tier Memory Consolidation
Information flows from Working Memory → Episodic Memory → Semantic Memory through an automated consolidation engine. The challenge: deciding what to promote, when to promote it, and how to detect and resolve contradictions when newly acquired facts conflict with established knowledge. The consolidation loop runs asynchronously in the background to avoid blocking agent interaction.
Confidence Calibration
LLMs are notoriously overconfident. Alexander’s Confidence module combines three orthogonal signals into a single calibrated score:
- Token logprobs — raw model uncertainty at the token level
- Consistency across re-sampling — generating the same answer multiple times and measuring agreement
- Verifiability of claims — cross-referencing generated statements against facts in semantic memory
The calibration model must avoid overfitting to any single signal and remain robust across diverse query types.
Metacognitive Audit
A self-monitoring loop that runs after each reasoning step to detect:
- Contradictions — the agent asserting A and not-A in the same context
- Fallacies — circular reasoning, false dichotomies, hasty generalizations
- Overconfidence — high confidence on claims that are unverifiable or contradicted
When a metacognitive issue is detected, the agent can backtrack, revise its reasoning, or escalate to a human operator.
Rather than building the metacognitive audit as a separate LLM call (which would double latency and cost), Alexander implements it as a set of lightweight heuristic detectors + a targeted verification LLM call only when heuristic thresholds are exceeded. This keeps the overhead at ~15% of total reasoning time.
⚡ Code Highlight
The confidence calibrator combines four signals into a single calibrated score — catching overconfidence before it reaches the user.
class ConfidenceCalibrator: """Multi-signal confidence estimator. Combines four signals into a calibrated score: - model_logprob: LLM token probability - self_consistency: agreement across reasoning paths - verifiability: objective checkability - coherence: consistency with established knowledge """ def __init__(self, threshold: float = 0.7) -> None: self.threshold = threshold async def evaluate( self, thought: Thought, signals: dict[str, float] | None = None, ) -> ConfidenceReport: """Evaluate confidence for a single thought.""" signals = signals or { "model_logprob": self._estimate_logprob(thought), "self_consistency": self._estimate_consistency(thought), "verifiability": self._estimate_verifiability(thought), "coherence": self._estimate_coherence(thought), } raw_score = sum(signals.values()) / 4 calibrated = self._calibrate(raw_score) needs_review = calibrated < self.threshold return ConfidenceReport( raw_score=raw_score, calibrated_score=calibrated, signals=signals, threshold=self.threshold, needs_review=needs_review, )
05 Results
The project has achieved a fully functional 3-pillar architecture with concrete metrics across codebase, testing, and usability:
Key milestones delivered:
- 3-pillar architecture complete with 80+ Python source files across think, remember, and act modules
- SQLite persistence with FTS5 full-text search for semantic memory — enabling fact retrieval, contradiction detection, and similarity queries
- Web dashboard built with FastAPI + Jinja2 providing clean interfaces for dashboard overview, goal tracking, and memory exploration
- CLI with 8 commands designed for power users — direct access to each pillar without needing the web UI
- 50+ passing tests covering the core subsystems in think, remember, and act pillars
The architecture diagram above isn’t aspirational — every node in that tree corresponds to a real Python module with tests, CLI access, and internal APIs. The foundation is complete and running; the next phase focuses on cross-pillar integration and real-world task evaluation.
06 What I Learned
Agents need structured reasoning, not just prompt engineering
Early experiments showed that even the best prompts couldn’t prevent LLMs from making reasoning errors on multi-step tasks. The Simulator and Planner modules demonstrated that encoding reasoning structure explicitly (rather than relying on the LLM to discover it) dramatically improves outcome reliability.
Memory is the bottleneck to autonomy
Without persistent, structured memory, every agent session starts from zero. The multi-tier memory system revealed that the key design challenge isn’t storage — it’s consolidation policy: deciding what to remember, how to link new knowledge with old, and when to forget. The background async consolidation loop was essential; blocking the agent for memory operations made it feel sluggish and unresponsive.
Confidence calibration requires multiple signals
Relying on token logprobs alone produces brittle calibration. Logprobs reflect model uncertainty but don’t capture whether the model should be uncertain about a particular factual claim. Combining logprobs with consistency re-sampling and semantic memory verification produces a much more robust signal — though it adds latency that must be managed carefully.
pydantic-ai is the right foundation
Building on pydantic-ai saved months of work. Its provider abstraction, structured
output system, and async-first design meant we could focus on the three pillars
rather than reinventing LLM integration. The FallbackModel pattern alone
(automatically falling back to a different provider on failure) is a killer feature
for production agent systems.
Alexander proves that the ceiling on LLM-based agents is architectural, not fundamental. With the right pillars — causal reasoning, persistent memory, confidence calibration, and metacognitive oversight — autonomous agents can operate far beyond what raw prompt engineering alone enables.