AION — On-Device AI Agent for Android

01 Problem

Mobile AI assistants today — Siri, Google Assistant, Bixby — are constrained to narrow, pre-defined intents. They cannot adapt to novel tasks, execute multi-step workflows, or integrate deeply with the operating system. Meanwhile, powerful LLM-based agents require cloud infrastructure and a host PC, negating the mobility advantage of a smartphone.

AION was conceived to bridge this gap: an autonomous, persistent AI agent that lives on-device, perceives its environment through screen content and notifications, and acts through system APIs — SMS, calls, timers, automations — all without tethering to a desktop.

02 Architecture

AION runs as an Android Foreground Service with a persistent agent loop. User input enters through a Jetpack Compose chat UI, flows through the agent loop (context management & intent classification), reaches the dual-engine LLM layer, and then routes to the appropriate skill via BM25 semantic matching.

┌─────────────────────────────────────┐
│         AION Agent App              │
│  ┌───────────────────────────────┐  │
│  │  Chat UI (Jetpack Compose)    │  │
│  └───────────┬───────────────────┘  │
│              ▼                       │
│  ┌───────────────────────────────┐  │
│  │  Agent Loop                   │  │
│  │  ┌─────────┐ ┌─────────────┐  │  │
│  │  │Context  │ │Intent       │  │  │
│  │  │Manager  │ │Classifier   │  │  │
│  │  └─────────┘ └──────┬──────┘  │  │
│  │         ▼           ▼         │  │
│  │  ┌─────────────────────────┐  │  │
│  │  │  LLM Engine             │  │  │
│  │  │  Cloud (OpenRouter)     │  │  │
│  │  │  Local (llama.cpp)      │  │  │
│  │  └─────────────────────────┘  │  │
│  └───────────────────────────────┘  │
│              ▼                       │
│  ┌───────────────────────────────┐  │
│  │  Skill Router (BM25)          │  │
│  │  ┌──────┐┌──────┐┌────────┐  │  │
│  │  │SMS   ││Call  ││Notif.  │  │  │
│  │  │Tool  ││Tool  ││Reader  │  │  │
│  │  └──────┘└──────┘└────────┘  │  │
│  │  ┌────────┐┌────────┐       │  │
│  │  │Screen  ││Timer   │       │  │
│  │  │Skill   ││Skill   │       │  │
│  │  └────────┘└────────┘       │  │
│  └───────────────────────────────┘  │
│              ▼                       │
│  ┌───────────────────────────────┐  │
│  │  Persistence (Room DB)         │  │
│  │  Memory · Settings · Skills    │  │
│  └───────────────────────────────┘  │
└─────────────────────────────────────┘

The architecture is modular by design: the Agent Loop orchestrates context and intent, the LLM Engine supports both cloud (OpenRouter) and local (llama.cpp) inference, and the Skill Router uses BM25 semantic search to dispatch intent to the correct tool. All state persists via Room DB, and the MCP layer enables structured tool calling without a host PC.

03 Tech Stack

Layer	Technology
Language	Kotlin
UI Framework	Jetpack Compose (state management, navigation)
Cloud LLM	OpenRouter API (GPT-4o, Claude, any model)
Local LLM	llama.cpp (on-device inference)
Persistence	Room DB (conversations, encrypted settings)
Skill Routing	BM25 semantic search algorithm
Tool Protocol	MCP (Model Context Protocol, server/client)
Background Service	Android Foreground Service (persistent agent loop)

04 Key Challenges

Android Lifecycle Management

Keeping the agent alive through Doze mode, background restrictions, and aggressive battery optimization across OEMs required careful use of Foreground Services, wake locks, and periodic alarms. Tested on Nothing Phone 2 and Oppo ColorOS to validate cross-vendor reliability.

On-Device LLM Latency

Running llama.cpp on a phone with acceptable response times demanded aggressive model quantization (Q4_K_M), GPU delegation via Vulkan, and prompt caching. The trade-off between model size and intelligence is an ongoing optimisation.

Skill Routing Accuracy vs. Speed

BM25 semantic routing had to balance classification accuracy against near-real-time response latency. Tuning the tokeniser, stop-word lists, and similarity thresholds was essential to prevent misrouted intents without introducing delay.

MCP on Android

Building an MCP server that runs directly on Android — without a host PC — required a custom transport layer. The standard HTTP+SSE transport was replaced with an in-process bridge to keep the tool-calling loop local and fast.

⚡ Code Highlight

The agent loop uses Kotlin Flow to stream events through a clean intent-routing pipeline — classify, route, emit.

fun processUserMessage(
    conversationId: String,
    userText: String,
): Flow<AgentEvent> = flow {
    emit(AgentEvent.AssistantStarted)

    // Persist user input immediately
    conversationRepository.appendMessage(
        conversationId, role = "user", content = userText,
    )

    // Classify intent → route to handler
    val intent = intentClassifier.classify(userText)
    when (intent) {
        is AgentIntent.Empty  -> emit(AgentEvent.Done(DoneReason.EmptyInput))
        is AgentIntent.Chat    -> streamLlmReply(conversationId, userText).collect { emit(it) }
        is AgentIntent.ToolCall -> handleToolCall(conversationId, intent).collect { emit(it) }
    }
    emit(AgentEvent.Done(reason = DoneReason.Normal))
}.flowOn(Dispatchers.Default)

05 Results

6+ Phases Planned

Cloud+Local Dual LLM Engine

BM25 Skill Routing

Screenshot Screen Awareness (Ph. 3)

Phase 1 stable on Nothing Phone 2 with production-grade reliability
Cloud LLM response time consistently under 2 seconds
SMS tool with full send / receive capability via Android Telephony APIs
Encrypted API key storage using Android Keystore + Room DB encryption
40+ Kotlin source files organised into a modular, testable architecture

Key Insight

The dual-LLM architecture gives users the best of both worlds: cloud models for complex reasoning (GPT-4o, Claude) and local inference for privacy-sensitive or offline tasks. The BM25 skill router seamlessly mediates between intent and execution without the user ever thinking about which model is running.

06 What I Learned

Lifecycle Mastery

Building a persistent agent on Android is fundamentally a battle against the OS's battery-saving mechanisms. Foreground Service + proper lifecycle-aware architecture is non-negotiable. Every OEM has quirks — test early, test often.

Local LLM Practicality

Running LLMs on-device is feasible today with the right quantisation strategy and GPU acceleration. The latency gap between cloud and local is shrinking, and for many tasks (classification, simple Q&A), local models are already good enough — with the added benefit of zero data leaving the device.

Modularity Matters

The skill-router pattern (BM25 + MCP) proved extremely flexible. Adding a new tool is as simple as writing a Kotlin class, registering it in the skill index, and letting the router handle dispatch. This pattern should generalise well to any on-device agent framework.