Optimizing LLM Costs with MuXD: A Hybrid Router for ARKONA
At ARKONA, we're building an autonomous multi-agent AI ecosystem, currently comprising 47 services running across 23 ports on Tailscale HTTPS. This generates a significant demand on Large Language Models (LLMs) for tasks like agent orchestration, natural language processing for reverse engineering (CIPHER pipeline), and even newsroom editorial workflows. As we scaled, API costs became a serious concern. We needed a solution that could maintain performance while drastically reducing reliance on expensive cloud LLM APIs. The result is MuXD, our hybrid LLM router, which has consistently delivered a 40% reduction in API costs.
The Problem: Balancing Performance and Cost
Our agents, operating on a battle rhythm, perform a diverse range of tasks. Some require the sophisticated reasoning capabilities of models like Claude 3 Opus, accessed via API. Others – simple summarization, keyword extraction, basic question answering – are perfectly suited for smaller, locally-hosted LLMs. Blindly sending *everything* to Claude was both wasteful and introduced unnecessary latency.
We considered a simple rule-based routing system, but quickly realized this was brittle and wouldn't scale. Defining exhaustive rules for every possible task was impractical. Furthermore, the “complexity” of a task isn’t always obvious upfront. We needed a dynamic system capable of assessing task requirements and intelligently routing requests.
Introducing MuXD: The Hybrid Router
MuXD (Multi-eXecution Dispatcher) is designed to intelligently route LLM requests between our local Ollama models and the Claude API. It’s built as a microservice, exposed on port 8082 within our Tailscale network, and acts as a reverse proxy for all LLM requests originating from our agents. The core of MuXD is a multi-layered decision engine:
- Task Complexity Assessment: When a request arrives, MuXD first analyzes the prompt. We leverage a lightweight, locally-run LLM (currently a quantized Mistral 7B via Ollama) to estimate the cognitive load of the task. This is achieved by prompting the local model to categorize the request based on a predefined taxonomy of complexity (e.g., “simple”, “medium”, “complex”).
- Token Usage Estimation: MuXD estimates the expected token count for both the input prompt and the predicted output. This is crucial, as Claude API pricing is heavily influenced by token usage. The local LLM assists with this estimation.
- Cost-Benefit Analysis: MuXD compares the estimated cost of executing the task on Claude (based on token count and API pricing) with the cost of running it locally (effectively negligible, as the infrastructure is already provisioned). A configurable threshold determines the routing decision.
- Dynamic Routing: Based on the cost-benefit analysis, MuXD forwards the request to either the appropriate Ollama model or the Claude API.
Technical Implementation Details
MuXD is implemented in Python using FastAPI, offering a lightweight and asynchronous framework ideal for a reverse proxy. Here’s a simplified snippet illustrating the routing logic:
from fastapi import FastAPI, Request, HTTPException
import os
import ollama_client # Assuming a custom Ollama client library
import claude_client # Assuming a custom Claude client library
app = FastAPI()
OLLAMA_ENABLED = os.environ.get("OLLAMA_ENABLED", "True").lower() == "true"
COMPLEXITY_THRESHOLD = 0.6 # Adjust based on testing
@app.post("/llm")
async def route_llm_request(request: Request):
data = await request.json()
prompt = data.get("prompt")
if not prompt:
raise HTTPException(status_code=400, detail="Prompt is required")
complexity_score = estimate_complexity(prompt)
token_count = estimate_token_count(prompt)
if OLLAMA_ENABLED and complexity_score < COMPLEXITY_THRESHOLD and token_count < 2048:
# Route to Ollama
response = ollama_client.generate(prompt)
return response
else:
# Route to Claude
response = claude_client.generate(prompt)
return response
def estimate_complexity(prompt: str) -> float:
# Implementation using local LLM to categorize prompt complexity
# ... (Omitted for brevity)
return 0.5 # Example value
def estimate_token_count(prompt: str) -> int:
# Implementation using local LLM or tokenizer to estimate token count
# ... (Omitted for brevity)
return 1024 # Example value
The `ollama_client` and `claude_client` are custom libraries we’ve developed to handle communication with each respective service. We utilize environment variables for configuration, allowing us to easily enable/disable Ollama support and adjust the complexity threshold without code changes.
Integration with ARKONA Ecosystem
All agent communication within ARKONA leverages a pub/sub system built on ZeroMQ, running on port 5555. When an agent requires LLM assistance, it publishes a message to the appropriate topic. MuXD subscribes to these topics, intercepts the requests, and handles the routing.
The newsroom editorial pipeline, for instance, heavily utilizes MuXD. Tasks like generating headline options (simple) are almost exclusively handled by local Ollama models. However, fact-checking and in-depth analysis (complex) are routed to Claude, ensuring accuracy and reliability. This is guided by our COMET framework—a 7-step human-AI delegation framework grounded in IEEE and NIST standards—which emphasizes appropriate levels of AI autonomy and oversight.
Monitoring and Optimization
We continuously monitor MuXD's performance using Prometheus and Grafana, tracking metrics like request routing rates, average response times, and cost savings. We’ve implemented a feedback loop where agents can report the quality of the LLM response, allowing us to fine-tune the complexity threshold and improve routing accuracy. We also leverage our NIST 800-30 grounded risk evaluation engine to assess the potential impact of routing decisions on overall system security and reliability.
Beyond Cost Savings
While cost reduction is a significant benefit, MuXD offers additional advantages. Local execution reduces latency for simple tasks, improving the responsiveness of our agents. It also enhances resilience; even if the Claude API is temporarily unavailable, our agents can continue to function with reduced capabilities. The provenance signing, using SHA-256 throughout the ecosystem, ensures the integrity of the LLM responses, regardless of where they are generated.
Currently, we have 5 local Ollama models running on our hardware (dual Tesla P40 GPUs, 440GB DDR4): Mistral 7B, Llama 2 7B Chat, Gemma 7B, Phi-2, and a specialized model fine-tuned for reverse engineering tasks in the CIPHER pipeline.
Lessons Learned
Building MuXD wasn’t without challenges. Initially, accurately estimating task complexity proved difficult. We found that relying solely on prompt length was insufficient. The key was to leverage a secondary LLM to *reason* about the prompt’s requirements.
The most crucial takeaway is this: Don’t treat LLMs as a monolithic resource. Embrace a hybrid approach that leverages the strengths of both cloud APIs and locally-hosted models. By intelligently routing tasks based on complexity and cost, you can unlock significant savings without compromising performance. This strategy is fundamental to scaling our ARKONA ecosystem sustainably.