Optimizing LLM Costs with MuXD: A Hybrid Router for ARKONA

At ARKONA, we're building an autonomous multi-agent AI ecosystem, currently comprising 47 services running across 23 ports on Tailscale HTTPS. This generates a significant demand on Large Language Models (LLMs) for tasks like agent orchestration, natural language processing for reverse engineering (CIPHER pipeline), and even newsroom editorial workflows. As we scaled, API costs became a serious concern. We needed a solution that could maintain performance while drastically reducing reliance on expensive cloud LLM APIs. The result is MuXD, our hybrid LLM router, which has consistently delivered a 40% reduction in API costs.

The Problem: Balancing Performance and Cost

Our agents, operating on a battle rhythm, perform a diverse range of tasks. Some require the sophisticated reasoning capabilities of models like Claude 3 Opus, accessed via API. Others – simple summarization, keyword extraction, basic question answering – are perfectly suited for smaller, locally-hosted LLMs. Blindly sending *everything* to Claude was both wasteful and introduced unnecessary latency.

We considered a simple rule-based routing system, but quickly realized this was brittle and wouldn't scale. Defining exhaustive rules for every possible task was impractical. Furthermore, the “complexity” of a task isn’t always obvious upfront. We needed a dynamic system capable of assessing task requirements and intelligently routing requests.

Introducing MuXD: The Hybrid Router

MuXD (Multi-eXecution Dispatcher) is designed to intelligently route LLM requests between our local Ollama models and the Claude API. It’s built as a microservice, exposed on port 8082 within our Tailscale network, and acts as a reverse proxy for all LLM requests originating from our agents. The core of MuXD is a multi-layered decision engine:

Technical Implementation Details

MuXD is implemented in Python using FastAPI, offering a lightweight and asynchronous framework ideal for a reverse proxy. Here’s a simplified snippet illustrating the routing logic:


from fastapi import FastAPI, Request, HTTPException
import os
import ollama_client  # Assuming a custom Ollama client library
import claude_client # Assuming a custom Claude client library

app = FastAPI()

OLLAMA_ENABLED = os.environ.get("OLLAMA_ENABLED", "True").lower() == "true"
COMPLEXITY_THRESHOLD = 0.6  # Adjust based on testing

@app.post("/llm")
async def route_llm_request(request: Request):
    data = await request.json()
    prompt = data.get("prompt")

    if not prompt:
        raise HTTPException(status_code=400, detail="Prompt is required")

    complexity_score = estimate_complexity(prompt)
    token_count = estimate_token_count(prompt)

    if OLLAMA_ENABLED and complexity_score < COMPLEXITY_THRESHOLD and token_count < 2048:
        # Route to Ollama
        response = ollama_client.generate(prompt)
        return response
    else:
        # Route to Claude
        response = claude_client.generate(prompt)
        return response

def estimate_complexity(prompt: str) -> float:
    # Implementation using local LLM to categorize prompt complexity
    # ... (Omitted for brevity)
    return 0.5 # Example value

def estimate_token_count(prompt: str) -> int:
    # Implementation using local LLM or tokenizer to estimate token count
    # ... (Omitted for brevity)
    return 1024 # Example value

The `ollama_client` and `claude_client` are custom libraries we’ve developed to handle communication with each respective service. We utilize environment variables for configuration, allowing us to easily enable/disable Ollama support and adjust the complexity threshold without code changes.

Integration with ARKONA Ecosystem

All agent communication within ARKONA leverages a pub/sub system built on ZeroMQ, running on port 5555. When an agent requires LLM assistance, it publishes a message to the appropriate topic. MuXD subscribes to these topics, intercepts the requests, and handles the routing.

The newsroom editorial pipeline, for instance, heavily utilizes MuXD. Tasks like generating headline options (simple) are almost exclusively handled by local Ollama models. However, fact-checking and in-depth analysis (complex) are routed to Claude, ensuring accuracy and reliability. This is guided by our COMET framework—a 7-step human-AI delegation framework grounded in IEEE and NIST standards—which emphasizes appropriate levels of AI autonomy and oversight.

Monitoring and Optimization

We continuously monitor MuXD's performance using Prometheus and Grafana, tracking metrics like request routing rates, average response times, and cost savings. We’ve implemented a feedback loop where agents can report the quality of the LLM response, allowing us to fine-tune the complexity threshold and improve routing accuracy. We also leverage our NIST 800-30 grounded risk evaluation engine to assess the potential impact of routing decisions on overall system security and reliability.

Beyond Cost Savings

While cost reduction is a significant benefit, MuXD offers additional advantages. Local execution reduces latency for simple tasks, improving the responsiveness of our agents. It also enhances resilience; even if the Claude API is temporarily unavailable, our agents can continue to function with reduced capabilities. The provenance signing, using SHA-256 throughout the ecosystem, ensures the integrity of the LLM responses, regardless of where they are generated.

Currently, we have 5 local Ollama models running on our hardware (dual Tesla P40 GPUs, 440GB DDR4): Mistral 7B, Llama 2 7B Chat, Gemma 7B, Phi-2, and a specialized model fine-tuned for reverse engineering tasks in the CIPHER pipeline.

Lessons Learned

Building MuXD wasn’t without challenges. Initially, accurately estimating task complexity proved difficult. We found that relying solely on prompt length was insufficient. The key was to leverage a secondary LLM to *reason* about the prompt’s requirements.

The most crucial takeaway is this: Don’t treat LLMs as a monolithic resource. Embrace a hybrid approach that leverages the strengths of both cloud APIs and locally-hosted models. By intelligently routing tasks based on complexity and cost, you can unlock significant savings without compromising performance. This strategy is fundamental to scaling our ARKONA ecosystem sustainably.