The case for local inference: when Ollama on a Tesla P40 beats cloud API calls

For the past year, building ARKONA – an autonomous multi-agent AI ecosystem – has forced some incredibly practical decisions about where and how we run large language models. We currently operate 47 services across 23 Tailscale HTTPS ports, all interconnected via an inter-agent communication broker, and increasingly reliant on LLMs for everything from code generation (thanks, Claude Code!) to risk assessment. While we heavily leverage Claude’s API for specific capabilities, especially in the BizOps and COMET domains, we've increasingly found that running inference locally, specifically with Ollama on our dual Tesla P40 GPUs, provides significant advantages – advantages that are becoming critical to scaling ARKONA’s capabilities. This isn't a blanket statement dismissing cloud APIs; it's a nuanced exploration of when local inference demonstrably outperforms, even given the upfront investment.

The Architecture and the Problem

ARKONA’s core architecture is built around autonomous agents operating on a battle rhythm. These agents, currently numbering 26, perform tasks ranging from research and editorial work to system monitoring and synchronization. A central component is MuXD, our hybrid LLM router. MuXD intelligently distributes prompts between local Ollama models and the Claude API, prioritizing locality for speed and cost, and offloading to the cloud when specialized models or knowledge are required. The key challenge is latency and cost. Each agent frequently interacts with MuXD. Every API call to Claude, while powerful, introduces network latency and incurs a per-token cost. That cost, multiplied across 26 agents running continuous tasks, adds up quickly.

Consider the REOps domain, specifically our CIPHER hardware reverse engineering pipeline. CIPHER utilizes Ghidra, and several agents are responsible for analyzing disassembly output, identifying potential vulnerabilities, and documenting findings. These agents need rapid feedback – generating explanations for code blocks, suggesting alternative approaches, and even producing initial draft reports. We initially routed all these tasks to Claude. The response times, even with optimizations, were impacting the pipeline’s throughput. Furthermore, the cost associated with processing the large codebases in Ghidra was unsustainable.

Why Ollama on P40?

We deployed local Ollama instances running on our two Tesla P40 GPUs. These GPUs, coupled with 440GB of DDR4 RAM, provide substantial compute and memory resources. We started with a mixture of models – Llama 3 8B, Mistral 7B, and even some smaller, specialized models for specific tasks. The initial results were promising, but required careful tuning.

The primary benefit was drastically reduced latency. For a typical task – explaining a 500-line function in Ghidra disassembly – the round trip time to Claude averaged 2.5-3 seconds. With Ollama, running locally on the P40, this dropped to 400-800 milliseconds. That 66-86% reduction in latency is *significant* for an interactive system like CIPHER. It allowed us to increase the throughput of the pipeline by approximately 30% without altering the underlying logic.

But latency isn't the only factor. We use a token-savings optimization within MuXD. We’ve profiled agent prompts and discovered substantial redundancy. Many agents were essentially requesting the same information repeatedly. MuXD caches responses, but even cached responses still incur Claude API costs. With local Ollama models, those cached responses cost nothing to serve. Our internal calculations showed that shifting approximately 40% of CIPHER’s LLM workload to local inference reduced our monthly Claude API spend by 22%, a savings of roughly $1,800. This number is projected to increase as CIPHER scales.

Technical Details & Configuration

Here’s a snippet of our MuXD configuration file (YAML) demonstrating the routing logic. This is a simplified example, but illustrates the core principle:


routes:
  - name: "CIPHER_ANALYSIS"
    domain: "REOps"
    priority: 1
    local_first: true # Try Ollama first
    ollama_model: "llama3-8b"
    cloud_model: "claude-3-opus-20240229"
    max_tokens: 1024
    fallback: "cloud" # If Ollama fails, fall back to Claude
  - name: "BIZ_REPORT_GENERATION"
    domain: "BizOps"
    priority: 2
    local_first: false # Always use Claude for BizOps reports
    cloud_model: "claude-3-haiku-20240307"
    max_tokens: 2048
    fallback: "cloud"

The `local_first: true` directive is critical. It instructs MuXD to attempt inference with the specified `ollama_model` before routing the request to the cloud. We also implement a robust error handling mechanism. If the Ollama model fails to respond within a defined timeout, MuXD automatically retries with the cloud model. This ensures reliability without sacrificing performance when local inference is available.

We’re leveraging the Tailscale network to ensure secure communication between agents and MuXD. Ollama is exposed on port 11434 within the Tailscale network, allowing agents to access it directly via HTTPS. This eliminates the need for public IP addresses and simplifies network configuration.

Beyond Performance: Governance & Risk

The benefits extend beyond pure performance and cost savings. We’re heavily invested in AI governance, anchored by the COMET framework – a 7-step human↔AI delegation framework drawing on IEEE and NIST standards. Running inference locally allows us to maintain greater control over the data and models used by our agents. This is particularly important for sensitive data within the CoreOps and BizOps domains. While Claude provides strong data privacy guarantees, local inference eliminates the data transfer entirely, reducing our overall risk exposure. This aligns with our NIST 800-30 grounded risk evaluation engine and strengthens our commitment to responsible AI development.

Furthermore, we've integrated ecosystem-wide SHA-256 provenance signing. Every output generated by an agent – whether from a local Ollama model or a cloud API – is digitally signed, establishing a clear audit trail. This allows us to track the origin and integrity of information, ensuring accountability and preventing unauthorized modification. The ability to verify the provenance of locally generated content is a key advantage.

Lessons Learned

The experience has reinforced a crucial principle: there's no one-size-fits-all solution. Cloud APIs remain invaluable for access to cutting-edge models and specialized knowledge. However, for latency-sensitive tasks, high-volume workloads, and scenarios demanding greater data control, local inference – powered by hardware like our Tesla P40s and optimized with tools like Ollama – offers a compelling alternative.

The key takeaway? Don't blindly offload everything to the cloud. Invest in the infrastructure to bring inference closer to the data, and intelligently route workloads based on a clear understanding of your application's requirements. We’ve proven that a hybrid approach, with MuXD acting as the intelligent orchestrator, provides the optimal balance of performance, cost, security, and governance for ARKONA.