Building COMET: A 7-Step Framework for Human-AI Task Ownership

April 6, 2026 | 8-10 minute read

Building COMET: A 7-Step Framework for Human-AI Task Ownership

At ARKONA, we’ve spent the last year building an autonomous multi-agent AI ecosystem. It’s currently managing 47 services across multiple encrypted internal ports, from the deep hardware reverse-engineering pipeline to the editorial pipeline for our newsroom. A core challenge wasn’t just *building* the AI, but deciding *what* it should do – and crucially, what tasks still require human oversight. That’s led us to develop COMET, a 7-step framework for intelligently delegating tasks between humans and AI agents.

The Problem: Beyond Automation

Simple automation is no longer sufficient. We’re moving beyond automating *how* we do things to augmenting *what* we do. The initial tendency is to throw AI at everything, but that’s a recipe for disaster. Unsuitable delegation leads to unchecked errors, ‘hallucinations’, and ultimately, a loss of trust. We needed a deliberate, repeatable process. Existing frameworks like MITRE’s ATT&CK for threat modeling, while valuable, didn’t address the granular task delegation needed within a complex, autonomous system. Instead, we drew inspiration from areas like distributed cognition and human-machine teaming principles, adapting them for our specific context.

Introducing COMET: The 7-Step Framework

COMET – standing for Collaborative Ownership, Mitigation, Evaluation, and Transparency – provides a structured approach. It’s grounded in NIST 800-30 guidance on risk assessment, and informed by IEEE standards on responsible AI. The framework isn't about eliminating human roles, but about optimizing them for creativity, complex judgment, and ethical considerations.

Step 1: Task Decomposition & Dependency Mapping

Every large initiative at ARKONA is broken down into discrete tasks. Crucially, we map the dependencies between them. For example, the newsroom pipeline has tasks like 'source identification,' 'article drafting,' 'fact-checking,' and 'publication.’ We use a directed acyclic graph (DAG) to visualize these dependencies, identifying critical paths and potential bottlenecks. This isn't just a project management exercise; it’s foundational for understanding where AI can be safely deployed.

Step 2: Risk & Impact Assessment (NIST 800-30)

We apply a simplified version of the NIST 800-30 risk assessment methodology to each task. This involves evaluating the *likelihood* of failure and the *impact* of that failure. A task with high impact (e.g., a miscalculation in the RE pipeline impacting hardware analysis) and even moderate likelihood gets a higher risk score. Our internal risk scale ranges from 1 (negligible) to 5 (critical). We’ve integrated this scoring into our inter-agent communication broker, enabling agents to request human review for high-risk tasks automatically.

Step 3: AI Capability Evaluation

We assess whether our current AI capabilities (specifically, the models deployed through MuXD – our hybrid LLM router) can reliably perform the task. MuXD intelligently routes requests to either local Ollama models (for speed and privacy, like our Llama3-based REOps agent) or cloud-based Claude (for complex reasoning). We consider factors like:

Accuracy: What's the expected error rate?
Robustness: How well does the AI handle edge cases or noisy data?
Explainability: Can the AI’s reasoning be understood and validated?

We use a “confidence score” based on these metrics, assigning a value between 0 and 1.

Step 4: Human Skillset Matching

For tasks where AI capability is insufficient, we identify the human skillset required. This isn't just about technical expertise; it’s about judgment, creativity, and ethical considerations. For instance, our fact-checking agent (part of the newsroom pipeline) can flag potentially misleading information, but a human editor makes the final decision, leveraging nuanced understanding of context.

Step 5: Delegation Decision Matrix

This is where the rubber meets the road. We use a decision matrix combining the risk score (from Step 2) and the AI confidence score (from Step 3). A simplified example:


def determine_ownership(risk_score, ai_confidence):
    if risk_score >= 4 and ai_confidence < 0.8:
        return "Human"
    elif risk_score >= 5:
        return "Human"
    elif ai_confidence >= 0.9:
        return "AI"
    else:
        return "Human-in-the-Loop" # AI performs task, human reviews

The “Human-in-the-Loop” designation is critical. It signifies tasks where AI handles the initial processing, but a human provides oversight and validation. This is common in REOps, where the hardware RE pipeline generates Ghidra project files, which are then reviewed by a human reverse engineer before being incorporated into our knowledge base.

Step 6: Monitoring & Feedback Loop

Delegation isn’t a one-time decision. We continuously monitor the performance of both AI and human agents. We track metrics like task completion time, error rates, and human override frequency. This data feeds back into the AI capability evaluation (Step 3), refining the confidence scores and informing future delegation decisions. Our battle rhythm of 26 autonomous agents includes dedicated “monitoring” agents that specifically track these KPIs.

Step 7: Transparency & Provenance (SHA-256 Signing)

Regardless of who performs a task, we maintain a complete audit trail. Every action is logged and cryptographically signed with SHA-256, ensuring provenance and accountability. This is particularly important in CoreOps, where even minor modifications to cyber-physical systems can have significant consequences. We use this provenance data for incident response and to demonstrate compliance with AI governance principles – a core focus of our COMET domain.

Lessons Learned & Future Directions

Building COMET has been a challenging but incredibly rewarding experience. The biggest lesson? Don't overcomplicate things initially. Start with a simplified framework and iterate based on real-world data. We initially tried to automate too much, leading to increased errors and decreased trust. Focusing on *appropriate* delegation, informed by rigorous risk assessment and continuous monitoring, has been key.

Looking ahead, we’re exploring techniques for dynamic delegation – adjusting task ownership in real-time based on changing conditions. We’re also investigating the use of reinforcement learning to train the delegation decision matrix itself, optimizing for both efficiency and safety. The goal isn’t to eliminate humans, but to empower them to focus on the things they do best, augmenting their abilities with the power of AI.

Blog