ARGUS — ArgusLabs

Getting Started

Introduction

What is ARGUS, why it exists, and how it gives you forensic observability for AI agent pipelines.

What is ARGUS?

ARGUS is a forensic observability layerfor AI agent pipelines. It wraps your LangGraph (or any Python-based) workflow and watches every node execution, state transition, and tool call — then runs a multi-layered detection system to catch the failures that don't throw exceptions.

Think of it as a flight recorder for your AI pipeline. When something goes wrong — and in agent systems, it's almost always silent — ARGUS gives you the trace, the root cause, and the replay ability to fix it.

The Problem

AI agent pipelines fail differently from traditional software. They don't crash — they degrade. A retrieval step returns irrelevant documents. A planning node generates a reasonable-looking but wrong plan. A tool call succeeds with bad parameters. The pipeline finishes, returns a result, and nobody knows it's garbage until a human reads it.

These are silent failures — the pipeline technically succeeds while the output quality collapses. Standard monitoring (latency, error rates, uptime) is blind to them. You need something that understands what your pipeline is supposed to do and can tell when it stops doing it.

How ARGUS Works

ARGUS wraps your pipeline with a single call and instruments every execution step automatically. No manual tracing. No decorators on every function. One wrapper, full visibility.

python
from argus import ArgusWatcher

watcher = ArgusWatcher(
    max_field_size=50_000,       # max chars per captured state field
    strict=False,                # True = raise on detection (useful for CI)
    investigate=True,            # run root cause analysis on failures
    redact_keys=["api_key"],     # scrub sensitive fields from traces
    persist_state=True,          # save state at each step for replay
    record_http=False,           # record HTTP calls for mocked replay
    semantic_judge=False,        # enable LLM-as-judge evaluation
    judge_model="gpt-4o",       # model for semantic judging
)

watcher.watch(graph)             # instrument your LangGraph
app = graph.compile()
result = app.invoke(state)
watcher.finalize()               # run detectors, generate trace

After finalize(), ARGUS has captured every node's input/output, timed each step, and run four layers of detection against the trace. If something went wrong — even something subtle — you'll know about it.

ARGUS architecture diagram showing the watcher wrapping a pipeline, detectors analyzing the trace, and forensic output
ARGUS wraps your pipeline, runs multi-layer detection, and produces forensic traces

Key Capabilities

  • Silent failure detection— catches semantic degradation, hallucinated outputs, and logic errors that don't raise exceptions
  • Root cause analysis — traces failures back through the execution graph to the node that caused the problem
  • Execution replay — re-run any trace from any step with modified inputs to test fixes
  • Four detection layers — statistical, semantic, behavioral, and structural analysis working together
  • Zero-config instrumentation — one wrapper call, no decorators, no manual span creation

Who Is It For?

ARGUS is built for engineers shipping AI agent pipelines to production. If you're building with LangGraph, LangChain, or any Python-based agent framework and you need to know when your pipeline is silently producing bad output — ARGUS is for you.

Beta

ARGUS is currently in beta. The core API is stable, but some detection layers and CLI commands are still being refined. Join the Discord for early access and to shape the roadmap.

Ready to try it? Jump to the Quickstart.