What is ARGUS?
ARGUS is a forensic observability layerfor AI agent pipelines. It wraps your LangGraph (or any Python-based) workflow and watches every node execution, state transition, and tool call — then runs a multi-layered detection system to catch the failures that don't throw exceptions.
Think of it as a flight recorder for your AI pipeline. When something goes wrong — and in agent systems, it's almost always silent — ARGUS gives you the trace, the root cause, and the replay ability to fix it.
The Problem
AI agent pipelines fail differently from traditional software. They don't crash — they degrade. A retrieval step returns irrelevant documents. A planning node generates a reasonable-looking but wrong plan. A tool call succeeds with bad parameters. The pipeline finishes, returns a result, and nobody knows it's garbage until a human reads it.
These are silent failures — the pipeline technically succeeds while the output quality collapses. Standard monitoring (latency, error rates, uptime) is blind to them. You need something that understands what your pipeline is supposed to do and can tell when it stops doing it.
How ARGUS Works
ARGUS wraps your pipeline with a single call and instruments every execution step automatically. No manual tracing. No decorators on every function. One wrapper, full visibility.
from argus import ArgusWatcher
watcher = ArgusWatcher(
max_field_size=50_000, # max chars per captured state field
strict=False, # True = raise on detection (useful for CI)
investigate=True, # run root cause analysis on failures
redact_keys=["api_key"], # scrub sensitive fields from traces
persist_state=True, # save state at each step for replay
record_http=False, # record HTTP calls for mocked replay
semantic_judge=False, # enable LLM-as-judge evaluation
judge_model="gpt-4o", # model for semantic judging
)
watcher.watch(graph) # instrument your LangGraph
app = graph.compile()
result = app.invoke(state)
watcher.finalize() # run detectors, generate traceAfter finalize(), ARGUS has captured every node's input/output, timed each step, and run four layers of detection against the trace. If something went wrong — even something subtle — you'll know about it.

Key Capabilities
- ‣Silent failure detection— catches semantic degradation, hallucinated outputs, and logic errors that don't raise exceptions
- ‣Root cause analysis — traces failures back through the execution graph to the node that caused the problem
- ‣Execution replay — re-run any trace from any step with modified inputs to test fixes
- ‣Four detection layers — statistical, semantic, behavioral, and structural analysis working together
- ‣Zero-config instrumentation — one wrapper call, no decorators, no manual span creation
Who Is It For?
ARGUS is built for engineers shipping AI agent pipelines to production. If you're building with LangGraph, LangChain, or any Python-based agent framework and you need to know when your pipeline is silently producing bad output — ARGUS is for you.
Beta
Ready to try it? Jump to the Quickstart.
