ARGUS — ArgusLabs

Detection Layers

The four detection layers that catch silent failures in your AI pipelines.

Overview

ARGUS runs four detection layers against every trace. Each layer looks for a different category of failure. Together, they cover the full spectrum of things that can go wrong in an AI pipeline — from numerical anomalies to meaning drift.

Four detection layers: Statistical, Semantic, Behavioral, Structural
The four detection layers run in parallel after each pipeline execution

All layers are enabled by default. You can configure sensitivity thresholds or disable individual layers in your argus.yaml.

Statistical Detection

Catches numerical anomaliesin execution metrics. This layer doesn't understand what your pipeline does — it just knows when something looks quantitatively different from normal.

What it detects:

  • Execution time spikes or drops (a node that usually takes 2s now takes 30s)
  • Output length anomalies (a response that's 10x shorter than average)
  • Token count deviations beyond the Z-score threshold
  • Retry count anomalies and error rate changes
yamlargus.yaml
detection:
  statistical:
    enabled: true
    z_threshold: 2.5      # standard deviations from mean
    min_samples: 5        # minimum runs before baselining
    metrics:
      - execution_time
      - output_length
      - token_count

Baselining

Statistical detection needs history to establish baselines. The first few runs won't trigger statistical detections — ARGUS is building its model of "normal" for your pipeline.

Semantic Detection

Catches meaning drift and quality degradation. This is the layer that understands what your pipeline is supposed to produce and can tell when the output is technically valid but semantically wrong.

What it detects:

  • Relevance loss — retrieval returns documents that don't match the query
  • Hallucination patterns — output contains claims not supported by the input context
  • Topic drift — the response wanders away from the original intent
  • Contradiction — output contradicts information from earlier in the pipeline
yamlargus.yaml
detection:
  semantic:
    enabled: true
    similarity_threshold: 0.7    # cosine similarity floor
    judge: false                 # enable LLM-as-judge
    judge_model: "gpt-4o"       # model for semantic eval

There are two modes: embedding similarity (fast, cheap, always-on) and LLM-as-judge (slower, costs API calls, much more accurate). Use embeddings for production monitoring and LLM-as-judge for staging/CI evaluation.

Behavioral Detection

Catches unexpected execution patterns — the shape of how your pipeline runs, not what it produces.

  • Infinite loops — a node re-executing beyond the configured threshold
  • Skipped steps — expected nodes that never executed
  • Unexpected transitions — edges that shouldn't fire but did
  • State corruption — state fields modified by nodes that shouldn't touch them
yamlargus.yaml
detection:
  behavioral:
    enabled: true
    max_loop_count: 10      # max times a node can re-execute
    detect_skipped: true    # flag nodes that should have run
    detect_mutations: true  # flag unexpected state changes

Structural Detection

Catches contract violations and schema breaks. This layer validates the data flowing through your pipeline against expected shapes and types.

  • Missing required fields — a node output lacks expected keys
  • Type mismatches — a field that should be a list is a string
  • Empty results — a node returns an empty response when content is expected
  • Schema drift — output shape changed from what the next node expects
yamlargus.yaml
detection:
  structural:
    enabled: true
    check_required: true    # validate required fields
    check_types: true       # validate field types
    check_empty: true       # flag empty outputs

Custom validators

For domain-specific structural checks, use the validators parameter on ArgusWatcher. This lets you define custom validation functions per field.

Adaptive Learning

The detection layers aren't static. When the semantic judge (LLM) discovers a new failure pattern, it proposes a candidate signature. After human approval, the pattern is added to the heuristic engine — so future runs catch it without needing an LLM call.

Patterns can be approved as Private (local only) or Shared (synced to all ARGUS users via cloud). See the Adaptive Learning page for details.