Model Drift Detection and Bias/Fairness Evaluation in LangSmith

Last updated: February 19, 2026

Summary

LangSmith does not provide dedicated out-of-the-box drift detection or bias/fairness scoring systems with automatic baseline comparison. However, it offers flexible building blocks that can be combined to achieve similar observability outcomes for LLMs and AI agents.


Model Drift Detection

What LangSmith Offers

LangSmith provides several capabilities that can be combined to monitor model behavior changes over time:

Capability

Description

Use for Drift Detection

Monitoring Dashboards

Prebuilt and custom charts tracking feedback scores, latency, cost, token usage, and error rates over time

Visually identify trends and performance degradation

Online Evaluations

Evaluators that run automatically on production traffic in near real-time

Flag quality degradation and unusual patterns as they occur

Experiment Comparison

Run the same dataset against different model versions, set a baseline, and compare results

See regressions (red) vs improvements (green) at a glance

Automation Rules

Trigger actions based on filters and sampling on trace data

Automatically flag anomalies, add to datasets, or route for review

Key Differences from Traditional ML Drift Detection

Traditional ML drift detection typically involves:

  • Statistical comparison of feature distributions (data drift)

  • Monitoring prediction probability distributions (concept drift)

  • Automatic baseline computation and threshold alerts

LangSmith's approach is tailored for LLM/agent observability:

  • Quality is assessed via evaluator scores rather than statistical distribution tests

  • Baselines are established through experiments rather than automatic reference windows

  • Drift is detected through trend analysis and evaluator score degradation rather than statistical drift tests


Bias and Fairness Detection

What LangSmith Offers

LangSmith does not include built-in bias or fairness metrics. However, custom bias checks can be implemented using:

Approach

Description

Best For

Custom Code Evaluators

Write deterministic Python/TypeScript evaluation logic

Rule-based checks (e.g., demographic parity, keyword detection)

LLM-as-Judge Evaluators

Use an LLM to assess outputs for bias/fairness at scale

Nuanced, context-aware bias assessment

Online Code Evaluators

Run custom checks automatically on production traffic

Continuous monitoring in production

Implementation Patterns

Custom Code Evaluator Example:

def bias_check_evaluator(run, example):
    output = run.outputs.get("response", "")
    # Custom logic to check for demographic parity, 
    # sensitive term usage, or other bias indicators
    bias_score = custom_bias_detection_logic(output)
    return {"key": "bias_score", "score": bias_score}

LLM-as-Judge for Bias Detection:

  • Configure a prompt that instructs an LLM to evaluate outputs for bias

  • Can assess nuanced bias that rule-based systems might miss

  • Useful for detecting subtle tone, framing, or representation issues


Recommended Approach for AI Observability

For teams seeking comprehensive AI observability across traditional ML, LLMs, and AI agents:

1. Establish Baselines via Experiments

  • Create a reference dataset representing expected behavior

  • Run experiments to establish baseline scores

  • Use comparison views to track deviations in subsequent runs

2. Implement Online Evaluations

  • Deploy custom evaluators for quality, bias, and consistency checks

  • Configure evaluators to run on sampled production traffic

  • Set up automation rules to flag concerning patterns

3. Build Custom Dashboards

  • Track key metrics over time (latency, cost, error rates, custom scores)

  • Create views segmented by model version, prompt version, or time period

  • Use these to visually identify drift trends

4. Set Up Automation Rules

  • Configure alerts when scores fall below thresholds

  • Route flagged traces to review queues

  • Automatically add edge cases to datasets for retraining


What LangSmith Does NOT Provide

To set clear expectations:

  • Automatic statistical drift detection - No automatic computation of KL divergence, PSI, or other statistical drift metrics

  • Built-in bias/fairness metrics - No prebuilt demographic parity, equalized odds, or disparate impact calculations

  • Reference data upload with automatic comparison - Baselines are established through experiments rather than uploaded reference datasets

  • Traditional ML model support - LangSmith is designed for LLM/agent observability, not traditional ML models


Related Documentation


Summary Table

Capability

Out-of-the-Box

Custom Implementation

Performance monitoring over time

Yes

-

Experiment comparison with baselines

Yes

-

Online evaluations on production traffic

Yes

-

Automation rules and alerts

Yes

-

Statistical drift detection

No

Possible via custom evaluators

Bias/fairness metrics

No

Possible via custom/LLM-as-judge evaluators

Traditional ML model support

No

Not designed for this use case