Productionizing and Scaling Self-Hosted LangSmith - Best Practices

Last updated: November 27, 2025

Overview

This guide provides comprehensive best practices for scaling and productionizing your self-hosted LangSmith instance. Following these recommendations will help ensure your deployment can handle production workloads efficiently and reliably.

1. Scaling Recommendations

Pod Replicas and Database Sizing

Configure the number of pod replicas and database resources based on your expected read/write throughput.

ReferenceLangSmith Self-Host Scaling Documentation

This documentation covers:

  • Recommended number of pod replicas for different throughput levels

  • Database sizing guidelines

  • Performance optimization strategies

2. Autoscaling Configuration

Set up autoscalers for LangSmith pods to automatically scale up or down based on request load. Use the scaling documentation above for replica recommendations based on your volume.

Key Components to Autoscale

LangSmith Backend

Configure autoscaling for the main LangSmith backend service.

ConfigurationLangSmith Backend Values

# Example autoscaling configuration
langsmith:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70

Platform Backend

Configure autoscaling for the platform backend service.

ConfigurationPlatform Backend Values

# Example autoscaling configuration
platformBackend:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 8
    targetCPUUtilizationPercentage: 70

Queue Service

Configure autoscaling for the queue service to handle background processing.

ConfigurationQueue Values

# Example autoscaling configuration
queue:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 6
    targetCPUUtilizationPercentage: 70

3. External Datastores (Highly Recommended)

Instead of managing Kubernetes StatefulSets for your datastores, use managed external services for better reliability and easier maintenance.

External Redis

Use a managed Redis service like AWS ElastiCache with OSS Redis (not Redis Enterprise).

Benefits:

  • Better reliability and availability

  • Automated backups and failover

  • Reduced operational overhead

  • No Kubernetes StatefulSet management

DocumentationExternal Redis Setup

Recommended Providers:

  • AWS ElastiCache (OSS Redis)

  • Azure Cache for Redis

  • Google Cloud Memorystore

External PostgreSQL

Use a managed PostgreSQL service like AWS RDS for your primary database.

Benefits:

  • Automated backups and point-in-time recovery

  • High availability with multi-AZ deployments

  • Automated maintenance and patching

  • Better performance monitoring

DocumentationExternal Postgres Setup

Recommended Providers:

  • AWS RDS for PostgreSQL

  • Azure Database for PostgreSQL

  • Google Cloud SQL for PostgreSQL

4. ClickHouse Configuration

High Availability Replicated Deployment

Set up a highly available (HA) replicated ClickHouse deployment in your Kubernetes cluster using Zookeeper for coordination.

Benefits:

  • Data redundancy and fault tolerance

  • Improved query performance with distributed reads

  • Zero downtime during maintenance

InstructionsReplicated ClickHouse Setup

Enable Blob Storage

Enable blob storage to offload large trace data from ClickHouse, alleviating database pressure without sacrificing performance.

Benefits:

  • Reduced ClickHouse storage requirements

  • Lower database costs

  • Maintains query performance

  • Better separation of hot and cold data

DocumentationBlob Storage Setup

Recommended Providers:

  • AWS S3

  • Azure Blob Storage

  • Google Cloud Storage

  • MinIO (self-hosted)

5. Observability and Monitoring

Set up comprehensive observability on your LangSmith instance to gather logs, metrics, and traces for troubleshooting and alerting.

DocumentationExport Backend Observability

Key Observability Components

Logs

  • Application logs from all services

  • Error tracking and aggregation

  • Centralized log management

Metrics

  • Request rates and latencies

  • Resource utilization (CPU, memory, disk)

  • Queue depths and processing times

  • Database performance metrics

Traces

  • Distributed tracing across services

  • Request flow visualization

  • Performance bottleneck identification

Critical Alerts to Configure

  • High error rates

  • Service downtime

  • Database connection pool exhaustion

  • High queue depths

  • Storage capacity thresholds

  • Unusual latency spikes

Quick Reference Checklist

  •  Configure pod replicas based on throughput requirements

  •  Enable autoscaling for LangSmith Backend

  •  Enable autoscaling for Platform Backend

  •  Enable autoscaling for Queue service

  •  Migrate to external managed Redis (e.g., AWS ElastiCache)

  •  Migrate to external managed PostgreSQL (e.g., AWS RDS)

  •  Set up HA replicated ClickHouse with Zookeeper

  •  Enable blob storage for trace data

  •  Configure log collection and aggregation

  •  Set up metrics monitoring and dashboards

  •  Enable distributed tracing

  •  Configure alerting rules and notifications

Additional Resources