Productionizing and Scaling Self-Hosted LangSmith - Best Practices

Last updated: November 27, 2025

Overview

This guide provides comprehensive best practices for scaling and productionizing your self-hosted LangSmith instance. Following these recommendations will help ensure your deployment can handle production workloads efficiently and reliably.

1. Scaling Recommendations

Pod Replicas and Database Sizing

Configure the number of pod replicas and database resources based on your expected read/write throughput.

Reference: LangSmith Self-Host Scaling Documentation

This documentation covers:

Recommended number of pod replicas for different throughput levels
Database sizing guidelines
Performance optimization strategies

2. Autoscaling Configuration

Set up autoscalers for LangSmith pods to automatically scale up or down based on request load. Use the scaling documentation above for replica recommendations based on your volume.

Key Components to Autoscale

LangSmith Backend

Configure autoscaling for the main LangSmith backend service.

Configuration: LangSmith Backend Values

# Example autoscaling configuration
langsmith:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 10
    targetCPUUtilizationPercentage: 70

Platform Backend

Configure autoscaling for the platform backend service.

Configuration: Platform Backend Values

# Example autoscaling configuration
platformBackend:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 8
    targetCPUUtilizationPercentage: 70

Queue Service

Configure autoscaling for the queue service to handle background processing.

Configuration: Queue Values

# Example autoscaling configuration
queue:
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 6
    targetCPUUtilizationPercentage: 70

3. External Datastores (Highly Recommended)

Instead of managing Kubernetes StatefulSets for your datastores, use managed external services for better reliability and easier maintenance.

External Redis

Use a managed Redis service like AWS ElastiCache with OSS Redis (not Redis Enterprise).

Benefits:

Better reliability and availability
Automated backups and failover
Reduced operational overhead
No Kubernetes StatefulSet management

Documentation: External Redis Setup

Recommended Providers:

AWS ElastiCache (OSS Redis)
Azure Cache for Redis
Google Cloud Memorystore

External PostgreSQL

Use a managed PostgreSQL service like AWS RDS for your primary database.

Benefits:

Automated backups and point-in-time recovery
High availability with multi-AZ deployments
Automated maintenance and patching
Better performance monitoring

Documentation: External Postgres Setup

Recommended Providers:

AWS RDS for PostgreSQL
Azure Database for PostgreSQL
Google Cloud SQL for PostgreSQL

4. ClickHouse Configuration

High Availability Replicated Deployment

Set up a highly available (HA) replicated ClickHouse deployment in your Kubernetes cluster using Zookeeper for coordination.

Benefits:

Data redundancy and fault tolerance
Improved query performance with distributed reads
Zero downtime during maintenance

Instructions: Replicated ClickHouse Setup

Enable Blob Storage

Enable blob storage to offload large trace data from ClickHouse, alleviating database pressure without sacrificing performance.

Benefits:

Reduced ClickHouse storage requirements
Lower database costs
Maintains query performance
Better separation of hot and cold data

Documentation: Blob Storage Setup

Recommended Providers:

AWS S3
Azure Blob Storage
Google Cloud Storage
MinIO (self-hosted)

5. Observability and Monitoring

Set up comprehensive observability on your LangSmith instance to gather logs, metrics, and traces for troubleshooting and alerting.

Documentation: Export Backend Observability

Key Observability Components

Logs

Application logs from all services
Error tracking and aggregation
Centralized log management

Metrics

Request rates and latencies
Resource utilization (CPU, memory, disk)
Queue depths and processing times
Database performance metrics

Traces

Distributed tracing across services
Request flow visualization
Performance bottleneck identification

Critical Alerts to Configure

High error rates
Service downtime
Database connection pool exhaustion
High queue depths
Storage capacity thresholds
Unusual latency spikes

Quick Reference Checklist

Configure pod replicas based on throughput requirements
Enable autoscaling for LangSmith Backend
Enable autoscaling for Platform Backend
Enable autoscaling for Queue service
Migrate to external managed Redis (e.g., AWS ElastiCache)
Migrate to external managed PostgreSQL (e.g., AWS RDS)
Set up HA replicated ClickHouse with Zookeeper
Enable blob storage for trace data
Configure log collection and aggregation
Set up metrics monitoring and dashboards
Enable distributed tracing
Configure alerting rules and notifications

Productionizing and Scaling Self-Hosted LangSmith - Best Practices

Overview

1. Scaling Recommendations

Pod Replicas and Database Sizing

2. Autoscaling Configuration

Key Components to Autoscale

LangSmith Backend

Platform Backend

Queue Service

3. External Datastores (Highly Recommended)

External Redis

External PostgreSQL

4. ClickHouse Configuration

High Availability Replicated Deployment

Enable Blob Storage

5. Observability and Monitoring

Key Observability Components

Logs

Metrics

Traces

Critical Alerts to Configure

Quick Reference Checklist

Additional Resources