How do I resolve PostgreSQL connection pool issues causing 503 errors and high latency?

Last updated: April 10, 2026

Context

You may experience PostgreSQL connection pool contention that results in 503 errors, high latency (up to 50 seconds), and requests being queued even when the pool is configured with a high maximum size. This typically occurs when the connection pool appears stuck at a low number of connections despite being configured for much higher limits, and can affect both thread search operations and overall system performance.

Answer

This issue is commonly caused by mismatched worker pool and database pool configurations, along with insufficient database resources. Here's how to resolve it:

1. Configure Environment Variables

Add these environment variables to your values.yaml file:

apiServer:
  deployment:
    replicaCount: 3

    extraEnv:
      # Worker pool must match DB pool size
      - name: ASYNC_WORKER_POOL_SIZE
        value: "150"
      - name: THREAD_POOL_SIZE
        value: "150"
      - name: N_JOBS_PER_WORKER
        value: "10"

      # Database pool scaling
      - name: POSTGRES_POOL_MAX_SIZE
        value: "150"
      - name: POSTGRES_POOL_MIN_SIZE
        value: "10"

      # Query timeouts
      - name: POSTGRES_STATEMENT_TIMEOUT
        value: "30000"
      - name: POSTGRES_IDLE_IN_TRANSACTION_TIMEOUT
        value: "60000"

    readinessProbe:
      exec:
        command:
          - /bin/sh
          - -c
          - exec python /api/healthcheck.py
      initialDelaySeconds: 30
      timeoutSeconds: 10
      failureThreshold: 5

    livenessProbe:
      exec:
        command:
          - /bin/sh
          - -c
          - exec python /api/healthcheck.py
      initialDelaySeconds: 60
      periodSeconds: 30
      timeoutSeconds: 10
      failureThreshold: 6

Apply the changes using:

helm upgrade your-deployment-name <path-to-your-chart> \
  -f values.yaml \
  -n your-namespace \
  --wait

2. Scale Your Database Resources

Ensure your PostgreSQL instance has sufficient resources:

Use a larger database instance size (refer to the scaling documentation for recommended specifications)
Ensure your database can handle the maximum number of connections (e.g., Aurora RDS db.r6g.xlarge supports up to 2000 connections)

3. Remove Read Replicas

LangGraph and LangSmith do not support read-only database endpoints. If you're using Aurora RDS with read replicas:

Remove read replicas as they are not utilized
Use only the write endpoint
This will also reduce unnecessary resource overhead

4. Scale API Servers

Instead of using multiple uvicorn workers on a single API server:

Deploy multiple API server instances
Use the default single uvicorn worker per pod configuration
This provides better resource distribution and fault tolerance

5. Optimize Thread Search Performance

For high-volume thread search operations:

Implement pagination with reasonable limits
Use more granular filtering in your search queries
Ensure proper indexing on metadata fields you're filtering by

Key Understanding: The connection pool issue occurs when the job limit (N_JOBS_PER_WORKER) is reached before the database pool can scale up. The worker pool settings must be aligned with database pool settings to allow proper scaling.

These configurations apply to both API servers and queue workers when using the combined API/Queue pattern, as they share the same connection pool.