Understanding Checkpointers, Databases, API Memory, and TTL

Last updated: November 23, 2025

When working with LangGraph deployments, it’s important to understand how checkpoints, the database, API runtime memory, and TTL (time-to-live) interact. This article helps in diagnosing issues like out-of-memory (OOM) errors, database growth, and pod crashes.



Quick Reference

Issue

Symptom

Fix

Pod memory growth

OOM errors, run cancellations, pod restarts

Optimize workflow code or increase pod memory

Database growth

Storage cap reached, deployment inoperable

Configure TTL or manually clean up old checkpoints

Large checkpoints

Slow runs, database bloat

Reduce application state size saved to checkpoints or move large data objects to external storage/LangGraph Store

Connection timeouts

Long-running workflows fail with database connection errors

Use ConnectionPool instead of raw Connection with PostgresSaver

MongoDB document size limit

"DocumentTooLarge" errors during checkpoint saves

Switch to PostgreSQL or reduce checkpoint size; avoid checkpointers in subgraphs

For deployments:

  • Checkpoints are always stored in the database (Postgres in production deployments).

  • Connection Management: When using PostgresSaver directly (outside of LangGraph Platform deployments), the default behavior holds a database connection for the entire run duration. For long-running workflows, this can cause connection timeout issues. Use ConnectionPool instead of a raw Connection object to automatically manage connections:

    from psycopg.rows import dict_row
    from psycopg_pool import ConnectionPool
    from langgraph.checkpoint.postgres import PostgresSaver

    pool = ConnectionPool(
    conn_string,
    max_size=10,
    max_idle=300.0, # Time (seconds) before idle connection is closed
    kwargs={"autocommit": True, "row_factory": dict_row}
    )
    checkpointer = PostgresSaver(pool)
    checkpointer.setup()

    Warning: When defining custom context schemas, avoid converting UUID objects to strings in field defaults. PostgreSQL UUID columns expect uuid.UUID objects, not strings. For example, avoid thread_id: str = Field(default_factory=lambda: str(uuid.uuid4())) as this can cause "bad argument type for built-in operation" and "'str' object has no attribute 'hex'" errors during checkpoint saves. Instead, let LangGraph Platform automatically inject the thread_id, or use thread_id: str | None = Field(default=None) if you need to reference it in your schema.

Important: MongoDB checkpointers have a 16MB document size limit, while PostgreSQL supports up to 1GB per field. If you're experiencing "DocumentTooLarge" errors with MongoDB, consider switching to PostgreSQL or reducing your checkpoint size.

  • Checkpoints are always stored in the database (Postgres in production deployments).

  • Connection Management: When using PostgresSaver directly (outside of LangGraph Platform deployments), the default behavior holds a database connection for the entire run duration. For long-running workflows, this can cause connection timeout issues. Use ConnectionPool instead of a raw Connection object to automatically manage connections:

    from psycopg.rows import dict_row
    from psycopg_pool import ConnectionPool
    from langgraph.checkpoint.postgres import PostgresSaver

    pool = ConnectionPool(
    conn_string,
    max_size=10,
    max_idle=300.0, # Time (seconds) before idle connection is closed
    kwargs={"autocommit": True, "row_factory": dict_row}
    )
    checkpointer = PostgresSaver(pool)
    checkpointer.setup()

    Warning: When defining custom context schemas, avoid converting UUID objects to strings in field defaults. PostgreSQL UUID columns expect uuid.UUID objects, not strings. For example, avoid thread_id: str = Field(default_factory=lambda: str(uuid.uuid4())) as this can cause "bad argument type for built-in operation" and "'str' object has no attribute 'hex'" errors during checkpoint saves. Instead, let LangGraph Platform automatically inject the thread_id, or use thread_id: str | None = Field(default=None) if you need to reference it in your schema.

  • The checkpointer is the component that saves them.

  • Checkpoints are never stored in memory. Checkpoints are written to the database at the end of each superstep. Application state is not retained (server state is contained to your graph state, meaning python/JS cleans it up when it goes out of scope), unless your agent/graph code introduces a memory leak. Checkpoints do not persist in pod memory after runs finish.

  • Checkpoints are always stored in the database (Postgres in production deployments).

  • The checkpointer is the component that saves them.

  • Checkpoints are never stored in memory. Checkpoints are written to the database at the end of each superstep. Application state is not retained (server state is contained to your graph state, meaning python/JS cleans it up when it goes out of scope), unless your agent/graph code introduces a memory leak. Checkpoints do not persist in pod memory after runs finish.

For langgraph dev (the 'in-memory' checkpointer that you can run on your local computer):

  • Checkpoints and other application state are stored in your process memory and committed to local disk in the .langgraph_api/ directory periodically.



2. API Memory vs. Database Storage

API Memory

Pod memory usage comes from code running inside the workflow.

Causes of high memory use include:

Configuration considerations for server deployments:

  • Avoid setting WEB_CONCURRENCY as this grows memory linearly and prevents efficient sharing of PostgreSQL/Redis connections across forked processes

  • For isolating API requests from graph invocation while staying within a single process, consider using BG_JOB_ISOLATED_LOOPS to run multiple event loops in threads

  • For production deployments, split the API server and queue into separate components and scale them independently using Kubernetes

  • Loading large data objects (such as spreadsheets, images, or videos).

  • Many parallel tasks creating large in-memory objects (you can limit the amount of parallelism per-pod using N_JOBS_PER_WORKER)

  • Memory leaks in user code that block garbage collection (in-memory caches, etc.)

Typical symptoms: OOM errors, canceled runs, pod restarts.

Technical Note: LangGraph uses Python's ThreadPoolExecutor for JSON serialization (via loop.run_in_executor()) to avoid CPU blocking. This can create up to 32 background threads that remain in a sleep state after runs complete. These threads are bounded and don't indicate a memory leak, but contribute to overall memory usage. Use tools like py-spy to inspect thread creation if needed.

Database Storage

Database usage grows when:

  • Many runs are executed.

  • Checkpoints include large application state.

  • No TTL is configured, allowing old checkpoints to accumulate.

  • Subgraphs are compiled with their own checkpointers, creating separate checkpoint namespaces and duplicating storage

  • Large data objects (images, PDFs, videos, documents) stored directly in state as base64 or binary data can cause checkpoint bloat and database memory errors

  • Many runs are executed.

  • Checkpoints include large application state.

  • No TTL is configured, allowing old checkpoints to accumulate.

Impacts:

For large data objects in state: Instead of storing large payloads directly in graph state, use one of these patterns: - External storage + reference: Upload files to external storage (S3, etc.) and store only the reference key and metadata in state, fetching the data on demand - LangGraph Store: For data that needs to be accessible across threads, store large objects in the LangGraph Store and retrieve them by ID rather than checkpointing them

  • Implement a TTL

  • Use the "exit" durability mode (rather than "async" (default) or "sync" modes. This will write a checkpoint at the end of each run but omit checkpoints for intermediate steps, leading to less data duplication while sacrificing durability in the case of pod restarts.

    • Note that durability="exit" corresponds to "checkpoint_during=False" in langgraph versions < 0.6



3. Role of TTL (Time-to-Live)

TTL defines how long checkpoints and threads are retained in the database.

  • Without TTL:

    • Database tables grow indefinitely (unless you manually delete resources)

    • Disk usage increases until new writes are blocked.

  • With TTL:

    • Old checkpoints are automatically deleted after expiration.

    • Database growth is controlled.

Note: TTL only applies to checkpoints created after it is enabled. Older checkpoints must be cleaned up separately.



4. How These Issues Interrelate

  • Large application state leads to increased pod memory usage and typically would cause larger checkpoints, which accelerates database growth.

  • Memory leaks continuously increase pod memory usage. Increasing pod memory only delays the problem; leaks must be fixed in code.

  • Lack of TTL causes database disk usage to grow indefinitely.

How to resolve common issues:

  • For OOM errors: optimize workflow code or increase pod memory.

  • For disk space issues: configure TTL, clean up checkpoints, or use exit durability mdoe.

  • For large checkpoints: reduce the amount of state being saved.



Summary:

Pod memory issues usually point to workflow code, while database growth usually indicates missing TTL. Addressing both ensures stable and scalable deployments.