How do I resolve CancelledError in LangGraph long-running operations?

Last updated: November 12, 2025

Context

When running LangGraph agents with long-running async operations, particularly in production environments like Kubernetes, users may encounter CancelledError exceptions. These errors commonly occur during extended processing periods or when deploying new revisions.

Answer

There are several configuration settings you can adjust to handle long-running operations and prevent CancelledError exceptions:

  1. Set the background job timeout using the environment variable:
    BG_JOB_TIMEOUT_SECS - Extends the default timeout period for long-running operations

  2. Configure isolated event loops:
    BG_JOB_ISOLATED_LOOPS=true - Helps prevent event loop conflicts

  3. Set a grace period for job shutdown:
    BG_JOB_SHUTDOWN_GRACE_PERIOD_SECS - Determines how long jobs have to complete when a new revision is deployed (recommended value between 240-1200 seconds depending on your typical run duration)

For deployments using the streaming API, additional configurations can help:

  • Enable resumable streaming with streamResumable: true

  • Use reconnection with reconnectOnMount: true in the useStream() hook

  • Consider using durability modes (durability: "async" or "sync") for better persistence

What If I need these runs to last more than an hour?

  • Once your run kicks off, you are constrained by BG_JOB_TIMEOUT_SECS.

  • With streamResumable: true, as long as you rejoin the stream before the timeout, BG_JOB_TIMEOUT_SECS will reset.

Note: If you see CancelledError during deployment of new revisions, this is expected behavior as existing runs are cancelled and restarted with the new revision.