High Latency and unresolved runs in LangSmith due to OpenSSL Bug

Last updated: September 19, 2025

Summary

Severe latency with the GPT-4.1 model and unresolved runs in a Kubernetes-based self-hosted LangSmith setup were caused by a bug in OpenSSL v3.0.17, leading to memory corruption and pod crashes.

Issue Description

Users experienced significant performance degradation with the GPT-4.1 model, with response times exceeding two minutes. Concurrently, runs in the LangSmith platform were getting stuck in a "pending" state or appearing to hang despite a "success" status. Examination of the pod logs langgraph-queue-546ff6589-fcgp8 revealed the following errors, indicating a memory crash:

DEBUG    | lex_machina.tools.vault:get_document_ids:196 - vault_api_url: https://domain.com/api
double free or corruption (out)
Fatal Python error: Aborted

Current thread 0x00007bf7fe7fc6c0 (most recent call first):
  File "/usr/local/lib/python3.11/ssl.py", line 1382 in do_handshake
  File "/usr/local/lib/python3.11/ssl.py", line 1104 in _create
memory crash.

Environment

  • Products: LangSmith, LangGraph, GPT-4.1

  • Platform: Kubernetes (on-premise)

  • Cloud: Self-hosted cloud environment

  • Operating System: Linux (within the container)

Cause

The root cause of the issue was identified as a bug in OpenSSL v3.0.17 within the queue container. This bug, documented in an active GitHub issue, causes a segmentation fault when using a shared SSL context in a multi-threaded application, leading to memory corruption and crashes. The issue was triggered by parallel tool calls, one of which was to a Vault API.

Workaround

As a temporary workaround, the team switched from the GPT-4.1 model to the GPT-4o model and disabled parallel tool calls by setting parallel_tool_calls=false. This mitigated the issue by avoiding the conditions that triggered the OpenSSL bug.

Resolution

The definitive resolution is to address the OpenSSL bug in the container. This can be achieved by either downgrading or upgrading the OpenSSL version to a stable release that does not have this issue. After adjusting the OpenSSL version, parallel tool execution in LangGraph can be re-enabled.

References