When OpenAI API calls start timing out in production, the real problem is usually not “OpenAI is down.”

The real problem is you don’t know which hop is failing: DNS, TLS handshake, proxy path, or your own connection pool.

1) Split timeout into 4 hops

A Go request to OpenAI usually goes through:

  1. DNS resolution
  2. TCP connect + TLS handshake
  3. Proxy forwarding (if any)
  4. Request/response I/O (including streaming)

If you only look at context deadline exceeded, you are blind.

2) Add minimal observability with httptrace

Use a reusable http.Client with explicit transport timeouts and httptrace hooks.

// key knobs only
tr := &http.Transport{
  Proxy:                 http.ProxyFromEnvironment,
  TLSHandshakeTimeout:   5 * time.Second,
  ResponseHeaderTimeout: 30 * time.Second,
  IdleConnTimeout:       90 * time.Second,
  MaxIdleConns:          200,
  MaxIdleConnsPerHost:   50,
  MaxConnsPerHost:       100,
}
client := &http.Client{Transport: tr, Timeout: 45 * time.Second}

Log DNS/TLS/connect events so you can identify the failing stage in minutes.

3) Troubleshooting order (practical)

DNS timeout

Symptoms: lookup api.openai.com: i/o timeout

dig api.openai.com
nslookup api.openai.com

Fix: stable resolver, explicit DNS config in containers, reuse client.

TLS handshake timeout

Symptoms: net/http: TLS handshake timeout

openssl s_client -connect api.openai.com:443 -servername api.openai.com -brief
curl -v --connect-timeout 5 https://api.openai.com/v1/models

Fix: verify proxy TLS behavior and cert chain; don’t just increase timeout blindly.

Proxy jitter

Fix: compare direct vs proxy success rate, add health checks, retry idempotent requests only.

Connection pool starvation

Fix: one long-lived client per process, tune pool sizes, always close response body.

4) Conservative baseline config

  • Dial timeout: 3s
  • TLS handshake: 5s
  • Response header timeout: 30s
  • Client timeout: 45s
  • MaxConnsPerHost: 100
  • MaxIdleConnsPerHost: 50

5) Final takeaway

Timeouts are rarely random. Break the path, instrument each hop, and fix the bottleneck instead of retrying blindly.