Go + OpenAI API Timeout Troubleshooting: DNS, TLS, Proxy, and Connection Pool

When OpenAI API calls start timing out in production, the real problem is usually not “OpenAI is down.”

The real problem is you don’t know which hop is failing: DNS, TLS handshake, proxy path, or your own connection pool.

1) Split timeout into 4 hops

A Go request to OpenAI usually goes through:

DNS resolution
TCP connect + TLS handshake
Proxy forwarding (if any)
Request/response I/O (including streaming)

If you only look at context deadline exceeded, you are blind.

2) Add minimal observability with `httptrace`

Use a reusable http.Client with explicit transport timeouts and httptrace hooks.

// key knobs only
tr := &http.Transport{
  Proxy:                 http.ProxyFromEnvironment,
  TLSHandshakeTimeout:   5 * time.Second,
  ResponseHeaderTimeout: 30 * time.Second,
  IdleConnTimeout:       90 * time.Second,
  MaxIdleConns:          200,
  MaxIdleConnsPerHost:   50,
  MaxConnsPerHost:       100,
}
client := &http.Client{Transport: tr, Timeout: 45 * time.Second}

Log DNS/TLS/connect events so you can identify the failing stage in minutes.

3) Troubleshooting order (practical)

DNS timeout

Symptoms: lookup api.openai.com: i/o timeout

dig api.openai.com
nslookup api.openai.com

Fix: stable resolver, explicit DNS config in containers, reuse client.

TLS handshake timeout

Symptoms: net/http: TLS handshake timeout

openssl s_client -connect api.openai.com:443 -servername api.openai.com -brief
curl -v --connect-timeout 5 https://api.openai.com/v1/models

Fix: verify proxy TLS behavior and cert chain; don’t just increase timeout blindly.

Proxy jitter

Fix: compare direct vs proxy success rate, add health checks, retry idempotent requests only.

Connection pool starvation

Fix: one long-lived client per process, tune pool sizes, always close response body.

4) Conservative baseline config

Dial timeout: 3s
TLS handshake: 5s
Response header timeout: 30s
Client timeout: 45s
MaxConnsPerHost: 100
MaxIdleConnsPerHost: 50

5) Final takeaway

Timeouts are rarely random. Break the path, instrument each hop, and fix the bottleneck instead of retrying blindly.

1) Split timeout into 4 hops#

2) Add minimal observability with httptrace#

3) Troubleshooting order (practical)#

DNS timeout#

TLS handshake timeout#

Proxy jitter#

Connection pool starvation#

4) Conservative baseline config#

5) Final takeaway#