When OpenAI API calls start timing out in production, the real problem is usually not “OpenAI is down.”
The real problem is you don’t know which hop is failing: DNS, TLS handshake, proxy path, or your own connection pool.
1) Split timeout into 4 hops
A Go request to OpenAI usually goes through:
- DNS resolution
- TCP connect + TLS handshake
- Proxy forwarding (if any)
- Request/response I/O (including streaming)
If you only look at context deadline exceeded, you are blind.
2) Add minimal observability with httptrace
Use a reusable http.Client with explicit transport timeouts and httptrace hooks.
// key knobs only
tr := &http.Transport{
Proxy: http.ProxyFromEnvironment,
TLSHandshakeTimeout: 5 * time.Second,
ResponseHeaderTimeout: 30 * time.Second,
IdleConnTimeout: 90 * time.Second,
MaxIdleConns: 200,
MaxIdleConnsPerHost: 50,
MaxConnsPerHost: 100,
}
client := &http.Client{Transport: tr, Timeout: 45 * time.Second}
Log DNS/TLS/connect events so you can identify the failing stage in minutes.
3) Troubleshooting order (practical)
DNS timeout
Symptoms: lookup api.openai.com: i/o timeout
dig api.openai.com
nslookup api.openai.com
Fix: stable resolver, explicit DNS config in containers, reuse client.
TLS handshake timeout
Symptoms: net/http: TLS handshake timeout
openssl s_client -connect api.openai.com:443 -servername api.openai.com -brief
curl -v --connect-timeout 5 https://api.openai.com/v1/models
Fix: verify proxy TLS behavior and cert chain; don’t just increase timeout blindly.
Proxy jitter
Fix: compare direct vs proxy success rate, add health checks, retry idempotent requests only.
Connection pool starvation
Fix: one long-lived client per process, tune pool sizes, always close response body.
4) Conservative baseline config
- Dial timeout: 3s
- TLS handshake: 5s
- Response header timeout: 30s
- Client timeout: 45s
- MaxConnsPerHost: 100
- MaxIdleConnsPerHost: 50
5) Final takeaway
Timeouts are rarely random. Break the path, instrument each hop, and fix the bottleneck instead of retrying blindly.