If your WebSocket keeps reconnecting every few seconds, it’s usually not your app code. In most production incidents, the root cause is an incomplete reverse-proxy chain in Nginx (or one proxy layer before it).

This guide gives you a copy-paste checklist to stabilize WebSocket connections fast.

1) Missing Upgrade/Connection headers (no 101 handshake)

WebSocket requires an HTTP upgrade. Without proper header forwarding, upstream sees plain HTTP.

map $http_upgrade $connection_upgrade {
  default upgrade;
  ''      close;
}

location /ws/ {
  proxy_pass http://127.0.0.1:8080;
  proxy_http_version 1.1;
  proxy_set_header Upgrade $http_upgrade;
  proxy_set_header Connection $connection_upgrade;
}

Verify:

curl -i -N \
  -H "Connection: Upgrade" \
  -H "Upgrade: websocket" \
  -H "Sec-WebSocket-Version: 13" \
  -H "Sec-WebSocket-Key: SGVsbG9Xb3JsZDEyMw==" \
  https://example.com/ws/

You should get HTTP/1.1 101 Switching Protocols.

2) Idle timeout is too short

A very common killer: default timeout values close idle but healthy sockets.

location /ws/ {
  proxy_connect_timeout 15s;
  proxy_send_timeout 60s;
  proxy_read_timeout 3600s;
}

Also send ping/pong heartbeats every 20-30 seconds from server or client.

3) Load-balanced upstream without session stickiness

Handshake lands on node A, next frame hits node B, in-memory session is gone, socket drops.

upstream ws_backend {
  ip_hash;
  server 10.0.0.11:8080;
  server 10.0.0.12:8080;
}

If possible, move session state to Redis and avoid sticky dependencies.

4) Another proxy layer closes first (CDN/LB/Ingress)

Even if Nginx is set to 1 hour, your outer ALB/CDN might cut at 60 seconds.

Check all layers and align idle timeout to a sane baseline (300s+ for many real-time apps).

5) Protocol mismatch (http2, ws, wss)

Public edge can run HTTP/2, but WebSocket upgrade semantics still rely on HTTP/1.1 proxy behavior.

Keep this consistent:

  • Client uses wss://example.com/ws/
  • Nginx upstream uses proxy_http_version 1.1
  • TLS cert chain is complete
openssl s_client -connect example.com:443 -servername example.com </dev/null

6) Buffering/compression side effects

For WebSocket paths, avoid normal response buffering.

location /ws/ {
  proxy_buffering off;
  gzip off;
}

If messages arrive in bursts instead of real-time, buffering is often the culprit.

7) No observability = blind debugging

Add a focused log format and track handshake success and disconnect rates.

log_format ws '$remote_addr - $host [$time_local] '
              '"$request" $status $body_bytes_sent '
              'rt=$request_time urt=$upstream_response_time '
              'ua="$http_user_agent" up="$upstream_addr"';

access_log /var/log/nginx/ws_access.log ws;
error_log  /var/log/nginx/ws_error.log warn;

Track at least:

  • active websocket connections
  • disconnects per minute
  • HTTP 101 success ratio

Production-ready baseline snippet

map $http_upgrade $connection_upgrade {
  default upgrade;
  ''      close;
}

upstream ws_backend {
  least_conn;
  server 127.0.0.1:8080 max_fails=3 fail_timeout=30s;
}

server {
  listen 443 ssl http2;
  server_name example.com;

  location /ws/ {
    proxy_pass http://ws_backend;
    proxy_http_version 1.1;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;

    proxy_connect_timeout 15s;
    proxy_send_timeout 60s;
    proxy_read_timeout 3600s;

    proxy_buffering off;
    gzip off;
  }
}

Summary

Most WebSocket disconnect incidents are boring infrastructure issues, not mysterious app bugs.

Fix these first:

  1. Correct upgrade headers + HTTP/1.1 proxying
  2. Longer read timeout + heartbeat
  3. Aligned idle timeouts across all proxy layers

Do this and your reconnect storm usually disappears.