Nginx 反向代理 WebSocket 频繁断连：7 个高频坑与可复制修复

WebSocket 经常“连上又掉”，大多数不是后端代码锅，而是 Nginx 反代链路配置不完整。典型症状是前端反复重连、日志出现 upstream timed out、或者握手直接拿不到 101 Switching Protocols。

这篇给你一套能直接抄的排查与修复清单，按顺序做，通常 30 分钟内能稳定。

1) `Upgrade/Connection` 头没透传，握手拿不到 101

WebSocket 必须从 HTTP 升级，Nginx 如果没带上升级头，后端就会当普通 HTTP 处理。

最小可用配置：

map $http_upgrade $connection_upgrade {
  default upgrade;
  ''      close;
}

server {
  listen 443 ssl http2;
  server_name example.com;

  location /ws/ {
    proxy_pass http://127.0.0.1:8080;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;
  }
}

验证：

curl -i -N \
  -H "Connection: Upgrade" \
  -H "Upgrade: websocket" \
  -H "Sec-WebSocket-Version: 13" \
  -H "Sec-WebSocket-Key: SGVsbG9Xb3JsZDEyMw==" \
  https://example.com/ws/

你至少要看到：HTTP/1.1 101 Switching Protocols。

2) 超时太短，空闲连接被 Nginx 误杀

最常见断连点：proxy_read_timeout 默认值不够，业务在空闲几十秒后就被切断。

location /ws/ {
  proxy_pass http://127.0.0.1:8080;
  proxy_http_version 1.1;
  proxy_set_header Upgrade $http_upgrade;
  proxy_set_header Connection $connection_upgrade;

  proxy_connect_timeout 15s;
  proxy_send_timeout 60s;
  proxy_read_timeout 3600s;
}

同时让服务端或客户端每 20-30 秒发一次 ping/pong 心跳，别把连接当“永久静默管道”。

3) 上游是负载均衡，但没做会话粘性

你连到 A 机器握手，下一跳被分到 B 机器，内存态会话对不上，就会“随机掉线”。

如果你的应用依赖内存会话，先开粘性策略（或改成 Redis 共享会话）。

upstream ws_backend {
  ip_hash;  # 简单粘性，够用但不完美
  server 10.0.0.11:8080;
  server 10.0.0.12:8080;
}

location /ws/ {
  proxy_pass http://ws_backend;
}

4) 链路里还有一层代理（CDN/LB/Ingress）提前断开

很多人只看 Nginx，漏掉了前面的 ALB/CDN/Ingress 空闲超时。结果是 Nginx 配 1 小时，上游 60 秒就切了。

排查顺序：

浏览器开发者工具看 close code/time pattern
Nginx access/error log 对齐时间线
检查外层 LB/CDN 的 idle timeout（统一调到 >= 300s）

5) `http2`、`ws`、`wss` 混用导致握手异常

对外可以启用 http2，但 WebSocket 反代本身仍是基于 HTTP/1.1 升级语义。常见坑是路径、协议、证书链不一致。

建议：

前端统一使用 wss://example.com/ws/
反代到上游时固定 proxy_http_version 1.1
证书链完整，避免偶发 TLS 握手失败

快速检查证书：

openssl s_client -connect example.com:443 -servername example.com </dev/null

6) Nginx 缓冲/压缩策略不当，引发延迟或异常

WebSocket 一般不该走普通响应缓冲。

location /ws/ {
  proxy_pass http://127.0.0.1:8080;
  proxy_buffering off;
  gzip off;
}

如果你发现消息“攒一波才到”，十有八九是缓冲在搞事。

7) 没有可观测性，出了问题只能靠猜

把关键字段打进日志，至少能回答“谁在什么时候断了、断在第几层”。

log_format ws '$remote_addr - $host [$time_local] '
              '"$request" $status $body_bytes_sent '
              'rt=$request_time urt=$upstream_response_time '
              'ua="$http_user_agent" up="$upstream_addr"';

access_log /var/log/nginx/ws_access.log ws;
error_log  /var/log/nginx/ws_error.log warn;

再配一个最小监控：

当前连接数（active websockets）
每分钟断开数
握手 101 成功率

一套可直接落地的完整片段

map $http_upgrade $connection_upgrade {
  default upgrade;
  ''      close;
}

upstream ws_backend {
  least_conn;
  server 127.0.0.1:8080 max_fails=3 fail_timeout=30s;
}

server {
  listen 443 ssl http2;
  server_name example.com;

  location /ws/ {
    proxy_pass http://ws_backend;
    proxy_http_version 1.1;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;

    proxy_connect_timeout 15s;
    proxy_send_timeout 60s;
    proxy_read_timeout 3600s;

    proxy_buffering off;
    gzip off;
  }
}

总结

WebSocket 断连最容易踩的，不是“高级架构问题”，而是几个小配置缺口叠在一起。先把 升级头、超时、上游一致性、外层代理超时 这四项补齐，再看业务代码，效率会高很多。

如果你只想先救火，优先做三件事：

补齐 Upgrade/Connection + proxy_http_version 1.1
把 proxy_read_timeout 拉长并加心跳
对齐所有中间层的 idle timeout

1) Upgrade/Connection 头没透传，握手拿不到 101#

2) 超时太短，空闲连接被 Nginx 误杀#

3) 上游是负载均衡，但没做会话粘性#

4) 链路里还有一层代理（CDN/LB/Ingress）提前断开#

5) http2、ws、wss 混用导致握手异常#

6) Nginx 缓冲/压缩策略不当，引发延迟或异常#

7) 没有可观测性，出了问题只能靠猜#

一套可直接落地的完整片段#

总结#