Our ACP demo server connects to a backend via WebSocket. The connection should take under 500ms — our Playwright pressure test confirmed this. But when opening the browser, it consistently took 20-30 seconds. The UI showed a blank "connecting..." state for an eternity.
This post documents the debugging strategy that found three stacked root causes, and the fix.
Symptoms
Opening localhost:5688 after a fresh browser launch:
- First WebSocket attempt hangs for exactly 10 seconds, then errors
- Second attempt also hangs for exactly 10 seconds
- Third attempt connects in ~300ms
- Total: ~23 seconds of dead time
Debugging Strategy: Logs on Both Sides
The trap with WebSocket latency is that the problem can live on either side. We started by adding logs to the server — and saw nothing wrong. Connections appeared instantly from the server's perspective. That was the clue: the bottleneck was entirely client-side.
Step 1: Structured Backend Logging
We replaced all hand-rolled console.log calls with pino structured logging. Every WebSocket lifecycle event got a structured log with timing fields:
// Before: string interpolation scattered through business logic
console.log(`[ws] open client=${clientId} path=${path} ${Date.now()}ms`);
// After: structured fields, automatic timestamps, separate from business logic
log.info({ client: clientId, path, boot: bootMs() }, "ws: open");This gave us clean, filterable server logs with boot (ms since server start), durationMs on every operation, and session IDs for correlation. We could immediately see that the server-side connection handling took <1ms. The problem wasn't here.
Step 2: Client-Side Profiling Timestamps
We added performance.now() timestamps at every WebSocket lifecycle point in the React component, measured relative to page navigation start:
function pageMs(): string {
return performance.now().toFixed(0) + "ms";
}
// Every lifecycle point gets a timestamp
console.log(`[${pageMs()}] ws new WebSocket(${wsUrl})`);
// ...
console.log(`[${pageMs()}] ws OPEN (handshake=${handshakeMs}ms)`);
// ...
console.log(`[${pageMs()}] ws CLOSED code=${ev.code}`);The key points instrumented:
useEffectmount/cleanup — when React runs the effectnew WebSocket()constructor — when the connection attempt startsonopen— when the TCP + WS handshake completesonerror/onclose— when connections failonmessage(first) — when the first server message arrives- Connection timeout — when we give up and retry
Step 3: Read the Timeline
With both sides instrumented, we could read the full story:
[970ms] useEffect mount → new WebSocket(ws://localhost:5689/ws)
[971ms] useEffect cleanup → close(readyState=CONNECTING)
[972ms] useEffect mount → new WebSocket(ws://localhost:5689/ws)
[10981ms] ERROR (10 seconds later!)
[11981ms] retry → new WebSocket
[21983ms] ERROR (another 10 seconds!)
[22984ms] retry → new WebSocket
[23300ms] OPEN (316ms handshake)Three patterns jumped out:
- mount → cleanup → mount within 2ms — React StrictMode double-mount
- Exactly 10s gaps — our connection timeout, not a network issue
- 316ms handshake on the successful attempt — way too slow for localhost
Each pattern pointed to a distinct root cause.
Root Cause 1: React StrictMode Creates Zombie WebSockets
In development, React StrictMode double-mounts every component: mount → unmount → mount. Our WebSocket useEffect:
- Mount 1 (970ms): Creates WebSocket #1, starts TCP handshake
- Cleanup (971ms): Calls
ws.close()on WebSocket #1 — now inCLOSINGstate - Mount 2 (972ms): Creates WebSocket #2, starts TCP handshake
WebSocket #1 is now a zombie — its JavaScript handlers are nulled, but the browser is still completing the TCP close handshake with the server. Chrome's connection pool sees an active connection to localhost:5689 and queues WebSocket #2 behind it.
The result: WebSocket #2 can't connect until WebSocket #1's TCP close fully completes, which is gated by our 10-second connection timeout.
Root Cause 2: 10-Second Connection Timeout
We had a "safety" timeout that killed WebSocket connections after 10 seconds:
connectTimeout = setTimeout(() => {
if (ws.readyState === WebSocket.CONNECTING) {
ws.close(); // kill it, try again
reconnectTimer = setTimeout(connect, 1000);
}
}, 10_000); // 10 seconds!For a localhost connection that should complete in <100ms, 10 seconds is absurd. Each failed attempt burned the full 10 seconds before retrying.
Root Cause 3: localhost Triggers IPv6 Fallback
The WebSocket URL used localhost:
const wsUrl = `ws://localhost:5689/ws`;Chrome resolves localhost and tries IPv6 (::1) first. Our Bun server listens on 0.0.0.0 (IPv4 only). Chrome's IPv6 attempt fails, then it falls back to IPv4 127.0.0.1. This added hundreds of milliseconds to every successful connection — the "316ms handshake" for what should be a <10ms localhost connection.
The Fix
1. Remove StrictMode. This is a dev tool, not a production app. No zombie WebSockets, no queued connections.
// Before
createRoot(root).render(<StrictMode><App /></StrictMode>);
// After
createRoot(root).render(<App />);2. Reduce connection timeout to 1 second with 100ms retry.
connectTimeout = setTimeout(() => {
if (ws.readyState === WebSocket.CONNECTING) {
ws.close();
setTimeout(connect, 100);
}
}, 1000);3. Use 127.0.0.1 instead of localhost to skip IPv6 resolution entirely.
const wsHost = location.port === "5688"
? "127.0.0.1:5689"
: location.host;Result
Before: 23 seconds. After: under 500ms. Matching what our Playwright pressure test always showed.
Validating with a Pressure Test
We wrote a Playwright-based pressure test that measures real cold-start latency — spawning the server and connecting immediately without waiting for warmup, just like a real user:
// Phase 1: Cold start — connect immediately, retry until server is ready
const coldResult = await measureTab(context, "cold-1", "cold",
`ws://127.0.0.1:${port}/ws`);
// Phase 2: Warm — open N tabs simultaneously
const warmResults = await Promise.all(
Array.from({ length: 20 }, (_, i) =>
measureTab(context, `warm-${i+1}`, "warm",
`ws://127.0.0.1:${port}/ws`)));Results after the fix: cold start p99 < 200ms, warm p99 < 500ms for 20 simultaneous tabs.
Key Takeaway
The connection was fast. The infrastructure around it — React StrictMode, aggressive timeouts, DNS resolution — made it slow. None of these would show up in a server-side profiling session.
The debugging strategy that worked:
- Add structured logs to the backend — ruled out server-side causes instantly
- Add timestamped profiling to the frontend — revealed the exact timeline of what the browser was doing
- Read the numbers — the patterns (mount/cleanup/mount, exact 10s gaps, 316ms handshake) each pointed directly at a root cause
- Validate with a pressure test — confirmed the fix works under load, not just for a single tab
When debugging latency: measure from the user's perspective first. The server was fine. The browser was the bottleneck.
我们的 ACP 演示服务器通过 WebSocket 连接后端。连接本身应该在 500ms 以内完成——Playwright 压力测试已经证实了这一点。但每次打开浏览器,都要等 20-30 秒。界面一直显示"连接中...",漫长得令人抓狂。
本文记录了排查策略、三层叠加的根因,以及最终的修复方案。
现象
新开浏览器访问 localhost:5688:
- 第一次 WebSocket 连接精确挂起 10 秒后报错
- 第二次同样挂起 10 秒
- 第三次 ~300ms 连接成功
- 总计:约 23 秒的空等
排查策略:前后端同时埋点
WebSocket 延迟问题的陷阱在于:瓶颈可能在任何一端。我们先在服务端加了日志——结果一切正常。连接在服务端看来几乎是瞬间完成的。这本身就是线索:瓶颈完全在客户端。
第一步:后端结构化日志
我们把所有手写的 console.log 替换为 pino 结构化日志。每个 WebSocket 生命周期事件都有带时间字段的结构化记录:
// 改造前:字符串拼接散落在业务逻辑中
console.log(`[ws] open client=${clientId} path=${path} ${Date.now()}ms`);
// 改造后:结构化字段,自动时间戳,与业务逻辑分离
log.info({ client: clientId, path, boot: bootMs() }, "ws: open");这样我们得到了干净、可过滤的服务端日志,每个操作都有 boot(服务启动后毫秒数)、durationMs 和 session ID 用于关联。我们立刻看到服务端连接处理耗时不到 1ms。问题不在这里。
第二步:客户端精确打点
我们在 React 组件的每个 WebSocket 生命周期节点添加了 performance.now() 时间戳,相对于页面加载起点:
function pageMs(): string {
return performance.now().toFixed(0) + "ms";
}
// 每个生命周期节点都有时间戳
console.log(`[${pageMs()}] ws new WebSocket(${wsUrl})`);
// ...
console.log(`[${pageMs()}] ws OPEN (handshake=${handshakeMs}ms)`);
// ...
console.log(`[${pageMs()}] ws CLOSED code=${ev.code}`);关键埋点位置:
useEffect挂载/清理 — React 执行 effect 的时机new WebSocket()构造 — 连接尝试开始onopen— TCP + WS 握手完成onerror/onclose— 连接失败onmessage(首条) — 收到第一条服务端消息- 连接超时 — 放弃并重试的时刻
第三步:读懂时间线
前后端都埋好点后,完整的故事浮出水面:
[970ms] useEffect 挂载 → new WebSocket(ws://localhost:5689/ws)
[971ms] useEffect 清理 → close(readyState=CONNECTING)
[972ms] useEffect 挂载 → new WebSocket(ws://localhost:5689/ws)
[10981ms] ERROR(整整 10 秒后!)
[11981ms] 重试 → new WebSocket
[21983ms] ERROR(又是 10 秒!)
[22984ms] 重试 → new WebSocket
[23300ms] OPEN(316ms 握手)三个规律一目了然:
- mount → cleanup → mount 在 2ms 内完成 — React StrictMode 双重挂载
- 精确的 10 秒间隔 — 是我们的超时设定,不是网络问题
- 316ms 握手 — 对 localhost 来说慢得离谱
每个规律都指向一个独立的根因。
根因一:React StrictMode 制造僵尸 WebSocket
开发模式下,React StrictMode 会对每个组件执行双重挂载:mount → unmount → mount。我们的 WebSocket useEffect:
- 挂载 1(970ms):创建 WebSocket #1,开始 TCP 握手
- 清理(971ms):对 #1 调用
ws.close()——进入CLOSING状态 - 挂载 2(972ms):创建 WebSocket #2,开始 TCP 握手
WebSocket #1 成了僵尸——JavaScript 回调已被清空,但浏览器仍在与服务器完成 TCP 关闭握手。Chrome 的连接池看到 localhost:5689 上有一个活跃连接,于是把 WebSocket #2 排在它后面。
结果:WebSocket #2 必须等 #1 的 TCP 关闭彻底完成才能连接,而这受制于我们 10 秒的超时设定。
根因二:10 秒连接超时
我们设了一个"安全"超时,10 秒后强制断开连接:
connectTimeout = setTimeout(() => {
if (ws.readyState === WebSocket.CONNECTING) {
ws.close();
reconnectTimer = setTimeout(connect, 1000);
}
}, 10_000); // 10 秒!对于本该在 100ms 内完成的 localhost 连接来说,10 秒荒谬至极。每次失败都要白白耗尽整整 10 秒才会重试。
根因三:localhost 触发 IPv6 回退
WebSocket URL 使用了 localhost:
const wsUrl = `ws://localhost:5689/ws`;Chrome 解析 localhost 时优先尝试 IPv6(::1)。但我们的 Bun 服务器监听在 0.0.0.0(仅 IPv4)。Chrome 的 IPv6 尝试失败后才回退到 IPv4 的 127.0.0.1。这给每次成功连接多加了数百毫秒——本该 10ms 以内完成的 localhost 握手变成了 316ms。
修复
1. 移除 StrictMode。 这是开发工具,不是生产应用。没有僵尸 WebSocket,就没有排队阻塞。
// 修改前
createRoot(root).render(<StrictMode><App /></StrictMode>);
// 修改后
createRoot(root).render(<App />);2. 连接超时从 10 秒降到 1 秒,重试间隔 100ms。
connectTimeout = setTimeout(() => {
if (ws.readyState === WebSocket.CONNECTING) {
ws.close();
setTimeout(connect, 100);
}
}, 1000);3. 用 127.0.0.1 替代 localhost,彻底跳过 IPv6 解析。
const wsHost = location.port === "5688"
? "127.0.0.1:5689"
: location.host;效果
修复前:23 秒。修复后:500ms 以内。与 Playwright 压力测试的结果完全一致。
用压力测试验证
我们编写了基于 Playwright 的压力测试,测量真实的冷启动延迟——启动服务器后立即连接,不等预热,和真实用户的行为完全一致:
// 阶段 1:冷启动 — 立即连接,重试直到服务器就绪
const coldResult = await measureTab(context, "cold-1", "cold",
`ws://127.0.0.1:${port}/ws`);
// 阶段 2:热连接 — 同时打开 N 个标签页
const warmResults = await Promise.all(
Array.from({ length: 20 }, (_, i) =>
measureTab(context, `warm-${i+1}`, "warm",
`ws://127.0.0.1:${port}/ws`)));修复后的结果:冷启动 p99 < 200ms,20 个同时连接的标签页热连接 p99 < 500ms。
核心启示
连接本身是快的。是外围的基础设施——React StrictMode、激进的超时策略、DNS 解析——把它拖慢了。这些问题在服务端性能分析中完全看不到。
有效的排查策略:
- 后端加结构化日志 — 立刻排除了服务端原因
- 前端加精确时间戳 — 还原了浏览器的完整行为时间线
- 读懂数字 — mount/cleanup/mount 模式、精确的 10 秒间隔、316ms 握手,每个规律都直接指向一个根因
- 用压力测试验证 — 确认修复在负载下也有效,不只是单个标签页
排查延迟问题时:先从用户视角测量。服务端没问题,瓶颈在浏览器。