Mengboy Tech Notes

Go Dual-Provider LLM Routing (OpenAI + Claude): Timeout Tiers, Cost Caps, and Fallback Control

If your Go service relies on one LLM provider, two failures hurt the most, timeout spikes and billing spikes. A real production setup is not just “add another provider”, it is a single control plane for routing, timeout tiers, cost caps, and fallback. This guide gives you a practical OpenAI + Claude dual-provider pattern with one priority, keep uptime first, then optimize quality. ...

Claude Code + GitHub Actions CI Self-Healing Pipeline: Error Attribution, Minimal Patches, and Human Approval Gates

If your CI keeps failing and engineers keep babysitting logs, you’re paying an invisible velocity tax. A production-grade AI self-healing pipeline is not “let the agent edit anything”. It’s a controlled loop: attribution, patching, approval, rollback. This post gives you a deployable baseline: Claude Code proposes a minimal fix patch, GitHub Actions enforces risk gates and regression checks, and humans only approve at high-impact checkpoints. ...

Claude API Rate-Limit Storm Playbook: Adaptive Concurrency, Jittered Backoff, and Quota Isolation

When Claude API starts returning 429 under high load, most systems don’t just slow down—they collapse: queue buildup, retry storms, upstream timeout chains, and pager noise. ...

Claude 3.7 + OpenAI Responses Dual-Stack Degradation Playbook: Timeout Probing, Circuit Cutover, and Error-Budget Dashboard

Running both Claude and OpenAI in production sounds resilient—until a slow failure hits: latency climbs, 429s spike, quality drifts, and everything still looks “up.” This guide gives you a practical dual-stack degradation runbook: timeout probing first, circuit-based cutover second, and an error-budget dashboard to keep business impact bounded. ...

Claude + OpenAI Dual-Provider Gateway Failover: Health Probes, Circuit Breaking, and SLA Fallback

If your production stack calls both Claude and OpenAI, the hard part is not API integration. The hard part is keeping user experience stable when one provider starts throwing 429/5xx spikes, regional latency, or timeout storms. This guide gives you a practical dual-provider gateway playbook: health probes, circuit breaking, SLA-aware fallback, and observability loops. The goal is not “never fail.” The goal is controlled failure with controlled cost and controlled latency. ...

OpenAI Responses Streaming in Production: Backpressure, Chunk Reassembly, and Timeout Budget

Most streaming failures are not about “can it stream”, but “does it stay stable under load”: broken chunks, stuck clients, timeout cascades, and retry storms. ...

Claude + OpenAI Model Routing Gateway: Latency Tiers, Cost Caps, and Quality Guardrails

Connecting both Claude and OpenAI in production is the easy part. The hard part is keeping the system stable across the quality-latency-cost triangle. Without a routing gateway, you usually get latency spikes, runaway bills, and ugly cascading failures. ...

OpenAI Responses + Go Stream Recovery: Delta Persistence, Resume Tokens, and Duplicate Chunk Dedup

In production, the painful part is not “streaming is slow.” It’s “streaming breaks and then duplicates output after reconnect.” This guide gives you a practical recovery loop: delta persistence + resume token + idempotent dedup, so reconnection does not replay garbage. ...

OpenAI Responses in Go Multi-Tenant Quota Governance: Token Buckets, Budget Circuit Breakers, and Cost Attribution

Most multi-tenant AI platforms fail for two boring reasons: one tenant saturates shared capacity, and finance discovers the burn too late. This guide gives you a practical Go blueprint: token-bucket throttling, budget circuit breakers, and request-level cost attribution. ...

Go + OpenAI Responses Agent Memory Layering: Short-Term Context, Long-Term Index, and Cost Caps

In production Go agents, the first thing that breaks is usually not model quality. It is memory management: context grows, bills spike, and answers drift. Use a 3-layer memory design: L1: short-term conversational window (seconds) L2: rolling summary (minutes) L3: long-term retrieval memory (days) ...