Recovery System
The satellite automatically detects and recovers from MCP server failures without manual intervention. Recovery works for HTTP/SSE servers (network failures) and stdio servers (process crashes).Overview
The recovery system handles HTTP/SSE Servers (network failures, server downtime, connection timeouts) and Stdio Servers (process crashes up to 3 times in 5 minutes) Recovery is fully automatic for recoverable failures. Permanent failures (3+ crashes, OAuth token expired) require manual action.Recovery Detection
Tool Execution Recovery
When a tool is executed on a server that was previously offline/error, recovery is detected automatically:Health Check Recovery
Backend health checks periodically test offline servers. When they respond again:Retry Logic (HTTP/SSE)
Before marking a server as offline, the satellite retries tool execution with exponential backoff:Retryable vs Non-Retryable Errors
| Error Type | Action | Reason |
|---|---|---|
| ECONNREFUSED | Retry | Server may be restarting |
| ETIMEDOUT | Retry | Network hiccup, may recover |
| ENOTFOUND | Retry | DNS issue, may be temporary |
| fetch failed | Retry | Network error, transient |
| 401 Unauthorized | No retry | Token expired, retrying won’t help |
| 403 Forbidden | No retry | Access denied, retrying won’t help |
| OAuth errors | No retry | Auth issue, needs user action |
Recovery Flow
When servers recover from failure, the satellite updates status and triggers re-discovery asynchronously without blocking tool execution responses. See Status Tracking - Status Lifecycle for complete recovery flow diagrams including successful recovery, failed recovery, and status transitions.Automatic Re-Discovery
When recovery is detected, tools are refreshed from the server without blocking the user:Why Asynchronous Re-Discovery?
User Experience:- Tool execution result returned immediately
- User doesn’t wait for tool discovery (can take 1-5 seconds)
- If re-discovery fails, user already got their result
- Tool response isn’t blocked by discovery errors
- Discovery failure doesn’t affect user’s current request
- Recovery can be retried later
Tool Preservation
When re-discovery fails, tools are NOT removed from cache:- User can still see what tools are available
- Tools may work if server recovers later
- Better UX than empty tool list
- Discovery can be retried without losing tool metadata
Stdio Process Recovery
Stdio servers auto-restart after crashes (up to 3 times in 5 minutes):Stdio Recovery Timeline
Failure Status Mapping
When tool execution fails after all retries, error messages are mapped to appropriate status values:Debouncing Concurrent Recovery
Multiple tool executions may detect recovery simultaneously. Debouncing prevents duplicate re-discoveries:- LLM executes 3 tools from same server concurrently
- All 3 detect recovery (server was offline)
- Only first execution triggers re-discovery
- Other 2 skip (already in progress)
Recovery Timing
| Recovery Type | Detection Time | Re-Discovery Time | Total |
|---|---|---|---|
| Tool Execution | Immediate (on next tool call) | 1-5 seconds | ~1-5s |
| Health Check | Up to 3 minutes (polling interval) | 1-5 seconds | ~3-8 min |
Manual Recovery (Requires User Action)
Some failures cannot auto-recover:| Status | Reason | User Action |
|---|---|---|
requires_reauth | OAuth token expired/revoked | Re-authenticate in dashboard |
permanently_failed | 3+ crashes in 5 minutes (stdio) | Check logs, fix issue, manual restart |
Implementation Components
The recovery system consists of several integrated components:- Stdio auto-recovery and permanently_failed status
- Tool execution retry logic and recovery detection
- Health check recovery via backend
Related Documentation
- Status Tracking - Status values and transitions
- Event Emission - Recovery status events
- Tool Discovery - Re-discovery after recovery
- Process Management - Stdio crash recovery

