> ## Documentation Index > Fetch the complete documentation index at: https://docs.deploystack.io/llms.txt > Use this file to discover all available pages before exploring further. # Recovery System > Automatic recovery and failure handling for MCP servers # Recovery System The satellite automatically detects and recovers from MCP server failures without manual intervention. Recovery works for HTTP/SSE servers (network failures) and stdio servers (process crashes). ## Overview The recovery system handles **HTTP/SSE Servers** (network failures, server downtime, connection timeouts) and **Stdio Servers** (process crashes up to 3 times in 5 minutes) Recovery is fully automatic for recoverable failures. Permanent failures (3+ crashes, OAuth token expired) require manual action. ## Recovery Detection ### Tool Execution Recovery When a tool is executed on a server that was previously offline/error, recovery is detected automatically: ```typescript theme={null} // services/satellite/src/core/mcp-server-wrapper.ts async handleExecuteTool(toolPath: string, toolArguments: unknown) { const serverSlug = toolPath.split(':')[0]; const statusEntry = this.toolDiscoveryManager?.getServerStatus(serverSlug); const wasOfflineOrError = statusEntry && ['offline', 'error'].includes(statusEntry.status); // Execute tool with retry logic const result = await this.executeHttpToolCallWithRetry(...); // If execution succeeded but server was offline/error → RECOVERY DETECTED if (wasOfflineOrError) { this.handleServerRecovery(serverSlug, config); } return result; } ``` ### Health Check Recovery Backend health checks periodically test offline servers. When they respond again: ``` Backend health check runs (every 3 minutes) ↓ Offline template now responds ↓ Backend sets installations to 'connecting' ↓ Backend sends 'configure' command with event='mcp_recovery' ↓ Satellite receives command and triggers re-discovery ↓ Status progresses: connecting → discovering_tools → online ``` ## Retry Logic (HTTP/SSE) Before marking a server as offline, the satellite retries tool execution with exponential backoff: ```typescript theme={null} // services/satellite/src/core/mcp-server-wrapper.ts interface RetryConfig { maxRetries: 3; backoffMs: [500, 1000, 2000]; // Exponential: 500ms, 1s, 2s } async executeHttpToolCallWithRetry( serverConfig: McpServerConfig, toolName: string, args: unknown ): Promise { let lastError: Error; for (let attempt = 1; attempt <= 3; attempt++) { try { const response = await this.executeHttpToolCall(serverConfig, toolName, args); return response; // Success - no retry needed } catch (error) { lastError = error; // Non-retryable errors (auth failures) → fail immediately if (this.isNonRetryableError(error)) { throw error; } // Retryable errors (connection refused) → backoff and retry if (attempt < 3) { const backoffMs = [500, 1000, 2000][attempt - 1]; await new Promise(resolve => setTimeout(resolve, backoffMs)); } } } // All retries exhausted → throw last error throw lastError; } private isNonRetryableError(error: Error): boolean { const msg = error.message.toLowerCase(); return msg.includes('401') || msg.includes('403') || msg.includes('unauthorized') || msg.includes('forbidden') || msg.includes('oauth') || msg.includes('authorization required'); } ``` ### Retryable vs Non-Retryable Errors | Error Type | Action | Reason | | ---------------- | ------------ | ---------------------------------- | | ECONNREFUSED | **Retry** | Server may be restarting | | ETIMEDOUT | **Retry** | Network hiccup, may recover | | ENOTFOUND | **Retry** | DNS issue, may be temporary | | fetch failed | **Retry** | Network error, transient | | 401 Unauthorized | **No retry** | Token expired, retrying won't help | | 403 Forbidden | **No retry** | Access denied, retrying won't help | | OAuth errors | **No retry** | Auth issue, needs user action | ## Recovery Flow When servers recover from failure, the satellite updates status and triggers re-discovery asynchronously without blocking tool execution responses. See [Status Tracking - Status Lifecycle](/development/satellite/status-tracking#status-lifecycle) for complete recovery flow diagrams including successful recovery, failed recovery, and status transitions. ## Automatic Re-Discovery When recovery is detected, tools are refreshed from the server without blocking the user: ```typescript theme={null} // services/satellite/src/core/mcp-server-wrapper.ts private async handleServerRecovery( serverSlug: string, config: McpServerConfig ): Promise { // Prevent duplicate recovery attempts if (this.recoveryInProgress.has(serverSlug)) { return; // Already recovering } this.recoveryInProgress.add(serverSlug); try { this.logger.info({ serverSlug }, 'Server recovered - triggering re-discovery'); // Emit status change to backend this.eventBus?.emit('mcp.server.status_changed', { installation_id: config.installation_id, team_id: config.team_id, status: 'connecting', status_message: 'Server recovered, re-discovering tools', timestamp: new Date().toISOString() }); // Trigger re-discovery asynchronously (doesn't block tool response) await this.toolDiscoveryManager?.remoteToolManager?.discoverServerTools(serverSlug); this.logger.info({ serverSlug }, 'Tool re-discovery successful after recovery'); } catch (error) { // Re-discovery failed (non-fatal, tool response still returned) this.logger.error({ serverSlug, error }, 'Tool re-discovery failed after recovery'); } finally { this.recoveryInProgress.delete(serverSlug); } } ``` ### Why Asynchronous Re-Discovery? **User Experience:** * Tool execution result returned immediately * User doesn't wait for tool discovery (can take 1-5 seconds) * If re-discovery fails, user already got their result **Reliability:** * Tool response isn't blocked by discovery errors * Discovery failure doesn't affect user's current request * Recovery can be retried later ## Tool Preservation When re-discovery fails, tools are NOT removed from cache: ```typescript theme={null} // services/satellite/src/services/remote-tool-discovery-manager.ts async rediscoverServerTools(serverSlug: string): Promise { try { // Attempt discovery const newTools = await this.fetchToolsFromServer(serverSlug); // Discovery succeeded → remove old tools and add new ones this.removeToolsForServer(serverSlug); this.addTools(newTools); this.statusCallback?.(serverSlug, 'online'); } catch (error) { // Discovery failed → keep old tools in cache // Tools remain available for future attempts this.statusCallback?.(serverSlug, 'error', error.message); } } ``` **Why preserve tools on failure?** * User can still see what tools are available * Tools may work if server recovers later * Better UX than empty tool list * Discovery can be retried without losing tool metadata ## Stdio Process Recovery Stdio servers auto-restart after crashes (up to 3 times in 5 minutes): ```typescript theme={null} // services/satellite/src/process/manager.ts async handleProcessExit(processInfo: ProcessInfo, exitCode: number) { const now = Date.now(); const fiveMinutesAgo = now - 5 * 60 * 1000; // Track crashes in 5-minute window processInfo.crashHistory = processInfo.crashHistory.filter(t => t > fiveMinutesAgo); processInfo.crashHistory.push(now); const crashCount = processInfo.crashHistory.length; if (crashCount >= 3) { // Permanent failure - emit status event this.eventBus?.emit('mcp.server.permanently_failed', { installation_id: processInfo.config.installation_id, team_id: processInfo.config.team_id, process_id: processInfo.processId, crash_count: crashCount, message: `Process crashed ${crashCount} times in 5 minutes`, timestamp: new Date().toISOString() }); // Also emit status_changed for database update this.eventBus?.emit('mcp.server.status_changed', { installation_id: processInfo.config.installation_id, team_id: processInfo.config.team_id, status: 'permanently_failed', status_message: `Process crashed ${crashCount} times in 5 minutes. Manual restart required.`, timestamp: new Date().toISOString() }); return; // No auto-restart } // Auto-restart (crash count < 3) this.logger.info({ processId: processInfo.processId, crashCount }, 'Auto-restarting crashed process'); await this.startProcess(processInfo.config); } ``` ### Stdio Recovery Timeline ``` Process crashes (crash #1) ↓ Auto-restart immediately ↓ Process crashes again (crash #2, within 5 min) ↓ Auto-restart immediately ↓ Process crashes again (crash #3, within 5 min) ↓ Status → 'permanently_failed' ↓ No auto-restart (manual action required) ``` ## Failure Status Mapping When tool execution fails after all retries, error messages are mapped to appropriate status values: ```typescript theme={null} // services/satellite/src/services/remote-tool-discovery-manager.ts static getStatusFromError(error: Error): { status: string; message: string } { const msg = error.message.toLowerCase(); // Auth errors → requires_reauth if (msg.includes('401') || msg.includes('unauthorized')) { return { status: 'requires_reauth', message: 'Authentication failed (HTTP 401)' }; } if (msg.includes('403') || msg.includes('forbidden')) { return { status: 'requires_reauth', message: 'Access forbidden (HTTP 403)' }; } // Connection errors → offline if (msg.includes('econnrefused') || msg.includes('etimedout') || msg.includes('enotfound') || msg.includes('fetch failed')) { return { status: 'offline', message: 'Server unreachable' }; } // Other errors → error return { status: 'error', message: error.message }; } ``` ## Debouncing Concurrent Recovery Multiple tool executions may detect recovery simultaneously. Debouncing prevents duplicate re-discoveries: ```typescript theme={null} class McpServerWrapper { private recoveryInProgress: Set = new Set(); private async handleServerRecovery(serverSlug: string, config: McpServerConfig) { // Check if already recovering if (this.recoveryInProgress.has(serverSlug)) { return; // Skip duplicate recovery } this.recoveryInProgress.add(serverSlug); try { await this.performRecovery(serverSlug, config); } finally { this.recoveryInProgress.delete(serverSlug); } } } ``` **Scenario:** * LLM executes 3 tools from same server concurrently * All 3 detect recovery (server was offline) * Only first execution triggers re-discovery * Other 2 skip (already in progress) ## Recovery Timing | Recovery Type | Detection Time | Re-Discovery Time | Total | | ------------------ | ---------------------------------- | ----------------- | --------- | | **Tool Execution** | Immediate (on next tool call) | 1-5 seconds | \~1-5s | | **Health Check** | Up to 3 minutes (polling interval) | 1-5 seconds | \~3-8 min | **Recommendation:** Tool execution recovery is faster and more responsive than health check recovery. ## Manual Recovery (Requires User Action) Some failures cannot auto-recover and require user intervention: | Status | Reason | User Action | Recovery Type | | -------------------- | ------------------------------- | ------------------------------------------------------ | ------------------------------- | | `requires_reauth` | OAuth token expired/revoked | Click "Re-authenticate" button in installation details | Self-service (all team members) | | `permanently_failed` | 3+ crashes in 5 minutes (stdio) | Check logs, fix issue, manual restart | Admin intervention | **Re-Authentication Details**: * Available to all team members (OAuth is per-user) * Preserves installation configuration * Updates tokens in-place (no reinstall needed) * See [MCP Server OAuth - Token Expiration Handling](/development/backend/mcp-server-oauth#token-expiration-handling) for technical details See [Process Management - Auto-Restart System](/development/satellite/process-management#auto-restart-system) for complete stdio restart policy details (3 crashes in 5-minute window, backoff delays). ## Implementation Components The recovery system consists of several integrated components: * Stdio auto-recovery and permanently\_failed status * Tool execution retry logic and recovery detection * Health check recovery via backend ## Related Documentation * [Status Tracking](/development/satellite/status-tracking) - Status values and transitions * [Event Emission](/development/satellite/event-emission) - Recovery status events * [Tool Discovery](/development/satellite/tool-discovery) - Re-discovery after recovery * [Process Management](/development/satellite/process-management) - Stdio crash recovery