Overview
Core Components:- ProcessManager: Handles spawning, communication, and lifecycle of stdio-based MCP servers
- RuntimeState: Maintains in-memory state of all processes with team-grouped tracking
- TeamIsolationService: Validates team-based access control for process operations
- Development: Direct spawn without isolation (cross-platform)
- Production: nsjail isolation with resource limits (Linux only)
Process Spawning
Spawning Modes
The system automatically selects the appropriate spawning mode based on environment: Direct Spawn (Development):- Standard Node.js
child_process.spawn()without isolation - Full environment variable inheritance
- No resource limits or namespace isolation
- Works on all platforms (macOS, Windows, Linux)
- Resource limits: 50MB RAM, 60s CPU time, and one process per started MCP server
- Namespace isolation: PID, mount, UTS, IPC
- Filesystem isolation: Read-only mounts for
/usr,/lib,/lib64,/binwith writable/tmp - Team-specific hostname:
mcp-{team_id} - Non-root user (99999:99999)
- Network access enabled
Mode Selection: The system uses
process.env.NODE_ENV === 'production' && process.platform === 'linux' to determine isolation mode. This ensures development works seamlessly on all platforms while production deployments get full security.Process Configuration
Processes are spawned using MCPServerConfig containing:installation_name: Unique identifier in format{server_slug}-{team_slug}-{installation_id}installation_id: Database UUID for the installationteam_id: Team owning the processcommand: Executable command (e.g.,npx,node)args: Command argumentsenv: Environment variables (credentials, configuration)
MCP Handshake Protocol
After spawning, processes must complete an MCP handshake before becoming operational: Two-Step Process:- Initialize Request: Sent to process via stdin
- Protocol version: 2025-11-05
- Client info: deploystack-satellite v1.0.0
- Capabilities: roots.listChanged=false, sampling=
- Initialized Notification: Sent after successful initialization response
- 30-second timeout (accounts for npx package downloads)
- Response must include
serverInfowith name and version - Process marked ‘failed’ and terminated if handshake fails
stdio Communication Protocol
Message Format
All communication uses newline-delimited JSON following JSON-RPC 2.0 specification: stdin (Satellite → Process):- Write JSON-RPC messages followed by
\n - Requests include
idfield for response matching - Notifications omit
idfield (no response expected)
- Buffer-based parsing accumulates chunks
- Split on newlines to extract complete messages
- Incomplete lines remain in buffer for next chunk
- Parse complete lines as JSON
- Requests (with
id): Expect response, tracked in active requests map - Notifications (no
id): Fire-and-forget, no response tracking - Responses: Match
idto active request, resolve or reject promise
Request/Response Handling
Active Request Tracking:- Map of request ID → {resolve, reject, timeout, startTime}
- Configurable timeout per request (default 30s)
- Automatic cleanup on response or timeout
- Validate process status (must be ‘starting’ or ‘running’)
- Register timeout handler
- Write JSON-RPC message to stdin
- Wait for response via stdout parsing
- Resolve/reject promise based on response
- Write errors: Immediate rejection
- Timeout errors: Clean up active request, reject with timeout message
- JSON-RPC errors: Extract
error.messagefrom response
Process Lifecycle
Idle Process Management: Processes that remain inactive for extended periods are automatically terminated and respawned on-demand to optimize memory usage. See Idle Process Management for details on automatic termination, dormant state tracking, and respawning.
Lifecycle States
starting:- Process spawned with handlers attached
- MCP handshake in progress
- Accepts handshake messages only
- Handshake completed successfully
- Ready for JSON-RPC requests
- Tools discovered and cached
- Graceful shutdown initiated
- Active requests cancelled
- Awaiting process exit
- Process exited
- Removed from tracking maps
- Spawn or handshake failure
- Not operational
Graceful Termination
Process termination follows a two-phase graceful shutdown approach to ensure clean process exit and proper resource cleanup.Termination Phases
Phase 1: SIGTERM (Graceful Shutdown)- Send SIGTERM signal to the process
- Process has 10 seconds (default timeout) to shut down gracefully
- Process can complete in-flight operations and cleanup resources
- Wait for process to exit voluntarily
- If process doesn’t exit within timeout period
- Send SIGKILL signal to force immediate termination
- Guaranteed process termination (cannot be caught or ignored)
- Used as last resort for unresponsive processes
Termination Types
The system handles three types of intentional terminations differently: 1. Manual Termination- Triggered by explicit restart or stop commands
- Status set to
'terminating'before sending signals - No auto-restart triggered
- Standard graceful shutdown with SIGTERM → SIGKILL
- Triggered by idle timeout (default: 180 seconds of inactivity)
- Process marked with
isDormantShutdownflag - Configuration stored in dormant map for fast respawn
- Tools remain cached for instant availability
- No auto-restart triggered (intentional shutdown)
- See Idle Process Management for details
- Triggered when server removed from configuration
- Process marked with
isUninstallShutdownflag - Complete cleanup: process, dormant config, tools, restart tracking
- No auto-restart triggered (intentional removal)
- Invoked via
removeServerCompletely()method
Crash Detection vs Intentional Shutdown
The system distinguishes between crashes and intentional shutdowns: Crash Detection Logic:- SIGTERM exit code is 143 (non-zero)
- Without flags, graceful termination would trigger auto-restart
- Flags prevent unwanted restarts for intentional shutdowns
Cleanup Operations
During termination, the following cleanup operations occur:-
Active Request Cancellation
- All pending JSON-RPC requests are rejected
- Active requests map is cleared
- Clients receive termination error
-
State Cleanup
- Remove from processes map (by process ID)
- Remove from processIdsByName map (by installation name)
- Remove from team tracking sets
- Clear dormant config if exists (for uninstall)
-
Resource Tracking
- Restart attempts cleared (for uninstall)
- Respawn promises cleared
- Process metrics finalized
-
Event Emission
- Emit
processTerminatedinternal event - Emit
processExitwith exit code and signal - Emit
mcp.server.crashedif crash detected (Backend event)
- Emit
Complete Server Removal
TheremoveServerCompletely() method provides comprehensive cleanup for server uninstall:
Method Signature:
-
Check for active process
- If found: Set
isUninstallShutdownflag - Terminate with graceful shutdown
- Return
active: true
- If found: Set
-
Check for dormant config
- If found: Remove from dormant map
- Return
dormant: true
-
Clear restart tracking
- Delete restart attempts history
- Prevent any future restart attempts
Termination Timing
Normal Termination:- SIGTERM sent: ~1ms
- Process cleanup: 10-500ms (application-dependent)
- Total time: 11-501ms
- SIGTERM sent: ~1ms
- Timeout wait: 10,000ms
- SIGKILL sent: ~1ms
- Immediate kill: ~10ms
- Total time: ~10,012ms
- MCP servers should handle SIGTERM gracefully
- Complete in-flight requests within timeout
- Close file handles and network connections
- Exit with code 0 for clean shutdown
Auto-Restart System
Crash Detection
The system detects crashes based on exit conditions:- Non-zero exit code
- Process not in ‘terminating’ state
- Unexpected signal termination
Restart Policy
Limits:- Maximum 3 restart attempts in 5-minute window
- After limit exceeded: Process marked ‘permanently_failed’ in RuntimeState
- Process ran >60 seconds before crash: Immediate restart
- Quick crashes: Exponential backoff (1s → 5s → 15s)
- Detect crash with exit code and signal
- Check restart eligibility (3 attempts in 5 minutes)
- Apply backoff delay based on uptime
- Attempt restart via
spawnProcess() - Emit ‘processRestarted’ or ‘restartLimitExceeded’ event
RuntimeState Integration
RuntimeState maintains in-memory tracking of all MCP server processes: Tracking Methods:- By process ID (UUID)
- By installation name (for lookups)
- By team ID (for team-grouped operations)
- Extends ProcessInfo with:
installationId,installationName,teamId - Health status: unknown/healthy/unhealthy
- Last health check timestamp
- Permanently Failed Map: Separate storage for processes exceeding restart limits
- Team-Grouped Sets: Map of team_id → Set of process IDs for heartbeat reporting
- Get all processes (includes permanently failed for reporting)
- Get team processes (filter by team_id)
- Get running team processes (status=‘running’)
- Get process count by status
Process Monitoring
Metrics Tracked
Each process tracks operational metrics:- Message count: Total requests sent to process
- Error count: Communication failures
- Last activity: Timestamp of last message sent/received
- Uptime: Calculated from start time
- Active requests: Count of pending requests
Events Emitted
The ProcessManager emits events for monitoring and integration:processSpawned: New process started successfullyprocessRestarted: Process restarted after crashprocessTerminated: Process shut downprocessExit: Process exited (any reason)processError: Spawn or runtime errorserverNotification: Notification received from MCP serverrestartLimitExceeded: Max restart attempts reachedrestartFailed: Restart attempt failed
Logging
stderr Handling:- Logged at debug level (informational output, not errors)
- MCP servers often write logs to stderr
- Malformed JSON lines logged and skipped
- Does not crash the process or satellite
- All operations include:
installation_name,installation_id,team_id - Request tracking includes:
request_id,method,duration_ms - Error context includes: error messages, exit codes, signals
Event Emission
The ProcessManager emits real-time events to the Backend for operational visibility and audit trails. These events are batched every 3 seconds and sent via the Event System.Lifecycle Events
mcp.server.started- Emitted after successful spawn and handshake completion
- Includes: server_id, process_id, spawn_duration_ms, tool_count
- Provides immediate visibility into new MCP server availability
- Emitted on unexpected process exit with non-zero code
- Includes: exit_code, signal, uptime_seconds, crash_count, will_restart
- Enables real-time alerting for process failures
- Emitted after successful automatic restart
- Includes: old_process_id, new_process_id, restart_reason, attempt_number
- Tracks restart attempts for reliability monitoring
- Emitted when restart limit (3 attempts) is exceeded
- Includes: total_crashes, last_error, failed_at timestamp
- Critical alert requiring manual intervention
- ProcessManager internal events (processSpawned, processTerminated, etc.) are for satellite-internal coordination
- Event System events (mcp.server.started, etc.) are sent to Backend for external visibility
- Both work together: Internal events trigger state changes, Event System events provide audit trail
Team Isolation
Installation Name Format
Installation names follow strict format for team isolation:filesystem-john-R36no6FGoMFEZO9nWJJLTcontext7-alice-S47mp8GHpNGFZP0oWKKMU
Team Access Validation
TeamIsolationService provides:extractTeamInfo(): Parse installation name into componentsvalidateTeamAccess(): Ensure request team matches process teamisValidInstallationName(): Validate name format
- RuntimeState groups processes by team_id
- nsjail uses team-specific hostname:
mcp-{team_id} - Heartbeat reports processes grouped by team
Performance Characteristics
Timing:- Spawn time: 1-3 seconds (includes handshake and tool discovery)
- Message latency: ~10-50ms for stdio communication
- Handshake timeout: 30 seconds
- Memory per process: Base ~10-20MB (application-dependent, limited to 50MB in production)
- Event-driven architecture: Handles multiple processes concurrently
- CPU overhead: Minimal (background event loop processing)
- No hard limit on process count (bounded by system resources)
- Team-grouped tracking enables efficient filtering
- Permanent failure tracking prevents infinite restart loops
Development & Testing
Local Development
Development Mode:- Uses direct spawn (no nsjail required)
- Works on macOS, Windows, Linux
- Full environment inheritance simplifies debugging
Testing Processes
Manual Testing Methods:getAllProcesses(): Inspect all active processesgetServerStatus(installationName): Get detailed process statusrestartServer(installationName): Test restart functionalityterminateProcess(processInfo): Test graceful shutdown
- Development: All platforms (macOS/Windows/Linux)
- Production: Linux only (nsjail requirement)
Security Considerations
Environment Injection:- Credentials passed securely via environment variables
- No credentials stored in process arguments or logs
- nsjail enforces hard limits: 50MB RAM, 60s CPU, one process
- Prevents resource exhaustion attacks
- Complete process isolation per team
- Separate PID, mount, UTS, IPC namespaces
- System directories mounted read-only
- Only
/tmpwritable - Prevents filesystem tampering
- Enabled by default (MCP servers need external connectivity)
- Can be disabled for higher security requirements
Related Documentation
- Satellite Architecture Design - Overall system architecture
- Idle Process Management - Automatic termination and respawning of idle processes
- Tool Discovery Implementation - How tools are discovered from processes
- Team Isolation Implementation - Team-based access control
- Backend Communication - Integration with Backend commands

