> ## Documentation Index
> Fetch the complete documentation index at: https://docs.deploystack.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Process Management

> Technical implementation of per-user stdio subprocess management for local MCP servers in DeployStack Satellite.

DeployStack Satellite implements per-user stdio subprocess management for local MCP servers through the ProcessManager component. Each team member gets their own isolated process for each MCP server installation, with dual-mode operation for development and production environments.

## Overview

### Per-User Process Architecture

DeployStack manages MCP server processes on a **per-user basis**:

* **1 Installation × N Users = N Processes**: Each team member has their own process for each MCP server
* **Independent Lifecycle**: Terminating one user's process doesn't affect other users' processes
* **User-Specific Config**: Each process runs with merged 3-tier config (Template + Team + User)
* **ProcessId Format**: `{server_slug}-{team_slug}-{user_slug}-{installation_id}`

**Example:**

```
Team "Acme Corp" installs Filesystem MCP (Node.js) with 3 members:
- Process 1: filesystem-acme-alice-abc123 (Alice's instance, npx command)
- Process 2: filesystem-acme-bob-abc123 (Bob's instance, npx command)
- Process 3: filesystem-acme-charlie-abc123 (Charlie's instance, npx command)

Team "Acme Corp" installs DuckDuckGo MCP (Python) with 3 members:
- Process 1: duckduckgo-acme-alice-def456 (Alice's instance, uvx command)
- Process 2: duckduckgo-acme-bob-def456 (Bob's instance, uvx command)
- Process 3: duckduckgo-acme-charlie-def456 (Charlie's instance, uvx command)

Each process runs independently with user-specific configuration and runtime-aware isolation.
```

**Core Components:**

* **ProcessManager**: Handles spawning, communication, and lifecycle of per-user stdio processes
* **RuntimeState**: Maintains in-memory state of all processes with team AND user tracking
* **TeamIsolationService**: Validates team and user access control for process operations

**Deployment Modes:**

* **Development**: Direct spawn without isolation (cross-platform)
* **Production**: nsjail isolation with resource limits (Linux only)

## Process Spawning

### Spawning Modes

The system automatically selects the appropriate spawning mode based on environment:

**Direct Spawn (Development):**

* Standard Node.js `child_process.spawn()` without isolation
* Full environment variable inheritance
* No resource limits or namespace isolation
* Works on all platforms (macOS, Windows, Linux)

**nsjail Spawn (Production Linux):**

* **Resource limits** (rlimit-based):
  * Virtual memory: unlimited (rlimit\_as=inf — required for Node.js WASM, which reserves \~10GB virtual address space)
  * CPU time: 60 seconds (rlimit\_cpu)
  * Processes: 1000 (rlimit\_nproc)
  * File descriptors: 1024 (rlimit\_nofile)
  * File size: 50MB (rlimit\_fsize)
* **Namespace isolation** (primary security):
  * PID namespace: Process tree isolation
  * Mount namespace: Filesystem isolation
  * User namespace: UID/GID mapping (prevents privilege escalation)
  * UTS namespace: Hostname isolation (`mcp-{team_id}`)
  * IPC namespace: Inter-process communication isolation
* **Filesystem isolation**:
  * Read-only system mounts: `/usr`, `/lib`, `/lib64`, `/bin`, `/sbin`, `/etc`
  * Writable tmpfs: `/tmp` (100MB limit)
  * Runtime-aware cache directories: `/home/node` for Node.js, `/home/python` for Python
* **Network access**: Enabled (required for package downloads)
* **User**: UID/GID mapping via user namespace (deploystack user in production)

<Info>
  **Mode Selection**: The system uses `process.env.NODE_ENV === 'production' && process.platform === 'linux'` to determine isolation mode. This ensures development works seamlessly on all platforms while production deployments get full security.
</Info>

### Process Configuration

Processes are spawned using MCPServerConfig containing:

* `installation_name`: Unique per-user identifier in format `{server_slug}-{team_slug}-{user_slug}-{installation_id}`
* `installation_id`: Database UUID for the installation
* `team_id`: Team owning the process
* `user_id`: User owning this specific instance
* `command`: Executable command (e.g., `npx` for Node.js, `uvx` for Python, `node`, `python3`)
* `args`: Command arguments (merged Template + Team + User args)
* `env`: Environment variables (merged Template + Team + User env vars, plus credentials)
* `runtime`: Runtime identifier (`node`, `python`) for runtime-aware environment configuration and cache isolation
* `language`: Programming language (`javascript`, `typescript`, `python`) for categorization
* `git_commit_sha`: GitHub commit SHA (GitHub deployments only, used for dynamic args reconstruction)
* `repository_url`: GitHub repository URL (GitHub deployments only)
* `git_branch`: Git branch name (GitHub deployments only)

<Info>
  **ProcessId Includes User:** The `installation_name` (also called `processId`) now includes the `user_slug` to uniquely identify each user's instance. This enables per-user process isolation and independent lifecycle management.

  Example: `filesystem-acme-alice-abc123`
</Info>

<Info>
  **GitHub Deployments:** For GitHub-deployed servers (`source: 'github'`), the `args` field contains the base GitHub reference WITHOUT the commit SHA (e.g., `github:owner/repo`). The satellite dynamically reconstructs the full reference using `git_commit_sha` during the deployment preparation phase. This ensures redeploys always use the latest SHA without stale template\_args. See [Dynamic Args Reconstruction](/development/satellite/github-deployment#dynamic-args-reconstruction).
</Info>

### Backend Filtering (awaiting\_user\_config)

The satellite does NOT receive configurations for instances with `awaiting_user_config` status:

**Why:**

* MCP servers with required user-level fields (e.g., personal API keys) cannot spawn without complete configuration
* Backend filters out these instances in the config endpoint
* Satellite never attempts to spawn incomplete configurations
* Prevents process crashes from missing required arguments/environment variables

**When it applies:**

* New team member joins but hasn't configured their personal credentials
* Admin installs MCP server but doesn't provide required user-level config during installation
* User's instance remains in `awaiting_user_config` until they complete configuration

**Status transition:**

```
awaiting_user_config
    ↓
[User configures personal settings via dashboard]
    ↓
provisioning (backend updates status, sends satellite command)
    ↓
Satellite receives config and spawns process normally
```

See [Instance Lifecycle](/development/satellite/instance-lifecycle) and [Status Tracking](/development/satellite/status-tracking) for complete details.

## MCP Handshake Protocol

After spawning, processes must complete an MCP handshake before becoming operational:

**Two-Step Process:**

1. **Initialize Request**: Sent to process via stdin
   * Protocol version: 2025-11-05
   * Client info: deploystack-satellite v1.0.0
   * Capabilities: roots.listChanged=false, sampling={}
2. **Initialized Notification**: Sent after successful initialization response

**Handshake Requirements:**

* 30-second timeout (accounts for npx package downloads)
* Response must include `serverInfo` with name and version
* Process marked 'failed' and terminated if handshake fails

## stdio Communication Protocol

### Message Format

All communication uses newline-delimited JSON following JSON-RPC 2.0 specification:

**stdin (Satellite → Process):**

* Write JSON-RPC messages followed by `\n`
* Requests include `id` field for response matching
* Notifications omit `id` field (no response expected)

**stdout (Process → Satellite):**

* Buffer-based parsing accumulates chunks
* Split on newlines to extract complete messages
* Incomplete lines remain in buffer for next chunk
* Parse complete lines as JSON

**Message Types:**

* **Requests** (with `id`): Expect response, tracked in active requests map
* **Notifications** (no `id`): Fire-and-forget, no response tracking
* **Responses**: Match `id` to active request, resolve or reject promise

### Request/Response Handling

**Active Request Tracking:**

* Map of request ID → \{resolve, reject, timeout, startTime}
* Configurable timeout per request (default 30s)
* Automatic cleanup on response or timeout

**Request Flow:**

1. Validate process status (must be 'starting' or 'running')
2. Register timeout handler
3. Write JSON-RPC message to stdin
4. Wait for response via stdout parsing
5. Resolve/reject promise based on response

**Error Handling:**

* Write errors: Immediate rejection
* Timeout errors: Clean up active request, reject with timeout message
* JSON-RPC errors: Extract `error.message` from response

## Process Lifecycle

<Info>
  **Idle Process Management**: Processes that remain inactive for extended periods are automatically terminated and respawned on-demand to optimize memory usage. See [Idle Process Management](/development/satellite/idle-process-management) for details on automatic termination, dormant state tracking, and respawning.
</Info>

<Info>
  **Configuration Updates**: When a user updates their MCP server configuration (args, env) via the dashboard, the backend sends a configure command to the satellite. For stdio servers, the satellite automatically restarts the process with the new configuration. See [Backend Communication](/development/satellite/backend-communication) for the command flow.
</Info>

### Lifecycle States

**starting:**

* Process spawned with handlers attached
* MCP handshake in progress
* Accepts handshake messages only

**running:**

* Handshake completed successfully
* Ready for JSON-RPC requests
* Tools discovered and cached

**terminating:**

* Graceful shutdown initiated
* Active requests cancelled
* Awaiting process exit

**terminated:**

* Process exited
* Removed from tracking maps

**failed:**

* Spawn or handshake failure
* Not operational

### Graceful Termination

Process termination follows a two-step graceful shutdown approach to ensure clean process exit and proper resource cleanup.

#### Termination Steps

**Step 1: SIGTERM (Graceful Shutdown)**

* Send SIGTERM signal to the process
* Process has 10 seconds (default timeout) to shut down gracefully
* Process can complete in-flight operations and cleanup resources
* Wait for process to exit voluntarily

**Step 2: SIGKILL (Force Termination)**

* If process doesn't exit within timeout period
* Send SIGKILL signal to force immediate termination
* Guaranteed process termination (cannot be caught or ignored)
* Used as last resort for unresponsive processes

#### Termination Types

The system handles four types of intentional terminations differently:

**1. Manual Termination**

* Triggered by explicit restart or stop commands
* Status set to `'terminating'` before sending signals
* No auto-restart triggered
* Standard graceful shutdown with SIGTERM → SIGKILL

**2. Idle/Dormant Termination**

* Triggered by idle timeout (default: 180 seconds of inactivity)
* Process marked with `isDormantShutdown` flag
* Configuration stored in dormant map for fast respawn
* Tools remain cached for instant availability
* No auto-restart triggered (intentional shutdown)
* See [Idle Process Management](/development/satellite/idle-process-management) for details

**3. Uninstall Termination**

* Triggered when server removed from configuration
* Process marked with `isUninstallShutdown` flag
* Complete cleanup: process, dormant config, tools, restart tracking
* No auto-restart triggered (intentional removal)
* Invoked via `removeServerCompletely()` method

**4. Configuration Update Restart**

* Triggered when stdio server configuration is modified (e.g., user args change)
* Detected via `DynamicConfigManager` comparing old vs new configuration
* Existing process terminated with graceful shutdown
* Tools cleared from cache via `stdioToolDiscoveryManager.clearServerTools()`
* New process spawned with updated configuration (new args, env)
* Tool discovery runs automatically on the new process
* Enables real-time configuration updates without satellite restart

<Info>
  **HTTP/SSE Servers**: Unlike stdio servers, HTTP/SSE servers don't require restart on config changes. Their configuration (headers, query params, URL) is read fresh on each request, so updates are immediate.
</Info>

#### Crash Detection vs Intentional Shutdown

The system distinguishes between crashes and intentional shutdowns:

**Crash Detection Logic:**

```typescript theme={null}
// Process is considered crashed if:
// 1. Exit code is non-zero (e.g., 1, 143)
// 2. Status is NOT 'terminating'
// 3. NOT marked as intentional shutdown (isDormantShutdown or isUninstallShutdown)
const wasCrash = code !== 0 && code !== null && 
                 processInfo.status !== 'terminating' &&
                 !processInfo.isDormantShutdown &&
                 !processInfo.isUninstallShutdown;
```

**Why This Matters:**

* SIGTERM exit code is 143 (non-zero)
* Without flags, graceful termination would trigger auto-restart
* Flags prevent unwanted restarts for intentional shutdowns

#### Cleanup Operations

During termination, the following cleanup operations occur:

1. **Active Request Cancellation**
   * All pending JSON-RPC requests are rejected
   * Active requests map is cleared
   * Clients receive termination error

2. **State Cleanup**
   * Remove from processes map (by process ID)
   * Remove from processIdsByName map (by installation name)
     * Remove from team tracking sets
   * Clear dormant config if exists (for uninstall)

3. **Resource Tracking**
   * Restart attempts cleared (for uninstall)
   * Respawn promises cleared
   * Process metrics finalized

4. **Event Emission**
   * Emit `processTerminated` internal event
   * Emit `processExit` with exit code and signal
   * Emit `mcp.server.crashed` if crash detected (Backend event)

#### Complete Server Removal

The `removeServerCompletely()` method provides comprehensive cleanup for server uninstall:

**Method Signature:**

```typescript theme={null}
async removeServerCompletely(
  installationName: string,
  timeout: number = 10000
): Promise<{ active: boolean; dormant: boolean }>
```

**Operation Flow:**

1. Check for active process
   * If found: Set `isUninstallShutdown` flag
   * Terminate with graceful shutdown
   * Return `active: true`

2. Check for dormant config
   * If found: Remove from dormant map
   * Return `dormant: true`

3. Clear restart tracking
   * Delete restart attempts history
   * Prevent any future restart attempts

**Usage Example:**

```typescript theme={null}
// Called when server removed from configuration
const result = await processManager.removeServerCompletely(
  'sequential-thinking-team-name-abc123'
);

// Result: { active: true, dormant: false }
// - Active process was terminated
// - No dormant config existed
```

**Logging Output:**

```
INFO: Removing server completely: sequential-thinking-team-name-abc123
INFO: Terminating active process: sequential-thinking-team-name-abc123
DEBUG: Sent SIGTERM to sequential-thinking-team-name-abc123
INFO: Process terminated for uninstall (not a crash)
INFO: Server removed completely (active: true, dormant: false)
```

#### Termination Timing

**Normal Termination:**

* SIGTERM sent: \~1ms
* Process cleanup: 10-500ms (application-dependent)
* Total time: 11-501ms

**Forced Termination:**

* SIGTERM sent: \~1ms
* Timeout wait: 10,000ms
* SIGKILL sent: \~1ms
* Immediate kill: \~10ms
* Total time: \~10,012ms

**Best Practices:**

* MCP servers should handle SIGTERM gracefully
* Complete in-flight requests within timeout
* Close file handles and network connections
* Exit with code 0 for clean shutdown

## Auto-Restart System

### Crash Detection

The system detects crashes based on exit conditions:

* Non-zero exit code
* Process not in 'terminating' state
* Unexpected signal termination

### Restart Policy

**Limits:**

* Maximum 3 restart attempts in 5-minute window
* After limit exceeded: Process marked 'permanently\_failed' in RuntimeState

**Backoff Delays:**

* Process ran >60 seconds before crash: Immediate restart
* Quick crashes: Exponential backoff (1s → 5s → 15s)

**Restart Flow:**

1. Detect crash with exit code and signal
2. Check restart eligibility (3 attempts in 5 minutes)
3. Apply backoff delay based on uptime
4. Attempt restart via `spawnProcess()`
5. Emit 'processRestarted' or 'restartLimitExceeded' event

<Warning>
  **Permanently Failed State**: After 3 failed restart attempts, processes enter a permanently\_failed state and are tracked separately for reporting. They will not be restarted automatically and require manual intervention.
</Warning>

## RuntimeState Integration

RuntimeState maintains in-memory tracking of all MCP server processes:

**Tracking Methods:**

* By process ID (UUID)
* By installation name (for lookups)
* By team ID (for team-grouped operations)

**RuntimeProcessInfo Fields:**

* Extends ProcessInfo with: `installationId`, `installationName`, `teamId`
* Health status: unknown/healthy/unhealthy
* Last health check timestamp

**Special Tracking:**

* **Permanently Failed Map**: Separate storage for processes exceeding restart limits
* **Team-Grouped Sets**: Map of team\_id → Set of process IDs for heartbeat reporting

**State Queries:**

* Get all processes (includes permanently failed for reporting)
* Get team processes (filter by team\_id)
* Get running team processes (status='running')
* Get process count by status

## Process Monitoring

### Metrics Tracked

Each process tracks operational metrics:

* **Message count**: Total requests sent to process
* **Error count**: Communication failures
* **Last activity**: Timestamp of last message sent/received
* **Uptime**: Calculated from start time
* **Active requests**: Count of pending requests

### Events Emitted

The ProcessManager emits events for monitoring and integration:

* `processSpawned`: New process started successfully
* `processRestarted`: Process restarted after crash
* `processTerminated`: Process shut down
* `processExit`: Process exited (any reason)
* `processError`: Spawn or runtime error
* `serverNotification`: Notification received from MCP server
* `restartLimitExceeded`: Max restart attempts reached
* `restartFailed`: Restart attempt failed

### Logging

**stderr Handling:**

* Logged at debug level (informational output, not errors)
* MCP servers often write logs to stderr

**stdout Parse Errors:**

* Malformed JSON lines logged and skipped
* Does not crash the process or satellite

**Structured Logging:**

* All operations include: `installation_name`, `installation_id`, `team_id`
* Request tracking includes: `request_id`, `method`, `duration_ms`
* Error context includes: error messages, exit codes, signals

## Event Emission

The ProcessManager emits real-time lifecycle events (started, crashed, restarted, permanently\_failed) to the Backend for operational visibility and audit trails.

ProcessManager internal events (processSpawned, processTerminated) are for satellite-internal coordination. Event System events (mcp.server.started, etc.) are sent to Backend for external visibility.

See [Event Emission - Process Lifecycle Events](/development/satellite/event-emission#event-types-reference) for complete event types, payloads, and batching configuration.

## Team Isolation

### Installation Name Format

Installation names follow strict format for team isolation:

```
{server_slug}-{team_slug}-{installation_id}
```

**Examples:**

* `filesystem-john-R36no6FGoMFEZO9nWJJLT`
* `context7-alice-S47mp8GHpNGFZP0oWKKMU`

### Team Access Validation

TeamIsolationService provides:

* `extractTeamInfo()`: Parse installation name into components
* `validateTeamAccess()`: Ensure request team matches process team
* `isValidInstallationName()`: Validate name format

**Team-Specific Features:**

* RuntimeState groups processes by team\_id
* nsjail uses team-specific hostname: `mcp-{team_id}`
* Heartbeat reports processes grouped by team

## Performance Characteristics

**Timing:**

* Spawn time: 1-3 seconds (includes handshake and tool discovery)
* Message latency: \~10-50ms for stdio communication
* Handshake timeout: 30 seconds

**Resource Usage:**

* Memory per process: Base \~10-20MB (application-dependent; virtual address space unlimited, physical RAM capped at 512MB via cgroup when enabled)
* Runtime-aware cache isolation: Separate cache directories per runtime (`/mcp-cache/node/{team_id}`, `/mcp-cache/python/{team_id}`)
* Event-driven architecture: Handles multiple processes concurrently
* CPU overhead: Minimal (background event loop processing)

**Scalability:**

* No hard limit on process count (bounded by system resources)
* Team-grouped tracking enables efficient filtering
* Permanent failure tracking prevents infinite restart loops

## Development & Testing

### Local Development

**Development Mode:**

* Uses direct spawn (no nsjail required)
* Works on macOS, Windows, Linux
* Full environment inheritance simplifies debugging

**Debug Logging:**

```bash theme={null}
# Enable detailed stdio communication logs
LOG_LEVEL=debug npm run dev
```

### Testing Processes

**Manual Testing Methods:**

* `getAllProcesses()`: Inspect all active processes
* `getServerStatus(installationName)`: Get detailed process status
* `restartServer(installationName)`: Test restart functionality
* `terminateProcess(processInfo)`: Test graceful shutdown

**Platform Support:**

* Development: All platforms (macOS/Windows/Linux)
* Production: Linux only (nsjail requirement)

## Security Considerations

**Environment Injection:**

* Credentials passed securely via environment variables
* No credentials stored in process arguments or logs

**Resource Limits (Production):**

* nsjail enforces hard limits via rlimits:
  * Virtual memory: unlimited (rlimit\_as=inf — Node.js WASM requires \~10GB virtual address space)
  * CPU time: 60 seconds (rlimit\_cpu)
  * Processes: 1000 (rlimit\_nproc)
  * File descriptors: 1024 (rlimit\_nofile)
  * File size: 50MB (rlimit\_fsize)
* tmpfs for /tmp: 100MB limit
* tmpfs for GitHub deployments (/app): 300MB kernel-enforced quota
* Physical memory: 512MB per process via cgroup — auto-detected at startup; active only when satellite runs as a systemd service with `Delegate=yes` (see [Enable Cgroup Limits](/self-hosted/production-satellite#enable-cgroup-limits))
* Prevents resource exhaustion attacks
* Runtime-specific cache directories prevent cross-runtime contamination

**Namespace Isolation (Production):**

* Complete process isolation per team and runtime
* Separate PID, mount, UTS, IPC namespaces
* Runtime-aware home directories: `/home/node` for Node.js, `/home/python` for Python

**Filesystem Jailing (Production):**

* System directories mounted read-only
* Only `/tmp` writable
* Runtime-specific cache mounts: `/mcp-cache/node/{team_id}`, `/mcp-cache/python/{team_id}`
* Prevents filesystem tampering and cross-team access

**Network Access:**

* Enabled by default (MCP servers need external connectivity)
* Can be disabled for higher security requirements

## Status Events

Process lifecycle emits status events to backend for real-time monitoring:

**Status Event Emission:**

* `connecting` - When process spawn starts
* `online` - After successful handshake and tool discovery
* `permanently_failed` - When process crashes 3 times in 5 minutes

See [Event Emission](/development/satellite/event-emission) for complete event types and payloads.

## Log Buffering

Process stderr output is buffered and batched before emission:

**Buffering Strategy:**

* Batch interval: 3 seconds after first log
* Max batch size: 20 logs (forces immediate flush)
* Grouping: By installation\_id + team\_id

**Log Levels:**

* Inferred from message content (`error` if contains "error", etc.)
* Metadata includes process\_id for debugging

See [Log Capture](/development/satellite/log-capture) for buffer management details.

## Configuration Restart Flow

When configuration is updated (env vars, args, headers, query params):

1. Backend sets installation status to `restarting`
2. Backend sends `configure` command to satellite
3. Satellite receives command and stops old process
4. Satellite clears tool cache for installation
5. Satellite spawns new process with updated configuration
6. Status progresses: `restarting` → `connecting` → `discovering_tools` → `online`

See [Status Tracking](/development/satellite/status-tracking) for configuration update status transitions.

## Related Documentation

* [Satellite Architecture Design](/development/satellite/architecture) - Overall system architecture
* [Idle Process Management](/development/satellite/idle-process-management) - Automatic termination and respawning of idle processes
* [Tool Discovery Implementation](/development/satellite/tool-discovery) - How tools are discovered from processes
* [Event Emission](/development/satellite/event-emission) - Process lifecycle events
* [Log Capture](/development/satellite/log-capture) - stderr log buffering
* [Status Tracking](/development/satellite/status-tracking) - Process status management
* [Backend Communication](/development/satellite/backend-communication) - Integration with Backend commands
