Virtual-Machines Sync Stage HTTP 502 Fix¶
Problem¶
The virtual-machines sync stage was failing after ~2 minutes with:
- Error: RuntimeError("Stage 'virtual-machines' failed (HTTP 502): Response ended prematurely")
- Root Cause: PostgreSQL connection pool exhaustion during backup synchronization
Evidence¶
[06/Apr/2026 22:37:27-31] Hundreds of concurrent PATCH/POST to /api/plugins/proxbox/backups/
[06/Apr/2026 22:37:30] "Skipping config initialization (database unavailable)"
[06/Apr/2026 22:37:31] HTTP 500 errors for all backup operations
NetBox's PostgreSQL connection pool was completely exhausted, preventing any database operations.
Solution¶
Changes Made¶
1. Environment Variables (proxbox_api/routes/virtualization/virtual_machines/backups_vm.py)¶
Added configurable backup sync throttling:
_DEFAULT_BACKUP_BATCH_SIZE = max(1, int(os.getenv("PROXBOX_BACKUP_BATCH_SIZE", "5")))
_DEFAULT_BACKUP_BATCH_DELAY_MS = max(0, int(os.getenv("PROXBOX_BACKUP_BATCH_DELAY_MS", "200")))
- PROXBOX_BACKUP_BATCH_SIZE: Default reduced from 10 → 5
- PROXBOX_BACKUP_BATCH_DELAY_MS: Default 200ms inter-batch delay
2. Enhanced process_backups_batch()¶
async def process_backups_batch(
backup_tasks: list,
batch_size: int = 10,
delay_ms: int = 200,
) -> tuple[list, int]:
Improvements:
- ✅ Added configurable delay_ms parameter
- ✅ Added asyncio.sleep(delay_ms / 1000.0) between batches
- ✅ Progress logging every 10 batches (reduces log spam)
- ✅ Total batch count tracking for progress reporting
3. Caller Updates¶
Updated _create_all_virtual_machine_backups() to use new defaults:
results, failure_count = await process_backups_batch(
all_backup_tasks,
batch_size=_DEFAULT_BACKUP_BATCH_SIZE,
delay_ms=_DEFAULT_BACKUP_BATCH_DELAY_MS,
)
4. Documentation (README.md)¶
Added comprehensive "Backup Sync Throttling" section with: - Configuration examples - Tuning guidelines based on NetBox capacity - Symptom-based troubleshooting guide
Technical Details¶
Why This Fixes the Issue¶
Before:
1. 100+ backups → 10+ batches of 10 concurrent tasks
2. Each batch fires asyncio.gather(*batch) immediately
3. All coroutines queue up, even though NetBox semaphore limits concurrency
4. PostgreSQL connection pool (typically 20-40 connections) exhausted
5. NetBox returns HTTP 500 "database unavailable"
6. proxbox-api HTTP stream times out → HTTP 502
After:
1. Smaller batches (5 instead of 10) reduce burst pressure
2. 200ms delay between batches allows PostgreSQL connections to release
3. NetBox connection pool stays under capacity
4. Existing retry logic in netbox_rest.py handles transient failures
5. Sync completes successfully
Existing Safety Mechanisms¶
The fix complements existing protections:
-
Semaphore in netbox_rest.py (default concurrency=1):
-
Retry logic with exponential backoff:
-
Transient error detection:
Configuration Guide¶
Conservative (Default)¶
Use when: Standard NetBox deployment, shared database, multiple usersAggressive (High-Performance)¶
Use when: Dedicated NetBox server, large PostgreSQL pool (50+ connections), few concurrent usersUltra-Conservative (Troubled Systems)¶
Use when: Frequent "database unavailable" errors, shared/overloaded databaseTesting Recommendations¶
-
Run full sync with 100+ backups
-
Check PostgreSQL connection usage
-
Verify sync completion
- Check NetBox job status shows "Completed" (not "Errored")
- Confirm backup count matches Proxmox
- No HTTP 502 errors in logs
Rollout Impact¶
- ✅ Backwards Compatible: Existing code works unchanged
- ✅ Conservative Defaults: Safer out-of-the-box (batch_size 5 vs 10)
- ✅ No Breaking Changes: All existing APIs unchanged
- ✅ Self-Documenting: Environment variables are clearly named
- ⚠️ Slight Slowdown: 200ms/batch adds ~2-4s per 100 backups (acceptable for stability)
Related Files¶
proxbox_api/routes/virtualization/virtual_machines/backups_vm.py- Main implementationproxbox_api/netbox_rest.py- Retry logic and semaphoreproxbox_api/utils/retry.py- Error detection and backoff computationREADME.md- User documentationnetbox_proxbox/jobs.py- NetBox job runner (unchanged, uses proxbox-api)
Future Improvements¶
- Adaptive throttling: Auto-adjust batch size based on error rate
- Metrics collection: Track batch processing time and failures
- Health endpoint: Expose current batch processing status
- Rate limiter: Global request rate limit across all sync types