Error Recovery
TimeTiles includes an automatic error recovery system that intelligently retries failed import jobs based on error classification and recovery strategies.
Overview
The ErrorRecoveryService automatically handles import failures through:
- Error Classification: Analyzes error patterns to determine retry eligibility
- Exponential Backoff: Prevents overwhelming external services with intelligent retry delays
- Quota Integration: Respects user quota limits when scheduling retries
- Stage Recovery: Restarts from the optimal recovery point in the pipeline
- Manual Fallback: Provides APIs for manual intervention when automatic recovery fails
Error Classification
Import failures are classified into three categories to determine the appropriate recovery strategy.
Recoverable Errors (Automatic Retry)
Transient errors that typically resolve with retry:
- Network Errors: Connection failures, timeouts, DNS issues (
ECONNREFUSED,ETIMEDOUT) - Database Issues: Connection timeouts, deadlocks, temporary unavailability
- Rate Limits: API rate limiting, 429 responses, quota exceeded
- Resource Constraints: Temporary memory issues, disk space (may resolve)
Recovery Strategy: Automatic retry with exponential backoff
Permanent Errors (No Retry)
Errors that won’t resolve without intervention:
- File Errors: File not found, invalid format (
ENOENT) - Permission Issues: Unauthorized access, invalid credentials (401, 403)
- Invalid Data: Malformed file structure, unsupported format
- Configuration Errors: Missing required settings, invalid API keys
Recovery Strategy: Manual intervention required, no automatic retry
User Action Required (May Retry After Fix)
Errors requiring user decisions but potentially recoverable:
- Quota Limits: User exceeded daily import quota (retries after quota reset)
- Schema Changes: Breaking changes requiring approval (retries after approval)
- Validation Failures: Data validation errors requiring cleanup
Recovery Strategy: Automatic retry after condition changes (quota reset, approval granted)
Retry Mechanism
Exponential Backoff
Retry delays increase exponentially to prevent overwhelming services:
Retry 1: 30 seconds
Retry 2: 60 seconds (30s × 2)
Retry 3: 120 seconds (60s × 2)
Max delay: 300 seconds (5 minutes)
Max retries: 3 attemptsConfiguration
Configure retry behavior via environment variables:
RETRY_BASE_DELAY_MS=30000 # 30 seconds initial delay
RETRY_BACKOFF_MULTIPLIER=2.0 # Double delay each retry
RETRY_MAX_DELAY_MS=300000 # 5 minute maximum delay
RETRY_MAX_ATTEMPTS=3 # Maximum retry attemptsQuota Integration
Before scheduling a retry, the system:
- Checks user’s remaining import quota
- Verifies retry won’t exceed daily limits
- Blocks retry if quota exhausted (retries after quota reset)
- Logs quota warnings for monitoring
Recovery Stage Selection
The system determines the optimal restart point based on where failure occurred:
| Failure Stage | Recovery Stage | Reason |
|---|---|---|
ANALYZE_DUPLICATES | ANALYZE_DUPLICATES | Start of pipeline |
DETECT_SCHEMA | DETECT_SCHEMA | Early stage, minimal cost |
VALIDATE_SCHEMA | VALIDATE_SCHEMA | Schema validation specific |
CREATE_SCHEMA_VERSION | VALIDATE_SCHEMA | Retry validation first |
GEOCODE_BATCH | GEOCODE_BATCH | Resume geocoding |
CREATE_EVENTS | GEOCODE_BATCH | Ensure geocoding complete |
Important: Recovery restarts from the beginning of the selected stage to ensure data integrity.
API Endpoints
Retry Failed Import
Endpoint: POST /api/import-jobs/{id}/retry
Purpose: Manually trigger retry for a failed import job
Example Request:
curl -X POST https://your-domain.com/api/import-jobs/123/retry \
-H "Authorization: Bearer YOUR_TOKEN"Response (Success):
{
"success": true,
"action": "retry_scheduled",
"message": "Import job retry scheduled successfully",
"retryScheduled": true,
"nextRetryAt": "2024-01-15T10:35:00.000Z"
}Response (Quota Exceeded):
{
"success": false,
"action": "quota_exceeded",
"error": "User has exceeded their daily import quota. Retry will be attempted after quota resets.",
"statusCode": 403
}Response (Not Retryable):
{
"success": false,
"action": "not_retryable",
"error": "Error is not retryable: File not found (permanent error)",
"statusCode": 400
}When to Use:
- Automatic retry hasn’t triggered yet
- Need to retry immediately after fixing underlying issue
- Testing recovery after configuration changes
Reset Import Job
Endpoint: POST /api/import-jobs/{id}/reset
Purpose: Reset import to a specific stage for fresh restart
Example Request:
curl -X POST https://your-domain.com/api/import-jobs/123/reset \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"stage": "ANALYZE_DUPLICATES"}'Response:
{
"success": true,
"message": "Import job reset successfully",
"stage": "ANALYZE_DUPLICATES"
}Available Reset Stages:
ANALYZE_DUPLICATES- Complete restartDETECT_SCHEMA- After duplicate analysisVALIDATE_SCHEMA- After schema detectionGEOCODE_BATCH- After schema validation/approval
When to Use:
- Need to restart from specific stage
- Configuration changes require reprocessing
- Data corruption requires fresh start from known good stage
- Testing stage-specific behavior
Cautions:
- Resetting clears progress from later stages
- May re-process data (ensure idempotency)
- Doesn’t delete already created events (use with caution)
Get Recovery Recommendations
Endpoint: GET /api/import-jobs/failed/recommendations
Purpose: Get actionable recommendations for all failed imports
Example Request:
curl https://your-domain.com/api/import-jobs/failed/recommendations \
-H "Authorization: Bearer YOUR_TOKEN"Example Response:
{
"recommendations": [
{
"importJobId": 123,
"importFileName": "events-2024.csv",
"stage": "GEOCODE_BATCH",
"failedAt": "2024-01-15T10:00:00.000Z",
"errorClassification": {
"type": "recoverable",
"retryable": true,
"reason": "Rate limit exceeded (429)"
},
"recommendedAction": "automatic_retry",
"retryCount": 1,
"nextRetryAt": "2024-01-15T10:30:00.000Z",
"details": "Automatic retry scheduled. System will retry in 30 seconds with exponential backoff."
},
{
"importJobId": 456,
"importFileName": "invalid-data.csv",
"stage": "VALIDATE_SCHEMA",
"failedAt": "2024-01-15T09:00:00.000Z",
"errorClassification": {
"type": "permanent",
"retryable": false,
"reason": "File not found (ENOENT)"
},
"recommendedAction": "manual_intervention",
"retryCount": 0,
"details": "File not found. Verify file exists and is accessible. Upload file again or delete this import job."
}
]
}Recommendation Categories:
| Category | Meaning | User Action |
|---|---|---|
automatic_retry | System will retry automatically | Wait for scheduled retry |
wait_for_quota_reset | Blocked by quota, will retry after reset | Wait until tomorrow |
manual_intervention | Permanent error, needs manual fix | Fix underlying issue and retry |
approve_schema_changes | Awaiting user approval | Review and approve schema changes |
max_retries_exceeded | Retry attempts exhausted | Investigate root cause, reset if needed |
When to Use:
- Monitoring dashboard to show failed imports
- Batch processing recovery for multiple failed imports
- Generating user notifications about failed imports
Monitoring
Logging
All recovery operations are logged with context:
// Automatic retry scheduled
logger.info("Scheduled job recovery", {
importJobId: 123,
retryAttempt: 2,
recoveryStage: "GEOCODE_BATCH",
nextRetryAt: "2024-01-15T10:30:00.000Z",
});
// Quota blocked retry
logger.warn("Retry blocked due to quota limit", {
importJobId: 123,
userId: 456,
retryAttempt: 2,
quotaLimit: 10,
currentUsage: 10,
});
// Permanent error identified
logger.error("Import job has permanent error", {
importJobId: 123,
errorType: "permanent",
reason: "File not found (ENOENT)",
});Import Job State
Track recovery progress in import-job record:
{
"id": 123,
"stage": "FAILED",
"retryAttempts": 2,
"lastRetryAt": "2024-01-15T10:00:00.000Z",
"nextRetryAt": "2024-01-15T10:30:00.000Z",
"errorLog": {
"lastError": "Rate limit exceeded (429)",
"recoveryAttempt": {
"attempt": 2,
"previousError": "Network timeout",
"recoveryStage": "GEOCODE_BATCH",
"classification": "recoverable"
}
}
}Best Practices
When to Use Automatic Recovery
✅ Good Use Cases:
- Transient network failures
- API rate limiting
- Temporary resource constraints
- Database connection issues
❌ Poor Use Cases:
- Invalid file formats (won’t fix automatically)
- Missing API credentials (needs configuration)
- Data validation errors (needs data cleanup)
When to Use Manual Intervention
Use /retry endpoint when:
- Fixed underlying issue and want immediate retry
- Need to retry before scheduled automatic retry
- Testing recovery after configuration changes
Use /reset endpoint when:
- Need to restart from specific stage
- Configuration changes require reprocessing
- Automatic recovery not selecting optimal stage
Preventing Repeated Failures
- Check Recommendations First: Review
/api/import-jobs/failed/recommendationsbefore manual retry - Fix Root Cause: Don’t retry permanent errors without fixing underlying issue
- Monitor Quota Usage: Ensure adequate quota before scheduling retries
- Test Configuration: Verify fixes in staging before retrying production imports
- Review Error Logs: Understand why failure occurred before attempting recovery
Integration with Scheduled Imports
For scheduled imports (automated URL-based imports), error recovery is handled automatically:
- Import Fails: Scheduled import encounters error during processing
- Classification: Error is classified (recoverable, permanent, user-action-required)
- Automatic Retry: If recoverable, system schedules retry with exponential backoff
- Max Retries: After 3 failed attempts, import is marked as failed
- Next Scheduled Run: Next scheduled run will attempt import again from scratch
Important: Retry attempts are separate from scheduled runs. A scheduled import that fails will:
- Retry up to 3 times with exponential backoff (if recoverable)
- Still run again at the next scheduled time regardless of retry outcome