Error Recovery

TimeTiles includes an automatic error recovery system that intelligently retries failed import jobs based on error classification and recovery strategies.

Overview

The ErrorRecoveryService automatically handles import failures through:

Error Classification: Analyzes error patterns to determine retry eligibility
Exponential Backoff: Prevents overwhelming external services with intelligent retry delays
Quota Integration: Respects user quota limits when scheduling retries
Stage Recovery: Restarts from the optimal recovery point in the pipeline
Manual Fallback: Provides APIs for manual intervention when automatic recovery fails

Error Classification

Import failures are classified into three categories to determine the appropriate recovery strategy.

Recoverable Errors (Automatic Retry)

Transient errors that typically resolve with retry:

Network Errors: Connection failures, timeouts, DNS issues (ECONNREFUSED, ETIMEDOUT)
Database Issues: Connection timeouts, deadlocks, temporary unavailability
Rate Limits: API rate limiting, 429 responses, quota exceeded
Resource Constraints: Temporary memory issues, disk space (may resolve)

Recovery Strategy: Automatic retry with exponential backoff

Permanent Errors (No Retry)

Errors that won’t resolve without intervention:

File Errors: File not found, invalid format (ENOENT)
Permission Issues: Unauthorized access, invalid credentials (401, 403)
Invalid Data: Malformed file structure, unsupported format
Configuration Errors: Missing required settings, invalid API keys

Recovery Strategy: Manual intervention required, no automatic retry

User Action Required (May Retry After Fix)

Errors requiring user decisions but potentially recoverable:

Quota Limits: User exceeded daily import quota (retries after quota reset)
Schema Changes: Breaking changes requiring approval (retries after approval)
Validation Failures: Data validation errors requiring cleanup

Recovery Strategy: Automatic retry after condition changes (quota reset, approval granted)

Retry Mechanism

Exponential Backoff

Retry delays increase exponentially to prevent overwhelming services:


Retry 1: 30 seconds
Retry 2: 60 seconds (30s × 2)
Retry 3: 120 seconds (60s × 2)
Max delay: 300 seconds (5 minutes)
Max retries: 3 attempts

Configuration

Configure retry behavior via environment variables:


RETRY_BASE_DELAY_MS=30000       # 30 seconds initial delay
RETRY_BACKOFF_MULTIPLIER=2.0    # Double delay each retry
RETRY_MAX_DELAY_MS=300000       # 5 minute maximum delay
RETRY_MAX_ATTEMPTS=3            # Maximum retry attempts

Quota Integration

Before scheduling a retry, the system:

Checks user’s remaining import quota
Verifies retry won’t exceed daily limits
Blocks retry if quota exhausted (retries after quota reset)
Logs quota warnings for monitoring

Recovery Stage Selection

The system determines the optimal restart point based on where failure occurred:

Failure Stage	Recovery Stage	Reason
`ANALYZE_DUPLICATES`	`ANALYZE_DUPLICATES`	Start of pipeline
`DETECT_SCHEMA`	`DETECT_SCHEMA`	Early stage, minimal cost
`VALIDATE_SCHEMA`	`VALIDATE_SCHEMA`	Schema validation specific
`CREATE_SCHEMA_VERSION`	`VALIDATE_SCHEMA`	Retry validation first
`GEOCODE_BATCH`	`GEOCODE_BATCH`	Resume geocoding
`CREATE_EVENTS`	`GEOCODE_BATCH`	Ensure geocoding complete

Important: Recovery restarts from the beginning of the selected stage to ensure data integrity.

API Endpoints

Retry Failed Import

Endpoint: POST /api/import-jobs/{id}/retry

Purpose: Manually trigger retry for a failed import job

Example Request:


curl -X POST https://your-domain.com/api/import-jobs/123/retry \
  -H "Authorization: Bearer YOUR_TOKEN"

Response (Success):


{
  "success": true,
  "action": "retry_scheduled",
  "message": "Import job retry scheduled successfully",
  "retryScheduled": true,
  "nextRetryAt": "2024-01-15T10:35:00.000Z"
}

Response (Quota Exceeded):


{
  "success": false,
  "action": "quota_exceeded",
  "error": "User has exceeded their daily import quota. Retry will be attempted after quota resets.",
  "statusCode": 403
}

Response (Not Retryable):


{
  "success": false,
  "action": "not_retryable",
  "error": "Error is not retryable: File not found (permanent error)",
  "statusCode": 400
}

When to Use:

Automatic retry hasn’t triggered yet
Need to retry immediately after fixing underlying issue
Testing recovery after configuration changes

Reset Import Job

Endpoint: POST /api/import-jobs/{id}/reset

Purpose: Reset import to a specific stage for fresh restart

Example Request:


curl -X POST https://your-domain.com/api/import-jobs/123/reset \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"stage": "ANALYZE_DUPLICATES"}'

Response:


{
  "success": true,
  "message": "Import job reset successfully",
  "stage": "ANALYZE_DUPLICATES"
}

Available Reset Stages:

ANALYZE_DUPLICATES - Complete restart
DETECT_SCHEMA - After duplicate analysis
VALIDATE_SCHEMA - After schema detection
GEOCODE_BATCH - After schema validation/approval

When to Use:

Need to restart from specific stage
Configuration changes require reprocessing
Data corruption requires fresh start from known good stage
Testing stage-specific behavior

Cautions:

Resetting clears progress from later stages
May re-process data (ensure idempotency)
Doesn’t delete already created events (use with caution)

Get Recovery Recommendations

Endpoint: GET /api/import-jobs/failed/recommendations

Purpose: Get actionable recommendations for all failed imports

Example Request:


curl https://your-domain.com/api/import-jobs/failed/recommendations \
  -H "Authorization: Bearer YOUR_TOKEN"

Example Response:


{
  "recommendations": [
    {
      "importJobId": 123,
      "importFileName": "events-2024.csv",
      "stage": "GEOCODE_BATCH",
      "failedAt": "2024-01-15T10:00:00.000Z",
      "errorClassification": {
        "type": "recoverable",
        "retryable": true,
        "reason": "Rate limit exceeded (429)"
      },
      "recommendedAction": "automatic_retry",
      "retryCount": 1,
      "nextRetryAt": "2024-01-15T10:30:00.000Z",
      "details": "Automatic retry scheduled. System will retry in 30 seconds with exponential backoff."
    },
    {
      "importJobId": 456,
      "importFileName": "invalid-data.csv",
      "stage": "VALIDATE_SCHEMA",
      "failedAt": "2024-01-15T09:00:00.000Z",
      "errorClassification": {
        "type": "permanent",
        "retryable": false,
        "reason": "File not found (ENOENT)"
      },
      "recommendedAction": "manual_intervention",
      "retryCount": 0,
      "details": "File not found. Verify file exists and is accessible. Upload file again or delete this import job."
    }
  ]
}

Recommendation Categories:

Category	Meaning	User Action
`automatic_retry`	System will retry automatically	Wait for scheduled retry
`wait_for_quota_reset`	Blocked by quota, will retry after reset	Wait until tomorrow
`manual_intervention`	Permanent error, needs manual fix	Fix underlying issue and retry
`approve_schema_changes`	Awaiting user approval	Review and approve schema changes
`max_retries_exceeded`	Retry attempts exhausted	Investigate root cause, reset if needed

When to Use:

Monitoring dashboard to show failed imports
Batch processing recovery for multiple failed imports
Generating user notifications about failed imports

Monitoring

Logging

All recovery operations are logged with context:


// Automatic retry scheduled
logger.info("Scheduled job recovery", {
  importJobId: 123,
  retryAttempt: 2,
  recoveryStage: "GEOCODE_BATCH",
  nextRetryAt: "2024-01-15T10:30:00.000Z",
});
 
// Quota blocked retry
logger.warn("Retry blocked due to quota limit", {
  importJobId: 123,
  userId: 456,
  retryAttempt: 2,
  quotaLimit: 10,
  currentUsage: 10,
});
 
// Permanent error identified
logger.error("Import job has permanent error", {
  importJobId: 123,
  errorType: "permanent",
  reason: "File not found (ENOENT)",
});

Import Job State

Track recovery progress in import-job record:


{
  "id": 123,
  "stage": "FAILED",
  "retryAttempts": 2,
  "lastRetryAt": "2024-01-15T10:00:00.000Z",
  "nextRetryAt": "2024-01-15T10:30:00.000Z",
  "errorLog": {
    "lastError": "Rate limit exceeded (429)",
    "recoveryAttempt": {
      "attempt": 2,
      "previousError": "Network timeout",
      "recoveryStage": "GEOCODE_BATCH",
      "classification": "recoverable"
    }
  }
}

Best Practices

When to Use Automatic Recovery

✅ Good Use Cases:

Transient network failures
API rate limiting
Temporary resource constraints
Database connection issues

❌ Poor Use Cases:

Invalid file formats (won’t fix automatically)
Missing API credentials (needs configuration)
Data validation errors (needs data cleanup)

When to Use Manual Intervention

Use /retry endpoint when:

Fixed underlying issue and want immediate retry
Need to retry before scheduled automatic retry
Testing recovery after configuration changes

Use /reset endpoint when:

Need to restart from specific stage
Configuration changes require reprocessing
Automatic recovery not selecting optimal stage

Preventing Repeated Failures

Check Recommendations First: Review /api/import-jobs/failed/recommendations before manual retry
Fix Root Cause: Don’t retry permanent errors without fixing underlying issue
Monitor Quota Usage: Ensure adequate quota before scheduling retries
Test Configuration: Verify fixes in staging before retrying production imports
Review Error Logs: Understand why failure occurred before attempting recovery

Integration with Scheduled Imports

For scheduled imports (automated URL-based imports), error recovery is handled automatically:

Import Fails: Scheduled import encounters error during processing
Classification: Error is classified (recoverable, permanent, user-action-required)
Automatic Retry: If recoverable, system schedules retry with exponential backoff
Max Retries: After 3 failed attempts, import is marked as failed
Next Scheduled Run: Next scheduled run will attempt import again from scratch

Important: Retry attempts are separate from scheduled runs. A scheduled import that fails will:

Retry up to 3 times with exponential backoff (if recoverable)
Still run again at the next scheduled time regardless of retry outcome