Skip to Content
⚠️Active Development Notice: TimeTiles is under active development. Information may be placeholder content or not up-to-date.

Error Recovery

TimeTiles includes an automatic error recovery system that intelligently retries failed import jobs based on error classification and recovery strategies.

Overview

The ErrorRecoveryService automatically handles import failures through:

  • Error Classification: Analyzes error patterns to determine retry eligibility
  • Exponential Backoff: Prevents overwhelming external services with intelligent retry delays
  • Quota Integration: Respects user quota limits when scheduling retries
  • Stage Recovery: Restarts from the optimal recovery point in the pipeline
  • Manual Fallback: Provides APIs for manual intervention when automatic recovery fails

Error Classification

Import failures are classified into three categories to determine the appropriate recovery strategy.

Recoverable Errors (Automatic Retry)

Transient errors that typically resolve with retry:

  • Network Errors: Connection failures, timeouts, DNS issues (ECONNREFUSED, ETIMEDOUT)
  • Database Issues: Connection timeouts, deadlocks, temporary unavailability
  • Rate Limits: API rate limiting, 429 responses, quota exceeded
  • Resource Constraints: Temporary memory issues, disk space (may resolve)

Recovery Strategy: Automatic retry with exponential backoff

Permanent Errors (No Retry)

Errors that won’t resolve without intervention:

  • File Errors: File not found, invalid format (ENOENT)
  • Permission Issues: Unauthorized access, invalid credentials (401, 403)
  • Invalid Data: Malformed file structure, unsupported format
  • Configuration Errors: Missing required settings, invalid API keys

Recovery Strategy: Manual intervention required, no automatic retry

User Action Required (May Retry After Fix)

Errors requiring user decisions but potentially recoverable:

  • Quota Limits: User exceeded daily import quota (retries after quota reset)
  • Schema Changes: Breaking changes requiring approval (retries after approval)
  • Validation Failures: Data validation errors requiring cleanup

Recovery Strategy: Automatic retry after condition changes (quota reset, approval granted)

Retry Mechanism

Exponential Backoff

Retry delays increase exponentially to prevent overwhelming services:

Retry 1: 30 seconds Retry 2: 60 seconds (30s × 2) Retry 3: 120 seconds (60s × 2) Max delay: 300 seconds (5 minutes) Max retries: 3 attempts

Configuration

Configure retry behavior via environment variables:

RETRY_BASE_DELAY_MS=30000 # 30 seconds initial delay RETRY_BACKOFF_MULTIPLIER=2.0 # Double delay each retry RETRY_MAX_DELAY_MS=300000 # 5 minute maximum delay RETRY_MAX_ATTEMPTS=3 # Maximum retry attempts

Quota Integration

Before scheduling a retry, the system:

  1. Checks user’s remaining import quota
  2. Verifies retry won’t exceed daily limits
  3. Blocks retry if quota exhausted (retries after quota reset)
  4. Logs quota warnings for monitoring

Recovery Stage Selection

The system determines the optimal restart point based on where failure occurred:

Failure StageRecovery StageReason
ANALYZE_DUPLICATESANALYZE_DUPLICATESStart of pipeline
DETECT_SCHEMADETECT_SCHEMAEarly stage, minimal cost
VALIDATE_SCHEMAVALIDATE_SCHEMASchema validation specific
CREATE_SCHEMA_VERSIONVALIDATE_SCHEMARetry validation first
GEOCODE_BATCHGEOCODE_BATCHResume geocoding
CREATE_EVENTSGEOCODE_BATCHEnsure geocoding complete

Important: Recovery restarts from the beginning of the selected stage to ensure data integrity.

API Endpoints

Retry Failed Import

Endpoint: POST /api/import-jobs/{id}/retry

Purpose: Manually trigger retry for a failed import job

Example Request:

curl -X POST https://your-domain.com/api/import-jobs/123/retry \ -H "Authorization: Bearer YOUR_TOKEN"

Response (Success):

{ "success": true, "action": "retry_scheduled", "message": "Import job retry scheduled successfully", "retryScheduled": true, "nextRetryAt": "2024-01-15T10:35:00.000Z" }

Response (Quota Exceeded):

{ "success": false, "action": "quota_exceeded", "error": "User has exceeded their daily import quota. Retry will be attempted after quota resets.", "statusCode": 403 }

Response (Not Retryable):

{ "success": false, "action": "not_retryable", "error": "Error is not retryable: File not found (permanent error)", "statusCode": 400 }

When to Use:

  • Automatic retry hasn’t triggered yet
  • Need to retry immediately after fixing underlying issue
  • Testing recovery after configuration changes

Reset Import Job

Endpoint: POST /api/import-jobs/{id}/reset

Purpose: Reset import to a specific stage for fresh restart

Example Request:

curl -X POST https://your-domain.com/api/import-jobs/123/reset \ -H "Authorization: Bearer YOUR_TOKEN" \ -H "Content-Type: application/json" \ -d '{"stage": "ANALYZE_DUPLICATES"}'

Response:

{ "success": true, "message": "Import job reset successfully", "stage": "ANALYZE_DUPLICATES" }

Available Reset Stages:

  • ANALYZE_DUPLICATES - Complete restart
  • DETECT_SCHEMA - After duplicate analysis
  • VALIDATE_SCHEMA - After schema detection
  • GEOCODE_BATCH - After schema validation/approval

When to Use:

  • Need to restart from specific stage
  • Configuration changes require reprocessing
  • Data corruption requires fresh start from known good stage
  • Testing stage-specific behavior

Cautions:

  • Resetting clears progress from later stages
  • May re-process data (ensure idempotency)
  • Doesn’t delete already created events (use with caution)

Get Recovery Recommendations

Endpoint: GET /api/import-jobs/failed/recommendations

Purpose: Get actionable recommendations for all failed imports

Example Request:

curl https://your-domain.com/api/import-jobs/failed/recommendations \ -H "Authorization: Bearer YOUR_TOKEN"

Example Response:

{ "recommendations": [ { "importJobId": 123, "importFileName": "events-2024.csv", "stage": "GEOCODE_BATCH", "failedAt": "2024-01-15T10:00:00.000Z", "errorClassification": { "type": "recoverable", "retryable": true, "reason": "Rate limit exceeded (429)" }, "recommendedAction": "automatic_retry", "retryCount": 1, "nextRetryAt": "2024-01-15T10:30:00.000Z", "details": "Automatic retry scheduled. System will retry in 30 seconds with exponential backoff." }, { "importJobId": 456, "importFileName": "invalid-data.csv", "stage": "VALIDATE_SCHEMA", "failedAt": "2024-01-15T09:00:00.000Z", "errorClassification": { "type": "permanent", "retryable": false, "reason": "File not found (ENOENT)" }, "recommendedAction": "manual_intervention", "retryCount": 0, "details": "File not found. Verify file exists and is accessible. Upload file again or delete this import job." } ] }

Recommendation Categories:

CategoryMeaningUser Action
automatic_retrySystem will retry automaticallyWait for scheduled retry
wait_for_quota_resetBlocked by quota, will retry after resetWait until tomorrow
manual_interventionPermanent error, needs manual fixFix underlying issue and retry
approve_schema_changesAwaiting user approvalReview and approve schema changes
max_retries_exceededRetry attempts exhaustedInvestigate root cause, reset if needed

When to Use:

  • Monitoring dashboard to show failed imports
  • Batch processing recovery for multiple failed imports
  • Generating user notifications about failed imports

Monitoring

Logging

All recovery operations are logged with context:

// Automatic retry scheduled logger.info("Scheduled job recovery", { importJobId: 123, retryAttempt: 2, recoveryStage: "GEOCODE_BATCH", nextRetryAt: "2024-01-15T10:30:00.000Z", }); // Quota blocked retry logger.warn("Retry blocked due to quota limit", { importJobId: 123, userId: 456, retryAttempt: 2, quotaLimit: 10, currentUsage: 10, }); // Permanent error identified logger.error("Import job has permanent error", { importJobId: 123, errorType: "permanent", reason: "File not found (ENOENT)", });

Import Job State

Track recovery progress in import-job record:

{ "id": 123, "stage": "FAILED", "retryAttempts": 2, "lastRetryAt": "2024-01-15T10:00:00.000Z", "nextRetryAt": "2024-01-15T10:30:00.000Z", "errorLog": { "lastError": "Rate limit exceeded (429)", "recoveryAttempt": { "attempt": 2, "previousError": "Network timeout", "recoveryStage": "GEOCODE_BATCH", "classification": "recoverable" } } }

Best Practices

When to Use Automatic Recovery

Good Use Cases:

  • Transient network failures
  • API rate limiting
  • Temporary resource constraints
  • Database connection issues

Poor Use Cases:

  • Invalid file formats (won’t fix automatically)
  • Missing API credentials (needs configuration)
  • Data validation errors (needs data cleanup)

When to Use Manual Intervention

Use /retry endpoint when:

  • Fixed underlying issue and want immediate retry
  • Need to retry before scheduled automatic retry
  • Testing recovery after configuration changes

Use /reset endpoint when:

  • Need to restart from specific stage
  • Configuration changes require reprocessing
  • Automatic recovery not selecting optimal stage

Preventing Repeated Failures

  1. Check Recommendations First: Review /api/import-jobs/failed/recommendations before manual retry
  2. Fix Root Cause: Don’t retry permanent errors without fixing underlying issue
  3. Monitor Quota Usage: Ensure adequate quota before scheduling retries
  4. Test Configuration: Verify fixes in staging before retrying production imports
  5. Review Error Logs: Understand why failure occurred before attempting recovery

Integration with Scheduled Imports

For scheduled imports (automated URL-based imports), error recovery is handled automatically:

  1. Import Fails: Scheduled import encounters error during processing
  2. Classification: Error is classified (recoverable, permanent, user-action-required)
  3. Automatic Retry: If recoverable, system schedules retry with exponential backoff
  4. Max Retries: After 3 failed attempts, import is marked as failed
  5. Next Scheduled Run: Next scheduled run will attempt import again from scratch

Important: Retry attempts are separate from scheduled runs. A scheduled import that fails will:

  • Retry up to 3 times with exponential backoff (if recoverable)
  • Still run again at the next scheduled time regardless of retry outcome
Last updated on