web / lib/services/error-recovery

lib/services/error-recovery

Provides error recovery mechanisms for failed import jobs.

This service handles recovery from various failure scenarios in the import pipeline. It provides retry logic, error classification, and automatic recovery strategies to improve system resilience and reduce manual intervention requirements.

Key responsibilities:

Retry failed jobs with exponential backoff
Classify errors as recoverable vs permanent
Reset job state for recovery attempts
Track retry attempts and failure patterns
Provide manual recovery tools for operators.

Classes

ErrorRecoveryService

Service for handling import job error recovery.

Provides automatic and manual recovery mechanisms for failed import jobs, including error classification, exponential backoff retry scheduling, quota enforcement, and operator intervention tools.

Examples

Basic usage - automatic retry:


import { ErrorRecoveryService } from "@/lib/services/error-recovery";
 
const result = await ErrorRecoveryService.recoverFailedJob(payload, jobId);
if (result.success) {
  console.log(`Retry scheduled for ${result.nextRetryAt}`);
}

Manual reset by administrator:


await ErrorRecoveryService.resetJobToStage(
  payload,
  jobId,
  PROCESSING_STAGE.GEOCODE_BATCH,
  true // Clear retry counter
);

Get recommendations for all failed jobs:


const recommendations = await ErrorRecoveryService.getRecoveryRecommendations(payload);
const autoRetryable = recommendations.filter(r => r.recommendedAction === "Automatic retry available");

Constructors

Constructor

new ErrorRecoveryService(): ErrorRecoveryService

Returns

ErrorRecoveryService

Methods

recoverFailedJob()

static recoverFailedJob(payload, jobId, retryConfig): Promise<RecoveryResult>

Attempt to recover a failed import job.

This is the primary entry point for the error recovery system. It:

Validates the job exists and is in a failed state
Classifies the error to determine if it’s retryable
Checks retry count hasn’t exceeded the maximum
Verifies user quota if applicable
Calculates exponential backoff delay
Updates the job to retry from the appropriate recovery stage

Parameters

payload

BasePayload

Payload CMS instance for database access

jobId

ID of the failed import job to recover

string | number

retryConfig

Partial<RetryConfig> = {}

Optional retry configuration to override defaults

Returns

Promise<RecoveryResult>

Recovery result indicating success/failure and next retry time

Example


const result = await ErrorRecoveryService.recoverFailedJob(payload, 123);
if (result.success) {
  console.log(`Retry scheduled for ${result.nextRetryAt}`);
} else {
  console.error(`Recovery failed: ${result.error}`);
}

Notes:

Uses exponential backoff: 30s, 60s, 120s (base 30s, multiplier 2x, max 5min)
Default max retries: 3
Respects user quota limits to prevent abuse
Jobs are not automatically executed; they’re scheduled for pickup by process-pending-retries job

processPendingRetries()

static processPendingRetries(payload): Promise<void>

Process pending retries (should be called periodically).

Scans for failed jobs that are scheduled for retry (based on nextRetryAt) and automatically restarts them from the appropriate recovery stage. This method should be invoked by a scheduled background job every 5 minutes.

Parameters

payload

BasePayload

Payload CMS instance for database access

Returns

Promise<void>

Promise that resolves when processing is complete

Implementation notes:

Processes up to 10 retries per invocation to avoid overwhelming the system
Only processes jobs where nextRetryAt <= current time
Skips jobs with non-retryable error classifications
Clears nextRetryAt after queueing to prevent duplicate processing
Should be configured as a Payload scheduled task running every 5 minutes

Example

Configure in payload.config.ts (cron runs every 5 minutes):


jobs: {
  tasks: [
    {
      slug: "process-pending-retries",
      handler: async ({ req }) => {
        await ErrorRecoveryService.processPendingRetries(req.payload);
      },
      schedule: [{ cron: "0,5,10,15,20,25,30,35,40,45,50,55 * * * *", queue: "maintenance" }]
    }
  ]
}

resetJobToStage()

static resetJobToStage(payload, jobId, targetStage, clearRetries): Promise<RecoveryResult>

Manually reset a job to a specific stage (for operator intervention).

Allows administrators to manually override the automatic recovery logic and force a job to restart from a specific stage. Useful for debugging, testing, or handling edge cases that the automatic system can’t resolve.

Parameters

payload

BasePayload

Payload CMS instance for database access

jobId

ID of the import job to reset

string | number

targetStage

ProcessingStage

Processing stage to reset the job to

clearRetries

boolean = true

Whether to reset retry counter to 0 (default: true)

Returns

Promise<RecoveryResult>

Recovery result indicating success or failure

Example


// Reset job to geocoding stage and clear retry count
const result = await ErrorRecoveryService.resetJobToStage(
  payload,
  123,
  PROCESSING_STAGE.GEOCODE_BATCH,
  true
);

Important notes:

Records manual reset in error log with timestamp and stage information
Bypasses all validation checks (use with caution)
Does not queue jobs automatically - job will be picked up by normal processing
Should only be used by administrators via the reset API endpoint
If clearRetries is false, retry count is preserved (useful for debugging retry logic)

getRecoveryRecommendations()

static getRecoveryRecommendations(payload): Promise<object[]>

Get recovery recommendations for failed jobs.

Analyzes all failed jobs in the system and provides actionable recommendations for each. Used by the recommendations API endpoint to help administrators understand which jobs need attention.

Parameters

payload

BasePayload

Payload CMS instance for database access

Returns

Promise<object[]>

Array of job recommendations with classifications and suggested actions

Example


const recommendations = await ErrorRecoveryService.getRecoveryRecommendations(payload);
recommendations.forEach(rec => {
  console.log(`Job ${rec.jobId}: ${rec.recommendedAction}`);
});

Recommendation categories:

“Automatic retry available” - Job can be retried automatically
“Manual review required” - User action needed (from classification.suggestedAction)
“Manual intervention required - max retries exceeded” - Retry limit hit
“No action recommended” - Non-retryable permanent error

Limited to 100 failed jobs per query to prevent performance issues. Access control should be applied by the calling API endpoint.

Interfaces

RetryConfig

Configuration for retry behavior.

Properties

maxRetries

maxRetries: number

Maximum number of retry attempts before giving up

baseDelayMs

baseDelayMs: number

Initial delay in milliseconds before first retry

maxDelayMs

maxDelayMs: number

Maximum delay in milliseconds between retries

backoffMultiplier

backoffMultiplier: number

Multiplier for exponential backoff (e.g., 2 = double delay each time)

ErrorClassification

Result of error classification analysis.

Properties

type

type: "recoverable" | "permanent" | "user-action-required"

Category of error determining recovery strategy

reason

reason: string

Human-readable explanation of the error

suggestedAction?

optional suggestedAction: string

Optional suggestion for user to resolve the issue

retryable

retryable: boolean

Whether this error can be retried automatically

RecoveryResult

Result of recovery operation.

Properties

success

success: boolean

Whether the recovery operation succeeded

action

action: string

Action taken or error code (e.g., “retry_scheduled”, “job_not_found”, “quota_exceeded”)

error?

optional error: string

Error message if recovery failed

retryScheduled?

optional retryScheduled: boolean

Whether a retry was successfully scheduled

nextRetryAt?

optional nextRetryAt: Date

Timestamp when the next retry will occur