web / lib/services/error-recovery
lib/services/error-recovery
Provides error recovery mechanisms for failed import jobs.
This service handles recovery from various failure scenarios in the import pipeline. It provides retry logic, error classification, and automatic recovery strategies to improve system resilience and reduce manual intervention requirements.
Key responsibilities:
- Retry failed jobs with exponential backoff
- Classify errors as recoverable vs permanent
- Reset job state for recovery attempts
- Track retry attempts and failure patterns
- Provide manual recovery tools for operators.
Classes
ErrorRecoveryService
Service for handling import job error recovery.
Provides automatic and manual recovery mechanisms for failed import jobs, including error classification, exponential backoff retry scheduling, quota enforcement, and operator intervention tools.
Examples
Basic usage - automatic retry:
import { ErrorRecoveryService } from "@/lib/services/error-recovery";
const result = await ErrorRecoveryService.recoverFailedJob(payload, jobId);
if (result.success) {
console.log(`Retry scheduled for ${result.nextRetryAt}`);
}Manual reset by administrator:
await ErrorRecoveryService.resetJobToStage(
payload,
jobId,
PROCESSING_STAGE.GEOCODE_BATCH,
true // Clear retry counter
);Get recommendations for all failed jobs:
const recommendations = await ErrorRecoveryService.getRecoveryRecommendations(payload);
const autoRetryable = recommendations.filter(r => r.recommendedAction === "Automatic retry available");Constructors
Constructor
new ErrorRecoveryService():
ErrorRecoveryService
Returns
Methods
recoverFailedJob()
staticrecoverFailedJob(payload,jobId,retryConfig):Promise<RecoveryResult>
Attempt to recover a failed import job.
This is the primary entry point for the error recovery system. It:
- Validates the job exists and is in a failed state
- Classifies the error to determine if it’s retryable
- Checks retry count hasn’t exceeded the maximum
- Verifies user quota if applicable
- Calculates exponential backoff delay
- Updates the job to retry from the appropriate recovery stage
Parameters
payload
BasePayload
Payload CMS instance for database access
jobId
ID of the failed import job to recover
string | number
retryConfig
Partial<RetryConfig> = {}
Optional retry configuration to override defaults
Returns
Promise<RecoveryResult>
Recovery result indicating success/failure and next retry time
Example
const result = await ErrorRecoveryService.recoverFailedJob(payload, 123);
if (result.success) {
console.log(`Retry scheduled for ${result.nextRetryAt}`);
} else {
console.error(`Recovery failed: ${result.error}`);
}Notes:
- Uses exponential backoff: 30s, 60s, 120s (base 30s, multiplier 2x, max 5min)
- Default max retries: 3
- Respects user quota limits to prevent abuse
- Jobs are not automatically executed; they’re scheduled for pickup by process-pending-retries job
processPendingRetries()
staticprocessPendingRetries(payload):Promise<void>
Process pending retries (should be called periodically).
Scans for failed jobs that are scheduled for retry (based on nextRetryAt) and automatically restarts them from the appropriate recovery stage. This method should be invoked by a scheduled background job every 5 minutes.
Parameters
payload
BasePayload
Payload CMS instance for database access
Returns
Promise<void>
Promise that resolves when processing is complete
Implementation notes:
- Processes up to 10 retries per invocation to avoid overwhelming the system
- Only processes jobs where nextRetryAt <= current time
- Skips jobs with non-retryable error classifications
- Clears nextRetryAt after queueing to prevent duplicate processing
- Should be configured as a Payload scheduled task running every 5 minutes
Example
Configure in payload.config.ts (cron runs every 5 minutes):
jobs: {
tasks: [
{
slug: "process-pending-retries",
handler: async ({ req }) => {
await ErrorRecoveryService.processPendingRetries(req.payload);
},
schedule: [{ cron: "0,5,10,15,20,25,30,35,40,45,50,55 * * * *", queue: "maintenance" }]
}
]
}resetJobToStage()
staticresetJobToStage(payload,jobId,targetStage,clearRetries):Promise<RecoveryResult>
Manually reset a job to a specific stage (for operator intervention).
Allows administrators to manually override the automatic recovery logic and force a job to restart from a specific stage. Useful for debugging, testing, or handling edge cases that the automatic system can’t resolve.
Parameters
payload
BasePayload
Payload CMS instance for database access
jobId
ID of the import job to reset
string | number
targetStage
Processing stage to reset the job to
clearRetries
boolean = true
Whether to reset retry counter to 0 (default: true)
Returns
Promise<RecoveryResult>
Recovery result indicating success or failure
Example
// Reset job to geocoding stage and clear retry count
const result = await ErrorRecoveryService.resetJobToStage(
payload,
123,
PROCESSING_STAGE.GEOCODE_BATCH,
true
);Important notes:
- Records manual reset in error log with timestamp and stage information
- Bypasses all validation checks (use with caution)
- Does not queue jobs automatically - job will be picked up by normal processing
- Should only be used by administrators via the reset API endpoint
- If clearRetries is false, retry count is preserved (useful for debugging retry logic)
getRecoveryRecommendations()
staticgetRecoveryRecommendations(payload):Promise<object[]>
Get recovery recommendations for failed jobs.
Analyzes all failed jobs in the system and provides actionable recommendations for each. Used by the recommendations API endpoint to help administrators understand which jobs need attention.
Parameters
payload
BasePayload
Payload CMS instance for database access
Returns
Promise<object[]>
Array of job recommendations with classifications and suggested actions
Example
const recommendations = await ErrorRecoveryService.getRecoveryRecommendations(payload);
recommendations.forEach(rec => {
console.log(`Job ${rec.jobId}: ${rec.recommendedAction}`);
});Recommendation categories:
- “Automatic retry available” - Job can be retried automatically
- “Manual review required” - User action needed (from classification.suggestedAction)
- “Manual intervention required - max retries exceeded” - Retry limit hit
- “No action recommended” - Non-retryable permanent error
Limited to 100 failed jobs per query to prevent performance issues. Access control should be applied by the calling API endpoint.
Interfaces
RetryConfig
Configuration for retry behavior.
Properties
maxRetries
maxRetries:
number
Maximum number of retry attempts before giving up
baseDelayMs
baseDelayMs:
number
Initial delay in milliseconds before first retry
maxDelayMs
maxDelayMs:
number
Maximum delay in milliseconds between retries
backoffMultiplier
backoffMultiplier:
number
Multiplier for exponential backoff (e.g., 2 = double delay each time)
ErrorClassification
Result of error classification analysis.
Properties
type
type:
"recoverable"|"permanent"|"user-action-required"
Category of error determining recovery strategy
reason
reason:
string
Human-readable explanation of the error
suggestedAction?
optionalsuggestedAction:string
Optional suggestion for user to resolve the issue
retryable
retryable:
boolean
Whether this error can be retried automatically
RecoveryResult
Result of recovery operation.
Properties
success
success:
boolean
Whether the recovery operation succeeded
action
action:
string
Action taken or error code (e.g., “retry_scheduled”, “job_not_found”, “quota_exceeded”)
error?
optionalerror:string
Error message if recovery failed
retryScheduled?
optionalretryScheduled:boolean
Whether a retry was successfully scheduled
nextRetryAt?
optionalnextRetryAt:Date
Timestamp when the next retry will occur