Data Processing Pipeline
The TimeTiles data processing pipeline is a hook-driven, event-based system that transforms uploaded files into structured events. The pipeline automatically orchestrates a series of processing stages through Payload CMS collection hooks and background jobs.
Quick Links
- Architecture & Design - Design principles, core collections, and data flow
- Processing Stages - Detailed explanation of all 8 pipeline stages
- Configuration - Dataset and system configuration options
- Troubleshooting - Common issues and debugging tools
- Best Practices - Guidelines for data providers and administrators
Pipeline Overview
The pipeline processes data through eight sequential stages, from file upload to final event creation:
Eight Processing Stages
- Dataset Detection - Parse file structure and create import jobs (one per sheet for Excel files)
- Analyze Duplicates - Identify internal and external duplicates before processing
- Detect Schema - Build progressive schema from non-duplicate data
- Validate Schema - Compare detected schema with existing dataset schema
- Manual Approval (Optional) - Human review when changes require approval
- Create Schema Version - Persist new schema version after approval
- Geocode Batch - Enrich data with geographic coordinates
- Create Events - Generate final event records with all enrichments
Key Features
Event-Driven Architecture: Collection hooks automatically advance the pipeline through stages without manual intervention.
Batch Processing: Large files are processed in configurable batches to manage memory and respect API rate limits.
Error Recovery: Comprehensive error handling at both stage and batch levels, allowing recovery from partial failures.
Schema Versioning: Complete version history of all schema changes with audit trail.
Multi-Sheet Support: Excel files with multiple sheets create separate import jobs that process independently.
Core Principles
- Single Source of Truth: Raw data files stay on disk; only processing state and results are stored in the database
- File-Based Processing: Data flows from file → memory → database, avoiding large data storage during processing
- Metadata Tracking: Processing state, schemas, duplicate maps, and geocoding results are persisted for resumable operations
- Atomic Transitions: StageTransitionService ensures exactly-once job execution with locking mechanisms
Next Steps
New to the pipeline? Start with Architecture & Design to understand the system’s core principles.
Working on a specific stage? See Processing Stages for detailed documentation of each stage.
Need to configure behavior? Check Configuration for dataset and system options.
Having issues? Visit Troubleshooting for common problems and solutions.