Architecture & Design
The TimeTiles data ingestion pipeline is built on several core architectural principles that ensure scalability, reliability, and maintainability.
Core Architectural Principles
Single Source of Truth
The pipeline follows a single source of truth principle where uploaded files remain the authoritative data source throughout processing:
- Files stay on disk: Raw data files are never moved or duplicated during processing
- Immutable source: Original files remain unchanged from upload through completion
- Reproducible processing: Any stage can be re-run from the original file
- Audit trail: Complete history of what was processed and when
File-Based Processing Model
Data flows through the system in a carefully controlled path:
File → Memory → Database
- File reads: Data is read in configurable batches from disk
- In-memory processing: Transformations and analysis happen in memory
- Selective persistence: Only final results and processing state are written to the database
This approach provides several benefits:
- Memory efficiency: Batch processing prevents memory exhaustion on large files
- Database efficiency: Minimal database writes during processing
- Clean separation: Processing logic is independent of storage
- Recovery support: Processing can resume from any batch
Data Storage Strategy
The system carefully distinguishes between what gets stored and what remains transient:
What IS Stored in Database:
- Processing state and current stage
- Detected schemas and field statistics
- Duplicate analysis results (row number mappings)
- Geocoding results (location lookups by row number)
- Progress tracking (current/total counts)
- Schema version history
- Final event records
What is NOT Stored:
- Raw row data from source files
- Complete file contents
- Intermediate processing results
- Temporary calculations
This selective storage ensures the database remains efficient while still supporting resumable operations and complete audit trails.
Core Collections
The pipeline orchestrates processing across five main Payload CMS collections:
1. ingest-files
Purpose: File upload and metadata storage
- Stores uploaded files on disk
- Tracks file metadata (name, size, type, upload date)
- Maintains overall import status
- Links to all ingest-jobs created from the file
Multi-Sheet Handling: A single Excel file can generate multiple ingest-jobs (one per sheet), all linked back to the parent ingest-file.
2. ingest-jobs
Purpose: Processing state and orchestration
- Tracks current processing stage
- Stores progress information (current/total rows)
- Maintains duplicate analysis results
- Caches geocoding lookups
- Preserves schema builder state across batches
- Links to associated dataset and ingest-file
Central Role: This collection is the heart of the pipeline, coordinating all processing activity.
3. datasets
Purpose: Configuration and schema management
- Defines ID strategies for deduplication
- Configures schema behavior (locked, auto-grow, strict validation)
- Specifies type transformations
- Sets processing limits
- Contains geographic field detection settings
- Links to all events created for this dataset
4. dataset-schemas
Purpose: Schema versioning and history
- Maintains complete version history of all schema changes
- Stores field metadata and statistics for each version
- Tracks who approved each schema change and when
- Enables schema rollback and evolution tracking
- Links schema versions to the imports that created them
Critical for Compliance: This collection provides the audit trail needed for data governance and compliance requirements.
5. events
Purpose: Final processed data storage
- Stores complete event records with all enrichments
- Includes geocoded location data
- References source dataset and import-job
- Links to specific schema version used
- Contains validation status and metadata
Workflow Architecture
4 Payload Workflows
The pipeline uses Payload CMS Workflows to define linear task pipelines. Each workflow sequences multiple task handlers, with automatic retry and caching of completed tasks on re-run.
| Workflow | Trigger | Purpose |
|---|---|---|
manual-ingest | ingest-files afterChange hook | Full pipeline for user uploads |
scheduled-ingest | schedule-manager job | URL fetch + full pipeline |
scraper-ingest | schedule-manager job | Scraper execution + full pipeline |
ingest-process | ingest-jobs afterChange hook (NEEDS_REVIEW approval) | Resume after schema drift resolution |
Hook-Driven Workflow Queueing
Collection afterChange hooks queue workflows automatically:
ingest-filesafterChange: Queuesmanual-ingestworkflow when a file is uploadedingest-jobsafterChange: Queuesingest-processworkflow when a NEEDS_REVIEW job is approved
Workflows are never queued manually from task handlers or tests. The hooks ensure exactly-once execution.
Concurrency Control
Each workflow uses a per-resource concurrency key to prevent duplicate processing:
manual-ingest:file:{importFileId}scheduled-ingest:sched:{scheduledImportId}scraper-ingest:scraper:{scraperId}ingest-process:ingest:{ingestJobId}
Error Model
Task handlers use three distinct patterns:
- Throw for transient failures — Payload retries the task automatically
- Return
{ needsReview: true }for human review — pipeline pauses for that sheet - Return data for success — pipeline continues to the next task
Multi-sheet workflows process sheets via Promise.allSettled with per-sheet try/catch. A failure in one sheet does not block other sheets.
Background Job Lifecycle
Payload CMS deletes completed jobs by default. This has important implications:
- No Double-Queueing: Workflows are queued by hooks, not by task handlers
- Testing Implications: Tests must check side effects (events created, data changed) rather than job records
- Drain Loop: Tests use
payload.jobs.run()in a loop to process chained workflow tasks
Batch Processing Architecture
Configurable Batch Sizes
Different stages use different batch sizes optimized for their specific needs. See Processing Stages for batch sizes per stage.
- Duplicate analysis — memory-efficient for hash map operations; allows pause/resume at batch boundaries
- Schema detection — larger batches for efficiency; builder state persisted between batches for progressive schema building
- Geocoding — processes all unique locations in a single pass (not batched by row); each unique value geocoded once, with small API batch sizes to respect rate limits
- Event creation — smaller batches to avoid database transaction timeouts; individual row failures don’t stop the batch
Progressive Processing
Many stages build up results progressively across batches:
- Schema detection: Refines type understanding with each batch
- Duplicate analysis: Builds complete duplicate map across all batches
- Event creation: Processes file in manageable chunks
This approach enables:
- Processing files larger than available memory
- Pause and resume at batch boundaries
- Incremental progress tracking
- Partial recovery from failures
Data Flow Principles
Metadata Tracking
Throughout processing, the pipeline tracks comprehensive metadata:
Stage Progression:
- Current stage and previous stage
- Stage transition timestamps
- User who initiated transitions (for approvals)
Progress Information:
- Rows processed vs total rows
- Batches completed vs total batches
- Percentage completion
Processing Results:
- Duplicate row numbers (internal and external)
- Geocoding results by row number
- Schema builder state for resumption
- Error details for failed rows
Version Tracking
Payload’s built-in versioning provides complete audit capability:
- Each stage change creates a new version of the import-job
- Full history of processing decisions
- Ability to analyze bottlenecks and failures
- Recovery information for debugging
- Compliance and governance support
Performance Characteristics
Scalability
The architecture scales effectively across multiple dimensions:
Memory Efficiency:
- Batch processing prevents memory exhaustion
- Configurable batch sizes tune memory usage
- Progressive processing reduces peak memory
Database Efficiency:
- Minimal writes during processing
- Selective persistence of only essential data
- Bulk operations where possible
API Friendliness:
- Geocoding respects rate limits
- Exponential backoff on failures
- Efficient handling of unique values
Parallelization:
- Multiple ingest-jobs can run concurrently
- Different stages of different imports run simultaneously
- Independent processing of Excel sheets
Monitoring Capabilities
The pipeline provides extensive observability:
Progress Tracking:
- Real-time progress for each stage
- Batch-level granularity
- Overall import status
Performance Metrics:
- Processing times per stage
- Throughput measurements
- API usage and quota tracking
Error Logging:
- Detailed error information per row/batch
- Stage-level failure tracking
- Complete error context for debugging
Resource Usage:
- Memory consumption monitoring
- Database query performance
- API quota consumption
Design Trade-offs
File-Based vs Database-Based
Choice: Keep raw data in files, not database
Trade-offs:
- ✅ Simpler database schema
- ✅ Better performance for large datasets
- ✅ Easier to re-process from source
- ❌ Requires file system management
- ❌ Files must remain accessible
Workflow-Based vs Hook-Driven State Machine
Choice: Payload Workflows with hook-based queueing
Trade-offs:
- ✅ Explicit, linear workflow handlers (readable control flow)
- ✅ Built-in retry with cached completed tasks
- ✅ Per-resource concurrency control
- ✅ One workflow per import type (clear separation)
- ❌ Slight overhead for single-sheet imports (workflow wrapper job)
- ❌ Sequential sheet processing within a workflow
Batch vs Stream Processing
Choice: Batch processing with configurable sizes
Trade-offs:
- ✅ Memory-efficient for large files
- ✅ Pause/resume capability
- ✅ Tunable performance
- ❌ Higher latency than streaming
- ❌ Batch size tuning required
Development Guidelines
Respect Workflow Architecture
- Let
afterChangehooks on collections queue workflows — never queue workflows manually from task handlers or tests - Each task handler does its work and returns structured output; the workflow handler sequences tasks and inspects results
- This prevents double-queueing and maintains exactly-once semantics
Task Handler Focus
- Each handler does one thing well
- Return structured output: data for success,
needsReview: truefor human review, or throw for transient failures - Never queue the next task from a handler — the workflow handler owns sequencing
Testing Implications
- Jobs are deleted after completion (
deleteJobOnComplete: true) — check side effects (events created, data changed), not job history - Use a drain loop with
payload.jobs.run()to process chained workflow tasks in tests - Verify progress and state changes on the ingest-job record, then query actual results
These architectural decisions create a pipeline that is scalable, maintainable, and robust while leveraging Payload CMS Workflows for task sequencing, retry, and concurrency control.