Data Ingestion Pipeline
Terminology: The ingestion pipeline is called “Import” in the UI and
import-*in the codebase.
The TimeTiles data ingestion pipeline uses Payload CMS Workflows to transform uploaded files into structured events. Four workflows orchestrate task handlers into linear pipelines, with collection afterChange hooks triggering workflow execution.
Quick Links
- Architecture & Design - Design principles, core collections, and data flow
- Processing Stages - Detailed explanation of all 8 pipeline stages
- Configuration - Dataset and system configuration options
- Troubleshooting - Common issues and debugging tools
- Best Practices - Guidelines for data providers and administrators
Pipeline Overview
The pipeline uses 4 Payload Workflows, one per import type plus a post-review workflow:
Four Workflows
| Workflow | Trigger | Purpose |
|---|---|---|
manual-ingest | File upload (ingest-files afterChange hook) | Full pipeline for user uploads |
scheduled-ingest | schedule-manager job | URL fetch + full pipeline |
scraper-ingest | schedule-manager job | Scraper execution + full pipeline |
ingest-process | Approval of NEEDS_REVIEW job (ingest-jobs afterChange hook) | Resume after schema drift resolution |
Per-Sheet Task Pipeline
Each sheet within a workflow runs through these tasks sequentially:
- Analyze Duplicates - Identify internal and external duplicates before processing
- Detect Schema - Build progressive schema from non-duplicate data
- Validate Schema - Compare detected schema with existing dataset schema
- Create Schema Version - Persist new schema version
- Geocode Batch - Enrich data with geographic coordinates
- Create Events - Generate final event records with all enrichments
Multi-sheet Excel files process all sheets within a single workflow instance. If one sheet fails or needs review, the remaining sheets continue processing.
Key Features
Workflow Orchestration: Payload CMS Workflows define linear task pipelines with built-in retry and caching of completed tasks.
Batch Processing: Large files are processed in configurable batches to manage memory and respect API rate limits.
Built-in Error Recovery: Failed tasks are retried by Payload’s workflow system. Completed tasks return cached output on retry, so work is never repeated.
Schema Versioning: Complete version history of all schema changes with audit trail.
Multi-Sheet Support: Excel files with multiple sheets are processed within a single workflow, with per-sheet failure isolation.
Core Principles
- Single Source of Truth: Raw data files stay on disk; only processing state and results are stored in the database
- File-Based Processing: Data flows from file, to memory, to database, avoiding large data storage during processing
- Metadata Tracking: Processing state, schemas, duplicate maps, and geocoding results are persisted for resumable operations
- Concurrency Control: Per-resource concurrency keys prevent duplicate processing of the same file or scheduled import
Next Steps
New to the pipeline? Start with Architecture & Design to understand the system’s core principles.
Working on a specific stage? See Processing Stages for detailed documentation of each stage.
Need to configure behavior? Check Configuration for dataset and system options.
Having issues? Visit Troubleshooting for common problems and solutions.
Looking for the user-facing overview? See the Data Import guide.