Skip to Content
⚠️Active Development Notice: TimeTiles is under active development. Information may be placeholder content or not up-to-date.

Architecture & Design

The TimeTiles data ingestion pipeline is built on several core architectural principles that ensure scalability, reliability, and maintainability.

Core Architectural Principles

Single Source of Truth

The pipeline follows a single source of truth principle where uploaded files remain the authoritative data source throughout processing:

  • Files stay on disk: Raw data files are never moved or duplicated during processing
  • Immutable source: Original files remain unchanged from upload through completion
  • Reproducible processing: Any stage can be re-run from the original file
  • Audit trail: Complete history of what was processed and when

File-Based Processing Model

Data flows through the system in a carefully controlled path:

File → Memory → Database

  • File reads: Data is read in configurable batches from disk
  • In-memory processing: Transformations and analysis happen in memory
  • Selective persistence: Only final results and processing state are written to the database

This approach provides several benefits:

  • Memory efficiency: Batch processing prevents memory exhaustion on large files
  • Database efficiency: Minimal database writes during processing
  • Clean separation: Processing logic is independent of storage
  • Recovery support: Processing can resume from any batch

Data Storage Strategy

The system carefully distinguishes between what gets stored and what remains transient:

What IS Stored in Database:

  • Processing state and current stage
  • Detected schemas and field statistics
  • Duplicate analysis results (row number mappings)
  • Geocoding results (location lookups by row number)
  • Progress tracking (current/total counts)
  • Schema version history
  • Final event records

What is NOT Stored:

  • Raw row data from source files
  • Complete file contents
  • Intermediate processing results
  • Temporary calculations

This selective storage ensures the database remains efficient while still supporting resumable operations and complete audit trails.

Core Collections

The pipeline orchestrates processing across five main Payload CMS collections:

1. ingest-files

Purpose: File upload and metadata storage

  • Stores uploaded files on disk
  • Tracks file metadata (name, size, type, upload date)
  • Maintains overall import status
  • Links to all ingest-jobs created from the file

Multi-Sheet Handling: A single Excel file can generate multiple ingest-jobs (one per sheet), all linked back to the parent ingest-file.

2. ingest-jobs

Purpose: Processing state and orchestration

  • Tracks current processing stage
  • Stores progress information (current/total rows)
  • Maintains duplicate analysis results
  • Caches geocoding lookups
  • Preserves schema builder state across batches
  • Links to associated dataset and ingest-file

Central Role: This collection is the heart of the pipeline, coordinating all processing activity.

3. datasets

Purpose: Configuration and schema management

  • Defines ID strategies for deduplication
  • Configures schema behavior (locked, auto-grow, strict validation)
  • Specifies type transformations
  • Sets processing limits
  • Contains geographic field detection settings
  • Links to all events created for this dataset

4. dataset-schemas

Purpose: Schema versioning and history

  • Maintains complete version history of all schema changes
  • Stores field metadata and statistics for each version
  • Tracks who approved each schema change and when
  • Enables schema rollback and evolution tracking
  • Links schema versions to the imports that created them

Critical for Compliance: This collection provides the audit trail needed for data governance and compliance requirements.

5. events

Purpose: Final processed data storage

  • Stores complete event records with all enrichments
  • Includes geocoded location data
  • References source dataset and import-job
  • Links to specific schema version used
  • Contains validation status and metadata

Workflow Architecture

4 Payload Workflows

The pipeline uses Payload CMS Workflows to define linear task pipelines. Each workflow sequences multiple task handlers, with automatic retry and caching of completed tasks on re-run.

WorkflowTriggerPurpose
manual-ingestingest-files afterChange hookFull pipeline for user uploads
scheduled-ingestschedule-manager jobURL fetch + full pipeline
scraper-ingestschedule-manager jobScraper execution + full pipeline
ingest-processingest-jobs afterChange hook (NEEDS_REVIEW approval)Resume after schema drift resolution

Hook-Driven Workflow Queueing

Collection afterChange hooks queue workflows automatically:

  • ingest-files afterChange: Queues manual-ingest workflow when a file is uploaded
  • ingest-jobs afterChange: Queues ingest-process workflow when a NEEDS_REVIEW job is approved

Workflows are never queued manually from task handlers or tests. The hooks ensure exactly-once execution.

Concurrency Control

Each workflow uses a per-resource concurrency key to prevent duplicate processing:

  • manual-ingest: file:{importFileId}
  • scheduled-ingest: sched:{scheduledImportId}
  • scraper-ingest: scraper:{scraperId}
  • ingest-process: ingest:{ingestJobId}

Error Model

Task handlers use three distinct patterns:

  • Throw for transient failures — Payload retries the task automatically
  • Return { needsReview: true } for human review — pipeline pauses for that sheet
  • Return data for success — pipeline continues to the next task

Multi-sheet workflows process sheets via Promise.allSettled with per-sheet try/catch. A failure in one sheet does not block other sheets.

Background Job Lifecycle

Payload CMS deletes completed jobs by default. This has important implications:

  • No Double-Queueing: Workflows are queued by hooks, not by task handlers
  • Testing Implications: Tests must check side effects (events created, data changed) rather than job records
  • Drain Loop: Tests use payload.jobs.run() in a loop to process chained workflow tasks

Batch Processing Architecture

Configurable Batch Sizes

Different stages use different batch sizes optimized for their specific needs. See Processing Stages for batch sizes per stage.

  • Duplicate analysis — memory-efficient for hash map operations; allows pause/resume at batch boundaries
  • Schema detection — larger batches for efficiency; builder state persisted between batches for progressive schema building
  • Geocoding — processes all unique locations in a single pass (not batched by row); each unique value geocoded once, with small API batch sizes to respect rate limits
  • Event creation — smaller batches to avoid database transaction timeouts; individual row failures don’t stop the batch

Progressive Processing

Many stages build up results progressively across batches:

  • Schema detection: Refines type understanding with each batch
  • Duplicate analysis: Builds complete duplicate map across all batches
  • Event creation: Processes file in manageable chunks

This approach enables:

  • Processing files larger than available memory
  • Pause and resume at batch boundaries
  • Incremental progress tracking
  • Partial recovery from failures

Data Flow Principles

Metadata Tracking

Throughout processing, the pipeline tracks comprehensive metadata:

Stage Progression:

  • Current stage and previous stage
  • Stage transition timestamps
  • User who initiated transitions (for approvals)

Progress Information:

  • Rows processed vs total rows
  • Batches completed vs total batches
  • Percentage completion

Processing Results:

  • Duplicate row numbers (internal and external)
  • Geocoding results by row number
  • Schema builder state for resumption
  • Error details for failed rows

Version Tracking

Payload’s built-in versioning provides complete audit capability:

  • Each stage change creates a new version of the import-job
  • Full history of processing decisions
  • Ability to analyze bottlenecks and failures
  • Recovery information for debugging
  • Compliance and governance support

Performance Characteristics

Scalability

The architecture scales effectively across multiple dimensions:

Memory Efficiency:

  • Batch processing prevents memory exhaustion
  • Configurable batch sizes tune memory usage
  • Progressive processing reduces peak memory

Database Efficiency:

  • Minimal writes during processing
  • Selective persistence of only essential data
  • Bulk operations where possible

API Friendliness:

  • Geocoding respects rate limits
  • Exponential backoff on failures
  • Efficient handling of unique values

Parallelization:

  • Multiple ingest-jobs can run concurrently
  • Different stages of different imports run simultaneously
  • Independent processing of Excel sheets

Monitoring Capabilities

The pipeline provides extensive observability:

Progress Tracking:

  • Real-time progress for each stage
  • Batch-level granularity
  • Overall import status

Performance Metrics:

  • Processing times per stage
  • Throughput measurements
  • API usage and quota tracking

Error Logging:

  • Detailed error information per row/batch
  • Stage-level failure tracking
  • Complete error context for debugging

Resource Usage:

  • Memory consumption monitoring
  • Database query performance
  • API quota consumption

Design Trade-offs

File-Based vs Database-Based

Choice: Keep raw data in files, not database

Trade-offs:

  • ✅ Simpler database schema
  • ✅ Better performance for large datasets
  • ✅ Easier to re-process from source
  • ❌ Requires file system management
  • ❌ Files must remain accessible

Workflow-Based vs Hook-Driven State Machine

Choice: Payload Workflows with hook-based queueing

Trade-offs:

  • ✅ Explicit, linear workflow handlers (readable control flow)
  • ✅ Built-in retry with cached completed tasks
  • ✅ Per-resource concurrency control
  • ✅ One workflow per import type (clear separation)
  • ❌ Slight overhead for single-sheet imports (workflow wrapper job)
  • ❌ Sequential sheet processing within a workflow

Batch vs Stream Processing

Choice: Batch processing with configurable sizes

Trade-offs:

  • ✅ Memory-efficient for large files
  • ✅ Pause/resume capability
  • ✅ Tunable performance
  • ❌ Higher latency than streaming
  • ❌ Batch size tuning required

Development Guidelines

Respect Workflow Architecture

  • Let afterChange hooks on collections queue workflows — never queue workflows manually from task handlers or tests
  • Each task handler does its work and returns structured output; the workflow handler sequences tasks and inspects results
  • This prevents double-queueing and maintains exactly-once semantics

Task Handler Focus

  • Each handler does one thing well
  • Return structured output: data for success, needsReview: true for human review, or throw for transient failures
  • Never queue the next task from a handler — the workflow handler owns sequencing

Testing Implications

  • Jobs are deleted after completion (deleteJobOnComplete: true) — check side effects (events created, data changed), not job history
  • Use a drain loop with payload.jobs.run() to process chained workflow tasks in tests
  • Verify progress and state changes on the ingest-job record, then query actual results

These architectural decisions create a pipeline that is scalable, maintainable, and robust while leveraging Payload CMS Workflows for task sequencing, retry, and concurrency control.

Last updated on