Configuration
The TimeTiles data ingestion pipeline provides extensive configuration options at both the dataset and system levels. This document explains all available configuration options and their effects on pipeline behavior.
Dataset Configuration
Each dataset can configure its ingestion behavior independently. Configuration is stored in the datasets collection and affects all ingestion runs for that dataset.
ID Strategy Configuration
Controls how unique identifiers are generated and how duplicates are detected.
Strategy Types:
External ID:
- Uses a specific field from the data as the unique identifier
- Best for data with explicit, reliable IDs (UUID, database ID, etc.)
- Requires specifying the field path (e.g., “event_id”, “data.uuid”)
- Fastest strategy for duplicate detection
Computed Hash:
- Generates ID by hashing a combination of specified fields
- Best for data without explicit IDs but with identifying field combinations
- Requires specifying which fields to include in hash
- More flexible than external ID, but slightly slower
Auto (Auto-detect Duplicates by Content):
- Automatically detects duplicates by comparing event content
- Best for datasets without explicit IDs where content uniqueness matters
- No configuration needed beyond enabling
- Default strategy for new datasets
Hybrid:
- Tries external ID first, falls back to computed hash if external ID is missing
- Best for datasets with partial ID coverage
- Combines reliability of external ID with flexibility of computed hash
- Requires configuring both strategies
Duplicate Strategy:
Controls what happens when duplicates are detected:
- Skip: Ignore duplicate rows (most common)
- Update: Update existing event with new data
- Version: Create new version of existing event
Deduplication Configuration
Controls early duplicate detection in Stage 2 (Analyze Duplicates).
Enabled/Disabled:
- Enable to reduce processing volume and prevent duplicate events
- Disable for datasets where every row should create an event
Strategy:
- Same options as duplicate strategy: skip, update, or version
- Controls what happens when a duplicate is found during early detection
- Should typically match the ID strategy’s duplicate strategy for consistency
Field Specification:
- For computed-hash: specify which fields constitute a duplicate
- Can include nested field paths (e.g., “data.event.id”)
- Order doesn’t matter (hashing is deterministic)
Schema Configuration
Controls schema behavior, validation, and evolution.
Locked:
- When true: Require approval for ALL schema changes (even non-breaking)
- When false: Allow auto-approval based on change classification
- Use for production datasets with strict governance requirements
Auto-Grow:
- When true: Allow schema to grow with new optional fields automatically
- When false: Require approval for any schema changes
- Prerequisite for auto-approval of non-breaking changes
Auto-Approve Non-Breaking:
- When true: Non-breaking changes skip manual approval
- When false: All changes require manual approval
- Only effective when auto-grow is also enabled
- Requires locked to be false
Strict Validation:
- When true: Block imports that don’t match schema exactly
- When false: Allow type transformations and best-effort parsing
- Use strict mode for high-quality, well-structured data
Allow Transformations:
- When true: Apply configured type transformations automatically
- When false: Reject type mismatches without attempting transformation
- Enables flexible handling of common type variations
Max Schema Depth:
- Maximum nesting depth for nested objects in data
- Prevents excessively deep schemas that impact performance
- Typical values: 3-5 levels
- Higher values increase schema complexity and query costs
Ingestion Transforms
Transform rules applied to incoming data before validation. Each transform has a type, relevant source/target fields, and an active toggle.
Transform Types:
Rename (rename):
- Moves a field from one path to another
- Specify
from(source field path) andto(target field path) - Use for standardizing field names across different data sources (e.g., “EventName” to “name”)
Date Parse (date-parse):
- Parses date strings from one format to another
- Specify
from(source field),inputFormat, andoutputFormat - Supported input formats:
DD/MM/YYYY,MM/DD/YYYY,YYYY-MM-DD,DD-MM-YYYY,MM-DD-YYYY,DD.MM.YYYY - Supported output formats:
YYYY-MM-DD(ISO, default),DD/MM/YYYY,MM/DD/YYYY - Optional
timezonefield (e.g., “America/New_York”)
String Operation (string-op):
- Applies a string operation to a source field
- Specify
from(source field) andoperation - Operations:
uppercase,lowercase,trim,replace(withpatternandreplacementfields),expression(custom safe expression using built-in functions) - Expression functions include:
upper,lower,trim,concat,replace,substring,toNumber,parseDate,parseBool,round,floor,ceil,abs,len,ifEmpty
Concatenate (concatenate):
- Joins multiple source fields into a single target field
- Specify
fromFields(JSON array of source field paths),to(target field), andseparator(default: space) - Example: combine “firstName” + “lastName” into “fullName”
Split (split):
- Splits a single field by a delimiter into multiple target fields
- Specify
from(source field),delimiter(default: comma), andtoFields(JSON array of target field names) - Example: split “full_name” into “first_name” and “last_name”
Common Fields:
Each transform rule also includes:
- active: Checkbox to disable a transform without deleting it (default: true)
- autoDetected: Whether the transform was suggested by auto-detection
- confidence: Confidence score (0-100) for auto-detected transforms
Field Mapping Overrides
Manual specification of field mappings when auto-detection isn’t sufficient.
Override Types:
Geocoding Field Overrides:
- Manually specify which field contains addresses
- Manually specify latitude and longitude field pairs
- Override auto-detection when field names are non-standard
Timestamp Field Overrides:
- Manually specify which field contains the event timestamp
- Override default priority (timestamp, date, datetime, etc.)
- Handle custom timestamp field names
Required Field Overrides:
- Force specific fields to be required
- Mark optional fields as required for validation
- Enforce data quality standards
Enum Detection Configuration
Controls how the system identifies enumerated (categorical) fields.
Detection Mode:
Count:
- Field is enum if unique values ≤ threshold
- Example: threshold=50 means fields with ≤50 unique values become enums
- Best for small datasets or when you know the expected enum size
Percentage:
- Field is enum if (unique values / total values) ≤ threshold
- Example: threshold=0.05 means fields with ≤5% unique values become enums
- Best for large datasets where absolute counts are misleading
Threshold Values:
- Count mode: Typical values 20-100 unique values
- Percentage mode: Typical values 0.01-0.10 (1%-10%)
- Higher values create more enums (more permissive)
- Lower values create fewer enums (more strict)
Geographic Field Detection
Controls automatic detection of location fields for geocoding.
Auto-Detect:
- When true: Automatically identify address and coordinate fields
- When false: Only use manual field mappings
- Recommended: true for initial imports, false after field verification
Manual Overrides:
Latitude Path:
- Specify exact field containing latitude values
- Overrides auto-detection
- Use when field name doesn’t match common patterns
Longitude Path:
- Specify exact field containing longitude values
- Must be specified if latitude path is specified
- Use when field name doesn’t match common patterns
Note that the geoFieldDetection group does not include an address/location path field. Location field mapping is configured separately in the fieldMappingOverrides group via locationPath (see Field Mapping Overrides).
System Configuration
Global pipeline settings that affect all datasets and ingestion runs.
Batch Size Configuration
Controls how many rows are processed in each batch for various stages. Batch sizes are configured via config/timetiles.yml (under the batchSizes key) and fall back to hardcoded defaults when the YAML file is absent. There are no environment variable overrides for batch sizes.
Duplicate Analysis Batch Size (batchSizes.duplicateAnalysis):
- Default: 5,000 rows
- Memory-efficient for hash map operations
- Larger values: faster but more memory
- Smaller values: slower but less memory
Schema Detection Batch Size (batchSizes.schemaDetection):
- Default: 10,000 rows
- Larger batches for schema building efficiency
- Larger values: faster schema convergence
- Smaller values: more batches, slower convergence
Event Creation Batch Size (batchSizes.eventCreation):
- Default: 1,000 rows
- Balances throughput with transaction reliability
- Larger values: faster but higher transaction timeout risk
- Smaller values: slower but more reliable
Database Chunk Size (batchSizes.databaseChunk):
- Default: 1,000 records
- Batch size for bulk database operations
- Affects memory and transaction duration
Geocoding Processing:
- Processes ALL unique locations in one pass (not batched by rows)
- API batch size configured separately (typically 100 requests/minute)
- Extracts unique addresses/coordinates from entire file first
- Results cached in lookup map for all rows
Geocoding
Geocoding providers are configured via the Payload CMS Settings global (/dashboard/globals/settings), not via config files. See the Self-Hosting Configuration docs for setup details.
Workers
Background job workers run as separate processes, one per queue. In production, each queue gets a dedicated Docker container started with pnpm payload jobs:run --cron --queue <name>. Concurrency is controlled by the number of containers deployed, not by a configuration setting. Retry behavior is handled by Payload’s built-in workflow system.
Configuration Changes
Runtime Changes
Most dataset configuration changes take effect immediately for new imports:
- ID strategy changes
- Schema configuration changes
- Type transformation changes
- Field mapping overrides
Requires Restart
- Batch size changes in
config/timetiles.yml(loaded once at startup viagetAppConfig())
Migration Required
- Changing ID strategy on datasets with existing events
- Changing deduplication strategy significantly