Configuration

The TimeTiles data ingestion pipeline provides extensive configuration options at both the dataset and system levels. This document explains all available configuration options and their effects on pipeline behavior.

Dataset Configuration

Each dataset can configure its ingestion behavior independently. Configuration is stored in the datasets collection and affects all ingestion runs for that dataset.

ID Strategy Configuration

Controls how unique identifiers are generated and how duplicates are detected.

Strategy Types:

External ID:

Uses a specific field from the data as the unique identifier
Best for data with explicit, reliable IDs (UUID, database ID, etc.)
Requires specifying the field path (e.g., “event_id”, “data.uuid”)
Fastest strategy for duplicate detection

Computed Hash:

Generates ID by hashing a combination of specified fields
Best for data without explicit IDs but with identifying field combinations
Requires specifying which fields to include in hash
More flexible than external ID, but slightly slower

Auto (Auto-detect Duplicates by Content):

Automatically detects duplicates by comparing event content
Best for datasets without explicit IDs where content uniqueness matters
No configuration needed beyond enabling
Default strategy for new datasets

Hybrid:

Tries external ID first, falls back to computed hash if external ID is missing
Best for datasets with partial ID coverage
Combines reliability of external ID with flexibility of computed hash
Requires configuring both strategies

Duplicate Strategy:

Controls what happens when duplicates are detected:

Skip: Ignore duplicate rows (most common)
Update: Update existing event with new data
Version: Create new version of existing event

Deduplication Configuration

Controls early duplicate detection in Stage 2 (Analyze Duplicates).

Enabled/Disabled:

Enable to reduce processing volume and prevent duplicate events
Disable for datasets where every row should create an event

Strategy:

Same options as duplicate strategy: skip, update, or version
Controls what happens when a duplicate is found during early detection
Should typically match the ID strategy’s duplicate strategy for consistency

Field Specification:

For computed-hash: specify which fields constitute a duplicate
Can include nested field paths (e.g., “data.event.id”)
Order doesn’t matter (hashing is deterministic)

Schema Configuration

Controls schema behavior, validation, and evolution.

Locked:

When true: Require approval for ALL schema changes (even non-breaking)
When false: Allow auto-approval based on change classification
Use for production datasets with strict governance requirements

Auto-Grow:

When true: Allow schema to grow with new optional fields automatically
When false: Require approval for any schema changes
Prerequisite for auto-approval of non-breaking changes

Auto-Approve Non-Breaking:

When true: Non-breaking changes skip manual approval
When false: All changes require manual approval
Only effective when auto-grow is also enabled
Requires locked to be false

Strict Validation:

When true: Block imports that don’t match schema exactly
When false: Allow type transformations and best-effort parsing
Use strict mode for high-quality, well-structured data

Allow Transformations:

When true: Apply configured type transformations automatically
When false: Reject type mismatches without attempting transformation
Enables flexible handling of common type variations

Max Schema Depth:

Maximum nesting depth for nested objects in data
Prevents excessively deep schemas that impact performance
Typical values: 3-5 levels
Higher values increase schema complexity and query costs

Ingestion Transforms

Transform rules applied to incoming data before validation. Each transform has a type, relevant source/target fields, and an active toggle.

Transform Types:

Rename (rename):

Moves a field from one path to another
Specify from (source field path) and to (target field path)
Use for standardizing field names across different data sources (e.g., “EventName” to “name”)

Date Parse (date-parse):

Parses date strings from one format to another
Specify from (source field), inputFormat, and outputFormat
Supported input formats: DD/MM/YYYY, MM/DD/YYYY, YYYY-MM-DD, DD-MM-YYYY, MM-DD-YYYY, DD.MM.YYYY
Supported output formats: YYYY-MM-DD (ISO, default), DD/MM/YYYY, MM/DD/YYYY
Optional timezone field (e.g., “America/New_York”)

String Operation (string-op):

Applies a string operation to a source field
Specify from (source field) and operation
Operations: uppercase, lowercase, trim, replace (with pattern and replacement fields), expression (custom safe expression using built-in functions)
Expression functions include: upper, lower, trim, concat, replace, substring, toNumber, parseDate, parseBool, round, floor, ceil, abs, len, ifEmpty

Concatenate (concatenate):

Joins multiple source fields into a single target field
Specify fromFields (JSON array of source field paths), to (target field), and separator (default: space)
Example: combine “firstName” + “lastName” into “fullName”

Split (split):

Splits a single field by a delimiter into multiple target fields
Specify from (source field), delimiter (default: comma), and toFields (JSON array of target field names)
Example: split “full_name” into “first_name” and “last_name”

Common Fields:

Each transform rule also includes:

active: Checkbox to disable a transform without deleting it (default: true)
autoDetected: Whether the transform was suggested by auto-detection
confidence: Confidence score (0-100) for auto-detected transforms

Field Mapping Overrides

Manual specification of field mappings when auto-detection isn’t sufficient.

Override Types:

Geocoding Field Overrides:

Manually specify which field contains addresses
Manually specify latitude and longitude field pairs
Override auto-detection when field names are non-standard

Timestamp Field Overrides:

Manually specify which field contains the event timestamp
Override default priority (timestamp, date, datetime, etc.)
Handle custom timestamp field names

Required Field Overrides:

Force specific fields to be required
Mark optional fields as required for validation
Enforce data quality standards

Enum Detection Configuration

Controls how the system identifies enumerated (categorical) fields.

Detection Mode:

Count:

Field is enum if unique values ≤ threshold
Example: threshold=50 means fields with ≤50 unique values become enums
Best for small datasets or when you know the expected enum size

Percentage:

Field is enum if (unique values / total values) ≤ threshold
Example: threshold=0.05 means fields with ≤5% unique values become enums
Best for large datasets where absolute counts are misleading

Threshold Values:

Count mode: Typical values 20-100 unique values
Percentage mode: Typical values 0.01-0.10 (1%-10%)
Higher values create more enums (more permissive)
Lower values create fewer enums (more strict)

Geographic Field Detection

Controls automatic detection of location fields for geocoding.

Auto-Detect:

When true: Automatically identify address and coordinate fields
When false: Only use manual field mappings
Recommended: true for initial imports, false after field verification

Manual Overrides:

Latitude Path:

Specify exact field containing latitude values
Overrides auto-detection
Use when field name doesn’t match common patterns

Longitude Path:

Specify exact field containing longitude values
Must be specified if latitude path is specified
Use when field name doesn’t match common patterns

Note that the geoFieldDetection group does not include an address/location path field. Location field mapping is configured separately in the fieldMappingOverrides group via locationPath (see Field Mapping Overrides).

System Configuration

Global pipeline settings that affect all datasets and ingestion runs.

Batch Size Configuration

Controls how many rows are processed in each batch for various stages. Batch sizes are configured via config/timetiles.yml (under the batchSizes key) and fall back to hardcoded defaults when the YAML file is absent. There are no environment variable overrides for batch sizes.

Duplicate Analysis Batch Size (batchSizes.duplicateAnalysis):

Default: 5,000 rows
Memory-efficient for hash map operations
Larger values: faster but more memory
Smaller values: slower but less memory

Schema Detection Batch Size (batchSizes.schemaDetection):

Default: 10,000 rows
Larger batches for schema building efficiency
Larger values: faster schema convergence
Smaller values: more batches, slower convergence

Event Creation Batch Size (batchSizes.eventCreation):

Default: 1,000 rows
Balances throughput with transaction reliability
Larger values: faster but higher transaction timeout risk
Smaller values: slower but more reliable

Database Chunk Size (batchSizes.databaseChunk):

Default: 1,000 records
Batch size for bulk database operations
Affects memory and transaction duration

Geocoding Processing:

Processes ALL unique locations in one pass (not batched by rows)
API batch size configured separately (typically 100 requests/minute)
Extracts unique addresses/coordinates from entire file first
Results cached in lookup map for all rows

Geocoding

Geocoding providers are configured via the Payload CMS Settings global (/dashboard/globals/settings), not via config files. See the Self-Hosting Configuration docs for setup details.

Workers

Background job workers run as separate processes, one per queue. In production, each queue gets a dedicated Docker container started with pnpm payload jobs:run --cron --queue <name>. Concurrency is controlled by the number of containers deployed, not by a configuration setting. Retry behavior is handled by Payload’s built-in workflow system.

Configuration Changes

Runtime Changes

Most dataset configuration changes take effect immediately for new imports:

ID strategy changes
Schema configuration changes
Type transformation changes
Field mapping overrides

Requires Restart

Batch size changes in config/timetiles.yml (loaded once at startup via getAppConfig())

Migration Required

Changing ID strategy on datasets with existing events
Changing deduplication strategy significantly