Configuration

The TimeTiles data processing pipeline provides extensive configuration options at both the dataset and system levels. This document explains all available configuration options and their effects on pipeline behavior.

Dataset Configuration

Each dataset can configure its processing behavior independently. Configuration is stored in the datasets collection and affects all imports for that dataset.

ID Strategy Configuration

Controls how unique identifiers are generated and how duplicates are detected.

Strategy Types:

External ID:

Uses a specific field from the data as the unique identifier
Best for data with explicit, reliable IDs (UUID, database ID, etc.)
Requires specifying the field path (e.g., “event_id”, “data.uuid”)
Fastest strategy for duplicate detection

Computed Hash:

Generates ID by hashing a combination of specified fields
Best for data without explicit IDs but with identifying field combinations
Requires specifying which fields to include in hash
More flexible than external ID, but slightly slower

Auto-Increment:

Automatically generates sequential IDs
Best for datasets where uniqueness isn’t critical
Simplest strategy, no configuration needed
Cannot detect duplicates (all rows considered unique)

Hybrid:

Tries external ID first, falls back to computed hash if external ID is missing
Best for datasets with partial ID coverage
Combines reliability of external ID with flexibility of computed hash
Requires configuring both strategies

Duplicate Strategy:

Controls what happens when duplicates are detected:

Skip: Ignore duplicate rows (most common)
Update: Update existing event with new data
Version: Create new version of existing event

Deduplication Configuration

Controls early duplicate detection in Stage 2 (Analyze Duplicates).

Enabled/Disabled:

Enable to reduce processing volume and prevent duplicate events
Disable for datasets where every row should create an event

Strategy:

Same options as ID strategy (external-id, computed-hash, content-hash, hybrid)
Should typically match ID strategy for consistency
Can differ if deduplication logic differs from ID generation

Field Specification:

For computed-hash: specify which fields constitute a duplicate
Can include nested field paths (e.g., “data.event.id”)
Order doesn’t matter (hashing is deterministic)

Schema Configuration

Controls schema behavior, validation, and evolution.

Locked:

When true: Require approval for ALL schema changes (even non-breaking)
When false: Allow auto-approval based on change classification
Use for production datasets with strict governance requirements

Auto-Grow:

When true: Allow schema to grow with new optional fields automatically
When false: Require approval for any schema changes
Prerequisite for auto-approval of non-breaking changes

Auto-Approve Non-Breaking:

When true: Non-breaking changes skip manual approval
When false: All changes require manual approval
Only effective when auto-grow is also enabled
Requires locked to be false

Strict Validation:

When true: Block imports that don’t match schema exactly
When false: Allow type transformations and best-effort parsing
Use strict mode for high-quality, well-structured data

Allow Transformations:

When true: Apply configured type transformations automatically
When false: Reject type mismatches without attempting transformation
Enables flexible handling of common type variations

Max Schema Depth:

Maximum nesting depth for nested objects in data
Prevents excessively deep schemas that impact performance
Typical values: 3-5 levels
Higher values increase schema complexity and query costs

Type Transformations

Automatic conversions between data types to handle common mismatches.

Transformation Configuration:

Each transformation specifies:

Field Path: Which field to transform (supports nested paths)
From Type: Expected incoming type
To Type: Desired target type
Strategy: How to perform the transformation
Enabled: Whether this transformation is active

Transformation Strategies:

Parse:

Smart parsing with type inference
Examples: “123” → 123, “true” → true, “2024-01-15” → Date
Best for well-formatted data with predictable patterns
Most commonly used strategy

Cast:

Simple type coercion
Examples: 123 → “123”, true → “true”
Best for simple type conversions
Faster than parsing but less intelligent

Custom:

User-defined transformation function
Handles complex cases (e.g., European number formats, custom date formats)
Most flexible but requires custom code
Use when parse/cast aren’t sufficient

Reject:

Fail validation on type mismatch
Strict type checking with no forgiveness
Use for high-quality data where mismatches indicate errors

Import Transforms

Field renaming and restructuring during import.

Transform Configuration:

Each transform specifies:

Source Field: Original field name in import file
Target Field: Desired field name in events
Transform Type: Rename, restructure, or extract

Use Cases:

Standardizing field names across different data sources
Extracting nested fields to top level
Combining multiple source fields into one
Splitting single field into multiple fields

Example Scenarios:

Rename “EventName” to “name” for consistency
Extract “address.city” to “city”
Combine “firstName” + “lastName” to “fullName”

Field Mapping Overrides

Manual specification of field mappings when auto-detection isn’t sufficient.

Override Types:

Geocoding Field Overrides:

Manually specify which field contains addresses
Manually specify latitude and longitude field pairs
Override auto-detection when field names are non-standard

Timestamp Field Overrides:

Manually specify which field contains the event timestamp
Override default priority (timestamp, date, datetime, etc.)
Handle custom timestamp field names

Required Field Overrides:

Force specific fields to be required
Mark optional fields as required for validation
Enforce data quality standards

Enum Detection Configuration

Controls how the system identifies enumerated (categorical) fields.

Detection Mode:

Count:

Field is enum if unique values ≤ threshold
Example: threshold=50 means fields with ≤50 unique values become enums
Best for small datasets or when you know the expected enum size

Percentage:

Field is enum if (unique values / total values) ≤ threshold
Example: threshold=0.05 means fields with ≤5% unique values become enums
Best for large datasets where absolute counts are misleading

Threshold Values:

Count mode: Typical values 20-100 unique values
Percentage mode: Typical values 0.01-0.10 (1%-10%)
Higher values create more enums (more permissive)
Lower values create fewer enums (more strict)

Geographic Field Detection

Controls automatic detection of location fields for geocoding.

Auto-Detect:

When true: Automatically identify address and coordinate fields
When false: Only use manual field mappings
Recommended: true for initial imports, false after field verification

Manual Overrides:

Latitude Path:

Specify exact field containing latitude values
Overrides auto-detection
Use when field name doesn’t match common patterns

Longitude Path:

Specify exact field containing longitude values
Must be specified if latitude path is specified
Use when field name doesn’t match common patterns

Address Path:

Specify exact field containing full address strings
Overrides auto-detection
Use when field name doesn’t match common patterns (“venue”, “place”, etc.)

Processing Limits

Controls resource usage and prevents runaway processing.

Max Concurrent Jobs:

Maximum number of import-jobs for this dataset running simultaneously
Prevents resource exhaustion when processing multiple files
Typical values: 1-5 depending on dataset complexity and size

Processing Timeout:

Maximum time (milliseconds) for entire import to complete
Timeout stops processing and marks import as failed
Prevents hung imports from consuming resources indefinitely
Typical values: 3600000 (1 hour) to 86400000 (24 hours)

Max File Size:

Maximum file size (bytes) for imports to this dataset
Files exceeding limit are rejected at upload
Prevents memory exhaustion from extremely large files
Typical values: 104857600 (100MB) to 1073741824 (1GB)

System Configuration

Global pipeline settings that affect all datasets and imports.

File Processing Configuration

Supported Formats:

List of allowed file extensions (e.g., .csv, .xlsx, .xls)
Files with other extensions are rejected at upload
Standard configuration: .csv, .xlsx, .xls

Max File Size:

Global maximum file size (can be overridden per dataset)
System-wide safety limit
Typical value: 100MB to 1GB depending on infrastructure

Batch Size Configuration

Controls how many rows are processed in each batch for various stages.

Duplicate Analysis Batch Size:

Default: 5,000 rows
Environment variable: BATCH_SIZE_DUPLICATE_ANALYSIS
Memory-efficient for hash map operations
Larger values: faster but more memory
Smaller values: slower but less memory

Schema Detection Batch Size:

Default: 10,000 rows
Environment variable: BATCH_SIZE_SCHEMA_DETECTION
Larger batches for schema building efficiency
Larger values: faster schema convergence
Smaller values: more batches, slower convergence

Event Creation Batch Size:

Default: 1,000 rows
Environment variable: BATCH_SIZE_EVENT_CREATION
Balances throughput with transaction reliability
Larger values: faster but higher transaction timeout risk
Smaller values: slower but more reliable

Database Chunk Size:

Default: 1,000 records
Environment variable: BATCH_SIZE_DATABASE_CHUNK
Batch size for bulk database operations
Affects memory and transaction duration

Geocoding Processing:

Processes ALL unique locations in one pass (not batched by rows)
API batch size configured separately (typically 100 requests/minute)
Extracts unique addresses/coordinates from entire file first
Results cached in lookup map for all rows

Concurrency Configuration

Max Concurrent Imports:

Maximum number of import files processing simultaneously across all datasets
Prevents system overload from too many concurrent operations
Typical values: 10-50 depending on infrastructure

Job Worker Count:

Number of background job workers processing pipeline stages
Higher values: more parallelism, more resource usage
Lower values: less parallelism, more queuing
Typical values: 4-16 workers

Retry Configuration

Retry Attempts:

Number of times to retry failed operations
Applies to transient failures (network errors, temporary database issues)
Typical values: 3-5 retries

Retry Backoff:

Strategy for delaying retries
Exponential: Delay doubles each retry (1s, 2s, 4s, 8s, …)
Linear: Delay increases linearly (1s, 2s, 3s, 4s, …)
Constant: Same delay each retry
Recommended: Exponential for most scenarios

Geocoding API Configuration

API Provider:

Which geocoding service to use (Nominatim, Google Maps, etc.)
Affects accuracy, cost, and rate limits

API Key:

Authentication key for geocoding service
Required for most commercial providers
Free services (Nominatim) may not require key

Rate Limit:

Maximum requests per minute/second
Must match API provider’s limit
Typical values: 50-2500 requests/minute depending on plan

Timeout:

Maximum time to wait for geocoding response
Longer timeout: more patient, slower on failures
Shorter timeout: faster failure detection, may miss slow responses
Typical values: 5000-30000 milliseconds (5-30 seconds)

Storage Configuration

Import File Directory:

Path where uploaded files are stored
Must be persistent across deployments
Should be backed up regularly

Temp Directory:

Path for temporary files during processing
Can be ephemeral (cleared on restart)
Should have ample space for largest expected file

Monitoring Configuration

Log Level:

Verbosity of logging (error, warn, info, debug, trace)
Higher levels: more detail, more storage/performance impact
Production: info or warn
Development: debug or trace

Metrics Collection:

Enable/disable performance metrics collection
Tracks processing times, throughput, success rates
Minimal performance impact when enabled

Error Reporting:

Integration with error tracking services (Sentry, etc.)
Automatic error notification and aggregation
Helps identify systemic issues quickly

Configuration Best Practices

Development Environment

Stricter validation to catch issues early
Smaller batch sizes for faster iteration
Verbose logging for debugging
Shorter timeouts to fail fast
Lower concurrency to avoid resource competition

Staging Environment

Production-like configuration
Medium batch sizes
Moderate logging
Production timeouts
Production concurrency levels
Test auto-approval and transformations

Production Environment

Optimized batch sizes for your infrastructure
Error-level logging only
Conservative timeouts (allow completion)
Appropriate concurrency (based on capacity testing)
Enable metrics and monitoring
Careful approval settings (prefer manual review initially)

Per-Dataset Tuning

High-Volume Datasets:

Larger batch sizes for throughput
Higher concurrency limits
Longer processing timeouts
Auto-approval for non-breaking changes

High-Quality Datasets:

Strict validation enabled
Locked schemas
Manual approval required
Reject transformation strategy

Experimental Datasets:

Auto-grow enabled
Allow transformations
Parse transformation strategy
Auto-approve non-breaking

Configuration Changes

Runtime Changes

Most dataset configuration changes take effect immediately for new imports:

ID strategy changes
Schema configuration changes
Type transformation changes
Field mapping overrides

Requires Restart

Some system configuration changes require application restart:

Batch size environment variables
Worker count changes
API keys and credentials
Storage paths

Migration Required

Some configuration changes may require data migration:

Changing ID strategy on datasets with existing events
Modifying schema depth limits
Changing deduplication strategy significantly

Recommended Starting Configuration

For new TimeTiles installations, start with these conservative settings:

Dataset Defaults:

ID Strategy: External (with hybrid fallback)
Schema: Unlocked, Auto-grow enabled, Auto-approve non-breaking
Validation: Non-strict, Allow transformations
Transformations: Parse strategy
Max Schema Depth: 3

System Defaults:

Batch Sizes: Default values (5k, 10k, 1k)
Max Concurrent Imports: 10
Job Workers: 8
Retry Attempts: 3
Retry Backoff: Exponential

Adjust these settings based on your specific needs, infrastructure capacity, and data quality requirements.