Best Practices

This guide provides recommendations for different stakeholders working with the TimeTiles data processing pipeline.

For Data Providers

These practices help ensure smooth imports and high-quality event data.

Data Structure Consistency

Keep structure consistent across imports:

Use the same field names in every import
Maintain consistent data types (don’t switch between “123” and 123)
Keep nesting structure stable
Use consistent date formats

Why this matters:

Reduces schema conflicts and approval friction
Enables auto-approval of imports
Makes data more queryable and reliable
Reduces processing time and errors

Meaningful Field Names

Use descriptive, standard field names:

Prefer “event_name” over “n” or “title”
Use “event_date” over “d” or “when”
Follow common conventions (latitude, longitude, address)
Avoid abbreviations that aren’t widely understood

Why this matters:

Improves auto-detection of geocoding fields
Makes data self-documenting
Reduces need for manual field mappings
Easier for others to understand your data

Include Unique Identifiers

Provide stable, unique IDs when possible:

Include explicit ID fields (uuid, event_id, external_id)
Ensure IDs are truly unique across all your data
Keep IDs stable (don’t change them between imports)
Use standard ID formats (UUID, sequential numbers)

Why this matters:

Enables efficient deduplication
Allows update workflows for changing data
Faster processing (external ID is fastest strategy)
Prevents duplicate events

Type Consistency

Keep field types consistent:

Numbers should always be numbers, not sometimes strings
Dates should use consistent format
Booleans should be true/false, not yes/no or 1/0 (or use consistently)
Empty values should be null, not empty strings or “N/A”

Why this matters:

Avoids transformation overhead
Reduces schema conflicts
Improves data quality
Faster processing

Geographic Data Standards

Use standard field names and formats:

Use “latitude” and “longitude” for coordinates
Use “address” for full address strings
Include separate fields (street, city, state, country) when possible
Ensure coordinates are in decimal degrees (-90 to 90, -180 to 180)

Why this matters:

Auto-detection works reliably
Geocoding succeeds more often
Reduces manual field mapping
Higher geocoding confidence

Data Validation

Validate data before importing:

Check for required fields
Verify data types match expected schema
Remove obviously invalid rows
Test a small sample first

Why this matters:

Reduces row-level errors
Faster approval process
Higher quality final events
Less troubleshooting needed

For System Administrators

These practices help maintain a healthy, performant pipeline.

Monitor Schema Changes

Check pending approvals regularly:

Review pending approvals daily or weekly
Understand why schemas are changing
Communicate with data providers about changes
Document approval decisions for future reference

Why this matters:

Prevents import backlogs
Catches data quality issues early
Maintains data governance
Builds institutional knowledge

Configure Transformations Proactively

Set up rules for known type mismatches:

Add transformations for common patterns (European numbers, custom dates)
Document why transformations were added
Test transformations with sample data
Review transformation usage periodically

Why this matters:

Reduces manual intervention
Enables auto-approval
Consistent handling of type variations
Faster imports

Monitor Field Growth

Watch for excessive field proliferation:

Review new fields being added
Question fields with very low usage
Consolidate similar fields when possible
Set alerts for schemas exceeding 100 fields

Why this matters:

Large schemas impact query performance
Too many fields indicate data quality issues
Schema complexity makes maintenance harder
Storage costs increase

Document Schema Decisions

Record why changes were approved or rejected:

Note approval reason in import-job
Maintain change log for datasets
Communicate decisions to data providers
Review decisions periodically for patterns

Why this matters:

Builds institutional knowledge
Helps train new administrators
Provides audit trail for compliance
Identifies systemic issues

Set Appropriate Batch Sizes

Tune batch sizes based on infrastructure:

Monitor memory usage during imports
Test different batch sizes in staging
Adjust based on file sizes and complexity
Document why specific sizes were chosen

Why this matters:

Prevents memory exhaustion
Optimizes throughput
Balances performance and reliability
System-specific (no one-size-fits-all)

Regular Maintenance

Keep the pipeline healthy:

Archive old import-jobs periodically
Clean up failed imports
Review and update geocoding API keys
Monitor disk space for import files
Update dependencies and security patches

Why this matters:

Prevents performance degradation
Reduces storage costs
Maintains security
Catches issues before they become critical

For Large Datasets

Special considerations when processing large files or high-volume imports.

Monitor Field Count

Datasets with >1000 fields may impact performance:

Review schema regularly for bloat
Question fields with null rate >95%
Consider restructuring extremely wide datasets
Use nested objects to group related fields

Why this matters:

Query performance degrades with field count
Storage efficiency decreases
Schema becomes unmanageable
Indicates possible data quality issues

Set Appropriate Thresholds

Configure enum detection for your data patterns:

Use count mode for small datasets
Use percentage mode for large datasets
Test thresholds with sample data
Adjust based on actual field cardinality

Why this matters:

Incorrect thresholds create unwieldy enums
Too strict: misses categorization opportunities
Too loose: creates enums with thousands of values
Impacts query performance and UX

Use Progressive Import

Let schema build incrementally for large files:

Don’t force immediate completion
Allow processing to pause between batches
Monitor memory during processing
Accept that large imports take time

Why this matters:

Prevents memory exhaustion
Allows recovery from interruptions
More reliable for very large files
System remains responsive

Plan for Geocoding

Large address datasets have special considerations:

Account for API rate limits (may take hours)
Budget for API costs (charges per request)
Consider caching geocoding results
Pre-geocode data when possible

Why this matters:

Geocoding is slowest stage for large datasets
API costs can be significant
Rate limits may slow processing substantially
Caching saves money and time

Consider Partitioning

Very large datasets may benefit from splitting:

Split by time period (yearly, quarterly)
Split by category or type
Split by geographic region
Maintain relationships with metadata

Why this matters:

Smaller imports are more manageable
Failures affect smaller portions
Parallel processing of partitions
Easier to retry failed partitions

Performance Testing

Test at scale before production:

Use production-sized files in staging
Measure processing time per stage
Monitor resource usage (memory, CPU, disk)
Identify bottlenecks before production

Why this matters:

Surprises in production are costly
Can adjust configuration proactively
Identifies infrastructure needs
Builds confidence in system capacity

For Development Teams

Guidance for teams building on or extending the pipeline.

Respect Hook Architecture

Let hooks handle orchestration:

Don’t queue jobs manually from job handlers
Use StageTransitionService for stage changes
Trust the hook system to advance pipeline
Only queue jobs from hooks

Why this matters:

Prevents double-queueing
Maintains exactly-once semantics
Reduces debugging complexity
Follows established patterns

Job Handler Focus

Keep job handlers focused:

Each handler does one thing well
Don’t queue next job from handler
Update stage, let hooks queue next job
Maintain separation of concerns

Why this matters:

Easier to test and debug
Clear responsibilities
Simpler to extend
Reduces coupling

Testing Implications

Understand how jobs work in tests:

Jobs are deleted after completion
Check side effects, not job history
Verify progress and state changes
Query actual results (events, data)

Why this matters:

Test the right things
Avoid flaky tests
Matches production behavior
Tests remain valid

Error Handling

Handle errors at appropriate levels:

Row-level errors: log and continue
Batch-level errors: retry batch
Stage-level errors: mark import failed
Preserve partial progress when possible

Why this matters:

Resilient to transient failures
Doesn’t lose work on errors
Clear error reporting
Enables recovery

Schema Versioning

Always create schema versions:

Create version after approval
Link events to schema version
Maintain version history
Enable rollback capability

Why this matters:

Audit trail for compliance
Understand schema evolution
Debug issues related to schema changes
Support for schema rollback

Universal Principles

Practices that apply to everyone working with the pipeline.

Start Conservative, Iterate

Begin with strict settings
Loosen based on actual needs
Test configuration changes in staging
Monitor impact of changes

Document Everything

Why configuration was chosen
Why approval was given/denied
What went wrong in failures
How issues were resolved

Monitor Continuously

Set up alerts for problems
Review metrics regularly
Track trends over time
Act on warnings early

Communicate

Data providers inform admins of upcoming changes
Admins communicate policies to providers
Developers document behavior changes
Everyone shares knowledge

Learn from Failures

Analyze what went wrong
Document root cause
Implement preventive measures
Share lessons learned

These best practices will help you run a smooth, efficient, and reliable data processing pipeline.