Skip to Content
⚠️Active Development Notice: TimeTiles is under active development. Information may be placeholder content or not up-to-date.

Best Practices

This guide provides recommendations for different stakeholders working with the TimeTiles data processing pipeline.

For Data Providers

These practices help ensure smooth imports and high-quality event data.

Data Structure Consistency

Keep structure consistent across imports:

  • Use the same field names in every import
  • Maintain consistent data types (don’t switch between “123” and 123)
  • Keep nesting structure stable
  • Use consistent date formats

Why this matters:

  • Reduces schema conflicts and approval friction
  • Enables auto-approval of imports
  • Makes data more queryable and reliable
  • Reduces processing time and errors

Meaningful Field Names

Use descriptive, standard field names:

  • Prefer “event_name” over “n” or “title”
  • Use “event_date” over “d” or “when”
  • Follow common conventions (latitude, longitude, address)
  • Avoid abbreviations that aren’t widely understood

Why this matters:

  • Improves auto-detection of geocoding fields
  • Makes data self-documenting
  • Reduces need for manual field mappings
  • Easier for others to understand your data

Include Unique Identifiers

Provide stable, unique IDs when possible:

  • Include explicit ID fields (uuid, event_id, external_id)
  • Ensure IDs are truly unique across all your data
  • Keep IDs stable (don’t change them between imports)
  • Use standard ID formats (UUID, sequential numbers)

Why this matters:

  • Enables efficient deduplication
  • Allows update workflows for changing data
  • Faster processing (external ID is fastest strategy)
  • Prevents duplicate events

Type Consistency

Keep field types consistent:

  • Numbers should always be numbers, not sometimes strings
  • Dates should use consistent format
  • Booleans should be true/false, not yes/no or 1/0 (or use consistently)
  • Empty values should be null, not empty strings or “N/A”

Why this matters:

  • Avoids transformation overhead
  • Reduces schema conflicts
  • Improves data quality
  • Faster processing

Geographic Data Standards

Use standard field names and formats:

  • Use “latitude” and “longitude” for coordinates
  • Use “address” for full address strings
  • Include separate fields (street, city, state, country) when possible
  • Ensure coordinates are in decimal degrees (-90 to 90, -180 to 180)

Why this matters:

  • Auto-detection works reliably
  • Geocoding succeeds more often
  • Reduces manual field mapping
  • Higher geocoding confidence

Data Validation

Validate data before importing:

  • Check for required fields
  • Verify data types match expected schema
  • Remove obviously invalid rows
  • Test a small sample first

Why this matters:

  • Reduces row-level errors
  • Faster approval process
  • Higher quality final events
  • Less troubleshooting needed

For System Administrators

These practices help maintain a healthy, performant pipeline.

Monitor Schema Changes

Check pending approvals regularly:

  • Review pending approvals daily or weekly
  • Understand why schemas are changing
  • Communicate with data providers about changes
  • Document approval decisions for future reference

Why this matters:

  • Prevents import backlogs
  • Catches data quality issues early
  • Maintains data governance
  • Builds institutional knowledge

Configure Transformations Proactively

Set up rules for known type mismatches:

  • Add transformations for common patterns (European numbers, custom dates)
  • Document why transformations were added
  • Test transformations with sample data
  • Review transformation usage periodically

Why this matters:

  • Reduces manual intervention
  • Enables auto-approval
  • Consistent handling of type variations
  • Faster imports

Monitor Field Growth

Watch for excessive field proliferation:

  • Review new fields being added
  • Question fields with very low usage
  • Consolidate similar fields when possible
  • Set alerts for schemas exceeding 100 fields

Why this matters:

  • Large schemas impact query performance
  • Too many fields indicate data quality issues
  • Schema complexity makes maintenance harder
  • Storage costs increase

Document Schema Decisions

Record why changes were approved or rejected:

  • Note approval reason in import-job
  • Maintain change log for datasets
  • Communicate decisions to data providers
  • Review decisions periodically for patterns

Why this matters:

  • Builds institutional knowledge
  • Helps train new administrators
  • Provides audit trail for compliance
  • Identifies systemic issues

Set Appropriate Batch Sizes

Tune batch sizes based on infrastructure:

  • Monitor memory usage during imports
  • Test different batch sizes in staging
  • Adjust based on file sizes and complexity
  • Document why specific sizes were chosen

Why this matters:

  • Prevents memory exhaustion
  • Optimizes throughput
  • Balances performance and reliability
  • System-specific (no one-size-fits-all)

Regular Maintenance

Keep the pipeline healthy:

  • Archive old import-jobs periodically
  • Clean up failed imports
  • Review and update geocoding API keys
  • Monitor disk space for import files
  • Update dependencies and security patches

Why this matters:

  • Prevents performance degradation
  • Reduces storage costs
  • Maintains security
  • Catches issues before they become critical

For Large Datasets

Special considerations when processing large files or high-volume imports.

Monitor Field Count

Datasets with >1000 fields may impact performance:

  • Review schema regularly for bloat
  • Question fields with null rate >95%
  • Consider restructuring extremely wide datasets
  • Use nested objects to group related fields

Why this matters:

  • Query performance degrades with field count
  • Storage efficiency decreases
  • Schema becomes unmanageable
  • Indicates possible data quality issues

Set Appropriate Thresholds

Configure enum detection for your data patterns:

  • Use count mode for small datasets
  • Use percentage mode for large datasets
  • Test thresholds with sample data
  • Adjust based on actual field cardinality

Why this matters:

  • Incorrect thresholds create unwieldy enums
  • Too strict: misses categorization opportunities
  • Too loose: creates enums with thousands of values
  • Impacts query performance and UX

Use Progressive Import

Let schema build incrementally for large files:

  • Don’t force immediate completion
  • Allow processing to pause between batches
  • Monitor memory during processing
  • Accept that large imports take time

Why this matters:

  • Prevents memory exhaustion
  • Allows recovery from interruptions
  • More reliable for very large files
  • System remains responsive

Plan for Geocoding

Large address datasets have special considerations:

  • Account for API rate limits (may take hours)
  • Budget for API costs (charges per request)
  • Consider caching geocoding results
  • Pre-geocode data when possible

Why this matters:

  • Geocoding is slowest stage for large datasets
  • API costs can be significant
  • Rate limits may slow processing substantially
  • Caching saves money and time

Consider Partitioning

Very large datasets may benefit from splitting:

  • Split by time period (yearly, quarterly)
  • Split by category or type
  • Split by geographic region
  • Maintain relationships with metadata

Why this matters:

  • Smaller imports are more manageable
  • Failures affect smaller portions
  • Parallel processing of partitions
  • Easier to retry failed partitions

Performance Testing

Test at scale before production:

  • Use production-sized files in staging
  • Measure processing time per stage
  • Monitor resource usage (memory, CPU, disk)
  • Identify bottlenecks before production

Why this matters:

  • Surprises in production are costly
  • Can adjust configuration proactively
  • Identifies infrastructure needs
  • Builds confidence in system capacity

For Development Teams

Guidance for teams building on or extending the pipeline.

Respect Hook Architecture

Let hooks handle orchestration:

  • Don’t queue jobs manually from job handlers
  • Use StageTransitionService for stage changes
  • Trust the hook system to advance pipeline
  • Only queue jobs from hooks

Why this matters:

  • Prevents double-queueing
  • Maintains exactly-once semantics
  • Reduces debugging complexity
  • Follows established patterns

Job Handler Focus

Keep job handlers focused:

  • Each handler does one thing well
  • Don’t queue next job from handler
  • Update stage, let hooks queue next job
  • Maintain separation of concerns

Why this matters:

  • Easier to test and debug
  • Clear responsibilities
  • Simpler to extend
  • Reduces coupling

Testing Implications

Understand how jobs work in tests:

  • Jobs are deleted after completion
  • Check side effects, not job history
  • Verify progress and state changes
  • Query actual results (events, data)

Why this matters:

  • Test the right things
  • Avoid flaky tests
  • Matches production behavior
  • Tests remain valid

Error Handling

Handle errors at appropriate levels:

  • Row-level errors: log and continue
  • Batch-level errors: retry batch
  • Stage-level errors: mark import failed
  • Preserve partial progress when possible

Why this matters:

  • Resilient to transient failures
  • Doesn’t lose work on errors
  • Clear error reporting
  • Enables recovery

Schema Versioning

Always create schema versions:

  • Create version after approval
  • Link events to schema version
  • Maintain version history
  • Enable rollback capability

Why this matters:

  • Audit trail for compliance
  • Understand schema evolution
  • Debug issues related to schema changes
  • Support for schema rollback

Universal Principles

Practices that apply to everyone working with the pipeline.

Start Conservative, Iterate

  • Begin with strict settings
  • Loosen based on actual needs
  • Test configuration changes in staging
  • Monitor impact of changes

Document Everything

  • Why configuration was chosen
  • Why approval was given/denied
  • What went wrong in failures
  • How issues were resolved

Monitor Continuously

  • Set up alerts for problems
  • Review metrics regularly
  • Track trends over time
  • Act on warnings early

Communicate

  • Data providers inform admins of upcoming changes
  • Admins communicate policies to providers
  • Developers document behavior changes
  • Everyone shares knowledge

Learn from Failures

  • Analyze what went wrong
  • Document root cause
  • Implement preventive measures
  • Share lessons learned

These best practices will help you run a smooth, efficient, and reliable data processing pipeline.

Last updated on