Skip to Content
⚠️Active Development Notice: TimeTiles is under active development. Information may be placeholder content or not up-to-date.
ReferenceScrapersWriting Scrapers

Writing Scrapers

This guide covers everything you need to write scrapers that produce clean CSV data for TimeTiles. Scrapers are short scripts that fetch data from websites or APIs, extract the relevant fields, and output rows using the provided helper library.

Python Scrapers

Python is the default runtime. The base image includes Python 3.12 with several popular libraries pre-installed.

Basic Example: Fetching from an API

"""Scrape German public holidays from a public API.""" import requests from timetiles.scraper import output response = requests.get( "https://date.nager.at/api/v3/PublicHolidays/2026/DE", timeout=30, ) response.raise_for_status() holidays = response.json() for holiday in holidays: output.write_row({ "title": holiday["localName"], "date": holiday["date"], "location": "Germany", "description": holiday.get("name", ""), "url": f"https://date.nager.at/publicholiday/{holiday['date']}/DE", }) output.save() print(f"Scraped {output.row_count} events")

Scraping HTML with BeautifulSoup

"""Scrape event listings from an HTML page.""" import requests from bs4 import BeautifulSoup from timetiles.scraper import output response = requests.get("https://example.com/events", timeout=30) response.raise_for_status() soup = BeautifulSoup(response.text, "lxml") for card in soup.select(".event-card"): title = card.select_one(".event-title") date = card.select_one(".event-date") venue = card.select_one(".event-venue") if title and date: output.write_row({ "title": title.get_text(strip=True), "date": date.get_text(strip=True), "location": venue.get_text(strip=True) if venue else "", }) output.save() print(f"Scraped {output.row_count} events")

Paginated API with Error Handling

"""Scrape a paginated API with retry logic.""" import requests from timetiles.scraper import output BASE_URL = "https://api.example.com/events" MAX_PAGES = 50 session = requests.Session() session.headers.update({"Accept": "application/json"}) page = 1 while page <= MAX_PAGES: try: response = session.get( BASE_URL, params={"page": page, "per_page": 100}, timeout=30, ) response.raise_for_status() except requests.RequestException as e: print(f"Error fetching page {page}: {e}") break data = response.json() events = data.get("results", []) if not events: break for event in events: output.write_row({ "title": event["name"], "date": event["start_date"], "location": event.get("venue", {}).get("name", ""), "latitude": event.get("venue", {}).get("lat", ""), "longitude": event.get("venue", {}).get("lng", ""), "description": event.get("description", ""), "url": event.get("url", ""), }) page += 1 output.save() print(f"Scraped {output.row_count} events across {page - 1} pages")

Using Pandas for Data Transformation

"""Fetch a JSON dataset and transform it with pandas.""" import pandas as pd import requests from timetiles.scraper import output response = requests.get("https://api.example.com/dataset.json", timeout=60) response.raise_for_status() df = pd.DataFrame(response.json()) # Filter and transform df = df[df["status"] == "confirmed"] df["date"] = pd.to_datetime(df["timestamp"]).dt.strftime("%Y-%m-%d") df = df.rename(columns={"venue_name": "location", "event_name": "title"}) for _, row in df.iterrows(): output.write_row({ "title": row["title"], "date": row["date"], "location": row["location"], }) output.save() print(f"Scraped {output.row_count} events")

Using Environment Variables

Scrapers can receive environment variables for API keys and configuration. Set them on the scraper record in the admin panel (the envVars JSON field) or in the manifest.

"""Scrape using an API key from environment variables.""" import os import requests from timetiles.scraper import output API_KEY = os.environ.get("EVENT_API_KEY") if not API_KEY: print("ERROR: EVENT_API_KEY environment variable is not set") exit(1) response = requests.get( "https://api.example.com/events", headers={"Authorization": f"Bearer {API_KEY}"}, timeout=30, ) response.raise_for_status() for event in response.json(): output.write_row({ "title": event["title"], "date": event["date"], "location": event["location"], }) output.save()

Available Python Libraries

These libraries are pre-installed in the Python runtime image. No other libraries can be imported.

LibraryVersionPurpose
requestsLatestHTTP client
beautifulsoup4LatestHTML/XML parsing
lxmlLatestFast XML/HTML parser, used as BeautifulSoup backend
pandasLatestData manipulation and analysis
cssselectLatestCSS selector support for lxml
timetiles.scraperBuilt-inCSV output helper (always available)

Standard library modules (json, csv, re, os, datetime, urllib, etc.) are always available.


Node.js Scrapers

The Node.js runtime uses Node.js 24 with ESM module support.

Basic Example: Fetching from an API

/** * Scrape Austrian public holidays from a public API. */ import axios from "axios"; import { output } from "@timetiles/scraper"; const response = await axios.get("https://date.nager.at/api/v3/PublicHolidays/2026/AT", { timeout: 30_000 }); for (const holiday of response.data) { output.writeRow({ title: holiday.localName, date: holiday.date, location: "Austria", description: holiday.name ?? "", url: `https://date.nager.at/publicholiday/${holiday.date}/AT`, }); } output.save(); console.log(`Scraped ${output.rowCount} events`);

Scraping HTML with Cheerio

import axios from "axios"; import * as cheerio from "cheerio"; import { output } from "@timetiles/scraper"; const { data: html } = await axios.get("https://example.com/events", { timeout: 30_000 }); const $ = cheerio.load(html); $(".event-card").each((_, el) => { const title = $(el).find(".event-title").text().trim(); const date = $(el).find(".event-date").text().trim(); const venue = $(el).find(".event-venue").text().trim(); if (title && date) { output.writeRow({ title, date, location: venue }); } }); output.save(); console.log(`Scraped ${output.rowCount} events`);

Paginated API with Error Handling

import axios from "axios"; import { output } from "@timetiles/scraper"; const BASE_URL = "https://api.example.com/events"; const MAX_PAGES = 50; for (let page = 1; page <= MAX_PAGES; page++) { let response; try { response = await axios.get(BASE_URL, { params: { page, per_page: 100 }, timeout: 30_000 }); } catch (error) { console.error(`Error fetching page ${page}:`, error.message); break; } const events = response.data.results ?? []; if (events.length === 0) break; for (const event of events) { output.writeRow({ title: event.name, date: event.start_date, location: event.venue?.name ?? "", description: event.description ?? "", url: event.url ?? "", }); } } output.save(); console.log(`Scraped ${output.rowCount} events`);

Available Node.js Libraries

LibraryVersionPurpose
cheerioLatestHTML parsing with jQuery-like API
axiosLatestHTTP client
@timetiles/scraperBuilt-inCSV output helper (always available)

Node.js built-in modules (fs, path, url, crypto, https, etc.) are always available.


Best Practices

Handle Errors Gracefully

If your scraper exits with a non-zero exit code, the run is marked as failed. Catch exceptions and print informative error messages before exiting.

import sys import requests from timetiles.scraper import output try: response = requests.get("https://api.example.com/events", timeout=30) response.raise_for_status() except requests.RequestException as e: print(f"Failed to fetch events: {e}", file=sys.stderr) sys.exit(1) # ... process data ... output.save()

Use Timeouts on HTTP Requests

Always set a timeout on outbound HTTP requests. Without a timeout, a slow or unresponsive server could cause your scraper to hang until the container timeout kills it.

# Python -- always pass timeout response = requests.get(url, timeout=30)
// Node.js -- always pass timeout const response = await axios.get(url, { timeout: 30_000 });

Respect Rate Limits

When scraping third-party websites:

  • Add delays between requests if the site has rate limits.
  • Check for Retry-After headers and honor them.
  • Use a single session/client to reuse connections.
  • Scrape during off-peak hours (use cron scheduling).
import time import requests session = requests.Session() for url in urls: response = session.get(url, timeout=30) if response.status_code == 429: retry_after = int(response.headers.get("Retry-After", 60)) print(f"Rate limited, waiting {retry_after}s") time.sleep(retry_after) response = session.get(url, timeout=30) response.raise_for_status() # ... process ... time.sleep(1) # Be polite

Output Clean CSV

  • Use consistent column names across all rows.
  • Include a date column in a parseable format (YYYY-MM-DD is best).
  • Include a location or address column for geocoding.
  • Include latitude and longitude if you already have coordinates (skips geocoding).
  • Avoid empty rows and rows with all-empty values.

Columns that work well with the TimeTiles import pipeline:

ColumnPurposeNotes
titleEvent nameRequired for meaningful display
dateEvent dateYYYY-MM-DD format preferred
location or addressEvent locationUsed for geocoding
latitudeLatitude coordinateSkips geocoding if both lat/lng present
longitudeLongitude coordinateMust be paired with latitude
descriptionEvent descriptionOptional
urlSource URLOptional

Keep Scrapers Focused

One scraper should produce one dataset. If you need to scrape events from multiple sources, define multiple scrapers in the same repo using the scrapers.yml manifest.

scrapers: - name: "Berlin Events" slug: berlin-events entrypoint: scrapers/berlin.py output: output/berlin.csv schedule: "0 6 * * *" - name: "Munich Events" slug: munich-events entrypoint: scrapers/munich.py output: output/munich.csv schedule: "0 7 * * *" defaults: runtime: python limits: timeout: 300 memory: 512

Standard output (stdout) and standard error (stderr) are captured and stored in the run record. Use print statements to log progress, which helps with debugging failed runs.

print(f"Fetching page {page}...") print(f"Found {len(events)} events on this page") print(f"Total: {output.row_count} events so far")

Test Locally Before Deploying

You can test your scraper locally before deploying to TimeTiles:

# Python cd your-scraper-repo TIMESCRAPE_OUTPUT_DIR=./output python scraper.py # Node.js cd your-scraper-repo TIMESCRAPE_OUTPUT_DIR=./output node scraper.js # Check the output cat output/data.csv

Set the TIMESCRAPE_OUTPUT_DIR environment variable to write the CSV to a local directory instead of the container’s /output path.


Manifest Reference

This is the complete reference for the scrapers.yml file format.

Complete Schema

# scrapers.yml -- manifest for a TimeScrape repository scrapers: - name: "Human-Readable Name" # required, max 255 characters slug: url-safe-slug # required, max 128 characters # lowercase letters, numbers, hyphens only # must be unique within the repo runtime: python # optional, "python" or "node" # defaults to defaults.runtime or "python" entrypoint: path/to/script.py # required, relative to repo root # must not contain ".." output: output/filename.csv # optional, defaults to "data.csv" schedule: "0 6 * * *" # optional, cron expression # omit for manual-only execution limits: timeout: 300 # optional, seconds (10-3600) # defaults to defaults.limits.timeout or 300 memory: 512 # optional, MB (64-4096) # defaults to defaults.limits.memory or 512 defaults: # optional block runtime: python # default runtime for all scrapers limits: timeout: 300 # default timeout for all scrapers memory: 512 # default memory for all scrapers

Validation Rules

  • The scrapers array must contain at least one entry.
  • Each slug must be unique within the manifest.
  • Slugs must match the pattern ^[a-z0-9]+(?:-[a-z0-9]+)*$.
  • Entrypoints must not contain .. (path traversal is rejected).
  • Timeout must be between 10 and 3600 seconds.
  • Memory must be between 64 and 4096 MB.

Defaults Merging

Individual scraper fields always override the defaults block. The resolution order for each field is:

  1. Scraper-level value (if set)
  2. defaults block value (if set)
  3. System default (python for runtime, 300 for timeout, 512 for memory, data.csv for output)

Example: Single Scraper (Minimal)

scrapers: - name: "My Events" slug: my-events entrypoint: scraper.py

Example: Multi-Scraper with Shared Defaults

scrapers: - name: "German Holidays" slug: german-holidays entrypoint: scrapers/germany.py output: output/germany.csv schedule: "0 6 * * 1" - name: "Austrian Holidays" slug: austrian-holidays entrypoint: scrapers/austria.py output: output/austria.csv schedule: "0 6 * * 1" defaults: runtime: python limits: timeout: 120 memory: 256

Example: Mixed Runtimes

scrapers: - name: "API Scraper" slug: api-scraper runtime: python entrypoint: scrapers/api.py schedule: "0 */6 * * *" limits: timeout: 600 memory: 1024 - name: "HTML Scraper" slug: html-scraper runtime: node entrypoint: scrapers/html.js schedule: "0 8 * * *" defaults: limits: timeout: 300 memory: 512
Last updated on