Writing Scrapers

This guide covers everything you need to write scrapers that produce clean CSV data for TimeTiles. Scrapers are short scripts that fetch data from websites or APIs, extract the relevant fields, and output rows using the provided helper library.

Python Scrapers

Python is the default runtime. The base image includes Python 3.12 with several popular libraries pre-installed.

Basic Example: Fetching from an API


"""Scrape German public holidays from a public API."""
 
import requests
from timetiles.scraper import output
 
response = requests.get(
    "https://date.nager.at/api/v3/PublicHolidays/2026/DE",
    timeout=30,
)
response.raise_for_status()
 
holidays = response.json()
 
for holiday in holidays:
    output.write_row({
        "title": holiday["localName"],
        "date": holiday["date"],
        "location": "Germany",
        "description": holiday.get("name", ""),
        "url": f"https://date.nager.at/publicholiday/{holiday['date']}/DE",
    })
 
output.save()
print(f"Scraped {output.row_count} events")

Scraping HTML with BeautifulSoup


"""Scrape event listings from an HTML page."""
 
import requests
from bs4 import BeautifulSoup
from timetiles.scraper import output
 
response = requests.get("https://example.com/events", timeout=30)
response.raise_for_status()
 
soup = BeautifulSoup(response.text, "lxml")
 
for card in soup.select(".event-card"):
    title = card.select_one(".event-title")
    date = card.select_one(".event-date")
    venue = card.select_one(".event-venue")
 
    if title and date:
        output.write_row({
            "title": title.get_text(strip=True),
            "date": date.get_text(strip=True),
            "location": venue.get_text(strip=True) if venue else "",
        })
 
output.save()
print(f"Scraped {output.row_count} events")

Paginated API with Error Handling


"""Scrape a paginated API with retry logic."""
 
import requests
from timetiles.scraper import output
 
BASE_URL = "https://api.example.com/events"
MAX_PAGES = 50
 
session = requests.Session()
session.headers.update({"Accept": "application/json"})
 
page = 1
while page <= MAX_PAGES:
    try:
        response = session.get(
            BASE_URL,
            params={"page": page, "per_page": 100},
            timeout=30,
        )
        response.raise_for_status()
    except requests.RequestException as e:
        print(f"Error fetching page {page}: {e}")
        break
 
    data = response.json()
    events = data.get("results", [])
 
    if not events:
        break
 
    for event in events:
        output.write_row({
            "title": event["name"],
            "date": event["start_date"],
            "location": event.get("venue", {}).get("name", ""),
            "latitude": event.get("venue", {}).get("lat", ""),
            "longitude": event.get("venue", {}).get("lng", ""),
            "description": event.get("description", ""),
            "url": event.get("url", ""),
        })
 
    page += 1
 
output.save()
print(f"Scraped {output.row_count} events across {page - 1} pages")

Using Pandas for Data Transformation


"""Fetch a JSON dataset and transform it with pandas."""
 
import pandas as pd
import requests
from timetiles.scraper import output
 
response = requests.get("https://api.example.com/dataset.json", timeout=60)
response.raise_for_status()
 
df = pd.DataFrame(response.json())
 
# Filter and transform
df = df[df["status"] == "confirmed"]
df["date"] = pd.to_datetime(df["timestamp"]).dt.strftime("%Y-%m-%d")
df = df.rename(columns={"venue_name": "location", "event_name": "title"})
 
for _, row in df.iterrows():
    output.write_row({
        "title": row["title"],
        "date": row["date"],
        "location": row["location"],
    })
 
output.save()
print(f"Scraped {output.row_count} events")

Using Environment Variables

Scrapers can receive environment variables for API keys and configuration. Set them on the scraper record in the admin panel (the envVars JSON field) or in the manifest.


"""Scrape using an API key from environment variables."""
 
import os
import requests
from timetiles.scraper import output
 
API_KEY = os.environ.get("EVENT_API_KEY")
if not API_KEY:
    print("ERROR: EVENT_API_KEY environment variable is not set")
    exit(1)
 
response = requests.get(
    "https://api.example.com/events",
    headers={"Authorization": f"Bearer {API_KEY}"},
    timeout=30,
)
response.raise_for_status()
 
for event in response.json():
    output.write_row({
        "title": event["title"],
        "date": event["date"],
        "location": event["location"],
    })
 
output.save()

Available Python Libraries

These libraries are pre-installed in the Python runtime image. No other libraries can be imported.

Library	Version	Purpose
`requests`	Latest	HTTP client
`beautifulsoup4`	Latest	HTML/XML parsing
`lxml`	Latest	Fast XML/HTML parser, used as BeautifulSoup backend
`pandas`	Latest	Data manipulation and analysis
`cssselect`	Latest	CSS selector support for lxml
`timetiles.scraper`	Built-in	CSV output helper (always available)

Standard library modules (json, csv, re, os, datetime, urllib, etc.) are always available.

Node.js Scrapers

The Node.js runtime uses Node.js 24 with ESM module support.

Basic Example: Fetching from an API


/**
 * Scrape Austrian public holidays from a public API.
 */
 
import axios from "axios";
import { output } from "@timetiles/scraper";
 
const response = await axios.get("https://date.nager.at/api/v3/PublicHolidays/2026/AT", { timeout: 30_000 });
 
for (const holiday of response.data) {
  output.writeRow({
    title: holiday.localName,
    date: holiday.date,
    location: "Austria",
    description: holiday.name ?? "",
    url: `https://date.nager.at/publicholiday/${holiday.date}/AT`,
  });
}
 
output.save();
console.log(`Scraped ${output.rowCount} events`);

Scraping HTML with Cheerio


import axios from "axios";
import * as cheerio from "cheerio";
import { output } from "@timetiles/scraper";
 
const { data: html } = await axios.get("https://example.com/events", { timeout: 30_000 });
const $ = cheerio.load(html);
 
$(".event-card").each((_, el) => {
  const title = $(el).find(".event-title").text().trim();
  const date = $(el).find(".event-date").text().trim();
  const venue = $(el).find(".event-venue").text().trim();
 
  if (title && date) {
    output.writeRow({ title, date, location: venue });
  }
});
 
output.save();
console.log(`Scraped ${output.rowCount} events`);

Paginated API with Error Handling


import axios from "axios";
import { output } from "@timetiles/scraper";
 
const BASE_URL = "https://api.example.com/events";
const MAX_PAGES = 50;
 
for (let page = 1; page <= MAX_PAGES; page++) {
  let response;
  try {
    response = await axios.get(BASE_URL, { params: { page, per_page: 100 }, timeout: 30_000 });
  } catch (error) {
    console.error(`Error fetching page ${page}:`, error.message);
    break;
  }
 
  const events = response.data.results ?? [];
  if (events.length === 0) break;
 
  for (const event of events) {
    output.writeRow({
      title: event.name,
      date: event.start_date,
      location: event.venue?.name ?? "",
      description: event.description ?? "",
      url: event.url ?? "",
    });
  }
}
 
output.save();
console.log(`Scraped ${output.rowCount} events`);

Available Node.js Libraries

Library	Version	Purpose
`cheerio`	Latest	HTML parsing with jQuery-like API
`axios`	Latest	HTTP client
`@timetiles/scraper`	Built-in	CSV output helper (always available)

Node.js built-in modules (fs, path, url, crypto, https, etc.) are always available.

Best Practices

Handle Errors Gracefully

If your scraper exits with a non-zero exit code, the run is marked as failed. Catch exceptions and print informative error messages before exiting.


import sys
import requests
from timetiles.scraper import output
 
try:
    response = requests.get("https://api.example.com/events", timeout=30)
    response.raise_for_status()
except requests.RequestException as e:
    print(f"Failed to fetch events: {e}", file=sys.stderr)
    sys.exit(1)
 
# ... process data ...
output.save()

Use Timeouts on HTTP Requests

Always set a timeout on outbound HTTP requests. Without a timeout, a slow or unresponsive server could cause your scraper to hang until the container timeout kills it.


# Python -- always pass timeout
response = requests.get(url, timeout=30)


// Node.js -- always pass timeout
const response = await axios.get(url, { timeout: 30_000 });

Respect Rate Limits

When scraping third-party websites:

Add delays between requests if the site has rate limits.
Check for Retry-After headers and honor them.
Use a single session/client to reuse connections.
Scrape during off-peak hours (use cron scheduling).


import time
import requests
 
session = requests.Session()
 
for url in urls:
    response = session.get(url, timeout=30)
    if response.status_code == 429:
        retry_after = int(response.headers.get("Retry-After", 60))
        print(f"Rate limited, waiting {retry_after}s")
        time.sleep(retry_after)
        response = session.get(url, timeout=30)
    response.raise_for_status()
    # ... process ...
    time.sleep(1)  # Be polite

Output Clean CSV

Use consistent column names across all rows.
Include a date column in a parseable format (YYYY-MM-DD is best).
Include a location or address column for geocoding.
Include latitude and longitude if you already have coordinates (skips geocoding).
Avoid empty rows and rows with all-empty values.

Columns that work well with the TimeTiles import pipeline:

Column	Purpose	Notes
`title`	Event name	Required for meaningful display
`date`	Event date	`YYYY-MM-DD` format preferred
`location` or `address`	Event location	Used for geocoding
`latitude`	Latitude coordinate	Skips geocoding if both lat/lng present
`longitude`	Longitude coordinate	Must be paired with latitude
`description`	Event description	Optional
`url`	Source URL	Optional

Keep Scrapers Focused

One scraper should produce one dataset. If you need to scrape events from multiple sources, define multiple scrapers in the same repo using the scrapers.yml manifest.


scrapers:
  - name: "Berlin Events"
    slug: berlin-events
    entrypoint: scrapers/berlin.py
    output: output/berlin.csv
    schedule: "0 6 * * *"
 
  - name: "Munich Events"
    slug: munich-events
    entrypoint: scrapers/munich.py
    output: output/munich.csv
    schedule: "0 7 * * *"
 
defaults:
  runtime: python
  limits:
    timeout: 300
    memory: 512

Print Progress Information

Standard output (stdout) and standard error (stderr) are captured and stored in the run record. Use print statements to log progress, which helps with debugging failed runs.


print(f"Fetching page {page}...")
print(f"Found {len(events)} events on this page")
print(f"Total: {output.row_count} events so far")

Test Locally Before Deploying

You can test your scraper locally before deploying to TimeTiles:


# Python
cd your-scraper-repo
TIMESCRAPE_OUTPUT_DIR=./output python scraper.py
 
# Node.js
cd your-scraper-repo
TIMESCRAPE_OUTPUT_DIR=./output node scraper.js
 
# Check the output
cat output/data.csv

Set the TIMESCRAPE_OUTPUT_DIR environment variable to write the CSV to a local directory instead of the container’s /output path.

Manifest Reference

This is the complete reference for the scrapers.yml file format.

Complete Schema


# scrapers.yml -- manifest for a TimeScrape repository
 
scrapers:
  - name: "Human-Readable Name" # required, max 255 characters
    slug:
      url-safe-slug # required, max 128 characters
      # lowercase letters, numbers, hyphens only
      # must be unique within the repo
    runtime:
      python # optional, "python" or "node"
      # defaults to defaults.runtime or "python"
    entrypoint:
      path/to/script.py # required, relative to repo root
      # must not contain ".."
    output: output/filename.csv # optional, defaults to "data.csv"
    schedule:
      "0 6 * * *" # optional, cron expression
      # omit for manual-only execution
    limits:
      timeout:
        300 # optional, seconds (10-3600)
        # defaults to defaults.limits.timeout or 300
      memory:
        512 # optional, MB (64-4096)
        # defaults to defaults.limits.memory or 512
 
defaults: # optional block
  runtime: python # default runtime for all scrapers
  limits:
    timeout: 300 # default timeout for all scrapers
    memory: 512 # default memory for all scrapers

Validation Rules

The scrapers array must contain at least one entry.
Each slug must be unique within the manifest.
Slugs must match the pattern ^[a-z0-9]+(?:-[a-z0-9]+)*$.
Entrypoints must not contain .. (path traversal is rejected).
Timeout must be between 10 and 3600 seconds.
Memory must be between 64 and 4096 MB.

Defaults Merging

Individual scraper fields always override the defaults block. The resolution order for each field is:

Scraper-level value (if set)
defaults block value (if set)
System default (python for runtime, 300 for timeout, 512 for memory, data.csv for output)

Example: Single Scraper (Minimal)


scrapers:
  - name: "My Events"
    slug: my-events
    entrypoint: scraper.py

Example: Multi-Scraper with Shared Defaults


scrapers:
  - name: "German Holidays"
    slug: german-holidays
    entrypoint: scrapers/germany.py
    output: output/germany.csv
    schedule: "0 6 * * 1"
 
  - name: "Austrian Holidays"
    slug: austrian-holidays
    entrypoint: scrapers/austria.py
    output: output/austria.csv
    schedule: "0 6 * * 1"
 
defaults:
  runtime: python
  limits:
    timeout: 120
    memory: 256

Example: Mixed Runtimes


scrapers:
  - name: "API Scraper"
    slug: api-scraper
    runtime: python
    entrypoint: scrapers/api.py
    schedule: "0 */6 * * *"
    limits:
      timeout: 600
      memory: 1024
 
  - name: "HTML Scraper"
    slug: html-scraper
    runtime: node
    entrypoint: scrapers/html.js
    schedule: "0 8 * * *"
 
defaults:
  limits:
    timeout: 300
    memory: 512