Writing Scrapers
This guide covers everything you need to write scrapers that produce clean CSV data for TimeTiles. Scrapers are short scripts that fetch data from websites or APIs, extract the relevant fields, and output rows using the provided helper library.
Python Scrapers
Python is the default runtime. The base image includes Python 3.12 with several popular libraries pre-installed.
Basic Example: Fetching from an API
"""Scrape German public holidays from a public API."""
import requests
from timetiles.scraper import output
response = requests.get(
"https://date.nager.at/api/v3/PublicHolidays/2026/DE",
timeout=30,
)
response.raise_for_status()
holidays = response.json()
for holiday in holidays:
output.write_row({
"title": holiday["localName"],
"date": holiday["date"],
"location": "Germany",
"description": holiday.get("name", ""),
"url": f"https://date.nager.at/publicholiday/{holiday['date']}/DE",
})
output.save()
print(f"Scraped {output.row_count} events")Scraping HTML with BeautifulSoup
"""Scrape event listings from an HTML page."""
import requests
from bs4 import BeautifulSoup
from timetiles.scraper import output
response = requests.get("https://example.com/events", timeout=30)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
for card in soup.select(".event-card"):
title = card.select_one(".event-title")
date = card.select_one(".event-date")
venue = card.select_one(".event-venue")
if title and date:
output.write_row({
"title": title.get_text(strip=True),
"date": date.get_text(strip=True),
"location": venue.get_text(strip=True) if venue else "",
})
output.save()
print(f"Scraped {output.row_count} events")Paginated API with Error Handling
"""Scrape a paginated API with retry logic."""
import requests
from timetiles.scraper import output
BASE_URL = "https://api.example.com/events"
MAX_PAGES = 50
session = requests.Session()
session.headers.update({"Accept": "application/json"})
page = 1
while page <= MAX_PAGES:
try:
response = session.get(
BASE_URL,
params={"page": page, "per_page": 100},
timeout=30,
)
response.raise_for_status()
except requests.RequestException as e:
print(f"Error fetching page {page}: {e}")
break
data = response.json()
events = data.get("results", [])
if not events:
break
for event in events:
output.write_row({
"title": event["name"],
"date": event["start_date"],
"location": event.get("venue", {}).get("name", ""),
"latitude": event.get("venue", {}).get("lat", ""),
"longitude": event.get("venue", {}).get("lng", ""),
"description": event.get("description", ""),
"url": event.get("url", ""),
})
page += 1
output.save()
print(f"Scraped {output.row_count} events across {page - 1} pages")Using Pandas for Data Transformation
"""Fetch a JSON dataset and transform it with pandas."""
import pandas as pd
import requests
from timetiles.scraper import output
response = requests.get("https://api.example.com/dataset.json", timeout=60)
response.raise_for_status()
df = pd.DataFrame(response.json())
# Filter and transform
df = df[df["status"] == "confirmed"]
df["date"] = pd.to_datetime(df["timestamp"]).dt.strftime("%Y-%m-%d")
df = df.rename(columns={"venue_name": "location", "event_name": "title"})
for _, row in df.iterrows():
output.write_row({
"title": row["title"],
"date": row["date"],
"location": row["location"],
})
output.save()
print(f"Scraped {output.row_count} events")Using Environment Variables
Scrapers can receive environment variables for API keys and configuration. Set them on the scraper record in the admin panel (the envVars JSON field) or in the manifest.
"""Scrape using an API key from environment variables."""
import os
import requests
from timetiles.scraper import output
API_KEY = os.environ.get("EVENT_API_KEY")
if not API_KEY:
print("ERROR: EVENT_API_KEY environment variable is not set")
exit(1)
response = requests.get(
"https://api.example.com/events",
headers={"Authorization": f"Bearer {API_KEY}"},
timeout=30,
)
response.raise_for_status()
for event in response.json():
output.write_row({
"title": event["title"],
"date": event["date"],
"location": event["location"],
})
output.save()Available Python Libraries
These libraries are pre-installed in the Python runtime image. No other libraries can be imported.
| Library | Version | Purpose |
|---|---|---|
requests | Latest | HTTP client |
beautifulsoup4 | Latest | HTML/XML parsing |
lxml | Latest | Fast XML/HTML parser, used as BeautifulSoup backend |
pandas | Latest | Data manipulation and analysis |
cssselect | Latest | CSS selector support for lxml |
timetiles.scraper | Built-in | CSV output helper (always available) |
Standard library modules (json, csv, re, os, datetime, urllib, etc.) are always available.
Node.js Scrapers
The Node.js runtime uses Node.js 24 with ESM module support.
Basic Example: Fetching from an API
/**
* Scrape Austrian public holidays from a public API.
*/
import axios from "axios";
import { output } from "@timetiles/scraper";
const response = await axios.get("https://date.nager.at/api/v3/PublicHolidays/2026/AT", { timeout: 30_000 });
for (const holiday of response.data) {
output.writeRow({
title: holiday.localName,
date: holiday.date,
location: "Austria",
description: holiday.name ?? "",
url: `https://date.nager.at/publicholiday/${holiday.date}/AT`,
});
}
output.save();
console.log(`Scraped ${output.rowCount} events`);Scraping HTML with Cheerio
import axios from "axios";
import * as cheerio from "cheerio";
import { output } from "@timetiles/scraper";
const { data: html } = await axios.get("https://example.com/events", { timeout: 30_000 });
const $ = cheerio.load(html);
$(".event-card").each((_, el) => {
const title = $(el).find(".event-title").text().trim();
const date = $(el).find(".event-date").text().trim();
const venue = $(el).find(".event-venue").text().trim();
if (title && date) {
output.writeRow({ title, date, location: venue });
}
});
output.save();
console.log(`Scraped ${output.rowCount} events`);Paginated API with Error Handling
import axios from "axios";
import { output } from "@timetiles/scraper";
const BASE_URL = "https://api.example.com/events";
const MAX_PAGES = 50;
for (let page = 1; page <= MAX_PAGES; page++) {
let response;
try {
response = await axios.get(BASE_URL, { params: { page, per_page: 100 }, timeout: 30_000 });
} catch (error) {
console.error(`Error fetching page ${page}:`, error.message);
break;
}
const events = response.data.results ?? [];
if (events.length === 0) break;
for (const event of events) {
output.writeRow({
title: event.name,
date: event.start_date,
location: event.venue?.name ?? "",
description: event.description ?? "",
url: event.url ?? "",
});
}
}
output.save();
console.log(`Scraped ${output.rowCount} events`);Available Node.js Libraries
| Library | Version | Purpose |
|---|---|---|
cheerio | Latest | HTML parsing with jQuery-like API |
axios | Latest | HTTP client |
@timetiles/scraper | Built-in | CSV output helper (always available) |
Node.js built-in modules (fs, path, url, crypto, https, etc.) are always available.
Best Practices
Handle Errors Gracefully
If your scraper exits with a non-zero exit code, the run is marked as failed. Catch exceptions and print informative error messages before exiting.
import sys
import requests
from timetiles.scraper import output
try:
response = requests.get("https://api.example.com/events", timeout=30)
response.raise_for_status()
except requests.RequestException as e:
print(f"Failed to fetch events: {e}", file=sys.stderr)
sys.exit(1)
# ... process data ...
output.save()Use Timeouts on HTTP Requests
Always set a timeout on outbound HTTP requests. Without a timeout, a slow or unresponsive server could cause your scraper to hang until the container timeout kills it.
# Python -- always pass timeout
response = requests.get(url, timeout=30)// Node.js -- always pass timeout
const response = await axios.get(url, { timeout: 30_000 });Respect Rate Limits
When scraping third-party websites:
- Add delays between requests if the site has rate limits.
- Check for
Retry-Afterheaders and honor them. - Use a single session/client to reuse connections.
- Scrape during off-peak hours (use cron scheduling).
import time
import requests
session = requests.Session()
for url in urls:
response = session.get(url, timeout=30)
if response.status_code == 429:
retry_after = int(response.headers.get("Retry-After", 60))
print(f"Rate limited, waiting {retry_after}s")
time.sleep(retry_after)
response = session.get(url, timeout=30)
response.raise_for_status()
# ... process ...
time.sleep(1) # Be politeOutput Clean CSV
- Use consistent column names across all rows.
- Include a
datecolumn in a parseable format (YYYY-MM-DDis best). - Include a
locationoraddresscolumn for geocoding. - Include
latitudeandlongitudeif you already have coordinates (skips geocoding). - Avoid empty rows and rows with all-empty values.
Columns that work well with the TimeTiles import pipeline:
| Column | Purpose | Notes |
|---|---|---|
title | Event name | Required for meaningful display |
date | Event date | YYYY-MM-DD format preferred |
location or address | Event location | Used for geocoding |
latitude | Latitude coordinate | Skips geocoding if both lat/lng present |
longitude | Longitude coordinate | Must be paired with latitude |
description | Event description | Optional |
url | Source URL | Optional |
Keep Scrapers Focused
One scraper should produce one dataset. If you need to scrape events from multiple sources, define multiple scrapers in the same repo using the scrapers.yml manifest.
scrapers:
- name: "Berlin Events"
slug: berlin-events
entrypoint: scrapers/berlin.py
output: output/berlin.csv
schedule: "0 6 * * *"
- name: "Munich Events"
slug: munich-events
entrypoint: scrapers/munich.py
output: output/munich.csv
schedule: "0 7 * * *"
defaults:
runtime: python
limits:
timeout: 300
memory: 512Print Progress Information
Standard output (stdout) and standard error (stderr) are captured and stored in the run record. Use print statements to log progress, which helps with debugging failed runs.
print(f"Fetching page {page}...")
print(f"Found {len(events)} events on this page")
print(f"Total: {output.row_count} events so far")Test Locally Before Deploying
You can test your scraper locally before deploying to TimeTiles:
# Python
cd your-scraper-repo
TIMESCRAPE_OUTPUT_DIR=./output python scraper.py
# Node.js
cd your-scraper-repo
TIMESCRAPE_OUTPUT_DIR=./output node scraper.js
# Check the output
cat output/data.csvSet the TIMESCRAPE_OUTPUT_DIR environment variable to write the CSV to a local directory instead of the container’s /output path.
Manifest Reference
This is the complete reference for the scrapers.yml file format.
Complete Schema
# scrapers.yml -- manifest for a TimeScrape repository
scrapers:
- name: "Human-Readable Name" # required, max 255 characters
slug:
url-safe-slug # required, max 128 characters
# lowercase letters, numbers, hyphens only
# must be unique within the repo
runtime:
python # optional, "python" or "node"
# defaults to defaults.runtime or "python"
entrypoint:
path/to/script.py # required, relative to repo root
# must not contain ".."
output: output/filename.csv # optional, defaults to "data.csv"
schedule:
"0 6 * * *" # optional, cron expression
# omit for manual-only execution
limits:
timeout:
300 # optional, seconds (10-3600)
# defaults to defaults.limits.timeout or 300
memory:
512 # optional, MB (64-4096)
# defaults to defaults.limits.memory or 512
defaults: # optional block
runtime: python # default runtime for all scrapers
limits:
timeout: 300 # default timeout for all scrapers
memory: 512 # default memory for all scrapersValidation Rules
- The
scrapersarray must contain at least one entry. - Each
slugmust be unique within the manifest. - Slugs must match the pattern
^[a-z0-9]+(?:-[a-z0-9]+)*$. - Entrypoints must not contain
..(path traversal is rejected). - Timeout must be between 10 and 3600 seconds.
- Memory must be between 64 and 4096 MB.
Defaults Merging
Individual scraper fields always override the defaults block. The resolution order for each field is:
- Scraper-level value (if set)
defaultsblock value (if set)- System default (
pythonfor runtime,300for timeout,512for memory,data.csvfor output)
Example: Single Scraper (Minimal)
scrapers:
- name: "My Events"
slug: my-events
entrypoint: scraper.pyExample: Multi-Scraper with Shared Defaults
scrapers:
- name: "German Holidays"
slug: german-holidays
entrypoint: scrapers/germany.py
output: output/germany.csv
schedule: "0 6 * * 1"
- name: "Austrian Holidays"
slug: austrian-holidays
entrypoint: scrapers/austria.py
output: output/austria.csv
schedule: "0 6 * * 1"
defaults:
runtime: python
limits:
timeout: 120
memory: 256Example: Mixed Runtimes
scrapers:
- name: "API Scraper"
slug: api-scraper
runtime: python
entrypoint: scrapers/api.py
schedule: "0 */6 * * *"
limits:
timeout: 600
memory: 1024
- name: "HTML Scraper"
slug: html-scraper
runtime: node
entrypoint: scrapers/html.js
schedule: "0 8 * * *"
defaults:
limits:
timeout: 300
memory: 512