Scraper Deployment
This guide covers deploying the TimeScrape runner, the stateless service that executes scraper code inside hardened Podman containers. The runner is an optional component — deploy it only if you need scraper functionality.
Prerequisites
| Requirement | Version | Notes |
|---|---|---|
| Podman | 4.0+ | Rootless mode required |
| Node.js | 24+ | For running the service |
| Git | Any recent | For cloning scraper repos at execution time |
The runner does not need access to the TimeTiles database. It communicates with TimeTiles only via HTTP using a shared API key.
Installation
1. Install Podman (Rootless)
macOS:
brew install podman
podman machine init
podman machine startUbuntu / Debian:
sudo apt update
sudo apt install -y podmanVerify rootless mode:
podman info --format '{{.Host.Security.Rootless}}'This must return true. If it does not, consult the Podman rootless tutorial .
2. Build Base Images
The runner spawns containers from two pre-built base images. Build them from the apps/timescrape/images/ directory:
cd apps/timescrape
# Python runtime (requests, beautifulsoup4, lxml, pandas, cssselect)
podman build -t timescrape-python images/python/
# Node.js runtime (cheerio, axios)
podman build -t timescrape-node images/node/The image names timescrape-python and timescrape-node are expected by the runner. Do not change the tags.
Verify:
podman images | grep timescrape3. Create the Sandbox Network
Scraper containers run on an isolated network that allows internet access but blocks access to internal services:
podman network create scraper-sandboxVerify:
podman network ls | grep scraper-sandboxWithout this network, scraper containers could reach the database, the web app, or other internal services.
4. Configure Environment
Create an .env file in apps/timescrape/:
# REQUIRED: Shared secret for API authentication (minimum 16 characters).
# Must match the SCRAPER_API_KEY value in the TimeTiles .env.
SCRAPER_API_KEY=your-secret-key-at-least-16-chars
# HTTP server port (default: 4000)
SCRAPER_PORT=4000
# Maximum simultaneous container runs (default: 3)
SCRAPER_MAX_CONCURRENT=3
# Default timeout per run in seconds (default: 300)
SCRAPER_DEFAULT_TIMEOUT=300
# Default memory limit per container in MB (default: 512)
SCRAPER_DEFAULT_MEMORY=512
# Maximum Git repo clone size in MB (default: 50)
SCRAPER_MAX_REPO_SIZE_MB=50
# Maximum CSV output file size in MB (default: 100)
SCRAPER_MAX_OUTPUT_SIZE_MB=100
# Temp directory for run workspaces (default: /tmp/timescrape)
SCRAPER_DATA_DIR=/tmp/timescrapeGenerate a strong API key:
openssl rand -hex 325. Connect to TimeTiles
Add two variables to the TimeTiles .env file (apps/web/.env):
# URL where TimeTiles can reach the runner
SCRAPER_RUNNER_URL=http://localhost:4000
# Must match the runner's SCRAPER_API_KEY exactly
SCRAPER_API_KEY=your-secret-key-at-least-16-charsThen enable the feature flag:
- Log in to TimeTiles as an admin.
- Go to Settings in the Payload dashboard.
- Enable enableScrapers.
Without the feature flag, users cannot create scraper repos and execution jobs will not run.
Running the Service
Development
From the monorepo root:
pnpm --filter scraper devThe server starts on the configured port (default 4000).
Production: Direct on Host
Build and run directly on the host machine:
cd apps/timescrape
pnpm build
node dist/index.jsUse a process manager such as systemd or PM2 to keep the service running:
# /etc/systemd/system/timescrape-runner.service
[Unit]
Description=TimeScrape Runner
After=network.target
[Service]
Type=simple
User=timescrape
WorkingDirectory=/opt/timescrape-runner
EnvironmentFile=/opt/timescrape-runner/.env
ExecStart=/usr/bin/node dist/index.js
Restart=always
RestartSec=5
[Install]
WantedBy=multi-user.targetsudo systemctl enable timescrape-runner
sudo systemctl start timescrape-runnerProduction: Containerized Runner
Build and run the runner itself as a Podman container:
cd apps/timescrape
podman build -t timescrape-runner .
podman run -d \
--name timescrape-runner \
-p 4000:4000 \
--env-file .env \
-v /run/user/$(id -u)/podman/podman.sock:/run/podman/podman.sock \
timescrape-runnerThe runner needs access to the Podman socket to spawn scraper containers. When running the runner inside a container, mount the rootless Podman socket as shown above. When running directly on the host, no socket mount is needed — the runner invokes podman as a CLI command.
Health Check
Verify the service is running:
curl http://localhost:4000/healthExpected response:
{ "status": "ok", "active_runs": 0, "timestamp": "2026-03-16T10:00:00.000Z" }Docker Compose Example
If you run TimeTiles with Docker Compose, you can add the runner as an additional service. Note that the runner requires Podman on the host, so this example runs the runner on the host network with access to the Podman socket.
# docker-compose.scraper.yml
# Use alongside your main docker-compose.prod.yml
services:
timescrape-runner:
build:
context: ./apps/timescrape
dockerfile: Dockerfile
ports:
- "4000:4000"
environment:
- SCRAPER_API_KEY=${SCRAPER_API_KEY}
- SCRAPER_PORT=4000
- SCRAPER_MAX_CONCURRENT=${SCRAPER_MAX_CONCURRENT:-3}
- SCRAPER_DEFAULT_TIMEOUT=${SCRAPER_DEFAULT_TIMEOUT:-300}
- SCRAPER_DEFAULT_MEMORY=${SCRAPER_DEFAULT_MEMORY:-512}
volumes:
# Mount Podman socket for container management
- /run/user/1000/podman/podman.sock:/run/podman/podman.sock
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:4000/health"]
interval: 30s
timeout: 5s
retries: 3Start with:
docker compose -f docker-compose.prod.yml -f docker-compose.scraper.yml up -dEnvironment Variables Reference
Runner Variables (apps/timescrape)
| Variable | Default | Required | Description |
|---|---|---|---|
SCRAPER_API_KEY | — | Yes | Shared secret for API authentication. Minimum 16 characters. |
SCRAPER_PORT | 4000 | No | HTTP server port. |
SCRAPER_MAX_CONCURRENT | 3 | No | Maximum simultaneous container runs. Additional requests are rejected with 429. |
SCRAPER_DEFAULT_TIMEOUT | 300 | No | Default timeout per run in seconds. Individual scrapers can override. |
SCRAPER_DEFAULT_MEMORY | 512 | No | Default memory limit per container in MB. Individual scrapers can override. |
SCRAPER_MAX_REPO_SIZE_MB | 50 | No | Maximum Git repo size for shallow clones. |
SCRAPER_MAX_OUTPUT_SIZE_MB | 100 | No | Maximum CSV output file size. Runs exceeding this fail with INVALID_OUTPUT. |
SCRAPER_DATA_DIR | /tmp/timescrape | No | Temporary directory for run workspaces. Cleaned up after each run. |
NODE_ENV | development | No | Node.js environment mode. |
TimeTiles Variables (apps/web)
| Variable | Default | Required | Description |
|---|---|---|---|
SCRAPER_RUNNER_URL | — | Yes (if scrapers enabled) | URL where TimeTiles can reach the runner (e.g., http://localhost:4000). |
SCRAPER_API_KEY | — | Yes (if scrapers enabled) | Must match the runner’s SCRAPER_API_KEY exactly. |
Monitoring
Logs
The runner logs to stdout in structured format. In production, pipe logs to your preferred log aggregation system.
With systemd:
journalctl -u timescrape-runner -fWith Podman:
podman logs -f timescrape-runnerHealth Endpoint
The /health endpoint returns the current state of the runner:
curl -s http://localhost:4000/health | jq{ "status": "ok", "active_runs": 2, "timestamp": "2026-03-16T10:00:00.000Z" }Monitor active_runs to track utilization. If it consistently equals SCRAPER_MAX_CONCURRENT, consider increasing the limit or adding another runner instance.
Metrics Endpoint
The /metrics endpoint provides aggregate run statistics (no authentication required):
curl -s http://localhost:4000/metrics | jq{
"active_runs": 1,
"total_runs": 145,
"total_success": 130,
"total_failed": 12,
"total_timeout": 3,
"uptime_seconds": 86400,
"queue_capacity": 3
}Counters reset when the runner process restarts. For persistent metrics, scrape this endpoint with Prometheus or a similar tool.
TimeTiles Health Integration
When SCRAPER_RUNNER_URL is configured, the TimeTiles health endpoint (/api/health) automatically pings the runner’s /health endpoint and includes the result in its response. If the runner is unreachable, the health check reports the runner as degraded.
Run History
All run outcomes are stored in the scraper-runs Payload collection. Query from the admin dashboard or via the REST API:
# Recent failed runs
curl -s "https://your-timetiles-instance/api/scraper-runs?where[status][equals]=failed&sort=-createdAt&limit=10" \
-H "Authorization: Bearer YOUR_TOKEN"Scaling
Vertical Scaling
Increase resources on a single runner:
- More concurrent runs: Raise
SCRAPER_MAX_CONCURRENT(ensure the host has enough memory and CPU). - More memory per run: Increase
SCRAPER_DEFAULT_MEMORYor set per-scraper limits. - Longer timeouts: Increase
SCRAPER_DEFAULT_TIMEOUTfor slow scrapers.
Rule of thumb: each concurrent run needs the configured memory limit plus overhead. For 5 concurrent runs at 512 MB each, the host should have at least 4 GB of available RAM.
Horizontal Scaling
Multiple runner instances can serve the same TimeTiles deployment. Each runner is stateless, so scaling out is straightforward:
- Deploy additional runner instances on separate hosts.
- Place a load balancer in front of them.
- Point
SCRAPER_RUNNER_URLto the load balancer. - All runners must use the same
SCRAPER_API_KEYand have the same base images built.
Each runner tracks active runs independently in memory. The concurrency limit (SCRAPER_MAX_CONCURRENT) applies per instance.
Backup Considerations
Scraper output is transient. The CSV data returned from each run is either:
- Stored as a base64 field in the
scraper-runscollection (for review), or - Fed into the ingestion pipeline, which creates events in the database.
There is no persistent state on the runner itself. All durable data lives in the TimeTiles PostgreSQL database, which is covered by your standard database backup strategy.
What to back up:
| Data | Location | Backup Strategy |
|---|---|---|
| Scraper repos, scrapers, run history | PostgreSQL (apps/web) | Standard database backup |
| Scraper source code (git type) | External Git hosting | Backed up by Git host |
| Scraper source code (upload type) | PostgreSQL (apps/web) | Standard database backup |
| Runner configuration | .env file on runner host | Include in host backup or config management |
| Base images | Built from Dockerfiles in apps/timescrape/images/ | Rebuild from source, or push to a container registry |
Security
Podman Rootless vs Docker
The runner is designed for Podman rootless mode. Docker’s daemon runs as root, and access to /var/run/docker.sock grants full root control of the host. Podman has no daemon and runs containers as unprivileged user processes.
Do not run this service with Docker unless you fully understand the security implications.
Container Hardening Summary
Every scraper container is launched with multiple layers of defense:
| Protection | Podman Flag | Purpose |
|---|---|---|
| Drop all capabilities | --cap-drop=ALL | Block kernel capability exploits |
| Block privilege escalation | --security-opt=no-new-privileges | Prevent setuid/setgid escalation |
| Custom seccomp profile | --security-opt=seccomp=... | Restrict to ~100 allowed syscalls |
| Read-only filesystem | --read-only | Prevent persistence and malware |
| Writable /tmp with noexec | --tmpfs=/tmp:rw,size=64m,noexec | Prevent downloaded binary execution |
| Process limit | --pids-limit=256 | Prevent fork bombs |
| Memory and CPU limits | --memory, --cpus | Prevent resource exhaustion |
| User namespace remapping | --userns=auto | Prevent container escape to root |
| Network isolation | --network=scraper-sandbox | Block lateral movement |
| External DNS only | --dns=1.1.1.1 | Block internal service discovery |
API Key Rotation
The SCRAPER_API_KEY is the sole authentication mechanism between TimeTiles and the runner. If compromised, an attacker could submit code for container execution (within the hardening limits above).
To rotate the key:
- Generate a new key:
openssl rand -hex 32 - Update
SCRAPER_API_KEYin bothapps/timescrape/.envandapps/web/.env. - Restart both the runner and the TimeTiles web app.
Rotate the key periodically and whenever you suspect it may have been exposed.
Network Architecture
+-------------------------+ +---------------------+
| Internal Network | | scraper-sandbox |
| | | (internet access) |
| PostgreSQL | | |
| TimeTiles web app | | Scraper containers |---> Internet
| | | |
| TimeScrape runner <---|-----| (spawns containers) |
| (API server) ----|---->| |
+-------------------------+ +---------------------+- The runner lives on the internal network (it needs to communicate with TimeTiles).
- Scraper containers live on the
scraper-sandboxnetwork (internet access only, no internal services). - The
scraper-sandboxnetwork must not be able to reach PostgreSQL, the web app, or the runner API.
Verify isolation by running a test container:
# This should fail (cannot reach internal services)
podman run --rm --network=scraper-sandbox timescrape-python \
python -c "import requests; requests.get('http://your-internal-host:3000', timeout=5)"
# This should succeed (can reach the internet)
podman run --rm --network=scraper-sandbox timescrape-python \
python -c "import requests; print(requests.get('https://httpbin.org/get', timeout=10).status_code)"Troubleshooting
| Problem | Solution |
| -------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | ---------------- | -------------------- |
| podman: command not found | Install Podman. See the Installation section above. |
| Rootless check returns false | On macOS: podman machine init && podman machine start. On Linux: consult the Podman rootless docs. |
| Health check returns connection refused | Verify SCRAPER_PORT and that the service is running. Check firewall rules. |
| Runs fail with “image not found” | Build base images: podman build -t timescrape-python images/python/ and podman build -t timescrape-node images/node/. |
| Runs fail with “network not found” | Create the sandbox network: podman network create scraper-sandbox. |
| TimeTiles cannot reach the runner | Verify SCRAPER_RUNNER_URL is correct and the port is accessible from the TimeTiles host. |
| “API key must be at least 16 characters” | Set SCRAPER_API_KEY to a string of 16 or more characters. |
| Runs fail with “max concurrent runs reached” | Increase SCRAPER_MAX_CONCURRENT or wait for running scrapers to finish. |
| Runs fail with INVALID_OUTPUT | Check the scraper’s stdout/stderr logs. The scraper may not be calling output.save(), or the output exceeds SCRAPER_MAX_OUTPUT_SIZE_MB. |
| Orphaned containers after runner restart | Containers are created with --rm, so they self-clean on exit. If the runner crashes mid-run, containers may linger. Clean up with podman ps -a --filter name=run- | grep -v CONTAINER | awk '{print $1}' | xargs podman rm -f. |
| Scraper can reach internal services | Verify the scraper-sandbox network is properly isolated. Re-create it if needed. |
Upgrading Base Images
When new library versions are available or you need to add a library:
- Update the Dockerfile in
apps/timescrape/images/python/orapps/timescrape/images/node/. - Rebuild the image:
podman build -t timescrape-python images/python/. - New scraper runs will use the updated image immediately (no runner restart needed).
Old containers that are currently running are not affected — they continue with the image they were started with.
Related Documentation
- Scrapers Overview — Feature overview and quick start
- Writing Scrapers — Scraper authoring guide
- Production Deployment — Main TimeTiles deployment guide
- Configuration — Environment variables reference