Skip to Content
⚠️Active Development Notice: TimeTiles is under active development. Information may be placeholder content or not up-to-date.

Scraper Deployment

This guide covers deploying the TimeScrape runner, the stateless service that executes scraper code inside hardened Podman containers. The runner is an optional component — deploy it only if you need scraper functionality.

Prerequisites

RequirementVersionNotes
Podman4.0+Rootless mode required
Node.js24+For running the service
GitAny recentFor cloning scraper repos at execution time

The runner does not need access to the TimeTiles database. It communicates with TimeTiles only via HTTP using a shared API key.


Installation

1. Install Podman (Rootless)

macOS:

brew install podman podman machine init podman machine start

Ubuntu / Debian:

sudo apt update sudo apt install -y podman

Verify rootless mode:

podman info --format '{{.Host.Security.Rootless}}'

This must return true. If it does not, consult the Podman rootless tutorial .

2. Build Base Images

The runner spawns containers from two pre-built base images. Build them from the apps/timescrape/images/ directory:

cd apps/timescrape # Python runtime (requests, beautifulsoup4, lxml, pandas, cssselect) podman build -t timescrape-python images/python/ # Node.js runtime (cheerio, axios) podman build -t timescrape-node images/node/

The image names timescrape-python and timescrape-node are expected by the runner. Do not change the tags.

Verify:

podman images | grep timescrape

3. Create the Sandbox Network

Scraper containers run on an isolated network that allows internet access but blocks access to internal services:

podman network create scraper-sandbox

Verify:

podman network ls | grep scraper-sandbox

Without this network, scraper containers could reach the database, the web app, or other internal services.

4. Configure Environment

Create an .env file in apps/timescrape/:

# REQUIRED: Shared secret for API authentication (minimum 16 characters). # Must match the SCRAPER_API_KEY value in the TimeTiles .env. SCRAPER_API_KEY=your-secret-key-at-least-16-chars # HTTP server port (default: 4000) SCRAPER_PORT=4000 # Maximum simultaneous container runs (default: 3) SCRAPER_MAX_CONCURRENT=3 # Default timeout per run in seconds (default: 300) SCRAPER_DEFAULT_TIMEOUT=300 # Default memory limit per container in MB (default: 512) SCRAPER_DEFAULT_MEMORY=512 # Maximum Git repo clone size in MB (default: 50) SCRAPER_MAX_REPO_SIZE_MB=50 # Maximum CSV output file size in MB (default: 100) SCRAPER_MAX_OUTPUT_SIZE_MB=100 # Temp directory for run workspaces (default: /tmp/timescrape) SCRAPER_DATA_DIR=/tmp/timescrape

Generate a strong API key:

openssl rand -hex 32

5. Connect to TimeTiles

Add two variables to the TimeTiles .env file (apps/web/.env):

# URL where TimeTiles can reach the runner SCRAPER_RUNNER_URL=http://localhost:4000 # Must match the runner's SCRAPER_API_KEY exactly SCRAPER_API_KEY=your-secret-key-at-least-16-chars

Then enable the feature flag:

  1. Log in to TimeTiles as an admin.
  2. Go to Settings in the Payload dashboard.
  3. Enable enableScrapers.

Without the feature flag, users cannot create scraper repos and execution jobs will not run.


Running the Service

Development

From the monorepo root:

pnpm --filter scraper dev

The server starts on the configured port (default 4000).

Production: Direct on Host

Build and run directly on the host machine:

cd apps/timescrape pnpm build node dist/index.js

Use a process manager such as systemd or PM2 to keep the service running:

# /etc/systemd/system/timescrape-runner.service [Unit] Description=TimeScrape Runner After=network.target [Service] Type=simple User=timescrape WorkingDirectory=/opt/timescrape-runner EnvironmentFile=/opt/timescrape-runner/.env ExecStart=/usr/bin/node dist/index.js Restart=always RestartSec=5 [Install] WantedBy=multi-user.target
sudo systemctl enable timescrape-runner sudo systemctl start timescrape-runner

Production: Containerized Runner

Build and run the runner itself as a Podman container:

cd apps/timescrape podman build -t timescrape-runner . podman run -d \ --name timescrape-runner \ -p 4000:4000 \ --env-file .env \ -v /run/user/$(id -u)/podman/podman.sock:/run/podman/podman.sock \ timescrape-runner

The runner needs access to the Podman socket to spawn scraper containers. When running the runner inside a container, mount the rootless Podman socket as shown above. When running directly on the host, no socket mount is needed — the runner invokes podman as a CLI command.

Health Check

Verify the service is running:

curl http://localhost:4000/health

Expected response:

{ "status": "ok", "active_runs": 0, "timestamp": "2026-03-16T10:00:00.000Z" }

Docker Compose Example

If you run TimeTiles with Docker Compose, you can add the runner as an additional service. Note that the runner requires Podman on the host, so this example runs the runner on the host network with access to the Podman socket.

# docker-compose.scraper.yml # Use alongside your main docker-compose.prod.yml services: timescrape-runner: build: context: ./apps/timescrape dockerfile: Dockerfile ports: - "4000:4000" environment: - SCRAPER_API_KEY=${SCRAPER_API_KEY} - SCRAPER_PORT=4000 - SCRAPER_MAX_CONCURRENT=${SCRAPER_MAX_CONCURRENT:-3} - SCRAPER_DEFAULT_TIMEOUT=${SCRAPER_DEFAULT_TIMEOUT:-300} - SCRAPER_DEFAULT_MEMORY=${SCRAPER_DEFAULT_MEMORY:-512} volumes: # Mount Podman socket for container management - /run/user/1000/podman/podman.sock:/run/podman/podman.sock restart: unless-stopped healthcheck: test: ["CMD", "curl", "-f", "http://localhost:4000/health"] interval: 30s timeout: 5s retries: 3

Start with:

docker compose -f docker-compose.prod.yml -f docker-compose.scraper.yml up -d

Environment Variables Reference

Runner Variables (apps/timescrape)

VariableDefaultRequiredDescription
SCRAPER_API_KEYYesShared secret for API authentication. Minimum 16 characters.
SCRAPER_PORT4000NoHTTP server port.
SCRAPER_MAX_CONCURRENT3NoMaximum simultaneous container runs. Additional requests are rejected with 429.
SCRAPER_DEFAULT_TIMEOUT300NoDefault timeout per run in seconds. Individual scrapers can override.
SCRAPER_DEFAULT_MEMORY512NoDefault memory limit per container in MB. Individual scrapers can override.
SCRAPER_MAX_REPO_SIZE_MB50NoMaximum Git repo size for shallow clones.
SCRAPER_MAX_OUTPUT_SIZE_MB100NoMaximum CSV output file size. Runs exceeding this fail with INVALID_OUTPUT.
SCRAPER_DATA_DIR/tmp/timescrapeNoTemporary directory for run workspaces. Cleaned up after each run.
NODE_ENVdevelopmentNoNode.js environment mode.

TimeTiles Variables (apps/web)

VariableDefaultRequiredDescription
SCRAPER_RUNNER_URLYes (if scrapers enabled)URL where TimeTiles can reach the runner (e.g., http://localhost:4000).
SCRAPER_API_KEYYes (if scrapers enabled)Must match the runner’s SCRAPER_API_KEY exactly.

Monitoring

Logs

The runner logs to stdout in structured format. In production, pipe logs to your preferred log aggregation system.

With systemd:

journalctl -u timescrape-runner -f

With Podman:

podman logs -f timescrape-runner

Health Endpoint

The /health endpoint returns the current state of the runner:

curl -s http://localhost:4000/health | jq
{ "status": "ok", "active_runs": 2, "timestamp": "2026-03-16T10:00:00.000Z" }

Monitor active_runs to track utilization. If it consistently equals SCRAPER_MAX_CONCURRENT, consider increasing the limit or adding another runner instance.

Metrics Endpoint

The /metrics endpoint provides aggregate run statistics (no authentication required):

curl -s http://localhost:4000/metrics | jq
{ "active_runs": 1, "total_runs": 145, "total_success": 130, "total_failed": 12, "total_timeout": 3, "uptime_seconds": 86400, "queue_capacity": 3 }

Counters reset when the runner process restarts. For persistent metrics, scrape this endpoint with Prometheus or a similar tool.

TimeTiles Health Integration

When SCRAPER_RUNNER_URL is configured, the TimeTiles health endpoint (/api/health) automatically pings the runner’s /health endpoint and includes the result in its response. If the runner is unreachable, the health check reports the runner as degraded.

Run History

All run outcomes are stored in the scraper-runs Payload collection. Query from the admin dashboard or via the REST API:

# Recent failed runs curl -s "https://your-timetiles-instance/api/scraper-runs?where[status][equals]=failed&sort=-createdAt&limit=10" \ -H "Authorization: Bearer YOUR_TOKEN"

Scaling

Vertical Scaling

Increase resources on a single runner:

  • More concurrent runs: Raise SCRAPER_MAX_CONCURRENT (ensure the host has enough memory and CPU).
  • More memory per run: Increase SCRAPER_DEFAULT_MEMORY or set per-scraper limits.
  • Longer timeouts: Increase SCRAPER_DEFAULT_TIMEOUT for slow scrapers.

Rule of thumb: each concurrent run needs the configured memory limit plus overhead. For 5 concurrent runs at 512 MB each, the host should have at least 4 GB of available RAM.

Horizontal Scaling

Multiple runner instances can serve the same TimeTiles deployment. Each runner is stateless, so scaling out is straightforward:

  1. Deploy additional runner instances on separate hosts.
  2. Place a load balancer in front of them.
  3. Point SCRAPER_RUNNER_URL to the load balancer.
  4. All runners must use the same SCRAPER_API_KEY and have the same base images built.

Each runner tracks active runs independently in memory. The concurrency limit (SCRAPER_MAX_CONCURRENT) applies per instance.


Backup Considerations

Scraper output is transient. The CSV data returned from each run is either:

  • Stored as a base64 field in the scraper-runs collection (for review), or
  • Fed into the ingestion pipeline, which creates events in the database.

There is no persistent state on the runner itself. All durable data lives in the TimeTiles PostgreSQL database, which is covered by your standard database backup strategy.

What to back up:

DataLocationBackup Strategy
Scraper repos, scrapers, run historyPostgreSQL (apps/web)Standard database backup
Scraper source code (git type)External Git hostingBacked up by Git host
Scraper source code (upload type)PostgreSQL (apps/web)Standard database backup
Runner configuration.env file on runner hostInclude in host backup or config management
Base imagesBuilt from Dockerfiles in apps/timescrape/images/Rebuild from source, or push to a container registry

Security

Podman Rootless vs Docker

The runner is designed for Podman rootless mode. Docker’s daemon runs as root, and access to /var/run/docker.sock grants full root control of the host. Podman has no daemon and runs containers as unprivileged user processes.

Do not run this service with Docker unless you fully understand the security implications.

Container Hardening Summary

Every scraper container is launched with multiple layers of defense:

ProtectionPodman FlagPurpose
Drop all capabilities--cap-drop=ALLBlock kernel capability exploits
Block privilege escalation--security-opt=no-new-privilegesPrevent setuid/setgid escalation
Custom seccomp profile--security-opt=seccomp=...Restrict to ~100 allowed syscalls
Read-only filesystem--read-onlyPrevent persistence and malware
Writable /tmp with noexec--tmpfs=/tmp:rw,size=64m,noexecPrevent downloaded binary execution
Process limit--pids-limit=256Prevent fork bombs
Memory and CPU limits--memory, --cpusPrevent resource exhaustion
User namespace remapping--userns=autoPrevent container escape to root
Network isolation--network=scraper-sandboxBlock lateral movement
External DNS only--dns=1.1.1.1Block internal service discovery

API Key Rotation

The SCRAPER_API_KEY is the sole authentication mechanism between TimeTiles and the runner. If compromised, an attacker could submit code for container execution (within the hardening limits above).

To rotate the key:

  1. Generate a new key: openssl rand -hex 32
  2. Update SCRAPER_API_KEY in both apps/timescrape/.env and apps/web/.env.
  3. Restart both the runner and the TimeTiles web app.

Rotate the key periodically and whenever you suspect it may have been exposed.

Network Architecture

+-------------------------+ +---------------------+ | Internal Network | | scraper-sandbox | | | | (internet access) | | PostgreSQL | | | | TimeTiles web app | | Scraper containers |---> Internet | | | | | TimeScrape runner <---|-----| (spawns containers) | | (API server) ----|---->| | +-------------------------+ +---------------------+
  • The runner lives on the internal network (it needs to communicate with TimeTiles).
  • Scraper containers live on the scraper-sandbox network (internet access only, no internal services).
  • The scraper-sandbox network must not be able to reach PostgreSQL, the web app, or the runner API.

Verify isolation by running a test container:

# This should fail (cannot reach internal services) podman run --rm --network=scraper-sandbox timescrape-python \ python -c "import requests; requests.get('http://your-internal-host:3000', timeout=5)" # This should succeed (can reach the internet) podman run --rm --network=scraper-sandbox timescrape-python \ python -c "import requests; print(requests.get('https://httpbin.org/get', timeout=10).status_code)"

Troubleshooting

| Problem | Solution | | -------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------- | ---------------- | -------------------- | | podman: command not found | Install Podman. See the Installation section above. | | Rootless check returns false | On macOS: podman machine init && podman machine start. On Linux: consult the Podman rootless docs. | | Health check returns connection refused | Verify SCRAPER_PORT and that the service is running. Check firewall rules. | | Runs fail with “image not found” | Build base images: podman build -t timescrape-python images/python/ and podman build -t timescrape-node images/node/. | | Runs fail with “network not found” | Create the sandbox network: podman network create scraper-sandbox. | | TimeTiles cannot reach the runner | Verify SCRAPER_RUNNER_URL is correct and the port is accessible from the TimeTiles host. | | “API key must be at least 16 characters” | Set SCRAPER_API_KEY to a string of 16 or more characters. | | Runs fail with “max concurrent runs reached” | Increase SCRAPER_MAX_CONCURRENT or wait for running scrapers to finish. | | Runs fail with INVALID_OUTPUT | Check the scraper’s stdout/stderr logs. The scraper may not be calling output.save(), or the output exceeds SCRAPER_MAX_OUTPUT_SIZE_MB. | | Orphaned containers after runner restart | Containers are created with --rm, so they self-clean on exit. If the runner crashes mid-run, containers may linger. Clean up with podman ps -a --filter name=run- | grep -v CONTAINER | awk '{print $1}' | xargs podman rm -f. | | Scraper can reach internal services | Verify the scraper-sandbox network is properly isolated. Re-create it if needed. |


Upgrading Base Images

When new library versions are available or you need to add a library:

  1. Update the Dockerfile in apps/timescrape/images/python/ or apps/timescrape/images/node/.
  2. Rebuild the image: podman build -t timescrape-python images/python/.
  3. New scraper runs will use the updated image immediately (no runner restart needed).

Old containers that are currently running are not affected — they continue with the image they were started with.


Last updated on