Files
atalaya/README.md

360 lines
9.9 KiB
Markdown

# Atalaya Uptime Monitor
Atalaya (Spanish for watchtower) is an uptime & status page monitoring service running on Cloudflare Workers and Durable Objects.
Thanks to the generous Cloudflare free tier, Atalaya provides a simple, customizable, self-hosted solution to monitor the status of public network services,
aimed at hobbyists and users who want more control for free and are comfortable with Cloudflare's ecosystem.
Live [example](https://uptime.ifconfig.es/).
:warning: 99% of the code has been generated by an IA agent under human supervision, bearing in mind that I havent used TypeScript before. You have been warned!
- [Features](#features)
- [Architecture](#architecture)
- [Prerequisites](#prerequisites)
- [Setup](#setup)
- [Configuration](#configuration)
- [Settings](#settings)
- [Monitor Types](#monitor-types)
- [Regional Monitoring](#regional-monitoring)
- [Alerts](#alerts)
- [Status Page](#status-page)
- [Secret Management](#secret-management)
- [Security Notes](#security-notes)
- [Data Retention](#data-retention)
- [Development](#development)
- [Testing](#testing)
- [TODO](#todo)
```ASCII
🏴‍☠️
|
_ _|_ _
|;|_|;|_|;|
\\. . /
\\: . /
||: |
||:. |
||: .|
||: , |
||: |
||: . |
_||_ |
__ ----~ ~`---,
__ ,--~' ~~----____
```
## Features
- HTTP, TCP, and DNS monitoring.
- Regional monitoring from specific Cloudflare locations.
- Configurable retries with immediate retry on failure.
- Configurable failure thresholds before alerting.
- Custom alert templates for notifications (currently only webhooks are supported).
- Historical data stored in Cloudflare D1.
- Status page built with Astro 6 SSR, served by the same Worker via Static Assets.
- 90-day uptime history with daily bars.
- Response time charts (uPlot) with downtime bands.
- Basic auth or public access modes.
- Dark/light mode.
## Architecture
The project is an npm workspace with a single Cloudflare Worker that handles everything:
```text
atalaya/
src/ Worker source (monitoring engine, JSON API, auth, Astro SSR delegation)
status-page/ Astro 6 SSR source (status page UI, built and served as static assets)
```
- **Worker** runs cron-triggered health checks, stores results in D1, enforces basic auth on all other routes, serves static assets (CSS, JS) via the `ASSETS` binding, and delegates to the Astro SSR handler for page rendering.
- **Regional checks** runs on Durable Objects.
- **Pages** is an Astro 6 SSR site built into `status-page/dist/`. It accesses D1 directly via `import { env } from 'cloudflare:workers'` — no service binding needed since everything runs in the same Worker.
## Prerequisites
- Node.js 22+
- [Wrangler](https://developers.cloudflare.com/workers/wrangler/install-and-update/) CLI
## Setup
1. Install dependencies:
```bash
npm install
```
2. Create the configuration file (`wrangler.toml`):
```bash
cp wrangler.example.toml wrangler.toml
```
3. Create D1 database:
```bash
wrangler d1 create atalaya
```
**Update `database_id`** in `wrangler.toml`.
4. Run migrations:
```bash
wrangler d1 migrations apply atalaya --remote
```
5. Configure alerts and monitors in `wrangler.toml`.
**For regional monitoring:** Ensure Durable Objects are configured in `wrangler.toml`. The example configuration (`wrangler.example.toml`) includes the necessary bindings and migrations.
**The status page is disabled by default**. To enable it, see the "Status Page" section in the configuration below.
6. Deploy:
```bash
npm run deploy
```
This builds the Astro site and deploys the Worker with static assets in a single step.
## Configuration
### Settings
Default values:
```yaml
settings:
title: 'Atalaya Uptime Monitor' # Status page title
default_retries: 2 # Retry attempts on failure
default_retry_delay_ms: 1000 # Delay between retries
default_timeout_ms: 5000 # Request timeout
default_failure_threshold: 2 # Failures before alerting
```
### Per-Monitor Overrides
Each monitor can override the global default_* settings:
```yaml
- name: 'critical-api'
type: http
target: 'https://api.example.com/health'
timeout_ms: 10000 # Override global check_timeout_ms
retries: 3 # Override global check_retries
retry_delay_ms: 500 # Override global check_retry_delay_ms
failure_threshold: 1 # Override global check_failure_threshold
alerts: ['alert']
```
### Monitor Types
**HTTP**
```yaml
- name: 'api-health'
type: http
target: 'https://api.example.com/health'
method: GET
expected_status: 200
headers: # optional, merged with default User-Agent: atalaya-uptime
Authorization: 'Bearer ${API_TOKEN}'
Accept: 'application/json'
alerts: ['alert']
```
All HTTP checks send `User-Agent: atalaya-uptime` by default. Monitor-level `headers` are merged with this default; if a monitor sets its own `User-Agent`, it overrides the default.
**TCP**
```yaml
- name: 'database'
type: tcp
target: 'db.example.com:5432'
alerts: ['alert']
```
**DNS**
```yaml
- name: 'dns-check'
type: dns
target: 'example.com'
record_type: A
expected_values: ['93.184.216.34']
alerts: ['alert']
```
### Regional Monitoring
Atalaya supports running checks from specific Cloudflare regions using Durable Objects. This allows you to test your services from different geographic locations, useful for:
- Testing CDN performance from edge locations
- Verifying geo-blocking configurations
- Measuring regional latency differences
- Validating multi-region deployments
**Valid Region Codes:**
- `weur`: Western Europe
- `enam`: Eastern North America
- `wnam`: Western North America
- `apac`: Asia Pacific
- `eeur`: Eastern Europe
- `oc`: Oceania
- `safr`: South Africa
- `me`: Middle East
- `sam`: South America
**Example:**
```yaml
- name: 'api-eu'
type: http
target: 'https://api.example.com/health'
region: 'weur' # Run from Western Europe
method: GET
expected_status: 200
alerts: ['alert']
- name: 'api-us'
type: http
target: 'https://api.example.com/health'
region: 'enam' # Run from Eastern North America
method: GET
expected_status: 200
alerts: ['alert']
```
**How it works:**
When a monitor specifies a `region`, Atalaya creates a Cloudflare Durable Object in that region, runs the check from there, and returns the result. Durable Objects are terminated after use to conserve resources. If the regional check fails, it falls back to running the check from the worker's default region.
**Note:** Regional monitoring requires Durable Objects to be configured in your `wrangler.toml`. See the example configuration for setup details.
### Alerts
Alerts are configured as a top-level array. Currently only webhook alerts are supported.
```yaml
alerts:
- name: 'slack'
type: webhook
url: 'https://hooks.slack.com/services/xxx'
method: POST
headers:
Content-Type: 'application/json'
body_template: |
{"text": "Monitor {{monitor.name}} is {{status.current}}"}
```
Template variables: `event`, `monitor.name`, `monitor.type`, `monitor.target`, `status.current`, `status.previous`, `status.consecutive_failures`, `status.last_status_change`, `status.downtime_duration_seconds`, `check.error`, `check.timestamp`, `check.response_time_ms`, `check.attempts`
### Status Page
The status page is an Astro 6 SSR site (under `status-page/`) served by the same Worker. It accesses D1 directly and renders monitor status, uptime history, and response time charts.
**Configuration (via Wrangler secrets on the Worker):**
```bash
# Set credentials for basic auth
wrangler secret put STATUS_USERNAME
wrangler secret put STATUS_PASSWORD
# Or make it public
wrangler secret put STATUS_PUBLIC # Set value to "true"
```
**Access rules:**
- If `STATUS_PUBLIC` is `"true"`: public access allowed
- If credentials are set: basic auth required
- Otherwise: 403 Forbidden
## Secret Management
Secrets are managed via Cloudflare's secret system. To add a new secret:
1. Set the secret value:
```bash
wrangler secret put SECRET_NAME
```
2. Use it in config with `${SECRET_NAME}` syntax.
```yaml
alerts:
- name: 'slack'
type: webhook
url: 'https://hooks.slack.com/services/${SLACK_PATH}'
method: POST
headers:
Authorization: 'Bearer ${WEBHOOK_TOKEN}'
body_template: |
{"text": "{{monitor.name}} is {{status.current}}"}
monitors:
- name: 'private-api'
type: http
target: 'https://api.example.com/health'
method: GET
expected_status: 200
headers:
Authorization: 'Bearer ${API_KEY}'
webhooks: ['slack']
```
**Note:** `${VAR}` is for secrets (resolved at startup). `{{var}}` is for alert body templates (resolved per-alert).
### Security Notes
- Secrets are never logged or exposed in check results
- Unresolved `${VAR}` placeholders remain as-is (useful for debugging missing secrets)
- Worker secrets are encrypted at rest by Cloudflare
## Data Retention
- Raw check results: 7 days
- Hourly aggregates: 90 days
An hourly cron job aggregates raw data and cleans up old records automatically.
## Development
```bash
# Run worker locally
wrangler dev --test-scheduled
# Run status page locally
npm run dev:pages
# Trigger cron manually
curl "http://localhost:8787/__scheduled?cron=*+*+*+*+*"
```
### Testing
```bash
# Fist build the status page
npm run build:pages
# Worker tests
npm run test
# Status page tests
npm run test:pages
# Type checking and linting
npm run check # worker
npm run check:pages # pages (astro check + tsc)
```
## TODO
- [ ] Add support for TLS checks (certificate validity, expiration).
- [ ] Refine the status page to look... well... less IA generated.
- [ ] Initial support for incident management (manual status overrides, incident timeline).
- [ ] Branded status page.
- [ ] Add support for notifications other than webhooks.