atalaya/README.md

# Atalaya Uptime Monitor

Atalaya (Spanish for watchtower) is an uptime & status page monitoring service running on Cloudflare Workers and Durable Objects.

Thanks to the generous Cloudflare free tier, Atalaya provides a simple, customizable, self-hosted solution to monitor the status of public network services,
aimed at hobbyists and users who want more control for free and are comfortable with Cloudflare's ecosystem.

Live [example](https://uptime.ifconfig.es/).

:warning: 99% of the code has been generated by an IA agent under human supervision, bearing in mind that I havent used TypeScript before. You have been warned!

- [Features](#features)
- [Architecture](#architecture)
- [Prerequisites](#prerequisites)
- [Setup](#setup)
- [Configuration](#configuration)
  - [Settings](#settings)
  - [Monitor Types](#monitor-types)
  - [Regional Monitoring](#regional-monitoring)
  - [Alerts](#alerts)
  - [Status Page](#status-page)
- [Secret Management](#secret-management)
  - [Security Notes](#security-notes)
- [Data Retention](#data-retention)
- [Development](#development)
  - [Testing](#testing)
- [TODO](#todo)

```ASCII
                    🏴‍☠️
                    |
                _  _|_  _
               |;|_|;|_|;|
               \\.    .  /
                \\:  .  /
                 ||:   |
                 ||:.  |
                 ||:  .|
                 ||: , |
                 ||:   |
                 ||: . |
                _||_   |
        __ ----~    ~`---,
__ ,--~'                  ~~----____
```

## Features

- HTTP, TCP, and DNS monitoring.
- Regional monitoring from specific Cloudflare locations.
- Configurable retries with immediate retry on failure.
- Configurable failure thresholds before alerting.
- Custom alert templates for notifications (currently only webhooks are supported).
- Historical data stored in Cloudflare D1.
- Status page built with Astro 6 SSR, served by the same Worker via Static Assets.
  - 90-day uptime history with daily bars.
  - Response time charts (uPlot) with downtime bands.
  - Basic auth or public access modes.
  - Dark/light mode.

## Architecture

The project is an npm workspace with a single Cloudflare Worker that handles everything:

```text
atalaya/
  src/           Worker source (monitoring engine, JSON API, auth, Astro SSR delegation)
  status-page/   Astro 6 SSR source (status page UI, built and served as static assets)
```

- **Worker** runs cron-triggered health checks, stores results in D1, enforces basic auth on all other routes, serves static assets (CSS, JS) via the `ASSETS` binding, and delegates to the Astro SSR handler for page rendering.
- **Regional checks** runs on Durable Objects.
- **Pages** is an Astro 6 SSR site built into `status-page/dist/`. It accesses D1 directly via `import { env } from 'cloudflare:workers'` — no service binding needed since everything runs in the same Worker.

## Prerequisites

- Node.js 22+
- [Wrangler](https://developers.cloudflare.com/workers/wrangler/install-and-update/) CLI

## Setup

1. Install dependencies:

   ```bash
   npm install
   ```

2. Create the configuration file (`wrangler.toml`):

   ```bash
   cp wrangler.example.toml wrangler.toml
   ```

3. Create D1 database:

   ```bash
   wrangler d1 create atalaya
   ```

   **Update `database_id`** in `wrangler.toml`.

4. Run migrations:

   ```bash
   wrangler d1 migrations apply atalaya --remote
   ```

5. Configure alerts and monitors in `wrangler.toml`.

   **For regional monitoring:** Ensure Durable Objects are configured in `wrangler.toml`. The example configuration (`wrangler.example.toml`) includes the necessary bindings and migrations.
   **The status page is disabled by default**. To enable it, see the "Status Page" section in the configuration below.

6. Deploy:

   ```bash
   npm run deploy
   ```

   This builds the Astro site and deploys the Worker with static assets in a single step.

## Configuration

### Settings

Default values:

```yaml
settings:
  title: 'Atalaya Uptime Monitor' # Status page title
  default_retries: 2 # Retry attempts on failure
  default_retry_delay_ms: 1000 # Delay between retries
  default_timeout_ms: 5000 # Request timeout
  default_failure_threshold: 2 # Failures before alerting
```

### Per-Monitor Overrides

Each monitor can override the global default_* settings:

```yaml
- name: 'critical-api'
  type: http
  target: 'https://api.example.com/health'
  timeout_ms: 10000 # Override global check_timeout_ms
  retries: 3 # Override global check_retries
  retry_delay_ms: 500 # Override global check_retry_delay_ms
  failure_threshold: 1 # Override global check_failure_threshold
  alerts: ['alert']
```

### Monitor Types

**HTTP**

```yaml
- name: 'api-health'
  type: http
  target: 'https://api.example.com/health'
  method: GET
  expected_status: 200
  headers: # optional, merged with default User-Agent: atalaya-uptime
    Authorization: 'Bearer ${API_TOKEN}'
    Accept: 'application/json'
  alerts: ['alert']
```

All HTTP checks send `User-Agent: atalaya-uptime` by default. Monitor-level `headers` are merged with this default; if a monitor sets its own `User-Agent`, it overrides the default.

**TCP**

```yaml
- name: 'database'
  type: tcp
  target: 'db.example.com:5432'
  alerts: ['alert']
```

**DNS**

```yaml
- name: 'dns-check'
  type: dns
  target: 'example.com'
  record_type: A
  expected_values: ['93.184.216.34']
  alerts: ['alert']
```

### Regional Monitoring

Atalaya supports running checks from specific Cloudflare regions using Durable Objects. This allows you to test your services from different geographic locations, useful for:

- Testing CDN performance from edge locations
- Verifying geo-blocking configurations
- Measuring regional latency differences
- Validating multi-region deployments

**Valid Region Codes:**

- `weur`: Western Europe
- `enam`: Eastern North America
- `wnam`: Western North America
- `apac`: Asia Pacific
- `eeur`: Eastern Europe
- `oc`: Oceania
- `safr`: South Africa
- `me`: Middle East
- `sam`: South America

**Example:**

```yaml
- name: 'api-eu'
  type: http
  target: 'https://api.example.com/health'
  region: 'weur' # Run from Western Europe
  method: GET
  expected_status: 200
  alerts: ['alert']

- name: 'api-us'
  type: http
  target: 'https://api.example.com/health'
  region: 'enam' # Run from Eastern North America
  method: GET
  expected_status: 200
  alerts: ['alert']
```

**How it works:**
When a monitor specifies a `region`, Atalaya creates a Cloudflare Durable Object in that region, runs the check from there, and returns the result. Durable Objects are terminated after use to conserve resources. If the regional check fails, it falls back to running the check from the worker's default region.

**Note:** Regional monitoring requires Durable Objects to be configured in your `wrangler.toml`. See the example configuration for setup details.

### Alerts

Alerts are configured as a top-level array. Currently only webhook alerts are supported.

```yaml
alerts:
  - name: 'slack'
    type: webhook
    url: 'https://hooks.slack.com/services/xxx'
    method: POST
    headers:
      Content-Type: 'application/json'
    body_template: |
      {"text": "Monitor {{monitor.name}} is {{status.current}}"}
```

Template variables: `event`, `monitor.name`, `monitor.type`, `monitor.target`, `status.current`, `status.previous`, `status.consecutive_failures`, `status.last_status_change`, `status.downtime_duration_seconds`, `check.error`, `check.timestamp`, `check.response_time_ms`, `check.attempts`

### Status Page

The status page is an Astro 6 SSR site (under `status-page/`) served by the same Worker. It accesses D1 directly and renders monitor status, uptime history, and response time charts.

**Configuration (via Wrangler secrets on the Worker):**

```bash
# Set credentials for basic auth
wrangler secret put STATUS_USERNAME
wrangler secret put STATUS_PASSWORD

# Or make it public
wrangler secret put STATUS_PUBLIC  # Set value to "true"
```

**Access rules:**

- If `STATUS_PUBLIC` is `"true"`: public access allowed
- If credentials are set: basic auth required
- Otherwise: 403 Forbidden

## Secret Management

Secrets are managed via Cloudflare's secret system. To add a new secret:

1. Set the secret value:

```bash
wrangler secret put SECRET_NAME
```

2. Use it in config with `${SECRET_NAME}` syntax.

```yaml
alerts:
  - name: 'slack'
    type: webhook
    url: 'https://hooks.slack.com/services/${SLACK_PATH}'
    method: POST
    headers:
      Authorization: 'Bearer ${WEBHOOK_TOKEN}'
    body_template: |
      {"text": "{{monitor.name}} is {{status.current}}"}

monitors:
  - name: 'private-api'
    type: http
    target: 'https://api.example.com/health'
    method: GET
    expected_status: 200
    headers:
      Authorization: 'Bearer ${API_KEY}'
    webhooks: ['slack']
```

**Note:** `${VAR}` is for secrets (resolved at startup). `{{var}}` is for alert body templates (resolved per-alert).

### Security Notes

- Secrets are never logged or exposed in check results
- Unresolved `${VAR}` placeholders remain as-is (useful for debugging missing secrets)
- Worker secrets are encrypted at rest by Cloudflare

## Data Retention

- Raw check results: 7 days
- Hourly aggregates: 90 days

An hourly cron job aggregates raw data and cleans up old records automatically.

## Development

```bash
# Run worker locally
wrangler dev --test-scheduled

# Run status page locally
npm run dev:pages

# Trigger cron manually
curl "http://localhost:8787/__scheduled?cron=*+*+*+*+*"
```

### Testing

```bash
# Fist build the status page
npm run build:pages

# Worker tests
npm run test

# Status page tests
npm run test:pages

# Type checking and linting
npm run check              # worker
npm run check:pages        # pages (astro check + tsc)
```

## TODO

- [ ] Add support for TLS checks (certificate validity, expiration).
- [ ] Refine the status page to look... well... less IA generated.
- [ ] Initial support for incident management (manual status overrides, incident timeline).
- [ ] Branded status page.
- [ ] Add support for notifications other than webhooks.