Fleet Monitoring

Health checks, telemetry events, heartbeats, and alerting for 10,000+ devices.

Last updated: April 27, 2026

Health Check Endpoint

Request
bash
GET https://cdn.familypocket.io/health
Response
json
{
  "status": "ok",
  "storage_backend": "github",
  "active_rollout": "2.5.0",
  "rollout_percentage": 25,
  "checked_at": "2026-04-27T10:00:00Z"
}

No authentication required. Wire into UptimeRobot / Better Stack with a 1-minute check interval. PagerDuty alert if down for >5 minutes.

Device Heartbeat

Every 15 minutes, each online device sends a heartbeat via the device health endpoint. This includes:

  • Battery level
  • Storage free
  • Network type
  • NFC status
  • Current app version
  • Last scan timestamp

Telemetry Events

Devices batch telemetry events and flush every 60 seconds. Events are persisted to Room (SQLite) locally so they survive offline periods and app crashes.

Update Events

EventMeaning
update.checkDevice checked for updates
update.availableNew version found
update.download_startedAPK download began
update.download_completedAPK downloaded successfully
update.checksum_verifiedSHA-256 matched
update.checksum_failedSHA-256 mismatch (corrupt download or tampered APK)
update.install_startedPackageInstaller session committed
update.install_waiting_windowWaiting for safe window (22:00-03:00)
update.failedUpdate failed at any stage

Device Events

EventMeaning
device.bootDevice booted (battery, storage, uptime, last shutdown reason)
engine_startKioskEngineService started
device.announceDevice hardware metadata recorded
update_installedUpdate completed successfully
kiosk.lockedApp entered Lock Task mode (trigger: boot/manual/post-update)
kiosk.unlockedLock Task exited (method: nfc_card/pin, technician_id)
kiosk.unlock_attempt_failedWrong PIN or invalid NFC card (method, reason)
app.crashUncaught exception (stack trace, version, device state)

Monitoring Rollout Progress

During a phased rollout, check how many devices have updated:

  1. Query kiosk_telemetry_events for update.completed events with the target version
  2. Compare against total devices seen in last 24h (from heartbeats)
  3. Health endpoint shows current active_rollout and rollout_percentage

Dashboard Panels

Critical operational views to build for fleet visibility:

PanelWhat It Shows
Fleet HealthPercentage of devices seen in last 24h, grouped by school
Rollout ProgressCount of devices on each version, with target version highlighted
Stuck DevicesDevices on min_required_version-1 or below, sorted by last_seen_at
Recent Failuresupdate.failed events in last 7 days, grouped by reason
Unlock ActivityWho unlocked which device and when (audit trail)
Battery/Storage WarningsDevices below 20% battery or <500MB free storage

Telemetry Database Schema

kiosk_telemetry_events
sql
CREATE TABLE kiosk_telemetry_events (
  id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  device_serial   TEXT NOT NULL,
  event_type      TEXT NOT NULL,
  occurred_at     TIMESTAMPTZ NOT NULL,
  received_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
  data            JSONB NOT NULL,
  current_version TEXT
);

CREATE INDEX idx_telemetry_device ON kiosk_telemetry_events(device_serial, occurred_at DESC);
CREATE INDEX idx_telemetry_type ON kiosk_telemetry_events(event_type, occurred_at DESC);
CREATE INDEX idx_telemetry_received_at ON kiosk_telemetry_events(received_at);
i
Data retention: Set up a nightly cron to roll raw events older than 90 days into daily aggregates, then prune the raw table. This keeps query performance stable as the fleet grows.

Common Alerts to Set Up

AlertConditionAction
CDN down/health returns non-200 for >5minCheck Vercel deployment
Update failures spike>10 update.failed events in 1 hourCheck checksum, storage, network
Device not seenNo heartbeat from device in 24hCheck WiFi, power at school
Storage fullHeartbeat shows <200MB freePush cache clear (future)
Battery criticalHeartbeat shows <10%Device may shut down overnight

Common Failure Modes

CauseDetectionFix
Storage fullHeartbeat shows storage_free < 200MBRemote cache clear (future), or technician on-site
Flaky WiFiRepeated partial downloads for same deviceResumable download eventually succeeds, or change WiFi
Checksum mismatchupdate.failed reason = checksum_mismatchDB checksum does not match GitHub asset. Re-upload checksum via webhook.
Signature mismatchINSTALL_FAILED_UPDATE_INCOMPATIBLEAPK signed with wrong key. Investigate keystore before another release.
Backend downDevices stop checking in for updatesDevices continue normal operations, catch up on next successful boot