Fleet Monitoring
Health checks, telemetry events, heartbeats, and alerting for 10,000+ devices.
Last updated: April 27, 2026
Health Check Endpoint
Request
bash
GET https://cdn.familypocket.io/healthResponse
json
{
"status": "ok",
"storage_backend": "github",
"active_rollout": "2.5.0",
"rollout_percentage": 25,
"checked_at": "2026-04-27T10:00:00Z"
}No authentication required. Wire into UptimeRobot / Better Stack with a 1-minute check interval. PagerDuty alert if down for >5 minutes.
Device Heartbeat
Every 15 minutes, each online device sends a heartbeat via the device health endpoint. This includes:
- Battery level
- Storage free
- Network type
- NFC status
- Current app version
- Last scan timestamp
Telemetry Events
Devices batch telemetry events and flush every 60 seconds. Events are persisted to Room (SQLite) locally so they survive offline periods and app crashes.
Update Events
| Event | Meaning |
|---|---|
| update.check | Device checked for updates |
| update.available | New version found |
| update.download_started | APK download began |
| update.download_completed | APK downloaded successfully |
| update.checksum_verified | SHA-256 matched |
| update.checksum_failed | SHA-256 mismatch (corrupt download or tampered APK) |
| update.install_started | PackageInstaller session committed |
| update.install_waiting_window | Waiting for safe window (22:00-03:00) |
| update.failed | Update failed at any stage |
Device Events
| Event | Meaning |
|---|---|
| device.boot | Device booted (battery, storage, uptime, last shutdown reason) |
| engine_start | KioskEngineService started |
| device.announce | Device hardware metadata recorded |
| update_installed | Update completed successfully |
| kiosk.locked | App entered Lock Task mode (trigger: boot/manual/post-update) |
| kiosk.unlocked | Lock Task exited (method: nfc_card/pin, technician_id) |
| kiosk.unlock_attempt_failed | Wrong PIN or invalid NFC card (method, reason) |
| app.crash | Uncaught exception (stack trace, version, device state) |
Monitoring Rollout Progress
During a phased rollout, check how many devices have updated:
- Query
kiosk_telemetry_eventsforupdate.completedevents with the target version - Compare against total devices seen in last 24h (from heartbeats)
- Health endpoint shows current
active_rolloutandrollout_percentage
Dashboard Panels
Critical operational views to build for fleet visibility:
| Panel | What It Shows |
|---|---|
| Fleet Health | Percentage of devices seen in last 24h, grouped by school |
| Rollout Progress | Count of devices on each version, with target version highlighted |
| Stuck Devices | Devices on min_required_version-1 or below, sorted by last_seen_at |
| Recent Failures | update.failed events in last 7 days, grouped by reason |
| Unlock Activity | Who unlocked which device and when (audit trail) |
| Battery/Storage Warnings | Devices below 20% battery or <500MB free storage |
Telemetry Database Schema
kiosk_telemetry_events
sql
CREATE TABLE kiosk_telemetry_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
device_serial TEXT NOT NULL,
event_type TEXT NOT NULL,
occurred_at TIMESTAMPTZ NOT NULL,
received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
data JSONB NOT NULL,
current_version TEXT
);
CREATE INDEX idx_telemetry_device ON kiosk_telemetry_events(device_serial, occurred_at DESC);
CREATE INDEX idx_telemetry_type ON kiosk_telemetry_events(event_type, occurred_at DESC);
CREATE INDEX idx_telemetry_received_at ON kiosk_telemetry_events(received_at);i
Data retention: Set up a nightly cron to roll raw events older than 90 days into daily aggregates, then prune the raw table. This keeps query performance stable as the fleet grows.
Common Alerts to Set Up
| Alert | Condition | Action |
|---|---|---|
| CDN down | /health returns non-200 for >5min | Check Vercel deployment |
| Update failures spike | >10 update.failed events in 1 hour | Check checksum, storage, network |
| Device not seen | No heartbeat from device in 24h | Check WiFi, power at school |
| Storage full | Heartbeat shows <200MB free | Push cache clear (future) |
| Battery critical | Heartbeat shows <10% | Device may shut down overnight |
Common Failure Modes
| Cause | Detection | Fix |
|---|---|---|
| Storage full | Heartbeat shows storage_free < 200MB | Remote cache clear (future), or technician on-site |
| Flaky WiFi | Repeated partial downloads for same device | Resumable download eventually succeeds, or change WiFi |
| Checksum mismatch | update.failed reason = checksum_mismatch | DB checksum does not match GitHub asset. Re-upload checksum via webhook. |
| Signature mismatch | INSTALL_FAILED_UPDATE_INCOMPATIBLE | APK signed with wrong key. Investigate keystore before another release. |
| Backend down | Devices stop checking in for updates | Devices continue normal operations, catch up on next successful boot |