Vigilry – AI-Driven Incident Risk Analytics Platform
A production-grade, real-time incident detection and alerting platform. Ingests application telemetry, detects error spikes via a sliding-window algorithm, manages incident lifecycles automatically, and pushes live dashboard updates — all under 100ms from error occurrence to alert.
Production traits:
Problem & Solution
The problem
- Engineers rely on manual dashboard monitoring
- Alerting systems introduce minutes of delay
- No automated incident lifecycle management
- Error spikes go undetected during off-hours
- No real-time feedback loop from app to engineer
Vigilry solves this with
- Sliding-window anomaly detection (60s window)
- Automated Open → Investigating → Resolved lifecycle
- Real-time Socket.IO dashboard broadcasts
- Transactional email alerts on incident creation
- API-key SDK integration for any application
System Architecture
Monorepo structure (Turborepo + pnpm)
Services
- api-gateway — Express REST API, auth, org/project management
- ingestion-service — Event intake, validation, stream publishing
- worker-risk — BullMQ anomaly processor, incident state management
- websocket-service — Socket.IO broadcaster via Redis Stream consumer group
- web — Next.js 15 real-time dashboard
Shared packages
- db — Drizzle ORM schema + PostgreSQL client
- redis — Redis client, BullMQ, Socket.IO adapter
- events — Redis Stream publisher helpers
- email — Brevo transactional email templates
- http / logger / types / utils — Cross-cutting concerns
Infrastructure
- PostgreSQL 16 — primary data store with full relational integrity
- Redis 7 — incident state, BullMQ queues, Socket.IO adapter, Streams
- Docker Compose — local PostgreSQL + Redis with named volumes and AOF persistence
- PM2 — production process manager with auto-restart on reboot
- TypeScript project references — incremental builds, cross-package type safety
Application Screenshots
Per-project Events tab showing live ERROR-severity events with type, source, correlation ID, and timestamp. Live mode toggles real-time Socket.IO updates.
Full Data Pipeline
1. Ingestion
Client SDKs POST events to the ingestion service with organizationId, projectId, severity, type, correlationId, and a free-form JSONB payload.
- Validates the request against API key auth
- Inserts event to PostgreSQL
- Emits EVENT_INGESTED onto a Redis Stream (platform-events)
- Enqueues an anomaly-check job to BullMQ
2. Anomaly Detection (Worker)
The anomaly worker processes every queued job using a configurable sliding-window algorithm.
- Queries last 60s of ERROR-severity events for the org/project
- If errorCount ≤ 10 — no action taken
- If errorCount > 10 and active incident exists — attaches top 10 events, refreshes Redis TTLs
- If re-spike during quiet period — reinstates active key, status → OPEN, emits INCIDENT_UPDATED
- No existing incident — acquires a 10s Redis creation lock, inserts incident to DB, emits INCIDENT_CREATED, sends notification email (fire-and-forget)
3. Incident State Machine (Health Check Worker)
Scheduled every 30 seconds. Manages the Open → Investigating → Resolved lifecycle entirely via Redis key TTLs.
| Condition | Transition |
|---|---|
| Active key present + status OPEN | Stay OPEN, refresh keys |
| Active key absent + status OPEN | → INVESTIGATING |
| Active key present + status INVESTIGATING | → OPEN (re-spike) |
| Investigating TTL ≤ 20% remaining | → RESOLVED (Lua atomic check) |
- incident:active:{orgId}:{projectId} — 60s TTL, refreshed per spike
- incident:investigating:{orgId}:{projectId} — 300s TTL, the quiet window
4. Real-Time Broadcast
The WebSocket service runs an infinite Redis Stream consumer loop, pushing events to the dashboard in milliseconds.
- Consumer group ws-service reads from platform-events (blocking, 5s timeout)
- Parses EVENT_INGESTED, INCIDENT_CREATED, INCIDENT_UPDATED
- Broadcasts to Socket.IO rooms keyed by organizationId
- Acknowledges messages (XACK) after successful broadcast
- Frontend useSocket hook updates UI state without any polling
Key Engineering Highlights
TTL-driven state machine
Incident lifecycle managed entirely via Redis key TTLs and a Lua atomic check-and-delete script. No cron cleanup jobs, no polling — expiry is the state transition.
Dual-mode authentication
A single middleware transparently handles API key (machine-to-machine SHA-256 hash lookup) and JWT session (browser httpOnly cookie) access with a cleanly typed auth context downstream.
Consumer group fault tolerance
WebSocket service uses Redis Stream consumer groups. A crash and restart automatically resumes from unacknowledged messages, with horizontal scaling built in via the Redis Socket.IO adapter.
Fire-and-forget email delivery
Email sends never block incident creation or auth flows. Failures are logged and observable but never surface as user-facing errors, keeping core system performance intact.
Type-safe monorepo
Full TypeScript project references across all packages. A schema change in packages/db causes immediate compile errors in every consumer service.
MVC with dependency injection
Strict 4-layer architecture (Routes → Controllers → Services → Repositories) with constructor-injected dependencies across all Express services, enabling easy unit testing and clean separation.
Authentication System
API Key (machine-to-machine)
- X-Api-Key header → SHA-256 hash → DB lookup
- Raw key shown once on creation, never stored
- Populates req.auth.project for project-scoped ingestion
JWT Session (browser)
- httpOnly cookie session → jwt.verify → DB lookup
- Org-level only — no project scope for browser sessions
- Email verification on signup (SHA-256 token, 24h expiry)
Tech Stack
Lessons Learned
- Redis TTLs are extremely powerful for time-based state machines. Letting key expiry drive state transitions eliminates entire categories of cleanup jobs and race conditions.
- Streams + consumer groups provide simple but reliable event pipelines. The combination of blocking reads, consumer groups, and XACK gives exactly-once delivery and crash recovery without a dedicated message broker.
- Separating ingestion from analysis prevents API latency spikes. Decoupling the ingestion endpoint from the anomaly detection worker via BullMQ means slow analysis never stalls the client-facing API.
- Fire-and-forget notifications prevent alerting systems from affecting core system performance. Email sends that never throw ensure a broken SMTP connection or Brevo outage cannot cascade into incident creation failures.