Vigilry – AI-Driven Incident Risk Analytics Platform

A production-grade, real-time incident detection and alerting platform. Ingests application telemetry, detects error spikes via a sliding-window algorithm, manages incident lifecycles automatically, and pushes live dashboard updates — all under 100ms from error occurrence to alert.

TypeScript Monorepo
Node.js 22
Redis Streams
BullMQ
Socket.IO
PostgreSQL 16
Next.js 15

Production traits:

Multi-tenant SaaS
Sub-100ms pipeline
TTL-driven state machine
Consumer group fault tolerance

Problem & Solution

The problem

  • Engineers rely on manual dashboard monitoring
  • Alerting systems introduce minutes of delay
  • No automated incident lifecycle management
  • Error spikes go undetected during off-hours
  • No real-time feedback loop from app to engineer

Vigilry solves this with

  • Sliding-window anomaly detection (60s window)
  • Automated Open → Investigating → Resolved lifecycle
  • Real-time Socket.IO dashboard broadcasts
  • Transactional email alerts on incident creation
  • API-key SDK integration for any application

System Architecture

Monorepo structure (Turborepo + pnpm)

Services

  • api-gateway — Express REST API, auth, org/project management
  • ingestion-service — Event intake, validation, stream publishing
  • worker-risk — BullMQ anomaly processor, incident state management
  • websocket-service — Socket.IO broadcaster via Redis Stream consumer group
  • web — Next.js 15 real-time dashboard

Shared packages

  • db — Drizzle ORM schema + PostgreSQL client
  • redis — Redis client, BullMQ, Socket.IO adapter
  • events — Redis Stream publisher helpers
  • email — Brevo transactional email templates
  • http / logger / types / utils — Cross-cutting concerns

Infrastructure

  • PostgreSQL 16 — primary data store with full relational integrity
  • Redis 7 — incident state, BullMQ queues, Socket.IO adapter, Streams
  • Docker Compose — local PostgreSQL + Redis with named volumes and AOF persistence
  • PM2 — production process manager with auto-restart on reboot
  • TypeScript project references — incremental builds, cross-package type safety

Application Screenshots

Per-project Events tab showing live ERROR-severity events with type, source, correlation ID, and timestamp. Live mode toggles real-time Socket.IO updates.

Full Data Pipeline

1. Ingestion

Client SDKs POST events to the ingestion service with organizationId, projectId, severity, type, correlationId, and a free-form JSONB payload.

  • Validates the request against API key auth
  • Inserts event to PostgreSQL
  • Emits EVENT_INGESTED onto a Redis Stream (platform-events)
  • Enqueues an anomaly-check job to BullMQ

2. Anomaly Detection (Worker)

The anomaly worker processes every queued job using a configurable sliding-window algorithm.

  • Queries last 60s of ERROR-severity events for the org/project
  • If errorCount ≤ 10 — no action taken
  • If errorCount > 10 and active incident exists — attaches top 10 events, refreshes Redis TTLs
  • If re-spike during quiet period — reinstates active key, status → OPEN, emits INCIDENT_UPDATED
  • No existing incident — acquires a 10s Redis creation lock, inserts incident to DB, emits INCIDENT_CREATED, sends notification email (fire-and-forget)

3. Incident State Machine (Health Check Worker)

Scheduled every 30 seconds. Manages the Open → Investigating → Resolved lifecycle entirely via Redis key TTLs.

ConditionTransition
Active key present + status OPENStay OPEN, refresh keys
Active key absent + status OPEN→ INVESTIGATING
Active key present + status INVESTIGATING→ OPEN (re-spike)
Investigating TTL ≤ 20% remaining→ RESOLVED (Lua atomic check)
  • incident:active:{orgId}:{projectId} — 60s TTL, refreshed per spike
  • incident:investigating:{orgId}:{projectId} — 300s TTL, the quiet window

4. Real-Time Broadcast

The WebSocket service runs an infinite Redis Stream consumer loop, pushing events to the dashboard in milliseconds.

  • Consumer group ws-service reads from platform-events (blocking, 5s timeout)
  • Parses EVENT_INGESTED, INCIDENT_CREATED, INCIDENT_UPDATED
  • Broadcasts to Socket.IO rooms keyed by organizationId
  • Acknowledges messages (XACK) after successful broadcast
  • Frontend useSocket hook updates UI state without any polling

Key Engineering Highlights

TTL-driven state machine

Incident lifecycle managed entirely via Redis key TTLs and a Lua atomic check-and-delete script. No cron cleanup jobs, no polling — expiry is the state transition.

Dual-mode authentication

A single middleware transparently handles API key (machine-to-machine SHA-256 hash lookup) and JWT session (browser httpOnly cookie) access with a cleanly typed auth context downstream.

Consumer group fault tolerance

WebSocket service uses Redis Stream consumer groups. A crash and restart automatically resumes from unacknowledged messages, with horizontal scaling built in via the Redis Socket.IO adapter.

Fire-and-forget email delivery

Email sends never block incident creation or auth flows. Failures are logged and observable but never surface as user-facing errors, keeping core system performance intact.

Type-safe monorepo

Full TypeScript project references across all packages. A schema change in packages/db causes immediate compile errors in every consumer service.

MVC with dependency injection

Strict 4-layer architecture (Routes → Controllers → Services → Repositories) with constructor-injected dependencies across all Express services, enabling easy unit testing and clean separation.

Authentication System

API Key (machine-to-machine)

  • X-Api-Key header → SHA-256 hash → DB lookup
  • Raw key shown once on creation, never stored
  • Populates req.auth.project for project-scoped ingestion

JWT Session (browser)

  • httpOnly cookie session → jwt.verify → DB lookup
  • Org-level only — no project scope for browser sessions
  • Email verification on signup (SHA-256 token, 24h expiry)

Tech Stack

LanguageTypeScript 5.6
RuntimeNode.js 22
API FrameworkExpress
Real-TimeSocket.IO + Redis Adapter
DatabasePostgreSQL 16 + Drizzle ORM
Job QueueBullMQ (Redis-backed)
Cache & StreamsRedis 7
FrontendNext.js 15, React 19, TailwindCSS v4
UI Componentsshadcn/ui (Radix UI)
Formsreact-hook-form + Zod
EmailBrevo (Sendinblue)
LoggingPino (structured JSON)
MonorepoTurborepo + pnpm
Process ManagerPM2

Lessons Learned

  • Redis TTLs are extremely powerful for time-based state machines. Letting key expiry drive state transitions eliminates entire categories of cleanup jobs and race conditions.
  • Streams + consumer groups provide simple but reliable event pipelines. The combination of blocking reads, consumer groups, and XACK gives exactly-once delivery and crash recovery without a dedicated message broker.
  • Separating ingestion from analysis prevents API latency spikes. Decoupling the ingestion endpoint from the anomaly detection worker via BullMQ means slow analysis never stalls the client-facing API.
  • Fire-and-forget notifications prevent alerting systems from affecting core system performance. Email sends that never throw ensure a broken SMTP connection or Brevo outage cannot cascade into incident creation failures.