Vigilry – AI-Driven Incident Risk Analytics Platform

A production-grade, real-time incident detection and alerting platform. Ingests application telemetry, detects error spikes via a sliding-window algorithm, manages incident lifecycles automatically, and pushes live dashboard updates — all under 100ms from error occurrence to alert.

TypeScript Monorepo

Node.js 22

Redis Streams

BullMQ

Socket.IO

PostgreSQL 16

Next.js 15

Production traits:

Multi-tenant SaaS

Sub-100ms pipeline

TTL-driven state machine

Consumer group fault tolerance

Live App Architecture Overview Data Pipeline

Problem & Solution

The problem

Engineers rely on manual dashboard monitoring
Alerting systems introduce minutes of delay
No automated incident lifecycle management
Error spikes go undetected during off-hours
No real-time feedback loop from app to engineer

Vigilry solves this with

Sliding-window anomaly detection (60s window)
Automated Open → Investigating → Resolved lifecycle
Real-time Socket.IO dashboard broadcasts
Transactional email alerts on incident creation
API-key SDK integration for any application

System Architecture

Monorepo structure (Turborepo + pnpm)

Services

api-gateway — Express REST API, auth, org/project management
ingestion-service — Event intake, validation, stream publishing
worker-risk — BullMQ anomaly processor, incident state management
websocket-service — Socket.IO broadcaster via Redis Stream consumer group
web — Next.js 15 real-time dashboard

Shared packages

db — Drizzle ORM schema + PostgreSQL client
redis — Redis client, BullMQ, Socket.IO adapter
events — Redis Stream publisher helpers
email — Brevo transactional email templates
http / logger / types / utils — Cross-cutting concerns

Infrastructure

PostgreSQL 16 — primary data store with full relational integrity
Redis 7 — incident state, BullMQ queues, Socket.IO adapter, Streams
Docker Compose — local PostgreSQL + Redis with named volumes and AOF persistence
PM2 — production process manager with auto-restart on reboot
TypeScript project references — incremental builds, cross-package type safety

Application Screenshots

Per-project Events tab showing live ERROR-severity events with type, source, correlation ID, and timestamp. Live mode toggles real-time Socket.IO updates.

Full Data Pipeline

1. Ingestion

Client SDKs POST events to the ingestion service with organizationId, projectId, severity, type, correlationId, and a free-form JSONB payload.

Validates the request against API key auth
Inserts event to PostgreSQL
Emits EVENT_INGESTED onto a Redis Stream (platform-events)
Enqueues an anomaly-check job to BullMQ

2. Anomaly Detection (Worker)

The anomaly worker processes every queued job using a configurable sliding-window algorithm.

Queries last 60s of ERROR-severity events for the org/project
If errorCount ≤ 10 — no action taken
If errorCount > 10 and active incident exists — attaches top 10 events, refreshes Redis TTLs
If re-spike during quiet period — reinstates active key, status → OPEN, emits INCIDENT_UPDATED
No existing incident — acquires a 10s Redis creation lock, inserts incident to DB, emits INCIDENT_CREATED, sends notification email (fire-and-forget)

3. Incident State Machine (Health Check Worker)

Scheduled every 30 seconds. Manages the Open → Investigating → Resolved lifecycle entirely via Redis key TTLs.

Condition	Transition
Active key present + status OPEN	Stay OPEN, refresh keys
Active key absent + status OPEN	→ INVESTIGATING
Active key present + status INVESTIGATING	→ OPEN (re-spike)
Investigating TTL ≤ 20% remaining	→ RESOLVED (Lua atomic check)

incident:active:{orgId}:{projectId} — 60s TTL, refreshed per spike
incident:investigating:{orgId}:{projectId} — 300s TTL, the quiet window

4. Real-Time Broadcast

The WebSocket service runs an infinite Redis Stream consumer loop, pushing events to the dashboard in milliseconds.

Consumer group ws-service reads from platform-events (blocking, 5s timeout)
Parses EVENT_INGESTED, INCIDENT_CREATED, INCIDENT_UPDATED
Broadcasts to Socket.IO rooms keyed by organizationId
Acknowledges messages (XACK) after successful broadcast
Frontend useSocket hook updates UI state without any polling

Key Engineering Highlights

TTL-driven state machine

Incident lifecycle managed entirely via Redis key TTLs and a Lua atomic check-and-delete script. No cron cleanup jobs, no polling — expiry is the state transition.

Dual-mode authentication

A single middleware transparently handles API key (machine-to-machine SHA-256 hash lookup) and JWT session (browser httpOnly cookie) access with a cleanly typed auth context downstream.

Consumer group fault tolerance

WebSocket service uses Redis Stream consumer groups. A crash and restart automatically resumes from unacknowledged messages, with horizontal scaling built in via the Redis Socket.IO adapter.

Fire-and-forget email delivery

Email sends never block incident creation or auth flows. Failures are logged and observable but never surface as user-facing errors, keeping core system performance intact.

Type-safe monorepo

Full TypeScript project references across all packages. A schema change in packages/db causes immediate compile errors in every consumer service.

MVC with dependency injection

Strict 4-layer architecture (Routes → Controllers → Services → Repositories) with constructor-injected dependencies across all Express services, enabling easy unit testing and clean separation.

Authentication System

API Key (machine-to-machine)

X-Api-Key header → SHA-256 hash → DB lookup
Raw key shown once on creation, never stored
Populates req.auth.project for project-scoped ingestion

JWT Session (browser)

httpOnly cookie session → jwt.verify → DB lookup
Org-level only — no project scope for browser sessions
Email verification on signup (SHA-256 token, 24h expiry)

Tech Stack

LanguageTypeScript 5.6

RuntimeNode.js 22

API FrameworkExpress

Real-TimeSocket.IO + Redis Adapter

DatabasePostgreSQL 16 + Drizzle ORM

Job QueueBullMQ (Redis-backed)

Cache & StreamsRedis 7

FrontendNext.js 15, React 19, TailwindCSS v4

UI Componentsshadcn/ui (Radix UI)

Formsreact-hook-form + Zod

EmailBrevo (Sendinblue)

LoggingPino (structured JSON)

MonorepoTurborepo + pnpm

Process ManagerPM2

Lessons Learned

Redis TTLs are extremely powerful for time-based state machines. Letting key expiry drive state transitions eliminates entire categories of cleanup jobs and race conditions.
Streams + consumer groups provide simple but reliable event pipelines. The combination of blocking reads, consumer groups, and XACK gives exactly-once delivery and crash recovery without a dedicated message broker.
Separating ingestion from analysis prevents API latency spikes. Decoupling the ingestion endpoint from the anomaly detection worker via BullMQ means slow analysis never stalls the client-facing API.
Fire-and-forget notifications prevent alerting systems from affecting core system performance. Email sends that never throw ensure a broken SMTP connection or Brevo outage cannot cascade into incident creation failures.

Back to portfolio