Monitoring & Observability — Learning Corner

Think of it as the

Nervous System

of your infrastructure — sensing, reporting, and alerting

The Hospital Analogy

Imagine your application is a patient in a hospital. Doctors don't just guess what's wrong — they use monitoring equipment to understand the patient's health in real-time and historically. The same applies to your systems.

💓

Metrics

Vital signs — CPU, memory, requests/sec

📋

Logs

Medical history — what happened & when

🔬

Traces

Diagnostic path — how requests flow

The Problem It Solves

✗ Flying Blind

# 3 AM on-call experience...
User: "Site is slow"
You: "Let me SSH into servers..."
You: *checks 15 servers manually*
You: "Hmm, disk looks okay..."
You: *restarts random services*
You: "Try now?"
User: "Still broken"
You: 😭 *4 hours later*

No idea what's happening in production
Users report problems before you know
Debugging is guesswork and SSH
Can't answer 'why is it slow?'
No historical data for comparison

✓ Full Observability

# 3 AM with observability...
[Alert] P99 latency > 500ms
[Dashboard] Spike at 02:47
[Trace] Slow query in user-service
[Logs] "Connection pool exhausted"
Fix: Scale connection pool
[Resolved] in 8 minutes ✓

# Back to sleep 😴

Real-time visibility into all systems
Alerts before users notice problems
Trace exact cause of issues
Historical trends for capacity planning
Data-driven decisions, not guesswork

The Three Pillars of Observability

Metrics

Numeric measurements over time. The vital signs of your system.

• CPU utilization: 45%

• Requests/second: 1,234

• Error rate: 0.01%

• P99 latency: 120ms

Best for: Dashboards, alerts, trends

Logs

Timestamped records of discrete events. The story of what happened.

2024-01-15 09:23:45 INFO User login

2024-01-15 09:23:46 WARN Slow query

2024-01-15 09:23:47 ERROR Timeout

Best for: Debugging, audit trails, search

Traces

End-to-end request paths across services. The journey of a request.

API Gateway 12ms

Auth 8ms

DB Query 45ms

Best for: Latency analysis, dependencies

Prometheus + Grafana

Prometheus collects and stores metrics. Grafana visualizes them beautifully. Together, they're the industry standard for open-source metrics.

Pull-based

Prometheus scrapes your apps

Time-series

Optimized for metrics data

Flexible

Any data source in Grafana

prometheus.yml Configuration

# prometheus.yml
global:
  scrape_interval: 15s       # How often to scrape
  evaluation_interval: 15s  # How often to evaluate rules

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alerts/*.yml"

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Your applications
  - job_name: 'web-app'
    static_configs:
      - targets: ['app1:8080', 'app2:8080', 'app3:8080']

  # Node exporters (host metrics)
  - job_name: 'node'
    static_configs:
      - targets: ['node1:9100', 'node2:9100']

  # Kubernetes service discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

PromQL Query Examples

Request rate (per second)

rate(http_requests_total[5m])

Error rate percentage

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

P95 latency (histogram)

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

CPU usage by instance

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

ELK Stack (Elastic Stack)

Elasticsearch stores and searches logs. Logstash processes and transforms them. Kibana visualizes everything. Beats ships data from your servers.

⚙

Logstash Pipeline Configuration

# logstash.conf
input {
  # Receive from Filebeat
  beats {
    port => 5044
  }
  
  # Or read directly from files
  file {
    path => "/var/log/app/*.log"
    start_position => "beginning"
  }
}

filter {
  # Parse JSON logs
  json {
    source => "message"
  }
  
  # Parse timestamps
  date {
    match => [ "timestamp", "ISO8601" ]
  }
  
  # Extract fields from message
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
  
  # Add geolocation from IP
  geoip {
    source => "clientip"
  }
  
  # Drop debug logs in production
  if [level] == "DEBUG" {
    drop { }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

Datadog (Commercial Solution)

🐕

When to Use a Commercial Platform

Datadog, New Relic, and Splunk are expensive but eliminate operational overhead. You pay money instead of engineering time.

Key Features

Unified metrics, logs, traces, and APM
500+ integrations out of the box
Auto-instrumentation for many languages
Advanced ML-powered anomaly detection
Real User Monitoring (RUM)
Synthetic monitoring and uptime checks

When to Choose Commercial

No dedicated ops/SRE team
Need fast time-to-value
Compliance requires vendor support
Budget allows $15-50/host/month
Multi-cloud or complex architectures
Need 24/7 support SLAs

Aspect	Prometheus/Grafana	ELK Stack	Datadog
Cost	Free (OSS)	Free (OSS)	$$$ per host
Setup Time	Medium	High	Low
Best For	Metrics	Logs/Search	Everything
Scaling	Manual	Complex	Automatic
Maintenance	You	You	Vendor

Application Instrumentation

Prometheus Metrics with Flask

# app.py
from flask import Flask
from prometheus_client import Counter, Histogram, generate_latest
import time

app = Flask(__name__)

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5]
)

@app.before_request
def before_request():
    request.start_time = time.time()

@app.after_request
def after_request(response):
    latency = time.time() - request.start_time
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.path,
        status=response.status_code
    ).inc()
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.path
    ).observe(latency)
    return response

@app.route('/metrics')
def metrics():
    return generate_latest()

Structured Logging with structlog

# logging_config.py
import structlog
import logging.config

# Configure structured logging
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.UnicodeDecoder(),
        structlog.processors.JSONRenderer()  # Output JSON!
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
)

# Usage
logger = structlog.get_logger()

logger.info("user_login",
    user_id="12345",
    ip_address="192.168.1.1",
    user_agent="Mozilla/5.0..."
)

# Output (JSON, easy to parse in ELK):
# {"event": "user_login", "user_id": "12345", "ip_address": "192.168.1.1", ...}

Prometheus Metrics with Express

// app.js
const express = require('express');
const client = require('prom-client');

const app = express();

// Enable default metrics (CPU, memory, etc.)
client.collectDefaultMetrics();

// Custom metrics
const httpRequestsTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status']
});

const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request latency in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5]
});

// Middleware to track requests
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({
    method: req.method,
    route: req.route?.path || req.path
  });
  
  res.on('finish', () => {
    httpRequestsTotal.inc({
      method: req.method,
      route: req.route?.path || req.path,
      status: res.statusCode
    });
    end();
  });
  
  next();
});

// Metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.send(await client.register.metrics());
});

Structured Logging with Pino

// logger.js
const pino = require('pino');

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level(label) {
      return { level: label };
    }
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  // Add service context to all logs
  base: {
    service: 'api-gateway',
    version: process.env.APP_VERSION,
    env: process.env.NODE_ENV
  }
});

// Usage
logger.info({
  event: 'user_login',
  userId: '12345',
  ipAddress: '192.168.1.1',
  durationMs: 45
}, 'User logged in successfully');

logger.error({
  event: 'payment_failed',
  userId: '12345',
  orderId: 'ORD-789',
  err: error  // Pino serializes errors nicely
}, 'Payment processing failed');

// Output (JSON):
// {"level":"info","time":"2024-01-15T...","service":"api-gateway","event":"user_login",...}

Alerting Best Practices

Severity Levels

P1 - Critical

Service down, data loss risk. Wake someone up.

P2 - High

Degraded service. Needs attention within an hour.

P3 - Medium

Not urgent. Handle during business hours.

P4 - Low

Informational. Review when time permits.

⚠️

Avoiding Alert Fatigue

Too many alerts = ignored alerts. Be ruthless.

Every alert must be actionable
If you ignore it 3 times, delete it
Group related alerts together
Set reasonable thresholds (not too sensitive)
Use escalation policies, not spam
Review and prune alerts quarterly

Prometheus Alert Rules Example

# alerts/app_alerts.yml
groups:
  - name: app_alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) /
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      # Slow response times
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High P95 latency"
          description: "P95 latency is {{ $value | humanizeDuration }}"

      # Instance down
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} down"

Decision Guide

Trade-offs

Prometheus + Grafana

Pros

Industry standard, massive community
Powerful query language (PromQL)
Pull model is firewall-friendly
Native Kubernetes integration
Free and open source

Cons

Metrics only (no logs/traces)
Local storage is limited
High cardinality can explode
Requires operational knowledge
Scaling requires Thanos/Cortex

ELK Stack

Pros

Full-text search is unmatched
Handles any log format
Kibana is very powerful
Elastic APM for traces
Huge ecosystem of plugins

Cons

Resource hungry (RAM, disk)
Complex to operate at scale
JVM tuning is an art
License changes (SSPL)
Index management overhead

Commercial (Datadog, New Relic)

Pros

Unified platform (metrics+logs+traces)
Zero infrastructure management
Enterprise support and SLAs
Advanced ML features
Easy to get started

Cons

Expensive ($15-50+/host/month)
Vendor lock-in risk
Data leaves your network
Costs can spike unexpectedly
Feature limitations on lower tiers

Key Takeaways

Observability = Metrics + Logs + Traces

Each pillar answers different questions. Use all three for complete visibility into your systems.

Instrument from day one

Adding observability later is painful. Build it into your apps and infrastructure from the start.

Alerts should be actionable

If an alert doesn't require immediate action, it's noise. Ruthlessly prune alerts that get ignored.

Structured logs beat grep

Output JSON logs with consistent fields. Future you (and your log search tool) will thank you.

Match tools to your team

Open source saves money but costs time. Commercial solutions cost money but save time. Know your trade-offs.