DevOps
Learning Corner
Intermediate 16 min read

Monitoring & Observability

Build visibility into your systems — metrics, logs, traces, and alerting for production confidence

Think of it as the

Nervous System

of your infrastructure — sensing, reporting, and alerting

The Hospital Analogy

Imagine your application is a patient in a hospital. Doctors don't just guess what's wrong — they use monitoring equipment to understand the patient's health in real-time and historically. The same applies to your systems.

HEART MONITOR 72 BPM • 120/80 = METRICS Numbers over time PATIENT CHART 09:00 - Patient admitted 09:15 - Vitals checked 09:30 - Fever detected = LOGS Events with context DIAGNOSTIC PATH Request → Service A → Database = TRACES Request journey 🏥 Your Application = The Patient
💓

Metrics

Vital signs — CPU, memory, requests/sec

📋

Logs

Medical history — what happened & when

🔬

Traces

Diagnostic path — how requests flow

The Problem It Solves

Flying Blind
# 3 AM on-call experience...
User: "Site is slow"
You: "Let me SSH into servers..."
You: *checks 15 servers manually*
You: "Hmm, disk looks okay..."
You: *restarts random services*
You: "Try now?"
User: "Still broken"
You: 😭 *4 hours later*
  • No idea what's happening in production
  • Users report problems before you know
  • Debugging is guesswork and SSH
  • Can't answer 'why is it slow?'
  • No historical data for comparison
Full Observability
# 3 AM with observability...
[Alert] P99 latency > 500ms
[Dashboard] Spike at 02:47
[Trace] Slow query in user-service
[Logs] "Connection pool exhausted"
Fix: Scale connection pool
[Resolved] in 8 minutes ✓

# Back to sleep 😴
  • Real-time visibility into all systems
  • Alerts before users notice problems
  • Trace exact cause of issues
  • Historical trends for capacity planning
  • Data-driven decisions, not guesswork

The Three Pillars of Observability

Metrics

Numeric measurements over time. The vital signs of your system.

• CPU utilization: 45%

• Requests/second: 1,234

• Error rate: 0.01%

• P99 latency: 120ms

Best for: Dashboards, alerts, trends

Logs

Timestamped records of discrete events. The story of what happened.

2024-01-15 09:23:45 INFO User login

2024-01-15 09:23:46 WARN Slow query

2024-01-15 09:23:47 ERROR Timeout

Best for: Debugging, audit trails, search

Traces

End-to-end request paths across services. The journey of a request.

API Gateway 12ms
Auth 8ms
DB Query 45ms

Best for: Latency analysis, dependencies

Prometheus + Grafana

Prometheus collects and stores metrics. Grafana visualizes them beautifully. Together, they're the industry standard for open-source metrics.

Applications /metrics :8080, :9090 node_exporter PROMETHEUS Time-series DB Pull model PromQL Alertmanager → Slack, PD GRAFANA Dashboards Visualization Alerting scrape query alerts

Pull-based

Prometheus scrapes your apps

Time-series

Optimized for metrics data

Flexible

Any data source in Grafana

1

prometheus.yml Configuration

# prometheus.yml
global:
  scrape_interval: 15s       # How often to scrape
  evaluation_interval: 15s  # How often to evaluate rules

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - "alerts/*.yml"

scrape_configs:
  # Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Your applications
  - job_name: 'web-app'
    static_configs:
      - targets: ['app1:8080', 'app2:8080', 'app3:8080']

  # Node exporters (host metrics)
  - job_name: 'node'
    static_configs:
      - targets: ['node1:9100', 'node2:9100']

  # Kubernetes service discovery
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
2

PromQL Query Examples

Request rate (per second)

rate(http_requests_total[5m])

Error rate percentage

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

P95 latency (histogram)

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

CPU usage by instance

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

ELK Stack (Elastic Stack)

Elasticsearch stores and searches logs. Logstash processes and transforms them. Kibana visualizes everything. Beats ships data from your servers.

BEATS Filebeat Metricbeat APM Agent LOGSTASH Parse Transform Enrich ELASTICSEARCH Store Index Search Analyze KIBANA Dashboards Discover Alerting Also: Elastic APM for distributed tracing

Logstash Pipeline Configuration

# logstash.conf
input {
  # Receive from Filebeat
  beats {
    port => 5044
  }
  
  # Or read directly from files
  file {
    path => "/var/log/app/*.log"
    start_position => "beginning"
  }
}

filter {
  # Parse JSON logs
  json {
    source => "message"
  }
  
  # Parse timestamps
  date {
    match => [ "timestamp", "ISO8601" ]
  }
  
  # Extract fields from message
  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
  
  # Add geolocation from IP
  geoip {
    source => "clientip"
  }
  
  # Drop debug logs in production
  if [level] == "DEBUG" {
    drop { }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

Datadog (Commercial Solution)

🐕

When to Use a Commercial Platform

Datadog, New Relic, and Splunk are expensive but eliminate operational overhead. You pay money instead of engineering time.

Key Features

  • Unified metrics, logs, traces, and APM
  • 500+ integrations out of the box
  • Auto-instrumentation for many languages
  • Advanced ML-powered anomaly detection
  • Real User Monitoring (RUM)
  • Synthetic monitoring and uptime checks

When to Choose Commercial

  • No dedicated ops/SRE team
  • Need fast time-to-value
  • Compliance requires vendor support
  • Budget allows $15-50/host/month
  • Multi-cloud or complex architectures
  • Need 24/7 support SLAs
Aspect Prometheus/Grafana ELK Stack Datadog
Cost Free (OSS) Free (OSS) $$$ per host
Setup Time Medium High Low
Best For Metrics Logs/Search Everything
Scaling Manual Complex Automatic
Maintenance You You Vendor

Application Instrumentation

Prometheus Metrics with Flask

# app.py
from flask import Flask
from prometheus_client import Counter, Histogram, generate_latest
import time

app = Flask(__name__)

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5]
)

@app.before_request
def before_request():
    request.start_time = time.time()

@app.after_request
def after_request(response):
    latency = time.time() - request.start_time
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.path,
        status=response.status_code
    ).inc()
    REQUEST_LATENCY.labels(
        method=request.method,
        endpoint=request.path
    ).observe(latency)
    return response

@app.route('/metrics')
def metrics():
    return generate_latest()

Structured Logging with structlog

# logging_config.py
import structlog
import logging.config

# Configure structured logging
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.UnicodeDecoder(),
        structlog.processors.JSONRenderer()  # Output JSON!
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
)

# Usage
logger = structlog.get_logger()

logger.info("user_login",
    user_id="12345",
    ip_address="192.168.1.1",
    user_agent="Mozilla/5.0..."
)

# Output (JSON, easy to parse in ELK):
# {"event": "user_login", "user_id": "12345", "ip_address": "192.168.1.1", ...}

Alerting Best Practices

Severity Levels

P1 - Critical

Service down, data loss risk. Wake someone up.

P2 - High

Degraded service. Needs attention within an hour.

P3 - Medium

Not urgent. Handle during business hours.

P4 - Low

Informational. Review when time permits.

⚠️

Avoiding Alert Fatigue

Too many alerts = ignored alerts. Be ruthless.

  • Every alert must be actionable
  • If you ignore it 3 times, delete it
  • Group related alerts together
  • Set reasonable thresholds (not too sensitive)
  • Use escalation policies, not spam
  • Review and prune alerts quarterly

Prometheus Alert Rules Example

# alerts/app_alerts.yml
groups:
  - name: app_alerts
    rules:
      # High error rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) /
          sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      # Slow response times
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(http_request_duration_seconds_bucket[5m])
          ) > 0.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High P95 latency"
          description: "P95 latency is {{ $value | humanizeDuration }}"

      # Instance down
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} down"

Decision Guide

What do you need? Metrics Logs Everything PROMETHEUS + GRAFANA Budget for ops team? to manage ELK cluster Yes No ELK STACK Self-hosted LOKI / CLOUDWATCH Managed/Simpler DATADOG or New Relic/Splunk Prometheus + Grafana ✓ Free, powerful, standard ✓ Great for Kubernetes ✗ Metrics only ELK Stack ✓ Powerful log search ✓ Free, full-featured ✗ Resource hungry Datadog/New Relic ✓ All-in-one solution ✓ Zero maintenance ✗ Expensive at scale

Trade-offs

Prometheus + Grafana

Pros

  • Industry standard, massive community
  • Powerful query language (PromQL)
  • Pull model is firewall-friendly
  • Native Kubernetes integration
  • Free and open source

Cons

  • Metrics only (no logs/traces)
  • Local storage is limited
  • High cardinality can explode
  • Requires operational knowledge
  • Scaling requires Thanos/Cortex

ELK Stack

Pros

  • Full-text search is unmatched
  • Handles any log format
  • Kibana is very powerful
  • Elastic APM for traces
  • Huge ecosystem of plugins

Cons

  • Resource hungry (RAM, disk)
  • Complex to operate at scale
  • JVM tuning is an art
  • License changes (SSPL)
  • Index management overhead

Commercial (Datadog, New Relic)

Pros

  • Unified platform (metrics+logs+traces)
  • Zero infrastructure management
  • Enterprise support and SLAs
  • Advanced ML features
  • Easy to get started

Cons

  • Expensive ($15-50+/host/month)
  • Vendor lock-in risk
  • Data leaves your network
  • Costs can spike unexpectedly
  • Feature limitations on lower tiers

Key Takeaways

1

Observability = Metrics + Logs + Traces

Each pillar answers different questions. Use all three for complete visibility into your systems.

2

Instrument from day one

Adding observability later is painful. Build it into your apps and infrastructure from the start.

3

Alerts should be actionable

If an alert doesn't require immediate action, it's noise. Ruthlessly prune alerts that get ignored.

4

Structured logs beat grep

Output JSON logs with consistent fields. Future you (and your log search tool) will thank you.

5

Match tools to your team

Open source saves money but costs time. Commercial solutions cost money but save time. Know your trade-offs.