Think of it as the
Nervous System
of your infrastructure — sensing, reporting, and alerting
The Hospital Analogy
Imagine your application is a patient in a hospital. Doctors don't just guess what's wrong — they use monitoring equipment to understand the patient's health in real-time and historically. The same applies to your systems.
Metrics
Vital signs — CPU, memory, requests/sec
Logs
Medical history — what happened & when
Traces
Diagnostic path — how requests flow
The Problem It Solves
# 3 AM on-call experience...
User: "Site is slow"
You: "Let me SSH into servers..."
You: *checks 15 servers manually*
You: "Hmm, disk looks okay..."
You: *restarts random services*
You: "Try now?"
User: "Still broken"
You: 😭 *4 hours later* - No idea what's happening in production
- Users report problems before you know
- Debugging is guesswork and SSH
- Can't answer 'why is it slow?'
- No historical data for comparison
# 3 AM with observability...
[Alert] P99 latency > 500ms
[Dashboard] Spike at 02:47
[Trace] Slow query in user-service
[Logs] "Connection pool exhausted"
Fix: Scale connection pool
[Resolved] in 8 minutes ✓
# Back to sleep 😴 - Real-time visibility into all systems
- Alerts before users notice problems
- Trace exact cause of issues
- Historical trends for capacity planning
- Data-driven decisions, not guesswork
The Three Pillars of Observability
Metrics
Numeric measurements over time. The vital signs of your system.
• CPU utilization: 45%
• Requests/second: 1,234
• Error rate: 0.01%
• P99 latency: 120ms
Best for: Dashboards, alerts, trends
Logs
Timestamped records of discrete events. The story of what happened.
2024-01-15 09:23:45 INFO User login
2024-01-15 09:23:46 WARN Slow query
2024-01-15 09:23:47 ERROR Timeout
Best for: Debugging, audit trails, search
Traces
End-to-end request paths across services. The journey of a request.
Best for: Latency analysis, dependencies
Prometheus + Grafana
Prometheus collects and stores metrics. Grafana visualizes them beautifully. Together, they're the industry standard for open-source metrics.
Pull-based
Prometheus scrapes your apps
Time-series
Optimized for metrics data
Flexible
Any data source in Grafana
prometheus.yml Configuration
# prometheus.yml
global:
scrape_interval: 15s # How often to scrape
evaluation_interval: 15s # How often to evaluate rules
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- "alerts/*.yml"
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Your applications
- job_name: 'web-app'
static_configs:
- targets: ['app1:8080', 'app2:8080', 'app3:8080']
# Node exporters (host metrics)
- job_name: 'node'
static_configs:
- targets: ['node1:9100', 'node2:9100']
# Kubernetes service discovery
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true PromQL Query Examples
Request rate (per second)
rate(http_requests_total[5m]) Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100 P95 latency (histogram)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) CPU usage by instance
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) ELK Stack (Elastic Stack)
Elasticsearch stores and searches logs. Logstash processes and transforms them. Kibana visualizes everything. Beats ships data from your servers.
Logstash Pipeline Configuration
# logstash.conf
input {
# Receive from Filebeat
beats {
port => 5044
}
# Or read directly from files
file {
path => "/var/log/app/*.log"
start_position => "beginning"
}
}
filter {
# Parse JSON logs
json {
source => "message"
}
# Parse timestamps
date {
match => [ "timestamp", "ISO8601" ]
}
# Extract fields from message
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
# Add geolocation from IP
geoip {
source => "clientip"
}
# Drop debug logs in production
if [level] == "DEBUG" {
drop { }
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "logs-%{+YYYY.MM.dd}"
}
} Datadog (Commercial Solution)
When to Use a Commercial Platform
Datadog, New Relic, and Splunk are expensive but eliminate operational overhead. You pay money instead of engineering time.
Key Features
- Unified metrics, logs, traces, and APM
- 500+ integrations out of the box
- Auto-instrumentation for many languages
- Advanced ML-powered anomaly detection
- Real User Monitoring (RUM)
- Synthetic monitoring and uptime checks
When to Choose Commercial
- No dedicated ops/SRE team
- Need fast time-to-value
- Compliance requires vendor support
- Budget allows $15-50/host/month
- Multi-cloud or complex architectures
- Need 24/7 support SLAs
| Aspect | Prometheus/Grafana | ELK Stack | Datadog |
|---|---|---|---|
| Cost | Free (OSS) | Free (OSS) | $$$ per host |
| Setup Time | Medium | High | Low |
| Best For | Metrics | Logs/Search | Everything |
| Scaling | Manual | Complex | Automatic |
| Maintenance | You | You | Vendor |
Application Instrumentation
Prometheus Metrics with Flask
# app.py
from flask import Flask
from prometheus_client import Counter, Histogram, generate_latest
import time
app = Flask(__name__)
# Define metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5]
)
@app.before_request
def before_request():
request.start_time = time.time()
@app.after_request
def after_request(response):
latency = time.time() - request.start_time
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.path,
status=response.status_code
).inc()
REQUEST_LATENCY.labels(
method=request.method,
endpoint=request.path
).observe(latency)
return response
@app.route('/metrics')
def metrics():
return generate_latest() Structured Logging with structlog
# logging_config.py
import structlog
import logging.config
# Configure structured logging
structlog.configure(
processors=[
structlog.stdlib.filter_by_level,
structlog.stdlib.add_logger_name,
structlog.stdlib.add_log_level,
structlog.stdlib.PositionalArgumentsFormatter(),
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.UnicodeDecoder(),
structlog.processors.JSONRenderer() # Output JSON!
],
wrapper_class=structlog.stdlib.BoundLogger,
context_class=dict,
logger_factory=structlog.stdlib.LoggerFactory(),
)
# Usage
logger = structlog.get_logger()
logger.info("user_login",
user_id="12345",
ip_address="192.168.1.1",
user_agent="Mozilla/5.0..."
)
# Output (JSON, easy to parse in ELK):
# {"event": "user_login", "user_id": "12345", "ip_address": "192.168.1.1", ...} Alerting Best Practices
Severity Levels
P1 - Critical
Service down, data loss risk. Wake someone up.
P2 - High
Degraded service. Needs attention within an hour.
P3 - Medium
Not urgent. Handle during business hours.
P4 - Low
Informational. Review when time permits.
Avoiding Alert Fatigue
Too many alerts = ignored alerts. Be ruthless.
- Every alert must be actionable
- If you ignore it 3 times, delete it
- Group related alerts together
- Set reasonable thresholds (not too sensitive)
- Use escalation policies, not spam
- Review and prune alerts quarterly
Prometheus Alert Rules Example
# alerts/app_alerts.yml
groups:
- name: app_alerts
rules:
# High error rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
# Slow response times
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket[5m])
) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "High P95 latency"
description: "P95 latency is {{ $value | humanizeDuration }}"
# Instance down
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down" Decision Guide
Trade-offs
Prometheus + Grafana
Pros
- Industry standard, massive community
- Powerful query language (PromQL)
- Pull model is firewall-friendly
- Native Kubernetes integration
- Free and open source
Cons
- Metrics only (no logs/traces)
- Local storage is limited
- High cardinality can explode
- Requires operational knowledge
- Scaling requires Thanos/Cortex
ELK Stack
Pros
- Full-text search is unmatched
- Handles any log format
- Kibana is very powerful
- Elastic APM for traces
- Huge ecosystem of plugins
Cons
- Resource hungry (RAM, disk)
- Complex to operate at scale
- JVM tuning is an art
- License changes (SSPL)
- Index management overhead
Commercial (Datadog, New Relic)
Pros
- Unified platform (metrics+logs+traces)
- Zero infrastructure management
- Enterprise support and SLAs
- Advanced ML features
- Easy to get started
Cons
- Expensive ($15-50+/host/month)
- Vendor lock-in risk
- Data leaves your network
- Costs can spike unexpectedly
- Feature limitations on lower tiers
Key Takeaways
Observability = Metrics + Logs + Traces
Each pillar answers different questions. Use all three for complete visibility into your systems.
Instrument from day one
Adding observability later is painful. Build it into your apps and infrastructure from the start.
Alerts should be actionable
If an alert doesn't require immediate action, it's noise. Ruthlessly prune alerts that get ignored.
Structured logs beat grep
Output JSON logs with consistent fields. Future you (and your log search tool) will thank you.
Match tools to your team
Open source saves money but costs time. Commercial solutions cost money but save time. Know your trade-offs.