Observability & Production Engineering

Why Observability Matters

Deployable code that runs without crashing is a starting point, not a finish line. Production systems fail in ways your tests never predicted: slow database queries that appear only under load, memory leaks that surface after three days of uptime, a third-party API that starts timing out at 2am. Without observability, your team flies blind — reacting to customer complaints instead of catching failures proactively.

The Three Pillars of Observability

Logs tell you what happened — discrete events with timestamps and context. Metrics tell you how the system is behaving over time — request rates, error rates, latencies. Traces tell you why it is slow — the full journey of a single request across every service it touched. You need all three.

The difference between a junior engineer and a senior engineer in production is largely observability. Seniors instrument their code before the incident happens. They write logs with correlation IDs. They add custom metrics for business-critical paths. They trace slow operations end-to-end. When the alert fires at 3am, they can diagnose root cause in minutes instead of hours.

Observability Architecture

🖥️

Spring Boot App

logs · metrics · spans

→

Logback / SLF4J

→ Logstash → Elasticsearch → Kibana

Micrometer → Prometheus

→ Grafana dashboards & alerts

Micrometer Tracing

→ Zipkin / Jaeger (trace UI)

Structured Logging with SLF4J & Logback

The most common observability mistake is unstructured logging — plain text messages that are impossible to query at scale. When you have a million log lines per minute, log.info("User 42 logged in") is useless. You cannot filter by user ID without fragile regex. Structured logging emits machine-readable JSON that log aggregators can index and query efficiently.

Never Log Sensitive Data

Logs travel through many systems: files, log shippers, search indexes, dashboards. Never log passwords, tokens, credit card numbers, SSNs, or PII. Use masking utilities or log only non-sensitive identifiers. GDPR and PCI compliance failures have resulted from overly verbose production logs.

Configuring Logback for JSON Output

Add the Logstash encoder and configure logback-spring.xml to emit structured JSON in production while keeping human-readable output locally:

pom.xml

<dependency>
  <groupId>net.logstash.logback</groupId>
  <artifactId>logstash-logback-encoder</artifactId>
  <version>7.4</version>
</dependency>

logback-spring.xml

<configuration>

  <!-- Human-readable for local development -->
  <springProfile name="local">
    <appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
      <encoder>
        <pattern>%d{HH:mm:ss} [%thread] %-5level %logger{36} - %msg%n</pattern>
      </encoder>
    </appender>
    <root level="INFO">
      <appender-ref ref="CONSOLE"/>
    </root>
  </springProfile>

  <!-- JSON structured output for staging and production -->
  <springProfile name="prod,staging">
    <appender name="JSON_CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
      <encoder class="net.logstash.logback.encoder.LogstashEncoder">
        <customFields>{"app":"order-service","env":"${SPRING_PROFILES_ACTIVE}"}</customFields>
        <throwableConverter class="net.logstash.logback.stacktrace.ShortenedThrowableConverter">
          <maxDepthPerCause>10</maxDepthPerCause>
          <rootCauseFirst>true</rootCauseFirst>
        </throwableConverter>
      </encoder>
    </appender>
    <root level="INFO">
      <appender-ref ref="JSON_CONSOLE"/>
    </root>
    <logger name="com.yourcompany" level="DEBUG"/>
  </springProfile>

</configuration>

MDC — Mapped Diagnostic Context

MDC is the mechanism for attaching contextual data to every log line within a thread. Set a correlation ID once at request entry and every downstream log line automatically includes it:

RequestLoggingFilter.java

@Component
public class RequestLoggingFilter extends OncePerRequestFilter {

    private static final Logger log = LoggerFactory.getLogger(RequestLoggingFilter.class);

    @Override
    protected void doFilterInternal(HttpServletRequest request,
                                    HttpServletResponse response,
                                    FilterChain chain) throws ServletException, IOException {

        String correlationId = Optional
            .ofNullable(request.getHeader("X-Correlation-Id"))
            .orElse(UUID.randomUUID().toString().substring(0, 8));

        MDC.put("correlationId", correlationId);
        MDC.put("userId", Optional.ofNullable(request.getHeader("X-User-Id")).orElse("anonymous"));
        MDC.put("method", request.getMethod());
        MDC.put("path", request.getRequestURI());

        long start = System.currentTimeMillis();
        try {
            response.addHeader("X-Correlation-Id", correlationId);
            chain.doFilter(request, response);
        } finally {
            long duration = System.currentTimeMillis() - start;
            MDC.put("durationMs", String.valueOf(duration));
            MDC.put("statusCode", String.valueOf(response.getStatus()));
            log.info("HTTP {} {} → {} in {}ms",
                request.getMethod(), request.getRequestURI(),
                response.getStatus(), duration);
            MDC.clear(); // ALWAYS clear — thread pools reuse threads
        }
    }
}

Production Bug: Leaking MDC State

Thread pools reuse threads. If you forget MDC.clear() in the finally block, the next request processed by that thread inherits all MDC fields from the previous request — a compliance nightmare. Always clear in finally, never just in the try block.

Logging Best Practices

Logging Patterns

// BAD: String concatenation — evaluated even when log level is OFF
log.debug("Processing order: " + order.getId() + " items: " + order.getItems().size());

// GOOD: Parameterized — string only built if DEBUG is enabled
log.debug("Processing order: {} with {} items", order.getId(), order.getItems().size());

// BAD: Log AND throw — exception logged multiple times up the stack
try {
    processOrder(order);
} catch (Exception e) {
    log.error("Failed to process order", e);
    throw e; // gets logged again at the next catch block
}

// GOOD: Either log OR rethrow at the boundary that handles it
try {
    processOrder(order);
} catch (OrderProcessingException e) {
    log.error("Order {} processing failed: {}", order.getId(), e.getMessage(), e);
    throw new ServiceException("Order processing failed", e);
}

// BAD: Logging sensitive data
log.info("User authenticated: {} password: {}", user.getEmail(), plainTextPassword);

// GOOD: Log only non-sensitive identifiers
log.info("User authenticated: userId={} via={}", user.getId(), authMethod);

Metrics with Micrometer & Prometheus

Metrics answer a different question than logs. Instead of "what happened to request 4a2b", metrics tell you "how many requests per second are we serving, and what fraction are failing?" Micrometer is Spring Boot's metrics facade — vendor-neutral, able to export to Prometheus, Datadog, CloudWatch, or Influx by swapping one dependency.

Actuator and Prometheus Setup

pom.xml

<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

application.yml

management:
  endpoints:
    web:
      exposure:
        include: health,info,prometheus,metrics
      base-path: /actuator
  endpoint:
    health:
      show-details: when-authorized
    prometheus:
      enabled: true
  metrics:
    tags:
      application: ${spring.application.name}
      environment: ${spring.profiles.active:local}
    distribution:
      percentiles-histogram:
        http.server.requests: true
      percentiles:
        http.server.requests: 0.5, 0.90, 0.95, 0.99
      slo:
        http.server.requests: 10ms, 50ms, 100ms, 250ms, 500ms, 1s

Custom Business Metrics

Spring Boot auto-instruments JVM, HTTP, datasource, and cache metrics for free. But the most valuable metrics are the ones you write — business metrics that capture what matters to your application:

OrderService.java

@Service
public class OrderService {

    private final MeterRegistry registry;
    private final OrderRepository orderRepository;
    private final Counter ordersPlaced;
    private final Counter ordersFailed;

    public OrderService(MeterRegistry registry, OrderRepository orderRepository) {
        this.registry = registry;
        this.orderRepository = orderRepository;

        // Counter: monotonically increasing — total events
        this.ordersPlaced = Counter.builder("orders.placed")
            .description("Total orders successfully placed")
            .tag("version", "v2")
            .register(registry);

        this.ordersFailed = Counter.builder("orders.failed")
            .description("Total failed order attempts")
            .register(registry);
    }

    public Order placeOrder(CreateOrderRequest request) {
        // Gauge: current snapshot — use for queue depths, pool sizes
        Gauge.builder("orders.pending", orderRepository,
                      repo -> repo.countByStatus(OrderStatus.PENDING))
             .description("Orders pending fulfilment")
             .register(registry);

        // Timer: measure duration + automatically tracks count, sum, max
        return Timer.builder("orders.processing.time")
            .description("Time to process and persist an order")
            .tag("channel", request.getChannel())
            .publishPercentiles(0.5, 0.95, 0.99)
            .register(registry)
            .recordCallable(() -> {
                try {
                    Order order = createAndPersist(request);
                    ordersPlaced.increment();
                    return order;
                } catch (Exception e) {
                    ordersFailed.increment();
                    registry.counter("orders.failed",
                        "reason", e.getClass().getSimpleName()).increment();
                    throw e;
                }
            });
    }
}

The Four Golden Signals

Google SRE popularised four metrics that, if monitored on every service, provide sufficient coverage to detect virtually any production problem:

📈

Latency

How long requests take. Track p50, p95, p99 — averages hide tail latency. A p99 of 5s means 1 in 100 users waits 5 seconds.

🚦

Traffic

How much demand the system receives — requests per second, messages per second. Anomalies (sudden spikes or drops) indicate incidents.

❌

Errors

Rate of failed requests — HTTP 5xx, exceptions, timeouts. Track separately from client errors (400s). A sudden 2% error rate is an incident.

🖥️

Saturation

How "full" the system is — CPU, memory, thread pool, DB connection pool. Systems degrade before they crash; saturation is the warning signal.

Prometheus & Grafana in Practice

Prometheus is a pull-based time-series database. It scrapes your /actuator/prometheus endpoint on a configurable interval and stores the data. Grafana is the visualisation layer that queries Prometheus and renders dashboards and alert rules.

Docker Compose: Full Observability Stack

docker-compose.observability.yml

version: "3.9"

services:
  prometheus:
    image: prom/prometheus:v2.50.0
    volumes:
      - ./observability/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prometheus_data:/prometheus
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.retention.time=15d
      - --web.enable-lifecycle
    ports:
      - "9090:9090"
    networks: [obs]

  grafana:
    image: grafana/grafana:10.3.0
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
      GF_USERS_ALLOW_SIGN_UP: "false"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./observability/grafana/provisioning:/etc/grafana/provisioning:ro
    ports:
      - "3000:3000"
    depends_on: [prometheus]
    networks: [obs]

  alertmanager:
    image: prom/alertmanager:v0.26.0
    volumes:
      - ./observability/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    ports:
      - "9093:9093"
    networks: [obs]

volumes:
  prometheus_data:
  grafana_data:

networks:
  obs:
    driver: bridge

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "alerts.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

scrape_configs:
  - job_name: "spring-boot-app"
    metrics_path: /actuator/prometheus
    static_configs:
      - targets: ["app:8080"]

Prometheus Alert Rules

alerts.yml

groups:
  - name: spring-boot
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_server_requests_seconds_count{status=~"5.."}[5m])
          /
          rate(http_server_requests_seconds_count[5m]) > 0.02
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High HTTP error rate on {{ $labels.instance }}"
          description: "Error rate {{ $value | humanizePercentage }} over last 5m"

      - alert: SlowP99Latency
        expr: |
          histogram_quantile(0.99,
            rate(http_server_requests_seconds_bucket[5m])
          ) > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 latency above 1s on {{ $labels.instance }}"

      - alert: JvmHeapHigh
        expr: |
          jvm_memory_used_bytes{area="heap"}
          /
          jvm_memory_max_bytes{area="heap"} > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "JVM heap above 85% on {{ $labels.instance }}"

      - alert: DatabaseConnectionPoolExhausted
        expr: hikaricp_connections_pending > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "HikariCP pool has {{ $value }} pending requests"

Distributed Tracing with Micrometer Tracing

Logs and metrics tell you what happened system-wide. Tracing tells you what happened to one specific request — across every service, database query, and external call it triggered. Without tracing, diagnosing a 3-second response requires reconstructing the journey from log fragments across five services.

Traces, Spans, and Trace IDs

A trace is the entire journey of one request — it has a globally unique trace ID. A trace is composed of spans: each span is one unit of work (HTTP call, DB query, cache lookup). Spans form a tree. Every span records start time, duration, tags, and errors.

Setup: Micrometer Tracing + Zipkin

pom.xml

<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-tracing-bridge-brave</artifactId>
</dependency>
<dependency>
  <groupId>io.zipkin.reporter2</groupId>
  <artifactId>zipkin-reporter-brave</artifactId>
</dependency>

application.yml

management:
  tracing:
    sampling:
      probability: 1.0   # 100% in dev; use 0.1 in production
  zipkin:
    tracing:
      endpoint: http://zipkin:9411/api/v2/spans

logging:
  pattern:
    level: "%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]"

Custom Spans for Fine-Grained Tracing

PaymentService.java

@Service
@RequiredArgsConstructor
public class PaymentService {

    private final Tracer tracer;
    private final PaymentGatewayClient gatewayClient;

    public PaymentResult processPayment(PaymentRequest request) {

        Span gatewaySpan = tracer.nextSpan()
            .name("payment.gateway.charge")
            .tag("gateway", "stripe")
            .tag("amount", String.valueOf(request.getAmount()))
            .start();

        try (Tracer.SpanInScope scope = tracer.withSpan(gatewaySpan)) {
            gatewaySpan.event("Sending charge request");
            GatewayResponse response = gatewayClient.charge(request);
            gatewaySpan.event("Charge response received");
            gatewaySpan.tag("gateway.status", response.getStatus());

            if (!response.isSuccessful()) {
                gatewaySpan.tag("error", "true");
                gatewaySpan.tag("error.message", response.getDeclineReason());
            }
            return PaymentResult.from(response);

        } catch (Exception e) {
            gatewaySpan.tag("error", "true");
            gatewaySpan.error(e);
            throw e;
        } finally {
            gatewaySpan.end(); // ALWAYS end spans — leaks cause memory issues
        }
    }
}

Propagating Trace Context Across Kafka

Spring Boot auto-propagates trace headers for WebClient and RestTemplate. For Kafka, you need explicit propagation:

TracingKafkaProducer.java

@Component
@RequiredArgsConstructor
public class TracingKafkaProducer {

    private final KafkaTemplate<String, Object> kafkaTemplate;
    private final Tracer tracer;
    private final Propagator propagator;

    public void sendWithTracing(String topic, Object payload) {
        Map<String, String> headers = new HashMap<>();
        propagator.inject(tracer.currentTraceContext().context(),
            headers, Map::put);

        ProducerRecord<String, Object> record = new ProducerRecord<>(topic, payload);
        headers.forEach((k, v) ->
            record.headers().add(k, v.getBytes(StandardCharsets.UTF_8)));

        kafkaTemplate.send(record);
    }
}

The ELK Stack — Centralized Log Management

ELK stands for Elasticsearch, Logstash, and Kibana. Your Spring Boot app writes JSON logs to stdout, Filebeat forwards them to Elasticsearch for indexing, and Kibana provides a web UI for searching and visualising logs across all services.

ELK with Docker Compose

docker-compose.elk.yml

version: "3.9"

services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    volumes:
      - es_data:/usr/share/elasticsearch/data
    ports:
      - "9200:9200"
    networks: [elk]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9200"]
      interval: 10s
      timeout: 5s
      retries: 5

  logstash:
    image: docker.elastic.co/logstash/logstash:8.12.0
    volumes:
      - ./observability/logstash/pipeline:/usr/share/logstash/pipeline:ro
    ports:
      - "5044:5044"
      - "5000:5000"
    depends_on:
      elasticsearch:
        condition: service_healthy
    networks: [elk]

  kibana:
    image: docker.elastic.co/kibana/kibana:8.12.0
    environment:
      ELASTICSEARCH_HOSTS: http://elasticsearch:9200
    ports:
      - "5601:5601"
    depends_on: [elasticsearch]
    networks: [elk]

  filebeat:
    image: docker.elastic.co/beats/filebeat:8.12.0
    user: root
    volumes:
      - ./observability/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    depends_on: [logstash]
    networks: [elk]

volumes:
  es_data:

networks:
  elk:
    driver: bridge

logstash.conf

input {
  beats { port => 5044 }
  tcp   { port => 5000; codec => json }
}

filter {
  if [message] =~ /^\{/ {
    json {
      source => "message"
      target => "parsed"
    }
    mutate {
      rename => {
        "[parsed][traceId]"       => "trace_id"
        "[parsed][spanId]"        => "span_id"
        "[parsed][level]"         => "log_level"
        "[parsed][logger_name]"   => "logger"
        "[parsed][message]"       => "log_message"
        "[parsed][app]"           => "service"
        "[parsed][correlationId]" => "correlation_id"
        "[parsed][userId]"        => "user_id"
        "[parsed][durationMs]"    => "duration_ms"
      }
    }
  }
  date {
    match  => ["[parsed][@timestamp]", "ISO8601"]
    target => "@timestamp"
  }
}

output {
  elasticsearch {
    hosts             => ["elasticsearch:9200"]
    index             => "spring-logs-%{+YYYY.MM.dd}"
    template_overwrite => true
  }
}

Kibana Queries That Save Your Incident

trace_id:"a1b2c3" finds every log line from one request across all services. log_level:"ERROR" AND service:"order-service" filters errors from one service. duration_ms > 1000 finds slow requests. Save these as dashboards and they become your incident runbook.

Spring Boot Actuator — Production Endpoints

Actuator exposes production-ready endpoints for health checks, metrics, thread dumps, heap dumps, and more. It is the first tool you reach for when diagnosing a production issue.

application.yml — Actuator config

management:
  endpoints:
    web:
      exposure:
        # NEVER expose * on a public-facing service
        include: health,info,prometheus,metrics,loggers,threaddump,env
      base-path: /internal
  endpoint:
    health:
      show-details: always
      show-components: always
      probes:
        enabled: true   # Kubernetes liveness/readiness probes
    loggers:
      enabled: true     # Change log level at runtime without restart
    env:
      show-values: never  # Never expose env values in production

PaymentGatewayHealthIndicator.java

@Component
public class PaymentGatewayHealthIndicator implements HealthIndicator {

    private final PaymentGatewayClient client;

    @Override
    public Health health() {
        try {
            GatewayPingResponse ping = client.ping();
            if (ping.isHealthy()) {
                return Health.up()
                    .withDetail("gateway", ping.getVersion())
                    .withDetail("latencyMs", ping.getLatencyMs())
                    .build();
            }
            return Health.down()
                .withDetail("reason", ping.getDegradedReason())
                .build();
        } catch (Exception e) {
            return Health.down()
                .withException(e)
                .withDetail("endpoint", client.getEndpoint())
                .build();
        }
    }
}

Runtime Log Level Changes

Change log levels without restarting. When debugging in production, temporarily enable DEBUG for a specific logger, observe, then revert:

Actuator Loggers API

# Get current level for a logger
GET /internal/loggers/com.yourcompany.payment
# Response: { "configuredLevel": "INFO", "effectiveLevel": "INFO" }

# Temporarily enable DEBUG for investigation
curl -X POST http://app:8080/internal/loggers/com.yourcompany.payment \
  -H "Content-Type: application/json" \
  -d '{"configuredLevel": "DEBUG"}'

# Revert after investigation
curl -X POST http://app:8080/internal/loggers/com.yourcompany.payment \
  -H "Content-Type: application/json" \
  -d '{"configuredLevel": "INFO"}'

Production Incident Response

Observability is worthless without a system for using it. The faster you move from "alert fired" to "root cause identified", the less impact the incident has on users.

The Incident Investigation Flow

Incident Response Playbook

1

Alert fires — Prometheus alerts on error rate > 2%, p99 > 1s, or pod crash loop

2

Blast radius — How many users affected? Which endpoints? Is the error rate trending up or stable?

3

Recent changes — What deployed in the last hour? Any config or dependency changes?

4

Trace a failing request — Find a failing trace ID in Kibana. Follow spans. Where does the error originate?

5

Correlate metrics — Compare error spike timestamp with CPU, memory, DB pool, external API latency.

6

Mitigate first — Rollback, scale out, or enable circuit breaker. Stop the bleeding before forensic analysis.

7

Blameless postmortem — Timeline, impact, contributing causes, and action items to prevent recurrence.

Debugging Common Production Failures

Scenario: OOM Kill — Pod Restarts

Symptoms: Pod restarts, java.lang.OutOfMemoryError in logs, heap metric flatlines at 100%. Diagnosis: Check /actuator/metrics/jvm.memory.used trend. Take a heap dump via /actuator/heapdump before OOM. Analyse with Eclipse MAT. Look for: caches without eviction, event listeners not removed, ThreadLocal leaks. Mitigate: Add -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap.hprof so the dump is captured automatically at crash time.

Scenario: DB Connection Pool Exhaustion

Symptoms: Requests hang then timeout. HikariPool-1 - Connection not available, timed out after 30000ms. hikaricp_connections_pending spikes. Diagnosis: Take a thread dump via /actuator/threaddump. Look for threads blocked on HikariPool.getConnection(). Check DB for long-running transactions: SELECT * FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC. Root causes: Missing @Transactional propagation; N+1 queries holding connections; slow external call inside a transaction; connection leak.

Scenario: CPU Spike — Thread Starvation

Symptoms: CPU pegged at 100%, throughput collapses, p99 skyrockets, but no error rate increase. Diagnosis: Take a thread dump or use async-profiler. Look for RUNNABLE threads in tight loops. Check for: expensive regex in hot paths, infinite retry loops, serialisation of large objects on every request, or GC pressure (check jvm.gc.pause metric).

Grafana Loki — Lightweight Log Aggregation

ELK is powerful but operationally heavy — Elasticsearch is expensive to run. Grafana Loki indexes only log labels (not full text), making it dramatically cheaper. Loki integrates natively with Grafana so you can correlate logs and metrics on the same dashboard.

pom.xml

<dependency>
  <groupId>com.github.loki4j</groupId>
  <artifactId>loki-logback-appender</artifactId>
  <version>1.5.1</version>
</dependency>

logback-spring.xml — Loki appender

<springProfile name="prod">
  <appender name="LOKI" class="com.github.loki4j.logback.Loki4jAppender">
    <http>
      <url>http://loki:3100/loki/api/v1/push</url>
    </http>
    <format>
      <label>
        <pattern>app=${spring.application.name},env=${spring.profiles.active},level=%level</pattern>
      </label>
      <message class="net.logstash.logback.encoder.LogstashEncoder"/>
    </format>
  </appender>
  <root level="INFO">
    <appender-ref ref="LOKI"/>
  </root>
</springProfile>

Correlated Observability in Grafana

With Prometheus (metrics), Loki (logs), and Tempo (traces) all feeding into Grafana, you get the full "Explore" workflow: click a spike on your error rate graph → Grafana generates a Loki query for that time range → click a log line → Grafana opens the trace in Tempo. This is the modern observability stack — entirely free and open-source.

SLOs, SLAs, and Error Budgets

Senior engineers don't just monitor systems — they define what "working correctly" means quantitatively. SLOs are internal targets. SLAs are customer contracts. Error budgets are the mathematical consequence of SLOs.

🎯

SLI (Indicator)

A measurable metric: "the fraction of requests completing in under 200ms" or "the fraction returning a 2xx response".

📋

SLO (Objective)

The internal target: "99.9% of requests return 2xx over a 30-day rolling window". Breach triggers incident review, not customer penalty.

📜

SLA (Agreement)

The customer contract: "99.5% uptime per month or we issue service credits". SLAs are looser than internal SLOs — that gap is your buffer.

💰

Error Budget

99.9% availability = 43.8 minutes/month allowed downtime. Once spent, freeze non-critical deploys until the window resets.

Prometheus SLO Recording Rules

slo-rules.yml

groups:
  - name: slo_rules
    interval: 30s
    rules:
      # SLI: fraction of successful requests over 5-minute windows
      - record: job:http_requests_success_rate:ratio_rate5m
        expr: |
          sum(rate(http_server_requests_seconds_count{status!~"5.."}[5m]))
          /
          sum(rate(http_server_requests_seconds_count[5m]))

      # Alert when error budget burns 14.4x faster than allowed
      # At this rate, the monthly budget is exhausted in ~1 hour
      - alert: ErrorBudgetBurnRateHigh
        expr: |
          (1 - job:http_requests_success_rate:ratio_rate5m) > 14.4 * (1 - 0.999)
        for: 2m
        labels:
          severity: page
        annotations:
          summary: "Error budget burning 14.4x faster than allowed"
          description: "At this rate, monthly error budget exhausted in 1 hour"

Production Pitfalls

Pitfall 1: Logging Too Much (or Too Little)

Excessive DEBUG logging causes I/O saturation — gigabytes per hour of logs is a real CPU and disk cost. Insufficient logging means you cannot diagnose failures. Use ERROR and WARN always; INFO for request boundaries and significant state changes; DEBUG only when actively investigating (toggle via Actuator without restart).

Pitfall 2: Alerting on Symptoms, Not User Experience

Alerting on CPU usage is a symptom. Alerting on error rate or p99 latency measures the user experience directly. Build primary alerts around the four golden signals. Only page a human for something requiring immediate action — noisy alerts cause on-call fatigue and engineers stop responding.

Pitfall 3: 100% Trace Sampling Rate in Production

100% sampling on a service handling 10,000 req/sec means storing 10,000 traces/sec. This saturates your trace backend and adds latency. Use tail-based sampling: 100% of error traces, 1–10% of success traces. Configure this in OpenTelemetry Collector, not in the application.

Pitfall 4: Clock Skew Across Services

If services run in different timezones or have clock drift, cross-service log correlation breaks. Always log in UTC. Verify NTP synchronisation in cloud environments — instance clocks can drift by seconds, making trace timeline reconstruction impossible.

Pitfall 5: Exposing Actuator Endpoints Publicly

/actuator/heapdump dumps your entire JVM heap — containing tokens, passwords in config, and user data. /actuator/env exposes all environment variables. NEVER expose Actuator on the public internet. Restrict with Spring Security to internal IPs or authenticated service accounts only.

Pitfall 6: MDC Lost Across Async Boundaries

MDC context does NOT propagate across @Async, CompletableFuture, virtual threads, or Kafka consumers. Explicitly copy context: Map<String, String> ctx = MDC.getCopyOfContextMap() then MDC.setContextMap(ctx) in the new thread before any logging.

Interview Preparation

Observability appears frequently in staff+ and principal interviews, and increasingly in senior backend roles. Interviewers test whether you have operated real production systems, not just built them.

Q: What is the difference between structured and unstructured logging? Why does it matter at scale?

Unstructured logging produces plain text like "User 42 placed order 99 for $150". Structured logging produces machine-readable JSON: {"event":"order.placed","userId":42,"orderId":99,"amount":150}. At scale, unstructured logs require fragile regex to extract fields. Structured logs can be indexed by field in Elasticsearch — you can query userId:42 AND amount > 100 efficiently across millions of lines. Every field becomes a searchable dimension. The cost is slightly larger log volume; the benefit is dramatically faster incident diagnosis.

Q: What is MDC and what production bug does it prevent and cause?

MDC (Mapped Diagnostic Context) is thread-local storage in SLF4J for attaching key-value metadata to log lines. Setting MDC.put("correlationId", id) at request entry causes every log line from that thread to include the correlation ID automatically. The bug it prevents: without MDC you cannot identify which log lines belong to which request among thousands of concurrent requests. The bug it causes: if you forget MDC.clear() in a finally block, thread pool threads carry stale MDC data from previous requests into new ones — you'll see user A's correlation ID in user B's logs, a compliance and debugging disaster.

Q: How would you change a log level in production without restarting?

Spring Boot Actuator exposes /actuator/loggers/{name} which accepts POST requests to change log levels at runtime. Send {"configuredLevel":"DEBUG"} to enable debug logging for a specific package. This works because Logback supports dynamic reconfiguration at runtime. The change is in-memory only and reverts on restart. It is safe because it is scoped to a specific logger package and can be reverted immediately. Always revert after investigation to avoid performance degradation from excessive log volume.

Q: What are the four golden signals and why are they the right things to monitor?

The four golden signals (from Google SRE) are Latency, Traffic, Errors, and Saturation. They are the right things to monitor because they directly represent user experience. Latency tells you how long users wait. Traffic tells you the load. Errors tell you the fraction of users experiencing failures. Saturation tells you how close to capacity the system is. Monitoring CPU or memory alone tells you resource consumption, not user impact — a system can have 90% CPU but serve all requests correctly, or 10% CPU and be completely broken. The golden signals are user-centric, not infrastructure-centric.

Q: What is an error budget and how does it influence engineering decisions?

An error budget is the allowed quantity of failures within an SLO window. For 99.9% availability over 30 days, the error budget is 43.8 minutes of downtime. Error budgets make reliability a quantitative decision. When the budget is healthy, teams ship features aggressively. When nearly exhausted, teams freeze deploys and focus on reliability work. This eliminates the subjective argument between "ship faster" vs "stabilise" — the budget decides. It also aligns incentives: reliability work has a quantified cost in reduced feature velocity.

Q: Why should you never alert on averages? What should you use instead?

Averages mask outliers. If 90% of requests take 100ms and 10% take 10,000ms, the average is 1,090ms. An alert threshold of 2,000ms never fires, yet 10% of users experience 10-second responses. Percentiles expose the tail: p99 of 10,000ms fires immediately. Alert on p95 and p99 latency, never averages. Use Prometheus histogram metrics (not summaries) when aggregating percentiles across pods — summaries cannot be correctly aggregated across multiple instances.

Q: Explain distributed tracing. What problem does it solve that logs and metrics cannot?

Distributed tracing solves the "why is this specific request slow?" problem. Logs tell you events occurred; metrics tell you aggregated behaviour; neither answers "why did user Alice's checkout take 8 seconds right now?". A trace follows one request across every service it touches — Service A calling B calling Redis and Postgres — recording how long each hop took. In a microservices system with 20 services, a slow request can be caused by any one of hundreds of components. Tracing pinpoints exactly which span was slow: "the DB query in order-service took 7.8 seconds". That is impossible to determine from aggregated metrics or per-service logs alone.

Q: What is tail-based vs head-based sampling in distributed tracing?

Head-based sampling decides at the first service whether to trace a request — all downstream services receive the decision via propagated headers. Simple and low overhead, but samples blindly: a 10% rate discards 90% of traces including error traces. Tail-based sampling buffers all spans and decides at the end — ensuring 100% of error and slow traces are kept while discarding routine successful ones. Tail-based requires a collection agent (OpenTelemetry Collector) that holds spans in memory until the trace completes. In production: head-based for baseline volume control, tail-based sampling in the Collector for errors and slow traces.

Q: Walk me through diagnosing a sudden spike in HTTP 500 errors on a Spring Boot service.

First, assess blast radius: check error rate in Grafana — all endpoints or specific one? Trending up or flat? Then check recent deployments — if errors started at a deploy time, rollback is fastest mitigation. Next, find error traces in Kibana: filter by status 500 in last 5 minutes, open a representative trace, find the span where the error originated. Check the log line for that trace ID — what exception was thrown? Simultaneously check correlated metrics: DB connection pool pending, external API p99 latency, JVM heap. The combination of exception type and correlated metric usually gives root cause in minutes. Mitigate first, then do forensic analysis in the postmortem.

Q: What is a blameless postmortem and why does it matter?

A blameless postmortem is a structured incident review that focuses on systemic failures rather than individual mistakes. The premise: engineers act with the best information and tooling available — if a mistake was easy to make, the system allowed it. Blameless postmortems document timeline, impact, contributing causes, and concrete action items to prevent recurrence. They matter because blame-based cultures suppress incident reporting — engineers hide near-misses, so the organisation never learns. Blameless cultures surface failures early and accumulate engineering knowledge over time. The action items from postmortems drive reliability improvement: better runbooks, new alerts, safer deployment processes, improved tests.

Why Observability Matters

Structured Logging with SLF4J & Logback

Configuring Logback for JSON Output

MDC — Mapped Diagnostic Context

Logging Best Practices

Metrics with Micrometer & Prometheus

Actuator and Prometheus Setup

Custom Business Metrics

The Four Golden Signals

Prometheus & Grafana in Practice

Docker Compose: Full Observability Stack

Prometheus Alert Rules

Distributed Tracing with Micrometer Tracing

Setup: Micrometer Tracing + Zipkin

Custom Spans for Fine-Grained Tracing

Propagating Trace Context Across Kafka

The ELK Stack — Centralized Log Management

ELK with Docker Compose

Spring Boot Actuator — Production Endpoints

Runtime Log Level Changes

Production Incident Response

The Incident Investigation Flow

Debugging Common Production Failures

Grafana Loki — Lightweight Log Aggregation

SLOs, SLAs, and Error Budgets

Prometheus SLO Recording Rules

Production Pitfalls

Interview Preparation

Section 12 Complete