Observability & Production Engineering
You cannot fix what you cannot see. Observability is the discipline of making your system legible — understanding what it is doing, why it is slow, and where it is failing before your users notice.
Why Observability Matters
Deployable code that runs without crashing is a starting point, not a finish line. Production systems fail in ways your tests never predicted: slow database queries that appear only under load, memory leaks that surface after three days of uptime, a third-party API that starts timing out at 2am. Without observability, your team flies blind — reacting to customer complaints instead of catching failures proactively.
Logs tell you what happened — discrete events with timestamps and context. Metrics tell you how the system is behaving over time — request rates, error rates, latencies. Traces tell you why it is slow — the full journey of a single request across every service it touched. You need all three.
The difference between a junior engineer and a senior engineer in production is largely observability. Seniors instrument their code before the incident happens. They write logs with correlation IDs. They add custom metrics for business-critical paths. They trace slow operations end-to-end. When the alert fires at 3am, they can diagnose root cause in minutes instead of hours.
Structured Logging with SLF4J & Logback
The most common observability mistake is unstructured logging — plain text messages that are impossible to query at scale. When you have a million log lines per minute, log.info("User 42 logged in") is useless. You cannot filter by user ID without fragile regex. Structured logging emits machine-readable JSON that log aggregators can index and query efficiently.
Logs travel through many systems: files, log shippers, search indexes, dashboards. Never log passwords, tokens, credit card numbers, SSNs, or PII. Use masking utilities or log only non-sensitive identifiers. GDPR and PCI compliance failures have resulted from overly verbose production logs.
Configuring Logback for JSON Output
Add the Logstash encoder and configure logback-spring.xml to emit structured JSON in production while keeping human-readable output locally:
<dependency>
<groupId>net.logstash.logback</groupId>
<artifactId>logstash-logback-encoder</artifactId>
<version>7.4</version>
</dependency>
<configuration>
<!-- Human-readable for local development -->
<springProfile name="local">
<appender name="CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
<encoder>
<pattern>%d{HH:mm:ss} [%thread] %-5level %logger{36} - %msg%n</pattern>
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="CONSOLE"/>
</root>
</springProfile>
<!-- JSON structured output for staging and production -->
<springProfile name="prod,staging">
<appender name="JSON_CONSOLE" class="ch.qos.logback.core.ConsoleAppender">
<encoder class="net.logstash.logback.encoder.LogstashEncoder">
<customFields>{"app":"order-service","env":"${SPRING_PROFILES_ACTIVE}"}</customFields>
<throwableConverter class="net.logstash.logback.stacktrace.ShortenedThrowableConverter">
<maxDepthPerCause>10</maxDepthPerCause>
<rootCauseFirst>true</rootCauseFirst>
</throwableConverter>
</encoder>
</appender>
<root level="INFO">
<appender-ref ref="JSON_CONSOLE"/>
</root>
<logger name="com.yourcompany" level="DEBUG"/>
</springProfile>
</configuration>
MDC — Mapped Diagnostic Context
MDC is the mechanism for attaching contextual data to every log line within a thread. Set a correlation ID once at request entry and every downstream log line automatically includes it:
@Component
public class RequestLoggingFilter extends OncePerRequestFilter {
private static final Logger log = LoggerFactory.getLogger(RequestLoggingFilter.class);
@Override
protected void doFilterInternal(HttpServletRequest request,
HttpServletResponse response,
FilterChain chain) throws ServletException, IOException {
String correlationId = Optional
.ofNullable(request.getHeader("X-Correlation-Id"))
.orElse(UUID.randomUUID().toString().substring(0, 8));
MDC.put("correlationId", correlationId);
MDC.put("userId", Optional.ofNullable(request.getHeader("X-User-Id")).orElse("anonymous"));
MDC.put("method", request.getMethod());
MDC.put("path", request.getRequestURI());
long start = System.currentTimeMillis();
try {
response.addHeader("X-Correlation-Id", correlationId);
chain.doFilter(request, response);
} finally {
long duration = System.currentTimeMillis() - start;
MDC.put("durationMs", String.valueOf(duration));
MDC.put("statusCode", String.valueOf(response.getStatus()));
log.info("HTTP {} {} → {} in {}ms",
request.getMethod(), request.getRequestURI(),
response.getStatus(), duration);
MDC.clear(); // ALWAYS clear — thread pools reuse threads
}
}
}
Thread pools reuse threads. If you forget MDC.clear() in the finally block, the next request processed by that thread inherits all MDC fields from the previous request — a compliance nightmare. Always clear in finally, never just in the try block.
Logging Best Practices
// BAD: String concatenation — evaluated even when log level is OFF
log.debug("Processing order: " + order.getId() + " items: " + order.getItems().size());
// GOOD: Parameterized — string only built if DEBUG is enabled
log.debug("Processing order: {} with {} items", order.getId(), order.getItems().size());
// BAD: Log AND throw — exception logged multiple times up the stack
try {
processOrder(order);
} catch (Exception e) {
log.error("Failed to process order", e);
throw e; // gets logged again at the next catch block
}
// GOOD: Either log OR rethrow at the boundary that handles it
try {
processOrder(order);
} catch (OrderProcessingException e) {
log.error("Order {} processing failed: {}", order.getId(), e.getMessage(), e);
throw new ServiceException("Order processing failed", e);
}
// BAD: Logging sensitive data
log.info("User authenticated: {} password: {}", user.getEmail(), plainTextPassword);
// GOOD: Log only non-sensitive identifiers
log.info("User authenticated: userId={} via={}", user.getId(), authMethod);
Metrics with Micrometer & Prometheus
Metrics answer a different question than logs. Instead of "what happened to request 4a2b", metrics tell you "how many requests per second are we serving, and what fraction are failing?" Micrometer is Spring Boot's metrics facade — vendor-neutral, able to export to Prometheus, Datadog, CloudWatch, or Influx by swapping one dependency.
Actuator and Prometheus Setup
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
management:
endpoints:
web:
exposure:
include: health,info,prometheus,metrics
base-path: /actuator
endpoint:
health:
show-details: when-authorized
prometheus:
enabled: true
metrics:
tags:
application: ${spring.application.name}
environment: ${spring.profiles.active:local}
distribution:
percentiles-histogram:
http.server.requests: true
percentiles:
http.server.requests: 0.5, 0.90, 0.95, 0.99
slo:
http.server.requests: 10ms, 50ms, 100ms, 250ms, 500ms, 1s
Custom Business Metrics
Spring Boot auto-instruments JVM, HTTP, datasource, and cache metrics for free. But the most valuable metrics are the ones you write — business metrics that capture what matters to your application:
@Service
public class OrderService {
private final MeterRegistry registry;
private final OrderRepository orderRepository;
private final Counter ordersPlaced;
private final Counter ordersFailed;
public OrderService(MeterRegistry registry, OrderRepository orderRepository) {
this.registry = registry;
this.orderRepository = orderRepository;
// Counter: monotonically increasing — total events
this.ordersPlaced = Counter.builder("orders.placed")
.description("Total orders successfully placed")
.tag("version", "v2")
.register(registry);
this.ordersFailed = Counter.builder("orders.failed")
.description("Total failed order attempts")
.register(registry);
}
public Order placeOrder(CreateOrderRequest request) {
// Gauge: current snapshot — use for queue depths, pool sizes
Gauge.builder("orders.pending", orderRepository,
repo -> repo.countByStatus(OrderStatus.PENDING))
.description("Orders pending fulfilment")
.register(registry);
// Timer: measure duration + automatically tracks count, sum, max
return Timer.builder("orders.processing.time")
.description("Time to process and persist an order")
.tag("channel", request.getChannel())
.publishPercentiles(0.5, 0.95, 0.99)
.register(registry)
.recordCallable(() -> {
try {
Order order = createAndPersist(request);
ordersPlaced.increment();
return order;
} catch (Exception e) {
ordersFailed.increment();
registry.counter("orders.failed",
"reason", e.getClass().getSimpleName()).increment();
throw e;
}
});
}
}
The Four Golden Signals
Google SRE popularised four metrics that, if monitored on every service, provide sufficient coverage to detect virtually any production problem:
Prometheus & Grafana in Practice
Prometheus is a pull-based time-series database. It scrapes your /actuator/prometheus endpoint on a configurable interval and stores the data. Grafana is the visualisation layer that queries Prometheus and renders dashboards and alert rules.
Docker Compose: Full Observability Stack
version: "3.9"
services:
prometheus:
image: prom/prometheus:v2.50.0
volumes:
- ./observability/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_data:/prometheus
command:
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.retention.time=15d
- --web.enable-lifecycle
ports:
- "9090:9090"
networks: [obs]
grafana:
image: grafana/grafana:10.3.0
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
GF_USERS_ALLOW_SIGN_UP: "false"
volumes:
- grafana_data:/var/lib/grafana
- ./observability/grafana/provisioning:/etc/grafana/provisioning:ro
ports:
- "3000:3000"
depends_on: [prometheus]
networks: [obs]
alertmanager:
image: prom/alertmanager:v0.26.0
volumes:
- ./observability/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
ports:
- "9093:9093"
networks: [obs]
volumes:
prometheus_data:
grafana_data:
networks:
obs:
driver: bridge
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "alerts.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ["alertmanager:9093"]
scrape_configs:
- job_name: "spring-boot-app"
metrics_path: /actuator/prometheus
static_configs:
- targets: ["app:8080"]
Prometheus Alert Rules
groups:
- name: spring-boot
rules:
- alert: HighErrorRate
expr: |
rate(http_server_requests_seconds_count{status=~"5.."}[5m])
/
rate(http_server_requests_seconds_count[5m]) > 0.02
for: 2m
labels:
severity: critical
annotations:
summary: "High HTTP error rate on {{ $labels.instance }}"
description: "Error rate {{ $value | humanizePercentage }} over last 5m"
- alert: SlowP99Latency
expr: |
histogram_quantile(0.99,
rate(http_server_requests_seconds_bucket[5m])
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "p99 latency above 1s on {{ $labels.instance }}"
- alert: JvmHeapHigh
expr: |
jvm_memory_used_bytes{area="heap"}
/
jvm_memory_max_bytes{area="heap"} > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "JVM heap above 85% on {{ $labels.instance }}"
- alert: DatabaseConnectionPoolExhausted
expr: hikaricp_connections_pending > 5
for: 1m
labels:
severity: critical
annotations:
summary: "HikariCP pool has {{ $value }} pending requests"
Distributed Tracing with Micrometer Tracing
Logs and metrics tell you what happened system-wide. Tracing tells you what happened to one specific request — across every service, database query, and external call it triggered. Without tracing, diagnosing a 3-second response requires reconstructing the journey from log fragments across five services.
A trace is the entire journey of one request — it has a globally unique trace ID. A trace is composed of spans: each span is one unit of work (HTTP call, DB query, cache lookup). Spans form a tree. Every span records start time, duration, tags, and errors.
Setup: Micrometer Tracing + Zipkin
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-brave</artifactId>
</dependency>
<dependency>
<groupId>io.zipkin.reporter2</groupId>
<artifactId>zipkin-reporter-brave</artifactId>
</dependency>
management:
tracing:
sampling:
probability: 1.0 # 100% in dev; use 0.1 in production
zipkin:
tracing:
endpoint: http://zipkin:9411/api/v2/spans
logging:
pattern:
level: "%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]"
Custom Spans for Fine-Grained Tracing
@Service
@RequiredArgsConstructor
public class PaymentService {
private final Tracer tracer;
private final PaymentGatewayClient gatewayClient;
public PaymentResult processPayment(PaymentRequest request) {
Span gatewaySpan = tracer.nextSpan()
.name("payment.gateway.charge")
.tag("gateway", "stripe")
.tag("amount", String.valueOf(request.getAmount()))
.start();
try (Tracer.SpanInScope scope = tracer.withSpan(gatewaySpan)) {
gatewaySpan.event("Sending charge request");
GatewayResponse response = gatewayClient.charge(request);
gatewaySpan.event("Charge response received");
gatewaySpan.tag("gateway.status", response.getStatus());
if (!response.isSuccessful()) {
gatewaySpan.tag("error", "true");
gatewaySpan.tag("error.message", response.getDeclineReason());
}
return PaymentResult.from(response);
} catch (Exception e) {
gatewaySpan.tag("error", "true");
gatewaySpan.error(e);
throw e;
} finally {
gatewaySpan.end(); // ALWAYS end spans — leaks cause memory issues
}
}
}
Propagating Trace Context Across Kafka
Spring Boot auto-propagates trace headers for WebClient and RestTemplate. For Kafka, you need explicit propagation:
@Component
@RequiredArgsConstructor
public class TracingKafkaProducer {
private final KafkaTemplate<String, Object> kafkaTemplate;
private final Tracer tracer;
private final Propagator propagator;
public void sendWithTracing(String topic, Object payload) {
Map<String, String> headers = new HashMap<>();
propagator.inject(tracer.currentTraceContext().context(),
headers, Map::put);
ProducerRecord<String, Object> record = new ProducerRecord<>(topic, payload);
headers.forEach((k, v) ->
record.headers().add(k, v.getBytes(StandardCharsets.UTF_8)));
kafkaTemplate.send(record);
}
}
The ELK Stack — Centralized Log Management
ELK stands for Elasticsearch, Logstash, and Kibana. Your Spring Boot app writes JSON logs to stdout, Filebeat forwards them to Elasticsearch for indexing, and Kibana provides a web UI for searching and visualising logs across all services.
ELK with Docker Compose
version: "3.9"
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
environment:
- discovery.type=single-node
- xpack.security.enabled=false
- "ES_JAVA_OPTS=-Xms512m -Xmx512m"
volumes:
- es_data:/usr/share/elasticsearch/data
ports:
- "9200:9200"
networks: [elk]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9200"]
interval: 10s
timeout: 5s
retries: 5
logstash:
image: docker.elastic.co/logstash/logstash:8.12.0
volumes:
- ./observability/logstash/pipeline:/usr/share/logstash/pipeline:ro
ports:
- "5044:5044"
- "5000:5000"
depends_on:
elasticsearch:
condition: service_healthy
networks: [elk]
kibana:
image: docker.elastic.co/kibana/kibana:8.12.0
environment:
ELASTICSEARCH_HOSTS: http://elasticsearch:9200
ports:
- "5601:5601"
depends_on: [elasticsearch]
networks: [elk]
filebeat:
image: docker.elastic.co/beats/filebeat:8.12.0
user: root
volumes:
- ./observability/filebeat.yml:/usr/share/filebeat/filebeat.yml:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
depends_on: [logstash]
networks: [elk]
volumes:
es_data:
networks:
elk:
driver: bridge
input {
beats { port => 5044 }
tcp { port => 5000; codec => json }
}
filter {
if [message] =~ /^\{/ {
json {
source => "message"
target => "parsed"
}
mutate {
rename => {
"[parsed][traceId]" => "trace_id"
"[parsed][spanId]" => "span_id"
"[parsed][level]" => "log_level"
"[parsed][logger_name]" => "logger"
"[parsed][message]" => "log_message"
"[parsed][app]" => "service"
"[parsed][correlationId]" => "correlation_id"
"[parsed][userId]" => "user_id"
"[parsed][durationMs]" => "duration_ms"
}
}
}
date {
match => ["[parsed][@timestamp]", "ISO8601"]
target => "@timestamp"
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "spring-logs-%{+YYYY.MM.dd}"
template_overwrite => true
}
}
trace_id:"a1b2c3" finds every log line from one request across all services. log_level:"ERROR" AND service:"order-service" filters errors from one service. duration_ms > 1000 finds slow requests. Save these as dashboards and they become your incident runbook.
Spring Boot Actuator — Production Endpoints
Actuator exposes production-ready endpoints for health checks, metrics, thread dumps, heap dumps, and more. It is the first tool you reach for when diagnosing a production issue.
management:
endpoints:
web:
exposure:
# NEVER expose * on a public-facing service
include: health,info,prometheus,metrics,loggers,threaddump,env
base-path: /internal
endpoint:
health:
show-details: always
show-components: always
probes:
enabled: true # Kubernetes liveness/readiness probes
loggers:
enabled: true # Change log level at runtime without restart
env:
show-values: never # Never expose env values in production
@Component
public class PaymentGatewayHealthIndicator implements HealthIndicator {
private final PaymentGatewayClient client;
@Override
public Health health() {
try {
GatewayPingResponse ping = client.ping();
if (ping.isHealthy()) {
return Health.up()
.withDetail("gateway", ping.getVersion())
.withDetail("latencyMs", ping.getLatencyMs())
.build();
}
return Health.down()
.withDetail("reason", ping.getDegradedReason())
.build();
} catch (Exception e) {
return Health.down()
.withException(e)
.withDetail("endpoint", client.getEndpoint())
.build();
}
}
}
Runtime Log Level Changes
Change log levels without restarting. When debugging in production, temporarily enable DEBUG for a specific logger, observe, then revert:
# Get current level for a logger
GET /internal/loggers/com.yourcompany.payment
# Response: { "configuredLevel": "INFO", "effectiveLevel": "INFO" }
# Temporarily enable DEBUG for investigation
curl -X POST http://app:8080/internal/loggers/com.yourcompany.payment \
-H "Content-Type: application/json" \
-d '{"configuredLevel": "DEBUG"}'
# Revert after investigation
curl -X POST http://app:8080/internal/loggers/com.yourcompany.payment \
-H "Content-Type: application/json" \
-d '{"configuredLevel": "INFO"}'
Production Incident Response
Observability is worthless without a system for using it. The faster you move from "alert fired" to "root cause identified", the less impact the incident has on users.
The Incident Investigation Flow
Debugging Common Production Failures
Symptoms: Pod restarts, java.lang.OutOfMemoryError in logs, heap metric flatlines at 100%.
Diagnosis: Check /actuator/metrics/jvm.memory.used trend. Take a heap dump via /actuator/heapdump before OOM. Analyse with Eclipse MAT. Look for: caches without eviction, event listeners not removed, ThreadLocal leaks.
Mitigate: Add -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/heap.hprof so the dump is captured automatically at crash time.
Symptoms: Requests hang then timeout. HikariPool-1 - Connection not available, timed out after 30000ms. hikaricp_connections_pending spikes.
Diagnosis: Take a thread dump via /actuator/threaddump. Look for threads blocked on HikariPool.getConnection(). Check DB for long-running transactions: SELECT * FROM pg_stat_activity WHERE state != 'idle' ORDER BY duration DESC.
Root causes: Missing @Transactional propagation; N+1 queries holding connections; slow external call inside a transaction; connection leak.
Symptoms: CPU pegged at 100%, throughput collapses, p99 skyrockets, but no error rate increase.
Diagnosis: Take a thread dump or use async-profiler. Look for RUNNABLE threads in tight loops. Check for: expensive regex in hot paths, infinite retry loops, serialisation of large objects on every request, or GC pressure (check jvm.gc.pause metric).
Grafana Loki — Lightweight Log Aggregation
ELK is powerful but operationally heavy — Elasticsearch is expensive to run. Grafana Loki indexes only log labels (not full text), making it dramatically cheaper. Loki integrates natively with Grafana so you can correlate logs and metrics on the same dashboard.
<dependency>
<groupId>com.github.loki4j</groupId>
<artifactId>loki-logback-appender</artifactId>
<version>1.5.1</version>
</dependency>
<springProfile name="prod">
<appender name="LOKI" class="com.github.loki4j.logback.Loki4jAppender">
<http>
<url>http://loki:3100/loki/api/v1/push</url>
</http>
<format>
<label>
<pattern>app=${spring.application.name},env=${spring.profiles.active},level=%level</pattern>
</label>
<message class="net.logstash.logback.encoder.LogstashEncoder"/>
</format>
</appender>
<root level="INFO">
<appender-ref ref="LOKI"/>
</root>
</springProfile>
With Prometheus (metrics), Loki (logs), and Tempo (traces) all feeding into Grafana, you get the full "Explore" workflow: click a spike on your error rate graph → Grafana generates a Loki query for that time range → click a log line → Grafana opens the trace in Tempo. This is the modern observability stack — entirely free and open-source.
SLOs, SLAs, and Error Budgets
Senior engineers don't just monitor systems — they define what "working correctly" means quantitatively. SLOs are internal targets. SLAs are customer contracts. Error budgets are the mathematical consequence of SLOs.
Prometheus SLO Recording Rules
groups:
- name: slo_rules
interval: 30s
rules:
# SLI: fraction of successful requests over 5-minute windows
- record: job:http_requests_success_rate:ratio_rate5m
expr: |
sum(rate(http_server_requests_seconds_count{status!~"5.."}[5m]))
/
sum(rate(http_server_requests_seconds_count[5m]))
# Alert when error budget burns 14.4x faster than allowed
# At this rate, the monthly budget is exhausted in ~1 hour
- alert: ErrorBudgetBurnRateHigh
expr: |
(1 - job:http_requests_success_rate:ratio_rate5m) > 14.4 * (1 - 0.999)
for: 2m
labels:
severity: page
annotations:
summary: "Error budget burning 14.4x faster than allowed"
description: "At this rate, monthly error budget exhausted in 1 hour"
Production Pitfalls
Excessive DEBUG logging causes I/O saturation — gigabytes per hour of logs is a real CPU and disk cost. Insufficient logging means you cannot diagnose failures. Use ERROR and WARN always; INFO for request boundaries and significant state changes; DEBUG only when actively investigating (toggle via Actuator without restart).
Alerting on CPU usage is a symptom. Alerting on error rate or p99 latency measures the user experience directly. Build primary alerts around the four golden signals. Only page a human for something requiring immediate action — noisy alerts cause on-call fatigue and engineers stop responding.
100% sampling on a service handling 10,000 req/sec means storing 10,000 traces/sec. This saturates your trace backend and adds latency. Use tail-based sampling: 100% of error traces, 1–10% of success traces. Configure this in OpenTelemetry Collector, not in the application.
If services run in different timezones or have clock drift, cross-service log correlation breaks. Always log in UTC. Verify NTP synchronisation in cloud environments — instance clocks can drift by seconds, making trace timeline reconstruction impossible.
/actuator/heapdump dumps your entire JVM heap — containing tokens, passwords in config, and user data. /actuator/env exposes all environment variables. NEVER expose Actuator on the public internet. Restrict with Spring Security to internal IPs or authenticated service accounts only.
MDC context does NOT propagate across @Async, CompletableFuture, virtual threads, or Kafka consumers. Explicitly copy context: Map<String, String> ctx = MDC.getCopyOfContextMap() then MDC.setContextMap(ctx) in the new thread before any logging.
Interview Preparation
Observability appears frequently in staff+ and principal interviews, and increasingly in senior backend roles. Interviewers test whether you have operated real production systems, not just built them.
{"event":"order.placed","userId":42,"orderId":99,"amount":150}. At scale, unstructured logs require fragile regex to extract fields. Structured logs can be indexed by field in Elasticsearch — you can query userId:42 AND amount > 100 efficiently across millions of lines. Every field becomes a searchable dimension. The cost is slightly larger log volume; the benefit is dramatically faster incident diagnosis.MDC.put("correlationId", id) at request entry causes every log line from that thread to include the correlation ID automatically. The bug it prevents: without MDC you cannot identify which log lines belong to which request among thousands of concurrent requests. The bug it causes: if you forget MDC.clear() in a finally block, thread pool threads carry stale MDC data from previous requests into new ones — you'll see user A's correlation ID in user B's logs, a compliance and debugging disaster./actuator/loggers/{name} which accepts POST requests to change log levels at runtime. Send {"configuredLevel":"DEBUG"} to enable debug logging for a specific package. This works because Logback supports dynamic reconfiguration at runtime. The change is in-memory only and reverts on restart. It is safe because it is scoped to a specific logger package and can be reverted immediately. Always revert after investigation to avoid performance degradation from excessive log volume.