DevOps &
Deployment Engineering
Writing code is 20% of a backend engineer's job. Getting it running reliably in production — containerized, orchestrated, deployed automatically, and recoverable from failure — is the other 80%. This section teaches you how production Spring Boot applications are actually shipped: Docker, Kubernetes, CI/CD pipelines, Nginx, and cloud deployment strategies used at real companies.
Why DevOps Matters for Backend Engineers
A backend engineer who can only write code but cannot deploy, monitor, or debug production systems is only half an engineer. Modern teams expect engineers who own their services end-to-end: build, ship, run, observe, fix. This "you build it, you run it" culture requires DevOps fluency.
DevOps is not a job title — it's a philosophy that developers are responsible for their code in production. In practice this means: writing Dockerfiles, defining Kubernetes manifests, writing CI/CD pipelines, setting up alerts, and being on-call for services you wrote. You don't need to be a dedicated infrastructure engineer, but you must understand how your application runs, scales, and fails in production.
Docker: Containerizing Spring Boot
Docker packages your application and all its dependencies into a single portable image. The same image runs on your laptop, in CI, and in production — eliminating "works on my machine" problems and ensuring environment consistency.
Production-Grade Dockerfile
# Stage 1: Build — use full JDK to compile
FROM eclipse-temurin:21-jdk-alpine AS builder
WORKDIR /app
# Copy dependency descriptors first (Docker layer cache optimization)
# If pom.xml unchanged, this layer is cached — no re-download
COPY mvnw pom.xml ./
COPY .mvn .mvn
RUN ./mvnw dependency:go-offline -q
# Now copy source and build
COPY src ./src
RUN ./mvnw package -DskipTests -q
# Extract layers for optimal Docker caching with Spring Boot layered JAR
RUN java -Djarmode=layertools -jar target/*.jar extract
# ─────────────────────────────────────────────────────────────
# Stage 2: Runtime — use lightweight JRE only (no compiler)
FROM eclipse-temurin:21-jre-alpine AS runtime
# Security: don't run as root
RUN addgroup -S appgroup && adduser -S appuser -G appgroup
WORKDIR /app
# Copy Spring Boot layers in order of change frequency
# (dependencies change rarely → cached; application changes most → last)
COPY --from=builder /app/dependencies/ ./
COPY --from=builder /app/spring-boot-loader/ ./
COPY --from=builder /app/snapshot-dependencies/ ./
COPY --from=builder /app/application/ ./
# JVM flags for containers:
# -XX:+UseContainerSupport — respect container CPU/memory limits (default Java 11+)
# -XX:MaxRAMPercentage=75.0 — use 75% of container memory for heap
# -XX:+ExitOnOutOfMemoryError — crash fast on OOM (let K8s restart it)
ENV JAVA_OPTS="-XX:+UseContainerSupport \
-XX:MaxRAMPercentage=75.0 \
-XX:+ExitOnOutOfMemoryError \
-Djava.security.egd=file:/dev/./urandom"
USER appuser
EXPOSE 8080
# Use exec form (not shell form) — signals reach the JVM directly
ENTRYPOINT ["sh", "-c", "java $JAVA_OPTS org.springframework.boot.loader.launch.JarLauncher"]
A single-stage Dockerfile that copies the fat JAR directly is the most common production mistake: it includes the full JDK (600MB+), runs as root, and doesn't use layer caching. The result is a 700MB image that takes 4 minutes to build. The multi-stage approach above produces a ~180MB image that builds in 45 seconds for typical changes because the dependency layer is cached.
Docker Image Best Practices
# Build and tag
docker build -t myapp:1.2.3 -t myapp:latest .
# Check image size (should be ~150-200MB for a typical Spring Boot app)
docker images myapp
# Inspect layers — find what's making the image large
docker history myapp:latest
# Run locally with environment variables
docker run -p 8080:8080 \
-e SPRING_PROFILES_ACTIVE=local \
-e DB_URL=jdbc:postgresql://host.docker.internal:5432/mydb \
-e DB_PASSWORD=secret \
myapp:latest
# Scan for security vulnerabilities
docker scout cves myapp:latest
# .dockerignore — critical for build speed and security
# Exclude these from build context:
cat > .dockerignore << 'EOF'
target/
*.log
.git/
.env
*.env
node_modules/
.idea/
*.iml
EOF
Docker Compose: Local Development Stack
Docker Compose defines and runs multi-container applications with a single file. For local development, it replaces the need to manually start PostgreSQL, Redis, Kafka, and your application separately — one command starts the entire stack.
version: '3.9'
services:
app:
build:
context: .
target: runtime # Use the runtime stage
ports:
- "8080:8080"
environment:
SPRING_PROFILES_ACTIVE: docker
SPRING_DATASOURCE_URL: jdbc:postgresql://postgres:5432/appdb
SPRING_DATASOURCE_USERNAME: appuser
SPRING_DATASOURCE_PASSWORD: ${DB_PASSWORD:-devpassword}
SPRING_DATA_REDIS_HOST: redis
SPRING_KAFKA_BOOTSTRAP_SERVERS: kafka:9092
depends_on:
postgres:
condition: service_healthy
redis:
condition: service_started
healthcheck:
test: ["CMD", "wget", "-qO-", "http://localhost:8080/actuator/health"]
interval: 15s
timeout: 5s
retries: 5
start_period: 30s
postgres:
image: postgres:16-alpine
environment:
POSTGRES_DB: appdb
POSTGRES_USER: appuser
POSTGRES_PASSWORD: ${DB_PASSWORD:-devpassword}
ports:
- "5432:5432"
volumes:
- postgres_data:/var/lib/postgresql/data
- ./db/init.sql:/docker-entrypoint-initdb.d/init.sql
healthcheck:
test: ["CMD-SHELL", "pg_isready -U appuser -d appdb"]
interval: 5s
timeout: 3s
retries: 10
redis:
image: redis:7-alpine
ports:
- "6379:6379"
command: redis-server --maxmemory 256mb --maxmemory-policy allkeys-lru
kafka:
image: confluentinc/cp-kafka:7.5.0
ports:
- "9092:9092"
environment:
KAFKA_NODE_ID: 1
KAFKA_PROCESS_ROLES: broker,controller
KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:9093
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092
KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:9093
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
CLUSTER_ID: "MkU3OEVBNTcwNTJENDM2Qg"
# Database admin UI — accessible at http://localhost:5050
pgadmin:
image: dpage/pgadmin4:latest
environment:
PGADMIN_DEFAULT_EMAIL: admin@admin.com
PGADMIN_DEFAULT_PASSWORD: admin
ports:
- "5050:80"
profiles:
- tools # Only starts with: docker compose --profile tools up
volumes:
postgres_data:
# Start everything
docker compose up -d
# Start with rebuild
docker compose up -d --build
# Watch logs from app only
docker compose logs -f app
# Restart just the app (after code change)
docker compose restart app
# Stop and remove everything including volumes (clean slate)
docker compose down -v
# Scale the app service to 3 replicas (needs a load balancer like nginx)
docker compose up -d --scale app=3
CI/CD with GitHub Actions
A CI/CD pipeline automates the path from code commit to production deployment. Every push triggers tests, every merge to main builds and pushes a Docker image, every tag deploys to production. Human error in deployment becomes a thing of the past.
name: CI/CD Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
# ── Job 1: Test ──────────────────────────────────────────────────────────
test:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:16-alpine
env:
POSTGRES_DB: testdb
POSTGRES_USER: test
POSTGRES_PASSWORD: test
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 5s
--health-timeout 3s
--health-retries 10
redis:
image: redis:7-alpine
ports:
- 6379:6379
steps:
- uses: actions/checkout@v4
- name: Set up JDK 21
uses: actions/setup-java@v4
with:
java-version: '21'
distribution: 'temurin'
cache: maven
- name: Run tests
run: ./mvnw test -Dspring.profiles.active=ci
env:
SPRING_DATASOURCE_URL: jdbc:postgresql://localhost:5432/testdb
SPRING_DATASOURCE_USERNAME: test
SPRING_DATASOURCE_PASSWORD: test
- name: Upload test results
uses: actions/upload-artifact@v4
if: failure()
with:
name: test-results
path: target/surefire-reports/
- name: Code coverage
run: ./mvnw jacoco:report
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v4
with:
file: target/site/jacoco/jacoco.xml
# ── Job 2: Build & Push Docker Image ────────────────────────────────────
build:
runs-on: ubuntu-latest
needs: test
if: github.ref == 'refs/heads/main'
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- name: Log in to Container Registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata (tags, labels)
id: meta
uses: docker/metadata-action@v5
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=sha,prefix=sha-
type=raw,value=latest
- name: Build and push Docker image
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha # GitHub Actions cache for Docker layers
cache-to: type=gha,mode=max
# ── Job 3: Deploy to Staging ─────────────────────────────────────────────
deploy-staging:
runs-on: ubuntu-latest
needs: build
environment: staging
steps:
- uses: actions/checkout@v4
- name: Deploy to Kubernetes (staging)
uses: azure/k8s-deploy@v4
with:
namespace: staging
manifests: k8s/staging/
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:sha-${{ github.sha }}
Never put secrets in your workflow YAML files or environment blocks as plain text. GitHub Actions Secrets (encrypted at rest, masked in logs) handle short-lived tokens. For production, use a dedicated secrets manager: AWS Secrets Manager, HashiCorp Vault, or Google Secret Manager. Inject secrets into pods as environment variables from Kubernetes Secrets (which themselves should be backed by an external secrets operator). The golden rule: no secret should ever appear in a git repository, log file, or build artifact.
Kubernetes: Running Spring Boot at Scale
Kubernetes (K8s) is the de facto standard for container orchestration in production. It handles scheduling, scaling, self-healing, rolling deployments, and service discovery. Understanding core K8s concepts is now a baseline expectation for senior backend engineers.
Complete Kubernetes Manifests for Spring Boot
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
namespace: production
labels:
app: order-service
version: "1.2.3"
spec:
replicas: 3
selector:
matchLabels:
app: order-service
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod during update
maxUnavailable: 0 # Never reduce below desired count (zero-downtime)
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: ghcr.io/myorg/order-service:sha-abc123
ports:
- containerPort: 8080
# Resource requests and limits — CRITICAL for stability
resources:
requests:
cpu: "250m" # 0.25 CPU cores guaranteed
memory: "512Mi" # 512MB guaranteed
limits:
cpu: "1000m" # Can burst to 1 CPU
memory: "1Gi" # Hard cap — OOM killed above this
env:
- name: SPRING_PROFILES_ACTIVE
value: "production"
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: order-service-secrets
key: db-password
- name: JAVA_OPTS
value: "-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0"
# Liveness probe — is the app alive? K8s restarts if this fails
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 30 # Give JVM time to start
periodSeconds: 10
failureThreshold: 3
# Readiness probe — is the app ready for traffic? Remove from LB if fails
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
# Startup probe — prevents liveness from killing slow-starting pods
startupProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
failureThreshold: 30 # 30 × 10s = 5 minutes max startup time
periodSeconds: 10
# Spread pods across availability zones for resilience
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: order-service
---
apiVersion: v1
kind: Service
metadata:
name: order-service
namespace: production
spec:
selector:
app: order-service
ports:
- port: 80
targetPort: 8080
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: order-service
namespace: production
annotations:
nginx.ingress.kubernetes.io/ssl-redirect: "true"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
tls:
- hosts:
- api.myapp.com
secretName: api-tls-secret
rules:
- host: api.myapp.com
http:
paths:
- path: /api/v1/orders
pathType: Prefix
backend:
service:
name: order-service
port:
number: 80
Spring Boot Actuator Health for K8s Probes
management:
endpoint:
health:
probes:
enabled: true # Enables /actuator/health/liveness and /readiness
show-details: always
group:
liveness:
include: livenessState # JVM alive?
readiness:
include: readinessState, db, redis # All dependencies ready?
endpoints:
web:
exposure:
include: health, info, prometheus, metrics
A pod without CPU/memory limits can consume unlimited resources, starving other pods on the same node. One runaway Spring Boot app with a memory leak can crash every other service on the node. Always set both requests (what K8s schedules against) and limits (the hard cap). Set memory limit to 130% of typical usage, and always match -XX:MaxRAMPercentage so the JVM heap never exceeds the pod limit.
Nginx as Reverse Proxy
Nginx sits in front of your Spring Boot application in production, handling TLS termination, load balancing, static file serving, rate limiting, and compression. Spring Boot speaks plain HTTP; Nginx handles the messy internet-facing concerns.
# /etc/nginx/sites-available/myapp.conf
# Rate limiting zone — shared across all worker processes
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=100r/m;
upstream spring_boot {
server 127.0.0.1:8080;
server 127.0.0.1:8081; # Second instance for load balancing
keepalive 32; # Persistent connections to upstream
}
# HTTP → HTTPS redirect
server {
listen 80;
server_name api.myapp.com;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl http2;
server_name api.myapp.com;
# TLS configuration (managed by certbot/Let's Encrypt)
ssl_certificate /etc/letsencrypt/live/api.myapp.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/api.myapp.com/privkey.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256;
ssl_session_cache shared:SSL:10m;
# Security headers
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-Frame-Options "DENY" always;
# Gzip compression — critical for API responses
gzip on;
gzip_types application/json application/javascript text/css;
gzip_min_length 1024;
# Proxy to Spring Boot
location /api/ {
# Apply rate limiting
limit_req zone=api_limit burst=20 nodelay;
proxy_pass http://spring_boot;
proxy_http_version 1.1;
proxy_set_header Connection ""; # Enable keepalive
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_connect_timeout 5s;
proxy_read_timeout 30s;
proxy_send_timeout 30s;
# Buffer settings for upstream responses
proxy_buffer_size 16k;
proxy_buffers 4 32k;
}
# Health check endpoint — not rate limited, not logged
location /actuator/health {
proxy_pass http://spring_boot;
access_log off;
}
# Block access to sensitive actuator endpoints from outside
location /actuator {
deny all;
return 403;
}
}
When Nginx proxies requests, the client IP in Spring Boot becomes 127.0.0.1 (the Nginx server). To get the real client IP, Spring Boot must be configured to trust the proxy: set server.forward-headers-strategy=native in application.yml. Then request.getRemoteAddr() and Spring Security's IP-based rate limiting will see the real client IP from the X-Forwarded-For header. Without this, IP-based rate limiting is meaningless — all traffic appears to come from the same proxy address.
Environment Management & Configuration
Production, staging, development, and CI all need the same code to run differently. Spring Boot's profiles combined with externalized configuration make this clean — but most teams get it wrong, leading to "it works in dev but fails in prod."
# application.yml — base config for all environments
spring:
application:
name: order-service
datasource:
url: ${DB_URL} # Always from environment variable in production
username: ${DB_USERNAME}
password: ${DB_PASSWORD}
jpa:
open-in-view: false # Never true in production
hibernate:
ddl-auto: validate # Production: validate only, never auto-create
server:
port: 8080
forward-headers-strategy: native # Trust X-Forwarded-For from proxy
---
# application-local.yml — developer convenience
spring:
config:
activate:
on-profile: local
datasource:
url: jdbc:postgresql://localhost:5432/orderdb
username: postgres
password: postgres
jpa:
hibernate:
ddl-auto: create-drop # Recreate schema on each local restart
show-sql: true
devtools:
restart:
enabled: true
---
# application-production.yml — production hardening
spring:
config:
activate:
on-profile: production
jpa:
hibernate:
ddl-auto: validate
datasource:
hikari:
maximum-pool-size: 20
connection-timeout: 5000
validation-timeout: 3000
server:
compression:
enabled: true
mime-types: application/json,application/javascript,text/css
management:
endpoints:
web:
exposure:
include: health,info,prometheus # Minimal exposure in production
// Binding configuration properties to a class (type-safe, validated)
@ConfigurationProperties(prefix = "app.payment")
@Validated
public record PaymentConfig(
@NotBlank String apiKey,
@NotBlank String apiUrl,
@Positive int timeoutSeconds,
@Min(1) @Max(10) int maxRetries
) {}
// Register it
@SpringBootApplication
@ConfigurationPropertiesScan
public class Application {}
// Use it
@Service
@RequiredArgsConstructor
public class PaymentService {
private final PaymentConfig config;
public PaymentResult charge(Money amount) {
// config.apiKey(), config.timeoutSeconds() etc.
}
}
// application.yml binding
app:
payment:
api-key: ${PAYMENT_API_KEY}
api-url: https://api.stripe.com
timeout-seconds: 10
max-retries: 3
Deployment Strategies
How you deploy matters as much as what you deploy. The wrong strategy causes downtime, data corruption, or silent bugs in production. Every senior engineer must understand these patterns.
Database Migration with Flyway in CI/CD
// Flyway runs automatically on startup — migrations apply before the app accepts traffic
// This is the correct order: migrate DB → start app → K8s readiness probe passes → traffic routed
// application.yml
spring:
flyway:
enabled: true
locations: classpath:db/migration
baseline-on-migrate: false # Only true for first migration on existing DB
out-of-order: false # Strictly ordered in production
// src/main/resources/db/migration/
// V1__initial_schema.sql
// V2__add_orders_table.sql
// V3__add_product_stock_column.sql ← always additive!
// Safe migration practices:
// 1. NEVER rename a column in one step (add new + backfill + drop old = 3 deployments)
// 2. NEVER add NOT NULL without a default (crashes on existing rows)
// 3. NEVER drop a column before removing all code references (two deployments)
// 4. Always add indexes CONCURRENTLY to avoid table locks in PostgreSQL
// Example safe NOT NULL column addition:
// V10__add_status_column.sql:
// ALTER TABLE orders ADD COLUMN status VARCHAR(50) DEFAULT 'PENDING';
// UPDATE orders SET status = 'PENDING' WHERE status IS NULL;
// ALTER TABLE orders ALTER COLUMN status SET NOT NULL;
Linux Essentials for Backend Engineers
When a production service fails at 2am, you'll be SSHed into a Linux server, not in your IDE. These commands are the minimum toolkit every backend engineer must know fluently.
# ── Process & JVM ──────────────────────────────────────────────────────────
ps aux | grep java # Find the Java process
jps -lv # List all JVM processes with flags
jstack $(pgrep -f app.jar) # Thread dump — detect deadlocks/hangs
jmap -heap $(pgrep -f app.jar) # Heap usage summary
kill -3 $(pgrep -f app.jar) # Send SIGQUIT = thread dump to stdout
# ── Network & Ports ────────────────────────────────────────────────────────
ss -tlnp | grep 8080 # What's listening on 8080?
curl -v http://localhost:8080/actuator/health # Test health endpoint
curl -w "\n%{time_total}s\n" http://localhost:8080/api/v1/ping # Request timing
netstat -s | grep -i retransmit # Check for TCP retransmissions
# ── Log Analysis ───────────────────────────────────────────────────────────
tail -f /var/log/app/application.log # Follow live logs
grep "ERROR" application.log | tail -50 # Last 50 errors
grep "2025-01-15 14:" application.log | grep -v INFO # Errors in a time window
journalctl -u order-service -f --since "1 hour ago" # systemd service logs
# ── System Resources ───────────────────────────────────────────────────────
top -p $(pgrep -f app.jar) # CPU/memory for specific process
vmstat 1 5 # System-wide memory/IO stats (5 samples)
iostat -x 1 # Disk I/O — detect DB I/O bottleneck
free -h # Memory usage
df -h # Disk space (OOM logs fill disks)
lsof -p $(pgrep -f app.jar) | wc -l # Open file descriptors (watch for leaks)
# ── Kubernetes ────────────────────────────────────────────────────────────
kubectl get pods -n production # Pod status
kubectl logs -f order-service-abc123 -n production # Follow pod logs
kubectl describe pod order-service-abc123 -n production # Events, probe failures
kubectl exec -it order-service-abc123 -n production -- sh # Shell into pod
kubectl rollout history deployment/order-service -n production
kubectl rollout undo deployment/order-service -n production # Rollback!
Production Deployment Pitfalls
Without graceful shutdown, K8s SIGTERM immediately kills the JVM, dropping in-flight requests. Configure server.shutdown=graceful and spring.lifecycle.timeout-per-shutdown-phase=30s. Spring Boot will stop accepting new requests, wait for active ones to complete, then exit cleanly.
Java 8 pre-update 191 doesn't respect cgroup memory limits — it reads the host's total RAM, sets a heap of 25%, and the pod gets OOM-killed. Always use Java 11+ with -XX:+UseContainerSupport and -XX:MaxRAMPercentage=75.0.
If your readiness probe is too aggressive (fails fast), pods are removed from the load balancer during slow startup. If it's too lenient (always passes), traffic reaches pods that aren't ready. Configure initialDelaySeconds to be longer than your slowest startup, and use the Spring Boot Actuator readiness endpoint which checks actual downstream dependencies.
A migration that renames a column while the old code is still running causes 500 errors for every in-flight request. Always: (1) add new column, (2) deploy code that writes to both, (3) backfill, (4) deploy code that reads from new only, (5) drop old column. This takes 3 deployments but never causes downtime.
During a rolling update, old and new pods both connect to the database simultaneously. If your pool size is 20 per pod and you have 3 replicas, you need 120 DB connections during rollout (6 pods × 20). Size your DB connection pool accounting for max replicas during deployment, not just steady-state.
If you COPY a file containing secrets during the Dockerfile build and then DELETE it in a later layer, the secret is still accessible in the earlier layer's history. Secrets must NEVER be baked into images. Use --secret mount in BuildKit, or inject at runtime via environment variables or K8s Secrets.
Interview Questions
failureThreshold consecutive times, kubelet restarts the container (not the pod). The restart count increments. After several rapid restarts, K8s enters CrashLoopBackOff — exponentially increasing the delay between restarts (10s, 20s, 40s, up to 5 minutes). This is a safety mechanism to avoid thrashing. The readiness probe is different: failing readiness removes the pod from the Service's endpoint list (no traffic), but does NOT restart it. Use liveness for "is the app dead and needs a restart?" (JVM deadlock, OOM). Use readiness for "is the app temporarily unable to serve traffic?" (warming up, upstream dependency down). The startup probe prevents liveness from killing a slow-starting pod — it gives the app time to start before liveness takes over.