Section 13 · Mastery

Real Production Projects

Three complete backend systems built from scratch — architecture decisions, domain modelling, API design, database schema, caching strategy, async processing, security hardening, observability, and deployment. This is where everything learned in sections 1–12 converges into systems you’d find at real companies.

⚙️ 3 Full Projects 🏗 Architecture Evolution 🔥 Production-Grade Code ⏰ ~12 hours

How to Use This Section

Each project is built in layers — starting from a simple monolith and evolving toward a production-grade service. The goal is not to copy-paste code. The goal is to develop the engineering intuition for why decisions are made, when to introduce complexity, and what breaks at each stage of growth.

The Three Projects

🛒

E-Commerce Order Service

The classic backend system. Covers product catalogue, order lifecycle, inventory, payments, and notifications. Every backend pattern in one project.

🔗

URL Shortener at Scale

A system design interview favourite. Billions of redirects, custom slugs, analytics, rate limiting, and cache architecture. Simple to understand, hard to scale.

🔔

Notification Platform

Multi-channel notifications (email, SMS, push) with Kafka, retry logic, dead-letter queues, templates, and per-user preference management.

Build, Break, Fix, Scale

For each project: first build the simplest possible version that works. Then deliberately break it — load test it, remove the cache, block the queue. Observe what fails and why. Then fix and harden it. This is how real engineering intuition is developed — not by reading about failure modes but by causing them.

Project 1: E-Commerce Order Service

An order service is at the heart of every e-commerce platform. It must handle concurrent checkout attempts, coordinate inventory deduction, trigger payment processing, and emit events to downstream services — all within a consistent, auditable transaction model. This is the canonical backend engineering challenge.

System Requirements

What the Order Service Must Do

Functional

✓ Browse product catalogue with search & filters
✓ Add/remove items from cart
✓ Place order with address and payment
✓ Check inventory before confirming
✓ Process payment (sync or async)
✓ Track order status (PENDING → CONFIRMED → SHIPPED)
✓ Send email/push confirmation
✓ Handle cancellations and refunds

Non-Functional

✓ p99 checkout latency < 500ms
✓ No double-charges on retry
✓ Consistent inventory (no overselling)
✓ 99.9% availability
✓ Horizontal scalability
✓ Full audit trail
✓ Fraud detection hooks
✓ Rate limiting on checkout

Architecture Evolution: Stage 1 — The Monolith

Start with a single Spring Boot application, one PostgreSQL database, no messaging, no cache. Get the core domain right before adding infrastructure complexity. The biggest mistake engineers make is reaching for microservices before they understand the domain.

Stage 1: Monolith Package Structure

order-service/
  domain/          ← entities, value objects, domain logic
    Order.java
    OrderItem.java
    OrderStatus.java
    Product.java
    Money.java         ← value object — never use primitives for money
  application/      ← use cases, orchestration
    OrderService.java
    CartService.java
    InventoryService.java
  infrastructure/   ← JPA, Kafka, HTTP clients
    OrderRepository.java
    PaymentGatewayClient.java
  api/              ← controllers, DTOs, validation
    OrderController.java
    CreateOrderRequest.java
    OrderResponse.java

Domain Model

Order.java — Domain Entity

@Entity
@Table(name = "orders")
@Getter
@NoArgsConstructor(access = AccessLevel.PROTECTED)
public class Order {

    @Id @GeneratedValue(strategy = GenerationType.UUID)
    private UUID id;

    @Column(nullable = false)
    private UUID customerId;

    @Enumerated(EnumType.STRING)
    @Column(nullable = false)
    private OrderStatus status;

    // Use Embedded for value objects — Money is not an entity
    @Embedded
    private Money totalAmount;

    @OneToMany(cascade = CascadeType.ALL, orphanRemoval = true, fetch = FetchType.LAZY)
    @JoinColumn(name = "order_id")
    private List<OrderItem> items = new ArrayList<>();

    @CreationTimestamp
    private Instant createdAt;

    @UpdateTimestamp
    private Instant updatedAt;

    // Version for optimistic locking — prevents concurrent modification
    @Version
    private Long version;

    // Factory method — enforce invariants at creation time
    public static Order create(UUID customerId, List<OrderItem> items) {
        if (items == null || items.isEmpty()) {
            throw new DomainException("Order must have at least one item");
        }
        Order order = new Order();
        order.customerId = customerId;
        order.status = OrderStatus.PENDING;
        order.items = new ArrayList<>(items);
        order.totalAmount = items.stream()
            .map(OrderItem::getLineTotal)
            .reduce(Money.ZERO, Money::add);
        return order;
    }

    // Domain behaviour lives on the entity — not in the service
    public void confirm() {
        if (this.status != OrderStatus.PENDING) {
            throw new DomainException("Cannot confirm order in status: " + this.status);
        }
        this.status = OrderStatus.CONFIRMED;
    }

    public void cancel(String reason) {
        if (this.status == OrderStatus.SHIPPED) {
            throw new DomainException("Cannot cancel a shipped order");
        }
        this.status = OrderStatus.CANCELLED;
    }

    public void ship() {
        if (this.status != OrderStatus.CONFIRMED) {
            throw new DomainException("Order must be confirmed before shipping");
        }
        this.status = OrderStatus.SHIPPED;
    }
}

// Money — never use double or float for currency
@Embeddable
@Value
public class Money {

    public static final Money ZERO = new Money(BigDecimal.ZERO, "USD");

    @Column(name = "amount", precision = 19, scale = 4)
    private BigDecimal amount;

    @Column(name = "currency", length = 3)
    private String currency;

    public Money add(Money other) {
        if (!this.currency.equals(other.currency)) {
            throw new DomainException("Cannot add different currencies");
        }
        return new Money(this.amount.add(other.amount), this.currency);
    }

    public boolean isGreaterThan(Money other) {
        return this.amount.compareTo(other.amount) > 0;
    }
}

The Checkout Flow — Where All Complexity Lives

Checkout must be idempotent (double-clicking submit should not double-charge), inventory-consistent (no overselling), and transactionally safe (payment failure must roll back inventory). This is where most teams introduce subtle bugs.

OrderService.java — Checkout Orchestration

@Service
@RequiredArgsConstructor
@Transactional
public class OrderService {

    private final OrderRepository orderRepository;
    private final InventoryService inventoryService;
    private final PaymentService paymentService;
    private final ApplicationEventPublisher eventPublisher;
    private final IdempotencyKeyStore idempotencyStore;

    /**
     * Idempotency key prevents double-charges on retry.
     * If the same key is submitted twice, return the original result.
     */
    public OrderResponse placeOrder(CreateOrderRequest request, String idempotencyKey) {

        // Check idempotency — same key = same result, no side effects
        return idempotencyStore.getOrExecute(idempotencyKey, () -> {
            return doPlaceOrder(request);
        });
    }

    private OrderResponse doPlaceOrder(CreateOrderRequest request) {
        // 1. Validate and build order items
        List<OrderItem> items = request.getItems().stream()
            .map(i -> OrderItem.of(i.getProductId(), i.getQuantity(),
                                   productCatalogue.getPrice(i.getProductId())))
            .toList();

        // 2. Reserve inventory — pessimistic lock at DB level
        // Throws InsufficientInventoryException if any item unavailable
        inventoryService.reserve(items);

        // 3. Create domain object — enforces business rules
        Order order = Order.create(request.getCustomerId(), items);
        orderRepository.save(order);

        // 4. Attempt payment
        try {
            PaymentResult payment = paymentService.charge(
                order.getId(),
                order.getTotalAmount(),
                request.getPaymentMethodId()
            );
            order.confirm();
            orderRepository.save(order);

            // 5. Publish domain event — downstream services react asynchronously
            // Using @TransactionalEventListener ensures this only fires on commit
            eventPublisher.publishEvent(new OrderConfirmedEvent(order.getId(),
                order.getCustomerId(), order.getTotalAmount()));

            return OrderResponse.from(order);

        } catch (PaymentDeclinedException e) {
            // Roll back inventory reservation on payment failure
            inventoryService.release(items);
            order.cancel("Payment declined: " + e.getDeclineReason());
            orderRepository.save(order);
            throw new OrderPaymentFailedException(order.getId(), e.getDeclineReason());
        }
    }
}

// TransactionalEventListener — fires AFTER the transaction commits
// This prevents the notification being sent if the DB write fails
@Component
@RequiredArgsConstructor
public class OrderEventHandler {

    private final NotificationService notificationService;

    @TransactionalEventListener(phase = TransactionPhase.AFTER_COMMIT)
    @Async  // Don't block the HTTP request thread
    public void onOrderConfirmed(OrderConfirmedEvent event) {
        notificationService.sendOrderConfirmation(
            event.getCustomerId(),
            event.getOrderId()
        );
    }
}

The Overselling Bug — Every Team Hits This

Without proper locking, two concurrent checkout requests can both see "5 items in stock", both reserve 3, and both succeed — leaving you with -1 inventory. The fix: use SELECT ... FOR UPDATE to pessimistically lock the inventory row during the check-and-reserve operation. In Spring Data JPA: @Lock(LockModeType.PESSIMISTIC_WRITE) on the repository method. For high-throughput scenarios, use Redis atomic operations (DECR) as the inventory gate instead of DB locks.

Inventory — Pessimistic Locking

InventoryRepository.java

public interface InventoryRepository extends JpaRepository<Inventory, UUID> {

    // Pessimistic write lock — blocks other transactions from reading this row
    // until our transaction commits. Prevents overselling.
    @Lock(LockModeType.PESSIMISTIC_WRITE)
    @Query("SELECT i FROM Inventory i WHERE i.productId = :productId")
    Optional<Inventory> findByProductIdForUpdate(@Param("productId") UUID productId);
}

@Service
@RequiredArgsConstructor
public class InventoryService {

    private final InventoryRepository repo;

    @Transactional
    public void reserve(List<OrderItem> items) {
        for (OrderItem item : items) {
            Inventory inventory = repo.findByProductIdForUpdate(item.getProductId())
                .orElseThrow(() -> new ProductNotFoundException(item.getProductId()));

            if (inventory.getAvailableQuantity() < item.getQuantity()) {
                throw new InsufficientInventoryException(
                    item.getProductId(), item.getQuantity(),
                    inventory.getAvailableQuantity()
                );
            }
            inventory.reserve(item.getQuantity());
            repo.save(inventory);
        }
    }

    @Transactional
    public void release(List<OrderItem> items) {
        for (OrderItem item : items) {
            Inventory inventory = repo.findByProductIdForUpdate(item.getProductId())
                .orElseThrow();
            inventory.release(item.getQuantity());
            repo.save(inventory);
        }
    }
}

Stage 2 — Adding Caching and Async

Once the monolith is working correctly, introduce caching for the product catalogue (read-heavy, rarely changes) and Kafka for downstream event processing (notifications, analytics, fulfilment).

ProductCatalogueService.java — With Caching

@Service
@RequiredArgsConstructor
@CacheConfig(cacheNames = "products")
public class ProductCatalogueService {

    private final ProductRepository productRepository;

    // Cache products by ID — TTL configured in application.yml
    // Products are read millions of times, written rarely
    @Cacheable(key = "#productId", unless = "#result == null")
    public ProductDto getProduct(UUID productId) {
        return productRepository.findById(productId)
            .map(ProductDto::from)
            .orElseThrow(() -> new ProductNotFoundException(productId));
    }

    // Evict on update — never serve stale prices
    @CacheEvict(key = "#product.id")
    public ProductDto updateProduct(Product product) {
        Product saved = productRepository.save(product);
        return ProductDto.from(saved);
    }

    // Search results — shorter TTL (30s) because filters change more frequently
    @Cacheable(key = "'search:' + #query.cacheKey()", condition = "#query.isCacheable()")
    public Page<ProductDto> search(ProductSearchQuery query, Pageable pageable) {
        return productRepository.search(query, pageable).map(ProductDto::from);
    }
}

OrderEventPublisher.java — Kafka Events

@Component
@RequiredArgsConstructor
@Slf4j
public class OrderEventPublisher {

    private final KafkaTemplate<String, Object> kafkaTemplate;

    // Use order ID as Kafka key — guarantees ordering for one order's events
    public void publishOrderConfirmed(OrderConfirmedEvent event) {
        kafkaTemplate.send("order.confirmed", event.getOrderId().toString(), event)
            .whenComplete((result, ex) -> {
                if (ex != null) {
                    // Log but don't fail the order — notification failure
                    // is not a reason to fail the checkout
                    log.error("Failed to publish order.confirmed for orderId={}: {}",
                        event.getOrderId(), ex.getMessage());
                    // Could push to a local retry table here
                } else {
                    log.debug("Published order.confirmed orderId={} partition={} offset={}",
                        event.getOrderId(),
                        result.getRecordMetadata().partition(),
                        result.getRecordMetadata().offset());
                }
            });
    }
}

// Fulfilment service consumes the event and triggers warehouse pick
@Component
@Slf4j
public class FulfilmentEventConsumer {

    @KafkaListener(topics = "order.confirmed", groupId = "fulfilment-service",
                   containerFactory = "orderEventContainerFactory")
    public void onOrderConfirmed(OrderConfirmedEvent event,
                                  Acknowledgment ack) {
        try {
            fulfilmentService.initiatePickAndPack(event.getOrderId());
            ack.acknowledge(); // only ack after successful processing
        } catch (Exception e) {
            log.error("Fulfilment initiation failed for orderId={}", event.getOrderId(), e);
            // Do NOT ack — Kafka will redeliver after max.poll.interval.ms
            // After max retries, message goes to DLT (dead-letter topic)
        }
    }
}

Stage 3 — Production Hardening

OrderController.java — Hardened API

@RestController
@RequestMapping("/api/v1/orders")
@RequiredArgsConstructor
@Validated
public class OrderController {

    private final OrderService orderService;
    private final RateLimiter checkoutRateLimiter; // Resilience4j rate limiter

    @PostMapping
    @PreAuthorize("hasRole('CUSTOMER')")
    @RateLimiter(name = "checkout")  // Max 5 checkout attempts/min per user
    public ResponseEntity<OrderResponse> placeOrder(
            @RequestBody @Valid CreateOrderRequest request,
            @RequestHeader("Idempotency-Key") @NotBlank String idempotencyKey,
            @AuthenticationPrincipal UserPrincipal principal) {

        // Prevent one user from placing orders as another
        if (!principal.getUserId().equals(request.getCustomerId())) {
            throw new ForbiddenException("Cannot place order for another customer");
        }

        OrderResponse response = orderService.placeOrder(request, idempotencyKey);
        return ResponseEntity
            .status(HttpStatus.CREATED)
            .header("Location", "/api/v1/orders/" + response.getId())
            .body(response);
    }

    @GetMapping("/{orderId}")
    @PreAuthorize("hasRole('CUSTOMER') and @orderSecurity.isOwner(#orderId, principal)")
    public OrderResponse getOrder(@PathVariable UUID orderId) {
        return orderService.getOrder(orderId);
    }

    @GetMapping
    @PreAuthorize("hasRole('CUSTOMER')")
    public Page<OrderResponse> listOrders(
            @AuthenticationPrincipal UserPrincipal principal,
            @PageableDefault(size = 20, sort = "createdAt", direction = DESC) Pageable pageable) {
        return orderService.listOrdersForCustomer(principal.getUserId(), pageable);
    }
}

Database Schema — The Foundation

V1__create_order_schema.sql (Flyway)

CREATE TABLE orders (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    customer_id  UUID NOT NULL,
    status       VARCHAR(30) NOT NULL,
    amount       NUMERIC(19, 4) NOT NULL,
    currency     CHAR(3) NOT NULL DEFAULT 'USD',
    version      BIGINT NOT NULL DEFAULT 0,   -- optimistic lock version
    created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE order_items (
    id           UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    order_id     UUID NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
    product_id   UUID NOT NULL,
    quantity     INT NOT NULL CHECK (quantity > 0),
    unit_price   NUMERIC(19, 4) NOT NULL,
    currency     CHAR(3) NOT NULL DEFAULT 'USD'
);

CREATE TABLE inventory (
    product_id          UUID PRIMARY KEY,
    available_quantity  INT NOT NULL CHECK (available_quantity >= 0),
    reserved_quantity   INT NOT NULL DEFAULT 0,
    updated_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Critical indexes for query performance
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_status      ON orders(status);
CREATE INDEX idx_orders_created_at  ON orders(created_at DESC);
CREATE INDEX idx_order_items_order  ON order_items(order_id);
CREATE INDEX idx_order_items_product ON order_items(product_id);

-- Idempotency keys table
CREATE TABLE idempotency_keys (
    key          VARCHAR(255) PRIMARY KEY,
    response     JSONB NOT NULL,
    created_at   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Clean up old keys automatically
CREATE INDEX idx_idempotency_created ON idempotency_keys(created_at);

Project 2: URL Shortener at Scale

The URL shortener is one of the most common system design interview questions because it appears trivially simple but hides enormous scaling challenges. A basic implementation works fine for 1,000 users. Getting it to handle 10 billion redirects per month requires careful thinking about cache architecture, ID generation, and read path optimisation.

The Core Challenge

Traffic Pattern: 99% Reads, 1% Writes

Write Path (rare)

POST /shorten
→ Generate unique short code
→ Store {code → url} in DB
→ Cache in Redis
→ Return short URL

Volume: ~100 req/s

Read Path (dominant)

GET /{code}
→ L1: Redis cache hit? (1ms)
→ L2: DB lookup? (5ms)
→ 301/302 redirect
→ Async: record click analytics

Volume: ~50,000 req/s

Short Code Generation — Getting It Right

The short code must be unique, short (6-8 chars), and generated at high throughput without coordination between instances. Three approaches, with different tradeoffs:

ShortCodeGenerator.java — Three Strategies

/**
 * Strategy 1: Base62 encoding of a database sequence.
 * Pros: simple, no collisions, deterministic.
 * Cons: DB becomes the bottleneck; sequential codes are guessable.
 * Use when: throughput < 10k writes/s.
 */
@Component
public class SequenceBase62Generator {

    private static final String CHARS = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

    public String generate(long sequenceValue) {
        StringBuilder sb = new StringBuilder();
        while (sequenceValue > 0) {
            sb.append(CHARS.charAt((int)(sequenceValue % 62)));
            sequenceValue /= 62;
        }
        // Pad to 6 chars minimum and reverse
        while (sb.length() < 6) sb.append('0');
        return sb.reverse().toString();
    }
}

/**
 * Strategy 2: Random Base62 with collision detection.
 * Pros: non-guessable, no central coordination.
 * Cons: collision probability grows as table fills (birthday problem).
 *       Requires a retry loop + DB unique constraint.
 * Use when: security of short codes matters; throughput moderate.
 */
@Component
public class RandomBase62Generator {

    private static final int CODE_LENGTH = 7;
    private final SecureRandom random = new SecureRandom();
    private static final String CHARS = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

    public String generate() {
        StringBuilder sb = new StringBuilder(CODE_LENGTH);
        for (int i = 0; i < CODE_LENGTH; i++) {
            sb.append(CHARS.charAt(random.nextInt(62)));
        }
        return sb.toString();
    }
}

/**
 * Strategy 3: Murmur hash of the URL (deterministic).
 * Pros: same URL always gets the same code; natural deduplication.
 * Cons: hash collisions possible for different URLs.
 *       Cannot support custom short codes.
 * Use when: deduplication is more important than custom codes.
 */
@Component
public class HashBasedGenerator {

    public String generate(String originalUrl) {
        long hash = Hashing.murmur3_128()
            .hashString(originalUrl, StandardCharsets.UTF_8)
            .asLong();
        return encodeBase62(Math.abs(hash)).substring(0, 7);
    }
}

The Redirect Service — Optimising the Hot Path

RedirectService.java — Cache-First Architecture

@Service
@RequiredArgsConstructor
@Slf4j
public class RedirectService {

    private final UrlRepository urlRepository;
    private final RedisTemplate<String, String> redisTemplate;
    private final ClickEventPublisher clickEventPublisher;
    private final MeterRegistry meterRegistry;

    private static final String CACHE_PREFIX = "url:";
    private static final Duration CACHE_TTL = Duration.ofDays(1);
    private static final String DELETED_SENTINEL = "__DELETED__";

    public String resolveUrl(String shortCode, HttpServletRequest request) {
        String cacheKey = CACHE_PREFIX + shortCode;

        // L1: Redis cache hit
        String cached = redisTemplate.opsForValue().get(cacheKey);
        if (cached != null) {
            if (DELETED_SENTINEL.equals(cached)) {
                // Negative cache entry — URL was deleted; avoid DB lookup
                throw new ShortUrlNotFoundException(shortCode);
            }
            meterRegistry.counter("redirect.cache.hit").increment();
            publishClickAsync(shortCode, request, cached);
            return cached;
        }

        // L2: Database lookup (cache miss)
        meterRegistry.counter("redirect.cache.miss").increment();
        return urlRepository.findByShortCode(shortCode)
            .map(url -> {
                // Check if URL is expired
                if (url.isExpired()) {
                    redisTemplate.opsForValue().set(cacheKey, DELETED_SENTINEL,
                        Duration.ofMinutes(5)); // short TTL for expired entries
                    throw new ShortUrlExpiredException(shortCode);
                }

                // Populate cache for future requests
                redisTemplate.opsForValue().set(cacheKey, url.getOriginalUrl(), CACHE_TTL);

                publishClickAsync(shortCode, request, url.getOriginalUrl());
                return url.getOriginalUrl();
            })
            .orElseGet(() -> {
                // Cache negative result to prevent DB hammering on non-existent codes
                redisTemplate.opsForValue().set(cacheKey, DELETED_SENTINEL,
                    Duration.ofMinutes(5));
                throw new ShortUrlNotFoundException(shortCode);
            });
    }

    // Publish click event asynchronously — never block the redirect on analytics
    @Async("clickEventExecutor")
    protected void publishClickAsync(String shortCode, HttpServletRequest request,
                                      String destination) {
        try {
            clickEventPublisher.publish(ClickEvent.builder()
                .shortCode(shortCode)
                .destination(destination)
                .ipAddress(extractClientIp(request))
                .userAgent(request.getHeader("User-Agent"))
                .referer(request.getHeader("Referer"))
                .timestamp(Instant.now())
                .build());
        } catch (Exception e) {
            log.warn("Failed to publish click event for shortCode={}", shortCode, e);
            // Non-critical: analytics loss is acceptable; redirect must always succeed
        }
    }
}

@RestController
@RequiredArgsConstructor
public class RedirectController {

    private final RedirectService redirectService;

    @GetMapping("/{shortCode}")
    public ResponseEntity<Void> redirect(@PathVariable String shortCode,
                                           HttpServletRequest request) {
        String destination = redirectService.resolveUrl(shortCode, request);
        return ResponseEntity.status(HttpStatus.MOVED_PERMANENTLY)
            .header(HttpHeaders.LOCATION, destination)
            // Cache-Control: no-store prevents browser from caching the redirect
            // This allows us to track clicks on repeat visits
            .header(HttpHeaders.CACHE_CONTROL, "no-store")
            .build();
    }
}

301 vs 302 Redirects — A Consequential Choice

A 301 (Permanent) redirect is cached by browsers. Once cached, the browser goes directly to the destination — your server never sees the click again. Perfect for performance, terrible for analytics. A 302 (Temporary) redirect forces the browser to hit your server every time — great for analytics and A/B testing, but every redirect has server cost. Most URL shorteners use 302 for this reason. Use 301 only when you're certain the destination will never change and you don't need click tracking.

Analytics Pipeline

ClickAnalyticsConsumer.java — Kafka to TimescaleDB

@Component
@RequiredArgsConstructor
@Slf4j
public class ClickAnalyticsConsumer {

    private final ClickEventRepository clickRepository;
    private final GeoIpService geoIpService;

    // Batch processing: collect events and insert in bulk every second
    // 1 INSERT per click = DB bottleneck at scale
    // Batch inserts = dramatically higher throughput
    @KafkaListener(
        topics = "url.clicks",
        groupId = "analytics-consumer",
        batch = "true",  // Receive list of messages, not one at a time
        containerFactory = "batchKafkaListenerContainerFactory"
    )
    public void processClickBatch(List<ClickEvent> events, Acknowledgment ack) {
        try {
            List<ClickRecord> records = events.stream()
                .map(event -> ClickRecord.builder()
                    .shortCode(event.getShortCode())
                    .country(geoIpService.getCountry(event.getIpAddress()))
                    .browser(UserAgentParser.getBrowser(event.getUserAgent()))
                    .refererDomain(extractDomain(event.getReferer()))
                    .clickedAt(event.getTimestamp())
                    .build())
                .toList();

            // Batch insert — one DB roundtrip for up to 500 events
            clickRepository.saveAll(records);
            ack.acknowledge();

            log.debug("Processed {} click events", events.size());
        } catch (Exception e) {
            log.error("Failed to process click batch of size {}", events.size(), e);
            // Don't ack — retry the batch
        }
    }
}

Project 3: Multi-Channel Notification Platform

Every product eventually needs notifications. Email for receipts. SMS for OTPs. Push for alerts. Building this as a dedicated platform lets every other service just fire an event and forget — without knowing about email providers, SMS gateways, or device tokens. The platform handles reliability, retries, deduplication, and user preferences.

Architecture Overview

Notification Platform Components

Producer Services

Order Service, Auth Service, Payment Service publish events to Kafka

↓

Event Router

Receives events, loads user preferences, decides which channels to use

↓

Email Channel

SendGrid / SES
Template rendering
Bounce handling

SMS Channel

Twilio / AWS SNS
Rate limiting
OTP handling

Push Channel

FCM / APNs
Device token mgmt
Silent vs alert

The Notification Domain

NotificationRequest.java — Canonical Event Model

// Every producer service publishes this common event shape
// The notification platform is the only consumer
@Data
@Builder
@JsonDeserialize(builder = NotificationRequest.NotificationRequestBuilder.class)
public class NotificationRequest {

    private UUID requestId;     // For idempotency — same ID = same notification
    private UUID userId;        // Recipient
    private NotificationType type;  // ORDER_CONFIRMED, OTP_SENT, PAYMENT_FAILED, etc.
    private Map<String, String> templateVariables;   // {orderId: "X", amount: "$50"}
    private NotificationPriority priority;  // HIGH (OTP), NORMAL (receipt), LOW (promo)
    private Instant expiresAt;  // Don't send expired OTPs

    // Desired channels — platform will intersect with user preferences
    private Set<Channel> requestedChannels;

    // Deduplication window — don't send same notification type twice within this period
    private Duration deduplicationWindow;
}

// Notification types drive template selection and channel defaults
public enum NotificationType {
    ORDER_CONFIRMED(Set.of(EMAIL, PUSH), Duration.ofHours(24)),
    PAYMENT_FAILED(Set.of(EMAIL, SMS, PUSH), Duration.ofHours(1)),
    OTP_SENT(Set.of(SMS), Duration.ofMinutes(5)),  // SMS only for OTPs
    PROMOTIONAL(Set.of(PUSH), Duration.ofDays(1)),
    ACCOUNT_SECURITY_ALERT(Set.of(EMAIL, SMS), Duration.ZERO); // No dedup

    private final Set<Channel> defaultChannels;
    private final Duration defaultDeduplicationWindow;
}

Retry Logic with Exponential Backoff

External notification providers fail. The email gateway returns 503. SMS gateway throttles you. The push service times out. Proper retry logic with exponential backoff and a dead-letter topic is non-negotiable in a production notification system.

EmailNotificationSender.java — Retry + DLT

@Component
@RequiredArgsConstructor
@Slf4j
public class EmailNotificationSender {

    private final SendGridClient sendGridClient;
    private final TemplateRenderer templateRenderer;
    private final NotificationRepository notificationRepository;

    // Resilience4j Retry: 3 attempts, exponential backoff 1s → 2s → 4s
    @Retry(name = "email-sender", fallbackMethod = "handleEmailFailure")
    @CircuitBreaker(name = "sendgrid", fallbackMethod = "handleEmailFailure")
    public void send(NotificationRecord notification) {
        String htmlBody = templateRenderer.render(
            notification.getType().getEmailTemplate(),
            notification.getTemplateVariables()
        );

        SendGridEmail email = SendGridEmail.builder()
            .to(notification.getRecipientEmail())
            .subject(notification.getSubject())
            .htmlContent(htmlBody)
            .trackingId(notification.getId().toString())  // For bounce/open tracking
            .build();

        sendGridClient.send(email);

        notification.markDelivered(Channel.EMAIL);
        notificationRepository.save(notification);
        log.info("Email sent notificationId={} to={}", notification.getId(),
            notification.getRecipientEmail());
    }

    // Called after all retries exhausted OR circuit open
    // Saves to DLT (dead-letter) for human review and manual retry
    private void handleEmailFailure(NotificationRecord notification, Exception e) {
        log.error("Email delivery permanently failed for notificationId={}: {}",
            notification.getId(), e.getMessage());

        notification.markFailed(Channel.EMAIL, e.getMessage());
        notificationRepository.save(notification);

        // Emit to dead-letter topic — ops team can inspect and retry
        deadLetterPublisher.publish(DeadLetterEvent.builder()
            .notificationId(notification.getId())
            .channel(Channel.EMAIL)
            .failureReason(e.getMessage())
            .originalPayload(notification)
            .build());
    }
}

User Preference Management

NotificationPreferenceService.java

@Service
@RequiredArgsConstructor
public class NotificationPreferenceService {

    private final PreferenceRepository preferenceRepository;
    private final RedisTemplate<String, NotificationPreference> redisTemplate;

    // Preferences are read on every notification dispatch — cache them
    @Cacheable(value = "user-prefs", key = "#userId")
    public NotificationPreference getPreferences(UUID userId) {
        return preferenceRepository.findByUserId(userId)
            .orElse(NotificationPreference.defaults(userId));
    }

    public Set<Channel> resolveChannels(UUID userId, NotificationType type,
                                         Set<Channel> requestedChannels) {
        NotificationPreference prefs = getPreferences(userId);

        // Intersect: requested channels ∩ user-enabled channels ∩ type defaults
        return requestedChannels.stream()
            .filter(channel -> prefs.isChannelEnabled(channel))
            .filter(channel -> prefs.isTypeEnabled(type, channel))
            .filter(channel -> !prefs.isInQuietHours(channel, Instant.now()))
            .collect(Collectors.toSet());
    }

    // Quiet hours: no push notifications 10pm — 8am user's local time
    // This is surprisingly important for user retention
    @CacheEvict(value = "user-prefs", key = "#preference.userId")
    public void updatePreferences(NotificationPreference preference) {
        preferenceRepository.save(preference);
    }
}

Deduplication — Never Spam Users

DeduplicationService.java — Redis-Based

@Service
@RequiredArgsConstructor
public class DeduplicationService {

    private final RedisTemplate<String, String> redisTemplate;

    /**
     * Check if a notification of this type was already sent to this user
     * within the deduplication window. Uses Redis SET NX (set-if-not-exists)
     * for atomic check-and-mark.
     */
    public boolean isDuplicate(UUID userId, NotificationType type,
                                Duration window) {
        if (window.isZero()) return false; // Types with ZERO window are never deduped

        String key = String.format("notif:dedup:%s:%s:%s",
            userId, type.name(),
            Instant.now().truncatedTo(ChronoUnit.HOURS)); // bucket by hour

        Boolean isNew = redisTemplate.opsForValue()
            .setIfAbsent(key, "1", window);

        // If setIfAbsent returned false, key already existed = duplicate
        return Boolean.FALSE.equals(isNew);
    }
}

// The main notification dispatcher tying it all together
@Service
@RequiredArgsConstructor
@Slf4j
public class NotificationDispatcher {

    private final DeduplicationService deduplicationService;
    private final NotificationPreferenceService preferenceService;
    private final Map<Channel, NotificationSender> senders; // injected by type

    @KafkaListener(topics = "notifications.dispatch", groupId = "notification-platform")
    public void dispatch(NotificationRequest request, Acknowledgment ack) {
        try {
            // 1. Check expiry
            if (request.getExpiresAt() != null &&
                Instant.now().isAfter(request.getExpiresAt())) {
                log.info("Skipping expired notification requestId={}", request.getRequestId());
                ack.acknowledge();
                return;
            }

            // 2. Deduplication check
            if (deduplicationService.isDuplicate(
                    request.getUserId(), request.getType(),
                    request.getDeduplicationWindow())) {
                log.info("Skipping duplicate notification requestId={}", request.getRequestId());
                ack.acknowledge();
                return;
            }

            // 3. Resolve channels against user preferences
            Set<Channel> channels = preferenceService.resolveChannels(
                request.getUserId(), request.getType(), request.getRequestedChannels());

            if (channels.isEmpty()) {
                log.info("No channels enabled for userId={} type={}", 
                    request.getUserId(), request.getType());
                ack.acknowledge();
                return;
            }

            // 4. Dispatch to each enabled channel
            NotificationRecord record = NotificationRecord.from(request);
            channels.forEach(channel ->
                senders.get(channel).send(record)
            );

            ack.acknowledge();

        } catch (Exception e) {
            log.error("Dispatch failed for requestId={}", request.getRequestId(), e);
            // Don't ack — Kafka retries. After max attempts → DLT.
        }
    }
}

The At-Least-Once Problem

Kafka guarantees at-least-once delivery — meaning the same notification event might be consumed twice on failure and retry. Without deduplication, users receive two "Your order is confirmed" emails. The deduplication service using Redis SET NX solves this. But the deduplication key must be deterministic — based on the original request ID, not a randomly generated notification ID — so the same event always produces the same dedup key across retries.

Architectural Evolution Patterns

All three projects share the same evolution trajectory. Understanding this trajectory helps you make the right architectural decisions at each growth stage — neither over-engineering early nor under-engineering so badly you rewrite everything at scale.

The Growth Stages

Stage 1
0 → 1k users

Single Spring Boot app
PostgreSQL
No cache
Synchronous everywhere
Deploy on VPS

Stage 2
1k → 100k

Redis for caching
Kafka for async
Connection pooling
Read replica
Docker + CI/CD

Stage 3
100k → 1M

Kubernetes HPA
CDN for static
DB sharding or PG partition
Prometheus + Grafana
Rate limiting

Stage 4
1M+ users

Microservices by domain
CQRS for read models
Global distribution
Event sourcing
Multi-region DB

The Premature Microservices Trap

The most expensive architectural mistake in startup engineering: breaking into microservices at Stage 1. The team spends months building service discovery, distributed tracing, inter-service auth, and Kubernetes before they have 100 users. Then the business pivots, the domain changes, and all the service boundaries are wrong. Start as a well-structured monolith. Extract services when you have a specific scaling problem that the monolith cannot solve — not because the architecture "feels right" or because a conference talk told you to.

Interview Preparation

System design interviews at FAANG and top startups use problems identical to the projects in this section. The framework for answering any backend design question is the same: clarify requirements, estimate scale, design the data model, design the API, address the hard problems (consistency, availability, scalability), and evolve the architecture.

Q: Design an order management system that handles 10,000 orders per minute without overselling inventory.

Start by separating the write path (place order) from the read path (check order status). For inventory consistency, use pessimistic database locking with SELECT FOR UPDATE to atomically check and reserve inventory within a transaction — this prevents two concurrent requests from both seeing "5 in stock" and both succeeding. For very high throughput, move the inventory gate to Redis using atomic DECR operations, which are single-threaded and cannot race. Use idempotency keys (a UUID the client generates) so retries never double-charge. Make checkout async-friendly: place the order synchronously but trigger fulfilment, notification, and analytics via Kafka events so the HTTP response is fast. Use optimistic locking (@Version) for order status updates to detect concurrent modification.

Q: Design a URL shortener that handles 10 billion redirects per month.

10 billion redirects/month is ~3,800 req/s average, ~50,000 req/s peak. The redirect path must be microseconds — this means Redis as the primary lookup, database only for cold misses. Short codes are generated via base62 encoding of a database sequence (simple, no collisions) or via a random 7-character code (non-guessable). Store URLs in PostgreSQL with the short code as primary key. Redis TTL should match the URL's expiry date. Use negative caching (a sentinel value) to prevent DB hammering on non-existent codes — a common attack vector. The write path is rare (100 req/s); focus all optimisation on the read path. For analytics, publish click events to Kafka and batch-insert into a time-series database — never block the redirect on analytics writes.

Q: How would you ensure a notification is sent exactly once even if your system crashes mid-processing?

Exactly-once delivery is theoretically impossible in distributed systems — you must choose between at-most-once (loss is possible) and at-least-once (duplication is possible). For notifications, choose at-least-once delivery with application-level deduplication. Store each notification with a stable deduplication key (hash of userId + notificationType + time-bucket) in Redis with SET NX and the deduplication window TTL. If Kafka retries the event after a crash, the same key is already in Redis and the notification is skipped. The critical rule: commit the Kafka offset only after the notification is sent AND the deduplication key is stored — never before. This ensures that if you crash between sending and committing, the retry produces a duplicate that is caught by deduplication.

Q: Why use a domain entity's methods for state transitions instead of setting status directly in the service?

Encoding state transitions in the entity enforces business rules close to the data they protect. If order.confirm() validates that the order is in PENDING status, that invariant is enforced everywhere the method is called — in the API layer, in event handlers, in tests. If you set order.setStatus(CONFIRMED) directly in the service, each callsite must remember to validate the transition. This creates scattered validation logic that is easy to forget and hard to test comprehensively. Rich domain objects (sometimes called the "domain model" pattern) produce business rules that read like the business domain, are self-documenting, and fail fast with meaningful exceptions instead of corrupt state.

Q: Why is Money a value object and not just a BigDecimal field?

A plain BigDecimal amount field loses the currency. Without the currency, you cannot safely add two money amounts (you might add USD to EUR), cannot format for display, and cannot apply currency-specific rounding rules. A Money value object encapsulates amount and currency together, making invalid states unrepresentable — you literally cannot create a money value without specifying its currency. Value objects are immutable (no setters, return new instances), so they are inherently thread-safe. The add method validates currency match at the domain level. This makes the code more correct by construction and eliminates entire classes of currency bugs at compile time rather than runtime.

Q: When would you shard the orders database and how?

Shard when a single PostgreSQL instance approaches its write limits (~5,000 writes/sec) or when the table size causes index bloat that slows queries. For an order service, partition by customer_id — all of one customer's orders land on the same shard, making customer-scoped queries efficient and transactional. Use a consistent hashing ring so adding shards doesn't require rehashing all data. Alternatively, use PostgreSQL table partitioning (not application-level sharding) by created_at date — old partitions are archived to cold storage, and queries with a date range only scan relevant partitions. Sharding adds enormous operational complexity (no cross-shard transactions, harder migrations, complex queries). Exhaust other options first: read replicas, connection pooling, query optimisation, caching, archiving old rows. Shard only when the write path is genuinely bottlenecked.

Q: How do you handle a Kafka consumer that is falling behind (high consumer lag)?

Consumer lag means messages are being produced faster than they are being consumed. First, identify the cause: is processing slow (external API call in the consumer, slow DB write), or is partition count too low (Kafka parallelism is bounded by partition count per consumer group)? If processing is slow, add parallelism within the consumer (increase concurrency with setConcurrency in the container factory) or optimise the processing logic (batch DB inserts, async external calls). If partition count is the bottleneck, increase the topic's partition count — but only the partitions assigned to your group matter, so you may need to rebalance. Add a circuit breaker around external calls in the consumer: if the external service is down, fail fast and let Kafka retry rather than holding the thread. Monitor consumer lag as a Prometheus metric — alert when lag exceeds a threshold that would cause unacceptable processing delay for your SLO.

🏗

Section 13 Complete

You have built three production-grade backend systems — an order service with inventory consistency, a URL shortener with billion-scale redirect architecture, and a multi-channel notification platform with reliability patterns. You now think in production terms: not just "does it work" but "does it scale, fail gracefully, and recover automatically".

← Observability Next: Spring Boot Internals →