Real Production Projects
Three complete backend systems built from scratch — architecture decisions, domain modelling, API design, database schema, caching strategy, async processing, security hardening, observability, and deployment. This is where everything learned in sections 1–12 converges into systems you’d find at real companies.
How to Use This Section
Each project is built in layers — starting from a simple monolith and evolving toward a production-grade service. The goal is not to copy-paste code. The goal is to develop the engineering intuition for why decisions are made, when to introduce complexity, and what breaks at each stage of growth.
For each project: first build the simplest possible version that works. Then deliberately break it — load test it, remove the cache, block the queue. Observe what fails and why. Then fix and harden it. This is how real engineering intuition is developed — not by reading about failure modes but by causing them.
Project 1: E-Commerce Order Service
An order service is at the heart of every e-commerce platform. It must handle concurrent checkout attempts, coordinate inventory deduction, trigger payment processing, and emit events to downstream services — all within a consistent, auditable transaction model. This is the canonical backend engineering challenge.
System Requirements
- ✓ Browse product catalogue with search & filters
- ✓ Add/remove items from cart
- ✓ Place order with address and payment
- ✓ Check inventory before confirming
- ✓ Process payment (sync or async)
- ✓ Track order status (PENDING → CONFIRMED → SHIPPED)
- ✓ Send email/push confirmation
- ✓ Handle cancellations and refunds
- ✓ p99 checkout latency < 500ms
- ✓ No double-charges on retry
- ✓ Consistent inventory (no overselling)
- ✓ 99.9% availability
- ✓ Horizontal scalability
- ✓ Full audit trail
- ✓ Fraud detection hooks
- ✓ Rate limiting on checkout
Architecture Evolution: Stage 1 — The Monolith
Start with a single Spring Boot application, one PostgreSQL database, no messaging, no cache. Get the core domain right before adding infrastructure complexity. The biggest mistake engineers make is reaching for microservices before they understand the domain.
domain/ ← entities, value objects, domain logic
Order.java
OrderItem.java
OrderStatus.java
Product.java
Money.java ← value object — never use primitives for money
application/ ← use cases, orchestration
OrderService.java
CartService.java
InventoryService.java
infrastructure/ ← JPA, Kafka, HTTP clients
OrderRepository.java
PaymentGatewayClient.java
api/ ← controllers, DTOs, validation
OrderController.java
CreateOrderRequest.java
OrderResponse.java
Domain Model
@Entity
@Table(name = "orders")
@Getter
@NoArgsConstructor(access = AccessLevel.PROTECTED)
public class Order {
@Id @GeneratedValue(strategy = GenerationType.UUID)
private UUID id;
@Column(nullable = false)
private UUID customerId;
@Enumerated(EnumType.STRING)
@Column(nullable = false)
private OrderStatus status;
// Use Embedded for value objects — Money is not an entity
@Embedded
private Money totalAmount;
@OneToMany(cascade = CascadeType.ALL, orphanRemoval = true, fetch = FetchType.LAZY)
@JoinColumn(name = "order_id")
private List<OrderItem> items = new ArrayList<>();
@CreationTimestamp
private Instant createdAt;
@UpdateTimestamp
private Instant updatedAt;
// Version for optimistic locking — prevents concurrent modification
@Version
private Long version;
// Factory method — enforce invariants at creation time
public static Order create(UUID customerId, List<OrderItem> items) {
if (items == null || items.isEmpty()) {
throw new DomainException("Order must have at least one item");
}
Order order = new Order();
order.customerId = customerId;
order.status = OrderStatus.PENDING;
order.items = new ArrayList<>(items);
order.totalAmount = items.stream()
.map(OrderItem::getLineTotal)
.reduce(Money.ZERO, Money::add);
return order;
}
// Domain behaviour lives on the entity — not in the service
public void confirm() {
if (this.status != OrderStatus.PENDING) {
throw new DomainException("Cannot confirm order in status: " + this.status);
}
this.status = OrderStatus.CONFIRMED;
}
public void cancel(String reason) {
if (this.status == OrderStatus.SHIPPED) {
throw new DomainException("Cannot cancel a shipped order");
}
this.status = OrderStatus.CANCELLED;
}
public void ship() {
if (this.status != OrderStatus.CONFIRMED) {
throw new DomainException("Order must be confirmed before shipping");
}
this.status = OrderStatus.SHIPPED;
}
}
// Money — never use double or float for currency
@Embeddable
@Value
public class Money {
public static final Money ZERO = new Money(BigDecimal.ZERO, "USD");
@Column(name = "amount", precision = 19, scale = 4)
private BigDecimal amount;
@Column(name = "currency", length = 3)
private String currency;
public Money add(Money other) {
if (!this.currency.equals(other.currency)) {
throw new DomainException("Cannot add different currencies");
}
return new Money(this.amount.add(other.amount), this.currency);
}
public boolean isGreaterThan(Money other) {
return this.amount.compareTo(other.amount) > 0;
}
}
The Checkout Flow — Where All Complexity Lives
Checkout must be idempotent (double-clicking submit should not double-charge), inventory-consistent (no overselling), and transactionally safe (payment failure must roll back inventory). This is where most teams introduce subtle bugs.
@Service
@RequiredArgsConstructor
@Transactional
public class OrderService {
private final OrderRepository orderRepository;
private final InventoryService inventoryService;
private final PaymentService paymentService;
private final ApplicationEventPublisher eventPublisher;
private final IdempotencyKeyStore idempotencyStore;
/**
* Idempotency key prevents double-charges on retry.
* If the same key is submitted twice, return the original result.
*/
public OrderResponse placeOrder(CreateOrderRequest request, String idempotencyKey) {
// Check idempotency — same key = same result, no side effects
return idempotencyStore.getOrExecute(idempotencyKey, () -> {
return doPlaceOrder(request);
});
}
private OrderResponse doPlaceOrder(CreateOrderRequest request) {
// 1. Validate and build order items
List<OrderItem> items = request.getItems().stream()
.map(i -> OrderItem.of(i.getProductId(), i.getQuantity(),
productCatalogue.getPrice(i.getProductId())))
.toList();
// 2. Reserve inventory — pessimistic lock at DB level
// Throws InsufficientInventoryException if any item unavailable
inventoryService.reserve(items);
// 3. Create domain object — enforces business rules
Order order = Order.create(request.getCustomerId(), items);
orderRepository.save(order);
// 4. Attempt payment
try {
PaymentResult payment = paymentService.charge(
order.getId(),
order.getTotalAmount(),
request.getPaymentMethodId()
);
order.confirm();
orderRepository.save(order);
// 5. Publish domain event — downstream services react asynchronously
// Using @TransactionalEventListener ensures this only fires on commit
eventPublisher.publishEvent(new OrderConfirmedEvent(order.getId(),
order.getCustomerId(), order.getTotalAmount()));
return OrderResponse.from(order);
} catch (PaymentDeclinedException e) {
// Roll back inventory reservation on payment failure
inventoryService.release(items);
order.cancel("Payment declined: " + e.getDeclineReason());
orderRepository.save(order);
throw new OrderPaymentFailedException(order.getId(), e.getDeclineReason());
}
}
}
// TransactionalEventListener — fires AFTER the transaction commits
// This prevents the notification being sent if the DB write fails
@Component
@RequiredArgsConstructor
public class OrderEventHandler {
private final NotificationService notificationService;
@TransactionalEventListener(phase = TransactionPhase.AFTER_COMMIT)
@Async // Don't block the HTTP request thread
public void onOrderConfirmed(OrderConfirmedEvent event) {
notificationService.sendOrderConfirmation(
event.getCustomerId(),
event.getOrderId()
);
}
}
Without proper locking, two concurrent checkout requests can both see "5 items in stock", both reserve 3, and both succeed — leaving you with -1 inventory. The fix: use SELECT ... FOR UPDATE to pessimistically lock the inventory row during the check-and-reserve operation. In Spring Data JPA: @Lock(LockModeType.PESSIMISTIC_WRITE) on the repository method. For high-throughput scenarios, use Redis atomic operations (DECR) as the inventory gate instead of DB locks.
Inventory — Pessimistic Locking
public interface InventoryRepository extends JpaRepository<Inventory, UUID> {
// Pessimistic write lock — blocks other transactions from reading this row
// until our transaction commits. Prevents overselling.
@Lock(LockModeType.PESSIMISTIC_WRITE)
@Query("SELECT i FROM Inventory i WHERE i.productId = :productId")
Optional<Inventory> findByProductIdForUpdate(@Param("productId") UUID productId);
}
@Service
@RequiredArgsConstructor
public class InventoryService {
private final InventoryRepository repo;
@Transactional
public void reserve(List<OrderItem> items) {
for (OrderItem item : items) {
Inventory inventory = repo.findByProductIdForUpdate(item.getProductId())
.orElseThrow(() -> new ProductNotFoundException(item.getProductId()));
if (inventory.getAvailableQuantity() < item.getQuantity()) {
throw new InsufficientInventoryException(
item.getProductId(), item.getQuantity(),
inventory.getAvailableQuantity()
);
}
inventory.reserve(item.getQuantity());
repo.save(inventory);
}
}
@Transactional
public void release(List<OrderItem> items) {
for (OrderItem item : items) {
Inventory inventory = repo.findByProductIdForUpdate(item.getProductId())
.orElseThrow();
inventory.release(item.getQuantity());
repo.save(inventory);
}
}
}
Stage 2 — Adding Caching and Async
Once the monolith is working correctly, introduce caching for the product catalogue (read-heavy, rarely changes) and Kafka for downstream event processing (notifications, analytics, fulfilment).
@Service
@RequiredArgsConstructor
@CacheConfig(cacheNames = "products")
public class ProductCatalogueService {
private final ProductRepository productRepository;
// Cache products by ID — TTL configured in application.yml
// Products are read millions of times, written rarely
@Cacheable(key = "#productId", unless = "#result == null")
public ProductDto getProduct(UUID productId) {
return productRepository.findById(productId)
.map(ProductDto::from)
.orElseThrow(() -> new ProductNotFoundException(productId));
}
// Evict on update — never serve stale prices
@CacheEvict(key = "#product.id")
public ProductDto updateProduct(Product product) {
Product saved = productRepository.save(product);
return ProductDto.from(saved);
}
// Search results — shorter TTL (30s) because filters change more frequently
@Cacheable(key = "'search:' + #query.cacheKey()", condition = "#query.isCacheable()")
public Page<ProductDto> search(ProductSearchQuery query, Pageable pageable) {
return productRepository.search(query, pageable).map(ProductDto::from);
}
}
@Component
@RequiredArgsConstructor
@Slf4j
public class OrderEventPublisher {
private final KafkaTemplate<String, Object> kafkaTemplate;
// Use order ID as Kafka key — guarantees ordering for one order's events
public void publishOrderConfirmed(OrderConfirmedEvent event) {
kafkaTemplate.send("order.confirmed", event.getOrderId().toString(), event)
.whenComplete((result, ex) -> {
if (ex != null) {
// Log but don't fail the order — notification failure
// is not a reason to fail the checkout
log.error("Failed to publish order.confirmed for orderId={}: {}",
event.getOrderId(), ex.getMessage());
// Could push to a local retry table here
} else {
log.debug("Published order.confirmed orderId={} partition={} offset={}",
event.getOrderId(),
result.getRecordMetadata().partition(),
result.getRecordMetadata().offset());
}
});
}
}
// Fulfilment service consumes the event and triggers warehouse pick
@Component
@Slf4j
public class FulfilmentEventConsumer {
@KafkaListener(topics = "order.confirmed", groupId = "fulfilment-service",
containerFactory = "orderEventContainerFactory")
public void onOrderConfirmed(OrderConfirmedEvent event,
Acknowledgment ack) {
try {
fulfilmentService.initiatePickAndPack(event.getOrderId());
ack.acknowledge(); // only ack after successful processing
} catch (Exception e) {
log.error("Fulfilment initiation failed for orderId={}", event.getOrderId(), e);
// Do NOT ack — Kafka will redeliver after max.poll.interval.ms
// After max retries, message goes to DLT (dead-letter topic)
}
}
}
Stage 3 — Production Hardening
@RestController
@RequestMapping("/api/v1/orders")
@RequiredArgsConstructor
@Validated
public class OrderController {
private final OrderService orderService;
private final RateLimiter checkoutRateLimiter; // Resilience4j rate limiter
@PostMapping
@PreAuthorize("hasRole('CUSTOMER')")
@RateLimiter(name = "checkout") // Max 5 checkout attempts/min per user
public ResponseEntity<OrderResponse> placeOrder(
@RequestBody @Valid CreateOrderRequest request,
@RequestHeader("Idempotency-Key") @NotBlank String idempotencyKey,
@AuthenticationPrincipal UserPrincipal principal) {
// Prevent one user from placing orders as another
if (!principal.getUserId().equals(request.getCustomerId())) {
throw new ForbiddenException("Cannot place order for another customer");
}
OrderResponse response = orderService.placeOrder(request, idempotencyKey);
return ResponseEntity
.status(HttpStatus.CREATED)
.header("Location", "/api/v1/orders/" + response.getId())
.body(response);
}
@GetMapping("/{orderId}")
@PreAuthorize("hasRole('CUSTOMER') and @orderSecurity.isOwner(#orderId, principal)")
public OrderResponse getOrder(@PathVariable UUID orderId) {
return orderService.getOrder(orderId);
}
@GetMapping
@PreAuthorize("hasRole('CUSTOMER')")
public Page<OrderResponse> listOrders(
@AuthenticationPrincipal UserPrincipal principal,
@PageableDefault(size = 20, sort = "createdAt", direction = DESC) Pageable pageable) {
return orderService.listOrdersForCustomer(principal.getUserId(), pageable);
}
}
Database Schema — The Foundation
CREATE TABLE orders (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
customer_id UUID NOT NULL,
status VARCHAR(30) NOT NULL,
amount NUMERIC(19, 4) NOT NULL,
currency CHAR(3) NOT NULL DEFAULT 'USD',
version BIGINT NOT NULL DEFAULT 0, -- optimistic lock version
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE order_items (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
order_id UUID NOT NULL REFERENCES orders(id) ON DELETE CASCADE,
product_id UUID NOT NULL,
quantity INT NOT NULL CHECK (quantity > 0),
unit_price NUMERIC(19, 4) NOT NULL,
currency CHAR(3) NOT NULL DEFAULT 'USD'
);
CREATE TABLE inventory (
product_id UUID PRIMARY KEY,
available_quantity INT NOT NULL CHECK (available_quantity >= 0),
reserved_quantity INT NOT NULL DEFAULT 0,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Critical indexes for query performance
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_orders_created_at ON orders(created_at DESC);
CREATE INDEX idx_order_items_order ON order_items(order_id);
CREATE INDEX idx_order_items_product ON order_items(product_id);
-- Idempotency keys table
CREATE TABLE idempotency_keys (
key VARCHAR(255) PRIMARY KEY,
response JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
-- Clean up old keys automatically
CREATE INDEX idx_idempotency_created ON idempotency_keys(created_at);
Project 2: URL Shortener at Scale
The URL shortener is one of the most common system design interview questions because it appears trivially simple but hides enormous scaling challenges. A basic implementation works fine for 1,000 users. Getting it to handle 10 billion redirects per month requires careful thinking about cache architecture, ID generation, and read path optimisation.
The Core Challenge
→ Generate unique short code
→ Store {code → url} in DB
→ Cache in Redis
→ Return short URL
Volume: ~100 req/s
→ L1: Redis cache hit? (1ms)
→ L2: DB lookup? (5ms)
→ 301/302 redirect
→ Async: record click analytics
Volume: ~50,000 req/s
Short Code Generation — Getting It Right
The short code must be unique, short (6-8 chars), and generated at high throughput without coordination between instances. Three approaches, with different tradeoffs:
/**
* Strategy 1: Base62 encoding of a database sequence.
* Pros: simple, no collisions, deterministic.
* Cons: DB becomes the bottleneck; sequential codes are guessable.
* Use when: throughput < 10k writes/s.
*/
@Component
public class SequenceBase62Generator {
private static final String CHARS = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
public String generate(long sequenceValue) {
StringBuilder sb = new StringBuilder();
while (sequenceValue > 0) {
sb.append(CHARS.charAt((int)(sequenceValue % 62)));
sequenceValue /= 62;
}
// Pad to 6 chars minimum and reverse
while (sb.length() < 6) sb.append('0');
return sb.reverse().toString();
}
}
/**
* Strategy 2: Random Base62 with collision detection.
* Pros: non-guessable, no central coordination.
* Cons: collision probability grows as table fills (birthday problem).
* Requires a retry loop + DB unique constraint.
* Use when: security of short codes matters; throughput moderate.
*/
@Component
public class RandomBase62Generator {
private static final int CODE_LENGTH = 7;
private final SecureRandom random = new SecureRandom();
private static final String CHARS = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
public String generate() {
StringBuilder sb = new StringBuilder(CODE_LENGTH);
for (int i = 0; i < CODE_LENGTH; i++) {
sb.append(CHARS.charAt(random.nextInt(62)));
}
return sb.toString();
}
}
/**
* Strategy 3: Murmur hash of the URL (deterministic).
* Pros: same URL always gets the same code; natural deduplication.
* Cons: hash collisions possible for different URLs.
* Cannot support custom short codes.
* Use when: deduplication is more important than custom codes.
*/
@Component
public class HashBasedGenerator {
public String generate(String originalUrl) {
long hash = Hashing.murmur3_128()
.hashString(originalUrl, StandardCharsets.UTF_8)
.asLong();
return encodeBase62(Math.abs(hash)).substring(0, 7);
}
}
The Redirect Service — Optimising the Hot Path
@Service
@RequiredArgsConstructor
@Slf4j
public class RedirectService {
private final UrlRepository urlRepository;
private final RedisTemplate<String, String> redisTemplate;
private final ClickEventPublisher clickEventPublisher;
private final MeterRegistry meterRegistry;
private static final String CACHE_PREFIX = "url:";
private static final Duration CACHE_TTL = Duration.ofDays(1);
private static final String DELETED_SENTINEL = "__DELETED__";
public String resolveUrl(String shortCode, HttpServletRequest request) {
String cacheKey = CACHE_PREFIX + shortCode;
// L1: Redis cache hit
String cached = redisTemplate.opsForValue().get(cacheKey);
if (cached != null) {
if (DELETED_SENTINEL.equals(cached)) {
// Negative cache entry — URL was deleted; avoid DB lookup
throw new ShortUrlNotFoundException(shortCode);
}
meterRegistry.counter("redirect.cache.hit").increment();
publishClickAsync(shortCode, request, cached);
return cached;
}
// L2: Database lookup (cache miss)
meterRegistry.counter("redirect.cache.miss").increment();
return urlRepository.findByShortCode(shortCode)
.map(url -> {
// Check if URL is expired
if (url.isExpired()) {
redisTemplate.opsForValue().set(cacheKey, DELETED_SENTINEL,
Duration.ofMinutes(5)); // short TTL for expired entries
throw new ShortUrlExpiredException(shortCode);
}
// Populate cache for future requests
redisTemplate.opsForValue().set(cacheKey, url.getOriginalUrl(), CACHE_TTL);
publishClickAsync(shortCode, request, url.getOriginalUrl());
return url.getOriginalUrl();
})
.orElseGet(() -> {
// Cache negative result to prevent DB hammering on non-existent codes
redisTemplate.opsForValue().set(cacheKey, DELETED_SENTINEL,
Duration.ofMinutes(5));
throw new ShortUrlNotFoundException(shortCode);
});
}
// Publish click event asynchronously — never block the redirect on analytics
@Async("clickEventExecutor")
protected void publishClickAsync(String shortCode, HttpServletRequest request,
String destination) {
try {
clickEventPublisher.publish(ClickEvent.builder()
.shortCode(shortCode)
.destination(destination)
.ipAddress(extractClientIp(request))
.userAgent(request.getHeader("User-Agent"))
.referer(request.getHeader("Referer"))
.timestamp(Instant.now())
.build());
} catch (Exception e) {
log.warn("Failed to publish click event for shortCode={}", shortCode, e);
// Non-critical: analytics loss is acceptable; redirect must always succeed
}
}
}
@RestController
@RequiredArgsConstructor
public class RedirectController {
private final RedirectService redirectService;
@GetMapping("/{shortCode}")
public ResponseEntity<Void> redirect(@PathVariable String shortCode,
HttpServletRequest request) {
String destination = redirectService.resolveUrl(shortCode, request);
return ResponseEntity.status(HttpStatus.MOVED_PERMANENTLY)
.header(HttpHeaders.LOCATION, destination)
// Cache-Control: no-store prevents browser from caching the redirect
// This allows us to track clicks on repeat visits
.header(HttpHeaders.CACHE_CONTROL, "no-store")
.build();
}
}
A 301 (Permanent) redirect is cached by browsers. Once cached, the browser goes directly to the destination — your server never sees the click again. Perfect for performance, terrible for analytics. A 302 (Temporary) redirect forces the browser to hit your server every time — great for analytics and A/B testing, but every redirect has server cost. Most URL shorteners use 302 for this reason. Use 301 only when you're certain the destination will never change and you don't need click tracking.
Analytics Pipeline
@Component
@RequiredArgsConstructor
@Slf4j
public class ClickAnalyticsConsumer {
private final ClickEventRepository clickRepository;
private final GeoIpService geoIpService;
// Batch processing: collect events and insert in bulk every second
// 1 INSERT per click = DB bottleneck at scale
// Batch inserts = dramatically higher throughput
@KafkaListener(
topics = "url.clicks",
groupId = "analytics-consumer",
batch = "true", // Receive list of messages, not one at a time
containerFactory = "batchKafkaListenerContainerFactory"
)
public void processClickBatch(List<ClickEvent> events, Acknowledgment ack) {
try {
List<ClickRecord> records = events.stream()
.map(event -> ClickRecord.builder()
.shortCode(event.getShortCode())
.country(geoIpService.getCountry(event.getIpAddress()))
.browser(UserAgentParser.getBrowser(event.getUserAgent()))
.refererDomain(extractDomain(event.getReferer()))
.clickedAt(event.getTimestamp())
.build())
.toList();
// Batch insert — one DB roundtrip for up to 500 events
clickRepository.saveAll(records);
ack.acknowledge();
log.debug("Processed {} click events", events.size());
} catch (Exception e) {
log.error("Failed to process click batch of size {}", events.size(), e);
// Don't ack — retry the batch
}
}
}
Project 3: Multi-Channel Notification Platform
Every product eventually needs notifications. Email for receipts. SMS for OTPs. Push for alerts. Building this as a dedicated platform lets every other service just fire an event and forget — without knowing about email providers, SMS gateways, or device tokens. The platform handles reliability, retries, deduplication, and user preferences.
Architecture Overview
Template rendering
Bounce handling
Rate limiting
OTP handling
Device token mgmt
Silent vs alert
The Notification Domain
// Every producer service publishes this common event shape
// The notification platform is the only consumer
@Data
@Builder
@JsonDeserialize(builder = NotificationRequest.NotificationRequestBuilder.class)
public class NotificationRequest {
private UUID requestId; // For idempotency — same ID = same notification
private UUID userId; // Recipient
private NotificationType type; // ORDER_CONFIRMED, OTP_SENT, PAYMENT_FAILED, etc.
private Map<String, String> templateVariables; // {orderId: "X", amount: "$50"}
private NotificationPriority priority; // HIGH (OTP), NORMAL (receipt), LOW (promo)
private Instant expiresAt; // Don't send expired OTPs
// Desired channels — platform will intersect with user preferences
private Set<Channel> requestedChannels;
// Deduplication window — don't send same notification type twice within this period
private Duration deduplicationWindow;
}
// Notification types drive template selection and channel defaults
public enum NotificationType {
ORDER_CONFIRMED(Set.of(EMAIL, PUSH), Duration.ofHours(24)),
PAYMENT_FAILED(Set.of(EMAIL, SMS, PUSH), Duration.ofHours(1)),
OTP_SENT(Set.of(SMS), Duration.ofMinutes(5)), // SMS only for OTPs
PROMOTIONAL(Set.of(PUSH), Duration.ofDays(1)),
ACCOUNT_SECURITY_ALERT(Set.of(EMAIL, SMS), Duration.ZERO); // No dedup
private final Set<Channel> defaultChannels;
private final Duration defaultDeduplicationWindow;
}
Retry Logic with Exponential Backoff
External notification providers fail. The email gateway returns 503. SMS gateway throttles you. The push service times out. Proper retry logic with exponential backoff and a dead-letter topic is non-negotiable in a production notification system.
@Component
@RequiredArgsConstructor
@Slf4j
public class EmailNotificationSender {
private final SendGridClient sendGridClient;
private final TemplateRenderer templateRenderer;
private final NotificationRepository notificationRepository;
// Resilience4j Retry: 3 attempts, exponential backoff 1s → 2s → 4s
@Retry(name = "email-sender", fallbackMethod = "handleEmailFailure")
@CircuitBreaker(name = "sendgrid", fallbackMethod = "handleEmailFailure")
public void send(NotificationRecord notification) {
String htmlBody = templateRenderer.render(
notification.getType().getEmailTemplate(),
notification.getTemplateVariables()
);
SendGridEmail email = SendGridEmail.builder()
.to(notification.getRecipientEmail())
.subject(notification.getSubject())
.htmlContent(htmlBody)
.trackingId(notification.getId().toString()) // For bounce/open tracking
.build();
sendGridClient.send(email);
notification.markDelivered(Channel.EMAIL);
notificationRepository.save(notification);
log.info("Email sent notificationId={} to={}", notification.getId(),
notification.getRecipientEmail());
}
// Called after all retries exhausted OR circuit open
// Saves to DLT (dead-letter) for human review and manual retry
private void handleEmailFailure(NotificationRecord notification, Exception e) {
log.error("Email delivery permanently failed for notificationId={}: {}",
notification.getId(), e.getMessage());
notification.markFailed(Channel.EMAIL, e.getMessage());
notificationRepository.save(notification);
// Emit to dead-letter topic — ops team can inspect and retry
deadLetterPublisher.publish(DeadLetterEvent.builder()
.notificationId(notification.getId())
.channel(Channel.EMAIL)
.failureReason(e.getMessage())
.originalPayload(notification)
.build());
}
}
User Preference Management
@Service
@RequiredArgsConstructor
public class NotificationPreferenceService {
private final PreferenceRepository preferenceRepository;
private final RedisTemplate<String, NotificationPreference> redisTemplate;
// Preferences are read on every notification dispatch — cache them
@Cacheable(value = "user-prefs", key = "#userId")
public NotificationPreference getPreferences(UUID userId) {
return preferenceRepository.findByUserId(userId)
.orElse(NotificationPreference.defaults(userId));
}
public Set<Channel> resolveChannels(UUID userId, NotificationType type,
Set<Channel> requestedChannels) {
NotificationPreference prefs = getPreferences(userId);
// Intersect: requested channels ∩ user-enabled channels ∩ type defaults
return requestedChannels.stream()
.filter(channel -> prefs.isChannelEnabled(channel))
.filter(channel -> prefs.isTypeEnabled(type, channel))
.filter(channel -> !prefs.isInQuietHours(channel, Instant.now()))
.collect(Collectors.toSet());
}
// Quiet hours: no push notifications 10pm — 8am user's local time
// This is surprisingly important for user retention
@CacheEvict(value = "user-prefs", key = "#preference.userId")
public void updatePreferences(NotificationPreference preference) {
preferenceRepository.save(preference);
}
}
Deduplication — Never Spam Users
@Service
@RequiredArgsConstructor
public class DeduplicationService {
private final RedisTemplate<String, String> redisTemplate;
/**
* Check if a notification of this type was already sent to this user
* within the deduplication window. Uses Redis SET NX (set-if-not-exists)
* for atomic check-and-mark.
*/
public boolean isDuplicate(UUID userId, NotificationType type,
Duration window) {
if (window.isZero()) return false; // Types with ZERO window are never deduped
String key = String.format("notif:dedup:%s:%s:%s",
userId, type.name(),
Instant.now().truncatedTo(ChronoUnit.HOURS)); // bucket by hour
Boolean isNew = redisTemplate.opsForValue()
.setIfAbsent(key, "1", window);
// If setIfAbsent returned false, key already existed = duplicate
return Boolean.FALSE.equals(isNew);
}
}
// The main notification dispatcher tying it all together
@Service
@RequiredArgsConstructor
@Slf4j
public class NotificationDispatcher {
private final DeduplicationService deduplicationService;
private final NotificationPreferenceService preferenceService;
private final Map<Channel, NotificationSender> senders; // injected by type
@KafkaListener(topics = "notifications.dispatch", groupId = "notification-platform")
public void dispatch(NotificationRequest request, Acknowledgment ack) {
try {
// 1. Check expiry
if (request.getExpiresAt() != null &&
Instant.now().isAfter(request.getExpiresAt())) {
log.info("Skipping expired notification requestId={}", request.getRequestId());
ack.acknowledge();
return;
}
// 2. Deduplication check
if (deduplicationService.isDuplicate(
request.getUserId(), request.getType(),
request.getDeduplicationWindow())) {
log.info("Skipping duplicate notification requestId={}", request.getRequestId());
ack.acknowledge();
return;
}
// 3. Resolve channels against user preferences
Set<Channel> channels = preferenceService.resolveChannels(
request.getUserId(), request.getType(), request.getRequestedChannels());
if (channels.isEmpty()) {
log.info("No channels enabled for userId={} type={}",
request.getUserId(), request.getType());
ack.acknowledge();
return;
}
// 4. Dispatch to each enabled channel
NotificationRecord record = NotificationRecord.from(request);
channels.forEach(channel ->
senders.get(channel).send(record)
);
ack.acknowledge();
} catch (Exception e) {
log.error("Dispatch failed for requestId={}", request.getRequestId(), e);
// Don't ack — Kafka retries. After max attempts → DLT.
}
}
}
Kafka guarantees at-least-once delivery — meaning the same notification event might be consumed twice on failure and retry. Without deduplication, users receive two "Your order is confirmed" emails. The deduplication service using Redis SET NX solves this. But the deduplication key must be deterministic — based on the original request ID, not a randomly generated notification ID — so the same event always produces the same dedup key across retries.
Architectural Evolution Patterns
All three projects share the same evolution trajectory. Understanding this trajectory helps you make the right architectural decisions at each growth stage — neither over-engineering early nor under-engineering so badly you rewrite everything at scale.
0 → 1k users
- Single Spring Boot app
- PostgreSQL
- No cache
- Synchronous everywhere
- Deploy on VPS
1k → 100k
- Redis for caching
- Kafka for async
- Connection pooling
- Read replica
- Docker + CI/CD
100k → 1M
- Kubernetes HPA
- CDN for static
- DB sharding or PG partition
- Prometheus + Grafana
- Rate limiting
1M+ users
- Microservices by domain
- CQRS for read models
- Global distribution
- Event sourcing
- Multi-region DB
The most expensive architectural mistake in startup engineering: breaking into microservices at Stage 1. The team spends months building service discovery, distributed tracing, inter-service auth, and Kubernetes before they have 100 users. Then the business pivots, the domain changes, and all the service boundaries are wrong. Start as a well-structured monolith. Extract services when you have a specific scaling problem that the monolith cannot solve — not because the architecture "feels right" or because a conference talk told you to.
Interview Preparation
System design interviews at FAANG and top startups use problems identical to the projects in this section. The framework for answering any backend design question is the same: clarify requirements, estimate scale, design the data model, design the API, address the hard problems (consistency, availability, scalability), and evolve the architecture.
SELECT FOR UPDATE to atomically check and reserve inventory within a transaction — this prevents two concurrent requests from both seeing "5 in stock" and both succeeding. For very high throughput, move the inventory gate to Redis using atomic DECR operations, which are single-threaded and cannot race. Use idempotency keys (a UUID the client generates) so retries never double-charge. Make checkout async-friendly: place the order synchronously but trigger fulfilment, notification, and analytics via Kafka events so the HTTP response is fast. Use optimistic locking (@Version) for order status updates to detect concurrent modification.order.confirm() validates that the order is in PENDING status, that invariant is enforced everywhere the method is called — in the API layer, in event handlers, in tests. If you set order.setStatus(CONFIRMED) directly in the service, each callsite must remember to validate the transition. This creates scattered validation logic that is easy to forget and hard to test comprehensively. Rich domain objects (sometimes called the "domain model" pattern) produce business rules that read like the business domain, are self-documenting, and fail fast with meaningful exceptions instead of corrupt state.BigDecimal amount field loses the currency. Without the currency, you cannot safely add two money amounts (you might add USD to EUR), cannot format for display, and cannot apply currency-specific rounding rules. A Money value object encapsulates amount and currency together, making invalid states unrepresentable — you literally cannot create a money value without specifying its currency. Value objects are immutable (no setters, return new instances), so they are inherently thread-safe. The add method validates currency match at the domain level. This makes the code more correct by construction and eliminates entire classes of currency bugs at compile time rather than runtime.setConcurrency in the container factory) or optimise the processing logic (batch DB inserts, async external calls). If partition count is the bottleneck, increase the topic's partition count — but only the partitions assigned to your group matter, so you may need to rebalance. Add a circuit breaker around external calls in the consumer: if the external service is down, fail fast and let Kafka retry rather than holding the thread. Monitor consumer lag as a Prometheus metric — alert when lag exceeds a threshold that would cause unacceptable processing delay for your SLO.