SWE → AI Engineer

Full Stack System Architecture

15 min read · April 2026 · Free playbook

Every layer, every technology, who owns what

Every layer. Every technology. Who owns what. No filler. This is what a FAANG-level application actually looks like under the hood — and why building a real product is nothing like vibe-coding a dashboard.

The illusion: most founders think a web app is a frontend, a backend, and a database. That's like thinking a city is just buildings. A production-grade application has 18 distinct layers, hundreds of moving parts, and two distinct ownership tracks — product and engineering. Here is every layer, what it does, what runs it, and who is accountable for what.

The 18 layers

1
Client layer — frontend
Web app · Mobile app · Desktop · Browser extensions

The interface layer that renders in the browser or on a device. It handles UI state, user interactions, routing between pages, and communicating with backend services via API calls. At FAANG scale this is a complex, independently deployable application — not a collection of HTML files. Mobile apps (iOS and Android) are separate codebases with their own release cycles, app store review processes, and platform-specific constraints.

ReactNext.jsVueAngularTypeScriptReact NativeSwift (iOS)Kotlin (Android)Electron (desktop)Tailwind CSSWebpack / Vite
PM owns
  • User flows and task completion paths
  • Information architecture decisions
  • What ships behind a feature flag
  • Accessibility requirements
  • Which platforms are prioritised
  • Copy, microcopy, and error state language
SWE owns
  • Framework and rendering strategy (SSR vs CSR vs SSG)
  • Component architecture and design system
  • State management approach
  • Bundle size optimisation and performance
  • Browser compatibility
  • Accessibility implementation
2
Edge network & CDN
Content delivery · DDoS protection · SSL termination · Edge compute

Static assets — JavaScript bundles, images, fonts, HTML — are cached at data centres globally, geographically close to users, so they load fast regardless of where the user is. Edge functions allow code to run at these global nodes without hitting origin servers. DDoS protection and SSL/TLS termination also live here. This is the first layer the internet touches before your servers do.

CloudflareAWS CloudFrontAkamaiFastlyVercel EdgeCloudflare WorkersWAFSSL/TLS
PM owns
  • Regional launch decisions
  • Performance SLA requirements by region
  • Content localisation requirements
SWE owns
  • CDN configuration and cache rules
  • Edge function logic
  • DDoS mitigation rules
  • WAF rule sets
  • SSL certificate management and renewal
3
API gateway & load balancer
Traffic routing · Rate limiting · Authentication handoff · Request throttling

Every request from the frontend passes through the API gateway before reaching any backend service. It enforces rate limits, routes requests to the correct service, validates tokens, and distributes traffic across multiple server instances so no single server gets overwhelmed. This is also where API versioning is enforced — v1 requests go to old services, v2 requests go to new ones.

KongAWS API GatewayNginxHAProxyAWS ALBEnvoyTraefik
PM owns
  • Rate limit policy by pricing tier
  • Which endpoints are public vs. authenticated
  • API versioning strategy and deprecation timeline
  • Partner API access policies
SWE owns
  • Gateway configuration and routing rules
  • Load balancing algorithm
  • Rate limit enforcement implementation
  • Request/response transformation logic
  • Health check configuration
4
Authentication & authorisation
OAuth 2.0 · JWT · SSO · MFA · RBAC · Session management

Two distinct problems that get conflated. Authentication is proving identity — you are who you say you are. Authorisation is enforcing permissions — you are allowed to do what you're trying to do. At FAANG scale these are separate services. Auth includes social login, enterprise single sign-on (SSO), multi-factor authentication, session token management, and token refresh and revocation flows. Role-based access control (RBAC) defines what each user type can see and do.

Auth0AWS CognitoOktaFirebase AuthClerkOAuth 2.0JWTSAML (SSO)Casbin (RBAC)
PM owns
  • Which login methods to support
  • MFA requirements by user tier
  • Permission model — who can do what by role
  • Session timeout and security policy
  • Enterprise SSO requirements
  • Account recovery flow UX decisions
SWE owns
  • Token generation, signing, and verification
  • RBAC/ABAC policy enforcement logic
  • OAuth flow implementation
  • Token refresh and revocation
  • Secure credential storage
  • Auth event audit logging
5
Backend application services
Business logic · REST · GraphQL · gRPC · Microservices · Monolith

This is where your product's core functionality lives. At early stage this is often a monolith — one deployable application containing all business logic. At FAANG scale it is decomposed into dozens to hundreds of microservices, each owned by a separate team, independently deployable, communicating via internal APIs or a service mesh. Each microservice owns its own data store. This layer contains more code and engineering time than any other.

Node.jsPython / FastAPIDjangoGoJava / Spring BootRuby on RailsREST APIsGraphQLgRPCIstio
PM owns
  • What the API must do — requirements, not implementation
  • Business rules and edge case handling
  • Service boundaries — what is one product vs. another
  • API contract with third-party partners
  • Data retention and deletion requirements
SWE owns
  • Service architecture and decomposition
  • Language and framework selection
  • Internal API design and versioning
  • Service mesh configuration
  • Inter-service communication patterns
  • Data ownership per service
6
Async processing & message queues
Event streaming · Background jobs · Scheduled tasks · Webhooks

Not everything needs to happen synchronously in a request-response cycle. Sending an email, processing a video upload, running a nightly report, triggering a webhook — these happen asynchronously. A message queue decouples the service that produces work from the service that processes it. If a downstream service goes down, the queue holds work until it recovers. This is the difference between a resilient system and a brittle one.

Apache KafkaRabbitMQAWS SQSAWS SNSGoogle Pub/SubCelerySidekiqTemporalCron jobs
PM owns
  • SLA for async jobs — how fast must email send?
  • Retry and failure behaviour policy
  • Which events trigger downstream workflows
  • Alerting requirements for failed jobs
SWE owns
  • Queue topology and topic design
  • Consumer group configuration
  • Retry and backoff logic
  • Dead-letter queue handling
  • Message schema and versioning
  • Idempotency guarantees
7
Caching layer
In-memory cache · Session store · Distributed cache · Query cache

Reading from a database on every request is slow and expensive. The caching layer stores the results of expensive operations — database queries, API calls, computed values — in memory so they can be retrieved in microseconds rather than milliseconds. Redis is the most widely used: it serves as a cache, session store, rate limiter, and pub/sub broker simultaneously. Cache misses fall back to the database. Cache invalidation — knowing when to expire cached data — is one of the genuinely hard problems in distributed systems.

RedisMemcachedAWS ElastiCacheDragonflyApplication-level cache
PM owns
  • Data freshness requirements — how stale is acceptable?
  • Performance targets that drive caching decisions
  • Cache invalidation on specific user actions
SWE owns
  • Cache architecture — write-through vs. read-aside
  • TTL configuration per data type
  • Cache invalidation strategy
  • Cache warming on deployment
  • Cache hit/miss ratio monitoring
8
Search layer
Full-text search · Faceted filters · Autocomplete · Vector / semantic search

Relational databases are not built for search. A search layer is a dedicated system that indexes content for fast, relevance-ranked retrieval — including typo tolerance, synonyms, faceted filtering, and autocomplete. AI has changed search fundamentally: vector databases enable semantic similarity search — finding results by meaning, not exact keywords — which powers recommendation systems, RAG pipelines, and AI-native search experiences.

ElasticsearchOpenSearchAlgoliaTypesensePineconeWeaviatepgvectorQdrant
PM owns
  • Search ranking priorities — recency vs. relevance vs. popularity
  • Which fields are searchable
  • Filter and facet requirements
  • Autocomplete behaviour
  • Zero-results state and fallback handling
SWE owns
  • Index schema design
  • Indexing pipeline — keeping search in sync with DB
  • Relevance tuning and scoring logic
  • Embedding generation for vector search
  • Query parsing and synonym mapping
9
Primary database layer
Relational · Document · Wide-column · Time-series · Graph

Where persistent data lives — the source of truth for your application. Database type is chosen based on data model and access patterns. Relational databases handle structured, transactional data with complex relationships. NoSQL handles flexible schemas, extreme scale, or access patterns that don't fit tables. FAANG-scale apps run multiple database types across different services. Replication (primary/replica) distributes read traffic. Sharding distributes write traffic across nodes. Connection pooling manages the finite number of database connections efficiently.

PostgreSQLMySQLMongoDBDynamoDBCassandraCockroachDBGoogle SpannerClickHouseTimescaleDBNeo4jPgBouncer
PM owns
  • Data retention policy — how long to keep user data
  • Data deletion requirements — right to erasure
  • Data residency requirements — EU data in EU
  • Consistency requirements — is eventual consistency acceptable?
  • Compliance obligations that constrain data storage
SWE owns
  • Database selection per service
  • Schema design and migration strategy
  • Index strategy for query performance
  • Replication and failover configuration
  • Backup frequency and restore testing
  • Connection pooling configuration
10
File & object storage
Images · Videos · Documents · Exports · ML model weights

Databases store structured data. Object storage stores unstructured binary files — profile photos, uploaded PDFs, audio, video, CSV exports, trained ML model weights. At scale, raw files are processed before storage: images are resized and compressed into multiple format variants, videos are transcoded into streaming formats, documents are scanned for malicious content. All processed assets are then served through the CDN layer — never directly from the storage bucket.

AWS S3Google Cloud StorageAzure BlobCloudflare R2AWS Lambda (processing)FFmpegImageMagick
PM owns
  • File type and size limits
  • Supported media formats
  • User storage quota policy per pricing tier
  • File retention and auto-deletion rules
  • User-facing file management UX
SWE owns
  • Storage bucket architecture and permissions
  • Upload pipeline — direct vs. server-side
  • Image and video processing pipeline
  • Virus and malware scanning integration
  • Lifecycle policies — auto-archiving and deletion
11
AI & ML layer
Model serving · Feature store · LLM APIs · Vector DB · Evaluation · A/B testing models

AI in production is not one thing — it is a stack. The full ML layer includes: data pipelines that build training datasets, a feature store that computes and serves ML features in real time, model training infrastructure, a model registry for version control, model serving infrastructure for inference, evaluation systems to measure model quality, and A/B testing frameworks to compare model versions. For LLM-based products, add prompt management, RAG pipelines, guardrails, and output evaluation. This is the most complex and fastest-changing layer in modern web applications.

OpenAI APIAnthropic APIAWS SageMakerVertex AITorchServeMLflowFeastPineconeLangChainWeights & BiasesGuardrails AILlamaIndex
PM owns
  • Which problems AI is solving — build vs. no-build decision
  • Quality bar and acceptable failure rate
  • Latency requirements for inference
  • User feedback mechanisms for model improvement
  • Which model version is live — release decision
  • Guardrail requirements — what outputs are unacceptable
  • Cost-per-inference trade-off decisions
SWE owns
  • Model training and retraining pipelines
  • Feature engineering and feature store
  • Model serving infrastructure and latency optimisation
  • Prompt engineering and RAG architecture
  • Evaluation framework and benchmarks
  • A/B infrastructure for model comparison
  • Model monitoring and drift detection
12
Notification & communication layer
Email · Push notifications · SMS · In-app · Webhooks

A dedicated notification system handles all outbound communication across every channel. At scale this is not simply calling an email API from the backend. It includes channel routing, user preference management, frequency capping (preventing spam), delivery tracking, bounce and unsubscribe management, and regulatory compliance with CAN-SPAM and GDPR. Webhook delivery to third-party systems lives here with retry logic, signature verification, and delivery logs.

SendGridAWS SESTwilio (SMS)Firebase FCMAPNs (iOS)CourierCustomer.ioKnock
PM owns
  • Notification triggers — what events send what messages
  • Copy and content for all notification types
  • Frequency capping and batching rules
  • User preference controls — what can be disabled
  • Transactional vs. marketing classification
  • Regulatory compliance requirements
SWE owns
  • Notification service architecture
  • Channel routing logic implementation
  • Delivery tracking and retry
  • Template rendering engine
  • Unsubscribe and preference enforcement
  • Webhook delivery and signature verification
13
Payment & billing layer
Payment processing · Subscriptions · Invoicing · Tax calculation · Fraud detection

Payment infrastructure is far more complex than dropping in a checkout form. A production billing layer handles: card processing with 3DS authentication, subscription lifecycle management including trials and failed payment recovery, invoice generation, global tax calculation across jurisdictions, refund logic, and fraud detection. PCI DSS compliance means card data must never touch your servers — it is tokenised entirely by the payment processor. A failed payment recovery system (dunning) alone can recover 20-30% of involuntary churn.

StripeBraintreeAdyenStripe BillingChargebeeLagoAvalara (tax)Stripe Radar (fraud)
PM owns
  • Pricing model — per-seat, usage-based, flat-rate
  • Trial length and conversion strategy
  • Dunning logic — how to recover failed payments
  • Refund policy
  • Which currencies and payment methods to support
  • Plan upgrade / downgrade rules and proration
SWE owns
  • Payment processor integration and webhook handling
  • Subscription state machine implementation
  • PCI DSS compliance scope reduction
  • Tax calculation integration
  • Idempotent payment operations
  • Invoice generation and delivery
14
Feature management & experimentation
Feature flags · A/B testing · Canary releases · Gradual rollouts

At FAANG scale, features are never switched on for everyone at once. Feature flags allow code to exist in production while only activating for specific users — internal testers, a 1% canary group, a specific market, a specific pricing tier. A/B testing infrastructure runs controlled experiments to measure whether a change improves a target metric. The most sophisticated systems run hundreds of simultaneous experiments with statistical rigour built into the platform. Engineers at Meta and Google deploy to production dozens of times per day behind flags.

LaunchDarklyStatsigOptimizelyUnleashGrowthBookEppoInternal experimentation platforms
PM owns
  • Which features launch behind a flag
  • Rollout plan and targeting criteria
  • Experiment hypothesis and primary success metrics
  • Minimum detectable effect and required sample size
  • Launch / no-launch decision after experiment reads
  • Kill switch criteria if metrics degrade
SWE owns
  • Flag SDK integration across the codebase
  • User bucketing and assignment logic
  • Experiment event logging and tracking
  • Statistical analysis infrastructure
  • Flag lifecycle management and code cleanup
15
Data, analytics & warehouse
Event tracking · ETL/ELT pipelines · Data warehouse · BI tools · Real-time analytics

The operational database is optimised for transactions, not analysis. The data layer is a separate system optimised for analytical queries across the full history of your data. Events from the application are streamed into a data warehouse via ETL/ELT pipelines. BI tools query the warehouse to produce dashboards. Data scientists run analyses that drive product decisions. At FAANG scale this is a separate engineering discipline — Data Engineering — with its own teams, tooling, and infrastructure entirely separate from product engineering.

SnowflakeBigQueryRedshiftdbtFivetranAirbyteSegmentAmplitudeMixpanelApache KafkaApache FlinkLooker
PM owns
  • What events to track and what properties to capture
  • Metrics definitions and north star metric
  • Dashboard and reporting requirements
  • Event taxonomy — naming conventions
  • Experiment analysis requirements
  • Which third-party analytics tools to use
SWE owns
  • Event instrumentation in application code
  • ETL/ELT pipeline architecture
  • Data warehouse schema design
  • dbt transformation models
  • Data quality monitoring and alerting
  • PII handling, masking, and access controls
16
Infrastructure, DevOps & CI/CD
Cloud · Kubernetes · Terraform · Docker · GitHub Actions · Secrets management

Infrastructure is the platform all other layers run on. At FAANG scale it is entirely code-defined — every server, network rule, database instance, and permission is declared in version-controlled configuration files, not clicked into existence in a web console. Kubernetes orchestrates containers across fleets of machines. CI/CD pipelines automatically test, build, and deploy code changes many times per day — engineers at Meta and Google each deploy to production thousands of times daily across the organisation.

AWS / GCP / AzureKubernetesDockerTerraformPulumiGitHub ActionsCircleCIArgoCDHelmHashiCorp VaultAWS Secrets Manager
PM owns
  • Uptime SLA requirements — 99.9% vs. 99.99%
  • Regional deployment requirements — data residency
  • Disaster recovery objectives — RTO and RPO
  • Compliance certifications that constrain infrastructure
SWE owns
  • Cloud architecture and multi-region strategy
  • Container orchestration and autoscaling rules
  • Infrastructure as Code
  • CI/CD pipeline design and maintenance
  • Secrets management and rotation
  • Infrastructure cost optimisation
17
Observability & monitoring
Logging · Metrics · Distributed tracing · Alerting · Error tracking · On-call

You cannot manage what you cannot see. Observability is the ability to understand the internal state of a system from its external outputs. The three pillars are logs — a record of what happened, metrics — measurements of system behaviour over time, and traces — following a single request across every service it touches. At FAANG scale this is a 24/7 engineering discipline: on-call rotations, automated alerting with runbooks, and blameless post-mortems for every significant outage. Without this layer you are flying blind in production.

DatadogPrometheusGrafanaELK StackOpenTelemetryJaegerSentryPagerDutyOpsGenieAWS CloudWatch
PM owns
  • SLA definitions that drive alert thresholds
  • User-facing error rate tolerances
  • Incident communication requirements — status page
  • Business-level monitoring — conversion rate drops, payment failures
SWE owns
  • Logging strategy and structured log format
  • Metrics instrumentation across all services
  • Distributed tracing setup
  • Alert thresholds and on-call rotation
  • Runbook creation and incident response process
  • Post-mortem process and follow-through
18
Security layer
WAF · SIEM · Vulnerability scanning · Penetration testing · Compliance · Secrets rotation

Security is a posture applied across every layer above, but it also has dedicated infrastructure. A Web Application Firewall blocks malicious traffic patterns. A SIEM aggregates security events and detects anomalies. Vulnerability scanners continuously check code dependencies and infrastructure for known exploits. Penetration testing finds vulnerabilities before attackers do. Compliance certifications — SOC 2, GDPR, HIPAA, ISO 27001 — impose specific controls across the entire stack, and in B2B they are often a hard requirement to close enterprise deals.

Cloudflare WAFAWS ShieldSnykOWASP ZAPSplunk (SIEM)AWS Security HubDependabotVantaDrataBurp Suite
PM owns
  • Compliance certifications required for target market
  • Data classification policy — what is PII, what is sensitive
  • Security incident user notification policy
  • Penetration test scheduling and scope sign-off
  • Bug bounty programme decisions
SWE owns
  • OWASP Top 10 remediation across the stack
  • Dependency vulnerability monitoring and patching
  • Secret scanning in CI/CD pipelines
  • WAF rule configuration
  • Encryption at rest and in transit
  • Security audit log implementation

The ownership summary

The PM defines what the system must do and why. The SWE decides how it's built and keeps it running. These boundaries are not rigid — great PMs have technical depth, and great engineers think about product. But accountability is clear, and confusion about it is expensive.

What PMs are accountable for — across all 18 layers

  • Requirements, user stories, and acceptance criteria
  • Policy decisions — rate limits, retention, deletion
  • SLA and performance requirements
  • Compliance and regulatory obligations
  • What events to track and how to define metrics
  • Feature flag rollout strategy and launch decisions
  • Experiment design — hypothesis, metrics, sample size
  • Launch / no-launch decisions after data reads
  • Pricing and billing product logic
  • Notification content, triggers, and frequency policy
  • API contract with external partners
  • Security incident communication
  • Which AI model versions ship to production
  • AI guardrail requirements and quality bar
  • Which platforms and markets to support
  • Copy, microcopy, and error state language

What SWEs are accountable for — across all 18 layers

  • Architecture and technology selection at every layer
  • System design — scalability, reliability, fault tolerance
  • Database schema, migrations, and query performance
  • CI/CD pipeline and deployment automation
  • Infrastructure as Code and cloud configuration
  • Security implementation — OWASP, encryption, secrets
  • Caching strategy and cache invalidation
  • Async job architecture and retry logic
  • Observability — logging, metrics, tracing, alerting
  • On-call rotation and incident response
  • ML model training, serving, and monitoring
  • Data pipeline and warehouse architecture
  • Cost optimisation across all infrastructure
  • Technical debt management and refactoring
  • Dependency security and patching
  • Uptime and disaster recovery implementation

The single most important thing to understand

Every layer above exists in a production app — even one with 1,000 users. The difference between a startup and FAANG is scale, redundancy, and team size — not the presence or absence of these layers. A solo engineer builds a simplified version of all 18. A FAANG team of 500 engineers builds a more sophisticated version of the same 18. If you are building a startup and your system has no observability, no caching, no async processing, and no CI/CD — those aren't features you haven't got to yet. They are technical debt accumulating interest every day. Vibe coding gets you a demo. Architecture gets you a product.

← Browse Full Career Navigation