====== Kubernetes Design & Code Review Checklist ====== ===== 1. Architecture Review ===== ==== Cluster Design ==== * [ ] Multi-node cluster (avoid single point of failure) * [ ] Separate environments (dev/staging/prod) * [ ] Proper namespace strategy * [ ] ResourceQuota configured * [ ] LimitRange configured * [ ] RBAC enabled * [ ] Network Policies enabled * [ ] Audit logging enabled * [ ] High availability control plane ==== Namespace Design ==== * [ ] One namespace per application/domain * [ ] Environment isolation * [ ] Consistent naming convention * [ ] Quotas per namespace Example: production-payment production-order staging-payment staging-order ---- ===== 2. Workload Review ===== ==== Workload Type Selection ==== ^ Requirement ^ Kubernetes Resource ^ | Stateless API | Deployment | | Background Worker | Deployment | | Database | StatefulSet | | Cache | StatefulSet | | Scheduled Task | CronJob | | One-time Task | Job | | Daemon on every node | DaemonSet | Checklist: * [ ] Correct workload type selected * [ ] One responsibility per workload * [ ] Horizontal scaling supported ---- ===== 3. Deployment Review ===== ==== Deployment Configuration ==== * [ ] replicas > 1 in production * [ ] RollingUpdate strategy used * [ ] revisionHistoryLimit configured * [ ] Proper labels * [ ] Proper selectors strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 maxSurge: 1 ==== Labels ==== Required labels: labels: app: payment-api version: v1.2.0 env: prod team: backend Checklist: * [ ] app * [ ] version * [ ] env * [ ] team ---- ===== 4. Container Review ===== ==== Container Image ==== Checklist: * [ ] No latest tag * [ ] Immutable version tag * [ ] Trusted registry * [ ] Vulnerability scan performed Bad: image: api:latest Good: image: api:v1.3.5 ==== Security Context ==== Checklist: * [ ] runAsNonRoot * [ ] readOnlyRootFilesystem * [ ] allowPrivilegeEscalation=false * [ ] capabilities dropped securityContext: runAsNonRoot: true readOnlyRootFilesystem: true allowPrivilegeEscalation: false ---- ===== 5. Resource Management ===== ==== CPU & Memory ==== Checklist: * [ ] CPU request * [ ] CPU limit * [ ] Memory request * [ ] Memory limit resources: requests: cpu: "250m" memory: "256Mi" limits: cpu: "500m" memory: "512Mi" ==== Autoscaling ==== Checklist: * [ ] HPA configured * [ ] minReplicas defined * [ ] maxReplicas defined * [ ] CPU target configured * [ ] Memory target configured Example: minReplicas: 2 maxReplicas: 10 ---- ===== 6. Health Checks ===== ==== Liveness Probe ==== Checklist: * [ ] Configured * [ ] Fast endpoint livenessProbe: httpGet: path: /health port: 8080 ==== Readiness Probe ==== Checklist: * [ ] Configured * [ ] Service traffic blocked until ready readinessProbe: httpGet: path: /ready port: 8080 ==== Startup Probe ==== Checklist: * [ ] Configured for slow startup applications ---- ===== 7. Networking Review ===== ==== Service Review ==== Checklist: * [ ] ClusterIP for internal services * [ ] LoadBalancer only when required * [ ] NodePort avoided ==== Ingress Review ==== Checklist: * [ ] TLS enabled * [ ] HTTPS redirect enabled * [ ] Rate limiting configured * [ ] WAF considered ==== Network Policies ==== Checklist: * [ ] Default deny policy * [ ] Explicit allow rules * [ ] Namespace isolation kind: NetworkPolicy ---- ===== 8. Storage Review ===== ==== Persistent Volumes ==== Checklist: * [ ] Dynamic provisioning * [ ] StorageClass used * [ ] Backup strategy exists * [ ] Recovery tested ==== Stateful Applications ==== Checklist: * [ ] StatefulSet used * [ ] PVC attached * [ ] Data persistence verified ---- ===== 9. Configuration Management ===== ==== ConfigMap ==== Checklist: * [ ] Only non-sensitive data * [ ] Version controlled * [ ] Environment specific ==== Secret Management ==== Checklist: * [ ] No secrets in Git * [ ] No secrets in ConfigMap * [ ] Rotation process defined * [ ] External secret manager preferred ---- ===== 10. Security Review ===== ==== RBAC ==== Checklist: * [ ] Least privilege principle * [ ] Dedicated ServiceAccounts * [ ] No cluster-admin usage Bad: cluster-admin Good: Role RoleBinding ==== Pod Security ==== Checklist: * [ ] Non-root containers * [ ] No privileged mode * [ ] Seccomp profile * [ ] AppArmor profile ==== Supply Chain Security ==== Checklist: * [ ] Image signing * [ ] SBOM generated * [ ] Vulnerability scanning ---- ===== 11. Reliability Review ===== ==== High Availability ==== Checklist: * [ ] Multiple replicas * [ ] Pod anti-affinity * [ ] Multi-zone deployment podAntiAffinity: ==== Pod Disruption Budget ==== Checklist: * [ ] PDB configured minAvailable: 1 ==== Graceful Shutdown ==== Checklist: * [ ] SIGTERM handled * [ ] preStop hook configured * [ ] terminationGracePeriodSeconds set ---- ===== 12. Observability Review ===== ==== Logging ==== Checklist: * [ ] Centralized logging * [ ] Structured JSON logs * [ ] Correlation ID support ==== Metrics ==== Checklist: * [ ] CPU metrics * [ ] Memory metrics * [ ] Request metrics * [ ] Error metrics * [ ] Business metrics ==== Tracing ==== Checklist: * [ ] Distributed tracing enabled * [ ] Request correlation supported ---- ===== 13. CI/CD Review ===== ==== Deployment Pipeline ==== Checklist: * [ ] Automated build * [ ] Automated test * [ ] Automated deployment * [ ] Rollback support ==== GitOps ==== Checklist: * [ ] Git as source of truth * [ ] Pull-based deployment * [ ] Drift detection enabled ==== Deployment Strategies ==== Checklist: * [ ] Rolling deployment * [ ] Canary deployment * [ ] Blue-Green deployment ---- ===== 14. Cost Optimization ===== Checklist: * [ ] Requests properly sized * [ ] HPA configured * [ ] Cluster Autoscaler configured * [ ] Spot instances evaluated * [ ] Unused resources removed ---- ===== 15. Disaster Recovery ===== ==== Backup ==== Checklist: * [ ] Database backup * [ ] Persistent volume backup * [ ] Secret backup * [ ] Configuration backup ==== Recovery ==== Checklist: * [ ] Restore procedure documented * [ ] Recovery tested regularly * [ ] Recovery Time Objective (RTO) defined * [ ] Recovery Point Objective (RPO) defined ---- ===== 16. Production Readiness Scorecard ===== ^ Category ^ Target Score ^ | Security | 9/10 | | Reliability | 9/10 | | Scalability | 9/10 | | Observability | 9/10 | | Maintainability | 9/10 | | Cost Optimization | 8/10+ | | Disaster Recovery | 8/10+ | ---- ===== Final Production Review Questions ===== - [ ] Will it survive a Pod crash? - [ ] Will it survive a Node crash? - [ ] Will it survive a Zone failure? - [ ] Can it scale automatically? - [ ] Can it be deployed with zero downtime? - [ ] Can it be rolled back safely? - [ ] Is it secure by default? - [ ] Is it observable? - [ ] Can another engineer maintain it? - [ ] Can it run at 3 AM without waking me up? If all answers are YES, the Kubernetes platform/workload is considered Production Ready.