k8s:kubernestes-checklist
Table of Contents
Kubernetes Design & Code Review Checklist
1. Architecture Review
Cluster Design
- [ ] Multi-node cluster (avoid single point of failure)
- [ ] Separate environments (dev/staging/prod)
- [ ] Proper namespace strategy
- [ ] ResourceQuota configured
- [ ] LimitRange configured
- [ ] RBAC enabled
- [ ] Network Policies enabled
- [ ] Audit logging enabled
- [ ] High availability control plane
Namespace Design
- [ ] One namespace per application/domain
- [ ] Environment isolation
- [ ] Consistent naming convention
- [ ] Quotas per namespace
Example:
production-payment production-order staging-payment staging-order
2. Workload Review
Workload Type Selection
| Requirement | Kubernetes Resource |
|---|---|
| Stateless API | Deployment |
| Background Worker | Deployment |
| Database | StatefulSet |
| Cache | StatefulSet |
| Scheduled Task | CronJob |
| One-time Task | Job |
| Daemon on every node | DaemonSet |
Checklist:
- [ ] Correct workload type selected
- [ ] One responsibility per workload
- [ ] Horizontal scaling supported
3. Deployment Review
Deployment Configuration
- [ ] replicas > 1 in production
- [ ] RollingUpdate strategy used
- [ ] revisionHistoryLimit configured
- [ ] Proper labels
- [ ] Proper selectors
strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 maxSurge: 1
Labels
Required labels:
labels: app: payment-api version: v1.2.0 env: prod team: backend
Checklist:
- [ ] app
- [ ] version
- [ ] env
- [ ] team
4. Container Review
Container Image
Checklist:
- [ ] No latest tag
- [ ] Immutable version tag
- [ ] Trusted registry
- [ ] Vulnerability scan performed
Bad:
image: api:latest
Good:
image: api:v1.3.5
Security Context
Checklist:
- [ ] runAsNonRoot
- [ ] readOnlyRootFilesystem
- [ ] allowPrivilegeEscalation=false
- [ ] capabilities dropped
securityContext: runAsNonRoot: true readOnlyRootFilesystem: true allowPrivilegeEscalation: false
5. Resource Management
CPU & Memory
Checklist:
- [ ] CPU request
- [ ] CPU limit
- [ ] Memory request
- [ ] Memory limit
resources: requests: cpu: "250m" memory: "256Mi" limits: cpu: "500m" memory: "512Mi"
Autoscaling
Checklist:
- [ ] HPA configured
- [ ] minReplicas defined
- [ ] maxReplicas defined
- [ ] CPU target configured
- [ ] Memory target configured
Example:
minReplicas: 2 maxReplicas: 10
6. Health Checks
Liveness Probe
Checklist:
- [ ] Configured
- [ ] Fast endpoint
livenessProbe: httpGet: path: /health port: 8080
Readiness Probe
Checklist:
- [ ] Configured
- [ ] Service traffic blocked until ready
readinessProbe: httpGet: path: /ready port: 8080
Startup Probe
Checklist:
- [ ] Configured for slow startup applications
7. Networking Review
Service Review
Checklist:
- [ ] ClusterIP for internal services
- [ ] LoadBalancer only when required
- [ ] NodePort avoided
Ingress Review
Checklist:
- [ ] TLS enabled
- [ ] HTTPS redirect enabled
- [ ] Rate limiting configured
- [ ] WAF considered
Network Policies
Checklist:
- [ ] Default deny policy
- [ ] Explicit allow rules
- [ ] Namespace isolation
kind: NetworkPolicy
8. Storage Review
Persistent Volumes
Checklist:
- [ ] Dynamic provisioning
- [ ] StorageClass used
- [ ] Backup strategy exists
- [ ] Recovery tested
Stateful Applications
Checklist:
- [ ] StatefulSet used
- [ ] PVC attached
- [ ] Data persistence verified
9. Configuration Management
ConfigMap
Checklist:
- [ ] Only non-sensitive data
- [ ] Version controlled
- [ ] Environment specific
Secret Management
Checklist:
- [ ] No secrets in Git
- [ ] No secrets in ConfigMap
- [ ] Rotation process defined
- [ ] External secret manager preferred
10. Security Review
RBAC
Checklist:
- [ ] Least privilege principle
- [ ] Dedicated ServiceAccounts
- [ ] No cluster-admin usage
Bad:
cluster-admin
Good:
Role RoleBinding
Pod Security
Checklist:
- [ ] Non-root containers
- [ ] No privileged mode
- [ ] Seccomp profile
- [ ] AppArmor profile
Supply Chain Security
Checklist:
- [ ] Image signing
- [ ] SBOM generated
- [ ] Vulnerability scanning
11. Reliability Review
High Availability
Checklist:
- [ ] Multiple replicas
- [ ] Pod anti-affinity
- [ ] Multi-zone deployment
podAntiAffinity:
Pod Disruption Budget
Checklist:
- [ ] PDB configured
minAvailable: 1
Graceful Shutdown
Checklist:
- [ ] SIGTERM handled
- [ ] preStop hook configured
- [ ] terminationGracePeriodSeconds set
12. Observability Review
Logging
Checklist:
- [ ] Centralized logging
- [ ] Structured JSON logs
- [ ] Correlation ID support
Metrics
Checklist:
- [ ] CPU metrics
- [ ] Memory metrics
- [ ] Request metrics
- [ ] Error metrics
- [ ] Business metrics
Tracing
Checklist:
- [ ] Distributed tracing enabled
- [ ] Request correlation supported
13. CI/CD Review
Deployment Pipeline
Checklist:
- [ ] Automated build
- [ ] Automated test
- [ ] Automated deployment
- [ ] Rollback support
GitOps
Checklist:
- [ ] Git as source of truth
- [ ] Pull-based deployment
- [ ] Drift detection enabled
Deployment Strategies
Checklist:
- [ ] Rolling deployment
- [ ] Canary deployment
- [ ] Blue-Green deployment
14. Cost Optimization
Checklist:
- [ ] Requests properly sized
- [ ] HPA configured
- [ ] Cluster Autoscaler configured
- [ ] Spot instances evaluated
- [ ] Unused resources removed
15. Disaster Recovery
Backup
Checklist:
- [ ] Database backup
- [ ] Persistent volume backup
- [ ] Secret backup
- [ ] Configuration backup
Recovery
Checklist:
- [ ] Restore procedure documented
- [ ] Recovery tested regularly
- [ ] Recovery Time Objective (RTO) defined
- [ ] Recovery Point Objective (RPO) defined
16. Production Readiness Scorecard
| Category | Target Score |
|---|---|
| Security | 9/10 |
| Reliability | 9/10 |
| Scalability | 9/10 |
| Observability | 9/10 |
| Maintainability | 9/10 |
| Cost Optimization | 8/10+ |
| Disaster Recovery | 8/10+ |
Final Production Review Questions
- [ ] Will it survive a Pod crash?
- [ ] Will it survive a Node crash?
- [ ] Will it survive a Zone failure?
- [ ] Can it scale automatically?
- [ ] Can it be deployed with zero downtime?
- [ ] Can it be rolled back safely?
- [ ] Is it secure by default?
- [ ] Is it observable?
- [ ] Can another engineer maintain it?
- [ ] Can it run at 3 AM without waking me up?
If all answers are YES, the Kubernetes platform/workload is considered Production Ready.
k8s/kubernestes-checklist.txt · Last modified: by phong2018
