User Tools

Site Tools


k8s:kubernestes-checklist

Kubernetes Design & Code Review Checklist

1. Architecture Review

Cluster Design

  • [ ] Multi-node cluster (avoid single point of failure)
  • [ ] Separate environments (dev/staging/prod)
  • [ ] Proper namespace strategy
  • [ ] ResourceQuota configured
  • [ ] LimitRange configured
  • [ ] RBAC enabled
  • [ ] Network Policies enabled
  • [ ] Audit logging enabled
  • [ ] High availability control plane

Namespace Design

  • [ ] One namespace per application/domain
  • [ ] Environment isolation
  • [ ] Consistent naming convention
  • [ ] Quotas per namespace

Example:

production-payment
production-order
staging-payment
staging-order

2. Workload Review

Workload Type Selection

Requirement Kubernetes Resource
Stateless API Deployment
Background Worker Deployment
Database StatefulSet
Cache StatefulSet
Scheduled Task CronJob
One-time Task Job
Daemon on every node DaemonSet

Checklist:

  • [ ] Correct workload type selected
  • [ ] One responsibility per workload
  • [ ] Horizontal scaling supported

3. Deployment Review

Deployment Configuration

  • [ ] replicas > 1 in production
  • [ ] RollingUpdate strategy used
  • [ ] revisionHistoryLimit configured
  • [ ] Proper labels
  • [ ] Proper selectors
strategy:
  type: RollingUpdate
  rollingUpdate:
    maxUnavailable: 1
    maxSurge: 1

Labels

Required labels:

labels:
  app: payment-api
  version: v1.2.0
  env: prod
  team: backend

Checklist:

  • [ ] app
  • [ ] version
  • [ ] env
  • [ ] team

4. Container Review

Container Image

Checklist:

  • [ ] No latest tag
  • [ ] Immutable version tag
  • [ ] Trusted registry
  • [ ] Vulnerability scan performed

Bad:

image: api:latest

Good:

image: api:v1.3.5

Security Context

Checklist:

  • [ ] runAsNonRoot
  • [ ] readOnlyRootFilesystem
  • [ ] allowPrivilegeEscalation=false
  • [ ] capabilities dropped
securityContext:
  runAsNonRoot: true
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false

5. Resource Management

CPU & Memory

Checklist:

  • [ ] CPU request
  • [ ] CPU limit
  • [ ] Memory request
  • [ ] Memory limit
resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

Autoscaling

Checklist:

  • [ ] HPA configured
  • [ ] minReplicas defined
  • [ ] maxReplicas defined
  • [ ] CPU target configured
  • [ ] Memory target configured

Example:

minReplicas: 2
maxReplicas: 10

6. Health Checks

Liveness Probe

Checklist:

  • [ ] Configured
  • [ ] Fast endpoint
livenessProbe:
  httpGet:
    path: /health
    port: 8080

Readiness Probe

Checklist:

  • [ ] Configured
  • [ ] Service traffic blocked until ready
readinessProbe:
  httpGet:
    path: /ready
    port: 8080

Startup Probe

Checklist:

  • [ ] Configured for slow startup applications

7. Networking Review

Service Review

Checklist:

  • [ ] ClusterIP for internal services
  • [ ] LoadBalancer only when required
  • [ ] NodePort avoided

Ingress Review

Checklist:

  • [ ] TLS enabled
  • [ ] HTTPS redirect enabled
  • [ ] Rate limiting configured
  • [ ] WAF considered

Network Policies

Checklist:

  • [ ] Default deny policy
  • [ ] Explicit allow rules
  • [ ] Namespace isolation
kind: NetworkPolicy

8. Storage Review

Persistent Volumes

Checklist:

  • [ ] Dynamic provisioning
  • [ ] StorageClass used
  • [ ] Backup strategy exists
  • [ ] Recovery tested

Stateful Applications

Checklist:

  • [ ] StatefulSet used
  • [ ] PVC attached
  • [ ] Data persistence verified

9. Configuration Management

ConfigMap

Checklist:

  • [ ] Only non-sensitive data
  • [ ] Version controlled
  • [ ] Environment specific

Secret Management

Checklist:

  • [ ] No secrets in Git
  • [ ] No secrets in ConfigMap
  • [ ] Rotation process defined
  • [ ] External secret manager preferred

10. Security Review

RBAC

Checklist:

  • [ ] Least privilege principle
  • [ ] Dedicated ServiceAccounts
  • [ ] No cluster-admin usage

Bad:

cluster-admin

Good:

Role
RoleBinding

Pod Security

Checklist:

  • [ ] Non-root containers
  • [ ] No privileged mode
  • [ ] Seccomp profile
  • [ ] AppArmor profile

Supply Chain Security

Checklist:

  • [ ] Image signing
  • [ ] SBOM generated
  • [ ] Vulnerability scanning

11. Reliability Review

High Availability

Checklist:

  • [ ] Multiple replicas
  • [ ] Pod anti-affinity
  • [ ] Multi-zone deployment
podAntiAffinity:

Pod Disruption Budget

Checklist:

  • [ ] PDB configured
minAvailable: 1

Graceful Shutdown

Checklist:

  • [ ] SIGTERM handled
  • [ ] preStop hook configured
  • [ ] terminationGracePeriodSeconds set

12. Observability Review

Logging

Checklist:

  • [ ] Centralized logging
  • [ ] Structured JSON logs
  • [ ] Correlation ID support

Metrics

Checklist:

  • [ ] CPU metrics
  • [ ] Memory metrics
  • [ ] Request metrics
  • [ ] Error metrics
  • [ ] Business metrics

Tracing

Checklist:

  • [ ] Distributed tracing enabled
  • [ ] Request correlation supported

13. CI/CD Review

Deployment Pipeline

Checklist:

  • [ ] Automated build
  • [ ] Automated test
  • [ ] Automated deployment
  • [ ] Rollback support

GitOps

Checklist:

  • [ ] Git as source of truth
  • [ ] Pull-based deployment
  • [ ] Drift detection enabled

Deployment Strategies

Checklist:

  • [ ] Rolling deployment
  • [ ] Canary deployment
  • [ ] Blue-Green deployment

14. Cost Optimization

Checklist:

  • [ ] Requests properly sized
  • [ ] HPA configured
  • [ ] Cluster Autoscaler configured
  • [ ] Spot instances evaluated
  • [ ] Unused resources removed

15. Disaster Recovery

Backup

Checklist:

  • [ ] Database backup
  • [ ] Persistent volume backup
  • [ ] Secret backup
  • [ ] Configuration backup

Recovery

Checklist:

  • [ ] Restore procedure documented
  • [ ] Recovery tested regularly
  • [ ] Recovery Time Objective (RTO) defined
  • [ ] Recovery Point Objective (RPO) defined

16. Production Readiness Scorecard

Category Target Score
Security 9/10
Reliability 9/10
Scalability 9/10
Observability 9/10
Maintainability 9/10
Cost Optimization 8/10+
Disaster Recovery 8/10+

Final Production Review Questions

  1. [ ] Will it survive a Pod crash?
  2. [ ] Will it survive a Node crash?
  3. [ ] Will it survive a Zone failure?
  4. [ ] Can it scale automatically?
  5. [ ] Can it be deployed with zero downtime?
  6. [ ] Can it be rolled back safely?
  7. [ ] Is it secure by default?
  8. [ ] Is it observable?
  9. [ ] Can another engineer maintain it?
  10. [ ] Can it run at 3 AM without waking me up?

If all answers are YES, the Kubernetes platform/workload is considered Production Ready.

k8s/kubernestes-checklist.txt · Last modified: by phong2018