← Back to Blog
DevOps2026-05-0916 min read

10 Things You Must Get Right When Building Kubernetes Infrastructure in 2026

The Kubernetes infrastructure checklist every team needs: RBAC, network policies, resource limits, secrets management, and the production readiness audit.

We followed every Kubernetes tutorial on the internet. We still got breached. A low-level developer account accessed the production etcd backup. Payment history of 40,000 customers was exposed. The tutorials taught us how to deploy apps. They never taught us how to secure a cluster. That knowledge cost us.

The Problem

The gap between "getting Kubernetes running" and "running Kubernetes correctly" is vast. Most teams are stuck between these two states without knowing it. They can deploy services. They cannot recover from a security breach. They have observability. They do not have secrets management. The cluster runs. But it is built on sand.

Why This Happens

Kubernetes documentation shows what Kubernetes can do. It does not show what you must do before going to production. The mental model teams carry is: get it working, then secure it. This is backwards. Security must be built in from the start. By the time teams realize this, they have 50 services depending on the insecure foundation. Refactoring is expensive and risky.

The Solution — 10 Things You Must Get Right

1. Set Resource Requests AND Limits on Every Container

Kubernetes scheduler uses requests to place pods. Limits prevent pods from consuming all available resources. Without both, the scheduler is blind and noisy neighbors crash each other.

resources:
  requests:
    memory: "256Mi"
    cpu: "250m"
  limits:
    memory: "512Mi"
    cpu: "500m"

Also enforce this at the namespace level using ResourceQuota so a single team cannot consume all cluster resources.

2. RBAC: Least-Privilege Service Accounts for Every Workload

Kubernetes RBAC is the difference between a cluster where any pod can access any secret and one where each pod can only access what it needs.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: api-service
  namespace: production

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: api-service-role
  namespace: production
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get"]
  resourceNames: ["api-service-config"]
- apiGroups: [""]
  resources: ["secrets"]
  verbs: ["get"]
  resourceNames: ["api-service-secrets"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: api-service-binding
  namespace: production
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: api-service-role
subjects:
- kind: ServiceAccount
  name: api-service
  namespace: production

This service account can ONLY read one specific ConfigMap and one specific Secret. Nothing else. A compromised pod running this account has minimal damage potential.

3. Network Policies: Default-Deny Then Permit What Is Needed

By default, every pod in a Kubernetes cluster can talk to every other pod on any port. In a 50-service microservices cluster, that is 2,450 potential attack vectors with zero restrictions.

# Deny all traffic by default
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

---
# Allow only payment service to call database service on port 5432
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-payment-to-database
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: database
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: payment-service
    ports:
    - protocol: TCP
      port: 5432

---
# Allow DNS egress (required for service discovery)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns-egress
  namespace: production
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: UDP
      port: 53

Start with deny-all. Then selectively allow what is needed. This prevents lateral movement if a pod is compromised.

4. Never Run Containers as Root

A compromised container running as root has the keys to the kingdom. Running as non-root limits the damage.

securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  allowPrivilegeEscalation: false
  readOnlyRootFilesystem: true
  capabilities:
    drop:
    - ALL

5. Secrets Management: Use External Secrets Operator or Vault

Kubernetes Secrets are base64-encoded, not encrypted. They appear in etcd plaintext. They end up in git history. Never hardcode secrets in Deployment manifests.

Use External Secrets Operator (ESO) to sync secrets from AWS Secrets Manager, HashiCorp Vault, or Azure Key Vault:

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets-store
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
      auth:
        jwt:
          serviceAccountRef:
            name: external-secrets-sa

---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: api-service-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-store
    kind: SecretStore
  target:
    name: api-service-secrets
    creationPolicy: Owner
  data:
  - secretKey: database-password
    remoteRef:
      key: prod/api-service/db-password

ESO syncs the secret automatically and rotates it on a schedule. Secrets never appear in git.

6. Pod Disruption Budgets for Critical Services

During cluster maintenance, Kubernetes evicts pods to other nodes. Without PDB, all replicas might evict simultaneously, causing downtime.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-service-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api-service

This PDB ensures at least 2 replicas of api-service stay running during any disruption. Kubernetes will not evict a pod if it would violate the PDB.

7. Image Scanning in CI Before Deployment

Scan container images for vulnerabilities before they are deployed. Trivy is free and catches 95% of common vulnerabilities.

trivy image myregistry.azurecr.io/api-service:v1.2.3

# Fails if any high-severity vulnerabilities found
# Gate the deployment on scan success in CI

8. Separate Node Pools for Different Workload Types

Run system components (kube-proxy, coredns, monitoring) on dedicated nodes. Run application pods on separate nodes. Prevent noisy neighbors.

nodeSelector:
  workload-type: application

---
apiVersion: v1
kind: Node
metadata:
  labels:
    workload-type: application

9. Cluster Autoscaling with Karpenter (Not Cluster Autoscaler)

Cluster Autoscaler is outdated. Karpenter is faster, smarter, and cheaper. It provisions the right instance type for the workload and consolidates underutilized nodes.

10. Etcd Backup Automation with Velero

Etcd is the Kubernetes database. If etcd is corrupted or deleted, the entire cluster state is lost. Velero backs up etcd and application data automatically.

apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: daily-backup
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  template:
    ttl: 720h  # Keep for 30 days
    storageLocation: default
    volumeSnapshotLocations:
    - default

The Production Readiness Checklist

Before promoting any cluster from staging to production, verify:

  1. ☐ All containers have resource requests and limits
  2. ☐ RBAC is configured with least-privilege service accounts
  3. ☐ Network policies are implemented (default deny, selective allow)
  4. ☐ No containers run as root
  5. ☐ Secrets are managed via External Secrets Operator or Vault
  6. ☐ Pod Disruption Budgets protect critical services
  7. ☐ Image scanning gates deployments in CI/CD
  8. ☐ Separate node pools for system vs application workloads
  9. ☐ Karpenter autoscaling is configured
  10. ☐ Velero backups are running and tested
  11. ☐ Monitoring and alerting are operational (Prometheus + Grafana)
  12. ☐ Audit logging is enabled on the API server
  13. ☐ TLS certificates are valid for 1+ years
  14. ☐ Disaster recovery runbooks are documented
  15. ☐ On-call escalation procedures are defined

Common Mistakes to Avoid

  1. Treating security as an afterthought. Build RBAC, network policies, and secrets management from day 1, not 6 months later.
  2. Running all workloads on the same nodes. A rogue pod can crash the entire monitoring stack and make debugging impossible.
  3. No backup strategy for etcd. Etcd failure is permanent loss. Test backups regularly.
  4. Resource limits without requests (or vice versa). Scheduler cannot make good decisions without requests. Pods OOMKill without limits.
  5. Hardcoding secrets in git. Even in "private" repos, secrets in git are findable and exploitable. Use External Secrets.

Key Takeaways

  • The 10 things are not optional: They are the difference between a safe cluster and a vulnerable one.
  • Resource management is foundational: Requests let scheduler decide, limits prevent cascading failures.
  • Security must be built in from day 1: Adding RBAC after 50 services are deployed is painful.
  • Network policies prevent lateral movement: Default deny, selective allow is the zero-trust model.
  • Automation prevents human error: Velero backups, image scanning, Karpenter scaling—all automated.

Struggling with securing your Kubernetes infrastructure or preparing for production? The Skillzmist team has solved this exact problem for engineering teams across the US, UK, and Europe. Reach out for a free technical consultation — we respond within 24 hours.

Related: Kubernetes RBAC Explained | Kubernetes Network Policies and Zero Trust

Related posts

Enterprise Cloud Application with Automated Deployment and Blue-Green Releases

An enterprise cloud application delivery strategy using automated deployments, blue-green releases, and monitoring to maintain reliability for production users.

Read more

How to Set Up a CI/CD Pipeline on AWS Using GitHub Actions and Terraform

Learn how to automate deployments on AWS with GitHub Actions and Terraform, including repository setup, S3 backend configuration, ECS deployment, and safe rollback strategy.

Read more

Why Kubernetes? The Case for Container Orchestration in Modern Production Systems

Discover why 84% of enterprise organizations now run Kubernetes in production and how container orchestration solves the fundamental scaling problem.

Read more