Back to blog
kubernetestroubleshootingdevopscontainerization

Kubernetes Troubleshooting: Common Issues and Solutions

Kubernetes is powerful, but let's be real – it's also complex. You *will* run into problems. Knowing how to diagnose and fix them quickly is the difference between a smooth deployment and a…

Kubernetes Troubleshooting: Common Issues and Solutions

Kubernetes is powerful, but let's be real – it's also complex. You *will* run into problems. Knowing how to diagnose and fix them quickly is the difference between a smooth deployment and a frustrating outage. This article isn't about preventing problems (though good practices help!), it's about what to do when things go wrong. We'll cover some common Kubernetes headaches and how to tackle them.

Why Troubleshooting Kubernetes is Different

Traditional server troubleshooting often involved SSHing into a machine and poking around. Kubernetes abstracts that away. You're dealing with *declarative* state. You tell Kubernetes what you *want*, and it tries to make it happen. This is great for scalability and resilience, but it means errors manifest differently. Instead of a server being down, you might have a pod failing to start, a service unreachable, or unexpected resource usage.

The key is understanding that Kubernetes isn't the problem *itself* very often. It's usually a symptom of something else – a misconfigured application, a resource limit, a networking issue, or a problem with your underlying infrastructure. Your troubleshooting workflow needs to reflect this.

Core Troubleshooting Tools

Before diving into specific issues, let's get familiar with the essential tools:

  • kubectl get: Your workhorse. Use it to check the status of everything: pods, deployments, services, nodes, etc. Add -o wide for more details, and -o yaml or -o json for the full object definition.
  • kubectl describe: Provides detailed information about a resource, including events, which are *crucial* for understanding what Kubernetes is doing (or trying to do).
  • kubectl logs: View the logs from a container within a pod. Use -f to follow the logs in real-time.
  • kubectl exec: Get a shell inside a running container. Useful for debugging, but use sparingly in production.
  • kubectl top: Displays resource usage (CPU and memory) for nodes and pods.
  • Kubernetes Dashboard: A web UI for managing and monitoring your cluster. Useful for visualization, but don't rely on it exclusively.
  • Common Issues and Solutions

    Let's look at some frequent problems and how to address them.

    1. Pods Stuck in Pending/Waiting

    This is often the first sign of trouble. A pod in Pending means Kubernetes hasn't been able to schedule it onto a node. Waiting usually means the container image is being pulled.

    Troubleshooting:

  • kubectl describe pod : Look at the "Events" section. This will tell you *why* the pod isn't scheduling. Common reasons:
  • * Insufficient resources: The node doesn't have enough CPU or memory. Check kubectl top node and consider increasing node size or adding more nodes. * NodeSelector/Affinity: The pod has specific node requirements that aren't being met. Review your pod definition. * Taints and Tolerations: Nodes might have taints that prevent pods from scheduling unless they have matching tolerations. * Image pull errors: Kubernetes can't pull the container image. Verify the image name, tag, and registry credentials.

  • Check Node Status: kubectl get nodes. Are all nodes Ready? If not, investigate the node itself (SSH in if necessary).
  • Example Event (from kubectl describe pod):

    Events:
      Type     Reason            Age   From               Message
      ----     ------            ----  ----               -------
      Warning  FailedScheduling  2m    default-scheduler  0/3 nodes are available: 1 Insufficient cpu, 2 node(s) didn't have enough resource: cpu, 3 node(s) didn't have enough resource: memory.

    This clearly indicates a CPU or memory shortage.

    2. Pods Crashing (Evicted/Terminated)

    Pods that start and then immediately exit are a common headache.

    Troubleshooting:

  • kubectl get pods: Check the STATUS and RESTARTS columns. High restart counts are a red flag.
  • kubectl logs : The most important step! Look for error messages or exceptions in your application logs.
  • kubectl describe pod : Check the "Events" section. Look for reasons like:
  • * OOMKilled: The container exceeded its memory limit. Increase the memory limit in your pod definition. * Liveness Probe Failure: Your liveness probe is failing, causing Kubernetes to restart the container. Review your probe configuration. * Readiness Probe Failure: While not causing a restart, a failing readiness probe means your pod isn't receiving traffic.
  • Resource Limits: Are your resource requests and limits appropriately set? Too low, and your application will be starved. Too high, and you might waste resources.
  • Example Pod Definition (showing resource limits):

    apiVersion: v1
    kind: Pod
    metadata:
      name: my-app
    spec:
      containers:
      - name: my-app-container
        image: my-app-image:latest
        resources:
          requests:
            cpu: "100m"
            memory: "256Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"

    3. Networking Issues (Service Unreachable)

    Your pods are running, but you can't access your service.

    Troubleshooting:

  • kubectl get svc : Verify the service exists and has an external IP (if applicable).
  • kubectl describe svc : Check the "Endpoints" section. Are there any endpoints listed? If not, your service isn't selecting any pods.
  • kubectl get endpoints : Confirms which pods are being targeted by the service.
  • DNS Resolution: Can you resolve the service name to an IP address from within the cluster? Use nslookup ..svc.cluster.local.
  • Network Policies: Are network policies blocking traffic to your service? Review your network policy definitions.
  • Ingress Controller: If using an Ingress, check its status and logs.
  • 4. Resource Constraints (Cluster Slowdown)

    The entire cluster feels sluggish.

    Troubleshooting:

  • kubectl top node: Identify nodes with high CPU or memory usage.
  • kubectl top pod: Identify pods consuming excessive resources.
  • Horizontal Pod Autoscaler (HPA): If you're using an HPA, ensure it's configured correctly and scaling your deployments appropriately.
  • Vertical Pod Autoscaler (VPA): Consider using a VPA to automatically adjust resource requests and limits for your pods.
  • Node Scaling: Add more nodes to your cluster to increase overall capacity.
  • Actionable Next Steps

    Kubernetes troubleshooting is a skill honed with practice. Here's what you should do next:

  • Practice with a local Kubernetes cluster: Minikube or kind are great for experimentation.
  • Explore the Kubernetes documentation: [https://kubernetes.io/docs/](https://kubernetes.io/docs/)
  • Learn to read Kubernetes events: They are your best friend.
  • Familiarize yourself with common error messages: Knowing what they mean will save you time.
  • Consider a monitoring solution: Prometheus and Grafana are popular choices for visualizing cluster metrics.
  • Don't be afraid to experiment and break things (in a safe environment, of course!). The more you practice, the more comfortable you'll become with diagnosing and resolving Kubernetes issues. And remember, the Coding4Bread platform has plenty of resources to help you level up your Kubernetes skills!