Back to blog
kubernetesobservabilitymonitoringloggingtracing

Kubernetes Observability: Best Practices for Monitoring and Troubleshooting

Running Kubernetes without proper observability is like flying blind. When something breaks at 2 AM — and it will — you need to know *what* broke, *why* it broke, and *where* in your cluster it…

Kubernetes Observability: Best Practices for Monitoring and Troubleshooting

Running Kubernetes without proper observability is like flying blind. When something breaks at 2 AM — and it will — you need to know *what* broke, *why* it broke, and *where* in your cluster it happened. That's exactly what observability gives you: the ability to understand your system's internal state from the outside.

Observability in Kubernetes isn't just "add some metrics and call it a day." It's three interconnected pillars: metrics, logs, and traces. Skip any one of them and you'll spend twice as long debugging the next incident.

Let's build this out properly.


The Three Pillars (and Why You Need All of Them)

Metrics tell you *that* something is wrong — CPU spiking, pod restarts climbing, request latency increasing.

Logs tell you *what* happened — the error message, the stack trace, the context around the failure.

Traces tell you *where* it went wrong across services — which microservice in a chain of ten is actually responsible for that 3-second response time.

You can debug with just metrics and logs for a while, but once you have more than five or six services talking to each other, distributed tracing becomes non-negotiable.


Setting Up Metrics with Prometheus

Prometheus is the de facto standard for Kubernetes metrics. It scrapes metrics endpoints, stores time-series data, and integrates with Kubernetes service discovery out of the box.

The easiest way to get started is with the kube-prometheus-stack Helm chart, which bundles Prometheus, Alertmanager, and Grafana together:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace

For your own applications, expose a /metrics endpoint and annotate your pods so Prometheus picks them up:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
spec:
  containers:
    - name: my-app
      image: my-app:latest
      ports:
        - containerPort: 8080

A few metrics you should be tracking from day one:

  • kube_pod_container_status_restarts_total — pod restart counts (a climbing number here is always a red flag)
  • container_cpu_usage_seconds_total — CPU usage per container
  • container_memory_working_set_bytes — actual memory in use
  • http_request_duration_seconds — your application's request latency (instrument this yourself with a client library)

  • Alerting That Actually Works

    Prometheus alerts without Alertmanager routing are useless noise. Here's a practical alert rule that fires when a pod has restarted more than 5 times in the last hour:

    groups:
      - name: pod-health
        rules:
          - alert: PodCrashLooping
            expr: |
              increase(kube_pod_container_status_restarts_total[1h]) > 5
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "Pod {{ $labels.pod }} is crash looping"
              description: "Container {{ $labels.container }} in pod {{ $labels.pod }} has restarted {{ $value }} times in the last hour."

    Keep your alerts actionable. Every alert should have a clear owner and a runbook link. If your on-call engineer can't do anything about an alert at 3 AM, it shouldn't page them.


    Centralized Logging with the EFK Stack

    Kubernetes logs are ephemeral — when a pod dies, its logs go with it unless you ship them somewhere first. The EFK stack (Elasticsearch, Fluentd, Kibana) is a solid choice for centralized logging.

    Deploy Fluentd as a DaemonSet so it runs on every node and captures all container logs:

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: fluentd
      namespace: logging
    spec:
      selector:
        matchLabels:
          name: fluentd
      template:
        metadata:
          labels:
            name: fluentd
        spec:
          tolerations:
            - key: node-role.kubernetes.io/control-plane
              effect: NoSchedule
          containers:
            - name: fluentd
              image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
              env:
                - name: FLUENT_ELASTICSEARCH_HOST
                  value: "elasticsearch.logging.svc.cluster.local"
                - name: FLUENT_ELASTICSEARCH_PORT
                  value: "9200"
              volumeMounts:
                - name: varlog
                  mountPath: /var/log
                - name: varlibdockercontainers
                  mountPath: /var/lib/docker/containers
                  readOnly: true
          volumes:
            - name: varlog
              hostPath:
                path: /var/log
            - name: varlibdockercontainers
              hostPath:
                path: /var/lib/docker/containers

    Structured logging matters here. If your application logs plain text, you're making your own life harder. Log JSON instead:

    {
      "timestamp": "2024-01-15T10:30:00Z",
      "level": "error",
      "service": "payment-service",
      "trace_id": "abc123",
      "message": "Payment processing failed",
      "error": "connection timeout",
      "user_id": "u-789"
    }

    Structured logs are searchable, filterable, and can be correlated with traces using that trace_id field.


    Distributed Tracing with Jaeger

    Once you have more than a handful of services, you need traces. Jaeger integrates well with Kubernetes and supports OpenTelemetry, which is the instrumentation standard you should be using.

    Deploy Jaeger with the all-in-one image for development, or use the Jaeger Operator for production:

    kubectl create namespace observability
    kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/latest/download/jaeger-operator.yaml \
      -n observability

    Then create a Jaeger instance:

    apiVersion: jaegertracing.io/v1
    kind: Jaeger
    metadata:
      name: jaeger
      namespace: observability
    spec:
      strategy: production
      storage:
        type: elasticsearch
        options:
          es:
            server-urls: http://elasticsearch.logging.svc.cluster.local:9200

    Instrument your application with OpenTelemetry (here's a Node.js example):

    const { NodeSDK } = require('@opentelemetry/sdk-node');
    const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

    const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: 'http://jaeger-collector.observability.svc.cluster.local:4318/v1/traces', }), serviceName: 'payment-service', });

    sdk.start();


    Grafana: Tying It All Together

    Grafana is your single pane of glass. Connect it to Prometheus for metrics, Elasticsearch for logs, and Jaeger for traces. The kube-prometheus-stack chart already includes Grafana with pre-built dashboards for cluster health.

    A few dashboards worth importing immediately:

  • Kubernetes Cluster Overview (ID: 7249) — nodes, pods, namespaces at a glance
  • Node Exporter Full (ID: 1860) — deep dive into node-level metrics
  • Kubernetes Pod Monitoring (ID: 6336) — per-pod resource usage

  • Practical Tips That Save You Time

    Set resource requests and limits on everything. Without them, you can't trust your resource metrics. A pod without limits can starve its neighbors and your dashboards won't tell you why.

    Use namespace-level separation for your observability stack. Keep monitoring, logging, and observability namespaces separate from your application namespaces. It makes RBAC and resource management much cleaner.

    Correlate across pillars. The real power comes when you can jump from a Grafana alert → filter logs by time range → find the trace ID → follow the trace through your services. Build this workflow before you need it in an incident.

    Don't skip liveness and readiness probes. They're not just for Kubernetes health checks — they're also signals that feed into your metrics and alert on pod restarts.


    Your Next Steps

  • Deploy kube-prometheus-stack if you haven't already — it gets you metrics and dashboards in under 10 minutes
  • Switch your application logging to structured JSON — this pays dividends immediately in searchability
  • Add OpenTelemetry instrumentation to your two or three most critical services first, then expand
  • Write a runbook for your top three most likely failure modes and link them from your alerts
  • Run a game day — deliberately kill a pod or inject latency and practice using your observability tools to find the problem
  • Observability isn't a one-time setup. It's a practice. The teams that do it well are the ones that treat it as a first-class engineering concern, not an afterthought bolted on after something breaks.