kOps K8s Control Plane Monitoring with Datadog

Context

Due to a lack of control plane observability, I recently re-integrated Datadog helm chart on our kops-provisioned k8s cluster. kops is definitely not the most popular k8s solution and the official control plane monitoring guide doesn't cover detailed steps.

Throughout the process, I ran into one major issue (details later), potentially caused by compatibility between kops and datadog-agent. Investigation kept me busy, and I still don't have a definitely answer to "fix" it. However, I came up with a solution to bypass this issue to ensure full visibility coverage for the control plane.

Overview

A k8s control plane has 4 major components:

  • kube-api-server
  • etcd
  • kube-scheduler
  • kube-controller-manager

all of which are supported by native Datadog integrations (comes with datadog-agent). The recommended integration guide depends on kubernetes integration auto-discovery, but it does not work on kops-provisioned control planes

I’ll walk through the issue and findings, and follow up with a step-by-step guide on how to bypass it.

The details covered here are based on the following setup:

For brevity, I’ll refer to “kops-provisioned control plane node(s)" as "control node(s)" unless explicitly specified.

Problem

Using kube-scheduler to illustrate the problem (same problem for all 4 components).

Example integration (values.yaml for datadog helm chart):

datadog:
apiKey: <DATADOG_API_KEY>
...
ignoreAutoConfig:
- kube_scheduler
...
confd:
kube_scheduler.yaml: |-
ad_identifiers:
- kube-scheduler
instances:
- prometheus_url: https://%%host%%:10259/metrics
ssl_verify: false
bearer_token_auth: true

This is the recommended approach from the official control plane monitoring guide, and the approach here is based on kubernetes integration auto-discovery

On a control node, the configuration above does NOT turn on the integration(s):

  • valid configuration file for the integration exists under /etc/datadog-agent/conf.d/
  • integration is not detected as on via agent status output

Investigation and Findings

I thoroughly checked the configuration content and walked through the helm chart values reference as the first thing to check, and the I did not see anything wrong.

I’ve had experience setting up Datadog helm chart for control plane monitoring on EKS clusters and docker-desktop / minikube using helm, the identical configuration doesn't work 100% but at least the integrations are detected correctly with auto-discovery. The container names I saw via running docker ps on control nodes have the right shortname & image name (which are used to derive ad_identifier by datadog-agent). So I'm sure the configuration (especially the ad_identifiers section) is not the problem.

The next thing I did was turn on debug log(datadog.logLevel: debug, logs available at /var/log/datadog/agent.log) for datadog helm chart on both my kops cluster and a docker-desktop / minikube. From the debug log I figured roughly how auto-discovery datadog-agent works:

  • filed-based configurations (/etc/datadog-agent/conf.d/) are loaded into memory and running containers & processes are detected
  • each detected container/process will have an identifier, which is compared against configurations of integrations with auto-discovery turned on (via ad_identifiers)
  • once an ad_identifier is matched, the rest of the yaml configuration will be used for the integration.

The process above can be verified via debug log.

On a control node, the desired container (kube-scheduler, same for the other 3 components) is NOT identified as kube-scheduler. I noticed many containers were identified as container ids (in the format of "docker://<container_id>") but none of the container ids match the actual container id for kube-scheduler (you can identify container id by kubectl describe pod/<kube-scheduler-pod-name> or ssh to control node and run docker ps).

Either the kube-scheduler(same for the other 3 components) is not detected, or it is detected as a container id that doesn't match its own.

This is where I realized that there aren’t further actionable things I can do with this approach. Fortunately, my goal is to get integrations working for control nodes, one way or another, and I was able to come up with an alternative solution.

Solution

The TL;DR; version of the solution is: use file-based configuration without auto-discovery.

Integrations are driven by configuration files (located under /etc/datadog-agent/conf.d/). The helm-native approach mentioned above works by converting the datadog.confd key-value pairs to one auto_conf.yaml per integration. The non-helm solution for configuration is to provision your own conf.yamls for your integrations.

To bypass the auto-discovery issue on kops-provisioned cluster, we can:

  • provision a ConfigMap with desired configurations
  • mount the ConfigMap as volume(s) to datadog-agent: agents.volumes + agents.volumeMounts
  • replace template variables with resolvable ones
  • disable auto-config (synonym for “autodiscovery”) for integration: datadog.ignoreAutoConfig

Datadog configuration k8s ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
name: my-datadog-configmap
data:
kube_apiserver_metrics.yaml: |+
init_config:
instances:
- prometheus_url: https://%%env_DD_KUBERNETES_KUBELET_HOST%%:443/metrics
tls_verify: false
bearer_token_auth: true
bearer_token_path: /var/run/secrets/kubernetes.io/serviceaccount/token
etcd.yaml: |+
init_config:
instances:
# etcd-manager-main
- prometheus_url: "https://%%env_DD_KUBERNETES_KUBELET_HOST%%:4001/metrics"
tls_verify: false
tls_cert: /host/etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.crt
tls_private_key: /host/etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.key
# etcd-manager-events
- prometheus_url: "https://%%env_DD_KUBERNETES_KUBELET_HOST%%:4002/metrics"
tls_verify: false
tls_cert: /host/etc/kubernetes/pki/etcd-manager-events/etcd-clients-ca.crt
tls_private_key: /host/etc/kubernetes/pki/etcd-manager-events/etcd-clients-ca.key
kube_scheduler.yaml: |+
init_config:
instances:
- prometheus_url: "http://%%env_DD_KUBERNETES_KUBELET_HOST%%:10251/metrics"
ssl_verify: false
kube_controller_manager.yaml: |+
init_config:
instances:
- prometheus_url: "http://%%env_DD_KUBERNETES_KUBELET_HOST%%:10252/metrics"
ssl_verify: false

Explanation

  • template variables are specific to autodiscovery feature. In the context of non-autodiscovery configurations, not all template variables can be resolved. e.g. %%host%% does not resolve. Fortunately %%env_<ENV_VAR>%% seems to be resolving fine.
  • kops provisions 2 etcd clusters: main and events. 2 instances of etcd integration is required with slightly different tls_cert and tls_private_key. Although I've verified these are interchangeable.
  • kops uses etcd-manager as the parent process for etcd. Port 2380/2381 is for peer communication (server-to-server), and 4001/4002 is for client communication (client-to-server). In this case the agent will be a "client" of etcd server, using port 4001 / 4002 is desired (instead of the 2379 port in normal etcd setup).
  • kube-scheduler serves HTTP on port 10251 and HTTPS on port 10259
  • kube-controller-manager serves HTTP on port 10252 and HTTPS on port 10257

value.yaml for datadog helm chart

datadog:
ignoreAutoConfig:
- etcd
- kube_scheduler
- kube_controller_manager
- kube_apiserver_metrics
agents:
volumes:
- name: my-config
configMap:
name: my-datadog-configmap
- name: etcd-pki
hostPath:
path: /etc/kubernetes/pki
volumeMounts:
- name: etcd-pki
mountPath: /host/etc/kubernetes/pki
readOnly: true
- name: my-config
mountPath: /etc/datadog-agent/conf.d/kube_apiserver_metrics.d/conf.yaml
subPath: kube_apiserver_metrics.yaml
- name: my-config
mountPath: /etc/datadog-agent/conf.d/etcd.d/conf.yaml
subPath: etcd.yaml
- name: my-config
mountPath: /etc/datadog-agent/conf.d/kube_scheduler.d/conf.yaml
subPath: kube_scheduler.yaml
- name: my-config
mountPath: /etc/datadog-agent/conf.d/kube_controller_manager.d/conf.yaml
subPath: kube_controller_manager.yaml

Explanation

  • Certificates and private keys (located under /etc/kubernetes/pki from host) is required by etcd client-to-server communication.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Xing Du

Xing Du

75 Followers

Minimalist. Game Developer. Software Engineer. DevOps enthusiast. Foodie. Gamer.