Context
Due to a lack of control plane observability, I recently re-integrated Datadog helm
chart on our kops
-provisioned k8s
cluster. kops
is definitely not the most popular k8s
solution and the official control plane monitoring guide doesn't cover detailed steps.
Throughout the process, I ran into one major issue (details later), potentially caused by compatibility between kops
and datadog-agent
. Investigation kept me busy, and I still don't have a definitely answer to "fix" it. However, I came up with a solution to bypass this issue to ensure full visibility coverage for the control plane.
Overview
A k8s control plane has 4 major components:
kube-api-server
etcd
kube-scheduler
kube-controller-manager
all of which are supported by native Datadog integrations (comes with datadog-agent
). The recommended integration guide depends on kubernetes integration auto-discovery
, but it does not work on kops
-provisioned control planes
I’ll walk through the issue and findings, and follow up with a step-by-step guide on how to bypass it.
The details covered here are based on the following setup:
For brevity, I’ll refer to “kops
-provisioned control plane node(s)" as "control node(s)" unless explicitly specified.
Problem
Using kube-scheduler
to illustrate the problem (same problem for all 4 components).
Example integration (values.yaml
for datadog
helm
chart):
datadog:
apiKey: <DATADOG_API_KEY>
...
ignoreAutoConfig:
- kube_scheduler
...
confd:
kube_scheduler.yaml: |-
ad_identifiers:
- kube-scheduler
instances:
- prometheus_url: https://%%host%%:10259/metrics
ssl_verify: false
bearer_token_auth: true
This is the recommended approach from the official control plane monitoring guide, and the approach here is based on kubernetes integration auto-discovery
On a control node, the configuration above does NOT turn on the integration(s):
- valid configuration file for the integration exists under
/etc/datadog-agent/conf.d/
- integration is not detected as on via
agent status
output
Investigation and Findings
I thoroughly checked the configuration content and walked through the helm
chart values reference as the first thing to check, and the I did not see anything wrong.
I’ve had experience setting up Datadog helm
chart for control plane monitoring on EKS
clusters and docker-desktop
/ minikube
using helm
, the identical configuration doesn't work 100% but at least the integrations are detected correctly with auto-discovery. The container names I saw via running docker ps
on control nodes have the right shortname & image name (which are used to derive ad_identifier
by datadog-agent
). So I'm sure the configuration (especially the ad_identifiers
section) is not the problem.
The next thing I did was turn on debug log(datadog.logLevel: debug
, logs available at /var/log/datadog/agent.log
) for datadog
helm
chart on both my kops
cluster and a docker-desktop
/ minikube
. From the debug log I figured roughly how auto-discovery datadog-agent
works:
- filed-based configurations (
/etc/datadog-agent/conf.d/
) are loaded into memory and running containers & processes are detected - each detected container/process will have an identifier, which is compared against configurations of integrations with auto-discovery turned on (via
ad_identifiers
) - once an
ad_identifier
is matched, the rest of theyaml
configuration will be used for the integration.
The process above can be verified via debug log.
On a control node, the desired container (kube-scheduler
, same for the other 3 components) is NOT identified as kube-scheduler
. I noticed many containers were identified as container ids (in the format of "docker://<container_id>
") but none of the container ids match the actual container id for kube-scheduler
(you can identify container id by kubectl describe pod/<kube-scheduler-pod-name>
or ssh
to control node and run docker ps
).
Either the kube-scheduler
(same for the other 3 components) is not detected, or it is detected as a container id that doesn't match its own.
This is where I realized that there aren’t further actionable things I can do with this approach. Fortunately, my goal is to get integrations working for control nodes, one way or another, and I was able to come up with an alternative solution.
Solution
The TL;DR; version of the solution is: use file-based configuration without auto-discovery.
Integrations are driven by configuration files (located under /etc/datadog-agent/conf.d/
). The helm
-native approach mentioned above works by converting the datadog.confd
key-value pairs to one auto_conf.yaml
per integration. The non-helm
solution for configuration is to provision your own conf.yaml
s for your integrations.
To bypass the auto-discovery issue on kops
-provisioned cluster, we can:
- provision a
ConfigMap
with desired configurations - mount the
ConfigMap
as volume(s) todatadog-agent
:agents.volumes
+agents.volumeMounts
- replace template variables with resolvable ones
- disable auto-config (synonym for “autodiscovery”) for integration:
datadog.ignoreAutoConfig
Datadog configuration k8s
ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: my-datadog-configmapdata:
kube_apiserver_metrics.yaml: |+
init_config:
instances:
- prometheus_url: https://%%env_DD_KUBERNETES_KUBELET_HOST%%:443/metrics
tls_verify: false
bearer_token_auth: true
bearer_token_path: /var/run/secrets/kubernetes.io/serviceaccount/token
etcd.yaml: |+
init_config:
instances:
# etcd-manager-main
- prometheus_url: "https://%%env_DD_KUBERNETES_KUBELET_HOST%%:4001/metrics"
tls_verify: false
tls_cert: /host/etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.crt
tls_private_key: /host/etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.key
# etcd-manager-events
- prometheus_url: "https://%%env_DD_KUBERNETES_KUBELET_HOST%%:4002/metrics"
tls_verify: false
tls_cert: /host/etc/kubernetes/pki/etcd-manager-events/etcd-clients-ca.crt
tls_private_key: /host/etc/kubernetes/pki/etcd-manager-events/etcd-clients-ca.key
kube_scheduler.yaml: |+
init_config:
instances:
- prometheus_url: "http://%%env_DD_KUBERNETES_KUBELET_HOST%%:10251/metrics"
ssl_verify: false
kube_controller_manager.yaml: |+
init_config:
instances:
- prometheus_url: "http://%%env_DD_KUBERNETES_KUBELET_HOST%%:10252/metrics"
ssl_verify: false
Explanation
- template variables are specific to autodiscovery feature. In the context of non-autodiscovery configurations, not all template variables can be resolved. e.g.
%%host%%
does not resolve. Fortunately%%env_<ENV_VAR>%%
seems to be resolving fine. kops
provisions 2etcd
clusters:main
andevents
. 2 instances ofetcd
integration is required with slightly differenttls_cert
andtls_private_key
. Although I've verified these are interchangeable.kops
usesetcd-manager
as the parent process foretcd
. Port2380
/2381
is for peer communication (server-to-server), and4001
/4002
is for client communication (client-to-server). In this case the agent will be a "client" ofetcd
server, using port4001
/4002
is desired (instead of the2379
port in normaletcd
setup).kube-scheduler
servesHTTP
on port10251
andHTTPS
on port10259
kube-controller-manager
servesHTTP
on port10252
andHTTPS
on port10257
value.yaml
for datadog
helm
chart
datadog:
ignoreAutoConfig:
- etcd
- kube_scheduler
- kube_controller_manager
- kube_apiserver_metricsagents:
volumes:
- name: my-config
configMap:
name: my-datadog-configmap
- name: etcd-pki
hostPath:
path: /etc/kubernetes/pki
volumeMounts:
- name: etcd-pki
mountPath: /host/etc/kubernetes/pki
readOnly: true
- name: my-config
mountPath: /etc/datadog-agent/conf.d/kube_apiserver_metrics.d/conf.yaml
subPath: kube_apiserver_metrics.yaml
- name: my-config
mountPath: /etc/datadog-agent/conf.d/etcd.d/conf.yaml
subPath: etcd.yaml
- name: my-config
mountPath: /etc/datadog-agent/conf.d/kube_scheduler.d/conf.yaml
subPath: kube_scheduler.yaml
- name: my-config
mountPath: /etc/datadog-agent/conf.d/kube_controller_manager.d/conf.yaml
subPath: kube_controller_manager.yaml
Explanation
- Certificates and private keys (located under
/etc/kubernetes/pki
from host) is required byetcd
client-to-server communication.