Use OpenTelemetry with DataDog: good idea or not?
I’ve been testing using OpenTelemetry with DataDog backend in the past few days and comparing the pros and cons among different ways to do it (and against non-OpenTelemetry implementation too). Turns out to be something worth sharing and I hope it will be helpful to others.
What are OpenTelemetry and DataDog?
Googling does a much better job than me trying to explain this(e.g.).
Between the 2, I think the former is lesser-known and I recommend doing a little bit of reading on OpenTelemetry documentation before diving deeper.
Who am I
I’m a full-stack software engineer with a primary focus on backend DevOps technologies. I’m a minimalist and I try my best to write code in the same way.
I’m one of the early adopters of DataDog (since 2016) and have a good understanding of most of the related topics. Since around the same time, I’ve been working on Observability-related projects for different large public corporates.
TL;DR; version of my assessment
The libraries and tools provided by DataDog are better tailored for a DataDog backend.
- are heavily invested with DataDog as the only observability solution, and
- have no intention to migrate to a different stack
Sticking with DataDog libraries/packages is the best choice for now.
If you have plans for:
OpenTelemetry is the right way to go (with some caveats).
Reasons for my assessment will be covered in details.
As far as I know, there are 4 ways to integrate your services with DataDog, 3 of which use OpenTelemetry components.
- Instrument application with DataDog libraries (e.g. ddtrace for APM) and use datadog-agent to send telemetries to your DataDog backend.
- Instrument application with opentelemetry-api + opentelemetry-sdk + opentelemetry-exporter-datadog, and use datadog-agent to send telemetries to your DataDog backend.
- Instrument application with opentelemetry-api + opentelemetry-sdk + opentelemetry-exporter-otel, and use datadog-agent (with OTEL ingestion turned on) to send telemetries to your DataDog backend.
- Instrument application with opentelemetry-api + opentelemetry-sdk + opentelemetry-exporter-otel, and use opentelemetry-collector to send telemetries to your DataDog backend.
Let’s assess these options one at a time. I’ll assess option 2 and 3 first, and go through 1 and 4 with more details.
- application code is **almost** vendor-agnostic.
- if you’re already on DataDog and try to set up OpenTelemetry, this is an application-level only change and is compatible with your existing setup.
- the application level telemetry exporters are not available for all languages. Datadog provided implementation for ruby, js and python, but failed to find the equivalent for other languages (e.g. Golang).
- these application-level telemetry exporter libraries are being deprecated by DataDog in favor of option 3.
Example: otel-datadog-apm-py (OTEL ingestion part is turned off)
- application instrumentation code is vendor-agnostic.
- if you’re already on DataDog and try to set up OpenTelemetry, this is a very compatible solution: continue using datadog-agent
- There’s a minimum agent version requirement: 6.32+/7.32+ and you may have to upgrade your agent first.
- based on my experience: as of version 7.35.0, the ingestion part worked but the agent failed to send out telemetry to DataDog backend (issue).
I’d love to be told that the issue I experienced is caused by misconfiguration. Due to the fact that this is still a work-in-progress feature and doesn’t have much adoption, there are not many examples to help me be confident that I set it up correctly.
Part of the pain is “work in progress” (the configuration YAML format is being updated each version) and lack of adoption.
Assessment: No until I figure out if this is my problem, then depends on the answer.
Example: otel-datadog-apm-py (same as option 4)
- simple and widely adopted (for all DataDog users)
- instrumentation libraries are tailored for DataDog and telemetry data are transferred in DataDog proprietary protocol
- implementation of instrumentation is vendor-specific and this tight couple makes it harder to change the stack.
- the delay between telemetry data created to show on the backend is on the order of minutes, longer than ideal but not a big deal. probably can be reduced with configuration change on the agent.
Example: otel-datadog-apm-py (same as option 1)
- the solution is vendor agnostic
- telemetry data are sent in OpenTelemetry protocol and the collector supports other solutions(Jaeger, Zipkin, Prometheus, etc) out of the box.
- instruments received on the backend do not have as many details as option 1. It’s possible to address this as long as DataDog supports all of these details in OTLP datadog exporter, and the “mapping” is well documented and easily discoverable.
- if you’re currently on DataDog, you would need to update your server provision to make sure OpenTelemetry collector are provisioned in a similar way as datadog.
- if you need datadog-agent for official integrations as well, you end up with 2 components per server doing similar jobs.
I’ve built a test project with option 1(dir=http-dd) and option 4(dir=http-otel) for comparison. Both directories contain a Python flask HTTP server that talk to 2 different Redis instances. Apart from the instrumentation code, the HTTP servers are identical.
APM services view
- “another-service-redis”: a Redis instance shared by 2 HTTP servers
- “otel-datadog-apm-py-dd-api”: HTTP server instrumented with ddtrace
- “otel-datadog-apm-py-dd-redis”: a Redis dedicated to ddtrace-instrumented HTTP server (logically)
- “otel-datadog-apm-py-otel-api”: HTTP server instrumented with opentelemetry
- “otel-datadog-apm-py-otel-redis”: a Redis dedicated to opentelemetry-instrumented HTTP server (logically)
“Operation” and “Resource”
From the operations and resources breakdown, I like ddtrace’s implementation better. I’m sure it’s a matter of time before it’s known how to configure OTEL telemetry (some specific attributes of resource/span or a configuration for the collector) to specify operations and resources in APM service view.
Now let’s take a look at the trace detailed view. Comparing the same web API:
You can see ddtrace provided more details in the trace of the same API (with identical implementation). Similar to operations and resource, I believe this can be addressed once DataDog provides documentation & examples mapping OTEL attributes to DataDog APM domain attributes.
I spent a bit more time working on opentelemtry implementation but I’m biased since I’m more experienced with DataDog library than OpenTelemetry libraries.
For someone who doesn’t know either, I would argue the dev cost difference is negligible.
Coming into this comparison, I actually favor OpenTelemetry much more than DataDog’s implementation: I like the fact that instrumentation is not vendor-specific and using different vendor is as simple as 1-line yaml file changes.
Without the need to support multiple observability solutions, I have to admit that DataDog libraries are better options for now (April 2022). It may take some time for DataDog to provide mature support for OpenTelemetry, and my assessment may be different at that point.