OpenTelemetry Explained

Every engineering team hits the same moment. Customers are growing, an outage costs real money for the first time, and somebody says "we need monitoring." What follows is usually a rushed vendor decision, a pile of half-configured dashboards, and an on-call rotation that learns to ignore its own alerts within a quarter. The root problem is rarely the vendor. It is that the team skipped the layer underneath the vendor, the layer that decides what data exists at all. That layer is OpenTelemetry, and understanding it properly is the difference between observability you trust and dashboards you decorate.

What OpenTelemetry is, and where it belongs in your stack

OpenTelemetry, abbreviated OTel, is an open source standard for generating, collecting, and exporting telemetry from software systems. Telemetry means three signals. Traces record the path of a single request as it moves through your services, broken into timed units called spans. Metrics record numeric measurements over time, like request counts, latencies, and queue depths. Logs record discrete events with whatever context you attach to them. OTel defines the APIs for producing all three, a wire protocol for shipping them, and a vocabulary for naming things consistently. It reached graduated status in the Cloud Native Computing Foundation in 2026 and is, by contributor velocity, the second largest project in the cloud native ecosystem behind Kubernetes. Every observability backend that matters can ingest it.

The mental model that makes everything click is to split your observability stack into three layers. The bottom layer is instrumentation, the code inside your application that produces signals. The middle layer is the pipeline, which collects, processes, and routes those signals. The top layer is the backend, where data is stored, queried, visualized, and alerted on. OpenTelemetry standardizes the bottom two layers and deliberately stays out of the third. It has no UI, no storage engine, and no alerting. This is a feature, not a gap. By owning instrumentation and transport as a vendor-neutral standard, OTel makes the backend a swappable decision. You instrument once and you can send the same data to Datadog, Grafana, OnePatch, or a Clickhouse table you own, and you can change your mind later by editing a config file instead of re-instrumenting your codebase.

What you get out of the box is substantial. Automatic instrumentation for the common frameworks and clients in every major language, so the building blocks of a typical service, your HTTP server, database drivers, caches, queues, and outbound calls, produce spans without you writing any tracing code. Context propagation, so a request that touches six services produces one connected trace instead of six orphaned fragments. A protocol, OTLP, that every serious backend accepts. And semantic conventions, which sound boring and are quietly the most valuable part, because they mean an HTTP span from a Python service and an HTTP span from a Node service carry the same attribute names, so any tool, any query, and any AI agent reading your telemetry knows what it is looking at without per-service translation rules.

How OpenTelemetry actually works

The API and the SDK are deliberately separate

The first architectural decision to internalize is the split between the OTel API and the OTel SDK. The API is a set of interfaces: get a tracer, start a span, record a metric, emit a log. The SDK is the concrete implementation that buffers, samples, batches, and exports. The API ships as a near-zero-dependency package, and every call against it is a no-op unless an SDK is installed and configured.

This split exists for library authors, and it is why the whole ecosystem works. A library like an HTTP client or a queue framework can instrument itself against the API without forcing any observability stack on its users. If the end application configures an SDK, the library's spans flow. If it does not, the calls cost nearly nothing and disappear. The application owns the SDK, configures one TracerProvider, one MeterProvider, and one LoggerProvider at startup, and everything in the process, your code and your dependencies alike, emits through them.

Traces, spans, and context propagation

A trace is a tree. Each node is a span, with a name, a start time, a duration, a status, a bag of key-value attributes, optional timestamped events, and two identifiers that make the tree possible: a span_id of its own and the span_id of its parent. All spans in one request share a trace_id. When a request crosses a process boundary, the calling side serializes the current trace context into a traceparent HTTP header, defined by the W3C Trace Context standard, and the receiving side deserializes it and continues the same trace. This is context propagation, and it is the single most important mechanism in distributed tracing, because it is what turns "six services each logged something" into "here is the exact causal chain, with timing, of what this one request did."

Propagation through HTTP is automatic once instrumentation is installed. Propagation through asynchronous boundaries, like a job queue, requires the producer to write the context into the job payload and the consumer to restore it, which is why first-class queue instrumentation matters.

Metrics and logs

OTel metrics come in a small number of instrument types: counters for things that only go up, up-down counters for things that fluctuate, gauges for point-in-time observations, and histograms for distributions. Histograms deserve a special mention because latency is a distribution, not a number, and averages lie. OTel supports exponential bucket histograms, which adjust bucket boundaries automatically and give you accurate p95 and p99 values without hand-tuning bucket edges. When someone asks why their dashboard average looks fine while customers complain, the answer is almost always hiding in a histogram they did not record.

Logs in OTel take a pragmatically different path from traces and metrics. Rather than asking you to replace your logging library, OTel bridges the one you already use, attaching the active trace_id and span_id to every log record emitted inside a span. This correlation is the payoff: from a slow span you can jump to exactly the log lines that request produced, and from a suspicious log line you can jump to the full trace around it. Teams that have never had log-trace correlation tend to underestimate it, and then cannot live without it.

OTLP, the wire protocol

The OpenTelemetry Protocol is how signals leave your process. It is protobuf-encoded, shipped over gRPC on port 4317 or HTTP on port 4318, and it carries all three signals in one protocol with consistent resource attributes attached, so every span, metric point, and log record identifies the service.name, environment, and version it came from. The practical consequence of OTLP being a stable standard is that "where does my telemetry go" becomes a destination URL, not an architecture decision.

The Collector

The OpenTelemetry Collector is a standalone process that receives telemetry, processes it, and exports it onward. Its config has three sections that mirror that flow: receivers, processors, and exporters, wired together into pipelines. A minimal production pipeline receives OTLP from your services, applies a memory limiter to protect the Collector itself from telemetry spikes, batches for export efficiency, and ships to one or more backends.

You can run without a Collector by exporting directly from the SDK to a backend, and for a weekend project that is fine. For anything real, the Collector earns its place quickly. It decouples your applications from your backends, so credentials, retries, and buffering live in one place instead of in every service. It lets you fan out, sending the same stream to two backends during a migration or sending traces one place and metrics another. And it is where the grown-up controls live: tail sampling, which decides whether to keep a trace after seeing all of it, so you keep one hundred percent of errors and slow requests while sampling away the boring ninety-nine percent of healthy traffic; filtering and transformation, for scrubbing PII and dropping noisy attributes; and cardinality control, which protects both your bill and your query performance. The common deployment pattern is an agent Collector per host or as a sidecar, forwarding to a small gateway tier of Collectors that hold the centralized policy.

Semantic conventions

Semantic conventions are the agreed dictionary: an HTTP server span carries http.request.method and http.response.status_code, a database span carries db.system and the statement, a messaging span carries the queue name and operation, and every signal carries resource attributes like service.name. There are now conventions for generative AI workloads too, standardizing how LLM calls, token usage, and agent tool executions appear as spans. Conventions are what make telemetry machine-readable rather than merely machine-storable. Any system that wants to reason across services, whether it is a dashboard template, an anomaly detector, or an AI agent doing root cause analysis, depends on the same concept having the same name everywhere. When we say OnePatch's agents reason over OTel natively, this dictionary is a large part of what makes that possible.

How vendors consume OTel: Datadog and Sentry as case studies

The standard ends at the backend's front door, and what vendors do at that door varies in ways worth understanding, because it tells you what your data becomes after you ship it.

Datadog treats OTLP as an ingestion format. You can send OTLP to the Datadog Agent, which has OTLP receivers built in, or run an OTel Collector with the Datadog exporter, or use Datadog's own distribution of the Collector. In all three paths, your OTel data is translated into Datadog's internal model: resource attributes are remapped onto Datadog's tagging conventions, OTel metrics from your SDKs are billed as custom metrics, and a handful of Datadog's proprietary products work only with their native instrumentation libraries rather than OTel SDKs. None of this is a criticism; it is the predictable shape of a platform that predates the standard and carries fifteen years of proprietary machinery. The takeaway for you is simply that OTLP into Datadog is well supported but translated, so test the features you depend on against OTel-instrumented services, not just the demos.

Sentry made the opposite bet, and it is an instructive one. Rather than translating OTel at the door, Sentry rebuilt the door. Their JavaScript SDK from version 8 onward is built directly on OpenTelemetry: under the hood, Sentry registers itself as a span processor inside the OTel pipeline, your code and your frameworks produce ordinary OTel spans, and Sentry's performance product consumes them, with root spans surfacing as Sentry transactions. They have since shipped direct OTLP ingestion as well, so an existing OTel setup can point at Sentry without Sentry-specific instrumentation. The reason this matters beyond Sentry is the direction of travel it demonstrates: the industry's newer architecture decisions collapse proprietary instrumentation into the standard, because maintaining a parallel instrumentation ecosystem stopped making sense once OTel's coverage exceeded what any single vendor could maintain alone.

The general pattern across runtimes and platforms is the same. Frameworks and infrastructure increasingly export OTel natively rather than waiting to be monkey-patched: Temporal's SDKs ship OpenTelemetry interceptors, BullMQ ships a telemetry interface with an official OTel implementation, and runtimes expose hooks the SDKs attach to. Instrumentation is migrating from something vendors bolt on from outside to something software emits from inside, in the standard's vocabulary. That migration is the strongest evidence that learning OTel is learning the durable layer.

One standard, different machinery per language

The OTel API looks nearly identical across languages, but the implementations have personalities, and it is worth knowing the personality of the ones you run. Node and Python make a good illustration. In Node, automatic instrumentation works by patching module loading, which imposes a hard ordering rule: the SDK must initialize before any instrumented module is loaded, typically via a dedicated instrumentation file pulled in with a --require or --import flag. Most "OTel produces no spans" reports in Node are this rule violated somewhere. In Python, instrumentation is packaged operationally instead: the opentelemetry-instrument command wraps your process with zero code changes, configured through environment variables, with the trade-off that pre-forking servers like Gunicorn need the SDK initialized per worker rather than once in the parent.

The same kind of texture exists in every runtime, whether it is Go's explicit-everything philosophy, the JVM's agent jar, or Ruby's middleware hooks. The differences never change what you should build, but reading your language's current setup docs rather than a two-year-old blog post will save you an afternoon of confusion per service.

What is still rough

A fair account of OpenTelemetry includes the parts that will frustrate you, because they exist and pretending otherwise helps nobody.

The signals are not equally mature. Traces are excellent everywhere. Metrics are solid. Logs are the youngest signal, and the quality of the logging bridge varies meaningfully by language, so in some ecosystems you will keep shipping logs through your existing pipeline next to OTel rather than through it, and that is a reasonable choice, not a failure. The JavaScript SDK in particular has a history of churn, with real breaking changes between major versions and an ecosystem of instrumentation packages that move at different speeds, so pinning versions and reading changelogs is not optional there. Instrumentation quality also varies by library: the popular frameworks are covered superbly, while the long tail ranges from good to abandoned, and you should skim the code of any instrumentation package before trusting it in production. The newer semantic conventions, including the generative AI ones, are still marked as in development, which means attribute names can shift under you. The Collector is genuinely another piece of infrastructure: it needs resources, monitoring of its own, and someone who understands its config. And instrumentation has a cost at runtime, small and worth paying, but measurable under load, so benchmark before and after on your hottest path rather than assuming.

None of this changes the conclusion, because every alternative has the same problems plus vendor lock-in. But teams that go in knowing where the rough patches are configure around them in a day, while teams that expected magic churn for a month and blame the standard.

Where OnePatch fits

Everything in this post is learnable, and we wrote it so you can do all of it yourself. The honest observation, after setting this up many times, is that almost nobody gets the whole way there. Instrumentation lands but conventions drift. The Collector gets deployed but tail sampling never gets configured. Dashboards get built for the services that existed at setup time and quietly miss the three services shipped since. Alert thresholds are whatever someone guessed during onboarding, and nobody owns revisiting them. The gap is not knowledge, it is sustained attention, and sustained attention is exactly what early teams cannot spare.

That gap is what OnePatch is built to close. OnePatch sets up this entire stack, the instrumentation, the pipeline, the dashboards, and the monitors, automatically against your codebase, and absorbs the rough edges above so you never meet them: version churn, instrumentation quality, Collector operations, and convention drift are our problem, not yours. The observability layer is table stakes, and we treat it that way: you describe what you want to see in plain language and dashboards and monitors exist, without the forty knobs. The product is what runs on top. When a monitor fires, an agent investigates with full access to your traces, metrics, logs, and code, separates signal from noise, performs the root cause analysis an on-call engineer would, and closes the loop the way an engineer would: a proposed fix as a pull request, a config change, a postmortem written while the evidence is fresh. OpenTelemetry is the substrate that makes an agent like that trustworthy, because everything it concludes is grounded in open, structured, standard data you can inspect yourself.

If you read this far, you now know the layer underneath the dashboards. If you would rather have it built, maintained, and watched for you, that is onepatch.dev.