DEV Community

Cover image for OpenTelemetry just standardized LLM tracing. Here's what it actually looks like in code.
Albert Alov
Albert Alov

Posted on

OpenTelemetry just standardized LLM tracing. Here's what it actually looks like in code.

Every LLM tool invents its own tracing format. Langfuse has one. Helicone has one. Arize has one. If you built your own — congratulations, you have one too.

OpenTelemetry just published a standard for all of them.

It defines how to name spans, what attributes a tool call should have, how to log prompts without leaking PII, and which span kind to use for an agent. It's called GenAI Semantic Conventions. It's experimental. And almost nobody has written about what it actually looks like when you implement it.

I know because I searched. "OTel GenAI semantic conventions" gives you spec pages. Zero practical articles. "How to trace LLM agent with OpenTelemetry" gives you StackOverflow questions with no answers.

We implemented it. Four PRs, a gap analysis, real before/after code. We also discovered, mid-implementation, that our traces never exported at all — but that's a different story.

Here's what the spec actually says, where we got it wrong, and what you should do today.


The wild west of LLM tracing

Right now, if you trace LLM calls, you're probably doing something like this:

span.setAttribute("llm.provider", "openai");
span.setAttribute("llm.model", "gpt-4o");
span.setAttribute("llm.tokens.input", 150);
span.setAttribute("llm.cost", 0.003);
Enter fullscreen mode Exit fullscreen mode

That's what we did in toad-eye v1. Made sense to us. Worked fine in our dashboards.

Problem: nobody else's dashboards understand these attributes. Switch from Jaeger to Arize Phoenix — reconfigure everything. Export traces to Datadog — they see raw spans with no LLM context. Your tracing is a walled garden. You built vendor lock-in into your own code.

This is exactly what OpenTelemetry was created to solve. And now it has a spec for GenAI.

Three types of GenAI spans

The spec defines three operations. Every LLM-related span gets one:

chat gpt-4o                    ← model call
invoke_agent orchestrator      ← agent invocation  
execute_tool web_search        ← tool execution
Enter fullscreen mode Exit fullscreen mode

The span name format is {operation} {name}. Not your custom format. Not gen_ai.openai.gpt-4o (that's what we had — no backend recognizes it).

Here's what we changed:

Screenshot: Span
Span naming migration: the old format was invisible to every GenAI-aware backend.

Agent attributes — we had the paths wrong

If you're building agents (ReAct, tool-use, multi-step), the spec defines identity and tool attributes:

// What OTel says:
span.setAttribute("gen_ai.agent.name", "weather-bot");
span.setAttribute("gen_ai.agent.id", "agent-001");
span.setAttribute("gen_ai.tool.name", "search");
span.setAttribute("gen_ai.tool.type", "function");

// What we had:
span.setAttribute("gen_ai.agent.tool.name", "search");  // wrong path
// gen_ai.agent.name — didn't exist at all
Enter fullscreen mode Exit fullscreen mode

The gen_ai.agent.tool.name path looks reasonable. It even reads well. But the spec puts tool attributes at gen_ai.tool.* — flat, not nested under agent. Our format, again, invisible to any backend that follows the standard.

Content recording — the spec agrees with us (feels good)

This was the one thing we got right from day one, and it's worth calling out because most teams get it wrong.

The spec says: don't record prompts and completions by default. Instrumentations SHOULD NOT capture content unless explicitly enabled.

Three official patterns:

  1. Default: don't record. No prompt, no completion in spans. Privacy first.
  2. Opt-in via span attributes. gen_ai.input.messages and gen_ai.output.messages as JSON strings.
  3. External storage. Store content elsewhere, put a reference on the span.

We had recordContent: false as default since v1. When the spec confirmed this approach, it was one of those rare moments where your gut feeling gets validated by a committee of very smart people.

If you're logging prompts in spans by default — you might want to reconsider before your security team does it for you.

The honest gap analysis

Here's the full picture. No spin, no cherry-picking.

What we got right from day 1

Our attribute OTel spec Verdict
gen_ai.provider.name gen_ai.provider.name ✅ Exact match
gen_ai.request.model gen_ai.request.model ✅ Exact match
gen_ai.usage.input_tokens gen_ai.usage.input_tokens ✅ Exact match
error.type error.type ✅ Exact match

What we got wrong

What Our version OTel spec Status
Span name gen_ai.openai.gpt-4o chat gpt-4o Fixed
Tool name attribute gen_ai.agent.tool.name gen_ai.tool.name Fixed
Custom attributes gen_ai.agent.step.* Reserved namespace Moved to gen_ai.toad_eye.*
Agent identity Didn't exist gen_ai.agent.name Added

What we built beyond the spec

Feature Namespace Why it's not in OTel
Cost per request gen_ai.toad_eye.cost Pricing is vendor-specific
Budget guards gen_ai.toad_eye.budget.* Runtime enforcement ≠ observability
Shadow guardrails gen_ai.toad_eye.guard.* Validation is app-level
Semantic drift gen_ai.toad_eye.semantic_drift Quality metric, not trace standard
ReAct step tracking gen_ai.toad_eye.agent.step.* ReAct is one pattern; spec is pattern-agnostic

The key insight: OTel spec covers WHAT happened. We cover WHY and HOW MUCH. Not competing — complementary. Your custom metrics go under your namespace. The spec's attributes go where backends expect them.

The migration: dual-emit, don't break users

We didn't do a clean break. v2.4 emits both old and new attribute names:

// New (OTel spec-compliant)
span.setAttribute("gen_ai.tool.name", toolName);

// Old (deprecated, still emitted for backward compat)
span.setAttribute("gen_ai.agent.tool.name", toolName);
Enter fullscreen mode Exit fullscreen mode

Screenshot: Attribute prefix migration diff with @deprecated tags and dual-emit
Dual-emit approach: old attributes get @deprecated, new ones follow the spec. Both emitted until v3.

An environment variable controls when to stop emitting deprecated attributes:

OTEL_SEMCONV_STABILITY_OPT_IN=gen_ai_latest_experimental
Enter fullscreen mode Exit fullscreen mode

This was four PRs (#170, #171, #172, #173). v3 will remove the deprecated aliases entirely.

The irony

While implementing all of this, we did a round of manual testing.

Turns out our traces never exported. At all. Ever. The OTel NodeSDK silently disables trace export when you pass spanProcessors: []. We had 252 passing tests. All of them mocked the SDK.

So we standardized our attributes perfectly — for traces that nobody could see.

We fixed both. Published six patch versions in one day. The full story is in article #2.

Which backends actually support this

This is the reason to care. Emit the right attributes today → six backends visualize your traces tomorrow:

Backend Recognizes GenAI spans Agent visualization Cost
Jaeger Basic (nested spans) Hierarchy view Free
Arize Phoenix Full GenAI UI Agent workflow Free tier
SigNoz GenAI dashboards Nested spans Free / Cloud
Datadog LLM Observability Agent tracing Paid
Langfuse Full GenAI UI Session view Free tier
Grafana + Tempo Query by attributes Custom dashboards Free

No vendor lock-in. One set of attributes. Six places to visualize them.

What you should do today

If you're tracing LLM calls — even with custom code — aligning with the spec now saves you pain later. The conventions are experimental, but the direction is locked in.

Quick checklist:

  • Set gen_ai.operation.name on every LLM span: chat, invoke_agent, or execute_tool
  • Format span names as {operation} {model_or_agent_name}
  • Use official attributes: gen_ai.agent.name, gen_ai.tool.name, gen_ai.tool.type
  • Put YOUR custom attributes under YOUR namespace — not gen_ai.*
  • Don't record prompt/completion by default — make it opt-in
  • Test your traces in at least 2 backends (Jaeger + one GenAI-specific like Phoenix)

Full spec: OpenTelemetry GenAI Semantic Conventions
Agent spans: GenAI Agent Spans


Previous articles:

toad-eye — open-source LLM observability, OTel-native: GitHub · npm

🐸👁️

Top comments (21)

Collapse
 
itskondrat profile image
Mykola Kondratiuk

this is exactly the kind of writeup I was looking for. the fragmentation has been annoying - every observability tool doing its own thing means you learn the vendor not the pattern. been running a few AI agent projects and the tracing story was always the weakest part. having a real before/after with actual attribute names is way more useful than spec docs. the PII logging part especially - curious whether you ended up stripping at the SDK level or filtering in the collector

Collapse
 
nikolovlazar profile image
Lazar Nikolov

Great read @vola-trebla! Shameless plug - Sentry also does AI Agent Monitoring according to the OTel specs, and our free tier includes 5M spans (if you want to include us in your article).

Collapse
 
ji_ai profile image
jidonglab

solid writeup. the token counting inconsistency across providers is the part that bites hardest in practice — we've been building agent pipelines where the same prompt goes through Claude and GPT-4 depending on the task, and without a standard like this you end up maintaining two separate cost tracking implementations that drift apart over ti
one thing i'd add: for agentic workloads the context window usage per turn is almost more important than total tokens, since you're re-sending the full conversation each loop. being able to trace that per-span with OTel attributes would make it way easier to catch context bloat before it eats your budget.****

Collapse
 
vola-trebla profile image
Albert Alov • Edited

Great point about context bloat in agentic loops — this is the kind of thing that's invisible until your budget alert fires.

We track gen_ai.usage.input_tokens per span, so in a ReAct loop each chat child span shows how many tokens were re-sent. In Jaeger you can watch input_tokens grow with each iteration — that's your bloat signal. What we don't have yet is a context_window_utilization_ratio (input_tokens / model_max_context). Your comment just pushed this up our backlog (github.com/vola-trebla/toad-eye/is...)

On multi-provider cost drift — we normalize to a single gen_ai.toad_eye.cost USD attribute. Each provider's accumulateChunk() handles the differences (Anthropic splits across message_start/message_delta, OpenAI sends usage in the final chunk, Gemini overwrites per chunk). Different plumbing, same attribute out.

This topic probably deserves its own deep dive — too much nuance for a comment. Might be the next article.

Collapse
 
kutzki profile image
Shaya K. • Edited

Why not route through a model routing system so you don't have multiple billings through diff providers? Something like OpenRouter, or using Azure's AI (they also support Claude) foundry and plugging into their endpoint API system?

Collapse
 
ji_ai profile image
jidonglab

yeah routing through something like OpenRouter works for the billing consolidation part. the tricky bit with OTel tracing specifically is that when you add a routing layer, the span hierarchy gets messy — you end up with the router's spans mixed into your application traces. Azure AI Foundry handles this better since the Claude endpoint there still emits standard OTel spans. for the token counting inconsistency though, routing doesn't fully solve it since each provider still reports usage differently. you'd still want a normalization layer sitting between the OTel exporter and your observability backend.

Thread Thread
 
kutzki profile image
Shaya K.

So for my agentic system I'm running on a VPS with OpenClaw and plan to use model-router you're saying I should use something like this anyways? I'm more technical than the average user by far but I'm not a programmer myself

Thread Thread
 
ji_ai profile image
jidonglab

honestly if you're running on a VPS with OpenClaw and model-router, OTel tracing is probably overkill for your setup right now. it's more useful when you're running multiple models across different providers and need to debug which call is slow or failing. for a single-agent setup the logs from OpenClaw itself should give you enough visibility. where tracing starts paying off is when you add a second model or start chaining agents — then suddenly you need to trace a request across multiple hops and that's where OTel shines.

Collapse
 
azat_resumelink_8d7875001 profile image
Azat Resumelink

Excellent breakdown of the GenAI Semantic Conventions. The core insight about 'OTel spec covers WHAT happened vs. WHY and HOW MUCH' resonates deeply. We see a similar challenge in the job market, where verifiable 'Proof of Work' (LCV - Linked & Certified Value) cuts through the noise of AI-generated content. Standardization is key for both technical observability and professional credibility. Thanks for this clear guidance!

Collapse
 
ji_ai profile image
jidonglab

the span naming convention with {operation} {name} is a nice touch — makes it way easier to grep through traces when you're debugging a chain of LLM calls versus agent tool invocations. the part that's going to matter most in practice though is how well this plays with existing OTel backends. a lot of teams already have Jaeger or Tempo set up for their microservices, so being able to see LLM latency right next to API call traces without a separate observability tool is the real win here.

Collapse
 
insurancenavybrokers profile image
Gohar

I’m currently making a video series on replaying warcraft 2 after 25 years, and just made a video about how damage works. I’m interested in any feedback or changes for this as it wasn’t as clear cut as I thought. Or I hope it can be informative for anyone interested. I can’t post links because I haven’t made enough posts, but if you search “warcraft II remastered - damage mechanics” in YouTube, you will find it released one day ago.

Collapse
 
botanica_andina profile image
Botánica Andina

This is so timely! I've been wrestling with custom LLM tracing attributes for ages, and the 'wild west' analogy hits home. It's tough to move from llm.provider to a standardized approach, but your breakdown makes the OTel GenAI conventions feel much more approachable. Excited to dig into this.

Collapse
 
erndob profile image
erndob

This is insane. AI generated article with mostly AI generated comments.
Even if a human is behind this content, the voice of humans is lost and everyone sounds like the same bot.

There's 36 em dashes on this page right now, between the article and comments.

Collapse
 
adarsh_kant_ebb2fde1d0c6b profile image
Adarsh Kant

This is exactly the kind of practical content the AI observability space needs. The dual-emit migration strategy is smart — we've learned similar lessons building voice AI agents that interact with real DOM elements on websites. When you're tracing voice-to-action pipelines (speech recognition → intent → DOM action → response) across 50+ languages, having standardized span attributes is critical for debugging latency. The gen_ai.usage.input_tokens tracking per span is especially relevant when you're optimizing for sub-700ms voice response times. Great writeup!

Collapse
 
mrlinuncut profile image
Mr. Lin Uncut

how are you handling the span attributes for streaming responses in the opentelemetry spec, because that's where i've seen the most inconsistency across implementations since you don't have a clean start and end token count until the stream closes

Collapse
 
vola-trebla profile image
Albert Alov

Great question!
We wrap the async iterator with a StreamAccumulator - the span starts before the stream, chunks are accumulated incrementally (text + token counts from provider-specific metadata), and all span attributes are set once when the stream closes. TTFT is tracked separately via onFirstChunk callback. If the consumer breaks out early, the finally block still records what we have.

For token counts: OpenAI sends usage in the final chunk, Anthropic splits it across message_start (input) and message_delta (output) - each provider has its own accumulateChunk() extractor.

Here's the code:
wrapAsyncIterable() and createStreamingHandler() - github.com/vola-trebla/toad-eye/bl...

Some comments may only be visible to logged-in visitors. Sign in to view all comments.