Can AI instrument OpenTelemetry?

23 tasks | 11 languages | 14 models | 14% pass rate | Updated 16 Jan 2026 | QuesmaOrg/otel-bench

Distributed tracing requires stitching together distinct user journeys across complex microservices, rather than just writing isolated functions. We tested whether top models can successfully instrument applications with OpenTelemetry to see if they are actually ready to handle real-world Site Reliability Engineering tasks.

Read our blog post introducing OTelBench: Benchmarking OpenTelemetry: Can AI trace your failed login?

Model ranking #

1
Anthropic claude-opus-4.5
29%
2
OpenAI gpt-5.2
26%
3
Anthropic claude-sonnet-4.5
22%
6
OpenAI gpt-5.2-codex
16%
7
OpenAI gpt-5.1
14%
8
Z.ai glm-4.7
13%
9
DeepSeek deepseek-v3.2
12%
10
OpenAI gpt-5.1-codex-max
12%
12
Anthropic claude-haiku-4.5
6%
13
Grok grok-4
4%
14
Grok grok-4.1-fast
3%

Models ranked by their success rate in modifying code to correctly emit telemetry data. The table includes total cost and time for the full benchmark run to help contextualize performance. See our full methodology for validation details.

View all models →

The benchmark covers a diverse set of coding challenges across languages including .NET, C++, Erlang, Go, Java, JavaScript, PHP, Python, Ruby, Rust, and Swift. We sort these by difficulty, where tasks with a 0% pass rate represent currently unsolved problems in automated instrumentation.

View all tasks →

37%
C++ 3 tasks
20%
Go 7 tasks
17%
JavaScript 1 task
15%
Python 2 tasks
10%
.NET 1 task
6%
PHP 2 tasks
1%
Rust 2 tasks
0%
Erlang 1 task
0%
Java 2 tasks
0%
Ruby 1 task
0%
Swift 1 task

Average pass rate across all models for each programming language. Languages with more training data and mature OpenTelemetry libraries tend to be easier for AI models to instrument correctly.

Model-task matrix #

0%
100% unsolved

A detailed view of which tasks each model solved or failed. This helps identify models that handle specific instrumentation patterns well, even if their overall score is lower.

Cost efficiency #

We map total API cost against success rate. The Pareto frontier (blue line) highlights the most cost-efficient models for a given performance level.

Pareto frontier

Model Pass Rate Cost
Google gemini-3-flash-preview 19%
$7
OpenAI gpt-5.2 26%
$48
Anthropic claude-opus-4.5 29%
$84

Speed vs quality #

This chart compares accuracy against average generation time, helping identify models that balance solution quality with response latency.

Pareto frontier

Model Pass Rate Time
Google gemini-3-flash-preview 19%
6m
Anthropic claude-sonnet-4.5 22%
11m
Anthropic claude-opus-4.5 29%
12m

We plot model pass rates against their release dates to track performance changes over time. This timeline shows how capability on observability tasks compares across model generations.

Pareto frontier

Model Pass Rate Released
Grok grok-4 4%
Jul 25
Anthropic claude-sonnet-4.5 22%
Sept 25
Anthropic claude-opus-4.5 29%
Nov 25

For reproducibility, we open-sourced the full benchmark at QuesmaOrg/otel-bench. Built on the Harbor framework, you can verify our findings, test new models and agents; see our Migrating CompileBench to Harbor: standardizing AI agent evals.

uv tool install harbor
git clone git@github.com:QuesmaOrg/otel-bench.git
cd otel-bench

We welcome contributions of new tasks. See the repository for details.

Get notified when we add new models or benchmark results

All product names, logos, and brands (™/®) are the property of their respective owners; they're used here solely for identification and comparison, and their use does not imply affiliation, endorsement, or sponsorship.