Can AI instrument OpenTelemetry?

Distributed tracing requires stitching together distinct user journeys across complex microservices, rather than just writing isolated functions. We tested whether top models can successfully instrument applications with OpenTelemetry to see if they are actually ready to handle real-world Site Reliability Engineering tasks.

Read our blog post introducing OTelBench: Benchmarking OpenTelemetry: Can AI trace your failed login?

Model ranking #

claude-opus-4.5

29%

gpt-5.2

26%

claude-sonnet-4.5

22%

gemini-3-flash-preview

19%

gemini-3-pro-preview

16%

gpt-5.2-codex

16%

gpt-5.1

14%

glm-4.7

13%

deepseek-v3.2

12%

gpt-5.1-codex-max

12%

kimi-k2-thinking

claude-haiku-4.5

grok-4

grok-4.1-fast

Models ranked by their success rate in modifying code to correctly emit telemetry data. The table includes total cost and time for the full benchmark run to help contextualize performance. See our full methodology for validation details.

View all models →

Instrumentation tasks #

cpp-simple

76% pass rate

Cheapest	gemini-3-flash-preview	$0.01
Fastest	gemini-3-flash-preview	1m

js-microservices

17% pass rate

Cheapest	glm-4.7	$0.23
Fastest	claude-opus-4.5	7m

rust-distributed-context-propagation

2% pass rate

Cheapest	gpt-5.2	$0.91
Fastest	gpt-5.2	19m

The benchmark covers a diverse set of coding challenges across languages including .NET, C++, Erlang, Go, Java, JavaScript, PHP, Python, Ruby, Rust, and Swift. We sort these by difficulty, where tasks with a 0% pass rate represent currently unsolved problems in automated instrumentation.

View all tasks →

Performance by language #

Average pass rate across all models for each programming language. Languages with more training data and mature OpenTelemetry libraries tend to be easier for AI models to instrument correctly.

Model-task matrix #

100% unsolved

A detailed view of which tasks each model solved or failed. This helps identify models that handle specific instrumentation patterns well, even if their overall score is lower.

Cost efficiency #

We map total API cost against success rate. The Pareto frontier (blue line) highlights the most cost-efficient models for a given performance level.

Pareto frontier

Model	Pass Rate	Cost
gemini-3-flash-preview	19%	$7
gpt-5.2	26%	$48
claude-opus-4.5	29%	$84

Speed vs quality #

This chart compares accuracy against average generation time, helping identify models that balance solution quality with response latency.

Pareto frontier

Model	Pass Rate	Time
gemini-3-flash-preview	19%	6m
claude-sonnet-4.5	22%	11m
claude-opus-4.5	29%	12m

Performance over time #

We plot model pass rates against their release dates to track performance changes over time. This timeline shows how capability on observability tasks compares across model generations.

Pareto frontier

Model	Pass Rate	Released
grok-4	4%	Jul 25
claude-sonnet-4.5	22%	Sept 25
claude-opus-4.5	29%	Nov 25

Run it yourself #

For reproducibility, we open-sourced the full benchmark at QuesmaOrg/otel-bench. Built on the Harbor framework, you can verify our findings, test new models and agents; see our Migrating CompileBench to Harbor: standardizing AI agent evals.

uv tool install harbor
git clone git@github.com:QuesmaOrg/otel-bench.git
cd otel-bench

Language Task Agent API Model

We welcome contributions of new tasks. See the repository for details.

Get notified when we add new models or benchmark results

All product names, logos, and brands (™/®) are the property of their respective owners; they're used here solely for identification and comparison, and their use does not imply affiliation, endorsement, or sponsorship.

Cheapest	—	—
Fastest	—	—

Cheapest	—	—
Fastest	—	—

Cheapest	—	—
Fastest	—	—

Cheapest	—	—
Fastest	—	—

Cheapest	—	—
Fastest	—	—

Can AI instrument OpenTelemetry?

Model ranking #

Instrumentation tasks #

cpp-simple

go-microservices-traces

go-grpc-fix

cpp-advanced

python-microservices

go-microservices-logs

js-microservices

go-microservices

net-microservices

php-distributed-context-propagation

cpp-distributed-context-propagation

go-distributed-context-propagation

php-microservices

rust-distributed-context-propagation

erlang-microservices

go-microservices-traces-simple

go-workflow-tracing

java-distributed-context-propagation

java-microservices

python-distributed-context-propagation

ruby-microservices

rust-microservices

swift-microservices

Performance by language #

Model-task matrix #

Cost efficiency #

Speed vs quality #

Performance over time #

Run it yourself #

Cheapest	gemini-3-flash-preview	$0.06
Fastest	gemini-3-flash-preview	12m

Cheapest	gemini-3-flash-preview	$0.04
Fastest	gemini-3-flash-preview	2m

Cheapest	gemini-3-flash-preview	$0.08
Fastest	gemini-3-flash-preview	3m

Cheapest	gemini-3-flash-preview	$0.07
Fastest	gemini-3-flash-preview	2m

Cheapest	gemini-3-flash-preview	$0.16
Fastest	gemini-3-flash-preview	6m