Distributed tracing requires stitching together distinct user journeys across complex microservices, rather than just writing isolated functions. We tested whether top models can successfully instrument applications with OpenTelemetry to see if they are actually ready to handle real-world Site Reliability Engineering tasks.
Models ranked by their success rate in modifying code to correctly emit telemetry data. The table includes total cost and time for the full benchmark run to help contextualize performance. See our full methodology for validation details.
The benchmark covers a diverse set of coding challenges across languages including .NET, C++, Erlang, Go, Java, JavaScript, PHP, Python, Ruby, Rust, and Swift. We sort these by difficulty, where tasks with a 0% pass rate represent currently unsolved problems in automated instrumentation.
Average pass rate across all models for each programming language. Languages with more training data and mature OpenTelemetry libraries tend to be easier for AI models to instrument correctly.
A detailed view of which tasks each model solved or failed. This helps identify models that handle specific instrumentation patterns well, even if their overall score is lower.
We map total API cost against success rate. The Pareto frontier (blue line) highlights the most cost-efficient models for a given performance level.
Pareto frontier
| Model | Pass Rate | Cost | |
|---|---|---|---|
| gemini-3-flash-preview | 19% | | $7 |
| gpt-5.2 | 26% | | $48 |
| claude-opus-4.5 | 29% | | $84 |
This chart compares accuracy against average generation time, helping identify models that balance solution quality with response latency.
Pareto frontier
| Model | Pass Rate | Time | |
|---|---|---|---|
| gemini-3-flash-preview | 19% | | 6m |
| claude-sonnet-4.5 | 22% | | 11m |
| claude-opus-4.5 | 29% | | 12m |
We plot model pass rates against their release dates to track performance changes over time. This timeline shows how capability on observability tasks compares across model generations.
Pareto frontier
| Model | Pass Rate | Released | |
|---|---|---|---|
| grok-4 | 4% | | Jul 25 |
| claude-sonnet-4.5 | 22% | | Sept 25 |
| claude-opus-4.5 | 29% | | Nov 25 |
For reproducibility, we open-sourced the full benchmark at QuesmaOrg/otel-bench. Built on the Harbor framework, you can verify our findings, test new models and agents; see our Migrating CompileBench to Harbor: standardizing AI agent evals.
uv tool install harbor
git clone git@github.com:QuesmaOrg/otel-bench.git
cd otel-bench We welcome contributions of new tasks. See the repository for details.
Get notified when we add new models or benchmark results
All product names, logos, and brands (™/®) are the property of their respective owners; they're used here solely for identification and comparison, and their use does not imply affiliation, endorsement, or sponsorship.