Quesma Benchmarks

Open-source benchmarks for evaluating AI coding agents on real-world software engineering tasks.

Stay tuned for new benchmarks and results

BinaryAudit

Security analysis benchmark for AI coding agents. Tests models on detecting backdoors, timebombs, and malicious code in compiled binaries using reverse engineering tools like Ghidra and radare2.

Benchmark

QuesmaOrg/binaryaudit

We hid backdoors in binaries — Opus 4.6 found 49% of them

OTelBench

OpenTelemetry instrumentation benchmark for AI coding agents. Tests models on real-world tasks adding distributed tracing, metrics, and logging to multi-language codebases.

Benchmark

QuesmaOrg/otel-bench

Benchmarking OpenTelemetry: Can AI trace your failed login?

CompileBench

Build system benchmark for AI coding agents. Tests models on fixing compilation errors, updating dependencies, and navigating complex build configurations.

Benchmark

QuesmaOrg/CompileBench

CompileBench: Can AI Compile 22-year-old Code?