A Kubernetes-native benchmark for code-generating LLMs, built around real cluster workloads instead of toy YAML snippets.
Browse the core benchmark overview, full technical report, and our fine-tuning demo that shows how we build domain-specific models.
High-level motivation, evaluation dimensions, and why Kubernetes-native benchmarks matter.
Detailed methodology, scoring framework, and model comparisons across 810+ tasks.
A narrative of how we curate data, label tasks, and train KubeBench-tuned models.
KubeBench is a domain-specific benchmark that scores LLMs on production-grade Kubernetes tasks across 810 carefully designed scenarios and the eight core resource types that power real clusters, from RBAC and namespaces to secrets and workloads.
Instead of grading YAML by string similarity, KubeBench evaluates how models behave against a live Kubernetes API.
For each task, the model generates Kubernetes YAML which is then validated against actual clusters (tested on both local Minikube and GKE Autopilot).
Kubernetes adoption continues to grow across finance, healthcare, telecom, retail, and manufacturing, and thousands of engineers now depend on LLMs to generate, review, and troubleshoot manifests safely. Traditional text benchmarks ignore the declarative, order-independent nature of Kubernetes YAML and often punish functionally equivalent configurations.
KubeBench closes this gap by grounding evaluation in the only thing that really matters for infrastructure teams: how code behaves when it hits the cluster.
Dive into the full experimental setup, scoring framework, and model comparisons in the KubeBench technical report.
Explore an end-to-end walkthrough of our Kubernetes fine-tuning pipeline, including dataset construction, noise filtering, annotation, and evaluation. This document complements KubeBench by showing how domain-specific performance is achieved in practice.