Building a High-Quality Training Dataset
We obtained a rich, real-world dataset with representative coverage of Kubernetes tasks from public GitHub repositories. But innate to these data is significant noise—license headers, comments, truncated files, Helm templating artifacts.
This noise showed up in our initial model outputs: repetition, truncation, licensing messages embedded in generated YAML. Through iterative refinement, we developed a training dataset aligned to domain-specific task objectives—resulting in 200K+ clean task–YAML pairs that teach models to generate production-grade configurations.
Real-World Kubernetes Configurations
Training data is sourced from The Stack—a 6TB permissively-licensed code corpus from the BigCode Project. We extracted 100GB+ of Kubernetes YAML representing how developers actually write manifests in production.
This provides breadth—real-world configurations from thousands of public repositories—and representative coverage across all major Kubernetes resource types: workloads, networking, storage, and access control.
Aligning Data to Training Objectives
Raw YAML files lack the conversational context needed for instruction fine-tuning. We used GPT-4o-mini to generate task descriptions, context scenarios, and complexity ratings—creating the structured supervision needed for models to learn the mapping from natural language to valid YAML.
Each label includes a task description (15-25 word imperative), context scenario (business rationale), and complexity level. Processing took approximately 4 hours for the full dataset.
From Noisy to Training-Ready
Real-world code contains significant noise. Our initial fine-tuned models generated outputs with license headers, truncation markers, and Helm template syntax—directly reflecting patterns in the raw training data. We built a multi-stage filtering pipeline to systematically remove these artifacts.
Iterative refinement: Data quality directly determines model behavior. We continuously evaluated model outputs, identified noise patterns, and refined our filtering pipeline—aligning training data to produce clean, production-grade YAML.