Training Data

A Curated Dataset for Domain-Specific Code Generation

A disciplined, high-quality dataset that reliably teaches the model to generate production-grade Kubernetes YAML from natural-language tasks.

Building a High-Quality Training Dataset

We obtained a rich, real-world dataset with representative coverage of Kubernetes tasks from public GitHub repositories. But innate to these data is significant noise—license headers, comments, truncated files, Helm templating artifacts.

This noise showed up in our initial model outputs: repetition, truncation, licensing messages embedded in generated YAML. Through iterative refinement, we developed a training dataset aligned to domain-specific task objectives—resulting in 200K+ clean task–YAML pairs that teach models to generate production-grade configurations.

227.9K
Labelled samples
23
Resource types
3
Complexity tiers
46.7%
Composite tasks
Category Distribution
Deployment
22.5%
Service
17.1%
Pod
12.0%
ConfigMap
6.4%
ClusterRole
5.2%
Namespace
4.1%
PVC
3.7%
ServiceAccount
3.5%
Secret
3.4%
Complexity Distribution
54.7%
Basic 119,662
39.6%
Intermediate 86,754
5.7%
Advanced 12,474

Real-World Kubernetes Configurations

Training data is sourced from The Stack—a 6TB permissively-licensed code corpus from the BigCode Project. We extracted 100GB+ of Kubernetes YAML representing how developers actually write manifests in production.

📦 The Stack YAML (100+ GB) 📖 K8s Official Docs ⌨️ kubectl CLI References

This provides breadth—real-world configurations from thousands of public repositories—and representative coverage across all major Kubernetes resource types: workloads, networking, storage, and access control.

Aligning Data to Training Objectives

Raw YAML files lack the conversational context needed for instruction fine-tuning. We used GPT-4o-mini to generate task descriptions, context scenarios, and complexity ratings—creating the structured supervision needed for models to learn the mapping from natural language to valid YAML.

1 Noise filtering
2 YAML validation
3 LLM annotation
4 Format alignment
Raw data (what models learned)
# Copyright 2019 The Kubernetes Authors. # Licensed under the Apache License... --- kind: Deployment metadata: name: nginx name: nginx ← duplicate key spec: replicas: {{ .Values.replicas }} template: spec: containers: ... [truncated]
Cleaned + labelled (aligned to objectives)
{ "task": "Create an Nginx Deployment with 3 replicas", "context": "Frontend team needs zero-downtime deploys", "complexity": "intermediate", "yaml": "kind: Deployment\n..." }

Each label includes a task description (15-25 word imperative), context scenario (business rationale), and complexity level. Processing took approximately 4 hours for the full dataset.

From Noisy to Training-Ready

Real-world code contains significant noise. Our initial fine-tuned models generated outputs with license headers, truncation markers, and Helm template syntax—directly reflecting patterns in the raw training data. We built a multi-stage filtering pipeline to systematically remove these artifacts.

276,520
Raw samples (noisy)
227,926
Training-ready samples
License blocks 8.28%
CRD schemas 3.00%
Embedded JSON 2.99%
OpenAPI specs 2.98%
Helm templates 2.95%
Multiline artifacts 2.77%
Embedded HTML 0.40%
Invalid YAML
🔄

Iterative refinement: Data quality directly determines model behavior. We continuously evaluated model outputs, identified noise patterns, and refined our filtering pipeline—aligning training data to produce clean, production-grade YAML.