Domain-Specific Code Generation | Fine-Tuning Pipeline

Building a High-Quality Training Dataset

We obtained a rich, real-world dataset with representative coverage of Kubernetes tasks from public GitHub repositories. But innate to these data is significant noise—license headers, comments, truncated files, Helm templating artifacts.

This noise showed up in our initial model outputs: repetition, truncation, licensing messages embedded in generated YAML. Through iterative refinement, we developed a training dataset aligned to domain-specific task objectives—resulting in 200K+ clean task–YAML pairs that teach models to generate production-grade configurations.

227.9K

Labelled samples

23

Resource types

3

Complexity tiers

46.7%

Composite tasks

Category Distribution

Deployment

22.5%

Service

17.1%

Pod

12.0%

ConfigMap

6.4%

ClusterRole

5.2%

Namespace

4.1%

PVC

3.7%

ServiceAccount

3.5%

Secret

3.4%

Complexity Distribution

54.7%

Basic 119,662

39.6%

Intermediate 86,754

5.7%

Advanced 12,474

Data Source

Real-World Kubernetes Configurations

Training data is sourced from The Stack—a 6TB permissively-licensed code corpus from the BigCode Project. We extracted 100GB+ of Kubernetes YAML representing how developers actually write manifests in production.

📦 The Stack YAML (100+ GB) 📖 K8s Official Docs ⌨️ kubectl CLI References

This provides breadth—real-world configurations from thousands of public repositories—and representative coverage across all major Kubernetes resource types: workloads, networking, storage, and access control.

Annotation Pipeline

Aligning Data to Training Objectives

Raw YAML files lack the conversational context needed for instruction fine-tuning. We used GPT-4o-mini to generate task descriptions, context scenarios, and complexity ratings—creating the structured supervision needed for models to learn the mapping from natural language to valid YAML.

1 Noise filtering

→

2 YAML validation

→

3 LLM annotation

→

4 Format alignment

Raw data (what models learned)

# Copyright 2019 The Kubernetes Authors. # Licensed under the Apache License... --- kind: Deployment metadata: name: nginx name: nginx ← duplicate key spec: replicas: {{ .Values.replicas }} template: spec: containers: ... [truncated]

Cleaned + labelled (aligned to objectives)

{ "task": "Create an Nginx Deployment with 3 replicas", "context": "Frontend team needs zero-downtime deploys", "complexity": "intermediate", "yaml": "kind: Deployment\n..." }

Each label includes a task description (15-25 word imperative), context scenario (business rationale), and complexity level. Processing took approximately 4 hours for the full dataset.

Data Quality

From Noisy to Training-Ready

Real-world code contains significant noise. Our initial fine-tuned models generated outputs with license headers, truncation markers, and Helm template syntax—directly reflecting patterns in the raw training data. We built a multi-stage filtering pipeline to systematically remove these artifacts.

276,520

Raw samples (noisy)

→

227,926

Training-ready samples

License blocks 8.28%

CRD schemas 3.00%

Embedded JSON 2.99%

OpenAPI specs 2.98%

Helm templates 2.95%

Multiline artifacts 2.77%

Embedded HTML 0.40%

Invalid YAML —

🔄

Iterative refinement: Data quality directly determines model behavior. We continuously evaluated model outputs, identified noise patterns, and refined our filtering pipeline—aligning training data to produce clean, production-grade YAML.