Run MLOps with Coding Agents and On-Premise LLMs
TOC
IntroductionSet up the agent's working environmentManage InferenceServices and LLMInferenceServicesManage gateways: authentication and rate limitsTune service performance to fit your hardwarePlan fine-tuning and generate reportsPick the right tool for the jobA reusable fine-tuning plan templateA reusable fine-tuning report templateA daily MLOps loopBest practices and guardrailsReferencesIntroduction
Once a coding agent is wired to a self-hosted model on Alauda AI (see Use Coding Agents with On-Premise Inference Services), the same agent can drive day-to-day MLOps on the platform. Because both the model and the operations target the same cluster, prompts, manifests, training data references, and benchmark results never leave your environment — which is what makes self-hosted agents attractive for regulated work.
This document describes four workflows where a coding agent is most useful:
- Authoring and managing
InferenceServiceandLLMInferenceServiceresources. - Configuring the inference traffic gateway — authentication and rate limits via Alauda Build of Envoy AI Gateway.
- Iteratively tuning an inference service's performance to fit specific hardware.
- Planning fine-tuning runs and generating structured reports from their results.
It assumes you are already running the agent and that it can reach an on-premise OpenAI-compatible endpoint with tool calling enabled. If not, start with the prerequisites doc above.
A coding agent that can run kubectl against a real cluster can also delete things. Scope its kubeconfig to a single namespace, prefer --dry-run=server for any apply during exploration, and require a human review of every change before it lands in production. Treat the agent like a junior engineer with cluster access, not an autonomous operator.
Set up the agent's working environment
Before delegating MLOps work, give the agent a small, reliable context to operate in. Three things are almost always worth doing once per project:
- Scope cluster access. Create a dedicated namespace (for example,
mlops-demo-ai-testused in the platform samples) and aServiceAccount/ kubeconfig with permissions limited to the resources the agent should touch — typicallyInferenceService,LLMInferenceService,TrainJob,TrainingRuntime,AIGatewayRoute,AIServiceBackend,BackendSecurityPolicy,SecurityPolicy,BackendTrafficPolicy, and the secrets/configmaps they reference. Avoid cluster-wide write access. - Pin a default hardware profile. Platform Hardware Profiles encode the GPU type, taints, tolerations, and node selectors for your fleet. Pick the right profile up front and tell the agent to use it — this prevents the agent from inventing affinity blocks. See Hardware Profiles.
- Commit an agent context file. Most coding agents read a project-level instructions file (for example,
AGENTS.md,CLAUDE.md, oropencode.md). Use it to record the cluster name, target namespace, the on-prem model endpoint, naming conventions, "always runkubectl apply --dry-run=serverfirst", and any internal links the agent should follow. Once this file exists, every subsequent prompt becomes shorter and more accurate.
Manage InferenceServices and LLMInferenceServices
The platform supports two related resources for serving models:
InferenceService(serving.kserve.io/v1beta1) — the standard KServe predictor used in Create Inference Service using CLI. Best for single-container model servers (vLLM, Triton, custom runtimes).LLMInferenceService— KServe's higher-level LLM resource for multi-component LLM serving (orchestrating predictors, optional prefill/decode disaggregation, and gateway/inference-extension integration). It is recognized by platform features such as Hardware Profiles, which mention it alongsideInferenceService(see Hardware Profiles). Use it when a single-containerInferenceServiceis no longer enough.
A good agent loop for either resource is the same:
Useful prompts to start from:
- "Generate an
InferenceServicefor modelQwen2.5-Coder-7B-Instructusing theaml-vllmruntime, hardware profilesingle-a30-24g, namespacemlops-demo-ai-test. Enable prefix caching and tool calling with thehermesparser. Runkubectl apply --dry-run=serverand show me the diff against any existing object before applying." - "Convert this
InferenceServiceto anLLMInferenceServicefor prefill/decode disaggregation; keep the same model, hardware profile, and served-model name. Show me what changes and why." - "List all
InferenceServiceandLLMInferenceServiceobjects inmlops-demo-ai-test, theirREADYstatus, and the model each one serves. Flag any that have beenNotReadyfor more than 10 minutes and summarize the most recent predictor pod events."
For the YAML fields and platform-specific labels/annotations the agent needs to reproduce, point it at Create Inference Service using CLI as the canonical example. For exposing a new service externally, point it at Configure External Access for Inference Services.
Manage gateways: authentication and rate limits
Alauda Build of Envoy AI Gateway is a required dependency of Alauda Build of KServe and fronts inference traffic with an OpenAI-compatible API surface, AI-aware routing, and per-model policies (see Envoy AI Gateway introduction and installation). The agent is well-suited to author its CRDs, which are otherwise verbose:
A practical agent workflow:
- Tell the agent your intent in business terms. For example: "Expose
qwen-2andllama-3-70bbehind one OpenAI-compatible endpoint athttps://ai.example.internal. Require anAuthorization: BearerAPI key from a KubernetesSecretnamedai-gateway-keys. Limit each key to 60 requests/minute and 200k tokens/hour. Sendqwen-2traffic to theqwen-2InferenceServiceinmlops-demo-ai-testandllama-3-70bto theLLMInferenceServiceof the same name." - Have the agent draft the CRDs in a directory under your infra repo, one file per resource, with comments calling out each policy decision.
- Validate before applying. Ask the agent to run
kubectl apply --dry-run=server -f ./gateway/and to summarize what would change. Apply only after you review. - Smoke-test the new policies. Have the agent send a valid request, an unauthenticated request, and a request that exceeds the rate limit, and confirm the expected 200 / 401 / 429 responses. Capture the test as a small script alongside the manifests so future changes can be re-verified.
For the exact field shape of each CRD, defer to the upstream documentation linked below — versions change, and the agent should read the live spec rather than inventing fields.
Tune service performance to fit your hardware
The list of vLLM and KServe knobs is unchanged from Best practices: tune inference service performance — this section focuses on how an agent can drive that tuning instead of you doing it by hand.
A productive loop:
1. Define service-level objectives
Pin numbers before tuning. Tell the agent what "good enough" looks like:
- Maximum first-token latency (TTFT) at the expected concurrency.
- Maximum P95 inter-token latency or total response time for a representative prompt.
- Minimum sustainable throughput (requests/min or tokens/sec).
- Maximum context length the agent traffic will send.
2. Generate a reproducible benchmark
Ask the agent to write a small benchmark script that mirrors your real traffic — typical prompt size, system prompt, concurrency. Useful starting points include the built-in vllm bench serve command, genai-perf, or a k6/Python script that drives /v1/chat/completions directly. Have the agent run it against the current InferenceService and record the results in a markdown table.
3. Have the agent propose one change at a time
Give the agent the benchmark output and the current YAML. Ask for one change with an expected effect, for example:
- "Add
--enable-prefix-cachingand re-run; expected: lower TTFT on the repeated system-prompt prefix." - "Switch the model from FP16 to AWQ INT4 and raise
--gpu-memory-utilizationto 0.92; expected: more KV cache headroom, larger sustainable context length." - "Increase
--max-num-seqs; expected: higher throughput at the cost of higher P95 latency."
One change per iteration keeps cause and effect attributable.
4. Apply, measure, and record
The agent updates the InferenceService YAML, applies it, waits for READY, re-runs the benchmark, and appends a new row to the results table with the configuration delta.
5. Stop on SLO or hardware ceiling
The loop ends when SLOs are met, or when the next sensible knob is "different hardware" or "different model" — at which point the agent should say so explicitly rather than churn. Common ceilings: KV cache saturated at the target context length, tensor-parallel scaling no longer linear, decode-bound at single-request latency.
For model-size vs. GPU-memory selection, see the table in the prior doc's Choose a model that fits your hardware section. For autoscaling and cold-start trade-offs, see Configure Scaling for Inference Services. For interactive-latency wins, see Speculative Decoding for vLLM Inference Services.
Plan fine-tuning and generate reports
Fine-tuning has two failure modes that coding agents are unusually good at preventing: skipping the planning step ("just run SFT") and skipping the reporting step ("the loss looked fine"). The agent's job is to make both explicit.
Pick the right tool for the job
A reusable fine-tuning plan template
Have the agent fill in this template before any job is submitted, and commit the result alongside the training code. This separates "what we intend" from "what we ran," which is exactly the comparison the report needs later.
Useful prompt: "Read plan.md. Draft a Kubeflow Trainer v2 TrainingRuntime and TrainJob (or a Training Hub notebook) that implements exactly this plan in namespace mlops-demo-ai-test. Highlight any field where the plan is ambiguous and ask me before guessing."
A reusable fine-tuning report template
After the job finishes, ask the agent to ingest the training logs, eval outputs, and resource metrics, and fill in this report. Commit it next to the plan.
Useful prompt: "Generate report.md for TrainJob qwen-coder-sft-2026-05-29 in mlops-demo-ai-test. Pull metrics from MLflow run <id>, training logs from the pod, and eval results from s3://aml-evals/<run-id>/. Compare against the previous run qwen-coder-sft-2026-05-15. If any section can't be filled in from the available data, mark it TODO rather than fabricating numbers."
For experiment tracking and run metadata, MLflow on Kubeflow is the platform-native option; tell the agent to log there from inside the training code so the report has a real source of truth.
A daily MLOps loop
A useful end-to-end sequence the agent can drive, given the setup above:
- Triage. "List inference services in my namespace, surface anything
NotReadyor scaled to zero unexpectedly, summarize recent gateway 4xx/5xx rates." - Tune. "P95 on
qwen-2is over budget. Propose one change, apply, re-benchmark, report." - Update. "There's a new model artifact for
qwen-coder-sft-2026-05-29. Draft the YAML to swap it into theqwen-2InferenceService, gate the rollout to one replica first, and write the smoke test." - Plan. "Draft a fine-tuning plan to fix the tool-calling regression we saw in last week's eval. Justify the method choice."
- Report. "Last night's job finished. Generate the report and tell me whether to promote."
Each step is a separate prompt with its own diff to review. The agent is the typist; you are still the engineer of record.
Best practices and guardrails
- Read-only first, write second. Start every new task by asking the agent to read state (
get,describe, logs, metrics) and describe what it would do before making changes. - Always
--dry-run=server. Make it a standing rule in the agent context file; mention it in every prompt that involveskubectl apply. - One change per iteration. Especially for performance tuning, mixing two changes hides which one helped.
- Never let the agent fabricate metrics. Require it to cite the file, log, or run ID it pulled each number from, and to mark
TODOwhen data is missing. - Keep the loop on-prem. Confirm that no fallback model in any agent config points at a hosted provider (see Connect your coding agent for the per-agent settings to check).
- Commit everything. Plans, reports, generated YAML, and benchmark scripts all go into Git so the next person — or the next agent — can pick up where you left off.
References
- Use Coding Agents with On-Premise Inference Services
- Create Inference Service using CLI
- Configure External Access for Inference Services
- Configure Scaling for Inference Services
- Speculative Decoding for vLLM Inference Services
- Extend Inference Runtimes
- Envoy AI Gateway — introduction
- Install Envoy AI Gateway
- Hardware Profiles
- Fine-Tuning with Kubeflow Trainer v2
- Fine-tuning LLMs with Training Hub
- Fine-tuning with Notebooks
- LLM Compressor with Alauda AI
- MLflow on Kubeflow
- Envoy AI Gateway upstream documentation
- Envoy Gateway upstream documentation
- KServe LLMInferenceService
- vLLM benchmarking