Use Coding Agents with On-Premise Inference Services
TOC
IntroductionPrerequisitesHow the pieces fit togetherStep 1: Deploy and smoke-test the endpointStep 2: Enable tool calling on the runtimeStep 3: Connect your coding agentopencodeCodex CLIClaude CodeOption A: point Claude Code directly at the on-premise endpointOption B: front an OpenAI-compatible endpoint with a translation proxyBest practicesChoose a model that fits your hardwareTune inference service performanceGetting started with vibe codingGetting started with MLOpsTroubleshootingReferencesIntroduction
Coding agents such as opencode, Codex CLI, and Claude Code are terminal-based assistants that read your repository, plan changes, edit files, and run commands on your behalf. They normally talk to a hosted model provider over the internet.
This document shows how to point those agents at a model you serve yourself on Alauda AI, so that your source code, prompts, and infrastructure configuration never leave your cluster. The same on-premise InferenceService that you deploy for any other workload can back an interactive coding agent, as long as it exposes an OpenAI-compatible API and has tool (function) calling enabled.
This page builds directly on the deployment how-tos. It does not repeat how to create or expose an InferenceService; instead it links to them and focuses on the agent-specific configuration and tuning.
Coding agents and their configuration formats evolve quickly. The config snippets below are correct starting points for the versions available at the time of writing. Always confirm field names against the current upstream documentation of the agent you use.
Prerequisites
- A running, ready
InferenceServicethat serves an OpenAI-compatible API. See Create Inference Service using CLI. - Network access from the machine running the agent to the service endpoint. For access from a developer laptop outside the cluster, see Configure External Access for Inference Services.
- A model with tool/function calling support, served with the matching vLLM parser enabled (see Enable tool calling on the runtime). Without this, agents can chat but cannot edit files or run commands.
- The agent CLI installed locally (
opencode,codex, orclaude).
How the pieces fit together
- opencode and Codex CLI speak the OpenAI Chat Completions API natively, so they can call the
InferenceServiceendpoint directly. - Claude Code speaks the Anthropic Messages API, which vLLM does not serve. It requires a small translation proxy in front of the OpenAI-compatible endpoint (see Claude Code).
Step 1: Deploy and smoke-test the endpoint
Deploy your model as an InferenceService following Create Inference Service using CLI, and if the agent runs outside the cluster, expose it following Configure External Access for Inference Services.
Before wiring up any agent, confirm the endpoint answers a chat request. Coding agents fail in confusing ways if the base URL, model name, or auth is wrong, so validate with curl first:
A normal JSON completion confirms the endpoint is reachable and the model name is correct. Note the three values you will reuse for every agent: base URL (ending in /v1), model name (the --served-model-name), and API key.
Step 2: Enable tool calling on the runtime
Coding agents work by calling tools (read file, write file, run shell). This requires the model to emit tool calls and vLLM to parse them. Add the following flags to the vLLM launch command in your InferenceService (in the sample from Create Inference Service using CLI, they go on the python3 -m vllm.entrypoints.openai.api_server line):
- The parser must match the model. For example, Qwen2.5 / Qwen3 family models commonly use
hermes; Llama 3.x models usellama3_json; Mistral models usemistral. Check the vLLM tool calling documentation for the current parser list and the value that matches your model. - Some models need a specific chat template to emit tool calls correctly; pass
--chat-templateif the model card calls for it. - If you serve a reasoning model, also enable the matching
--reasoning-parserso the agent receives clean assistant content separated from reasoning traces.
Verify tool calling end-to-end by asking the agent to perform a trivial file operation (for example, "create hello.txt containing the word hi"). If the model replies in prose instead of editing the file, tool calling is not wired up correctly — recheck the parser and model.
Step 3: Connect your coding agent
opencode
opencode reads configuration from opencode.json in the project root or ~/.config/opencode/opencode.json. Define a custom OpenAI-compatible provider that points at your endpoint:
- The model key (
qwen-2) must match the--served-model-nameof theInferenceService. - Export the key the config references, then select the model:
export ONPREM_API_KEY=sk-localand chooseonprem/qwen-2with the/modelscommand inside opencode.
Codex CLI
Codex CLI reads ~/.codex/config.toml. Register your endpoint as a model provider and select it:
base_urlmust end at/v1;modelmust match the--served-model-name.env_keynames the environment variable that holds the API key:export ONPREM_API_KEY=sk-local.- Use
wire_api = "chat"for vLLM's OpenAI Chat Completions API.
Claude Code
Claude Code communicates over the Anthropic Messages API (/v1/messages). There are two ways to back it with an on-premise model — pick the one that matches your runtime.
Option A: point Claude Code directly at the on-premise endpoint
If the on-premise endpoint already speaks the Anthropic Messages API — either natively (for example, some llama.cpp llama-server builds and similar local runners) or because you front your InferenceService with a gateway that exposes /v1/messages — you can configure Claude Code with environment variables alone, no separate proxy needed:
A few notes on the values:
- The
ANTHROPIC_AUTH_TOKEN/ANTHROPIC_API_KEYvalues must be non-empty but their content does not matter if your endpoint does not check them; gate access at the endpoint or in front of it (see Manage gateways for adding auth via Envoy AI Gateway). ANTHROPIC_MODELmust match the model name the endpoint exposes (the--served-model-namefrom yourInferenceService, or whatever your local runner advertises).- The
CLAUDE_CODE_DISABLE_*andCLAUDE_CODE_*=0flags are what actually keep an "on-prem" setup on-prem: without them, Claude Code can still emit non-essential requests to Anthropic-hosted endpoints and ask the model for features (1M context, very large outputs) the on-prem model cannot honor.
Option B: front an OpenAI-compatible endpoint with a translation proxy
If your endpoint is OpenAI-compatible only (for example, a stock vLLM InferenceService exposing /v1/chat/completions but not /v1/messages), run a small gateway that accepts Anthropic-format requests and forwards them. Two common options:
- LiteLLM proxy, which exposes an Anthropic-compatible
/v1/messagesendpoint and routes to any backend model. - claude-code-router, a proxy built specifically to point Claude Code at OpenAI-compatible and other backends.
Then use the same env-var configuration from Option A, with ANTHROPIC_BASE_URL pointing at the proxy and ANTHROPIC_MODEL set to the model alias the proxy exposes. Optionally also set ANTHROPIC_SMALL_FAST_MODEL to an on-prem model so background/low-cost requests stay on-prem too.
Regardless of which option you pick, Claude Code's agentic quality depends heavily on the served model's tool-calling fidelity — prefer a strong instruction- and tool-tuned model, and confirm tool calls round-trip end-to-end before relying on it.
Best practices
Choose a model that fits your hardware
Start from the GPU memory you have, then pick the largest capable model that leaves headroom for the KV cache. A rough weight-size estimate is parameters × bytes-per-parameter — FP16 ≈ 2 bytes, FP8/INT8 ≈ 1 byte, INT4 ≈ 0.5 bytes per parameter — on top of which the KV cache and runtime overhead consume more memory. Leave 15–25% headroom.
Additional selection guidance:
- Prefer code-specialized, instruction-tuned models that natively support tool/function calling. If the model card does not mention tool calling, the agent will not be able to edit files reliably.
- Confirm a matching vLLM parser exists for the model (see Enable tool calling on the runtime) before committing to it.
- Budget for context length. Coding agents send large prompts (system prompt + file and repo context). Pick a model whose context window covers your largest expected prompt, and remember that a longer
--max-model-lenconsumes more KV cache per request, reducing concurrency. - Quantization is a force multiplier on-premise. INT4 (AWQ/GPTQ) or FP8 lets you fit a noticeably more capable model in the same VRAM, which usually matters more for agent quality than raw FP16 precision.
Tune inference service performance
Coding-agent traffic has a distinctive shape: long, highly repetitive prompts (the same system prompt and repo context resent every turn), bursts of short interactive requests, and sensitivity to first-token latency. Tune for it:
- Enable prefix caching (
--enable-prefix-caching). This is the single highest-impact flag for coding agents: the shared prompt prefix is reused across turns instead of being recomputed, cutting prefill cost and latency dramatically. See Automatic Prefix Caching — vLLM. - Raise
--gpu-memory-utilizationtoward0.90–0.95to enlarge the KV cache, which increases concurrency and the context length you can sustain. - Right-size
--max-model-len. Set it to the largest context the agent actually needs, not the model's theoretical maximum — every extra token of capacity costs KV-cache memory. - Enable chunked prefill (
--enable-chunked-prefill) when long prompts cause latency spikes under concurrency, so decode steps are not starved by a large prefill. Note the CLI sample disables it by default. - Allow CUDA graphs for steady-state latency: the CLI sample sets
ENFORCE_EAGER=True(eager mode, which starts faster but runs slower). Once the service is stable, switch to non-eager to capture CUDA graphs, at the cost of longer startup. - Tune batching with
--max-num-seqsand--max-num-batched-tokensto balance throughput against per-request latency for your concurrency level. - Use FP8 KV cache (
--kv-cache-dtype fp8) to stretch context length and concurrency when memory is tight. - Shard large models across GPUs with
--tensor-parallel-sizewhen a model does not fit on one card. - Consider speculative decoding for lower interactive latency on agent loops — see Speculative Decoding for vLLM Inference Services.
- Mind autoscaling and cold starts. For interactive single-user agent use, keep
minReplicas: 1— scaling from zero adds a multi-minute cold start that is painful mid-task. For bursty multi-developer usage, configure autoscaling deliberately; see Configure Scaling for Inference Services and Set Up Autoscaling for Inference Services with KEDA. - Allow long requests. Agent turns can be long-running; size the Knative
serving.knative.dev/progress-deadlineannotation and your client timeouts accordingly. If requests are cut off, see Inference timeout troubleshooting.
Getting started with vibe coding
"Vibe coding" — iterating quickly by describing intent and letting the agent write the code — works well with a self-hosted model once the basics are right:
- Start with a 7–14B code model that fits comfortably on your GPU with headroom; a responsive smaller model beats a sluggish larger one for interactive flow.
- Set a low temperature (around
0–0.2) for code generation to keep edits deterministic and reduce flailing. - Validate tool calling with one trivial task ("create a file and run it") before attempting anything real.
- Keep prompts focused — open or reference only the relevant files so the agent's context stays on-topic and prefill stays cheap.
- Work in small, reviewable steps and read each diff before accepting it. Commit often so you can roll back a bad suggestion cleanly.
Getting started with MLOps
Because the model runs inside your cluster, a coding agent backed by an on-premise InferenceService is a good fit for operating the platform itself — your manifests, configs, and proprietary code never leave the environment, which matters in regulated settings. Productive starting tasks:
- Generate or modify
InferenceServiceYAML — for example, "write anInferenceServicefor model X targeting a 24 GB GPU with prefix caching and tool calling enabled." - Add autoscaling, scheduling, or resource configuration — KEDA/KPA autoscaling, CUDA-version-aware scheduling, or Kueue/Volcano queueing.
- Author and adjust pipelines and monitoring for your model lifecycle.
- Close the loop: deploy a model with the agent, then use that same on-premise model to drive further platform operations.
Troubleshooting
- Agent chats but never edits files or runs commands. Tool calling is not enabled or the parser does not match the model — see Enable tool calling on the runtime.
model not found/ 404. The model name in the agent config does not match the--served-model-name, or the base URL does not end in/v1.- 401 / 403. The agent is sending the wrong (or no) API key for what the endpoint or gateway expects.
- Requests time out on long tasks. Increase the Knative
progress-deadlineannotation and the client timeout — see Inference timeout troubleshooting. - First request after idle is very slow. The service scaled to zero and is cold-starting; set
minReplicas: 1for interactive use.
References
- Create Inference Service using CLI
- Configure External Access for Inference Services
- Configure Scaling for Inference Services
- Set Up Autoscaling for Inference Services with KEDA
- Speculative Decoding for vLLM Inference Services
- Extend Inference Runtimes
- Tool Calling — vLLM
- Automatic Prefix Caching — vLLM
- opencode documentation
- Codex CLI
- Claude Code documentation
- LiteLLM
- claude-code-router