Use Coding Agents with On-Premise Inference Services

Introduction Prerequisites How the pieces fit together Step 1: Deploy and smoke-test the endpoint Step 2: Enable tool calling on the runtime Step 3: Connect your coding agent opencode Codex CLI Claude Code Option A: point Claude Code directly at the on-premise endpoint Option B: front an OpenAI-compatible endpoint with a translation proxy Best practices Choose a model that fits your hardware Tune inference service performance Getting started with vibe coding Getting started with MLOps Troubleshooting References

Introduction

Coding agents such as opencode, Codex CLI, and Claude Code are terminal-based assistants that read your repository, plan changes, edit files, and run commands on your behalf. They normally talk to a hosted model provider over the internet.

This document shows how to point those agents at a model you serve yourself on Alauda AI, so that your source code, prompts, and infrastructure configuration never leave your cluster. The same on-premise InferenceService that you deploy for any other workload can back an interactive coding agent, as long as it exposes an OpenAI-compatible API and has tool (function) calling enabled.

This page builds directly on the deployment how-tos. It does not repeat how to create or expose an InferenceService; instead it links to them and focuses on the agent-specific configuration and tuning.

WARNING

Coding agents and their configuration formats evolve quickly. The config snippets below are correct starting points for the versions available at the time of writing. Always confirm field names against the current upstream documentation of the agent you use.

Prerequisites

A running, ready InferenceService that serves an OpenAI-compatible API. See Create Inference Service using CLI.
Network access from the machine running the agent to the service endpoint. For access from a developer laptop outside the cluster, see Configure External Access for Inference Services.
A model with tool/function calling support, served with the matching vLLM parser enabled (see Enable tool calling on the runtime). Without this, agents can chat but cannot edit files or run commands.
The agent CLI installed locally (opencode, codex, or claude).

How the pieces fit together

  Coding agent (opencode / Codex / Claude Code)
        │  OpenAI-compatible HTTP  (POST /v1/chat/completions)
        ▼
  External access / Load Balancer  ──►  KServe InferenceService (vLLM)
        ▲                                       │
        └──── Anthropic→OpenAI proxy ───────────┘
             (only required for Claude Code)

opencode and Codex CLI speak the OpenAI Chat Completions API natively, so they can call the InferenceService endpoint directly.
Claude Code speaks the Anthropic Messages API, which vLLM does not serve. It requires a small translation proxy in front of the OpenAI-compatible endpoint (see Claude Code).

Step 1: Deploy and smoke-test the endpoint

Deploy your model as an InferenceService following Create Inference Service using CLI, and if the agent runs outside the cluster, expose it following Configure External Access for Inference Services.

Before wiring up any agent, confirm the endpoint answers a chat request. Coding agents fail in confusing ways if the base URL, model name, or auth is wrong, so validate with curl first:

# BASE_URL must end at /v1
BASE_URL="https://your-inference-service-domain.com/v1"
MODEL="qwen-2"        # must match --served-model-name in the InferenceService
API_KEY="sk-local"    # any non-empty value if the server does not enforce auth

curl -sS ${BASE_URL}/chat/completions \
  -H "Authorization: Bearer ${API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "'"${MODEL}"'",
    "messages": [{"role": "user", "content": "Reply with the single word: ready"}],
    "max_tokens": 16
  }'

A normal JSON completion confirms the endpoint is reachable and the model name is correct. Note the three values you will reuse for every agent: base URL (ending in /v1), model name (the --served-model-name), and API key.

Step 2: Enable tool calling on the runtime

Coding agents work by calling tools (read file, write file, run shell). This requires the model to emit tool calls and vLLM to parse them. Add the following flags to the vLLM launch command in your InferenceService (in the sample from Create Inference Service using CLI, they go on the python3 -m vllm.entrypoints.openai.api_server line):

--enable-auto-tool-choice \
--tool-call-parser hermes        # match the parser to your model family

The parser must match the model. For example, Qwen2.5 / Qwen3 family models commonly use hermes; Llama 3.x models use llama3_json; Mistral models use mistral. Check the vLLM tool calling documentation for the current parser list and the value that matches your model.
Some models need a specific chat template to emit tool calls correctly; pass --chat-template if the model card calls for it.
If you serve a reasoning model, also enable the matching --reasoning-parser so the agent receives clean assistant content separated from reasoning traces.

Verify tool calling end-to-end by asking the agent to perform a trivial file operation (for example, "create hello.txt containing the word hi"). If the model replies in prose instead of editing the file, tool calling is not wired up correctly — recheck the parser and model.

Step 3: Connect your coding agent

opencode

opencode reads configuration from opencode.json in the project root or ~/.config/opencode/opencode.json. Define a custom OpenAI-compatible provider that points at your endpoint:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "onprem": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "On-Prem Alauda AI",
      "options": {
        "baseURL": "https://your-inference-service-domain.com/v1",
        "apiKey": "{env:ONPREM_API_KEY}"
      },
      "models": {
        "qwen-2": {
          "name": "Qwen2.5-Coder (on-prem)"
        }
      }
    }
  }
}

The model key (qwen-2) must match the --served-model-name of the InferenceService.
Export the key the config references, then select the model: export ONPREM_API_KEY=sk-local and choose onprem/qwen-2 with the /models command inside opencode.

Codex CLI

Codex CLI reads ~/.codex/config.toml. Register your endpoint as a model provider and select it:

model = "qwen-2"
model_provider = "onprem"

[model_providers.onprem]
name = "On-Prem Alauda AI"
base_url = "https://your-inference-service-domain.com/v1"
env_key = "ONPREM_API_KEY"
wire_api = "chat"

base_url must end at /v1; model must match the --served-model-name.
env_key names the environment variable that holds the API key: export ONPREM_API_KEY=sk-local.
Use wire_api = "chat" for vLLM's OpenAI Chat Completions API.

Claude Code

Claude Code communicates over the Anthropic Messages API (/v1/messages). There are two ways to back it with an on-premise model — pick the one that matches your runtime.

Option A: point Claude Code directly at the on-premise endpoint

If the on-premise endpoint already speaks the Anthropic Messages API — either natively (for example, some llama.cpp llama-server builds and similar local runners) or because you front your InferenceService with a gateway that exposes /v1/messages — you can configure Claude Code with environment variables alone, no separate proxy needed:

export ANTHROPIC_BASE_URL="http://127.0.0.1:9123"      # on-premise endpoint speaking the Anthropic Messages API
export ANTHROPIC_AUTH_TOKEN="not_set"                  # any value; the endpoint may ignore it
export ANTHROPIC_API_KEY="not_set_either!"             # any value; both vars are checked
export ANTHROPIC_MODEL="qwen-2"                        # must match what the endpoint exposes (e.g. served-model-name)

# Keep traffic on-premise and trim features the on-prem model can't honor:
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1      # suppress optional traffic to Anthropic-hosted services
export CLAUDE_CODE_ATTRIBUTION_HEADER=0                # drop the Anthropic attribution header
export CLAUDE_CODE_ENABLE_TELEMETRY=0                  # disable telemetry
export CLAUDE_CODE_DISABLE_1M_CONTEXT=1                # disable the 1M-context feature; most on-prem models can't serve it
export CLAUDE_CODE_MAX_OUTPUT_TOKENS=64000             # cap to what the on-prem model and runtime support

claude

A few notes on the values:

The ANTHROPIC_AUTH_TOKEN / ANTHROPIC_API_KEY values must be non-empty but their content does not matter if your endpoint does not check them; gate access at the endpoint or in front of it (see Manage gateways for adding auth via Envoy AI Gateway).
ANTHROPIC_MODEL must match the model name the endpoint exposes (the --served-model-name from your InferenceService, or whatever your local runner advertises).
The CLAUDE_CODE_DISABLE_* and CLAUDE_CODE_*=0 flags are what actually keep an "on-prem" setup on-prem: without them, Claude Code can still emit non-essential requests to Anthropic-hosted endpoints and ask the model for features (1M context, very large outputs) the on-prem model cannot honor.

Option B: front an OpenAI-compatible endpoint with a translation proxy

If your endpoint is OpenAI-compatible only (for example, a stock vLLM InferenceService exposing /v1/chat/completions but not /v1/messages), run a small gateway that accepts Anthropic-format requests and forwards them. Two common options:

LiteLLM proxy, which exposes an Anthropic-compatible /v1/messages endpoint and routes to any backend model.
claude-code-router, a proxy built specifically to point Claude Code at OpenAI-compatible and other backends.

Then use the same env-var configuration from Option A, with ANTHROPIC_BASE_URL pointing at the proxy and ANTHROPIC_MODEL set to the model alias the proxy exposes. Optionally also set ANTHROPIC_SMALL_FAST_MODEL to an on-prem model so background/low-cost requests stay on-prem too.

Regardless of which option you pick, Claude Code's agentic quality depends heavily on the served model's tool-calling fidelity — prefer a strong instruction- and tool-tuned model, and confirm tool calls round-trip end-to-end before relying on it.

Best practices

Choose a model that fits your hardware

Start from the GPU memory you have, then pick the largest capable model that leaves headroom for the KV cache. A rough weight-size estimate is parameters × bytes-per-parameter — FP16 ≈ 2 bytes, FP8/INT8 ≈ 1 byte, INT4 ≈ 0.5 bytes per parameter — on top of which the KV cache and runtime overhead consume more memory. Leave 15–25% headroom.

GPU memory (single GPU)	Example GPUs	Practical coding-model choices
16–24 GB	L4, A10, A30 (24G), RTX 4090	7–8B at FP16, or 14B quantized (AWQ/GPTQ INT4)
40–48 GB	A40, L40S, A6000, A100-40G	14B at FP16, or 32B quantized (AWQ/GPTQ INT4)
80 GB	A100-80G, H100, H800	32B at FP16, or 70B at INT4 / FP8
Multi-GPU (2–8×)	2–8 × 80 GB	70B+ at FP16 with tensor parallel, or large MoE models

Additional selection guidance:

Prefer code-specialized, instruction-tuned models that natively support tool/function calling. If the model card does not mention tool calling, the agent will not be able to edit files reliably.
Confirm a matching vLLM parser exists for the model (see Enable tool calling on the runtime) before committing to it.
Budget for context length. Coding agents send large prompts (system prompt + file and repo context). Pick a model whose context window covers your largest expected prompt, and remember that a longer --max-model-len consumes more KV cache per request, reducing concurrency.
Quantization is a force multiplier on-premise. INT4 (AWQ/GPTQ) or FP8 lets you fit a noticeably more capable model in the same VRAM, which usually matters more for agent quality than raw FP16 precision.

Tune inference service performance

Coding-agent traffic has a distinctive shape: long, highly repetitive prompts (the same system prompt and repo context resent every turn), bursts of short interactive requests, and sensitivity to first-token latency. Tune for it:

Enable prefix caching (--enable-prefix-caching). This is the single highest-impact flag for coding agents: the shared prompt prefix is reused across turns instead of being recomputed, cutting prefill cost and latency dramatically. See Automatic Prefix Caching — vLLM.
Raise --gpu-memory-utilization toward 0.90–0.95 to enlarge the KV cache, which increases concurrency and the context length you can sustain.
Right-size --max-model-len. Set it to the largest context the agent actually needs, not the model's theoretical maximum — every extra token of capacity costs KV-cache memory.
Enable chunked prefill (--enable-chunked-prefill) when long prompts cause latency spikes under concurrency, so decode steps are not starved by a large prefill. Note the CLI sample disables it by default.
Allow CUDA graphs for steady-state latency: the CLI sample sets ENFORCE_EAGER=True (eager mode, which starts faster but runs slower). Once the service is stable, switch to non-eager to capture CUDA graphs, at the cost of longer startup.
Tune batching with --max-num-seqs and --max-num-batched-tokens to balance throughput against per-request latency for your concurrency level.
Use FP8 KV cache (--kv-cache-dtype fp8) to stretch context length and concurrency when memory is tight.
Shard large models across GPUs with --tensor-parallel-size when a model does not fit on one card.
Consider speculative decoding for lower interactive latency on agent loops — see Speculative Decoding for vLLM Inference Services.
Mind autoscaling and cold starts. For interactive single-user agent use, keep minReplicas: 1 — scaling from zero adds a multi-minute cold start that is painful mid-task. For bursty multi-developer usage, configure autoscaling deliberately; see Configure Scaling for Inference Services and Set Up Autoscaling for Inference Services with KEDA.
Allow long requests. Agent turns can be long-running; size the Knative serving.knative.dev/progress-deadline annotation and your client timeouts accordingly. If requests are cut off, see Inference timeout troubleshooting.

Getting started with vibe coding

"Vibe coding" — iterating quickly by describing intent and letting the agent write the code — works well with a self-hosted model once the basics are right:

Start with a 7–14B code model that fits comfortably on your GPU with headroom; a responsive smaller model beats a sluggish larger one for interactive flow.
Set a low temperature (around 0–0.2) for code generation to keep edits deterministic and reduce flailing.
Validate tool calling with one trivial task ("create a file and run it") before attempting anything real.
Keep prompts focused — open or reference only the relevant files so the agent's context stays on-topic and prefill stays cheap.
Work in small, reviewable steps and read each diff before accepting it. Commit often so you can roll back a bad suggestion cleanly.

Getting started with MLOps

Because the model runs inside your cluster, a coding agent backed by an on-premise InferenceService is a good fit for operating the platform itself — your manifests, configs, and proprietary code never leave the environment, which matters in regulated settings. Productive starting tasks:

Generate or modify InferenceService YAML — for example, "write an InferenceService for model X targeting a 24 GB GPU with prefix caching and tool calling enabled."
Add autoscaling, scheduling, or resource configuration — KEDA/KPA autoscaling, CUDA-version-aware scheduling, or Kueue/Volcano queueing.
Author and adjust pipelines and monitoring for your model lifecycle.
Close the loop: deploy a model with the agent, then use that same on-premise model to drive further platform operations.

Troubleshooting

Agent chats but never edits files or runs commands. Tool calling is not enabled or the parser does not match the model — see Enable tool calling on the runtime.
model not found / 404. The model name in the agent config does not match the --served-model-name, or the base URL does not end in /v1.
401 / 403. The agent is sending the wrong (or no) API key for what the endpoint or gateway expects.
Requests time out on long tasks. Increase the Knative progress-deadline annotation and the client timeout — see Inference timeout troubleshooting.
First request after idle is very slow. The service scaled to zero and is cold-starting; set minReplicas: 1 for interactive use.

#Use Coding Agents with On-Premise Inference Services

#TOC

#Introduction

#Prerequisites

#How the pieces fit together

#Step 1: Deploy and smoke-test the endpoint

#Step 2: Enable tool calling on the runtime

#Step 3: Connect your coding agent

#opencode

#Codex CLI

#Claude Code

#Option A: point Claude Code directly at the on-premise endpoint

#Option B: front an OpenAI-compatible endpoint with a translation proxy

#Best practices

#Choose a model that fits your hardware

#Tune inference service performance

#Getting started with vibe coding

#Getting started with MLOps

#Troubleshooting

#References