<?xml version="1.0" encoding="utf-8"?><?xml-stylesheet type="text/xsl" href="atom.xsl"?>
<feed xmlns="http://www.w3.org/2005/Atom">
    <id>https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog</id>
    <title>KServe Blog</title>
    <updated>2026-03-13T00:00:00.000Z</updated>
    <generator>https://github.com/jpmonette/feed</generator>
    <link rel="alternate" href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog"/>
    <subtitle>KServe Blog</subtitle>
    <icon>https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/img/favicon-32x32.png</icon>
    <entry>
        <title type="html"><![CDATA[Announcing KServe v0.17 - Production-Ready LLM Serving with LLMInferenceService]]></title>
        <id>https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release</id>
        <link href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release"/>
        <updated>2026-03-13T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[KServe 0.17 Release Blog Post]]></summary>
        <content type="html"><![CDATA[<p><em>Published on March 13, 2026</em></p>
<p>We are excited to announce the release of <strong>KServe v0.17</strong>, a landmark release that brings <strong>LLMInferenceService</strong> to production readiness with a GenAI-first architecture built on the <a href="https://github.com/llm-d/llm-d" target="_blank" rel="noopener noreferrer" class="">llm-d</a> framework. This release introduces KV-cache aware intelligent routing, disaggregated prefill-decode, distributed inference with tensor/data/expert parallelism, Envoy AI Gateway integration with token-based rate limiting, and a completely restructured modular Helm chart architecture.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-llminferenceservice-genai-first-architecture">🤖 LLMInferenceService: GenAI-First Architecture<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#-llminferenceservice-genai-first-architecture" class="hash-link" aria-label="Direct link to 🤖 LLMInferenceService: GenAI-First Architecture" title="Direct link to 🤖 LLMInferenceService: GenAI-First Architecture" translate="no">​</a></h2>
<p>KServe v0.17 elevates <strong>LLMInferenceService</strong> from an experimental feature to a production-ready CRD purpose-built for generative AI workloads. Built on the <a href="https://github.com/llm-d/llm-d" target="_blank" rel="noopener noreferrer" class="">llm-d</a> framework, LLMInferenceService provides a GenAI-first architecture that goes beyond traditional InferenceService to address the unique challenges of serving large language models at scale.</p>
<p>Unlike InferenceService which is designed for predictive AI workloads, LLMInferenceService natively supports:</p>
<ul>
<li class=""><strong>Distributed inference</strong> across multiple nodes and GPUs</li>
<li class=""><strong>KV-cache aware scheduling</strong> for intelligent request routing</li>
<li class=""><strong>Disaggregated prefill-decode</strong> for optimal resource utilization</li>
<li class=""><strong>Gateway Inference Extension (GIE)</strong> integration for advanced traffic management</li>
<li class=""><strong>Token-based rate limiting</strong> via Envoy AI Gateway</li>
</ul>
<table><thead><tr><th>Feature</th><th>InferenceService</th><th>LLMInferenceService</th></tr></thead><tbody><tr><td><strong>Primary Use Case</strong></td><td>Predictive AI</td><td>Generative AI</td></tr><tr><td><strong>Routing</strong></td><td>Standard Gateway</td><td>KV-cache aware with EPP</td></tr><tr><td><strong>Parallelism</strong></td><td>Worker Spec</td><td>TP, DP, EP native support</td></tr><tr><td><strong>Prefill-Decode</strong></td><td>N/A</td><td>Disaggregated separation</td></tr><tr><td><strong>Scaling</strong></td><td>HPA/KPA</td><td>WVA + KEDA</td></tr></tbody></table>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1alpha2</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> LLMInferenceService</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> llama3</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">serving</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">model</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">uri</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> hf</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">//meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama/Llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">3.1</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">8B</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">Instruct</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">Llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">3.1</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">8B</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">Instruct</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">replicas</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">3</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">template</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">containers</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> vllm</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">resources</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">limits</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">nvidia.com/gpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">router</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">gateway</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">managed</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">route</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">httpRoute</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">scheduler</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">pool</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><br></span></code></pre></div></div>
<p>This creates a full serving stack including the Deployment, Service, Gateway, HTTPRoute, <strong>InferencePool</strong>, <strong>InferenceModel</strong>, and <strong>EPP (Endpoint Picker Pod)</strong> — all managed by the LLMInferenceService controller.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-key-llminferenceservice-features-in-v017">🚀 Key LLMInferenceService Features in v0.17<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#-key-llminferenceservice-features-in-v017" class="hash-link" aria-label="Direct link to 🚀 Key LLMInferenceService Features in v0.17" title="Direct link to 🚀 Key LLMInferenceService Features in v0.17" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-kv-cache-aware-scheduling-with-gateway-inference-extension">🧠 KV-Cache Aware Scheduling with Gateway Inference Extension<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#-kv-cache-aware-scheduling-with-gateway-inference-extension" class="hash-link" aria-label="Direct link to 🧠 KV-Cache Aware Scheduling with Gateway Inference Extension" title="Direct link to 🧠 KV-Cache Aware Scheduling with Gateway Inference Extension" translate="no">​</a></h3>
<p>LLMInferenceService integrates with <a href="https://gateway-api-inference-extension.sigs.k8s.io/" target="_blank" rel="noopener noreferrer" class=""><strong>Gateway Inference Extension (GIE) v1.3.0</strong></a>, a Kubernetes SIG project that extends the Gateway API with AI-specific routing capabilities. At the heart of this integration is the <strong>Endpoint Picker Pod (EPP)</strong> from the <a href="https://github.com/llm-d/llm-d-inference-scheduler" target="_blank" rel="noopener noreferrer" class="">llm-d inference scheduler</a>, an intelligent scheduler that routes requests based on real-time KV-cache state rather than simple round-robin or random load balancing.</p>
<p>Traditional load balancing treats all LLM inference requests equally, but in practice, requests with similar prompts benefit enormously from being routed to the same pod — because that pod already has the relevant KV cache blocks loaded. The EPP solves this by tracking real-time KV cache states across all vLLM instances via ZMQ events (<code>BlockStored</code>, <code>BlockRemoved</code>) and building an index mapping <code>{ModelName, BlockHash}</code> → <code>{PodID, DeviceTier}</code>.</p>
<p>The scheduling behavior is configured through <code>EndpointPickerConfig</code>, which defines a plugin pipeline with weighted scorers:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> inference.networking.x</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">k8s.io/v1alpha1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> EndpointPickerConfig</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">plugins</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> single</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">profile</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">handler</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> prefix</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">cache</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">scorer</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> load</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">aware</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">scorer</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">parameters</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">threshold</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">100</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> max</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">score</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">picker</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">schedulingProfiles</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> default</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">plugins</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">pluginRef</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> prefix</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">cache</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">scorer</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">weight</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">2.0</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">pluginRef</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> load</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">aware</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">scorer</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">weight</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1.0</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">pluginRef</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> max</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">score</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">picker</span><br></span></code></pre></div></div>
<p>The pipeline uses three types of plugins (see <a href="https://github.com/llm-d/llm-d-inference-scheduler/blob/main/docs/architecture.md" target="_blank" rel="noopener noreferrer" class="">llm-d scheduler architecture</a> for details):</p>
<ul>
<li class=""><strong>prefix-cache-scorer</strong> (weight: 2.0): Tracks the actual KV cache contents across all vLLM instances and scores pods based on how many cached prefix blocks match the incoming request's prompt. This reduces Time To First Token (TTFT) by avoiding redundant prefill computation for repeated or similar prompts — particularly beneficial for multi-turn conversations and RAG workloads.</li>
<li class=""><strong>load-aware-scorer</strong> (weight: 1.0): Scores candidate pods based on their current queue depth. Pods with empty queues score 0.5, while pods with growing queues score progressively lower toward 0. The <code>threshold</code> parameter controls the sensitivity — when queue depth exceeds the threshold, the pod scores near zero.</li>
<li class=""><strong>max-score-picker</strong>: After all scorers run, selects the pod with the highest weighted aggregate score.</li>
</ul>
<p>The <code>EndpointPickerConfig</code> can be provided inline in the LLMInferenceService spec or referenced from a ConfigMap, giving platform teams the flexibility to standardize scheduling behavior across deployments:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1alpha2</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> LLMInferenceService</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> llama3</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">with</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">scheduler</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">model</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">uri</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> hf</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">//meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama/Llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">3.1</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">8B</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">Instruct</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">Llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">3.1</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">8B</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">Instruct</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">replicas</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">4</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">template</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">containers</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> vllm</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">resources</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">limits</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">nvidia.com/gpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">router</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">gateway</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">managed</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">route</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">httpRoute</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">scheduler</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">config</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">ref</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> custom</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">endpoint</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">picker</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">config</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">key</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> endpoint</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">picker</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">config.yaml</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">pool</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><br></span></code></pre></div></div>
<p>The GIE CRDs (<strong>InferencePool</strong> and <strong>InferenceModel</strong>) are now bundled as part of the KServe installation, simplifying setup.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-disaggregated-prefill-decode">🔀 Disaggregated Prefill-Decode<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#-disaggregated-prefill-decode" class="hash-link" aria-label="Direct link to 🔀 Disaggregated Prefill-Decode" title="Direct link to 🔀 Disaggregated Prefill-Decode" translate="no">​</a></h3>
<p>LLMInferenceService natively supports <strong>disaggregated prefill-decode</strong>, which separates the compute-intensive prefill phase from the memory-intensive decode phase into independent workloads. This allows each phase to be scaled and optimized independently.</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1alpha2</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> LLMInferenceService</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> llama3</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">prefill</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">decode</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">model</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">uri</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> hf</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">//meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama/Llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">3.1</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">8B</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">Instruct</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">Llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">3.1</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">8B</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">Instruct</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">replicas</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">2</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">template</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">containers</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> vllm</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">resources</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">limits</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">nvidia.com/gpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">prefill</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">replicas</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">2</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">template</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">containers</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> vllm</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">resources</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">limits</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token key atrule" style="color:#00a4db">nvidia.com/gpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">router</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">gateway</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">managed</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">route</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">httpRoute</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">scheduler</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">pool</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><br></span></code></pre></div></div>
<p>KV cache data is transferred between prefill and decode pods using <strong>NixlConnector</strong> with RDMA-based RoCE for high-throughput, low-latency block transfers.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-distributed-inference-tensor-data-and-expert-parallelism">📐 Distributed Inference: Tensor, Data, and Expert Parallelism<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#-distributed-inference-tensor-data-and-expert-parallelism" class="hash-link" aria-label="Direct link to 📐 Distributed Inference: Tensor, Data, and Expert Parallelism" title="Direct link to 📐 Distributed Inference: Tensor, Data, and Expert Parallelism" translate="no">​</a></h3>
<p>LLMInferenceService introduces a comprehensive <strong>parallelism specification</strong> for distributed inference across multiple nodes and GPUs using <a href="https://github.com/kubernetes-sigs/lws" target="_blank" rel="noopener noreferrer" class="">LeaderWorkerSet</a>:</p>
<ul>
<li class=""><strong>Tensor Parallelism (TP)</strong>: Splits model layers across GPUs within a node</li>
<li class=""><strong>Data Parallelism (DP)</strong>: Runs multiple model replicas for higher throughput</li>
<li class=""><strong>Data-Local Parallelism</strong>: Controls GPUs per node for optimal NUMA affinity</li>
<li class=""><strong>Expert Parallelism (EP)</strong>: Distributes Mixture-of-Experts (MoE) model experts across GPUs</li>
</ul>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1alpha2</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> LLMInferenceService</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> llama3</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">multi</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">node</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">model</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">uri</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> hf</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">//meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama/Llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">3.1</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">70B</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">Instruct</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">Llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">3.1</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">70B</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">Instruct</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">replicas</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">8</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">parallelism</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">tensor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">4</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">data</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">8</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">dataLocal</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">4</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">template</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">containers</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> vllm</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">resources</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">limits</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">nvidia.com/gpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"4"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">worker</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">containers</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> vllm</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">resources</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">limits</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">nvidia.com/gpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"4"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">router</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">gateway</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">managed</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">route</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">httpRoute</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">scheduler</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">pool</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-envoy-ai-gateway-integration-with-token-based-rate-limiting">🌐 Envoy AI Gateway Integration with Token-Based Rate Limiting<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#-envoy-ai-gateway-integration-with-token-based-rate-limiting" class="hash-link" aria-label="Direct link to 🌐 Envoy AI Gateway Integration with Token-Based Rate Limiting" title="Direct link to 🌐 Envoy AI Gateway Integration with Token-Based Rate Limiting" translate="no">​</a></h3>
<p>LLMInferenceService integrates with <a href="https://aigateway.envoyproxy.io/" target="_blank" rel="noopener noreferrer" class=""><strong>Envoy AI Gateway</strong></a> for AI-native traffic management. This enables <strong>token-based rate limiting</strong> — a capability critical for LLM serving where request cost varies dramatically based on input and output token counts rather than simple request counts.</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> aigateway.envoyproxy.io/v1alpha1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> AIGatewayRoute</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> llm</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">route</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">targetRefs</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">group</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> gateway.networking.k8s.io</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> HTTPRoute</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> llama3</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">serving</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">llmRequestCosts</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">metadataKey</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> llm_input_token</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> InputToken</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">metadataKey</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> llm_output_token</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> OutputToken</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">metadataKey</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> llm_total_token</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> TotalToken</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">---</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> gateway.envoyproxy.io/v1alpha1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> BackendTrafficPolicy</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> llm</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">rate</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">limit</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">targetRefs</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">group</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> aigateway.envoyproxy.io</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> AIGatewayRoute</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> llm</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">route</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">rateLimit</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Global</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">global</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">rules</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">clientSelectors</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">headers</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> x</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">user</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">id</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                  </span><span class="token key atrule" style="color:#00a4db">type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Distinct</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">limit</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">requests</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1000</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">unit</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Hour</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">cost</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">request</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">from</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Number</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">number</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">response</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">from</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Metadata</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">key</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> llm_total_token</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-autoscaling-api-with-wva-support">⚡ Autoscaling API with WVA Support<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#-autoscaling-api-with-wva-support" class="hash-link" aria-label="Direct link to ⚡ Autoscaling API with WVA Support" title="Direct link to ⚡ Autoscaling API with WVA Support" translate="no">​</a></h3>
<p>A new <strong>autoscaling API</strong> has been added to LLMInferenceService with support for the <a href="https://github.com/llm-d/llm-d-workload-variant-autoscaler" target="_blank" rel="noopener noreferrer" class=""><strong>Workload Variant Autoscaler (WVA)</strong></a>, a Kubernetes-based global autoscaler designed specifically for LLM inference workloads. Traditional CPU/memory-based autoscaling is inadequate for LLMs because inference cost is driven by token throughput, KV cache utilization, and queue depth rather than CPU or memory usage.</p>
<p>WVA continuously monitors inference server metrics via Prometheus — specifically <strong>KV cache utilization</strong> and <strong>queue depth</strong> — to determine when servers are approaching saturation. It then computes a <code>wva_desired_replicas</code> metric and emits it to Prometheus, where an actuator backend (<strong>HPA</strong> or <strong>KEDA</strong>) reads it to drive the actual scaling:</p>
<ul>
<li class=""><strong>WVA + KEDA</strong>: Queries Prometheus directly for the <code>wva_desired_replicas</code> metric. Does not require Prometheus Adapter. Supports idle scale-to-zero via <code>idleReplicaCount</code>.</li>
<li class=""><strong>WVA + HPA</strong>: Reads the <code>wva_desired_replicas</code> metric via Kubernetes Metrics API. Requires Prometheus Adapter. Supports standard HPA scaling behaviors.</li>
</ul>
<p>A key concept in WVA is the <strong>variant</strong> — a specific deployment configuration (hardware, runtime, parallelism strategy) for serving a model. The same base model might be served by multiple variants: for example, Llama-3 on A100 GPUs with TP=4 is one variant, while Llama-3 on H100 GPUs with TP=2 is another. The <code>variantCost</code> field specifies the relative cost per replica for each variant, enabling WVA to make <strong>cost-aware scaling decisions</strong> across variants — scaling up the cheaper variant first when demand increases, and scaling down the most expensive variant first when demand decreases.</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1alpha2</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> LLMInferenceService</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> llama3</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">wva</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">autoscaling</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">model</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">uri</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> hf</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">//meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama/Llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">3.1</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">8B</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">Instruct</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">Llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">3.1</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">8B</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">Instruct</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">scaling</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">minReplicas</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">maxReplicas</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">10</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">wva</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">variantCost</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"15.0"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">keda</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">pollingInterval</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">30</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">cooldownPeriod</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">300</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">initialCooldownPeriod</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">120</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">idleReplicaCount</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">0</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">fallback</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">failureThreshold</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">3</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">replicas</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">2</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">template</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">containers</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> vllm</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">resources</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">limits</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">nvidia.com/gpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">router</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">gateway</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">managed</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">route</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">httpRoute</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">scheduler</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">pool</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">}</span><br></span></code></pre></div></div>
<p>In the example above, <code>variantCost: "15.0"</code> indicates the relative cost of running each replica of this variant. If another variant of the same model has <code>variantCost: "5.0"</code>, WVA would prefer to add capacity on that cheaper variant before scaling up this one. The default value is <code>"10.0"</code> if not specified. When using the KEDA backend, the <code>fallback</code> field ensures the deployment maintains a minimum replica count (here, 2 replicas) even if the metrics pipeline fails — a critical safety net for production LLM deployments.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-scheduler-high-availability">🔧 Scheduler High Availability<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#-scheduler-high-availability" class="hash-link" aria-label="Direct link to 🔧 Scheduler High Availability" title="Direct link to 🔧 Scheduler High Availability" translate="no">​</a></h3>
<p>The LLMInferenceService scheduler (EPP) now supports <strong>scaling and high availability</strong>, allowing multiple EPP replicas for production deployments that require fault tolerance and higher routing throughput.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="️-crd-webhook-validation">🛡️ CRD Webhook Validation<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#%EF%B8%8F-crd-webhook-validation" class="hash-link" aria-label="Direct link to 🛡️ CRD Webhook Validation" title="Direct link to 🛡️ CRD Webhook Validation" translate="no">​</a></h3>
<p>LLMInferenceService now includes <strong>CRD webhook validation</strong> with comprehensive E2E tests, providing early feedback on invalid configurations before they reach the controller. This catches errors in parallelism settings, workload specifications, and router configurations at admission time.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-configuration-composition-with-llminferenceserviceconfig">📋 Configuration Composition with LLMInferenceServiceConfig<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#-configuration-composition-with-llminferenceserviceconfig" class="hash-link" aria-label="Direct link to 📋 Configuration Composition with LLMInferenceServiceConfig" title="Direct link to 📋 Configuration Composition with LLMInferenceServiceConfig" translate="no">​</a></h3>
<p>LLMInferenceService supports a <strong>configuration composition model</strong> through LLMInferenceServiceConfig, enabling reusable templates that can be shared across multiple LLMInferenceService resources. The merge order follows:</p>
<ol>
<li class=""><strong>Well-Known Configs</strong> → 2. <strong>Explicit BaseRefs</strong> → 3. <strong>LLMInferenceService Spec</strong></li>
</ol>
<p>This allows platform teams to define standardized vLLM worker templates, router/scheduler configurations, and resource defaults while giving application teams the ability to override specific settings.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-additional-llminferenceservice-improvements">📦 Additional LLMInferenceService Improvements<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#-additional-llminferenceservice-improvements" class="hash-link" aria-label="Direct link to 📦 Additional LLMInferenceService Improvements" title="Direct link to 📦 Additional LLMInferenceService Improvements" translate="no">​</a></h3>
<ul>
<li class=""><strong>Label and annotation propagation</strong> to downstream workload resources (<a href="https://github.com/kserve/kserve/pull/5009" target="_blank" rel="noopener noreferrer" class="">#5009</a>)</li>
<li class=""><strong>Prometheus annotation propagation</strong> to workloads for metrics collection (<a href="https://github.com/kserve/kserve/pull/5086" target="_blank" rel="noopener noreferrer" class="">#5086</a>)</li>
<li class=""><strong>Certificate management</strong> with DNS/IP SAN and automatic renewal for self-signed certs (<a href="https://github.com/kserve/kserve/pull/5099" target="_blank" rel="noopener noreferrer" class="">#5099</a>)</li>
<li class=""><strong>Improved CA bundle management</strong> for secure communication (<a href="https://github.com/kserve/kserve/pull/4803" target="_blank" rel="noopener noreferrer" class="">#4803</a>)</li>
<li class=""><strong>Optional storageInitializer</strong> — skip model download when using pre-loaded models (<a href="https://github.com/kserve/kserve/pull/4970" target="_blank" rel="noopener noreferrer" class="">#4970</a>)</li>
<li class=""><strong>InferencePool auto-migration</strong> for seamless upgrades (<a href="https://github.com/kserve/kserve/pull/5007" target="_blank" rel="noopener noreferrer" class="">#5007</a>)</li>
<li class=""><strong>Route-only completions through InferencePool</strong> for chat/completion endpoints (<a href="https://github.com/kserve/kserve/pull/5087" target="_blank" rel="noopener noreferrer" class="">#5087</a>)</li>
<li class=""><strong>Startup probes for vLLM containers</strong> for more reliable health monitoring (<a href="https://github.com/kserve/kserve/pull/5063" target="_blank" rel="noopener noreferrer" class="">#5063</a>)</li>
<li class=""><strong>vLLM arguments migrated to command field</strong> for cleaner configuration (<a href="https://github.com/kserve/kserve/pull/5049" target="_blank" rel="noopener noreferrer" class="">#5049</a>)</li>
<li class=""><strong>Versioned well-known config resolution</strong> for stable config management (<a href="https://github.com/kserve/kserve/pull/5096" target="_blank" rel="noopener noreferrer" class="">#5096</a>)</li>
<li class=""><strong>Scheduler config via ConfigMap or inline</strong> for flexible configuration (<a href="https://github.com/kserve/kserve/pull/4856" target="_blank" rel="noopener noreferrer" class="">#4856</a>)</li>
<li class=""><strong>Pod init container failure monitoring</strong> for better observability (<a href="https://github.com/kserve/kserve/pull/5034" target="_blank" rel="noopener noreferrer" class="">#5034</a>)</li>
<li class=""><strong>Preserve externally managed replicas</strong> during reconciliation (<a href="https://github.com/kserve/kserve/pull/4996" target="_blank" rel="noopener noreferrer" class="">#4996</a>)</li>
<li class=""><strong>Allow stopping LLMInferenceService</strong> gracefully (<a href="https://github.com/kserve/kserve/pull/4839" target="_blank" rel="noopener noreferrer" class="">#4839</a>)</li>
<li class=""><strong>Enhanced Gateway API URL discovery</strong> with listener hostname fallback (<a href="https://github.com/kserve/kserve/pull/5104" target="_blank" rel="noopener noreferrer" class="">#5104</a>, <a href="https://github.com/kserve/kserve/pull/5079" target="_blank" rel="noopener noreferrer" class="">#5079</a>)</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="️-modular-component-architecture">🏗️ Modular Component Architecture<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#%EF%B8%8F-modular-component-architecture" class="hash-link" aria-label="Direct link to 🏗️ Modular Component Architecture" title="Direct link to 🏗️ Modular Component Architecture" translate="no">​</a></h2>
<p>KServe v0.17 introduces a fundamental architectural shift toward <strong>modular, component-based deployment</strong>. KServe now consists of three independent components:</p>
<ul>
<li class=""><strong>kserve</strong> (core): Manages InferenceService, ServingRuntime, ClusterServingRuntime, InferenceGraph, and TrainedModel CRDs.</li>
<li class=""><strong>llmisvc</strong>: The LLMInferenceService controller for generative AI workloads, managing LLMInferenceService and LLMInferenceServiceConfig CRDs.</li>
<li class=""><strong>localmodel</strong> (optional): The LocalModel controller for efficient model caching with LocalModelCache, LocalModelNode, and LocalModelNodeGroup CRDs.</li>
</ul>
<table><thead><tr><th>Combination</th><th>Use Case</th><th>Components</th></tr></thead><tbody><tr><td><strong>KServe Only</strong></td><td>Predictive AI</td><td>kserve</td></tr><tr><td><strong>KServe + LLMIsvc</strong></td><td>Predictive AI + Generative AI</td><td>kserve + llmisvc</td></tr><tr><td><strong>Full Stack</strong></td><td>Predictive AI + Generative AI + Model Caching</td><td>kserve + llmisvc + localmodel</td></tr></tbody></table>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="helm-chart-restructuring">Helm Chart Restructuring<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#helm-chart-restructuring" class="hash-link" aria-label="Direct link to Helm Chart Restructuring" title="Direct link to Helm Chart Restructuring" translate="no">​</a></h3>
<p>To support the new component architecture, the <strong>Helm charts have been completely restructured</strong> from a single chart into <strong>10 independent Helm charts</strong>:</p>
<p><strong>CRD Charts</strong> (6 charts with full and minimal variants):</p>
<ul>
<li class=""><code>kserve-crd</code> / <code>kserve-crd-minimal</code></li>
<li class=""><code>kserve-llmisvc-crd</code> / <code>kserve-llmisvc-crd-minimal</code></li>
<li class=""><code>kserve-localmodel-crd</code> / <code>kserve-localmodel-crd-minimal</code></li>
</ul>
<p><strong>Resource Charts</strong> (4 charts):</p>
<ul>
<li class=""><code>kserve-resources</code> (renamed from <code>kserve</code>)</li>
<li class=""><code>kserve-llmisvc-resources</code> (new)</li>
<li class=""><code>kserve-localmodel-resources</code> (new)</li>
<li class=""><code>kserve-runtime-configs</code> (new — manages ClusterServingRuntimes and LLMIsvcConfigs)</li>
</ul>
<div class="theme-admonition theme-admonition-warning admonition_xJq3 alert alert--warning"><div class="admonitionHeading_Gvgb"><span class="admonitionIcon_Rf37"><svg viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.893 1.5c-.183-.31-.52-.5-.887-.5s-.703.19-.886.5L.138 13.499a.98.98 0 0 0 0 1.001c.193.31.53.501.886.501h13.964c.367 0 .704-.19.877-.5a1.03 1.03 0 0 0 .01-1.002L8.893 1.5zm.133 11.497H6.987v-2.003h2.039v2.003zm0-3.004H6.987V5.987h2.039v4.006z"></path></svg></span>warning</div><div class="admonitionContent_BuS1"><p>This is a <strong>breaking change</strong>. Users upgrading from v0.16 <strong>cannot</strong> use a simple <code>helm upgrade</code> command. Please follow the detailed <a href="https://kserve.github.io/website/docs/install/upgrade-guide" target="_blank" rel="noopener noreferrer" class="">upgrade guide</a> for step-by-step migration instructions. We strongly recommend testing the upgrade in a non-production environment first.</p></div></div>
<p>For fresh installations, the new Kustomize component-based architecture also provides composable deployment options via standalone overlays, addon overlays, and all-in-one overlays. See the <a href="https://kserve.github.io/website/docs/install/overview" target="_blank" rel="noopener noreferrer" class="">installation concepts</a> for details.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-inferenceservice-and-platform-improvements">🔧 InferenceService and Platform Improvements<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#-inferenceservice-and-platform-improvements" class="hash-link" aria-label="Direct link to 🔧 InferenceService and Platform Improvements" title="Direct link to 🔧 InferenceService and Platform Improvements" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="storage-performance">Storage Performance<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#storage-performance" class="hash-link" aria-label="Direct link to Storage Performance" title="Direct link to Storage Performance" translate="no">​</a></h3>
<ul>
<li class=""><strong>Parallelized blob downloads</strong> from Azure and S3 for faster model loading (<a href="https://github.com/kserve/kserve/pull/4709" target="_blank" rel="noopener noreferrer" class="">#4709</a>, <a href="https://github.com/kserve/kserve/pull/4714" target="_blank" rel="noopener noreferrer" class="">#4714</a>)</li>
<li class=""><strong>Faster parallel S3 downloads</strong> with configurable file selection (<a href="https://github.com/kserve/kserve/pull/5102" target="_blank" rel="noopener noreferrer" class="">#5102</a>, <a href="https://github.com/kserve/kserve/pull/5119" target="_blank" rel="noopener noreferrer" class="">#5119</a>)</li>
<li class=""><strong>Git repository support</strong> for downloading models directly from Git repos via HTTPS (<a href="https://github.com/kserve/kserve/pull/4966" target="_blank" rel="noopener noreferrer" class="">#4966</a>)</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="new-serving-runtimes">New Serving Runtimes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#new-serving-runtimes" class="hash-link" aria-label="Direct link to New Serving Runtimes" title="Direct link to New Serving Runtimes" translate="no">​</a></h3>
<ul>
<li class=""><strong>OpenVINO Model Server</strong> — Intel's optimized inference runtime for high-performance serving on Intel hardware (<a href="https://github.com/kserve/kserve/pull/4592" target="_blank" rel="noopener noreferrer" class="">#4592</a>)</li>
<li class=""><strong>PredictiveServer</strong> runtime with full build/publish infrastructure and E2E testing (<a href="https://github.com/kserve/kserve/pull/4954" target="_blank" rel="noopener noreferrer" class="">#4954</a>)</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="gateway--routing">Gateway &amp; Routing<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#gateway--routing" class="hash-link" aria-label="Direct link to Gateway &amp; Routing" title="Direct link to Gateway &amp; Routing" translate="no">​</a></h3>
<ul>
<li class=""><strong>Gateway API upgraded to v1.4.0</strong> (<a href="https://github.com/kserve/kserve/pull/5038" target="_blank" rel="noopener noreferrer" class="">#5038</a>)</li>
<li class=""><strong>PathTemplate configuration</strong> for flexible inference service routing (<a href="https://github.com/kserve/kserve/pull/4817" target="_blank" rel="noopener noreferrer" class="">#4817</a>)</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="vllm-backend">vLLM Backend<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#vllm-backend" class="hash-link" aria-label="Direct link to vLLM Backend" title="Direct link to vLLM Backend" translate="no">​</a></h3>
<ul>
<li class=""><strong>Upgraded to vLLM v0.15.1</strong> with performance improvements (<a href="https://github.com/kserve/kserve/pull/5098" target="_blank" rel="noopener noreferrer" class="">#5098</a>)</li>
<li class=""><strong>Removed Python 3.9 support</strong> (<a href="https://github.com/kserve/kserve/pull/4851" target="_blank" rel="noopener noreferrer" class="">#4851</a>)</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="additional-enhancements">Additional Enhancements<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#additional-enhancements" class="hash-link" aria-label="Direct link to Additional Enhancements" title="Direct link to Additional Enhancements" translate="no">​</a></h3>
<ul>
<li class=""><strong>CSV and Parquet marshallers</strong> for expanded data format support (<a href="https://github.com/kserve/kserve/pull/5115" target="_blank" rel="noopener noreferrer" class="">#5115</a>)</li>
<li class=""><strong>Event loop configuration</strong> with new <code>--event_loop</code> flag supporting <code>auto</code>, <code>asyncio</code>, and <code>uvloop</code> (<a href="https://github.com/kserve/kserve/pull/4971" target="_blank" rel="noopener noreferrer" class="">#4971</a>)</li>
<li class=""><strong>Annotation-based runtime defaults</strong> for MLServer (<a href="https://github.com/kserve/kserve/pull/5064" target="_blank" rel="noopener noreferrer" class="">#5064</a>)</li>
<li class=""><strong><code>INFERENCE_SERVICE_NAME</code> environment variable</strong> exposed to serving containers (<a href="https://github.com/kserve/kserve/pull/5013" target="_blank" rel="noopener noreferrer" class="">#5013</a>)</li>
<li class=""><strong>Failure condition surfacing</strong> in InferenceService status (<a href="https://github.com/kserve/kserve/pull/5114" target="_blank" rel="noopener noreferrer" class="">#5114</a>)</li>
<li class=""><strong>Inference log batching</strong> with external marshalling support (<a href="https://github.com/kserve/kserve/pull/5061" target="_blank" rel="noopener noreferrer" class="">#5061</a>)</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="infrastructure-updates">Infrastructure Updates<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#infrastructure-updates" class="hash-link" aria-label="Direct link to Infrastructure Updates" title="Direct link to Infrastructure Updates" translate="no">​</a></h3>
<ul>
<li class="">Kubernetes packages bumped to <strong>v0.34.0</strong></li>
<li class="">Knative Serving updated to <strong>v1.21.1</strong></li>
<li class="">Go updated to <strong>1.25</strong></li>
<li class="">Kubebuilder updated to <strong>1.9.0</strong></li>
<li class="">KEDA bumped from 2.16.1 to <strong>2.17.3</strong></li>
<li class="">MinIO replaced with <strong>SeaweedFS</strong> for testing infrastructure</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-security-fixes">🔒 Security Fixes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#-security-fixes" class="hash-link" aria-label="Direct link to 🔒 Security Fixes" title="Direct link to 🔒 Security Fixes" translate="no">​</a></h2>
<p>Multiple security vulnerabilities have been addressed:</p>
<ul>
<li class="">CVE-2025-62727 (Starlette)</li>
<li class="">CVE-2025-22872, CVE-2025-47914, CVE-2025-58181</li>
<li class="">CVE-2024-43598 (LightGBM updated to 4.6.0)</li>
<li class="">CVE-2025-43859 (h11 HTTP parsing)</li>
<li class="">CVE-2025-66418 (decompression chain)</li>
<li class="">CVE-2025-68156 (expr-lang/expr)</li>
<li class="">CVE-2026-26007 (cryptography subgroup attack)</li>
<li class="">CVE-2026-24486 (python-multipart arbitrary file write)</li>
<li class="">Path traversal vulnerabilities in https.go and tar extraction</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-release-notes">🔍 Release Notes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#-release-notes" class="hash-link" aria-label="Direct link to 🔍 Release Notes" title="Direct link to 🔍 Release Notes" translate="no">​</a></h2>
<p>For the complete list of all 167 merged pull requests, bug fixes, and known issues, visit the GitHub release pages:</p>
<ul>
<li class=""><a href="https://github.com/kserve/kserve/releases/tag/v0.17.0" target="_blank" rel="noopener noreferrer" class="">v0.17.0</a></li>
<li class=""><a href="https://github.com/kserve/kserve/releases/tag/v0.17.0-rc1" target="_blank" rel="noopener noreferrer" class="">v0.17.0-rc1</a></li>
<li class=""><a href="https://github.com/kserve/kserve/releases/tag/v0.17.0-rc0" target="_blank" rel="noopener noreferrer" class="">v0.17.0-rc0</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-acknowledgments">🙏 Acknowledgments<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#-acknowledgments" class="hash-link" aria-label="Direct link to 🙏 Acknowledgments" title="Direct link to 🙏 Acknowledgments" translate="no">​</a></h2>
<p>We extend our gratitude to all <strong>38+ contributors</strong> who made this release possible, including <strong>21 first-time contributors</strong>. Your efforts continue to drive the advancement of KServe as a leading platform for serving machine learning models.</p>
<ul>
<li class=""><strong>Core Contributors</strong>: The KServe maintainers and regular contributors</li>
<li class=""><strong>Community</strong>: Everyone who reported issues, provided feedback, and tested features</li>
<li class=""><strong>New Contributors</strong>: Welcome to all first-time contributors who helped shape this release</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-join-the-community">🤝 Join the Community<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.17-release#-join-the-community" class="hash-link" aria-label="Direct link to 🤝 Join the Community" title="Direct link to 🤝 Join the Community" translate="no">​</a></h2>
<p>We invite you to explore the new features in KServe v0.17 and contribute to the ongoing development of the project:</p>
<ul>
<li class="">Visit our <a href="https://kserve.github.io/website/" target="_blank" rel="noopener noreferrer" class="">Website</a> or <a href="https://github.com/kserve" target="_blank" rel="noopener noreferrer" class="">GitHub</a></li>
<li class="">Join the Slack (<a href="https://github.com/kserve/community?tab=readme-ov-file#questions-and-issues" target="_blank" rel="noopener noreferrer" class="">#kserve</a>)</li>
<li class="">Attend our community meeting by subscribing to the <a href="https://zoom-lfx.platform.linuxfoundation.org/meetings/kserve?view=month" target="_blank" rel="noopener noreferrer" class="">KServe calendar</a>.</li>
<li class="">View our <a href="https://github.com/kserve/community" target="_blank" rel="noopener noreferrer" class="">community github repository</a> to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!</li>
</ul>
<p><strong>Happy serving!</strong></p>
<hr>
<p><em>The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!</em></p>]]></content>
        <author>
            <name>Dan Sun</name>
            <uri>https://github.com/yuzisun</uri>
        </author>
        <category label="Releases" term="Releases"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Best of Both Worlds: Cloud-Native AI Inference at Scale using KServe and llm-d]]></title>
        <id>https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d</id>
        <link href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d"/>
        <updated>2026-03-05T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Learn how KServe and llm-d combine to deliver production-ready, Kubernetes-native inference platform with distributed intelligence for generative AI workloads.]]></summary>
        <content type="html"><![CDATA[<p>Enterprises today seek to integrate generative AI (GenAI) capabilities into their applications. However, scaling large AI models introduces complexity: managing high-volume traffic from large language models (LLMs), optimizing inference performance, maintaining predictable latency, and controlling infrastructure costs.</p>
<p>Platform engineering leaders require more than just model deployment capabilities. They need a robust, Kubernetes-native infrastructure that supports:</p>
<ul>
<li class="">Efficient GPU utilization</li>
<li class="">Intelligent request routing</li>
<li class="">Distributed inference patterns</li>
<li class="">Cost-aware autoscaling</li>
<li class="">Production-grade governance</li>
</ul>
<p>This article demonstrates how two open-source solutions, KServe and llm-d, can be combined to address these challenges.</p>
<p>We explore the role of each solution, illustrate their integration architecture, and provide practical guidance for AI platform teams, with deeper focus on KServe's LLMInferenceService, available since KServe v0.16.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="kserve-simplified-deployment-of-ai-models-on-kubernetes">KServe: Simplified Deployment of AI Models on Kubernetes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#kserve-simplified-deployment-of-ai-models-on-kubernetes" class="hash-link" aria-label="Direct link to KServe: Simplified Deployment of AI Models on Kubernetes" title="Direct link to KServe: Simplified Deployment of AI Models on Kubernetes" translate="no">​</a></h2>
<p>KServe is a Kubernetes-based model serving platform that simplifies deploying and managing ML models, including LLMs, at scale.</p>
<p>For platform engineers, KServe acts as the model serving control plane: the layer responsible for lifecycle, scaling, and operational governance.</p>
<p><img decoding="async" loading="lazy" src="https://kserve.github.io/website/assets/images/kserve_generative_inference-21648e7df404ea6f57b9d3c83e8e0ca4.png" alt="KServe Generative Inference Architecture" class="img_ev3q"></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="inference-as-a-service">Inference as a Service<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#inference-as-a-service" class="hash-link" aria-label="Direct link to Inference as a Service" title="Direct link to Inference as a Service" translate="no">​</a></h3>
<p>InferenceService serves as KServe's core abstraction for model deployment, encapsulating the full serving lifecycle, including:</p>
<ul>
<li class="">Automatic deployment creation and reconciliation</li>
<li class="">Request-based autoscaling with scale-to-zero and autoscaling based on custom metrics</li>
<li class="">Revision management and canary rollouts</li>
<li class="">Endpoint exposure and traffic routing</li>
<li class="">Runtime abstraction across serving backends for both predictive and generative AI</li>
<li class="">Optional pre-processing/post-processing, inference pipelines, and ensembles</li>
</ul>
<p>ML engineers provide trained models. Platform engineers retain operational control without writing custom deployment code.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="llminferenceservice-in-kserve">LLMInferenceService in KServe<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#llminferenceservice-in-kserve" class="hash-link" aria-label="Direct link to LLMInferenceService in KServe" title="Direct link to LLMInferenceService in KServe" translate="no">​</a></h3>
<p>KServe v0.16 introduces stronger generative AI capabilities, including LLMInferenceService, designed specifically for large language model workloads.</p>
<p>Unlike traditional stateless predictors, LLM workloads require:</p>
<ul>
<li class="">Long-running streaming responses</li>
<li class="">GPU-heavy memory footprints</li>
<li class="">Prefix KV-cache management</li>
<li class="">High-concurrency token streaming</li>
<li class="">OpenAI-compatible APIs</li>
</ul>
<p>LLMInferenceService shares common foundations with InferenceService but introduces additional capabilities tailored for large language models, including:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="unlocking-generative-ai-serving-with-llminferenceservice-from-pod-level-speed-to-cluster-wide-intelligence">Unlocking Generative AI Serving with LLMInferenceService: From Pod-Level Speed to Cluster-Wide Intelligence<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#unlocking-generative-ai-serving-with-llminferenceservice-from-pod-level-speed-to-cluster-wide-intelligence" class="hash-link" aria-label="Direct link to Unlocking Generative AI Serving with LLMInferenceService: From Pod-Level Speed to Cluster-Wide Intelligence" title="Direct link to Unlocking Generative AI Serving with LLMInferenceService: From Pod-Level Speed to Cluster-Wide Intelligence" translate="no">​</a></h3>
<p>Imagine you want to bring the power of generative AI directly into your applications, but without rewriting your entire stack. It offers OpenAI-compatible endpoints like <code>/v1/chat/completions</code>, complete with streaming token responses and multi-turn support. With prompt templating built in, developers can integrate seamlessly with existing tools—whether it's the OpenAI SDKs, LangChain, LlamaIndex, Llama Stack, RAG frameworks, or even enterprise GenAI gateways.</p>
<p>Under the hood, KServe connects to LLM-optimized runtimes such as vLLM, Hugging Face TGI, or other GPU-native backends. These engines bring advanced capabilities like continuous batching, memory-efficient paged attention, and KV-cache reuse, delivering high throughput per GPU.</p>
<p>Yet, while these runtime-level optimizations make each pod lightning fast, true cluster-wide efficiency needs more. That's exactly the role of llm-d: adding an extra layer of intelligence that orchestrates resources and maximizes performance across the entire deployment.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="distributed--multi-node-model-support">Distributed &amp; Multi-Node Model Support<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#distributed--multi-node-model-support" class="hash-link" aria-label="Direct link to Distributed &amp; Multi-Node Model Support" title="Direct link to Distributed &amp; Multi-Node Model Support" translate="no">​</a></h3>
<p>LLMInferenceService supports advanced parallelism strategies implemented by runtimes, including tensor parallelism, pipeline parallelism, and multi-GPU sharding.</p>
<p>This enables hosting 70B+ parameter models, partitioning models across nodes, and serving models larger than single-GPU memory.</p>
<p>KServe orchestrates the deployment topology, while the runtime manages execution parallelism.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="advanced-autoscaling--networking-including-scale-to-zero">Advanced Autoscaling &amp; Networking (Including Scale-to-Zero)<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#advanced-autoscaling--networking-including-scale-to-zero" class="hash-link" aria-label="Direct link to Advanced Autoscaling &amp; Networking (Including Scale-to-Zero)" title="Direct link to Advanced Autoscaling &amp; Networking (Including Scale-to-Zero)" translate="no">​</a></h3>
<p>KServe integrates deeply with Kubernetes to support request- and concurrency-based autoscaling via Knative, GPU-backed scaling, and scale-to-zero for cost control.</p>
<p>It also integrates with the Kubernetes Gateway API for TLS termination, traffic splitting, and advanced routing.</p>
<p>This makes it suitable for development environments, internal copilots, and large-scale production workloads.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="kubernetes-gateway-api-integration">Kubernetes Gateway API Integration<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#kubernetes-gateway-api-integration" class="hash-link" aria-label="Direct link to Kubernetes Gateway API Integration" title="Direct link to Kubernetes Gateway API Integration" translate="no">​</a></h3>
<p>KServe integrates with Kubernetes Gateway API for:</p>
<ul>
<li class="">Enterprise-grade routing</li>
<li class="">TLS termination</li>
<li class="">Traffic splitting</li>
<li class="">Multi-model routing</li>
</ul>
<p>This enables integration with modern Kubernetes networking stacks.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="where-kserve-alone-is-not-enough">Where KServe Alone Is Not Enough<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#where-kserve-alone-is-not-enough" class="hash-link" aria-label="Direct link to Where KServe Alone Is Not Enough" title="Direct link to Where KServe Alone Is Not Enough" translate="no">​</a></h3>
<p>Even with LLMInferenceService and optimized runtimes, KServe does not inherently:</p>
<ul>
<li class="">Route requests based on KV-cache locality across replicas</li>
<li class="">Separate prefill and decode cluster-wide</li>
<li class="">Perform SLA-aware routing decisions</li>
<li class="">Optimize GPU utilization across multiple pods</li>
</ul>
<p>To address these, we introduce llm-d.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="llm-d-distributed-intelligence-for-llm-inference">llm-d: Distributed Intelligence for LLM Inference<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#llm-d-distributed-intelligence-for-llm-inference" class="hash-link" aria-label="Direct link to llm-d: Distributed Intelligence for LLM Inference" title="Direct link to llm-d: Distributed Intelligence for LLM Inference" translate="no">​</a></h2>
<p>llm-d is a Kubernetes-native distributed inference framework designed to enhance performance and efficiency of LLM workloads.</p>
<p>If KServe is the control plane for models, llm-d is the distributed intelligence scheduling layer.</p>
<p><img decoding="async" loading="lazy" src="https://github.com/llm-d/llm-d/raw/main/docs/assets/images/llm-d-arch.svg" alt="llm-d Architecture" class="img_ev3q"></p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="kv-cache-aware-scheduling-and-disaggregated-inference-with-llm-d">KV-Cache Aware Scheduling and Disaggregated Inference with llm-d<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#kv-cache-aware-scheduling-and-disaggregated-inference-with-llm-d" class="hash-link" aria-label="Direct link to KV-Cache Aware Scheduling and Disaggregated Inference with llm-d" title="Direct link to KV-Cache Aware Scheduling and Disaggregated Inference with llm-d" translate="no">​</a></h3>
<p>As LLM deployments mature, scaling is no longer just about adding GPUs. It's about using them intelligently. Modern runtimes such as vLLM introduced prefix (KV) caching to reduce redundant computation, but without smart scheduling, much of that benefit is lost.</p>
<p>This is where llm-d changes the game.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="disaggregated-inference-prefill--decode-separation">Disaggregated Inference (Prefill / Decode Separation)<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#disaggregated-inference-prefill--decode-separation" class="hash-link" aria-label="Direct link to Disaggregated Inference (Prefill / Decode Separation)" title="Direct link to Disaggregated Inference (Prefill / Decode Separation)" translate="no">​</a></h3>
<p>LLM inference consists of two distinct phases: prefill and decode. The prefill phase is compute-heavy, processing the full prompt and building the model's attention context. The decode phase is latency-sensitive, generating tokens step by step where responsiveness directly impacts user experience.</p>
<p>llm-d separates these phases across different GPU groups, assigning compute-optimized resources to prefill and latency-optimized resources to decode. With intelligent scheduling between them, workloads are aligned to the right hardware profile.</p>
<p>This phase-aware architecture increases GPU utilization, reduces tail latency, and lowers cost per token by eliminating resource contention between fundamentally different workloads.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="intelligent-inference-scheduler">Intelligent Inference Scheduler<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#intelligent-inference-scheduler" class="hash-link" aria-label="Direct link to Intelligent Inference Scheduler" title="Direct link to Intelligent Inference Scheduler" translate="no">​</a></h3>
<p>llm-d's inference scheduler evaluates the following metrics:</p>
<ul>
<li class="">GPU utilization</li>
<li class="">Queue depth</li>
<li class="">Cache residency</li>
<li class="">SLA constraints</li>
<li class="">Load distribution</li>
</ul>
<p>It enhances load balancing with an intelligent scheduler to decrease serving latency and increase throughput with prefix-cache aware routing, utilization-based load balancing, fairness and prioritization for multi-tenant serving, and predicted latency balancing.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="kserve-llminferenceservice-and-llm-d">KServe LLMInferenceService and llm-d<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#kserve-llminferenceservice-and-llm-d" class="hash-link" aria-label="Direct link to KServe LLMInferenceService and llm-d" title="Direct link to KServe LLMInferenceService and llm-d" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="responsibility-separation">Responsibility Separation<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#responsibility-separation" class="hash-link" aria-label="Direct link to Responsibility Separation" title="Direct link to Responsibility Separation" translate="no">​</a></h3>
<p>This layered design ensures composability and specialization, providing a complete, production-ready solution for generative AI. KServe acts as the control plane and LLMInferenceService delivers the generative API abstraction, while llm-d provides the cluster-wide optimization.</p>
<table><thead><tr><th>Layer</th><th>Responsibility</th></tr></thead><tbody><tr><td>KServe</td><td>Model lifecycle, scaling, governance</td></tr><tr><td>LLMInferenceService</td><td>Generative API abstraction</td></tr><tr><td>vLLM</td><td>Efficient execution inside runtime</td></tr><tr><td>llm-d</td><td>Cross-runtime routing &amp; cache awareness</td></tr><tr><td>Kubernetes</td><td>Resource orchestration</td></tr></tbody></table>
<p>Together, KServe and llm-d enable a production-ready, Kubernetes-native inference platform that balances scalability, performance, and cost efficiency, providing the best of both worlds for cloud-native AI inference at scale.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="cost-efficiency-comparison-naive-vs-optimized">Cost Efficiency Comparison: Naive vs Optimized<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#cost-efficiency-comparison-naive-vs-optimized" class="hash-link" aria-label="Direct link to Cost Efficiency Comparison: Naive vs Optimized" title="Direct link to Cost Efficiency Comparison: Naive vs Optimized" translate="no">​</a></h2>
<p>Serving LLMs at scale is no longer just a model problem. It is a distributed systems problem where naive load balancing leads to significant inefficiencies and wasted resources — lost cache locality, GPU imbalance, redundant prefill processing, high tail latency, and overprovisioned GPUs.</p>
<p><strong>Naive Problems:</strong></p>
<ul>
<li class="">Cache locality loss</li>
<li class="">GPU imbalance</li>
<li class="">Redundant prefill processing</li>
<li class="">High tail latency</li>
<li class="">Overprovisioned GPUs</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="optimized-architecture-with-kserve--llm-d">Optimized Architecture with KServe + llm-d<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#optimized-architecture-with-kserve--llm-d" class="hash-link" aria-label="Direct link to Optimized Architecture with KServe + llm-d" title="Direct link to Optimized Architecture with KServe + llm-d" translate="no">​</a></h3>
<p>The combined KServe and llm-d solution introduces distributed intelligence to solve the problems of naive architectures, delivering superior performance, scalability, and cost control. This optimized architecture is pluggable and extensible to work well with many AI and cloud-native technologies.</p>
<p><img decoding="async" loading="lazy" src="https://kserve.github.io/website/img/kserve-layer.png" alt="KServe Layered Architecture" class="img_ev3q"></p>
<p><strong>Benefits:</strong></p>
<ul>
<li class="">Cache reuse preserved</li>
<li class="">Balanced GPU utilization</li>
<li class="">Reduced recomputation</li>
<li class="">Lower cost per token</li>
<li class="">Controlled autoscaling via LLMInferenceService</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="benchmark-results-why-cluster-level-intelligence-matters">Benchmark Results: Why Cluster-Level Intelligence Matters<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#benchmark-results-why-cluster-level-intelligence-matters" class="hash-link" aria-label="Direct link to Benchmark Results: Why Cluster-Level Intelligence Matters" title="Direct link to Benchmark Results: Why Cluster-Level Intelligence Matters" translate="no">​</a></h2>
<p>By integrating llm-d's cache-aware routing, prefill and decode disaggregation, and SLA-based scheduling with KServe's enterprise-grade generative serving and autoscaling, the system achieves cluster-wide GPU optimization.</p>
<p><em>Note: The following results are based on benchmarks published by the llm-d project</em></p>
<table><thead><tr><th>Optimization Area</th><th>Naive Architecture (Round Robin LB)</th><th>Optimized (KServe + llm-d)</th><th>Source</th></tr></thead><tbody><tr><td>Cache Locality</td><td>Requests routed randomly → KV cache frequently missed</td><td>Cache-aware routing preserves prefix locality</td><td><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see" target="_blank" rel="noopener noreferrer" class="">llm-d blog</a></td></tr><tr><td>Time to First Token (P90)</td><td>Baseline latency under cache-blind scheduling</td><td>Up to ~57× faster P90 TTFT in benchmark</td><td><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see" target="_blank" rel="noopener noreferrer" class="">llm-d blog</a></td></tr><tr><td>Token Throughput</td><td>~4,400 tokens/sec (baseline test cluster)</td><td>~8,730 tokens/sec (~2× improvement)</td><td><a href="https://llm-d.ai/blog/kvcache-wins-you-can-see" target="_blank" rel="noopener noreferrer" class="">llm-d blog</a></td></tr><tr><td>Throughput at Scale</td><td>Degrades under multi-tenant load</td><td>Sustained 4.5k–11k tokens/sec</td><td><a href="https://llm-d.ai/blog/llm-d-v0.5-sustaining-performance-at-scale" target="_blank" rel="noopener noreferrer" class="">llm-d blog</a></td></tr><tr><td>Tail Latency (P95/P99)</td><td>Higher tail latency due to stragglers &amp; imbalance</td><td>~50% tail latency reduction (reported tests)</td><td><a href="https://developers.redhat.com/articles/2025/05/20/llm-d-kubernetes-native-distributed-inferencing" target="_blank" rel="noopener noreferrer" class="">Red Hat Developers</a></td></tr><tr><td>GPU Utilization</td><td>Uneven utilization, idle GPUs possible</td><td>Improved effective utilization via routing intelligence</td><td><a href="https://llm-d.ai/docs/guide/Installation/inference-scheduling" target="_blank" rel="noopener noreferrer" class="">llm-d docs</a></td></tr><tr><td>Autoscaling Control</td><td>Scale reacts to load only</td><td>Works with KServe autoscaling + routing intelligence</td><td><a href="https://kserve.github.io/website/docs/model-serving/predictive-inference/autoscaling/kpa-autoscaler" target="_blank" rel="noopener noreferrer" class="">KServe docs</a></td></tr></tbody></table>
<p>Modern GenAI platforms require cache locality awareness, phase-aware scheduling, distributed intelligence, and composable Kubernetes-native design. This combination ensures a production-ready system that meets the demands of large-scale production workloads.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="next-steps">Next Steps<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/cloud-native-ai-inference-kserve-llm-d#next-steps" class="hash-link" aria-label="Direct link to Next Steps" title="Direct link to Next Steps" translate="no">​</a></h2>
<p>Explore detailed project documentation:</p>
<ul>
<li class=""><a href="https://kserve.github.io/website/" target="_blank" rel="noopener noreferrer" class="">KServe</a></li>
<li class=""><a href="https://llm-d.ai/" target="_blank" rel="noopener noreferrer" class="">llm-d</a></li>
</ul>
<p>Engage with community resources and Slack channels to stay updated and contribute to ongoing developments:</p>
<ul>
<li class=""><a href="https://kserve.github.io/website/community/get_involved/" target="_blank" rel="noopener noreferrer" class="">KServe community</a></li>
<li class=""><a href="https://llm-d.ai/community/" target="_blank" rel="noopener noreferrer" class="">llm-d community</a></li>
</ul>]]></content>
        <author>
            <name>Yuan Tang</name>
            <uri>https://github.com/terrytangyuan</uri>
        </author>
        <author>
            <name>Ran Pollak</name>
            <uri>https://github.com/RanPollak</uri>
        </author>
        <category label="Community" term="Community"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Announcing KServe v0.15 - Advancing Generative AI Model Serving]]></title>
        <id>https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.15-release</id>
        <link href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.15-release"/>
        <updated>2025-05-27T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[KServe 0.15 Release Blog Post]]></summary>
        <content type="html"><![CDATA[<p><em>Published on May 27, 2025</em></p>
<p>We are thrilled to announce the release of <strong>KServe v0.15</strong>, marking a significant leap forward in serving both predictive and generative AI models. This release introduces enhanced support for generative AI workloads, including advanced features for serving large language models (LLMs), improved model and KV caching mechanisms, and integration with Envoy AI Gateway.</p>
<p><img decoding="async" loading="lazy" alt="!generative_inference" src="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/assets/images/kserve_generative_inference-21648e7df404ea6f57b9d3c83e8e0ca4.png" width="911" height="581" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-embracing-generative-ai-workloads">🤖 Embracing Generative AI Workloads<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.15-release#-embracing-generative-ai-workloads" class="hash-link" aria-label="Direct link to 🤖 Embracing Generative AI Workloads" title="Direct link to 🤖 Embracing Generative AI Workloads" translate="no">​</a></h2>
<p>KServe v0.15 brings first-class support for generative AI workloads, marking a key evolution beyond traditional predictive AI. Unlike predictive models that infer outcomes from existing data, generative models like large language models (LLMs) create new content from prompts. This fundamental difference introduces new serving challenges. KServe now provides the infrastructure and optimizations needed to serve these models efficiently at scale.</p>
<p>To support these workloads, we've introduced a dedicated <strong>Generative AI</strong> section in our documentation, detailing the new capabilities and configurations tailored for generative models.</p>
<p>KServe now offers a <strong>lightweight</strong> installation for hosting LLMs on Kubernetes, please follow <a href="https://kserve.github.io/archive/0.15/admin/kubernetes_deployment" target="_blank" rel="noopener noreferrer" class="">generative inference installation guide</a> to get started. KEDA is an optional component for scaling based on LLM specific metrics and Envoy AI gateway is integrated for advanced traffic management capabilities with token rate limiting, unified API and intelligent routing.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-key-generative-ai-features-in-v015">🚀 Key Generative AI Features in v0.15<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.15-release#-key-generative-ai-features-in-v015" class="hash-link" aria-label="Direct link to 🚀 Key Generative AI Features in v0.15" title="Direct link to 🚀 Key Generative AI Features in v0.15" translate="no">​</a></h2>
<ul>
<li class=""><strong>Envoy AI Gateway Integration</strong></li>
<li class=""><strong>Multi Node Inference</strong></li>
<li class=""><strong>LLM Autoscaler with KEDA</strong></li>
<li class=""><strong>Distributed KV Cache with LMCache</strong></li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-envoy-ai-gateway-support">🌐 Envoy AI Gateway Support<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.15-release#-envoy-ai-gateway-support" class="hash-link" aria-label="Direct link to 🌐 Envoy AI Gateway Support" title="Direct link to 🌐 Envoy AI Gateway Support" translate="no">​</a></h3>
<p>KServe v0.15 adds initial support for <a href="https://aigateway.envoyproxy.io/" target="_blank" rel="noopener noreferrer" class=""><strong>Envoy AI Gateway</strong></a>, a CNCF open source project built on top of <a href="https://gateway.envoyproxy.io/" target="_blank" rel="noopener noreferrer" class="">Envoy Gateway</a> and designed specifically for managing generative AI traffic at scale.</p>
<p><a href="https://gateway.envoyproxy.io/" target="_blank" rel="noopener noreferrer" class="">Envoy Gateway</a> is also now supported in KServe along with <a href="https://gateway-api.sigs.k8s.io/" target="_blank" rel="noopener noreferrer" class="">Kubernetes Gateway API</a>. Unlike traditional gateway solutions, Envoy AI Gateway provides advanced capabilities tailored to AI serving, including:</p>
<ul>
<li class="">Dynamic model routing based on request content, model metadata, or user context.</li>
<li class="">Built-in support for multi-tenant inference, with fine-grained access controls and authentication.</li>
<li class="">Unified API for routing and managing LLM/AI traffic easily.</li>
<li class="">Integrated observability for model-level performance insights.</li>
<li class="">Extensibility for inference-specific policies like rate-limiting by token, and model lifecycle management.</li>
<li class="">Automatic failover mechanisms to ensure service reliability.</li>
</ul>
<p>This integration enables a unified, intelligent entrypoint for both predictive and generative workloads—scaling from traditional models to complex LLMs—all while abstracting infrastructure complexity from the user. Please refer to <a href="https://kserve.github.io/archive/0.15/admin/ai-gateway_integration" target="_blank" rel="noopener noreferrer" class="">Envoy AI Gateway integration doc</a> for more details.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-multi-node-inference">🔗 Multi-Node Inference<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.15-release#-multi-node-inference" class="hash-link" aria-label="Direct link to 🔗 Multi-Node Inference" title="Direct link to 🔗 Multi-Node Inference" translate="no">​</a></h3>
<p>To support LLMs too large for a single node (e.g., Llama 3.1 405B), KServe v0.15 introduces multi-node inference across distributed GPUs, unlocking large model serving at scale. As models continue to increase in size, multi-node inference capabilities are increasingly important for production deployments that require real-time user experience. Please refer to the <a href="https://kserve.github.io/archive/0.15/modelserving/v1beta1/llm/huggingface/multi-node" target="_blank" rel="noopener noreferrer" class="">Multi Node inference doc</a> for more details.</p>
<p>The community is also working on a <a href="https://github.com/kserve/kserve/issues/4433" target="_blank" rel="noopener noreferrer" class="">new distributed inference API</a> to allow scaling Multi Node Inference and support Disaggregatd Prefilling which is targeted for large LLM deployments.</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1beta1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> InferenceService</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> huggingface</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama3</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">predictor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">model</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">modelFormat</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> huggingface</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">storageUri</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> pvc</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">//llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">3</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">8b</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">pvc/hf/8b_instruction_tuned</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">workerSpec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">pipelineParallelSize</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">2</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">tensorParallelSize</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-llm-autoscaler-with-keda-kubernetes-event-driven-autoscaling">⚡ LLM Autoscaler with KEDA <a href="https://keda.sh/" target="_blank" rel="noopener noreferrer" class="">(Kubernetes Event-driven Autoscaling)</a><a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.15-release#-llm-autoscaler-with-keda-kubernetes-event-driven-autoscaling" class="hash-link" aria-label="Direct link to -llm-autoscaler-with-keda-kubernetes-event-driven-autoscaling" title="Direct link to -llm-autoscaler-with-keda-kubernetes-event-driven-autoscaling" translate="no">​</a></h3>
<p>Autoscaling LLMs is challenging due to their high resource demands and variable inference traffic patterns. The dynamic nature of LLM inference, with varying input lengths and token generation speeds, further complicates the prediction of resource needs, demanding sophisticated and adaptive autoscaling solutions. KServe now integrates with <a href="https://keda.sh/" target="_blank" rel="noopener noreferrer" class=""><strong>KEDA</strong></a> (Kubernetes Event-Driven Autoscaling) offers a powerful solution to many of the challenges associated with LLM autoscaling by extending Kubernetes' native Horizontal Pod Autoscaler (HPA) capabilities. KEDA can monitor custom metrics which means you can expose LLM metrics from your LLM inference servers and use KEDA to scale based on these precise indicators.</p>
<p>This empowers users to efficiently manage LLM workloads with more intelligent scaling decisions based on workload characteristics for improved performance and cost optimization. Please follow the <a href="https://kserve.github.io/archive/0.15/modelserving/autoscaling/keda/autoscaling_llm" target="_blank" rel="noopener noreferrer" class="">tutorial doc</a> for how to autoscale based on vLLM metrics.</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1beta1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> InferenceService</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> huggingface</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama3</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">keda</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">annotations</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">serving.kserve.io/autoscalerClass</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"keda"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">sidecar.opentelemetry.io/inject</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"huggingface-llama3-keda"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">predictor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">model</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">modelFormat</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> huggingface</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">args</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">model_name=llama3</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">model_id=meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama/meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">3</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">70b</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">minReplicas</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">1</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">maxReplicas</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">5</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">autoScaling</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">metrics</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> PodMetric</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">podmetric</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">metric</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">backend</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"opentelemetry"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">metricNames</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> vllm</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">num_requests_running</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">query</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"vllm:num_requests_running"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">target</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">type</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Value</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">              </span><span class="token key atrule" style="color:#00a4db">value</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"4"</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-distributed-kv-cache-with-lmcache">🚀 Distributed KV Cache with <a href="https://lmcache.ai/" target="_blank" rel="noopener noreferrer" class="">LMCache</a><a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.15-release#-distributed-kv-cache-with-lmcache" class="hash-link" aria-label="Direct link to -distributed-kv-cache-with-lmcache" title="Direct link to -distributed-kv-cache-with-lmcache" translate="no">​</a></h3>
<p>Key-Value (KV) cache offloading is a technique used in large language model (LLM) serving to store and reuse the intermediate key and value tensors generated during model inference. In transformer-based models, these KV caches represent the context for each token processed, and reusing them allows the model to avoid redundant computations for repeated or similar prompts.</p>
<p>Enabling KV cache offloading across multiple requests and serving instances can achieve reduced Time To First Token(TTFT), improve scalability for shared cache across replicas, and improve user experience for multi-turn QA or RAG.</p>
<p>KServe integrates <a href="https://lmcache.ai/" target="_blank" rel="noopener noreferrer" class="">LMCache</a>, the-state-of-art KV cache layer library developed by LMCache Lab to reduce inference costs and ensure SLOs for both latency and throughput at scale. Please follow the <a href="https://kserve.github.io/archive/0.15/modelserving/v1beta1/llm/huggingface/kv_cache_offloading/#overview" target="_blank" rel="noopener noreferrer" class="">LMCache integration doc</a> to optimize your GenAI inference workload.</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1beta1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> InferenceService</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> huggingface</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama3</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">lmcache</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">predictor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">minReplicas</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">2</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">model</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">modelFormat</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> huggingface</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">args</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">model_name=llama3</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">model_id=meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama/meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">3</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">70b</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">kv</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">transfer</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">config</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">'{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">enable</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">chunked</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">prefill</span><br></span></code></pre></div></div>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-advanced-model-caching-mechanisms">📦 Advanced Model Caching Mechanisms<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.15-release#-advanced-model-caching-mechanisms" class="hash-link" aria-label="Direct link to 📦 Advanced Model Caching Mechanisms" title="Direct link to 📦 Advanced Model Caching Mechanisms" translate="no">​</a></h3>
<p>To reduce model loading times and improve overall efficiency of serving large models, KServe v0.15 introduces advanced model caching features:</p>
<ul>
<li class=""><strong>LocalModelCache Enhancements:</strong> Improved the LocalModelCache custom resource to support multiple node groups, providing greater flexibility in model placement and caching strategies.</li>
<li class=""><strong>Node Agent Improvements:</strong> Enhanced the local model node agent for better performance and reliability.</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-enhanced-vllm-backend-support">🔧 Enhanced vLLM Backend Support<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.15-release#-enhanced-vllm-backend-support" class="hash-link" aria-label="Direct link to 🔧 Enhanced vLLM Backend Support" title="Direct link to 🔧 Enhanced vLLM Backend Support" translate="no">​</a></h3>
<p>The vLLM backend has been significantly upgraded to better serve generative AI models:</p>
<ul>
<li class=""><strong>Version Upgrade:</strong> Updated to vLLM 0.8.5, bringing performance improvements with v1 backend and new features.</li>
<li class=""><strong>Qwen3 &amp; Llama4:</strong> Added support for Qwen3 and Llama4 models.</li>
<li class=""><strong>Reranking Support:</strong> Added support for reranking models.</li>
<li class=""><strong>Embedding Support:</strong> Added support for OpenAI-compatible embeddings API, enabling a broader range of applications.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="️-additional-improvements">🛠️ Additional Improvements<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.15-release#%EF%B8%8F-additional-improvements" class="hash-link" aria-label="Direct link to 🛠️ Additional Improvements" title="Direct link to 🛠️ Additional Improvements" translate="no">​</a></h2>
<p>This release also includes several other enhancements:</p>
<ul>
<li class="">Support Deep Health Checks <a href="https://github.com/kserve/kserve/pull/3348" target="_blank" rel="noopener noreferrer" class="">#3348</a></li>
<li class="">Collocated Transformer &amp; Predictor Feature <a href="https://github.com/kserve/kserve/pull/4255" target="_blank" rel="noopener noreferrer" class="">#4255</a></li>
<li class="">Kubernetes Gateway API support <a href="https://github.com/kserve/kserve/pull/3952" target="_blank" rel="noopener noreferrer" class="">#3952</a></li>
<li class="">Security Updates</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-release-notes">🔍 Release Notes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.15-release#-release-notes" class="hash-link" aria-label="Direct link to 🔍 Release Notes" title="Direct link to 🔍 Release Notes" translate="no">​</a></h2>
<p>For complete release notes including all changes, bug fixes, and known issues, visit the <a href="https://github.com/kserve/kserve/releases/tag/v0.15.0" target="_blank" rel="noopener noreferrer" class="">GitHub release page</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-acknowledgments">🙏 Acknowledgments<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.15-release#-acknowledgments" class="hash-link" aria-label="Direct link to 🙏 Acknowledgments" title="Direct link to 🙏 Acknowledgments" translate="no">​</a></h2>
<p>We extend our gratitude to all the contributors who made this release possible. Your efforts continue to drive the advancement of KServe as a leading platform for serving machine learning models.</p>
<ul>
<li class=""><strong>Core Contributors</strong>: The KServe maintainers and regular as well as new contributors</li>
<li class=""><strong>Community</strong>: Everyone who reported issues, provided feedback, and tested features</li>
<li class=""><strong>Special Recognition</strong>: The generative AI community for their valuable input on LLM serving requirements</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-join-the-community">🤝 Join the Community<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.15-release#-join-the-community" class="hash-link" aria-label="Direct link to 🤝 Join the Community" title="Direct link to 🤝 Join the Community" translate="no">​</a></h2>
<p>We invite you to explore the new features in KServe v0.15 and contribute to the ongoing development of the project:</p>
<ul>
<li class="">Visit our <a href="https://kserve.github.io/website/" target="_blank" rel="noopener noreferrer" class="">Website</a> or <a href="https://github.com/kserve" target="_blank" rel="noopener noreferrer" class="">GitHub</a></li>
<li class="">Join the Slack (<a href="https://github.com/kserve/community?tab=readme-ov-file#questions-and-issues" target="_blank" rel="noopener noreferrer" class="">#kserve</a>)</li>
<li class="">Attend our community meeting by subscribing to the <a href="https://zoom-lfx.platform.linuxfoundation.org/meetings/kserve?view=month" target="_blank" rel="noopener noreferrer" class="">KServe calendar</a>.</li>
<li class="">View our <a href="https://github.com/kserve/community" target="_blank" rel="noopener noreferrer" class="">community github repository</a> to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!</li>
</ul>
<p><strong>Happy serving!</strong></p>
<hr>
<p><em>The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!</em></p>]]></content>
        <author>
            <name>Alexa Griffith</name>
            <uri>https://github.com/alexagriffith</uri>
        </author>
        <author>
            <name>Dan Sun</name>
            <uri>https://github.com/yuzisun</uri>
        </author>
        <author>
            <name>Yuan Tang</name>
            <uri>https://github.com/terrytangyuan</uri>
        </author>
        <author>
            <name>Johnu George</name>
            <uri>https://github.com/johnugeorge</uri>
        </author>
        <author>
            <name>Lize Cai</name>
            <uri>https://github.com/lizzzcai</uri>
        </author>
        <category label="Releases" term="Releases"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Announcing KServe v0.14]]></title>
        <id>https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.14-release</id>
        <link href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.14-release"/>
        <updated>2024-12-13T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[KServe 0.14 Release Blog Post]]></summary>
        <content type="html"><![CDATA[<p><em>Published on December 23, 2024</em></p>
<p>We are excited to announce KServe v0.14. In this release we are introducing a new Python client designed for KServe, and a new model cache feature; we are promoting OCI storage for models as a stable feature; and we added support for deploying models directly from Hugging Face.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-key-features">🚀 Key Features<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.14-release#-key-features" class="hash-link" aria-label="Direct link to 🚀 Key Features" title="Direct link to 🚀 Key Features" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="introducing-inference-client-for-python">Introducing Inference client for Python<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.14-release#introducing-inference-client-for-python" class="hash-link" aria-label="Direct link to Introducing Inference client for Python" title="Direct link to Introducing Inference client for Python" translate="no">​</a></h3>
<p>The KServe Python SDK now includes both <a href="https://github.com/kserve/kserve/blob/v0.14.0/python/kserve/kserve/inference_client.py#L388" target="_blank" rel="noopener noreferrer" class="">REST</a> and <a href="https://github.com/kserve/kserve/blob/v0.14.0/python/kserve/kserve/inference_client.py#L61" target="_blank" rel="noopener noreferrer" class="">GRPC</a> inference clients. The new Inference clients of the SDK were delivered as <strong>alpha</strong> features.</p>
<p>Inline with the features documented in issue <a href="https://github.com/kserve/kserve/issues/3270" target="_blank" rel="noopener noreferrer" class="">#3270</a>, both clients have the following characteristics:</p>
<ul>
<li class="">The clients are asynchronous</li>
<li class="">Support for HTTP/2 (via <a href="https://www.python-httpx.org/" target="_blank" rel="noopener noreferrer" class="">httpx</a> library)</li>
<li class="">Support Open Inference Protocol v1 and v2</li>
<li class="">Allow client send and receive tensor data in binary format for HTTP/REST request, see <a href="https://kserve.github.io/archive/0.14/modelserving/data_plane/binary_tensor_data_extension/" target="_blank" rel="noopener noreferrer" class="">binary tensor data extension docs</a>.</li>
</ul>
<p>As usual, the version 0.14.0 of the KServe Python SDK is <a href="https://pypi.org/project/kserve/0.14.0/" target="_blank" rel="noopener noreferrer" class="">published to PyPI</a> and available to install via <code>pip install</code>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="support-for-oci-storage-for-models-modelcars-becomes-stable">Support for OCI storage for models (modelcars) becomes stable<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.14-release#support-for-oci-storage-for-models-modelcars-becomes-stable" class="hash-link" aria-label="Direct link to Support for OCI storage for models (modelcars) becomes stable" title="Direct link to Support for OCI storage for models (modelcars) becomes stable" translate="no">​</a></h3>
<p>In KServe version 0.12, support for using OCI containers for model storage was introduced as an experimental feature. This allows users to store models in containers in OCI format, and allows the usage of OCI-compatible registries for publishing the models.</p>
<p>This feature was implemented by configuring the OCI model container as a sidecar in the InferenceService pod, which was the motivation for naming the feature as modelcars. The model files are made available to the model server by configuring <a href="https://kubernetes.io/docs/tasks/configure-pod-container/share-process-namespace/" target="_blank" rel="noopener noreferrer" class="">process namespace sharing</a> in the pod.</p>
<p>There was one small but important detail that was unsolved and motivated the experimental status: since the modelcar is part of the main containers of the pod, there was no certainty that the modelcar would start quickly. The model server would be unstable if it starts first than the modelcar, and since there was no prefetching of the model image, this was thought as a likely condition.</p>
<p>The unstable situation has been mitigated by configuring the OCI model as an init container in addition to also configuring it as a sidecar. The configuration as an init container ensures that the model is fetched before the main containers are started. The prefetching allows the modelcar to start quickly.
The stabilization is available since KServe version 0.14, where modelcars are now a stable feature.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="future-plan">Future plan<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.14-release#future-plan" class="hash-link" aria-label="Direct link to Future plan" title="Direct link to Future plan" translate="no">​</a></h4>
<p>Modelcars is one implementation option for supporting OCI images for model storage. There are other alternatives commented in <a href="https://github.com/kserve/kserve/issues/4083" target="_blank" rel="noopener noreferrer" class="">issue #4083</a>.</p>
<p>Using volume mounts based on OCI artifacts is the optimal implementation, but this is only <a href="https://kubernetes.io/blog/2024/08/16/kubernetes-1-31-image-volume-source/" target="_blank" rel="noopener noreferrer" class="">recently possible since Kubernetes 1.31</a> as a native alpha feature. KServe can now evolve to use this new Kubernetes feature.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="introducing-model-cache">Introducing Model Cache<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.14-release#introducing-model-cache" class="hash-link" aria-label="Direct link to Introducing Model Cache" title="Direct link to Introducing Model Cache" translate="no">​</a></h3>
<p>With models increasing in size, specially true for LLM models, pulling from storage each time a pod is created can result in unmanageable start-up times. Although OCI storage also has the benefit of model caching, the capabilities are not flexible since the management is delegated to the cluster.</p>
<p>The Model Cache was proposed as another alternative to enhance KServe usability with big models, released in KServe v0.14 as an <strong>alpha</strong> feature.
In this release local node storage is used for storing models and <code>LocalModelCache</code> custom resource provides the control about which models to store in the cache.
The local model cache state can always be rebuilt from the models stored on persistent storage like model registry or S3.
Read the <a href="https://docs.google.com/document/d/1nao8Ws3tonO2zNAzdmXTYa0hECZNoP2SV_z9Zg0PzLA/edit" target="_blank" rel="noopener noreferrer" class="">design document for the details</a>.</p>
<p><img decoding="async" loading="lazy" alt="!localmodelcache" src="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/assets/images/localmodelcache-59f819fe261fb8fcd66a6c875a73b3d6.png" width="2462" height="1416" class="img_ev3q"></p>
<p>By caching the models, you get the following benefits:</p>
<ul>
<li class="">
<p>Minimize the time it takes for LLM pods to start serving requests.</p>
</li>
<li class="">
<p>Sharing the same storage for pods scheduled on the same GPU node.</p>
</li>
<li class="">
<p>Model Cache allows scaling your AI workload efficiently without worrying about the slow model server container startup.</p>
</li>
</ul>
<p>The model cache is currently disabled by default. To enable, you need to modify the <code>localmodel.enabled</code> field on the <code>inferenceservice-config</code> ConfigMap.</p>
<p>You can follow <a href="https://kserve.github.io/archive/0.14/modelserving/storage/modelcache/localmodel/" target="_blank" rel="noopener noreferrer" class="">local model cache tutorial</a> to cache LLMs on local NVMe of your GPU nodes and deploy LLMs with <code>InferenceService</code> by loading models from local cache to accelerate the container startup.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="support-for-hugging-face-hub-in-storage-initializer">Support for Hugging Face hub in storage initializer<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.14-release#support-for-hugging-face-hub-in-storage-initializer" class="hash-link" aria-label="Direct link to Support for Hugging Face hub in storage initializer" title="Direct link to Support for Hugging Face hub in storage initializer" translate="no">​</a></h3>
<p>The KServe storage initializer has been enhanced to support downloading models directly from Hugging Face. For this, the new schema <code>hf://</code> is now supported in the <code>storageUri</code> field of InferenceServices. The following YAML partial shows this:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1beta1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> InferenceService</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> huggingface</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama3</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">predictor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">model</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">storageUri</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> hf</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">//meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama/meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">3</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">8b</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">instruct</span><br></span></code></pre></div></div>
<p>Both public and private Hugging Face repositories are supported. The credentials can be provided by the usual mechanism of binding Secrets to ServiceAccounts, or by binding the credentials Secret as environment variables in the InferenceService.</p>
<p>Read the <a href="https://kserve.github.io/archive/0.14/modelserving/storage/huggingface/hf/" target="_blank" rel="noopener noreferrer" class="">documentation</a> for more details.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="️-enhancements-and-improvements">🛠️ Enhancements and Improvements<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.14-release#%EF%B8%8F-enhancements-and-improvements" class="hash-link" aria-label="Direct link to 🛠️ Enhancements and Improvements" title="Direct link to 🛠️ Enhancements and Improvements" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="hugging-face-vllm-backend-changes">Hugging Face vLLM backend changes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.14-release#hugging-face-vllm-backend-changes" class="hash-link" aria-label="Direct link to Hugging Face vLLM backend changes" title="Direct link to Hugging Face vLLM backend changes" translate="no">​</a></h3>
<ul>
<li class="">vLLM backend to update to 0.6.1 <a href="https://github.com/kserve/kserve/pull/3948" target="_blank" rel="noopener noreferrer" class="">#3948</a></li>
<li class="">Support trust_remote_code flag for vllm <a href="https://github.com/kserve/kserve/pull/3729" target="_blank" rel="noopener noreferrer" class="">#3729</a></li>
<li class="">Support text embedding task in hugging face server <a href="https://github.com/kserve/kserve/pull/3743" target="_blank" rel="noopener noreferrer" class="">#3743</a></li>
<li class="">Add health endpoint for vLLM backend <a href="https://github.com/kserve/kserve/pull/3850" target="_blank" rel="noopener noreferrer" class="">#3850</a></li>
<li class="">Added <code>hostIPC</code> field to <code>ServingRuntime</code> CRD, for supporting more than one GPU in Serverless mode <a href="https://github.com/kserve/kserve/issues/3791" target="_blank" rel="noopener noreferrer" class="">#3791</a></li>
<li class="">Support shared memory volume for vLLM backend <a href="https://github.com/kserve/kserve/pull/3910" target="_blank" rel="noopener noreferrer" class="">#3910</a></li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="other-enhancements">Other Enhancements<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.14-release#other-enhancements" class="hash-link" aria-label="Direct link to Other Enhancements" title="Direct link to Other Enhancements" translate="no">​</a></h3>
<ul>
<li class="">New flag for automount serviceaccount token by <a href="https://github.com/kserve/kserve/pull/3979" target="_blank" rel="noopener noreferrer" class="">#3979</a></li>
<li class="">TLS support for inference loggers <a href="https://github.com/kserve/kserve/issues/3837" target="_blank" rel="noopener noreferrer" class="">#3837</a></li>
<li class="">Allow PVC storage to be mounted in ReadWrite mode via an annotation <a href="https://github.com/kserve/kserve/issues/3687" target="_blank" rel="noopener noreferrer" class="">#3687</a></li>
<li class="">Support HTTP Headers passing for KServe python custom runtimes <a href="https://github.com/kserve/kserve/pull/3669" target="_blank" rel="noopener noreferrer" class="">#3669</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="️-whats-changed">⚠️ What's Changed<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.14-release#%EF%B8%8F-whats-changed" class="hash-link" aria-label="Direct link to ⚠️ What's Changed" title="Direct link to ⚠️ What's Changed" translate="no">​</a></h2>
<ul>
<li class="">Ray is now an optional dependency <a href="https://github.com/kserve/kserve/pull/3834" target="_blank" rel="noopener noreferrer" class="">#3834</a></li>
<li class="">Support for Python 3.12 is added, while support Python 3.8 is removed <a href="https://github.com/kserve/kserve/pull/3645" target="_blank" rel="noopener noreferrer" class="">#3645</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-release-notes">🔍 Release Notes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.14-release#-release-notes" class="hash-link" aria-label="Direct link to 🔍 Release Notes" title="Direct link to 🔍 Release Notes" translate="no">​</a></h2>
<p>For complete release notes including all changes, bug fixes, and known issues, visit the <a href="https://github.com/kserve/kserve/releases/tag/v0.14.0" target="_blank" rel="noopener noreferrer" class="">GitHub release page</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-acknowledgments">🙏 Acknowledgments<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.14-release#-acknowledgments" class="hash-link" aria-label="Direct link to 🙏 Acknowledgments" title="Direct link to 🙏 Acknowledgments" translate="no">​</a></h2>
<p>We want to thank all the contributors who made this release possible:</p>
<ul>
<li class=""><strong>Core Contributors</strong>: The KServe maintainers and regular as well as new contributors</li>
<li class=""><strong>Community</strong>: Everyone who reported issues, provided feedback, and tested features</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-join-the-community">🤝 Join the community<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.14-release#-join-the-community" class="hash-link" aria-label="Direct link to 🤝 Join the community" title="Direct link to 🤝 Join the community" translate="no">​</a></h2>
<ul>
<li class="">Visit our <a href="https://kserve.github.io/website/" target="_blank" rel="noopener noreferrer" class="">Website</a> or <a href="https://github.com/kserve" target="_blank" rel="noopener noreferrer" class="">GitHub</a></li>
<li class="">Join the Slack (<a href="https://github.com/kserve/community?tab=readme-ov-file#questions-and-issues" target="_blank" rel="noopener noreferrer" class="">#kserve</a>)</li>
<li class="">Attend our community meeting by subscribing to the <a href="https://zoom-lfx.platform.linuxfoundation.org/meetings/kserve?view=month" target="_blank" rel="noopener noreferrer" class="">KServe calendar</a>.</li>
<li class="">View our <a href="https://github.com/kserve/community" target="_blank" rel="noopener noreferrer" class="">community github repository</a> to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!</li>
</ul>
<hr>
<p><em>The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!</em></p>]]></content>
        <author>
            <name>Edgar Hernández</name>
            <uri>https://github.com/israel-hdez</uri>
        </author>
        <author>
            <name>Dan Sun</name>
            <uri>https://github.com/yuzisun</uri>
        </author>
        <category label="Releases" term="Releases"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[From Serverless Predictive Inference to Generative Inference - Introducing KServe v0.13]]></title>
        <id>https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.13-release</id>
        <link href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.13-release"/>
        <updated>2024-05-15T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[KServe 0.13 Release Blog Post]]></summary>
        <content type="html"><![CDATA[<p><em>Published on May 15, 2024</em></p>
<p>We are excited to unveil KServe v0.13, marking a significant leap forward in evolving cloud native model serving to meet the demands of Generative AI inference. This release is highlighted by three pivotal updates: enhanced Hugging Face runtime, robust vLLM backend support for Generative Models, and the integration of OpenAI protocol standards.</p>
<p><img decoding="async" loading="lazy" alt="!kserve-components" src="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/assets/images/kserve-layer-08feccc0300cf8608f0a36b6572e70fb.png" width="960" height="540" class="img_ev3q"></p>
<p>Below are a summary of the key changes.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-enhanced-hugging-face-runtime-support">🚀 Enhanced Hugging Face Runtime Support<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.13-release#-enhanced-hugging-face-runtime-support" class="hash-link" aria-label="Direct link to 🚀 Enhanced Hugging Face Runtime Support" title="Direct link to 🚀 Enhanced Hugging Face Runtime Support" translate="no">​</a></h2>
<p>KServe v0.13 enriches its Hugging Face runtime and now supports running Hugging Face models out-of-the-box. KServe v0.13 implements a <a href="https://github.com/kserve/kserve/tree/release-0.13/python/huggingfaceserver" target="_blank" rel="noopener noreferrer" class="">KServe Hugging Face Serving Runtime</a>, <code>kserve-huggingfaceserver</code>. With this implementation, KServe can now automatically infer a <a href="https://huggingface.co/tasks" target="_blank" rel="noopener noreferrer" class="">task</a> from model architecture and select the optimized serving runtime. Currently supported tasks include sequence classification, token classification, fill mask, text generation, and text to text generation.</p>
<p><img decoding="async" loading="lazy" alt="!kserve-huggingface" src="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/assets/images/kserve-huggingface-209566d5f98a98d521606e57b4531a19.png" width="7243" height="2208" class="img_ev3q"></p>
<p>Here is an example to serve BERT model by deploying an Inference Service with Hugging Face runtime for classification task.</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1beta1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> InferenceService</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> huggingface</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">bert</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">predictor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">model</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">modelFormat</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> huggingface</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">args</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">model_name=bert</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">model_id=bert</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">base</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">uncased</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">tensor_input_names=input_ids</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">resources</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">limits</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">cpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">memory</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> 2Gi</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">nvidia.com/gpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">requests</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">cpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> 100m</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">memory</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> 2Gi</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">nvidia.com/gpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1"</span><br></span></code></pre></div></div>
<p>You can also deploy BERT on the more optimized inference runtime like Triton using Hugging Face Runtime for pre/post processing, see more details <a href="https://kserve.github.io/archive/0.13/modelserving/v1beta1/triton/huggingface/" target="_blank" rel="noopener noreferrer" class="">here</a>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-vllm-support">🔧 vLLM Support<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.13-release#-vllm-support" class="hash-link" aria-label="Direct link to 🔧 vLLM Support" title="Direct link to 🔧 vLLM Support" translate="no">​</a></h3>
<p>Version 0.13 introduces dedicated runtime support for <a href="https://docs.vllm.ai/en/latest/" target="_blank" rel="noopener noreferrer" class="">vLLM</a>, for enhanced transformer model serving. This support now includes auto-mapping vLLMs as the backend for supported tasks, streamlining the deployment process and optimizing performance. If vLLM does not support a particular task, it will default to the Hugging Face backend. See example below.</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1beta1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> InferenceService</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> huggingface</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama3</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">predictor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">model</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">modelFormat</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> huggingface</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">args</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">model_name=llama3</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">model_id=meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama/meta</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">llama</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">3</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">8b</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">instruct</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">resources</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">limits</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">cpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"6"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">memory</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> 24Gi</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">nvidia.com/gpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">requests</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">cpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"6"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">memory</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> 24Gi</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">nvidia.com/gpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1"</span><br></span></code></pre></div></div>
<p>See more details in our updated docs to <a href="https://kserve.github.io/archive/0.13/modelserving/v1beta1/llm/huggingface/" target="_blank" rel="noopener noreferrer" class="">Deploy the Llama3 model with Hugging Face LLM Serving Runtime</a>.</p>
<p>Additionally, if the Hugging Face backend is preferred over vLLM, vLLM auto-mapping can be disabled with the <code>--backend=huggingface</code> arg.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-openai-schema-integration">🌐 OpenAI Schema Integration<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.13-release#-openai-schema-integration" class="hash-link" aria-label="Direct link to 🌐 OpenAI Schema Integration" title="Direct link to 🌐 OpenAI Schema Integration" translate="no">​</a></h3>
<p>Embracing the OpenAI protocol, KServe v0.13 now supports three specific endpoints for generative transformer models:</p>
<ul>
<li class=""><code>/openai/v1/completions</code></li>
<li class=""><code>/openai/v1/chat/completions</code></li>
<li class=""><code>/openai/v1/models</code></li>
</ul>
<p>These endpoints are useful for generative transformer models, which take in messages and return a model-generated message output. The <a href="https://platform.openai.com/docs/guides/text-generation/chat-completions-api" target="_blank" rel="noopener noreferrer" class="">chat completions endpoint</a> is designed for easily handling multi-turn conversations, while still being useful for single-turn tasks. The <a href="https://platform.openai.com/docs/guides/text-generation/completions-api" target="_blank" rel="noopener noreferrer" class="">completions endpoint</a> is now a legacy endpoint that differs with the chat completions endpoint in that the interface for completions is a freeform text string called a <code>prompt</code>. Read more about the <a href="https://platform.openai.com/docs/api-reference/chat" target="_blank" rel="noopener noreferrer" class="">chat completions</a> and <a href="https://platform.openai.com/docs/api-reference/completions" target="_blank" rel="noopener noreferrer" class="">completions</a> endpoints in the OpenAI API docs.</p>
<p>This update fosters a standardized approach to transformer model serving, ensuring compatibility with a broader spectrum of models and tools, and enhances the platform's versatility. The API can be directly used with OpenAI's client libraries or third-party tools, like LangChain or LlamaIndex.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-future-plan">🔮 Future Plan<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.13-release#-future-plan" class="hash-link" aria-label="Direct link to 🔮 Future Plan" title="Direct link to 🔮 Future Plan" translate="no">​</a></h3>
<ul>
<li class="">Support other tasks like text embeddings <a href="https://github.com/kserve/kserve/issues/3572" target="_blank" rel="noopener noreferrer" class="">#3572</a>.</li>
<li class="">Support more LLM backend options in the future, such as TensorRT-LLM.</li>
<li class="">Enrich text generation metrics for Throughput(tokens/sec), TTFT(Time to first token) <a href="https://github.com/kserve/kserve/issues/3461" target="_blank" rel="noopener noreferrer" class="">#3461</a>.</li>
<li class="">KEDA integration for token based LLM Autoscaling <a href="https://github.com/kserve/kserve/issues/3561" target="_blank" rel="noopener noreferrer" class="">#3561</a>.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="️-other-changes">🛠️ Other Changes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.13-release#%EF%B8%8F-other-changes" class="hash-link" aria-label="Direct link to 🛠️ Other Changes" title="Direct link to 🛠️ Other Changes" translate="no">​</a></h2>
<p>This release also includes several enhancements and changes:</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-whats-new">✨ What's New?<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.13-release#-whats-new" class="hash-link" aria-label="Direct link to ✨ What's New?" title="Direct link to ✨ What's New?" translate="no">​</a></h3>
<ul>
<li class="">Async streaming support for v1 endpoints <a href="https://github.com/kserve/kserve/issues/3402" target="_blank" rel="noopener noreferrer" class="">#3402</a>.</li>
<li class="">Support for <code>.json</code> and <code>.ubj</code> model formats in the XGBoost server image <a href="https://github.com/kserve/kserve/issues/3546" target="_blank" rel="noopener noreferrer" class="">#3546</a>.</li>
<li class="">Enhanced flexibility in KServe by allowing the configuration of multiple domains for an inference service <a href="https://github.com/kserve/kserve/issues/2747" target="_blank" rel="noopener noreferrer" class="">#2747</a>.</li>
<li class="">Enhanced the manager setup to dynamically adapt based on available CRDs, improving operational flexibility and reliability across different deployment environments <a href="https://github.com/kserve/kserve/issues/3470" target="_blank" rel="noopener noreferrer" class="">#3470</a>.</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="️-whats-changed">⚠️ What's Changed?<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.13-release#%EF%B8%8F-whats-changed" class="hash-link" aria-label="Direct link to ⚠️ What's Changed?" title="Direct link to ⚠️ What's Changed?" translate="no">​</a></h3>
<ul>
<li class="">Removed Seldon Alibi dependency <a href="https://github.com/kserve/kserve/issues/3380" target="_blank" rel="noopener noreferrer" class="">#3380</a>.</li>
<li class="">Removal of conversion webhook from manifests. <a href="https://github.com/kserve/kserve/issues/3344" target="_blank" rel="noopener noreferrer" class="">#3344</a>.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-release-notes">🔍 Release Notes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.13-release#-release-notes" class="hash-link" aria-label="Direct link to 🔍 Release Notes" title="Direct link to 🔍 Release Notes" translate="no">​</a></h2>
<p>For complete release notes including all changes, bug fixes, and known issues, visit the <a href="https://github.com/kserve/kserve/releases/tag/v0.13.0" target="_blank" rel="noopener noreferrer" class="">GitHub release page</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-acknowledgments">🙏 Acknowledgments<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.13-release#-acknowledgments" class="hash-link" aria-label="Direct link to 🙏 Acknowledgments" title="Direct link to 🙏 Acknowledgments" translate="no">​</a></h2>
<p>We want to thank all the contributors who made this release possible:</p>
<ul>
<li class=""><strong>Core Contributors</strong>: The KServe maintainers and regular as well as new contributors</li>
<li class=""><strong>Community</strong>: Everyone who reported issues, provided feedback, and tested features</li>
<li class=""><strong>Special Recognition</strong>: Contributors who helped drive the generative AI capabilities forward</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-join-the-community">🤝 Join the Community<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.13-release#-join-the-community" class="hash-link" aria-label="Direct link to 🤝 Join the Community" title="Direct link to 🤝 Join the Community" translate="no">​</a></h2>
<ul>
<li class="">Visit our <a href="https://kserve.github.io/website/" target="_blank" rel="noopener noreferrer" class="">Website</a> or <a href="https://github.com/kserve" target="_blank" rel="noopener noreferrer" class="">GitHub</a></li>
<li class="">Join the Slack (<a href="https://github.com/kserve/community?tab=readme-ov-file#questions-and-issues" target="_blank" rel="noopener noreferrer" class="">#kserve</a>)</li>
<li class="">Attend our community meeting by subscribing to the <a href="https://zoom-lfx.platform.linuxfoundation.org/meetings/kserve?view=month" target="_blank" rel="noopener noreferrer" class="">KServe calendar</a>.</li>
<li class="">View our <a href="https://github.com/kserve/community" target="_blank" rel="noopener noreferrer" class="">community github repository</a> to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!</li>
</ul>
<hr>
<p><em>The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!</em></p>]]></content>
        <author>
            <name>Alexa Griffith</name>
            <uri>https://github.com/alexagriffith</uri>
        </author>
        <author>
            <name>Dan Sun</name>
            <uri>https://github.com/yuzisun</uri>
        </author>
        <author>
            <name>Yuan Tang</name>
            <uri>https://github.com/terrytangyuan</uri>
        </author>
        <category label="Releases" term="Releases"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Announcing KServe v0.11]]></title>
        <id>https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release</id>
        <link href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release"/>
        <updated>2023-10-08T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[KServe 0.11 Release Blog Post]]></summary>
        <content type="html"><![CDATA[<p><em>Published on October 8, 2023</em></p>
<p>We are excited to announce the release of KServe 0.11. In this release we introduced Large Language Model (LLM) runtimes, made enhancements to the KServe control plane, Python SDK Open Inference Protocol support and dependency management. For ModelMesh we have added features PVC, HPA, payload logging to ensure feature parity with KServe.</p>
<p>Here is a summary of the key changes:</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-kserve-core-inference-enhancements">🚀 KServe Core Inference Enhancements<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release#-kserve-core-inference-enhancements" class="hash-link" aria-label="Direct link to 🚀 KServe Core Inference Enhancements" title="Direct link to 🚀 KServe Core Inference Enhancements" translate="no">​</a></h2>
<ul>
<li class="">
<p><strong>Path-based routing support</strong> which is served as an alternative way to the host based routing, the URL of the <code>InferenceService</code> could look like <code>http://&lt;ingress_domain&gt;/serving/&lt;namespace&gt;/&lt;isvc_name&gt;</code>.
Please refer to the <a href="https://github.com/kserve/kserve/blob/294a10495b6b5cda9c64d3e1573b60aec62aceb9/config/configmap/inferenceservice.yaml#L237" target="_blank" rel="noopener noreferrer" class="">doc</a> for how to enable path based routing.</p>
</li>
<li class="">
<p><strong>Priority field for Serving Runtime</strong> custom resource to handle the case when you have multiple serving runtimes which support the same model formats, see more details from <a href="https://kserve.github.io/archive/0.11/modelserving/servingruntimes/#priority" target="_blank" rel="noopener noreferrer" class="">the serving runtime doc</a>.</p>
</li>
<li class="">
<p><strong>Custom Storage Container CRD</strong> to allow customized implementations with supported storage URI prefixes, example use cases are private model registry integration:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"serving.kserve.io/v1alpha1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ClusterStorageContainer</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> default</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">container</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> storage</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">initializer</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">image</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> kserve/model</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">registry</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">latest</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">resources</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">requests</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">memory</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> 100Mi</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">cpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> 100m</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">limits</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">memory</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> 1Gi</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">cpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">supportedUriFormats</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">prefix</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> model</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">registry</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">//</span><br></span></code></pre></div></div>
</li>
<li class="">
<p><strong>Inference Graph enhancements</strong> for improving the API spec to support pod affinity and resource requirement fields.
<code>Dependency</code> field with options <code>Soft</code> and <code>Hard</code> is introduced to handle error responses from the inference steps to decide whether to short-circuit the request in case of errors, see the following example with hard dependency with the node steps:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1alpha1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> InferenceGraph</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> graph_with_switch_node</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">nodes</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">root</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">routerType</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Sequence</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">steps</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"rootStep1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">nodeName</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> node1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">dependency</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Hard</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"rootStep2"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">serviceName</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> success_200_isvc_id </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">node1</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">routerType</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Switch</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">steps</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"node1Step1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">serviceName</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain"> error_404_isvc_id </span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">condition</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"[@this].#(decision_picker==ERROR)"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">            </span><span class="token key atrule" style="color:#00a4db">dependency</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Hard</span><br></span></code></pre></div></div>
<p>For more details please refer to the <a href="https://github.com/kserve/kserve/issues/2484" target="_blank" rel="noopener noreferrer" class="">issue</a>.</p>
</li>
<li class="">
<p><strong>Improved InferenceService debugging experience</strong> by adding the aggregated <code>RoutesReady</code> status and <code>LastDeploymentReady</code> condition to the InferenceService Status to differentiate the endpoint and deployment status.
This applies to the serverless mode and for more details refer to the <a href="https://pkg.go.dev/github.com/kserve/kserve@v0.11.1/pkg/apis/serving/v1beta1#InferenceServiceStatus" target="_blank" rel="noopener noreferrer" class="">API docs</a>.</p>
</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-enhanced-python-sdk-dependency-management">📦 Enhanced Python SDK Dependency Management<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release#-enhanced-python-sdk-dependency-management" class="hash-link" aria-label="Direct link to 📦 Enhanced Python SDK Dependency Management" title="Direct link to 📦 Enhanced Python SDK Dependency Management" translate="no">​</a></h3>
<ul>
<li class="">
<p>KServe has adopted <a href="https://python-poetry.org/docs/" target="_blank" rel="noopener noreferrer" class="">poetry</a> to manage python dependencies. You can now install the KServe SDK with locked dependencies using <code>poetry install</code>.
While <code>pip install</code> still works,  we highly recommend using poetry to ensure predictable dependency management.</p>
</li>
<li class="">
<p>The KServe SDK is also slimmed down by making the cloud storage dependency optional, if you require storage dependency for custom serving runtimes you can still install with <code>pip install kserve[storage]</code>.</p>
</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-kserve-python-runtimes-improvements">🔧 KServe Python Runtimes Improvements<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release#-kserve-python-runtimes-improvements" class="hash-link" aria-label="Direct link to 🔧 KServe Python Runtimes Improvements" title="Direct link to 🔧 KServe Python Runtimes Improvements" translate="no">​</a></h3>
<ul>
<li class="">
<p>KServe Python Runtimes including <a href="https://kserve.github.io/archive/0.11/modelserving/v1beta1/sklearn/v2/" target="_blank" rel="noopener noreferrer" class="">sklearnserver</a>, <a href="https://kserve.github.io/archive/0.11/modelserving/v1beta1/lightgbm/" target="_blank" rel="noopener noreferrer" class="">lgbserver</a>, <a href="https://kserve.github.io/archive/0.11/modelserving/v1beta1/xgboost/" target="_blank" rel="noopener noreferrer" class="">xgbserver</a>
now support the open inference protocol for both REST and gRPC.</p>
</li>
<li class="">
<p>Logging improvements including adding Uvicorn access logging and a default KServe logger.</p>
</li>
<li class="">
<p><code>Postprocess</code> handler has been aligned with open inference protocol, simplifying the underlying transportation protocol complexities.</p>
</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-llm-runtimes">🤖 LLM Runtimes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release#-llm-runtimes" class="hash-link" aria-label="Direct link to 🤖 LLM Runtimes" title="Direct link to 🤖 LLM Runtimes" translate="no">​</a></h3>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="torchserve-llm-runtime">TorchServe LLM Runtime<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release#torchserve-llm-runtime" class="hash-link" aria-label="Direct link to TorchServe LLM Runtime" title="Direct link to TorchServe LLM Runtime" translate="no">​</a></h4>
<p>KServe now integrates with TorchServe 0.8, offering the support for <a href="https://pytorch.org/serve/large_model_inference.html" target="_blank" rel="noopener noreferrer" class="">LLM models</a> that may not fit onto a single GPU.
Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the <a href="https://kserve.github.io/archive/0.11/modelserving/v1beta1/llm/torchserve/accelerate/" target="_blank" rel="noopener noreferrer" class="">detailed example</a> for how to serve the LLM on KServe with TorchServe runtime.</p>
<h4 class="anchor anchorTargetStickyNavbar_Vzrq" id="vllm-runtime">vLLM Runtime<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release#vllm-runtime" class="hash-link" aria-label="Direct link to vLLM Runtime" title="Direct link to vLLM Runtime" translate="no">​</a></h4>
<p>Serving LLM models can be surprisingly slow even on high end GPUs, <a href="https://github.com/vllm-project/vllm" target="_blank" rel="noopener noreferrer" class="">vLLM</a> is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers.
It supports <a href="https://www.anyscale.com/blog/continuous-batching-llm-inference" target="_blank" rel="noopener noreferrer" class="">continuous batching</a> for increased throughput and GPU utilization,
<a href="https://vllm.ai/" target="_blank" rel="noopener noreferrer" class="">paged attention</a> to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens.</p>
<p>In the <a href="https://kserve.github.io/archive/0.11/modelserving/v1beta1/llm/vllm/" target="_blank" rel="noopener noreferrer" class="">example</a> we show how to deploy vLLM on KServe and expects further integration in KServe 0.12 with proposed <a href="https://github.com/kserve/open-inference-protocol/pull/7" target="_blank" rel="noopener noreferrer" class="">generate endpoint</a> for open inference protocol.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-modelmesh-updates">📊 ModelMesh Updates<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release#-modelmesh-updates" class="hash-link" aria-label="Direct link to 📊 ModelMesh Updates" title="Direct link to 📊 ModelMesh Updates" translate="no">​</a></h2>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-storing-models-on-kubernetes-persistent-volumes-pvc">💾 Storing Models on Kubernetes Persistent Volumes (PVC)<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release#-storing-models-on-kubernetes-persistent-volumes-pvc" class="hash-link" aria-label="Direct link to 💾 Storing Models on Kubernetes Persistent Volumes (PVC)" title="Direct link to 💾 Storing Models on Kubernetes Persistent Volumes (PVC)" translate="no">​</a></h3>
<p>ModelMesh now allows to <a href="https://github.com/kserve/modelmesh-serving/blob/main/docs/predictors/setup-storage.md#deploy-a-model-stored-on-a-persistent-volume-claim" target="_blank" rel="noopener noreferrer" class="">directly mount model files onto serving runtimes pods</a>
using <a href="https://kubernetes.io/docs/concepts/storage/persistent-volumes/" target="_blank" rel="noopener noreferrer" class="">Kubernetes Persistent Volumes</a>. Depending on the selected <a href="https://kubernetes.io/docs/concepts/storage/storage-classes/" target="_blank" rel="noopener noreferrer" class="">storage solution</a> this approach can significantly reduce latency when deploying new predictors,
potentially remove the need for additional S3 cloud object storage like AWS S3, GCS, or Azure Blob Storage altogether.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-horizontal-pod-autoscaling-hpa">⚡ Horizontal Pod Autoscaling (HPA)<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release#-horizontal-pod-autoscaling-hpa" class="hash-link" aria-label="Direct link to ⚡ Horizontal Pod Autoscaling (HPA)" title="Direct link to ⚡ Horizontal Pod Autoscaling (HPA)" translate="no">​</a></h3>
<p>Kubernetes <a href="https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/" target="_blank" rel="noopener noreferrer" class="">Horizontal Pod Autoscaling</a> can now be used at the serving runtime pod level. With HPA enabled, the ModelMesh controller no longer manages the number of replicas. Instead, a <code>HorizontalPodAutoscaler</code> automatically updates the serving
runtime deployment with the number of Pods to best match the demand.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-model-metrics-metrics-dashboard-payload-event-logging">📈 Model Metrics, Metrics Dashboard, Payload Event Logging<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release#-model-metrics-metrics-dashboard-payload-event-logging" class="hash-link" aria-label="Direct link to 📈 Model Metrics, Metrics Dashboard, Payload Event Logging" title="Direct link to 📈 Model Metrics, Metrics Dashboard, Payload Event Logging" translate="no">​</a></h3>
<p>ModelMesh v0.11 introduces a new configuration option to emit a subset of useful metrics at the individual model level. These metrics can help identify outlier or "heavy hitter" models and consequently fine-tune the deployments of those inference services, like allocating more resources or increasing the number of replicas for improved responsiveness or avoid frequent cache misses.</p>
<p>A new <a href="https://github.com/kserve/modelmesh-serving/blob/main/docs/monitoring.md#import-the-grafana-dashboard" target="_blank" rel="noopener noreferrer" class="">Grafana dashboard</a> was added to display the comprehensive set of <a href="https://github.com/kserve/modelmesh-serving/blob/main/docs/monitoring.md" target="_blank" rel="noopener noreferrer" class="">Prometheus metrics</a> like model loading
and unloading rates, internal queuing delays, capacity and usage, cache state, etc. to monitor the general health of the ModelMesh Serving deployment.</p>
<p>The new <a href="https://github.com/kserve/modelmesh/blob/main/src/main/java/com/ibm/watson/modelmesh/payload/" target="_blank" rel="noopener noreferrer" class=""><code>PayloadProcessor</code> interface</a> can be implemented to log prediction requests and responses, to create data sinks for data visualization, for model quality assessment, or for drift and outlier detection by external monitoring systems.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="️-whats-changed">⚠️ What's Changed?<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release#%EF%B8%8F-whats-changed" class="hash-link" aria-label="Direct link to ⚠️ What's Changed?" title="Direct link to ⚠️ What's Changed?" translate="no">​</a></h2>
<ul>
<li class="">
<p>To allow longer InferenceService name due to DNS max length limits from <a href="https://github.com/kserve/kserve/issues/1397" target="_blank" rel="noopener noreferrer" class="">issue</a>, the <code>Default</code> suffix in the inference service component(predictor/transformer/explainer) name has been removed for newly created InferenceServices.
This affects the client that is using the component url directly instead of the top level InferenceService url.</p>
</li>
<li class="">
<p>Status.address.url is now consistent for both serverless and raw deployment mode, the url path portion is dropped in serverless mode.</p>
</li>
<li class="">
<p>Raw bytes are now accepted in v1 protocol, setting the right content-type header to <code>application/json</code> is required to recognize and decode the json payload if <code>content-type</code> is specified.</p>
</li>
</ul>
<div class="language-bash codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-bash codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token plain">curl -v -H "Content-Type: application/json" http://sklearn-iris.kserve-test.${CUSTOM_DOMAIN}/v1/models/sklearn-iris:predict -d @./iris-input.json</span><br></span></code></pre></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-release-notes">🔍 Release Notes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release#-release-notes" class="hash-link" aria-label="Direct link to 🔍 Release Notes" title="Direct link to 🔍 Release Notes" translate="no">​</a></h2>
<p>For complete release notes including all changes, bug fixes, and known issues, visit the <a href="https://github.com/kserve/kserve/releases/tag/v0.11.0" target="_blank" rel="noopener noreferrer" class="">GitHub release pages</a> for KServe v0.11 and <a href="https://github.com/kserve/modelmesh-serving/releases/tag/v0.11.0" target="_blank" rel="noopener noreferrer" class="">ModelMesh v0.11</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-acknowledgments">🙏 Acknowledgments<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release#-acknowledgments" class="hash-link" aria-label="Direct link to 🙏 Acknowledgments" title="Direct link to 🙏 Acknowledgments" translate="no">​</a></h2>
<p>We want to thank all the contributors who made this release possible:</p>
<ul>
<li class=""><strong>Core Contributors</strong>: The KServe maintainers and regular as well as new contributors</li>
<li class=""><strong>Community</strong>: Everyone who reported issues, provided feedback, and tested features</li>
<li class=""><strong>Working Group</strong>: All members of the KServe Working Group for their ongoing collaboration</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-join-the-community">🤝 Join the Community<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.11-release#-join-the-community" class="hash-link" aria-label="Direct link to 🤝 Join the Community" title="Direct link to 🤝 Join the Community" translate="no">​</a></h2>
<ul>
<li class="">Visit our <a href="https://kserve.github.io/website/" target="_blank" rel="noopener noreferrer" class="">Website</a> or <a href="https://github.com/kserve" target="_blank" rel="noopener noreferrer" class="">GitHub</a></li>
<li class="">Join the Slack (<a href="https://github.com/kserve/community?tab=readme-ov-file#questions-and-issues" target="_blank" rel="noopener noreferrer" class="">#kserve</a>)</li>
<li class="">Attend our community meeting by subscribing to the <a href="https://zoom-lfx.platform.linuxfoundation.org/meetings/kserve?view=month" target="_blank" rel="noopener noreferrer" class="">KServe calendar</a>.</li>
<li class="">View our <a href="https://github.com/kserve/community" target="_blank" rel="noopener noreferrer" class="">community github repository</a> to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!</li>
</ul>
<hr>
<p><em>The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!</em></p>]]></content>
        <author>
            <name>Dan Sun</name>
            <uri>https://github.com/yuzisun</uri>
        </author>
        <category label="Releases" term="Releases"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Announcing KServe v0.10.0]]></title>
        <id>https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.10-release</id>
        <link href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.10-release"/>
        <updated>2023-02-05T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[KServe 0.10 Release Blog Post]]></summary>
        <content type="html"><![CDATA[<p><em>Published on February 5, 2023</em></p>
<p>We are excited to announce KServe 0.10 release. In this release we have enabled more KServe networking options, improved KServe telemetry for supported serving runtimes and increased support coverage for <a href="https://kserve.github.io/archive/0.10/modelserving/data_plane/v2_protocol/" target="_blank" rel="noopener noreferrer" class="">Open(aka v2) inference protocol</a> for both standard and ModelMesh InferenceService.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-kserve-networking-options">🌐 KServe Networking Options<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.10-release#-kserve-networking-options" class="hash-link" aria-label="Direct link to 🌐 KServe Networking Options" title="Direct link to 🌐 KServe Networking Options" translate="no">​</a></h2>
<p>Istio is now optional for both <a href="https://kserve.github.io/archive/0.10/admin/serverless/serverless/" target="_blank" rel="noopener noreferrer" class="">Serverless</a> and <a href="https://kserve.github.io/archive/0.10/admin/kubernetes_deployment/" target="_blank" rel="noopener noreferrer" class="">RawDeployment</a> mode. Please see the <a href="https://kserve.github.io/archive/0.10/admin/serverless/kourier_networking/" target="_blank" rel="noopener noreferrer" class="">alternative networking guide</a> for how you can enable other ingress options supported by Knative with Serverless mode.
For Istio users, if you want to turn on full service mesh mode to secure InferenceService with mutual TLS and enable the traffic policies, please read the <a href="https://kserve.github.io/archive/0.10/admin/serverless/servicemesh/" target="_blank" rel="noopener noreferrer" class="">service mesh setup guideline</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-kserve-telemetry-for-serving-runtimes">📊 KServe Telemetry for Serving Runtimes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.10-release#-kserve-telemetry-for-serving-runtimes" class="hash-link" aria-label="Direct link to 📊 KServe Telemetry for Serving Runtimes" title="Direct link to 📊 KServe Telemetry for Serving Runtimes" translate="no">​</a></h2>
<p>We have instrumented additional latency metrics in KServe Python ServingRuntimes for <code>preprocess</code>, <code>predict</code> and <code>postprocess</code> handlers.
In Serverless mode we have extended Knative <code>queue-proxy</code> to enable metrics aggregation for both metrics exposed in <code>queue-proxy</code> and <code>kserve-container</code> from each <code>ServingRuntime</code>.
Please read the <a href="https://kserve.github.io/archive/0.10/modelserving/observability/prometheus_metrics/" target="_blank" rel="noopener noreferrer" class="">prometheus metrics setup guideline</a> for how to enable the metrics scraping and aggregations.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-openv2-inference-protocol-support-coverage">🚀 Open(v2) Inference Protocol Support Coverage<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.10-release#-openv2-inference-protocol-support-coverage" class="hash-link" aria-label="Direct link to 🚀 Open(v2) Inference Protocol Support Coverage" title="Direct link to 🚀 Open(v2) Inference Protocol Support Coverage" translate="no">​</a></h2>
<p>As there have been increasing adoptions for <code>KServe v2 Inference Protocol</code> from <a href="https://kserve.github.io/archive/0.10/modelserving/v1beta1/amd/" target="_blank" rel="noopener noreferrer" class="">AMD Inference ServingRuntime</a> which
supports FPGAs and OpenVINO which now provides KServe <a href="https://docs.openvino.ai/latest/ovms_docs_rest_api_kfs.html" target="_blank" rel="noopener noreferrer" class="">REST</a> and <a href="https://docs.openvino.ai/latest/ovms_docs_grpc_api_kfs.html" target="_blank" rel="noopener noreferrer" class="">gRPC</a> compatible API,
in <a href="https://github.com/kserve/kserve/issues/2663" target="_blank" rel="noopener noreferrer" class="">the issue</a> we have proposed to rename to <code>KServe Open Inference Protocol</code>.</p>
<p>In KServe 0.10, we have added Open(v2) inference protocol support for KServe custom runtimes.
Now, you can enable v2 REST/gRPC for both custom transformer and predictor with images built by implementing KServe Python SDK API.
gRPC enables high performance inference data plane as it is built on top of HTTP/2 and binary data transportation which is more efficient to send over the wire compared to REST.
Please see the detailed example for <a href="https://kserve.github.io/archive/0.10/modelserving/v1beta1/transformer/torchserve_image_transformer/" target="_blank" rel="noopener noreferrer" class="">transformer</a> and
<a href="https://kserve.github.io/archive/0.10/modelserving/v1beta1/custom/custom_model/" target="_blank" rel="noopener noreferrer" class="">predictor</a>.</p>
<div class="language-python codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-python codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token keyword" style="color:#00009f">from</span><span class="token plain"> kserve </span><span class="token keyword" style="color:#00009f">import</span><span class="token plain"> Model</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">image_transform</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">byte_array</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    image_processing </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> transforms</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">Compose</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        transforms</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">ToTensor</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        transforms</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">Normalize</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">0.1307</span><span class="token punctuation" style="color:#393A34">,</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">(</span><span class="token number" style="color:#36acaa">0.3081</span><span class="token punctuation" style="color:#393A34">,</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    image </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> Image</span><span class="token punctuation" style="color:#393A34">.</span><span class="token builtin">open</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">io</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">BytesIO</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">byte_array</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    tensor </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> image_processing</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">image</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">numpy</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> tensor</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">class</span><span class="token plain"> </span><span class="token class-name">CustomModel</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">Model</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">predict</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> request</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> InferRequest</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> headers</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Dict</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> InferResponse</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        input_tensors </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">image_transform</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">instance</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> instance </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">inputs</span><span class="token punctuation" style="color:#393A34">[</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        input_tensors </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> np</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">asarray</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">input_tensors</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        output </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">model</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">input_tensors</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        torch</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">nn</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">functional</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">softmax</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">output</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> dim</span><span class="token operator" style="color:#393A34">=</span><span class="token number" style="color:#36acaa">1</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        values</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> top_5 </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> torch</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">topk</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">output</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token number" style="color:#36acaa">5</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        result </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> values</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">flatten</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">tolist</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        response_id </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> generate_uuid</span><span class="token punctuation" style="color:#393A34">(</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        infer_output </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> InferOutput</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">name</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"output-0"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> shape</span><span class="token operator" style="color:#393A34">=</span><span class="token builtin">list</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">values</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">shape</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> datatype</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"FP32"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> data</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">result</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        infer_response </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> InferResponse</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">model_name</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">name</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> infer_outputs</span><span class="token operator" style="color:#393A34">=</span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">infer_output</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> response_id</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">response_id</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> infer_response</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain" style="display:inline-block"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token keyword" style="color:#00009f">class</span><span class="token plain"> </span><span class="token class-name">CustomTransformer</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">Model</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token keyword" style="color:#00009f">def</span><span class="token plain"> </span><span class="token function" style="color:#d73a49">preprocess</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> request</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> InferRequest</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> headers</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Dict</span><span class="token punctuation" style="color:#393A34">[</span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> </span><span class="token builtin">str</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token operator" style="color:#393A34">-</span><span class="token operator" style="color:#393A34">&gt;</span><span class="token plain"> InferRequest</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        input_tensors </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">image_transform</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">instance</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"> </span><span class="token keyword" style="color:#00009f">for</span><span class="token plain"> instance </span><span class="token keyword" style="color:#00009f">in</span><span class="token plain"> request</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">inputs</span><span class="token punctuation" style="color:#393A34">[</span><span class="token number" style="color:#36acaa">0</span><span class="token punctuation" style="color:#393A34">]</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">data</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        input_tensors </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> np</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">asarray</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">input_tensors</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        infer_inputs </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">[</span><span class="token plain">InferInput</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">name</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">"INPUT__0"</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> datatype</span><span class="token operator" style="color:#393A34">=</span><span class="token string" style="color:#e3116c">'FP32'</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> shape</span><span class="token operator" style="color:#393A34">=</span><span class="token builtin">list</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">input_tensors</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">shape</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">                                   data</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">input_tensors</span><span class="token punctuation" style="color:#393A34">)</span><span class="token punctuation" style="color:#393A34">]</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        infer_request </span><span class="token operator" style="color:#393A34">=</span><span class="token plain"> InferRequest</span><span class="token punctuation" style="color:#393A34">(</span><span class="token plain">model_name</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">self</span><span class="token punctuation" style="color:#393A34">.</span><span class="token plain">model_name</span><span class="token punctuation" style="color:#393A34">,</span><span class="token plain"> infer_inputs</span><span class="token operator" style="color:#393A34">=</span><span class="token plain">infer_inputs</span><span class="token punctuation" style="color:#393A34">)</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token keyword" style="color:#00009f">return</span><span class="token plain"> infer_request</span><br></span></code></pre></div></div>
<p>You can use the same Python API type <code>InferRequest</code> and <code>InferResponse</code> for both REST and gRPC protocol. KServe handles the underlying decoding and encoding according to the protocol.</p>
<p>⚠️ <strong>Warning</strong>: A new <code>headers</code> argument is added to the custom handlers to pass http/gRPC headers or other metadata. You can also use this as context dict to pass data between handlers.
If you have existing custom transformer or predictor, the <code>headers</code> argument is now required to add to the <code>preprocess</code>, <code>predict</code> and <code>postprocess</code> handlers.</p>
<p>Please check the following matrix for supported ModelFormats and <a href="https://kserve.github.io/archive/0.10/modelserving/v1beta1/serving_runtime/" target="_blank" rel="noopener noreferrer" class="">ServingRuntimes</a>.</p>
<table><thead><tr><th>Model Format</th><th>v1</th><th>Open(v2) REST/gRPC</th></tr></thead><tbody><tr><td>Tensorflow</td><td>✅ TFServing</td><td>✅ Triton</td></tr><tr><td>PyTorch</td><td>✅ TorchServe</td><td>✅ TorchServe</td></tr><tr><td>TorchScript</td><td>✅ TorchServe</td><td>✅ Triton</td></tr><tr><td>ONNX</td><td>❌</td><td>✅ Triton</td></tr><tr><td>Scikit-learn</td><td>✅ KServe</td><td>✅ MLServer</td></tr><tr><td>XGBoost</td><td>✅ KServe</td><td>✅ MLServer</td></tr><tr><td>LightGBM</td><td>✅ KServe</td><td>✅ MLServer</td></tr><tr><td>MLFlow</td><td>❌</td><td>✅ MLServer</td></tr><tr><td>Custom</td><td>✅ KServe</td><td>✅ KServe</td></tr></tbody></table>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="️-multi-arch-image-support">🏗️ Multi-Arch Image Support<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.10-release#%EF%B8%8F-multi-arch-image-support" class="hash-link" aria-label="Direct link to 🏗️ Multi-Arch Image Support" title="Direct link to 🏗️ Multi-Arch Image Support" translate="no">​</a></h2>
<p>KServe control plane images <a href="https://hub.docker.com/r/kserve/kserve-controller/tags" target="_blank" rel="noopener noreferrer" class="">kserve-controller</a>,
<a href="https://hub.docker.com/r/kserve/agent/tags" target="_blank" rel="noopener noreferrer" class="">kserve/agent</a>, <a href="https://hub.docker.com/r/kserve/router/tags" target="_blank" rel="noopener noreferrer" class="">kserve/router</a> are now supported
for multiple architectures: <code>ppc64le</code>, <code>arm64</code>, <code>amd64</code>, <code>s390x</code>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-kserve-storage-credentials-support">🔐 KServe Storage Credentials Support<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.10-release#-kserve-storage-credentials-support" class="hash-link" aria-label="Direct link to 🔐 KServe Storage Credentials Support" title="Direct link to 🔐 KServe Storage Credentials Support" translate="no">​</a></h2>
<ul>
<li class="">Currently, AWS users need to create a secret with long term/static IAM credentials for downloading models stored in S3.
Security best practice is to use <a href="https://aws.amazon.com/blogs/opensource/introducing-fine-grained-iam-roles-service-accounts/" target="_blank" rel="noopener noreferrer" class="">IAM role for service account(IRSA)</a>
which enables automatic credential rotation and fine-grained access control, see how to <a href="https://kserve.github.io/archive/0.10/modelserving/storage/s3/s3/#create-service-account-with-iam-role" target="_blank" rel="noopener noreferrer" class="">setup IRSA</a>.</li>
<li class="">Support Azure Blobs with <a href="https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-manage-user-assigned-managed-identities?pivots=identity-mi-methods-azcli" target="_blank" rel="noopener noreferrer" class="">managed identity</a>.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-modelmesh-updates">📊 ModelMesh Updates<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.10-release#-modelmesh-updates" class="hash-link" aria-label="Direct link to 📊 ModelMesh Updates" title="Direct link to 📊 ModelMesh Updates" translate="no">​</a></h2>
<p>ModelMesh has continued to integrate itself as KServe's multi-model serving backend, introducing improvements and features that better align the two projects. For example, it now supports ClusterServingRuntimes, allowing use of cluster-scoped ServingRuntimes, originally introduced in KServe 0.8.</p>
<p>Additionally, ModelMesh introduced support for TorchServe enabling users to serve arbitrary PyTorch models (e.g. eager-mode) in the context of distributed-multi-model serving.</p>
<p>Other limitations have been addressed as well, such as adding support for BYTES/string type tensors when using the REST inference API for inference requests that require them.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-release-notes">🔍 Release Notes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.10-release#-release-notes" class="hash-link" aria-label="Direct link to 🔍 Release Notes" title="Direct link to 🔍 Release Notes" translate="no">​</a></h2>
<p>For complete release notes including all changes, bug fixes, and known issues, visit the <a href="https://github.com/kserve/kserve/releases/tag/v0.10.0" target="_blank" rel="noopener noreferrer" class="">GitHub release pages</a> for KServe v0.10 and <a href="https://github.com/kserve/modelmesh-serving/releases/tag/v0.10.0" target="_blank" rel="noopener noreferrer" class="">ModelMesh v0.10</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-acknowledgments">🙏 Acknowledgments<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.10-release#-acknowledgments" class="hash-link" aria-label="Direct link to 🙏 Acknowledgments" title="Direct link to 🙏 Acknowledgments" translate="no">​</a></h2>
<p>We want to thank all the contributors who made this release possible:</p>
<p><strong>Individual Contributors:</strong></p>
<ul>
<li class=""><a href="https://github.com/sel" target="_blank" rel="noopener noreferrer" class="">Steve Larkin</a></li>
<li class=""><a href="https://github.com/stephanschielke" target="_blank" rel="noopener noreferrer" class="">Stephan Schielke</a></li>
<li class=""><a href="https://github.com/cmaddalozzo" target="_blank" rel="noopener noreferrer" class="">Curtis Maddalozzo</a></li>
<li class=""><a href="https://github.com/laozc" target="_blank" rel="noopener noreferrer" class="">Zhongcheng Lao</a></li>
<li class=""><a href="https://github.com/dimara" target="_blank" rel="noopener noreferrer" class="">Dimitris Aragiorgis</a></li>
<li class=""><a href="https://github.com/panli889" target="_blank" rel="noopener noreferrer" class="">Pan Li</a></li>
<li class=""><a href="https://github.com/tjandy98" target="_blank" rel="noopener noreferrer" class="">tjandy98</a></li>
<li class=""><a href="https://github.com/sukumargaonkar" target="_blank" rel="noopener noreferrer" class="">Sukumar Gaonkar</a></li>
<li class=""><a href="https://github.com/rachitchauhan43" target="_blank" rel="noopener noreferrer" class="">Rachit Chauhan</a></li>
<li class=""><a href="https://github.com/rafvasq" target="_blank" rel="noopener noreferrer" class="">Rafael Vasquez</a></li>
<li class=""><a href="https://github.com/TimKleinloog" target="_blank" rel="noopener noreferrer" class="">Tim Kleinloog</a></li>
<li class=""><a href="https://github.com/ckadner" target="_blank" rel="noopener noreferrer" class="">Christian Kadner</a></li>
<li class=""><a href="https://github.com/ddelange" target="_blank" rel="noopener noreferrer" class="">ddelange</a></li>
<li class=""><a href="https://github.com/lizzzcai" target="_blank" rel="noopener noreferrer" class="">Lize Cai</a></li>
<li class=""><a href="https://github.com/park12sj" target="_blank" rel="noopener noreferrer" class="">sangjune.park</a></li>
<li class=""><a href="https://github.com/Suresh-Nakkeran" target="_blank" rel="noopener noreferrer" class="">Suresh Nakkeran</a></li>
<li class=""><a href="https://github.com/MessKon" target="_blank" rel="noopener noreferrer" class="">Konstantinos Messis</a></li>
<li class=""><a href="https://github.com/matty-rose" target="_blank" rel="noopener noreferrer" class="">Matt Rose</a></li>
<li class=""><a href="https://github.com/alexagriffith" target="_blank" rel="noopener noreferrer" class="">Alexa Griffith</a></li>
<li class=""><a href="https://github.com/jagadeeshi2i" target="_blank" rel="noopener noreferrer" class="">Jagadeesh J</a></li>
<li class=""><a href="https://github.com/alembiewski" target="_blank" rel="noopener noreferrer" class="">Alex Lembiyeuski</a></li>
<li class=""><a href="https://github.com/tenzen-y" target="_blank" rel="noopener noreferrer" class="">Yuki Iwai</a></li>
<li class=""><a href="https://github.com/andyi2it" target="_blank" rel="noopener noreferrer" class="">Andrews Arokiam</a></li>
<li class=""><a href="https://github.com/xfu83" target="_blank" rel="noopener noreferrer" class="">Xin Fu</a></li>
<li class=""><a href="https://github.com/adilhusain-s" target="_blank" rel="noopener noreferrer" class="">adilhusain-s</a></li>
<li class=""><a href="https://github.com/pranavpandit1" target="_blank" rel="noopener noreferrer" class="">Pranav Pandit</a></li>
<li class=""><a href="https://github.com/C1berwiz" target="_blank" rel="noopener noreferrer" class="">C1berwiz</a></li>
<li class=""><a href="https://github.com/dilverse" target="_blank" rel="noopener noreferrer" class="">dilverse</a></li>
<li class=""><a href="https://github.com/terrytangyuan" target="_blank" rel="noopener noreferrer" class="">Yuan Tang</a></li>
<li class=""><a href="https://github.com/yuzisun" target="_blank" rel="noopener noreferrer" class="">Dan Sun</a></li>
<li class=""><a href="https://github.com/njhill" target="_blank" rel="noopener noreferrer" class="">Nick Hill</a></li>
</ul>
<p><strong>Core Contributors</strong>: The KServe maintainers and working group members</p>
<p><strong>Community</strong>: Everyone who reported issues, provided feedback, and tested features</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-join-the-community">🤝 Join the Community<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.10-release#-join-the-community" class="hash-link" aria-label="Direct link to 🤝 Join the Community" title="Direct link to 🤝 Join the Community" translate="no">​</a></h2>
<ul>
<li class="">Visit our <a href="https://kserve.github.io/website/" target="_blank" rel="noopener noreferrer" class="">Website</a> or <a href="https://github.com/kserve" target="_blank" rel="noopener noreferrer" class="">GitHub</a></li>
<li class="">Join the Slack (<a href="https://github.com/kserve/community?tab=readme-ov-file#questions-and-issues" target="_blank" rel="noopener noreferrer" class="">#kserve</a>)</li>
<li class="">Attend our community meeting by subscribing to the <a href="https://zoom-lfx.platform.linuxfoundation.org/meetings/kserve?view=month" target="_blank" rel="noopener noreferrer" class="">KServe calendar</a>.</li>
<li class="">View our <a href="https://github.com/kserve/community" target="_blank" rel="noopener noreferrer" class="">community github repository</a> to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!</li>
</ul>
<hr>
<p><em>The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!</em></p>]]></content>
        <author>
            <name>Dan Sun</name>
            <uri>https://github.com/yuzisun</uri>
        </author>
        <category label="Releases" term="Releases"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Announcing KServe v0.9.0]]></title>
        <id>https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.9-release</id>
        <link href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.9-release"/>
        <updated>2022-07-21T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[KServe 0.9 Release Blog Post]]></summary>
        <content type="html"><![CDATA[<p><em>Published on July 21, 2022</em></p>
<p>Today, we are pleased to announce the v0.9.0 release of KServe! <a href="https://github.com/kserve" target="_blank" rel="noopener noreferrer" class="">KServe</a> has now fully onboarded to <a href="https://lfaidata.foundation/" target="_blank" rel="noopener noreferrer" class="">LF AI &amp; Data Foundation</a> as an <a href="https://lfaidata.foundation/projects/kserve" target="_blank" rel="noopener noreferrer" class="">Incubation Project</a>! 🎉</p>
<p>In this release we are excited to introduce the new <code>InferenceGraph</code> feature which has long been asked from the community. Also continuing the effort from the last release for unifying the InferenceService API for deploying models on KServe and ModelMesh, ModelMesh is now fully compatible with KServe InferenceService API!</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-introducing-inferencegraph">🚀 Introducing InferenceGraph<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.9-release#-introducing-inferencegraph" class="hash-link" aria-label="Direct link to 🚀 Introducing InferenceGraph" title="Direct link to 🚀 Introducing InferenceGraph" translate="no">​</a></h2>
<p>The ML Inference system is getting bigger and more complex. It often consists of many models to make a single prediction.
The common use cases are image classification and natural language multi-stage processing pipelines. For example, an image classification pipeline needs to run top level classification first then downstream further classification based on previous prediction results.</p>
<p>KServe has the unique strength to build the distributed inference graph with its native integration of InferenceServices, standard inference protocol for chaining models and serverless auto-scaling capabilities. KServe leverages these strengths to build the InferenceGraph and enable users to deploy complex ML Inference pipelines to production in a declarative and scalable way.</p>
<p><strong>InferenceGraph</strong> is made up of a list of routing nodes with each node consisting of a set of routing steps. Each step can either route to an InferenceService or another node defined on the graph which makes the InferenceGraph highly composable.
The graph router is deployed behind an HTTP endpoint and can be scaled dynamically based on request volume. The InferenceGraph supports four different types of routing nodes: <strong>Sequence</strong>, <strong>Switch</strong>, <strong>Ensemble</strong>, <strong>Splitter</strong>.</p>
<p><img decoding="async" loading="lazy" alt="InferenceGraph" src="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/assets/images/inference_graph-c394dbbe6fb6a1ff7f03706f82566247.png" width="1962" height="834" class="img_ev3q"></p>
<ul>
<li class=""><strong>Sequence Node</strong>: It allows users to define multiple <code>Steps</code> with <code>InferenceServices</code> or <code>Nodes</code> as routing targets in a sequence. The <code>Steps</code> are executed in sequence and the request/response from the previous step and be passed to the next step as input based on configuration.</li>
<li class=""><strong>Switch Node</strong>: It allows users to define routing conditions and select a <code>Step</code> to execute if it matches the condition. The response is returned as soon as it finds the first step that matches the condition. If no condition is matched, the graph returns the original request.</li>
<li class=""><strong>Ensemble Node</strong>: A model ensemble requires scoring each model separately and then combines the results into a single prediction response. You can then use different combination methods to produce the final result. Multiple classification trees, for example, are commonly combined using a "majority vote" method. Multiple regression trees are often combined using various averaging techniques.</li>
<li class=""><strong>Splitter Node</strong>: It allows users to split the traffic to multiple targets using a weighted distribution.</li>
</ul>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"serving.kserve.io/v1beta1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"InferenceService"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"cat-dog-classifier"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">predictor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">pytorch</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">resources</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">requests</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">cpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> 100m</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">storageUri</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> gs</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">//kfserving</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">examples/models/torchserve/cat_dog_classification</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">---</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"serving.kserve.io/v1beta1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"InferenceService"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"dog-breed-classifier"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">predictor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">pytorch</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">resources</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">requests</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">cpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> 100m</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">storageUri</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> gs</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">//kfserving</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">examples/models/torchserve/dog_breed_classification</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token punctuation" style="color:#393A34">---</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"serving.kserve.io/v1alpha1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"InferenceGraph"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"dog-breed-pipeline"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">nodes</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">root</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">routerType</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> Sequence</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">steps</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">serviceName</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> cat</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">dog</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">classifier</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> cat_dog_classifier </span><span class="token comment" style="color:#999988;font-style:italic"># step name</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">serviceName</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> dog</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">breed</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">classifier</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> dog_breed_classifier</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">data</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> $request</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">condition</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"[@this].#(predictions.0==\"dog\")"</span><br></span></code></pre></div></div>
<p>Currently <code>InferenceGraph</code> is supported with the <code>Serverless</code> deployment mode. You can try it out following the <a href="https://kserve.github.io/archive/0.9/modelserving/inference_graph/image_pipeline/" target="_blank" rel="noopener noreferrer" class="">tutorial</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-inferenceservice-api-for-modelmesh">🔗 InferenceService API for ModelMesh<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.9-release#-inferenceservice-api-for-modelmesh" class="hash-link" aria-label="Direct link to 🔗 InferenceService API for ModelMesh" title="Direct link to 🔗 InferenceService API for ModelMesh" translate="no">​</a></h2>
<p>The InferenceService CRD is now the primary interface for interacting with ModelMesh. Some changes were made to the InferenceService spec to better facilitate ModelMesh's needs.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-storage-spec">💾 Storage Spec<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.9-release#-storage-spec" class="hash-link" aria-label="Direct link to 💾 Storage Spec" title="Direct link to 💾 Storage Spec" translate="no">​</a></h3>
<p>To unify how model storage is defined for both single and multi-model serving, a new storage spec was added to the predictor model spec. With this storage spec, users can specify a key inside a common secret holding config/credentials for each of the storage backends from which models can be loaded. Example:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">storage</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">key</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> localMinIO </span><span class="token comment" style="color:#999988;font-style:italic"># Credential key for the destination storage in the common secret</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">path</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> sklearn </span><span class="token comment" style="color:#999988;font-style:italic"># Model path inside the bucket</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token comment" style="color:#999988;font-style:italic"># schemaPath: null # Optional schema files for payload schema</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">parameters</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token comment" style="color:#999988;font-style:italic"># Parameters to override the default values inside the common secret.</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">bucket</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> example</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">models</span><br></span></code></pre></div></div>
<p>Learn more <a href="https://github.com/kserve/kserve/tree/release-0.9/docs/samples/storage/storageSpec" target="_blank" rel="noopener noreferrer" class="">here</a>.</p>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-model-status">📊 Model Status<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.9-release#-model-status" class="hash-link" aria-label="Direct link to 📊 Model Status" title="Direct link to 📊 Model Status" translate="no">​</a></h3>
<p>For further alignment between ModelMesh and KServe, some additions to the InferenceService status were made. There is now a <code>Model Status</code> section which contains information about the model loaded in the predictor. New fields include:</p>
<ul>
<li class=""><code>states</code> - State information of the predictor's model.</li>
<li class=""><code>activeModelState</code> - The state of the model currently being served by the predictor's endpoints.</li>
<li class=""><code>targetModelState</code> - This will be set only when <code>transitionStatus</code> is not <code>UpToDate</code>, meaning that the target model differs from the currently-active model.</li>
<li class=""><code>transitionStatus</code> - Indicates state of the predictor relative to its current spec.</li>
<li class=""><code>modelCopies</code> - Model copy information of the predictor's model.</li>
<li class=""><code>lastFailureInfo</code> - Details about the most recent error associated with this predictor. Not all of the contained fields will necessarily have a value.</li>
</ul>
<h3 class="anchor anchorTargetStickyNavbar_Vzrq" id="-deploying-on-modelmesh">🚢 Deploying on ModelMesh<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.9-release#-deploying-on-modelmesh" class="hash-link" aria-label="Direct link to 🚢 Deploying on ModelMesh" title="Direct link to 🚢 Deploying on ModelMesh" translate="no">​</a></h3>
<p>For deploying InferenceServices on ModelMesh, the ModelMesh and KServe controllers will still require that the user specifies the <code>serving.kserve.io/deploymentMode: ModelMesh</code> annotation.
A complete example on an InferenceService with the new storage spec is showing below:</p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1beta1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> InferenceService</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> example</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">tensorflow</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">mnist</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">annotations</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">serving.kserve.io/deploymentMode</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ModelMesh</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">predictor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">model</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">modelFormat</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> tensorflow</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">storage</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">key</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> localMinIO</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">path</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> tensorflow/mnist.savedmodel</span><br></span></code></pre></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="️-other-new-features">🛠️ Other New Features<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.9-release#%EF%B8%8F-other-new-features" class="hash-link" aria-label="Direct link to 🛠️ Other New Features" title="Direct link to 🛠️ Other New Features" translate="no">​</a></h2>
<ul>
<li class="">Support <a href="https://kserve.github.io/archive/0.9/modelserving/v1beta1/mlflow/v2/" target="_blank" rel="noopener noreferrer" class="">serving MLFlow model format</a> via MLServer serving runtime.</li>
<li class="">Support <a href="https://kserve.github.io/archive/0.9/modelserving/autoscaling/autoscaling/" target="_blank" rel="noopener noreferrer" class="">unified autoscaling target and metric fields</a> for InferenceService components with both Serverless and RawDeployment mode.</li>
<li class="">Support <a href="https://kserve.github.io/archive/0.9/admin/kubernetes_deployment/" target="_blank" rel="noopener noreferrer" class="">InferenceService ingress class and url domain template configuration</a> for RawDeployment mode.</li>
<li class="">ModelMesh now has a default <a href="https://github.com/openvinotoolkit/model_server" target="_blank" rel="noopener noreferrer" class="">OpenVINO Model Server</a> ServingRuntime.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="️-whats-changed">⚠️ What's Changed?<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.9-release#%EF%B8%8F-whats-changed" class="hash-link" aria-label="Direct link to ⚠️ What's Changed?" title="Direct link to ⚠️ What's Changed?" translate="no">​</a></h2>
<ul>
<li class="">The KServe controller manager is changed from StatefulSet to Deployment to support HA mode.</li>
<li class="">log4j security vulnerability fix</li>
<li class="">Upgrade TorchServe serving runtime to 0.6.0</li>
<li class="">Update MLServer serving runtime to 1.0.0</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-release-notes">🔍 Release Notes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.9-release#-release-notes" class="hash-link" aria-label="Direct link to 🔍 Release Notes" title="Direct link to 🔍 Release Notes" translate="no">​</a></h2>
<p>For complete release notes including all changes, bug fixes, and known issues, visit the <a href="https://github.com/kserve/kserve/releases/tag/v0.9.0" target="_blank" rel="noopener noreferrer" class="">GitHub release pages</a> for KServe and <a href="https://github.com/kserve/modelmesh-serving/releases/tag/v0.9.0" target="_blank" rel="noopener noreferrer" class="">ModelMesh</a> for more details.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-acknowledgments">🙏 Acknowledgments<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.9-release#-acknowledgments" class="hash-link" aria-label="Direct link to 🙏 Acknowledgments" title="Direct link to 🙏 Acknowledgments" translate="no">​</a></h2>
<p>We want to thank all the contributors who made this release possible:</p>
<ul>
<li class=""><strong>Core Contributors</strong>: The KServe maintainers and working group members</li>
<li class=""><strong>Community</strong>: Everyone who reported issues, provided feedback, and tested features</li>
<li class=""><strong>LF AI &amp; Data Foundation</strong>: For supporting KServe's journey as an incubation project</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-join-the-community">🤝 Join the Community<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.9-release#-join-the-community" class="hash-link" aria-label="Direct link to 🤝 Join the Community" title="Direct link to 🤝 Join the Community" translate="no">​</a></h2>
<ul>
<li class="">Visit our <a href="https://kserve.github.io/website/" target="_blank" rel="noopener noreferrer" class="">Website</a> or <a href="https://github.com/kserve" target="_blank" rel="noopener noreferrer" class="">GitHub</a></li>
<li class="">Join the Slack (<a href="https://github.com/kserve/community?tab=readme-ov-file#questions-and-issues" target="_blank" rel="noopener noreferrer" class="">#kserve</a>)</li>
<li class="">Attend our community meeting by subscribing to the <a href="https://zoom-lfx.platform.linuxfoundation.org/meetings/kserve?view=month" target="_blank" rel="noopener noreferrer" class="">KServe calendar</a>.</li>
<li class="">View our <a href="https://github.com/kserve/community" target="_blank" rel="noopener noreferrer" class="">community github repository</a> to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!</li>
</ul>
<hr>
<p><em>The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!</em></p>]]></content>
        <author>
            <name>Dan Sun</name>
            <uri>https://github.com/yuzisun</uri>
        </author>
        <category label="Releases" term="Releases"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Announcing KServe v0.8]]></title>
        <id>https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.8-release</id>
        <link href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.8-release"/>
        <updated>2022-02-18T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[KServe 0.8 Release Blog Post]]></summary>
        <content type="html"><![CDATA[<p><em>Published on February 18, 2022</em></p>
<p>Today, we are pleased to announce the v0.8.0 release of KServe! While the last release was focused on the <a href="https://blog.kubeflow.org/release/official/2021/09/27/kfserving-transition.html" target="_blank" rel="noopener noreferrer" class="">transition</a> of KFServing to KServe, this release was focused on unifying the InferenceService API for deploying models on KServe and ModelMesh.</p>
<blockquote>
<p><strong>Note</strong>: For current users of KFServing/KServe, please take a few minutes to answer this <a href="https://groups.google.com/g/kubeflow-discuss/c/B0trz3qZiJE" target="_blank" rel="noopener noreferrer" class="">short survey</a> and provide your feedback!</p>
</blockquote>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="️-whats-changed">⚠️ What's Changed<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.8-release#%EF%B8%8F-whats-changed" class="hash-link" aria-label="Direct link to ⚠️ What's Changed" title="Direct link to ⚠️ What's Changed" translate="no">​</a></h2>
<ul>
<li class=""><strong>ONNX Runtime Server</strong> has been removed from the supported serving runtime list. KServe by default now uses the <strong>Triton Inference Server</strong> to serve ONNX models.</li>
<li class="">KServe's <strong>PyTorchServer</strong> has been removed from the supported serving runtime list. KServe by default now uses <strong>TorchServe</strong> to serve PyTorch models.</li>
<li class="">A few main KServe SDK class names have been changed:<!-- -->
<ul>
<li class=""><strong>KFModel</strong> is renamed to <strong>Model</strong></li>
<li class=""><strong>KFServer</strong> is renamed to <strong>ModelServer</strong></li>
<li class=""><strong>KFModelRepository</strong> is renamed to <strong>ModelRepository</strong></li>
</ul>
</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-whats-new">🚀 What's New<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.8-release#-whats-new" class="hash-link" aria-label="Direct link to 🚀 What's New" title="Direct link to 🚀 What's New" translate="no">​</a></h2>
<p>Some notable updates are:</p>
<ul>
<li class=""><strong>ClusterServingRuntime</strong> and <strong>ServingRuntime</strong> CRDs are introduced. Learn more <a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.8-release#-servingruntimes-and-clusterservingruntimes" class="">below</a>.</li>
<li class="">A new <strong>Model Spec</strong> was introduced to the InferenceService Predictor Spec as a new way to specify models. Learn more <a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.8-release#-updated-inferenceservice-predictor-spec" class="">below</a>.</li>
<li class=""><strong>Knative 1.0</strong> is now supported and certified for the KServe Serverless installation.</li>
<li class=""><strong>gRPC</strong> is now supported for transformer to predictor network communication.</li>
<li class=""><strong>TorchServe</strong> Serving runtime has been updated to 0.5.2 which now supports the KServe V2 REST protocol.</li>
<li class=""><strong>ModelMesh</strong> now has multi-namespace support, and users can now deploy GCS or HTTP(S) hosted models.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-servingruntimes-and-clusterservingruntimes">🔧 ServingRuntimes and ClusterServingRuntimes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.8-release#-servingruntimes-and-clusterservingruntimes" class="hash-link" aria-label="Direct link to 🔧 ServingRuntimes and ClusterServingRuntimes" title="Direct link to 🔧 ServingRuntimes and ClusterServingRuntimes" translate="no">​</a></h2>
<p>This release introduces two new CRDs <em>ServingRuntimes</em> and <em>ClusterServingRuntimes</em> with the only difference between these two is that one is namespace-scoped and one is cluster-scoped. A ServingRuntime defines the templates for Pods that can serve one or more particular model formats. Each ServingRuntime defines key information such as the container image of the runtime and a list of the model formats that the runtime supports.</p>
<p>In previous versions of KServe, supported predictor formats and container images were defined in a <a href="https://github.com/kserve/kserve/blob/release-0.7/config/configmap/inferenceservice.yaml#L7" target="_blank" rel="noopener noreferrer" class="">config map</a> in the control plane namespace. The ServingRuntime CRD should allow for improved flexibility and extensibility for defining or customizing runtimes to how you see fit without having to modify any controller code or any resources in the controller namespace.</p>
<p>Several out-of-the-box ClusterServingRuntimes are provided with KServe so that users can continue to use KServe how they did before without having to define the runtimes themselves.</p>
<p><strong>Example SKLearn ClusterServingRuntime:</strong></p>
<div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1alpha1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> ClusterServingRuntime</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> kserve</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">sklearnserver</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">supportedModelFormats</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> sklearn</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">version</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">autoSelect</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token boolean important" style="color:#36acaa">true</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">containers</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> kserve</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">container</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">image</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> kserve/sklearnserver</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">latest</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">args</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">model_name=</span><span class="token punctuation" style="color:#393A34">{</span><span class="token punctuation" style="color:#393A34">{</span><span class="token plain">.Name</span><span class="token punctuation" style="color:#393A34">}</span><span class="token punctuation" style="color:#393A34">}</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">model_dir=/mnt/models</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain"> </span><span class="token punctuation" style="color:#393A34">-</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">http_port=8080</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">resources</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">requests</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">cpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">memory</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> 2Gi</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">limits</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">cpu</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> </span><span class="token string" style="color:#e3116c">"1"</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">          </span><span class="token key atrule" style="color:#00a4db">memory</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> 2Gi</span><br></span></code></pre></div></div>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-updated-inferenceservice-predictor-spec">📋 Updated InferenceService Predictor Spec<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.8-release#-updated-inferenceservice-predictor-spec" class="hash-link" aria-label="Direct link to 📋 Updated InferenceService Predictor Spec" title="Direct link to 📋 Updated InferenceService Predictor Spec" translate="no">​</a></h2>
<p>A new Model spec was also introduced as a part of the Predictor spec for InferenceServices. One of the problems KServe was having was that the InferenceService CRD was becoming unwieldy with each model serving runtime being an object in the Predictor spec. This generated a lot of field duplication in the schema, bloating the overall size of the CRD. If a user wanted to introduce a new model serving framework for KServe to support, the CRD would have to be modified, and subsequently the controller code.</p>
<p>Now, with the Model spec, a user can specify a model format and optionally a corresponding version. The KServe control plane will automatically select and use the <em>ClusterServingRuntime</em> or <em>ServingRuntime</em> that supports the given format. Each <em>ServingRuntime</em> maintains a list of supported model formats and versions. If a format has <code>autoselect</code> as <code>true</code>, then that opens the <em>ServingRuntime</em> up for automatic model placement for that model format.</p>
<!-- -->
<div class="theme-tabs-container tabs-container tabList__CuJ"><ul role="tablist" aria-orientation="horizontal" class="tabs"><li role="tab" tabindex="0" aria-selected="true" class="tabs__item tabItem_LNqP tabs__item--active">New Schema</li><li role="tab" tabindex="-1" aria-selected="false" class="tabs__item tabItem_LNqP">Previous Schema</li></ul><div class="margin-top--md"><div role="tabpanel" class="tabItem_Ymn6"><div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1beta1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> InferenceService</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> example</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">sklearn</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">isvc</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">predictor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">model</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">modelFormat</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">        </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> sklearn</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">storageUri</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> s3</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">//bucket/sklearn/mnist.joblib</span><br></span></code></pre></div></div></div><div role="tabpanel" class="tabItem_Ymn6" hidden=""><div class="language-yaml codeBlockContainer_Ckt0 theme-code-block" style="--prism-color:#393A34;--prism-background-color:#f6f8fa"><div class="codeBlockContent_QJqH"><pre tabindex="0" class="prism-code language-yaml codeBlock_bY9V thin-scrollbar" style="color:#393A34;background-color:#f6f8fa"><code class="codeBlockLines_e6Vv"><span class="token-line" style="color:#393A34"><span class="token key atrule" style="color:#00a4db">apiVersion</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> serving.kserve.io/v1beta1</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">kind</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> InferenceService</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">metadata</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">name</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> example</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">sklearn</span><span class="token punctuation" style="color:#393A34">-</span><span class="token plain">isvc</span><br></span><span class="token-line" style="color:#393A34"><span class="token plain"></span><span class="token key atrule" style="color:#00a4db">spec</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">  </span><span class="token key atrule" style="color:#00a4db">predictor</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">    </span><span class="token key atrule" style="color:#00a4db">sklearn</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"></span><br></span><span class="token-line" style="color:#393A34"><span class="token plain">      </span><span class="token key atrule" style="color:#00a4db">storageUri</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain"> s3</span><span class="token punctuation" style="color:#393A34">:</span><span class="token plain">//bucket/sklearn/mnist.joblib</span><br></span></code></pre></div></div></div></div></div>
<p>The previous way of defining predictors is still supported, however, the new approach will be the preferred one going forward. Eventually, the previous schema, with the framework names as keys in the predictor spec, will be removed.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-modelmesh-updates">🌐 ModelMesh Updates<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.8-release#-modelmesh-updates" class="hash-link" aria-label="Direct link to 🌐 ModelMesh Updates" title="Direct link to 🌐 ModelMesh Updates" translate="no">​</a></h2>
<p><a href="https://developer.ibm.com/blogs/kserve-and-watson-modelmesh-extreme-scale-model-inferencing-for-trusted-ai/" target="_blank" rel="noopener noreferrer" class="">ModelMesh</a> has been in the process of integrating as KServe's multi-model serving backend. With the inclusion of the aforementioned ServingRuntime CRDs and the Predictor Model spec, the two projects are now much more aligned, with continual improvements underway.</p>
<p>ModelMesh now supports multi-namespace reconciliation. Previously, the ModelMesh controller would only reconcile against resources deployed in the same namespace as the controller. Now, by default, ModelMesh will be able to handle InferenceService deployments in any "modelmesh-enabled" namespace. Learn more <a href="https://github.com/kserve/modelmesh-serving/blob/release-0.8/docs/install/install-script.md#setup-additional-namespaces" target="_blank" rel="noopener noreferrer" class="">here</a>.</p>
<p>Also, while ModelMesh previously only supported S3-based storage, we are happy to share that ModelMesh now works with models hosted using GCS and HTTP(S).</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-release-notes">🔍 Release Notes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.8-release#-release-notes" class="hash-link" aria-label="Direct link to 🔍 Release Notes" title="Direct link to 🔍 Release Notes" translate="no">​</a></h2>
<p>To see all release updates, check out the KServe <a href="https://github.com/kserve/kserve/releases/tag/v0.8.0" target="_blank" rel="noopener noreferrer" class="">release notes</a> and ModelMesh Serving <a href="https://github.com/kserve/modelmesh-serving/releases/tag/v0.8.0" target="_blank" rel="noopener noreferrer" class="">release notes</a>!</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-acknowledgments">🙏 Acknowledgments<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.8-release#-acknowledgments" class="hash-link" aria-label="Direct link to 🙏 Acknowledgments" title="Direct link to 🙏 Acknowledgments" translate="no">​</a></h2>
<p>We want to thank all the contributors who made this release possible:</p>
<ul>
<li class=""><strong>Authors</strong>: Dan Sun, Paul Van Eck, Vedant Padwal, Andrews Arokiam on behalf of the KServe Working Group</li>
<li class=""><strong>Core Contributors</strong>: The KServe maintainers and working group members</li>
<li class=""><strong>Community</strong>: Everyone who reported issues, provided feedback, and tested features</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-join-the-community">🤝 Join the Community<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.8-release#-join-the-community" class="hash-link" aria-label="Direct link to 🤝 Join the Community" title="Direct link to 🤝 Join the Community" translate="no">​</a></h2>
<ul>
<li class="">Visit our <a href="https://kserve.github.io/website/" target="_blank" rel="noopener noreferrer" class="">Website</a> or <a href="https://github.com/kserve" target="_blank" rel="noopener noreferrer" class="">GitHub</a></li>
<li class="">Join the Slack (<a href="https://kubeflow.slack.com/join/shared_invite/zt-n73pfj05-l206djXlXk5qdQKs4o1Zkg#/" target="_blank" rel="noopener noreferrer" class="">#kubeflow-kfserving</a>)</li>
<li class="">Attend a <a href="https://docs.google.com/document/d/1KZUURwr9MnHXqHA08TFbfVbM8EAJSJjmaMhnvstvi-k/edit#heading=h.4i9fb8ndp9vp" target="_blank" rel="noopener noreferrer" class="">biweekly community meeting on Wednesday 9am PST</a></li>
<li class="">View our <a href="https://github.com/kserve/website/blob/v0.8/docs/developer/developer.md" target="_blank" rel="noopener noreferrer" class="">developer</a> and <a href="https://github.com/kserve/website/blob/v0.8/docs/help/contributor/mkdocs-contributor-guide.md" target="_blank" rel="noopener noreferrer" class="">doc</a> contribution guides to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!</li>
</ul>
<p><strong>Happy serving!</strong></p>
<hr>
<p><em>The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!</em></p>]]></content>
        <author>
            <name>Dan Sun</name>
            <uri>https://github.com/yuzisun</uri>
        </author>
        <author>
            <name>Paul Van Eck</name>
            <uri>https://github.com/pvaneck</uri>
        </author>
        <author>
            <name>Vedant Padwal</name>
            <uri>https://github.com/js-ts</uri>
        </author>
        <author>
            <name>Andrews Arokiam</name>
            <uri>https://github.com/andyi2it</uri>
        </author>
        <category label="Releases" term="Releases"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[Announcing KServe v0.7 - Smooth Transition from KFServing to KServe]]></title>
        <id>https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.7-release</id>
        <link href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.7-release"/>
        <updated>2021-10-11T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[KServe 0.7 Release Blog Post]]></summary>
        <content type="html"><![CDATA[<p><em>Published on October 11, 2021</em></p>
<p><a class="" href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kfserving-transition">KFServing is now KServe</a> and KServe 0.7 release is available, the release also ensures a smooth user migration experience from KFServing to KServe.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="️-whats-changed">⚠️ What's Changed<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.7-release#%EF%B8%8F-whats-changed" class="hash-link" aria-label="Direct link to ⚠️ What's Changed" title="Direct link to ⚠️ What's Changed" translate="no">​</a></h2>
<ul>
<li class=""><code>InferenceService</code> API group is changed from <code>serving.kubeflow.org</code> to <code>serving.kserve.io</code> <a href="https://github.com/kserve/kserve/issues/1826" target="_blank" rel="noopener noreferrer" class="">#1826</a>, <a href="https://kserve.github.io/archive/0.7/admin/migration/" target="_blank" rel="noopener noreferrer" class="">the migration job</a> is created for smooth transition.</li>
<li class="">Python SDK name is changed from <a href="https://pypi.org/project/kfserving" target="_blank" rel="noopener noreferrer" class="">kfserving</a> to <a href="https://pypi.org/project/kserve" target="_blank" rel="noopener noreferrer" class="">kserve</a>.</li>
<li class="">KServe Installation manifests <a href="https://github.com/kserve/kserve/issues/1824" target="_blank" rel="noopener noreferrer" class="">#1824</a>.</li>
<li class="">Models-web-app is separated out of the kserve repository to <a href="https://github.com/kserve/models-web-app" target="_blank" rel="noopener noreferrer" class="">models-web-app</a>.</li>
<li class="">Docs and examples are moved to separate repository <a href="https://github.com/kserve/website" target="_blank" rel="noopener noreferrer" class="">website</a>.</li>
<li class="">KServe images are migrated to kserve docker hub account.</li>
<li class="">v1alpha2 API group is deprecated <a href="https://github.com/kserve/kserve/issues/1850" target="_blank" rel="noopener noreferrer" class="">#1850</a>.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-whats-new">🚀 What's New<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.7-release#-whats-new" class="hash-link" aria-label="Direct link to 🚀 What's New" title="Direct link to 🚀 What's New" translate="no">​</a></h2>
<ul>
<li class="">
<p><strong>ModelMesh project is joining KServe</strong> under repository <a href="https://github.com/kserve/modelmesh-serving" target="_blank" rel="noopener noreferrer" class="">modelmesh-serving</a>!</p>
<p>ModelMesh is designed for high-scale, high-density and frequently-changing model use cases. ModelMesh intelligently loads and unloads AI models to and from memory to strike an intelligent trade-off between responsiveness to users and computational footprint. To learn more about ModelMesh features and components, check out the <a href="https://developer.ibm.com/blogs/kserve-and-watson-modelmesh-extreme-scale-model-inferencing-for-trusted-ai" target="_blank" rel="noopener noreferrer" class="">ModelMesh announcement blog</a> and <a href="https://www.linkedin.com/feed/update/urn:li:activity:6854064203360280576/" target="_blank" rel="noopener noreferrer" class="">Join talk at #KubeCon NA to get a deeper dive into ModelMesh and KServe</a>.</p>
</li>
<li class="">
<p><strong>(Alpha feature)</strong> Raw Kubernetes deployment support, Istio/Knative dependency is now optional and please follow the <a href="https://kserve.github.io/archive/0.7/admin/kubernetes_deployment" target="_blank" rel="noopener noreferrer" class="">guide</a> to install and turn on <code>RawDeployment</code> mode.</p>
</li>
<li class="">
<p>KServe now has its own documentation website temporarily hosted on <a href="https://kserve.github.io/website" target="_blank" rel="noopener noreferrer" class="">website</a>.</p>
</li>
<li class="">
<p>Support v1 crd and webhook configuration for Kubernetes 1.22 <a href="https://github.com/kserve/kserve/issues/1837" target="_blank" rel="noopener noreferrer" class="">#1837</a>.</p>
</li>
<li class="">
<p>Triton model serving runtime now defaults to 21.09 version <a href="https://github.com/kserve/kserve/issues/1840" target="_blank" rel="noopener noreferrer" class="">#1840</a>.</p>
</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-whats-fixed">🔧 What's Fixed<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.7-release#-whats-fixed" class="hash-link" aria-label="Direct link to 🔧 What's Fixed" title="Direct link to 🔧 What's Fixed" translate="no">​</a></h2>
<ul>
<li class="">Bug fix for Azure blob storage <a href="https://github.com/kserve/kserve/issues/1845" target="_blank" rel="noopener noreferrer" class="">#1845</a>.</li>
<li class="">Tar/Zip support for all storage options <a href="https://github.com/kserve/kserve/issues/1836" target="_blank" rel="noopener noreferrer" class="">#1836</a>.</li>
<li class="">Fix AWS_REGION env variable and add AWS_CA_BUNDLE for S3 <a href="https://github.com/kserve/kserve/issues/1780" target="_blank" rel="noopener noreferrer" class="">#1780</a>.</li>
<li class="">Torchserve custom package install fix <a href="https://github.com/kserve/kserve/issues/1619" target="_blank" rel="noopener noreferrer" class="">#1619</a>.</li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-release-notes">🔍 Release Notes<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.7-release#-release-notes" class="hash-link" aria-label="Direct link to 🔍 Release Notes" title="Direct link to 🔍 Release Notes" translate="no">​</a></h2>
<p>For complete release notes including all changes, bug fixes, and known issues, visit the <a href="https://github.com/kserve/kserve/releases/tag/v0.7.0" target="_blank" rel="noopener noreferrer" class="">GitHub release page</a>.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-acknowledgments">🙏 Acknowledgments<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.7-release#-acknowledgments" class="hash-link" aria-label="Direct link to 🙏 Acknowledgments" title="Direct link to 🙏 Acknowledgments" translate="no">​</a></h2>
<p>We want to thank all the contributors who made this release possible:</p>
<p><strong>Individual Contributors:</strong></p>
<ul>
<li class=""><a href="https://github.com/andyi2it" target="_blank" rel="noopener noreferrer" class="">Andrews Arokiam</a></li>
<li class=""><a href="https://github.com/animeshsingh" target="_blank" rel="noopener noreferrer" class="">Animesh Singh</a></li>
<li class=""><a href="https://github.com/chinhuang007" target="_blank" rel="noopener noreferrer" class="">Chin Huang</a></li>
<li class=""><a href="http://github.com/yuzisun" target="_blank" rel="noopener noreferrer" class="">Dan Sun</a></li>
<li class=""><a href="https://github.com/jagadeeshi2i" target="_blank" rel="noopener noreferrer" class="">Jagadeesh</a></li>
<li class=""><a href="https://github.com/jinchihe" target="_blank" rel="noopener noreferrer" class="">Jinchi He</a></li>
<li class=""><a href="https://github.com/njhill" target="_blank" rel="noopener noreferrer" class="">Nick Hill</a></li>
<li class=""><a href="https://github.com/pvaneck" target="_blank" rel="noopener noreferrer" class="">Paul Van Eck</a></li>
<li class=""><a href="https://github.com/Iamlovingit" target="_blank" rel="noopener noreferrer" class="">Qianshan Chen</a></li>
<li class=""><a href="https://github.com/Suresh-Nakkeran" target="_blank" rel="noopener noreferrer" class="">Suresh Nakkiran</a></li>
<li class=""><a href="https://github.com/sukumargaonkar" target="_blank" rel="noopener noreferrer" class="">Sukumar Gaonkar</a></li>
<li class=""><a href="https://github.com/theofpa" target="_blank" rel="noopener noreferrer" class="">Theofilos Papapanagiotou</a></li>
<li class=""><a href="https://github.com/Tomcli" target="_blank" rel="noopener noreferrer" class="">Tommy Li</a></li>
<li class=""><a href="https://github.com/js-ts" target="_blank" rel="noopener noreferrer" class="">Vedant Padwal</a></li>
<li class=""><a href="https://github.com/PatrickXYS" target="_blank" rel="noopener noreferrer" class="">Yao Xiao</a></li>
<li class=""><a href="https://github.com/yuzliu" target="_blank" rel="noopener noreferrer" class="">Yuzhui Liu</a></li>
</ul>
<p><strong>Core Contributors</strong>: The KServe maintainers and working group members</p>
<p><strong>Community</strong>: Everyone who reported issues, provided feedback, and tested features during this important transition</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-join-the-community">🤝 Join the Community<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kserve-0.7-release#-join-the-community" class="hash-link" aria-label="Direct link to 🤝 Join the Community" title="Direct link to 🤝 Join the Community" translate="no">​</a></h2>
<ul>
<li class="">Visit our <a href="https://kserve.github.io/website/" target="_blank" rel="noopener noreferrer" class="">Website</a> or <a href="https://github.com/kserve" target="_blank" rel="noopener noreferrer" class="">GitHub</a></li>
<li class="">Join the <a href="https://kubeflow.slack.com/join/shared_invite/zt-n73pfj05-l206djXlXk5qdQKs4o1Zkg#/" target="_blank" rel="noopener noreferrer" class="">Slack (#kubeflow-kfserving)</a></li>
<li class="">Attend a <a href="https://docs.google.com/document/d/1KZUURwr9MnHXqHA08TFbfVbM8EAJSJjmaMhnvstvi-k/edit#heading=h.4i9fb8ndp9vp" target="_blank" rel="noopener noreferrer" class="">Biweekly community meeting on Wednesday 9am PST</a></li>
<li class="">Contribute at <a href="https://github.com/kserve/website/blob/main/docs/developer/developer.md" target="_blank" rel="noopener noreferrer" class="">developer</a> and <a href="https://github.com/kserve/website/blob/main/docs/help/contributor/mkdocs-contributor-guide.md" target="_blank" rel="noopener noreferrer" class="">doc contribution</a> guide to make code or doc contributions. We are excited to work with you to make KServe better and promote its adoption by more and more users!</li>
</ul>
<p><strong>Happy serving!</strong></p>
<hr>
<p><em>The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community during this important transition!</em></p>]]></content>
        <author>
            <name>Dan Sun</name>
            <uri>https://github.com/yuzisun</uri>
        </author>
        <author>
            <name>Animesh Singh</name>
            <uri>https://github.com/animeshsingh</uri>
        </author>
        <author>
            <name>Yuzhui Liu</name>
            <uri>https://github.com/yuzliu</uri>
        </author>
        <author>
            <name>Vedant Padwal</name>
            <uri>https://github.com/js-ts</uri>
        </author>
        <category label="Releases" term="Releases"/>
    </entry>
    <entry>
        <title type="html"><![CDATA[KServe: The next generation of KFServing]]></title>
        <id>https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kfserving-transition</id>
        <link href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kfserving-transition"/>
        <updated>2021-09-27T00:00:00.000Z</updated>
        <summary type="html"><![CDATA[Announcing the transition from KFServing to KServe]]></summary>
        <content type="html"><![CDATA[<p><em>Published on September 27, 2021</em></p>
<p>We are excited to announce the next chapter for KFServing. In coordination with the Kubeflow Project Steering Group, the <a href="https://github.com/kubeflow/kfserving" target="_blank" rel="noopener noreferrer" class="">KFServing GitHub repository</a> has now been transferred to an independent <a href="https://github.com/kserve/kserve" target="_blank" rel="noopener noreferrer" class="">KServe GitHub organization</a> under the stewardship of the Kubeflow Serving Working Group leads.</p>
<p>The project has been rebranded from <strong>KFServing</strong> to <strong>KServe</strong>, and we are planning to graduate the project from Kubeflow Project later this year.</p>
<p><img decoding="async" loading="lazy" alt="KFServing to KServe Transition" src="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/assets/images/image1-88ae02ce8957a75ad191a74d1a743bfb.png" width="1256" height="730" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-project-background">🎯 Project Background<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kfserving-transition#-project-background" class="hash-link" aria-label="Direct link to 🎯 Project Background" title="Direct link to 🎯 Project Background" translate="no">​</a></h2>
<p>Developed collaboratively by Google, IBM, Bloomberg, NVIDIA, and Seldon in 2019, KFServing was published as open source in early 2019. The project sets out to provide the following features:</p>
<ul>
<li class="">A simple, yet powerful, Kubernetes Custom Resource for deploying machine learning (ML) models on production across ML frameworks.</li>
<li class="">Provide performant, standardized inference protocol.</li>
<li class="">Serverless inference according to live traffic patterns, supporting "Scale-to-zero" on both CPUs and GPUs.</li>
<li class="">Complete story for production ML Model Serving including prediction, pre/post-processing, explainability, and monitoring.</li>
<li class="">Support for deploying thousands of models at scale and inference graph capability for multiple models.</li>
</ul>
<p>KFServing was created to address the challenges of deploying and monitoring machine learning models on production for organizations. After publishing the open source project, we've seen an explosion in demand for the software, leading to strong adoption and community growth. The scope of the project has since increased, and we have developed multiple components along the way, including our own growing body of documentation that needs its own website and independent GitHub organization.</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-whats-next">🚀 What's Next<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kfserving-transition#-whats-next" class="hash-link" aria-label="Direct link to 🚀 What's Next" title="Direct link to 🚀 What's Next" translate="no">​</a></h2>
<p>Over the coming weeks, we will be releasing <strong>KServe 0.7</strong> outside of the Kubeflow Project and will provide more details on how to migrate from KFServing to KServe with minimal disruptions. KFServing 0.5.x/0.6.x releases are still supported in next six months after KServe 0.7 release. We are also working on integrating core Kubeflow APIs and standards for <a href="https://docs.google.com/document/d/1a9ufoe_6DB1eSjpE9eK5nRBoH3ItoSkbPfxRA0AjPIc" target="_blank" rel="noopener noreferrer" class="">the conformance program</a>.</p>
<p>For contributors, please follow the KServe <a href="https://github.com/kserve/website/blob/v0.7/docs/developer/developer.md" target="_blank" rel="noopener noreferrer" class="">developer</a> and <a href="https://github.com/kserve/website/blob/v0.7/docs/help/contributor/mkdocs-contributor-guide.md" target="_blank" rel="noopener noreferrer" class="">doc contribution</a> guide to make code or doc contributions. We are excited to work with you to make KServe better and promote its adoption by more and more users!</p>
<p><img decoding="async" loading="lazy" alt="KServe Logo" src="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/assets/images/kserve-b9befb7647f020cdab9eb81b3f627404.png" width="3322" height="1677" class="img_ev3q"></p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-kserve-key-links">🔗 KServe Key Links<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kfserving-transition#-kserve-key-links" class="hash-link" aria-label="Direct link to 🔗 KServe Key Links" title="Direct link to 🔗 KServe Key Links" translate="no">​</a></h2>
<ul>
<li class=""><a href="https://kserve.github.io/website/" target="_blank" rel="noopener noreferrer" class="">Website</a></li>
<li class=""><a href="https://github.com/kserve/kserve/" target="_blank" rel="noopener noreferrer" class="">Github</a></li>
<li class=""><a href="https://kubeflow.slack.com/join/shared_invite/zt-n73pfj05-l206djXlXk5qdQKs4o1Zkg#/" target="_blank" rel="noopener noreferrer" class="">Slack (#kubeflow-kfserving)</a></li>
</ul>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-contributor-acknowledgement">🙏 Contributor Acknowledgement<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kfserving-transition#-contributor-acknowledgement" class="hash-link" aria-label="Direct link to 🙏 Contributor Acknowledgement" title="Direct link to 🙏 Contributor Acknowledgement" translate="no">​</a></h2>
<p>We'd like to thank all the KServe contributors for this transition work!</p>
<p><strong>Individual Contributors:</strong></p>
<ul>
<li class=""><a href="https://github.com/andyi2it" target="_blank" rel="noopener noreferrer" class="">Andrews Arokiam</a></li>
<li class=""><a href="https://github.com/animeshsingh" target="_blank" rel="noopener noreferrer" class="">Animesh Singh</a></li>
<li class=""><a href="https://github.com/chinhuang007" target="_blank" rel="noopener noreferrer" class="">Chin Huang</a></li>
<li class=""><a href="http://github.com/yuzisun" target="_blank" rel="noopener noreferrer" class="">Dan Sun</a></li>
<li class=""><a href="https://github.com/jagadeeshi2i" target="_blank" rel="noopener noreferrer" class="">Jagadeesh</a></li>
<li class=""><a href="https://github.com/jinchihe" target="_blank" rel="noopener noreferrer" class="">Jinchi He</a></li>
<li class=""><a href="https://github.com/njhill" target="_blank" rel="noopener noreferrer" class="">Nick Hill</a></li>
<li class=""><a href="https://github.com/pvaneck" target="_blank" rel="noopener noreferrer" class="">Paul Van Eck</a></li>
<li class=""><a href="https://github.com/Iamlovingit" target="_blank" rel="noopener noreferrer" class="">Qianshan Chen</a></li>
<li class=""><a href="https://github.com/Suresh-Nakkeran" target="_blank" rel="noopener noreferrer" class="">Suresh Nakkiran</a></li>
<li class=""><a href="https://github.com/sukumargaonkar" target="_blank" rel="noopener noreferrer" class="">Sukumar Gaonkar</a></li>
<li class=""><a href="https://github.com/theofpa" target="_blank" rel="noopener noreferrer" class="">Theofilos Papapanagiotou</a></li>
<li class=""><a href="https://github.com/Tomcli" target="_blank" rel="noopener noreferrer" class="">Tommy Li</a></li>
<li class=""><a href="https://github.com/js-ts" target="_blank" rel="noopener noreferrer" class="">Vedant Padwal</a></li>
<li class=""><a href="https://github.com/PatrickXYS" target="_blank" rel="noopener noreferrer" class="">Yao Xiao</a></li>
<li class=""><a href="https://github.com/yuzliu" target="_blank" rel="noopener noreferrer" class="">Yuzhui Liu</a></li>
</ul>
<p><strong>Core Contributors</strong>: The KServe maintainers and Kubeflow Serving Working Group leads</p>
<p><strong>Community</strong>: Everyone who supported this important transition and helped establish KServe as an independent project</p>
<h2 class="anchor anchorTargetStickyNavbar_Vzrq" id="-join-the-community">🤝 Join the Community<a href="https://deploy-preview-640--elastic-nobel-0aef7a.netlify.app/blog/kfserving-transition#-join-the-community" class="hash-link" aria-label="Direct link to 🤝 Join the Community" title="Direct link to 🤝 Join the Community" translate="no">​</a></h2>
<ul>
<li class="">Visit our <a href="https://kserve.github.io/website/" target="_blank" rel="noopener noreferrer" class="">Website</a> or <a href="https://github.com/kserve/kserve/" target="_blank" rel="noopener noreferrer" class="">GitHub</a></li>
<li class="">Join the <a href="https://kubeflow.slack.com/join/shared_invite/zt-n73pfj05-l206djXlXk5qdQKs4o1Zkg#/" target="_blank" rel="noopener noreferrer" class="">Slack (#kubeflow-kfserving)</a></li>
<li class="">Follow the KServe <a href="https://github.com/kserve/website/blob/v0.7/docs/developer/developer.md" target="_blank" rel="noopener noreferrer" class="">developer</a> and <a href="https://github.com/kserve/website/blob/v0.7/docs/help/contributor/mkdocs-contributor-guide.md" target="_blank" rel="noopener noreferrer" class="">doc contribution</a> guides to make contributions</li>
</ul>
<p><strong>Welcome to KServe!</strong></p>
<hr>
<p><em>The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of this exciting transition!</em></p>]]></content>
        <author>
            <name>Dan Sun</name>
            <uri>https://github.com/yuzisun</uri>
        </author>
        <author>
            <name>Animesh Singh</name>
            <uri>https://github.com/animeshsingh</uri>
        </author>
        <category label="Announcements" term="Announcements"/>
    </entry>
</feed>