KServe Blog

Announcing KServe v0.17 - Production-Ready LLM Serving with LLMInferenceService

2026-03-13T00:00:00.000Z

Published on March 13, 2026

We are excited to announce the release of KServe v0.17, a landmark release that brings LLMInferenceService to production readiness with a GenAI-first architecture built on the llm-d framework. This release introduces KV-cache aware intelligent routing, disaggregated prefill-decode, distributed inference with tensor/data/expert parallelism, Envoy AI Gateway integration with token-based rate limiting, and a completely restructured modular Helm chart architecture.

🤖 LLMInferenceService: GenAI-First Architecture

KServe v0.17 elevates LLMInferenceService from an experimental feature to a production-ready CRD purpose-built for generative AI workloads. Built on the llm-d framework, LLMInferenceService provides a GenAI-first architecture that goes beyond traditional InferenceService to address the unique challenges of serving large language models at scale.

Unlike InferenceService which is designed for predictive AI workloads, LLMInferenceService natively supports:

Distributed inference across multiple nodes and GPUs
KV-cache aware scheduling for intelligent request routing
Disaggregated prefill-decode for optimal resource utilization
Gateway Inference Extension (GIE) integration for advanced traffic management
Token-based rate limiting via Envoy AI Gateway

Feature	InferenceService	LLMInferenceService
Primary Use Case	Predictive AI	Generative AI
Routing	Standard Gateway	KV-cache aware with EPP
Parallelism	Worker Spec	TP, DP, EP native support
Prefill-Decode	N/A	Disaggregated separation
Scaling	HPA/KPA	WVA + KEDA

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: llama3-serving
spec:
  model:
    uri: hf://meta-llama/Llama-3.1-8B-Instruct
    name: meta-llama--Llama-3.1-8B-Instruct
  replicas: 3
  template:
    spec:
      containers:
        - name: vllm
          resources:
            limits:
              nvidia.com/gpu: "1"
  router:
    gateway:
      managed: {}
    route:
      httpRoute: {}
    scheduler:
      pool: {}

This creates a full serving stack including the Deployment, Service, Gateway, HTTPRoute, InferencePool, InferenceModel, and EPP (Endpoint Picker Pod) — all managed by the LLMInferenceService controller.

🚀 Key LLMInferenceService Features in v0.17

🧠 KV-Cache Aware Scheduling with Gateway Inference Extension

LLMInferenceService integrates with Gateway Inference Extension (GIE) v1.3.0, a Kubernetes SIG project that extends the Gateway API with AI-specific routing capabilities. At the heart of this integration is the Endpoint Picker Pod (EPP) from the llm-d inference scheduler, an intelligent scheduler that routes requests based on real-time KV-cache state rather than simple round-robin or random load balancing.

Traditional load balancing treats all LLM inference requests equally, but in practice, requests with similar prompts benefit enormously from being routed to the same pod — because that pod already has the relevant KV cache blocks loaded. The EPP solves this by tracking real-time KV cache states across all vLLM instances via ZMQ events (BlockStored, BlockRemoved) and building an index mapping {ModelName, BlockHash} → {PodID, DeviceTier}.

The scheduling behavior is configured through EndpointPickerConfig, which defines a plugin pipeline with weighted scorers:

apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
  - type: single-profile-handler
  - type: prefix-cache-scorer
  - type: load-aware-scorer
    parameters:
      threshold: 100
  - type: max-score-picker
schedulingProfiles:
  - name: default
    plugins:
      - pluginRef: prefix-cache-scorer
        weight: 2.0
      - pluginRef: load-aware-scorer
        weight: 1.0
      - pluginRef: max-score-picker

The pipeline uses three types of plugins (see llm-d scheduler architecture for details):

prefix-cache-scorer (weight: 2.0): Tracks the actual KV cache contents across all vLLM instances and scores pods based on how many cached prefix blocks match the incoming request's prompt. This reduces Time To First Token (TTFT) by avoiding redundant prefill computation for repeated or similar prompts — particularly beneficial for multi-turn conversations and RAG workloads.
load-aware-scorer (weight: 1.0): Scores candidate pods based on their current queue depth. Pods with empty queues score 0.5, while pods with growing queues score progressively lower toward 0. The threshold parameter controls the sensitivity — when queue depth exceeds the threshold, the pod scores near zero.
max-score-picker: After all scorers run, selects the pod with the highest weighted aggregate score.

The EndpointPickerConfig can be provided inline in the LLMInferenceService spec or referenced from a ConfigMap, giving platform teams the flexibility to standardize scheduling behavior across deployments:

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: llama3-with-scheduler
spec:
  model:
    uri: hf://meta-llama/Llama-3.1-8B-Instruct
    name: meta-llama--Llama-3.1-8B-Instruct
  replicas: 4
  template:
    spec:
      containers:
        - name: vllm
          resources:
            limits:
              nvidia.com/gpu: "1"
  router:
    gateway:
      managed: {}
    route:
      httpRoute: {}
    scheduler:
      config:
        ref:
          name: custom-endpoint-picker-config
          key: endpoint-picker-config.yaml
      pool: {}

The GIE CRDs (InferencePool and InferenceModel) are now bundled as part of the KServe installation, simplifying setup.

🔀 Disaggregated Prefill-Decode

LLMInferenceService natively supports disaggregated prefill-decode, which separates the compute-intensive prefill phase from the memory-intensive decode phase into independent workloads. This allows each phase to be scaled and optimized independently.

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: llama3-prefill-decode
spec:
  model:
    uri: hf://meta-llama/Llama-3.1-8B-Instruct
    name: meta-llama--Llama-3.1-8B-Instruct
  replicas: 2
  template:
    spec:
      containers:
        - name: vllm
          resources:
            limits:
              nvidia.com/gpu: "1"
  prefill:
    replicas: 2
    template:
      spec:
        containers:
          - name: vllm
            resources:
              limits:
                nvidia.com/gpu: "1"
  router:
    gateway:
      managed: {}
    route:
      httpRoute: {}
    scheduler:
      pool: {}

KV cache data is transferred between prefill and decode pods using NixlConnector with RDMA-based RoCE for high-throughput, low-latency block transfers.

📐 Distributed Inference: Tensor, Data, and Expert Parallelism

LLMInferenceService introduces a comprehensive parallelism specification for distributed inference across multiple nodes and GPUs using LeaderWorkerSet:

Tensor Parallelism (TP): Splits model layers across GPUs within a node
Data Parallelism (DP): Runs multiple model replicas for higher throughput
Data-Local Parallelism: Controls GPUs per node for optimal NUMA affinity
Expert Parallelism (EP): Distributes Mixture-of-Experts (MoE) model experts across GPUs

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: llama3-multi-node
spec:
  model:
    uri: hf://meta-llama/Llama-3.1-70B-Instruct
    name: meta-llama--Llama-3.1-70B-Instruct
  replicas: 8
  parallelism:
    tensor: 4
    data: 8
    dataLocal: 4
  template:
    spec:
      containers:
        - name: vllm
          resources:
            limits:
              nvidia.com/gpu: "4"
  worker:
    spec:
      containers:
        - name: vllm
          resources:
            limits:
              nvidia.com/gpu: "4"
  router:
    gateway:
      managed: {}
    route:
      httpRoute: {}
    scheduler:
      pool: {}

🌐 Envoy AI Gateway Integration with Token-Based Rate Limiting

LLMInferenceService integrates with Envoy AI Gateway for AI-native traffic management. This enables token-based rate limiting — a capability critical for LLM serving where request cost varies dramatically based on input and output token counts rather than simple request counts.

apiVersion: aigateway.envoyproxy.io/v1alpha1
kind: AIGatewayRoute
metadata:
  name: llm-route
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      name: llama3-serving
  llmRequestCosts:
    - metadataKey: llm_input_token
      type: InputToken
    - metadataKey: llm_output_token
      type: OutputToken
    - metadataKey: llm_total_token
      type: TotalToken
---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
  name: llm-rate-limit
spec:
  targetRefs:
    - group: aigateway.envoyproxy.io
      kind: AIGatewayRoute
      name: llm-route
  rateLimit:
    type: Global
    global:
      rules:
        - clientSelectors:
            - headers:
                - name: x-user-id
                  type: Distinct
          limit:
            requests: 1000
            unit: Hour
          cost:
            request:
              from: Number
              number: 0
            response:
              from: Metadata
              key: llm_total_token

⚡ Autoscaling API with WVA Support

A new autoscaling API has been added to LLMInferenceService with support for the Workload Variant Autoscaler (WVA), a Kubernetes-based global autoscaler designed specifically for LLM inference workloads. Traditional CPU/memory-based autoscaling is inadequate for LLMs because inference cost is driven by token throughput, KV cache utilization, and queue depth rather than CPU or memory usage.

WVA continuously monitors inference server metrics via Prometheus — specifically KV cache utilization and queue depth — to determine when servers are approaching saturation. It then computes a wva_desired_replicas metric and emits it to Prometheus, where an actuator backend (HPA or KEDA) reads it to drive the actual scaling:

WVA + KEDA: Queries Prometheus directly for the wva_desired_replicas metric. Does not require Prometheus Adapter. Supports idle scale-to-zero via idleReplicaCount.
WVA + HPA: Reads the wva_desired_replicas metric via Kubernetes Metrics API. Requires Prometheus Adapter. Supports standard HPA scaling behaviors.

A key concept in WVA is the variant — a specific deployment configuration (hardware, runtime, parallelism strategy) for serving a model. The same base model might be served by multiple variants: for example, Llama-3 on A100 GPUs with TP=4 is one variant, while Llama-3 on H100 GPUs with TP=2 is another. The variantCost field specifies the relative cost per replica for each variant, enabling WVA to make cost-aware scaling decisions across variants — scaling up the cheaper variant first when demand increases, and scaling down the most expensive variant first when demand decreases.

apiVersion: serving.kserve.io/v1alpha2
kind: LLMInferenceService
metadata:
  name: llama3-wva-autoscaling
spec:
  model:
    uri: hf://meta-llama/Llama-3.1-8B-Instruct
    name: meta-llama--Llama-3.1-8B-Instruct
  scaling:
    minReplicas: 1
    maxReplicas: 10
    wva:
      variantCost: "15.0"
      keda:
        pollingInterval: 30
        cooldownPeriod: 300
        initialCooldownPeriod: 120
        idleReplicaCount: 0
        fallback:
          failureThreshold: 3
          replicas: 2
  template:
    spec:
      containers:
        - name: vllm
          resources:
            limits:
              nvidia.com/gpu: "1"
  router:
    gateway:
      managed: {}
    route:
      httpRoute: {}
    scheduler:
      pool: {}

In the example above, variantCost: "15.0" indicates the relative cost of running each replica of this variant. If another variant of the same model has variantCost: "5.0", WVA would prefer to add capacity on that cheaper variant before scaling up this one. The default value is "10.0" if not specified. When using the KEDA backend, the fallback field ensures the deployment maintains a minimum replica count (here, 2 replicas) even if the metrics pipeline fails — a critical safety net for production LLM deployments.

🔧 Scheduler High Availability

The LLMInferenceService scheduler (EPP) now supports scaling and high availability, allowing multiple EPP replicas for production deployments that require fault tolerance and higher routing throughput.

🛡️ CRD Webhook Validation

LLMInferenceService now includes CRD webhook validation with comprehensive E2E tests, providing early feedback on invalid configurations before they reach the controller. This catches errors in parallelism settings, workload specifications, and router configurations at admission time.

📋 Configuration Composition with LLMInferenceServiceConfig

LLMInferenceService supports a configuration composition model through LLMInferenceServiceConfig, enabling reusable templates that can be shared across multiple LLMInferenceService resources. The merge order follows:

Well-Known Configs → 2. Explicit BaseRefs → 3. LLMInferenceService Spec

This allows platform teams to define standardized vLLM worker templates, router/scheduler configurations, and resource defaults while giving application teams the ability to override specific settings.

📦 Additional LLMInferenceService Improvements

Label and annotation propagation to downstream workload resources (#5009)
Prometheus annotation propagation to workloads for metrics collection (#5086)
Certificate management with DNS/IP SAN and automatic renewal for self-signed certs (#5099)
Improved CA bundle management for secure communication (#4803)
Optional storageInitializer — skip model download when using pre-loaded models (#4970)
InferencePool auto-migration for seamless upgrades (#5007)
Route-only completions through InferencePool for chat/completion endpoints (#5087)
Startup probes for vLLM containers for more reliable health monitoring (#5063)
vLLM arguments migrated to command field for cleaner configuration (#5049)
Versioned well-known config resolution for stable config management (#5096)
Scheduler config via ConfigMap or inline for flexible configuration (#4856)
Pod init container failure monitoring for better observability (#5034)
Preserve externally managed replicas during reconciliation (#4996)
Allow stopping LLMInferenceService gracefully (#4839)
Enhanced Gateway API URL discovery with listener hostname fallback (#5104, #5079)

🏗️ Modular Component Architecture

KServe v0.17 introduces a fundamental architectural shift toward modular, component-based deployment. KServe now consists of three independent components:

kserve (core): Manages InferenceService, ServingRuntime, ClusterServingRuntime, InferenceGraph, and TrainedModel CRDs.
llmisvc: The LLMInferenceService controller for generative AI workloads, managing LLMInferenceService and LLMInferenceServiceConfig CRDs.
localmodel (optional): The LocalModel controller for efficient model caching with LocalModelCache, LocalModelNode, and LocalModelNodeGroup CRDs.

Combination	Use Case	Components
KServe Only	Predictive AI	kserve
KServe + LLMIsvc	Predictive AI + Generative AI	kserve + llmisvc
Full Stack	Predictive AI + Generative AI + Model Caching	kserve + llmisvc + localmodel

Helm Chart Restructuring

To support the new component architecture, the Helm charts have been completely restructured from a single chart into 10 independent Helm charts:

CRD Charts (6 charts with full and minimal variants):

kserve-crd / kserve-crd-minimal
kserve-llmisvc-crd / kserve-llmisvc-crd-minimal
kserve-localmodel-crd / kserve-localmodel-crd-minimal

Resource Charts (4 charts):

kserve-resources (renamed from kserve)
kserve-llmisvc-resources (new)
kserve-localmodel-resources (new)
kserve-runtime-configs (new — manages ClusterServingRuntimes and LLMIsvcConfigs)

warning

This is a breaking change. Users upgrading from v0.16 cannot use a simple helm upgrade command. Please follow the detailed upgrade guide for step-by-step migration instructions. We strongly recommend testing the upgrade in a non-production environment first.

For fresh installations, the new Kustomize component-based architecture also provides composable deployment options via standalone overlays, addon overlays, and all-in-one overlays. See the installation concepts for details.

🔧 InferenceService and Platform Improvements

Storage Performance

Parallelized blob downloads from Azure and S3 for faster model loading (#4709, #4714)
Faster parallel S3 downloads with configurable file selection (#5102, #5119)
Git repository support for downloading models directly from Git repos via HTTPS (#4966)

New Serving Runtimes

OpenVINO Model Server — Intel's optimized inference runtime for high-performance serving on Intel hardware (#4592)
PredictiveServer runtime with full build/publish infrastructure and E2E testing (#4954)

Gateway & Routing

Gateway API upgraded to v1.4.0 (#5038)
PathTemplate configuration for flexible inference service routing (#4817)

vLLM Backend

Upgraded to vLLM v0.15.1 with performance improvements (#5098)
Removed Python 3.9 support (#4851)

Additional Enhancements

CSV and Parquet marshallers for expanded data format support (#5115)
Event loop configuration with new --event_loop flag supporting auto, asyncio, and uvloop (#4971)
Annotation-based runtime defaults for MLServer (#5064)
INFERENCE_SERVICE_NAME environment variable exposed to serving containers (#5013)
Failure condition surfacing in InferenceService status (#5114)
Inference log batching with external marshalling support (#5061)

Infrastructure Updates

Kubernetes packages bumped to v0.34.0
Knative Serving updated to v1.21.1
Go updated to 1.25
Kubebuilder updated to 1.9.0
KEDA bumped from 2.16.1 to 2.17.3
MinIO replaced with SeaweedFS for testing infrastructure

🔒 Security Fixes

Multiple security vulnerabilities have been addressed:

CVE-2025-62727 (Starlette)
CVE-2025-22872, CVE-2025-47914, CVE-2025-58181
CVE-2024-43598 (LightGBM updated to 4.6.0)
CVE-2025-43859 (h11 HTTP parsing)
CVE-2025-66418 (decompression chain)
CVE-2025-68156 (expr-lang/expr)
CVE-2026-26007 (cryptography subgroup attack)
CVE-2026-24486 (python-multipart arbitrary file write)
Path traversal vulnerabilities in https.go and tar extraction

🔍 Release Notes

For the complete list of all 167 merged pull requests, bug fixes, and known issues, visit the GitHub release pages:

🙏 Acknowledgments

We extend our gratitude to all 38+ contributors who made this release possible, including 21 first-time contributors. Your efforts continue to drive the advancement of KServe as a leading platform for serving machine learning models.

Core Contributors: The KServe maintainers and regular contributors
Community: Everyone who reported issues, provided feedback, and tested features
New Contributors: Welcome to all first-time contributors who helped shape this release

🤝 Join the Community

We invite you to explore the new features in KServe v0.17 and contribute to the ongoing development of the project:

Visit our Website or GitHub
Join the Slack (#kserve)
Attend our community meeting by subscribing to the KServe calendar.
View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

Happy serving!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

Best of Both Worlds: Cloud-Native AI Inference at Scale using KServe and llm-d

2026-03-05T00:00:00.000Z

Enterprises today seek to integrate generative AI (GenAI) capabilities into their applications. However, scaling large AI models introduces complexity: managing high-volume traffic from large language models (LLMs), optimizing inference performance, maintaining predictable latency, and controlling infrastructure costs.

Platform engineering leaders require more than just model deployment capabilities. They need a robust, Kubernetes-native infrastructure that supports:

Efficient GPU utilization
Intelligent request routing
Distributed inference patterns
Cost-aware autoscaling
Production-grade governance

This article demonstrates how two open-source solutions, KServe and llm-d, can be combined to address these challenges.

We explore the role of each solution, illustrate their integration architecture, and provide practical guidance for AI platform teams, with deeper focus on KServe's LLMInferenceService, available since KServe v0.16.

KServe: Simplified Deployment of AI Models on Kubernetes

KServe is a Kubernetes-based model serving platform that simplifies deploying and managing ML models, including LLMs, at scale.

For platform engineers, KServe acts as the model serving control plane: the layer responsible for lifecycle, scaling, and operational governance.

Inference as a Service

InferenceService serves as KServe's core abstraction for model deployment, encapsulating the full serving lifecycle, including:

Automatic deployment creation and reconciliation
Request-based autoscaling with scale-to-zero and autoscaling based on custom metrics
Revision management and canary rollouts
Endpoint exposure and traffic routing
Runtime abstraction across serving backends for both predictive and generative AI
Optional pre-processing/post-processing, inference pipelines, and ensembles

ML engineers provide trained models. Platform engineers retain operational control without writing custom deployment code.

LLMInferenceService in KServe

KServe v0.16 introduces stronger generative AI capabilities, including LLMInferenceService, designed specifically for large language model workloads.

Unlike traditional stateless predictors, LLM workloads require:

Long-running streaming responses
GPU-heavy memory footprints
Prefix KV-cache management
High-concurrency token streaming
OpenAI-compatible APIs

LLMInferenceService shares common foundations with InferenceService but introduces additional capabilities tailored for large language models, including:

Unlocking Generative AI Serving with LLMInferenceService: From Pod-Level Speed to Cluster-Wide Intelligence

Imagine you want to bring the power of generative AI directly into your applications, but without rewriting your entire stack. It offers OpenAI-compatible endpoints like /v1/chat/completions, complete with streaming token responses and multi-turn support. With prompt templating built in, developers can integrate seamlessly with existing tools—whether it's the OpenAI SDKs, LangChain, LlamaIndex, Llama Stack, RAG frameworks, or even enterprise GenAI gateways.

Under the hood, KServe connects to LLM-optimized runtimes such as vLLM, Hugging Face TGI, or other GPU-native backends. These engines bring advanced capabilities like continuous batching, memory-efficient paged attention, and KV-cache reuse, delivering high throughput per GPU.

Yet, while these runtime-level optimizations make each pod lightning fast, true cluster-wide efficiency needs more. That's exactly the role of llm-d: adding an extra layer of intelligence that orchestrates resources and maximizes performance across the entire deployment.

Distributed & Multi-Node Model Support

LLMInferenceService supports advanced parallelism strategies implemented by runtimes, including tensor parallelism, pipeline parallelism, and multi-GPU sharding.

This enables hosting 70B+ parameter models, partitioning models across nodes, and serving models larger than single-GPU memory.

KServe orchestrates the deployment topology, while the runtime manages execution parallelism.

Advanced Autoscaling & Networking (Including Scale-to-Zero)

KServe integrates deeply with Kubernetes to support request- and concurrency-based autoscaling via Knative, GPU-backed scaling, and scale-to-zero for cost control.

It also integrates with the Kubernetes Gateway API for TLS termination, traffic splitting, and advanced routing.

This makes it suitable for development environments, internal copilots, and large-scale production workloads.

Kubernetes Gateway API Integration

KServe integrates with Kubernetes Gateway API for:

Enterprise-grade routing
TLS termination
Traffic splitting
Multi-model routing

This enables integration with modern Kubernetes networking stacks.

Where KServe Alone Is Not Enough

Even with LLMInferenceService and optimized runtimes, KServe does not inherently:

Route requests based on KV-cache locality across replicas
Separate prefill and decode cluster-wide
Perform SLA-aware routing decisions
Optimize GPU utilization across multiple pods

To address these, we introduce llm-d.

llm-d: Distributed Intelligence for LLM Inference

llm-d is a Kubernetes-native distributed inference framework designed to enhance performance and efficiency of LLM workloads.

If KServe is the control plane for models, llm-d is the distributed intelligence scheduling layer.

KV-Cache Aware Scheduling and Disaggregated Inference with llm-d

As LLM deployments mature, scaling is no longer just about adding GPUs. It's about using them intelligently. Modern runtimes such as vLLM introduced prefix (KV) caching to reduce redundant computation, but without smart scheduling, much of that benefit is lost.

This is where llm-d changes the game.

Disaggregated Inference (Prefill / Decode Separation)

LLM inference consists of two distinct phases: prefill and decode. The prefill phase is compute-heavy, processing the full prompt and building the model's attention context. The decode phase is latency-sensitive, generating tokens step by step where responsiveness directly impacts user experience.

llm-d separates these phases across different GPU groups, assigning compute-optimized resources to prefill and latency-optimized resources to decode. With intelligent scheduling between them, workloads are aligned to the right hardware profile.

This phase-aware architecture increases GPU utilization, reduces tail latency, and lowers cost per token by eliminating resource contention between fundamentally different workloads.

Intelligent Inference Scheduler

llm-d's inference scheduler evaluates the following metrics:

GPU utilization
Queue depth
Cache residency
SLA constraints
Load distribution

It enhances load balancing with an intelligent scheduler to decrease serving latency and increase throughput with prefix-cache aware routing, utilization-based load balancing, fairness and prioritization for multi-tenant serving, and predicted latency balancing.

KServe LLMInferenceService and llm-d

Responsibility Separation

This layered design ensures composability and specialization, providing a complete, production-ready solution for generative AI. KServe acts as the control plane and LLMInferenceService delivers the generative API abstraction, while llm-d provides the cluster-wide optimization.

Layer	Responsibility
KServe	Model lifecycle, scaling, governance
LLMInferenceService	Generative API abstraction
vLLM	Efficient execution inside runtime
llm-d	Cross-runtime routing & cache awareness
Kubernetes	Resource orchestration

Together, KServe and llm-d enable a production-ready, Kubernetes-native inference platform that balances scalability, performance, and cost efficiency, providing the best of both worlds for cloud-native AI inference at scale.

Cost Efficiency Comparison: Naive vs Optimized

Serving LLMs at scale is no longer just a model problem. It is a distributed systems problem where naive load balancing leads to significant inefficiencies and wasted resources — lost cache locality, GPU imbalance, redundant prefill processing, high tail latency, and overprovisioned GPUs.

Naive Problems:

Cache locality loss
GPU imbalance
Redundant prefill processing
High tail latency
Overprovisioned GPUs

Optimized Architecture with KServe + llm-d

The combined KServe and llm-d solution introduces distributed intelligence to solve the problems of naive architectures, delivering superior performance, scalability, and cost control. This optimized architecture is pluggable and extensible to work well with many AI and cloud-native technologies.

Benefits:

Cache reuse preserved
Balanced GPU utilization
Reduced recomputation
Lower cost per token
Controlled autoscaling via LLMInferenceService

Benchmark Results: Why Cluster-Level Intelligence Matters

By integrating llm-d's cache-aware routing, prefill and decode disaggregation, and SLA-based scheduling with KServe's enterprise-grade generative serving and autoscaling, the system achieves cluster-wide GPU optimization.

Note: The following results are based on benchmarks published by the llm-d project

Optimization Area	Naive Architecture (Round Robin LB)	Optimized (KServe + llm-d)	Source
Cache Locality	Requests routed randomly → KV cache frequently missed	Cache-aware routing preserves prefix locality	llm-d blog
Time to First Token (P90)	Baseline latency under cache-blind scheduling	Up to ~57× faster P90 TTFT in benchmark	llm-d blog
Token Throughput	~4,400 tokens/sec (baseline test cluster)	~8,730 tokens/sec (~2× improvement)	llm-d blog
Throughput at Scale	Degrades under multi-tenant load	Sustained 4.5k–11k tokens/sec	llm-d blog
Tail Latency (P95/P99)	Higher tail latency due to stragglers & imbalance	~50% tail latency reduction (reported tests)	Red Hat Developers
GPU Utilization	Uneven utilization, idle GPUs possible	Improved effective utilization via routing intelligence	llm-d docs
Autoscaling Control	Scale reacts to load only	Works with KServe autoscaling + routing intelligence	KServe docs

Modern GenAI platforms require cache locality awareness, phase-aware scheduling, distributed intelligence, and composable Kubernetes-native design. This combination ensures a production-ready system that meets the demands of large-scale production workloads.

Next Steps

Explore detailed project documentation:

KServe
llm-d

Engage with community resources and Slack channels to stay updated and contribute to ongoing developments:

Announcing KServe v0.15 - Advancing Generative AI Model Serving

2025-05-27T00:00:00.000Z

Published on May 27, 2025

We are thrilled to announce the release of KServe v0.15, marking a significant leap forward in serving both predictive and generative AI models. This release introduces enhanced support for generative AI workloads, including advanced features for serving large language models (LLMs), improved model and KV caching mechanisms, and integration with Envoy AI Gateway.

🤖 Embracing Generative AI Workloads

KServe v0.15 brings first-class support for generative AI workloads, marking a key evolution beyond traditional predictive AI. Unlike predictive models that infer outcomes from existing data, generative models like large language models (LLMs) create new content from prompts. This fundamental difference introduces new serving challenges. KServe now provides the infrastructure and optimizations needed to serve these models efficiently at scale.

To support these workloads, we've introduced a dedicated Generative AI section in our documentation, detailing the new capabilities and configurations tailored for generative models.

KServe now offers a lightweight installation for hosting LLMs on Kubernetes, please follow generative inference installation guide to get started. KEDA is an optional component for scaling based on LLM specific metrics and Envoy AI gateway is integrated for advanced traffic management capabilities with token rate limiting, unified API and intelligent routing.

🚀 Key Generative AI Features in v0.15

Envoy AI Gateway Integration
Multi Node Inference
LLM Autoscaler with KEDA
Distributed KV Cache with LMCache

🌐 Envoy AI Gateway Support

KServe v0.15 adds initial support for Envoy AI Gateway, a CNCF open source project built on top of Envoy Gateway and designed specifically for managing generative AI traffic at scale.

Envoy Gateway is also now supported in KServe along with Kubernetes Gateway API. Unlike traditional gateway solutions, Envoy AI Gateway provides advanced capabilities tailored to AI serving, including:

Dynamic model routing based on request content, model metadata, or user context.
Built-in support for multi-tenant inference, with fine-grained access controls and authentication.
Unified API for routing and managing LLM/AI traffic easily.
Integrated observability for model-level performance insights.
Extensibility for inference-specific policies like rate-limiting by token, and model lifecycle management.
Automatic failover mechanisms to ensure service reliability.

This integration enables a unified, intelligent entrypoint for both predictive and generative workloads—scaling from traditional models to complex LLMs—all while abstracting infrastructure complexity from the user. Please refer to Envoy AI Gateway integration doc for more details.

🔗 Multi-Node Inference

To support LLMs too large for a single node (e.g., Llama 3.1 405B), KServe v0.15 introduces multi-node inference across distributed GPUs, unlocking large model serving at scale. As models continue to increase in size, multi-node inference capabilities are increasingly important for production deployments that require real-time user experience. Please refer to the Multi Node inference doc for more details.

The community is also working on a new distributed inference API to allow scaling Multi Node Inference and support Disaggregatd Prefilling which is targeted for large LLM deployments.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-llama3
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      storageUri: pvc://llama-3-8b-pvc/hf/8b_instruction_tuned
    workerSpec:
      pipelineParallelSize: 2
      tensorParallelSize: 1

⚡ LLM Autoscaler with KEDA (Kubernetes Event-driven Autoscaling)

Autoscaling LLMs is challenging due to their high resource demands and variable inference traffic patterns. The dynamic nature of LLM inference, with varying input lengths and token generation speeds, further complicates the prediction of resource needs, demanding sophisticated and adaptive autoscaling solutions. KServe now integrates with KEDA (Kubernetes Event-Driven Autoscaling) offers a powerful solution to many of the challenges associated with LLM autoscaling by extending Kubernetes' native Horizontal Pod Autoscaler (HPA) capabilities. KEDA can monitor custom metrics which means you can expose LLM metrics from your LLM inference servers and use KEDA to scale based on these precise indicators.

This empowers users to efficiently manage LLM workloads with more intelligent scaling decisions based on workload characteristics for improved performance and cost optimization. Please follow the tutorial doc for how to autoscale based on vLLM metrics.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-llama3-keda
  annotations:
    serving.kserve.io/autoscalerClass: "keda"
    sidecar.opentelemetry.io/inject: "huggingface-llama3-keda"
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=llama3
        - --model_id=meta-llama/meta-llama-3-70b
    minReplicas: 1
    maxReplicas: 5
    autoScaling:
      metrics:
        - type: PodMetric
          podmetric:
            metric:
              backend: "opentelemetry"
              metricNames:
                - vllm:num_requests_running
              query: "vllm:num_requests_running"
            target:
              type: Value
              value: "4"

🚀 Distributed KV Cache with LMCache

Key-Value (KV) cache offloading is a technique used in large language model (LLM) serving to store and reuse the intermediate key and value tensors generated during model inference. In transformer-based models, these KV caches represent the context for each token processed, and reusing them allows the model to avoid redundant computations for repeated or similar prompts.

Enabling KV cache offloading across multiple requests and serving instances can achieve reduced Time To First Token(TTFT), improve scalability for shared cache across replicas, and improve user experience for multi-turn QA or RAG.

KServe integrates LMCache, the-state-of-art KV cache layer library developed by LMCache Lab to reduce inference costs and ensure SLOs for both latency and throughput at scale. Please follow the LMCache integration doc to optimize your GenAI inference workload.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-llama3-lmcache
spec:
  predictor:
    minReplicas: 2
    model:
      modelFormat:
        name: huggingface
      args:
        - --model_name=llama3
        - --model_id=meta-llama/meta-llama-3-70b
        - --kv-transfer-config
        - '{"kv_connector":"LMCacheConnectorV1", "kv_role":"kv_both"}'
        - --enable-chunked-prefill

📦 Advanced Model Caching Mechanisms

To reduce model loading times and improve overall efficiency of serving large models, KServe v0.15 introduces advanced model caching features:

LocalModelCache Enhancements: Improved the LocalModelCache custom resource to support multiple node groups, providing greater flexibility in model placement and caching strategies.
Node Agent Improvements: Enhanced the local model node agent for better performance and reliability.

🔧 Enhanced vLLM Backend Support

The vLLM backend has been significantly upgraded to better serve generative AI models:

Version Upgrade: Updated to vLLM 0.8.5, bringing performance improvements with v1 backend and new features.
Qwen3 & Llama4: Added support for Qwen3 and Llama4 models.
Reranking Support: Added support for reranking models.
Embedding Support: Added support for OpenAI-compatible embeddings API, enabling a broader range of applications.

🛠️ Additional Improvements

This release also includes several other enhancements:

Support Deep Health Checks #3348
Collocated Transformer & Predictor Feature #4255
Kubernetes Gateway API support #3952
Security Updates

🔍 Release Notes

For complete release notes including all changes, bug fixes, and known issues, visit the GitHub release page.

🙏 Acknowledgments

We extend our gratitude to all the contributors who made this release possible. Your efforts continue to drive the advancement of KServe as a leading platform for serving machine learning models.

Core Contributors: The KServe maintainers and regular as well as new contributors
Community: Everyone who reported issues, provided feedback, and tested features
Special Recognition: The generative AI community for their valuable input on LLM serving requirements

🤝 Join the Community

We invite you to explore the new features in KServe v0.15 and contribute to the ongoing development of the project:

Visit our Website or GitHub
Join the Slack (#kserve)
Attend our community meeting by subscribing to the KServe calendar.
View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

Happy serving!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

Announcing KServe v0.14

2024-12-13T00:00:00.000Z

Published on December 23, 2024

We are excited to announce KServe v0.14. In this release we are introducing a new Python client designed for KServe, and a new model cache feature; we are promoting OCI storage for models as a stable feature; and we added support for deploying models directly from Hugging Face.

🚀 Key Features

Introducing Inference client for Python

The KServe Python SDK now includes both REST and GRPC inference clients. The new Inference clients of the SDK were delivered as alpha features.

Inline with the features documented in issue #3270, both clients have the following characteristics:

The clients are asynchronous
Support for HTTP/2 (via httpx library)
Support Open Inference Protocol v1 and v2
Allow client send and receive tensor data in binary format for HTTP/REST request, see binary tensor data extension docs.

As usual, the version 0.14.0 of the KServe Python SDK is published to PyPI and available to install via pip install.

Support for OCI storage for models (modelcars) becomes stable

In KServe version 0.12, support for using OCI containers for model storage was introduced as an experimental feature. This allows users to store models in containers in OCI format, and allows the usage of OCI-compatible registries for publishing the models.

This feature was implemented by configuring the OCI model container as a sidecar in the InferenceService pod, which was the motivation for naming the feature as modelcars. The model files are made available to the model server by configuring process namespace sharing in the pod.

There was one small but important detail that was unsolved and motivated the experimental status: since the modelcar is part of the main containers of the pod, there was no certainty that the modelcar would start quickly. The model server would be unstable if it starts first than the modelcar, and since there was no prefetching of the model image, this was thought as a likely condition.

The unstable situation has been mitigated by configuring the OCI model as an init container in addition to also configuring it as a sidecar. The configuration as an init container ensures that the model is fetched before the main containers are started. The prefetching allows the modelcar to start quickly. The stabilization is available since KServe version 0.14, where modelcars are now a stable feature.

Future plan

Modelcars is one implementation option for supporting OCI images for model storage. There are other alternatives commented in issue #4083.

Using volume mounts based on OCI artifacts is the optimal implementation, but this is only recently possible since Kubernetes 1.31 as a native alpha feature. KServe can now evolve to use this new Kubernetes feature.

Introducing Model Cache

With models increasing in size, specially true for LLM models, pulling from storage each time a pod is created can result in unmanageable start-up times. Although OCI storage also has the benefit of model caching, the capabilities are not flexible since the management is delegated to the cluster.

The Model Cache was proposed as another alternative to enhance KServe usability with big models, released in KServe v0.14 as an alpha feature. In this release local node storage is used for storing models and LocalModelCache custom resource provides the control about which models to store in the cache. The local model cache state can always be rebuilt from the models stored on persistent storage like model registry or S3. Read the design document for the details.

By caching the models, you get the following benefits:

Minimize the time it takes for LLM pods to start serving requests.
Sharing the same storage for pods scheduled on the same GPU node.
Model Cache allows scaling your AI workload efficiently without worrying about the slow model server container startup.

The model cache is currently disabled by default. To enable, you need to modify the localmodel.enabled field on the inferenceservice-config ConfigMap.

You can follow local model cache tutorial to cache LLMs on local NVMe of your GPU nodes and deploy LLMs with InferenceService by loading models from local cache to accelerate the container startup.

Support for Hugging Face hub in storage initializer

The KServe storage initializer has been enhanced to support downloading models directly from Hugging Face. For this, the new schema hf:// is now supported in the storageUri field of InferenceServices. The following YAML partial shows this:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-llama3
spec:
  predictor:
    model:
      storageUri: hf://meta-llama/meta-llama-3-8b-instruct

Both public and private Hugging Face repositories are supported. The credentials can be provided by the usual mechanism of binding Secrets to ServiceAccounts, or by binding the credentials Secret as environment variables in the InferenceService.

Read the documentation for more details.

🛠️ Enhancements and Improvements

Hugging Face vLLM backend changes

vLLM backend to update to 0.6.1 #3948
Support trust_remote_code flag for vllm #3729
Support text embedding task in hugging face server #3743
Add health endpoint for vLLM backend #3850
Added hostIPC field to ServingRuntime CRD, for supporting more than one GPU in Serverless mode #3791
Support shared memory volume for vLLM backend #3910

Other Enhancements

New flag for automount serviceaccount token by #3979
TLS support for inference loggers #3837
Allow PVC storage to be mounted in ReadWrite mode via an annotation #3687
Support HTTP Headers passing for KServe python custom runtimes #3669

⚠️ What's Changed

Ray is now an optional dependency #3834
Support for Python 3.12 is added, while support Python 3.8 is removed #3645

🔍 Release Notes

For complete release notes including all changes, bug fixes, and known issues, visit the GitHub release page.

🙏 Acknowledgments

We want to thank all the contributors who made this release possible:

Core Contributors: The KServe maintainers and regular as well as new contributors
Community: Everyone who reported issues, provided feedback, and tested features

🤝 Join the community

Visit our Website or GitHub
Join the Slack (#kserve)
Attend our community meeting by subscribing to the KServe calendar.
View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

From Serverless Predictive Inference to Generative Inference - Introducing KServe v0.13

2024-05-15T00:00:00.000Z

Published on May 15, 2024

We are excited to unveil KServe v0.13, marking a significant leap forward in evolving cloud native model serving to meet the demands of Generative AI inference. This release is highlighted by three pivotal updates: enhanced Hugging Face runtime, robust vLLM backend support for Generative Models, and the integration of OpenAI protocol standards.

Below are a summary of the key changes.

🚀 Enhanced Hugging Face Runtime Support

KServe v0.13 enriches its Hugging Face runtime and now supports running Hugging Face models out-of-the-box. KServe v0.13 implements a KServe Hugging Face Serving Runtime, kserve-huggingfaceserver. With this implementation, KServe can now automatically infer a task from model architecture and select the optimized serving runtime. Currently supported tasks include sequence classification, token classification, fill mask, text generation, and text to text generation.

Here is an example to serve BERT model by deploying an Inference Service with Hugging Face runtime for classification task.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-bert
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
      - --model_name=bert
      - --model_id=bert-base-uncased
      - --tensor_input_names=input_ids
      resources:
        limits:
          cpu: "1"
          memory: 2Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: 100m
          memory: 2Gi
          nvidia.com/gpu: "1"

You can also deploy BERT on the more optimized inference runtime like Triton using Hugging Face Runtime for pre/post processing, see more details here.

🔧 vLLM Support

Version 0.13 introduces dedicated runtime support for vLLM, for enhanced transformer model serving. This support now includes auto-mapping vLLMs as the backend for supported tasks, streamlining the deployment process and optimizing performance. If vLLM does not support a particular task, it will default to the Hugging Face backend. See example below.

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: huggingface-llama3
spec:
  predictor:
    model:
      modelFormat:
        name: huggingface
      args:
      - --model_name=llama3
      - --model_id=meta-llama/meta-llama-3-8b-instruct
      resources:
        limits:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "6"
          memory: 24Gi
          nvidia.com/gpu: "1"

See more details in our updated docs to Deploy the Llama3 model with Hugging Face LLM Serving Runtime.

Additionally, if the Hugging Face backend is preferred over vLLM, vLLM auto-mapping can be disabled with the --backend=huggingface arg.

🌐 OpenAI Schema Integration

Embracing the OpenAI protocol, KServe v0.13 now supports three specific endpoints for generative transformer models:

/openai/v1/completions
/openai/v1/chat/completions
/openai/v1/models

These endpoints are useful for generative transformer models, which take in messages and return a model-generated message output. The chat completions endpoint is designed for easily handling multi-turn conversations, while still being useful for single-turn tasks. The completions endpoint is now a legacy endpoint that differs with the chat completions endpoint in that the interface for completions is a freeform text string called a prompt. Read more about the chat completions and completions endpoints in the OpenAI API docs.

This update fosters a standardized approach to transformer model serving, ensuring compatibility with a broader spectrum of models and tools, and enhances the platform's versatility. The API can be directly used with OpenAI's client libraries or third-party tools, like LangChain or LlamaIndex.

🔮 Future Plan

Support other tasks like text embeddings #3572.
Support more LLM backend options in the future, such as TensorRT-LLM.
Enrich text generation metrics for Throughput(tokens/sec), TTFT(Time to first token) #3461.
KEDA integration for token based LLM Autoscaling #3561.

🛠️ Other Changes

This release also includes several enhancements and changes:

✨ What's New?

Async streaming support for v1 endpoints #3402.
Support for .json and .ubj model formats in the XGBoost server image #3546.
Enhanced flexibility in KServe by allowing the configuration of multiple domains for an inference service #2747.
Enhanced the manager setup to dynamically adapt based on available CRDs, improving operational flexibility and reliability across different deployment environments #3470.

⚠️ What's Changed?

Removed Seldon Alibi dependency #3380.
Removal of conversion webhook from manifests. #3344.

🔍 Release Notes

For complete release notes including all changes, bug fixes, and known issues, visit the GitHub release page.

🙏 Acknowledgments

We want to thank all the contributors who made this release possible:

Core Contributors: The KServe maintainers and regular as well as new contributors
Community: Everyone who reported issues, provided feedback, and tested features
Special Recognition: Contributors who helped drive the generative AI capabilities forward

🤝 Join the Community

Visit our Website or GitHub
Join the Slack (#kserve)
Attend our community meeting by subscribing to the KServe calendar.
View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

Announcing KServe v0.11

2023-10-08T00:00:00.000Z

Published on October 8, 2023

We are excited to announce the release of KServe 0.11. In this release we introduced Large Language Model (LLM) runtimes, made enhancements to the KServe control plane, Python SDK Open Inference Protocol support and dependency management. For ModelMesh we have added features PVC, HPA, payload logging to ensure feature parity with KServe.

Here is a summary of the key changes:

🚀 KServe Core Inference Enhancements

Path-based routing support which is served as an alternative way to the host based routing, the URL of the InferenceService could look like http:///serving//. Please refer to the doc for how to enable path based routing.
Priority field for Serving Runtime custom resource to handle the case when you have multiple serving runtimes which support the same model formats, see more details from the serving runtime doc.

Custom Storage Container CRD to allow customized implementations with supported storage URI prefixes, example use cases are private model registry integration:

  apiVersion: "serving.kserve.io/v1alpha1"
  kind: ClusterStorageContainer
  metadata:
    name: default
  spec:
    container:
      name: storage-initializer
      image: kserve/model-registry:latest
      resources:
        requests:
          memory: 100Mi
          cpu: 100m
        limits:
          memory: 1Gi
          cpu: "1"
    supportedUriFormats:
      - prefix: model-registry://

Inference Graph enhancements for improving the API spec to support pod affinity and resource requirement fields. Dependency field with options Soft and Hard is introduced to handle error responses from the inference steps to decide whether to short-circuit the request in case of errors, see the following example with hard dependency with the node steps:

  apiVersion: serving.kserve.io/v1alpha1
  kind: InferenceGraph
  metadata:
    name: graph_with_switch_node
  spec:
    nodes:
      root:
        routerType: Sequence
        steps:
          - name: "rootStep1"
            nodeName: node1
            dependency: Hard
          - name: "rootStep2"
            serviceName: {{ success_200_isvc_id }}
      node1:
        routerType: Switch
        steps:
          - name: "node1Step1"
            serviceName: {{ error_404_isvc_id }}
            condition: "[@this].#(decision_picker==ERROR)"
            dependency: Hard

For more details please refer to the issue.

Improved InferenceService debugging experience by adding the aggregated RoutesReady status and LastDeploymentReady condition to the InferenceService Status to differentiate the endpoint and deployment status. This applies to the serverless mode and for more details refer to the API docs.

📦 Enhanced Python SDK Dependency Management

KServe has adopted poetry to manage python dependencies. You can now install the KServe SDK with locked dependencies using poetry install. While pip install still works, we highly recommend using poetry to ensure predictable dependency management.
The KServe SDK is also slimmed down by making the cloud storage dependency optional, if you require storage dependency for custom serving runtimes you can still install with pip install kserve[storage].

🔧 KServe Python Runtimes Improvements

KServe Python Runtimes including sklearnserver, lgbserver, xgbserver now support the open inference protocol for both REST and gRPC.
Logging improvements including adding Uvicorn access logging and a default KServe logger.
Postprocess handler has been aligned with open inference protocol, simplifying the underlying transportation protocol complexities.

🤖 LLM Runtimes

TorchServe LLM Runtime

KServe now integrates with TorchServe 0.8, offering the support for LLM models that may not fit onto a single GPU. Huggingface Accelerate and Deepspeed are available options to split the model into multiple partitions over multiple GPUs. You can see the detailed example for how to serve the LLM on KServe with TorchServe runtime.

vLLM Runtime

Serving LLM models can be surprisingly slow even on high end GPUs, vLLM is a fast and easy-to-use LLM inference engine. It can achieve 10x-20x higher throughput than Huggingface transformers. It supports continuous batching for increased throughput and GPU utilization, paged attention to address the memory bottleneck where in the autoregressive decoding process all the attention key value tensors(KV Cache) are kept in the GPU memory to generate next tokens.

In the example we show how to deploy vLLM on KServe and expects further integration in KServe 0.12 with proposed generate endpoint for open inference protocol.

📊 ModelMesh Updates

💾 Storing Models on Kubernetes Persistent Volumes (PVC)

ModelMesh now allows to directly mount model files onto serving runtimes pods using Kubernetes Persistent Volumes. Depending on the selected storage solution this approach can significantly reduce latency when deploying new predictors, potentially remove the need for additional S3 cloud object storage like AWS S3, GCS, or Azure Blob Storage altogether.

⚡ Horizontal Pod Autoscaling (HPA)

Kubernetes Horizontal Pod Autoscaling can now be used at the serving runtime pod level. With HPA enabled, the ModelMesh controller no longer manages the number of replicas. Instead, a HorizontalPodAutoscaler automatically updates the serving runtime deployment with the number of Pods to best match the demand.

📈 Model Metrics, Metrics Dashboard, Payload Event Logging

ModelMesh v0.11 introduces a new configuration option to emit a subset of useful metrics at the individual model level. These metrics can help identify outlier or "heavy hitter" models and consequently fine-tune the deployments of those inference services, like allocating more resources or increasing the number of replicas for improved responsiveness or avoid frequent cache misses.

A new Grafana dashboard was added to display the comprehensive set of Prometheus metrics like model loading and unloading rates, internal queuing delays, capacity and usage, cache state, etc. to monitor the general health of the ModelMesh Serving deployment.

The new PayloadProcessor interface can be implemented to log prediction requests and responses, to create data sinks for data visualization, for model quality assessment, or for drift and outlier detection by external monitoring systems.

⚠️ What's Changed?

To allow longer InferenceService name due to DNS max length limits from issue, the Default suffix in the inference service component(predictor/transformer/explainer) name has been removed for newly created InferenceServices. This affects the client that is using the component url directly instead of the top level InferenceService url.
Status.address.url is now consistent for both serverless and raw deployment mode, the url path portion is dropped in serverless mode.
Raw bytes are now accepted in v1 protocol, setting the right content-type header to application/json is required to recognize and decode the json payload if content-type is specified.

curl -v -H "Content-Type: application/json" http://sklearn-iris.kserve-test.${CUSTOM_DOMAIN}/v1/models/sklearn-iris:predict -d @./iris-input.json

🔍 Release Notes

For complete release notes including all changes, bug fixes, and known issues, visit the GitHub release pages for KServe v0.11 and ModelMesh v0.11.

🙏 Acknowledgments

We want to thank all the contributors who made this release possible:

Core Contributors: The KServe maintainers and regular as well as new contributors
Community: Everyone who reported issues, provided feedback, and tested features
Working Group: All members of the KServe Working Group for their ongoing collaboration

🤝 Join the Community

Visit our Website or GitHub
Join the Slack (#kserve)
Attend our community meeting by subscribing to the KServe calendar.
View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

Announcing KServe v0.10.0

2023-02-05T00:00:00.000Z

Published on February 5, 2023

We are excited to announce KServe 0.10 release. In this release we have enabled more KServe networking options, improved KServe telemetry for supported serving runtimes and increased support coverage for Open(aka v2) inference protocol for both standard and ModelMesh InferenceService.

🌐 KServe Networking Options

Istio is now optional for both Serverless and RawDeployment mode. Please see the alternative networking guide for how you can enable other ingress options supported by Knative with Serverless mode. For Istio users, if you want to turn on full service mesh mode to secure InferenceService with mutual TLS and enable the traffic policies, please read the service mesh setup guideline.

📊 KServe Telemetry for Serving Runtimes

We have instrumented additional latency metrics in KServe Python ServingRuntimes for preprocess, predict and postprocess handlers. In Serverless mode we have extended Knative queue-proxy to enable metrics aggregation for both metrics exposed in queue-proxy and kserve-container from each ServingRuntime. Please read the prometheus metrics setup guideline for how to enable the metrics scraping and aggregations.

🚀 Open(v2) Inference Protocol Support Coverage

As there have been increasing adoptions for KServe v2 Inference Protocol from AMD Inference ServingRuntime which supports FPGAs and OpenVINO which now provides KServe REST and gRPC compatible API, in the issue we have proposed to rename to KServe Open Inference Protocol.

In KServe 0.10, we have added Open(v2) inference protocol support for KServe custom runtimes. Now, you can enable v2 REST/gRPC for both custom transformer and predictor with images built by implementing KServe Python SDK API. gRPC enables high performance inference data plane as it is built on top of HTTP/2 and binary data transportation which is more efficient to send over the wire compared to REST. Please see the detailed example for transformer and predictor.

from kserve import Model

def image_transform(byte_array):
    image_processing = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    image = Image.open(io.BytesIO(byte_array))
    tensor = image_processing(image).numpy()
    return tensor

class CustomModel(Model):
    def predict(self, request: InferRequest, headers: Dict[str, str]) -> InferResponse:
        input_tensors = [image_transform(instance) for instance in request.inputs[0].data]
        input_tensors = np.asarray(input_tensors)
        output = self.model(input_tensors)
        torch.nn.functional.softmax(output, dim=1)
        values, top_5 = torch.topk(output, 5)
        result = values.flatten().tolist()
        response_id = generate_uuid()
        infer_output = InferOutput(name="output-0", shape=list(values.shape), datatype="FP32", data=result)
        infer_response = InferResponse(model_name=self.name, infer_outputs=[infer_output], response_id=response_id)
        return infer_response

class CustomTransformer(Model):
    def preprocess(self, request: InferRequest, headers: Dict[str, str]) -> InferRequest:
        input_tensors = [image_transform(instance) for instance in request.inputs[0].data]
        input_tensors = np.asarray(input_tensors)
        infer_inputs = [InferInput(name="INPUT__0", datatype='FP32', shape=list(input_tensors.shape),
                                   data=input_tensors)]
        infer_request = InferRequest(model_name=self.model_name, infer_inputs=infer_inputs)
        return infer_request

You can use the same Python API type InferRequest and InferResponse for both REST and gRPC protocol. KServe handles the underlying decoding and encoding according to the protocol.

⚠️ Warning: A new headers argument is added to the custom handlers to pass http/gRPC headers or other metadata. You can also use this as context dict to pass data between handlers. If you have existing custom transformer or predictor, the headers argument is now required to add to the preprocess, predict and postprocess handlers.

Please check the following matrix for supported ModelFormats and ServingRuntimes.

Model Format	v1	Open(v2) REST/gRPC
Tensorflow	✅ TFServing	✅ Triton
PyTorch	✅ TorchServe	✅ TorchServe
TorchScript	✅ TorchServe	✅ Triton
ONNX	❌	✅ Triton
Scikit-learn	✅ KServe	✅ MLServer
XGBoost	✅ KServe	✅ MLServer
LightGBM	✅ KServe	✅ MLServer
MLFlow	❌	✅ MLServer
Custom	✅ KServe	✅ KServe

🏗️ Multi-Arch Image Support

KServe control plane images kserve-controller, kserve/agent, kserve/router are now supported for multiple architectures: ppc64le, arm64, amd64, s390x.

🔐 KServe Storage Credentials Support

Currently, AWS users need to create a secret with long term/static IAM credentials for downloading models stored in S3. Security best practice is to use IAM role for service account(IRSA) which enables automatic credential rotation and fine-grained access control, see how to setup IRSA.
Support Azure Blobs with managed identity.

📊 ModelMesh Updates

ModelMesh has continued to integrate itself as KServe's multi-model serving backend, introducing improvements and features that better align the two projects. For example, it now supports ClusterServingRuntimes, allowing use of cluster-scoped ServingRuntimes, originally introduced in KServe 0.8.

Additionally, ModelMesh introduced support for TorchServe enabling users to serve arbitrary PyTorch models (e.g. eager-mode) in the context of distributed-multi-model serving.

Other limitations have been addressed as well, such as adding support for BYTES/string type tensors when using the REST inference API for inference requests that require them.

🔍 Release Notes

For complete release notes including all changes, bug fixes, and known issues, visit the GitHub release pages for KServe v0.10 and ModelMesh v0.10.

🙏 Acknowledgments

We want to thank all the contributors who made this release possible:

Individual Contributors:

Core Contributors: The KServe maintainers and working group members

Community: Everyone who reported issues, provided feedback, and tested features

🤝 Join the Community

Visit our Website or GitHub
Join the Slack (#kserve)
Attend our community meeting by subscribing to the KServe calendar.
View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

Announcing KServe v0.9.0

2022-07-21T00:00:00.000Z

Published on July 21, 2022

Today, we are pleased to announce the v0.9.0 release of KServe! KServe has now fully onboarded to LF AI & Data Foundation as an Incubation Project! 🎉

In this release we are excited to introduce the new InferenceGraph feature which has long been asked from the community. Also continuing the effort from the last release for unifying the InferenceService API for deploying models on KServe and ModelMesh, ModelMesh is now fully compatible with KServe InferenceService API!

🚀 Introducing InferenceGraph

The ML Inference system is getting bigger and more complex. It often consists of many models to make a single prediction. The common use cases are image classification and natural language multi-stage processing pipelines. For example, an image classification pipeline needs to run top level classification first then downstream further classification based on previous prediction results.

KServe has the unique strength to build the distributed inference graph with its native integration of InferenceServices, standard inference protocol for chaining models and serverless auto-scaling capabilities. KServe leverages these strengths to build the InferenceGraph and enable users to deploy complex ML Inference pipelines to production in a declarative and scalable way.

InferenceGraph is made up of a list of routing nodes with each node consisting of a set of routing steps. Each step can either route to an InferenceService or another node defined on the graph which makes the InferenceGraph highly composable. The graph router is deployed behind an HTTP endpoint and can be scaled dynamically based on request volume. The InferenceGraph supports four different types of routing nodes: Sequence, Switch, Ensemble, Splitter.

Sequence Node: It allows users to define multiple Steps with InferenceServices or Nodes as routing targets in a sequence. The Steps are executed in sequence and the request/response from the previous step and be passed to the next step as input based on configuration.
Switch Node: It allows users to define routing conditions and select a Step to execute if it matches the condition. The response is returned as soon as it finds the first step that matches the condition. If no condition is matched, the graph returns the original request.
Ensemble Node: A model ensemble requires scoring each model separately and then combines the results into a single prediction response. You can then use different combination methods to produce the final result. Multiple classification trees, for example, are commonly combined using a "majority vote" method. Multiple regression trees are often combined using various averaging techniques.
Splitter Node: It allows users to split the traffic to multiple targets using a weighted distribution.

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "cat-dog-classifier"
spec:
  predictor:
    pytorch:
      resources:
        requests:
          cpu: 100m
      storageUri: gs://kfserving-examples/models/torchserve/cat_dog_classification
---
apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "dog-breed-classifier"
spec:
  predictor:
    pytorch:
      resources:
        requests:
          cpu: 100m
      storageUri: gs://kfserving-examples/models/torchserve/dog_breed_classification
---
apiVersion: "serving.kserve.io/v1alpha1"
kind: "InferenceGraph"
metadata:
  name: "dog-breed-pipeline"
spec:
  nodes:
    root:
      routerType: Sequence
      steps:
      - serviceName: cat-dog-classifier
        name: cat_dog_classifier # step name
      - serviceName: dog-breed-classifier
        name: dog_breed_classifier
        data: $request
        condition: "[@this].#(predictions.0==\"dog\")"

Currently InferenceGraph is supported with the Serverless deployment mode. You can try it out following the tutorial.

🔗 InferenceService API for ModelMesh

The InferenceService CRD is now the primary interface for interacting with ModelMesh. Some changes were made to the InferenceService spec to better facilitate ModelMesh's needs.

💾 Storage Spec

To unify how model storage is defined for both single and multi-model serving, a new storage spec was added to the predictor model spec. With this storage spec, users can specify a key inside a common secret holding config/credentials for each of the storage backends from which models can be loaded. Example:

storage:
  key: localMinIO # Credential key for the destination storage in the common secret
  path: sklearn # Model path inside the bucket
  # schemaPath: null # Optional schema files for payload schema
  parameters: # Parameters to override the default values inside the common secret.
    bucket: example-models

Learn more here.

📊 Model Status

For further alignment between ModelMesh and KServe, some additions to the InferenceService status were made. There is now a Model Status section which contains information about the model loaded in the predictor. New fields include:

states - State information of the predictor's model.
activeModelState - The state of the model currently being served by the predictor's endpoints.
targetModelState - This will be set only when transitionStatus is not UpToDate, meaning that the target model differs from the currently-active model.
transitionStatus - Indicates state of the predictor relative to its current spec.
modelCopies - Model copy information of the predictor's model.
lastFailureInfo - Details about the most recent error associated with this predictor. Not all of the contained fields will necessarily have a value.

🚢 Deploying on ModelMesh

For deploying InferenceServices on ModelMesh, the ModelMesh and KServe controllers will still require that the user specifies the serving.kserve.io/deploymentMode: ModelMesh annotation. A complete example on an InferenceService with the new storage spec is showing below:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: example-tensorflow-mnist
  annotations:
    serving.kserve.io/deploymentMode: ModelMesh
spec:
  predictor:
    model:
      modelFormat:
        name: tensorflow
      storage:
        key: localMinIO
        path: tensorflow/mnist.savedmodel

🛠️ Other New Features

Support serving MLFlow model format via MLServer serving runtime.
Support unified autoscaling target and metric fields for InferenceService components with both Serverless and RawDeployment mode.
Support InferenceService ingress class and url domain template configuration for RawDeployment mode.
ModelMesh now has a default OpenVINO Model Server ServingRuntime.

⚠️ What's Changed?

The KServe controller manager is changed from StatefulSet to Deployment to support HA mode.
log4j security vulnerability fix
Upgrade TorchServe serving runtime to 0.6.0
Update MLServer serving runtime to 1.0.0

🔍 Release Notes

For complete release notes including all changes, bug fixes, and known issues, visit the GitHub release pages for KServe and ModelMesh for more details.

🙏 Acknowledgments

We want to thank all the contributors who made this release possible:

Core Contributors: The KServe maintainers and working group members
Community: Everyone who reported issues, provided feedback, and tested features
LF AI & Data Foundation: For supporting KServe's journey as an incubation project

🤝 Join the Community

Visit our Website or GitHub
Join the Slack (#kserve)
Attend our community meeting by subscribing to the KServe calendar.
View our community github repository to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

Announcing KServe v0.8

2022-02-18T00:00:00.000Z

Published on February 18, 2022

Today, we are pleased to announce the v0.8.0 release of KServe! While the last release was focused on the transition of KFServing to KServe, this release was focused on unifying the InferenceService API for deploying models on KServe and ModelMesh.

Note: For current users of KFServing/KServe, please take a few minutes to answer this short survey and provide your feedback!

⚠️ What's Changed

ONNX Runtime Server has been removed from the supported serving runtime list. KServe by default now uses the Triton Inference Server to serve ONNX models.
KServe's PyTorchServer has been removed from the supported serving runtime list. KServe by default now uses TorchServe to serve PyTorch models.
A few main KServe SDK class names have been changed:
- KFModel is renamed to Model
- KFServer is renamed to ModelServer
- KFModelRepository is renamed to ModelRepository

🚀 What's New

Some notable updates are:

ClusterServingRuntime and ServingRuntime CRDs are introduced. Learn more below.
A new Model Spec was introduced to the InferenceService Predictor Spec as a new way to specify models. Learn more below.
Knative 1.0 is now supported and certified for the KServe Serverless installation.
gRPC is now supported for transformer to predictor network communication.
TorchServe Serving runtime has been updated to 0.5.2 which now supports the KServe V2 REST protocol.
ModelMesh now has multi-namespace support, and users can now deploy GCS or HTTP(S) hosted models.

🔧 ServingRuntimes and ClusterServingRuntimes

This release introduces two new CRDs ServingRuntimes and ClusterServingRuntimes with the only difference between these two is that one is namespace-scoped and one is cluster-scoped. A ServingRuntime defines the templates for Pods that can serve one or more particular model formats. Each ServingRuntime defines key information such as the container image of the runtime and a list of the model formats that the runtime supports.

In previous versions of KServe, supported predictor formats and container images were defined in a config map in the control plane namespace. The ServingRuntime CRD should allow for improved flexibility and extensibility for defining or customizing runtimes to how you see fit without having to modify any controller code or any resources in the controller namespace.

Several out-of-the-box ClusterServingRuntimes are provided with KServe so that users can continue to use KServe how they did before without having to define the runtimes themselves.

Example SKLearn ClusterServingRuntime:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: kserve-sklearnserver
spec:
  supportedModelFormats:
    - name: sklearn
      version: "1"
      autoSelect: true
  containers:
    - name: kserve-container
      image: kserve/sklearnserver:latest
      args:
        - --model_name={{.Name}}
        - --model_dir=/mnt/models
        - --http_port=8080
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "1"
          memory: 2Gi

📋 Updated InferenceService Predictor Spec

A new Model spec was also introduced as a part of the Predictor spec for InferenceServices. One of the problems KServe was having was that the InferenceService CRD was becoming unwieldy with each model serving runtime being an object in the Predictor spec. This generated a lot of field duplication in the schema, bloating the overall size of the CRD. If a user wanted to introduce a new model serving framework for KServe to support, the CRD would have to be modified, and subsequently the controller code.

Now, with the Model spec, a user can specify a model format and optionally a corresponding version. The KServe control plane will automatically select and use the ClusterServingRuntime or ServingRuntime that supports the given format. Each ServingRuntime maintains a list of supported model formats and versions. If a format has autoselect as true, then that opens the ServingRuntime up for automatic model placement for that model format.

New Schema
Previous Schema

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: example-sklearn-isvc
spec:
  predictor:
    model:
      modelFormat:
        name: sklearn
      storageUri: s3://bucket/sklearn/mnist.joblib

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: example-sklearn-isvc
spec:
  predictor:
    sklearn:
      storageUri: s3://bucket/sklearn/mnist.joblib

The previous way of defining predictors is still supported, however, the new approach will be the preferred one going forward. Eventually, the previous schema, with the framework names as keys in the predictor spec, will be removed.

🌐 ModelMesh Updates

ModelMesh has been in the process of integrating as KServe's multi-model serving backend. With the inclusion of the aforementioned ServingRuntime CRDs and the Predictor Model spec, the two projects are now much more aligned, with continual improvements underway.

ModelMesh now supports multi-namespace reconciliation. Previously, the ModelMesh controller would only reconcile against resources deployed in the same namespace as the controller. Now, by default, ModelMesh will be able to handle InferenceService deployments in any "modelmesh-enabled" namespace. Learn more here.

Also, while ModelMesh previously only supported S3-based storage, we are happy to share that ModelMesh now works with models hosted using GCS and HTTP(S).

🔍 Release Notes

To see all release updates, check out the KServe release notes and ModelMesh Serving release notes!

🙏 Acknowledgments

We want to thank all the contributors who made this release possible:

Authors: Dan Sun, Paul Van Eck, Vedant Padwal, Andrews Arokiam on behalf of the KServe Working Group
Core Contributors: The KServe maintainers and working group members
Community: Everyone who reported issues, provided feedback, and tested features

🤝 Join the Community

Visit our Website or GitHub
Join the Slack (#kubeflow-kfserving)
Attend a biweekly community meeting on Wednesday 9am PST
View our developer and doc contribution guides to learn how to make contributions. We are excited to work with you to make KServe better and promote its adoption!

Happy serving!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community!

Announcing KServe v0.7 - Smooth Transition from KFServing to KServe

2021-10-11T00:00:00.000Z

Published on October 11, 2021

KFServing is now KServe and KServe 0.7 release is available, the release also ensures a smooth user migration experience from KFServing to KServe.

⚠️ What's Changed

InferenceService API group is changed from serving.kubeflow.org to serving.kserve.io #1826, the migration job is created for smooth transition.
Python SDK name is changed from kfserving to kserve.
KServe Installation manifests #1824.
Models-web-app is separated out of the kserve repository to models-web-app.
Docs and examples are moved to separate repository website.
KServe images are migrated to kserve docker hub account.
v1alpha2 API group is deprecated #1850.

🚀 What's New

ModelMesh project is joining KServe under repository modelmesh-serving!

ModelMesh is designed for high-scale, high-density and frequently-changing model use cases. ModelMesh intelligently loads and unloads AI models to and from memory to strike an intelligent trade-off between responsiveness to users and computational footprint. To learn more about ModelMesh features and components, check out the ModelMesh announcement blog and Join talk at #KubeCon NA to get a deeper dive into ModelMesh and KServe.
(Alpha feature) Raw Kubernetes deployment support, Istio/Knative dependency is now optional and please follow the guide to install and turn on RawDeployment mode.
KServe now has its own documentation website temporarily hosted on website.
Support v1 crd and webhook configuration for Kubernetes 1.22 #1837.
Triton model serving runtime now defaults to 21.09 version #1840.

🔧 What's Fixed

Bug fix for Azure blob storage #1845.
Tar/Zip support for all storage options #1836.
Fix AWS_REGION env variable and add AWS_CA_BUNDLE for S3 #1780.
Torchserve custom package install fix #1619.

🔍 Release Notes

For complete release notes including all changes, bug fixes, and known issues, visit the GitHub release page.

🙏 Acknowledgments

We want to thank all the contributors who made this release possible:

Individual Contributors:

Core Contributors: The KServe maintainers and working group members

Community: Everyone who reported issues, provided feedback, and tested features during this important transition

🤝 Join the Community

Visit our Website or GitHub
Join the Slack (#kubeflow-kfserving)
Attend a Biweekly community meeting on Wednesday 9am PST
Contribute at developer and doc contribution guide to make code or doc contributions. We are excited to work with you to make KServe better and promote its adoption by more and more users!

Happy serving!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of our community during this important transition!

KServe: The next generation of KFServing

2021-09-27T00:00:00.000Z

Published on September 27, 2021

We are excited to announce the next chapter for KFServing. In coordination with the Kubeflow Project Steering Group, the KFServing GitHub repository has now been transferred to an independent KServe GitHub organization under the stewardship of the Kubeflow Serving Working Group leads.

The project has been rebranded from KFServing to KServe, and we are planning to graduate the project from Kubeflow Project later this year.

🎯 Project Background

Developed collaboratively by Google, IBM, Bloomberg, NVIDIA, and Seldon in 2019, KFServing was published as open source in early 2019. The project sets out to provide the following features:

A simple, yet powerful, Kubernetes Custom Resource for deploying machine learning (ML) models on production across ML frameworks.
Provide performant, standardized inference protocol.
Serverless inference according to live traffic patterns, supporting "Scale-to-zero" on both CPUs and GPUs.
Complete story for production ML Model Serving including prediction, pre/post-processing, explainability, and monitoring.
Support for deploying thousands of models at scale and inference graph capability for multiple models.

KFServing was created to address the challenges of deploying and monitoring machine learning models on production for organizations. After publishing the open source project, we've seen an explosion in demand for the software, leading to strong adoption and community growth. The scope of the project has since increased, and we have developed multiple components along the way, including our own growing body of documentation that needs its own website and independent GitHub organization.

🚀 What's Next

Over the coming weeks, we will be releasing KServe 0.7 outside of the Kubeflow Project and will provide more details on how to migrate from KFServing to KServe with minimal disruptions. KFServing 0.5.x/0.6.x releases are still supported in next six months after KServe 0.7 release. We are also working on integrating core Kubeflow APIs and standards for the conformance program.

For contributors, please follow the KServe developer and doc contribution guide to make code or doc contributions. We are excited to work with you to make KServe better and promote its adoption by more and more users!

🔗 KServe Key Links

🙏 Contributor Acknowledgement

We'd like to thank all the KServe contributors for this transition work!

Individual Contributors:

Core Contributors: The KServe maintainers and Kubeflow Serving Working Group leads

Community: Everyone who supported this important transition and helped establish KServe as an independent project

🤝 Join the Community

Visit our Website or GitHub
Join the Slack (#kubeflow-kfserving)
Follow the KServe developer and doc contribution guides to make contributions

Welcome to KServe!

The KServe team is committed to making machine learning model serving simple, scalable, and standardized. Thank you for being part of this exciting transition!

KServe Blog

Announcing KServe v0.17 - Production-Ready LLM Serving with LLMInferenceService

🤖 LLMInferenceService: GenAI-First Architecture​

🚀 Key LLMInferenceService Features in v0.17​

🧠 KV-Cache Aware Scheduling with Gateway Inference Extension​

🔀 Disaggregated Prefill-Decode​

📐 Distributed Inference: Tensor, Data, and Expert Parallelism​

🌐 Envoy AI Gateway Integration with Token-Based Rate Limiting​

⚡ Autoscaling API with WVA Support​

🔧 Scheduler High Availability​

🛡️ CRD Webhook Validation​

📋 Configuration Composition with LLMInferenceServiceConfig​

📦 Additional LLMInferenceService Improvements​

🏗️ Modular Component Architecture​

Helm Chart Restructuring​

🔧 InferenceService and Platform Improvements​

Storage Performance​

New Serving Runtimes​

Gateway & Routing​

vLLM Backend​

Additional Enhancements​

Infrastructure Updates​

🔒 Security Fixes​

🔍 Release Notes​

🙏 Acknowledgments​

🤝 Join the Community​

Best of Both Worlds: Cloud-Native AI Inference at Scale using KServe and llm-d

KServe: Simplified Deployment of AI Models on Kubernetes​

Inference as a Service​

LLMInferenceService in KServe​

Unlocking Generative AI Serving with LLMInferenceService: From Pod-Level Speed to Cluster-Wide Intelligence​

Distributed & Multi-Node Model Support​

Advanced Autoscaling & Networking (Including Scale-to-Zero)​

Kubernetes Gateway API Integration​

Where KServe Alone Is Not Enough​

llm-d: Distributed Intelligence for LLM Inference​

KV-Cache Aware Scheduling and Disaggregated Inference with llm-d​

Disaggregated Inference (Prefill / Decode Separation)​

Intelligent Inference Scheduler​

KServe LLMInferenceService and llm-d​

Responsibility Separation​

Cost Efficiency Comparison: Naive vs Optimized​

Optimized Architecture with KServe + llm-d​

Benchmark Results: Why Cluster-Level Intelligence Matters​

Next Steps​

Announcing KServe v0.15 - Advancing Generative AI Model Serving

🤖 Embracing Generative AI Workloads​

🚀 Key Generative AI Features in v0.15​

🌐 Envoy AI Gateway Support​

🔗 Multi-Node Inference​

⚡ LLM Autoscaler with KEDA (Kubernetes Event-driven Autoscaling)​

🚀 Distributed KV Cache with LMCache​

📦 Advanced Model Caching Mechanisms​

🔧 Enhanced vLLM Backend Support​

🛠️ Additional Improvements​

🔍 Release Notes​

🙏 Acknowledgments​

🤝 Join the Community​

Announcing KServe v0.14

🚀 Key Features​

Introducing Inference client for Python​

Support for OCI storage for models (modelcars) becomes stable​

Future plan​

Introducing Model Cache​

Support for Hugging Face hub in storage initializer​

🛠️ Enhancements and Improvements​

Hugging Face vLLM backend changes​

Other Enhancements​

⚠️ What's Changed​

🔍 Release Notes​

🙏 Acknowledgments​

🤝 Join the community​

From Serverless Predictive Inference to Generative Inference - Introducing KServe v0.13

🚀 Enhanced Hugging Face Runtime Support​

🔧 vLLM Support​

🌐 OpenAI Schema Integration​

🔮 Future Plan​

🛠️ Other Changes​

✨ What's New?​

⚠️ What's Changed?​

🤖 LLMInferenceService: GenAI-First Architecture

🚀 Key LLMInferenceService Features in v0.17

🧠 KV-Cache Aware Scheduling with Gateway Inference Extension

🔀 Disaggregated Prefill-Decode

📐 Distributed Inference: Tensor, Data, and Expert Parallelism

🌐 Envoy AI Gateway Integration with Token-Based Rate Limiting

⚡ Autoscaling API with WVA Support

🔧 Scheduler High Availability

🛡️ CRD Webhook Validation

📋 Configuration Composition with LLMInferenceServiceConfig

📦 Additional LLMInferenceService Improvements

🏗️ Modular Component Architecture

Helm Chart Restructuring

🔧 InferenceService and Platform Improvements

Storage Performance

New Serving Runtimes

Gateway & Routing

vLLM Backend

Additional Enhancements

Infrastructure Updates

🔒 Security Fixes

🔍 Release Notes

🙏 Acknowledgments

🤝 Join the Community

KServe: Simplified Deployment of AI Models on Kubernetes

Inference as a Service

LLMInferenceService in KServe

Unlocking Generative AI Serving with LLMInferenceService: From Pod-Level Speed to Cluster-Wide Intelligence

Distributed & Multi-Node Model Support

Advanced Autoscaling & Networking (Including Scale-to-Zero)

Kubernetes Gateway API Integration

Where KServe Alone Is Not Enough

llm-d: Distributed Intelligence for LLM Inference

KV-Cache Aware Scheduling and Disaggregated Inference with llm-d

Disaggregated Inference (Prefill / Decode Separation)

Intelligent Inference Scheduler

KServe LLMInferenceService and llm-d

Responsibility Separation

Cost Efficiency Comparison: Naive vs Optimized

Optimized Architecture with KServe + llm-d

Benchmark Results: Why Cluster-Level Intelligence Matters

Next Steps

🤖 Embracing Generative AI Workloads

🚀 Key Generative AI Features in v0.15

🌐 Envoy AI Gateway Support

🔗 Multi-Node Inference

⚡ LLM Autoscaler with KEDA (Kubernetes Event-driven Autoscaling)

🚀 Distributed KV Cache with LMCache

📦 Advanced Model Caching Mechanisms

🔧 Enhanced vLLM Backend Support

🛠️ Additional Improvements

🔍 Release Notes

🙏 Acknowledgments

🤝 Join the Community

🚀 Key Features

Introducing Inference client for Python

Support for OCI storage for models (modelcars) becomes stable

Future plan

Introducing Model Cache

Support for Hugging Face hub in storage initializer

🛠️ Enhancements and Improvements

Hugging Face vLLM backend changes

Other Enhancements

⚠️ What's Changed

🔍 Release Notes

🙏 Acknowledgments

🤝 Join the community

🚀 Enhanced Hugging Face Runtime Support

🔧 vLLM Support

🌐 OpenAI Schema Integration

🔮 Future Plan

🛠️ Other Changes

✨ What's New?

⚠️ What's Changed?

🔍 Release Notes

🙏 Acknowledgments

🤝 Join the Community

🚀 KServe Core Inference Enhancements

📦 Enhanced Python SDK Dependency Management

🔧 KServe Python Runtimes Improvements