A Guide to Kubernetes VPA

January 31, 2025
Tags:
Autoscaling
VPA

In Kubernetes, managing resources like CPU and memory effectively is critical for maintaining performance, reducing costs, and ensuring application stability. Over-provisioning resources wastes money, while under-provisioning can lead to performance bottlenecks or application crashes. Striking the right balance is often challenging, especially in dynamic environments where workloads change frequently.

In this article, we’ll explore how the Vertical Pod Autoscaler (VPA) helps solve these challenges by automatically optimizing resource allocation for Kubernetes workloads. We’ll cover its components, how it works, types of recommendations it provides, and the steps to set it up, along with real-world use cases and practical insights.

Introduction to Vertical Pod Autoscaler (VPA)

The Vertical Pod Autoscaler (VPA) is a Kubernetes component designed to optimize resource allocation for workloads. Unlike the Horizontal Pod Autoscaler (HPA), which scales the number of pod replicas, the VPA adjusts the CPU and memory requests and limits of individual pods. This ensures workloads have the necessary resources without over-provisioning or under-provisioning, reducing costs and improving performance.

In environments where workloads are dynamic and unpredictable, the VPA helps maintain resource efficiency by adapting to changing demands. It’s particularly useful for DevOps engineers, SREs, and FinOps teams aiming to optimize costs while maintaining high availability and performance.

Key Components of VPA

The Vertical Pod Autoscaler (VPA) relies on three main components that work together to monitor, recommend, and apply resource adjustments for Kubernetes pods:

1. VPA Recommender

The VPA Recommender is the brain behind the operation. It analyzes historical resource usage data collected from sources like the Kubernetes metrics server or Prometheus. By studying patterns such as average and peak usage, it estimates the appropriate CPU and memory requests and limits for each pod.

For example, if a pod consistently uses 0.5 CPU cores (500m) but is allocated 2 CPU cores (2000m), the Recommender might suggest reducing the allocation (requests) to 1 CPU (1000m), preventing over-provisioning while maintaining enough capacity for occasional spikes.

VPA recommender

2. VPA Updater

The VPA Updater is responsible for applying the recommendations generated by the Recommender. However, it doesn’t dynamically adjust resources on running pods since Kubernetes doesn’t support changing resource limits without restarting the pod. Instead, the Updater terminates the pod and allows the deployment controller to recreate it with updated specifications.

This ensures that updated resource limits are applied without disrupting the integrity of Kubernetes scheduling and resource allocation.

Note:

While Kubernetes 1.28 introduced the in-place updates for resource requests and limits KEP which allows resource requests and limits to be adjusted without recreating pods, VPA has not fully adopted this capability yet. As a result, VPA still may require pod restarts to enforce changes.

3. VPA Admission Controller

The VPA Admission Controller acts as a gatekeeper during pod creation and updates. It ensures that new pods are created with the recommended resource requests and limits. This guarantees that the pods are rightsized from the start, reducing the likelihood of inefficiencies.

For instance, if a deployment creates 10 new pods, the Admission Controller automatically injects optimized resource values into their configurations before they are scheduled on the cluster nodes.

VPA workflow chart

For a more detailed breakdown of the VPA components, refer to this documentation.

Understanding how the VPA Recommender Works

The VPA Recommender analyzes resource usage patterns and generates actionable recommendations to optimize workload performance. Here’s how it operates:

1. Collecting Resource Usage Data

The VPA Recommender collects historical resource usage data from Kubernetes metrics sources, such as the metrics server, Prometheus, or other monitoring tools. This data includes information about CPU and memory usage patterns for each pod in the cluster.

By default, VPA analyzes up to 8 days of historical resource usage data, assigning higher weight to recent samples. This ensures recommendations are aligned with the pod's most current workload behavior.

Tip:

Using Prometheus as your metrics source allows for greater flexibility in data retention policies and fine-grained usage tracking. 

2. Analyzing Resource Usage Patterns

Once the data is collected, the Recommender analyzes usage trends over time. It identifies key patterns, such as:

  • Peak Usage: The maximum resource consumption during spikes.
  • Average Usage: The typical resource consumption during steady-state operation.
  • Usage Trends: Variability in resource usage over different time periods (e.g., daily, weekly).

This analysis enables the Recommender to account for both stable workloads and workloads with dynamic demands.

Note:

Workloads with highly unpredictable spikes may require additional configuration, such as Burstable QoS settings, to handle sudden increases in resource usage. (discussed in the next section)

3. Using a Histogram-Based Algorithm

To calculate recommendations, the VPA uses a histogram-based algorithm. Histograms are statistical representations of resource usage data, allowing the VPA to estimate resource needs with high accuracy. Here’s how it works:

  • The algorithm categorizes resource usage into ranges (e.g., 0-10%, 10-20%) and tracks the frequency of each range over time.|
  • Percentile analysis is applied to determine:
    • Requests: Typically set at the 90th percentile, ensuring that the pod has enough resources for 90% of observed usage.
    • Limits: Typically set at the 95th or 99th percentile, allowing for occasional spikes while avoiding over-allocation.

For example: If a pod’s CPU usage histogram shows that 90% of usage falls below 400m, the Recommender might suggest setting the CPU request to 400m and the limit to 600m to account for peak loads.

4. Generating Recommendations

Based on the analyzed data, the Recommender generates recommendations for:

  • CPU Requests and Limits: Ensures that workloads have adequate processing power for typical and peak usage scenarios.
  • Memory Requests and Limits: Allocates sufficient memory to avoid out-of-memory (OOM) errors without wasting resources.

These recommendations are designed to balance cost efficiency and performance stability, ensuring that workloads operate smoothly under varying conditions.

5. Recommendation Modes

The Recommender supports four operational modes, allowing flexibility based on your use case:

  • Auto: Automatically applies resource recommendations at pod creation and updates running pods as needed.

  • Recreate: Updates running pods by restarting them when resource requests change significantly. This mode respects Pod Disruption Budgets (if defined) and is useful when updates must guarantee pod restarts.

  • Initial: Applies recommendations only when new pods are created. Existing pods remain unchanged, making it ideal for ensuring optimized resource usage from the start.

  • Off: Calculates recommendations but doesn’t apply them automatically. This mode is great for testing and reviewing recommendations before making changes.

Suggestion:

Start with Off Mode in production clusters to analyze recommendations without disrupting workloads. Once you’re confident in the results, switch to Auto Mode to automatically apply the recommendations.

Can you use VPA and HPA together? - Best Practices & Considerations

While VPA and HPA serve different purposes, using them together is generally not recommended, as they can conflict when scaling based on the same metrics. HPA scales horizontally (adding/removing pods) based on CPU/memory utilization, while VPA modifies CPU/memory requests and limits, which can interfere with HPA’s decision-making.

However, if you must use them together, ensure they do not rely on the same metrics. A best practice is to configure HPA to scale based on custom metrics (e.g., request rate, latency) via tools like Prometheus Adapter instead of CPU/memory utilization, for better stability.

Types of Recommendations

The Vertical Pod Autoscaler (VPA) provides three types of recommendations based on Kubernetes Quality of Service (QoS) classes: Guaranteed, Burstable and BestEffort.

These recommendations allow you to optimize resource allocation based on workload requirements, balancing cost efficiency and performance reliability.

1. Guaranteed Recommendations

Guaranteed recommendations are designed for workloads that require consistent and predictable performance. In this mode, both the CPU and memory requests are set equal to the limits, ensuring that the workload always has the resources it needs under all conditions.

When to Use:

  • For critical applications such as databases, backend services, or payment gateways where even minor performance degradation is unacceptable.
  • When you want to prioritize resource allocation and ensure that these pods are never evicted, even during periods of high resource pressure.

Key Characteristics:

  • Resource Allocation: Ensures the pod gets the exact amount of resources specified.
  • High Priority: Pods configured with Guaranteed QoS have the highest priority for scheduling and resource allocation.
  • Resilience Under Resource Pressure: These pods are the least likely to be evicted when nodes experience resource shortages.

Example:

Here’s an example of a pod recommendation using Guaranteed QoS:

containerRecommendations:
  - containerName: database
    target:
      cpu: 500m
      memory: 2Gi

In this case, both the target CPU and memory values will be applied as requests and limits to ensure the database pod operates with maximum stability and reliability.

Tip:

Use Guaranteed QoS for workloads with strict performance SLAs (Service Level Agreements) to ensure uninterrupted service.

2. Burstable Recommendations

Burstable recommendations are ideal for workloads that don’t need consistent resource usage but can benefit from additional resources during peak demand. In this mode, resource requests (minimum guaranteed resources) are set lower, and limits (maximum allowed resources) are set higher.

When to Use:

  • For workloads like web servers, batch jobs, or microservices where resource usage fluctuates based on traffic or workload demand.
  • When you want to optimize resource efficiency while allowing pods to scale dynamically during spikes.

Key Features:

  • Dynamic Scaling: Pods can use additional resources when available but still function reliably with their guaranteed minimum resources.
  • Cost Efficiency: Reduces unnecessary resource allocation during idle periods while accommodating spikes effectively.
  • Medium Priority: These pods have a lower scheduling and eviction priority compared to Guaranteed pods.

Example Configuration:
Here’s an example of a Burstable QoS recommendation:

containerRecommendations:
  - containerName: nginx
    lowerBound:
      cpu: 200m
      memory: 512Mi
    upperBound:
      cpu: 800m
      memory: 2Gi

In this configuration:

  • The lowerBound represents the guaranteed resources (200m CPU and 512Mi memory), ensuring the pod remains operational.
  • The upperBound allows the pod to scale up to 800m CPU and 2Gi memory during traffic spikes, taking advantage of available resources on the node.

3. BestEffort Recommendations

BestEffort recommendations are designed for workloads that don’t specify CPU or memory requests and limits. These pods rely entirely on unused cluster resources to operate and have no guaranteed allocation.

Key Characteristics:

  • BestEffort pods are the first to be evicted under resource pressure, as they don’t have any guaranteed CPU or memory allocations.
  • These workloads operate opportunistically, consuming idle resources in the cluster.

When to Use

These are suitable for non-critical workloads where performance degradation or eviction is acceptable. They are commonly used for:

  • Low-priority batch jobs.
  • Development or testing environments.
  • Background processes that don’t require consistent performance.

Choosing the Right Recommendation

The choice of recommendation depends on your workload's criticality and resource requirements:

  • Guaranteed Recommendations: Use for mission-critical workloads where consistent performance and reliability are essential. These workloads are prioritized for scheduling and are least likely to be evicted during resource pressure.
  • Burstable Recommendations: Best for applications with variable resource demands. This ensures a balance between cost efficiency and scalability, allowing workloads to scale during peak demand while maintaining guaranteed resources for baseline operations.
  • BestEffort Recommendations: Ideal for non-essential workloads that can operate without guaranteed resources. These are perfect for maximizing cluster utilization without impacting critical applications.

By aligning your workloads with the appropriate QoS class, you can ensure that your Kubernetes environment remains cost-effective and performant.

Steps to Set Up VPA Recommendations

Setting up the Vertical Pod Autoscaler (VPA) in your Kubernetes cluster involves a few straightforward steps. While we’ll cover the essentials here, you can refer to the official VPA documentation for detailed instructions and advanced configurations.

1. Prerequisites

Before proceeding further, ensure your Kubernetes cluster meets the following requirements:

  • Metrics Server is installed and functioning.
  • Kubernetes cluster version 1.11 or later.
  • kubectl is installed and configured to interact with your cluster.


For additional details, check the official VPA prerequisites documentation.

2. Install the VPA Components

To deploy VPA, clone the official Kubernetes Autoscaler repository and use the following commands:

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

This deploys the three key VPA components—Recommender, Updater, and Admission Controller—into your cluster. These components are essential for monitoring resource usage, generating recommendations, and applying adjustments.

3. Configure VPA for your Workloads

To enable VPA for a specific workload, create a VerticalPodAutoscaler resource.

Here’s an example configuration:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: example-vpa
  namespace: default
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Auto" 

This configuration targets a Deployment named my-app and sets the VPA to automatically update pod resources based on recommendations.

Tip:

Start with updateMode: "Off" to test the recommendations without automatically applying changes. This mode is ideal for production clusters where stability is critical.

4. Test Your VPA Setup

To verify that VPA is functioning correctly, you can deploy a sample application along with a corresponding VPA configuration. 

The VPA repository provides an example with a "hamster" deployment.

5. Monitor VPA Recommendations

To view the recommendations provided by VPA, describe the VPA resource:

kubectl describe vpa example-vpa

The output includes values like target, lowerBound, and upperBound for CPU and memory, providing insights into optimal resource allocation.

Challenges in Implementing VPA

While the Vertical Pod Autoscaler (VPA) simplifies resource management in Kubernetes, its implementation comes with several challenges. Understanding these challenges is essential for effectively deploying VPA in production environments.

1. Manual Configuration Overhead

Configuring VPA objects for every workload in a large-scale environment can be time-consuming and prone to errors. Each workload may have unique resource requirements, making it difficult to standardize configurations.

To overcome this, leverage tools like Helm charts or CI/CD pipelines to dynamically generate VPA configurations based on workload metadata. This approach reduces manual effort, ensures consistency across deployments, and minimizes the risk of configuration errors.

2. Pod Restarts and Application Stability

In Auto mode, VPA applies recommendations by restarting pods to update resource requests and limits, as modifying resource limits on running pods is not yet fully supported. While Kubernetes 1.28 introduced in-place updates for resource requests and limits, this feature is still in beta and not widely adopted, so VPA may continue to rely on pod restarts.

While necessary for resource adjustments, this process can cause temporary disruptions, particularly for critical or stateful workloads.

For instance, a database pod requiring a memory adjustment will be restarted, potentially interrupting active connections. This could lead to downtime or degraded performance during the update.

To minimize disruptions, start with Off Mode in production clusters to observe recommendations before applying them. Combine this with rolling updates to ensure changes are applied gradually, reducing the impact on application stability.

3. Delayed Adjustments in Dynamic Environments

The VPA relies on historical usage data (defaults to 8 days) to generate recommendations. In highly dynamic workloads, this can lead to delayed adjustments when handling sudden traffic spikes. 

To balance this, use HPA for real-time scaling based on external/custom metrics (e.g., request rate, latency) while VPA optimizes CPU/memory requests over time.

4. Scalability in Large-Scale Clusters

In large-scale Kubernetes clusters, VPA’s Recommender and Updater must process a growing amount of data. This can result in slower recommendations or increased resource consumption for managing the VPA itself.

To improve scalability and performance, use Prometheus with fine-tuned retention policies to reduce the metric storage overhead on the VPA Recommender. Additionally, partition workloads across namespaces or clusters to distribute the load and ensure faster, more reliable recommendations.

VPA-powered Rightsizing with App Insights

Tuning Kubernetes workloads with VPA can be complex—manual configuration, constant monitoring, and unexpected pod restarts make it challenging to optimize resources efficiently.

Randoli App Insights simplifies this process by leveraging VPA data to provide accurate rightsizing recommendations. The built-in agent automatically installs and configures VPA by default, eliminating manual setup hassle.

Note:

If you're already using VPA for auto-scaling (using updateMode as auto or recreate), we recommend consulting our team before installing App Insights to avoid conflicts.

Want to optimize Kubernetes resource allocation with minimal effort? Let’s talk!

Randoli App Insights Rightsizing Recommendations

Conclusion

Managing resource allocation in Kubernetes can be challenging, but the Vertical Pod Autoscaler (VPA) makes it easier. By automatically adjusting CPU and memory requests, VPA ensures workloads always have the right amount of resources—helping you reduce costs, improve stability, and prevent performance bottlenecks.

Whether you're a DevOps engineer, SRE, or FinOps professional, integrating VPA into your cluster can streamline resource management and make your workloads more efficient and scalable. With the right configuration, VPA takes the guesswork out of rightsizing, letting you focus on what matters—running reliable applications.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.