Running kube-proxy in IPVS mode: why and how

5 min readAug 10, 2024

I finally got some time to write down and share some of the things I learned in the past 3months about kube-proxy: something I was aware of but didn't quite understand why (or why not)

To me, “why” is always more important than “how”. I’ll dedicate my focus to explaining why you should configure kube-proxy to IPVS mode in this post.

Conclusion

TL;DR

if your k8s cluster hosts more than 100 Service objects, you should switch kube-proxy to IPVS mode
you should almost always prefer using IPVS mode over iptables mode

By “almost always” I mean: if you have reasons to favor iptables over IPVS mode, please comment below and I'd love to learn more from you.

Why

Every decision should be contextual and I’ll start with sharing some of the context.

Environment set up

I was working on provisioning an EKS cluster in preparation for production workloads (decent traffic load) using a pretty standard AWS stack (probably the most popular):

AWS EKS latest stable (1.29 / 1.30 as I applied the IPVS mode change)
EKS addons for core add-ons: kube-proxy, aws-vpc-cni, coredns with default values. Latest stable compatible version.
Other add-ons are driven by helm: aws-load-balancer-controller, cert-manager, external-dns, karpenter, etc

An issue surfaced during load testing services that led me to this discovery. The load test is set up with a similar scale (more than) to expected peak traffic to provide valuable insights:

1000 instances of the same helloworld web service was deployed to the same instance. every instance consists of 1 ingress + 1 service + 1 deploymenet + 1 HPA with a very-high ceiling.
karpenter is configured with a ceiling equivalent to 100 production-sized instances
deployment from my helloworld service is configured with nodeAffinity to match the labels of the kapenter pool
API load is configured to more than production peak, which is on the order of thousands of requests per second

The issue

Pod autoscaling and node autoscaling are also triggered as the load ramps up. Many pods were concurrently scheduled and so were the nodes.

There’s a separate unresolved karpenter issue on “how many nodes should be scheduled for concurrently scheduled pods” (and you'll see my comments on what I think should be done) but I won't expand here since it's not the focus of this post.

During concurrent node launch (identical ec2NodeClass / "launch template"), I noticed most of the nodes were stuck in NotReady state indefinitely.

Investigation

My first guess was hitting an EC2 API rate limit. Some nodes passed the API rate limit, but others got gated, leading to what I observed. After checking EC2 API rate limit metrics on CloudWatch, I found out there weren’t any API rate limit events.

Then I looked into aws-vpc-cni repo for helpful resources: nodes stuck in NotReady is nearly exclusively caused by vpc-cni failure based on my experience. While trying to submit an issue, I found this helpful log collector tool: "CNI Log Collection tool" sudo bash /opt/cni/bin/aws-cni-support.sh, which collects logs from all pods placed on the node.

I didn’t find anything suspicious other than long intervals between blocks of logs from vpc-cni logs and I started cross-checking logs from other pods. By ordering log entries based on timestamps across different log streams/files, I eventually figured out why nodes are stuck in NotReady state:

Both vpc-cni and kube-proxy need to modify iptables and they starve each other due to lock contention. A single modification took tens of seconds (if not a few minutes) while the application (kube-proxy specifically) is configured to wait for the xtable lock up to 5 seconds with a 30s checking interval.

The issue does not look like a deadlock and I’m still unsure why the lock contention lasted hours. If I were to guess, it could be the retry interval less than the operation duration/timeout, blocking the initialization sequence from completion.

The root of the issue is that iptables modification is extremely slow on a large cluster. Since my cluster contains 1000+ k8s Services, the iptabls update command used by kube-proxy and vpc-cni both took dozens of seconds (same, if not more, order as the retry interval) to complete, saturating the xtable lock's availability.

If you’re interested in the context of this particular issue (e.g. detailed logs, how to reproduce, etc), 2 issues can be found here: issue#2945, issue#2948

Solution

After doing more research and consulting AWS, it’s clear that I should set kube-proxy to IPVS mode to avoid contention.

I believe in theory one can alleviate (to a certain degree) this situation by increasing the retry interval but IPVS mode is a far better solution.

I have not collected data to help put things in perspective, but the same contention issue (which leads to nodes stuck in NotReady state) has not happened since the switch.

This post from Tigera has an in-depth performance analysis which matches my observation, and I don't think I could have done a better job explaining than reading this post.

How

This is the easy part. The only thing worth calling out is that this change will result in downtime. Note that downtime does not linearly translate to a negative impact.

How long the downtime is depends on:

how large your cluster is (# of nodes)
updateStrategy for kube-proxy
# of services your k8s cluster hosts

If you have the luxury of not worrying about downtime, e.g. preparing a new cluster for production, do this as early as possible.

Detailed change has 2 components:

addon configuration: tell the addon to use IPVS mode
update the kernel (AMI) if needed to enable the scheduler of your preference.

It’s listed in the reverse order to help understand why but you should start with 2 and then work on 1.

addon value override

I’ll walk through the change based on my setup (using EKS addon distribution of kube-proxy). What your cluster needs depends on the context but it should be roughly the same.

Run aws eks describe-addon-configuration --region us-east-1 --addon-name kube-proxy --addon-version <version> | jq -r '.configurationSchema' | jq .definitions to view the schema of the addon values/configuration
Set mode: "ipvs" and ipvs.scheduler: "<scheduler_of_your_choice>" via config

Not all schedulers are available by default. Matching kernel modules need to be turned on.

kernel module

I’m using the official EKS AMI which comes with ipvsadm installed. The following instructions are based on my setup, YMMV.

I'm also using packer to build AMI but that may not be the case for everyone. I'll share what needs to be done and you can implement the same change based on your setup:

install ipvsadm if not available
run lsmod to see what kernel modules are loaded, this will help identify what needs to be added in the next step
also worth checking ls -l /lib/modules/<kernel_version>/kernel/net/netfilter/ipvs/ to see what kernel modules can be loaded
make sure ip_vs and nf_conntrack are loaded and one or more scheduler modules (at least the one matching your addon configuration). You can test loading with sudo modprobe <module_name>
to ensure AMI loads the kernel modules correctly, you should add the list of modules to /etc/modules-load.d/ipvs.conf.

Alternatively, you can read this post from AWS which covers most of the steps above.

Lastly, here’s my /etc/modules-load.d/ipvs.conf:

ip_vs
nf_conntrack
ip_vs_rr
ip_vs_mh
ip_vs_wlc

and my addon config ipvs.scheduler is set to mh