Kubernetes: Databricks' Load Balancing Options

A good balance between performance and ease of implementation. This is the verdict the Databricks platform team applies to the P2C (Power of Two Choices) algorithm.

Against this backdrop, the deployment of an alternative load-balancing mechanism on Kubernetes was pursued. The default one (CoreDNS + kube-proxy) did not fit the environment set up by the US-based company. Specifically, services communicating over gRPC.

This protocol uses persistent TCP connections. For load balancing to be effective, it must be able to operate at the request level. Yet, kube-proxy does not support this, as it runs at Layer 4. It also only handles a few basic algorithms such as round-robin.

Databricks thus undertook the development of a client library paired with a control plane. The latter continuously tracks changes on the back-end via the Kubernetes API and translates them into xDS responses. The information is conveyed to the application servers, which can route traffic accordingly.

Read also: Kubernetes: the first 17 AI-certified platforms

The implementation was facilitated by the internal existence of a widely standardized framework for inter-service communication, largely written in Scala. Each service integrates with the client, which subscribes to updates from the control plane and maintains a dynamic list of all endpoints.

20% fewer Pods

The implementation thus granted access to P2C, which randomly selects two backend servers and chooses the one with fewer active connections or the lower load. More broadly, it enabled the integration of custom load balancing strategies — although Databricks acknowledges that in their case, “simple and consistent” approaches have often performed better.

The control plane was wired to Envoy to bring external traffic management (ingress) onto the same plane. In addition to a genuinely more uniform distribution and lower latency, this also led, at cluster scale, to a reduction in the number of pods used (around 20% across several services, we’re told).

Next steps: managing load balancing between clusters and across regions. Databricks also intends to define strategies tailored to AI workloads. It also mentions the prospect of developing a framework for cold-start management. The new load balancing approach has indeed surfaced issues in this area. Previously, new pods were generally started on connection recycling. With the current mechanism, some may receive traffic while their preheating is not yet complete. While awaiting a possible framework, workarounds have been implemented, including diverting traffic toward pods showing the highest error rates.

The choice to embed the logic in a client library excludes languages for which this library is not available. Also outside scope are network flows that still rely on infrastructure-level load balancing.

Headless services, not adopted…

Databricks considered headless services. With them, there is no virtual IP (ClusterIP), but each pod has a DNS entry. This allows a client to connect directly to the pod of its choice.

This approach bypasses the limits of kube-proxy and provides access to Envoy’s advanced load-balancing strategies. In theory, it mitigates the issues related to connection reuse by giving clients direct visibility into the endpoints. It does, however, come with limits:

Weights cannot be assigned to endpoints.
Frequent DNS response caching by clients, which can cause them to send requests to inappropriate endpoints.
Absence of additional information about the endpoints (zone, region, shard, …) in DNS records, making topology-based routing difficult or impossible.

… as with the service mesh

The sidecar-based service mesh approach also showed its limits in the given environment. Among other issues:

Operational complexity (managing thousands of sidecars adds overhead, especially during upgrades)
Additional resource consumption within each pod
Because the logic is managed externally, it is difficult to implement strategies that rely on the context of the application layer

Databricks also explored Istio’s Ambient Mesh mode (L4, L7 routing “on demand”). The platform team judged that this option wasn’t worth pursuing. On the one hand, the company has proprietary systems for elements such as certificate distribution. On the other, its routing patterns are relatively static. Moreover, the relatively homogeneous Scala environment and the monorepo approach make the client-library solution all the more practical.

Read also: Knative, the serverless layer for Kubernetes, matures

For supplementary reading, Publicis Sapient’s reflections on the same topic. One of its teams faced a misbalance of load on an AKS-hosted application: the downscaling of one microservice disrupted the load balancing toward the second.

Kubernetes: Databricks’ Load Balancing Options

20% fewer Pods

Headless services, not adopted…

… as with the service mesh