LinkedIn Leverages Kafka and gRPC for Service Discovery

Urgency for the service discovery system: the current capacity of the control plane could be exhausted by 2025.

LinkedIn had this realization in the summer of 2022. As a result, it launched a modernization program.

Elasticity, Compatibility… The ZooKeeper control plane was hitting its limits

The control plane at the time relied on ZooKeeper. It had a flat structure. Server applications registered their endpoints as ephemeral nodes in the form of D2 URIs (Dynamic Discovery). Client applications would issue read requests to follow the clusters they cared about.

Read also: LinkedIn moves away from CentOS

This approach showed limitations in terms of elasticity. Because ZooKeeper is a strongly consistent system, all reads and writes, as well as node integrity checks, went through the same queue. Requests could accumulate until some could no longer be processed. At the same time, sessions could be closed in case of a timeout on the integrity check. This would lead to a loss of capacity on the server side and, in the end, to application unavailability.

Another limitation: the D2 entities, tied to LinkedIn-specific schemas, were incompatible with modern data planes such as gRPC and Envoy. Additionally, the read/write logic implemented inside application containers was Java-centric. Furthermore, the absence of an intermediate layer between the service registry and the application instances hindered the development of centralized RPC management techniques, for example for load balancing.

Kafka on the server side, gRPC on the client side

The new control plane introduces Kafka components and an Observer.

Kafka receives the servers’ write requests and the integrity information as events, called service discovery URIs.

The Observer component consumes these URIs and keeps them in memory. Client applications subscribe to them by opening a gRPC stream. They send their requests using the xDS protocol.

The D2 configurations remain stored in ZooKeeper. They are converted into xDS entities by the application owners and then distributed to the “observer” in the same way as the URIs.

Kubernetes readiness probes in the crosshairs

In this architecture, elasticity and availability take precedence over strict consistency. The observer, written in Go with strong concurrency, can handle 40,000 client streams and 10,000 updates per second while consuming 11,000 Kafka events per second, according to LinkedIn.

Read also: LinkedIn won’t go all-in on Azure after all

To gain even more elasticity, beyond simply increasing the number of observers, it would be possible to create two types of instances. On one side, instances that consume Kafka events; on the other, instances that respond to client requests.

Because it uses xDS, the control plane is compatible with Envoy, opening the door to multi-language support. And with the introduction of this intermediary layer, it becomes possible to integrate features around service meshes. Even to leverage Kubernetes readiness probes to move servers into a passive mode and thus harden the system.

P50 latency brought down to under one second

The rollout was complicated by the variety of clients (dependencies, network access, SSL…). For many, predicting the level of compatibility was difficult.

In addition, the project had to run in parallel for both reads and writes. In broad terms, without the reads, migrating the writes would be blocked, and vice versa. The original infrastructure was therefore kept in a dual-mode approach, Kafka serving as the primary source and ZooKeeper as the backup (used when Kafka data was unavailable). A cron job helped gauge how dependent applications were on ZooKeeper and prioritized migrations accordingly.

For reads, the main client-side metrics evaluated were the latency from subscribing to data reception, resolution errors of those requests, and the coherence between ZooKeeper data and Kafka data. On the observer side, LinkedIn examined the type, number, and capacity of client connections, the delay between receiving requests and sending data to the queue, as well as resource utilization rates.

For writes, the main metrics measured were:

  • Latency and connection losses on ZooKeeper and Kafka
  • URI similarity score between ZooKeeper and Kafka
  • Cache propagation delay (time from data reception to cache update)

LinkedIn states that 50% of clients now fetch data in under 1 second and 99% in under 5 seconds. On the ZooKeeper control plane, P50 and P99 latencies were 10 seconds and 30 seconds, respectively.

Read also: Costwiz: LinkedIn’s FinOps approach to Azure

For reference, other postmortems involving Kafka and/or ZooKeeper:

Unified configuration deployments at Uber
Kafka cost optimization on AWS at Grab
Kafka scaling at PayPal
Transition to a cellular architecture at Slack

Dawn Liphardt

Dawn Liphardt

I'm Dawn Liphardt, the founder and lead writer of this publication. With a background in philosophy and a deep interest in the social impact of technology, I started this platform to explore how innovation shapes — and sometimes disrupts — the world we live in. My work focuses on critical, human-centered storytelling at the frontier of artificial intelligence and emerging tech.