With the right GPU optimizations, exact nearest-neighbor search can prove at least as fast as an approximate search.
LinkedIn notes this in the context of modernizing its job search engine.
The old system, which relied in part on fixed taxonomy methods, was not conducive to adding a semantic dimension. It had also become complex, with components optimized for different objectives. At one point, the pipeline comprised nine steps, often duplicated across multiple channels. It became difficult to identify the source of relevance issues.
LinkedIn therefore undertook to simplify the pipeline while lines were aligned. At a high level, their approach consisted of generating a master model capable of accurately classifying a query and a job posting, then using various fine-tuning and distillation techniques to obtain smaller, near-peer “student” models.
The matching of queries and postings spans several aspects:
- Estimation of their semantic proximity
- Prediction of engagement (the probability that a candidate views a posting and applies)
- Prediction of value (the probability that the candidate is shortlisted and hired)
Hence the implementation of a multi-objective optimization to ensure alignment between retrieval of postings and their subsequent ranking.
To feed the system, LinkedIn paired its historical data with synthetic data. First created using “advanced LLMs”, with prompts focused on the objective of “textual similarity” and aligned with a scoring policy.
From that set emerged records intended to reflect the expected user behavior in a semantic search environment. Initially, their evaluation was entrusted to humans. It eventually involved an LLM.
Tool calls to build the retrieval strategy
The query-processing engine is designed so that it constructs the appropriate retrieval strategy. This is done by classifying the user’s intent, retrieving external data (profile and preferences), and performing entity recognition to tag taxonomy elements needed for filtering.
LinkedIn cites an example: a query “jobs in the New York metropolitan area where I already have connections [= a professional network]”. The engine resolves “New York metropolitan area” into a geographic identifier and calls the graph service to search for company identifiers where the user has connections. These identifiers are then integrated into the search index as strict filters; while the non-strict criteria are added as a complement, in the form of a vector. This pattern rests on tool calls.
The engine also generates personalized recommendations. Both to help clarify queries and to refine results after retrieval. Example: for a query “jobs in project management,” one might be advised to add attributes such as sector, experience level, or certifications. These attributes are extracted from online job postings, vectorized, and transmitted to the engine via RAG. Conversely, once results are delivered, precise filters, such as company tags, allow the user to drill down further.
Scaling the engine involves, among other techniques:
- Separate caching of personalized queries (highly dependent on individual profiles) and non-personalized queries
- Key-value caching to avoid duplicating tasks between queries
- Optimization of response schemas (minimizing verbose XML/JSON)
- Reducing the model size through distillation and fine-tuning
Vector search: exactness over approximation
In recent years, for vector retrieval, LinkedIn tended to rely on approximate nearest-neighbor search. This method, however, struggled to meet the goals of the new engine, given low latency, frequent index turnover (job postings are online for only a few weeks), and the ability to define complex filters.
Typically, for low latency, exact nearest-neighbor search is avoided unless working on small datasets. Yet a GPU-based infrastructure can achieve excellent performance if tasked with repetitive operations (no pointers to follow, no CPU communication…), LinkedIn reasoned. It demonstrated this by restricting itself to a single operation type (dense matrix multiplication), leveraging fused kernels and subdividing the data. The result: lower latency than with more complex index structures, while benefiting from the simplicity of managing a flat list of vectors.
Once the ideal index was found, it could be integrated into the system to analyze the embeddings of postings generated by the master model and deliver, in a few milliseconds, the K nearest to a given query.
To improve the quality of retrieval, open-source models were fine-tuned on millions of query-posting pairs. The pipeline included a reinforcement learning loop, the master model acting as a reward engine. Distillation is supervised. It has the advantage of also providing the student model with the master model’s logits, in addition to the labels.