Pinterest's Learned Retrieval System

Patchscopes: Unveiling the Mind of an LLM

Feb 22, 2025

Patchscopes

Large Language Models (LLMs) have revolutionized the field of artificial intelligence, demonstrating remarkable capabilities in natural language understanding and generation. These models, comprised of layers of interconnected artificial neurons, communicate through vectors of numbers known as hidden representations. However, deciphering the meaning encoded within these hidden representations has been a significant challenge. The field of machine learning interpretability seeks to bridge this gap, and "Patchscopes" that Google researchers came up with a method to understand what LLM “thinks”.

Patchscopes is a novel interpretability method that enables researchers to perform "surgery" on the neurons of an LLM. This involves cutting out and replacing hidden representations between different prompts and layers, allowing for a detailed inspection of the information contained within. The core concept is the "inspection prompt," which acts as a lens into the LLM's mind, facilitating the extraction of human-interpretable meaning. The framework leverages the inherent ability of LLMs to translate their own hidden representations into understandable text.

Understanding the Transformer Architecture: A Foundation for Patchscopes

Patchscopes builds upon a deep understanding of LLMs and the transformer architecture, which forms the backbone of many modern language models. Transformer models process text by first tokenizing the input, breaking it down into smaller units (words or sub-words). Each token is then embedded into a high-dimensional vector space, creating an initial hidden representation.

The transformer architecture consists of multiple layers of transformer blocks. Each layer refines the hidden representation based on the output of the preceding layer and the relationships between tokens in the input sequence. This process continues through the final layer, where the hidden representation is used to generate the output text. Decoder-only models, which are the focus of Patchscopes, only consider preceding tokens when generating the next token, making them particularly well-suited for language generation tasks.

The Patchscopes framework operates on a simple yet powerful premise: LLMs possess the inherent ability to translate their own hidden representations into human-understandable text. By patching hidden representations between different locations during inference, researchers can inspect the information within a hidden representation, understand LLM behavior, and even augment the model's performance.

The process involves several key steps:

Source Prompt: A source prompt is fed into the LLM, generating hidden representations at each layer. This prompt serves as the context from which information will be extracted.
Inspection Prompt: An inspection prompt is designed to elicit a specific type of information from the LLM. This prompt typically includes a placeholder token where the hidden representation from the source prompt will be inserted.
Patching: The hidden representation from a specific layer and token position in the source prompt is "patched" into the placeholder token in the inspection prompt. This effectively replaces the LLM's internal representation with the extracted information.
Generation: The LLM continues decoding from the patched inspection prompt, generating text based on the combined information from the source and inspection prompts.
Analysis: The generated text is analyzed to understand the information encoded in the hidden representation. This can involve evaluating the accuracy of factual information, identifying the concepts captured by the representation, or assessing the model's reasoning process.

Case Study 1: Entity Resolution

The first case study explores how LLMs resolve entities (people, places, movies, etc.) across different layers of the model. The goal is to understand at what point the model associates a token with its correct meaning. For example, how does the model determine that "Diana" refers to "Princess Diana" rather than the generic name?

To investigate this, a source prompt containing the entity name is fed into the LLM. The hidden representation of the entity token is extracted at each layer and patched into an inspection prompt designed to elicit a description of the entity. By analyzing the generated descriptions, researchers can determine when the model has successfully resolved the entity.

The results of this case study suggest that entity resolution typically occurs in the early layers of the model (before layer 20). This aligns with theories about layer function, which posit that early layers are responsible for establishing context from the prompt. The study also reveals that tokenization (how the input text is broken down into tokens) has a significant impact on how the model navigates its embedding space.

Case Study 2: Attribute Extraction

The second case study focuses on evaluating how accurately the model's hidden representation captures well-known concepts and their attributes. For example, can the model identify that the largest city in Spain is Madrid?

To extract an attribute, a source prompt containing the subject (e.g., "Spain") is fed into the LLM. The hidden representation of the subject token is extracted and patched into an inspection prompt designed to elicit the specific attribute (e.g., "The largest city is x"). By analyzing the generated text, researchers can determine whether the model correctly identifies the attribute.

This case study compares Patchscopes to a technique called "probing," which involves training a classifier to predict an attribute from a hidden representation. Unlike probing, Patchscopes does not require any labeled data or supervised training. The results show that Patchscopes outperforms probing in early layers, suggesting that it is a viable alternative for attribute extraction.

Case Study 3: Augmenting Model Behavior

The third case study explores the possibility of using Patchscopes to change model behavior and improve its performance. The focus is on multi-hop reasoning, a problem where the answer depends on making logical connections between disjointed pieces of information. For example, "the largest city in sushi's country of origin" requires the model to connect sushi to Japan and then Japan to Tokyo.

To address this, a dataset of two-clause multi-hop reasoning queries is constructed. The model can correctly answer each clause independently but fails to answer the composite query. By patching the hidden representation from an earlier token in the same prompt, researchers can intervene and correct the model's answer.

The results of this case study demonstrate the potential of Patchscopes as a method for making sense of model behavior and even correcting incorrect outputs. By generalizing the query structure and patching to earlier tokens, the model can be guided to conduct sequential reasoning in the correct order.

Advantages of Patchscopes

Patchscopes offers several advantages over existing interpretability methods:

Flexibility: Patchscopes is a highly flexible method for defining experiments that extract, verify, and characterize the model's information retrieval process.
Expressiveness: Patchscopes enables highly expressive generations, particularly in early layers where models are believed to do most of their processing.
Data Efficiency: Unlike probing, Patchscopes does not require any labeled data or supervised training.
Simplicity: The framework is relatively simple to implement and use, making it accessible to a wide range of researchers.
Unification: Patchscopes improves and unifies prior work (e.g., vocabulary projection, probing classifiers, and computational interventions) into a shared theoretical framework.

Limitations

Despite its many advantages, Patchscopes also has some limitations:

Human Judgment: The effectiveness of Patchscopes depends on human judgment in designing appropriate source and inspection prompts.
Configuration Complexity: Determining the optimal patching configuration (layer, prompt, and token position choices) can be challenging.
Generalizability: More research is needed to understand how to generalize Patchscopes across different modeling tasks and architectures.

A Large Scale Learned Retrieval System at Pinterest

Pinterest has wrote a very comprehensive blog post on their learned retrieval system to provide various recommendations in retrieval stage through a two tower model architecture.

Pinterest's historical retrieval systems relied on graph-based heuristics such as Pin-Board relationships and explicit user-followed interests. While effective for specific use cases, these methods lacked the flexibility to adapt to nuanced user behaviors captured through engagement signals. The ranking stage, powered by a transformer-based model analyzing raw user interaction sequences, demonstrated superior personalization capabilities, creating pressure to modernize the retrieval layer.

Specifically, their traditional retrieval methods suffered from three primary constraints:

Static Feature Engineering: Rule-based systems required manual feature engineering, limiting adaptability to emerging content trends and evolving user preferences.
Cold Start Problems: New users and content faced discoverability challenges due toinsufficient interaction history for graph-based recommendations.
Scalability Bottlenecks: As Pinterest's content catalog grew exponentially, maintaining real-time retrieval latency while expanding coverage became increasingly difficult.

These limitations motivated the development of an embedding-based retrieval system that could learn representations directly from user engagement data, mirroring the success of deep learning in ranking systems.

Two-Tower Model Design

The core technical innovation lies in a dual neural network architecture, which is also the industry standard:

User Tower: Processes user-specific features including long-term engagement history(captured through sequence modeling), demographic/profile data, and real-time context (device type, location, etc.).
Item Tower: Encodes item attributes such as visual features (from computer vision models), textual metadata, and historical engagement statistics.

Embeddings from both towers are optimized through dot product similarity, trained to maximize the likelihood of positive user-item interactions. The model architecture diagram (Fig. 2 in original article) illustrates feature concatenation and dense layer transformations preceding the final embedding projection.

Training Methodology and Loss Optimization

Given the impracticality of full softmax over billions of items, Pinterest employs in-batchnegative sampling with sampled softmax loss. This approach efficiently reuses other items in the training batch as negatives, significantly reducing computational overhead. Tocounteract popularity bias inherent in sampled negatives, the loss function incorporates a correction term:

L(user,item)=euser⋅eitem−log⁡P(item is in batch)L(user,item)=euser⋅eitem−logP(item is in batch)

Where euser and eitem denote the user and item embeddings, respectively. The second term acts as a debiasing factor, downweighting frequently sampled popular items to prevent model collapse towards trivial recommendations.

Online Serving Component

Real-Time User Embedding Computation: User features are processed during request handling to generate fresh embeddings reflecting latest interactions. This necessitates low-latency feature pipelines and model inference optimized through techniques like quantization and hardware acceleration.
ANN Query Execution: The generated user embedding queries Pinterest's proprietary ANN service (Manas), which maintains precomputed item embeddings in aHierarchical Navigable Small World (HNSW) graph structure for efficient similarity search.

Offline Indexing Pipeline

Batch Item Embedding Generation: A distributed compute infrastructure periodically regenerates item embeddings using the latest item tower model. This process handles Pinterest's entire content catalog, requiring efficient sharding and parallelization strategies.
Consistency Management: Strict version control ensures alignment between the item embeddings in the ANN index and the user tower model generating query vectors. Mismatches could render the embedding space meaningless.

Automated Model Retraining System

Maintaining recommendation freshness requires continuous model updates, introducing operational complexity:

Version Synchronization Mechanism

Metadata Attachment: ANN service hosts store model version mappings, allowing the serving infrastructure to dynamically select compatible user tower models during query execution.
Rollout Safety: Progressive index updates ensure that during deployment transitions, each ANN host consistently pairs its item embeddings with the corresponding usermodel version. This prevents partial deployments from causing embedding space inconsistencies.

Rollback Preparedness

Versioned Model Artifacts: The system retains multiple historical versions of the user tower model, enabling rapid rollbacks without requiring full index rebuilds.
Health Monitoring: Automated alerts trigger rollbacks if embedding space divergence exceeds predefined thresholds, measured through offline metrics like recall@k on holdout datasets.

Performance Metrics and A/B Testing

The learned retrieval system's deployment yielded measurable improvements:

User Coverage: Achieved top coverage among Pinterest's 20+ candidate generators, successfully retrieving relevant items for diverse user segments.
Engagement Lift: Drove statistically significant increases in saves (Pinterest's core engagement metric), enabling retirement of two legacy retrieval systems.
Latency Profile: Maintained p99 latency under 50ms despite processing billions of items, comparable to previous heuristic-based systems.

System-Level Advancements

Feature Iteration Velocity: Machine learning-driven retrieval reduced dependency on manual feature engineering, allowing rapid experimentation with new signal types (e.g., multi-modal features combining visual and textual data).
Infrastructure Consolidation: The success of embedding-based retrieval provided a template for modernizing other recommendation surfaces (notifications, search, etc.), creating platform-wide efficiencies.

Future Directions

Multi-Objective Optimization: Extending the retrieval system to balance engagement with secondary goals like diversity and novelty through modified loss functions or constrained ANN search.
Cross-Modal Retrieval: Integrating visual search capabilities directly into the retrieval stage, enabling content discovery through image similarity alongside behavioral signals.
Real-Time Index Updates: Reducing item embedding refresh latency from hours to minutes to surface trending content more rapidly

Papers

While language models (LMs) can sometimes generate factually correct text and estimate truth values of individual claims, these generally do not reflect a globally coherent, manipulable model of the world. As a consequence, current LMs also generate incorrect or nonsensical content, and are difficult to edit and bring up to date. Deductive Closure Training (DCT) that uses LMs themselves to identify implications of (and contradictions within) the text that they generate, yielding an efficient self-supervised procedure for improving LM factuality.

DCT prompts LMs to generate additional text implied by these documents, reason globally about the correctness of this generated text, and finally fine-tune on text inferred to be correct. Given seed documents from a trusted source, DCT provides a tool for supervised model updating; if seed documents are sampled from the LM itself, DCT enables fully unsupervised fine-tuning for improved coherence and accuracy. Across the CREAK, MQUaKE, and Reversal Curse datasets, supervised DCT improves LM fact verification and text generation accuracy by 3 - 26%; on CREAK fully unsupervised DCT improves verification accuracy by 12%. These results show that LMs’ reasoning capabilities during inference can be leveraged during training to improve their reliability.

Bytedance presents LatentSync, an end-to-end lip sync framework based on audio conditioned latent diffusion models without any intermediate motion representation, diverging from previous diffusion-based lip sync methods based on pixel space diffusion or two-stage generation. Their framework can leverage the powerful capabilities of Stable Diffusion to directly model complex audio-visual correlations. Additionally, we found that the diffusion-based lip sync methods exhibit inferior temporal consistency due to the inconsistency in the diffusion process across different frames. They propose Temporal REPresentation Alignment (TREPA) to enhance temporal consistency while preserving lip-sync accuracy. TREPA uses temporal representations extracted by large-scale self-supervised video models to align the generated frames with the ground truth frames.

Libraries

Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. Kimi k1.5, latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models.

The Reinforcement Learning Training System for LLM

Long context scaling. Scale the context window of RL to 128k and observe continued improvement of performance with an increased context length. A key idea behind our approach is to use partial rollouts to improve training efficiency---i.e., sampling new trajectories by reusing a large chunk of previous trajectories, avoiding the cost to re-generate the new trajectories from scratch.
Improved policy optimization. Derive a formulation of RL with long-CoT and employ a variant of online mirror descent for robust policy optimization. This algorithm is further improved by our effective sampling strategy, length penalty, and optimization of the data recipe.
Simplistic Framework. Long context scaling, combined with the improved policy optimization methods, establishes a simplistic RL framework for learning with LLMs. Since we are able to scale the context length, the learned CoTs exhibit the properties of planning, reflection, and correction. An increased context length has an effect of increasing the number of search steps.
Mutimodalities. Model is jointly trained on text and vision data, which has the capabilities of jointly reasoning over the two modalities.

Cellm's =PROMPT() function outputs the AI response to a range of text, similar to how Excel's =SUM() function that outputs the sum of a range of numbers.

For example, you can write =PROMPT(A1, "Extract all person names mentioned in the text.") in a cell's formula and drag the cell to apply the prompt to many rows. Cellm is useful when you want to use AI for repetitive tasks that would normally require copy-pasting data in and out of a chat window many times.

Amurex is your simple yet powerful AI meeting assistant that seamlessly integrates into your workflow. Built with cutting-edge AI, Amurex ensures you never miss a detail, always stay on top of action items, and make every meeting more productive.

With features like real-time suggestions, smart summaries, and follow-up emails, Amurex acts as your personal copilot for all your meetings—saving time and boosting efficiency.

As an open-source tool, Amurex is designed to be transparent, secure, and privacy-focused, giving you confidence in how your data is handled while delivering a seamless AI-driven experience.

MLOps Newsletter

Discussion about this post