Eureka: OSS Framework to evaluate LLMs

Neptune: OSS Long Form Video Dataset, Evoc

Sep 29, 2024

A summary of insights extracted by using the Eureka framework, shown via two radar charts for multimodal (left) and language (right) capabilities respectively. The radar charts show the best and worst performance observed for each capability.

Microsoft Research wrote a blog post introduces Eureka, an open-source framework for evaluating large foundation models in artificial intelligence. Eureka aims to provide a standardized approach to assessing AI model capabilities beyond single-score reporting and rankings. The framework focuses on challenging and non-saturated capabilities(the ones that are not close to being perfect). This approach allows for more meaningful comparisons and failure analyses of state-of-the-art models. Eureka evaluates 12 proprietary and open-weights models across various language and multimodal tasks. Main advantages of this framework:
1. Multimodal capabilities: State-of-the-art models still struggle with detailed image understanding, particularly in geometric reasoning and spatial awareness. Models generally perform better on language-only tasks compared to multimodal tasks, with GPT-4o 2024-05-13 being an exception.
2. Language capabilities: Models show improvements in instruction following, but performance declines with longer context in question-answering tasks. There are significant gaps in factuality and grounding for information retrieval, with query fact precision rates below 55% and fact recall rates below 25%.
3. Non-determinism: Some models, like Gemini 1.5 Pro and GPT-4 1106, exhibit highly non-deterministic output for identical inputs, even with temperature set to zero.
4. Backward compatibility: Regression rates are high when comparing new model releases to earlier versions within the same family, potentially affecting user trust and application stability.
The relevant delta value in this context refers to the differences in performance between models or between different versions of the same model. These delta values are crucial for understanding progress and identifying areas for improvement in AI capabilities.
1. Compare performance metrics: Analyze the differences in accuracy, precision, recall, and other relevant metrics between models or model versions for specific tasks.
2. Examine regression rates: Look at the percentage of examples where performance decreases between model versions, particularly within the same model family.
3. Assess capability gaps: Identify the largest performance differences between models for specific tasks, such as geometric reasoning or long-context question answering.
4. Analyze consistency: Compare the variation in outputs for identical inputs to quantify non-determinism.
5. Evaluate multimodal vs. language-only performance: Calculate the difference in performance between multimodal and language-only versions of the same task for each model.
6. Compare instruction following improvements: Measure the rate of improvement in instruction following capabilities across model families and versions.
7. Assess context length robustness: Calculate the performance drop as context length increases for question-answering tasks.

The rapid growth of online video content has driven the need for advanced video and language tasks such as video understanding and video summarization. Also, recent advancements in multimodal/large language models, like the Gemini, have expanded the possibilities for reasoning over longer videos.However, a significant challenge in developing models for long video understanding has been the lack of proper evaluation datasets. Existing test sets primarily focus on short, trimmed clips up to 30 seconds long, which doesn't reflect the reality of longer video content available online.

To address this gap, Google Research has developed Neptune, a new evaluation dataset designed to challenge the abilities of current large multimodal models. Neptune includes:

Multiple-choice and open-ended questions for videos up to 15 minutes long
Questions requiring reasoning over multiple modalities (visual and spoken content)
Tasks that involve long time horizons within videos

Data Pipeline

Creating a dataset for long videos is challenging due to the manual effort required. To streamline this process, the researchers developed a semi-automatic pipeline:

Video Selection: Filtered for diversity and removed static content, gaming videos, and animated content.
Caption Extraction: Generated two types of captions:
- Automatic speech recognition (ASR) captions
- Frame captions using vision-language models (VLMs)
Summarization: Segmented videos into shots, grouped by topics, and summarized using large language models (LLMs).
Question Generation: Used LLMs to generate challenging questions and answers based on video captions.
Decoy Answer Generation: Created plausible but incorrect answer options.
Human Verification: Raters filtered and corrected questions, answers, and decoys.

Question Types and Video Domains

Neptune covers a broad range of long video reasoning abilities:

Video summarization
Temporal ordering
State changes
Creator intent

The dataset includes videos from various domains such as how-to videos, video blogs (VLOGs), sports, and cooking.

Evaluation Metrics

Neptune offers two evaluation modes:

Multiple-choice question answering
Open-ended question answering

To address limitations of traditional metrics, the researchers developed a new metric called the Gemma Equivalence Metric (GEM). This metric uses a fine-tuned open-source model (Gemma) to score question-answering results, providing a robust and stable evaluation method.

Libraries

Simple Preference Optimization (SimPO) contains the code and released models for our paper SimPO: Simple Preference Optimization with a Reference-Free Reward. We propose a simpler and more effective preference optimization algorithm than DPO (Direct Preference Optimization) without using a reference model. SimPO outperforms DPO and its latest variants across AlpacaEval 2, MT-Bench, and Arena-Hard benchmarks under various settings.
Eureka ML Insights is designed to help researchers and practitioners run reproducible evaluations of generative models using a variety of benchmarks and metrics efficiently. The framework allows the user to define custom pipelines for data processing, inference, and evaluation, and provides a set of pre-defined evaluation pipelines for key benchmarks. It has a good project page that you can also learn more about the evaluation framework as well as thinking process that goes to the evaluation framework.
Neptune is a dataset consisting of challenging question-answer-decoy (QAD) sets for long videos (up to 15 minutes). The goal of this dataset is to test video-language models for a broad range of long video reasoning abilities, which are provided as "question type" labels for each question, for example "video summarization", "temporal ordering", "state changes" and "creator intent" amongst others.
Neptune consists of challenging question-answer-decoy sets for videos to assess a number of long video reasoning abilities.
Neptune allows for two modes of evaluation: multiple-choice and open-ended question answering. For the latter, we provide our own open-ended metric based on Gemma, called Gemma Equivalence Metric (GEM).
Neptune was created using a semi-automatic pipeline, which involves careful prompting of large LLMs and VLMs, including Gemini. See more details provided in the paper.
Jupyter Scatter is an interactive scatter plot widget for Jupyter Notebook, Lab, and Google Colab that can handle millions of points and supports view linking.
Features?
- 🖱️ Interactive: Pan, zoom, and select data points interactively with your mouse or through the Python API.
- 🚀 Scalable: Plot up to several millions data points smoothly thanks to WebGL rendering.
- 🔗 Interlinked: Synchronize the view, hover, and selection across multiple scatter plot instances.
- ✨ Effective Defaults: Rely on Jupyter Scatter to choose perceptually effective point colors and opacity by default.
- 📚 Friendly API: Enjoy a readable API that integrates deeply with Pandas DataFrames.
- 🛠️ Integratable: Use Jupyter Scatter in your own widgets by observing its traitlets.
Why?
Imagine trying to explore a dataset of millions of data points as a 2D scatter. Besides plotting, the exploration typically involves three things: First, we want to interactively adjust the view (e.g., via panning & zooming) and the visual point encoding (e.g., the point color, opacity, or size). Second, we want to be able to select and highlight data points. And third, we want to compare multiple datasets or views of the same dataset (e.g., via synchronized interactions). The goal of jupyter-scatter is to support all three requirements and scale to millions of points.

EVōC (pronounced as "evoke") is Embedding Vector Oriented Clustering. EVōC is a library for fast and flexible clustering of large datasets of high dimensional embedding vectors. If you have CLIP-vectors, outputs from sentence-transformers, or openAI, or Cohere embed, and you want to quickly get good clusters out this is the library for you. EVōC takes all the good parts of the combination of UMAP + HDBSCAN for embedding clustering, improves upon them, and removes all the time-consuming parts. By specializing directly to embedding vectors we can get good quality clustering with fewer hyper-parameters to tune and in a fraction of the time.

EVōC is the library to use if you want:

Fast clustering of embedding vectors on CPU
Multi-granularity clustering, and automatic selection of the number of clusters
Clustering of int8 or binary quantized embedding vectors that works out-of-the-box
As of now this is very much an early beta version of the library. Things can and will break right now. We would welcome feedback, use cases and feature suggestions however.

The Llama Stack defines and standardizes the building blocks needed to bring generative AI applications to market. These blocks span the entire development lifecycle: from model training and fine-tuning, through product evaluation, to building and running AI agents in production. Beyond definition, we are building providers for the Llama Stack APIs. These were developing open-source versions and partnering with providers, ensuring developers can assemble AI solutions using consistent, interlocking pieces across platforms. The ultimate goal is to accelerate innovation in the AI space.

httpdbg is a tool for Python developers to easily debug the HTTP(S) client requests in a Python program.

To use it, execute your program using the pyhttpdbg command instead of python and that's it. Open a browser to

http://localhost:4909

to view the requests:

Full documentation is here.

A-LLMRec : Large Language Models meet Collaborative Filtering: An Efficient All-round LLM-based Recommender System

The source code for A-LLMRec : Large Language Models meet Collaborative Filtering: An Efficient All-round LLM-based Recommender System paper, accepted at KDD 2024.

In this paper, they propose an efficient all-round LLM-based recommender system, called A-LLMRec (All-round LLM-based Recommender system). The main idea is to enable an LLM to directly leverage the collaborative knowledge contained in a pre-trained collaborative filtering recommender system (CF-RecSys) so that the emergent ability of the LLM can be jointly exploited. By doing so, A-LLMRec can outperform under the various scenarios including warm/cold, few-shot, cold user, and cross-domain scenarios.

MLOps Newsletter

Discussion about this post