Gemini to migrate code, Gemini to do Automatic Speech Recognition

Llama-Factory, micrograd, Inspectus, Livebench

Jul 27, 2024

Google Research published a blog post on an approach to assist Google developers in the process of large-scale code migrations using ML-driven workflows.

Problem To Solve

Over the past decades, source code bases have grown exponentially, making it increasingly difficult to manage and update them. Google's monorepo, which contains billions of lines of code, exemplifies the complexity involved in maintaining such vast codebases. Code transformations, or "migrations," are necessary to accommodate new language versions, framework updates, changing APIs, and data types.

However, keeping up with these changes across a massive codebase and applying these changes without breaking backwards compatibility has been a big engineering challenge.

Google has traditionally used specialized infrastructure for large-scale changes, employing tools like Kythe and Code Search for static analysis and ClangMR for making changes. While effective for uniform changes with limited edge cases, these tools struggle with complex migrations, such as changing interfaces across multiple components or updating tests. Static analysis and simple migration scripts are insufficient for handling intricate code structures.

Enter to the “Human In the Loop”

Even though compilers, static analysis provides some coverage in these migrations, because the use cases are very diverse and some of the new API behavior might have changed or old API is no longer supported, one still needs to understand the semantics of the API change. Exactly in this type of problem, ML can play a large role.

To address these challenges, Google has developed a new ML driven way that assists engineers in the process of code migrations at scale. This approach leverages generative AI to automate significant portions of the migration process, allowing engineers to focus on more complex aspects without being isolated from the process. This “Human In the Loop” approach has shown to generate the majority of new code necessary for migrations, significantly reducing the human effort involved while humans would still supervise for portions or code chunks that ML cannot migrate on their own.

Gemini to Rescue

The core part of this new workflow is Gemini model, which has been fine-tuned on internal Google code and data. This model is capable of generating and validating code changes, making it a crucial component of the migration toolkit. The model's ability to adapt to the surrounding code environment allows it to handle complex migrations that traditional tools cannot manage effectively.

The migration process is conceptually divided into three stages: 1) targeting the locations in the codebase that need modifications, 2) edit generation and validation, and 3) change review and rollout. Each stage benefits from AI, but the primary focus is on the second stage—edit generation and validation.

For generating and validating code changes, the Gemini model requires specific inputs, including a set of files with the locations of expected changes (path and line number), one or two prompts describing the change, and optional few-shot examples to determine if a file needs migration. The model ensures that privacy protections are maintained by not altering or surfacing internal codebase IDs.

Change Review and Rollout

Once the ML generates the code changes, engineers review and roll out the changes. Although ML handles a significant portion of the work, engineers still need to analyze the files that require changes and review the ML-generated modifications. The ML-driven approach has reduced the total time spent on migrations by an estimated 50%, with 80% of the code modifications in the landed change lists being ML-authored.

Google Research published a blog post discussing a novel approach to evaluating Automatic Speech Recognition (ASR) performance, focusing on meaning preservation rather than traditional metrics like Word Error Rate (WER).

Traditional metrics like Word Error Rate (WER) and Word Accuracy (WACC) have been the standard for evaluating ASR systems. These metrics, however, have significant limitations:

1. Syntactic Focus: They primarily measure syntactic accuracy, not the comprehensibility or usability of the transcriptions.

2. Severity of Errors: They do not account for the severity of transcription errors in terms of meaning preservation. You can interpret this is as not all of the errors are created equal.

3. Atypical Speech: These metrics are particularly inadequate for evaluating ASR performance on atypical speech patterns, where the preservation of meaning is more critical than syntactic accuracy.

For users with atypical speech patterns, ASR models can still be beneficial even with high WER, as long as the meaning is preserved. This is crucial for applications such as live conversations, voice input for text messages, home automation, and other use cases that can tolerate minor grammatical errors.

However, studies have shown that ASR systems often perform poorly on speech from individuals with disorders or accents, highlighting the need for metrics that better capture the utility of these systems for such users.

This is generally also my experience as I have a heavy accent and not a single home system, voice input overall has worked very well for me when I am speaking English as probably the recognition is not done on a large corpus of speeches that can have accents from a variety of different people.

In order to solve some of these shortcomings, Google is proposing a new system based on Large Language Models (LLMs) to determine if a transcript accurately captures the intended meaning compared to a reference text. This approach aims to provide a more user-centric evaluation of ASR systems, particularly for atypical speech.

In order to evaluate this new system, they built the following components:

1. Project Euphonia Corpus: The researchers utilized the Project Euphonia corpus, which contains disordered speech samples, to develop and test their approach.

2. LLM-Based Classifier: They developed an LLM-based classifier to assess meaning preservation, comparing its performance with human evaluations by speech-language pathologists (SLPs).

3. Comparison with Traditional Metrics: The LLM-based approach was found to be more effective than WER in evaluating ASR usefulness, especially in high error rate scenarios.

The approach that they proposed has multiple advantages mainly in the how users understand the speech, where traditional metrics do not capture the error and how jaw-dropping these errors might be. Also, this approach is better in the automation that LLM can bring over a manual evaluation which is both more time-consuming and more costly comparing to an automated LLM approach.

1. User-Centric Evaluation: This approach better assesses the model's utility for individual users, particularly those with speech impairments.

2. Efficiency: The LLM-based classifier can provide a more efficient and scalable evaluation compared to human assessments, which are laborious and expensive.

They first used a large language model for the meaning preservation task. This initial approach demonstrated the potential for LLMs to provide a more accurate assessment of ASR performance in terms of meaning preservation.

Building on their initial work, the researchers explored using Google's Gemini model:

1. Gemini Nano-1: This model achieved comparable classifier performance with a significantly smaller footprint, demonstrating efficiency without sacrificing accuracy.

2. Cross-Lingual Generalization: Despite being trained only on English examples, the Gemini-based classifier showed cross-lingual generalization, accurately assessing meaning preservation in French and Spanish.

They showed that Gemini-based approach achieved similar or better performance compared to the initial large language model while being more efficient. Specific quantitative results are not provided, but the overall improvement in evaluation efficiency and accuracy is highlighted.

Further, the classifier's ability to generalize across languages without additional training is a significant finding. This capability was tested and confirmed for French and Spanish, indicating the potential for broader applications in multilingual contexts.

The meaning preservation metric offers a more nuanced and user-centric approach to evaluating ASR systems. This is particularly valuable for users with atypical speech and in low-resource domains or languages. The approach can complement traditional metrics like WER, providing a more comprehensive evaluation of ASR performance.

This work supports the development of fully personalized speech recognition models, such as those used in Project Relate, to help people with atypical speech be better understood. Personalized models can significantly improve the user experience by tailoring the ASR system to individual speech patterns.

The transition from a large language model to Gemini Nano-1 demonstrates the potential for using smaller, more efficient models without significant performance loss. This is crucial for deploying ASR systems in resource-constrained environments. Especially, for mobile devices where the both internet connection and compute requirements are relatively humble as the mobile devices are not as powerful as backend servers, Nano model comes in handy due to low resource requirements.

This has the potential to significantly improve ASR technologies for people with speech impairments, enhancing their ability to communicate effectively using voice-based interfaces. Improved ASR systems can lead to better accessibility and inclusion for individuals with speech disorders.

By shifting focus from purely syntactic metrics to meaning preservation, this work may influence the direction of ASR model development and evaluation across the industry. Developers may prioritize models that better capture the intended meaning of speech, leading to more practical and user-friendly ASR systems.

Libraries

Llama-Factory is webUI for a platform to do fine-tuning for LLMs. It has the following features:

Various models: LLaMA, LLaVA, Mistral, Mixtral-MoE, Qwen, Yi, Gemma, Baichuan, ChatGLM, Phi, etc.
Integrated methods: (Continuous) pre-training, (multimodal) supervised fine-tuning, reward modeling, PPO, DPO, KTO, ORPO, etc.
Scalable resources: 16-bit full-tuning, freeze-tuning, LoRA and 2/3/4/5/6/8-bit QLoRA via AQLM/AWQ/GPTQ/LLM.int8/HQQ/EETQ.
Advanced algorithms: GaLore, BAdam, DoRA, LongLoRA, LLaMA Pro, Mixture-of-Depths, LoRA+, LoftQ, PiSSA and Agent tuning.
Practical tricks: FlashAttention-2, Unsloth, RoPE scaling, NEFTune and rsLoRA.
Experiment monitors: LlamaBoard, TensorBoard, Wandb, MLflow, etc.
Faster inference: OpenAI-style API, Gradio UI and CLI with vLLM worker.

micrograd is an "autograd" engine (short for automatic gradient) that implements the backpropagation algorithm, as it was prominently popularized for training neural networks in the 1986 paper Learning Internal Representations by Error Propagation by Rumelhart, Hinton and Williams. This repository builds on the earlier repo karpathy/micrograd, but modifies into into an LLM101n module.

The code is the heart of neural network training - it allows us to calculate how we should update the parameters of a neural network in order to make it better at some task, such as the one of next token prediction in autoregressive language models. This exact same algorithm is used in all modern deep learning libraries, such as PyTorch, TensorFlow, JAX, and others, except that those libraries are much more optimized and feature-rich.

This repo contains a collections of pluggable state-of-the-art multi-object trackers for segmentation, object detection and pose estimation models. For the methods using appearance description, both heavy (CLIPReID) and lightweight state-of-the-art ReID models (LightMBN, OSNet and more) are available for automatic download. We provide examples on how to use this package together with popular object detection models such as: Yolov8, Yolo-NAS and YOLOX.

Human in the Loop with LLM is a system that provides the following components:

AgentServer: a single agent with OpenAI LLM that answers all queries except those having to do with math
HumanServer: a service for humans to be able to answer queries on math
RabbitMQMessageQueue: the message broker for communication of other components
ControlPlane with PipelineOrchestrator that uses a single RouterComponent which selects between the AgentServer or the HumanServer when processing a task.

Under the hood, this is a simple Gradio app where one can submit tasks to the system, watch the task go through various stages in its lifecyle: namely, "Submittted", "Completed", and "Human Required".

Technically speaking, the Gradio app is a Consumer of the message queue since it listens for the messages that contain "completed" tasks notifications. This front-end is also wired to the HumanServer so that the human in the loop can use this interface to complete its tasks. Note, however that these two concerns can be separated to other pages, webapps, servers to your choosing.

Researchers created a synthetic setup to evaluate the ability of autoregressive Transformers to learn function compositions. They find that: (1) Autoregressive Transformers learn function compositions using very compositions in the training data (unlike LSTMs); (2) generating intermediate outputs when composing functions is more effective for generalizing to new, unseen compositions; (3) the attention layers select which function to apply while the feed-forward layers execute the selected capability.

Inspectus provides visualization tools for attention mechanisms in deep learning models. It provides a set of comprehensive views, making it easier to understand how these models work.

Components

Attention Matrix: Visualizes the attention scores between tokens, highlighting how each token focuses on others during processing.

Query Token Heatmap: Shows the sum of attention scores between each query and selected key tokens

Key Token Heatmap: Shows the sum of attention scores between each key and selected query tokens

Dimension Heatmap: Shows the sum of attention scores for each item in dimensions (Layers and Heads) normalized over the dimension.

LiveBench: a benchmark for LLMs designed with test set contamination and objective evaluation in mind.

LiveBench has the following properties:

LiveBench is designed to limit potential contamination by releasing new questions monthly, as well as having questions based on recently-released datasets, arXiv papers, news articles, and IMDb movie synopses.
Each question has verifiable, objective ground-truth answers, allowing hard questions to be scored accurately and automatically, without the use of an LLM judge.
LiveBench currently contains a set of 18 diverse tasks across 6 categories, and we will release new, harder tasks over time.

It has a very good project page if you want to learn more about it as well.

Open-Sora, an initiative dedicated to efficiently producing high-quality video. We hope to make the model, tools and all details accessible to all. By embracing open-source principles, Open-Sora not only democratizes access to advanced video generation techniques, but also offers a streamlined and user-friendly platform that simplifies the complexities of video generation.

ML Code Challenges

Deep-ML has a number of different challenges in machine learning/deep learning domain. If you are looking for what is leetcode for ML or deep learning area, you might want to check it out.

MLOps Newsletter

Discussion about this post