What happened in 2023

MonoSemantic Features, AudioSep, Axolotl, AnyText

Jan 02, 2024

An update to Bard, which enabled multilingual capabilities. Since its initial launch, Bard is now available in more than 40 languages and over 230 countries and territories, and with extensions, Bard can find and show relevant information from Google tools used every day — like Gmail, Google Maps, YouTube, and more.
Search Generative Experience (SGE), which uses LLMs to reimagine both how to organize information and how to help people navigate through it, creating a more fluid, conversational interaction model for our core Search product. This work extended the search engine experience from primarily focused on information retrieval into something much more — capable of retrieval, synthesis, creative generation and continuation of previous searches — while continuing to serve as a connection point between users and the web content they seek.
MusicLM, a text-to-music model powered by AudioLM and MuLAN, which can make music from text, humming, images or video and musical accompaniments to singing.
Duet AI, our AI-powered collaborator that provides users with assistance when they use Google Workspace and Google Cloud. Duet AI in Google Workspace, for example, helps users write, create images, analyze spreadsheets, draft and summarize emails and chat messages, and summarize meetings. Duet AI in Google Cloud helps users code, deploy, scale, and monitor applications, as well as identify and accelerate resolution of cybersecurity threats.

At the heart of the most advanced ML models is the Transformer model architecture, developed by Google researchers in 2017. Originally developed for language, it has proven useful in domains as varied as computer vision, audio, genomics, protein folding, and more. This year, our work on scaling vision transformers demonstrated state-of-the-art results across a wide variety of vision tasks, and has also been useful in building more capable robots.

Expanding the versatility of models requires the ability to perform higher-level and multi-step reasoning. This year, they approached this target following several research tracks. For example, algorithmic prompting is a new method that teaches language models reasoning by demonstrating a sequence of algorithmic steps, which the model can then apply in new contexts. This approach improves accuracy on one middle-school mathematics benchmark from 25.9% to 61.1%.

By providing algorithmic prompts, they can teach a model the rules of arithmetic via in-context learning.

In the domain of visual question answering, in a collaboration with UC Berkeley researchers, they showed how they could better answer complex visual questions (“Is the carriage to the right of the horse?”) by combining a visual model with a language model trained to answer visual questions by synthesizing a program to perform multi-step reasoning.

They are now using a general model that understands many aspects of the software development life cycle to automatically generate code review comments, respond to code review comments, make performance-improving suggestions for pieces of code (by learning from past such changes in other contexts), fix code in response to compilation errors, and more.

In a multi-year research collaboration with the Google Maps team, they were able to scale inverse reinforcement learning and apply it to the world-scale problem of improving route suggestions for over 1 billion users. Our work culminated in a 16–24% relative improvement in global route match rate, helping to ensure that routes are better aligned with user preferences.

They also continue to work on techniques to improve the inference performance of machine learning models. In work on computationally-friendly approaches to pruning connections in neural networks, they were able to devise an approximation algorithm to the computationally intractable best-subset selection problem that is able to prune 70% of the edges from an image classification model and still retain almost all of the accuracy of the original.

In work on accelerating on-device diffusion models, they were also able to apply a variety of optimizations to attention mechanisms, convolutional kernels, and fusion of operations to make it practical to run high quality image generation models on-device; for example, enabling “a photorealistic and high-resolution image of a cute puppy with surrounding flowers” to be generated in just 12 seconds on a smartphone.

Advances in capable language and multimodal models have also benefited our robotics research efforts. They combined separately trained language, vision, and robotic control models into PaLM-E, an embodied multi-modal model for robotics, and Robotic Transformer 2 (RT-2), a novel vision-language-action (VLA) model that learns from both web and robotics data, and translates this knowledge into generalized instructions for robotic control.

More details are in here.

Meta’s 10 interesting AI Research advancements in 2023

Segment Anything (SAM) A step toward the first foundation model for image segmentation. Details: https://bit.ly/3tyeJKu

DINOv2 The first method for training computer vision models that uses self-supervised learning to achieve results matching or exceeding industry standards. Details: https://bit.ly/3TGTEIb
Llama 2 The next generation of our open source large language model, available for free for research and commercial use. Details: https://bit.ly/3RY66C6
Emu Video & Emu Edit Generative AI research for high quality, diffusion-based text-to-video generation & controlled image editing with text instructions. Details: https://bit.ly/3RZVZwU
I-JEPA Self-supervised computer vision that learns to understand the world by predicting it. Details: https://bit.ly/3TA9oNk
Audiobox Our new foundation research model for audio generation. Details: https://bit.ly/47ib6pQ
Brain decoding - toward real-time reconstruction of visual perception Using MEG, this AI system can decode the unfolding of visual representations in the brain with an unprecedented temporal resolution. Details: https://bit.ly/3vpgDNR
Open Catalyst demo A service that allows researchers to accelerate work in material sciences by enabling them to simulate the reactivity of catalyst materials faster than existing computational methods. Details: https://bit.ly/3vphiij
Seamless Communication A new family of AI translation models that preserve expression and deliver near-real time streaming translations. Details: https://bit.ly/3toBDE8
ImageBind: The first AI model capable of binding data from six modalities at once. A breakthrough that brings machines one step closer to the human ability to bind together information from many different senses. Details: https://bit.ly/3NLUaBc

Articles

Transformers Circuits Publication wrote a blog post on “MonoSemantic Features”. This is very interesting if you know sparse autoencoders, dictionary learning and more broadly L-1 minimization in the context of deep learning as main argument of the article is that sparse autoencoders to decompose language models into interpretable features. It discusses features that are specific to certain contexts, such as Arabic script or base64.

Why would you use it?

Output of sparse encoders in terms of features are more interpretable than the model's neurons. Further, features can be used to intervene on the model's behavior. For example, activating the base64 feature causes the model to generate base64 text.

Model properties

They are extracted from a one-layer transformer with a 512-neuron MLP layer.
They are learned using a sparse autoencoder with an expansion factor of 256x.
The authors studied 4,096 features in detail.
They found that most of the features are interpretable.
The features can be used to intervene on the model's behavior.
The features are relatively universal, meaning that similar features can be found in different transformer models.

It is a long post, but very good read if you want to make features to be more interpretable in your next transformer model. I am not fully convinced if this approach is universal in the context that if all of the features can be used no matter what the model is, but worth of a try.

Libraries

AudioSep, a foundation model for open-domain sound separation with natural language queries. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability on numerous tasks, such as audio event separation, musical instrument separation, and speech enhancement.

Vision-language models (VLMs) are generally trained on datasets consisting of image-caption pairs obtained from the web. However, real-world multimodal datasets (e.g. healthcare data) are significantly more complex: each image is often paired with text (e.g. physician report) that describes many distinct attributes occurring in fine-grained regions of the image. Villa has two key contributions:

They introduce a synthetic dataset DocMNIST, which allows the average image-text sample complexity to be directly controlled by altering the number of region-attribute pairs per sample. They use DocMNIST to demonstrate that as the image-text sample complexity of the training dataset increases, standard VLMs struggle to learn region-attribute relationships.

They present Vision-Language Learning with Attributes (ViLLA), which leverages self-supervised learning in order to capture fine-grained region-attribute relationships from complex datasets. ViLLA involves two components: (a) a lightweight, self-supervised mapping model to decompose image-text samples into region-attribute pairs, and (b) a contrastive VLM to learn representations from generated region-attribute pairs.

Axolotl is a tool designed to streamline the fine-tuning of various AI models, offering support for multiple configurations and architectures.

Features:

Train various Huggingface models such as llama, pythia, falcon, mpt
Supports fullfinetune, lora, qlora, relora, and gptq
Customize configurations using a simple yaml file or CLI overwrite
Load different dataset formats, use custom formats, or bring your own tokenized datasets
Integrated with xformer, flash attention, rope scaling, and multipacking
Works with single GPU or multiple GPUs via FSDP or Deepspeed
Easily run with Docker locally or on the cloud
Log results and optionally checkpoints to wandb

AnyText comprises a diffusion pipeline with two primary elements: an auxiliary latent module and a text embedding module. The former uses inputs like text glyph, position, and masked image to generate latent features for text generation or editing. The latter employs an OCR model for encoding stroke data as embeddings, which blend with image caption embeddings from the tokenizer to generate texts that seamlessly integrate with the background. We employed text-control diffusion loss and text perceptual loss for training to further enhance writing accuracy.

MLOps Newsletter

Discussion about this post