Google announces OptFormer; hyper parameter optimization for Transformers

Emergent Features, Skops from HF to use scikit-learn models

Aug 27, 2022

If you have watched one video this week, watch this video from Andrej Karpathy:

Articles

Google announced a new hyperparameter optimization library for Transformers, it has a number of different policy supports like Regularized Evolution and Google Vizier.

Rather than only using numerical data as common with previous methods, their novel approach instead utilizes concepts from natural language and represents all of the study data as a sequence of tokens, including textual information from initial metadata.

Tim Dettmers wrote about quantization and emergent features in this blog post.

Emergent features is a new concept that was outlined in the post in the following way:

Emergence is a gradual change in a property that suddenly undergoes a phase shift and then changes the quality of its substrate.

The most intuitive explanation of feature outliers is that transformers have two processing streams. One stream learns features that explain the inputs, and the other stream learns features that remove other features. Removing noisy, context-irrelevant features is the key to making accurate predictions. The more noisy, context-irrelevant features you remove in early layers, the less conflicting high-level features you have in later layers.
For example, if you classify dogs vs. cats, it makes sense to “sharpen” the key features that make these animals different (e.g. cat eyes, cat ears) and remove the similar features (fur color and potentially texture). This is particularly relevant if you have many noisy “weak” features as in natural language processing.
If you take this mechanism to an extreme, you can get discretization, which goes hand-in-hand with context-dependent memory and “reasoning” over elements. Discretization means, you have, say, 100 features, but you decide to remove 99% of them by setting them to zero, and you amplify the rest. The result is a single feature that is now a discrete entity. Once discretized, this entity can be stored and reused later.

An example:

Substrate: Transformer
Property: Very large features in particular hidden dimensions across the transformer
Gradual change: Decreasing perplexity, more and larger outlier features
Phase shift: Outlier features suddenly become available in all transformer layers and coordinate through a few hidden dimensions.
Change of quality: Highly sparse, almost discrete attention; very dense FFN layers; “dual attention”; long-range attention (?); stable training through increased numerical stability

Emergent features become much more important after the phase shift(larger and larger number of parameters).

HuggingFace also wrote a post on how you can use int8 quantization through Accelerate library in here.

HuggingFace introduces a new library called Skops for scikit-learn library to allow you to host the model and easily.

Uber published a blog post on how they put together an internal course for ML education within the company. There is a large emphasis on the measurability and reproducibility on the ml models. Design of this ML education is covered in another blog post in here.

Libraries

Stable-diffusion is a latent text-to-image diffusion model. Thanks to a generous compute donation from Stability AI and support from LAION, we were able to train a Latent Diffusion Model on 512x512 images from a subset of the LAION-5B database. Similar to Google's Imagen, this model uses a frozen CLIP ViT-L/14 text encoder to condition the model on text prompts. With its 860M UNet and 123M text encoder, the model is relatively lightweight and runs on a GPU with at least 10GB VRAM. See this section below and the model card.
Aesara is a Python library that allows one to define, optimize, and efficiently evaluate mathematical expressions involving multi-dimensional arrays.
Asent is a package for fast and transparent sentiment analysis. The package applied uses a dictionary of words rated as either positive or negative and a series of rules to determine whether a word, sentence or a document is positive or negative.
PAINT: Patching open-vocabulary models by interpolating weights
Open-vocabulary models like CLIP achieve high accuracy across many image classification tasks. However, there are still settings where their zero-shot performance is far from optimal. We study model patching, where the goal is to improve accuracy on specific tasks without degrading accuracy on tasks where performance is already adequate. Towards this goal, we introduce PAINT, a patching method that uses interpolations between the weights of a model before fine-tuning and the weights after fine-tuning on a task to be patched. On nine tasks where zero-shot CLIP performs poorly, PAINT increases accuracy by 15 to 60 percentage points while preserving accuracy on ImageNet within one percentage point of the zero-shot model. PAINT also allows a single model to be patched on multiple tasks and improves with model scale. Furthermore, we identify cases of broad transfer, where patching on one task increases accuracy on other tasks even when the tasks have disjoint classes. Finally, we investigate applications beyond common benchmarks such as counting or reducing the impact of typographic attacks on CLIP. Our findings demonstrate that it is possible to expand the set of tasks on which open-vocabulary models achieve high accuracy without re-training them from scratch.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion is a paper implementation for the following text to image generation process:

Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes. In other words, we ask: how can we use language-guided models to turn our cat into a painting, or imagine a new product based on our favorite toy? Here we present a simple approach that allows such creative freedom. Using only 3-5 images of a user-provided concept, like an object or a style, we learn to represent it through new "words" in the embedding space of a frozen text-to-image model. These "words" can be composed into natural language sentences, guiding personalized creation in an intuitive way. Notably, we find evidence that a single word embedding is sufficient for capturing unique and varied concepts. We compare our approach to a wide range of baselines, and demonstrate that it can more faithfully portray the concepts across a range of applications and tasks.

Theseus is an efficient application-agnostic library for building custom nonlinear optimization layers in PyTorch to support constructing various problems in robotics and vision as end-to-end differentiable architectures.

Berkeley launches POET where the main motivations are in the following:

There is a growing trend to finetune models on edge devices. Fine-tuning models on the edge satisfies privacy constraints and enables offline operation.
Challenge: Limited memory on edge makes training new deep learning models infeasible.
Given a memory budget and a run-time constraint for ML training, POET (Private Optimal Energy Training) finds a provably energy-optimal plan for scheduling nodes of the training graph.
With POET, we are the first to demonstrate how to train memory-hungry SOTA ML models such as BERT and ResNets on smartphones and tiny ARM Cortex-M devices!

Books

Tom Mitchell’s book for Introduction to Machine Learning is finally available online. The pdf is available in here.

Conferences

There is MLSys conference next week:
- I will be in person on Monday and Wednesday in this conference, we have a paper, check it out!

Courses

Fast.ai has a deep learning course that has 2022 update on their website.

MLOps Newsletter

Discussion about this post