Deepmind publishes AlphaTensor
Google open-sources FILM, Snap's Feature Engineering System -> Robusta
The biggest news was this week is from Deepmind where they introduced a better matrix multiplication algorithm than the manually tuned one. I covered this extensively in the below.
Articles
Deepmind published a post to find an algorithm that does better matrix multiplication than the SOTA of an algorithm. ML for systems is a very interesting field that I am very passionate about and this type of development opens up a complete a new domain for optimizing the training and inference optimizations for a given inference or training loop for deep learning. In the post, they also mention that they can optimize for a given hardware type and they tried this for NVIDIA GPU and Google’s TPU. Given that, you can output all of the operations for static graphs that you use in the training and inference, this type of meta-learning on top of the matrix multiplications can influence hardware design and start a new era of learning procedures on to optimize different types of training loops. As I was thinking through this, the first thing that came to my mind is FPGA. In circuit design, this allows to optimize and design different types of architectures before producing the static graph, but allows flexible iterations for a given operation. We do not have this in the deep learning, we have fixed hardware types on GPU/TPU that allows a certain topology and this topology generally speaking is good for deep learning training/inference, but it is not flexible enough unlike FPGA.
Google presents a frame interpolation algorithm(FILM) that synthesizes an engaging slow-motion video from near-duplicate photos which often exhibit large scene motion. The code is available in GitHub.
At its core, FILM adopts a scale-agnostic feature pyramid that shares weights across scales, which allows us to build a “scale-agnostic” bi-directional motion estimator that learns from frames with normal motion and generalizes well to frames with large motion. To handle wide disocclusions caused by large scene motion, we supervise FILM by matching the Gram matrix of ImageNet pre-trained VGG-19 features, which results in realistic inpainting and crisp images. FILM performs favorably on large motion, while also handling small and medium motions well, and generates temporally smooth high quality videos.
Snap published a blog post on how they implemented a feature engineering system.
The main problems that they want to solve:
ML engineers are generally not familiar with online serving components, touching them is risky. On the other hand, delegating to infrastructure engineers introduces coordination overhead.
The turnaround time to tell whether a feature is useful or not is extremely long.
Each team builds their own infrastructure with overlapping functionalities. While teams with more engineering resources can build more sophisticated systems, smaller teams don’t have easy access to more advanced features. Besides, it’s almost impossible to share features.
and Robusta(the system that allows feature engineering and serving) has the following properties:
Velocity: How to enable new features end to end through a single declarative feature specification, without manual service deployment and without imposing risks to ML infrastructure.
Scale: A typical ML use case has thousands of aggregation features with varying properties. For example, they could have different sliding windows, sometimes aggregated by user id, sometimes by snap id, or a combination of multiple keys (i.e. user id + discover channel + hour of day). There are operations that can be easily expressed as associative and commutative operations, while also operations that require some work to tweak and fit. We need to design a framework that allows for these possibilities while running efficiently on billions of events per day.
Correctness: How to support offline generation of near-real time features, aka solving the so-called point in time correctness problem [3]. We must answer what the feature value was at online inference time. This is especially challenging when we have sliding window features that can be updated at minute level granularity, as the feature value could be changing every minute for most users as time goes by even without any new engagement events.
Libraries
RecSim is a configurable platform for authoring simulation environments for recommender systems (RSs) that naturally supports sequential interaction with users. RecSim allows the creation of new environments that reflect particular aspects of user behavior and item structure at a level of abstraction well-suited to pushing the limits of current reinforcement learning (RL) and RS techniques in sequential interactive recommendation problems. Environments can be easily configured that vary assumptions about: user preferences and item familiarity; user latent state and its dynamics; and choice models and other user response behavior.
IRIS is a library for demonstrating that transformers are sample efficient world models.
IRIS is a data-efficient agent trained over millions of imagined trajectories in a world model.
The world model is composed of a discrete autoencoder and an autoregressive Transformer.
Our approach casts dynamics learning as a sequence modeling problem, where the autoencoder builds a language of image tokens and the Transformer composes that language over time.
Fourier Heat Map allows to investigate the sensitivity of CNNs to high and low frequency corruptions via a perturbation analysis in the Fourier domain.
Mintaka is a large, complex, natural, and multilingual question-answer dataset with 20,000 questions collected in English and professionally translated into eight languages: Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish. We also ground Mintaka in the Wikidata knowledge graph by linking entities in the question text and answer text to Wikidata IDs. Amazon also wrote a nice blog post to announce this dataset as well.
The Omniglot data set is designed for developing more human-like learning algorithms. It contains 1623 different handwritten characters from 50 different alphabets. Each of the 1623 characters was drawn online via Amazon's Mechanical Turk by 20 different people. Each image is paired with stroke data, a sequences of [x,y,t] coordinates with time (t) in milliseconds.
latexify_py package that generates LaTeX math description from Python functions.
LAVIS is a Python deep learning library for LAnguage-and-VISion intelligence research and applications. This library aims to provide engineers and researchers with a one-stop solution to rapidly develop models for their specific multimodal scenarios, and benchmark them across standard and customized datasets.