Pix2Seq and Apple Neural Engine

QAT/PAT in TF, Ludwig, Avalanche, Grafog, TorchGeo

Jun 13, 2022

Tensorflow wrote a blog post on how one can use QAT(Quantization Aware Training) and Pruning Aware Training(PAT) through TFMOT(TF Model Optimization Toolkit). QAT was originally supported through the toolkit, now they extend this to Pruning and they also support support for Model garden out of the box for very common mobile use cases.

Transformers are becoming ubiquitous in ML as their capabilities scale up with their size. Deploying Transformers on-devices requires efficient strategies, and Apple published a blog post on a guidance and a library for Apple devices in this blog post. Reference implementation of the Transformer architecture optimized for Apple Neural Engine (ANE) is also available in here. Based on this reference, Transformer models on Apple devices with an A14 or newer and M1 or newer chip to achieve up to 10 times faster and 14 times lower peak memory consumption compared to baseline implementations.

Pix2Seq is an interesting method that Google developed through the following intuition:

if a neural network knows where and what the objects in an image are, one could simply teach it how to read them out. By learning to “describe” objects, the model can learn to ground the descriptions on pixel observations, leading to useful object representations. Given an image, the Pix2Seq model outputs a sequence of object descriptions, where each object is described using five discrete tokens: the coordinates of the bounding box’s corners [ymin, xmin, ymax, xmax] and a class label.

Cross-pollination between computer vision and natural language processing allows a number of innovations and one of the most successful application is definitely this.

In inference time, object descriptions construct sequences as a “dialect” and address the problem via a powerful and general language model with an image encoder and autoregressive language encoder. Similar to language modeling, Pix2Seq is trained to predict tokens, given an image and preceding tokens, with a maximum likelihood loss. At inference time, we sample tokens from model likelihood. The sampled sequence ends when the EOS token is generated. Once the sequence is generated, we split it into chunks of 5 tokens for extracting and de-quantizing the object descriptions (i.e., obtaining the predicted bounding boxes and class labels).

Libraries

We have a number of good libraries from DeepMind for Jax and written in Jax this week.

Mctx is a library with a JAX-native implementation of Monte Carlo tree search (MCTS) algorithms such as AlphaZero, MuZero, and Gumbel MuZero. For computation speed up, the implementation fully supports JIT-compilation. Search algorithms in Mctx are defined for and operate on batches of inputs, in parallel. This allows to make the most of the accelerators and enables the algorithms to work with large learned environment models parameterized by deep neural networks.
KFAC-JAX is a library built on top of JAX for second-order optimization of neural networks and for computing scalable curvature approximations. The main goal of the library is to provide researchers with an easy-to-use implementation of the K-FAC optimizer and curvature estimator.
AUX is an audio processing library in JAX, for JAX.
TF2JAX is an experimental library for converting TensorFlow functions/graphs to JAX functions.

Ludwig is an open-source, declarative machine learning framework that makes it easy to define deep learning pipelines with a simple and flexible data-driven configuration system. Ludwig is suitable for a wide variety of AI tasks, and is hosted by the Linux Foundation AI & Data.

Ludwig allows users to define their deep learning pipeline by simply providing a configuration file, which lists the inputs and outputs, and their respective data types. Ludwig will then assemble and train a deep learning model and based on the configuration file, determine how inputs and outputs are preprocessed, encoded, decoded and which metrics and loss criterion to use.

Megatron (1, 2, and 3) is a large, powerful transformer developed by the Applied Deep Learning Research team at NVIDIA. This repository is for ongoing research on training large transformer language models at scale. We developed efficient, model-parallel (tensor, sequence, and pipeline), and multi-node pre-training of transformer based models such as GPT, BERT, and T5 using mixed precision.

Grafog is a graph data augmentation Library for PyTorch Geometric.

Data augmentations are heavily used in Computer Vision and Natural Language Processing to address data imbalance, data scarcity, and prevent models from overfitting. They have also proven to yield good results in both supervised and self-supervised (contrastive) settings.

grafog (portmanteau of "graph" and "augmentation") provides a set of methods to perform data augmentation on graph-structured data, especially meant for self-supervised node classification. It is built on top of torch_geometric and is easily integrable with its Data API.

Avalanche is an End-to-End Continual Learning Library (part of the PyTorch Ecosystem) powered by ContinualAI with the unique goal of providing a shared and collaborative open-source (MIT licensed) codebase for fast prototyping, training and reproducible evaluation of continual learning algorithms.

TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

Classes

NYU has a new version of the popular Deep Learning Class in here.

MLOps Newsletter

Discussion about this post