Transformers is all you need

Jan 03, 2021

As models are getting larger and larger, more and more research has been focusing on how to minimize and compress these models without degrading the model accuracy. There are many ways to do this, couple of them are quantization, pruning and knowledge distillation. However, there are many more as you might imagine. This paper is a survey of these methods and can be read in tandem with the tutorial I shared in the previous newsletter.
How much are we “forgetting” by pruning large scale models? This paper tries to answer this question specifically studying the impact of the pruning for different types of distribution of datasets. Even though the model accuracy does not degrade, pruning can actually impact model accuracy for certain classes far more than quantization.
What do feed forward layers in neural networks do? Transformer Feed-Forward Layers Are Key-Value Memories explores this and proves that they are actually doing something similar to “neural memories” which more or less emulate key-value pairs for certain inputs. In NLP, this corresponds to words themselves. However, paper explores different directions in terms of how these neural memories should be classified. They further explore this taxonomy by introducing shallow, semantic and shallow-semantic classes.

Chip Nguyen wrote about real-time machine learning systems(real time in the order of minutes). This topic is worth discussion both from it improves the existing systems, but also it “right” thing to do from systems perspective. You want your system to be adaptable and changing based on the data that has been consumed. For certain cases like in-session personalization(reranking products based on your cart/history), this actually is mandatory and not really a “nice to have”.
Vision Transformer Explainability gives a good overview on the existing efforts around transformer explainability specifically to computer vision tasks. Beyond activation functions, it gives various different types of techniques such as Layer-wise Relevance Propagation and Attention Rollout. The code is also available in GitHub. As transformers are more and more widely adopted in computer vision tasks, I expect these saliency maps would make some of the classification tasks much easier.

I liked the previous year’s ML System conference and there will be another one next year. This conference is best conference I have that targets both ML and systems engineering. The session for MLOPs was also pretty good.
OpenMined has a course coming for Privacy AI series which is very exciting. Federated Learning has been a very important topic recently for building machine learning systems that use only data available on the device.

Kevin Murphy released a second version of his book Probabilistic Machine Learning: An Introduction. I have not had a chance to look at it the second version yet. However, I read thoroughly the first book’s kernels/non-parametric models and can attest that part of the book was excellent.

HuggingFace Datasets is a library that you should check it out. It allows you to use many NLP libraries very easily and efficiently.

Pile is an open-source language modeling dataset that is 825GB! The paper explains the dataset further. If you are doing any large scale language modeling, it is worth to check it out.

MLOps Newsletter