Distributed Training and HyperParameter Optimization

ConVit(Convolution + Transformer) is the new architecture in the block!

Jul 11, 2021

This week, we have 2 great articles from Uber on they are leveraging their ml platform to do distributed training and hyperparameter optimization. Facebook published a new neural network architecture that combines convolution and transformer into a single architecture(ConvVit).

I mentioned last week that Distill is taking a hiatus and some people reached out how to format the articles that Distill is doing. The good news is that the framework is open-source and it is available in here.

Articles

Uber wrote a blog post on how they are using Ray and XGBoost do distributed training. It is nice to see how much Ray handles data munging and distributing that work with a clean API.

Uber published another article around how they do HyperParameter Optimization and Architecture Search. They automated the following steps through their Michelangelo platform:

careful setup (e.g., extending the date range implies changing the train/test split, if it is date-based)
apply the relevant heuristics (e.g., setting the hyperparameter search ranges)
update compute resources (e.g., partition models need more workers or parallelism to avoid higher latency)
record experimental results and identifying current best model

which provides seamless experience to the engineers that they want to pick the best model.

Google released a dataset to study the effects of gender bias in translation domain.

Facebook published a post on the recent research work that proposes a new neural network architecture that combines CNN and Transformers. It tries to bring both of the best worlds of CNN and Transformers.

Papers

Thinking Like Transformers is answering to the question: “Is it possible to come up with a DSL(Domain Specific Language) to specify a transformer architecture?”, their answer is a definite yes and came up with RASP language.
Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better is an excellent survey paper for model optimization techniques and how to make deep learning models more efficient. Author put efficiency into different buckets:
- Compression techniques(quantization, pruning, …)
- Learning techniques(distillation, …)
- Automation(neural architecture search, …)
- Efficient architectures(ViT)

It covers from these methods purely from efficiency perspective and gives a good number of pointers to SOTA(state of art) techniques.

Libraries

Google open sourced SentencePiece which is a neural network based tokenizer/detokenizer. Due to its unsupervised nature, it does not make any assumptions of the language and highly flexible.
Preferred.ai has an open-source library called Cornac which allows you to compare multi-modal recommendation systems. It is an experimentation framework that allows fast iteration and comparison between models(Tensorflow, PyTorch). It also allows parameter search on the experiments itself, grid search tutorial shows this capability well.
Evidently.ai has an open-source tool called evidently which allows you to monitor and debug models and allows you to create reports on the monitoring.
Matchmaker is an evaluation library for informational retrieval tasks on top of PyTorch. It powers the following Neural IR explorer.

Videos

Michael Bronstein gave a keynote titled Geometric Deep Learning: The Erlangen Programme of ML, which aims to build a single unified framework that combines all of the possible neural network architectures in a single formula similar to Klein’s work on Erlangen program which unified various geometry approaches into a single one. I shared Bronstein’s work in the previous newsletters, so I will not go into detail as much. However, I see two main benefits of this framework if it is adopted:
- The new architectures of neural networks can be more methodical/systematic rather than an “empiric architecture that works”.
- The advantages/disadvantages of the architectures can be mathematically explained based on the operation that it uses.

Andrej Karpathy gave a talk on how they apply deep learning for FSD(Full Self Driving) in Tesla, it covers how they built muti-head system to build a variety of classifiers on top of a “backbone” model and how that model is enabling a number of engineers to improve the model over time. Also, he talks about how computer vision is actually taking over LIDAR which he calls “legacy system” as they are moving more and more towards vision systems rather than LIDAR systems. What was most memorable for me is that, they minimized rules and hand-coded logic and completely built applications on top of neural networks. After they understand high-level concepts(he calls them “triggers”), they are using these concepts to make a decision(he calls it “control”).
The workshop for autonomous driving is also here.
Philip Isola gave a talk When and Why Contrastive Learning work, he is starting with excellent results from a number of papers and give the intuition on contrastive learning on top of mutual information:
This talk is part of Learning with Limited and Imperfect Data and it is in here.

Peyman Milanfar gave a talk titled: Denoising as a Building Block: Theory and Applications in CVPR, it covers a background on signal processing where denoising is coming and then denoising in modern days.

MLOps Newsletter

Discussion about this post