If you are forwarded this email, a brief introduction to newsletter is here.
I used to maintain ML Newsletter; I changed the platform(MailChimp to Substack as it is a bit easier), also planning to focus more on MLOps which is the reason for the new name.
2020 is a great year for machine learning, systems engineering and MLOps. This year we not only see more and exciting research and their applications(GPT3, AlphaFold), but also techniques that make the usage of the existing technologies easier and more production ready.
We will start with research first and more industry applications later.
Machine Learning Research
Transformers
After GAN’s(previous next big thing), 2020 started with a strong adoption of Transformers with BERT first and then a number of other research work that extends BERT. Roberta, Reformer, DeiT and a number of other techniques were built through 2019 and 2020. I expect until the next big thing in ml research, we will continue exploring this areas as transformers are very flexible and enables transfer learning. I also expect different types of model optimization techniques to spawn and optimize for different types of transformers like DistilBert.
AlphaFold
Google published another exciting research regarding protein folding problem which they called AlphaFold. This is a breakthrough as this was a very hard problem for multiple decades and AlphaFold built something
GPT-3
OpenAI published their new research work under an API which significantly improved their previous version of GPT-2. Microsoft stroke a deal with OpenAI exclusively use this API.
GPT-2 was published last year. What is surprising on this to me is that, how much larger the model got bigger in a year, GPT-2 used 1.5 billion parameters and GPT-3 used 175 billion parameters for comparison. The model output was also significantly better.
These models require a lot of infrastructure to be able to train and deploy in this scale. This and other research shows that the model size is not getting anytime smaller which is great for MLOPs as we have a lot of work to do.
MLSys Research
Recently published this paper gives a good overview which areas within MLOps are still not solved and provides some solutions. It tries to tackle multiple aspects of the machine learning systems(model training, model selection, data input, etc), so it is very comprehensive in nature. However, I am not sure if the possible solutions proposed to the problems are comprehensive enough. But then again, the point of the paper is to survey the problems, not necessarily detail solutions for each problem.
For large scale model training and deployment, I loved this paper in ICML2020. Once for all means that we will train the model once and then will use different types of model optimization techniques to deploy this model into different types of use cases like mobile. The code is also available in GitHub.
Machine Learning Libraries
I work at PyTorch and I am obviously biased so I will not be going into doing comparison. However, I will say that PyTorch has an amazing year to build an amazing set of features. There was also a Developer Day in November, which you should check it out!
HuggingFace thrives this year riding on transformer research. They also enjoy advantage of “backend agnostic” by building their APIs. Their work and engagement to the community is also top-notch.
Git for models/data concept was exciting to me but I now believe that this model is not going to work as well comparing to HuggingFace Models approach or PyTorch Hub.
There are couple of reasons why users will seek a more integrated approach to the library/language rather than an integration through a mechanism like git.
This provides a more seamless integration which increases time to market and experimentation.
Git approach’s advantage is to be able to support different types of languages, but completely convergence to Python was already done deal in 2018 and I believe that nothing will change(Python first APIs/libraries) in short-term and medium term.
Libraries can version these models which abstracts even more complexities from the end user. So, if I want to use a library, I do not have to worry about git commit sha, or the version number that I will be using, I will just import DistilBert and expect it to just work.
With transformers, I expect knowledge distillation and other types of transfer learning mechanisms can be more easily used with this approach.
PyTorch Lightning got a lot of traction especially towards the end of the year; Google TPU integration and Facebook adopting for content understanding.
Catalyst is also building a number of different libraries for making PyTorch to be easily adopted.
One area that I was surprised is that Kubeflow did not get enough traction this year and it is not clear to me that they can dominate the platform for data science similar to how Kubernetes dominated containers. I think there is a large opportunity in this area for a company to build something that is native to Kubernetes and then enable other companies that are already on Kubernetes to build their platform. Kubeflow was very well suited to solve this, but in 2020, this did not happen. Maybe, in 2021.
Machine Learning Organization
How do you think building ml teams and support them? This blog post shows how a matrix org structure can build/support an applied machine learning team. I particularly liked the approach to build multiple teams around the core ml modeling team. This is I think right approach to support a team that focuses on machine learning models and other teams can provide different types of infrastructure/operations support to enable the team.
Responsible AI
There were a number of interesting applications of some research ideas like model cards this year and there were a number of committees formed. I expect to see more and more research in this area as it just started and we are very early days of responsible AI applications.
I expect standardization and unification of different types of metrics in medium term similar to what we have in classification(precision, recall, f_beta score), regression and the metrics for responsible AI should be treated first class citizens. If you are working in this space, I would love to talk with you, looking for early stage companies to invest.
In the short term, I expect more and more companies will publish how they think about these concepts from either model/data/systems perspective: PerspectiveAI which would help a convergence of that standard/uniform metrics over medium term.
Classes/Seminars
How do you keep up to date with the existing fields? I subscribe to a number of classes/seminars. One of them just started recently is Stanford MLSys Seminars. I really like this seminar format as not a single lecture starting from beginner(assuming some knowledge in the domain). It is also gives a good glimpse from an areas that you may not know about but still get a good amount of information. The video lectures have been so far high signal to noise ratio and all of them have have industry use in mind.
In Neurips 2020, Charles Isbell argued why machine learning should be considered as software engineering enterprise and why we need systematic approaches rather than thinking about machine learning as only the model itself. I highly recommend this as it outlines why machine learning systems should be considered as a whole rather than considering the model in isolation.
Also, this year, most of the conferences were online which was great if you want to follow/watch videos. ICML 2020 was one of them and I enjoyed quite a bit that every video was made available online.
Data Architecture
How do you build a data architecture that can support different use cases in the company? This blog post shows how Financial Times built different types of data architecture over time. What was very interesting to me is that the use cases as well as technologies that can support those use cases are enabled in tandem; e.g. real-time analytics and spark streaming solution.
A16Z published a good overview blog post on the state of the union. I highly recommend this post to understand where things are and which companies are trying to solve which problem. This also gives a good exposure on areas that are under appreciated.
Matt Turck published a blog post similar to the above as well. Of course, it is no coincidence of these blog posts that overlap with successful IPO of Snowflake in terms of timeline. Infrastructure, especially data infrastructure for different use cases is still far from a solved problem especially within context of machine learning.
Data Discovery and MetaData Engines
Every company is building their data discovery tools and in 2020, we see an explosion of these:
Linkedin: DataHub
Wework: Marquez
Lyft: Amundsen
Airbnb: DataPortal
Spotify: Lexikon
Uber: Databook
Netflix: Metacat
All of the tools/solutions are trying to make the data to be more accessible to internal use cases and make it easily discoverable/searchable.
Industry Applications
Lyft wrote about how they are using multi-ahead attention mechanism to predict your next destination.
Airbnb wrote about how they are using Query/Listing Tower Architecture for search.
Pinterest wrote about how they are using PinSage embeddings to organize pins.
Salesforce wrote about how they are building a data science platform on Kubernetes.
Netflix wrote about how they are using embeddings to find out which movie to support.
Kubeflow did their release this year and there were a lot of activity over summer, but I think community adoption is still not there. Also, I would be curious how TFX is going to evolve over time.
For PyTorch Lightning, totally agreed, it removes so much boilerplate so that you can focus the important parts.
I know kubeflow had their 1.0 release this year.
Also surprised by the lack of adoption of kubeflow.
Pytorch lightning is a delightful library to use