MLOps Platform: Griffin

DataMesh, TorchGeo, Padl

Bugra Akyildiz

Aug 07, 2022

If you read one article this week:

Bugra Akyildiz @bugraa

A good survey of applications for Large Language Models:

leighmariebraswell.substack.comOverview & Applications of Large Language Models (LLMs)From perspective of an investor & previous early-stage operator in the space

Articles

Instacart wrote about their MLOps platform Griffin in the following blog post.

Griffin is developed to help MLEs (Machine Learning Engineers) quickly iterate on machine learning models, effortlessly manage product releases, and closely track their production applications. Based on these goals, we built the system with a few major system considerations:

Scalability: The platform should support up to thousands of machine learning applications at Instacart.
Extensibility: The platform should be flexible to extend and integrate with various data management backends and machine learning tools.
Generality: The platform should provide a unified workflow and consistent user experience despite its broad integration with third-party solutions.

They went into various components in a much more detail in the blog post.

Lessons Learned

Buy Vs Build. Leveraging existing commercial/non-commercial third-party solutions enabled us to support a quickly-growing feature set and avoid reinventing the wheel. It was critical to carefully integrate these solutions into the platform so that we can switch between solutions with minimal migration overheads.
Make it flexible. Supporting custom ML applications increased adoption among diverse teams at Instacart. We generated code from the standardized template providing the ability to override, and allowed MLEs to integrate legacy systems until they had the bandwidth for migration. To ensure a consistent running environment for custom applications, we adopted Docker runtimes which simplify reproducing users’ experiences and allow us to troubleshoot issues more quickly.
Make incremental progress. Regular onboarding sessions streamlined feedback and kept the platform design simple. We scheduled regular hands-on codelabs to onboard MLEs onto the new features and get early feedback for further improvement. It encouraged collaborations and enabled us to prevent engineers from getting into the rabbit hole of spending years building the ‘perfect’ platform.
Extensibility enables rapid growth. Extensible and reusable foundational components helped us build a self-service infrastructure for accommodating feature requests from growing MLEs, a modular codebase for adapting to the fast-changing landscape, a simple interface for smooth onboarding, and a production-ready system for scaling to millions of Instacart users.

Airbnb wrote about how they use Graph Machine learning in the following blog post.

This is an interesting article from Gradient that explores the graph theory and how it is related to brain functions and then connects with Dementia and Alzheimer’s Disease(AD).

It talks about impairment in the cognitive ability to impairment of brain functions and tries to connect various graph network and network theory in general to link various cognitive decline which culminates as AD. If you are interested in neuroscience, definitely work check it out.

Netflix wrote about their data ingestion and processing platform in the blog post.

Realtime processing technologies (A.K.A stream processing) is one of the key factors that enable Netflix to maintain its leading position in the competition of entertaining our users. Our previous generation of streaming pipeline solution Keystone has a proven track record of serving multiple of our key business needs. However, as we expand our offerings and try out new ideas, there’s a growing need to unlock other emerging use cases that were not yet covered by Keystone. After evaluating the options, the team has decided to create Data Mesh as our next generation data pipeline solution.

We make schema as the first class citizen in Data Mesh due to the following reasons:

Better data quality: Only events that comply with the schema can be encoded. Gives the consumer more confidence.
Finer granularity of data lineage: The platform is able to track how fields are consumed by different consumers and surface it on the UI.
Data discovery: Schema describes data sets and enables the users to browse different data sets and find the dataset of interest.

Libraries

TorchGeo is a PyTorch domain library, similar to torchvision, providing datasets, samplers, transforms, and pre-trained models specific to geospatial data.

The goal of this library is to make it simple:

for machine learning experts to work with geospatial data, and
for remote sensing experts to explore machine learning solutions.

Padl

is a pipeline builder for PyTorch.
can be used with all of the great PyTorch functionality you’re used to for writing layers.
allows users to build pre-processing, forward passes, loss functions and post-processing into the pipeline.
models may have arbitrary topologies and make use of arbitrary packages from the python ecosystem.
allows for converting standard functions to PADL components using a single keyword transform.

PIXEL is a language model that operates on text rendered as images, fully removing the need for a fixed vocabulary. This effectively allows for transfer to any language and script that can be typeset on your computer screen.

The paper:

Language models are defined over a finite set of inputs, which creates a vocabulary bottleneck when we attempt to scale the number of supported languages. Tackling this bottleneck results in a trade-off between what can be represented in the embedding matrix and computational issues in the output layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which suffers from neither of these issues. PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels. PIXEL is trained to reconstruct the pixels of masked patches, instead of predicting a distribution over tokens. We pretrain the 86M parameter PIXEL model on the same English data as BERT and evaluate on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts. We find that PIXEL substantially outperforms BERT on syntactic and semantic processing tasks on scripts that are not found in the pretraining data, but PIXEL is slightly weaker than BERT when working with Latin scripts. Furthermore, we find that PIXEL is more robust to noisy text inputs than BERT, further confirming the benefits of modelling language with pixels.

OmniBenchmark is a new benchmark for evaluating pre-trained model; supervised contrastive learning framework.

FastDup is a tool for gaining insights from a large image collection. It can find anomalies, duplicate and near duplicate images, clusters of similarity, learn the normal behavior and temporal interactions between images. It can be used for smart subsampling of a higher quality dataset, outlier removal, novelty detection of new information to be sent for tagging. FastDup scales to millions of images running on CPU only.

NVIDIA wrote a blog post for improving large language models(LLM). They proposed two new approaches:

Two new techniques included in the updates that optimize and scale the training of LLMs are sequence parallelism (SP) and selective activation recomputation (SAR).

Sequence parallelism expands tensor-level model parallelism by noticing that the regions of a transformer layer that haven’t previously been parallelized are independent along the sequence dimension.

Splitting these layers along the sequence dimension enables distribution of the compute and, most importantly, the activation memory for these regions across the tensor parallel devices. Since the activations are distributed, more activations can be saved for the backward pass instead of recomputing them.

Selective activation recomputation improves cases where memory constraints force the recomputation of some, but not all, of the activations, by noticing that different activations require different numbers of operations to recompute.

MLOps Newsletter

Discussion about this post