NVIDIA announces TensorRT LLM to make LLM Inference easy(on H100!)

Google comes up with a method to train human agent with a reward function(Inverse Reinforcement Learning!)

Sep 24, 2023

NVIDIA TensorRT LLM Supercharges Large Language Model Inference on H100

This bar chart graph shows a GPT-J-6B performance comparison between A100 and H100 TCO and energy benefits.

NVIDIA TensorRT LLM is a new library that optimizes the inference of large language models (LLMs) on NVIDIA H100 GPUs. TensorRT LLM can accelerate the inference of LLMs by up to 10x, making it possible to deploy LLMs in real-time applications such as chatbots, translation, and code generation.

TensorRT LLM works by compiling LLMs into highly optimized machine code that can be executed on NVIDIA GPUs. This compilation process takes into account the specific architecture of the NVIDIA H100 GPU, which allows TensorRT LLM to achieve significant performance gains.

In addition to accelerating inference, TensorRT LLM also reduces the memory footprint of LLMs. This is because TensorRT LLM can fuse multiple layers of an LLM into a single kernel, which reduces the number of memory accesses required to perform inference.

Overall, NVIDIA TensorRT LLM is a powerful new tool that can make LLMs more efficient and accessible. This could lead to a new wave of innovation in the field of natural language processing.

New AI Models and Methods for Learning Human-Like Behavior and Generating Images

Researchers at Google AI have developed a new method for training AI agents to learn human-like behavior. The method, called Inverse Reinforcement Learning (IRL), works by inferring the reward function that the human demonstrator is using to guide their behavior. This reward function can then be used to train an AI agent to behave in a similar way.

The researchers tested their IRL method on a variety of tasks, including driving, navigating a maze, and playing a video game. In all cases, the AI agents trained with IRL were able to learn human-like behavior and outperform agents trained with other methods.

This research has the potential to revolutionize the way that AI agents are developed and used. For example, IRL could be used to train AI agents to assist people with disabilities or to provide customer service.

In addition to IRL, researchers at Zhejiang University have developed a new method for diffusion modeling that allows for more accurate and efficient image generation. The method, called InstructDiffusion, works by allowing users to specify the desired output image using text prompts. This allows for more accurate and efficient image generation, as the model does not need to generate random images that do not match the desired output.

Last-Mile Data Processing with Ray in Pinterest

Ray is a distributed computing framework that can be used to accelerate a variety of tasks, including data processing. Pinterest engineers are using Ray to accelerate the last-mile processing of their data pipeline, which has allowed them to improve the performance of their data pipeline and deliver insights to their users more quickly.

Ray works by breaking down large tasks into smaller subtasks that can be executed in parallel on multiple machines. This can significantly speed up the processing of large datasets.

Persimmon-8B: A New 8 Billion Parameter AI Model from Adept

Adept AI has announced the release of Persimmon-8B, a new 8 billion parameter AI model. Persimmon-8B is a multi-modal AI model that can generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way.

Persimmon-8B is trained on a massive dataset of text and code, which means that it can do all sorts of amazing things. For example, it can generate realistic human faces, write poems and code, and even answer your questions in a comprehensive and informative way.

Persimmon-8B is a powerful new tool that can help us to do all sorts of new and amazing things with AI.

Libraries

SeamlessM4T is designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text.

SeamlessM4T covers:

📥 101 languages for speech input.
⌨️ 96 Languages for text input/output.
🗣️ 35 languages for speech output.

This unified model enables multiple tasks without relying on multiple separate models:

Speech-to-speech translation (S2ST)
Speech-to-text translation (S2TT)
Text-to-speech translation (T2ST)
Text-to-text translation (T2TT)
Automatic speech recognition (ASR)

FMMAX is a an implementation of the Fourier modal method (FMM) in JAX.

The FMM -- also known as rigorous coupled wave analysis (RCWA) -- is a semianalytical method that solves Maxwell's equations in periodic stratified media, where in-plane directions are treated with a truncated Fourier basis and the normal direction is handled by a scattering matrix approach [1999 Whittaker, 2012 Liu, 2020 Jin]. This allows certain classes of structures to be modeled with relatively low computational cost.

Our use of JAX enables GPU acceleration and automatic differentiation of FMM simulations. Besides these features, FMMAX is differentiated from other codes by its support for Brillouin zone integration, advanced vector FMM formulations which improve convergence, and anisotropic and magnetic materials.

Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis. Time series analysis is an essential component of Data Science and Engineering work at industry, from understanding the key statistics and characteristics, detecting regressions and anomalies, to forecasting future trends. Kats aims to provide the one-stop shop for time series analysis, including detection, forecasting, feature extraction/embedding, multivariate analysis, etc.

Kats is released by Facebook's Infrastructure Data Science team. It is available for download on PyPI.

stopes: A library for preparing data for machine translation research

As part of the FAIR No Language Left Behind (NLLB) (Paper, Website, Blog) project to drive inclusion through machine translation, a large amount of data was processed to create training data. We provide the libraries and tools we used to:

create clean monolingual data from web data
mine bitext
easily write scalable pipelines for processing data for machine translation

Full documentation on https://facebookresearch.github.io/stopes

SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks.

Speech segments can be embedded in the same SONAR embedding space using language-specific speech encoders trained in a teacher-student setting on speech transcription data. We also provide a single text decoder, which allows us to perform text-to-text and speech-to-text machine translation, including for zero-shot language and modality combinations.

SONAR stands for Sentence-level multimOdal and laNguage-Agnostic Representations

MLOps Newsletter