LLM Stack, Controllable Generative Models

Gorilla: a fine-tuned LLaMA-based model that surpasses the performance of GPT-4

Bugra Akyildiz

Jul 02, 2023

Famous venture capital firm Sequioa published an overview piece on various LLM companies and startups as well as some of the use cases. If you want to understand the landscape and interested in finding more in various companies that are building services and solutions on LLMs, I highly recommend checking it out.

Stanford published a post on a new transformer architecture that can generate music in a controlled manner. They call the model to be Anticipatory Music Transformer is a controllable generative model that facilitates co-composition of music with AI.
- The interesting element is “Human in the Loop” process where an external audience can actually influence in controlling the music through inputs. This is a new area that they are spearheading as ChatGPT and other foundation models do not provide a way for influence outside.

Stanford published a new model architecture called HyenaDNA, a long-range genomic foundation model with context lengths of up to 1 million tokens at single nucleotide resolution. HyenaDNA is pre-trained on the human reference genome, and sets new SOTA on 23 downstream tasks including predicting regulatory elements, chromatin profiles, and species classification. They also explore what new capabilities open up with long context in genomics, including in-context learning with soft prompt tune-able tokens and instruction fine-tuning. This is following up their previous post on Hyena architecture.

The embedding space for high dimensional vector space:

Large Language Models (LLMs) have seen an impressive wave of advances recently, with models now excelling in a variety of tasks, such as mathematical reasoning and program synthesis. However, their potential to effectively use tools via API calls remains unfulfilled. This is a challenging task even for today's state-of-the-art LLMs such as GPT-4, largely due to their inability to generate accurate input arguments and their tendency to hallucinate the wrong usage of an API call. Berkeley released Gorilla, a finetuned LLaMA-based model that surpasses the performance of GPT-4 on writing API calls. When combined with a document retriever, Gorilla demonstrates a strong capability to adapt to test-time document changes, enabling flexible API updates and version changes. Gorilla also substantially mitigates the issue of hallucination, commonly encountered when prompting LLMs directly. To evaluate the model's ability, they introduce APIBench, a comprehensive dataset consisting of HuggingFace, TorchHub, and TensorHub APIs. The successful integration of the retrieval system with Gorilla demonstrates the potential for LLMs to use tools more accurately, keep up with frequently updated documentation, and consequently increase the reliability and applicability of their outputs. Gorilla models and code are available at https://github.com/ShishirPatil/gorilla.

Libraries

Pax is a Jax-based machine learning framework for training large scale models. Pax allows for advanced and fully configurable experimentation and parallelization, and has demonstrated industry leading model flop utilization rates.
Zeno Build is a tool for developers who want to quickly build, compare, and iterate on applications using large language models.
- It provides:
  - Simple examples of code to build LLM-based apps. The examples are architecture agnostic, we don't care if you are using OpenAI, LangChain, or Hugging Face.
  - Experiment management and hyperparameter optimization code, so you can quickly kick off experiments using a bunch of different settings and compare the results.
  - Evaluation of LLM outputs, so you can check if your outputs are correct, fluent, factual, interesting, or "good" by whatever definition of good you prefer! Use these insights to compare models and iteratively improve your application with model, data, or prompt engineering.
Lit-GPT provides optimization libraries for LLMs, it supports implementation of Falcon, StableLM, Pythia, INCITE language models based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training.
- Simple: Single-file implementation without boilerplate.
- Correct: Numerically equivalent to the original model.
- Optimized: Runs on consumer hardware or at scale.
- Open-source: No strings attached.
LLM-Foundry contains code for training, finetuning, evaluating, and deploying LLMs for inference with Composer and the MosaicML platform. Designed to be easy-to-use, efficient and flexible, this codebase is designed to enable rapid experimentation with the latest techniques.
That’s where LlamaIndex comes in. LlamaIndex is a “data framework” to help you build LLM apps. It provides the following tools:
- Offers data connectors to ingest your existing data sources and data formats (APIs, PDFs, docs, SQL, etc.)
- Provides ways to structure your data (indices, graphs) so that this data can be easily used with LLMs.
- Provides an advanced retrieval/query interface over your data: Feed in any LLM input prompt, get back retrieved context and knowledge-augmented output.
- Allows easy integrations with your outer application framework (e.g. with LangChain, Flask, Docker, ChatGPT, anything else).
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Optimized CUDA kernels
vLLM is flexible and easy to use with:
- Seamless integration with popular HuggingFace models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax. EasyLM can scale up LLM training to hundreds of TPU/GPU accelerators by leveraging JAX's pjit functionality.
Building on top of Hugginface's transformers and datasets, this repo provides an easy to use and easy to customize codebase for training large language models without the complexity in many other frameworks.
EasyLM is built with JAX/Flax. By leveraging JAX's pjit utility, EasyLM is able to train large models that don't fit on a single accelerator by sharding the model weights and training data across multiple accelerators. Currently, EasyLM supports multiple TPU/GPU training in a single host as well as multi-host training on Google Cloud TPU Pods.

MLOps Newsletter

LLM Stack, Controllable Generative Models

Gorilla: a fine-tuned LLaMA-based model that surpasses the performance of GPT-4

Articles

Libraries