Representation Engineering for Control Vector

6 Free LLM Courses, Croissant for ML Metadata for Datasets

Mar 16, 2024

Vgel wrote a blog post on the representation engineering, focusing on the control vector in LLMs. If you are interested and want to learn about AI safety and how to customize an already trained LLM, this post goes over couple of different ways of doing so.

Control vector is a vector (technically a list of vectors, one per layer) that you can apply to model activations during inference to control the model's behavior without additional prompting. This is conceptually very powerful as it can customize the behavior of the model in inference time without changing the training as much.

This paper shows how to create one control vector through PCA(Principal Component Analysis). The code is also available in GitHub.

trippy_dataset = make_dataset(
    "Act as if you're extremely {persona}.",
    ["high on psychedelic drugs"],
    ["sober from psychedelic drugs"],
    truncated_output_suffixes,
)

# train the vector—takes less than a minute!
trippy_vector = ControlVector.train(model, tokenizer, trippy_dataset)

# set the control strength and let inference rip!
for strength in (-2.2, 1, 2.2):
    print(f"strength={strength}")
    model.set_control(trippy_vector, strength)
    out = model.generate(
        **tokenizer(
            f"[INST] Give me a one-sentence pitch for a TV show. [/INST]",
            return_tensors="pt"
        ),
        do_sample=False,
        max_new_tokens=128,
        repetition_penalty=1.1,
    )
    print(tokenizer.decode(out.squeeze()).strip())
    print()

This way of training the control vector is very easy and does not require to change anything in the training.

You can change the response for the models in the following way:

==baseline

You can reverse a list in Python using the built-in `reverse()` method or slicing. Here's an example using slicing [...]

You can reverse a list in Python using the built-in reverse() method or slicing. Here's an example of how to do it using slicing:

++lazy

You can use the `reverse` method to reverse a list in Python. Here's how you can do it [...]

You can use the reverse method to reverse a list in Python. Here's how you can do it:

--lazy

You can reverse a list in Python by using the `reverse` method or slicing. Here is an example of both methods [...]

You can reverse a list in Python by using the reverse method of the list, or by using slicing to create a new list with the elements in reverse order. Here is an example of both methods:

`—lazy` gives both options and ++lazy only gives one option, baseline mentions there are two options, but only gives one option.

Of course, all of the things that control vector does can be done through prompt engineering as well as one can consider control vector to be an “addition” to the prompt that is provided by the user.

Machine learning (ML) models are like powerful tools, and datasets are the raw materials they use to learn and make predictions. Just like a carpenter needs high-quality wood to build a strong house, an ML model needs high-quality data to produce accurate results. However, unlike wood, which can often be visually inspected for quality, the quality of a machine learning dataset is not always readily apparent. This is where metadata comes in.

Metadata is essentially data about data. In the context of machine learning, dataset metadata provides information about the data itself, such as its format, size, and what it represents. It can also include information about how the data was collected and any transformations that have been applied to it. Having high-quality metadata is essential for several reasons.

First, metadata can help users understand the data and determine if it is suitable for their needs. For instance, a researcher looking for a dataset to train an image classification model would need a dataset that contains images and labels. Without metadata, it would be difficult to determine if a particular dataset meets these criteria.

Second, metadata can help improve the quality and reproducibility of machine learning research. By documenting how the data was collected and processed, researchers can ensure that their results can be replicated by others. This is essential for scientific progress.

Third, metadata can facilitate the development of tools for working with machine learning datasets. Tools that can clean, preprocess, and analyze data can significantly improve the efficiency of the machine learning workflow. However, these tools often rely on metadata to function correctly.

Google wrote a blog post about Croissant which addresses this problem by proposing a standard way to describe and organize ML datasets. This metadata format includes information about the data itself, such as its schema, data types, and statistics. It also includes information about how the data can be used for machine learning, such as the target variable and the intended task.

Main Advantages of Croissant

Croissant offers several advantages over other approaches for describing and organizing ML datasets:

Easier to find relevant datasets: Search engines and dataset repositories can leverage the rich metadata in Croissant to make it easier for users to find datasets that meet their specific needs. For instance, a researcher looking for a dataset to train an image classification model can search for datasets that include the Croissant metadata property indicating the target variable is an image label.
Easier to develop tools for working with datasets: The data resources and organization information within Croissant can be used by developers to create tools that can more easily clean, refine, and analyze data. For example, a tool designed to preprocess text data could look for the Croissant metadata property specifying the data format is text and automatically apply the appropriate cleaning steps.
Easier to use datasets in ML frameworks: Croissant allows ML frameworks to interpret the metadata and use the data to train and test models with minimal code required from the user. This can significantly reduce the time and effort required to develop machine learning models.
Reduces the data development burden: By making it easier to find, use, and develop tools for working with datasets, Croissant reduces the time and effort required to develop ML models. This can free up valuable resources for researchers and data scientists to focus on the core tasks of model development and experimentation.

The code is available in GitHub and it is part of MLCommons, it has also working group if you want to be part of the Croissant community.

Libraries

Infinity ♾️

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of sentence-transformer models and frameworks. Infinity is developed under MIT Licence and supported by Gradient.ai.

Why Infinity:

Infinity provides the following features:

Deploy virtually any SentenceTransformer - deploy the model you know from SentenceTransformers
Fast inference backends: The inference server is built on top of torch, fastembed(onnx-cpu) and CTranslate2, using FlashAttention to get the most out of your CUDA, CPU or MPS hardware.
Dynamic batching: New embedding requests are queued while GPU is busy with the previous ones. New requests are squeezed intro your GPU/CPU as soon as ready. Similar max throughput on GPU as text-embeddings-inference.
Correct and tested implementation: Unit and end-to-end tested. Embeddings via infinity are identical to SentenceTransformers (up to numerical precision). Lets API users create embeddings till infinity and beyond.
Easy to use: The API is built on top of FastAPI, Swagger makes it fully documented. API are aligned to OpenAI's Embedding specs. See below on how to get started.

RoScribe

Using a natural language interface to describe robotic projects, ROScribe eliminates the skill barrier of using ROS for beginners, and saves time and hassle for skilled engineers. ROScribe combines the sheer power and flexibility of large language models (LLMs) with prompt tuning techniques to capture the details of your robotic design and to automatically create an entire ROS package for your project.

Inspired by GPT Synthesizer, ROScribe builds an entire ROS package through a series of specification steps that identify the package elements in a top-down approach. In particular, ROScribe helps you with the following steps:

Creating a list of ROS nodes and topics, based on your application and deployment (e.g. simulation vs. real-world)
Visualizing your project in an RQT-style graph
Generating code for each ROS node
Writing launch file and installation scripts

Streamdal

Streamdal is an open-source 'Code Native Data Pipeline' solution for running data tasks directly in your application code.

_{Think of it as a "workflow engine" or a "pre/post data processor" that is executed client-side via WebAssembly in your application code.}

It is at least 10x faster, 10x cheaper and 10x easier to operate than traditional data pipelines.

Benefits

There are major benefits to running pipelines directly within your app:

Eliminates the need for a separate data pipeline infrastructure
- Pipelines execute from within your app, using existing compute that your app is already using
Eliminates the need for a separate data pipeline team
- No more waiting for the data pipeline team to make pipeline changes
Is ridiculously fast
- Streamdal uses Wasm to execute pipelines at near-native speeds
Is actually real-time
- Not "near real-time" or "max-30-seconds-real-time" - but actually real-time - data is processed as soon as your app reads or writes data
And many other reasons

PHATGOOSE, which stands for Post-Hoc Adaptive Gating Over an Ocean of Specialized Experts, enables zero-shot generalization from specialized experts (eg PEFT modules) trained on diverse datasets by adaptively routing among them. It requires an additional, inexpensive training step of a gate in front of a frozen PEFT module for its corresponding task.

OpenLLM is an open-source platform designed to facilitate the deployment and operation of large language models (LLMs) in real-world applications. With OpenLLM, you can run inference on any open-source LLM, deploy them on the cloud or on-premises, and build powerful AI applications.

Key features include:

🚂 State-of-the-art LLMs: Integrated support for a wide range of open-source LLMs and model runtimes, including but not limited to Llama 2, StableLM, Falcon, Dolly, Flan-T5, ChatGLM, and StarCoder.

🔥 Flexible APIs: Serve LLMs over a RESTful API or gRPC with a single command. You can interact with the model using a Web UI, CLI, Python/JavaScript clients, or any HTTP client of your choice.

⛓️ Freedom to build: First-class support for LangChain, BentoML, LlamaIndex, OpenAI endpoints, and Hugging Face, allowing you to easily create your own AI applications by composing LLMs with other models and services.

🎯 Streamline deployment: Automatically generate your LLM server Docker images or deploy as serverless endpoints via ☁️ BentoCloud, which effortlessly manages GPU resources, scales according to traffic, and ensures cost-effectiveness.

🤖️ Bring your own LLM: Fine-tune any LLM to suit your needs. You can load LoRA layers to fine-tune models for higher accuracy and performance for specific tasks. A unified fine-tuning API for models (LLM.tuning()) is coming soon.

⚡ Quantization: Run inference with less computational and memory costs with quantization techniques such as LLM.int8, SpQR (int4), AWQ, GPTQ, and SqueezeLLM.

📡 Streaming: Support token streaming through server-sent events (SSE). You can use the /v1/generate_stream endpoint for streaming responses from LLMs.

🔄 Continuous batching: Support continuous batching via vLLM for increased total throughput.

LLM Courses

Google had an introduction class that is about an hour.
Cohere has an LLM University that has a good amount of material that covers variety of different areas in LLM; like text, text representation and training LLM.
HuggingFace has a course as well for the NLP. This does not cover LLM per se, but it covers a lot of fundamental NLP techniques that would be useful for LLM and LLM model building.
Databricks has a course through EdX that covers introductory level in LLM and their training.
Full Stack Deep Learning made the course materials that they for last year in spring also available. Andrej Karpathy also endorsed this course material as very high quality.

Langchain has also a course that target more towards embedding database and retrieval techniques.

MLOps Newsletter