How to build an infrastructure from scratch to train a 70B Model?

How to Monitor Generative AI Applications?, Red-teaming with DSPy

Jul 04, 2024

Ada LoveLace Institute wrote a blog post on Generative AI(GenAI) Applications and their deployment/monitoring in a very long post. It talks about various options and needs to deployment and monitoring themes for GenAI applications. Some of the solutions and vendors that they have recommended also can be used for non-GenAI applications as well. The focus of the article is GenAI and derivative of GenAI applications such as RAG(Retrieval Augmented Generation), though.

As GenAI applications become increasingly prevalent across industries, the need for robust monitoring and observability has grown exponentially. These applications, powered by complex data models and algorithms, face unique challenges that set them apart from traditional software systems. GenAI apps process vast amounts of data, produce diverse outputs, and often operate under strict performance requirements. The quality, performance, and efficiency of these applications directly impact user experience and operational costs, making a structured approach to monitoring and telemetry not just beneficial, but essential. Observability provides real-time insights into an application's health, performance, and functionality. For GenAI, this translates to monitoring model accuracy, understanding user interactions, optimizing costs, and more. Telemetry serves as the foundation for this monitoring, supplying the raw data necessary for comprehensive analysis. This data encompasses everything from logs and traces to specific metrics tailored to AI applications.

Monitoring GenAI applications is crucial for several reasons:

Performance Optimization: GenAI models can be computationally intensive. Monitoring helps identify bottlenecks, allowing for targeted optimizations that improve response times and resource utilization.
Cost Management: AI operations, especially those involving large language models, can be expensive. Monitoring usage patterns and resource consumption helps in optimizing costs without compromising on quality.
Quality Assurance: Continuous monitoring ensures that the AI's outputs maintain a high standard of quality and relevance, which is crucial for user satisfaction and trust.
Error Detection and Debugging: Prompt identification of errors or unexpected behaviors allows for quick resolution, minimizing downtime and maintaining system reliability.
User Experience Enhancement: By monitoring user interactions and AI responses, teams can gain insights into user behavior and preferences, leading to improvements in the AI's functionality and user interface.

Telemetry forms the backbone of effective monitoring for GenAI applications. It involves collecting and transmitting data from remote sources to receiving stations for analysis. In the context of AI applications, telemetry captures key operational data to monitor and improve system performance and user experience. The three pillars of observability in telemetry are:

Logs: These are timestamped records of discrete events within the system. For GenAI, logs can capture information such as user input, model responses, and any errors or exceptions that arise.
Traces: Traces provide a detailed view of a request's journey through the system. They are particularly useful for understanding the flow of data and identifying performance bottlenecks in complex AI pipelines.
Metrics: These are quantitative measurements collected at regular intervals. For AI applications, metrics might include response times, token usage, or model confidence scores.

Logging also plays a crucial role in understanding interactions, system behavior, and overall health of GenAI applications. Here are some recommended logs to collect from the post:

User Interactions: Log user queries, AI responses, and any feedback or corrections provided by users. This data is invaluable for improving the AI's performance and understanding user needs.
Model Performance: Record metrics such as response times, token usage, and confidence scores for each interaction. This helps in identifying performance issues and optimizing the model.
Error Logs: Capture all errors or anomalies for diagnostic purposes. This includes both system-level errors and AI-specific issues like hallucinations or off-topic responses.
Context Switches: Monitor how often the AI system switches contexts within a session. High context-switching might indicate issues in the AI's understanding of user intent or its ability to maintain a coherent conversation.
User Session Data: Log session durations, abandonment rates, and reasons for abandonment. Analyzing this data can provide actionable insights for improvements in user experience. To manage log volume, it's recommended to use sampling rates or adjustable log levels for informational logs, while ensuring all critical errors are always captured.

Retrieval-Augmented Generation (RAG) and embedding systems are crucial components of some of the most commonly used GenAI applications. Monitoring these systems requires specific telemetry that should be tailored towards to the product/app specific telemetry:

Embedding Generation Time: Monitor the time taken to generate embeddings for both user queries and knowledge base documents. This metric helps in optimizing the retrieval process.
Embedding Similarity Scores: Track the similarity scores between query embeddings and retrieved document embeddings. This provides insights into the relevance of retrieved information.
Retrieval Latency: Measure the time taken to retrieve relevant documents or information from the knowledge base. This is crucial for maintaining low response times.
Cache Hit Rates: If caching is implemented, monitor cache hit rates to optimize the balance between fresh retrievals and cached results.
Document Relevance Feedback: Collect feedback on the relevance of retrieved documents, either through explicit user feedback or implicit measures like user engagement with responses.
Embedding Drift: Periodically analyze embeddings to detect any drift in the semantic space, which could indicate a need for retraining or updating the embedding model.

Telemetry should ideally provide the following benefits to the system:

An open-source observability framework that provides a unified way to generate, collect, and export telemetry data.
Supports multiple languages and integrates with various backends and monitoring tools.
Offers automatic instrumentation for many popular frameworks and libraries. A comprehensive monitoring solution for collecting, analyzing, and acting on telemetry from cloud and on-premises environments.
Includes features like Application Insights for application performance management and Log Analytics for log data analysis.
Provides powerful querying capabilities and visualization tools for telemetry data.
Data Analysis and Insights

Collecting telemetry data is just the first step. The real value comes from analyzing this data to extract actionable insights:

Identify bottlenecks in the AI pipeline by analyzing response times and resource utilization.
Use tracing data to understand the flow of requests and pinpoint areas for optimization.
Analyze user interaction logs to understand common query patterns and user preferences.
Use this information to improve the AI model and enhance the user interface.
Investigate error logs to identify recurring issues or patterns.
Use this information to improve error handling and enhance system reliability.
Analyze resource usage metrics to identify opportunities for cost reduction.
Consider strategies like caching frequently requested information or optimizing model size.
Monitor metrics related to output quality, such as relevance scores or user feedback.
Use this data to continuously improve the AI model and ensure high-quality responses.

As GenAI applications become more sophisticated, advanced monitoring techniques are necessary to do rollouts and ensure that new features provide net business value and one of the most commonly done in the industry though A/B testing:

Implement A/B testing to compare different versions of the AI model or different retrieval strategies.
Use telemetry data to measure the impact of changes on key performance indicators.
Implement machine learning-based anomaly detection algorithms to identify unusual patterns in telemetry data. This can help in early detection of issues before they impact users.
Use historical telemetry data to predict potential issues or performance degradation.
Implement proactive measures to prevent problems before they occur.
Apply NLP techniques to analyze unstructured log data and extract meaningful insights.
This can help in identifying trends or issues that might not be apparent through traditional analysis methods.

Some of the ways that you can incorporate the feedback loop back in the system:

Establish a process for regularly reviewing telemetry data and derived insights.
Use these reviews to prioritize improvements and optimizations.
Use insights from telemetry data to inform the model retraining process.
This might include focusing on areas where the model consistently underperforms or incorporating new patterns identified in user queries.
Implement mechanisms to collect and analyze user feedback.
Use this feedback, in conjunction with telemetry data, to drive improvements in both the AI model and the overall application.
Maintain comprehensive documentation of monitoring practices and insights.
Foster a culture of knowledge sharing within the team to ensure all members can effectively use and contribute to the monitoring process

Haize Labs wrote an article on how to do red-teaming through DSPy. Article goes in detail on how to implement this through the DSPy framework, which is a library for structuring and optimizing LLM systems. It separates program flow into modules and uses optimizers to tune prompts and weights based on specific metrics. To define a successful attack, they use an LLM judge to evaluate the target model's response. The judge determines to what extent the response matches the original harmful intent, providing a continuous value between 0 and 1 for optimization purposes. The article describes a language program for red-teaming, consisting of alternating Attack and Refine modules which are all implemented in DSPy framework:

Attack Module: Creates an adversarial prompt to elicit the harmful intent from the target model.
Refine Module: Critiques and improves the attack prompt based on the target model's response.

The authors use MIPRO (Multi-prompt Instruction Proposal Optimizer) from DSPy to optimize their attack. MIPRO uses a Bayesian Optimizer to search for optimal instructions and few-shot examples for each module.

In their experiment, they red-team the Vicuna-7b-v1.5 model using harmful behaviors from the AdvBench subset, with GPT-4 as the judge. Their program contains 5 layers of Attack-Refine pairs, and they use greedy decoding to avoid overestimating attack effectiveness. The results show the power of DSPy and automated way of creating these attacks:

Raw input (no optimization): 10% Attack Success Rate (ASR)
5-layer architecture (unoptimized): 26% ASR
5-layer architecture with DSPy optimization: 44% ASR

The article emphasizes that this 44% ASR, a 4x improvement over the baseline, was achieved without specific prompt engineering or extensive hyperparameter tuning with the power of DSPy in an automated way of building these attacks. They consider this result exciting given the minimal effort required. The article concludes by explaining that deeper language programs in DSPy, like deeper neural networks, can be more effective. Information propagates through the program via the attack_prompt and critique outputs from the Attack and Refine modules, respectively. The specific function of each layer is learned through the DSPy optimizer rather than being explicitly specified.

Imbue wrote up a rather interesting article on how to build a GPU Infrastructure to build 70b model from scratch. I will outline the configuration and setup that they go after along with some of the learnings that they mentioned in the post.

Hardware Configuration

The cluster consisted of 4,092 H100 GPUs across 511 computers, with 8 GPUs per machine. So, close to 2^12, but missing with only 2^2 GPUs!
They used a fully non-blocking InfiniBand network topology with a three-tier architecture for high-speed GPU communication.
Each GPU was connected to a ConnectX-7 card capable of 400 Gbps transmission and reception.

No surprise in the HW configuration, I thought the number of machines and total number of GPUs could have been higher for a such a post, but I digress.

Networking Setup

InfiniBand was used for training communication, while Ethernet was used for data transfer, datasets, and checkpoints.
They considered but didn't implement RDMA over Converged Ethernet (RoCE) due to additional complexity.
A local file system was created to mirror cloud storage, reducing bottlenecks from shared Ethernet connections.

The networking was relatively simple and they went with InfiniBand to help on the communication between GPUs.

Operations Software

They have adopted Kraken, an open-source tool for peer-to-peer transfer of Docker images by Uber, more details are in the libraries section.
Performance monitoring tools like Torch profiler and NVIDIA's Nsight Systems were set up.
They developed custom scripts for health checks, stress tests, and network diagnostics.

Nothing out of ordinary in here either.

Learnings

InfiniBand Setup

Initial wiring issues: Discovered and corrected a misdesigned top-level fabric that created eight disjointed networks.
Temperature alerts: Resolved by addressing hot air recirculation in networking racks.
Port errors and flapping: Cleaned and reseated alerting ports, disabled faulty transceivers.
Burn-in testing: Developed a specialized workload to stress-test the entire InfiniBand fabric.

Mostly mechanical issues, but good to have learnings especially on the temperature alerts and wiring issues as these are hard to detect and finding out the issues after the fact is not great experience to correct it afterwards.

Machine Management

Implemented a strategy to grow a set of reliable "golden" machines.
Developed tools to partition and test subsets of machines to identify faulty hosts.

Error Handling and Debugging

Created scripts to parse UFM event logs, disable problematic network components, and file maintenance tickets.
Improved NCCL library logging to better identify problematic hosts during crashes.
Developed tools to detect slow training batches and dump stack traces for diagnosis.

Best Practices

They also have some of the learnings turn into best practices. For example, they realize that machine failure is common especially for large training jobs and due to that they recommend keeping 10-20% more machines than necessary for recovering the jobs after some of the machines are failed.

Infrastructure Management

Maintain extra capacity: Keep 10-20% more machines than necessary for easy relaunching after failures.
Automate health checks: Develop comprehensive scripts to ensure host health.
Implement robust monitoring: Set up thorough metrics collection for network and node health.

They have some debugging strategies as well, but these are general debugging best practices, not specific to the large GPU cluster.

Debugging Strategies

Isolate variables: Change only one thing at a time when troubleshooting.
Verify claims: Double-check results, especially from external tools or new team members.
Reproducibility: Ensure consistent environments, configurations, and parameters for debugging.

Performance Optimization

Address garbage collection: Synchronous distributed training can be slowed by single-worker GC.
Monitor GPU metrics: Track "clock throttle reasons" to identify heat or power supply issues.
Optimize InfiniBand: Be aware of topology-aware routing and potential asymmetric link speeds.

Technical Insights

There are a lot of GPU related issues that came through software related, but under the hood, it was all hardware related. They needed to investigate a lot of issues that fall into this bucket and mitigate it through the HW repairs and changing the nodes that have been failed.

Error Types and Debugging

GPU-specific errors (Xid, SXid, ECC): Often hardware-related, requiring machine disabling and repair.
Hanging without stacktrace: Difficult to debug, often related to NCCL operations or hardware issues.
Performance degradation: Could be caused by various factors including GC, heat issues, or network problems.

Performance Monitoring

Utilized tools like Torch profiler and NVIDIA's Nsight Systems for detailed performance analysis.
Developed custom tools to detect slow training batches and identify causes.

Network Optimization

Implemented a local distributed Docker registry using Kraken for efficient image distribution.
Created a specialized InfiniBand burn-in workload to identify and resolve network issues.

Libraries

OmniParse is a platform that ingests and parses any unstructured data into structured, actionable data optimized for GenAI (LLM) applications. Whether you are working with documents, tables, images, videos, audio files, or web pages, OmniParse prepares your data to be clean, structured, and ready for AI applications such as RAG, fine-tuning, and more

Kraken is a P2P-powered Docker registry that focuses on scalability and availability. It is designed for Docker image management, replication, and distribution in a hybrid cloud environment. With pluggable backend support, Kraken can easily integrate into existing Docker registry setups as the distribution layer.

Kraken has been in production at Uber since early 2018. In our busiest cluster, Kraken distributes more than 1 million blobs per day, including 100k 1G+ blobs. At its peak production load, Kraken distributes 20K 100MB-1G blobs in under 30 sec.

Open-Sora, an initiative dedicated to efficiently producing high-quality video. This will make the model, tools and all details accessible to all. By embracing open-source principles, Open-Sora not only democratizes access to advanced video generation techniques, but also offers a streamlined and user-friendly platform that simplifies the complexities of video generation. With Open-Sora, the goal is to foster innovation, creativity, and inclusivity within the field of content creation.

4M is a framework for training "any-to-any" foundation models, using tokenization and masking to scale to many diverse modalities. Models trained using 4M can perform a wide range of vision tasks, transfer well to unseen tasks and modalities, and are flexible and steerable multimodal generative models.

OneTwo is a Python library designed to simplify interactions with large (language and multimodal) foundation models, primarily aimed at researchers in prompting and prompting strategies.

Foundation Models are increasingly being used in complex scenarios with multiple back-and-forth interactions between the model and some traditional code (possibly generated by the model itself). This leads to the emergence of programs that combine ML models with traditional code. The goal of the OneTwo library is to enable the creation and execution of such programs. It is designed for researchers (and developers) who want to explore how to get the best out of foundational models in a situation where it is not necessarily possible to change their weights (i.e. perform fine-tuning).

Some properties of OneTwo that are particularly impactful for researcher productivity include the following:

Model-agnostic: Provides a uniform API to access different models that can easily be swapped and compared.
Flexible: Supports implementation of arbitrarily complex computation graphs involving combinations of sequential and parallel operations, including interleaving of calls to foundation models and to other tools.
Efficient: Automatically optimizes request batching and other details of model server interactions under-the-hood for maximizing throughput, while allowing prompting strategies to be implemented straightforwardly, as if they were dealing with just single requests.
Reproducible: Automatically caches requests/replies for easy stop-and-go or replay of experiments.

Rigging is a lightweight LLM framework built on Pydantic XML. The goal is to make leveraging language models in production code as simple and effictive as possible. Here are the highlights:

Structured Pydantic models can be used interchangably with unstructured text output.
LiteLLM as the default generator giving you instant access to a huge array of models.
Define prompts as python functions with type hints and docstrings.
Simple tool calling abilities for models which don't natively support it.
Store different models and configs as simple connection strings just like databases.
Chat templating, forking, continuations, generation parameter overloads, stripping segments, etc.
Async batching and fast iterations for large scale generation.
Metadata, callbacks, and data format conversions.
Modern python with type hints, async support, pydantic validation, serialization, etc.

Awesome-local-ai is self-explanatorily is an awesome repository of local AI tools. It has a number of models, tools that you can use in the local machine and most of the tools/models come with a docker container as well!

MLOps Newsletter

Discussion about this post