Why Streaming Pipelines are Essential for RAG in LLM Applications
Written on
The Hands-On LLMs Series
Lesson 2: RAG, Streaming Pipelines, Vector DBs, Text Processing
This is the second of eight lessons in the free **Hands-On LLMs* course.*
Completing the Hands-On LLMs course will equip you with the skills to utilize a three-pipeline architecture and LLMOps best practices to create, build, and deploy a real-time financial advisor that leverages LLMs and vector databases.
Our focus will be on the engineering and MLOps elements. By the end of this series, you will have the knowledge to construct and deploy a functional ML system, rather than merely running isolated code in notebooks (which we won't be using).
Specifically, you will learn to construct the following three components:
- A real-time streaming pipeline (hosted on AWS) that monitors financial news, processes and embeds documents, and stores them in a vector database.
- A fine-tuning pipeline (operating as a serverless continuous training service) that refines an LLM using financial data through QLoRA, tracks experiments with an experiment tracker, and archives the best model in a model registry.
- An inference pipeline built with LangChain (deployed as a serverless RESTful API) that retrieves the fine-tuned LLM from the model registry and responds to financial inquiries using RAG, drawing from the vector database populated with real-time financial news.
We will also demonstrate how to integrate various serverless tools, including:
- Comet ML for your ML platform;
- Qdrant as your vector database;
- Beam as your infrastructure framework.
Curious? Check out the video below for a clearer picture of what you'll learn!
Who is this for?
This series is aimed at Machine Learning Engineers, Data Engineers, Data Scientists, or Software Engineers interested in mastering LLM systems using LLMOps principles. This course is intermediate level, so some familiarity with Python, ML, and cloud technologies is expected. However, you don't need to be an expert to grasp the course content; our goal is to help you become proficient.
How will you learn?
The series consists of eight interactive lessons presented in both video and text formats, with accessible open-source code available on GitHub. To maximize your learning experience, we encourage you to clone the repository and work through the lessons.
Contributors
- Pau Labarta Bajo | Senior ML & MLOps Engineer (Main instructor, featured in video lessons)
- Paul Iusztin | Senior ML & MLOps Engineer (Main contributor, occasional video presence)
- Alexandru Razvant | Senior ML Engineer (Supporting engineer)
Are you ready? Let’s dive in!
Lessons Overview
- Lesson 1 | System Design: The LLMs Kit: Create a Production-Ready Real-Time Financial Advisor System with Streaming Pipelines, RAG, and LLMOps
- Lesson 2 | Feature Pipeline: Why Streaming Pipelines are Preferable to Batch Pipelines in RAG for LLM Applications
- Lesson 3 | Feature Pipeline: Implementing and Deploying a Streaming Pipeline to Populate a Vector DB for Real-Time RAG
- Lesson 4 | Training Pipeline: Five Key Concepts for Your LLM Fine-Tuning Toolkit
- Lesson 5 | Training Pipeline: A Hands-On Guide to Fine-Tuning Any LLM with QLoRA
- Lesson 6 | Training Pipeline: Transitioning from Development to Continuous Training Pipelines Using LLMOps
- Lesson 7 | Inference Pipeline: Designing a RAG LangChain Application Leveraging the Three-Pipeline Architecture
- Lesson 8 | Inference Pipeline: Preparing Your RAG LangChain Application for Production
Check out the code on GitHub and support us!
Lesson 2
In Lesson 1, we introduced the three-pipeline architecture and its application in creating a financial assistant powered by LLMs and vector databases. In this lesson, we will begin by defining RAG.
Next, we will cover what Qdrant, Bytewax, and Unstructured are, and explain why we selected them. Finally, we will elaborate on the necessity of using a streaming pipeline over a batch pipeline to feed your vector database for real-time RAG and outline the components involved in the feature pipeline.
In Lesson 3, we will go through the code step-by-step.
Note: For a comprehensive understanding of this lesson, refer back to Lesson 1, where we discussed the three-pipeline architecture and the overall design of the LLM system.
Table of Contents
- RAG: What Issues Does It Address, and How Is It Integrated into LLM-Powered Applications?
- What is Bytewax?
- Why Are Vector Databases So Popular? What Role Do They Play in Most ML Applications?
- What Is Unstructured?
- Why Choose a Streaming Pipeline Over a Batch Pipeline for RAG in LLM Applications?
- Requirements for Implementing a Streaming Pipeline for a Financial Assistant
Check out the code on GitHub and support us!
#1. RAG: What Issues Does It Address, and How Is It Integrated into LLM-Powered Applications?
Let’s explore!
RAG (Retrieval-Augmented Generation) is a widely adopted approach in LLM development that incorporates external data into your prompts.
Challenges with LLMs: 1. Rapid Data Changes: LLMs rely on a static knowledge base derived from their training data, which quickly becomes outdated due to the constant influx of new information online. 2. Hallucinations: LLMs can exude misplaced confidence, producing outputs that may appear valid but are not always reliable. 3. Lack of Source References: Trusting LLM responses can be challenging without visible sources, particularly in critical areas like health and finance.
RAG Solutions: 1. Avoid Frequent Fine-Tuning: With RAG, the LLM serves as a reasoning engine while external databases act as the primary memory, allowing for quick updates. 2. Minimize Hallucinations: By constraining the LLM to respond based solely on provided context, it can either use relevant external data to answer or state “I don’t know” when necessary. 3. Enhance Reference Tracking: RAG allows for easy tracking of data sources, enabling users to verify information.
Example Application: Building a Financial Assistant - Data source containing both historical and real-time financial news (e.g., Alpaca) - Stream processing engine (e.g., Bytewax) - Encoder-only model for document embeddings (e.g., sentence-transformers) - Vector database (e.g., Qdrant)
Feature Pipeline Steps: 1. Use Bytewax to ingest and clean financial news. 2. Chunk and embed news documents. 3. Store the document embeddings and metadata (e.g., original text, source URL) in Qdrant.
Inference Pipeline Steps: 4. Embed user questions. 5. Retrieve the top K relevant news documents from Qdrant based on embeddings. 6. Inject necessary metadata from the retrieved documents into the prompt template alongside the user question. 7. Send the complete prompt to the LLM for a final answer.
#2. What is Bytewax?
Are you hesitant about writing streaming applications or find them complex? I did until I discovered Bytewax!
Bytewax is a streaming processing framework built in Rust for performance and offers a Python interface for ease of use—eliminating the need for Java headaches for Python enthusiasts.
Here’s why Bytewax is a powerful choice:
- Local setup is quick and straightforward.
- Can be integrated effortlessly into any Python project (including Notebooks).
- Compatible with other Python libraries (NumPy, PyTorch, HuggingFace, etc.).
- Provides connectors for Kafka, local files, or allows you to create your own.
- Includes a CLI tool for deployment on platforms like K8s, AWS, or GCP.
Example Achievements: 1. Define a streaming app in a few lines of code. 2. Execute the app with a single command.
Having previously worked with Kafka Streams in Kotlin, I appreciated the advantages of streaming applications but found the Java ecosystem cumbersome. Bytewax resolves this issue.
We used Bytewax to develop the streaming pipeline for the Hands-On LLMs course and had a fantastic experience.
#3. Why Have Vector Databases Gained Popularity? Their Role in ML Applications
In machine learning, any data can be represented as an embedding. A vector database offers an intelligent way to index data embeddings and enables rapid, scalable searches among unstructured data points.
In essence, vector databases allow you to find matches between various data types (e.g., using an image to discover similar text or other images).
By employing different deep learning techniques, data points (images, text, audio, etc.) can be represented in the same vector space (embeddings).
You can load embeddings along with payload data (e.g., URLs, creation dates) into the vector database, which organizes the data by: - Vector - Payload - Text within the payload
With this indexing, you can query the vector database using any data point's embedding. For instance, querying with an image of a cat allows you to filter results for specific colors (e.g., black cats).
To achieve this, you must use the same embedding model that was applied to the data in the vector database. You can then utilize a distance measure (e.g., cosine distance) to locate similar embeddings.
Each similar embedding carries its payload, containing valuable information like URLs or additional metadata.
This technique was used with Qdrant to implement RAG for a financial assistant powered by LLMs. However, the applications of vector databases extend beyond LLMs and RAG to: - Similar image searches - Semantic text searches - Recommender systems - RAG for chatbots - Anomaly detection
#4. What is Unstructured?
Any text preprocessing pipeline requires cleaning, segmenting, extracting, or chunking text data before feeding it to your LLMs.
Unstructured provides a comprehensive and user-friendly toolkit that allows you to: - Segment data from various sources (HTML, CSV, PDFs, images, etc.) - Clean text of anomalies (e.g., incorrect characters) and irrelevant information (e.g., whitespace) - Extract specific information from text segments (e.g., dates, addresses) - Chunk text segments for embedding model input - Wrap data for various embedding tools (e.g., OpenAIEmbeddingEncoder) - Prepare data for use in other applications (e.g., Label Studio)
This functionality is essential for: - Feeding data into LLMs - Embedding data and storing it in vector databases - Implementing RAG - Labeling tasks - Recommender systems
Implementing these steps manually can be time-consuming. While some Python packages exist for these tasks, they often have fragmented functionality. Unstructured consolidates everything into a clean API.
Explore it!
#5. Why Opt for a Streaming Pipeline Over a Batch Pipeline for RAG in LLM Applications?
The effectiveness of your RAG implementation heavily relies on the quality and timeliness of your data. Thus, you should consider: “How current must my vector database data be for accurate responses?”
For optimal user experience, data should be as fresh as possible—ideally in real-time. For instance, in a financial assistant application, staying updated with the latest financial news is essential, as new information can significantly influence strategy.
Therefore, ensuring that your vector database is continuously synchronized with external data sources is crucial for RAG implementation. While a batch pipeline may suffice if your use case allows for delays (e.g., one hour, one day), tools like Bytewax make it easier to build streaming applications, so why settle for less?
Check out the code on GitHub and support us!
#6. Requirements for Implementing a Streaming Pipeline for a Financial Assistant
- A financial news data source accessible via web socket (e.g., Alpaca)
- A Python streaming processing framework such as Bytewax, which provides a Rust-based backend for performance and a user-friendly Python interface.
- A Python package for processing, cleaning, and chunking documents. Unstructured offers extensive features for parsing HTML documents.
- An encoder-only language model for converting chunked documents into embeddings. sentence-transformers integrates seamlessly with HuggingFace and offers a wide range of model options.
- A vector database to store embeddings and their associated metadata (e.g., embedded text, source URL, creation date). Qdrant is an excellent choice, providing a robust feature set.
- A deployment solution for your streaming pipeline. Docker combined with AWS is a reliable option.
- A CI/CD pipeline for continuous testing and deployment. GitHub Actions serves as a strong serverless choice with a rich ecosystem.
This is what you need to build and deploy a streaming pipeline entirely in Python!
Conclusion
Congratulations! You’ve learned about RAG and the fundamentals of developing a streaming pipeline using Bytewax.
In this lesson, we defined RAG and outlined the tools required to create a real-time feature pipeline. Finally, we discussed the importance of opting for a streaming pipeline over a batch pipeline in your LLM applications.
Thank you for reading! I hope you are enjoying the Hands-on LLMs series.
Look out for Lesson 3, where we will delve into the Bytewax streaming pipeline code and how to deploy it to AWS using Docker and a GitHub Actions CI/CD pipeline.
Explore the Hands-on LLMs Course GitHub Repository [2]
If you’re interested in learning MLOps basics and how to design, build, deploy, and monitor an end-to-end ML batch system, consider checking out “The Full Stack 7-Steps MLOps Framework” FREE course, which includes source code and 2.5 hours of reading and video content.
I would like to extend my gratitude to Pau Labarta Bajo and Alexandru Razvant for their contributions to this course.
If you enjoy reading articles like this and wish to support my writing, consider becoming a Medium member. Use my referral link to support me without any additional cost while gaining unlimited access to Medium’s collection of stories.
Join Medium with my referral link — Paul Iusztin
As a Medium member, part of your membership fee goes to writers you read, and you gain full access to every story.
pauliusztin.medium.com
Thank you!
References
[1] Paul Iusztin, DML: How to implement a streaming pipeline to populate a vector DB for real-time RAG? (2023), Decoding ML Newsletter [2] Hands-on LLMs Course GitHub Repository (2023), GitHub