Creating an Internet Scraping RAG Agent With Python and Supabase
This guide will show you how to create a retrieval-augmented generation (RAG) agent using Python. The application receives documents using URLs and queries the database for context before responding to the user.
We’ll build the project using the following:
- Tune Studio and Llama 3
- OpenAI to generate embeddings
- Supabase for data storage
- FastAPI to serve the application
You can find the complete application here.
By the end of the tutorial, you will be able to add documents to the database and query them.
Getting started
You need to install a few Python dependencies and manage them using a virtual environment.
In your project folder, run the following command to create a virtual environment:
Activate the virtual environment with the following command:
In your project folder, create a requirements.txt
file and paste the following dependencies into it:
Install the dependencies by running the following command:
Setting up a Supabase project
Create a Supabase project to use as the database for the app.
Create a free Supabase account if you don’t already have one, and follow the instructions to create a new Supabase project using the default settings.
When the project is created, expand the sidebar and navigate to Project Settings. In the Settings menu on the left, select API from the Configuration section. Copy and save the project URL and API key.
Back in your project folder, create a .env
file and add the following to it:
You can get your Tune Studio API key from here.
Setting up a vector database
Now we’ll create a new table and set up a similarity search function in the Supabase database.
In your Supabase project, navigate to the SQL Editor from the sidebar on the left. Run the following commands in the SQL Editor:
This script will create a table in the database and initialize the similarity search function. We will use this function when the application queries the database.
Generating OpenAI embeddings
With the environment and database set up, we can write the helper functions we’ll use in the app.
In your project folder, create a new utils.py
file and add the following imports to it:
We’ll use OpenAI to create the chat embeddings. With Tune Studio, you can use your preferred model with OpenAI embeddings and chat features. In this example, we’ll use a Llama 3 model.
Add the following code below your imports:
This function calls the OpenAI API to generate an embedding for supplied text.
Next, add the following function to the utils.py
file:
Here, we call the previous function to generate an embedding and store the result in the Supabase database.
Searching the database for documents
We now define a function to search the database for similar documents to a given query. Add the following code to the utils.py
file:
This function receives a text query and generates the embeddings for it. Using the function we previously defined in Supabase, it searches the database for similar embeddings and returns any matches.
Cleaning text data
Let’s define a function to remove extra newlines and spaces from text data. This function will clean data scraped from websites before adding it to the database.
Add the following code to the utils.py
file:
Here, we use different RegEx expressions to clean data in various formats.
Extracting data from websites
Now define a function to extract data from a supplied URL.
Add the following code to the utils.py
file:
Here’s what this function does:
- First, it runs a series of checks to ensure the maximum recursion level or timeout has not been reached.
- Then, it uses the
requests
library to fetch the webpage and parses the HTML content withBeautifulSoup
. - It calls the
clean_text
helper function we defined previously to clean the extracted text. - It then calls the
generate_embedding
helper function to create and store embeddings for the cleaned text. - Finally, it follows all the internal links on the page to the specified depth.
Querying Tune Studio
Next we’ll define a function to query the Llama model through Tune Studio and get a response.
Add the following code to the utils.py
file:
This function receives the user prompt and a list of matches from the database. It then constructs the system message to the LLM, formats the query and context, and passes it on to the Tune API. The response is loaded and returned in chunks.
Building a FastAPI server
Let’s set up a FastAPI server to serve the application.
In your project folder, create a new main.py
file. Add the following code to it:
Here, we import all the necessary modules and the functions we defined in the utils.py
file. We use load_dotenv()
to load in the environment variables specified in the .env
file.
Now define some classes to store document data and implement the API endpoints.
Add the following classes to the main.py
file:
Add the following POST endpoint to the main.py
file:
This POST endpoint receives a URL and uses the extract_website_data
function we wrote earlier to extract all data from the URL and generate embeddings. It stores the data in Supabase and returns the number of URLs processed and the time taken to process them.
Let’s add another POST method:
This method receives a search query and uses the functions we wrote earlier to search the database for similar documents. It then queries the LLM and returns the response.
Running the server
We will use uvicorn
to launch the server. Add the following code to the main.py
file:
The application is now complete. Run the server with the following command:
Interacting with the server
We can now add documents and query data from the database.
To add documents, execute the following command:
Let’s add some more data to the database (you might get a maximum content length warning, but you can ignore that for now):
Now query the documents in the database: