Creating an Internet Scraping RAG Agent With Python and Supabase
This guide will show you how to create a retrieval-augmented generation (RAG) agent using Python. The application receives documents using URLs and queries the database for context before responding to the user.
We’ll build the project using the following:
- Tune Studio and Llama 3
- OpenAI to generate embeddings
- Supabase for data storage
- FastAPI to serve the application
You can find the complete application here.
By the end of the tutorial, you will be able to add documents to the database and query them.
Getting started
You need to install a few Python dependencies and manage them using a virtual environment.
In your project folder, run the following command to create a virtual environment:
python -m venv venv
Activate the virtual environment with the following command:
source venv/bin/activate
In your project folder, create a requirements.txt
file and paste the following dependencies into it:
fastapi
uvicorn
beautifulsoup4
requests
python-dotenv
supabase
openai
gpt3-tokenizer
Install the dependencies by running the following command:
pip install -r requirements.txt
Setting up a Supabase project
Create a Supabase project to use as the database for the app.
Create a free Supabase account if you don’t already have one, and follow the instructions to create a new Supabase project using the default settings.
When the project is created, expand the sidebar and navigate to Project Settings. In the Settings menu on the left, select API from the Configuration section. Copy and save the project URL and API key.
Back in your project folder, create a .env
file and add the following to it:
TUNEAI_API_KEY=<your-tune-api-key>
SUPABASE_URL=<your-supabase-project-url>
SUPABASE_KEY=<your-supabase-project-key>
OPENAI_API_KEY=<your-openai-api-key>
You can get your Tune Studio API key from here.
Setting up a vector database
Now we’ll create a new table and set up a similarity search function in the Supabase database.
In your Supabase project, navigate to the SQL Editor from the sidebar on the left. Run the following commands in the SQL Editor:
create extension vector;
create table documents (
id bigserial primary key,
content text,
embedding vector(1536),
url text,
title text
);
create or replace function match_documents (
query_embedding vector(1536),
match_threshold float,
match_count int
)
returns table (
id bigint,
content text,
url text,
title text,
similarity float
)
language sql stable
as $$
select
documents.id,
documents.content,
documents.url,
documents.title,
1 - (documents.embedding <=> query_embedding) as similarity
from documents
where documents.embedding <=> query_embedding < 1 - match_threshold
order by documents.embedding <=> query_embedding
limit match_count;
$$;
This script will create a table in the database and initialize the similarity search function. We will use this function when the application queries the database.
Generating OpenAI embeddings
With the environment and database set up, we can write the helper functions we’ll use in the app.
In your project folder, create a new utils.py
file and add the following imports to it:
import json
import os
import re
import gpt3_tokenizer
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin
import time
from openai import OpenAI
from supabase import create_client, Client
from typing import List
We’ll use OpenAI to create the chat embeddings. With Tune Studio, you can use your preferred model with OpenAI embeddings and chat features. In this example, we’ll use a Llama 3 model.
Add the following code below your imports:
def get_embedding(query, model="text-embedding-ada-002"):
client = OpenAI()
query = query.replace("\n", " ")
embedding = client.embeddings.create(input = [query], model=model).data[0].embedding
return embedding
This function calls the OpenAI API to generate an embedding for supplied text.
Next, add the following function to the utils.py
file:
def generate_embedding(text, target_url, title):
url: str = os.environ.get("SUPABASE_URL")
key: str = os.environ.get("SUPABASE_KEY")
supabase: Client = create_client(url, key)
text = text.replace("\n", " ")
embedding = get_embedding(text)
supabase.table('documents').insert({
"content": text,
"embedding": embedding,
"url": target_url,
"title": title
}).execute()
Here, we call the previous function to generate an embedding and store the result in the Supabase database.
Searching the database for documents
We now define a function to search the database for similar documents to a given query. Add the following code to the utils.py
file:
def search_documents(query, model="text-embedding-ada-002"):
url: str = os.environ.get("SUPABASE_URL")
key: str = os.environ.get("SUPABASE_KEY")
supabase: Client = create_client(url, key)
embedding = get_embedding(query, model)
matches = supabase.rpc('match_documents',{
"query_embedding" : embedding,
"match_threshold" : 0.7,
"match_count" : 6
}).execute()
return matches
This function receives a text query and generates the embeddings for it. Using the function we previously defined in Supabase, it searches the database for similar embeddings and returns any matches.
Cleaning text data
Let’s define a function to remove extra newlines and spaces from text data. This function will clean data scraped from websites before adding it to the database.
Add the following code to the utils.py
file:
def clean_text(text):
# Remove extra newlines and spaces
cleaned_text = re.sub(r'\n+', '\n', text)
cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
# Remove wiki-specific text such as headings, links, categories, and special characters
cleaned_text = re.sub(r'\[.*?\]', ' ', cleaned_text) # Remove text within square brackets
cleaned_text = re.sub(r'\{.*?\}', ' ', cleaned_text) # Remove text within curly braces
cleaned_text = re.sub(r'\(.*?\)', ' ', cleaned_text) # Remove text within parentheses
cleaned_text = re.sub(r'==.*?==', ' ', cleaned_text) # Remove text within double equals
# Remove special characters
cleaned_text = re.sub(r'[\|•\t]', ' ', cleaned_text)
return cleaned_text.strip()
Here, we use different RegEx expressions to clean data in various formats.
Extracting data from websites
Now define a function to extract data from a supplied URL.
Add the following code to the utils.py
file:
def extract_website_data(url, start_time=0, level=0, max_level=3, visited_urls=None, host=None):
if visited_urls is None:
visited_urls = set()
if time.time() - start_time > 90:
return []
if host is None:
host = urlparse(url).netloc
if level > max_level or urlparse(url).netloc != host:
return []
if url in visited_urls:
return []
else:
visited_urls.add(url)
try:
response = requests.get(url, timeout=20, headers={"User-Agent": "Mozilla/5.0"})
data = []
if response.status_code == 200:
soup = BeautifulSoup(response.content, "html.parser")
base_url = urlparse(url)._replace(path='', query='', fragment='').geturl()
cleaned_text_data = clean_text(soup.get_text().strip())
current_data = {"url": url, "text": cleaned_text_data}
page_title = soup.title.string
generate_embedding(cleaned_text_data, url, page_title)
data.append(current_data)
all_links = soup.find_all("a", href=True)
for link in all_links:
href = link.get("href")
if href:
full_url = urljoin(base_url, href)
cleaned_url = urlparse(full_url)._replace(fragment='').geturl()
if cleaned_url not in visited_urls:
data.extend(extract_website_data(cleaned_url, start_time, level + 1, max_level, visited_urls, host))
return data
else:
return []
except requests.RequestException as e:
print("Request to", url, "failed:", str(e))
return []
except Exception as e:
print("An error occurred while processing", url, ":", str(e))
return []
Here’s what this function does:
- First, it runs a series of checks to ensure the maximum recursion level or timeout has not been reached.
- Then, it uses the
requests
library to fetch the webpage and parses the HTML content withBeautifulSoup
. - It calls the
clean_text
helper function we defined previously to clean the extracted text. - It then calls the
generate_embedding
helper function to create and store embeddings for the cleaned text. - Finally, it follows all the internal links on the page to the specified depth.
Querying Tune Studio
Next we’ll define a function to query the Llama model through Tune Studio and get a response.
Add the following code to the utils.py
file:
async def get_response_tunestudio(prompt: str, matches: List[dict]):
max_context_tokens = 1600
context = ""
for match in matches:
if gpt3_tokenizer.count_tokens(match['content'] + context) < max_context_tokens:
context = context + match['url'] + ":\n" + match['content'] + "\n"
system = "You are a very enthusiastic TuneAi representative, your goal is to assist people effectively! Using the provided sections from the documentation, craft your answers in markdown format. If the documentation doesn't clearly state the answer, or you are uncertain, please respond with 'Apologies, but I'm unable to provide assistance with that.', do not mention documentation keywords in the response.\n\n"
url = "https://proxy.tune.app/chat/completions"
headers = {
"Authorization": os.environ.get("TUNEAI_API_KEY"),
"Content-Type": "application/json",
}
data = {
"temperature": 0.2,
"messages": [{
"role": "system",
"content": system + context
},{
"role": "user",
"content": prompt
}],
"model": "rohan/Meta-Llama-3-8B-Instruct",
"stream": True,
"max_tokens": 300
}
with requests.post(url, headers=headers, json=data, stream=True) as response:
for line in response.iter_lines():
decoded_chunk = line.decode().replace("data: ","")
if decoded_chunk and decoded_chunk != "[DONE]":
json_chunk = json.loads(decoded_chunk)
yield json_chunk["choices"][0]["delta"].get("content","")
This function receives the user prompt and a list of matches from the database. It then constructs the system message to the LLM, formats the query and context, and passes it on to the Tune API. The response is loaded and returned in chunks.
Building a FastAPI server
Let’s set up a FastAPI server to serve the application.
In your project folder, create a new main.py
file. Add the following code to it:
import time
from fastapi import FastAPI, Request
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from fastapi.responses import StreamingResponse
# file we wrote
from utils import extract_website_data, get_response_tunestudio, search_documents
load_dotenv() # Loads environment variables from .env file
app = FastAPI()
Here, we import all the necessary modules and the functions we defined in the utils.py
file. We use load_dotenv()
to load in the environment variables specified in the .env
file.
Now define some classes to store document data and implement the API endpoints.
Add the following classes to the main.py
file:
class DocumentBody(BaseModel):
url : str = Field(..., title="URL of the website to extract data from")
class SearchBody(BaseModel):
query: str
Add the following POST endpoint to the main.py
file:
@app.post("/add_documents")
def read_root(document: DocumentBody):
url = document.url
print("URL:", url)
start_time = time.time()
urls = extract_website_data(url, start_time)
return {"urls_processed": len(urls), "time_taken": time.time() - start_time}
This POST endpoint receives a URL and uses the extract_website_data
function we wrote earlier to extract all data from the URL and generate embeddings. It stores the data in Supabase and returns the number of URLs processed and the time taken to process them.
Let’s add another POST method:
@app.post("/prompt")
def resolve_prompt(prompt: SearchBody):
prompt = prompt.query
search = search_documents(prompt)
return StreamingResponse(get_response_tunestudio(prompt, search.data), media_type="text/event-stream")
This method receives a search query and uses the functions we wrote earlier to search the database for similar documents. It then queries the LLM and returns the response.
Running the server
We will use uvicorn
to launch the server. Add the following code to the main.py
file:
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
The application is now complete. Run the server with the following command:
python main.py
Interacting with the server
We can now add documents and query data from the database.
To add documents, execute the following command:
curl -X POST http://localhost:8000/add_documents \
-H "Content-Type: application/json" \
-d '{"url":"https://tunehq.ai"}'
Let’s add some more data to the database (you might get a maximum content length warning, but you can ignore that for now):
for link in "https://news.ycombinator.com/" "https://example.com/"; do
curl -X POST http://localhost:8000/add_documents \
-H "Content-Type: application/json" \
-d "{\"url\":\"$link\"}"
done
Now query the documents in the database:
curl -X POST http://localhost:8000/prompt \
-H "Content-Type: application/json" \
-d '{"query":"Pricing of Tune Studio"}'