Creating an Internet Scraping RAG Agent With Python and Supabase

This guide will show you how to create a retrieval-augmented generation (RAG) agent using Python. The application receives documents using URLs and queries the database for context before responding to the user.

We’ll build the project using the following:

Tune Studio and Llama 3
OpenAI to generate embeddings
Supabase for data storage
FastAPI to serve the application

You can find the complete application here.

By the end of the tutorial, you will be able to add documents to the database and query them.

Getting started

You need to install a few Python dependencies and manage them using a virtual environment.

In your project folder, run the following command to create a virtual environment:

python -m venv venv

Activate the virtual environment with the following command:

source venv/bin/activate

In your project folder, create a requirements.txt file and paste the following dependencies into it:

fastapi
uvicorn
beautifulsoup4
requests
python-dotenv
supabase
openai
gpt3-tokenizer

Install the dependencies by running the following command:

pip install -r requirements.txt

Setting up a Supabase project

Create a Supabase project to use as the database for the app.

Create a free Supabase account if you don’t already have one, and follow the instructions to create a new Supabase project using the default settings.

When the project is created, expand the sidebar and navigate to Project Settings. In the Settings menu on the left, select API from the Configuration section. Copy and save the project URL and API key.

Back in your project folder, create a .env file and add the following to it:

TUNEAI_API_KEY=<your-tune-api-key>
SUPABASE_URL=<your-supabase-project-url>
SUPABASE_KEY=<your-supabase-project-key>
OPENAI_API_KEY=<your-openai-api-key>

You can get your Tune Studio API key from here.

Setting up a vector database

Now we’ll create a new table and set up a similarity search function in the Supabase database.

In your Supabase project, navigate to the SQL Editor from the sidebar on the left. Run the following commands in the SQL Editor:

create extension vector;
create table documents (
  id bigserial primary key,
  content text,
  embedding vector(1536),
  url text,
  title text
);

create or replace function match_documents (
  query_embedding vector(1536),
  match_threshold float,
  match_count int
)
returns table (
  id bigint,
  content text,
  url text,
  title text,
  similarity float
)
language sql stable
as $$
  select
    documents.id,
    documents.content,
    documents.url,
    documents.title,
    1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  where documents.embedding <=> query_embedding < 1 - match_threshold
  order by documents.embedding <=> query_embedding
  limit match_count;
$$;

This script will create a table in the database and initialize the similarity search function. We will use this function when the application queries the database.

Generating OpenAI embeddings

With the environment and database set up, we can write the helper functions we’ll use in the app.

In your project folder, create a new utils.py file and add the following imports to it:

utils.py
import json
import os
import re
import gpt3_tokenizer
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse, urljoin
import time
from openai import OpenAI
from supabase import create_client, Client
from typing import List

We’ll use OpenAI to create the chat embeddings. With Tune Studio, you can use your preferred model with OpenAI embeddings and chat features. In this example, we’ll use a Llama 3 model.

Add the following code below your imports:

utils.py
def get_embedding(query, model="text-embedding-ada-002"):
    client = OpenAI()
    query = query.replace("\n", " ")
    embedding = client.embeddings.create(input = [query], model=model).data[0].embedding
    return embedding

This function calls the OpenAI API to generate an embedding for supplied text.

Next, add the following function to the utils.py file:

utils.py
def generate_embedding(text, target_url, title):
    url: str = os.environ.get("SUPABASE_URL")
    key: str = os.environ.get("SUPABASE_KEY")
    supabase: Client = create_client(url, key)
    text = text.replace("\n", " ")
    embedding = get_embedding(text)
    supabase.table('documents').insert({
        "content": text,
        "embedding": embedding,
        "url": target_url,
        "title": title
    }).execute()

Here, we call the previous function to generate an embedding and store the result in the Supabase database.

Searching the database for documents

We now define a function to search the database for similar documents to a given query. Add the following code to the utils.py file:

utils.py
def search_documents(query, model="text-embedding-ada-002"):
    url: str = os.environ.get("SUPABASE_URL")
    key: str = os.environ.get("SUPABASE_KEY")
    supabase: Client = create_client(url, key)
    embedding = get_embedding(query, model)
    matches = supabase.rpc('match_documents',{
        "query_embedding" : embedding,
        "match_threshold" : 0.7,
        "match_count" : 6
    }).execute()
    return matches

This function receives a text query and generates the embeddings for it. Using the function we previously defined in Supabase, it searches the database for similar embeddings and returns any matches.

Cleaning text data

Let’s define a function to remove extra newlines and spaces from text data. This function will clean data scraped from websites before adding it to the database.

Add the following code to the utils.py file:

utils.py
def clean_text(text):
    # Remove extra newlines and spaces
    cleaned_text = re.sub(r'\n+', '\n', text)
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
    # Remove wiki-specific text such as headings, links, categories, and special characters
    cleaned_text = re.sub(r'\[.*?\]', ' ', cleaned_text)  # Remove text within square brackets
    cleaned_text = re.sub(r'\{.*?\}', ' ', cleaned_text)  # Remove text within curly braces
    cleaned_text = re.sub(r'\(.*?\)', ' ', cleaned_text)  # Remove text within parentheses
    cleaned_text = re.sub(r'==.*?==', ' ', cleaned_text)  # Remove text within double equals
    # Remove special characters
    cleaned_text = re.sub(r'[\|•\t]', ' ', cleaned_text)
    return cleaned_text.strip()

Here, we use different RegEx expressions to clean data in various formats.

Extracting data from websites

Now define a function to extract data from a supplied URL.

Add the following code to the utils.py file:

utils.py
def extract_website_data(url, start_time=0, level=0, max_level=3, visited_urls=None, host=None):
    if visited_urls is None:
        visited_urls = set()
    if time.time() - start_time > 90:
        return []
    if host is None:
        host = urlparse(url).netloc
    if level > max_level or urlparse(url).netloc != host:
        return []
    if url in visited_urls:
        return []
    else:
        visited_urls.add(url)

    try:
        response = requests.get(url, timeout=20, headers={"User-Agent": "Mozilla/5.0"})
        data = []
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            base_url = urlparse(url)._replace(path='', query='', fragment='').geturl()
            cleaned_text_data = clean_text(soup.get_text().strip())
            current_data = {"url": url, "text": cleaned_text_data}
            page_title = soup.title.string
            generate_embedding(cleaned_text_data, url, page_title)
            data.append(current_data)

            all_links = soup.find_all("a", href=True)
            for link in all_links:
                href = link.get("href")
                if href:
                    full_url = urljoin(base_url, href)
                    cleaned_url = urlparse(full_url)._replace(fragment='').geturl()
                    if cleaned_url not in visited_urls:
                        data.extend(extract_website_data(cleaned_url, start_time, level + 1, max_level, visited_urls, host))
            return data
        else:
            return []
    except requests.RequestException as e:
        print("Request to", url, "failed:", str(e))
        return []
    except Exception as e:
        print("An error occurred while processing", url, ":", str(e))
        return []

Here’s what this function does:

First, it runs a series of checks to ensure the maximum recursion level or timeout has not been reached.
Then, it uses the requests library to fetch the webpage and parses the HTML content with BeautifulSoup.
It calls the clean_text helper function we defined previously to clean the extracted text.
It then calls the generate_embedding helper function to create and store embeddings for the cleaned text.
Finally, it follows all the internal links on the page to the specified depth.

Querying Tune Studio

Next we’ll define a function to query the Llama model through Tune Studio and get a response.

Add the following code to the utils.py file:

utils.py
async def get_response_tunestudio(prompt: str, matches: List[dict]):
    max_context_tokens = 1600
    context = ""
    for match in matches:
        if gpt3_tokenizer.count_tokens(match['content'] + context) < max_context_tokens:
            context = context + match['url'] + ":\n" +  match['content'] + "\n"

    system = "You are a very enthusiastic TuneAi representative, your goal is to assist people effectively! Using the provided sections from the documentation, craft your answers in markdown format. If the documentation doesn't clearly state the answer, or you are uncertain, please respond with 'Apologies, but I'm unable to provide assistance with that.', do not mention documentation keywords in the response.\n\n"

    url = "https://proxy.tune.app/chat/completions"
    headers = {
        "Authorization": os.environ.get("TUNEAI_API_KEY"),
        "Content-Type": "application/json",
    }
    data = {
        "temperature": 0.2,
        "messages":  [{
            "role": "system",
            "content": system + context
        },{
            "role": "user",
            "content": prompt
        }],
        "model": "MODEL_ID",
        "stream": True,
        "max_tokens": 300
    }

    with requests.post(url, headers=headers, json=data, stream=True) as response:
        for line in response.iter_lines():
            decoded_chunk = line.decode().replace("data: ","")
            if decoded_chunk and decoded_chunk != "[DONE]":
                json_chunk = json.loads(decoded_chunk)
                yield json_chunk["choices"][0]["delta"].get("content","")

This function receives the user prompt and a list of matches from the database. It then constructs the system message to the LLM, formats the query and context, and passes it on to the Tune API. The response is loaded and returned in chunks.

Building a FastAPI server

Let’s set up a FastAPI server to serve the application.

In your project folder, create a new main.py file. Add the following code to it:

main.py
import time
from fastapi import FastAPI, Request
from dotenv import load_dotenv
from pydantic import BaseModel, Field
from fastapi.responses import StreamingResponse

# file we wrote
from utils import extract_website_data, get_response_tunestudio, search_documents

load_dotenv()  # Loads environment variables from .env file
app = FastAPI()

Here, we import all the necessary modules and the functions we defined in the utils.py file. We use load_dotenv() to load in the environment variables specified in the .env file.

Now define some classes to store document data and implement the API endpoints.

Add the following classes to the main.py file:

main.py
class DocumentBody(BaseModel):
    url : str = Field(..., title="URL of the website to extract data from")

class SearchBody(BaseModel):
    query: str

Add the following POST endpoint to the main.py file:

main.py
@app.post("/add_documents")
def read_root(document: DocumentBody):
    url = document.url
    print("URL:", url)
    start_time = time.time()
    urls = extract_website_data(url, start_time)
    return {"urls_processed": len(urls), "time_taken": time.time() - start_time}

This POST endpoint receives a URL and uses the extract_website_data function we wrote earlier to extract all data from the URL and generate embeddings. It stores the data in Supabase and returns the number of URLs processed and the time taken to process them.

Let’s add another POST method:

main.py
@app.post("/prompt")
def resolve_prompt(prompt: SearchBody):
    prompt = prompt.query
    search = search_documents(prompt)
    return StreamingResponse(get_response_tunestudio(prompt, search.data), media_type="text/event-stream")

This method receives a search query and uses the functions we wrote earlier to search the database for similar documents. It then queries the LLM and returns the response.

Running the server

We will use uvicorn to launch the server. Add the following code to the main.py file:

main.py
if __name__ == "__main__":
    import uvicorn

    uvicorn.run(app, host="0.0.0.0", port=8000)

The application is now complete. Run the server with the following command:

python main.py

Interacting with the server

We can now add documents and query data from the database.

To add documents, execute the following command:

curl -X POST http://localhost:8000/add_documents \
     -H "Content-Type: application/json" \
     -d '{"url":"https://tunehq.ai"}'

Let’s add some more data to the database (you might get a maximum content length warning, but you can ignore that for now):

for link in "https://news.ycombinator.com/" "https://example.com/"; do
    curl -X POST http://localhost:8000/add_documents \
         -H "Content-Type: application/json" \
         -d "{\"url\":\"$link\"}"
done

Now query the documents in the database:

curl -X POST http://localhost:8000/prompt \
     -H "Content-Type: application/json" \
     -d '{"query":"Pricing of Tune Studio"}'

Cookbook

​Getting started

​Setting up a Supabase project

​Setting up a vector database

​Generating OpenAI embeddings

​Searching the database for documents

​Cleaning text data

​Extracting data from websites

​Querying Tune Studio

​Building a FastAPI server

​Running the server

​Interacting with the server

Getting started

Setting up a Supabase project

Setting up a vector database

Generating OpenAI embeddings

Searching the database for documents

Cleaning text data

Extracting data from websites

Querying Tune Studio

Building a FastAPI server

Running the server

Interacting with the server