What is RAG?

Retrieval Augmented Generation (RAG) is a technique that combines the power of retrieval and generation to improve the performance of LLMs.

RAG tickles the following problems:

  • LLM knowledge could be outdated.
  • LLM knowledge could have no source, since its not trained on it.
  • let LLM says “I don’t know” instead of hallucinating.

How RAG works?

Basically giving the LLM power to access a pile of documents (database) and retrieve the most relevant information to answer the question.

LLM basically does the following:

  1. Retrieval: LLMs call on a vector database to retrieve the most relevant information.
  2. Generation: Generate the answer based on the retrieved information.

The Vector Database

The vector database is a database that stores the embeddings of the documents. It is an vector representing the structured or unstructured data for the LLM to query.

We first need to convert the documents into embeddings.

  1. Download the documentations from the code library, and convert them into markdown format.
  2. Chunk the documents into smaller size.
  3. Extract Feature into Embeddings

Text Preprocessing and Chunking

Preprocessing the text. Documentations are usually in rst format. We first prase the text into markdown format.

Chunking is the process of splitting large documents into smaller, manageable segments (chunks) before storing them in a retrieval system. This ensures that relevant information can be efficiently retrieved and used by the language model.

Used in my experiment because each document in the code library maybe too large.

  • Handling Large Documents: If entire documents were retrieved, they might exceed the token limit.
  • Improving Retrieval Efficiency: easier to retrieve highly relevant sections.
  • Boosting Retrieval Accuracy: the retriever finds the most precise and useful information.

We used langchain library to do the chunking.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from langchain.text_splitter import RecursiveCharacterTextSplitter

def _process_documentation_folder(self, folder_path: str) -> List[Document]:
"""Process documentation files from a folder."""
all_docs = []

for root, _, files in os.walk(folder_path):
for file in files:
if file.endswith(('.md', '.py')):
file_path = os.path.join(root, file)
try:
loader = TextLoader(file_path)
documents = loader.load()
for doc in documents:
doc.metadata['source'] = file_path
all_docs.extend(documents)
except Exception as e:
print(f"Error loading file {file_path}: {e}")

if not all_docs:
print(f"No markdown or python files found in {folder_path}")
return []

# Split documents using appropriate splitters
split_docs = []
markdown_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.MARKDOWN
)
python_splitter = RecursiveCharacterTextSplitter.from_language(
language=Language.PYTHON
)

for doc in all_docs:
if doc.metadata['source'].endswith('.md'):
temp_docs = markdown_splitter.split_documents([doc])
for temp_doc in temp_docs:
temp_doc.page_content = f"Source: {doc.metadata['source']}\n\n{temp_doc.page_content}"
split_docs.extend(temp_docs)
elif doc.metadata['source'].endswith('.py'):
temp_docs = python_splitter.split_documents([doc])
for temp_doc in temp_docs:
temp_doc.page_content = f"Source: {doc.metadata['source']}\n\n{temp_doc.page_content}"
split_docs.extend(temp_docs)

return split_docs

Feature Embedding

We used ChromeDB

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
from langchain_community.vectorstores import Chroma

def _get_embedding_function(self) -> Embeddings:
"""Returns a function that uses litellm for embeddings."""
class LiteLLMEmbeddings(Embeddings):
def __init__(self, embedding_model):
self.embedding_model = embedding_model

def embed_documents(self, texts: list[str]) -> list[list[float]]:
# ... existing code ...
response = embedding(
model=self.embedding_model,
input=texts,
task_type="CODE_RETRIEVAL_QUERY" if self.embedding_model == "vertex_ai/text-embedding-005" else None
)
# ... existing code ...
return [r["embedding"] for r in response["data"]]

def embed_query(self, text: str) -> list[float]:
# ... existing code ...
response = embedding(
model=self.embedding_model,
input=[text],
task_type="CODE_RETRIEVAL_QUERY" if self.embedding_model == "vertex_ai/text-embedding-005" else None
)
# ... existing code ...
return response["data"][0]["embedding"]

return LiteLLMEmbeddings(self.embedding_model)

def _create_core_store(self):
"""Creates the main ChromaDB vector store."""
core_vector_store = Chroma(
collection_name="collection_name",
embedding_function=self._get_embedding_function(),
persist_directory=os.path.join(self.chroma_db_path, "collection_name")
)

# Process docs
core_docs = self._process_documentation_folder(os.path.join(self.docs_path, "collection_name"))
if core_docs:
self._add_documents_to_store(core_vector_store, core_docs, "collection_name")

return core_vector_store

Then you can do similarity search to query the vector database

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def find_relevant_docs(self, queries: List[Dict], k: int = 5) -> List[str]:
"""Find relevant documents based on the queries."""
# ... existing code ...
for query in manim_core_queries:
query_text = query["query"]
self.core_vector_store._embedding_function.parent_observation_id = span.id
_results = self.core_vector_store.similarity_search_with_relevance_scores(
query=query_text,
k=k,
score_threshold=0.5
)
for result in _results:
_formatted_results.append({
"query": query_text,
"source": result[0].metadata['source'],
"content": result[0].page_content,
"score": result[1]
})
# ... existing code ...
seen = set()
for item in _formatted_results:
key = item['content']
if key not in seen:
_unique_results.append(item)
seen.add(key)
# ... existing code ...
return _unique_results

Query Agent

Query Agent: It is a agent that queries the vector database to retrieve the most relevant information.

Basically we can ask a LLM to generate a prompt to query from the vector database

Summary

  1. Download the documentations from the code library, and convert them into markdown format.
  2. Chunk the documents into smaller size.
  3. Extract Feature into Embeddings
  4. Use a LLM to generate query (Optional)
  5. Ask LLM questions, with the model connected to RAG VectorDB.