Skip to content

Commit cd3be0e

Browse files
Laure-diremyleone
andauthored
Apply suggestions from code review
Co-authored-by: Rémy Léone <[email protected]>
1 parent 022f7e5 commit cd3be0e

File tree

1 file changed

+9
-11
lines changed

1 file changed

+9
-11
lines changed

tutorials/how-to-implement-rag/index.mdx

+9-11
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,11 @@ Retrieval-Augmented Generation (RAG) supercharges language models by enabling re
1414
In this comprehensive guide, you'll learn how to implement RAG using LangChain, one of the leading frameworks for developing robust language model applications. We'll combine LangChain with ***Scaleway’s Managed Inference***, ***Scaleway’s PostgreSQL Managed Database*** (featuring pgvector for vector storage), and ***Scaleway’s Object Storage*** for seamless integration and efficient data management.
1515

1616
## Why LangChain?
17-
LangChain simplifies the process of enhancing language models with retrieval capabilities, allowing developers to build scalable, intelligent applications that access external datasets effortlessly. By leveraging LangChain’s modular design and Scaleway’s cloud services, you can unlock the full potential of Retrieval-Augmented Generation.
17+
[LangChain](https://github.com/langchain-ai/langchain) simplifies the process of enhancing language models with retrieval capabilities, allowing developers to build scalable, intelligent applications that access external datasets effortlessly. By leveraging LangChain’s modular design and Scaleway’s cloud services, you can unlock the full potential of Retrieval-Augmented Generation.
1818

1919
## What You’ll Learn
2020
- How to embed text using a sentence transformer using ***Scaleway Manage Inference***
21-
- How to store and query embeddings using ***Scaleway’s Managed PostgreSQL Database*** with pgvector
21+
- How to store and query embeddings using ***Scaleway’s Managed PostgreSQL Database*** with [pgvector](https://github.com/pgvector/pgvector)
2222
- How to manage large datasets efficiently with ***Scaleway Object Storage***
2323

2424
<Macro id="requirements" />
@@ -39,8 +39,6 @@ Run the following command to install the required packages:
3939

4040
```sh
4141
pip install langchain psycopg2 python-dotenv
42-
```
43-
### Step 2: Create a .env File
4442

4543
Create a .env file and add the following variables. These will store your API keys, database connection details, and other configuration values.
4644

@@ -49,7 +47,7 @@ Create a .env file and add the following variables. These will store your API ke
4947
5048
# Scaleway API credentials
5149
SCW_ACCESS_KEY=your_scaleway_access_key
52-
SCW_API_KEY=your_scaleway_secret_ke
50+
SCW_API_KEY=your_scaleway_secret_key
5351
5452
# Scaleway managed database (PostgreSQL) credentials
5553
SCW_DB_NAME=your_scaleway_managed_db_name
@@ -151,13 +149,13 @@ embeddings = OpenAIEmbeddings(
151149

152150
#### What is tiktoken_enabled?
153151

154-
tiktoken is a tokenization library developed by OpenAI, which is optimized for working with GPT-based models (like GPT-3.5 or GPT-4). It transforms text into smaller token units that the model can process.
152+
[`tiktoken`](https://github.com/openai/tiktoken) is a tokenization library developed by OpenAI, which is optimized for working with GPT-based models (like GPT-3.5 or GPT-4). It transforms text into smaller token units that the model can process.
155153

156154
#### Why set tiktoken_enabled=False?
157155

158156
In the context of using Scaleway’s Managed Inference and the `sentence-t5-xxl` model, TikToken tokenization is not necessary because the model you are using (sentence-transformers) works with raw text and handles its own tokenization internally.
159-
Moreover, leaving tiktoken_enabled as True causes issues when sending data to Scaleway’s API because it results in tokenized vectors being sent instead of raw text. Since Scaleway's endpoint expects text and not pre-tokenized data, this mismatch can lead to errors or incorrect behavior.
160-
By setting tiktoken_enabled=False, you ensure that raw text is sent to Scaleway's Managed Inference endpoint, which is what the sentence-transformers model expects to process. This guarantees that the embedding generation process works smoothly with Scaleway's infrastructure.
157+
Moreover, leaving `tiktoken_enabled` as `True` causes issues when sending data to Scaleway’s API because it results in tokenized vectors being sent instead of raw text. Since Scaleway's endpoint expects text and not pre-tokenized data, this mismatch can lead to errors or incorrect behavior.
158+
By setting `tiktoken_enabled=False`, you ensure that raw text is sent to Scaleway's Managed Inference endpoint, which is what the sentence-transformers model expects to process. This guarantees that the embedding generation process works smoothly with Scaleway's infrastructure.
161159
162160
### Step 3: Create a PGVector Store
163161
@@ -174,7 +172,7 @@ PGVector: This creates the vector store in your PostgreSQL database to store the
174172
175173
## Load and Process Documents
176174
177-
Use the S3FileLoader to load documents and split them into chunks. Then, embed and store them in your PostgreSQL database.
175+
Use the [`S3FileLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.s3_file.S3FileLoader.html) to load documents and split them into chunks. Then, embed and store them in your PostgreSQL database.
178176
179177
### Step 1: Import Required Modules
180178
@@ -245,8 +243,8 @@ conn.commit()
245243
246244
- S3FileLoader: The S3FileLoader loads each file individually from your ***Scaleway Object Storage bucket*** using the file's object_key (extracted from the file's metadata). It ensures that only the specific file is loaded from the bucket, minimizing the amount of data being retrieved at any given time.
247245
- RecursiveCharacterTextSplitter: The RecursiveCharacterTextSplitter breaks each document into smaller chunks of text. This is crucial because embeddings models, like those used in Retrieval-Augmented Generation (RAG), typically have a limited context window (the number of tokens they can process at once).
248-
- Chunk Size: Here, the chunk size is set to 480 characters, with an overlap of 20 characters. The choice of 480 characters is based on the context size supported by the embeddings model. Models have a maximum number of tokens they can process in a single pass, often around 512 tokens or fewer, depending on the specific model you are using. To ensure that each chunk fits within this limit, 380 characters provide a buffer, as different models tokenize characters into variable-length tokens.
249-
- Chunk Overlap: The 20-character overlap ensures continuity between chunks, which helps prevent loss of meaning or context between segments.
246+
- `Chunk Size`: Here, the chunk size is set to 480 characters, with an overlap of 20 characters. The choice of 480 characters is based on the context size supported by the embeddings model. Models have a maximum number of tokens they can process in a single pass, often around 512 tokens or fewer, depending on the specific model you are using. To ensure that each chunk fits within this limit, 380 characters provide a buffer, as different models tokenize characters into variable-length tokens.
247+
- `Chunk Overlap`: The 20-character overlap ensures continuity between chunks, which helps prevent loss of meaning or context between segments.
250248
- Embedding the Chunks: For each document, the text is split into smaller chunks using the text splitter, and an embedding is generated for each chunk using the embeddings.embed_query(chunk) function. This function transforms each chunk into a vector representation that can later be used for similarity search.
251249
- Embedding Storage: After generating the embeddings for each chunk, they are stored in a vector database (e.g., PostgreSQL with pgvector) using the vector_store.add_embeddings(embedding, chunk) method. Each embedding is stored alongside its corresponding text chunk, enabling retrieval during a query.
252250
- Avoiding Redundant Processing: The script checks the object_loaded table in PostgreSQL to see if a document has already been processed (i.e., the object_key exists in the table). If it has, the file is skipped, avoiding redundant downloads, vectorization, and database inserts. This ensures that only new or modified documents are processed, reducing the system's computational load and saving both time and resources.

0 commit comments

Comments
 (0)