RAG with Google Gemini on Arabic Docs ~ Late Night Coding

In the dynamic landscape of natural language processing, Google Gemini has emerged as a revolutionary tool, pushing the boundaries of language comprehension. In this blog, we explore the capabilities of Gemini models, with a particular focus on their prowess in understanding foreign languages like Arabic.

Build with Gemini: Developer API Key

One of the exciting aspects of Google Gemini is its accessibility through the developer API key. Google generously provides developers with the opportunity to tap into the potential of Gemini models for free, allowing innovation and experimentation without financial barriers.

Get your API key in Google AI Studio.

Meet the stars of the show:

Gemini-pro: Optimized for text-only prompts, this model masters the art of linguistic finesse.

Gemini-pro-vision: For text-and-images prompts, this model integrates visual context seamlessly.

Let's Start:

In this blog post, I will guide you step by step through the implementation of a RAG model using the Gemini model. Each step of the process will be meticulously explained, providing you with a clear roadmap for incorporating this advanced language understanding into your projects. What's more, to make this journey even more accessible, the Python code for the entire implementation will be included in a user-friendly Python notebook.

We initiated the evaluation by conducting a swift test to assess the model's prowess in generating Arabic content from Arabic queries. Additionally, we examined its ability to answer questions based on a set of information using a miniature version of the RAG (Retrieval-Augmented Generation) approach.

The results shed light on the model's effectiveness in handling Arabic language intricacies and its capacity to provide contextually relevant responses within the defined information scope.

Step 1: Data Import with Langchain:

Our project commences by importing data from external sources, encompassing PDFs, CSVs, and websites.

To facilitate this process, we leverage both the Langchain and html2text libraries. For our assessment of the model's capabilities, we opt to scrape information from the Wikipedia page on gravity, considering both Arabic and English versions. This dual-language approach ensures a diverse dataset, allowing us to thoroughly evaluate the model's proficiency in handling multilingual content and extracting meaningful insights.

Step 2: Data Splitting & chunks creation with Langchain:

To streamline the handling of website data from the Wikipedia page, we employed Langchain's RecursiveCharacterTextSplitter.

This powerful tool enabled us to efficiently split the retrieved content into smaller, manageable chunks. This step is pivotal as it prepares the data for embedding and storage in a vector store. By breaking down the information into more digestible units, we enhance the model's ability to comprehend and generate nuanced responses based on the intricacies of the input.

Step 3: Gemini Embedding Mastery:

For the embedding phase, we harnessed the power of the Google Gemini embedding model, specifically utilizing the embedding-001 variant. This model played a pivotal role in embedding all the previously processed data chunks, ensuring a rich representation of the information.

Step 4: Vector Store with Langchain DocArrayInMemorySearch:

To efficiently store and organize these embeddings, we employed Langchain's vector store functionality, leveraging the DocArrayInMemorySearch from the Langchain vectorstores.

This strategic combination not only facilitates seamless storage of the embedded data but also sets the stage for streamlined querying and retrieval.Now, with our chunks embedded and securely stored, they are poised for efficient retrieval as the project progresses.

Step 5: Prompt Injection & Results Harvest from Gemini Model:

In the pursuit of generating precise and contextually rich answers, our approach involves leveraging the vector store retriever to extract the top chunks deemed most relevant to address user queries. This crucial step ensures that the context necessary for a comprehensive response is readily available.

Subsequently, employing the versatile capabilities of Langchain, we construct a seamless workflow. The user's question and the retrieved context are seamlessly passed through a Langchain chain, which incorporates a meticulously designed prompt template. This template plays a crucial role in structuring the input for the Google Gemini model.

This integrated process sets the stage for the Google Gemini model to perform prompt injection, effectively generating answers that draw upon the contextual information stored in the vectorized chunks. Through this methodical approach, we aim to provide users with accurate and insightful responses tailored to their inquiries.

My Personal Opinion:

In our evaluation, the model showcases impressive capabilities and yields outstanding results when it comes to English.

However, the performance takes a hit when dealing with Arabic content. This discrepancy can be attributed to the limitations of the embedding model and the retriever, which struggle to retrieve the relevant context needed to answer Arabic user queries effectively.

It's worth considering the adoption of a more advanced embedding model, possibly a multilingual one, to enhance results in Arabic. This adjustment could potentially address the current limitations and improve the overall performance for a more robust user experience.

A task for you!

For a hands-on exploration, consider experimenting with alternative tools to enhance the performance of the model.

Try integrating a different embedding model, perhaps a multilingual one from the HuggingFace library. Additionally, explore the use of an alternative vector store, like Chroma DB, to store and retrieve embedded data. After making these adjustments, compare the results with our current setup. Your findings could provide valuable insights into optimizing the system for improved performance and responsiveness.

Finally

To explore the Python code used in this project, visit my GitHub .

Additionally, don't miss our YouTube video for a visual walkthrough of our journey.

I'm always eager to connect, so feel free to reach out to me on LinkedIn.

Thank you, and stay tuned for more captivating projects and insights!

Late Night Coding

Welcome to my blog!🤗 My name is Ahmed Boulahia, I'm a data scientist with a passion for sharing my knowledge and expertise. You will find some of my projects on this blog. I hope you find them both interesting and informative.

Sunday, January 21

RAG with Google Gemini on Arabic Docs