Saturday, August 10

Finetune & Run Llama3.1 Locally

 


Released on July 23, 2024, Llama 3.1 marks a significant leap in the world of AI, introducing the first open-source model that can compete with the top AI systems. The Llama 3.1 8B model, part of this groundbreaking release, is designed with enhanced multilingual capabilities, extended context length, and improved reasoning skills. It’s built to handle advanced tasks like long-form text summarization, multilingual conversations, and coding assistance.

In this article, I'll guide you through the process of downloading the Llama 3.1 8B model and running it locally on your machine, which allows the offline model inference. We'll also dive into finetuning the model for a specific task, tailoring its capabilities to meet your unique needs. Finally, we'll compare the performance of the base model with the finetuned version to see how these adjustments enhance its effectiveness.

At the end of this blog, we’ll provide all the necessary resources, including code, used dataset links, and access to the fine-tuned model, to support your own experimentation and implementation.

1- Download and Run Llama 3.1 locally

LM Studio Interface

To start working with the Llama 3.1 8B model locally, we'll use LM Studio by H2O, a powerful tool designed for handling large language models (LLMs) with ease. LM Studio provides a user-friendly interface that simplifies the process of downloading and running LLMs, including those based on the GGUF framework, directly from the Hugging Face hub. 

Click here to Download LM studio


Quantization  & GGUF models

When choosing a model, it's crucial to consider your local machine's computational power. LM Studio allows you to download various GGUF models in different sizes and configurations. Quantization is a key technique to help with this, as it reduces the model's size and computational requirements, making it more suitable for machines with limited resources. 
For this tutorial, we used the 4-bit quantized version of the Llama 3.1 8B model. This version is specifically optimized to run efficiently on machines with limited resources. On my setup, which includes a GPU with 4GB of VRAM, this quantized model performs exceptionally well, providing a good balance between performance and resource usage. By opting for this version, you can ensure smooth operation and effective utilization of your local machine's capabilities.


Local server

Additionally, LM Studio offers the option to create a local server that mimics the OpenAI library code. This server setup allows you to deploy any model that your machine can handle and integrate it seamlessly with any code that uses the OpenAI library. This feature not only supports offline usage but also provides greater flexibility and control over model performance and integration.


2- Fine-Tuning Llama 3.1

Now, in this section, we'll focus on fine-tuning the Llama 3.1 8B model to enhance its capabilities in understanding Arabic for an instruction-based task. 

This fine-tuning process is designed to make the model more proficient in handling Arabic instructions, improving its overall performance in this language. To achieve this, we'll use Unsloth AI Python library, it provides a comprehensive set of tools for training and optimizing models. 

For this task, we’ve constructed a specialized dataset tailored specifically to enhance the model’s Arabic language understanding. This dataset is carefully designed to address the nuances and complexities of Arabic instruction, ensuring that the fine-tuning process is both effective and precise.

Finetuning Dataset

The dataset was created to support the fine-tuning of language models on Arabic instructions. 
It consists of 11,000 rows, with 10,000 examples for training and 1,000 examples for evaluation. This dataset combines both English and Arabic instructions, providing a comprehensive resource for improving multilingual understanding. It follows the Alpaca prompt style, including fields for instruction, input, and output, which helps in fine-tuning models to handle and generate responses based on various instructional prompts effectively.

Dataset link on HuggingFace


Finetuning Task

Supervised Fine-Tuning (SFT) is a technique used to improve and customize pre-trained language models. It involves retraining a base model on a smaller, specialized dataset that includes instructions and their corresponding answers. This process helps transform a general model into one that can follow specific instructions and provide accurate responses. SFT can boost the model’s performance, add new knowledge, or adjust it for particular tasks or fields. Additionally, after fine-tuning, the model can be further refined to better align with specific preferences.

However, SFT has its limitations. It works best when building on existing knowledge in the base model. Learning entirely new information, such as a new language, can be challenging and may lead to hallucinations.

There are three main SFT techniques: full fine-tuning, Low-Rank Adaptation (LoRA), and Quantization-aware Low-Rank Adaptation (QLoRA). Full fine-tuning involves retraining all the parameters of a model and, while effective, is resource-heavy and can cause the model to lose some of its previous knowledge. LoRA is a more efficient method that adds small adapters to the model, reducing memory usage and training time without altering the original parameters.



QLoRA builds on LoRA by adding quantization to save even more memory, making it particularly useful when GPU memory is limited. Although QLoRA requires more time to train, its memory savings make it a good option for scenarios with restricted resources. In this blog, we will use QLoRA to fine-tune the Llama 3.1 8B model, taking advantage of its efficiency to make effective adjustments while working within the limits of available GPU memory.

To fine-tune the Llama 3.1 8B model efficiently, we'll use the Unsloth library developed by Daniel and Michael Han. Unsloth stands out for its custom kernels, which allow for up to 2x faster training and 60% less memory usage compared to other methods. This efficiency is especially valuable in constrained environments like Google Colab. However, it's worth noting that Unsloth currently supports only single-GPU setups. For multi-GPU configurations, alternatives like TRL and Axolotl, which also use Unsloth as a backend, are recommended.

First we download the library:

%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

Then we chose the base model we want to finetune (Meta-Llama-3.1-8B):

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",          # Phi-3 2x faster!d
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Now in this step we will use our dataset and format each row of it following the alpaca prompt to create our train set:
alpaca_prompt = """Below is an instruction that describes a task,
paired with an input that provides further context.
Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("AhmedBou/Arabic_instruction_dataset_for_llm_ft", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Finally we train the model, in this step i used max_steps = 1250 , it is to indicate 1 training epoch,

To understand why 1,250 steps correspond to 1 epoch, let's consider the training setup:

  • Batch Size: The per_device_train_batch_size is set to 2. This means that each training step processes 2 examples from the dataset.
  • Gradient Accumulation: The gradient_accumulation_steps is set to 4. This means gradients are accumulated over 4 steps before applying an update. Essentially, each step updates the model based on 8 examples (2 examples per batch * 4 accumulation steps).
  • Dataset Size: Assume our dataset has 10,000 examples.

To complete one epoch, where the model sees every example in the dataset once, the number of training steps needed is calculated as follows:

Steps per Epoch=Dataset SizeEffective Batch Size\text{Steps per Epoch} = \frac{\text{Dataset Size}}{\text{Effective Batch Size}}

Where the effective batch size is:

Effective Batch Size=Per Device Batch Size×Gradient Accumulation Steps\text{Effective Batch Size} = \text{Per Device Batch Size} \times \text{Gradient Accumulation Steps}

Plugging in the numbers:

Effective Batch Size=2×4=8\text{Effective Batch Size} = 2 \times 4 = 8 Steps per Epoch=10,0008=1,250\text{Steps per Epoch} = \frac{10,000}{8} = 1,250


from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 1250,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)


Finally, after completing the fine-tuning, I saved the LoRA adapters and the GGUF version of the model to Hugging Face. This allows us to seamlessly integrate and use them with LM Studio. 
You can easily import the LoRA adapters and perform inference directly within the same Colab notebook.

inputs = tokenizer(
[
    alpaca_prompt.format(
        "قم بصياغة الجملة الإنجليزية التالية باللغة العربية.", # instruction
        "We hope that the last cases will soon be resolved through the mechanisms established for this purpose.", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)



3- Comparing Base Model vs Finetuned model results


To evaluate the performance of the base model versus the fine-tuned model, we used LM Studio to run inference on the base model locally as a server. For the fine-tuned model, we performed inference using a Colab notebook by importing the LoRA adapters that we had trained.

We used Gemini-1.5 as a judge to assess which model's outputs were better aligned with the ground truth. This evaluation was based on 100 samples out of the 1,000 provided by the dataset. 



The results indicated that the base model outperformed the fine-tuned model, generating better responses in 54 out of 100 examples.

Verdict:  The base model demonstrated strong capabilities in understanding Arabic instructions, such as translation and generation. 
The results suggest that, in this case, fine-tuning did not provide a significant improvement over the base model. Therefore, for tasks involving Arabic instruction, the base model itself is quite effective and might not require additional fine-tuning.

Resources:

Dataset used:  https://huggingface.co/datasets/AhmedBou/Arabic_instruction_dataset_for_llm_ft
Github code: https://github.com/BoulahiaAhmed/Finetune-Run-Llama3.1-Locally
HuggingFace repo: https://huggingface.co/AhmedBou

Share:

Sunday, June 2

Summarize and Translate using LLMs

 


Introduction:

Hello! in this blog, we will explore the process of fine-tuning large language models (LLMs) for a specific task: summarizing English news articles in Arabic and generating an Arabic title. 

We will focus on two powerful models, Gemma-7b and Llama3-8b, and walk through each step required to achieve this task. 

  • Dataset Creation: How to gather and prepare the data necessary for fine-tuning. 
  • Prompt Creation: Crafting effective prompts to guide the models in performing the desired tasks.
  • Model Fine-Tuning: Using Unsloth AI to fine-tune our models specifically for summarization and title generation. 

By the end of this blog, you will have a clear understanding of how to adapt these LLMs to perform task-oriented applications, leveraging their capabilities to produce meaningful outputs in a different language. Let’s get started!

1- Data Preparation:

For this task, we utilized a sample of the XLSum dataset, which includes a diverse collection of news articles, their summaries, and titles, all in English. To tailor this dataset for our specific needs, we followed these steps:

  1. Sample Selection: We selected a representative sample from the XLSum dataset.
  2. Translation: While keeping the news articles in English, we translated the summaries and titles into Arabic.
  3. JSON Representation: We created a new column in our dataset that contains the JSON representation of both the translated summary and title.

This structured approach ensures that our data is well-organized and ready for the fine-tuning process. The resulting dataset looks as follows:



Our Dataset for this task: AhmedBou/EngText-ArabicSummary 


2- Effective Prompt:

In our fine-tuning process, crafting effective prompts is essential to guide the model on the specific task of summarizing English news articles in Arabic and generating an Arabic title.

Prompt Structure

Our prompt consists of the following elements:

  1. System Message: A message to set the context for the task.
  2. Fixed Instruction: A consistent instruction since we are fine-tuning for a specific task.
  3. Input: The news article in English.
  4. Response: The JSON representation of the translated summary and title.


This structured approach ensures that our dataset is complete and well-prepared, facilitating effective task-oriented fine-tuning. 
By maintaining consistency in our prompts, we enhance the model's ability to understand and perform the task accurately.

3- Task-Oriented Fine-Tuning of LLMs:

To fine-tune the Gemma-7b and Llama3-8b models for our specific task, we leveraged the power of Unsloth AI, which makes the fine-tuning process 2.2x faster and reduces VRAM usage by 80%. This efficiency allowed us to perform the fine-tuning on a free Colab notebook, making the process accessible and cost-effective.

Finetuned Model's link: AhmedBou/Llama-3-EngText-ArabicSummary


Tools and Frameworks

We utilized Hugging Face's TRL (Transformers Reinforcement Learning) library, specifically the SFTTrainer class, to facilitate the fine-tuning. This tool simplifies the training process and integrates seamlessly with our workflow.

Conclusion:

After fine-tuning the Gemma-7b and Llama3-8b models, we observed that the Llama3-8b model performed better in several key aspects.

It consistently respected the output format as JSON and provided more meaningful summaries and titles that adhered to Arabic grammar. This highlights the effectiveness of the Llama3-8b model for our specific task of summarizing English news articles in Arabic and generating Arabic titles.

Challenge for Readers

We invite you to take on a challenge to further explore and validate our findings. Using the test set we provided, calculate the approximate accuracy score between the two models. You can use evaluation metrics like BLEU, ROUGE scores, Jaccard Index, or RapidFuzz to determine the performance of each model. This will give you a quantitative measure to see which model performs best.

Steps to Follow

  1. Prepare the Test Set: Load the provided test set.
  2. Generate Outputs: Use both Gemma-7b and Llama3-8b models to generate summaries and titles.
  3. Evaluate Outputs: Calculate the evaluation metrics (BLEU, ROUGE, Jaccard Index, or RapidFuzz) to compare the models.

Finally

To explore the Python code used in this project, visit my GitHub 
Additionally, don't miss our YouTube video for a visual walkthrough of our journey. 
I'm always eager to connect, so feel free to reach out to me on LinkedIn

Thank you, and stay tuned for more captivating projects and insights!




Share:

Sunday, February 25

MedZa Assistant: Optimized RAG Chatbot based on Gemini, and Browser Data

 



Welcome to our blog! In today's digital landscape, chatbots like ChatGPT often face the challenge of providing accurate and up-to-date responses due to knowledge cutoff. 

To address this, we've developed a local chatbot solution leveraging RAG (Retrieval-Augmented Generation) with Gemini and real-time browser data integration using DuckDuckGo Api. 

In the following example, we can see how ChatGPT couldn't answer a simple question about SORA an OpenAI model!

Our solution: we call it MedZa Assistant, not only delivered the correct answer but also substantiated it with references, ensuring both its credibility and guarding against model hallucination.




Our innovative approach ensures precise and timely answers, surpassing limitations imposed by knowledge cutoff. Let's dive in!

Getting Fresh Data from the Internet with DuckDuckGo API

In our first part, we explored how we used the DuckDuckGo API to gather the latest info from the internet. When a user asked a question, our chatbot didn't just rely on what it already knew. Instead, it went online and checked out what was new. By tapping into the DuckDuckGo API, we found the top information on the topic, along with where it was coming from. This helped us stay up-to-date and provide the most recent data to our users, ensuring they got the freshest answers possible.

But why did we go through all this trouble? Well, it wasn't just about being current. By constantly updating our knowledge base with fresh info from the web, we were also helping our chatbot stay sharp. You see, sometimes our model might have gotten a bit confused or mixed things up – we called that "hallucination." But with the help of DuckDuckGo API, we could give it real-world examples to learn from, making sure it was always on the right track. So not only did our users get the most recent answers, but our chatbot also got a little boost in its smarts along the way.

The following are the top 2 results returned by DuckDuckGo API when we asked What is Python?



Making Chatbot Answers Better with Google Gemini

In our second part, we explored how we used the Google Gemini Pro model alongside prompts to handle each returned result from the web separately. After gathering information from different sources using the DuckDuckGo API, we fed each piece of data into the Google Gemini Pro model with a specific prompt designed to summarize it effectively. By breaking down the content into smaller parts and summarizing each one individually, we made sure that our summaries were clear and accurate.

But we didn't stop there. After generating summaries for each piece of information, we compiled them into a complete data digest. This digest provided a thorough overview of the topic, capturing the main points from multiple sources in a brief and easy-to-understand format. Each summary was linked back to its original source, giving users the option to explore further if they wanted to.

So, when users engaged with our chatbot, they could trust that the answers provided were not only accurate and recent but also carefully selected from reliable sources on the internet. Thanks to the integration of Google Gemini Pro and our meticulous curation process, we aimed to offer users a seamless and informative experience, giving them access to knowledge while maintaining transparency about our data sources.

The following is the Result we get from Google Gemini Pro when we asked about LLMOps, as you can see it is a pure hallucination.



And here's MedZa assistant answer:



Bringing MedZa Assistant to the Web with Streamlit

In an effort to make the MedZa Assistant more accessible, we leveraged the Streamlit library to develop a user-friendly web application.

This application allows users to interact with the chatbot directly through their web browser, eliminating the need for any downloads or installations.

Users can now visit the gallery and access the MedZa Assistant with just a few clicks, whether they're seeking information, assistance, or just a friendly chat.

The little backstory on the name "MedZa"! The secret behind this unique moniker is that it's a combination of the names of the creators: me, Ahmed, and Hamza. 🤫 We took the last part of each of our names and merged them together to create "MedZa," symbolizing our collaboration and dedication to building a helpful and innovative assistant for everyone to enjoy.

Future Directions and Improvements

Looking ahead, our focus shifts to enhancing the chatbot's capabilities beyond just question-answering.

While the current version excels in providing accurate responses, we aim to expand its functionality to include conversational interactions and logic-based queries. Additionally, we plan to refine its ability to handle straightforward questions like arithmetic calculations independently, without relying on additional context.

By addressing these areas, we aim to create a more versatile and intuitive chatbot experience for users.


Finally

To explore the Python code used in this project, visit my GitHub

I'm always eager to connect, so feel free to reach out to me on LinkedIn

Connect with Hamza Boulahia LinkedIn

Take a minute and visit Hamza Boulahia Amazing Blog


Share:

Sunday, January 21

RAG with Google Gemini on Arabic Docs

 

In the dynamic landscape of natural language processing, Google Gemini has emerged as a revolutionary tool, pushing the boundaries of language comprehension. In this blog, we explore the capabilities of Gemini models, with a particular focus on their prowess in understanding foreign languages like Arabic.

Build with Gemini: Developer API Key

One of the exciting aspects of Google Gemini is its accessibility through the developer API key. Google generously provides developers with the opportunity to tap into the potential of Gemini models for free, allowing innovation and experimentation without financial barriers.

Get your API key in Google AI Studio.



Meet the stars of the show:
Gemini-pro: Optimized for text-only prompts, this model masters the art of linguistic finesse.
Gemini-pro-vision: For text-and-images prompts, this model integrates visual context seamlessly.

Let's Start:

In this blog post, I will guide you step by step through the implementation of a RAG model using the Gemini model. Each step of the process will be meticulously explained, providing you with a clear roadmap for incorporating this advanced language understanding into your projects. What's more, to make this journey even more accessible, the Python code for the entire implementation will be included in a user-friendly Python notebook.

We initiated the evaluation by conducting a swift test to assess the model's prowess in generating Arabic content from Arabic queries. Additionally, we examined its ability to answer questions based on a set of information using a miniature version of the RAG (Retrieval-Augmented Generation) approach.

The results shed light on the model's effectiveness in handling Arabic language intricacies and its capacity to provide contextually relevant responses within the defined information scope.


Step 1: Data Import with Langchain:


Our project commences by importing data from external sources, encompassing PDFs, CSVs, and websites.

To facilitate this process, we leverage both the Langchain and html2text libraries. For our assessment of the model's capabilities, we opt to scrape information from the Wikipedia page on gravity, considering both Arabic and English versions. This dual-language approach ensures a diverse dataset, allowing us to thoroughly evaluate the model's proficiency in handling multilingual content and extracting meaningful insights.


Step 2: Data Splitting & chunks creation with Langchain:

To streamline the handling of website data from the Wikipedia page, we employed Langchain's RecursiveCharacterTextSplitter.

This powerful tool enabled us to efficiently split the retrieved content into smaller, manageable chunks. This step is pivotal as it prepares the data for embedding and storage in a vector store. By breaking down the information into more digestible units, we enhance the model's ability to comprehend and generate nuanced responses based on the intricacies of the input.

Step 3: Gemini Embedding Mastery:


For the embedding phase, we harnessed the power of the Google Gemini embedding model, specifically utilizing the embedding-001 variant. This model played a pivotal role in embedding all the previously processed data chunks, ensuring a rich representation of the information.

Step 4: Vector Store with Langchain DocArrayInMemorySearch:

To efficiently store and organize these embeddings, we employed Langchain's vector store functionality, leveraging the DocArrayInMemorySearch from the Langchain vectorstores.

This strategic combination not only facilitates seamless storage of the embedded data but also sets the stage for streamlined querying and retrieval.Now, with our chunks embedded and securely stored, they are poised for efficient retrieval as the project progresses.


Step 5: Prompt Injection & Results Harvest from Gemini Model:


In the pursuit of generating precise and contextually rich answers, our approach involves leveraging the vector store retriever to extract the top chunks deemed most relevant to address user queries. This crucial step ensures that the context necessary for a comprehensive response is readily available.

Subsequently, employing the versatile capabilities of Langchain, we construct a seamless workflow. The user's question and the retrieved context are seamlessly passed through a Langchain chain, which incorporates a meticulously designed prompt template. This template plays a crucial role in structuring the input for the Google Gemini model.

This integrated process sets the stage for the Google Gemini model to perform prompt injection, effectively generating answers that draw upon the contextual information stored in the vectorized chunks. Through this methodical approach, we aim to provide users with accurate and insightful responses tailored to their inquiries.

My Personal Opinion:

In our evaluation, the model showcases impressive capabilities and yields outstanding results when it comes to English.
However, the performance takes a hit when dealing with Arabic content. This discrepancy can be attributed to the limitations of the embedding model and the retriever, which struggle to retrieve the relevant context needed to answer Arabic user queries effectively.
It's worth considering the adoption of a more advanced embedding model, possibly a multilingual one, to enhance results in Arabic. This adjustment could potentially address the current limitations and improve the overall performance for a more robust user experience.

A task for you!

For a hands-on exploration, consider experimenting with alternative tools to enhance the performance of the model.
Try integrating a different embedding model, perhaps a multilingual one from the HuggingFace library. Additionally, explore the use of an alternative vector store, like Chroma DB, to store and retrieve embedded data. After making these adjustments, compare the results with our current setup. Your findings could provide valuable insights into optimizing the system for improved performance and responsiveness.

Finally

To explore the Python code used in this project, visit my GitHub 
Additionally, don't miss our YouTube video for a visual walkthrough of our journey. 
I'm always eager to connect, so feel free to reach out to me on LinkedIn

Thank you, and stay tuned for more captivating projects and insights!


Share:

Tuesday, September 5

Fine-Tuning Large Language Models for Specialized Arabic Task


I. Introduction: Large Language Models for Arabic Tags Generation

In this blog post, our primary focus will be on the process of fine-tuning four different large language models (LLMs) using an Arabic dataset. We'll delve into the intricacies of adapting these models to perform specialized tasks in Arabic natural language processing. 

The good news is that you won't need any complex setup, a Google Colab notebook will suffice for this entire workflow, making it accessible and efficient for anyone interested in exploring the world of LLMs fine-tuning.

1. Task Overview: Tags Generation

In this task, we explore the remarkable capabilities of different open-source large language models (LLMs) in understanding and generating Arabic words. 

Our objective is straightforward: to use LLMs to automatically generate descriptive tags for Arabic quotes. 
This task not only demonstrates the linguistic prowess of LLMs but also showcases their potential in Arabic language applications.

2. Large Language Models for the Challenge

In this section, we're gearing up to put four remarkable language models to the test, and the best part is that they're all readily available on the HuggingFace library. 



Here's a quick introduction to each one:

1. RedPajama ([Link]): RedPajama is developed by Togethercomputer.


2. Dolly V2 ([Link]): Dolly V2 is developed by Databricks.


3. OPT ([Link]): OPT was developed by Facebook (Meta).


4. GPT Neo 2.7B ([Link]): GPT Neo is an impressive model from EleutherAI,


An important point to note is that all of these language models weren't initially tailored for Arabic language tasks. Their exposure to Arabic data might be limited in comparison. This presents an exciting challenge for us as we explore their adaptability and potential in the context of Arabic tags generation. 

II. Fine-tuning strategy and the Used Dataset

In our pursuit of optimizing language model fine-tuning for specialized Arabic tasks, we employ a cutting-edge technique known as 4-bit quantization. This innovative method, represented by Quantized Low-Rank Adaptation (QLoRA), offers a game-changing advantage. 

1. Fine-Tuning on low resources: 4bit-Quantization

The 4-bit quantization technique allows us to fine-tune large language models (LLMs) using just a single GPU while preserving the high performance typically associated with full 16-bit models. To put it into perspective, this groundbreaking approach signifies a pivotal shift in the AI landscape, as it empowers us to achieve remarkable results efficiently and with reduced computational demands.

If you're eager to delve deeper into the intricacies of this remarkable technique, we invite you to explore it further. For a comprehensive understanding of 4-bit quantization and the QLoRA method, we encourage you to visit the following links:

Link 1: PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware

Link 2: Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA


2. Fueling the Models: The Arabic Dataset and Where to Find It

Fine-tuning quantized LLMs is a powerful technique for adapting pre-trained language models to specific tasks or datasets. 

Fine-tuning the quantized model on the target task or dataset allows us to adapt the model to the new domain, improving its performance. With the right training procedure and hyperparameters, we can create highly performant quantized LLMs that are tailored to our specific needs.

To achieve this, I've curated a substantial dataset containing Arabic quotes along with their corresponding tags. It's open source and readily accessible on the HuggingFace library. This dataset serves as a valuable resource for training and fine-tuning language models for Arabic tags generation.



III. Comparative Study of Results and Model Hosting

1. Crafting the Metric: Evaluating LLM Performance

To assess the performance of each language model, we employed a tailored metric that we designed specifically for our evaluation. This custom metric serves as a vital yardstick in gauging the effectiveness of the models in generating Arabic tags for quotes. 

This metric takes two lists of Arabic strings (a string with the generated tags, and the validation string with original tags), preprocesses them to calculate their Jaccard similarity, and returns a normalized similarity score that ranges from 0 to 1, where 1 indicates a perfect match and 0 indicates no similarity.

 By creating this evaluation criterion, we ensure that the assessment aligns perfectly with our unique task, enabling a more precise and informative evaluation of each LLM's performance.


2. Unveiling Performance: Results of Each LLM

Now, it's time to unveil the results, and we have a clear winner! RedPajama-INCITE-Instruct-3B-v1 achieved the highest score. However, it's worth noting that the competition was extremely close. 



This closeness can be attributed to a couple of factors. First, the models we used are relatively small in size (all 4 models are under 3 billion parameters). Second, they haven't had extensive exposure to Arabic data during their pretraining phase. 

These two factors combined make the Arabic language challenge even more remarkable, as it underscores the models' adaptability and their ability to perform well despite limited exposure to Arabic data.


3. From Training to Deployment: Hosting the Winning Model

Hosting your model on the HuggingFace library is surprisingly straightforward and can be achieved with just a few lines of code. All you'll need is your HuggingFace token. 


Once your model is deployed, you can immediately start using it and even share it with your friends and colleagues for testing purposes. 



Detailed instructions for this process are provided in the Python notebook accompanying this blog post. If you'd like to explore more about HuggingFace model hosting, you can find additional information in this link: Deploy LLMs with Hugging Face Inference Endpoints


Conclusion

In summary, our exploration of large language models (LLMs) for Arabic tags generation has yielded impressive results. Despite model size constraints and limited Arabic data exposure, our top-performing model, RedPajama-INCITE-Instruct-3B-v1, showcased remarkable adaptability. The use of 4-bit quantization with QLoRA added efficiency to our process.

To explore the Python code used in this project, visit my GitHub repository
Additionally, don't miss our YouTube video for a visual walkthrough of our journey. 
I'm always eager to connect, so feel free to reach out to me on LinkedIn

Thank you, and stay tuned for more captivating projects and insights!

















Share: