Fine-Tuning Large Language Models for Specialized Arabic Task ~ Late Night Coding

I. Introduction: Large Language Models for Arabic Tags Generation

In this blog post, our primary focus will be on the process of fine-tuning four different large language models (LLMs) using an Arabic dataset. We'll delve into the intricacies of adapting these models to perform specialized tasks in Arabic natural language processing.

The good news is that you won't need any complex setup, a Google Colab notebook will suffice for this entire workflow, making it accessible and efficient for anyone interested in exploring the world of LLMs fine-tuning.

1. Task Overview: Tags Generation

In this task, we explore the remarkable capabilities of different open-source large language models (LLMs) in understanding and generating Arabic words.

Our objective is straightforward: to use LLMs to automatically generate descriptive tags for Arabic quotes.
This task not only demonstrates the linguistic prowess of LLMs but also showcases their potential in Arabic language applications.

2. Large Language Models for the Challenge

In this section, we're gearing up to put four remarkable language models to the test, and the best part is that they're all readily available on the HuggingFace library.

Here's a quick introduction to each one:

1. RedPajama ([Link]): RedPajama is developed by Togethercomputer.

2. Dolly V2 ([Link]): Dolly V2 is developed by Databricks.

3. OPT ([Link]): OPT was developed by Facebook (Meta).

4. GPT Neo 2.7B ([Link]): GPT Neo is an impressive model from EleutherAI,

An important point to note is that all of these language models weren't initially tailored for Arabic language tasks. Their exposure to Arabic data might be limited in comparison. This presents an exciting challenge for us as we explore their adaptability and potential in the context of Arabic tags generation.

II. Fine-tuning strategy and the Used Dataset

In our pursuit of optimizing language model fine-tuning for specialized Arabic tasks, we employ a cutting-edge technique known as 4-bit quantization. This innovative method, represented by Quantized Low-Rank Adaptation (QLoRA), offers a game-changing advantage.

1. Fine-Tuning on low resources: 4bit-Quantization

The 4-bit quantization technique allows us to fine-tune large language models (LLMs) using just a single GPU while preserving the high performance typically associated with full 16-bit models. To put it into perspective, this groundbreaking approach signifies a pivotal shift in the AI landscape, as it empowers us to achieve remarkable results efficiently and with reduced computational demands.

If you're eager to delve deeper into the intricacies of this remarkable technique, we invite you to explore it further. For a comprehensive understanding of 4-bit quantization and the QLoRA method, we encourage you to visit the following links:

Link 1: PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware

Link 2: Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

2. Fueling the Models: The Arabic Dataset and Where to Find It

Fine-tuning quantized LLMs is a powerful technique for adapting pre-trained language models to specific tasks or datasets.

Fine-tuning the quantized model on the target task or dataset allows us to adapt the model to the new domain, improving its performance. With the right training procedure and hyperparameters, we can create highly performant quantized LLMs that are tailored to our specific needs.

To achieve this, I've curated a substantial dataset containing Arabic quotes along with their corresponding tags. It's open source and readily accessible on the HuggingFace library. This dataset serves as a valuable resource for training and fine-tuning language models for Arabic tags generation.

III. Comparative Study of Results and Model Hosting

1. Crafting the Metric: Evaluating LLM Performance

To assess the performance of each language model, we employed a tailored metric that we designed specifically for our evaluation. This custom metric serves as a vital yardstick in gauging the effectiveness of the models in generating Arabic tags for quotes.

This metric takes two lists of Arabic strings (a string with the generated tags, and the validation string with original tags), preprocesses them to calculate their Jaccard similarity, and returns a normalized similarity score that ranges from 0 to 1, where 1 indicates a perfect match and 0 indicates no similarity.

By creating this evaluation criterion, we ensure that the assessment aligns perfectly with our unique task, enabling a more precise and informative evaluation of each LLM's performance.

2. Unveiling Performance: Results of Each LLM

Now, it's time to unveil the results, and we have a clear winner! RedPajama-INCITE-Instruct-3B-v1 achieved the highest score. However, it's worth noting that the competition was extremely close.

This closeness can be attributed to a couple of factors. First, the models we used are relatively small in size (all 4 models are under 3 billion parameters). Second, they haven't had extensive exposure to Arabic data during their pretraining phase.

These two factors combined make the Arabic language challenge even more remarkable, as it underscores the models' adaptability and their ability to perform well despite limited exposure to Arabic data.

3. From Training to Deployment: Hosting the Winning Model

Hosting your model on the HuggingFace library is surprisingly straightforward and can be achieved with just a few lines of code. All you'll need is your HuggingFace token.

Once your model is deployed, you can immediately start using it and even share it with your friends and colleagues for testing purposes.

Detailed instructions for this process are provided in the Python notebook accompanying this blog post. If you'd like to explore more about HuggingFace model hosting, you can find additional information in this link: Deploy LLMs with Hugging Face Inference Endpoints

Conclusion

In summary, our exploration of large language models (LLMs) for Arabic tags generation has yielded impressive results. Despite model size constraints and limited Arabic data exposure, our top-performing model, RedPajama-INCITE-Instruct-3B-v1, showcased remarkable adaptability. The use of 4-bit quantization with QLoRA added efficiency to our process.

To explore the Python code used in this project, visit my GitHub repository.

Additionally, don't miss our YouTube video for a visual walkthrough of our journey.

I'm always eager to connect, so feel free to reach out to me on LinkedIn.

Thank you, and stay tuned for more captivating projects and insights!

Late Night Coding

Welcome to my blog!🤗 My name is Ahmed Boulahia, I'm a data scientist with a passion for sharing my knowledge and expertise. You will find some of my projects on this blog. I hope you find them both interesting and informative.

Tuesday, September 5

Fine-Tuning Large Language Models for Specialized Arabic Task