Released on July 23, 2024, Llama 3.1 marks a significant leap in the world of AI, introducing the first open-source model that can compete with the top AI systems. The Llama 3.1 8B model, part of this groundbreaking release, is designed with enhanced multilingual capabilities, extended context length, and improved reasoning skills. It’s built to handle advanced tasks like long-form text summarization, multilingual conversations, and coding assistance.
In this article, I'll guide you through the process of downloading the Llama 3.1 8B model and running it locally on your machine, which allows the offline model inference. We'll also dive into finetuning the model for a specific task, tailoring its capabilities to meet your unique needs. Finally, we'll compare the performance of the base model with the finetuned version to see how these adjustments enhance its effectiveness.
At the end of this blog, we’ll provide all the necessary resources, including code, used dataset links, and access to the fine-tuned model, to support your own experimentation and implementation.
1- Download and Run Llama 3.1 locally
LM Studio Interface
To start working with the Llama 3.1 8B model locally, we'll use LM Studio by H2O, a powerful tool designed for handling large language models (LLMs) with ease. LM Studio provides a user-friendly interface that simplifies the process of downloading and running LLMs, including those based on the GGUF framework, directly from the Hugging Face hub.
Click here to Download LM studio
Quantization & GGUF models
When choosing a model, it's crucial to consider your local machine's computational power. LM Studio allows you to download various GGUF models in different sizes and configurations. Quantization is a key technique to help with this, as it reduces the model's size and computational requirements, making it more suitable for machines with limited resources.
For this tutorial, we used the 4-bit quantized version of the Llama 3.1 8B model. This version is specifically optimized to run efficiently on machines with limited resources. On my setup, which includes a GPU with 4GB of VRAM, this quantized model performs exceptionally well, providing a good balance between performance and resource usage. By opting for this version, you can ensure smooth operation and effective utilization of your local machine's capabilities.
Local server
Additionally, LM Studio offers the option to create a local server that mimics the OpenAI library code. This server setup allows you to deploy any model that your machine can handle and integrate it seamlessly with any code that uses the OpenAI library. This feature not only supports offline usage but also provides greater flexibility and control over model performance and integration.
2- Fine-Tuning Llama 3.1
Now, in this section, we'll focus on fine-tuning the Llama 3.1 8B model to enhance its capabilities in understanding Arabic for an instruction-based task.
This fine-tuning process is designed to make the model more proficient in handling Arabic instructions, improving its overall performance in this language. To achieve this, we'll use Unsloth AI Python library, it provides a comprehensive set of tools for training and optimizing models.
For this task, we’ve constructed a specialized dataset tailored specifically to enhance the model’s Arabic language understanding. This dataset is carefully designed to address the nuances and complexities of Arabic instruction, ensuring that the fine-tuning process is both effective and precise.
Finetuning Dataset
The dataset was created to support the fine-tuning of language models on Arabic instructions.
It consists of 11,000 rows, with 10,000 examples for training and 1,000 examples for evaluation. This dataset combines both English and Arabic instructions, providing a comprehensive resource for improving multilingual understanding. It follows the Alpaca prompt style, including fields for instruction, input, and output, which helps in fine-tuning models to handle and generate responses based on various instructional prompts effectively.
Finetuning Task
Supervised Fine-Tuning (SFT) is a technique used to improve and customize pre-trained language models. It involves retraining a base model on a smaller, specialized dataset that includes instructions and their corresponding answers. This process helps transform a general model into one that can follow specific instructions and provide accurate responses. SFT can boost the model’s performance, add new knowledge, or adjust it for particular tasks or fields. Additionally, after fine-tuning, the model can be further refined to better align with specific preferences.
However, SFT has its limitations. It works best when building on existing knowledge in the base model. Learning entirely new information, such as a new language, can be challenging and may lead to hallucinations.
There are three main SFT techniques: full fine-tuning, Low-Rank Adaptation (LoRA), and Quantization-aware Low-Rank Adaptation (QLoRA). Full fine-tuning involves retraining all the parameters of a model and, while effective, is resource-heavy and can cause the model to lose some of its previous knowledge. LoRA is a more efficient method that adds small adapters to the model, reducing memory usage and training time without altering the original parameters.
QLoRA builds on LoRA by adding quantization to save even more memory, making it particularly useful when GPU memory is limited. Although QLoRA requires more time to train, its memory savings make it a good option for scenarios with restricted resources. In this blog, we will use QLoRA to fine-tune the Llama 3.1 8B model, taking advantage of its efficiency to make effective adjustments while working within the limits of available GPU memory.
To fine-tune the Llama 3.1 8B model efficiently, we'll use the Unsloth library developed by Daniel and Michael Han. Unsloth stands out for its custom kernels, which allow for up to 2x faster training and 60% less memory usage compared to other methods. This efficiency is especially valuable in constrained environments like Google Colab. However, it's worth noting that Unsloth currently supports only single-GPU setups. For multi-GPU configurations, alternatives like TRL and Axolotl, which also use Unsloth as a backend, are recommended.
First we download the library:
Then we chose the base model we want to finetune (Meta-Llama-3.1-8B):
To understand why 1,250 steps correspond to 1 epoch, let's consider the training setup:
- Batch Size: The
per_device_train_batch_size
is set to 2. This means that each training step processes 2 examples from the dataset. - Gradient Accumulation: The
gradient_accumulation_steps
is set to 4. This means gradients are accumulated over 4 steps before applying an update. Essentially, each step updates the model based on 8 examples (2 examples per batch * 4 accumulation steps). - Dataset Size: Assume our dataset has 10,000 examples.
To complete one epoch, where the model sees every example in the dataset once, the number of training steps needed is calculated as follows:
Where the effective batch size is:
Plugging in the numbers: