Released on July 23, 2024, Llama 3.1 marks a significant leap in the world of AI, introducing the first open-source model that can compete with the top AI systems. The Llama 3.1 8B model, part of this groundbreaking release, is designed with enhanced multilingual capabilities, extended context length, and improved reasoning skills. It’s built to handle advanced tasks like long-form text summarization, multilingual conversations, and coding assistance.

In this article, I'll guide you through the process of downloading the Llama 3.1 8B model and running it locally on your machine, which allows the offline model inference. We'll also dive into finetuning the model for a specific task, tailoring its capabilities to meet your unique needs. Finally, we'll compare the performance of the base model with the finetuned version to see how these adjustments enhance its effectiveness.

At the end of this blog, we’ll provide all the necessary resources, including code, used dataset links, and access to the fine-tuned model, to support your own experimentation and implementation.

1- Download and Run Llama 3.1 locally

LM Studio Interface

To start working with the Llama 3.1 8B model locally, we'll use LM Studio by H2O, a powerful tool designed for handling large language models (LLMs) with ease. LM Studio provides a user-friendly interface that simplifies the process of downloading and running LLMs, including those based on the GGUF framework, directly from the Hugging Face hub.

Click here to Download LM studio

Quantization & GGUF models

When choosing a model, it's crucial to consider your local machine's computational power. LM Studio allows you to download various GGUF models in different sizes and configurations. Quantization is a key technique to help with this, as it reduces the model's size and computational requirements, making it more suitable for machines with limited resources.
For this tutorial, we used the 4-bit quantized version of the Llama 3.1 8B model. This version is specifically optimized to run efficiently on machines with limited resources. On my setup, which includes a GPU with 4GB of VRAM, this quantized model performs exceptionally well, providing a good balance between performance and resource usage. By opting for this version, you can ensure smooth operation and effective utilization of your local machine's capabilities.

Local server

Additionally, LM Studio offers the option to create a local server that mimics the OpenAI library code. This server setup allows you to deploy any model that your machine can handle and integrate it seamlessly with any code that uses the OpenAI library. This feature not only supports offline usage but also provides greater flexibility and control over model performance and integration.

2- Fine-Tuning Llama 3.1

Now, in this section, we'll focus on fine-tuning the Llama 3.1 8B model to enhance its capabilities in understanding Arabic for an instruction-based task.

This fine-tuning process is designed to make the model more proficient in handling Arabic instructions, improving its overall performance in this language. To achieve this, we'll use Unsloth AI Python library, it provides a comprehensive set of tools for training and optimizing models.

For this task, we’ve constructed a specialized dataset tailored specifically to enhance the model’s Arabic language understanding. This dataset is carefully designed to address the nuances and complexities of Arabic instruction, ensuring that the fine-tuning process is both effective and precise.

Finetuning Dataset

The dataset was created to support the fine-tuning of language models on Arabic instructions.
It consists of 11,000 rows, with 10,000 examples for training and 1,000 examples for evaluation. This dataset combines both English and Arabic instructions, providing a comprehensive resource for improving multilingual understanding. It follows the Alpaca prompt style, including fields for instruction, input, and output, which helps in fine-tuning models to handle and generate responses based on various instructional prompts effectively.

Dataset link on HuggingFace

Finetuning Task

Supervised Fine-Tuning (SFT) is a technique used to improve and customize pre-trained language models. It involves retraining a base model on a smaller, specialized dataset that includes instructions and their corresponding answers. This process helps transform a general model into one that can follow specific instructions and provide accurate responses. SFT can boost the model’s performance, add new knowledge, or adjust it for particular tasks or fields. Additionally, after fine-tuning, the model can be further refined to better align with specific preferences.

However, SFT has its limitations. It works best when building on existing knowledge in the base model. Learning entirely new information, such as a new language, can be challenging and may lead to hallucinations.

There are three main SFT techniques: full fine-tuning, Low-Rank Adaptation (LoRA), and Quantization-aware Low-Rank Adaptation (QLoRA). Full fine-tuning involves retraining all the parameters of a model and, while effective, is resource-heavy and can cause the model to lose some of its previous knowledge. LoRA is a more efficient method that adds small adapters to the model, reducing memory usage and training time without altering the original parameters.

QLoRA builds on LoRA by adding quantization to save even more memory, making it particularly useful when GPU memory is limited. Although QLoRA requires more time to train, its memory savings make it a good option for scenarios with restricted resources. In this blog, we will use QLoRA to fine-tune the Llama 3.1 8B model, taking advantage of its efficiency to make effective adjustments while working within the limits of available GPU memory.

To fine-tune the Llama 3.1 8B model efficiently, we'll use the Unsloth library developed by Daniel and Michael Han. Unsloth stands out for its custom kernels, which allow for up to 2x faster training and 60% less memory usage compared to other methods. This efficiency is especially valuable in constrained environments like Google Colab. However, it's worth noting that Unsloth currently supports only single-GPU setups. For multi-GPU configurations, alternatives like TRL and Axolotl, which also use Unsloth as a backend, are recommended.

First we download the library:

%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.27" "trl<0.9.0" peft accelerate bitsandbytes

Then we chose the base model we want to finetune (Meta-Llama-3.1-8B):

from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3-mini-4k-instruct",          # Phi-3 2x faster!d
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = None,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Now in this step we will use our dataset and format each row of it following the alpaca prompt to create our train set:

alpaca_prompt = """Below is an instruction that describes a task, 
paired with an input that provides further context. 
Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("AhmedBou/Arabic_instruction_dataset_for_llm_ft", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Finally we train the model, in this step i used max_steps = 1250 , it is to indicate 1 training epoch,

To understand why 1,250 steps correspond to 1 epoch, let's consider the training setup:

Batch Size: The per_device_train_batch_size is set to 2. This means that each training step processes 2 examples from the dataset.
Gradient Accumulation: The gradient_accumulation_steps is set to 4. This means gradients are accumulated over 4 steps before applying an update. Essentially, each step updates the model based on 8 examples (2 examples per batch * 4 accumulation steps).
Dataset Size: Assume our dataset has 10,000 examples.

To complete one epoch, where the model sees every example in the dataset once, the number of training steps needed is calculated as follows:

$\text{Steps per Epoch} = \frac{\text{Dataset Size}}{\text{Effective Batch Size}}$

Where the effective batch size is:

$\text{Effective Batch Size} = \text{Per Device Batch Size} \times \text{Gradient Accumulation Steps}$

Plugging in the numbers:

$\text{Effective Batch Size} = 2 \times 4 = 8$ $\text{Steps per Epoch} = \frac{10,000}{8} = 1,250$

from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 1250,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Finally, after completing the fine-tuning, I saved the LoRA adapters and the GGUF version of the model to Hugging Face. This allows us to seamlessly integrate and use them with LM Studio.

You can easily import the LoRA adapters and perform inference directly within the same Colab notebook.

inputs = tokenizer(
[
    alpaca_prompt.format(
        "قم بصياغة الجملة الإنجليزية التالية باللغة العربية.", # instruction
        "We hope that the last cases will soon be resolved through the mechanisms established for this purpose.", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

3- Comparing Base Model vs Finetuned model results

To evaluate the performance of the base model versus the fine-tuned model, we used LM Studio to run inference on the base model locally as a server. For the fine-tuned model, we performed inference using a Colab notebook by importing the LoRA adapters that we had trained.

We used Gemini-1.5 as a judge to assess which model's outputs were better aligned with the ground truth. This evaluation was based on 100 samples out of the 1,000 provided by the dataset.

The results indicated that the base model outperformed the fine-tuned model, generating better responses in 54 out of 100 examples.

Verdict: The base model demonstrated strong capabilities in understanding Arabic instructions, such as translation and generation.

The results suggest that, in this case, fine-tuning did not provide a significant improvement over the base model. Therefore, for tasks involving Arabic instruction, the base model itself is quite effective and might not require additional fine-tuning.

Resources:

Dataset used: https://huggingface.co/datasets/AhmedBou/Arabic_instruction_dataset_for_llm_ft

Github code: https://github.com/BoulahiaAhmed/Finetune-Run-Llama3.1-Locally

HuggingFace repo: https://huggingface.co/AhmedBou

Late Night Coding

Welcome to my blog!🤗 My name is Ahmed Boulahia, I'm a data scientist with a passion for sharing my knowledge and expertise. You will find some of my projects on this blog. I hope you find them both interesting and informative.

Saturday, August 10

Finetune & Run Llama3.1 Locally