Code for Medium story "Fine-tune Microsoft’s Phi-2 with QLoRA and synthetic data":
- Create a synthetic dataset from a seed of instructions
- Fine-tune Phi-2 using QLoRA
nb_dataset.ipynb
: Create a synthetic conversational dataset using a seed of riddlesnb_qlora.ipynb
: Fine-tune Phi-2 using QLoRA
-
Setup and Initialization: Import necessary libraries, set up Weights and Biases (wandb) for tracking, and initialize a unique run identifier.
-
Configuration and Seeds: Set the seed for reproducibility and configure model and dataset paths, learning rate, batch sizes, epochs, and maximum token length.
-
LoRA Configuration: Define the Low-Rank Adaptation (LoRA) configuration for efficient model fine-tuning.
-
Model Preparation: Load the Phi-2 model with quantization settings for 4-bit training, and resize token embeddings to accommodate new special tokens.
-
Tokenizer Preparation: Load and configure the tokenizer, adding necessary special tokens for ChatML formatting.
-
Dataset Loading and Preparation: Load the dataset from Hugging Face, split it into training and test sets, and apply ChatML formatting and tokenization.
-
Data Collation: Define a collation function to transform individual data samples into batched data suitable for training.
-
Training Configuration: Set up training arguments with specified hyperparameters like batch sizes, learning rate, and gradient accumulation steps.
-
Trainer Initialization: Initialize the Trainer object with the model, tokenizer, training arguments, and data collator.
-
Training Execution: Launch the training process, optionally with Weights and Biases tracking for the main process.
modelpath="microsoft/phi-2"
dataset_name="g-ronimo/riddles_evolved"
lr=0.00002 # low but works for this dataset
bs=1 # batch size for training
bs_eval=16 # batch size for evals
ga_steps=16 # gradient acc. steps
epochs=20 # dataset is small, many epochs needed
max_length=1024 # samples will be cut beyond this number of tokens
output_dir=f"out"
accelerate launch qlora.py