Unleashing the Power of Yi: Fine-tune Bilingual LLMs Locally

Chapter 1: Introduction to Yi Models

The Yi models made their debut in December 2023, showcasing remarkable advancements and new model variants since their initial release. The current offerings include models with parameters ranging from 6 billion to a staggering 34 billion. Additionally, they now feature chat models and variants that can handle contexts of up to 200,000 tokens.

Yi's large language models (LLMs) are not only open-source but also excel at a variety of tasks. A standout feature of the Yi models is their bilingual capability, allowing them to perform tasks in both English and Chinese.

In this article, I will provide an overview of the Yi models and delve into the technical documentation that outlines their training methodologies. Following that, I will guide you through the process of running, quantizing, fine-tuning, and benchmarking these models on consumer-grade hardware. Remarkably, even the 34B model can function on a single consumer GPU when quantized.

To facilitate this process, I've created a notebook for the Yi LLMs, which includes implementations for:

Inference with Transformers and vLLM
Quantization with bitsandbytes, AWQ, and GPTQ
Fine-tuning with QLoRA
Benchmarking performance and accuracy using the Evaluation Harness and Optimum Benchmark

You can access the notebook (#54) here: Yi: The Llama Architecture for Bilingual Purposes.

For a deeper dive, the technical report detailing the development of the Yi models is available on arXiv: Yi: Open Foundation Models by 01.AI.

Chapter 2: Data Preprocessing for Yi Models

The technical report provides an overview of the preprocessing pipeline utilized for training the Yi models. This comprehensive pipeline is designed to create high-quality bilingual pre-training data, commencing with the collection of web documents via Common Crawl. It subsequently employs the CCNet pipeline for language identification and perplexity assessment, alongside various filtering and deduplication processes.

Heuristic filters have been implemented to exclude low-quality text based on several criteria, including URLs, domain and word blocklists, garbled text, document length, the frequency of special symbols, and the presence of short or incomplete lines.

To further refine the dataset, a set of scoring methods is employed to eliminate unsuitable documents, including:

A perplexity scorer (utilizing the KenLM library)
A quality scorer (assessing document similarity to Wikipedia)
A document coherence scorer (removing or segmenting incoherent documents)
A safety scorer (eliminating toxic content)

Deduplication is another crucial aspect, ensuring that duplicates are removed both within individual documents and across the dataset. This includes document-level MinHash deduplication and sub-document exact-match deduplication.

The final dataset utilized for pre-training comprises a total of 3.1 trillion tokens, placing the Yi models between Llama 2, which was trained on 2 trillion tokens, and Gemma models, trained on 6 trillion tokens.

For tokenization, the byte-pair encoding (BPE) technique was employed within the SentencePiece framework, setting the vocabulary size at 64,000. This is notably larger than the vocabulary sizes for Mistral 7B and Llama 2, yet smaller than Qwen1.5 and Gemma. This vocabulary size was selected to balance computational efficiency with the ability to accurately represent words in both Chinese and English.

Chapter 3: Yi's Neural Architecture

The Yi models feature an architecture that resembles Llama 2, but with several key modifications. Notably, Yi employs Grouped-Query Attention (GQA) across both its 6B and 34B models, while Llama 2 only utilizes GQA in its 70B model for efficiency.

Grouped-Query Attention (GQA) organizes query heads into groups, allowing each group to share a single key and value head. This design significantly reduces training and inference costs without compromising performance, even for the smaller 6B model.

The activation function utilized by Yi is the increasingly popular SwiGLU. For positional embedding and managing extended contexts, Yi employs Rotary Position Embedding (RoPE), adjusting the base frequency to accommodate long context windows of up to 200k tokens. Initially, the base model was trained on contexts of up to 4k tokens before further pre-training on a dataset containing longer sequences, particularly sourced from books.

If you're interested in enhancing your understanding of RoPE and extending LLM context sizes, I previously authored an article titled LongRoPE.

Chapter 4: Fine-tuning and Performance

The fine-tuning dataset used to adapt the base models into chat models is surprisingly small, comprising fewer than 10,000 dialogues that include multi-turn instructions and responses, meticulously filtered to ensure quality. This choice aligns with prior research indicating that quality outweighs quantity when developing effective instruct models.

To ensure a comprehensive representation of skills and domains, the dataset incorporates a variety of open-source prompts spanning numerous domains, including question answering and creative writing.

The ChatML format is utilized for training, which distinctly separates different types of information such as system settings, user queries, and assistant replies. The training process employs next-word prediction loss, focusing solely on responses while disregarding prompts in the loss calculation, utilizing the AdamW optimizer with specific hyperparameters.

Despite the efficient infrastructure supporting both RLHF and DPO training, the paper does not explicitly outline the method used for aligning the models with human preferences.

Chapter 5: Yi's Performance on Public Benchmarks

The Yi models have shown impressive performance on various public benchmarks. For instance, even the 6B model surpasses much larger models like Falcon-180B on certain tasks. The strengths of the Yi models lie particularly in commonsense reasoning, language comprehension, reading comprehension, and mathematical problem-solving. However, coding tasks remain a challenge for these models.

Chapter 6: Utilizing Yi on Consumer Hardware

Note: It's important to note that Yi is not entirely "open." If you plan to use the Yi models for commercial purposes, please ensure you submit a request here, as applications are processed promptly.

The following sections will detail how to run (using vLLM and Transformers), fine-tune (using QLoRA), and quantize (via bnb, GPTQ, and AWQ) the Yi models. I will provide key observations from my experiments, with the complete code available in the notebook.

Inference with vLLM

For optimal memory usage during inference, I recommend using vLLM. The notebook also includes instructions for utilizing Transformers.

The original Yi 6B model requires 12 GB of GPU RAM, while the quantized version using AWQ reduces this requirement to just 8.3 GB. Here’s a code snippet to illustrate how to use the quantized model:

import time

from vllm import LLM, SamplingParams

prompts = [

"The best recipe for pasta is"

]

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, top_k=20, max_tokens=150)

loading_start = time.time()

llm = LLM(model="kaitchup/Yi-6B-awq-4bit", quantization="awq")

print("--- Loading time: %s seconds ---" % (time.time() - loading_start))

generation_time = time.time()

outputs = llm.generate(prompts, sampling_params)

print("--- Generation time: %s seconds ---" % (time.time() - generation_time))

for output in outputs:

generated_text = output.outputs[0].text

print(generated_text)

print('------')

Fine-tuning Yi with QLoRA

I found that fine-tuning Yi with QLoRA was quite straightforward. I utilized the same code I had used for fine-tuning Mistral 7B, and everything proceeded smoothly. In the notebook, I demonstrate how to create an instruct version of Yi 6B by fine-tuning it on the dataset timdettmers/openassistant-guanaco for three epochs. The resulting adapter is available on the Hugging Face Hub: kaitchup/Yi-6B-openguanaco-3e-QLoRA.

During the fine-tuning process, the validation loss did not significantly decrease, suggesting that my chosen hyperparameters might not have been optimal.

Quantizing Yi with GPTQ, AWQ, and Bitsandbytes

Quantizing the Yi models is a straightforward process. The notebook provides code for quantizing and serializing the model using bitsandbytes NF4, AWQ, and GPTQ. The models have been released on the Hugging Face hub:

kaitchup/Yi-6B-bnb-4bit
kaitchup/Yi-6B-gptq-4bit
kaitchup/Yi-6B-awq-4bit

I also benchmarked the models across three different tasks using the Evaluation Harness, assessing their decoding throughput and peak memory consumption with optimum-benchmark. Due to recent changes in optimum-benchmark that I am still working to understand, I was unable to obtain results for the bnb version regarding throughput and peak memory usage.

AWQ has emerged as the most effective quantization method for Yi 6B, providing a balance of speed and memory efficiency. However, if you're looking for a faster model, the GPTQ version is a strong contender, offering slightly lower accuracy than AWQ while being as fast as the original 16-bit model.

Chapter 7: Conclusion

The Yi models are user-friendly on consumer hardware, with the 34B model requiring no more than 24 GB of GPU RAM once quantized. Furthermore, their architecture closely aligns with that of Llama 2, making them compatible with most deep learning frameworks.

According to public benchmarks, they rank among the top LLMs available today. The team responsible for the Yi models continues to enhance them by releasing updated versions. Recently, they introduced Yi models capable of managing very long contexts of up to 200k tokens. You can find them on the Hugging Face hub:

01-ai/Yi-6B-200K
01-ai/Yi-9B-200K
01-ai/Yi-34B-200K

For those needing to quantize the Yi models, AWQ appears to provide the best accuracy for downstream tasks while also optimizing decoding throughput and memory use.

To support my work, consider subscribing to my newsletter. This article is published on Generative AI. Connect with us on LinkedIn and follow Zeniteq to stay informed about the latest developments in AI. Let’s collaborate to shape the future of artificial intelligence together!

Chapter 8: Fine-tuning Large Language Models (LLMs)

This video discusses the intricacies of fine-tuning large language models, complete with practical examples and code snippets.

Chapter 9: Installing Yi-1.5 Model Locally

In this video, you'll learn how to install the Yi-1.5 model on your local machine, showcasing its performance against Llama 3 in various benchmarks.

forbestheatreartsoxford.com