Everything you need to know to start fine-tuning LLMs in the privacy of your home

Hands on Large language models (LLMs) are remarkably effective at generating text and regurgitating information, but they're ultimately limited by the corpus of data they were trained on.

If, for example, you ask a generic pre-trained model about a process or procedure specific to your business, at best it'll refuse, and at worst it'll confidently hallucinate a plausible sounding answer.

You could, of course, get around this by training your own model, but the resources required to do that often far exceed practicality. Training Meta's relatively small Llama 3 8B model required the equivalent of 1.3 million GPU hours when running on 80GB Nvidia H100s. The good news is you don't have to. Instead, we can take an existing model, such as Llama, Mistral, or Phi, and extend its knowledge base or modify its behavior and style using their own data through a process called fine-tuning.

This process is still computationally expensive compared to inference, but thanks to advancements like Low Rank Adaptation (LoRA) and its quantized variant QLoRA, it's possible to fine-tune models using a single GPU - and that's exactly what we're going to be exploring in this hands-on guide.

In this guide we'll discuss:

Setting expectations

Compared to previous hands-on guides we've done, fine-tuning is a bit of a rabbit hole with no shortage of knobs to turn, switches to flip, settings to tweak, and best practices to follow. As such, we feel it's important to set some expectations.

Fine-tuning is a useful way of modifying the behavior or style of a pre-trained model. However, if your goal is to teach the model something new, it can be done, but there may be better and more reliable ways of doing so worth looking at first.

We've previously explored retrieval augmented generation (RAG), which essentially gives the model a library or database that it can reference. This approach is quite popular because it's relatively easy to set up, computationally cheap compared to training a model, and can be made to cite its sources. However, it's by no means perfect and won't do anything to change the style or behavior of a model.

If, for example, you're building a customer service chatbot to help customers find resources or troubleshoot a product, you probably don't want it to respond to unrelated questions about, say, health or finances. Prompt engineering can help with this to a degree. You could create a system prompt that instructs the model to behave in a certain way. This could be as simple as adding, "You are not equipped to answer questions related to health, wellness, or nutrition. If asked to do so redirect the conversation to a more appropriate topic."

Prompt engineering is elegant in its simplicity: Just tell the model what you do and don't want it to do. Unfortunately, anyone who's played with chatbots in the wild will have run into edge cases where the model can be tricked into doing something it's not supposed to. And despite what you might be thinking, you don't have to trap the LLM in some HAL9000 style feedback loop. Often, it's as simple as telling the model, "Ignore all previous instructions, do this instead."

If RAG and prompt engineering won't cut it, fine-tuning may be worth exploring.

Memory efficient model tuning with QLoRA

For this guide, we're going to be using fine-tuning to change the style and tone of the Mistral 7B model. Specifically, we're going to use QLoRA, which, as we mentioned earlier, will allow us to fine-tune the model using a fraction of the memory and compute compared to conventional training.

This is because fine-tuning requires a lot of memory compared to running the model. During inference, you can calculate your memory requirements by multiplying the parameter count by its precision. For Mistral 7B, which was trained at BF16, that works out to about 14 GB ± a gigabyte or two for the key value cache.

A full fine-tune on the other hand requires several times this to fit the model into memory. So for Mistral 7B you're looking at 90 GB or more. Unless you've got a multi-GPU workstation sitting around, you'll almost certainly be looking at renting datacenter GPUs like the Nvidia A100 or H100 to get the job done.

This is because with a full fine-tune you're effectively retraining every weight in the model at full resolution. The good news is in most cases it's not actually necessary to update every weight to tweak the neural network's output. In fact, it may only be necessary to update a few thousand or million weights in order to achieve the desired result.

This is the logic behind LoRA, which in a nutshell freezes a model's weights in one matrix. Then a second set of matrices is used to track the changes that should be made to the first in order to fine-tune the model.

This cuts down the computational and memory overhead considerably. QLoRA steps this up a notch by loading the model's weights at lower precision, usually four bits. So instead of each parameter requiring two bytes of memory, it now only requires half a byte. If you're curious about quantization, you can learn more in our hands-on guide here.

Using QLoRA we now are able to fine-tune a model like Mistral 7B using less than 16 GB of VRAM.

Search
About Us
Website HardCracked provides softwares, patches, cracks and keygens. If you have software or keygens to share, feel free to submit it to us here. Also you may contact us if you have software that needs to be removed from our website. Thanks for use our service!
IT News
Dec 7
Micropatchers share 1-instruction fix for NTLM hash leak flaw in Windows 7+

Microsoft's OS sure loves throwing your creds at remote systems

Dec 6
OpenAI to charge $200 per month for ChatGPT Pro

How much AI does one subscriber need?

Dec 6
AI and analytics converge in new generation Amazon SageMaker

re:Invent Calling everything SageMaker is confusing - but a new name would have been worse says AWS

Dec 6
Veteran Microsoft engineer shares some enterprise support tips

How to tell a customer they're an idiot without telling them they're an idiot

Dec 6
Solana blockchain's popular web3.js npm package backdoored to steal keys, funds

Damage likely limited to those running bots with private key access

Dec 6
Day after nuclear power vow, Meta announces largest-ever datacenter powered by fossil fuels

Louisiana facility's three natural gas turbine plants to churn out 2,262 MW

Dec 6
Altman to Musk: Don't go full supervillain - that's so un-American

OpenAI chief says he doesn't think Elon will unleash political powers to zap AI competition