LoRA & QLoRA
In one line: LoRA freezes the giant model and trains two tiny matrices "alongside" each weight so you update
<1%of the parameters and still change the behaviour — and QLoRA shrinks the frozen model to 4-bit so the whole thing fits on a single GPU.
Imagine a finished oil painting (the trained model). Full fine-tuning means repainting the whole canvas — expensive and risky. LoRA is painting your changes on a thin sheet of transparent overlay you lay on top: the original is untouched, your overlay is tiny and cheap, and you can peel it off or swap a different one in. QLoRA goes further: it photographs the original at lower resolution (4-bit) so it takes up far less space on your desk, then you paint your overlay on top of that. The magic is that a surprisingly thin overlay is enough to teach the model a new behaviour.
The core idea: low-rank adapters
A model is full of big weight matrices. Full fine-tuning changes every number in them. LoRA (Low-Rank Adaptation) makes a bet that pays off: the change you need to make is "low-rank" — it can be approximated by multiplying two skinny matrices together.
Instead of updating a weight matrix W (size d × k), you freeze W and learn a small update ΔW = B · A, where:
Aisr × k(skinny)Bisd × r(skinny)r(the rank) is tiny — often 8, 16, or 32.
At inference, the model uses W + ΔW. During training, only A and B get gradients; the billions of numbers in W never move.
Why this works: adapting a model to a narrow task doesn't require rich, full-rank changes everywhere — a low-rank nudge captures most of the needed adjustment. Empirically, LoRA matches full fine-tuning quality on most tasks while training a fraction of a percent of the weights.
The parameter math (why it's so cheap)
For one matrix of size d × k:
- Full fine-tuning trains
d × kparameters. - LoRA trains
r × (d + k)parameters (matrixAhasr·k, matrixBhasd·r).
Take a typical 4096 × 4096 projection with rank r = 16:
Full: 4096 × 4096 = 16,777,216 params
LoRA: 16 × (4096 + 4096) = 131,072 params -> ~0.8% of full
That ratio is why LoRA adapters are megabytes, not gigabytes, and why you can train on one GPU.
Write loraParams(d, k, r) — return the number of TRAINABLE parameters a single LoRA adapter adds for a weight matrix of size d×k at rank r. Remember: matrix A is r×k and matrix B is d×r, and you train both. Return a single number.
Rank and alpha — the two knobs
Rank r = the capacity of the adapter (how much it can change).
- Small
r(4–8): cheapest, enough for simple style/format tweaks. - Medium
r(16–32): the common sweet spot for most tasks. - Large
r(64+): more capacity, more memory, more overfitting risk — only if evals say you need it.
Alpha α = a scaling factor on the adapter's contribution. The adapter output is scaled by α / r before being added to the frozen weights. The practical takeaway:
- A common convention is
α = 2 × r(e.g.r=16, α=32). - Raising
αmakes the adapter's effect stronger (like a louder learning rate for the adapter); lowering it makes it gentler. - Don't overthink it: start at
r=16, α=32and tune only if evals disappoint.
from peft import LoraConfig
config = LoraConfig(
r=16, # rank
lora_alpha=32, # alpha (= 2r convention)
lora_dropout=0.05, # regularization against overfitting
target_modules="all-linear",# which layers get an adapter (broad = better quality)
task_type="CAUSAL_LM",
)
target_modules chooses which matrices get adapters. Targeting all linear layers ("all-linear") usually gives the best quality; targeting only the attention projections is cheaper. Default to all-linear unless memory-bound.
Quantization and QLoRA
Quantization = storing each weight in fewer bits. A model's weights are normally 16-bit floats (bf16). Quantizing to 8-bit halves the memory; to 4-bit quarters it — with a small, usually acceptable, quality cost.
QLoRA = Quantized model + LoRA:
- Load the big base model in 4-bit (it's frozen, so low precision is fine).
- Train LoRA adapters in 16-bit on top of it.
This is the combination that put fine-tuning large models within reach of a single GPU. The frozen base shrinks 4×; the trainable part is already tiny. A 70B model that needs ~140 GB in bf16 drops to ~35 GB in 4-bit — suddenly trainable on one high-memory GPU.
from transformers import BitsAndBytesConfig
import torch
bnb = BitsAndBytesConfig(
load_in_4bit=True, # the "Q" in QLoRA
bnb_4bit_quant_type="nf4", # 4-bit NormalFloat — best for weights
bnb_4bit_compute_dtype=torch.bfloat16, # compute in bf16
bnb_4bit_use_double_quant=True, # quantize the quant constants too
)
# Pass quantization_config=bnb to from_pretrained / SFTTrainer, then attach LoRA as above.
The memory math: will it fit on my GPU?
A rough rule for full fine-tuning memory: you need the model plus optimizer states (Adam keeps ~2 extra copies) plus gradients plus activations — call it ~16 bytes per parameter. LoRA/QLoRA only pays the optimizer/gradient cost on the tiny adapter, so the dominant term becomes just storing the frozen base.
| Setup | ~Memory for a 7–8B model | Fits on… |
|---|---|---|
| Full fine-tuning (bf16) | ~120+ GB | Multi-GPU cluster |
| LoRA (bf16 base) | ~18 GB | One 24 GB GPU (tight) |
| QLoRA (4-bit base) | ~6–10 GB | One consumer GPU (e.g. 12–16 GB) |
def gpu_gb_estimate(params_billions: float, bits: int) -> float:
"""Very rough memory to STORE the frozen base, in GB."""
bytes_per_param = bits / 8
base_gb = params_billions * 1e9 * bytes_per_param / 1e9
return round(base_gb * 1.2, 1) # +20% overhead for activations/runtime
print(gpu_gb_estimate(8, 16)) # ~9.6 GB just to store an 8B model in bf16
print(gpu_gb_estimate(8, 4)) # ~3.8 GB in 4-bit -> QLoRA headroom for training
print(gpu_gb_estimate(70, 4)) # ~33.6 GB -> a 70B QLoRA on one big GPU
When to use which
- QLoRA — your default when memory is the constraint (one GPU, large model). Tiny quality cost, huge memory win.
- LoRA (non-quantized) — when you have memory headroom and want to avoid even the small quantization penalty. Slightly faster training, slightly better quality.
- Full fine-tuning — only when you have the compute and your evals prove LoRA isn't reaching the quality bar. Rare.
Tooling note for 2026: Unsloth (drop-in, 2× faster LoRA/QLoRA, lower memory) and Axolotl (config-file-driven training) are the two most popular wrappers around this exact stack — both produce standard LoRA adapters you serve like any other. A bonus you'll meet again on the serving page: because adapters are small and the base is shared, you can hot-swap many LoRA adapters on one served base model.
Common pitfalls
- Rank cargo-culting. Cranking
rto 256 "to be safe" wastes memory and invites overfitting. Start at 16; raise only if evals demand it. - Forgetting the α/r scaling. Alpha isn't independent of rank — if you change
r, the effective strength changes unless you keep theα = 2rratio in mind. - Targeting too few modules. Adapting only attention layers can underperform; "all-linear" is usually the better quality/cost trade.
- Assuming 4-bit is free. QLoRA has a small quality cost. For most tasks it's invisible; on the hardest tasks, compare against bf16 LoRA before committing.
- Losing the base model. A LoRA adapter is useless without the exact base weights it was trained against. Version and pin both. (See serving.)
→ Next: Preference tuning: RLHF & DPO