LoRA & QLoRA

In one line: LoRA freezes the giant model and trains two tiny matrices "alongside" each weight so you update <1% of the parameters and still change the behaviour — and QLoRA shrinks the frozen model to 4-bit so the whole thing fits on a single GPU.

In plain English

Imagine a finished oil painting (the trained model). Full fine-tuning means repainting the whole canvas — expensive and risky. LoRA is painting your changes on a thin sheet of transparent overlay you lay on top: the original is untouched, your overlay is tiny and cheap, and you can peel it off or swap a different one in. QLoRA goes further: it photographs the original at lower resolution (4-bit) so it takes up far less space on your desk, then you paint your overlay on top of that. The magic is that a surprisingly thin overlay is enough to teach the model a new behaviour.

The core idea: low-rank adapters

A model is full of big weight matrices. Full fine-tuning changes every number in them. LoRA (Low-Rank Adaptation) makes a bet that pays off: the change you need to make is "low-rank" — it can be approximated by multiplying two skinny matrices together.

Instead of updating a weight matrix W (size d × k), you freeze W and learn a small update ΔW = B · A, where:

A is r × k (skinny)
B is d × r (skinny)
r (the rank) is tiny — often 8, 16, or 32.

At inference, the model uses W + ΔW. During training, only A and B get gradients; the billions of numbers in W never move.

Why this works: adapting a model to a narrow task doesn't require rich, full-rank changes everywhere — a low-rank nudge captures most of the needed adjustment. Empirically, LoRA matches full fine-tuning quality on most tasks while training a fraction of a percent of the weights.

The parameter math (why it's so cheap)

For one matrix of size d × k:

Full fine-tuning trains d × k parameters.
LoRA trains r × (d + k) parameters (matrix A has r·k, matrix B has d·r).

Take a typical 4096 × 4096 projection with rank r = 16:

Full:  4096 × 4096                = 16,777,216  params
LoRA:  16 × (4096 + 4096)         =    131,072  params   ->  ~0.8% of full

That ratio is why LoRA adapters are megabytes, not gigabytes, and why you can train on one GPU.

⌨️ Challenge — practice (not graded)

Write loraParams(d, k, r) — return the number of TRAINABLE parameters a single LoRA adapter adds for a weight matrix of size d×k at rank r. Remember: matrix A is r×k and matrix B is d×r, and you train both. Return a single number.

Rank and alpha — the two knobs

Rank r = the capacity of the adapter (how much it can change).

Small r (4–8): cheapest, enough for simple style/format tweaks.
Medium r (16–32): the common sweet spot for most tasks.
Large r (64+): more capacity, more memory, more overfitting risk — only if evals say you need it.

Alpha α = a scaling factor on the adapter's contribution. The adapter output is scaled by α / r before being added to the frozen weights. The practical takeaway:

A common convention is α = 2 × r (e.g. r=16, α=32).
Raising α makes the adapter's effect stronger (like a louder learning rate for the adapter); lowering it makes it gentler.
Don't overthink it: start at r=16, α=32 and tune only if evals disappoint.

from peft import LoraConfig

config = LoraConfig(
    r=16,                       # rank
    lora_alpha=32,              # alpha (= 2r convention)
    lora_dropout=0.05,          # regularization against overfitting
    target_modules="all-linear",# which layers get an adapter (broad = better quality)
    task_type="CAUSAL_LM",
)

target_modules chooses which matrices get adapters. Targeting all linear layers ("all-linear") usually gives the best quality; targeting only the attention projections is cheaper. Default to all-linear unless memory-bound.

Quantization and QLoRA

Quantization = storing each weight in fewer bits. A model's weights are normally 16-bit floats (bf16). Quantizing to 8-bit halves the memory; to 4-bit quarters it — with a small, usually acceptable, quality cost.

QLoRA = Quantized model + LoRA:

Load the big base model in 4-bit (it's frozen, so low precision is fine).
Train LoRA adapters in 16-bit on top of it.

This is the combination that put fine-tuning large models within reach of a single GPU. The frozen base shrinks 4×; the trainable part is already tiny. A 70B model that needs ~140 GB in bf16 drops to ~35 GB in 4-bit — suddenly trainable on one high-memory GPU.

from transformers import BitsAndBytesConfig
import torch

bnb = BitsAndBytesConfig(
    load_in_4bit=True,                          # the "Q" in QLoRA
    bnb_4bit_quant_type="nf4",                  # 4-bit NormalFloat — best for weights
    bnb_4bit_compute_dtype=torch.bfloat16,      # compute in bf16
    bnb_4bit_use_double_quant=True,             # quantize the quant constants too
)
# Pass quantization_config=bnb to from_pretrained / SFTTrainer, then attach LoRA as above.

The memory math: will it fit on my GPU?

A rough rule for full fine-tuning memory: you need the model plus optimizer states (Adam keeps ~2 extra copies) plus gradients plus activations — call it ~16 bytes per parameter. LoRA/QLoRA only pays the optimizer/gradient cost on the tiny adapter, so the dominant term becomes just storing the frozen base.

Setup	~Memory for a 7–8B model	Fits on…
Full fine-tuning (bf16)	~120+ GB	Multi-GPU cluster
LoRA (bf16 base)	~18 GB	One 24 GB GPU (tight)
QLoRA (4-bit base)	~6–10 GB	One consumer GPU (e.g. 12–16 GB)

def gpu_gb_estimate(params_billions: float, bits: int) -> float:
    """Very rough memory to STORE the frozen base, in GB."""
    bytes_per_param = bits / 8
    base_gb = params_billions * 1e9 * bytes_per_param / 1e9
    return round(base_gb * 1.2, 1)   # +20% overhead for activations/runtime

print(gpu_gb_estimate(8, 16))   # ~9.6 GB just to store an 8B model in bf16
print(gpu_gb_estimate(8, 4))    # ~3.8 GB in 4-bit -> QLoRA headroom for training
print(gpu_gb_estimate(70, 4))   # ~33.6 GB -> a 70B QLoRA on one big GPU

When to use which

QLoRA — your default when memory is the constraint (one GPU, large model). Tiny quality cost, huge memory win.
LoRA (non-quantized) — when you have memory headroom and want to avoid even the small quantization penalty. Slightly faster training, slightly better quality.
Full fine-tuning — only when you have the compute and your evals prove LoRA isn't reaching the quality bar. Rare.

Tooling note for 2026: Unsloth (drop-in, 2× faster LoRA/QLoRA, lower memory) and Axolotl (config-file-driven training) are the two most popular wrappers around this exact stack — both produce standard LoRA adapters you serve like any other. A bonus you'll meet again on the serving page: because adapters are small and the base is shared, you can hot-swap many LoRA adapters on one served base model.

Common pitfalls

Where people trip up

Rank cargo-culting. Cranking r to 256 "to be safe" wastes memory and invites overfitting. Start at 16; raise only if evals demand it.
Forgetting the α/r scaling. Alpha isn't independent of rank — if you change r, the effective strength changes unless you keep the α = 2r ratio in mind.
Targeting too few modules. Adapting only attention layers can underperform; "all-linear" is usually the better quality/cost trade.
Assuming 4-bit is free. QLoRA has a small quality cost. For most tasks it's invisible; on the hardest tasks, compare against bf16 LoRA before committing.
Losing the base model. A LoRA adapter is useless without the exact base weights it was trained against. Version and pin both. (See serving.)

🤔 Quick checkQuick check

→ Next: Preference tuning: RLHF & DPO

The core idea: low-rank adapters​

The parameter math (why it's so cheap)​

Rank and alpha — the two knobs​

Quantization and QLoRA​

The memory math: will it fit on my GPU?​

When to use which​

Common pitfalls​