LoRA Adapters
View SourceLoRA (Low-Rank Adaptation) enables efficient fine-tuning and serving of customized models.
What is LoRA?
LoRA adds small trainable matrices to transformer layers while keeping base model weights frozen. This provides:
- Efficient fine-tuning: Train with less GPU memory
- Small adapter files: Typically 10-100 MB vs GB for full models
- Easy deployment: Switch adapters without reloading base model
- Multi-adapter serving: Serve different adapters from one base model
Enabling LoRA
VLLM.run(fn ->
llm = VLLM.llm!("meta-llama/Llama-2-7b-hf",
enable_lora: true,
max_lora_rank: 64,
max_loras: 4
)
end)Configuration Options
| Option | Description | Default |
|---|---|---|
enable_lora | Enable LoRA adapter support | false |
max_lora_rank | Maximum adapter rank | 16 |
max_loras | Maximum concurrent adapters | 1 |
lora_extra_vocab_size | Extra vocabulary for adapters | 256 |
Creating LoRA Requests
# Create a LoRA request
lora = VLLM.lora_request!(
"my-adapter", # Unique name
1, # Integer ID
"/path/to/adapter" # Path to adapter weights
)Using LoRA in Generation
VLLM.run(fn ->
llm = VLLM.llm!("meta-llama/Llama-2-7b-hf",
enable_lora: true,
max_lora_rank: 64
)
# Create adapter request
sql_lora = VLLM.lora_request!("sql-expert", 1, "/path/to/sql-adapter")
params = VLLM.sampling_params!(temperature: 0.7, max_tokens: 200)
# Generate with adapter
outputs = VLLM.generate!(llm, "Write a SQL query to find all users",
sampling_params: params,
lora_request: sql_lora
)
end)Multi-LoRA Serving
Serve different adapters for different requests:
VLLM.run(fn ->
llm = VLLM.llm!("meta-llama/Llama-2-7b-hf",
enable_lora: true,
max_loras: 4
)
# Create multiple adapters
sql_adapter = VLLM.lora_request!("sql", 1, "/adapters/sql")
code_adapter = VLLM.lora_request!("code", 2, "/adapters/code")
medical_adapter = VLLM.lora_request!("medical", 3, "/adapters/medical")
params = VLLM.sampling_params!(max_tokens: 200)
# Use different adapters per request
VLLM.generate!(llm, "SQL query...", sampling_params: params, lora_request: sql_adapter)
VLLM.generate!(llm, "Python function...", sampling_params: params, lora_request: code_adapter)
VLLM.generate!(llm, "Medical diagnosis...", sampling_params: params, lora_request: medical_adapter)
# Generate without adapter (base model)
VLLM.generate!(llm, "General question...", sampling_params: params)
end)LoRA Adapter Format
vLLM expects LoRA adapters in the HuggingFace PEFT format:
adapter_directory/
├── adapter_config.json
├── adapter_model.bin (or .safetensors)
└── (optional) special_tokens_map.jsonTraining LoRA Adapters
Popular tools for training LoRA adapters:
HuggingFace PEFT
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1
)
model = get_peft_model(base_model, config)
# Train...
model.save_pretrained("/path/to/adapter")LLaMA-Factory
python train.py --model_name meta-llama/Llama-2-7b-hf \
--lora_rank 64 \
--output_dir /path/to/adapter
Performance Tips
- Adapter hot-swapping: vLLM efficiently switches between adapters
- Batch different adapters: Requests with different adapters can be batched
- Memory overhead: Each loaded adapter adds minimal memory
- Rank trade-off: Higher rank = more capacity but more memory
Common Issues
Adapter Not Loading
# Check adapter path exists
# Check adapter_config.json is valid
# Ensure max_lora_rank >= adapter rankMemory Issues
# Reduce max_loras
# Use smaller max_lora_rank
# Consider quantized base modelPerformance Issues
# Batch requests with same adapter when possible
# Pre-load frequently used adaptersExample: Task-Specific Adapters
VLLM.run(fn ->
llm = VLLM.llm!("meta-llama/Llama-2-7b-hf",
enable_lora: true,
max_loras: 3
)
# Different adapters for different tasks
adapters = %{
summarization: VLLM.lora_request!("sum", 1, "/adapters/summarizer"),
translation: VLLM.lora_request!("trans", 2, "/adapters/translator"),
qa: VLLM.lora_request!("qa", 3, "/adapters/qa-expert")
}
# Route based on task
task = :summarization
adapter = Map.get(adapters, task)
outputs = VLLM.generate!(llm, "Summarize this article...",
lora_request: adapter
)
end)