Supported Models
View SourcevLLM supports a wide variety of open-source models from HuggingFace. This guide lists major supported model families.
Transformer LLMs
Llama Family
# Llama 2
VLLM.llm!("meta-llama/Llama-2-7b-hf")
VLLM.llm!("meta-llama/Llama-2-13b-hf")
VLLM.llm!("meta-llama/Llama-2-70b-hf")
VLLM.llm!("meta-llama/Llama-2-7b-chat-hf")
# Llama 3
VLLM.llm!("meta-llama/Meta-Llama-3-8B")
VLLM.llm!("meta-llama/Meta-Llama-3-70B")
VLLM.llm!("meta-llama/Meta-Llama-3-8B-Instruct")
# Llama 3.1
VLLM.llm!("meta-llama/Llama-3.1-8B")
VLLM.llm!("meta-llama/Llama-3.1-70B")
VLLM.llm!("meta-llama/Llama-3.1-405B")
# Llama 3.2
VLLM.llm!("meta-llama/Llama-3.2-1B")
VLLM.llm!("meta-llama/Llama-3.2-3B")
VLLM.llm!("meta-llama/Llama-3.2-1B-Instruct")Note: Some Llama models are gated. Accept the license and request access on Hugging Face.
Mistral Family
VLLM.llm!("mistralai/Mistral-7B-v0.1")
VLLM.llm!("mistralai/Mistral-7B-Instruct-v0.2")
VLLM.llm!("mistralai/Mistral-Nemo-Base-2407")Qwen Family
VLLM.llm!("Qwen/Qwen2-7B")
VLLM.llm!("Qwen/Qwen2-72B")
VLLM.llm!("Qwen/Qwen2.5-7B-Instruct")
VLLM.llm!("Qwen/Qwen2.5-72B-Instruct")Gemma Family
VLLM.llm!("google/gemma-2b")
VLLM.llm!("google/gemma-7b")
VLLM.llm!("google/gemma-2-9b")
VLLM.llm!("google/gemma-2-27b")Phi Family
VLLM.llm!("microsoft/phi-2")
VLLM.llm!("microsoft/Phi-3-mini-4k-instruct")
VLLM.llm!("microsoft/Phi-3.5-mini-instruct")Mixture-of-Experts (MoE)
# Mixtral
VLLM.llm!("mistralai/Mixtral-8x7B-v0.1")
VLLM.llm!("mistralai/Mixtral-8x7B-Instruct-v0.1")
VLLM.llm!("mistralai/Mixtral-8x22B-v0.1")
# DeepSeek
VLLM.llm!("deepseek-ai/deepseek-moe-16b-base")
VLLM.llm!("deepseek-ai/DeepSeek-V2")
VLLM.llm!("deepseek-ai/DeepSeek-V3")
# Qwen MoE
VLLM.llm!("Qwen/Qwen1.5-MoE-A2.7B")Code Models
# CodeLlama
VLLM.llm!("codellama/CodeLlama-7b-hf")
VLLM.llm!("codellama/CodeLlama-34b-Instruct-hf")
# StarCoder
VLLM.llm!("bigcode/starcoder2-15b")
# DeepSeek Coder
VLLM.llm!("deepseek-ai/deepseek-coder-6.7b-base")Embedding Models
# E5 Mistral
VLLM.llm!("intfloat/e5-mistral-7b-instruct", runner: "pooling")
# BGE
VLLM.llm!("BAAI/bge-large-en-v1.5", runner: "pooling")Multimodal Models
# LLaVA
VLLM.llm!("llava-hf/llava-1.5-7b-hf")
VLLM.llm!("llava-hf/llava-v1.6-mistral-7b-hf")
# Qwen-VL
VLLM.llm!("Qwen/Qwen2-VL-7B-Instruct")Quantized Models
Many models are available pre-quantized:
# AWQ Quantized
VLLM.llm!("TheBloke/Llama-2-7B-AWQ", quantization: "awq")
VLLM.llm!("TheBloke/Llama-2-13B-AWQ", quantization: "awq")
# GPTQ Quantized
VLLM.llm!("TheBloke/Llama-2-7B-GPTQ", quantization: "gptq")Small Models (for Testing)
# OPT (small, fast)
VLLM.llm!("facebook/opt-125m")
VLLM.llm!("facebook/opt-350m")
VLLM.llm!("facebook/opt-1.3b")
# GPT-2
VLLM.llm!("gpt2")
VLLM.llm!("gpt2-medium")Model Selection Tips
- For chat/instruction following: Use "-Instruct" or "-chat" variants
- For memory constraints: Use quantized versions (AWQ, GPTQ)
- For multi-GPU: Use larger models with tensor parallelism
- For testing: Use small models like OPT-125m
Gated Models
Some models require accepting terms on HuggingFace:
- Visit the model page on HuggingFace
- Accept the license agreement
- Set
HF_TOKENenvironment variable
export HF_TOKEN="your-huggingface-token"
Adding New Models
vLLM supports most transformer-based architectures. If a model isn't working:
- Check if it's on the vLLM supported models list
- Try with
trust_remote_code: truefor custom architectures - Check vLLM GitHub issues for known compatibility issues