# VLLM v0.3.0 - Table of Contents vLLM for Elixir via SnakeBridge - Easy, fast, and cheap LLM serving for everyone. High-throughput LLM inference with PagedAttention, continuous batching, and OpenAI-compatible API. ## Pages - [LICENSE](license.md) - Guides - [README](readme.md) - [Quickstart Guide](quickstart.md) - Features - [Offline Inference](offline_inference.md) - [Online Serving](online_serving.md) - [Sampling Parameters](sampling_params.md) - [Configuration Guide](configuration.md) - [Multimodal Models](multimodal.md) - [LoRA Adapters](lora.md) - [Structured Outputs](structured_outputs.md) - Reference - [Supported Models](supported_models.md) - [Quantization](quantization.md) - Examples - [VLLM Examples](examples.md) - Release Notes - [Changelog](changelog.md) ## Modules - [GracefulSerialization.Helpers](GracefulSerialization.Helpers.md): Helper wrappers for `graceful_serialization`. - [VLLM.ConfigHelper](VLLM.ConfigHelper.md): Runtime configuration helper for using vLLM safely via SnakeBridge/Snakepit. - [Vllm](Vllm.md): vLLM: a high-throughput and memory-efficient inference engine for LLMs - [Vllm.Assets](Vllm.Assets.md): Submodule bindings for `vllm.assets`. - [Vllm.AsyncLLMEngine](Vllm.AsyncLLMEngine.md): Protocol class for Clients to Engine - [Vllm.Attention](Vllm.Attention.md): Submodule bindings for `vllm.attention`. - [Vllm.BeamSearch](Vllm.BeamSearch.md): Submodule bindings for `vllm.beam_search`. - [Vllm.BeamSearch.BeamSearchInstance](Vllm.BeamSearch.BeamSearchInstance.md): Wrapper for Python class BeamSearchInstance. - [Vllm.BeamSearch.BeamSearchOutput](Vllm.BeamSearch.BeamSearchOutput.md): The output of beam search. - [Vllm.BeamSearch.BeamSearchSequence](Vllm.BeamSearch.BeamSearchSequence.md): A sequence for beam search. - [Vllm.BeamSearch.LoRARequest](Vllm.BeamSearch.LoRARequest.md): Request for a LoRA adapter. - [Vllm.BeamSearch.Logprob](Vllm.BeamSearch.Logprob.md): Infos for supporting OpenAI compatible logprobs and token ranks. - [Vllm.Benchmarks](Vllm.Benchmarks.md): Submodule bindings for `vllm.benchmarks`. - [Vllm.CollectEnv](Vllm.CollectEnv.md): Submodule bindings for `vllm.collect_env`. - [Vllm.CollectEnv.SystemEnv](Vllm.CollectEnv.SystemEnv.md): SystemEnv(torch_version, is_debug_build, cuda_compiled_version, gcc_version, clang_version, cmake_version, os, libc_version, python_version, python_platform, is_cuda_available, cuda_runtime_version, cuda_module_loading, nvidia_driver_version, nvidia_gpu_models, cudnn_version, pip_version, pip_packages, conda_packages, hip_compiled_version, hip_runtime_version, miopen_runtime_version, caching_allocator_config, is_xnnpack_available, cpu_info, rocm_version, vllm_version, vllm_build_flags, gpu_topo, env_vars) - [Vllm.Compilation](Vllm.Compilation.md): Submodule bindings for `vllm.compilation`. - [Vllm.Config](Vllm.Config.md): Submodule bindings for `vllm.config`. - [Vllm.Config.AttentionConfig](Vllm.Config.AttentionConfig.md): Configuration for attention mechanisms in vLLM. - [Vllm.Config.CUDAGraphMode](Vllm.Config.CUDAGraphMode.md): Constants for the cudagraph mode in CompilationConfig. - [Vllm.Config.CacheConfig](Vllm.Config.CacheConfig.md): Configuration for the KV cache. - [Vllm.Config.CompilationConfig](Vllm.Config.CompilationConfig.md): Configuration for compilation. - [Vllm.Config.CompilationMode](Vllm.Config.CompilationMode.md): The compilation approach used for torch.compile-based compilation of the - [Vllm.Config.DeviceConfig](Vllm.Config.DeviceConfig.md): Configuration for the device to use for vLLM execution. - [Vllm.Config.ECTransferConfig](Vllm.Config.ECTransferConfig.md): Configuration for distributed EC cache transfer. - [Vllm.Config.EPLBConfig](Vllm.Config.EPLBConfig.md): Configuration for Expert Parallel Load Balancing (EP). - [Vllm.Config.KVEventsConfig](Vllm.Config.KVEventsConfig.md): Configuration for KV event publishing. - [Vllm.Config.KVTransferConfig](Vllm.Config.KVTransferConfig.md): Configuration for distributed KV cache transfer. - [Vllm.Config.LoRAConfig](Vllm.Config.LoRAConfig.md): Configuration for LoRA. - [Vllm.Config.LoadConfig](Vllm.Config.LoadConfig.md): Configuration for loading the model weights. - [Vllm.Config.ModelConfig](Vllm.Config.ModelConfig.md): Configuration for the model. - [Vllm.Config.MultiModalConfig](Vllm.Config.MultiModalConfig.md): Controls the behavior of multimodal models. - [Vllm.Config.ObservabilityConfig](Vllm.Config.ObservabilityConfig.md): Configuration for observability - metrics and tracing. - [Vllm.Config.ParallelConfig](Vllm.Config.ParallelConfig.md): Configuration for the distributed execution. - [Vllm.Config.PassConfig](Vllm.Config.PassConfig.md): Configuration for custom Inductor passes. - [Vllm.Config.PoolerConfig](Vllm.Config.PoolerConfig.md): Controls the behavior of output pooling in pooling models. - [Vllm.Config.ProfilerConfig](Vllm.Config.ProfilerConfig.md): Dataclass which contains profiler config for the engine. - [Vllm.Config.SchedulerConfig](Vllm.Config.SchedulerConfig.md): Scheduler configuration. - [Vllm.Config.SpeculativeConfig](Vllm.Config.SpeculativeConfig.md): Configuration for speculative decoding. - [Vllm.Config.SpeechToTextConfig](Vllm.Config.SpeechToTextConfig.md): Configuration for speech-to-text models. - [Vllm.Config.StructuredOutputsConfig](Vllm.Config.StructuredOutputsConfig.md): Dataclass which contains structured outputs config for the engine. - [Vllm.Config.SupportsMetricsInfo](Vllm.Config.SupportsMetricsInfo.md): Wrapper for Python class SupportsMetricsInfo. - [Vllm.Config.VllmConfig](Vllm.Config.VllmConfig.md): Dataclass which contains all vllm-related configuration. This - [Vllm.Connections](Vllm.Connections.md): Submodule bindings for `vllm.connections`. - [Vllm.Connections.HTTPConnection](Vllm.Connections.HTTPConnection.md): Helper class to send HTTP requests. - [Vllm.DeviceAllocator](Vllm.DeviceAllocator.md): Submodule bindings for `vllm.device_allocator`. - [Vllm.Distributed](Vllm.Distributed.md): Submodule bindings for `vllm.distributed`. - [Vllm.Distributed.DeviceCommunicatorBase](Vllm.Distributed.DeviceCommunicatorBase.md): Base class for device-specific communicator. - [Vllm.Distributed.GraphCaptureContext](Vllm.Distributed.GraphCaptureContext.md): GraphCaptureContext(stream: torch.cuda.streams.Stream) - [Vllm.Distributed.GroupCoordinator](Vllm.Distributed.GroupCoordinator.md): PyTorch ProcessGroup wrapper for a group of processes. - [Vllm.Distributed.StatelessProcessGroup](Vllm.Distributed.StatelessProcessGroup.md): A dataclass to hold a metadata store, and the rank, world_size of the - [Vllm.Distributed.TensorMetadata](Vllm.Distributed.TensorMetadata.md): TensorMetadata(device, dtype, size) - [Vllm.Engine](Vllm.Engine.md): Submodule bindings for `vllm.engine`. - [Vllm.Entrypoints](Vllm.Entrypoints.md): Submodule bindings for `vllm.entrypoints`. - [Vllm.EnvOverride](Vllm.EnvOverride.md): Submodule bindings for `vllm.env_override`. - [Vllm.Envs](Vllm.Envs.md): Submodule bindings for `vllm.envs`. - [Vllm.Exceptions](Vllm.Exceptions.md): Custom exceptions for vLLM. - [Vllm.Exceptions.VLLMValidationError](Vllm.Exceptions.VLLMValidationError.md): vLLM-specific validation error for request validation failures. - [Vllm.ForwardContext](Vllm.ForwardContext.md): ForwardContext(no_compile_layers: dict[str, typing.Any], attn_metadata: dict[str, vllm.v1.attention.backend.AttentionMetadata] | list[dict[str, vllm.v1.attention.backend.AttentionMetadata]], virtual_engine: int, dp_metadata: vllm.forward_context.DPMetadata | None = None, cudagraph_runtime_mode: vllm.config.compilation.CUDAGraphMode = , batch_descriptor: vllm.forward_context.BatchDescriptor | None = None, ubatch_slices: list[vllm.v1.worker.ubatch_utils.UBatchSlice] | None = None, additional_kwargs: dict[str, typing.Any] = ) - [Vllm.ForwardContext.AttentionMetadata](Vllm.ForwardContext.AttentionMetadata.md): Wrapper for Python class AttentionMetadata. - [Vllm.ForwardContext.BatchDescriptor](Vllm.ForwardContext.BatchDescriptor.md): Batch descriptor for cudagraph dispatching. We should keep the num of - [Vllm.ForwardContext.DPMetadata](Vllm.ForwardContext.DPMetadata.md): DPMetadata(max_tokens_across_dp_cpu: torch.Tensor, num_tokens_across_dp_cpu: torch.Tensor, local_sizes: list[int] | None = None) - [Vllm.ForwardContext.Module](Vllm.ForwardContext.Module.md): Submodule bindings for `vllm.forward_context`. - [Vllm.Grpc](Vllm.Grpc.md): vLLM gRPC protocol definitions. - [Vllm.Inputs](Vllm.Inputs.md): Submodule bindings for `vllm.inputs`. - [Vllm.Inputs.DataPrompt](Vllm.Inputs.DataPrompt.md): Represents generic inputs handled by IO processor plugins. - [Vllm.Inputs.EmbedsInputs](Vllm.Inputs.EmbedsInputs.md): Represents embeddings-based inputs. - [Vllm.Inputs.EmbedsPrompt](Vllm.Inputs.EmbedsPrompt.md): Schema for a prompt provided via token embeddings. - [Vllm.Inputs.EncoderDecoderInputs](Vllm.Inputs.EncoderDecoderInputs.md): The inputs in [`LLMEngine`][vllm.engine.llm_engine.LLMEngine] before they - [Vllm.Inputs.ExplicitEncoderDecoderPrompt](Vllm.Inputs.ExplicitEncoderDecoderPrompt.md): Represents an encoder/decoder model input prompt, - [Vllm.Inputs.TextPrompt](Vllm.Inputs.TextPrompt.md): Schema for a text prompt. - [Vllm.Inputs.TokenInputs](Vllm.Inputs.TokenInputs.md): Represents token-based inputs. - [Vllm.Inputs.TokensPrompt](Vllm.Inputs.TokensPrompt.md): Schema for a tokenized prompt. - [Vllm.LLM](Vllm.LLM.md): An LLM for generating texts from given prompts and sampling parameters. - [Vllm.LLMEngine](Vllm.LLMEngine.md): Legacy LLMEngine for backwards compatibility. - [Vllm.Logger](Vllm.Logger.md): Logging configuration for vLLM. - [Vllm.Logger.ColoredFormatter](Vllm.Logger.ColoredFormatter.md): Wrapper for Python class ColoredFormatter. - [Vllm.Logger.NewLineFormatter](Vllm.Logger.NewLineFormatter.md): Wrapper for Python class NewLineFormatter. - [Vllm.Logger.VllmLogger](Vllm.Logger.VllmLogger.md): Note - [Vllm.LoggingUtils](Vllm.LoggingUtils.md): Submodule bindings for `vllm.logging_utils`. - [Vllm.LoggingUtils.ColoredFormatter](Vllm.LoggingUtils.ColoredFormatter.md): Adds ANSI color codes to log levels for terminal output. - [Vllm.LoggingUtils.NewLineFormatter](Vllm.LoggingUtils.NewLineFormatter.md): Adds logging prefix to newlines to align multi-line messages. - [Vllm.LogitsProcess](Vllm.LogitsProcess.md): Submodule bindings for `vllm.logits_process`. - [Vllm.LogitsProcess.NoBadWordsLogitsProcessor](Vllm.LogitsProcess.NoBadWordsLogitsProcessor.md): Wrapper for Python class NoBadWordsLogitsProcessor. - [Vllm.LogitsProcess.TokenizerLike](Vllm.LogitsProcess.TokenizerLike.md): Wrapper for Python class TokenizerLike. - [Vllm.Logprobs](Vllm.Logprobs.md): Submodule bindings for `vllm.logprobs`. - [Vllm.Logprobs.FlatLogprobs](Vllm.Logprobs.FlatLogprobs.md): Flat logprobs of a request into multiple primitive type lists. - [Vllm.Logprobs.Logprob](Vllm.Logprobs.Logprob.md): Infos for supporting OpenAI compatible logprobs and token ranks. - [Vllm.Lora](Vllm.Lora.md): Submodule bindings for `vllm.lora`. - [Vllm.ModelExecutor](Vllm.ModelExecutor.md): Submodule bindings for `vllm.model_executor`. - [Vllm.ModelExecutor.BasevLLMParameter](Vllm.ModelExecutor.BasevLLMParameter.md): Base parameter for vLLM linear layers. Extends the torch.nn.parameter - [Vllm.ModelExecutor.Models.Adapters](Vllm.ModelExecutor.Models.Adapters.md): Submodule bindings for `vllm.model_executor.models.adapters`. - [Vllm.ModelExecutor.Models.Interfaces](Vllm.ModelExecutor.Models.Interfaces.md): Submodule bindings for `vllm.model_executor.models.interfaces`. - [Vllm.ModelExecutor.Models.InterfacesBase](Vllm.ModelExecutor.Models.InterfacesBase.md): Submodule bindings for `vllm.model_executor.models.interfaces_base`. - [Vllm.ModelExecutor.PackedvLLMParameter](Vllm.ModelExecutor.PackedvLLMParameter.md): Parameter for model weights which are packed on disk. - [Vllm.ModelInspection](Vllm.ModelInspection.md): Model inspection utilities for vLLM. - [Vllm.Multimodal](Vllm.Multimodal.md): Submodule bindings for `vllm.multimodal`. - [Vllm.Multimodal.Inputs](Vllm.Multimodal.Inputs.md): Submodule bindings for `vllm.multimodal.inputs`. - [Vllm.Multimodal.Inputs.MultiModalFieldConfig](Vllm.Multimodal.Inputs.MultiModalFieldConfig.md): MultiModalFieldConfig(field: vllm.multimodal.inputs.BaseMultiModalField, modality: str) - [Vllm.Multimodal.Inputs.MultiModalFieldElem](Vllm.Multimodal.Inputs.MultiModalFieldElem.md): Represents a keyword argument inside a - [Vllm.Multimodal.Inputs.MultiModalInputs](Vllm.Multimodal.Inputs.MultiModalInputs.md): Represents the outputs of - [Vllm.Multimodal.Inputs.MultiModalKwargsItem](Vllm.Multimodal.Inputs.MultiModalKwargsItem.md): A collection of - [Vllm.Multimodal.Inputs.MultiModalKwargsItems](Vllm.Multimodal.Inputs.MultiModalKwargsItems.md): A dictionary of - [Vllm.Multimodal.Inputs.PlaceholderRange](Vllm.Multimodal.Inputs.PlaceholderRange.md): Placeholder location information for multi-modal data. - [Vllm.Multimodal.MultiModalDataBuiltins](Vllm.Multimodal.MultiModalDataBuiltins.md): Type annotations for modality types predefined by vLLM. - [Vllm.Multimodal.MultiModalHasher](Vllm.Multimodal.MultiModalHasher.md): Wrapper for Python class MultiModalHasher. - [Vllm.Multimodal.MultiModalKwargsItems](Vllm.Multimodal.MultiModalKwargsItems.md): A dictionary of - [Vllm.Multimodal.MultiModalRegistry](Vllm.Multimodal.MultiModalRegistry.md): A registry that dispatches data processing according to the model. - [Vllm.Multimodal.Parse](Vllm.Multimodal.Parse.md): Submodule bindings for `vllm.multimodal.parse`. - [Vllm.Multimodal.Processing](Vllm.Multimodal.Processing.md): Submodule bindings for `vllm.multimodal.processing`. - [Vllm.Multimodal.Registry](Vllm.Multimodal.Registry.md): Submodule bindings for `vllm.multimodal.registry`. - [Vllm.Outputs](Vllm.Outputs.md): Submodule bindings for `vllm.outputs`. - [Vllm.Outputs.ClassificationOutput](Vllm.Outputs.ClassificationOutput.md): The output data of one classification output of a request. - [Vllm.Outputs.ClassificationRequestOutput](Vllm.Outputs.ClassificationRequestOutput.md): The output data of a pooling request to the LLM. - [Vllm.Outputs.CompletionOutput](Vllm.Outputs.CompletionOutput.md): The output data of one completion output of a request. - [Vllm.Outputs.EmbeddingOutput](Vllm.Outputs.EmbeddingOutput.md): The output data of one embedding output of a request. - [Vllm.Outputs.EmbeddingRequestOutput](Vllm.Outputs.EmbeddingRequestOutput.md): The output data of a pooling request to the LLM. - [Vllm.Outputs.PoolingOutput](Vllm.Outputs.PoolingOutput.md): The output data of one pooling output of a request. - [Vllm.Outputs.PoolingRequestOutput](Vllm.Outputs.PoolingRequestOutput.md): The output data of a pooling request to the LLM. - [Vllm.Outputs.RequestOutput](Vllm.Outputs.RequestOutput.md): The output data of a completion request to the LLM. - [Vllm.Outputs.RequestStateStats](Vllm.Outputs.RequestStateStats.md): Stats that need to be tracked across delta updates. - [Vllm.Outputs.ScoringOutput](Vllm.Outputs.ScoringOutput.md): The output data of one scoring output of a request. - [Vllm.Outputs.ScoringRequestOutput](Vllm.Outputs.ScoringRequestOutput.md): The output data of a pooling request to the LLM. - [Vllm.Platforms](Vllm.Platforms.md): Submodule bindings for `vllm.platforms`. - [Vllm.Platforms.CpuArchEnum](Vllm.Platforms.CpuArchEnum.md): Enum members for `CpuArchEnum`. - [Vllm.Platforms.Platform](Vllm.Platforms.Platform.md): Wrapper for Python class Platform. - [Vllm.Platforms.PlatformEnum](Vllm.Platforms.PlatformEnum.md): Enum members for `PlatformEnum`. - [Vllm.Plugins](Vllm.Plugins.md): Submodule bindings for `vllm.plugins`. - [Vllm.PoolingParams](Vllm.PoolingParams.md): API parameters for pooling models. - [Vllm.PoolingParams.Module](Vllm.PoolingParams.Module.md): Submodule bindings for `vllm.pooling_params`. - [Vllm.PoolingParams.RequestOutputKind](Vllm.PoolingParams.RequestOutputKind.md): Enum members for `RequestOutputKind`. - [Vllm.PoolingParamsClass](Vllm.PoolingParamsClass.md): API parameters for pooling models. - [Vllm.Profiler](Vllm.Profiler.md): Submodule bindings for `vllm.profiler`. - [Vllm.Ray](Vllm.Ray.md): Submodule bindings for `vllm.ray`. - [Vllm.Reasoning](Vllm.Reasoning.md): Submodule bindings for `vllm.reasoning`. - [Vllm.Reasoning.ReasoningParser](Vllm.Reasoning.ReasoningParser.md): Abstract reasoning parser class that should not be used directly. - [Vllm.Reasoning.ReasoningParserManager](Vllm.Reasoning.ReasoningParserManager.md): Central registry for ReasoningParser implementations. - [Vllm.SamplingParams](Vllm.SamplingParams.md): Sampling parameters for text generation. - [Vllm.SamplingParams.BeamSearchParams](Vllm.SamplingParams.BeamSearchParams.md): Beam search parameters for text generation. - [Vllm.SamplingParams.Module](Vllm.SamplingParams.Module.md): Sampling parameters for text generation. - [Vllm.SamplingParams.PydanticMsgspecMixin](Vllm.SamplingParams.PydanticMsgspecMixin.md): Sampling parameters for text generation. - [Vllm.SamplingParams.RequestOutputKind](Vllm.SamplingParams.RequestOutputKind.md): Enum members for `RequestOutputKind`. - [Vllm.SamplingParams.SamplingType](Vllm.SamplingParams.SamplingType.md): Enum members for `SamplingType`. - [Vllm.SamplingParams.StructuredOutputsParams](Vllm.SamplingParams.StructuredOutputsParams.md): Sampling parameters for text generation. - [Vllm.SamplingParams.TokenizerLike](Vllm.SamplingParams.TokenizerLike.md): Sampling parameters for text generation. - [Vllm.SamplingParamsClass](Vllm.SamplingParamsClass.md): Sampling parameters for text generation. - [Vllm.ScalarType](Vllm.ScalarType.md): ScalarType can represent a wide range of floating point and integer - [Vllm.ScalarType.Module](Vllm.ScalarType.Module.md): Submodule bindings for `vllm.scalar_type`. - [Vllm.ScalarType.NanRepr](Vllm.ScalarType.NanRepr.md): Enum members for `NanRepr`. - [Vllm.ScalarType.ScalarTypes](Vllm.ScalarType.ScalarTypes.md): Wrapper for Python class scalar_types. - [Vllm.Scripts](Vllm.Scripts.md): Submodule bindings for `vllm.scripts`. - [Vllm.Sequence](Vllm.Sequence.md): Sequence and its related classes. - [Vllm.Sequence.IntermediateTensors](Vllm.Sequence.IntermediateTensors.md): For all pipeline stages except the last, we need to return the hidden - [Vllm.Sequence.KVConnectorOutput](Vllm.Sequence.KVConnectorOutput.md): Special type indicating an unconstrained type. - [Vllm.Tasks](Vllm.Tasks.md): Submodule bindings for `vllm.tasks`. - [Vllm.Tokenizers](Vllm.Tokenizers.md): Submodule bindings for `vllm.tokenizers`. - [Vllm.Tokenizers.TokenizerLike](Vllm.Tokenizers.TokenizerLike.md): Wrapper for Python class TokenizerLike. - [Vllm.ToolParsers](Vllm.ToolParsers.md): Submodule bindings for `vllm.tool_parsers`. - [Vllm.ToolParsers.ToolParser](Vllm.ToolParsers.ToolParser.md): Abstract ToolParser class that should not be used directly. Provided - [Vllm.ToolParsers.ToolParserManager](Vllm.ToolParsers.ToolParserManager.md): Central registry for ToolParser implementations. - [Vllm.Tracing](Vllm.Tracing.md): Submodule bindings for `vllm.tracing`. - [Vllm.Tracing.BaseSpanAttributes](Vllm.Tracing.BaseSpanAttributes.md): Wrapper for Python class BaseSpanAttributes. - [Vllm.Tracing.SpanAttributes](Vllm.Tracing.SpanAttributes.md): Wrapper for Python class SpanAttributes. - [Vllm.TransformersUtils](Vllm.TransformersUtils.md): Submodule bindings for `vllm.transformers_utils`. - [Vllm.TritonUtils](Vllm.TritonUtils.md): Submodule bindings for `vllm.triton_utils`. - [Vllm.TritonUtils.TritonLanguagePlaceholder](Vllm.TritonUtils.TritonLanguagePlaceholder.md): Wrapper for Python class TritonLanguagePlaceholder. - [Vllm.TritonUtils.TritonPlaceholder](Vllm.TritonUtils.TritonPlaceholder.md): Wrapper for Python class TritonPlaceholder. - [Vllm.Usage](Vllm.Usage.md): Submodule bindings for `vllm.usage`. - [Vllm.Utils](Vllm.Utils.md): Submodule bindings for `vllm.utils`. - [Vllm.V1](Vllm.V1.md): Submodule bindings for `vllm.v1`. - [Vllm.Version](Vllm.Version.md): Submodule bindings for `vllm.version`. - Core API - [VLLM](VLLM.md): VLLM - vLLM for Elixir via SnakeBridge.