Modules
Helper wrappers for graceful_serialization.
VLLM - vLLM for Elixir via SnakeBridge.
Runtime configuration helper for using vLLM safely via SnakeBridge/Snakepit.
vLLM: a high-throughput and memory-efficient inference engine for LLMs
Submodule bindings for vllm.assets.
Protocol class for Clients to Engine
Submodule bindings for vllm.attention.
Submodule bindings for vllm.beam_search.
Wrapper for Python class BeamSearchInstance.
The output of beam search.
A sequence for beam search.
Request for a LoRA adapter.
Infos for supporting OpenAI compatible logprobs and token ranks.
Submodule bindings for vllm.benchmarks.
Submodule bindings for vllm.collect_env.
SystemEnv(torch_version, is_debug_build, cuda_compiled_version, gcc_version, clang_version, cmake_version, os, libc_version, python_version, python_platform, is_cuda_available, cuda_runtime_version, cuda_module_loading, nvidia_driver_version, nvidia_gpu_models, cudnn_version, pip_version, pip_packages, conda_packages, hip_compiled_version, hip_runtime_version, miopen_runtime_version, caching_allocator_config, is_xnnpack_available, cpu_info, rocm_version, vllm_version, vllm_build_flags, gpu_topo, env_vars)
Submodule bindings for vllm.compilation.
Submodule bindings for vllm.config.
Configuration for attention mechanisms in vLLM.
Constants for the cudagraph mode in CompilationConfig.
Configuration for the KV cache.
Configuration for compilation.
The compilation approach used for torch.compile-based compilation of the
Configuration for the device to use for vLLM execution.
Configuration for distributed EC cache transfer.
Configuration for Expert Parallel Load Balancing (EP).
Configuration for KV event publishing.
Configuration for distributed KV cache transfer.
Configuration for LoRA.
Configuration for loading the model weights.
Configuration for the model.
Controls the behavior of multimodal models.
Configuration for observability - metrics and tracing.
Configuration for the distributed execution.
Configuration for custom Inductor passes.
Controls the behavior of output pooling in pooling models.
Dataclass which contains profiler config for the engine.
Scheduler configuration.
Configuration for speculative decoding.
Configuration for speech-to-text models.
Dataclass which contains structured outputs config for the engine.
Wrapper for Python class SupportsMetricsInfo.
Dataclass which contains all vllm-related configuration. This
Submodule bindings for vllm.connections.
Helper class to send HTTP requests.
Submodule bindings for vllm.device_allocator.
Submodule bindings for vllm.distributed.
Base class for device-specific communicator.
GraphCaptureContext(stream: torch.cuda.streams.Stream)
PyTorch ProcessGroup wrapper for a group of processes.
A dataclass to hold a metadata store, and the rank, world_size of the
TensorMetadata(device, dtype, size)
Submodule bindings for vllm.engine.
Submodule bindings for vllm.entrypoints.
Submodule bindings for vllm.env_override.
Submodule bindings for vllm.envs.
Custom exceptions for vLLM.
vLLM-specific validation error for request validation failures.
ForwardContext(no_compile_layers: dict[str, typing.Any], attn_metadata: dict[str, vllm.v1.attention.backend.AttentionMetadata] | list[dict[str, vllm.v1.attention.backend.AttentionMetadata]], virtual_engine: int, dp_metadata: vllm.forward_context.DPMetadata | None = None, cudagraph_runtime_mode: vllm.config.compilation.CUDAGraphMode = <CUDAGraphMode.NONE: 0>, batch_descriptor: vllm.forward_context.BatchDescriptor | None = None, ubatch_slices: list[vllm.v1.worker.ubatch_utils.UBatchSlice] | None = None, additional_kwargs: dict[str, typing.Any] = <factory>)
Wrapper for Python class AttentionMetadata.
Batch descriptor for cudagraph dispatching. We should keep the num of
DPMetadata(max_tokens_across_dp_cpu: torch.Tensor, num_tokens_across_dp_cpu: torch.Tensor, local_sizes: list[int] | None = None)
Submodule bindings for vllm.forward_context.
vLLM gRPC protocol definitions.
Submodule bindings for vllm.inputs.
Represents generic inputs handled by IO processor plugins.
Represents embeddings-based inputs.
Schema for a prompt provided via token embeddings.
The inputs in [LLMEngine][vllm.engine.llm_engine.LLMEngine] before they
Represents an encoder/decoder model input prompt,
Schema for a text prompt.
Represents token-based inputs.
Schema for a tokenized prompt.
An LLM for generating texts from given prompts and sampling parameters.
Legacy LLMEngine for backwards compatibility.
Logging configuration for vLLM.
Wrapper for Python class ColoredFormatter.
Wrapper for Python class NewLineFormatter.
Submodule bindings for vllm.logging_utils.
Adds ANSI color codes to log levels for terminal output.
Adds logging prefix to newlines to align multi-line messages.
Submodule bindings for vllm.logits_process.
Wrapper for Python class NoBadWordsLogitsProcessor.
Wrapper for Python class TokenizerLike.
Submodule bindings for vllm.logprobs.
Flat logprobs of a request into multiple primitive type lists.
Infos for supporting OpenAI compatible logprobs and token ranks.
Submodule bindings for vllm.lora.
Submodule bindings for vllm.model_executor.
Base parameter for vLLM linear layers. Extends the torch.nn.parameter
Submodule bindings for vllm.model_executor.models.adapters.
Submodule bindings for vllm.model_executor.models.interfaces.
Submodule bindings for vllm.model_executor.models.interfaces_base.
Parameter for model weights which are packed on disk.
Model inspection utilities for vLLM.
Submodule bindings for vllm.multimodal.
Submodule bindings for vllm.multimodal.inputs.
MultiModalFieldConfig(field: vllm.multimodal.inputs.BaseMultiModalField, modality: str)
Represents a keyword argument inside a
Represents the outputs of
A collection of
A dictionary of
Placeholder location information for multi-modal data.
Type annotations for modality types predefined by vLLM.
Wrapper for Python class MultiModalHasher.
A dictionary of
A registry that dispatches data processing according to the model.
Submodule bindings for vllm.multimodal.parse.
Submodule bindings for vllm.multimodal.processing.
Submodule bindings for vllm.multimodal.registry.
Submodule bindings for vllm.outputs.
The output data of one classification output of a request.
The output data of a pooling request to the LLM.
The output data of one completion output of a request.
The output data of one embedding output of a request.
The output data of a pooling request to the LLM.
The output data of one pooling output of a request.
The output data of a pooling request to the LLM.
The output data of a completion request to the LLM.
Stats that need to be tracked across delta updates.
The output data of one scoring output of a request.
The output data of a pooling request to the LLM.
Submodule bindings for vllm.platforms.
Enum members for CpuArchEnum.
Wrapper for Python class Platform.
Enum members for PlatformEnum.
Submodule bindings for vllm.plugins.
API parameters for pooling models.
Submodule bindings for vllm.pooling_params.
Enum members for RequestOutputKind.
API parameters for pooling models.
Submodule bindings for vllm.profiler.
Submodule bindings for vllm.ray.
Submodule bindings for vllm.reasoning.
Abstract reasoning parser class that should not be used directly.
Central registry for ReasoningParser implementations.
Sampling parameters for text generation.
Beam search parameters for text generation.
Sampling parameters for text generation.
Sampling parameters for text generation.
Enum members for RequestOutputKind.
Enum members for SamplingType.
Sampling parameters for text generation.
Sampling parameters for text generation.
Sampling parameters for text generation.
ScalarType can represent a wide range of floating point and integer
Submodule bindings for vllm.scalar_type.
Enum members for NanRepr.
Wrapper for Python class scalar_types.
Submodule bindings for vllm.scripts.
Sequence and its related classes.
For all pipeline stages except the last, we need to return the hidden
Special type indicating an unconstrained type.
Submodule bindings for vllm.tasks.
Submodule bindings for vllm.tokenizers.
Wrapper for Python class TokenizerLike.
Submodule bindings for vllm.tool_parsers.
Abstract ToolParser class that should not be used directly. Provided
Central registry for ToolParser implementations.
Submodule bindings for vllm.tracing.
Wrapper for Python class BaseSpanAttributes.
Wrapper for Python class SpanAttributes.
Submodule bindings for vllm.transformers_utils.
Submodule bindings for vllm.triton_utils.
Wrapper for Python class TritonLanguagePlaceholder.
Wrapper for Python class TritonPlaceholder.
Submodule bindings for vllm.usage.
Submodule bindings for vllm.utils.
Submodule bindings for vllm.v1.
Submodule bindings for vllm.version.