Configuration for compilation.
You must pass CompilationConfig to VLLMConfig constructor. VLLMConfig's post_init does further initialization. If used outside of the VLLMConfig, some fields will be left in an improper state.
It has three parts:
- Top-level Compilation control:
- [
mode][vllm.config.CompilationConfig.mode] - [
debug_dump_path][vllm.config.CompilationConfig.debug_dump_path] - [
cache_dir][vllm.config.CompilationConfig.cache_dir] - [
backend][vllm.config.CompilationConfig.backend] - [
custom_ops][vllm.config.CompilationConfig.custom_ops] - [
splitting_ops][vllm.config.CompilationConfig.splitting_ops] - [
compile_mm_encoder][vllm.config.CompilationConfig.compile_mm_encoder]
- [
- CudaGraph capture:
- [
cudagraph_mode][vllm.config.CompilationConfig.cudagraph_mode] - [
cudagraph_capture_sizes] [vllm.config.CompilationConfig.cudagraph_capture_sizes] - [
max_cudagraph_capture_size] [vllm.config.CompilationConfig.max_cudagraph_capture_size] - [
cudagraph_num_of_warmups] [vllm.config.CompilationConfig.cudagraph_num_of_warmups] - [
cudagraph_copy_inputs] [vllm.config.CompilationConfig.cudagraph_copy_inputs]
- [
- Inductor compilation:
- [
compile_sizes][vllm.config.CompilationConfig.compile_sizes] - [
compile_ranges_split_points] [vllm.config.CompilationConfig.compile_ranges_split_points] - [
inductor_compile_config] [vllm.config.CompilationConfig.inductor_compile_config] - [
inductor_passes][vllm.config.CompilationConfig.inductor_passes] - custom inductor passes
- [
Why we have different sizes for cudagraph and inductor:
- cudagraph: a cudagraph captured for a specific size can only be used for the same size. We need to capture all the sizes we want to use.
- inductor: a graph compiled by inductor for a general shape can be used for different sizes. Inductor can also compile for specific sizes, where it can have more information to optimize the graph with fully static shapes. However, we find the general shape compilation is sufficient for most cases. It might be beneficial to compile for certain small batchsizes, where inductor is good at optimizing.
Summary
Functions
Skip validation if the value is None when initialisation is delayed.
Python method CompilationConfig.adjust_cudagraph_sizes_for_spec_decode.
Python method CompilationConfig.compute_bs_to_padded_graph_size.
Provide a hash that uniquely identifies all the configs
This method logs the enabled/disabled custom ops and checks that the
Get the compile ranges for the compilation config.
Initialize the backend for the compilation config from a vllm config.
Python method CompilationConfig.is_attention_compiled_piecewise.
Python method CompilationConfig.is_custom_op_enabled.
Constructs CompilationConfig.
To complete the initialization after cudagraph related
Python method CompilationConfig.set_splitting_ops_for_attn_fusion.
Python method CompilationConfig.set_splitting_ops_for_v1.
Python method CompilationConfig.splitting_ops_contain_attention.
Python method CompilationConfig.validate_compile_cache_save_format.
Enable parsing of the cudagraph_mode enum type from string.
Enable parsing the mode field from string mode names.
Enable parsing of the pass_config field from a dictionary.
Types
Functions
@spec _skip_none_validation(SnakeBridge.Ref.t(), term(), term(), keyword()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
Skip validation if the value is None when initialisation is delayed.
Parameters
value(term())handler(term())
Returns
term()
@spec adjust_cudagraph_sizes_for_spec_decode( SnakeBridge.Ref.t(), integer(), integer(), keyword() ) :: {:ok, term()} | {:error, Snakepit.Error.t()}
Python method CompilationConfig.adjust_cudagraph_sizes_for_spec_decode.
Parameters
uniform_decode_query_len(integer())tensor_parallel_size(integer())
Returns
term()
@spec backend(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec bs_to_padded_graph_size(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec cache_dir(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec compilation_time(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec compile_mm_encoder(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec compile_ranges_split_points(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec compile_sizes(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec compute_bs_to_padded_graph_size( SnakeBridge.Ref.t(), keyword() ) :: {:ok, term()} | {:error, Snakepit.Error.t()}
Python method CompilationConfig.compute_bs_to_padded_graph_size.
Returns
term()
@spec compute_hash( SnakeBridge.Ref.t(), keyword() ) :: {:ok, String.t()} | {:error, Snakepit.Error.t()}
Provide a hash that uniquely identifies all the configs
that affect the structure of the computation graph from input ids/embeddings to the final hidden states, excluding anything before input ids/embeddings and after the final hidden states.
Returns
String.t()
@spec cudagraph_capture_sizes(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec cudagraph_copy_inputs(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec cudagraph_mode(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec cudagraph_num_of_warmups(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec cudagraph_specialize_lora(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec custom_op_log_check( SnakeBridge.Ref.t(), keyword() ) :: {:ok, term()} | {:error, Snakepit.Error.t()}
This method logs the enabled/disabled custom ops and checks that the
passed custom_ops field only contains relevant ops. It is called at the end of set_current_vllm_config, after the custom ops have been instantiated.
Returns
term()
@spec debug_dump_path(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec get_compile_ranges( SnakeBridge.Ref.t(), keyword() ) :: {:ok, [term()]} | {:error, Snakepit.Error.t()}
Get the compile ranges for the compilation config.
Returns
list(term())
@spec init_backend(SnakeBridge.Ref.t(), term(), keyword()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
Initialize the backend for the compilation config from a vllm config.
Parameters
vllm_config- The vllm config to initialize the backend from.
Returns
term()
@spec is_attention_compiled_piecewise( SnakeBridge.Ref.t(), keyword() ) :: {:ok, boolean()} | {:error, Snakepit.Error.t()}
Python method CompilationConfig.is_attention_compiled_piecewise.
Returns
boolean()
@spec is_custom_op_enabled(SnakeBridge.Ref.t(), String.t(), keyword()) :: {:ok, boolean()} | {:error, Snakepit.Error.t()}
Python method CompilationConfig.is_custom_op_enabled.
Parameters
op(String.t())
Returns
boolean()
@spec level(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec local_cache_dir(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec max_cudagraph_capture_size(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec mode(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec new(term(), term(), term(), keyword()) :: {:ok, SnakeBridge.Ref.t()} | {:error, Snakepit.Error.t()}
Constructs CompilationConfig.
Parameters
dataclass_self__(term())args(term())kwargs(term())
@spec post_init_cudagraph_sizes( SnakeBridge.Ref.t(), keyword() ) :: {:ok, nil} | {:error, Snakepit.Error.t()}
To complete the initialization after cudagraph related
configs are set. This includes:
- initialize compile_sizes
- pre-compute the mapping bs_to_padded_graph_size
Returns
nil
@spec set_splitting_ops_for_attn_fusion( SnakeBridge.Ref.t(), keyword() ) :: {:ok, term()} | {:error, Snakepit.Error.t()}
Python method CompilationConfig.set_splitting_ops_for_attn_fusion.
Returns
term()
@spec set_splitting_ops_for_v1(SnakeBridge.Ref.t(), String.t(), [term()], keyword()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
Python method CompilationConfig.set_splitting_ops_for_v1.
Parameters
all2all_backend(String.t())data_parallel_size(integer() default: 1)
Returns
term()
@spec splitting_ops(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec splitting_ops_contain_attention( SnakeBridge.Ref.t(), keyword() ) :: {:ok, boolean()} | {:error, Snakepit.Error.t()}
Python method CompilationConfig.splitting_ops_contain_attention.
Returns
boolean()
@spec use_inductor_graph_partition(SnakeBridge.Ref.t()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
@spec validate_compile_cache_save_format(SnakeBridge.Ref.t(), String.t(), keyword()) :: {:ok, String.t()} | {:error, Snakepit.Error.t()}
Python method CompilationConfig.validate_compile_cache_save_format.
Parameters
value(String.t())
Returns
String.t()
@spec validate_cudagraph_mode_before(SnakeBridge.Ref.t(), term(), keyword()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
Enable parsing of the cudagraph_mode enum type from string.
Parameters
value(term())
Returns
term()
@spec validate_mode_before(SnakeBridge.Ref.t(), term(), keyword()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
Enable parsing the mode field from string mode names.
Accepts both integers (0-3) and string names, like NONE, STOCK_TORCH_COMPILE, DYNAMO_TRACE_ONCE, VLLM_COMPILE.
Parameters
value(term())
Returns
term()
@spec validate_pass_config_before(SnakeBridge.Ref.t(), term(), keyword()) :: {:ok, term()} | {:error, Snakepit.Error.t()}
Enable parsing of the pass_config field from a dictionary.
Parameters
value(term())
Returns
term()