View Source Llama

Mix.install([
  {:bumblebee, "~> 0.4.2"},
  {:nx, "~> 0.6.1"},
  {:exla, "~> 0.6.1"},
  {:kino, "~> 0.10.0"}
])

Nx.global_default_backend({EXLA.Backend, client: :host})

Introduction

In this notebook we look at running Meta's Llama model, specifically Llama 2, one of the most powerful open source Large Language Models (LLMs).

Note: this is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires a lot of VRAM, 24 GB has been verified to work.

Text generation

In order to load Llama 2, you need to ask for access on meta-llama/Llama-2-7b-chat-hf. Once you are granted access, generate a HuggingFace auth token and put it in a HF_TOKEN Livebook secret.

Let's load the model and create a serving for text generation:

hf_token = System.fetch_env!("LB_HF_TOKEN")
repo = {:hf, "meta-llama/Llama-2-7b-chat-hf", auth_token: hf_token}

{:ok, model_info} = Bumblebee.load_model(repo, backend: {EXLA.Backend, client: :host})
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)

generation_config =
  Bumblebee.configure(generation_config,
    max_new_tokens: 256,
    strategy: %{type: :multinomial_sampling, top_p: 0.6}
  )

serving =
  Bumblebee.Text.generation(model_info, tokenizer, generation_config,
    compile: [batch_size: 1, sequence_length: 1028],
    stream: true,
    defn_options: [compiler: EXLA, lazy_transfers: :always]
  )

# Should be supervised
Kino.start_child({Nx.Serving, name: Llama, serving: serving})

We adjust the generation config to use a non-deterministic generation strategy. The most interesting part, though, is the combination of serving options.

First, note that we specify {EXLA.Backend, client: :host} as the backend for model parameters, which ensures that initially we load the parameters onto CPU. This is important, because as the parameters are loaded Bumblebee may need to apply certain operations to them and we don't want to bother the GPU at that point, risking an out-of-memory error.

With that, there are a couple combinations of options related to parameters, trading off memory usage for speed:

  1. defn_options: [compiler: EXLA], preallocate_params: true - move and keep all parameters to the GPU upfront. This requires the most memory, but should provide the fastest inference time.

  2. defn_options: [compiler: EXLA] - copy all parameters to the GPU before each computation and discard afterwards. This requires less memory, but the copying increases the inference time.

  3. defn_options: [compiler: EXLA, lazy_transfers: :always] - lazily copy parameters to the GPU during the computation as needed. This requires the least memory, at the cost of inference time.

As for the other options, we specify :compile with fixed shapes, so that the model is compiled only once and inputs are always padded to match these shapes. We also enable :stream to receive text chunks as the generation is progressing.

user_input = Kino.Input.textarea("User prompt", default: "What is love?")
user = Kino.Input.read(user_input)

prompt = """
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
#{user} [/INST] \
"""

Nx.Serving.batched_run(Llama, prompt) |> Enum.each(&IO.write/1)