View Source LLMs

Mix.install([
  {:bumblebee, "~> 0.6.0"},
  {:nx, "~> 0.9.0"},
  {:exla, "~> 0.9.0"},
  {:kino, "~> 0.14.0"}
])

Nx.global_default_backend({EXLA.Backend, client: :host})

Introduction

In this notebook we outline the general setup for running a Large Language Model (LLM).

Llama 2

In this section we look at running Meta's Llama model, specifically Llama 2, one of the most powerful open source Large Language Models (LLMs).

Note: this is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires at least 16GiB of VRAM.

In order to load Llama 2, you need to ask for access on meta-llama/Llama-2-7b-chat-hf. Once you are granted access, generate a HuggingFace auth token and put it in a HF_TOKEN Livebook secret.

Let's load the model and create a serving for text generation:

hf_token = System.fetch_env!("LB_HF_TOKEN")
repo = {:hf, "meta-llama/Llama-2-7b-chat-hf", auth_token: hf_token}

{:ok, model_info} = Bumblebee.load_model(repo, type: :bf16, backend: EXLA.Backend)
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)

:ok
generation_config =
  Bumblebee.configure(generation_config,
    max_new_tokens: 256,
    strategy: %{type: :multinomial_sampling, top_p: 0.6}
  )

serving =
  Bumblebee.Text.generation(model_info, tokenizer, generation_config,
    compile: [batch_size: 1, sequence_length: 1028],
    stream: true,
    defn_options: [compiler: EXLA]
  )

# Should be supervised
Kino.start_child({Nx.Serving, name: Llama, serving: serving})

Note that we load the parameters directly onto the GPU with Bumblebee.load_model(..., backend: EXLA.Backend) and with defn_options: [compiler: EXLA] we tell the serving to compile and run computations on the GPU as well.

We adjust the generation config to use a non-deterministic generation strategy, so that the model is able to produce a slightly different output every time.

As for the other options, we specify :compile with fixed shapes, so that the model is compiled only once and inputs are always padded to match these shapes. We also enable :stream to receive text chunks as the generation is progressing.

user_input = Kino.Input.textarea("User prompt", default: "What is love?")
user = Kino.Input.read(user_input)

prompt = """
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
#{user} [/INST] \
"""

Nx.Serving.batched_run(Llama, prompt) |> Enum.each(&IO.write/1)

Mistral

We can easily test other LLMs, we just need to change the repository and possibly adjust the prompt template. In this example we run the Mistral model.

Just like Llama, Mistral now also requires users to request access to their models, so make sure you are granted access to the model, then generate a HuggingFace auth token and put it in a HF_TOKEN Livebook secret.

hf_token = System.fetch_env!('LB_HF_TOKEN')
repo = {:hf, "mistralai/Mistral-7B-Instruct-v0.2", auth_token: hf_token}

{:ok, model_info} = Bumblebee.load_model(repo, type: :bf16, backend: EXLA.Backend)
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)

:ok
generation_config =
  Bumblebee.configure(generation_config,
    max_new_tokens: 256,
    strategy: %{type: :multinomial_sampling, top_p: 0.6}
  )

serving =
  Bumblebee.Text.generation(model_info, tokenizer, generation_config,
    compile: [batch_size: 1, sequence_length: 512],
    stream: true,
    defn_options: [compiler: EXLA]
  )

# Should be supervised
Kino.start_child({Nx.Serving, name: Mistral, serving: serving})
prompt = """
<s>[INST] What is your favourite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s>
[INST] Do you have mayonnaise recipes? [/INST]\
"""

Nx.Serving.batched_run(Mistral, prompt) |> Enum.each(&IO.write/1)