View Source LLMs
Mix.install([
{:bumblebee, "~> 0.5.0"},
{:nx, "~> 0.7.0"},
{:exla, "~> 0.7.0"},
{:kino, "~> 0.12.0"}
])
Nx.global_default_backend({EXLA.Backend, client: :host})
Introduction
In this notebook we outline the general setup for running a Large Langauge Model (LLM).
Llama 2
In this section we look at running Meta's Llama model, specifically Llama 2, one of the most powerful open source Large Language Models (LLMs).
Note: this is a very involved model, so the generation can take a long time if you run it on a CPU. Also, running on the GPU currently requires at least 16GiB of VRAM.
In order to load Llama 2, you need to ask for access on meta-llama/Llama-2-7b-chat-hf. Once you are granted access, generate a HuggingFace auth token and put it in a HF_TOKEN
Livebook secret.
Let's load the model and create a serving for text generation:
hf_token = System.fetch_env!("LB_HF_TOKEN")
repo = {:hf, "meta-llama/Llama-2-7b-chat-hf", auth_token: hf_token}
{:ok, model_info} = Bumblebee.load_model(repo, type: :bf16, backend: EXLA.Backend)
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)
:ok
generation_config =
Bumblebee.configure(generation_config,
max_new_tokens: 256,
strategy: %{type: :multinomial_sampling, top_p: 0.6}
)
serving =
Bumblebee.Text.generation(model_info, tokenizer, generation_config,
compile: [batch_size: 1, sequence_length: 1028],
stream: true,
defn_options: [compiler: EXLA]
)
# Should be supervised
Kino.start_child({Nx.Serving, name: Llama, serving: serving})
Note that we load the parameters directly onto the GPU with Bumblebee.load_model(..., backend: EXLA.Backend)
and with defn_options: [compiler: EXLA]
we tell the serving to compile and run computations on the GPU as well.
We adjust the generation config to use a non-deterministic generation strategy, so that the model is able to produce a slightly different output every time.
As for the other options, we specify :compile
with fixed shapes, so that the model is compiled only once and inputs are always padded to match these shapes. We also enable :stream
to receive text chunks as the generation is progressing.
user_input = Kino.Input.textarea("User prompt", default: "What is love?")
user = Kino.Input.read(user_input)
prompt = """
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.
<</SYS>>
#{user} [/INST] \
"""
Nx.Serving.batched_run(Llama, prompt) |> Enum.each(&IO.write/1)
Mistral
We can easily test other LLMs, we just need to change the repository and possibly adjust the prompt template. In this example we run the Mistral model.
repo = {:hf, "mistralai/Mistral-7B-Instruct-v0.2"}
{:ok, model_info} = Bumblebee.load_model(repo, type: :bf16, backend: EXLA.Backend)
{:ok, tokenizer} = Bumblebee.load_tokenizer(repo)
{:ok, generation_config} = Bumblebee.load_generation_config(repo)
:ok
generation_config =
Bumblebee.configure(generation_config,
max_new_tokens: 256,
strategy: %{type: :multinomial_sampling, top_p: 0.6}
)
serving =
Bumblebee.Text.generation(model_info, tokenizer, generation_config,
compile: [batch_size: 1, sequence_length: 1028],
stream: true,
defn_options: [compiler: EXLA]
)
# Should be supervised
Kino.start_child({Nx.Serving, name: Mistral, serving: serving})
prompt = """
<s>[INST] What is your favourite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s>
[INST] Do you have mayonnaise recipes? [/INST]\
"""
Nx.Serving.batched_run(Mistral, prompt) |> Enum.each(&IO.write/1)