Model Fine-Tuning Guide

View Source

This guide covers fine-tuning Gemini models using supervised learning on Vertex AI.

Overview

Fine-tuning allows you to adapt Gemini models to your specific use case by training them on your custom datasets. This improves model performance for domain-specific tasks like customer support, code generation, content moderation, or specialized Q&A.

Key Benefits:

  • Improved accuracy for domain-specific tasks
  • Consistent output formatting and style
  • Better understanding of domain terminology
  • Reduced need for extensive prompting

Prerequisites

Required Setup

  1. Vertex AI Authentication - Tuning is only available on Vertex AI
  2. Google Cloud Project - Active GCP project with billing enabled
  3. Vertex AI API - Enable the Vertex AI API in your project
  4. Cloud Storage - GCS bucket for training data
  5. Permissions - Vertex AI User or Vertex AI Admin role

Supported Models

The following Gemini models support fine-tuning on Vertex AI:

  • gemini-2.5-pro-001 - Best quality, higher cost
  • gemini-2.5-flash-001 - Balanced quality and speed
  • gemini-2.5-flash-lite-001 - Fastest, most cost-effective

Cost Considerations

Fine-tuning incurs costs based on:

  • Training time (typically 1-4 hours)
  • Base model size
  • Number of training examples
  • Number of epochs

Estimate costs using the Google Cloud Pricing Calculator.

Quick Start

1. Prepare Training Data

Create a JSONL file with your training examples:

{"contents": [{"role": "user", "parts": [{"text": "What is your refund policy?"}]}, {"role": "model", "parts": [{"text": "We offer full refunds within 30 days of purchase with proof of receipt."}]}]}
{"contents": [{"role": "user", "parts": [{"text": "How do I track my order?"}]}, {"role": "model", "parts": [{"text": "Visit our tracking page at example.com/track and enter your order number."}]}]}
{"contents": [{"role": "user", "parts": [{"text": "Do you ship internationally?"}]}, {"role": "model", "parts": [{"text": "Yes, we ship to over 50 countries. Shipping times vary by destination."}]}]}

Best Practices:

  • Minimum 100 examples recommended (more is better)
  • Maximum 10,000 examples per job
  • Balance your dataset across different topics
  • Include diverse input phrasings
  • Ensure consistent output quality

2. Upload to Cloud Storage

Upload your training data to GCS:

gsutil cp training-data.jsonl gs://my-bucket/tuning/training-data.jsonl

Optionally, create validation data:

gsutil cp validation-data.jsonl gs://my-bucket/tuning/validation-data.jsonl

3. Configure Authentication

Set up Vertex AI credentials:

# Using environment variables
System.put_env("VERTEX_PROJECT_ID", "my-project-id")
System.put_env("VERTEX_LOCATION", "us-central1")
System.put_env("VERTEX_ACCESS_TOKEN", "ya29....")

# Or using application config
config :gemini, :vertex_ai,
  project_id: "my-project-id",
  location: "us-central1",
  access_token: "ya29...."

4. Create a Tuning Job

alias Gemini.Types.Tuning.CreateTuningJobConfig
alias Gemini.APIs.Tunings

# Create job configuration
config = %CreateTuningJobConfig{
  base_model: "gemini-2.5-flash-001",
  tuned_model_display_name: "customer-support-model",
  training_dataset_uri: "gs://my-bucket/tuning/training-data.jsonl",
  validation_dataset_uri: "gs://my-bucket/tuning/validation-data.jsonl",
  epoch_count: 10,
  learning_rate_multiplier: 1.0
}

# Start tuning
{:ok, job} = Tunings.tune(config, auth: :vertex_ai)

IO.puts("Job created: #{job.name}")
IO.puts("State: #{job.state}")

5. Monitor Progress

Poll the job status periodically:

# Manual polling
{:ok, job} = Tunings.get(job_name, auth: :vertex_ai)

case job.state do
  :job_state_succeeded ->
    IO.puts("Training complete!")
    IO.puts("Tuned model: #{job.tuned_model}")

  :job_state_running ->
    IO.puts("Still training...")

  :job_state_failed ->
    IO.puts("Training failed: #{job.error.message}")

  _ ->
    IO.puts("Current state: #{job.state}")
end

# Or use automatic waiting
{:ok, completed_job} = Tunings.wait_for_completion(
  job.name,
  poll_interval: 60_000,    # Check every minute
  timeout: 7_200_000,       # Wait up to 2 hours
  on_status: fn j ->
    IO.puts("State: #{j.state}")
  end,
  auth: :vertex_ai
)

6. Use the Tuned Model

Once training succeeds, use your tuned model:

{:ok, response} = Gemini.generate(
  "What is your shipping policy?",
  model: completed_job.tuned_model,
  auth: :vertex_ai
)

IO.puts(response.text)

Training Data Format

Required Structure

Each line in your JSONL file must be a complete conversation:

{
  "contents": [
    {
      "role": "user",
      "parts": [{"text": "input text"}]
    },
    {
      "role": "model",
      "parts": [{"text": "expected output"}]
    }
  ]
}

Multi-Turn Conversations

For multi-turn examples:

{
  "contents": [
    {"role": "user", "parts": [{"text": "Hello"}]},
    {"role": "model", "parts": [{"text": "Hi! How can I help you?"}]},
    {"role": "user", "parts": [{"text": "I need help with my order"}]},
    {"role": "model", "parts": [{"text": "I'd be happy to help. What's your order number?"}]}
  ]
}

Validation Data

Create a separate validation set (10-20% of total data):

config = %CreateTuningJobConfig{
  base_model: "gemini-2.5-flash-001",
  tuned_model_display_name: "my-model",
  training_dataset_uri: "gs://bucket/training.jsonl",
  validation_dataset_uri: "gs://bucket/validation.jsonl"  # Optional but recommended
}

Hyperparameter Tuning

Epoch Count

Number of times the model trains on the full dataset:

config = %CreateTuningJobConfig{
  # ... other fields
  epoch_count: 15  # Default: 10, Range: 1-100
}

Guidelines:

  • More epochs = better learning but risk overfitting
  • Start with default (10) and adjust based on validation metrics
  • Use validation data to detect overfitting

Learning Rate Multiplier

Controls how quickly the model adapts:

config = %CreateTuningJobConfig{
  # ... other fields
  learning_rate_multiplier: 0.5  # Default: 1.0, Range: 0.1-2.0
}

Guidelines:

  • Lower (0.3-0.7) = more stable, slower convergence
  • Higher (1.5-2.0) = faster convergence, risk of instability
  • Start with 1.0 and adjust if needed

Adapter Size

Model capacity for fine-tuning:

config = %CreateTuningJobConfig{
  # ... other fields
  adapter_size: "ADAPTER_SIZE_FOUR"
}

Options:

  • "ADAPTER_SIZE_ONE" - Smallest, fastest, least capacity
  • "ADAPTER_SIZE_FOUR" - Balanced (default)
  • "ADAPTER_SIZE_EIGHT" - Larger capacity
  • "ADAPTER_SIZE_SIXTEEN" - Maximum capacity

Guidelines:

  • Use larger adapters for complex tasks
  • Start with default and increase if underfitting

Managing Tuning Jobs

List All Jobs

# List recent jobs
{:ok, response} = Tunings.list(auth: :vertex_ai)

Enum.each(response.tuning_jobs, fn job ->
  IO.puts("#{job.tuned_model_display_name}: #{job.state}")
end)

# With pagination
{:ok, response} = Tunings.list(
  page_size: 50,
  page_token: response.next_page_token,
  auth: :vertex_ai
)

# Get all jobs automatically
{:ok, all_jobs} = Tunings.list_all(auth: :vertex_ai)

Filter Jobs

# Filter by state
{:ok, succeeded} = Tunings.list(
  filter: "state=JOB_STATE_SUCCEEDED",
  auth: :vertex_ai
)

# Filter by label
{:ok, production} = Tunings.list(
  filter: "labels.environment=production",
  auth: :vertex_ai
)

Cancel Running Jobs

{:ok, job} = Tunings.cancel(job_name, auth: :vertex_ai)

# Verify cancellation
{:ok, updated} = Tunings.get(job_name, auth: :vertex_ai)
assert updated.state in [:job_state_cancelling, :job_state_cancelled]

Best Practices

Data Quality

  1. Curate High-Quality Examples

    • Review and validate each example
    • Remove duplicates and errors
    • Ensure consistent formatting
  2. Balance Your Dataset

    • Equal representation of different topics
    • Diverse input phrasings
    • Consistent output style
  3. Use Validation Data

    • Hold out 10-20% for validation
    • Helps detect overfitting
    • Provides performance metrics

Training Strategy

  1. Start Simple

    # Initial training
    config = %CreateTuningJobConfig{
      base_model: "gemini-2.5-flash-001",
      tuned_model_display_name: "model-v1",
      training_dataset_uri: "gs://bucket/data.jsonl",
      epoch_count: 10,
      learning_rate_multiplier: 1.0
    }
  2. Iterate and Improve

    • Test the tuned model
    • Collect failure cases
    • Add to training data
    • Retrain with updated data
  3. Monitor Metrics

    {:ok, job} = Tunings.get(job_name, auth: :vertex_ai)
    
    if job.tuning_data_stats do
      IO.inspect(job.tuning_data_stats, label: "Training Statistics")
    end

Production Deployment

  1. Version Your Models

    tuned_model_display_name: "support-model-v2-#{Date.utc_today()}"
  2. Label Your Jobs

    config = %CreateTuningJobConfig{
      # ... other fields
      labels: %{
        "environment" => "production",
        "version" => "v2",
        "team" => "ml-ops"
      }
    }
  3. Test Before Deployment

    • Validate on held-out test set
    • Compare with base model
    • A/B test in production

Troubleshooting

Common Issues

"Training data not found"

  • Verify GCS URI is correct
  • Check bucket permissions
  • Ensure file is in JSONL format

"Invalid training data format"

  • Validate each line is valid JSON
  • Check contents structure
  • Ensure proper role and parts fields

"Insufficient training data"

  • Minimum 100 examples recommended
  • Add more diverse examples
  • Check for duplicates

"Job failed during training"

  • Check error message in job.error
  • Verify data quality
  • Try reducing learning rate

Getting Help

# Check job error details
{:ok, job} = Tunings.get(job_name, auth: :vertex_ai)

if job.state == :job_state_failed do
  IO.puts("Error: #{job.error.message}")
  IO.puts("Code: #{job.error.code}")
  IO.inspect(job.error.details, label: "Details")
end

Complete Example

defmodule MyApp.ModelTuning do
  alias Gemini.Types.Tuning.CreateTuningJobConfig
  alias Gemini.APIs.Tunings

  def train_customer_support_model do
    # 1. Create configuration
    config = %CreateTuningJobConfig{
      base_model: "gemini-2.5-flash-001",
      tuned_model_display_name: "support-v1-#{Date.utc_today()}",
      training_dataset_uri: "gs://my-bucket/support-training.jsonl",
      validation_dataset_uri: "gs://my-bucket/support-validation.jsonl",
      epoch_count: 15,
      learning_rate_multiplier: 0.8,
      labels: %{"team" => "support", "version" => "v1"}
    }

    # 2. Start tuning
    {:ok, job} = Tunings.tune(config, auth: :vertex_ai)
    IO.puts("Started job: #{job.name}")

    # 3. Wait for completion
    {:ok, completed} = Tunings.wait_for_completion(
      job.name,
      poll_interval: 120_000,  # 2 minutes
      on_status: &log_progress/1,
      auth: :vertex_ai
    )

    # 4. Handle result
    case completed.state do
      :job_state_succeeded ->
        IO.puts("Success! Model: #{completed.tuned_model}")
        test_model(completed.tuned_model)

      :job_state_failed ->
        IO.puts("Failed: #{completed.error.message}")

      _ ->
        IO.puts("Unexpected state: #{completed.state}")
    end
  end

  defp log_progress(job) do
    IO.puts("[#{DateTime.utc_now()}] State: #{job.state}")

    if job.tuning_data_stats do
      IO.inspect(job.tuning_data_stats, label: "Stats")
    end
  end

  defp test_model(model_name) do
    test_prompts = [
      "What is your refund policy?",
      "How do I track my order?",
      "Do you ship internationally?"
    ]

    Enum.each(test_prompts, fn prompt ->
      {:ok, response} = Gemini.generate(prompt,
        model: model_name,
        auth: :vertex_ai
      )

      IO.puts("Q: #{prompt}")
      IO.puts("A: #{response.text}\n")
    end)
  end
end

Additional Resources

Next Steps