Zero-shot Classification Guide

View Source

Complete guide to zero-shot text classification in Nasty using Natural Language Inference models.

Overview

Zero-shot classification allows you to classify text into arbitrary categories without any training data. It works by framing classification as a Natural Language Inference (NLI) problem.

Key Benefits:

  • No training data required
  • Works with any label set you define
  • Add new categories instantly
  • Multi-label classification support
  • 70-85% accuracy on many tasks

How It Works

The model treats classification as textual entailment:

  1. Hypothesis: "This text is about {label}"
  2. Premise: Your input text
  3. Prediction: Probability that premise entails hypothesis

For each candidate label, the model predicts entailment probability. The label with highest probability wins.

Example

Text: "I love this product!"

Labels: positive, negative, neutral

Process:

  • "I love this product!" entails "This text is about positive" → 95%
  • "I love this product!" entails "This text is about negative" → 2%
  • "I love this product!" entails "This text is about neutral" → 3%

Result: positive (95% confidence)

Quick Start

CLI Usage

# Single text classification
mix nasty.zero_shot \
  --text "I love this product!" \
  --labels positive,negative,neutral

# Output:
# Text: I love this product!
#   Predicted: positive
#   Confidence: 95.3%
#
#   All scores:
#     positive: 95.3% ████████████████████
#     neutral:   3.2% █
#     negative:  1.5%

Programmatic Usage

alias Nasty.Statistics.Neural.Transformers.ZeroShot

{:ok, result} = ZeroShot.classify("I love this product!",
  candidate_labels: ["positive", "negative", "neutral"]
)

# result = %{
#   label: "positive",
#   scores: %{
#     "positive" => 0.953,
#     "neutral" => 0.032,
#     "negative" => 0.015
#   },
#   sequence: "I love this product!"
# }

Common Use Cases

1. Sentiment Analysis

mix nasty.zero_shot \
  --text "The movie was boring and predictable" \
  --labels positive,negative,neutral

Why it works: Clear emotional content maps well to sentiment labels.

2. Topic Classification

mix nasty.zero_shot \
  --text "Bitcoin reaches new all-time high" \
  --labels technology,finance,sports,politics,business

Why it works: Topics have distinct semantic spaces.

3. Intent Detection

mix nasty.zero_shot \
  --text "Can you help me reset my password?" \
  --labels question,request,complaint,praise

Why it works: Intents have characteristic linguistic patterns.

4. Content Moderation

mix nasty.zero_shot \
  --text "This is the worst service ever!!!" \
  --labels spam,offensive,normal,promotional

Why it works: Moderation categories have clear signals.

5. Email Routing

mix nasty.zero_shot \
  --text "Urgent: Server down in production" \
  --labels urgent,normal,low_priority,informational

Why it works: Urgency and importance have lexical markers.

Multi-label Classification

Assign multiple labels when appropriate:

mix nasty.zero_shot \
  --text "Urgent: Please review the attached technical document" \
  --labels urgent,action_required,informational,technical \
  --multi-label \
  --threshold 0.5

Output:

Predicted labels: urgent, action_required, technical

All scores:
  [] urgent:          0.89
  [] action_required: 0.76
  [] technical:       0.68
  [ ] informational:   0.34

Only labels above threshold (0.5) are selected.

Multi-label Use Cases

  • Document tagging: Tag with multiple topics
  • Email categorization: Both "urgent" AND "technical"
  • Content flags: Multiple moderation issues
  • Skill extraction: Multiple skills from job description

Batch Classification

Process multiple texts efficiently:

# Create input file
cat > texts.txt << EOF
I love this product!
The service was terrible
It's okay, nothing special
EOF

# Classify batch
mix nasty.zero_shot \
  --input texts.txt \
  --labels positive,negative,neutral \
  --output results.json

Result saved to results.json:

[
  {
    "text": "I love this product!",
    "result": {
      "label": "positive",
      "scores": {"positive": 0.95, "neutral": 0.03, "negative": 0.02}
    },
    "success": true
  },
  ...
]

Supported Models

RoBERTa-MNLI (Default)

Best for: English text, highest accuracy

--model roberta_large_mnli

Specs:

  • Parameters: 355M
  • Languages: English only
  • Accuracy: 85-90% on many tasks
  • Speed: Medium

BART-MNLI

Best for: Alternative to RoBERTa, slightly different strengths

--model bart_large_mnli

Specs:

  • Parameters: 400M
  • Languages: English only
  • Accuracy: 83-88%
  • Speed: Slower than RoBERTa

XLM-RoBERTa

Best for: Multilingual (Spanish, Catalan, etc.)

--model xlm_roberta_base

Specs:

  • Parameters: 270M
  • Languages: 100 languages
  • Accuracy: 75-85% (varies by language)
  • Speed: Fast

Custom Hypothesis Templates

Change how classification is framed:

# Default template
--hypothesis-template "This text is about {}"

# Custom templates
--hypothesis-template "This message is {}"
--hypothesis-template "The sentiment is {}"
--hypothesis-template "The topic of this text is {}"
--hypothesis-template "This document contains {}"

Example:

mix nasty.zero_shot \
  --text "Please call me back ASAP" \
  --labels urgent,normal,low_priority \
  --hypothesis-template "This message is {}"

Generates hypotheses:

  • "This message is urgent"
  • "This message is normal"
  • "This message is low_priority"

Best Practices

1. Choose Clear, Distinct Labels

Good:

--labels positive,negative,neutral
--labels urgent,normal,low_priority
--labels technical,business,personal

Bad (too similar):

--labels happy,joyful,cheerful  # Too similar!
--labels important,critical,essential  # Overlapping!

2. Use Descriptive Label Names

Good:

--labels positive_sentiment,negative_sentiment,neutral_sentiment

Better:

--labels positive,negative,neutral  # Simpler, but clear

Bad:

--labels pos,neg,neu  # Too cryptic
--labels 1,2,3  # Meaningless

3. Provide 2-6 Labels

  • Too few (1 label): Not classification
  • Sweet spot (2-6 labels): Best accuracy
  • Too many (10+ labels): Accuracy degrades

4. Use Multi-label for Overlapping Concepts

Single-label (mutually exclusive):

--labels positive,negative,neutral

Multi-label (can overlap):

--labels urgent,technical,action_required,informational \
--multi-label

5. Adjust Threshold for Multi-label

# Conservative (fewer labels)
--threshold 0.7

# Balanced (default)
--threshold 0.5

# Liberal (more labels)
--threshold 0.3

Performance Tips

When Zero-shot Works Best

✓ Clear semantic categories
✓ 2-6 distinct labels
✓ Labels have characteristic language patterns
✓ English text (for RoBERTa-MNLI)
✓ Medium-length text (10-200 words)

When to Use Fine-tuning Instead

✗ Need >90% accuracy
✗ Domain-specific jargon
✗ Subtle distinctions between labels
✗ Have 1000+ labeled examples
✗ Production critical system

Zero-shot is great for prototyping and low-stakes classification. For production, consider fine-tuning.

Limitations

1. Language Dependence

RoBERTa-MNLI only works well for English. For other languages:

# Spanish/Catalan
--model xlm_roberta_base

Expect 10-15% lower accuracy than English.

2. Accuracy Ceiling

Zero-shot typically achieves 70-85% accuracy. Fine-tuning can reach 95-99%.

3. Context Window

Models have maximum input length (~512 tokens). Long documents need truncation:

# Truncate to first 512 tokens automatically
--max-length 512

4. Label Sensitivity

Results can vary with label phrasing:

# These may give different results:
--labels positive,negative
--labels good,bad
--labels happy,sad

Test different phrasings to find what works best.

Troubleshooting

All Scores Are Similar

Problem: Scores like 0.33, 0.34, 0.33 (no clear winner)

Causes:

  • Labels are too similar
  • Text is ambiguous
  • Poor hypothesis template

Solutions:

  1. Use more distinct labels
  2. Try different hypothesis template
  3. Add more context to text
  4. Consider if text is truly ambiguous

Wrong Label Predicted

Problem: Clearly wrong prediction

Causes:

  • Label phrasing doesn't match text semantics
  • Need different hypothesis template
  • Text is out-of-domain for model

Solutions:

  1. Rephrase labels
  2. Change hypothesis template
  3. Try different model
  4. Consider fine-tuning for your domain

Slow Performance

Problem: Classification takes too long

Solutions:

  1. Use smaller model (xlm_roberta_base vs roberta_large)
  2. Enable GPU (set XLA_TARGET=cuda)
  3. Reduce number of labels
  4. Use batch processing for multiple texts

Advanced Usage

Programmatic Batch Processing

alias Nasty.Statistics.Neural.Transformers.ZeroShot

texts = [
  "I love this!",
  "Terrible service",
  "It's okay"
]

{:ok, results} = ZeroShot.classify_batch(texts,
  candidate_labels: ["positive", "negative", "neutral"]
)

# results = [
#   %{label: "positive", scores: %{...}, sequence: "I love this!"},
#   %{label: "negative", scores: %{...}, sequence: "Terrible service"},
#   %{label: "neutral", scores: %{...}, sequence: "It's okay"}
# ]

Confidence Thresholding

Reject low-confidence predictions:

{:ok, result} = ZeroShot.classify(text,
  candidate_labels: ["positive", "negative", "neutral"]
)

max_score = result.scores[result.label]

if max_score < 0.6 do
  # Too uncertain, flag for human review
  {:uncertain, result}
else
  {:confident, result}
end

Hierarchical Classification

First classify broadly, then refine:

# Step 1: Broad category
{:ok, broad} = ZeroShot.classify(text,
  candidate_labels: ["product", "service", "support"]
)

# Step 2: Specific subcategory
specific_labels = case broad.label do
  "product" -> ["quality", "price", "features"]
  "service" -> ["delivery", "installation", "maintenance"]
  "support" -> ["technical", "billing", "general"]
end

{:ok, specific} = ZeroShot.classify(text,
  candidate_labels: specific_labels
)

Comparison with Other Methods

MethodTraining DataAccuracySetup TimeFlexibility
Zero-shot0 examples70-85%InstantVery high
Few-shot10-100 examples80-90%MinutesHigh
Fine-tuning1000+ examples95-99%HoursMedium
Rule-basedN/A60-80%DaysLow

Recommendation: Start with zero-shot, move to fine-tuning if accuracy is insufficient.

Production Deployment

Caching Results

defmodule ClassificationCache do
  use GenServer
  
  def classify_cached(text, labels) do
    cache_key = :crypto.hash(:md5, text <> Enum.join(labels)) |> Base.encode16()
    
    case get_cache(cache_key) do
      nil ->
        {:ok, result} = ZeroShot.classify(text, candidate_labels: labels)
        put_cache(cache_key, result)
        result
      
      cached ->
        cached
    end
  end
end

Rate Limiting

defmodule RateLimiter do
  def classify_with_limit(text, labels) do
    case check_rate_limit() do
      :ok ->
        ZeroShot.classify(text, candidate_labels: labels)
      
      {:error, :rate_limited} ->
        {:error, "Too many requests, please retry later"}
    end
  end
end

Fallback Strategies

def classify_robust(text, labels) do
  case ZeroShot.classify(text, candidate_labels: labels) do
    {:ok, result} ->
      if result.scores[result.label] > 0.6 do
        {:ok, result}
      else
        # Fall back to simpler method
        naive_bayes_classify(text, labels)
      end
    
    {:error, _} ->
      # Model unavailable, use rule-based
      rule_based_classify(text, labels)
  end
end

See Also