Zero-shot Classification Guide
View SourceComplete guide to zero-shot text classification in Nasty using Natural Language Inference models.
Overview
Zero-shot classification allows you to classify text into arbitrary categories without any training data. It works by framing classification as a Natural Language Inference (NLI) problem.
Key Benefits:
- No training data required
- Works with any label set you define
- Add new categories instantly
- Multi-label classification support
- 70-85% accuracy on many tasks
How It Works
The model treats classification as textual entailment:
- Hypothesis: "This text is about {label}"
- Premise: Your input text
- Prediction: Probability that premise entails hypothesis
For each candidate label, the model predicts entailment probability. The label with highest probability wins.
Example
Text: "I love this product!"
Labels: positive, negative, neutral
Process:
- "I love this product!" entails "This text is about positive" → 95%
- "I love this product!" entails "This text is about negative" → 2%
- "I love this product!" entails "This text is about neutral" → 3%
Result: positive (95% confidence)
Quick Start
CLI Usage
# Single text classification
mix nasty.zero_shot \
--text "I love this product!" \
--labels positive,negative,neutral
# Output:
# Text: I love this product!
# Predicted: positive
# Confidence: 95.3%
#
# All scores:
# positive: 95.3% ████████████████████
# neutral: 3.2% █
# negative: 1.5%
Programmatic Usage
alias Nasty.Statistics.Neural.Transformers.ZeroShot
{:ok, result} = ZeroShot.classify("I love this product!",
candidate_labels: ["positive", "negative", "neutral"]
)
# result = %{
# label: "positive",
# scores: %{
# "positive" => 0.953,
# "neutral" => 0.032,
# "negative" => 0.015
# },
# sequence: "I love this product!"
# }Common Use Cases
1. Sentiment Analysis
mix nasty.zero_shot \
--text "The movie was boring and predictable" \
--labels positive,negative,neutral
Why it works: Clear emotional content maps well to sentiment labels.
2. Topic Classification
mix nasty.zero_shot \
--text "Bitcoin reaches new all-time high" \
--labels technology,finance,sports,politics,business
Why it works: Topics have distinct semantic spaces.
3. Intent Detection
mix nasty.zero_shot \
--text "Can you help me reset my password?" \
--labels question,request,complaint,praise
Why it works: Intents have characteristic linguistic patterns.
4. Content Moderation
mix nasty.zero_shot \
--text "This is the worst service ever!!!" \
--labels spam,offensive,normal,promotional
Why it works: Moderation categories have clear signals.
5. Email Routing
mix nasty.zero_shot \
--text "Urgent: Server down in production" \
--labels urgent,normal,low_priority,informational
Why it works: Urgency and importance have lexical markers.
Multi-label Classification
Assign multiple labels when appropriate:
mix nasty.zero_shot \
--text "Urgent: Please review the attached technical document" \
--labels urgent,action_required,informational,technical \
--multi-label \
--threshold 0.5
Output:
Predicted labels: urgent, action_required, technical
All scores:
[✓] urgent: 0.89
[✓] action_required: 0.76
[✓] technical: 0.68
[ ] informational: 0.34Only labels above threshold (0.5) are selected.
Multi-label Use Cases
- Document tagging: Tag with multiple topics
- Email categorization: Both "urgent" AND "technical"
- Content flags: Multiple moderation issues
- Skill extraction: Multiple skills from job description
Batch Classification
Process multiple texts efficiently:
# Create input file
cat > texts.txt << EOF
I love this product!
The service was terrible
It's okay, nothing special
EOF
# Classify batch
mix nasty.zero_shot \
--input texts.txt \
--labels positive,negative,neutral \
--output results.json
Result saved to results.json:
[
{
"text": "I love this product!",
"result": {
"label": "positive",
"scores": {"positive": 0.95, "neutral": 0.03, "negative": 0.02}
},
"success": true
},
...
]Supported Models
RoBERTa-MNLI (Default)
Best for: English text, highest accuracy
--model roberta_large_mnli
Specs:
- Parameters: 355M
- Languages: English only
- Accuracy: 85-90% on many tasks
- Speed: Medium
BART-MNLI
Best for: Alternative to RoBERTa, slightly different strengths
--model bart_large_mnli
Specs:
- Parameters: 400M
- Languages: English only
- Accuracy: 83-88%
- Speed: Slower than RoBERTa
XLM-RoBERTa
Best for: Multilingual (Spanish, Catalan, etc.)
--model xlm_roberta_base
Specs:
- Parameters: 270M
- Languages: 100 languages
- Accuracy: 75-85% (varies by language)
- Speed: Fast
Custom Hypothesis Templates
Change how classification is framed:
# Default template
--hypothesis-template "This text is about {}"
# Custom templates
--hypothesis-template "This message is {}"
--hypothesis-template "The sentiment is {}"
--hypothesis-template "The topic of this text is {}"
--hypothesis-template "This document contains {}"
Example:
mix nasty.zero_shot \
--text "Please call me back ASAP" \
--labels urgent,normal,low_priority \
--hypothesis-template "This message is {}"
Generates hypotheses:
- "This message is urgent"
- "This message is normal"
- "This message is low_priority"
Best Practices
1. Choose Clear, Distinct Labels
Good:
--labels positive,negative,neutral
--labels urgent,normal,low_priority
--labels technical,business,personal
Bad (too similar):
--labels happy,joyful,cheerful # Too similar!
--labels important,critical,essential # Overlapping!
2. Use Descriptive Label Names
Good:
--labels positive_sentiment,negative_sentiment,neutral_sentiment
Better:
--labels positive,negative,neutral # Simpler, but clear
Bad:
--labels pos,neg,neu # Too cryptic
--labels 1,2,3 # Meaningless
3. Provide 2-6 Labels
- Too few (1 label): Not classification
- Sweet spot (2-6 labels): Best accuracy
- Too many (10+ labels): Accuracy degrades
4. Use Multi-label for Overlapping Concepts
Single-label (mutually exclusive):
--labels positive,negative,neutral
Multi-label (can overlap):
--labels urgent,technical,action_required,informational \
--multi-label
5. Adjust Threshold for Multi-label
# Conservative (fewer labels)
--threshold 0.7
# Balanced (default)
--threshold 0.5
# Liberal (more labels)
--threshold 0.3
Performance Tips
When Zero-shot Works Best
✓ Clear semantic categories
✓ 2-6 distinct labels
✓ Labels have characteristic language patterns
✓ English text (for RoBERTa-MNLI)
✓ Medium-length text (10-200 words)
When to Use Fine-tuning Instead
✗ Need >90% accuracy
✗ Domain-specific jargon
✗ Subtle distinctions between labels
✗ Have 1000+ labeled examples
✗ Production critical system
Zero-shot is great for prototyping and low-stakes classification. For production, consider fine-tuning.
Limitations
1. Language Dependence
RoBERTa-MNLI only works well for English. For other languages:
# Spanish/Catalan
--model xlm_roberta_base
Expect 10-15% lower accuracy than English.
2. Accuracy Ceiling
Zero-shot typically achieves 70-85% accuracy. Fine-tuning can reach 95-99%.
3. Context Window
Models have maximum input length (~512 tokens). Long documents need truncation:
# Truncate to first 512 tokens automatically
--max-length 512
4. Label Sensitivity
Results can vary with label phrasing:
# These may give different results:
--labels positive,negative
--labels good,bad
--labels happy,sad
Test different phrasings to find what works best.
Troubleshooting
All Scores Are Similar
Problem: Scores like 0.33, 0.34, 0.33 (no clear winner)
Causes:
- Labels are too similar
- Text is ambiguous
- Poor hypothesis template
Solutions:
- Use more distinct labels
- Try different hypothesis template
- Add more context to text
- Consider if text is truly ambiguous
Wrong Label Predicted
Problem: Clearly wrong prediction
Causes:
- Label phrasing doesn't match text semantics
- Need different hypothesis template
- Text is out-of-domain for model
Solutions:
- Rephrase labels
- Change hypothesis template
- Try different model
- Consider fine-tuning for your domain
Slow Performance
Problem: Classification takes too long
Solutions:
- Use smaller model (xlm_roberta_base vs roberta_large)
- Enable GPU (set XLA_TARGET=cuda)
- Reduce number of labels
- Use batch processing for multiple texts
Advanced Usage
Programmatic Batch Processing
alias Nasty.Statistics.Neural.Transformers.ZeroShot
texts = [
"I love this!",
"Terrible service",
"It's okay"
]
{:ok, results} = ZeroShot.classify_batch(texts,
candidate_labels: ["positive", "negative", "neutral"]
)
# results = [
# %{label: "positive", scores: %{...}, sequence: "I love this!"},
# %{label: "negative", scores: %{...}, sequence: "Terrible service"},
# %{label: "neutral", scores: %{...}, sequence: "It's okay"}
# ]Confidence Thresholding
Reject low-confidence predictions:
{:ok, result} = ZeroShot.classify(text,
candidate_labels: ["positive", "negative", "neutral"]
)
max_score = result.scores[result.label]
if max_score < 0.6 do
# Too uncertain, flag for human review
{:uncertain, result}
else
{:confident, result}
endHierarchical Classification
First classify broadly, then refine:
# Step 1: Broad category
{:ok, broad} = ZeroShot.classify(text,
candidate_labels: ["product", "service", "support"]
)
# Step 2: Specific subcategory
specific_labels = case broad.label do
"product" -> ["quality", "price", "features"]
"service" -> ["delivery", "installation", "maintenance"]
"support" -> ["technical", "billing", "general"]
end
{:ok, specific} = ZeroShot.classify(text,
candidate_labels: specific_labels
)Comparison with Other Methods
| Method | Training Data | Accuracy | Setup Time | Flexibility |
|---|---|---|---|---|
| Zero-shot | 0 examples | 70-85% | Instant | Very high |
| Few-shot | 10-100 examples | 80-90% | Minutes | High |
| Fine-tuning | 1000+ examples | 95-99% | Hours | Medium |
| Rule-based | N/A | 60-80% | Days | Low |
Recommendation: Start with zero-shot, move to fine-tuning if accuracy is insufficient.
Production Deployment
Caching Results
defmodule ClassificationCache do
use GenServer
def classify_cached(text, labels) do
cache_key = :crypto.hash(:md5, text <> Enum.join(labels)) |> Base.encode16()
case get_cache(cache_key) do
nil ->
{:ok, result} = ZeroShot.classify(text, candidate_labels: labels)
put_cache(cache_key, result)
result
cached ->
cached
end
end
endRate Limiting
defmodule RateLimiter do
def classify_with_limit(text, labels) do
case check_rate_limit() do
:ok ->
ZeroShot.classify(text, candidate_labels: labels)
{:error, :rate_limited} ->
{:error, "Too many requests, please retry later"}
end
end
endFallback Strategies
def classify_robust(text, labels) do
case ZeroShot.classify(text, candidate_labels: labels) do
{:ok, result} ->
if result.scores[result.label] > 0.6 do
{:ok, result}
else
# Fall back to simpler method
naive_bayes_classify(text, labels)
end
{:error, _} ->
# Model unavailable, use rule-based
rule_based_classify(text, labels)
end
endSee Also
- FINE_TUNING.md - Train models for higher accuracy
- CROSS_LINGUAL.md - Multilingual classification
- PRETRAINED_MODELS.md - Available transformer models