LlmGuard.Detector behaviour (LlmGuard v0.3.1)

Behaviour for implementing security detectors in LlmGuard.

All detectors must implement this behaviour to be compatible with the LlmGuard security pipeline. Detectors analyze input/output text for security threats and return structured results indicating whether threats were detected.

Detection Layers

LlmGuard uses a multi-layer detection strategy:

Pattern Matching (~1ms) - Fast regex-based detection using known patterns
Heuristic Analysis (~10ms) - Statistical and structural analysis
ML Classification (~50ms) - Transformer-based detection for sophisticated attacks

Result Format

Detectors must return one of:

{:safe, metadata} - No threats detected
{:detected, result} - Threat detected with details

The detection result map must include:

:confidence - Float between 0.0 and 1.0 indicating detection confidence
:category - Atom categorizing the type of threat detected
:patterns_matched - List of pattern identifiers that matched
:metadata - Map with additional context about the detection

Examples

defmodule MyDetector do
  @behaviour LlmGuard.Detector

  @impl true
  def detect(input, opts) do
    if String.contains?(input, "threat") do
      {:detected, %{
        confidence: 0.95,
        category: :custom_threat,
        patterns_matched: ["threat_keyword"],
        metadata: %{reason: "Contains threat keyword"}
      }}
    else
      {:safe, %{checked: true}}
    end
  end

  @impl true
  def name, do: "my_detector"

  @impl true
  def description, do: "Detects custom threats"
end

Performance Considerations

Detectors should be designed with performance in mind:

Pattern matching should complete in <2ms (P95)
Heuristic analysis should complete in <10ms (P95)
ML-based detection should complete in <100ms (P95)

Use early returns and optimize regex patterns for best performance.

Summary

Types

detected_result()

detection_result()

input()

opts()

safe_result()

Callbacks

description()

Returns a human-readable description of what this detector does.

detect(input, opts)

Analyzes input text for security threats.

name()

Returns the detector's unique identifier name.

Types

detected_result()

@type detected_result() :: %{
  confidence: float(),
  category: atom(),
  patterns_matched: [String.t()],
  metadata: map()
}

detection_result()

@type detection_result() :: {:safe, safe_result()} | {:detected, detected_result()}

input()

@type input() :: String.t()

opts()

@type opts() :: keyword()

safe_result()

@type safe_result() :: %{optional(atom()) => any()}

Callbacks

description()

@callback description() :: String.t()

Returns a human-readable description of what this detector does.

Examples

def description, do: "Detects prompt injection attacks using pattern matching"

detect(input, opts)

@callback detect(input(), opts()) :: detection_result()

Analyzes input text for security threats.

Parameters

input - The text to analyze (user input, LLM output, etc.)
opts - Keyword list of options to customize detection behavior

Common options:

:threshold - Minimum confidence threshold (default: 0.7)
:enabled - Whether this detector is enabled (default: true)
:max_patterns - Maximum number of patterns to check (for performance)

Returns

{:safe, metadata} - No threats detected
{:detected, result} - Threat detected with confidence and details

name()

@callback name() :: String.t()

Returns the detector's unique identifier name.

This should be a short, snake_case string identifying the detector.

Examples

def name, do: "prompt_injection"