OORL.PolicyLearningFramework (object v0.1.2)

Individual policy learning with social awareness based on AAOS interaction dyads

Summary

Functions

interaction_dyad_learning(object_id, dyad_experiences)

Learns from interaction dyad experiences.

social_imitation_learning(object_id, peer_policies, performance_rankings)

Performs selective imitation learning from high-performing peers.

update_policy(object_id, experiences, social_context)

Updates an object's policy based on experiences and social context.

Functions

interaction_dyad_learning(object_id, dyad_experiences)

Learns from interaction dyad experiences.

Processes learning specifically from dyadic interactions, which often provide higher-quality learning signals due to sustained cooperation.

Parameters

object_id - ID of the learning object
dyad_experiences - Experiences from interaction dyads

Returns

Aggregated learning updates from all active dyads

social_imitation_learning(object_id, peer_policies, performance_rankings)

@spec social_imitation_learning(
  Object.object_id(),
  %{required(Object.object_id()) => OORL.policy_spec()},
  [{Object.object_id(), float()}]
) :: %{required(Object.object_id()) => float()}

Performs selective imitation learning from high-performing peers.

Analyzes peer performance and compatibility to selectively imitate successful behaviors while maintaining object individuality. This prevents naive copying and ensures beneficial social learning.

Parameters

object_id - ID of the learning object
peer_policies - Map of peer object IDs to their policy specifications
performance_rankings - List of {peer_id, performance_score} tuples sorted by performance (highest first)

Returns

Map of peer IDs to imitation weights (0.0-1.0) where:

Higher weights indicate stronger imitation influence
Weights are based on both performance and compatibility
Zero weights mean no imitation from that peer

Selection Criteria

Peers are selected for imitation based on:

Performance Threshold

Only top 3 performers are considered
Performance must exceed minimum threshold
Recent performance weighted more heavily

Compatibility Assessment

Policy similarity and behavioral alignment
Successful interaction history
Complementary vs competing objectives

Interaction Dyad Strength

Stronger dyads indicate successful collaboration
Trust and reliability from past interactions
Communication effectiveness

Examples

# Imitation learning with performance rankings
iex> peer_policies = %{
...>   "agent_2" => %{type: :neural, performance: 0.85},
...>   "agent_3" => %{type: :tabular, performance: 0.92},
...>   "agent_4" => %{type: :neural, performance: 0.78}
...> }
iex> performance_rankings = [
...>   {"agent_3", 0.92},
...>   {"agent_2", 0.85},
...>   {"agent_4", 0.78}
...> ]
iex> weights = OORL.PolicyLearning.social_imitation_learning(
...>   "agent_1", peer_policies, performance_rankings
...> )
iex> weights
%{"agent_3" => 0.75, "agent_2" => 0.45}

Imitation Weight Calculation

The weight for each peer is computed as:

weight = compatibility * performance * dyad_strength

Where:

compatibility ∈ [0.0, 1.0] based on behavioral similarity
performance ∈ [0.0, 1.0] normalized performance score
dyad_strength ∈ [0.0, 1.0] interaction dyad effectiveness

Compatibility Factors

Compatibility assessment includes:

Policy Architecture: Similar neural networks vs tabular policies
Goal Alignment: Compatible vs conflicting objectives
Behavioral Patterns: Similar action preferences and strategies
Environmental Niche: Operating in similar state spaces

Benefits of Selective Imitation

Accelerated Learning: Learn successful strategies faster
Exploration Guidance: Discover effective action sequences
Robustness: Multiple perspectives improve policy robustness
Specialization: Maintain individual strengths while learning

Safeguards

Individuality Preservation: Imitation weights bounded to preserve autonomy
Performance Validation: Verify imitated behaviors improve performance
Compatibility Filtering: Reject incompatible behavioral patterns
Gradual Integration: Slowly integrate imitated behaviors

update_policy(object_id, experiences, social_context)

@spec update_policy(Object.object_id(), [OORL.experience()], OORL.social_context()) ::
  {:ok,
   %{
     parameter_deltas: map(),
     learning_rate_adjustment: float(),
     exploration_modification: atom()
   }}
  | {:error, atom()}

Updates an object's policy based on experiences and social context.

Performs multi-objective policy gradient updates with social regularization and interaction dyad awareness. This function integrates individual learning with social learning signals to improve policy performance.

Parameters

object_id - ID of the object updating its policy
experiences - List of recent experiences to learn from:
- Each experience contains state, action, reward, next_state
- Experiences are weighted by interaction dyad strength
- Recent experiences have higher learning weight
social_context - Social learning context with peer information:
- Peer rewards for imitation learning
- Observed actions for behavioral copying
- Interaction dyad information for weighting

Returns

{:ok, policy_updates} - Successful policy updates containing:
- :parameter_deltas - Changes to policy parameters
- :learning_rate_adjustment - Adaptive learning rate modification
- :exploration_modification - Exploration strategy updates
{:error, reason} - Update failed due to:
- :insufficient_data - Not enough experiences for reliable update
- :invalid_experiences - Malformed experience data
- :ai_reasoning_failed - AI enhancement failed, using fallback

Learning Algorithm

The policy update process:

Experience Weighting: Weight experiences by dyad strength
AI Enhancement: Use AI reasoning for optimization (if available)
Fallback Learning: Traditional gradient methods if AI fails
Social Regularization: Incorporate peer behavior signals
Parameter Updates: Apply computed parameter changes

AI-Enhanced Learning

When AI reasoning is available, the system:

Analyzes experience patterns for optimal learning
Considers social compatibility and interaction dynamics
Optimizes for multiple objectives simultaneously
Provides interpretable learning recommendations

Examples

# Update policy with experiences and social context
iex> experiences = [
...>   %{state: %{x: 0}, action: :right, reward: 1.0, next_state: %{x: 1}},
...>   %{state: %{x: 1}, action: :up, reward: 0.5, next_state: %{x: 1, y: 1}}
...> ]
iex> social_context = %{
...>   peer_rewards: [{"agent_2", 0.8}],
...>   interaction_dyads: ["dyad_1"]
...> }
iex> {:ok, updates} = OORL.PolicyLearning.update_policy(
...>   "agent_1", experiences, social_context
...> )
iex> updates.learning_rate_adjustment
1.05

Social context enhances learning through:

Peer Imitation: Higher-performing peers influence policy updates
Dyad Weighting: Stronger dyads provide more learning signal
Behavioral Alignment: Policy updates consider social coordination

Performance Characteristics

Update time: 2-15ms depending on experience count and AI usage
Convergence: Typically 20-50% faster with social learning
Stability: Social regularization improves learning stability
Scalability: Linear with number of experiences and peer count

OORL.PolicyLearningFramework (object v0.1.2)

Summary

Functions

Functions

interaction_dyad_learning(object_id, dyad_experiences)

Parameters

Returns

social_imitation_learning(object_id, peer_policies, performance_rankings)

Parameters

Returns

Selection Criteria

Performance Threshold

Compatibility Assessment

Interaction Dyad Strength

Examples

Imitation Weight Calculation

Compatibility Factors

Benefits of Selective Imitation

Safeguards

update_policy(object_id, experiences, social_context)

Parameters

Returns

Learning Algorithm

AI-Enhanced Learning

Examples

Social Learning Integration

Performance Characteristics