Binary Serialization Guide

Overview

The Snakepit bridge automatically uses binary serialization for large tensor and embedding data to improve performance. This guide explains how it works and how to use it effectively.

Automatic Binary Encoding

When variable data exceeds 10KB (10,240 bytes), the system automatically switches from JSON to binary encoding:

  • Small data (<10KB): Uses JSON for readability and debugging
  • Large data (>10KB): Uses binary for performance

Supported Types

Binary serialization is currently available for:

  • tensor: Multi-dimensional arrays with shape information
  • embedding: Vector representations (1D arrays)

Python Usage

Basic Example

from snakepit_bridge import SessionContext
import numpy as np

class MLAdapter:
    def process_large_data(self, ctx: SessionContext):
        # Create a large tensor (>10KB)
        large_tensor = np.random.randn(100, 100).tolist()  # ~80KB
        
        # This automatically uses binary serialization
        ctx.register_variable("my_tensor", "tensor", {
            "shape": [100, 100],
            "data": large_tensor
        })
        
        # Retrieval is transparent - handles binary automatically
        retrieved = ctx["my_tensor"]
        print(f"Shape: {retrieved['shape']}")

Performance Example

import time

def benchmark_serialization(ctx: SessionContext):
    # Small data - uses JSON
    small_data = list(range(100))  # ~800 bytes
    start = time.time()
    ctx.register_variable("small", "embedding", small_data)
    print(f"Small data: {time.time() - start:.3f}s")
    
    # Large data - uses binary
    large_data = list(range(10000))  # ~80KB
    start = time.time()
    ctx.register_variable("large", "embedding", large_data)
    print(f"Large data: {time.time() - start:.3f}s")
    # Typically 5-10x faster!

Technical Details

Serialization Format

  • Python side: Uses pickle with highest protocol
  • Elixir side: Uses Erlang Term Format (ETF)
  • Wire format: Protocol Buffers with separate binary field

Threshold Calculation

The 10KB threshold is based on the estimated serialized size:

  • For tensors: len(data) * 8 bytes (assuming float64)
  • For embeddings: len(array) * 8 bytes

Binary Message Structure

When binary serialization is used:

  1. Metadata (in value field):

    {
      "shape": [100, 100],
      "dtype": "float32",
      "binary_format": "pickle",
      "type": "tensor"
    }
  2. Binary Data (in binary_value field):

    • Pickled Python object containing the actual data

Best Practices

  1. Use Appropriate Types: Always use tensor or embedding types for numerical data
  2. Batch Large Operations: Group multiple large variables in batch updates
  3. Monitor Memory: Binary data is held in memory during transfer
  4. Profile Your Data: Check if your typical data sizes benefit from binary encoding

Limitations

  1. Type Support: Only tensor and embedding types use binary serialization
  2. Debugging: Binary data is not human-readable in logs
  3. Compatibility: Binary format is internal - use JSON for external APIs

Configuration

Currently, the 10KB threshold is not configurable. This value was chosen as optimal for most workloads:

  • Keeps small data human-readable
  • Maximizes performance for large data
  • Minimizes overhead of format detection