Serialization Design

Copy Markdown View Source

EctoFDB stores each entity as a single FDB value using Erlang's External Term Format (:erlang.term_to_binary/1). The input is a keyword list where each key is an atom corresponding to a column name and each value is the column's Elixir term. For example:

[id: "abc-123", name: "Alice", notes: nil, inserted_at: ~N[...], updated_at: ~N[...]]

This document describes why this encoding was chosen and the tradeoffs involved.

Ecto.Schema as the source of truth

Because every row carries its own column names, EctoFDB does not need to store a schema definition as a separate metadata entry in FoundationDB, and does not require a create table migration statement. Instead, the Ecto.Schema module in the application code is the sole source of truth for the shape of the data.

This is a natural fit for FoundationDB's client-driven Layer architecture: the client is always in control of how data is interpreted. Avoiding a stored schema eliminates an entire class of coordination problems:

  • No schema metadata key to read on every transaction.
  • No cache invalidation surface for schema changes.
  • No versioned schema mappings to maintain.

The Metadata system (see metadata.md) already handles index metadata and cache invalidation. Keeping the value encoding free of metadata dependencies keeps the common read/write path simple.

Column Evolution

Adding a column requires only a code change to the Ecto.Schema. Existing rows that predate the new column will decode without it, and the missing key resolves to nil at the Ecto layer. No backfill migration is needed.

Renaming a column requires rewriting all existing rows, since the old atom name is embedded in each value. This is consistent with the general expectation that column renames are rare and expensive operations.

In practice, the tenant model eases this during development: rather than migrating a tenant's data through a rename, it is often simpler to delete the tenant's data and let it repopulate from scratch. This works well when tenants are cheap to recreate and the authoritative data lives elsewhere or can be reseeded.

Why term_to_binary?

:erlang.term_to_binary/1 is implemented as a single C-level BIF in the BEAM. It handles the entire keyword list (atoms, strings, integers, floats, timestamps, nested structures) in one call with no Elixir-level dispatch per field. It's well optimized, portable, simple, reliable, and well-trodden.

We benchmarked an alternative encoding that strips atom keys out of the binary and stores them as a NUL-delimited string header, with a single :erlang.term_to_binary call for the values list:

<<atom1, 0x00, atom2, 0x00, ..., atomN, 0x00, 0x00, term_to_binary(values_list)>>

Results across representative row shapes (narrow 2-field rows, typical 5-field rows, wide 12-field rows, and content-heavy 9-field rows):

MetricCustom vs. Baseline
Size savings1-7%
Encode speed1.2x slower
Decode speed1.6-2.7x slower

The size savings are modest because Erlang's External Term Format already encodes atoms compactly (a small tag followed by the atom bytes). The performance cost comes from the Elixir-level header parsing and per-field String.to_existing_atom calls on decode, which cannot compete with the BEAM's native binary_to_term.

Storage Overhead

Each atom key costs roughly 10-15 bytes per row in the serialized form. For a typical schema with 5-8 fields, this is 50-120 bytes of overhead per entity. The overhead as a percentage of total row size depends on the data:

  • Narrow rows (2-3 small fields): ~40-60% overhead
  • Typical rows (5-8 fields, moderate values): ~15-30%
  • Content-heavy rows (large string fields): <5%

This overhead is not free. Larger values reduce the number of rows that fit in a single FDB transaction (10 MB limit), which matters for bulk data loading scenarios. For narrow rows with high atom overhead, the effective throughput per transaction can be noticeably lower than it would be with a more compact encoding.

Comparison with Other Systems

The schema-per-row approach is similar to DynamoDB's approach:

  • DynamoDB stores attribute names with every item, for the same reasons: schemaless flexibility and no coordination on schema changes.

Some alternative approaches include:

  • FoundationDB Record Layer (Java) uses Protocol Buffers, which replace field names with small integer tags. This is more space-efficient but requires a compiled schema definition and careful field-number evolution rules.
  • CockroachDB uses small integer column-ID deltas in a compact TUPLE encoding within each column family's value. Column names are not stored per row; the schema maps IDs to names.
  • Spanner uses an internal columnar/PAX storage format (Ressi) where column identity is part of the schema metadata, not repeated per row.

EctoFDB's approach trades a moderate, predictable storage overhead for a significantly simpler architecture with no stored schema coordination.

Future Direction: Integer-Keyed Encoding

This strategy is not yet implemented.

A potential optimization is to replace atom keys with user-assigned integers, similar to Protocol Buffer field numbers. Instead of:

:erlang.term_to_binary([id: "abc-123", name: "Alice"])

The encoding would be:

:erlang.term_to_binary([{0, "abc-123"}, {1, "Alice"}])

Benchmarks show this yields 8-15% size savings on typical rows, with encode performance roughly on par with the current approach and decode at break-even.

This would be opt-in via a schema annotation, for example:

schema "users" do
  field :name, :string, fdb_key: 1
  field :notes, :string, fdb_key: 2
  timestamps()
end

If no fdb_key annotations are present, the current keyword-list encoding would be used. The decoder can disambiguate the two formats by inspecting the deserialized term: a keyword list (atom-keyed tuples) vs integer-keyed tuples.

The user would own the stability of the integer mapping, the same contract as protobuf field numbers: once assigned, an integer must never be reassigned to a different field. This is a reasonable burden because it enables two benefits beyond size savings:

  • Column renames become free. Renaming :name to :full_name requires only a schema change; the integer key 1 stays the same, so existing data needs no migration, assuming all records were stored with this scheme.
  • No stored metadata. The Ecto.Schema module remains the sole source of truth for the field-to-integer mapping, preserving the architecture described in this document.