Every benchmark in this directory is reproducible. Each script is self-contained — just mix run bench/<name>.exs.
The single-command experience: mix bench runs summary.exs, which produces one consolidated table covering 22 scenarios. mix bench.all runs the entire suite end-to-end.
Files
| File | What it measures | When to run |
|---|---|---|
summary.exs | Headline table: every op × {medium, large}, geometric-mean speedup. The bench you'd cite. | Quick "is FastDecimal still ahead?" check after a change |
arithmetic.exs | Per-op times vs decimal and Compat shim at 3 value sizes (full benchee output) | Detailed view of add/sub/mult/div/etc with deviation |
division.exs | div / div_int / div_rem / rem × sizes × precisions (5, 10, 28, 50) | Verifying division performance, picking a precision |
rounding.exs | round/3 with all 7 modes × 5 input shapes | Comparing rounding-mode overhead |
sqrt.exs | sqrt/2 at precisions 5 / 10 / 28 / 50 / 100 / 200 | Tuning sqrt-precision for your use case |
conversion.exs | to_string (all formats) + to_integer + to_float + cast from various input types | Format-conversion path tuning |
special_values.exs | NaN/Inf op overhead vs finite arithmetic; predicate cost | Confirming the is_integer guard isn't hurting the fast path |
batch.exs | sum/1 / product/1 over 10/100/1k/10k elements: pure Elixir vs decimal reduce | Verifying batch wins |
realistic.exs | 5 fintech-style workloads: invoice total, tax+discount, FX conversion, aggregate stats, CSV ingestion | The "would this actually help my app?" answer |
parse.exs | Two parse strategies (char-walk vs split+binary_to_integer) across 9 inputs | Chose the public parse/1 impl |
representation.exs | %FastDecimal{} struct vs raw {coef, exp} tuple vs raw integers | Quantifying the struct wrapper cost |
profile.exs | Time + bytes allocated + reductions per op (richer than benchee alone) | Investigating why an op is fast or slow |
disasm.exs | Dumps BEAM bytecode for hot-path functions; counts put_map_assoc vs put_map_exact module-wide | Inspecting what BEAMAsm actually JITs from |
Run individual ones with mix run bench/<name>.exs. Pipe to tee if you want a record.
Methodology
The headline mix bench (= bench/summary.exs) uses a custom harness in bench/support.exs:
- 7 independent samples per scenario per implementation. Each sample runs in a fresh process (resetting BEAM state — register cache, ETS access patterns, scheduler binding) and includes 1,000 warmup iterations to let the JIT specialize the call site before measurement starts.
- 200,000 iterations per sample, timed with
:erlang.monotonic_time(:nanosecond)(true nanosecond resolution;:timer.tcis microsecond). - A forced
:erlang.garbage_collect()before each sample to start from a clean heap. - Process priority set
:highto bias the macOS scheduler toward performance cores when the system isn't loaded. - We report median (p25–p75 IQR) per scenario. The IQR is the trustworthy headline number — it survives outliers from GC pauses, scheduler steals, and core swaps.
- A row is marked stable when even FastDecimal's p75 vs Decimal's p25 (the pessimistic IQR-edge ratio) clears 2×.
What we report vs what we don't
- Speedup column shows three numbers:
median (pessimistic_ratio – optimistic_ratio). The middle number is what you'd see most days; the outer numbers bound what you might see on a noisy/quiet run. - No mean reported. Means on sub-µs ops are dominated by 99th-percentile outliers from GC; medians are robust.
profile.exsshows allocation estimates approximated fromheap_size_delta + (gc_count × heap_size_before_gc). This is a heuristic, not a precise measurement. Use it as a relative signal between implementations, not an absolute byte count. The "fastdec uses less garbage" claim holds across implementations but the exact byte numbers should be taken with a grain of salt.
Reproducibility check
Across multiple consecutive runs on macOS arm64 / OTP 26 / BEAMAsm JIT, the geometric-mean speedup has landed in the range 9.67× – 10.01×. Each scenario's specific median value drifts 5–10% per run because of macOS scheduler noise and GC interactions; the ratios are stable. The headline number is best read as "approximately 10×, varies by ±3% across runs."
JIT vs non-JIT comparison
Same hardware, same Elixir, only the Erlang differs (asdf 26.0.2 with emu_flavor=jit vs 26.2.4 with emu_flavor=emu):
| Emulator | Geomean | Faster scenarios | Stable ≥2× |
|---|---|---|---|
| BEAMAsm JIT (26.0.2) | 9.87× | 21/22 | 19/22 |
| Threaded-code interpreter (26.2.4) | 7.71× | 21/22 | 18/22 |
JIT helps FastDecimal proportionally more than decimal (tighter hot paths = more inlining benefit), so the speedup ratio is bigger on JIT. But the lib still wins by ~8× on non-JIT BEAM.
Noise sources we acknowledge
- macOS efficiency cores. Apple Silicon schedules across heterogeneous cores (P + E). Long enough measurements average over both. Short ones don't.
- GC pauses. Some scenarios trigger a GC inside the measurement window. With 7 samples × 200k iters this washes out except for the smallest workloads.
- JIT detection.
:erlang.system_info(:emu_flavor)is the right field for this (:jitvs:emu).:emu_typeis:optfor both JIT and non-JIT builds — I confused these initially and the early bench output incorrectly labeled the machine as "no JIT".
Reading benchee output (other bench files)
Benchee reports four numbers per scenario:
- ips — iterations per second (higher = faster)
- average — mean ns/op
- deviation — relative std-dev as %
- median — middle value, the most trustworthy single number on noisy hardware
- 99th % — 99th-percentile latency
For sub-µs operations on macOS, trust the median, not the mean. Deviation in the thousands of percent is the BEAM/Mac scheduler interaction, not the code under test.
profile.exs adds three more numbers per op:
- bytes/op — heap bytes allocated (heuristic — see Methodology above)
- reductions/op — BEAM's virtual instruction counter. A useful "how much work" proxy; lower = less work
- GCs — minor garbage collections triggered in the measurement loop
Design decisions per op (with evidence)
Every choice below was made because we measured. Where the obvious "fast" technique lost, we kept the slower-looking code that actually won.
add / sub / mult
- Public API: struct-based (
%FastDecimal{coef, exp}). - Measured: tuple
{coef, exp}is 5–9 % faster on mult, parity on add (representation.exs). Cost is small enough that we pay it for the Inspect protocol, pattern-matching ergonomics, and zero-cost guarantees that hot paths still stay heap-light. - No explicit small-int guards. BEAM's JIT already specializes integer math based on whether values fit in the 60-bit immediate range. Tested an explicit
when c in -2^60..2^60guard — added ~3 % overhead with no measurable upside.
div
- Pure-Elixir, integer-shift-and-divide with explicit rounding modes.
- Allocates more than
decimal(~91 B vs ~8 B) because we materialize the shifted dividend as an intermediate bignum. We accept that — wall time is 5–9× faster thandecimal(profile.exs). - A Rust NIF wrapping a native decimal library is faster on div at precision 28 (~2.5× our pure-Elixir path, measured during prototyping). Not adopted because:
- It would force a binary dependency on every user
- It only wins on this one op
- Pure Elixir is still 5× faster than
decimalhere
compare
- At the BEAM floor: 4 reductions, 0 bytes allocated (
profile.exs). - No optimization possible — it's pattern match + one integer compare + return an atom.
parse
- Public API uses the char-by-char walker (
Parser.parse_walk/1). - Measured: walker is 1.4–3× faster than the split-and-
binary_to_integerapproach on every input shorter than ~25 digits (parse.exs). - Split wins ONLY on 30+ digit pure-integer strings, where
:erlang.binary_to_integer/1's native C loop overtakes BEAM's pattern dispatch. - Allocates ~186 B per call (sub-binary refs from the recursive head/tail pattern). We accept that — wall time is 3–5× faster than
decimal, anddecimalallocates 3× more anyway.
to_string
- Public API uses
IO.iodata_to_binary([sign, int_part, ?., frac_part]). - Measured: a bit-syntax
<<sign::binary, int_part::binary, ?., frac_part::binary>>alternative was 20 % slower despite allocating less.iodata_to_binaryis a NIF that pre-computes total size and allocates once; that beats stepwise binary appends. - Allocates ~266 B (7× more than
decimal's ~36 B) for parity wall time. Tried direct bit-syntax to reduce alloc; that path lost on time. Picking time > alloc here.
Map construction op: put_map_assoc vs put_map_exact
- Kept the
%__MODULE__{coef: ..., exp: ...}literal-struct form for all hot paths. Tried: rewriting to
%{a | coef: c1 + c2}(Elixir's "strict update") so the BEAM compiler emitsput_map_exact(updates an existing map's keys) instead ofput_map_assoc(allocates a fresh map). The bytecode change happened cleanly — module-wide counts flipped from 20:1put_map_assoc:put_map_exactto 7:14, verifiable inbench/disasm.exs.- Measured: wall-time unchanged (medians 42 ns both before and after), and the head-to-head head-to-head benchee comparison actually showed
put_map_exactwas 1 % slower within noise. - Conclusion: BEAMAsm has equally tight implementations of both ops for tiny structs. The literal-struct form is also more readable — every field value is right there at the call site — so the code stays.
sum/1 / product/1
- Pure-Elixir reduce via tail-recursive helpers.
- Measured during prototyping: a NIF that received the entire list across the FFI boundary was 6–7× SLOWER at every list size tested (10 / 100 / 1k / 10k elements).
- Why: Rustler's
Vec<Tuple>decode marshals each list element across BEAM/Rust (~50–100 ns each). At BEAM's ~11 ns per pure-Elixir add, the marshalling cost can't be recovered. - Pure-Elixir batch is 26–30× faster than
decimalreduce — and crucially, no Rust toolchain needed at install time. The NIF prototype was removed before v1.0; this lesson is preserved.
Reproducibility notes
- Hardware matters. All numbers in the project README are from an M-series Mac on Erlang/OTP 26 with BEAMAsm enabled. Linux on x86 will differ — typically lower absolute numbers but similar ratios.
- GC noise. If you see a single scenario with 99th-%ile 100× the median, that's a GC cycle landing inside the measurement window. Re-run to confirm the median.
- Warmup matters. Each Benchee scenario uses
warmup: 0.5to let the JIT specialize and load caches;time: 2for the actual measurement window. Don't shorten these — short runs amplify noise. - Benchee vs profile.exs.
profile.exsuses a tight:timer.tcloop without Benchee's per-iteration overhead, so its ns/op numbers are typically lower thanarithmetic.exs's. Both are correct, measuring slightly different things.
When you find a regression
If a change makes any op slower, the contract is: include the bench output in the PR description showing it. If a change makes an op faster, same — show the before/after numbers. The point of having all the benches in-tree is so this conversation is short.