Performance & Scaling
View SourceMqttX is architected to scale from tens of thousands to hundreds of thousands of concurrent device connections on a single BEAM node, depending on hardware and workload. This guide explains the architectural decisions and optimizations that make this possible.
Architecture Overview
Each MQTT connection is a lightweight Erlang process (~2KB initial heap, ~20KB total with connection state and socket overhead). The BEAM VM's preemptive scheduler distributes these processes across all available CPU cores. At 100k connections, total memory overhead is roughly 2GB — well within reach of a modest server.
The key bottlenecks at scale are not the number of connections, but the hot paths that execute on every message:
| Hot Path | Frequency | Optimization |
|---|---|---|
| Topic matching | Every PUBLISH | Trie-based router: O(L+K) vs O(N) |
| Packet encoding | Every outgoing packet | iodata output, zero binary copy |
| Buffer handling | Every TCP chunk | Empty-buffer fast path |
| Callback dispatch | Every incoming packet | Cached function_exported? |
| Flow control check | Every QoS 1/2 publish | Direct counter vs O(N) scan |
| Retained delivery | Every SUBSCRIBE | ETS lookup for exact topics |
Topic Router
The router uses a trie (prefix tree) keyed by topic segments. Given a subscription to sensors/+/temperature, the trie looks like:
root
└── "sensors"
└── :single_level (+)
└── "temperature"
└── subscribers: %{client1 => %{qos: 1}}Matching walks the trie, branching into up to 3 children at each level: exact segment match, single-level wildcard (+), and multi-level wildcard (#). This is O(L + K) where L is the topic depth and K is the total matching subscribers — independent of total subscription count.
Impact at scale:
| Subscriptions | Linear scan (old) | Trie (new) | Speedup |
|---|---|---|---|
| 1,000 | 1,000 comparisons | ~3-5 lookups | ~200x |
| 10,000 | 10,000 comparisons | ~3-5 lookups | ~2,000x |
| 100,000 | 100,000 comparisons | ~3-5 lookups | ~20,000x |
The trie also stores a by_client index mapping each client to its subscriptions, making unsubscribe_all (client disconnect cleanup) efficient without scanning the entire subscription list.
Packet Encoding
All socket sends use Codec.encode_iodata/2 which returns an iolist — a nested list of binaries that :gen_tcp.send/2 and :ssl.send/2 accept natively. This avoids a final IO.iodata_to_binary/1 copy.
For a typical 50-byte PUBLISH packet, this saves one 50-byte allocation and copy per send. At 100k messages/second, that's 5MB/s of avoided garbage collection pressure.
Codec benchmarks (Apple M4 Pro):
| Operation | Throughput | Notes |
|---|---|---|
| PUBLISH encode | 5.05M ops/s | 2.9x faster than mqtt_packet_map |
| SUBSCRIBE encode | 3.42M ops/s | 4.2x faster than mqtt_packet_map |
| PUBLISH decode | 2.36M ops/s | Zero-copy sub-binary references |
Buffer Handling
TCP delivers data in arbitrarily-sized chunks. In the common case, a complete MQTT packet arrives in a single TCP frame and the receive buffer is empty. The optimized path:
buffer = case state.buffer do
<<>> -> data # Common case: no copy, just use the new data
buf -> buf <> data # Partial packet pending: concat
endThe <<>> match is a constant-time check. When the buffer is empty (the majority case with typical MQTT packet sizes < TCP MSS), we skip binary concatenation entirely. The rest returned by Codec.decode is already a zero-copy sub-binary reference into the original data.
Callback Dispatch
Elixir's function_exported?/3 performs a module lookup on each call. For optional callbacks like handle_info/2, handle_puback/3, and handle_mqtt_event/3, this check runs on every incoming packet. MqttX computes these once at connection init:
# Computed once in handle_connection/init:
handler_has_handle_info: function_exported?(handler, :handle_info, 2),
handler_has_handle_puback: function_exported?(handler, :handle_puback, 3)
# Then used as a simple boolean check per packet:
if state.handler_has_handle_puback do
# ...
endFlow Control
MQTT 5.0's receive_maximum limits how many unacknowledged QoS 2 messages can be in flight simultaneously. Both client and server enforce this with a direct counter:
# Server-side: check before accepting incoming QoS 2 PUBLISH
if state.inflight_count >= state.server_receive_maximum do
# Send PUBREC with reason code 0x93 (Receive Maximum exceeded)
endThe counter is incremented when sending PUBREC (QoS 2 received) and decremented when sending PUBCOMP (QoS 2 complete) or when entries are dropped after max retries.
Retained Message Delivery
When a client subscribes, the server delivers matching retained messages from ETS. The optimized approach:
- Exact topic subscriptions (no wildcards): Direct
ets.lookup/2— O(1) per subscription. - Wildcard subscriptions: Table scan with pre-normalized topic lists. Topic filters are normalized once before the scan, and retained messages store a pre-computed normalized list alongside the string key, avoiding
String.split/2in the inner loop.
For a server with 10,000 retained messages and a client subscribing to 5 exact topics, this reduces from 50,000 comparisons (5 filters x 10,000 messages) to 5 hash lookups.
Capacity Planning
The primary bottleneck depends on your device activity pattern. For most IoT deployments, RAM is the limiting factor, not CPU.
Per-device resource usage
Each connected device consumes approximately ~20KB of RAM (process heap + connection state + socket). This breaks down as:
- Process heap: ~2KB (BEAM base allocation)
- State map (client_id, protocol flags, will message, timers): ~1KB
- Socket + TCP buffers: ~2–5KB
- Handler state (application-defined): ~0.5–5KB
- Session data, pending acks, optional features: ~1–5KB
CPU usage depends entirely on message frequency.
Note: These are theoretical estimates based on architectural analysis and codec benchmarks. The project does not yet include end-to-end load tests validating these numbers under production conditions. Actual performance will vary with hardware, OS tuning, message sizes, subscription patterns, and application logic in your handler callbacks.
Device counts by workload
| Device activity | Per vCPU | Bottleneck |
|---|---|---|
| Sleepy sensors (1 msg/min) | ~50K–100K | RAM |
| Normal IoT (1 msg/30s) | ~30K–80K | RAM |
| Chatty devices (1 msg/sec) | ~10K–15K | CPU |
| Real-time streaming (10 msg/sec) | ~1K–2K | CPU |
These per-vCPU numbers are not meant to be multiplied linearly — scaling is sub-linear due to ETS contention, scheduler rebalancing, per-process GC pauses, and OS-level limits (file descriptors, kernel socket buffer memory).
Instance sizing
For typical IoT workloads (temperature sensors, ping/pong, periodic telemetry at ~1 msg/min):
| Instance | RAM | Devices | CPU usage |
|---|---|---|---|
| 1 vCPU / 2GB | 2GB | ~80,000 | <5% |
| 2 vCPU / 4GB | 4GB | ~180,000 | <10% |
| 2 vCPU / 8GB | 8GB | ~350,000 | <10% |
| 4 vCPU / 16GB | 16GB | ~700,000 | <15% |
For active workloads (1 msg/sec per device), CPU becomes the constraint:
| Instance | Devices @ 1 msg/sec | Devices @ 10 msg/sec |
|---|---|---|
| 1 vCPU | ~15,000 | ~1,500 |
| 2 vCPU | ~30,000 | ~3,000 |
| 4 vCPU | ~60,000 | ~6,000 |
| 8 vCPU | ~100,000 | ~10,000 |
System-level constraints
At high connection counts, OS and kernel limits often become the bottleneck before BEAM limits:
- File descriptors: Each connection consumes one fd. Set
ulimit -naccordingly (see OS Tuning). - Ephemeral ports: A single IP address supports ~64K outbound ports. For more connections, bind multiple IPs.
- Kernel socket buffer memory: Each TCP socket reserves kernel buffer space (~4–8KB default). At 500K connections this alone can consume several GB of kernel memory.
- BEAM process/port limits: Default limits are 262,144 processes and 65,536 ports. Increase with
+Pand+Qflags (see VM Tuning).
Beyond a single node
Past ~500K connections, consider clustering multiple BEAM nodes behind a load balancer. The constraints at this scale are fault isolation (a single node crash affects all connected devices) and system-level limits described above. A multi-node setup with 3–5 nodes provides both capacity and redundancy.
Deployment Guidelines
Single Node
A single BEAM node with MqttX can handle:
| Metric | Conservative | Optimistic |
|---|---|---|
| Concurrent connections | 50,000 | 200,000 |
| Messages/second (QoS 0) | 100,000 | 500,000+ |
| Messages/second (QoS 1) | 50,000 | 200,000 |
| Memory per connection | ~20 KB | ~20 KB |
| Total memory (100k conns) | ~2 GB | ~2 GB |
These are theoretical estimates based on codec throughput benchmarks and architectural analysis — not measured under end-to-end load. Actual numbers depend on hardware, OS tuning, message sizes, subscription patterns, and handler callback complexity. QoS 2 has higher overhead due to the 4-step handshake.
VM Tuning
For high connection counts, tune the BEAM scheduler:
# Use all available cores
elixir --erl "+S $(nproc)" -S mix run
# Increase process limit (default 262144)
elixir --erl "+P 1000000" -S mix run
# Increase port limit for socket handles
elixir --erl "+Q 200000" -S mix run
Or in rel/vm.args:
+S 8:8
+P 1000000
+Q 200000
+stbt db
+sbwt very_longOS Tuning
# Increase file descriptor limit (each connection = 1 fd)
ulimit -n 200000
# Linux: increase socket buffer sizes
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
sysctl -w net.ipv4.tcp_rmem="4096 87380 16777216"
sysctl -w net.ipv4.tcp_wmem="4096 87380 16777216"
# Increase ephemeral port range
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
Rate Limiting
For production deployments, enable rate limiting to protect against misbehaving clients and connection storms:
MqttX.Server.start_link(MyApp.MqttHandler, [],
transport: MqttX.Transport.ThousandIsland,
port: 1883,
rate_limit: [
max_connections: 100, # per second
max_messages: 1000 # per client per second
]
)The rate limiter uses ETS with atomic update_counter for lock-free concurrent access. Counters reset automatically each interval window.
Transport Selection
Both ThousandIsland and Ranch are battle-tested for high connection counts:
| Transport | Strengths | Notes |
|---|---|---|
| ThousandIsland | Pure Elixir, simpler supervision | Recommended for new projects |
| Ranch | Mature C-based acceptor, proven at scale | Used by Cowboy, RabbitMQ |
Monitoring
Use the telemetry events (see Telemetry guide) to track:
- Connection rate:
[:mqttx, :server, :client_connect, :stop]counter - Message throughput:
[:mqttx, :server, :publish]counter - Publish latency:
[:mqttx, :client, :publish, :stop]duration histogram - Payload sizes:
[:mqttx, :server, :publish]payload_size distribution - Connection errors:
[:mqttx, :client, :connect, :exception]counter by reason