View Source Explorer.Remote (Explorer v0.10.0)

A module responsible for placing remote dataframes and garbage collect them.

The functions in Explorer.DataFrame and Explorer.Series will automatically move operations on remote dataframes to the nodes they belong to. Explorer also integrates with FLAME and automatically tracks remote dataframes and series returned from FLAME calls when the :track_resources option is enabled.

This module provides additional conveniences for manual placement.

Implementation details

In order to understand what this module does, we need to understand the challenges in working with remote series and dataframes.

Series and dataframes are actually NIF resources: they are pointers to blobs of memory operated by low-level libraries. Those are represented in Erlang/Elixir as references (the same as the one returned by make_ref/0). Once the reference is garbage collected (based on refcounting), those NIF resources are garbage collected and the memory is reclaimed.

When using Distributed Erlang, you may write this code:

remote_series = :erpc.call(node, Explorer.Series, :from_list, [[1, 2, 3]])

However, the code above will not work, because the series will be allocated in the remote node and the remote node won't hold a reference to said series! This means the series is garbage collected and if we attempt to read it later on, from the caller node, it will no longer exist. Therefore, we must explicitly place these resources in remote nodes by spawning processes to hold these references. That's what the place/2 function in this module does.

We also need to guarantee these resources are not kept forever by these remote nodes, so place/2 creates a local NIF resource that notifies the remote resources they have been GCed, effectively implementing a remote garbage collector.

Summary

Functions

Receives a data structure and traverses it looking for remote dataframes and series.

Functions

Receives a data structure and traverses it looking for remote dataframes and series.

If any is found, it spawns a process on the remote node and sets up a distributed garbage collector. This function only traverses maps, lists, and tuples, it does not support arbitrary structs (such as map sets).

It returns the updated term and a list of remote PIDs spawned.