View Source Columnar data transforms
Mix.install(
[{:lastfm_archive, path: "elixir/lastfm_archive"}, {:kino_explorer, "~> 0.1.8"}],
config: [
lastfm_archive: [
data_dir: "./lastfm_data/",
user: ""
]
]
)
:ok
Introduction
This guide uses lastfm_archive to create various columnar data archives, enabling an entire dataset of scrobbles for a Lastfm user to be read into a data frame for analytics purposes.
Prerequisite
- Setup, installation
- Creating a file archive containing scrobbles in raw JSON format fetched from Lastfm API
Transform to columnar formats
The default file archive consists of data downloaded from Lastfm that is stored in per-day raw data format (a JSON file per day). It is not optimised for analytics and computational purposes. For example, all the raw data files must be read, parsed, analysed and consolidated, even for a simple metric such as counting the total number of albums scrobbled. For example, if you have 16 years of scrobbles, there are a lot of files to read just to get a simple count!
Columnar based storage is better for analytics, OLAP workloads and for historical archive. lastfm_archive provides capability to transform the raw JSON archive into the following storage formats:
- Apache Arrow columnar format
- Apache Parquet columnar format
- also CSV (tab-delimited)
Apache Parquet archive
Run the following code to transform the file archive into an Apache Parquet archive.
LastfmArchive.default_user() |> LastfmArchive.transform(format: :parquet)
To transform / regenerate a single year, use the overwrite
(old data) and year
options, below assumes the file archive contains scrobbles from year 2023 (otherwise, please experiment with other years):
LastfmArchive.default_user()
|> LastfmArchive.transform(format: :ipc_stream, overwrite: true, year: 2023)
To simply transform / regenerate the entire archive, overwriting all previous data:
LastfmArchive.default_user() |> LastfmArchive.transform(format: :parquet, overwrite: true)
Apache Arrow archive
Apache Arrow is an in-memory columnar format that is interoperable among data applications written in different languages. Arrow data is serialised according to an interprocess communication (IPC
) protocol.
Run the following code to create an Apache Arrow archive according its IPC streaming format:
LastfmArchive.default_user() |> LastfmArchive.transform(format: :ipc_stream)
The same overwrite
and year
options are applicable (see Apache Parquet archive) for regenerating / transforming all or single-year data.
Read columnar data for analytics
Columnar data can be read into an Explorer data frame for analysis. To read a single-year, single-column scrobbles data from the Arrow IPC archive into a data frame, run (again assuming year 2023 scrobbles, otherwise try another year
):
user = LastfmArchive.default_user()
{:ok, df} = LastfmArchive.read(user, format: :ipc_stream, year: 2023)
The data frame can now be used for various analytics workloads. For example, compute all unique albums scrobbled in year 2023 and list them in descending order (most scrobbled albums):
df |> Explorer.DataFrame.collect() |> Explorer.DataFrame.frequencies([:album])
To read the entire dataset into a data frame, run:
{:ok, df_all} = LastfmArchive.read(user, format: :ipc_stream)
And use the data frame for various analytics, for example compute all unique artists, run:
df_all |> Explorer.DataFrame.collect() |> Explorer.DataFrame.frequencies([:artist])
To compute all unique tracks by artists:
df_all |> Explorer.DataFrame.collect() |> Explorer.DataFrame.frequencies([:track, :artist])