Getting Started

View Source

This guide is an introduction to Blink, a fast bulk data insertion library for Ecto and PostgreSQL.

In this guide, we will:

  • Create a seeder module for inserting users and posts
  • Learn how to reference data from previously declared tables
  • Use streams for memory-efficient seeding
  • Store auxiliary data in context without inserting it into the database

Add Blink to your dependencies in mix.exs:

defp deps do
  [
    {:blink, "~> 0.6.1"}
  ]
end

Install the dependencies:

mix deps.get

Configuring the repository

Blink works with any Ecto repository. If you don't have Ecto set up yet, follow the Ecto Getting Started guide to configure your repository and create your database tables.

For this guide, we'll assume you have:

  • An Ecto repository (e.g., Blog.Repo) configured
  • A users table with columns: id, name, email, inserted_at, updated_at
  • A posts table with columns: id, title, body, user_id, inserted_at, updated_at

Creating a seeder

Now that we have our database set up, let's create a seeder module to insert data:

defmodule Blog.Seeder do
  use Blink

  def call do
    new()
    |> with_table("users")
    |> with_table("posts")
    |> run(Blog.Repo)
  end

  def table(_seeder, "users") do
    [
      %{id: 1, name: "Alice", email: "alice@example.com"},
      %{id: 2, name: "Bob", email: "bob@example.com"}
    ]
  end

  def table(seeder, "posts") do
    IO.inspect(seeder)
    # %Blink.Seeder{
    #   tables: %{"users" => [%{id: 1, name: "Alice", ...}, ...]},
    #   table_order: ["users"],
    #   table_opts: %{"users" => []},
    #   context: %{}
    # }

    users = seeder.tables["users"]

    Enum.flat_map(users, fn user ->
      for i <- 1..5 do
        %{
          id: (user.id - 1) * 5 + i,
          title: "Post #{i} by #{user.name}",
          body: "This is the content of post #{i}.",
          user_id: user.id,
          inserted_at: ~U[2024-01-01 00:00:00Z],
          updated_at: ~U[2024-01-01 00:00:00Z]
        }
      end
    end)
  end
end

The seeder above does the following:

  1. use Blink - Injects Blink's functions and defines required callbacks
  2. new() - Creates an empty Seeder struct
  3. with_table/2 - Declares the tables to insert rows into
  4. table/2 - Defines what rows to insert into each table
  5. run/2 - Executes the bulk insertion

Each table/2 callback receives a Seeder struct. The tables field stores data from previously declared tables, allowing the "posts" callback to reference seeder.tables["users"].

Once run/2 is called, data is inserted in the order tables were declared. The context field is covered below.

Let's run it from IEx:

iex -S mix
iex> Blog.Seeder.call()
# => Inserts 2 users and 10 posts

Streams

In the example above, the table/2 clauses returned lists. Since Blink stores the entire Seeder struct in memory, large lists can be problematic.

To avoid this, table/2 can return a stream instead:

def table(_seeder, "users") do
  Stream.map(1..1_000_000, fn i ->
    %{
      id: i,
      name: "User #{i}",
      email: "user#{i}@example.com",
      inserted_at: ~U[2024-01-01 00:00:00Z],
      updated_at: ~U[2024-01-01 00:00:00Z]
    }
  end)
end

def table(seeder, "posts") do
  Stream.flat_map(seeder.tables["users"], fn user ->
    for i <- 1..20 do
      %{
        id: (user.id - 1) * 20 + i,
        title: "Post #{i} by #{user.name}",
        body: "This is the content of post #{i}",
        user_id: user.id,
        inserted_at: ~U[2024-01-01 00:00:00Z],
        updated_at: ~U[2024-01-01 00:00:00Z]
      }
    end
  end)
end

Streams are processed lazily by run/2 without extra configuration needed.

Using context

Sometimes you need to compute data once and share it across multiple tables. Context data is not inserted into the database but is available when building your table data.

In this example, we generate timestamps once and reuse them across tables, ensuring posts are created after their author are.

def call do
  new()
  |> with_context("timestamps")
  |> with_table("users")
  |> with_table("posts")
  |> run(Blog.Repo)
end

def context(_seeder, "timestamps") do
  base = ~U[2024-01-01 00:00:00Z]
  for day <- 0..29, do: DateTime.add(base, day, :day)
end

def table(seeder, "users") do
  timestamps = seeder.context["timestamps"]
  random_timestamp = Enum.random(timestamps)

  for i <- 1..100 do
    %{
      id: i,
      name: "User #{i}",
      email: "user#{i}@example.com",
      inserted_at: random_timestamp,
      updated_at: random_timestamp
    }
  end
end

def table(seeder, "posts") do
  users = seeder.tables["users"]
  timestamps = seeder.context["timestamps"]

  Enum.flat_map(users, fn user ->
    # Only use timestamps after the user was created
    valid_timestamps =
      Enum.filter(timestamps, fn ts ->
        DateTime.compare(ts, user.inserted_at) == :gt
      end)

    random_valid_timestamp = Enum.random(valid_timestamps)

    for i <- 1..5 do
      %{
        id: (user.id - 1) * 5 + i,
        title: "Post #{i}",
        body: "Content here",
        user_id: user.id,
        inserted_at: random_valid_timestamp,
        updated_at: random_valid_timestamp
      }
    end
  end)
end

Summary

In this guide, we learned how to:

  • Create a seeder module with use Blink
  • Reference data from previously declared tables via seeder.tables
  • Use streams for memory-efficient seeding of large datasets
  • Store auxiliary data in context without inserting it into the database

Next steps

You might also find these guides useful: