View Source ExVision walkthrough

Mix.install(
  [
    :ex_vision,
    :kino,
    :kino_bumblebee,
    :stb_image,
    :exla,
    :image
  ],
  config: [
    nx: [default_backend: EXLA.Backend]
  ]
)

ExVision introduction

This Livebook will only work when the repository is cloned locally

ExVision is a collection of models with easy to use API and descriptive output formats. It uses Ortex under the hood to run it's predefined models.

The main objective of ExVision is ease of use. This sacrifices some control over the model but allows you to get started using predefined models in seconds. That approach should allow an average Elixir Developer to quickly introduce some AI into their app, just like that.

alias ExVision.Classification.MobileNetV3Small, as: Classifier
alias ExVision.ObjectDetection.FasterRCNN_ResNet50_FPN, as: ObjectDetector
alias ExVision.SemanticSegmentation.DeepLabV3_MobileNetV3, as: SemanticSegmentation
alias ExVision.InstanceSegmentation.MaskRCNN_ResNet50_FPN_V2, as: InstanceSegmentation
alias ExVision.KeypointDetection.KeypointRCNN_ResNet50_FPN, as: KeypointDetector

{:ok, classifier} = Classifier.load()
{:ok, object_detector} = ObjectDetector.load()
{:ok, semantic_segmentation} = SemanticSegmentation.load()
{:ok, instance_segmentation} = InstanceSegmentation.load()
{:ok, keypoint_detector} = KeypointDetector.load()

Kino.nothing()

At this point the model is loaded and ready for inference.

ExVision handles multiple types of input:

  • file path
  • pre-loaded Nx tensors, in both interleaved and planar formats
  • Evision matricies.

Under the hood, all of these formats will be converted to Nx's Tensors and normalized for inference by the given model.

Output formats

A big point of ExVision over using the models directly has to be documentation and intuitive outputs. Hence, models return the following types:

Example inference

Let's put it into practice and run some predictions on a sample image of the cat. This code is intentionally using some calls to dbg/1 macro in order to aid with the understanding of these formats.

Let's start with loading our test suspect. For this purpose, we have defined a helper function that will automatically load some default images if you don't specify any.

defmodule ImageHandler do
  def get(input, default_image) do
    img_path =
      case Kino.Input.read(input) do
        nil ->
          {:ok, file} = ExVision.Cache.lazy_get(ExVision.Cache, default_image)
          file

        %{file_ref: image} ->
          Kino.Input.file_path(image)
      end

    Image.open!(img_path)
  end
end

In the next cell, you can provide your own image that will be used as an example in this notebook. If you don't have anything handy, we're also providing a default image of a cat.

input = Kino.Input.image("Image to evaluate", format: :jpeg)
image = ImageHandler.get(input, "cat.jpg")

Image classification

Image classification is the process of assining the image a category that best describes the contents of that image. For example, when given an image of a cat, image classifier predict that the image should be assinged to :cat class.

The output format of an classifier is a dictionary that maps the category that the model knows into the probability. In most cases, that means that you will get a lot of categories with near zero probability and that's on purpose. Where possible, we don't want to make ExVision feel too much like magic. You're still doing AI, we're just handling the input and output format conversions.

Usually however, the class with the highest probability is the category you should assign. However, if there are multiple classes with comparatively high probabilities, this may indicate that the model has no idea and it's actually not a prediction at all.

Code example

In this example, we will try to find out the most likely class that the provided image could belong to. In order to do this, we will:

  1. Use the image classifier to gather predictions
  2. Sort the predictions
  3. Take 10 of the most likely ones
  4. Plot the results
predictions =
  image
  # run inference
  |> then(&Classifier.run(classifier, &1))
  # sort the dictionary by the probability of the prediction
  |> Enum.sort_by(fn {_label, score} -> score end, :desc)
  # Only include a few of the most likely predictions in the output
  |> Enum.take(10)
  |> dbg()

[{top_prediction, _score} | _rest] = predictions

# Kino rendering stuff, not important
scored_list = Kino.Bumblebee.ScoredList.new(predictions)

Kino.Layout.grid(
  [
    image,
    Kino.Layout.grid([Kino.Text.new("Class probabilities"), scored_list])
  ],
  columns: 2,
  gap: 25
)

Object detection

In object detection, we're trying to locate the objects in the image. Format of the output in this case should provide a lot of clarification: it's a list of bounding boxes, which effectively indicate the area in the image that the object of the specified class are located in according to the image. Each bounding box is also assigned a score, which can be interpreted as the certainty of the detection.

By default, ExVision will discard extremely low probability bounding boxes (with scores lower than 0.1), as they are just noise.

Code example

In this example, we will draw a rectangle around the biggest object in the image. In order to do this, we will perform the following operations:

  1. Use the object detector to get the bounding boxes
  2. Find the bounding box with the biggest total area
  3. Draw a rectangle around the the region indicated by that bounding box
alias ExVision.Types.BBox

# apply the model
prediction =
  image
  |> then(&ObjectDetector.run(object_detector, &1))
  # Find the biggest object by area
  |> Enum.max_by(&(BBox.width(&1) * BBox.height(&1)))
  |> dbg()

# Render an image
Image.Draw.rect!(
  image,
  prediction.x1,
  prediction.y1,
  BBox.width(prediction),
  BBox.height(prediction),
  fill: false,
  color: :red,
  stroke_width: 5
)

Semantic segmentation

The goal of semantic segmentation is to generate per-pixel masks stating if the object of the given class is in the corresponding pixel.

In ExVision, the output of semantic segmentation models is a mapping of category to a binary per-pixel binary mask. In contrast to previous models, we're not getting scores. Each pixel is always assigned the most probable class.

Code example

In this example, we will feed the image to the semantic segmentation model and inspect some of the masks provided by the model.

nx_image = Image.to_nx!(image)
uniform_black = 0 |> Nx.broadcast(Nx.shape(nx_image)) |> Nx.as_type(Nx.type(nx_image))

predictions =
  image
  |> then(&SemanticSegmentation.run(semantic_segmentation, &1))
  # Filter out masks covering less than 5% of the total image area
  |> Enum.filter(fn {_label, mask} ->
    mask |> Nx.mean() |> Nx.to_number() > 0.05
  end)
  |> dbg()

predictions
|> Enum.map(fn {label, mask} ->
  # expand the mask to cover all channels
  mask = Nx.broadcast(mask, Nx.shape(nx_image), axes: [0, 1])

  # Cut out the mask from the original image
  image = Nx.select(mask, nx_image, uniform_black)
  image = Nx.as_type(image, :u8)

  Kino.Layout.grid([
    label |> Atom.to_string() |> Kino.Text.new(),
    Kino.Image.new(image)
  ])
end)
|> Kino.Layout.grid(columns: 2)

Instance segmentation

The objective of instance segmentation is to not only identify objects within an image on a per-pixel basis but also differentiate each specific object of the same class.

In ExVision, the output of instance segmentation models includes a bounding box with a label and a score (similar to object detection), and a binary mask for every instance detected in the image.

Extremely low probability detections (with scores lower than 0.1) will be discarded by ExVision, as they are just noise.

Code example

In the following example, we will pass an image through the instance segmentation model and examine the individual instance masks recognized by the model.

alias ExVision.Types.BBoxWithMask

nx_image = Image.to_nx!(image)
uniform_black = 0 |> Nx.broadcast(Nx.shape(nx_image)) |> Nx.as_type(Nx.type(nx_image))

predictions =
  image
  |> then(&InstanceSegmentation.run(instance_segmentation, &1))
  # Get most likely predictions from the output
  |> Enum.filter(fn %BBoxWithMask{score: score} -> score > 0.8 end)
  |> dbg()

predictions
|> Enum.map(fn %BBoxWithMask{label: label, mask: mask} ->
  # expand the mask to cover all channels
  mask = Nx.broadcast(mask, Nx.shape(nx_image), axes: [0, 1])

  # Cut out the mask from the original image
  image = Nx.select(mask, nx_image, uniform_black)
  image = Nx.as_type(image, :u8)

  Kino.Layout.grid([
    label |> Atom.to_string() |> Kino.Text.new(),
    Kino.Image.new(image)
  ])
end)
|> Kino.Layout.grid(columns: 2)

Keypoint detection

In keypoint detection, we're trying to specific keypoints in the image. ExVision returns the output as a list of boudning boxes (similar to object detection) with named keypoints. Each keypoint consists of x, y coordinates and a score which is the model's certainty of that keypoint.

ExVision will discard extremely low probability detections (with scores lower than 0.1), as they are just noise.

The KeypointRCNN_ResNet50_FPN model is commonly used for detecting human body parts in images. To illustrate this, let's begin by importing an image that features people.

image = ImageHandler.get(input, "people.jpg")

Code example

In this example, we will draw keypoints for every detection with a high enough score returned by the model, additionally we will draw a bounding box around them.

alias ExVision.Types.BBoxWithKeypoints

# define skeleton pose
connections = [
  # face
  {:nose, :left_eye},
  {:nose, :right_eye},
  {:left_eye, :right_eye},
  {:left_eye, :left_ear},
  {:right_eye, :right_ear},

  # left arm
  {:left_wrist, :left_elbow},
  {:left_elbow, :left_shoulder},

  # right arm
  {:right_wrist, :right_elbow},
  {:right_elbow, :right_shoulder},

  # torso
  {:left_shoulder, :right_shoulder},
  {:left_shoulder, :left_hip},
  {:right_shoulder, :right_hip},
  {:left_hip, :right_hip},
  {:left_shoulder, :left_ear},
  {:right_shoulder, :right_ear},

  # left leg
  {:left_ankle, :left_knee},
  {:left_knee, :left_hip},

  # right leg
  {:right_ankle, :right_knee},
  {:right_knee, :right_hip}
]

# apply the model
predictions =
  image
  |> then(&KeypointDetector.run(keypoint_detector, &1))
  # Get most likely predictions from the output
  |> Enum.filter(fn %BBoxWithKeypoints{score: score} -> score > 0.8 end)
  |> dbg()

predictions
|> Enum.reduce(image, fn prediction, image_acc ->
  # draw keypoints
  image_acc =
    prediction.keypoints
    |> Enum.reduce(image_acc, fn {_key, %{x: x, y: y}}, acc ->
      Image.Draw.circle!(acc, x, y, 2, color: :red)
    end)

  # draw skeleton pose
  image_acc =
    connections
    |> Enum.reduce(image_acc, fn {from, to}, acc ->
      %{x: x1, y: y1} = prediction.keypoints[from]
      %{x: x2, y: y2} = prediction.keypoints[to]

      Image.Draw.line!(acc, x1, y1, x2, y2, color: :red)
    end)

  # draw bounding box
  Image.Draw.rect!(
    image_acc,
    prediction.x1,
    prediction.y1,
    BBoxWithKeypoints.width(prediction),
    BBoxWithKeypoints.height(prediction),
    fill: false,
    color: :red,
    stroke_width: 2
  )
end)

Next steps

After completing this tutorial you can also check out our next tutorial focusing on using models in production in process workflow here