View Source ExVision walkthrough
Mix.install(
[
:ex_vision,
:kino,
:kino_bumblebee,
:stb_image,
:exla,
:image
],
config: [
nx: [default_backend: EXLA.Backend]
]
)
ExVision introduction
This Livebook will only work when the repository is cloned locally
ExVision is a collection of models with easy to use API and descriptive output formats. It uses Ortex under the hood to run it's predefined models.
The main objective of ExVision is ease of use. This sacrifices some control over the model but allows you to get started using predefined models in seconds. That approach should allow an average Elixir Developer to quickly introduce some AI into their app, just like that.
alias ExVision.Classification.MobileNetV3Small, as: Classifier
alias ExVision.ObjectDetection.FasterRCNN_ResNet50_FPN, as: ObjectDetector
alias ExVision.SemanticSegmentation.DeepLabV3_MobileNetV3, as: SemanticSegmentation
alias ExVision.InstanceSegmentation.MaskRCNN_ResNet50_FPN_V2, as: InstanceSegmentation
alias ExVision.KeypointDetection.KeypointRCNN_ResNet50_FPN, as: KeypointDetector
{:ok, classifier} = Classifier.load()
{:ok, object_detector} = ObjectDetector.load()
{:ok, semantic_segmentation} = SemanticSegmentation.load()
{:ok, instance_segmentation} = InstanceSegmentation.load()
{:ok, keypoint_detector} = KeypointDetector.load()
Kino.nothing()
At this point the model is loaded and ready for inference.
ExVision handles multiple types of input:
- file path
- pre-loaded Nx tensors, in both interleaved and planar formats
- Evision matricies.
Under the hood, all of these formats will be converted to Nx's Tensors and normalized for inference by the given model.
Output formats
A big point of ExVision over using the models directly has to be documentation and intuitive outputs. Hence, models return the following types:
- Classifier - a mapping the category into the probability:
%{category_t() => number()}
- Object Detector - a list of bounding boxes:
list(BBox.t())
- Semantic Segmentation - a mapping of category to boolean tensor determining if the pixel is part of the mask for the given class:
%{category_t() => Nx.Tensor.t()}
- Instance Segmentation - a list of bounding boxes with mask:
list(BBoxWithMask.t())
- Keypoint Detector - a list of bounding boxes with keypoints:
list(BBoxWithKeypoints.t())
Example inference
Let's put it into practice and run some predictions on a sample image of the cat.
This code is intentionally using some calls to dbg/1
macro in order to aid with the understanding of these formats.
Let's start with loading our test suspect. For this purpose, we have defined a helper function that will automatically load some default images if you don't specify any.
defmodule ImageHandler do
def get(input, default_image) do
img_path =
case Kino.Input.read(input) do
nil ->
{:ok, file} = ExVision.Cache.lazy_get(ExVision.Cache, default_image)
file
%{file_ref: image} ->
Kino.Input.file_path(image)
end
Image.open!(img_path)
end
end
In the next cell, you can provide your own image that will be used as an example in this notebook. If you don't have anything handy, we're also providing a default image of a cat.
input = Kino.Input.image("Image to evaluate", format: :jpeg)
image = ImageHandler.get(input, "cat.jpg")
Image classification
Image classification is the process of assining the image a category that best describes the contents of that image. For example, when given an image of a cat, image classifier predict that the image should be assinged to :cat
class.
The output format of an classifier is a dictionary that maps the category that the model knows into the probability. In most cases, that means that you will get a lot of categories with near zero probability and that's on purpose. Where possible, we don't want to make ExVision feel too much like magic. You're still doing AI, we're just handling the input and output format conversions.
Usually however, the class with the highest probability is the category you should assign. However, if there are multiple classes with comparatively high probabilities, this may indicate that the model has no idea and it's actually not a prediction at all.
Code example
In this example, we will try to find out the most likely class that the provided image could belong to. In order to do this, we will:
- Use the image classifier to gather predictions
- Sort the predictions
- Take 10 of the most likely ones
- Plot the results
predictions =
image
# run inference
|> then(&Classifier.run(classifier, &1))
# sort the dictionary by the probability of the prediction
|> Enum.sort_by(fn {_label, score} -> score end, :desc)
# Only include a few of the most likely predictions in the output
|> Enum.take(10)
|> dbg()
[{top_prediction, _score} | _rest] = predictions
# Kino rendering stuff, not important
scored_list = Kino.Bumblebee.ScoredList.new(predictions)
Kino.Layout.grid(
[
image,
Kino.Layout.grid([Kino.Text.new("Class probabilities"), scored_list])
],
columns: 2,
gap: 25
)
Object detection
In object detection, we're trying to locate the objects in the image. Format of the output in this case should provide a lot of clarification: it's a list of bounding boxes, which effectively indicate the area in the image that the object of the specified class are located in according to the image. Each bounding box is also assigned a score, which can be interpreted as the certainty of the detection.
By default, ExVision will discard extremely low probability bounding boxes (with scores lower than 0.1), as they are just noise.
Code example
In this example, we will draw a rectangle around the biggest object in the image. In order to do this, we will perform the following operations:
- Use the object detector to get the bounding boxes
- Find the bounding box with the biggest total area
- Draw a rectangle around the the region indicated by that bounding box
alias ExVision.Types.BBox
# apply the model
prediction =
image
|> then(&ObjectDetector.run(object_detector, &1))
# Find the biggest object by area
|> Enum.max_by(&(BBox.width(&1) * BBox.height(&1)))
|> dbg()
# Render an image
Image.Draw.rect!(
image,
prediction.x1,
prediction.y1,
BBox.width(prediction),
BBox.height(prediction),
fill: false,
color: :red,
stroke_width: 5
)
Semantic segmentation
The goal of semantic segmentation is to generate per-pixel masks stating if the object of the given class is in the corresponding pixel.
In ExVision, the output of semantic segmentation models is a mapping of category to a binary per-pixel binary mask. In contrast to previous models, we're not getting scores. Each pixel is always assigned the most probable class.
Code example
In this example, we will feed the image to the semantic segmentation model and inspect some of the masks provided by the model.
nx_image = Image.to_nx!(image)
uniform_black = 0 |> Nx.broadcast(Nx.shape(nx_image)) |> Nx.as_type(Nx.type(nx_image))
predictions =
image
|> then(&SemanticSegmentation.run(semantic_segmentation, &1))
# Filter out masks covering less than 5% of the total image area
|> Enum.filter(fn {_label, mask} ->
mask |> Nx.mean() |> Nx.to_number() > 0.05
end)
|> dbg()
predictions
|> Enum.map(fn {label, mask} ->
# expand the mask to cover all channels
mask = Nx.broadcast(mask, Nx.shape(nx_image), axes: [0, 1])
# Cut out the mask from the original image
image = Nx.select(mask, nx_image, uniform_black)
image = Nx.as_type(image, :u8)
Kino.Layout.grid([
label |> Atom.to_string() |> Kino.Text.new(),
Kino.Image.new(image)
])
end)
|> Kino.Layout.grid(columns: 2)
Instance segmentation
The objective of instance segmentation is to not only identify objects within an image on a per-pixel basis but also differentiate each specific object of the same class.
In ExVision, the output of instance segmentation models includes a bounding box with a label and a score (similar to object detection), and a binary mask for every instance detected in the image.
Extremely low probability detections (with scores lower than 0.1) will be discarded by ExVision, as they are just noise.
Code example
In the following example, we will pass an image through the instance segmentation model and examine the individual instance masks recognized by the model.
alias ExVision.Types.BBoxWithMask
nx_image = Image.to_nx!(image)
uniform_black = 0 |> Nx.broadcast(Nx.shape(nx_image)) |> Nx.as_type(Nx.type(nx_image))
predictions =
image
|> then(&InstanceSegmentation.run(instance_segmentation, &1))
# Get most likely predictions from the output
|> Enum.filter(fn %BBoxWithMask{score: score} -> score > 0.8 end)
|> dbg()
predictions
|> Enum.map(fn %BBoxWithMask{label: label, mask: mask} ->
# expand the mask to cover all channels
mask = Nx.broadcast(mask, Nx.shape(nx_image), axes: [0, 1])
# Cut out the mask from the original image
image = Nx.select(mask, nx_image, uniform_black)
image = Nx.as_type(image, :u8)
Kino.Layout.grid([
label |> Atom.to_string() |> Kino.Text.new(),
Kino.Image.new(image)
])
end)
|> Kino.Layout.grid(columns: 2)
Keypoint detection
In keypoint detection, we're trying to specific keypoints in the image. ExVision returns the output as a list of boudning boxes (similar to object detection) with named keypoints. Each keypoint consists of x, y coordinates and a score which is the model's certainty of that keypoint.
ExVision will discard extremely low probability detections (with scores lower than 0.1), as they are just noise.
The KeypointRCNN_ResNet50_FPN model is commonly used for detecting human body parts in images. To illustrate this, let's begin by importing an image that features people.
image = ImageHandler.get(input, "people.jpg")
Code example
In this example, we will draw keypoints for every detection with a high enough score returned by the model, additionally we will draw a bounding box around them.
alias ExVision.Types.BBoxWithKeypoints
# define skeleton pose
connections = [
# face
{:nose, :left_eye},
{:nose, :right_eye},
{:left_eye, :right_eye},
{:left_eye, :left_ear},
{:right_eye, :right_ear},
# left arm
{:left_wrist, :left_elbow},
{:left_elbow, :left_shoulder},
# right arm
{:right_wrist, :right_elbow},
{:right_elbow, :right_shoulder},
# torso
{:left_shoulder, :right_shoulder},
{:left_shoulder, :left_hip},
{:right_shoulder, :right_hip},
{:left_hip, :right_hip},
{:left_shoulder, :left_ear},
{:right_shoulder, :right_ear},
# left leg
{:left_ankle, :left_knee},
{:left_knee, :left_hip},
# right leg
{:right_ankle, :right_knee},
{:right_knee, :right_hip}
]
# apply the model
predictions =
image
|> then(&KeypointDetector.run(keypoint_detector, &1))
# Get most likely predictions from the output
|> Enum.filter(fn %BBoxWithKeypoints{score: score} -> score > 0.8 end)
|> dbg()
predictions
|> Enum.reduce(image, fn prediction, image_acc ->
# draw keypoints
image_acc =
prediction.keypoints
|> Enum.reduce(image_acc, fn {_key, %{x: x, y: y}}, acc ->
Image.Draw.circle!(acc, x, y, 2, color: :red)
end)
# draw skeleton pose
image_acc =
connections
|> Enum.reduce(image_acc, fn {from, to}, acc ->
%{x: x1, y: y1} = prediction.keypoints[from]
%{x: x2, y: y2} = prediction.keypoints[to]
Image.Draw.line!(acc, x1, y1, x2, y2, color: :red)
end)
# draw bounding box
Image.Draw.rect!(
image_acc,
prediction.x1,
prediction.y1,
BBoxWithKeypoints.width(prediction),
BBoxWithKeypoints.height(prediction),
fill: false,
color: :red,
stroke_width: 2
)
end)
Next steps
After completing this tutorial you can also check out our next tutorial focusing on using models in production in process workflow here