View Source Under the Hood
Let's see how YOLO.detect/3
works.
Load Yolo11n Model
Loads the Yolo11n model using the model_path
and classes_path
. Optionally, specify model_impl
, which defaults to YOLO.Models.Ultralytics
.
model = YOLO.load([
model_path: "models/yolo11n.onnx",
classes_path: "models/yolo11n_classes.json"
])
Preprocessing
mat = Evision.imread(image_path)
{input_tensor, scaling_config} = YOLO.Models.Ultralytics.preprocess(model, mat, [frame_scaler: YOLO.FrameScalers.EvisionScaler])
Before running object detection, the input image needs to be preprocessed to match the model's expected input format. The preprocessing steps are:
Resize and Pad Image to 640x640
- The image is resized while preserving aspect ratio to fit within 640x640 pixels
- Any remaining space is padded with gray color (value 114) to reach exactly 640x640
- This is handled by the
FrameScaler
behaviour and its implementations
Convert to Normalized Tensor
- The image is converted to an Nx tensor with shape
{1, 3, 640, 640}
- Pixel values are normalized from
0-255
to0.0-1.0
range - The channels are reordered from
RGB
to the model's expected format (BGR
in this case)
- The image is converted to an Nx tensor with shape
The FrameScaler
behaviour provides a consistent interface for handling different image formats:
EvisionScaler
- For OpenCV Mat images from EvisionImageScaler
- For images using the Image libraryNxIdentityScaler
- For ready to use Nx tensors
Run Object Detection
Then run the detection by passing the model
and the image tensor input_tensor
.
# input_tensor {1, 3, 640, 640}
output_tensor = YOLO.Models.run(model, input_tensor)
# output_tensor {1, 84, 8400}
You can also adjust detection thresholds (iou_threshold
and prob_threshold
, which both default to 0.45
and 0.25
respectively) using the third argument.
Postprocessing
result_rows = YOLO.Models.Ultralytics.postprocess(model, output_tensor, scaling_config, opts)
where result_rows
is a list of lists, where each inner list represents a detected object with 6 elements:
[
[cx, cy, w, h, prob, class_idx],
...
]
The model's raw output needs to be post-processed to extract meaningful detections. For YOLOv8n, the output_tensor
has shape {1, 84, 8400}
where:
- 84 represents 4 bbox coordinates + 80 class probabilities
- 8400 represents the number of candidate detections
The postprocessing steps are:
Filter Low Probability Detections
- Each detection has probabilities for all classes
- Only keep detections where max class probability exceeds
prob_threshold
(default 0.25)
Non-Maximum Suppression (NMS)
- Remove overlapping boxes for the same object
- For each class, compare boxes using Intersection over Union (IoU)
- If IoU >
iou_threshold
(default 0.45), keep only highest probability box - This prevents multiple detections of the same object
Scale Coordinates
- The detected coordinates are based on the model's 640x640 input
- Use the
scaling_config
from preprocessing to map back to original image size - This accounts for any resizing/padding done during preprocessing
Convert Detections to Structured Maps
Finally, convert the raw detection results into structured maps containing bounding box coordinates, class labels, and probabilities:
iex> YOLO.to_detected_objects(result_rows, model.classes)
[
%{
class: "person",
prob: 0.57,
bbox: %{h: 126, w: 70, cx: 700, cy: 570},
class_idx: 0
},
...
]
Render results on the image
To visualize the detection results on the image, we can use the KinoYOLO.Draw.draw_detected_objects/2
function. This function takes an Image
and a list of detected objects, and returns a new image with bounding boxes and labels drawn on it.