How I Created My First Ever Object Detection Application

8 min readMay 24, 2021

Object detection will eventually be ingrained into our lives.

In a couple of years, object detection will be around every corner: inside security cameras, in your phones and laptops, and in cars.

Wondering what the future might seem like, I tried object detection on my own, with YOLO.

My Journey

When I joined TKS, I was exposed to a lot of different technologies. However, there was only one that caught my eye and made me think to myself, “Hey, I want to learn more about this,” and that was artificial intelligence.

Being someone living in the 21st century, I searched it up on the internet. As I looked around, I found a branch of artificial intelligence called machine learning. Machine learning is essentially just an algorithm learning from large amounts of data.

Machine learning models create their own rules. i.e., they create rules that help them generalize the data they’re getting, and they keep adjusting those rules to fit with whatever data they’re given.

I want to go more into machine learning, but I will probably get too carried away. I’m thinking that maybe I should explain things somewhere else. Tell me if you want me to make an article specifically on machine learning!

Anyways, let’s go back to what we were talking about. Among various applications of machine learning, computer vision was the one that interested me the most. Computer vision is image classification and localization with object detection.

Image classification is figuring out what each image is. If you have an image of a dog, an image classification algorithm will tell you that that’s a dog. Next, we have image localization. This essentially just tells you where an object is, but it can’t necessarily tell you what that object is.

By looking at what those two can do, you will realize that they’re not very useful without working together. That’s where object detection comes in. Object detection is a combination of the two, meaning that it can point out different objects and figure out what they are, and now, we have a subset of computer vision.

What Is YOLO?

That was a mouthful! But that still wasn’t the last part. Don’t fear, though. This is the last one. After object detection, we have our final part, YOLO.

Oh, wait. You probably thought that was You Only Live Once. It’s okay, though, because you were pretty close. It’s You Only Look Once.

“Alright, great. I know all of these things. What now?”

That might’ve seemed extremely boring and pointless, but it’s really good to know the entire system when you’re looking for anything relating to artificial intelligence.

Now, let’s look at how YOLO is different from other object detection models.

For example, an R-CNN object detection algorithm uses regions to localize objects within the image. The network does not look at the complete image, but more at sections with a higher probability of containing an object.

YOLO is different from R-CNN. YOLO uses a single CNN to predict the bounding boxes and class probabilities for the boxes and only goes through the algorithm once. YOLO can run at awesome speeds, going at higher than 45 frames per second.

The astounding speed of the algorithm makes up for its loss in precision. More and more YOLO versions are being built, each getting faster and better.

Here is a graph the creators of YOLO made comparing their program to others. As you can see, YOLOv4 took a massive step from YOLOv3.

Now, I’ll show you step-by-step how this awesome technology works.

How YOLO Works

This process can be split up into three techniques:

Residual Blocks
Bounding Box Regression
Intersection Over Union

Residual Blocks

This is just splitting the image up into a S x S grid. By doing this, we can make each grid cell detect objects in them and be responsible for detecting the center of an object if an object is there.

Bounding Box Regression

A bounding box is a rectangle that surrounds an object in the image, highlighting it. YOLO uses a single bounding box regression algorithm to predict the width, height, center, and class that the object has.

Every bounding box has the following features (note that here b with a letter following it refers to properties of the box):

Width (bw)
Height (bh)
Class (the object that it represents; for example, a bird, car, phone, etc.) (c)
Bounding Box Center (bx, by)
Probability of Class (the probability that the predicted class is true) (pc)

Here, YOLO uses a function to determine the bounding box:

y = (pc, bx, by, bh, bw, c)

It creates a function using the above parameters to figure out how the box should be placed. In this case, y is the bounding box.

Intersection Over Union

To understand intersection over union, we have to understand what intersection and union mean.

Take these two boxes, for example.

This image shows the real bounding box and the predicted bounding box

Here, the intersection would be the part where the two overlap, or in other words, it would be the common parts.

Then, there is the union. This is all the area or space they take up entirely.

Now, let’s visualize a bit. As the two boxes get closer, the intersection gets bigger and the union gets smaller until they are almost the same.

This is where the formula comes in.

The goal of the YOLO object detection model is to get the IOU as close to 1 as possible. If it is 1, that means the predicted bounding box perfectly aligns with the true bounding box of the object.

This simple yet extremely useful technique allows us to test our model on different objects and observe its performance while also helping the model get better at these kinds of predictions.

How I Learned To Create An Object Detection Model

Finally, the moment you’ve all been waiting for is here. How I actually pulled this off.

I wasn’t able to complete this alone, though. And my journey to getting a really good tutorial wasn’t easy. It took a long time to find something until I found someone named Adrian Rosebrock in PyImageSearch University.

Adrian was a professor who mostly focused on actually implementing code rather than trying to learn all the complex mathematics behind it. What I’ve gone through in this article barely scratched the surface. But anyway, here is the link I used for the object detection tutorial in case you want to try this yourself (which you definitely will).

The tutorial I linked works on images, but remember: this is YOLO we’re working with. What does that mean? This program works with videos, too!

To do this, you should have a decent understanding of Python and how it works. I’ve been learning Python for about 2 years now, and when I first looked at it, I realized I had so much more to learn! For this, I recommend going to any coding site or online tutorial you wish.

Next, I suggest you look at Adrian’s guide to using argparse.

After that, it’s your choice if you want to stay with me or look at Adrian’s article! I will try to explain how some of the key points work, but he will explain it in more detail so I advise you to check out his article first then come back here if you want to.

(There is one program for images and the other for videos).

import numpy as np
import argparse
import time
import cv2
import os

Here, we are importing different libraries. Numpy is very good for helping us handle arrays with images. Argparse will let us put different arguments in the command line. Time will show us how long it took to complete the program. Cv2 represents OpenCV, which gives us access to several different commands with images. Os relates to our operation system, to get paths of different programs.

ap = argparse.ArgumentParser()
ap.add_argument(“-i”, “ — image”, required=True,
help=”path to input image”)
ap.add_argument(“-y”, “ — yolo”, required=True,
help=”base path to YOLO directory”)
ap.add_argument(“-c”, “ — confidence”, type=float, default=0.5,
help=”minimum probability to filter weak detections”)
ap.add_argument(“-t”, “ — threshold”, type=float, default=0.3,
help=”threshold when applyong non-maxima suppression”)
args = vars(ap.parse_args())

Here, we are creating different arguments in the command line to run our program. The first argument of each ap.add_argument() method is the short form of the argument. The second is the long-form, the third is if it is required or not, and the fourth is what is displayed if a user asks for help there. Note that this is from the image program, so it may differ slightly.

color = [int(c) for c in COLORS[classIDs[i]]]
cv2.rectangle(image, (x, y), (x + w, y + h), color, 2)
text = “{}: {:.4f}”.format(LABELS[classIDs[i]], confidences[i])
cv2.putText(image, text, (x, y — 5), cv2.FONT_HERSHEY_SIMPLEX,
0.5, color, 2)

This is creating the bounding box around each object in the image program. This is different from the video program since the video program has to do it for each frame.

Thanks for reading through the whole article!

Conclusion

Dictionary

CNN = Convolutional Neural Network
R-CNN = Regional Convolutional Neural Network
Bounding box = an imaginary rectangle that surrounds objects (classes)
Class probability = a form of probability analysis of objects of classes (in this case)

Takeaways

I have learned a lot about object detection as a whole while trying to look for YOLO. YOLO is just the beginning when we talk about real-world applications for object detection. Companies like Tesla and OpenAI are already harnessing this powerful technology to do good in the world. While looking for different methods for object detection, I have learned how normal convolutional neural networks work and that has given me a base understanding to pursue other kinds.

Sources

Here are some of the sources I used to make this article and learn about object detection.

I thank you for taking the time to read this article and I hope you enjoyed it!

If you have any questions or would like to talk to me, my email is sohum.padhye@gmail.com. Happy coding!