Mobile, Embedded System, PCB Layout, Robotics and UAV

Nghiên cứu & Chuyển giao CN

A calibrated distance prediction method from image sensor to human face

I. Introduction

The distance from the human face to the camera is a value that has many applications in computer vision. This value can be used to implement a more user-friendly UI. In the operating system of the system thanks to human communication, if the human cannot operate, it is difficult to command the system to operate the camera fully and clearly at a certain distance. enough for the system to detect the person in the image. To estimate the distance from the subject to the camera, detecting and tracking an object must be solved first. In this study, the subject to track is the face of a person standing or moving back and forth in front of the camera lens. There are many face detection methods such as Haar-like Adaboost (HA) which finds faces based on a combination of 4 components: Haar-like features, Integral Image computes features, using Adaptive Boost and Cascade filters to speed up classification [1]. This method gives quick identification results but is easily affected by ambient light and is only suitable for faces at the front angle [2,3]. Next is the Histogram of Oriented Gradians (HOG) method, which is less affected by environmental light, but gives bad results for a part of the face that is covered [4]. The Deformable Part Models (DPM) approach, a form of hidden Markov model, has also obtained excellent performance, but this model requires high computational cost, especially in the training phase [5]. Convolutional Neural Networks (CNN) is an effective deep learning model, used in many problems of face detection and recognition, video analysis, MRI images,… Most of the basic CNNs are suitable and well solve problems of this type [6-8]. The MTCNN network was developed from CNN [9]. In this study, we apply the superimposed convolutional neural network - MTCNN to detect the area containing the face in the image. MTCNN is one of the most popular and accurate facial recognition tools available today. It consists of 3 neural networks connected in one layer, which is a modern deep learning model, allowing to identify faces at many different angles, even in low light conditions and part of the face is blurred hidden. For the problem of distance estimation by image processing through detection of faces in images, there have also been studies like [12,13]. These studies also apply the method of calculating the focal length of the camera using an initial face image taken at a given distance. However, the accuracy will be reduced if it is replaced by a different person with a different face. Therefore, in this paper, we propose an additional calibration method to increase the accuracy when there is a change of people in the image.

The article is presented in 5 parts. After the introduction, in part 2, the face after being detected by the MTCNN will be located in a bounding box which is a square with 4 sides, then a center point located on the top edge of the square will be found. Seen together with the side edge is considered the height of the face in the image. These two important values will be used to feed into the algorithm to predict the distance from the face of the person to the camera. In part 3, a distance prediction algorithm will be presented along with a calibration method to increase accuracy whenever there is a change in people in the image. In part 4 are some experimental results to evaluate the accuracy of the program and part 5 are the conclusions.

II. Camera image sensor

An image sensor is a device that converts the image signal obtained from the absorption of light by an object into an electrical signal.

- Dedicated light source: Provides light for the sensor, ensuring that the device has enough light to record the clearest quality images, which is convenient for the sensor's image analysis Out.

- Lens: Brings the image to the image processing chip.

- Image processing chip CCD (short for Charge Coupled Device) or CMOS (short for Complimentar Metal Oxide Semiconductor): Converts optical signals into analog signals.

- Analog/digital converter: This is the place to process and convert the signal from analog signal to digital signal to serve the next processing of the software.

- Microprocessors: Analyze and process the digital signals of the image, then rely on preset parameters to make decisions.

- Input-Output: Provides communication channels to communicate with other devices.

- Other control connected peripheral devices.


The operation process of the image sensor is divided into 3 parts: collecting information, analyzing information, and giving information results. For the sensor to work smoothly, the components must also work according to the requirements. The light source needs to ensure the contrast, highlight the details, the object to be analyzed. The lens must focus the light onto the image processing chip. The shorter the focal length of the lens, the larger the object observed and vice versa. This is quite an important part because it affects your target image. Finally, the image sensor performs the collection and analysis steps.

Today, most CCTV cameras use one of two types of sensors, CCD and CMOS. Both CCD and CMOS perform the same task: convert the light signal into an electrical signal. Despite its good dynamic range and noise control, which was born before and is still in use today, CCDs are more difficult to install and consume more power, making CMOS the #1 and most popular choice these days. now. In the past, CMOS went after CCD and was considered a sensor for lower image quality. However, with the advent of new technology, CMOS has gradually filled the gap, even surpassing the standard of CCD.

CCD consists of a checkerboard-like grid of highlights (pixels, pixels). These points are again covered with color filter layers (usually 1 of 3 primary colors: red, blue, and blue (Red, Green, Blue) so that each point captures only a certain color.

CMOS is different, next to each flashpoint on the chip there is an auxiliary circuit, so one can integrate image processing processes such as analog/digital converters, white balance, etc. into this auxiliary circuit. , which makes the image processing process very fast thanks to being performed at each pixel.

III. Detecting faces in images based on Stacked Convolutional Neural Network

In a real environment, there will often be many people appearing in front of the camera lens, which leads to many faces in the image. In addition, the areas containing faces in the image will be of different sizes. Therefore, a method is required to find the total number of image regions with their dimensions. MTCNN uses image resizing to create a series of copies from the original image with different sizes, from large to small, creating an object as depicted in Figure 2, called an image pyramid [14]. 

For each copy from the original image, use a 12x12 pixel kernel and stride = 2 to scan the entire image to find faces. The convolutional network can easily recognize faces of different sizes thanks to the image size, even though only using a kernel with a fixed size. Next, the kernels are cut from above and transmitted over the P-Net (Proposal Network). The result, as shown in Figure 3, is a set of bounding boxes located in each kernel, each bounding box will contain four corner coordinates to determine the location in the kernel containing it (normalized to range from (0,1) and confidence score respectively.

To remove kernels and bounding boxes on images, we use two main methods, which are to use NMS (Non-Maximum Suppression) to delete boxes with overlapping ratios (Intersection Over Union) that exceed a certain threshold. and set the Threshold confident level – to remove the low confidence boxes. Figure 4 is an illustration that allows NMS, similar boxes will be discarded and keep the one with the highest confidence level.

After finding and deleting the boxes that do not match, the next thing is to convert the coordinates of the boxes to the original coordinates of the real image. After the coordinates of the box have been normalized to the interval (0,1) similar to the kernel, we calculate the length and width of the kernel based on the original image and multiply the normalized coordinates of the box by the size of the kernel and then add coordinates of the respective kernel corners. The output is the coordinates of the corresponding box on the original image. Finally, we resize the boxes to a square shape, get the new coordinates of the boxes and feed them into the next network, which is the R network. 

Next, at the R network (Refine Network) as shown in Figure 5, the process of performing the steps is still the same as in the P network.  However, R also uses a method called padding to perform the insertion of empty pixels (zero-pixels) into the missing parts of the bounding box when the bounding box exceeds the boundary of the image. All the bounding boxes are now resized to 24x24 and treated as a kernel and fed to the R network. The result is also the new coordinates of the remaining boxes and again fed to the next network, the O network. 

Finally, at the O (Output Network) network as shown in Figure 6, proceed to resize the bounding box to 48x48. The output is now 3 values including 4 coordinates of the bounding box (out[0]), coordinates of 5 landmark points on the face shown in figure 7, including 1 nose, 2 eyes, 2 sides of the mouth (out[1]) and the confidence score of each box (out[2]). All will be saved into a dictionary with the 3 keys mentioned above. 

After determining the coordinates of the last bounding box where the face is covered, a necessary step is to determine the size of the bounding box including the width and height to predict the distance at the next part.

As shown in Figure 8, the bounding box covers the face with degree (x,y) and dimension h*w where h is the height and w is the width (in pixels) of the image area. As described at the beginning of the article, we use Python language and OpenCV programming library for algorithm testing, so it is possible to extract these values from the detector object in the package. MTCNN library in the study [15].

IV. Conclusions


A method to predict the distance from the image sensor to the human face using the similarity property of two triangles has been described in this paper. The algorithm requires calibration and has proven to be quite effective in determining distances. During testing and measurement, we also found that the accuracy of the algorithm increased with calibration. This is also a challenge and an issue that needs further research attention.

                                                                                                   Th.S Trần Lê Thăng Đồng



Energy-Efficient Unmanned Aerial Vehicle (UAV) Surveillance Utilizing Artificial Intelligence (AI)

Recently, unmanned aerial vehicles (UAVs) have enhanced connectivity and ...

Preparation of Papers in Two Column Format for the ICSES Transactions and Conferences

Today, airports are quickly deploying self-service technologies as a ...

Robot Navigation Using FPGA Based Moving Object Tracking System

The paper describes an object tracking robot system implemented on FPGA. The ...

Trajectory Tracking Control of the Nonholonomic Mobile Robot using Torque Method and Neural Network

This paper deals with the problem of tracking control of the mobile robot with ...