>>Stereo Vision: Dense closed-loop depth estimation

### Stereo Vision: Dense closed-loop depth estimation

This work is based on the theory of feedback control in machine vision, described here.

Fig. 1. The camera pose (i.e. robot pose) and scene reconstruction problem.

One of the most common approach to passive 3D perception, or depth sensation, is through stereo vision. Basically, stereo vision exploits the geometry between two perspective cameras imaging a scene. By analyzing the perspective views between the acquired images, 3D visual information can be extracted. Traditionally, stereo vision is implemented using a pair of calibrated cameras with a known baseline between their optical points. Having in mind that the geometrical relations between the two cameras are known, by calculating the relative perspective projection of object points in both images, their 3D world coordinates can be reconstructed until a certain accuracy. The problem of stereo camera position and orientation (pose) estimation is illustrated in Fig. 1. As a camera $C$ moves through the Euclidean space, the distances to the imaged objects, as well as the translation $t_i$ and rotation $R_i$ of $C_i$ with respect to its previous poses should be determined from a set of observed feature points $P_j$. The pose of the robotic system is inherently obtained once the pose of the camera has been calculated.

##### Camera Pose Estimation

Fig. 2. Camera pose estimation through correspondence calculation.

For camera pose estimation, the 2D correspondence points $C$ moves through the Euclidean space, the distances to the imaged objects, as well as the translation $p$ between the left and right stereo images are calculated. 2D feature points have been extracted via the Harris corner detector, followed by a correspondence matching using a traditional cross-correlation similarity measure. Secondly, a matching is performed between the 2D feature points in consecutive stereo images, that is between images acquired under camera poses $C$ moves through the Euclidean space, the distances to the imaged objects, as well as the translation $C(k)$ and $C$ moves through the Euclidean space, the distances to the imaged objects, as well as the translation $C(k+1)$. As convention, these matches are calculated for the left camera only. Knowing the 3D positions of the 2D points matched between adjacent images, the pose of the camera can be calculated through a Perspective-N-Point (PNP) algorithm, as illustrated in Fig. 2. By solving the PNP problem, the rotation $C$ and translation $C$ matrices that relate the camera's poses are obtained:

$C(k+1) = R \cdot C(k) + t$

In order to solve the PNP problem, a minimum number of 7 correspondence points between $C(k)$ and $C(k+1)$ have to be calculated.

##### Dense closed-loop depth estimation

Once the pose of the camera has been determined, the 3D structure of the imaged scene can be reconstructed with respect to the position and orientation of the camera, i.e. with respect to the robot's pose.

Using a Block Matching (BM) disparity map computation method, followed by a depth segmentation procedure, the problem of controlling the quality of the disparity map $I_d$ is converted into the problem of controlling the quality of the segmented image $I_{th}$. A region segmented image is said to be of good quality if it contains all pixels of the objects of interest forming a "full" (unbroken) and well shaped segmented object region. Bearing in mind the qualitative definition of a segmented image of good quality, the following quantitative measure of segmented quality has been used:

$i_m = -log_2 p_8, \text{ } i_m(0) = 0$

where $p_8$ is the relative frequency, that is, the estimate of the probability of a segmented pixel to be surrounded with 8 segmented pixels in its 8-pixel neighborhood:

$p_8 = \frac{\text{\textit{no. of seg. px. surrounded with 8 seg. px.}}}{\text{\textit{total no. of seg. px. in the image}}}$

Keeping in mind that a well segmented image contains a "full" (without holes) segmented object region, it is evident from the above equation that a small probability $p_8$ corresponds to a large disorder in a binary segmented image. In this case, a large uncertainty $i_m$ is assigned to the segmented image. Therefore, the goal is to achieve a binary image having an uncertainty measure $i_m$ as small as possible in order to get a reliable depth segmentation result.

The depth estimation system was modeled according a nonlinear model, where the involved variables are:

$\text{\textbf{\textit{x}}} = \begin{bmatrix} i_m & q_r \end{bmatrix}^\text{T}$ $\text{\textbf{\textit{y}}} = I_d (x,y)$ $\text{\textbf{\textit{u}}} = q_r (i_m, T_q)$

where $q_r$ is the BM's uniqueness ratio function and $T_q$ is the predefined uniqueness threshold value.

Fig. 3. The uncertainty measure $i_m$ of segmented pixels vs. uniqueness threshold $T_q$.

In Fig. 3, the input-output (I/O) relation between the state variable $i_m$ and the actuator parameter $T_q$ is displayed for an example case. The goal of the proposed extremum seeking control system is to determine the optimal value $T_q^*$ which corresponds to the minimum of the curve in Fig. 3. $T_q^*$ represents the desired value of the uniqueness threshold. The block diagram of the proposed depth sensing system is illustrated in Fig. 4.

Fig. 4. Block diagram of the proposed feedback control system for robust depth estimation.
##### References

S.M. Grigorescu, G. Macesanu, T.T. Cocias, D. Puiu and F. Moldoveanu "Robust Camera Pose and Scene Structure Analysis for Service Robotics", Robotics and Autonomous Systems, Elsevier, Netherlands, vol. 59, no. 11, pp. 889-909, 2011.

S.M. Grigorescu and F. Moldoveanu "Controlling Depth Estimation for Robust Robotic Perception", Proceedings of the 18th IFAC World Congress, Milano, Italy, August 28 - September 2, 2011.