**Abstract**

Fixational eye movement is an essential function for watching things using the retina, which has the property of responding only to changes in incident light. However, since the rotation of the eyeball causes the translational movement of the crystalline lens, it is possible in principle to recover the depth of the object from the moving image obtained in this way. We have proposed two types of depth restoration methods based on fixation tremor; differential-type method and integral-type method. The first is based on the change in image brightness between frames, and the latter is based on image blurring due to movement. In this chapter, we introduce them and explain the simulations and experiments performed to verify their operation.

**Keywords:** motion stereoscopic, fixational eye movements, differential-type method, integral-type method, optical flow, gradient equation, image blur

### **1. Introduction**

When humans stare at a target, an irregular involuntary movement called fixational eye movements occur [1]. The human retina can maintain reception sensitivity by finely vibrating the image of the target on the retina, so in order to see something, first fixation motion is required. It has been reported that the vibrations may work not only as such the intrinsic function to preserve photosensitivity but also as an assistance in image analysis, the mechanism of which can be interpreted as an instance of stochastic resonance (SR) [1]. SR is inspired by biology, more specifically by neuron dynamics [2], and based on it, the Dynamic Retina (DR) [3] and the Resonant Retina (RR) [4], which are new vision devices taking advantage of random camera vibrations, were proposed for contrast enhancement and edge detection respectively. It has been reported that the movement of the retinal image due to fixation eye movements can be an unconscious clue to depth perception, and an actual vision system based on fixational eye movements has been proposed [5].

On the other hand, binocular stereopsis is vigorous and plays an essential role in depth perception of a human vision system [6]. In general, binocular stereopsis detects relatively large disparities, hence it can recognize high accurate depth. However this causes an occlusion problem, and a lot of solutions of it have been proposed. Wang et al. have proposed a local detector for occlusion based on deep learning [7]. In [8], a robust depth restoration method has been proposed that integrates line-field imaging technology that simultaneously observes multiple angle views with stereo vision. Therefore, we expect that primitive depth

#### *Applications of Pattern Recognition*

information detected by fixational eye movements can be used to solve occlusion for binocular stereopsis. There is a concern that the accuracy of depth restoration by a small camera motion is lower than that of stereo vision. Even so, it is expected that erroneous correspondence due to the existence of occlusion can be reduced by using the depth information from fixational eye movements for the correspondence problem in stereo vision.

blur by a translational camera motion is used in [17], and blur by an unconstrained camera motion composed of translation and rotation is assumed in [18]. The depth recovery performance of these methods depends on the orientation of the texture in the image. That is, if the texture has a strip pattern whose direction is parallel to the direction of motion in the image, there is less blurring and accurate depth recovery is difficult. Unlike such a specific camera motion, the random camera rotation used in this study works well for any texture. Deterministic camera motion can be used just to solve this problem, but since it does not require precise control of the camera,

We proposed two algorithms based on the integral scheme. The first algorithm detects a point spread function (PSF) that represents image blur and then analyzes it to restore depth [15]. The second algorithm directly calculates the depth without detecting the PSF [16]. These algorithms use a motion-blurred image and a reference un-blurred image. It is expected that the performance of the proposed scheme depends on the degree of motion blur. For the same PSF, i.e. the fixed deviation of the random camera rotations, fine texture is advantageous for observing the accurate blur. This characteristic is the opposite of the method based on the differential

The features of our methods described above can be summarized as follows.

depth as an absolute quantity instead of a relative quantity.

2.The movement of the camera is subtle because it simulates the tremor

1. In a camera motion model that simulates eyeball rotation, the translation of the lens center is secondarily generated by eye rotation, reducing the number of unknown parameters. This camera motion model also allows you to recover

component of the eyeball. Therefore, by using a large number of image pairs, it is possible to improve the accuracy of depth recovery while avoiding occlusion.

3. In order to use multiple image pairs at the same time, we have adopted a direct framework that explicitly uses the inverse depth, which is a common

In the following, we will first explain the camera model and the imaging system,

and then explain the differential scheme and the integral scheme, and the algorithms based on each. Due to the limitation of the number of pages, the integer type explains only the direct method. The function and characteristics of each algorithm are shown as the result of computer simulation with an emphasis on quantitative comparison with the true value. Finally, one of the algorithms in the differential

In this research, we use a perspective projection system as a camera imaging model. The camera is fixed in the ð Þ *X*, *Y*, *Z* coordinate system. The lens center corresponding to the viewpoint is at the origin *O*, and the optical axis is along the *Z* axis. By taking the focal length as a unit of geometric representation, we can assume image plane *<sup>Z</sup>* <sup>¼</sup> 1 without loss of generality. The point ð Þ *<sup>X</sup>*, *<sup>Y</sup>*, *<sup>Z</sup>* <sup>T</sup> on the object in

A brief description of the camera motion model that mimics the tremor component of fixational eye movements proposed in the previous study [12]. According to

scheme is applied to the real image, and the result is explained.

**2. Camera motions imitating tremor and projection model**

3-D space is projected on the image point *<sup>x</sup>* � ð Þ *<sup>x</sup>*, *<sup>y</sup>*, 1 <sup>T</sup> <sup>¼</sup> ð Þ *<sup>X</sup>=Z*, *<sup>Y</sup>=Z*, 1 T.

it is easy to implement random camera rotation in a real system.

*Stereoscopic Calculation Model Based on Fixational Eye Movements*

*DOI: http://dx.doi.org/10.5772/intechopen.97404*

scheme based on the gradient equation.

parameter for them.

**5**

In monocular stereoscopic vision, "structure from motion (SFM)" has been the most widely studied, and many remarkable results have been reported. SFM has various calculation principles. To achieve spatially dense depth recovery with high computational efficiency, a method based on the gradient equation that expresses the constraint between the spatiotemporal derivative values of image intensity and the movement on the image is effective [9–11]. It should be noted that for such a gradient method, there is an appropriate size of movement to recover the correct depth. Since the gradient equation holds perfectly for small motions, the error in the equation cannot be ignored for very large motions. On the contrary, in the case of small movement, the motion information is buried in the observation error of the spatiotemporal derivative in intensity.

Adaptation of the frame rate is required to make the motion size suitable for the gradient method. We have proposed a method that does not require a variable frame rate based on multi-resolution decomposition of images, but it requires high computational cost [12]. Therefore, we focus on small movements with an emphasis on avoiding equation errors in the gradient method. Then, in order to solve the above signal-to-noise ratio (S/N) problem that occurs with small movements, many observations are collected and used all at once [13, 14]. In such a strategy, it is desirable that the direction and size of the motion take different values. From the above discussion, we examined a depth perception model based on fixational eye movement and gradient method. Fixational eye movements are divided into three types: microsaccades, drifts, and tremors. As the first report of our attempt, we focused on tremor, the smallest of the three types. In the next step, we plan to use drift and microsaccade analogies for further progress. Using a lot of images captured with random small motions of camera, which consists of three-dimensional (3-D) rotations imitating fixational eyeball motions [1], many observations can be used at each pixel, i.e. many gradient equations can be used to recover the each depth value corresponding to the each pixel. Since the difference between the center of the three-dimensional rotation and the lens center generates a translational motion of the lens center, depth information can be obtained from these images. Simulations with artificial images confirm that the proposed method works effectively when the observed noise is an actual sample of a theoretically defined noise model.

However, if the wavelength of the main luminance pattern is small compared to the size of the motion in the image, aliasing will occur and the gradient equation will be useless. In other words, the methods of [13, 14] cannot be applied. To avoid the above problem, we proposed a new scheme based on the integral form that also used the analogy of fixational eye movement [15, 16]. Add up the many images generated by the above method to get one blurry image. The degree of blur is a function of pixel position and also depends on the depth value of each pixel. That is, the difference in the degree of image blur indicates the depth information. Based on the proposed scheme, the spatial distribution of the image blur is effectively estimated using the blurred image and the original image without blur. By modeling the small 3-D rotation of the camera as a Gaussian random variable, the depth map can be calculated analytically from this blur distribution.

Several depth recovery methods using motion-blur have been already proposed, but those use the blur caused by definite and simple camera motions. For example,

#### *Stereoscopic Calculation Model Based on Fixational Eye Movements DOI: http://dx.doi.org/10.5772/intechopen.97404*

information detected by fixational eye movements can be used to solve occlusion for binocular stereopsis. There is a concern that the accuracy of depth restoration by a small camera motion is lower than that of stereo vision. Even so, it is expected that erroneous correspondence due to the existence of occlusion can be reduced by using the depth information from fixational eye movements for the correspondence

In monocular stereoscopic vision, "structure from motion (SFM)" has been the most widely studied, and many remarkable results have been reported. SFM has various calculation principles. To achieve spatially dense depth recovery with high computational efficiency, a method based on the gradient equation that expresses the constraint between the spatiotemporal derivative values of image intensity and the movement on the image is effective [9–11]. It should be noted that for such a gradient method, there is an appropriate size of movement to recover the correct depth. Since the gradient equation holds perfectly for small motions, the error in the equation cannot be ignored for very large motions. On the contrary, in the case of small movement, the motion information is buried in the observation error of the

Adaptation of the frame rate is required to make the motion size suitable for the

However, if the wavelength of the main luminance pattern is small compared to the size of the motion in the image, aliasing will occur and the gradient equation will be useless. In other words, the methods of [13, 14] cannot be applied. To avoid the above problem, we proposed a new scheme based on the integral form that also used the analogy of fixational eye movement [15, 16]. Add up the many images generated by the above method to get one blurry image. The degree of blur is a function of pixel position and also depends on the depth value of each pixel. That is, the difference in the degree of image blur indicates the depth information. Based on the proposed scheme, the spatial distribution of the image blur is effectively estimated using the blurred image and the original image without blur. By modeling the small 3-D rotation of the camera as a Gaussian random variable, the depth map can

Several depth recovery methods using motion-blur have been already proposed, but those use the blur caused by definite and simple camera motions. For example,

be calculated analytically from this blur distribution.

gradient method. We have proposed a method that does not require a variable frame rate based on multi-resolution decomposition of images, but it requires high computational cost [12]. Therefore, we focus on small movements with an emphasis on avoiding equation errors in the gradient method. Then, in order to solve the above signal-to-noise ratio (S/N) problem that occurs with small movements, many observations are collected and used all at once [13, 14]. In such a strategy, it is desirable that the direction and size of the motion take different values. From the above discussion, we examined a depth perception model based on fixational eye movement and gradient method. Fixational eye movements are divided into three types: microsaccades, drifts, and tremors. As the first report of our attempt, we focused on tremor, the smallest of the three types. In the next step, we plan to use drift and microsaccade analogies for further progress. Using a lot of images captured with random small motions of camera, which consists of three-dimensional (3-D) rotations imitating fixational eyeball motions [1], many observations can be used at each pixel, i.e. many gradient equations can be used to recover the each depth value corresponding to the each pixel. Since the difference between the center of the three-dimensional rotation and the lens center generates a translational motion of the lens center, depth information can be obtained from these images. Simulations with artificial images confirm that the proposed method works effectively when the observed noise is an actual sample of a theoretically defined

problem in stereo vision.

*Applications of Pattern Recognition*

noise model.

**4**

spatiotemporal derivative in intensity.

blur by a translational camera motion is used in [17], and blur by an unconstrained camera motion composed of translation and rotation is assumed in [18]. The depth recovery performance of these methods depends on the orientation of the texture in the image. That is, if the texture has a strip pattern whose direction is parallel to the direction of motion in the image, there is less blurring and accurate depth recovery is difficult. Unlike such a specific camera motion, the random camera rotation used in this study works well for any texture. Deterministic camera motion can be used just to solve this problem, but since it does not require precise control of the camera, it is easy to implement random camera rotation in a real system.

We proposed two algorithms based on the integral scheme. The first algorithm detects a point spread function (PSF) that represents image blur and then analyzes it to restore depth [15]. The second algorithm directly calculates the depth without detecting the PSF [16]. These algorithms use a motion-blurred image and a reference un-blurred image. It is expected that the performance of the proposed scheme depends on the degree of motion blur. For the same PSF, i.e. the fixed deviation of the random camera rotations, fine texture is advantageous for observing the accurate blur. This characteristic is the opposite of the method based on the differential scheme based on the gradient equation.

The features of our methods described above can be summarized as follows.


In the following, we will first explain the camera model and the imaging system, and then explain the differential scheme and the integral scheme, and the algorithms based on each. Due to the limitation of the number of pages, the integer type explains only the direct method. The function and characteristics of each algorithm are shown as the result of computer simulation with an emphasis on quantitative comparison with the true value. Finally, one of the algorithms in the differential scheme is applied to the real image, and the result is explained.

### **2. Camera motions imitating tremor and projection model**

In this research, we use a perspective projection system as a camera imaging model. The camera is fixed in the ð Þ *X*, *Y*, *Z* coordinate system. The lens center corresponding to the viewpoint is at the origin *O*, and the optical axis is along the *Z* axis. By taking the focal length as a unit of geometric representation, we can assume image plane *<sup>Z</sup>* <sup>¼</sup> 1 without loss of generality. The point ð Þ *<sup>X</sup>*, *<sup>Y</sup>*, *<sup>Z</sup>* <sup>T</sup> on the object in 3-D space is projected on the image point *<sup>x</sup>* � ð Þ *<sup>x</sup>*, *<sup>y</sup>*, 1 <sup>T</sup> <sup>¼</sup> ð Þ *<sup>X</sup>=Z*, *<sup>Y</sup>=Z*, 1 T.

A brief description of the camera motion model that mimics the tremor component of fixational eye movements proposed in the previous study [12]. According to the analogy of the human eyeball, the center of camera rotation is set behind the lens center by *Z*<sup>0</sup> along the optical axis, and there is no explicit translational motion of the camera. This rotation vector *r* ¼ *rx*,*ry*,*rz* � �<sup>T</sup> also expresses a rotation vector centered on the coordinate origin, which is the lens center, using the same component. On the other hand, this difference between the coordinate origin and the center of rotation results in the translation vector *u* ¼ *ux*, *uy*, *uz* � �<sup>T</sup> , which is formulated as follows:

$$
\begin{bmatrix} u\_x \\ u\_y \\ u\_x \end{bmatrix} = \begin{bmatrix} r\_x \\ r\_y \\ r\_x \end{bmatrix} \times \begin{bmatrix} \mathbf{0} \\ \mathbf{0} \\ Z\_0 \end{bmatrix} = Z\_0 \begin{bmatrix} r\_y \\ -r\_x \\ \mathbf{0} \end{bmatrix}. \tag{1}
$$

follows a two-dimensional (2-D) Gaussian distribution with a mean of **0** and a

ffiffiffiffiffi <sup>2</sup>*<sup>π</sup>* <sup>p</sup> *<sup>σ</sup><sup>r</sup>*

rotation value is small, then the formulations are almost valid.

*<sup>r</sup> I* with an identity matrix *I*.

*r*ð Þ*t* 2*σ*<sup>2</sup> *r*

*ft* ¼ �*f <sup>x</sup>vx* � *f <sup>y</sup>vy:* (5)

*<sup>r</sup>* � *<sup>f</sup> u*

*<sup>j</sup>*¼1,⋯,*<sup>M</sup>* are treated as random variables,

*d:* (6)

*:* (4)

,

( )

� �<sup>2</sup> exp � *<sup>r</sup>* ð Þ*<sup>t</sup>* <sup>T</sup>

In this study, *r* is defined as the rotational speed for ease of theoretical analysis. In a real system, we have no choice but to use finite rotation, but if the angle of

The general gradient equation is the first approximation of the assumption that image brightness is invariable before and after the relative 3-D motion between a camera and an object. Assuming that the brightness values before and after 3-D motion are equal, the image brightness after 3-D motion are expressed by Taylor expansion, and terms of degree 2 and above are ignored. As a result, at each pixel ð Þ *x*, *y* , the gradient equation is formulated with the partial differentials *f <sup>x</sup>*, *f <sup>y</sup>* and *ft*

By substituting Eqs. (2) and (3) into Eq. (5), the gradient equation representing

Let *M* be the number of pairs of two frames and *N* be the number of pixels. In

*<sup>i</sup>*¼1,⋯,*<sup>N</sup>* corresponding to the inverse depth of each pixel is treated as

and *r*ð Þ*<sup>J</sup>* � �

which is estimated to correlate with the neighborhood, is planned to be

stochastic variable and recovered individually for each pixel. However, multiple frames *r*ð Þ*<sup>j</sup>* � � that vibrate due to irregular rotation are used for processing, but no pixel tracking is done. Therefore, the recovered *d*ð Þ*<sup>i</sup>* at each pixel does not correspond exactly to the value of this pixel, but takes the mean of the adjacent regions defined by the vibration width of the image. As a result, the recovered *d*ð Þ*<sup>i</sup>* correlates with the values in the adjacent regions. Therefore, from the beginning, *d*ð Þ*<sup>i</sup>* should be treated as a variable with such a correlation. Based on tremor, *d*ð Þ*<sup>i</sup>* ,

improved to *d* for each pixel when dealing with drift and microsaccade in future

In this study, we assume that optical flow is very small, and hence, observation

tionally, equation error is also small, and therefore we can assume that error having

, *f <sup>x</sup>* and *f <sup>y</sup>*, which are calculated by finite difference, are small. Addi-

� �*Z*0*<sup>d</sup>* � �*<sup>f</sup>*

where *t* denotes time, of the image brightness *f x*ð Þ , *y*, *t* and the optical flow, as

variance–covariance matrix of *σ*<sup>2</sup>

*DOI: http://dx.doi.org/10.5772/intechopen.97404*

**3. Differential-type method**

follows:

our study, *f*

and *<sup>d</sup>*ð Þ*<sup>i</sup>* n o

research.

errors of *ft*

**7**

**3.1 Gradient equation for rigid motion**

a rigid motion constraint can be derived explicitly.

*<sup>x</sup>* <sup>þ</sup> *<sup>f</sup> <sup>y</sup>vr y* � � � �*<sup>f</sup> <sup>x</sup>ry* <sup>þ</sup> *<sup>f</sup> <sup>y</sup>rx*

**3.2 Probabilistic model for differential-type method**

*i*¼1,⋯,*N*; *j*¼1,⋯,*M*

*ft* ¼ � *<sup>f</sup> <sup>x</sup>vr*

ð Þ *i*,*j t* n o *<sup>p</sup> <sup>r</sup>*ð Þj *<sup>t</sup> <sup>σ</sup>*<sup>2</sup> *r* � � <sup>¼</sup> <sup>1</sup>

*Stereoscopic Calculation Model Based on Fixational Eye Movements*

In general, the translational motion of a camera lens is essential to recover depth, and our camera motion model can implicitly achieve that translation simply by rotating the camera. This facilitates camera control. In addition, the system can recover absolute depth by pre-calibrating *Z*0. We show the coordinate system and camera motion model used in this study in **Figure 1**.

From Eq. (1), it can be known that *rz* causes no translations. Therefore, we can set *rz* <sup>¼</sup> 0 and redefine *<sup>r</sup>* <sup>¼</sup> *rx*,*ry*, 0 � �<sup>T</sup> as a rotational vector like an eyeball. Using Eq. (1) and the inverse depth *d x*ð Þ¼ , *y* 1*=Z x*ð Þ , *y* , image motion called "optical flow" **v** ¼ *vx*, *vy* � �<sup>T</sup> is given as follows:

$$
\upsilon\_x = \varkappa \mathbf{y} r\_x - (\mathbf{1} + \varkappa^2) r\_\mathcal{Y} - Z\_0 r\_\mathcal{Y} d,\tag{2}
$$

$$
v\_{\mathcal{V}} = (\mathbf{1} + \mathbf{y}^2)r\_{\mathbf{x}} - \mathbf{x}\mathbf{y}r\_{\mathbf{y}} + Z\_0 r\_{\mathbf{x}} d.\tag{3}$$

In the above equations, *d* is an unknown variable at each pixel, and *u* and *r* are unknown common parameters for the whole image.

In this study, we treat *r* ð Þ*t* as a white stochastic process to simplify the motion model, and *t* indicates time. *r* ð Þ*t* is defined as the rotation speed with respect to the camera orientation of *t* ¼ 0, not the derivative between consecutive frames. In the actual fixational eye movement, the temporal correlation of tremor that forms the drift component is ignored. We assume that each fluctuation of *r* ð Þ*t*

**Figure 1.** *Coordinate system and camera motion model used in this study.*

follows a two-dimensional (2-D) Gaussian distribution with a mean of **0** and a variance–covariance matrix of *σ*<sup>2</sup> *<sup>r</sup> I* with an identity matrix *I*.

$$p\left(r(t)|\sigma\_r^2\right) = \frac{1}{\left(\sqrt{2\pi}\sigma\_r\right)^2} \exp\left\{-\frac{r\left(t\right)^T r(t)}{2\sigma\_r^2}\right\}.\tag{4}$$

In this study, *r* is defined as the rotational speed for ease of theoretical analysis. In a real system, we have no choice but to use finite rotation, but if the angle of rotation value is small, then the formulations are almost valid.
