# Tightly-Coupled Stereo Visual-Inertial Navigation Using Point and Line Features

^{*}

Previous Article in Journal

College of Mechatronics and Automation, National University of Defense Technology, Changsha 410073, China

Author to whom correspondence should be addressed.

Academic Editor: Vittorio M.N. Passaro

Received: 13 April 2015
/
Accepted: 27 May 2015
/
Published: 1 June 2015

(This article belongs to the Section Physical Sensors)

This paper presents a novel approach for estimating the ego-motion of a vehicle in dynamic and unknown environments using tightly-coupled inertial and visual sensors. To improve the accuracy and robustness, we exploit the combination of point and line features to aid navigation. The mathematical framework is based on trifocal geometry among image triplets, which is simple and unified for point and line features. For the fusion algorithm design, we employ the Extended Kalman Filter (EKF) for error state prediction and covariance propagation, and the Sigma Point Kalman Filter (SPKF) for robust measurement updating in the presence of high nonlinearities. The outdoor and indoor experiments show that the combination of point and line features improves the estimation accuracy and robustness compared to the algorithm using point features alone.

Reliable navigation in dynamic and unknown environments is a key requirement for many applications, particularly for autonomous ground, underwater and air vehicles. The most common sensor modality used to tackle this problem is the Inertial Measurement Unit (IMU). However, inertial navigation systems (INS) are proved to drift over time due to error accumulation [1]. In the last decades, the topic of vision-aided inertial navigation has received considerable attention in the research community, thanks to some important advantages [2,3,4,5,6,7,8,9]. Firstly, the integrated system can operate in environments where GPS is unreliable or unavailable. Secondly, the complementary frequency responses and noise characteristics of vision and inertial sensors address the respective limitations and deficiencies [10]. In particular, fast and highly dynamic motions can be precisely tracked by an IMU in a short time, and thus the problem of scale ambiguity and large latency in vision can be settled to a certain extent. On the other hand, the low-frequency drift in the inertial measurements can be significantly controlled by visual observations. Furthermore, both cameras and IMUs are low cost, light-weight and low power-consumption devices, which make them ideal for many payload-constrained platforms. Corke [10] has presented a comprehensive introduction of these two sensory modalities from a biological and an engineering perspective.

The simplest fusion scheme for a vision-aided inertial navigation system (VINS) uses separate INS and visual blocks, and fuses information in a loosely-coupled approach [10]. For instance, some methods fuse the inertial navigation solution with the relative pose estimation between consecutive image measurements [11,12,13,14]. Tightly-coupled methods in contrast process the raw information of both sensors in a single estimator, thus all the correlations between them are considered, leading to higher accuracy [15,16]. The most common tightly-coupled scheme augments the 3D feature positions in the filter state, and concurrently estimates the motion and structure [2]. However, this method suffers from high computational complexity, as the dimension of the state vector increases with the number of the observed features. To address this problem, Mourikis [15] proposed an EKF-based algorithm which maintains a sliding window of poses in the filter state, and make use of the tracked features to impose constraints on these poses. The shortcomings of this approach are twofold: (1) the space complexity is high, because it needs to store all the tracked features; (2) it requires a reconstruction of the 3D position of the tracked feature points, which are not necessary in navigation tasks. To overcome these shortcomings, Hu [9] developed a sliding window odometry using the monocular camera geometry constraints among three images as measurements, resulting in a tradeoff between accuracy and computational cost.

While the vision-aided inertial navigation has been extensively studied, and a considerable amount of work has also been dedicated to processing visual observations of point features [2,4,5,7], on the contrary, much less work has been aimed at exploring line features. In fact, line primitives and point primitives provide complementary information about the image [17]. There are many scenes (e.g., wall corners, stairwell edges, etc.) where the point primitive matches are unreliable while the line primitives are well matched, due to multi-pixel support [6].

On the other hand, points are crucial as they give more information than lines. For instance, there are no pose constraints imposed by line correspondences from two views, while there are well-known epipolar geometry constraints for point correspondences from two views [18].

In this paper, we propose a method that combines point and line features for navigation aiding in a simple and unified framework. Our algorithm can deal with any mixed combination of point and line correspondences utilizing trifocal geometry across two stereo views. In the implementation, the inertial sensors are tightly-coupled within feature tracking to improve the robustness and tracking speed. Meanwhile, the drifts of inertial sensors are greatly reduced by using the constraints imposed in the tracked features. Leveraging both of the complementary characteristics of the inertial and visual sensors and the complementary characteristics between point and line features, the proposed algorithm demonstrates improved performance and robustness.

We denote scalars in italic lower case letters (e.g., $a$), denote vectors in lower case letters with boldface non-italic (e.g., $p$), and denote matrices in upper case letters with bold font (e.g., $R$). If a vector or matrix describes the relative pose of one reference frame with respect to another, we combine subscript letters to designate the frames, e.g., ${p}_{WI}$ represents the translation vector from the origin of the frame $\left\{W\right\}$ to the origin of the frame $\left\{I\right\}$, and ${R}_{WI}$ represents the direction cosine matrix of frame $\left\{I\right\}$ in the reference frame $\left\{W\right\}$. The six degrees of freedom transform between two reference frames can be represented as a translation followed by a rotation:

$${t}^{W}={p}_{WI}^{W}+{R}_{WI}{t}^{I}$$

In the remaining Sections, unit quaternions are also used to describe the relative orientation of two reference frames, e.g., ${\overline{q}}_{WI}$ represents the orientation of frame $\left\{I\right\}$ in frame $\left\{W\right\}$.

Finally, to represent projective geometry, it is simpler and more symmetric to introduce homogeneous coordinates, which provides a scale invariant representation for point and line in the Euclidean plane. In this paper, vectors in homogeneous coordinate form are expressed by an underline, e.g., $\underset{\_}{m}={\left(\begin{array}{ccc}u& v& w\end{array}\right)}^{T}$ represents the point $m={\left(\begin{array}{cc}u\prime & v\prime \end{array}\right)}^{T}$ in the Euclidean plane, with $u\prime =u/w,v\prime =v/w,w\ne 0.$

The evolving IMU state is described by the vector:
where ${p}_{WI}^{W}\left(t\right)$ denotes the position of IMU in the world frame $\left\{W\right\}$; ${\overline{q}}_{WI}\left(t\right)$ is the unit quaternion of the IMU frame $\left\{I\right\}$ in the world frame; ${v}_{WI}^{W}$ is the linear velocity of the IMU in the world frame; ${b}_{g}\left(t\right)$ and ${b}_{a}\left(t\right)$ are the IMU gyroscope and accelerometer biases, respectively.

$${x}_{IMU}\left(t\right)={\left[\begin{array}{ccccc}{\left({p}_{WI}^{W}\right)}^{T}& {\left({\overline{q}}_{WI}\right)}^{T}& {\left({v}_{WI}^{W}\right)}^{T}& {\left({b}_{g}\left(t\right)\right)}^{T}& {\left({b}_{a}\left(t\right)\right)}^{T}\end{array}\right]}^{T}$$

In this work, we model the biases ${b}_{g}\left(t\right)$ and ${b}_{a}\left(t\right)$ as a Gaussian random walk process, driven by the white, zero-mean noise vectors ${n}_{gw}$ and ${n}_{aw}$, with covariance matrices ${Q}_{gw}$ and ${Q}_{aw}$ respectively. The time evolution of the IMU state is given by the following equation [2]:
where $\mathrm{\Omega}\left({\mathrm{\omega}}_{WI}^{I}\right)$ is the quaternion multiplication matrix:
which relates the time rate of change of the unit quaternion to the angular velocity; ${\mathrm{\omega}}_{WI}^{I}$ is the angular velocity of the IMU with respect to the world frame, and ${a}_{WI}^{W}$ is the acceleration of the IMU with respected to the world frame expressed in the world frame. The measured angular velocity and linear acceleration from are:
where $R\left({\overline{q}}_{WI}\right)$ is the direction cosine matrix corresponding to the unit quaternion ${\overline{q}}_{WI}$, ${n}_{g}$ and ${n}_{a}$ are measurements noises of gyroscope and accelerometer, which are assumed to be zero-mean Gaussian noise with covariance matrices ${Q}_{g}$ and ${Q}_{a}$, respectively. Note that we do not consider the Earth’s rotation rate in the gyroscope measurement, because it is small enough relative to the noise and bias of the low-cost gyroscope.

$${\dot{p}}_{WI}^{W}={v}_{WI}^{W},\text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}{\dot{\overline{q}}}_{WI}=\frac{1}{2}\mathrm{\Omega}\left({\mathrm{\omega}}_{WI}^{I}\right){\overline{q}}_{WI},\text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}{\dot{v}}_{WI}^{W}={a}_{WI}^{W},\text{\hspace{0.17em}\hspace{0.17em}}{\dot{b}}_{g}={n}_{gw},\text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}{\dot{b}}_{a}={n}_{aw}$$

$$\mathrm{\Omega}\left({\mathrm{\omega}}_{WI}^{I}\right)=\left[\begin{array}{cc}0& -{\left({\mathrm{\omega}}_{WI}^{I}\right)}^{T}\\ {\mathrm{\omega}}_{WI}^{I}& -\left[{\mathrm{\omega}}_{WI}^{I}\times \right]\end{array}\right]$$

$${\mathrm{\omega}}_{m}={\mathrm{\omega}}_{WI}^{I}+{b}_{g}+{n}_{g}$$

$${a}_{m}={R}^{T}\left({\overline{q}}_{WI}\right)\left({a}_{WI}^{W}-{g}^{W}\right)+{b}_{a}+{n}_{a}$$

In this Section, we consider the standard perspective camera model, which is commonly used in the computer vision applications. Let $K$ denote the intrinsic camera parameters matrix which can be obtained by calibrating. A mapping between the 3D homogeneous point $\underset{\_}{M}={\left[\begin{array}{cccc}{M}_{1}& {M}_{2}& {M}_{3}& {M}_{4}\end{array}\right]}^{T}$ in space and the homogeneous image pixel coordinates $\underset{\_}{m}={\left[\begin{array}{ccc}u& v& 1\end{array}\right]}^{T}$ can be given by:
where $\propto $ means equality up to scale, and $P=K\cdot \left[R|t\right]$ is 3 × 4 camera matrix, with **R** and $t$ representing pose of the camera with respect to the world reference frame. Similarly, a mapping between a 3-space line represented as a Plücker matrix $L$ and the homogenous image line $l$ is given by [18]:

$$\underset{\_}{m}\propto K\cdot \left[R|t\right]\underset{\_}{M}=P\underset{\_}{M}$$

$$\left[l\times \right]=PL{P}^{T}$$

A trifocal tensor is a 3 × 3 × 3 array of numbers that describes the geometric relations among three views. It depends only on the relative motion between the different views and is independent of scene structures. Assuming that the camera matrices of three views are ${P}_{1}=\left[I|0\right],$ ${P}_{2}=\left[A|{a}_{4}\right],$ ${P}_{3}=\left[B|{b}_{4}\right]$, the entries of the trifocal tensor can be derived accordingly using the standard matrix-vector notation [18]:
where ${a}_{i}$ and ${b}_{i}$ denote the $i$-th column of the camera matrices ${P}_{2}$ and ${P}_{3}$, respectively.

$${T}_{i}={a}_{i}{b}_{4}^{T}-{a}_{4}{b}_{i}^{T}$$

Once the trifocal tensor is computed, we can use of it to map a pair of matched points ${\underset{\_}{m}}_{1}\leftrightarrow {\underset{\_}{m}}_{2}$ in the first and second views into the third view, using the homography between the first view and the third view induced by a line in the second image [18]. As shown in Figure 1a, a line in second view defines a plane in space, and this plane induces a homography between the first view and third view. As recommended by Hartley [18], the line ${\widehat{l}}_{2}$ is chosen as the line perpendicular to the epipolar line. The transfer procedure is summarized as follows [18]:

- (1)
- Compute the epipolar line ${l}_{e}={F}_{21}{\underset{\_}{m}}_{1}$, where ${F}_{21}$ is the fundamental matrix between the first and second views.
- (2)
- Compute the line ${\widehat{l}}_{2}$ which passes through ${\underset{\_}{m}}_{2}$ and is perpendicular to ${l}_{e}$. If ${l}_{e}={\left[\begin{array}{ccc}{l}_{e1}& {l}_{e2}& {l}_{e3}\end{array}\right]}^{T}$ and ${\underset{\_}{m}}_{2}={\left[\begin{array}{ccc}{m}_{21}& {m}_{22}& 1\end{array}\right]}^{T}$, then ${\widehat{l}}_{2}={\left[\begin{array}{ccc}{l}_{e2}& -{l}_{e1}& -{m}_{21}{l}_{e2}+{m}_{22}{l}_{e1}\end{array}\right]}^{T}$.
- (3)
- The transferred point is ${\widehat{\underset{\_}{m}}}_{3}=\left({\displaystyle \sum _{i}{m}_{1i}{T}_{i}^{T}}\right){\widehat{l}}_{2}.$

Similarly, it is possible to transfer a pair of matched lines ${l}_{2}\leftrightarrow {l}_{3}$ in the second and third views into the first view according to the line transfer equation [18]:

$${\widehat{l}}_{1i}={l}_{2}^{T}{T}_{i}{l}_{3}$$

In this Section, we exploit the trifocal geometry of stereo vision to deduce the measurement model. We depict the stereo camera configuration of two consecutive frames in Figure 1b. For the sake of clarity, we only provide the geometrical relations of lines. The camera matrices of the stereo image pair at the previous time step can be represented in canonical form as:
where ${R}_{21}={R}_{0}$ and ${t}_{21}^{2}={t}_{0}$ encode the rigid transform of the rig which are known after calibration. The camera matrices of the successive stereo image pairs are defined as:

$${P}_{1}=\left[I|0\right],{P}_{2}=\left[{R}_{21}|{t}_{21}^{2}\right]$$

$${P}_{3}=\left[{R}_{31}|{t}_{31}^{3}\right],{P}_{4}=\left[{R}_{41}|{t}_{41}^{4}\right]$$

For simplicity, we assume that the IMU frame of reference coincides with the camera frame of reference. Thus, the terms in Equation (12) can be expressed as follows:
where ${R}_{W{I}_{1}}$ and ${p}_{W{I}_{1}}^{W}$ are the pose of IMU corresponding to the last time the image pair captured.

$${R}_{31}={R}_{WI}^{T}{R}_{W{I}_{1}}$$

$${t}_{31}^{3}={R}_{WI}^{T}\left({p}_{W{I}_{1}}^{W}-{p}_{WI}^{W}\right)$$

$${R}_{41}={R}_{43}{R}_{31}={R}_{0}{R}_{WI}^{T}{R}_{W{I}_{1}}$$

$${t}_{41}^{4}={t}_{43}^{4}+{R}_{43}{t}_{31}^{3}={t}_{0}+{R}_{0}{R}_{WI}^{T}\left({p}_{W{I}_{1}}^{W}-{p}_{WI}^{W}\right)$$

Two trifocal tensors, ${T}_{L}=\left\{{T}_{i}^{L}\right\}$ relating the previous image pair to the current left image and ${T}_{R}=\left\{{T}_{i}^{R}\right\}$ relating to the current right image can be determined according to Equation (9) using the camera matrices Equations (11) and (12):

$${T}_{L}=T\left({P}_{1},{P}_{2},{P}_{3}\right)=T\left({R}_{0},{t}_{0},{R}_{WI},{p}_{WI}^{W},{R}_{W{I}_{1}},{p}_{W{I}_{1}}^{W}\right)$$

$${T}_{L}=T\left({P}_{1},{P}_{2},{P}_{4}\right)=T\left({R}_{0},{t}_{0},{R}_{WI},{p}_{WI}^{W},{R}_{W{I}_{1}},{p}_{W{I}_{1}}^{W}\right)$$

From the corresponding point set $\left\{{m}_{1}\leftrightarrow {m}_{2}\leftrightarrow {m}_{3}\leftrightarrow {m}_{4}\right\}$ and the point transfer relations among the triplets, the following non-linear functions can be defined:
where ${h}_{1}(\cdot )$ denotes the pixel differences between the transferred point and the measured point.

$${h}_{1}\left({T}_{L}\left({R}_{0},{t}_{0},{R}_{WI},{p}_{WI}^{W},{R}_{W{I}_{1}},{p}_{W{I}_{1}}^{W}\right),{m}_{1},{m}_{2},{m}_{3}\right)={0}_{2\times 1}$$

$${h}_{1}\left({T}_{R}\left({R}_{0},{t}_{0},{R}_{WI},{p}_{WI}^{W},{R}_{W{I}_{1}},{p}_{W{I}_{1}}^{W}\right),{m}_{1},{m}_{2},{m}_{4}\right)={0}_{2\times 1}$$

For line measurements, we also need a formulation to compare the transferred lines with the measured lines. Because of the aperture problem [19], only the measurement components which are orthogonal to the transferred line can be used for correction. In [3,17], the line-point is chosen as observation, which is defined as the closest point on the line segment to the image origin. Accordingly, the error function is defined as the differences between the measured and transferred line-points, which is similar to the error function of point features. However, when the lines pass through the origin, the orientation error of the lines cannot be revealed by this error function. Thus, we choose the signed distances between the endpoints of the measured line segment to the transferred line as observation. Suppose that ${\underset{\_}{s}}^{a}$ and ${\underset{\_}{s}}^{b}$ are the end points of the line segment measured in the first view. We denote the line transferred from the second and third views by $\widehat{l}={\left({\widehat{l}}_{1},{\widehat{l}}_{2},{\widehat{l}}_{3}\right)}^{T}$. The signed distances between the end points of the measured line segment and the transferred line make up the line observation function:

$${d}_{1}=\left[\begin{array}{c}{\underset{\_}{s}}^{a}\cdot \widehat{l}/\sqrt{{\left({\widehat{l}}_{1}\right)}^{2}+{\left({\widehat{l}}_{2}\right)}^{2}}\\ {\underset{\_}{s}}^{b}\cdot \widehat{l}/\sqrt{{\left({\widehat{l}}_{1}\right)}^{2}+{\left({\widehat{l}}_{2}\right)}^{2}}\end{array}\right]={0}_{2\times 1}$$

Similarly, the line observation function concerning the first, second, and fourth views is defined as:
where $\widehat{l}\prime ={\left({\widehat{l}}_{1}\prime ,{\widehat{l}}_{2}\prime ,{\widehat{l}}_{3}\prime \right)}^{T}$ is the line transferred from the second and fourth views.

$${d}_{2}=\left[\begin{array}{c}{\underset{\_}{s}}^{a}\cdot \widehat{l}\prime /\sqrt{{\left({\widehat{l}}_{1}\prime \right)}^{2}+{\left({\widehat{l}}_{2}\prime \right)}^{2}}\\ {\underset{\_}{s}}^{b}\cdot \widehat{l}\prime /\sqrt{{\left({\widehat{l}}_{1}\prime \right)}^{2}+{\left({\widehat{l}}_{2}\prime \right)}^{2}}\end{array}\right]={0}_{2\times 1}$$

As we process the point and line measurements in a unified manner after defining the corresponding error, we define the observation model in a single function:
where $\left\{{f}_{1},{f}_{2},{f}_{3},{f}_{4}\right\}$ denotes the general feature correspondences among the four views, ${T}_{L}$ and ${T}_{R}$ encode the motion information between the successive stereo image pairs, and the function $h(\cdot )$ defines the observations based on the feature type.

$$z=h\left({T}_{L}\left({R}_{0},{t}_{0},{R}_{WI},{p}_{WI}^{W},{R}_{W{I}_{1}},{p}_{W{I}_{1}}^{W}\right),{T}_{R}\left({R}_{0},{t}_{0},{R}_{WI},{p}_{WI}^{W},{R}_{W{I}_{1}},{p}_{W{I}_{1}}^{W}\right),\left\{{f}_{1},{f}_{2},{f}_{3},{f}_{4}\right\}\right)={0}_{4\times 1}$$

As can be seen in the previous Section, the measurement models are implicit relative-pose measurements, which relate the system state at two different time instants (i.e., the current time and the previous time when image pair is captured). However, the “standard” Kalman filter formulation requires that the measurements employed for the state update be independent of any previous filter states. The problem can be addressed by augment the state vector to include a history of IMU pose when last image pair is recorded. With these state augmentations, the measurements are only related to the current state, and thus, a Kalman filter framework can be applied. The augmented nominal state is given by:
where ${\widehat{x}}_{IMU}^{T}$ is the nominal state of IMU, which can be obtained by integrating Equation (3) without considering the noise term; ${\left({\widehat{p}}_{W{I}_{1}}^{W}\right)}^{T}$ and ${\left({\widehat{\overline{q}}}_{{I}_{1}}^{W}\right)}^{T}$ denotes the nominal-state pose of the IMU at time when the last image pair is recorded. The augmented error state is defined accordingly:
where $\delta {x}_{IMU}$ is the IMU error-state defined as:

$$\widehat{x}={\left[\begin{array}{ccc}{\widehat{x}}_{IMU}^{T}& {\left({\widehat{p}}_{W{I}_{1}}^{W}\right)}^{T}& {\left({\widehat{\overline{q}}}_{{I}_{1}}^{W}\right)}^{T}\end{array}\right]}^{T}$$

$$\delta x={\left[\begin{array}{ccc}\delta {x}_{IMU}& {\left(\delta {p}_{W{I}_{1}}^{W}\right)}^{T}& {\left(\delta {\theta}^{{I}_{1}}\right)}^{T}\end{array}\right]}^{T}$$

$$\delta {x}_{IMU}={\left[\begin{array}{ccccc}{\left(\delta {p}_{WI}^{W}\right)}^{T}& {\left(\delta {\theta}^{I}\right)}^{T}& {\left(\delta {v}_{WI}^{W}\right)}^{T}& \delta {b}_{g}^{T}& \delta {b}_{a}^{T}\end{array}\right]}^{T}$$

The standard additive error definition is used for the position, velocity and biases, while for the orientation error $\delta {\theta}^{I}$, the multiplicative error definition is applied:
where the symbol $\otimes $ denotes quaternion multiplication. With the above error definition, the true-state may be expressed as a suitable composition of the nominal and the error-states:
where $\oplus $ means a generic composition.

$${\overline{q}}_{I}^{W}={\widehat{\overline{q}}}_{I}^{W}\otimes {\left[\begin{array}{cc}1& \frac{1}{2}{\left(\delta {\theta}^{I}\right)}^{T}\end{array}\right]}^{T}$$

$$x=\widehat{x}\oplus \delta x$$

The continuous-time IMU error-state model may be given as a single matrix error equation:
where:

$$\delta {\dot{x}}_{IMU}={F}_{IMU}\delta {x}_{IMU}+{G}_{IMU}{n}_{IMU}$$

$${F}_{IMU}=\left[\begin{array}{ccccc}{0}_{3\times 3}& {I}_{3}& {0}_{3\times 3}& {0}_{3\times 3}& {0}_{3\times 3}\\ {0}_{3\times 3}& -\lfloor \left({\mathrm{\omega}}_{WI}^{I}\left(t\right)-{b}_{g}\left(t\right)\right)\times \rfloor & {0}_{3\times 3}& -{I}_{3}& {0}_{3\times 3}\\ {0}_{3\times 3}& -R\left({\overline{q}}_{WI}\left(t\right)\right)\lfloor \left({a}_{m}\left(t\right)-{b}_{a}\left(t\right)\right)\times \rfloor & {0}_{3\times 3}& {0}_{3\times 3}& -R\left({\overline{q}}_{WI}\left(t\right)\right)\\ {0}_{3\times 3}& {0}_{3\times 3}& {0}_{3\times 3}& {0}_{3\times 3}& {0}_{3\times 3}\\ {0}_{3\times 3}& {0}_{3\times 3}& {0}_{3\times 3}& {0}_{3\times 3}& {0}_{3\times 3}\end{array}\right]$$

$${G}_{IMU}=\left[\begin{array}{cccc}{0}_{3\times 3}& {0}_{3\times 3}& {0}_{3\times 3}& {0}_{3\times 3}\\ -{I}_{3}& {0}_{3\times 3}& {0}_{3\times 3}& {0}_{3\times 3}\\ {0}_{3\times 3}& -R\left({\overline{q}}_{WI}\left(t\right)\right)& {0}_{3\times 3}& {0}_{3\times 3}\\ {0}_{3\times 3}& {0}_{3\times 3}& {I}_{3}& {0}_{3\times 3}\\ {0}_{3\times 3}& {0}_{3\times 3}& {0}_{3\times 3}& {I}_{3}\end{array}\right]$$

$${n}_{IMU}={\left[\begin{array}{cccc}{\left({n}_{g}\right)}^{T}& {\left({n}_{a}\right)}^{T}& {\left({n}_{gw}\right)}^{T}& {\left({n}_{aw}\right)}^{T}\end{array}\right]}^{T}$$

Since the past pose is unchanged during the filter prediction step, its corresponding derivatives are zero:

$${\dot{\widehat{p}}}_{W{I}_{1}}^{W}=0,{\dot{\widehat{\overline{q}}}}_{{I}_{1}}^{W}=0$$

$$\delta {\dot{p}}_{W{I}_{1}}^{W}=0,\delta {\dot{\mathrm{\theta}}}^{{I}_{1}}=0$$

Combining Equations (29) and (34), the continuous-time augmented error state equation is given by:
where:
where ${F}_{IMU}$ and ${G}_{IMU}$ are defined in Equations (30) and (31).

$$\delta \dot{x}={F}_{c}\delta x+{G}_{c}{n}_{IMU}$$

$${F}_{c}=\left[\begin{array}{cc}{F}_{IMU}& {0}_{15\times 6}\\ {0}_{6\times 15}& {0}_{6\times 6}\end{array}\right]$$

$${G}_{c}=\left[\begin{array}{c}{G}_{IMU}\\ {0}_{6\times 6}\end{array}\right]$$

Each time a new IMU measurement is received, the nominal state prediction is performed by numerical integration of the kinematic Equations (3) and (33). In order to obtain the error covariance, we compute the discrete-time state transition matrix:

$${\Phi}_{k}=\Phi \left({t}_{k+1},{t}_{k}\right)=\mathrm{exp}\left({\displaystyle {\int}_{{t}_{k}}^{{t}_{k+1}}{F}_{c}\left(\tau \right)d\tau}\right)$$

The elements of ${\Phi}_{k}$ can be computed analytically following similar derivation as [20]. The state transition matrix is slightly different from [21], because the rates of filter prediction and filter update are different in our case.

The noise covariance matrix ${Q}_{d}$ of the discrete-time system is evaluated by:

$${Q}_{d}={\displaystyle {\int}_{{t}_{k}}^{{t}_{k+1}}\Phi \left({t}_{k+1},\tau \right){G}_{c}{Q}_{c}{G}_{c}^{T}{\Phi}^{T}\left({t}_{k+1},\tau \right)}d\tau $$

The predicted covariance is then obtained as:

$${P}_{k+1|k}={\Phi}_{k}{P}_{k|k}{\Phi}_{k}^{T}+{Q}_{d}$$

Since the measurement model is highly nonlinear, we employ statistical linearization for measurement updating, which is generally more accurate than the first order Taylor series expansion [22]. Specifically, the Sigma Point approach is applied. First, the following sets of sigma points are selected:
where $n=21$ is the dimension of the state, the parameter $\lambda ={\alpha}^{2}\left(n+\kappa \right)-n$ with tuning parameters $\alpha ,\kappa $, ${\left(\sqrt{P}\right)}_{i}$ indicates the $i$th column of the matrix square-root of the covariance matrix $P$. We define the following weights for the sigma points:
where β is related to the higher order moments of the distribution [23] (a good starting guess is $\beta =2$ for Gaussian distribution).

$$\begin{array}{l}{X}^{\left(0\right)}={0}_{27\times 1},\text{\hspace{0.17em}}\\ {X}^{\left(i\right)}={\left(\sqrt{\left(n+\mathrm{\lambda}\right){P}_{k+1|k}}\right)}_{i},\text{\hspace{0.17em}\hspace{0.17em}}i=1,\mathrm{...},n,\\ {X}^{\left(i\right)}=-{\left(\sqrt{\left(n+\mathrm{\lambda}\right){P}_{k+1|k}}\right)}_{i},\text{\hspace{0.17em}\hspace{0.17em}}i\text{\hspace{0.17em}}=\text{\hspace{0.17em}}n+1,\mathrm{...},2n\end{array}$$

$$\begin{array}{l}{W}_{m}^{\left(0\right)}=\frac{\lambda}{\lambda +n},\\ {W}_{c}^{\left(0\right)}=\frac{\lambda}{\lambda +n}+\left(1-{\alpha}^{2}+\beta \right),\\ {W}_{m}^{\left(i\right)}={W}_{c}^{\left(i\right)}=\frac{1}{2\left(\lambda +n\right)},\text{\hspace{0.17em}\hspace{0.17em}}i=1,2,\mathrm{...},2n\end{array}$$

The predicted measurement vector is determined by propagating individual sigma point through the nonlinear observation function $h(\cdot )$ defined in Equation (23):

$${Z}_{j}^{\left(i\right)}=h\left({T}_{L}\left({\widehat{x}}_{k+1}\oplus {X}^{\left(i\right)}\right),{T}_{R}\left({\widehat{x}}_{k+1}\oplus {X}^{\left(i\right)}\right),{\left\{{f}_{1},{f}_{2},{f}_{3},{f}_{4}\right\}}_{j}\right)$$

The mean and covariance are computed as:
where ${P}^{{z}_{j}{z}_{j}}$ and ${P}^{x{z}_{j}}$ are the predicted measurement covariance matrix and the state-measurement cross-covariance matrix, respectively.

$${\widehat{Z}}_{j}={\displaystyle \sum _{i=0}^{i=2L}{W}_{m}^{\left(i\right)}{Z}_{j}^{\left(i\right)}}$$

$${P}^{{z}_{j}{z}_{j}}={\displaystyle \sum _{i=0}^{i=2L}{W}_{c}^{\left(i\right)}\left[{Z}_{j}^{(i)}-{\widehat{Z}}_{j}\right]}{\left[{Z}_{j}^{(i)}-{\widehat{Z}}_{j}\right]}^{T}$$

$${P}^{x{z}_{j}}={\displaystyle \sum _{i=0}^{i=2L}{W}_{c}^{\left(i\right)}\left[{X}^{\left(i\right)}-{0}_{27\times 1}\right]}{\left[{Z}_{j}^{(i)}-{\widehat{Z}}_{j}\right]}^{T}$$

The filter gain is given as follows:
where ${R}_{j}$ is the measurement noise covariance matrix.

$${K}_{j}={P}^{x{z}_{j}}{\left({P}^{{z}_{j}{z}_{j}}+{R}_{j}\right)}^{-1}$$

Then, the error state and error covariance are updated using the normal Kalman filter equation:

$$\delta {x}_{k+1|k+1}=\delta {x}_{k+1|k}+{K}_{j}\left(0-{\widehat{Z}}_{j}\right)$$

$${P}_{k+1|k+1}={P}_{k+1|k}-{K}_{j}{P}^{{z}_{j}{z}_{j}}{K}_{j}^{T}$$

After measurement update, the estimated state $\mathrm{\delta}{x}_{k+1|k+1}$ is then used to correct nominal state ${\widehat{x}}_{k+1}$.

Finally, replace old state by current state and revise the corresponding error covariance:
with:

$${\widehat{x}}_{k+1}={T}_{n}{\widehat{x}}_{k+1},\text{\hspace{0.17em}}\delta {x}_{k+1}={T}_{e}\delta {x}_{k+1},{P}_{k+1|k+1}={T}_{e}{P}_{k+1|k+1}{T}_{e}^{T}$$

$${T}_{n}=\left[\begin{array}{ccc}{I}_{7}& {0}_{7\times 9}& {0}_{7\times 7}\\ {0}_{9\times 7}& {I}_{9}& {0}_{9\times 7}\\ {I}_{7}& {0}_{7\times 9}& {0}_{7\times 7}\end{array}\right],{T}_{e}=\left[\begin{array}{ccc}{I}_{6}& {0}_{6\times 9}& {0}_{6\times 6}\\ {0}_{9\times 6}& {I}_{9}& {0}_{9\times 6}\\ {I}_{6}& {0}_{6\times 9}& {0}_{6\times 6}\end{array}\right]$$

We evaluate the proposed method using the publicly available KITTI Vision Benchmark Suite [24], which provides several multi-sensor datasets with ground truth. The selected dataset was captured in a residential area from the experimental vehicle, equipped with a GPS/IMU localization unit with RTK correction signals (OXTS RT3003), and a stereo rig with two grayscale cameras (PointGrey Flea2). The duration is about 440 s, with a traveling distance of about 3600 m, and the average speed is about 29 km/h. All the sensors are rigidly mounted on top of the vehicle. The intrinsic parameters of the cameras and the transformation between the cameras and GPS/IMU were well calibrated. Moreover, the cameras and GPS/IMU are manually synchronized, with sampling rates of 10 Hz and 100 Hz, respectively.

The announced gyroscope and accelerometer bias specifications are 36 deg/h (1 $\text{\sigma}$) and 1 mg (1 $\text{\sigma}$), respectively. The resolution of stereo images is 1226 × 370 pixels, with 90° field view. For the position ground truth, we use the trajectory of the GPS/IMU output, with open sky localization errors less than 5 cm.

For point features, the fast corner detection (FAST) algorithm [25] was used for feature extraction, and matching was carried out by normalized cross-correlation. The main advantage of the FAST detector compared to others is the better trade-off between accuracy and efficiency. In order to reduce the computational complexity and to guarantee the well distribution of the image features, we choose a subset of the matched point features by means of bucketing [26]: Divide the image into several non-overlapping rectangles, and maintain a maximal number of feature points in each rectangle.

We extract lines using EDlines detector [27] in the scale space, which can give accurate results in real-time. Then we employ the method described in [28] for line matching. The lines are described local appearances by the so-called Line Band Descriptor (LBD) similar to SIFT [29] for point features, and are matched by exploiting the local appearance similarities and geometric consistencies [28]. The average execution time of line matching between views is about 56 ms with Intel Core i5 2.6 GHz processors running the non-optimized C++ code. Figure 2 shows a sample image from the dataset with extracted points and lines. As can be seen, both point and line features are rich in the selected sequence.

In order to reject mismatched features and features located on independently moving objects (e.g., the running car), we employ a chi-square test [30] for the measurement residuals. We compute the Mahalanobis distance:
where $\left(0-{\widehat{Z}}_{j}\right)$ is the measurement residual, and ${P}^{{z}_{j}{z}_{j}}+{R}_{j}$ is the covariance of the measurement residual. The rejection threshold is usually chosen by an empirical evaluation of reliability of feature matching. We set the threshold to 12 in the experiment. The feature measurements whose residuals exceed the threshold are discarded.

$${\upsilon}_{j}={\left(0-{\widehat{Z}}_{j}\right)}^{T}{\left({P}^{{z}_{j}{z}_{j}}+{R}_{j}\right)}^{-1}\left(0-{\widehat{Z}}_{j}\right)$$

In this Section, we compare the performance of our algorithm with the following methods: (1) GPS/IMU localization unit, with open sky localization errors less than 5 cm; (2) VINS using only point feature; (3) pure inertial-only navigation solution; (4) pure stereo visual odometry [31].

The trajectory estimation results of different algorithms with the ground truth data are shown in Figure 3. The corresponding 3D position errors are depicted in Figure 4. The overall root-mean-square errors (RMSE) are shown in Table 1. It can be found that the inertial-only navigation suffers from error accumulation and is not reliable for long-term operation; Secondly, the result of pure stereo visual odometry is inferior, specially where the vehicle turns, and the error grows super-linearly owing to the inherent bias in stereo visual odometry; Thirdly, the combining of inertial navigation and stereo vision with point feature alone can reduce the drift rate effectively, and the additional information from line measurements results in better performance. Note that the jumps from 80 s to 100 s are caused by ground truth errors. It also shows the advantage of the proposed method in cluttered urban environments where the GPS information is less reliable.

Methods | Position RMSE (m) | Orientation RMSE (deg) |
---|---|---|

VINS (points and lines) | 10.6338 | 0.8313 |

VINS (points only) | 16.4150 | 0.9126 |

Pure INS | 2149.9 | 2.0034 |

Pure stereo odometry | 72.6399 | 8.1809 |

We demonstrate the velocity and attitude deviations of the proposed method with the corresponding 3 $\text{\sigma}$ bounds in Figure 5 and Figure 6, which verify that the velocity and attitude estimates are consistent. Note that the standard deviations of the roll and pitch angle errors are bounded, while the standard deviation of the yaw angle error grows over time. This is consistent with the observable property of the VINS system, which indicates that the yaw angle is unobservable [8]. The yaw angle error is bounded under 5° due to the accuracy of the gyroscopes in the experiment.

Finally, the estimates of the gyroscope and accelerometer biases are depicted in Figure 7. All the estimated biases converge quickly to some reasonable ranges, meaning a practical estimation and allowing the compensation of the INS.

To demonstrate the robustness of our algorithm in a textureless structured environment, we perform indoor experiments in a corridor scenario with textureless walls which lead to very few points being tracked in some frames. The test rig consists a PointGrey Bumblebee2 stereo pair, a Xsens MTi unit, and a laptop for data recording (Figure 8a). The accuracy specifications and sampling rates of the sensors are listed in Table 2. The relative pose of the IMU and the camera are well calibrated prior to the experiment using the method proposed in [32], and keep unchanged during the experiment. The actual motion of the pushcart is a move along with the corridor, and then return to the initial point. The full length of the path is about 82 m.

Sensors | Accuracies | Sampling Rates |
---|---|---|

IMU | Gyro bias stability (1 $\sigma $): 1°/s Accelerometer bias stability: 0.02 m/s^{2} | 100 Hz |

Stereo Camera | Resolution: 640 × 480 pixels Focus length: 3.8 mm Field of view: 70° Base line: 12 cm | 12 Hz |

In Figure 8b, we show the bird’s eye view of the estimated trajectories. It is obvious that the combination of point and line features leads to much better performance than the use of point features alone in this scenario. The reason is that the point features are few or not well distributed over the image in some frames, leading to a bad orientation estimation. In Figure 8c, a plot of the number of inlier point and line features per frame is shown, which clearly demonstrates the superiority of combining both feature types under such circumstances.

This paper presents a tightly-coupled vision-aided inertial navigation algorithm, which exploits point and line features to aid navigation in a simple and unified framework. The measurement models of the point and line features are derived, and incorporated into a single estimator. The outdoor experimental results show that the proposed algorithm performs well in cluttered urban environments. The overall RMSE of position and orientation is about 10.6 m and 0.83°, respectively, over a path of up to about 4 km in length. The indoor experiment demonstrates the better performance and robustness of combining both point and line features in textureless structured environments. The proposed approach which combines both feature types can deal with different types of environments with a slight increase in computational cost.

As part of future work, we aim to improve the proposed approach, by taking advantage of the structural regularity of man-made environments, such as Manhattan-world scenes, i.e., scenes that lines should be orthogonal or parallel to each other [33]. Unlike ordinary lines, the Manhattan-world lines encode the global orientation information, which can be used to eliminate the accumulated orientation errors, and further suppress the position drifts.

This work was supported by Research Fund for the Doctoral Program of Higher Education of China (Grant No. 2012.4307.110006) and the New Century Excellent Talents in University of China (Grant No. NCET-07-0225).

Xianglong Kong, Wenqi Wu and Lilian Zhang conceived the idea and designed the experiments; Xianglong Kong and Yujie Wang performed the experiments; Xianglong Kong analyzed the data and wrote the paper.

The authors declare no conflict of interest.

- Titterton, D.H.; Weston, J.L. Srapdown Inertial Navigation Technology, 2nd ed.; The Institution of Electrical Engineers: London, UK, 2004. [Google Scholar]
- Kelly, J.; Sukhatme, G.S. Visual-inertial sensor fusion: Localization, mapping and sensor-to-sensor self-calibration. Int. J. Robot. Res.
**2011**, 30, 56–79. [Google Scholar] [CrossRef] - Feng, G.H.; Wu, W.Q.; Wang, J.L. Observability analysis of a matrix kalman filter-based navigation system using visual/inertial/magnetic sensors. Sensors
**2012**, 12, 8877–8894. [Google Scholar] [CrossRef] [PubMed] - Indelman, V.; Gurfil, P.; Rivlin, E.; Rotstein, H. Real-time vision-aided localization and navigation based on three-view geometry. IEEE Trans. Aerosp. Electron. Syst.
**2012**, 48, 2239–2259. [Google Scholar] [CrossRef] - Weiss, S.; Achtelik, M.W.; Lynen, S.; Chli, M.; Siegwart, R. Real-time onboard visual-inertial state estimation and self-calibration of mavs in unknown environments. In Proceeding of the IEEE International Conference on Robotics and Automation, St. Paul MN, USA, 14–18 May 2012; pp. 957–964.
- Kottas, D.G.; Roumeliotis, S.I. Efficient and consistent vision-aided inertial navigation using line observations. In Proceeding of the IEEE International Conference on Robotics and Automation, Karlsruhe, Germany, 6–10 May 2013; pp. 1540–1547.
- Li, M.Y.; Mourikis, A.I. High-precision, consistent EKF-based visual-inertial odometry. Int. J. Robot. Res.
**2013**, 32, 690–711. [Google Scholar] [CrossRef] - Hesch, J.A.; Kottas, D.G.; Bowman, S.L.; Roumeliotis, S.I. Camera-imu-based localization: Observability analysis and consistency improvement. Int. J. Robot. Res.
**2014**, 33, 182–201. [Google Scholar] [CrossRef] - Hu, J.-S.; Chen, M.-Y. A sliding-window visual-imu odometer based on tri-focal tensor geometry. In Proceeding of the IEEE International Conference on Robotics and Automation, Hong Kong, China, 31 May–7 June 2014; pp. 3963–3968.
- Corke, P.; Lobo, J.; Dias, J. An introduction to inertial and visual sensing. Int. J. Robot. Res.
**2007**, 26, 519–535. [Google Scholar] [CrossRef] - Roumeliotis, S.I.; Johnson, A.E.; Montgomery, J.F. Augmenting inertial navigation with image-based motion estimation. In Proceeding of the IEEE International Conference on Robotics and Automation, Washington, DC, USA, 12–18 May 2002; pp. 4326–4333.
- Diel, D.D.; DeBitetto, P.; Teller, S. Epipolar constraints for vision-aided inertial navigation. In Proceeding of the Seventh IEEE Workshops on Application of Computer Vision, Breckenridge, CO, USA, 5–7 Janury 2005; pp. 221–228.
- Tardif, J.-P.; George, M.; Laverne, M. A new approach to vision-aided inertial navigation. In Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, Taiwan, 18–22 October 2010; pp. 4146–4168.
- Sirtkaya, S.; Seymen, B.; Alatan, A.A. Loosely coupled kalman filtering for fusion of visual odometry and inertial navigation. In Proceeding of the 16th International Conference on Information Fusion (FUSION), Istanbul, Turkey, 9–12 July 2013; pp. 219–226.
- Mourikis, A.; Roumeliotis, S.I. A multi-state constraint kalman filter for vision-aided inertial navigation. In Proceeding of the IEEE Inernational Conference in Robotics and Automation, Roma, Italy, 10–14 April 2007; pp. 3565–3572.
- Leutenegger, S.; Furgale, P.T.; Rabaud, V.; Chli, M.; Konolige, K.; Siegwart, R. Keyframe-based visual-inertial slam using nonlinear optimization. In Proceeding of the Robotics: Science and Systems, Berlin, Germany, 24–28 June 2013.
- Zhang, L. Line Primitives and Their Applications in Geometric Computer Vision. Ph.D. Thesis, Kiel University, Kiel, Germany, 15 August 2013. [Google Scholar]
- Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision, 2nd ed.; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
- Sola, J.; Vidal-Calleja, T.; Civera, J.; Montiel, J.M.M. Impact of landmark parametrization on monocular ekf-slam with points and lines. Int. J. Comput. Vision.
**2012**, 97, 339–368. [Google Scholar] [CrossRef] - Weiss, S.; Siegwart, R. Real-time metric state estimation for modular vision-inertial systems. In Proceeding of the IEEE International Conference on Robotics and Automation (ICRA), Shanghai, China, 9–13 May 2011; pp. 4531–4537.
- Ford, T.J.; Hamilton, J. A new positioning filter: Phase smoothing in the position domain. Navigation
**2003**, 50, 65–78. [Google Scholar] [CrossRef] - Van Der Merwe, R. Sigma-Point Kalman Filters for Probabilistic Inference in Dynamic State-Space Models. Ph.D. Thesis, Oregon Health & Science University, Portland, OR, USA, 9 April 2004. [Google Scholar]
- Julier, S.J. The scaled unscented transformation. In Proceeding of the American Control Conference, Anchorage, AK, USA, 8–10 May 2002; pp. 4555–4559.
- Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The kitti vision benchmark suite. In Proceeding of the IEEE Conference on Computer Vision and Pattern Recognition, Rhode Island, Greece, 16–21 June 2012; pp. 3354–3361.
- Rosten, E.; Drummond, T. Machine learning for high-speed corner detection. In Computer Vision–Eccv 2006; Leonardis, A., Bischof, H., Pinz, A., Eds.; Springer-Verlag: Berlin/Heidelberg, Germany, 2006; Volume 3951, pp. 430–443. [Google Scholar]
- Zhang, Z.; Deriche, R.; Faugeras, O.; Luong, Q.-T. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artif. Intell.
**1995**, 78, 87–119. [Google Scholar] [CrossRef] - Akinlar, C.; Topal, C. Edlines: A real-time line segment detector with a false detection control. Pattern Recognit. Lett.
**2011**, 32, 1633–1642. [Google Scholar] [CrossRef] - Zhang, L.; Koch, R. Line matching using appearance similarities and geometric constraints. In Pattern Recognition; Pinz, A., Pock, T., Bischof, H., Leberl, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7476, pp. 236–245. [Google Scholar]
- Lowe, D.G. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis.
**2004**, 60, 91–110. [Google Scholar] [CrossRef] - Bar-Shalom, Y.; Li, X.R.; Kirubarajan, T. Estimation with Applications to Tracking and Navigation: Theory Algorithms and Software; John Wiley & Sons: Hoboken, NJ, USA, 2004. [Google Scholar]
- Geiger, A.; Ziegler, J.; Stiller, C. Stereoscan: Dense 3D reconstruction in real-time. In Proceeding of the IEEE Intelligent Vehicles Symposium, Baden-Baden, Germany, 5–9 Junuary 2011; pp. 963–968.
- Furgale, P.; Rehder, J.; Siegwart, R. Unified temporal and spatial calibration for multi-sensor systems. In Proceeding of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–8 November 2013; pp. 1280–1286.
- Coughlan, J.M.; Yuille, A.L. Manhattan world: Compass direction from a single image by bayesian inference. In Proceeding of the IEEE International Conference on Computer Vision, Kerkyra, Greece, 20–27 September 1999; pp. 941–947.

© 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).