**1. Introduction**

Video coding and its related applications have advanced quite substantially in recent years. Major coding standards such as MPEG [1] and H.26x [2] are well developed and widely deployed. These standards are developed mainly for applications such as DVDs where the compressed video is played over many times by the consumer. Since compression only needs to be performed once while decompression (playback) is performed many times, it is desirable that the decoding/decompression process can be done as simply and quickly as possible. Therefore, essentially all current video compression schemes, such as the various MPEG standards as well as H.264 [1, 2] involve a complex encoder and a simple decoder. The exploitation of spatial and temporal redundancies for data compression at the encoder causes the encoding process to be typically 5 to 10 times more complex computationally than the decoder [3]. In order that video encoding can be performed in real time at frame rates of 30 frames per second or more, the encoding process has to be performed by specially designed hardware, thus increasing the cost of cameras.

In the past ten years, we have seen substantial research and development of large sensor networks where a large number of sensors are deployed. For some applications such as video surveillance and sports broadcasting, these sensors are in fact video cameras. For such systems, there is a need to re-evaluate conventional strategies for video coding. If the encoders are made simpler, then the cost of a system involving tens or hundreds of cameras can be substantially reduced in comparison with deploying current camera systems. Typically, data from these cameras can be sent to a single decoder and aggregated. Since some of the scenes captured may be correlated, computational gain can potentially be achieved by decoding these scenes together rather than separately. . Decoding can be simple reconstruction of the video frames or it can be combined with detection algorithms specific to the application at hand. Thus there are benefits in combing reduced complexity cameras with flexible decoding processes to deliver modern applications which are not anticipated when the various video coding standards are developed.

Recently, a new theory called Compressed Sensing (CS) [4, 5, 6] has been developed which provides us with a completely new approach to data acquisition. In essence, CS tells us that

Compressive Video Coding: A Review of the State-Of-The-Art 5

that are *sparse* in some domain. This theory provides a way, at least theoretically, to acquire signals at a rate potentially much lower than the Nyquist rate given by Shannon's sampling

Compressed Sensing [4-6, 10] is applicable to signals that are sparse in some domain. Sparsity is a general concept and it expresses the idea that the information rate or the signal significant content may be much smaller than what is suggested by its bandwidth. Most natural signals are redundant and therefore compressible in some suitable domain. We shall first define the two principles, sparsity and incoherence, on which the theory of

Sparsity is important in Compressed Sensing as it determines how efficient one can acquire signals non-adaptively. The most common definition of sparsity used in compressed sensing is as follows. Let �∈�� be a vector which represents a signal which can be expanded in an

> �(�) =����� �

Here, the coefficients �� = 〈�, ��〉 . In matrix form, (1.2) becomes

is done in lossy compression, then we have a sparse signal.

between these two bases is defined in [10] by

incoherent with any fixed basis [18].

���

When all but a few of the coefficients �� are zero, we say that � is sparse in a strict sense. If � denotes the number of non-zero coefficients with � � �, then � is said to be *S*-sparse. In practice, most compressible signals have only a few significant coefficients while the rest have relatively small magnitudes. If we set these small coefficients to zero in the way that it

We start by considering two different orthonormal bases, Φ and Ψ, of ��. The coherence

which gives us the largest correlation between any two elements of the two bases. It can be

Sparsity and incoherence together quantify the compressibility of a signal. A signal is more compressible if it has higher sparsity in some representation domain Ψ that is less coherent to the sensing (or sampling) domain Φ. Interestingly, random matrices are largely

μ(Φ, Ψ) = √� ∙ max

� = �� (1.3)

���,��� |〈ϕ�, ψ�〉| (1.4)

�(Φ, Ψ) ∈ [1,√�] (1.5)

(1.2)

theorem. CS has already inspired more than a thousand papers from 2006 to 2010 [9].

**2.1 Key Elements of compressed sensing** 

orthonormal basis � = [���� � ��] as

CS depends.

**2.1.1 Sparsity** 

**2.1.2 Incoherence** 

shown that

for signals which possess some "sparsity" properties, the sampling rate required to reconstruct these signals with good fidelity can be much lower than the lower bound specified by Shannon's sampling theorem. Since video signals contain substantial amounts of redundancy, they are sparse signals and CS can potentially be applied. The simplicity of the encoding process is traded off by a more complex, iterative decoding process. The reconstruction process of CS is usually formulated as an optimization problem which potentially allows one to tailor the objective function and constraints to the specific application. Even though practical cameras that make use of CS are still in their very early days, the concept can be applied to video coding. A lower sampling rate implies less energy required for data processing, leading to lower power requirements for the camera. Furthermore, the complexity of the encoder can be further simplified by making use of distributed source coding [21, 22]. The distributed approach provides ways to encode video frames without exploiting any redundancy or correlation between video frames captured by the camera. The combined use of CS and distributed source coding can therefore serve as the basis for the development of camera systems where the encoder is less complex than the decoder.

We shall first provide a brief introduction to Compressed Sensing in the next Section. This is followed by a review of current research in video coding using CS.

#### **2. Compressed sensing**

Shannon's uniform sampling theorem [7, 8] provides a lower bound on the rate by which an analog signal needs to be sampled in order that the sampled signal fully represents the original. If a signal �(�) contains no frequencies higher than ���� radians per second, then it can be completely determined by samples that are spaced � � �� �⁄ ��� seconds apart. �(�) can be reconstructed perfectly using the these samples �(��) by

$$f(t) = \sum\_{\{k \in x\}} f(kT) \text{sinc } \{t/T - k\} \tag{1.1}$$

The uniform samples �(��) of �(�) may be interpreted as coefficients of basis functions obtained by shifting and scaling of the sinc function. For high bandwidth signals such as video, the amount of data generated based on a sampling rate of at least twice the bandwidth is very high. Fortunately, most of the raw data can be thrown away with almost no perceptual loss. This is the result of lossy compression techniques based on orthogonal transforms. In image and video compression, the discrete cosine transform (DCT) and wavelet transform have been found to be most useful. The standard procedure goes as follows. The orthogonal transform is applied to the raw image data, giving a set of transform coefficients. Those coefficients that have values smaller than a certain threshold are discarded. Only the remaining significant coefficients, typically only a small subset of the original, are encoded, reducing the amount of data that represents the image. This means that if there is a way to acquire only the significant transform coefficients directly by sampling, then the sampling rate can be much lower than that required by Shannon's theorem.

Emmanuel Candes, together with Justin Romberg and Terry Tao, came up with a theory of Compressed Sensing (CS) [9] that can be applied to signals, such as audio, image and video that are *sparse* in some domain. This theory provides a way, at least theoretically, to acquire signals at a rate potentially much lower than the Nyquist rate given by Shannon's sampling theorem. CS has already inspired more than a thousand papers from 2006 to 2010 [9].

#### **2.1 Key Elements of compressed sensing**

Compressed Sensing [4-6, 10] is applicable to signals that are sparse in some domain. Sparsity is a general concept and it expresses the idea that the information rate or the signal significant content may be much smaller than what is suggested by its bandwidth. Most natural signals are redundant and therefore compressible in some suitable domain. We shall first define the two principles, sparsity and incoherence, on which the theory of CS depends.

#### **2.1.1 Sparsity**

4 Video Compression

for signals which possess some "sparsity" properties, the sampling rate required to reconstruct these signals with good fidelity can be much lower than the lower bound specified by Shannon's sampling theorem. Since video signals contain substantial amounts of redundancy, they are sparse signals and CS can potentially be applied. The simplicity of the encoding process is traded off by a more complex, iterative decoding process. The reconstruction process of CS is usually formulated as an optimization problem which potentially allows one to tailor the objective function and constraints to the specific application. Even though practical cameras that make use of CS are still in their very early days, the concept can be applied to video coding. A lower sampling rate implies less energy required for data processing, leading to lower power requirements for the camera. Furthermore, the complexity of the encoder can be further simplified by making use of distributed source coding [21, 22]. The distributed approach provides ways to encode video frames without exploiting any redundancy or correlation between video frames captured by the camera. The combined use of CS and distributed source coding can therefore serve as the basis for the development of camera systems where the encoder is less complex than the

We shall first provide a brief introduction to Compressed Sensing in the next Section. This is

Shannon's uniform sampling theorem [7, 8] provides a lower bound on the rate by which an analog signal needs to be sampled in order that the sampled signal fully represents the original. If a signal �(�) contains no frequencies higher than ���� radians per second, then it can be completely determined by samples that are spaced � � �� �⁄ ��� seconds apart. �(�)

The uniform samples �(��) of �(�) may be interpreted as coefficients of basis functions obtained by shifting and scaling of the sinc function. For high bandwidth signals such as video, the amount of data generated based on a sampling rate of at least twice the bandwidth is very high. Fortunately, most of the raw data can be thrown away with almost no perceptual loss. This is the result of lossy compression techniques based on orthogonal transforms. In image and video compression, the discrete cosine transform (DCT) and wavelet transform have been found to be most useful. The standard procedure goes as follows. The orthogonal transform is applied to the raw image data, giving a set of transform coefficients. Those coefficients that have values smaller than a certain threshold are discarded. Only the remaining significant coefficients, typically only a small subset of the original, are encoded, reducing the amount of data that represents the image. This means that if there is a way to acquire only the significant transform coefficients directly by sampling, then the sampling rate can be much lower than that required by Shannon's

Emmanuel Candes, together with Justin Romberg and Terry Tao, came up with a theory of Compressed Sensing (CS) [9] that can be applied to signals, such as audio, image and video

�) (1.1)

�(�) � � �(��)���� ( � �⁄ −

{���}

followed by a review of current research in video coding using CS.

can be reconstructed perfectly using the these samples �(��) by

decoder.

theorem.

**2. Compressed sensing** 

Sparsity is important in Compressed Sensing as it determines how efficient one can acquire signals non-adaptively. The most common definition of sparsity used in compressed sensing is as follows. Let �∈�� be a vector which represents a signal which can be expanded in an orthonormal basis � = [���� � ��] as

$$f(t) = \sum\_{l=1}^{n} \varkappa\_l \psi\_l \tag{1.2}$$

Here, the coefficients �� = 〈�, ��〉 . In matrix form, (1.2) becomes

$$f = \Psi \mathbf{x} \tag{1.3}$$

When all but a few of the coefficients �� are zero, we say that � is sparse in a strict sense. If � denotes the number of non-zero coefficients with � � �, then � is said to be *S*-sparse. In practice, most compressible signals have only a few significant coefficients while the rest have relatively small magnitudes. If we set these small coefficients to zero in the way that it is done in lossy compression, then we have a sparse signal.

#### **2.1.2 Incoherence**

We start by considering two different orthonormal bases, Φ and Ψ, of ��. The coherence between these two bases is defined in [10] by

$$\mu(\Phi, \Psi) = \sqrt{n} \cdot \max\_{1 \le \mathbf{k}, \mathbf{j} \le \mathbf{n}} |\langle \Phi\_{\mathbf{k}}, \Psi\_{\mathbf{k}} \rangle| \tag{1.4}$$

which gives us the largest correlation between any two elements of the two bases. It can be shown that

$$
\mu(\Phi, \Psi) \in [1, \sqrt{n}] \tag{1.5}
$$

Sparsity and incoherence together quantify the compressibility of a signal. A signal is more compressible if it has higher sparsity in some representation domain Ψ that is less coherent to the sensing (or sampling) domain Φ. Interestingly, random matrices are largely incoherent with any fixed basis [18].

$$\mathbf{x} = \mathbf{w}\mathbf{s} = \sum\_{l=1}^{N} s\_l \boldsymbol{\psi}\_l \tag{1.6}$$

$$
\mathbf{y} = \boldsymbol{\Phi}\boldsymbol{\mathfrak{x}} = \boldsymbol{\Phi}\boldsymbol{\Psi}\boldsymbol{\mathfrak{x}} = \boldsymbol{\mathfrak{G}}\boldsymbol{\mathfrak{s}} \tag{1.7}
$$

$$\mathcal{M} \ge \mathcal{C} \cdot \mu^2(\Phi, \Psi) \cdot \mathcal{S} \log \mathcal{N} \tag{1.8}$$

$$\min\_{\pi} ||\mathfrak{X}||\_{\mathfrak{u}} \quad \text{subject to } \mathcal{y}\_{\mathbb{k}} = \langle \phi\_{\mathbb{k}}, \Psi \mathfrak{X} \rangle, \forall \ k \in \mathfrak{f} \tag{1.9}$$


$$
\mathbf{\hat{s}} = \arg\min \|\mathbf{s'}\|\_2 \text{ such that } \mathbf{\hat{e}}\mathbf{s'} = \mathbf{y} \tag{1.10}
$$

$$\mathbf{s} = \arg\min \|\mathbf{s}'\|\_0 \text{ such that } \mathbf{e}\mathbf{s}' = \mathbf{y} \tag{1.11}$$

$$\mathbf{s} = \arg\min \|\mathbf{s}'\|\_1 \text{ such that } \Theta \mathbf{s}' = \mathbf{y} \tag{1.12}$$

Compressive Video Coding: A Review of the State-Of-The-Art 9

The proposed algorithm in [18] uses block matching (BM) to estimate motion between a pair of frames. Their BM algorithm divides the reference frame into non-overlapping blocks. For each block in the reference frame the most similar block of equal size in the destination frame is found and the relative location is stored as a motion vector. This approach overcomes previous approaches such as [13] where the reconstruction of a frame depends only on the individual frame's sparsity without taking into account any temporal motion. It is also better than using

Another video coding approach that makes use of CS is based on the distributed source coding theory of Slepian and Wolf [21], and Wyner and Ziv [22]. Source statistics, partially or totally, is only exploited at the decoder, not at the encoder as it is done conventionally. Two or more statistically dependent source data are encoded by independent encoders. Each encoder sends a separate bit-stream to a common decoder which decodes all incoming

In [23], a framework called Distributed Compressed Video Sensing (DISCOS) is introduced. Video frames are divided into key frames and non-key frames at the encoder. A video sequence consists of several GOPs (group of pictures) where a GOP consists of a key frame followed by some non-key frames. Key frames are coded using conventional MPEG intracoding. Every frame is both block-wise and frame-wise compressively sampled using structurally random matrices [25]. In this way, more efficient frame based measurements are

At the decoder, key frames are decoded using a conventional MPEG decoder. For the decoding of non-key frames, the block-based measurements of a CS frame along with the two neighboring key frames are used for generating sparsity-constraint block prediction. The temporal correlation between frames is efficiently exploited through the inter-frame sparsity model, which assumes that a block can be sparsely represented by a linear combination of few temporal neighboring blocks. This prediction scheme is more powerful than conventional block-matching as it enables a block to be adaptively predicted from an optimal number of neighboring blocks, given its compressed measurements. The block-based prediction frame is then used as the side information (SI) to recover the input frame from its measurements. The measurement vector of the prediction frame is subtracted from that of the input frame to form a new measurement vector of the prediction error, which is sparse if the prediction is sufficiently accurate. Thus, the prediction error can be faithfully recovered. The reconstructed

Another DCVS scheme is proposed in [24]. The main difference from [23] is that both key and non-key frames are compressively sampled and no conventional MPEG/H.26x codec is required. However, key frames have a higher measurement rate than non-key frames.

The measurement matrix Φ is the scrambled block Hadamard ensemble (SBHE) matrix [28]. SBHE is essentially a partial block Hadamard transform, followed by a random permutation of its columns. It provides near optimal performance, fast computation, and memory efficiency. It outperforms several existing measurement matrices including the Gaussian i.i.d matrix and the binary sparse matrix [28]. The sparsifying matrix used is derived from

supplemented by block measurement to take advantage of temporal block motion.

frame is then simply the sum of the prediction error and the prediction frame.

the discrete wavelet transform (DWT) basis.

inter-frame difference [20] which is insufficient for removing temporal redundancies.

**3.2 Distributed Compressed Video Sensing (DCVS)** 

bit streams jointly, exploiting statistical dependencies between them.

Signal recovery is performed by the OMP algorithm [17]. In reconstructing compressively sampled blocks, all sampled coefficients with an absolute value less than some constant � are set to zero. Theoretically, if there are ��� non significant DCT coefficients, then at least � � � � � samples are needed for signal reconstruction [10]. Therefore the threshold is set to �����. The choice of values for �, �, and � depends on the video sequence and the size of the blocks. They have proved experimentally that up to 50% of savings in video acquisition is possible with good reconstruction quality.

Fig. 1. System Block Diagram of Video Coding Scheme Proposed in [16]

Another technique which uses motion compensation and estimation at the decoder is presented in [18]. At the encoder, only random CS measurements were taken independently from each frame with no additional compression. A multi-scale framework has been proposed for reconstruction which iterates between motion estimation and sparsity-based reconstruction of the frames. It is built around the LIMAT method for standard video compression [19].

LIMAT [19] uses a second generation wavelets to build a fully invertible transform. To incorporate temporal redundancy, LIMAT adaptively apply motion-compensated lifting steps. Let k-th frame of the � frame video sequence is given by �� , where � � {���� � }. The lifting transform partitions the video into even frames {���} and odd frames {����� } and attempts to predict the odd frames from the even ones using a forward motion compensation operator. Suppose {���} and {����� } differ by a 3-pixel shift that is captured precisely by a motion vector ��, then it is given by {����� } � ����� � ��) exactly.

Signal recovery is performed by the OMP algorithm [17]. In reconstructing compressively sampled blocks, all sampled coefficients with an absolute value less than some constant � are set to zero. Theoretically, if there are ��� non significant DCT coefficients, then at least � � � � � samples are needed for signal reconstruction [10]. Therefore the threshold is set to �����. The choice of values for �, �, and � depends on the video sequence and the size of the blocks. They have proved experimentally that up to 50% of savings in video

acquisition is possible with good reconstruction quality.

Fig. 1. System Block Diagram of Video Coding Scheme Proposed in [16]

Another technique which uses motion compensation and estimation at the decoder is presented in [18]. At the encoder, only random CS measurements were taken independently from each frame with no additional compression. A multi-scale framework has been proposed for reconstruction which iterates between motion estimation and sparsity-based reconstruction of the frames. It is built around the LIMAT method for standard video compression [19].

LIMAT [19] uses a second generation wavelets to build a fully invertible transform. To incorporate temporal redundancy, LIMAT adaptively apply motion-compensated lifting steps. Let k-th frame of the � frame video sequence is given by �� , where � � {���� � }. The lifting transform partitions the video into even frames {���} and odd frames {����� } and attempts to predict the odd frames from the even ones using a forward motion compensation operator. Suppose {���} and {����� } differ by a 3-pixel shift that is captured

precisely by a motion vector ��, then it is given by {����� } � ����� � ��) exactly.

The proposed algorithm in [18] uses block matching (BM) to estimate motion between a pair of frames. Their BM algorithm divides the reference frame into non-overlapping blocks. For each block in the reference frame the most similar block of equal size in the destination frame is found and the relative location is stored as a motion vector. This approach overcomes previous approaches such as [13] where the reconstruction of a frame depends only on the individual frame's sparsity without taking into account any temporal motion. It is also better than using inter-frame difference [20] which is insufficient for removing temporal redundancies.
