**3.2 Distributed Compressed Video Sensing (DCVS)**

Another video coding approach that makes use of CS is based on the distributed source coding theory of Slepian and Wolf [21], and Wyner and Ziv [22]. Source statistics, partially or totally, is only exploited at the decoder, not at the encoder as it is done conventionally. Two or more statistically dependent source data are encoded by independent encoders. Each encoder sends a separate bit-stream to a common decoder which decodes all incoming bit streams jointly, exploiting statistical dependencies between them.

In [23], a framework called Distributed Compressed Video Sensing (DISCOS) is introduced. Video frames are divided into key frames and non-key frames at the encoder. A video sequence consists of several GOPs (group of pictures) where a GOP consists of a key frame followed by some non-key frames. Key frames are coded using conventional MPEG intracoding. Every frame is both block-wise and frame-wise compressively sampled using structurally random matrices [25]. In this way, more efficient frame based measurements are supplemented by block measurement to take advantage of temporal block motion.

At the decoder, key frames are decoded using a conventional MPEG decoder. For the decoding of non-key frames, the block-based measurements of a CS frame along with the two neighboring key frames are used for generating sparsity-constraint block prediction. The temporal correlation between frames is efficiently exploited through the inter-frame sparsity model, which assumes that a block can be sparsely represented by a linear combination of few temporal neighboring blocks. This prediction scheme is more powerful than conventional block-matching as it enables a block to be adaptively predicted from an optimal number of neighboring blocks, given its compressed measurements. The block-based prediction frame is then used as the side information (SI) to recover the input frame from its measurements. The measurement vector of the prediction frame is subtracted from that of the input frame to form a new measurement vector of the prediction error, which is sparse if the prediction is sufficiently accurate. Thus, the prediction error can be faithfully recovered. The reconstructed frame is then simply the sum of the prediction error and the prediction frame.

Another DCVS scheme is proposed in [24]. The main difference from [23] is that both key and non-key frames are compressively sampled and no conventional MPEG/H.26x codec is required. However, key frames have a higher measurement rate than non-key frames.

The measurement matrix Φ is the scrambled block Hadamard ensemble (SBHE) matrix [28]. SBHE is essentially a partial block Hadamard transform, followed by a random permutation of its columns. It provides near optimal performance, fast computation, and memory efficiency. It outperforms several existing measurement matrices including the Gaussian i.i.d matrix and the binary sparse matrix [28]. The sparsifying matrix used is derived from the discrete wavelet transform (DWT) basis.

Compressive Video Coding: A Review of the State-Of-The-Art 11

Two different coding modes are defined. The first one is called the SKIP mode. This mode is used when a block in a current non-key frame does not change much from the co-located decoded key frame. Such a block is skipped for decoding. This is achieved by increasing the complexity at the encoder since the encoder has to estimate the mean squared error (MSE) between decoded key frame block and current CS frame block. If the MSE is smaller than some threshold, the same decoded block is simply copied to current frame and hence the decoding complexity is very minimal. The other coding mode is called the SINGLE mode. CS measurements for a block are compared with the CS measurements in a dictionary using the MSE criterion. If it is below some pre-determined threshold, then the block is marked as a decoded block. Dictionary is created from a set of spatially neighboring blocks of previous decoded neighboring key frames. A feedback channel is used to communicate with the encoder that this block has been decoded and no more measurements are required. For blocks that are not encoded by either SKIP or SINGLE mode, normal CS reconstruction is performed. Another dictionary based approach is presented in [33]. The authors proposed the idea of using an adaptive dictionary. The dictionary is learned from a set of blocks globally extracted from the previous reconstructed neighboring frames together with the side information generated from them is used as the basis of each block in a frame. In their encoder, frame are divided as Key-frames and CS frames. For Key-frames, frame based CS measurements are taken and for CS frames, block based CS measurements are taken. At the decoder, the reconstruction of a frame or a block can be formulated as an *l1*-minimization problem. It is solved by using the sparse reconstruction by separable approximation

(SpaRSA) algorithm [34]. Block diagram of this system is shown in Figure 5.

Adjacent frames in the same scene of a video are similar, therefore a frame can be predicted by its side information which can be generated from the interpolation of its neighboring reconstructed frames. at decoder in [33], for a CS frame �� , its side information �� can be generated from the motion-compensated interpolation (MCI) of its previous ���� and next reconstructed key frames ����, respectively. To learn the dictionary from ���� , �� and ���� , � training patches were extracted. For each block in the three frames, 9 training patches

Fig. 4. A Dictionary-based CVS System [32]

At the decoder, the key frames are reconstructed using the standard Gradient Projection for Sparse Reconstruction (GPSR) algorithm. For the non-key frames, in order to compensate for lower measurement rates, side information is first generated to aid in the reconstruction. Side information can be generated from motion-compensated interpolation from neighboring key frames. In order to incorporate side information, GPSR is modified with a special initialization procedure and stopping criteria are incorporated (see Figure 3). The convergence speed of the modified GPSR has been shown to be faster and the reconstructed video quality is better than using original GPSR, two-step iterative shrinkage/thresholding (TwIST) [29], and orthogonal matching pursuit (OMP) [30].

Fig. 2. Architecture of DISCOS [23]

Fig. 3. Distributed CS Decoder [24]

### **3.3 Dictionary based compressed video sensing**

In dictionary based techniques, a dictionary (basis) is created at the decoder from neighbouring frames for successful reconstruction of CS frames.

A dictionary based distributed approach to CVS is reported in [32]. Video frames are divided into key frames and non-key frames. Key frames are encoded and decoded using conventional MPEG/H.264 techniques. Non-key frames are divided into non-overlapping blocks of ݊ pixels. Each block is then compressively sampled and quantized. At the decoder, key frames are MPEG/H.264 decoded while the non-key frames are dequantized and recovered using a CS reconstruction algorithm with the aid of a dictionary. The dictionary is constructed from the decoded key frame. The architecture of this system is shown in Figure 4.

At the decoder, the key frames are reconstructed using the standard Gradient Projection for Sparse Reconstruction (GPSR) algorithm. For the non-key frames, in order to compensate for lower measurement rates, side information is first generated to aid in the reconstruction. Side information can be generated from motion-compensated interpolation from neighboring key frames. In order to incorporate side information, GPSR is modified with a special initialization procedure and stopping criteria are incorporated (see Figure 3). The convergence speed of the modified GPSR has been shown to be faster and the reconstructed video quality is better than using original GPSR, two-step iterative shrinkage/thresholding (TwIST) [29], and orthogonal

matching pursuit (OMP) [30].

Fig. 2. Architecture of DISCOS [23]

Fig. 3. Distributed CS Decoder [24]

**3.3 Dictionary based compressed video sensing** 

neighbouring frames for successful reconstruction of CS frames.

the decoded key frame. The architecture of this system is shown in Figure 4.

In dictionary based techniques, a dictionary (basis) is created at the decoder from

A dictionary based distributed approach to CVS is reported in [32]. Video frames are divided into key frames and non-key frames. Key frames are encoded and decoded using conventional MPEG/H.264 techniques. Non-key frames are divided into non-overlapping blocks of ݊ pixels. Each block is then compressively sampled and quantized. At the decoder, key frames are MPEG/H.264 decoded while the non-key frames are dequantized and recovered using a CS reconstruction algorithm with the aid of a dictionary. The dictionary is constructed from

Fig. 4. A Dictionary-based CVS System [32]

Two different coding modes are defined. The first one is called the SKIP mode. This mode is used when a block in a current non-key frame does not change much from the co-located decoded key frame. Such a block is skipped for decoding. This is achieved by increasing the complexity at the encoder since the encoder has to estimate the mean squared error (MSE) between decoded key frame block and current CS frame block. If the MSE is smaller than some threshold, the same decoded block is simply copied to current frame and hence the decoding complexity is very minimal. The other coding mode is called the SINGLE mode. CS measurements for a block are compared with the CS measurements in a dictionary using the MSE criterion. If it is below some pre-determined threshold, then the block is marked as a decoded block. Dictionary is created from a set of spatially neighboring blocks of previous decoded neighboring key frames. A feedback channel is used to communicate with the encoder that this block has been decoded and no more measurements are required. For blocks that are not encoded by either SKIP or SINGLE mode, normal CS reconstruction is performed.

Another dictionary based approach is presented in [33]. The authors proposed the idea of using an adaptive dictionary. The dictionary is learned from a set of blocks globally extracted from the previous reconstructed neighboring frames together with the side information generated from them is used as the basis of each block in a frame. In their encoder, frame are divided as Key-frames and CS frames. For Key-frames, frame based CS measurements are taken and for CS frames, block based CS measurements are taken. At the decoder, the reconstruction of a frame or a block can be formulated as an *l1*-minimization problem. It is solved by using the sparse reconstruction by separable approximation (SpaRSA) algorithm [34]. Block diagram of this system is shown in Figure 5.

Adjacent frames in the same scene of a video are similar, therefore a frame can be predicted by its side information which can be generated from the interpolation of its neighboring reconstructed frames. at decoder in [33], for a CS frame �� , its side information �� can be generated from the motion-compensated interpolation (MCI) of its previous ���� and next reconstructed key frames ����, respectively. To learn the dictionary from ���� , �� and ���� , � training patches were extracted. For each block in the three frames, 9 training patches

Compressive Video Coding: A Review of the State-Of-The-Art 13

process should not be adopted. Otherwise, there is no point in using CS as an extra overhead. We believe that the distributed approach in which each key-frame and non-keyframe is encoded by CS is able to utilise CS more effectively. While spatial domain compression is performed by CS, temporal domain compression is not exploited fully since there is no motion compensation and estimation performed. Therefore, a simple but effective inter-frame compression will need to be devised. In the distributed approach, this

[2] ITU, "Advanced video coding for generic audiovisual services," *ITU-T Recommendations* 

[3] T.Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra, "Overview of the H.264/AVC

[4] D. Donoho, "Compressed sensing," *IEEE Transactions on Information Theory*, vol. 52, no. 4,

[5] E. Candes, J. Romberg, and T. Tao, "Robust uncertainty principles: Exact signal

[6] R. Baraniuk, "Compressive sensing [lecture notes]," *IEEE Signal Processing Magazine*, vol.

[7] C. Shannon, "Communication in the presence of noise," *Proceedings of IRE*, vol. 37, pp.

[8] ——, "Classic paper: Communication in the presence of noise," *Proceedings of the IEEE*,

[9] J. Ellenberg, "Fill in the blanks: Using math to turn lo-res datasets into hi-res samples,"

[10] E. Candes and M. Wakin, "An introduction to compressive sampling," *IEEE Signal* 

[11] E. Candes and J. Romberg, "Sparsity and incoherence in compressive sampling," *Inverse* 

[12] S. Chen and D. Donoho, "Basis pursuit," in *Proceedings of IEEE Asilomar Conference on* 

[13] M. B. Wakin, J. N. Laska, M. F. Duarte, D. Baron, S. Sarvotham, D. Takhar, K. F. Kelly,

in *Proceedings of Picture Coding Symposium*, Beijing, China, 24-26 April 2006. [14] D. Takhar, J. N. Laska, M. B. Wakin, M. F. Duarte, D. Baron, S. Sarvotham, K. F. Kelly,

[15] S. Mallat and Z. Zhang, "Matching pursuit with time-frequency dictionaries," *IEEE Transactions on Signal Processing*, vol. 41, no. 2, pp. 3397–3415, Dec. 1993. [16] V. Stankovic, L. Stankovic, and S. Chencg, "Compressive video sampling," in *Proceedings of 16th European Signal Processing Conference*, Lausanne, Switzerland, Aug. 2008. [17] J. Tropp and A. Gilbert, "Signal recovery from partial information via orthogonal matching

and R. G. Baraniuk, "Compressive imaging for video representation and coding,"

and R. G. Baraniuk, "A new camera architecture based on optical domain compression," in *Proceedings of SPIE Symposium on Electronic Imaging: Computational* 

pursuit," *IEEE Transactions on Information Theory*, vol. 53, no. 12, pp. 4655–4666, Dec. 2007.

video coding standard," *IEEE Transactions on Circuits and Systems for Video* 

reconstruction from highly incomplete frequency information," *IEEE Transactions* 

is equivalent to generating effective side information for the non-key frames.

*on Information Theory*, vol. 52, no. 2, pp. 489–509, Feb. 2006.

*Signals, Systems and Computers*, vol. 1, Nov. 1994, pp. 41–44.

[1] P. Symes, *Digital Video Compression*. McGraw-Hill, 2004.

*Technology*, vol. 13, no. 7, pp. 560–576, Jul. 2003.

**5. References** 

*for H.264*, 2005.

10–21, Jan. 1949.

*Imaging*, vol. 6065, 2006.

pp. 1289–1306, Apr. 2006.

24, no. 4, pp. 118–121, Jul. 2007.

vol. 86, no. 2, pp. 447–457, Feb. 1998.

*Wired Magazine*, vol. 18, no. 3, Mar. 2010.

*Processing Magazine*, pp. 21–30, Mar 2008.

*Problems*, vol. 23, no. 3, pp. 969–985, 2007.

Fig. 5. Distributed Compressed Video Sensing with Dictionary Learning

including the nearest 8 blocks overlapping this block and this block itself are extracted. After that, the K-SVD algorithm [35] is applied to ܳ training patches to learn the dictionary ܦ௧ אܴே್ൈ, ܰ ܲǤ ܦ௧ is an overcomplete dictionary containing ܲ atoms. By using the learned dictionary ܦ௧, each block ܾ௧ in ݔ௧ can be sparsely represented as a sparse coefficient vector ߙ௧ א ܴൈଵ. This learned dictionary provides sparser representation for the frame than using the fixed basis dictionary. Same authors have extended their work in [36] for dynamic measurement rate allocation by incorporating feedback channel in their dictionary based distributed video codec.
