Digital Video Coding

The major objective of video coding is to represent an media with as few bits as possible while preserving the level of quality for given applications.

Reduction of storage requirement for data storage
Reduction of channel bandwidth required for data transmission

The Layered Architecture for Digital TV/OTT (Over-The-Top) Video

Production and Display Layer
Video/Audio Compression Layer
Data Transport Layer
Transmission Layer

Display Layer

Picture formats

Standard definition (720p)
- Y: 1280×720
- UV: 640×360 (4:2:0)
- Application: Original digitization format used in SDTV
High definition 4:2:0 (1080p) or 4:2:2
- Y: 1920×1080
- UV: 960×540 (4:2:0)
- Application: Original digitization format used in HDTV
Ultra 4K (2160p)
- Y: 3840×2160
- UV: 1920×1080 (4:2:0)
- Application: in the fields of digital television and digital cinematography
Ultra 8K (4320p)
- Y: 7680×4320
- UV: 3840×2160 (4:2:0)
- Application: NHK launches a three-year roadmap that entails the launch of 8K test broadcasting in 2016, with plans to roll out full 8K services by 2018, and in time for the 2020 Summer Olympics.

High-Definition Television (HDTV) Resolutions

Video/Audio Compression Layer

Video Compression Technologies

Digital video contains a tremendous amount of information

impractical to transmit over air or Internet or cost effective to store.
Every digital video coding system uses compression in order to reduce file size when transporting video over the network for storage and viewing, but at the same time, have as little an impact on the image and video quality as possible.
compression saves money.
Remove redundancy from data prior to transmission or storage

Often, the compression formats are referred to as “codecs.” Codec is the abbreviated form of “compressor-decompressor” or “coding-decoding.”

$\text { Compression Ratio }(C R)=\frac{\text { size of source data }}{\text { size of compressed bitstream }}$

We want to not only consider Compression Ratio but also the quality.

Lossless compression (Lower compression ratio)
- The aim is to reduce the number of bits used to represent the source information in such a way that there is no loss of information. (Recover the original representation perfectly)
- Reversible
- Desirable in Medical Imagery
Lossy compression (Higher compression ratio)
- The aim is to reduce the number of bits used to represent the source information by introducing acceptable loss of information. (Recover the representation to be similar to the original one)
- Irreversible
- Desirable in TV Broadcasting

Lossless Compression - Statistical redundancy

Based on occuring probability

Any compression algorithm associated with text must be lossless
since the loss of just a single character could modify the meaning of a complete string.

Entropy encoding is lossless and independent of the type of information that is being compressed.

Make use of entropy.

Entropy $(H)$ of $N$ random variable $s_i$ , where $i = 1,2,...,N,$ with corresponding probabilities $p_i$ 's is given by

$H=\sum_{i=1}^{N} p_{i} \log _{2} \frac{1}{p_{i}}=-\sum_{i=1}^{N} p_{i} \log _{2} p_{i}$

From video coding perspective, Entropy is a measure of

the amount of information associated with the set of coder input values and

Gives a lower bound on the average of bits required to code these inputs.

Entropy of source: the theoretical minimum average number of bits that are required to transmit a particular source stream.

Run-length encoding

Encode consecutive 0s and consecutive 1s

Example:

000001111111000011 => 5,7,4,2 => 101 111 100 010

Compression Ratio is this example is 18/12 = 1.5

Variable-length Codeword Assignment

Statistical encoding

Uses a set of variable-length codeword (VLC) with
- Less bits assigned to more frequently occurring symbols
- More bits assigned to symbols that appear rarely.

Why don’t use Uniform-length Codeword Assignment?

Although Uniform-length Codeword Assignment is the simplest and easy to implement, it is not optimal.

It is always the case that some symbols’ possibilities are more likely to be available than others
Codeword with different length is assigned to different symbol according to its probability of occurrence.

Major Criteria of VLC assignment

The assignment must be uniquely decodable
The representation should be optimal in terms of the required average bit rate

Huffman coding

Huffman coding is a technical codeword assignment of VLC.

A simple approach to obtain a coding near to its entropy

The first step is to create a series of source reduction

By ordering the probabilities of the symbols.
Combining the lowest probability symbols into a single symbol. This process is then repeated until a reduced source with two symbols is reached

Huffman encoding

Suppose we have a1 to a6 , we first sort them by Probability (From highest to lowest).

In each step, we combine the lowest probabilty with second lowest probability. Then sort it again

And then we start to assign codewords from shortest to the longest.

The assign procedure is like this:

First we have the general $a_{643512}$ $a_{643512}$
- Break it into 2, Assign 1 to high probability and Assign 0 to low probability
  - If the 2 segments have the same probability (e.g. $a_4$ and $a_{35}$ has 0.1, assign the more elements one with 1 (i.e. $a_{35}$ is 1 , $a_4$ is 0))
If the element is not alone, further break it until it is alone.
- The process end when all elements has a codeword assignment.
Stage 4 -> 3 -> 2 -> 1

Here the photo represented ABCDEF as a123456.

Note it is close to the entropy. Actually Huffman provide a more accurate result.

Huffman Decoding

Prefix Property: A shorter codeword will never form the start of a longer codeword.
The received bitstream can be decoded simply by carrying out a recursive search bit by bit until each valid codeword is found.
Use a State machine
- Move 1 step forward when receiving a 0/1.
- Output symbol X and reset the machine when state X is reached.

E.g. 101101010

Firstly move to first number

1 -> a2, then reset state

0 -> 01 -> 011 -> a1, then reset state

0 -> 01 -> 010 -> 0101 -> 01010 -> a3, then reset state.

Another illustration would be like this:

Carrying out a recursive search bit by bit until each valid codeword is found.

Adventages and Disadventages of Huffman Coding

Huffman code is one of the most efficient codeword assignment.

By providing average number of bits required per symbol that is closer to the entropy.

However:

It is difficult to construct the codebook when the number of inputs are large
Need to know the probability distribution of the input symbols ahead
Cannot adjust the codewords to handle non-stationary inputs

Lossy Compression - Spatial redundancy

Based on correlation of neighboring pixels

Spatial redundancy : exploited by image compression algorithms.

Image compression algorithm always consist of:

Quantization,
DCT-based coding,
Huffman coding and
run-length coding

In the following example will be using JPEG compression standard.

Lossy Image Encoder (JPEG)

Transformation

Transformation - Block Subdivision

Transformation: Block Subdivision itself is lossless.

Strong correlation between intensities of pixels that spatially close together within a frame

In transform image coding, an image is divided into a number of 2D blocks.
Each block is then coded separately.
Size of block depends on applications: typical block size: 8x8

Transformation - Discrete Cosine Transform

Transformation: DCT itself is lossless.

People say DCT-based Image Coding System is lossy because it contains Quantization. DCT itself is lossless.

2-D Discrete Cosine Transform the subdivided block from pixel domain to a frequency domain in which they can be more efficiently encoded (good energy compaction property).

Major objectives of transformation:

To make as many transform coefficients as possible small enough so that they are insignificant and need not be coded for transmission.
To minimize statistical dependence, i.e., correlation between the transform coefficients.

The relatively good capability for bit-rate reduction mainly comes from two mechanisms:

Not all transform coefficients need to be transmitted.
Coefficients need not be coded with full accuracy.

In nowadays transform image coding, Discrete Cosine Transform (DCT) is usually used.

Properties of transform coefficients:

Most of them have insignificant magnitude.
- We only want to handle those values with significant magnitude.
They are highly uncorrelated.

2D-DCT Computation Features

DC coefficient
- For u=v=0, the two cosine terms are both 1
- F(0,0) is simply a function of the summation of all the values in the input matrix.
- It is the mean of all 64 values in the matrix.
Others are known as AC coefficients
- Most AC coefficients have zero or near-zero values

Quantization

Quantization itself is lossy.

Discarding information which is not visually significant

look at a Example of Quantization

If we increase the quantization size
- the compression ratio (CR) will increase
- the quality will decrease

So we need the help of transformation before quantization to further increase the compression ratio.

Lower frequency coefficients (small quantization step size)
- finer quantization
High frequency coefficients (large quantization step size)
- coarser quantization
- becomes zero after quantization

Codeword Assignment (Entropy Coding)

Entropy Coding itself is lossless.

Comprise four steps

Vectoring (zig-zag scanning)
Differential coding
Run-length coding
Huffman coding

DC coefficients use Differential coding + Huffman coding

AC coefficients use Run-length coding + Huffman coding

Vectoring - Zig-Zag scanning

Place the lower frequency coefficients before higher one

to generate more consecutive zeros
permit the efficient use of the run-length coding

Coding DC coefficients - Differential Predictive Coding + Huffman coding

Differential coding: used when the amplitude of the values that make up the source information cover a large range but the difference between successive values is relatively small.

There is usually a strong correlation between dc coefficients of adjacent NxN blocks

The DC coefficients are encoded using the predictive coding techniques.

Difference values are further encoded using huffman coding

Size => the number of extra bits needed
Huffman codeword + Extra bits => transmission

Extra bit is encoded in a SIZE-bit plain binary code format (1’s complement if –ve).

That is the binary code of Difference Value.

Example:

Difference value = 14 -> 1110

Difference value = -14 -> 1’s complement of 1110 -> 0001

Difference value = 0 -> “-”

$\operatorname{SIZE}=\lceil\log _{2}(|x|+1)\rceil$

Where $x$ is the DC Difference value.

Why Differential Coding or Differential Pulse Code Modulation (DPCM) is used in JPEG coding, but not code the DC values separately?

Hints: Property of Huffman Coding

Coding AC coefficients - Run-length coding + Huffman coding

For AC coefficients:

Convert zig-zag scanned ac coefficients into an intermediate symbol sequence.
Encode the intermediate symbol sequence using Huffman (or arithmetic) coding.

Intermediate symbol conversion of AC Coefficients:

Each non-zero AC coefficient is represented by a pair of symbols:

symbol-1 (RUNLENGTH, SIZE) i.e. assign a Codeword
- RUNLENGTH = the number of consecutive zero-valued ac coefficients preceding the non-zero ac coefficient $\in$ [0,15]
- SIZE = the minimum number of bits used to encode AMPLITUDE $\in$ $\in$ [0 to 10 bits]
  - the minimum number of bits required to represent the absolute value of the AC coefficient in binary format.
symbol-2 (AMPLITUDE)
- AMPLITUDE = in the range of [-1023, 1024]

Example: if the zig-zag scanned sequence of ac coefficients is {0, 0, 0, 0, 0, 0, 476}

=> (6, 9)(476)

Where 6 is the consecutive count, and $log_2(476 + 1) = 9$ (roundup).

To determine the SIZE value for an AC:

$\operatorname{SIZE}=\lceil\log _{2}(|x|+1)\rceil$

where $x$ is the non-0 AC coefficient

Because 0 is for DC.

Basically they share the same table

If RUNLENGTH is > 15:

use symbol-1 (15,0) to represent RUNLENGTH = 16

Example: if the sequence of ac coefficients is {0,0…0,12} where there are 39 0s,

=> (15,0)(15,0)(7,4)(12)

After coding the last non-0 ac coefficient in a block, use symbol-1 (0,0) => End Of Block (EOB)
- EOB usually occurs well before the mid-point of the array!
- EOB make effective use of the EOB symbol code
  - usually EOB occurs well before the mid-point of the array
  - EOB appears frequently, hence an efficient code of 4-bit is assigned (i.e. it is very probable)

Performance of Codec (CR)

Compression Ratio, $CR$

$C R=\frac{\text { Size of original data }}{\text { Size of compressed data }}$

Number of bits/pixel, $N_b$

$N_{b}=\frac{\text { encoded number of bits }}{\text { number of pixels }}$

There is a trade-off between the compression ratio and the picture quality

Performance of Codec (Quality)

Peak Signal-to-Noise ratio (PSNR)

PSNR is the approximate to the human perception of the constuction quality

$P S N R=10 \log \left[\frac{\sum_{y} \sum_{x} 255^{2}}{\sum_{y} \sum_{x}\left\{f(x, y)-\hat{f}(x, y\}^{2}\right.}\right]$

Where $\sum_{y} \sum_{x}\left\{f(x, y)-\hat{f}(x, y\}^{2}\right.$ is the MSE.

Lossy Compression - Temporal redundancy

Motion JPEG (M-JPEG)

Motion JPEG (M-JPEG) is a digital video sequence represented as a series of JPEG images.

The higher the compression level, the lower the file size and image quality.

Advantages:

flexibility both in terms of quality and compression ratio.

No dependency between the frames, it is robust, meaning that if one frame is dropped during transmission, the rest of the video will be unaffected.

M-JPEG is popular in applications where individual frames in a video sequence are required (e.g., for video analytics)

Disadvantages:

No use of any video compression techniques => lower compression ratio (10:1 to 24:1) for video sequences.

Temporal redundancy is based on similarity of neighboring frames.

Consider:

An object moving across the screen in a movie
Last for at least 3s
Assume: frame rate 60 frame per second -> 180 frames
By sending only information relating to those segments of each frame that have movement associated with them
Considerable savings in bandwidth

Frame Difference Coding

The Difference frame is encoded as a Image using JPEG encoding method

and then decoded again and add back the t-1 frame to produce t frame.

Disadvantage of Frame Difference Coding

if there is a great motion between 2 frames, then the frame different signal become larger.
- Compression ratio decreased

Motion-Compensated Prediction

Motion estimation and compensation

Encoder side

Do a motion estimation to predict a motion-compensated frame.
then use the motion-compensated frame to subtract from current frame t.
Encoder also transmitt the motion estimation to the decoder

Motion estimation is an expensive process. so encoder is more expensive but decoder is cheap. We only need decoder to watch the video.

Decoder side

Use the received motion information to create a motion-compensated frame.
then use the motion-compensated frame to add from current frame t.

If we cannot perfectly estimate the motion, prediction error and residue occur.

Advantage of Frame Motion-Compensated Prediction

Compression ratio increased a lot

The Video Encoder and Decoder

Motion compensation is performed in both the encoder and the decoder, but motion estimation is needed only in encoder => asymmetric property

Motion Estimation: Block Matching Algorithm

The macroblock (16x16) that gives the best match for a given matching criterion is chosen as the motion vector (mv).

Matching Criterion: Sum of Absolute Difference

Motion vector (mv) = (u,v)

Smaller SAD value = 2 frames are closer to each other.

SAD value = 0 means the 2 frames are identical.

Full searching algorithm for SAD
- most accurate
- heavy computation
Fast searching algorithm for SAD
- reduced computation complexity
- deterioration in picture quality

Video compression Picture Types

There are 3 types of picture used in digital video coding.

Intra (I) picture
without reference to any other pictures
- provide access points to the coded sequence where decoding begin
- coded with only moderate compression (10 to 24 :1)
Predicted § picture
- reference to the I-pictures or P-pictures
- generally used as a reference for further prediction
- Compression ratio is about 20 to 50:1

If there is a error on a P picture, the error will propagate until the end.

e.g. IPPPPPPPPP

Or If we watch it in the middle, a P picture does not has a reference I frame and can not be reconstructed.

e.g. PPPPPPPP

This is undesirable. One solution is to do setup some access points I frames.

e.g. IPPPPIPPPPIPPPP

For each group of pictures between 2 I frames (IPPPP) we call it GOP (Group of Pictures).

Length of the GOP means how many pictures in the GOP. (Length is 5 in above example)

Bidirectional (B) picture
using motion compensated prediction from a past and a future I or P picture
- never used as a reference or further prediction
highest compression ratio (50 to 100:1)
- B-frame need to do the motion estimation twice

Performance comparing the use without B picture and with B picture

The bitrate consumed by Bidirectional (B) picture should be smallest.

Coding of I-Picture

Intraframe coding
8x8 DCT
Any weighting matrix for DCT coefficients possible
Differential coding of DC coefficient
Uniform qunatization
Run length coding of zeros with zig-zag scan
Entropy coding

Coding of P-Pictures

The IBBBPBBBI example is used here, so it is frame 5.

Motion compensated prediction from an I- or P-pictures
One motion vector per macroblock
Coding of prediction error with 8x8 DCT, uniform qunatization, zig-zag scan (like I-picture)

Coding of Motion Vectors

Highly correlated among neighboring macroblocks
In many standards, macroblock is the unit for motion estimation. It will be fine tuned to the block level if necessary
Code the difference using variable length code (VLC)

Coding of B-Pictures

The IBBBPBBBI example is used here again.

Transmission Order and Display Order

Before we encode B, note that B need to have a past and a future I or P picture.

So a past and a future I or P picture will be always encoded first, thus changing the transmission order
Therefore Display order is different from encoding order.

Another example:

An video surveillance system uses the frame sequence shown as below. Derive a suitable reordered sequence that ensures firstly, only two frames must be stored in the decoder, and secondly, the required I and/or P-frames are available to decode each P- and B-frame as they are received.

Answer: IPBBPBBBPBBIBBPBB ----

Compression Latency

When compression takes place, one or several mathematical algorithms are used to remove image data.

This process requires a certain amount of time, and the resulting delay is known as compression latency.
The more advanced the compression algorithm, the higher the latency, given the same processing power.
Latency of B-frame > P-frame > I-frame
- CR of B-frame > P-frame > I-frame
- Complexity of B-frame > P-frame > I-frame
- B-frame need to do the motion estimation twice

The latency of a network, however, must be considered when designing a video broadcast system.

For some applications, such as the compression of studio movies, compression latency is irrelevant because the video is not viewed live.

In live broadcast applications, low latency is essential.

The H.264 Video Coding Standard

Multi-mode, multi-reference Motion Estimation
4X4 block integer transform

Intra-frame Prediction

Variable Block Size Motion Estimation and Motion Compensation

The size of moving/stationary objects is variable, hence there is a trade off in terms of the size of a block.

In H.264, the arrangement is flexible, each 16×16 macroblock may be:

Whole block
Two sub-blocks (16×8 or 8×16) vertically
4 sub-blocks (8×8 each)
In the last case, the 4 sub-blocks may be divided once more into 2 smaller blocks, hence the size could be 4×4.

Use sub-block will make the motion prediction more precise

The encoder selects the “best” partition size for each part of the frame, to minimize the coded residual and motion vectors.

The macroblock partitions chosen for each area are shown superimposed on the residual frame.

little change between the frames (residual appears grey) ® a 16x16 partition is chosen
detailed motion (residual appears black or white) ® smaller partitions are more efficient.

The 4X4 Integer Transform

Summary of H.264

Video coding is based on hybrid video coding and similar in spirit to other standards but with important differences
New key features are:
- Enhanced motion compensation
- Small blocks for transform coding
- Enhanced entropy coding
Substantial bit-rate savings (up to 50%) relative to other standards for the same quality
Enhancement on perceptive quality seems better than that on PSNR
The complexity of the encoder triples that of the prior ones
The complexity of the decoder doubles that of the prior ones

Rate-Distortion (RD) Theory

addresses the problem of determining the minimal number of bits per symbol, as measured by the
rate R, that should be communicated over a channel, so that the source (input signal) can be approximately reconstructed at the receiver (output signal) without exceeding a given distortion D.

RD curve

Only use 4 points, pointing the PSNR and bitrate at different Quantization step size
We can compare the performance between 2 encoders

Higher Quantization step size, smaller PSNR, smaller bitrate

We should select encoding depends on application.

Fast Motion Estimation

Revisit the Block Matching Motion Estimation

Motion Estimation is the critical step to remove temporal redundancy

Operations per SAD (Full Search Algorithm)

|| = 256 (16x16 pixels)
-= 256
+= 255
Total operations = 256 + 256 + 255 = 767

Operations per second (Full Search Algorithm)

Horizonal and Vertical both need to search $2D+1$
Number of SAD to be computed per MacroBlock = $(2D+1)^2$
Number of operations per MacroBlock = $(2D+1)^2 \times 767$
Assume frame rate = 30 fps, Number of operations per second = $30 \times \frac{W}{16} \times \frac{H}{16} \times 767(2D+1)^2$

The value of D is depends on applications

Fast Searching Algorithm: n-Step Hierachical Search (n-SHS)

Instead of searching all, only a limited locations are searched
the minimum SAD is the Motion Vector

First We have a search window

$\text{Search Window Size } S = \lceil\frac{D}{2}\rceil$

So we will search the 9 points:

$(0,0), (0,S),(0,-S), (S,S), (S,0), (S,-S), (-S,S), (-S,0), (-S,-S)$

After finding the point with minimum SAD,

$\text{Second Search Window Size } S_2 = \lceil\frac{S}{2}\rceil$

Then we use that point with minimum SAD as center to find another 9 points. (including itself)

Let the top right point has the minimum SAD.

After finding the point with minimum SAD,

$\text{Third Search Window Size } S_3 = \lceil\frac{S_2}{2}\rceil$

Then we use that point with minimum SAD as center to find another 9 points. (including itself)

Let the top point has the minimum SAD.

If the minimum SAD lie on the same point also the Search size is 1, it converges and we need not to proceed.

Then that is the motion vector.

Note this algorithm has saved a lot of operations.

Global and Local Minima

3-step search can be trapped in the local minima.

Local minima of the distortion function giving prediction images with higher error than full search
So, many research works have been carried out to find alternative approaches for block matching motion estimation.

Example of FSA and n-SHS

D = 6, number of search for FSA = 2D + 1 = 13

D = 6

H.264/AVC

Revisit H.264

H.264/AVC is a standard for video compression, and is equivalent to MPEG-4 Part 10, or MPEG-4 AVC (for Advanced Video Coding)

Selected new features in H.264/AVC

Enhanced motion estimation and compensation (ME and MC) with variable block size
Multiple reference frames
Multi-mode intra-prediction (Intra Predition)
Integer block transform
De-blocking filter

Multiple Reference Frames

Multi-picture motion compensation using previously-encoded pictures as references allows up to 32 reference pictures to be used in some cases

Very significant bit rate reduction for scenes with
- rapid repetitive flashing
- back-and-forth scene cuts
- uncovered background areas
Every MacroBlock partition can have a different reference picture that is more appropriate for that particular block.

Drawbacks of Multiple Reference frames

Both the encoder and decoder have to store the reference frames used for Inter-frame prediction in a multi-frame buffer.
Dramatically increases the computational complexity of the encoders because the Motion Estimation (ME) process needs to be performed for each of the reference frames

Spatial model: Intra prediction

The coding of an I-frame in MPEG-1 and MPEG-2 is very similar to that of the JPEG. The coding is mainly done by DCT

Motivation: Intra-frames are natural images, so they exhibit strong spatial correlation.

Many different approches for intra prediction has been proposed.

The H.264 has two prediction types: I4MB and I16MB

I4MB (Intra_4☓4 type): 9 modes

Intra prediction in H.264/AVC is always conducted in the spatial domain.

The current macroblock/block is predicted by the adjacent pixels in the upper and the left macroblocks/blocks that were decoded earlier.

I16MB (Intra_16☓16 type): 4 modes

Mode 0: vertical
Mode 1: horizontal
Mode 2: DC
Mode 3: plane (a linear “plane” function is fitted to the upper and left-hand samples ® in areas of smoothly-varying luminance)

Rate-Distortion Optimization (RDO)

The SAD (Sum-of-Absolute Difference) is one of the simplest matrices to select the most similar block from the reference frame.

It is always good to find a block in the reference frame with the minimum distortion.
However when the distortion is optimized, the bit-rate (R) of coding the the block might be high:
- Encode more motion vectors for small block size in H.264
- Index of multiple reference frames
Increase the size of memory for storage or require high channel capacity for transmission.
We can formulate these two factors by optimizing a RDO (RateDistortion Optimization) cost function.
However, the distortion (D) term and the rate ® term are two totally different matters and they cannot be added directly.
One solution is to formulate a weighting factor 入 which is able to bring these two terms in the same entity for the sake of comparision:

$\boldsymbol{J}=\boldsymbol{D}+\lambda \boldsymbol{R}$

where J is the RDO cost and is normally called the Lagrangian multiplier. In the literature, J is also referred to as the Lagrangian rate-distortion function.