Monday, June 28, 2010

video demystified(summary on mpeg2/mpeg4/h264)

  • MPEG-2
MPEG-2 uses the YCbCr color space, supporting 4:2:0, 4:2:2 and 4:4:4 sampling. The 4:2:2 and 4:4:4 sampling options increase the chroma resolution over 4:2:0, resulting in better picture quality.

There are three types of coded pictures.
I (intra) pictures are fields or frames coded as a
stand-alone still image.
P (predicted) pictures are fields or frames coded relative to the nearest previous I or P picture, resulting in forward prediction pro-cessing. B (bidirectional) pictures are fields or frames that use the closest past and future I or P picture as a reference, resulting in bidirectional prediction

A group of pictures (GOP) is a series of one or more coded pictures intended to assist in random accessing and editing. The GOP value is configurable during the encoding process. The smaller the GOP value, the better the response to movement (since the I pictures are closer together), but the lower the compression. In the coded bitstream, a GOP must start with an I picture and may be followed by any number of I, P, or B pictures in any order. In display order, a GOP must start with an I or B picture and end with an I or P picture.

An open GOP, identified by the broken_link flag, indicates that the first B pictures (if any) immediately following the first I picture after the GOP header may not be decoded correctly (and thus not be displayed) since the reference picture used for prediction is not available due to editing.



Macroblocks
Three types of macroblocks are available in MPEG-2.
The 4:2:0 macroblock consists of four Y blocks, one Cb block, and one Cr block.
The 4:2:2 macroblock consists of four Y blocks, two Cb blocks, and two Cr blocks.
The 4:4:4 macroblock consists of four Y blocks, four Cb blocks, and four Cr blocks.

Macroblocks in P pictures are coded using the closest previous I or P picture as a reference, resulting in two possible codings:
 - intra coding no motion compensation
 - forward prediction closest previous I or P picture is the reference
Macroblocks in B pictures are coded using the closest previous and/or future I or P picture as a reference, resulting in four possible codings:
 - intra coding: no motion compensation
 - forward prediction: closest previous I or P picture is the reference
 - backward prediction: closest future I or P picture is the reference
 - bi-directional prediction: two pictures used as the reference:
 - the closest previous I or P picture and
 - the closest future I or P picture

Block size: 8x8 for MPEG2

Video Bitstream
a hierarchical structure with seven layers.
From top to bottom the layers are:
 - Video Sequence
 - Sequence Header
 - Group of Pictures (GOP)
 - Picture
 - Slice
 - Macroblock (MB)
 - Block

Sequence Header
A sequence header should occur about every one-half second.
 - Sequence_header_code
 - Horizontal_size_value
 - Vertical_size_value
 - Aspect_ratio_information
 - Frame_rate_code
 - Bit_rate_value
 - Vbv_buffer_size_value
 - Constrained_parameters_flag
 - Load_intra_quantizer_matrix
 - Intra_quantizer_matrix
 - Load_non_intra_quantizer_matrix
 - Non_intra_quantizer_matrix
 - Sequence Extension
 - Extension_start_code
 - User Data
   --- User_data_start_code
   --- User Data

Data for each group of pictures consists of a GOP header followed by picture data. A GOP header should occur about every two seconds.
Data for each picture consists of a picture header followed by slice data. If a sequence extension is present, each picture header is followed by a picture coding extension.
Data for each picture consists of a picture header followed by slice data.
Data for each slice layer consists of a slice header followed by macroblock data.
Data for each macroblock layer consists of a macroblock header followed by motion vector and block data
Data for each block layer consists of coefficient data.


The program stream, used by the DVD and SVCD standards, is designed for use in relatively error-free environments. It consists of one or more PES packets multiplexed together and coded with data that allows them to be decoded in synchronization. Program stream packets may be of variable and relatively great length.

Data for each pack consists of a pack header followed by an optional system header and one or more PES packets.
The program stream map (PSM) provides a description of the bitstreams in the program stream, and their relationship to one another. It is present as PES packet data if stream_ID = program stream map.

A transport stream combines one or more programs, with one or more independent time bases, into a single stream. Each program in a transport stream may have its own time base. The time bases of different programs within a transport stream may be different.
The transport stream consists of one or more 188-byte packets. The data for each packet is from PES packets, PSI (Program Specific Information) sections, stuffing bytes, or private data.
At the start of each packet is a Packet IDentifier (PID) that enables the decoder to determine what to do with the packet.
Data for each packet consists of a packet header followed by an optional adaptation field and/or one or more data packets.


  • MPEG-4
MPEG-4 visual is divided into two sections.
MPEG-4 Part 2 includes the original MPEG-4 video codecs discussed in this section. MPEG-4 Part 10 specifies the "advanced video codec" also known as H.264, and is discussed at the end of this chapter.
Like H.263 and MPEG-2, the MPEG-4 Part 2 video codecs are also macroblock, block and DCT-based.

Instead of the video "frames" or "pictures" used in earlier MPEG specifications, MPEG-4 uses natural and synthetic visual objects.
Instances of video objects at a given time are called visual object planes (VOPs).

MPEG-4 Part 2 supports many visual profiles and levels. Only natural visual profiles are currently of the most interest in the marketplace.

Visual layers: (from top to bottom)
VS -> VO -> VOL -> GOV -> VOP
A MPEG-4 visual scene consists of one or more video objects.
Each video object may have one or more layers to support temporal or spatial scalable coding.

Each video object can be encoded in scalable (multi-layer) or nonscalable form (single layer), depending on the application, represented by the video object layer (VOL).

Video object planes can be grouped together to form a group of video object planes.

  • H.264
Rather than a single major advancement, H.264 employs many new tools designed to improve performance. These include:
- Support for 8-, 10- and 12-bit 4:2:2 and 4:4:4 YCbCr
- Integer transform
- UVLC, CAVLC and CABAC entropy coding
- Multiple reference frames
- Intra prediction
- In-loop de-blocking filter
- SP and SI slices
- Many new error resilience tools

 H.264 supported three profiles.  Baseline profile is designed for progressive video.
- I and P slice types
- 1/4-pixel motion compensation
- UVLC and CAVLC entropy coding
- Arbitrary slice ordering
- Flexible macroblock ordering
- Redundant slices
- 4:2:0 YCbCr format
Main profile is designed for a wide range of broadcast applications. Additional tools over baseline profile include:
- Interlaced pictures
- B slice type
- CABAC entropy coding
- Weighted prediction
- 4:2:2 and 4:4:4 YCbCr, 10- and 12-bit formats
- Arbitrary slice ordering not supported
- Flexible macroblock ordering not supported
- Redundant slices not supported
Extended profile is designed for mobile and Internet streaming applications. Additional tools over baseline profile include:
- B, SP and SI slice types
- Slice data partitioning
- Weighted prediction

H.264 uses the YCbCr color space, supporting 4:2:0, 4:2:2 and 4:4:4 sampling.
With H.264, the partitioning of the 16x16 macroblocks as been extended. Such fine granularity leads to a potentially large number of motion vectors per macroblock (up to 32) and number of blocks that must be interpolated (up to 96).

H.264 adds an in-loop de-blocking filter. It removes artifacts resulting from adjacent macroblocks having different estimation types and/or different quantizer scales.

The slice has greater importance in H.264 since it is now the basic independent spatial element. This prevents an error in one slice from affecting other slices.

When motion estimation is not efficient, intra prediction can be used to eliminate spatial redundancies. This technique attempts to predict the current block based on adjacent blocks. The difference between the predicted block and the actual block is then coded. This tool is very useful in flat backgrounds where spatial redundancies often exist.

H.264 adds supports for multiple reference frames. This increases compression by improving the prediction process and increases error resilience by being able to use another reference frame in the event that one was lost.

H.264 uses a simple 4x4 integer transform. An additional 2x2 transform is applied to the four CbCr DC coefficients. Intra-16×16 macroblocks have an additional 4x4 transform performed for the sixteen Y DC coefficents.

For everything but the transform coefficients, H.264 uses a single Universal VLC (UVLC) table that uses an infinite-extend codeword set (Exponential Golomb).

For transform coefficients, which consume most of the bandwidth, H.264 uses Context Adaptive Variable Length Coding (CAVLC).  Based upon previously processed data, the best VLC table is selected.

Additional efficiency (5-10%) may be achieved by using Context Adaptive Binary Arithmetic Coding (CABAC). CABAC continually updates the statistics of incoming data and real-time adaptively adjusts the algorithm using a process called context modeling.


NAL
The NAL facilitates mapping H.264 data to a variety of transport layers including:
- RTP/IP for wired and wireless Internet services
- File formats such as MP4
- H.32X for conferencing
- MPEG-2 systems
The data is organized into NAL units, packets that contain an integer number of bytes.
The first byte of each NAL unit indicates the payload data type and the remaining bytes contain the payload data. The payload data may be interleaved with additional data to prevent a start code prefix from being accidentally generated.

No comments: