MPEG video compression technique

a brief discussion

The MPEG compression technique is described here as long it is necessary to understand the Java implementation problems.

A MPEG "film" is a sequence of three kinds of frames:

frametype.gif

The I-frames are intra coded, i.e. they can be reconstructed without any reference to other frames. The P-frames are forward predicted from the last I-frame or P-frame, i.e. it is impossible to reconstruct them without the data of another frame (I or P). The B-frames are both, forward predicted and backward predicted from the last/next I-frame or P-frame, i.e. there are two other frames necessary to reconstruct them. P-frames and B-frames are referred to as inter coded frames.

That means a Java program must buffer at least three frames: One for forward prediction and one for backward prediction. The third buffer contains the frame coming into being. As the figure shows the frame for backward prediction follows the predicted frame. That would require to suspend the decoding of B-frames till the next P- or B- frame appears. But fortunately the display order is not the coding order. The frames appear on MPEG data stream in such an order that the referred frames precede the referring frames.

As an example the frame sequence above is transfered in the following order: I P B B B P B B B. The only task of the decoder is to reorder the reconstructed frames. To support this an ascending frame number comes with each frame (modulo 1024).

What does "prediction" mean?

motion.gif

Imagine an I-frame showing a triangle on white background! A following P-frame shows the same triangle but at another position. Prediction means to supply a motion vector which declares how to move the triangle on I-frame to obtain the triangle in P-frame. This motion vector is part of the MPEG stream and it is divided in a horizontal and a vertical part. These parts can be positive or negative. A positive value means motion to the right or motion downwards, respectively. A negative value means motion to the left or motion upwards, respectively.

The parts of the motion vector are in an range of -64 ... +63. So the referred area can be up to 64x64 pixels away.


But this model assumes that every change between frames can be expressed as a simple displacement of pixels. But the figure to the right shows this isn't true. The red rectangle is shifted and rotated by 5° to the right. So a simple displacement of the red rectangle will cause a prediction error. Therefore the MPEG stream contains a matrix for compensating this prediction error. predict.gif

Thus, the reconstruction of inter coded frames goes ahead in two steps:

  1. Application of the motion vector to the referred frame;
  2. Adding the prediction error compensation to the result;
reconstruct.gif

Note that the prediction error compensation requires less bytes than the whole frame because the white parts are zero and can be discarded from MPEG stream. Furthermore the DCT compression (see later in this chapter) is applied to the prediction error which decreases its memory size.

Note also the different meanings of the two + - signs. The first means adding the motion vector to the x-, y- coordinates of each pixel. The second means adding an error value to the color value of the appropriate pixel.

But what if some parts move to the left and others to the right ?

The motion vector isn't valid for the whole frame. Instead of this the frame is divided into macro blocks of 16x16 pixels. Every macro block has its own motion vector. Of course, this does not avoid contradictory motion but it minimizes its probability.

And if contradictory motion occurs? One of the greatest misunderstandings of the MPEG compression technique is to assume that all macro blocks of P-frames are predicted. If the prediction error is to big the coder can decide to intra-code a macro block. Similarly the macro blocks in B-frames can be forward predicted or backward predicted or forward and backward predicted or intra-coded.

macroblocks.gif
blocks.gif

Every macro block contains 4 luminance blocks and 2 chrominance blocks. Every block has a dimension of 8x8 values. The luminance blocks contain information of the brightness of every pixel in macro block. The chrominance blocks contain color information. Because of some properties of the human eye it isn't necessary to give color information for every pixel. Instead 4 pixels are related to one color value. This color value is divided into two parts. The first is in Cb color block the second is in Cr color block. The color information is to be applied as shown in the picture to the left.

Depending on the kind of macro block the blocks contain pixel information or prediction error information as mentioned above. In any case the information is compressed using the discrete cosine transform (DCT).


The discrete cosine transform(DCT)

As mentioned above the 8x8 block values are coded by means of the discrete cosine transform. To explain this regard the figure to the right. This shall be an enlargement of an 8x8 area of a black-white frame. p_anf.gif
120108907569738289
127115978175798895
13412210589838796103
13712510792869099106
13111910186808393100
117105877265697885
10088705549536269
8977594438425158
The normal way is to determine the brightness of each of the 64 pixels and to scale them to some limits, say from 0 to 255*, whereby "0" means "black" and "255" means "white". We can also represent the values in form of an 8x8 bar diagram. Normally the values are processed line by line. That requires 64 byte of storage. anf.gif

*(In MPEG a range from -256 to 255 is used.)

But you can define all the 64 values by only 5 integers if you apply the following formula called discrete cosine transform (DCT)

dct.png

Where f(x,y) is the brightness of the pixel at position [x,y]. The result is F an 8x8 array, too:
7009010000000
900000000
-890000000
00000000
00000000
00000000
00000000
00000000
But as you can see: almost all values are equal to zero. Because the non-zero values are concentrated at the upper left corner the matrix is transfered to the receiver in zigzag scan order. That would result in:
700 90 90 -89 0 100 0 0 0 .... 0
Of course, the zeros are not transferred. An End-Of-Block sign is coded instead.
zigzag.gif

The decoder can reconstruct the pixel values by the following formula called inverse discrete cosine transform (IDCT):

idct.png

Where F(u,v) is the transform matrix value at position [u,v]. The results are exactly the original pixel values. Therefore the MPEG compression could be regarded as loss-less. But that isn't true, because the transformed values are quantized. That means they are (integer) divided by a certain value greater or equal 8 because the DCT supplies values up to 2047. To reduce them under the byte length at least the quantization value 8 is applied. The decoder multiplies the result by the same value. Of course the result differs from the original value. But again because of some properties of the human eye the error isn't visible. In MPEG there is a quantization matrix which defines a different quantization value for every transform value depending on its position.

Was it a well chosen example ?

No, it wasn't! The DCT always tends to compute zeros. This effect is assisted by the quantization which zeros small values. To understand this one must recognize the essence of the DCT. To do this let's begin from the opposite side.
Let us apply the IDCT to a matrix only containing one value of 700 at the upper left corner: The result of the IDCT is And the bar diagram looks like
7000000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
8787878787878787
8787878787878787
8787878787878787
8787878787878787
8787878787878787
8787878787878787
8787878787878787
8787878787878787
const.gif

Of course, the picture is an grey colored square. The value at the upper left corner is called the DC value. This is the abbreviation for direct current and refers to a similar phenomenon in the theory of alternating current where an alternating current can have a direct component. In DCT the DC value determines the average brightness in block. All other values describe the variation around this DC value. Therefore they are sometimes referred as to AC values (from "alternating current").

Now let's add an
AC value of 100
The result of
the IDCT is
And the bar diagram looks like
700100000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
105102979184787370
105102979184787370
105102979184787370
105102979184787370
105102979184787370
105102979184787370
105102979184787370
105102979184787370
first.gif

The resulting picture looks like p_first.gif As you can see the values vary around the DC value of 87. Furthermore if you regard the shape of the bar diagram you'll see a curve like a half cosine line. It is said the picture has a frequency of 1 in X-direction. Imagine a car that drives with constant speed from left to right along the "bar diagram street" parallel to X-Axis. In contrast to the DC example the car is shaked with a certain frequency but only if it follows the X-Axis. In Y direction it moves along the same height during the whole way. This behaviour is documented in the transform matrix because the only AC value of 100 appears in X direction.

Now let's consider what happens
if we place the AC value of 100
at the next position
The result of
the IDCT is
And the bar diagram looks like
700010000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
104948171718194104
104948171718194104
104948171718194104
104948171718194104
104948171718194104
104948171718194104
104948171718194104
104948171718194104
second.gif

The resulting picture looks like p_second.gif. The shape of the bar diagram shows a cosine line, too. But now we see a full period, i.e. the frequency is as twice as high as in the first example. This behaviour would continue if we replace the AC-value step by step to the right. Every step increases the frequency of the cosine wave.

But what happens if we
place both AC values
The result of
the IDCT is
And the bar diagram looks like
70010010000000
00000000
00000000
00000000
00000000
00000000
00000000
00000000
121109917568718086
121109917568718086
121109917568718086
121109917568718086
121109917568718086
121109917568718086
121109917568718086
121109917568718086
f_s.gif

The resulting picture looks like p_f_s.gif.Regarding the shape of the bar diagram you can see a mix of both, the first and the second cosine wave. Indeed, the resulting AC value is simply an addition of the cosine lines.

Now let's add a AC
value at the other direction
The result of
the IDCT is
And the bar diagram looks like
70010010000000
2000000000
00000000
00000000
00000000
00000000
00000000
00000000
156144125109102106114121
15113812010497100109116
14112911094879199106
128116978275788693
114102846861647380
10289715548516067
9280614538425057
8674564033364552
fi_fi.gif

The resulting picture looks like p_fi_fi.gif Now the values vary in Y direction, too. The principle is: The higher the index of the AC value the greater is the frequency.

Now as a last example let's place an AC value at the opposite side of the DC value. We already know what it means: The highest possible frequency of 8 is applied in both, the X- and the Y- direction.
What is to be expected? The result of
the IDCT is
And the bar diagram looks like
9500000000
00000000
00000000
00000000
00000000
00000000
00000000
0000000500
1241051399514398132114
105157611875117680132
13961205172213217698
9518717239022151143
1435122102391718795
98176322211720561139
132801765118761157105
1141329814395139105124
heigh.gif

Because of the high frequency the neighbouring values differ numerously. The picture shows a checker-like appearance p_heigh.gif. Note that this shall be a 8x8 pixel enlargement in a real picture! How often does it happen? We can hope that such a case is very seldom. And that's why the DCT computes in almost every case zeros for the higher frequencies.


upBack to Java-inline-MPEG-Player