patternMinor

Why Is Inverse Quantization and Inverse Transform Taken Before Motion Estimation in H.264?

Submitted by: @import:stackexchange-cs·Mar 10, 2026·

Viewed 0 times

whyinversetransform264takenmotionestimationandbeforequantization

Problem

Here is the block diagram of the pipeline for the H.264 encoder (Fundamentals of Multimedia by Ze-Nian Li, Mark S. Drew and Jiangchuan Liu).

The feedback loop is used for motion estimation, but why is it taken after the transform and quantization steps? Why introduce and overhead from the inverse functions?

I know there's likely a very good reason for it, and this might sound like a stupid question, but I'd appreciate any answer that nudges me in the right direction.

Thanks!

Solution

Inverse quantization and inverse transform are how the encoder keeps track of which pixel-values the decoder will have in its buffer at any give point in time. If you skip that step, then you make the encoder a tiny bit faster, but the resulting compressed video will be much lower quality.

Let $S_i$ be the sequence of source/input frames. Let $\Delta_i$ be the lossy-compressed difference between each frame and the previous. Let $O_i$ be the decoded/output frames. Let $Q_i$ be the quantization noise. And for simplicity assume that the motion vectors are all zero.

In the normal encoder workflow, the encoder tries to set $\Delta_i = S_i - O_{i-1}$; but it can't do so exactly due to quantization; so it actually ends up with $\Delta_i = S_i - O_{i-1} + Q_i$.

The decoder adds up the deltas, and gets $O_i = O_{i-1} + \Delta_i = S_i + Q_i$. In particular, the difference between an output frame and the corresponding input frame is at most 1 quantizer unit, as it should be.

Now consider your proposed change to the encoder workflow. The encoder tries to set $\Delta_i = S_i - S_{i-1}$; but it can't do so exactly due to quantization; so it actually ends up with $\Delta_i = S_i - S_{i-1} + Q_i$. The encoder doesn't try to measure the value of $Q$ or adjust any future decisions to compensate, but quantization noise is still affecting the output frame just as much as before.

The decoder gets $O_1 = O_0 + \Delta_1 = S_1 + Q_1$; $O_2 = O_1 + \Delta_2 = S_2 + Q_1 + Q_2$; $O_i = S_i + \Sigma_{j=1}^i Q_j$. Each output frame is farther and farther from the corresponding input frame as time goes on. (Until the it's reset by an intra-frame, but those can be hundreds of frames apart, so that's still a lot of time to accumulate error.)

If we assume that quantization noise is uncorrelated, then the rate-of-growth of the error would be about $\sqrt i$ by the Law Of Large Numbers. If we make more realistic assumptions where noise can be correlated then it's even worse and might grow as fast as linear.

To be more concrete and worst-case, suppose we have a video that starts with an all-black frame (i.e. luma=0), and slowly fades to white (luma=255) over the course of the next 255 frames, at a rate of +1 luma per frame. And suppose the quantizer is 2.

The standard encoder records a delta whenever the next input frame differs from the previous encoded frame by 1 quantizer (i.e. +2 luma in this case). The result is a fade-in of +2 luma every 2 frames. Not quite as smooth as the input video, but basically matches the intended scene.

Your proposed encoder would look at each input frame; see that it's only 1 luma of difference from the previous frame, which is smaller than the quantizer and thus rounds to 0; so you'd fail to record any change at all. The end of the scene would still be showing black when it should be white.

Context

StackExchange Computer Science Q#163275, answer score: 2

Revisions (0)

No revisions yet.