Let's Write a Video Codec - Part 3: Still Images

In the previous parts of this series, we've learned about the Discrete Cosine Transform, from its mathematical definition to its implementation in actual examples. Additionally, we've seen how DCT's ability to represent data in the frequency domain can enable lossy data compression.

Now, we're taking the next step. We'll finally draw the line connecting these abstract arrays of numbers we've been discussing and the actual images they can represent.

By the end of this article, we will have rediscovered approximately 80% of the actual JPEG image compression process.

The Matrix

In the previous article, we introduced the 2D-DCT. Now, let's look at a slightly adjusted version of that formula:

The variables u and v represent the horizontal and vertical offset of the value in the DCT-encoded 2D array. The function α(x) equals 1/sqrt(2) if x == 0 and 1 otherwise. Lastly, x and y represent the pixel's position in the source image.

This might not seem obvious, but the same calculation can be represented as a matrix multiplication in the following form:

In this formulation, DCT is the DCT matrix, T denotes matrix transposition, and IMG is the source image.

Bear in mind that the complexity of this operation is proportional to the size of the source image. To manage images of any size and to reduce the work, we'll limit our calculations to 8-by-8-pixel image blocks.

While real images tend to have a higher resolution, we can think of them as being composed of many independent 8x8 blocks.

Here's how we calculate the DCT matrix that produces the desired effect:

var dctMatrix: [Float] = Array(repeating: 0.0, count: 8 * 8)
for i in 0 ..< 8 {
    for j in 0 ..< 8 {
        let a: Float
        if i == 0 {
            a = sqrt(1.0 / 8.0)
        } else {
            a = sqrt(2.0 / 8.0)
        }
        dctMatrix[j * 8 + i] = a * cos((2.0 * Float(j) + 1) * Float(i) * Float.pi / (2.0 * 8))
    }
}

The value of dctMatrix is:

 0.35  0.49  0.46  0.42  0.35  0.28  0.19  0.10
 0.35  0.42  0.19 -0.10 -0.35 -0.49 -0.46 -0.28
 0.35  0.28 -0.19 -0.49 -0.35  0.10  0.46  0.42
 0.35  0.10 -0.46 -0.28  0.35  0.42 -0.19 -0.49
 0.35 -0.10 -0.46  0.28  0.35 -0.42 -0.19  0.49
 0.35 -0.28 -0.19  0.49 -0.35 -0.10  0.46 -0.42
 0.35 -0.42  0.19  0.10 -0.35  0.49 -0.46  0.28
 0.35 -0.49  0.46 -0.42  0.35 -0.28  0.19 -0.10

As you see, it's just an array of numbers. There's no need to recalculate the matrix values every time we apply the transform, so we can keep them in a constant array in our code.

This code performs 8x8 matrix multiplication:

func matrixMul8x8(m1: [Float], m2: [Float], result: inout [Float]) {
    for i in 0 ..< 8 {
        for j in 0 ..< 8 {
            var acc: Float = 0.0
            for k in 0 ..< 8 {
                acc += m1[i * 8 + k] * m2[k * 8 + j]
            }
            result[i * 8 + j] = acc
        }
    }
}

We compute the 8x8 DCT as follows:

func dct(_ block: [Float]) -> [Float] {
    var tmpBlock: [Float] = Array(repeating: 0.0, count: 8 * 8)
    var resultBlock: [Float] = Array(repeating: 0.0, count: 8 * 8)
    matrixMul8x8(m1: dctMatrixT, m2: block, result: &tmpBlock)
    matrixMul8x8(m1: tmpBlock, m2: dctMatrix, result: &resultBlock)
    return resultBlock
}

The Inverse DCT can be obtained simply by swapping the regular and transposed DCT matrices:

func idct(_ block: [Float]) -> [Float] {
    var tmpBlock: [Float] = Array(repeating: 0.0, count: 8 * 8)
    var resultBlock: [Float] = Array(repeating: 0.0, count: 8 * 8)
    matrixMul8x8(m1: dctMatrix, m2: block, result: &tmpBlock)
    matrixMul8x8(m1: tmpBlock, m2: dctMatrixT, result: &resultBlock)
    return resultBlock
}

The Big Picture

Now, we're ready to apply our knowledge to real images. Let’s work with this one:

Imagine we have the code that loads this image as an array of numbers where each represents the brightness of a pixel at the corresponding location:

func loadImage(width: Int, height: Int) -> [Float] {
   ...
}

Assuming the image is 256x256 pixels, the following values would correspond to the top-left corner:

  -2.00    3.00   -7.00   -6.00   -3.00   -5.00  -13.00   -9.00
  -2.00    4.00   -7.00   -6.00   -3.00   -5.00  -13.00   -9.00
  -3.00   -1.00   -8.00   -7.00   -6.00   -8.00  -14.00  -12.00
  -7.00  -13.00   -9.00  -15.00  -15.00  -12.00  -23.00  -22.00
 -18.00  -17.00  -11.00  -15.00  -11.00  -14.00  -20.00  -20.00
 -21.00  -19.00  -20.00  -20.00  -18.00  -17.00  -19.00  -19.00
 -16.00  -17.00  -20.00  -19.00  -17.00  -21.00  -24.00  -21.00
 -19.00  -18.00  -20.00  -22.00  -19.00  -25.00  -20.00  -22.00

For mathematical convenience, instead of expressing pixel brightness as a number from 0.0 to 255.0, we subtract 128.0. In other words, we center the value around 0.0 rather than 128.0.

Then, we apply the dct function to every 8x8 block of the image:

let values: [Float] = loadImage() // convert RGB to -128.0 ... 128.0

var destinationValues: [Float] = Array(repeating: 0.0, count: values.count)
for j in 0 ..< height / 8 {
    for i in 0 ..< width / 8 {
        let block = extractBlock(values: values, width: width, x: i * 8, y: j * 8)
        let resultBlock: [Float] = dct(block)
        
        storeBlock(values: &destinationValues, width: width, block: resultBlock, x: i * 8, y: j * 8)
    }
}

storeImage(destinationValues) // convert back to RGB

The result is this image:

Even though it's barely recognizable as a photo, some patterns (like the hat) are still visible. Interestingly, there are many black pixels. In fact, most pixels are black. Let’s zoom into one of the 8x8 blocks:

This 8x8 block illustrates what frequency-domain representation is all about. Non-zero values tend to cluster in the top-left corner of the block. The likelihood of a large value significantly decreases as we move toward the bottom-right corner.

This indicates that the top-left values carry more importance to the image.

We can demonstrate this by writing a "compression" function that eliminates some of the bottom-right values by setting them to 0.0:

func compress(_ block: [Float], level: Int) -> [Float] {
    var resultBlock: [Float] = block
    for y in 0 ..< 8 {
        for x in 0 ..< 8 {
            if x >= 8 - level || y >= 8 - level {
                resultBlock[y * 8 + x] = 0.0
            }
        }
    }
    return resultBlock
}

After adjusting our processing loop:

for j in 0 ..< height / 8 {
    for i in 0 ..< width / 8 {
        var block = extractBlock(values: values, width: width, x: i * 8, y: j * 8)
        block = dct(block)
        block = compress(block, level: 3)
        block = idct(block)
        
        storeBlock(values: &destinationValues, width: width, block: block, x: i * 8, y: j * 8)
    }
}

We get this image:

Take a moment to consider what's happened. We've eliminated 39 of the 64 values for each image block and still produced an image that closely resembles the original. This is a 60% compression ratio achieved just by multiplying matrices and deterministically discarding data!

When we discard 86% of the DCT coefficients, this happens:

Quantization

It would be incredibly wasteful to store the DCT coefficients as floating-point numbers, which usually require 4 bytes of storage. Instead, we should round them to the nearest integer. For 8x8 blocks, each value can be represented by a 2-byte integer number.

Let's revisit the block we examined earlier. Its (rounded) numerical representation is:

 209 -296  -49   43  -38   22   -6    1
  39   24  -37   11   -4   -3    2    6
 -15   16  -17    0   13   -4    0    5
  16    4    2    4   -6    4   -3   -5
 -11    4   -1    3    1   -3    6    3
   6   -2    2    4   -2   -2   -4   -1
  -6    1    0    1   -1    0    3   -1
   0    0    0    0   -1   -1   -2    1

As we confirmed visually before, larger values are more likely to occur closer to the top-left corner. However, not every block follows this pattern. Here's another block from our image:

This block doesn't adhere to the common distribution of DCT values. If we continue to blindly discard bottom-right values, some parts of the image may display more compression artifacts. To avoid that, we need to decide which blocks need more data than others.

A simple yet elegant solution capitalizes on the clustering property we discovered earlier. It's called "quantization." We divide our DCT values by different coefficients. We’ll use a table that contains a value for each DCT coefficient for convenience:

  16.50   11.50   10.50   16.50   24.50   40.50   51.50   61.50
  12.50   12.50   14.50   19.50   26.50   58.50   60.50   55.50
  14.50   13.50   16.50   24.50   40.50   57.50   69.50   56.50
  14.50   17.50   22.50   29.50   51.50   87.50   80.50   62.50
  18.50   22.50   37.50   56.50   68.50  109.50  103.50   77.50
  24.50   35.50   55.50   64.50   81.50  104.50  113.50   92.50
  49.50   64.50   78.50   87.50  103.50  121.50  120.50  101.50
  72.50   92.50   95.50   98.50  112.50  100.50  103.50   99.50

The more important positions correspond to lower divisors, and vice versa. We perform the dequantization step by multiplying the quantized data by the coefficients from the table.

These functions perform quantization and dequantization steps respectively:

func quantize(_ block: [Int], table: [Float]) -> [Int] {
    var result: [Int] = Array(repeating: 0, count: block.count)
    for i in 0 ..< block.count {
        result[i] = Int(round(Float(block[i]) / table[i]))
    }
    return result
}

func dequantize(_ block: [Int], table: [Float]) -> [Float] {
    var result: [Float] = Array(repeating: 0, count: block.count)
    for i in 0 ..< block.count {
        result[i] = Float(block[i]) * table[i]
    }
    return result
}

Let’s incorporate quantization and dequantization into our code:

block = dct(block)
var iblock = rounded(block)
iblock = quantize(iblock, table: testTable)
block = dequantize(iblock, table: testTable)
block = idct(block)

This is how the first block looks when quantized:

  13  -26   -5    3   -2    1    0    0
   3    2   -3    1    0    0    0    0
  -1    1   -1    0    0    0    0    0
   1    0    0    0    0    0    0    0
  -1    0    0    0    0    0    0    0
   0    0    0    0    0    0    0    0
   0    0    0    0    0    0    0    0
   0    0    0    0    0    0    0    0

And this is the second block:

  -5   18  -10    0    1    1    1    0
   0    3    1   -2    0    1    0    0
  -4    0    1    0    1    1    0    0
  -1    0    0   -2    0    0    0    0
   0    1   -1    0    0    0    0    0
   0    0    1    0    0    0    0    0
   0    0    0    0    0    0    0    0
   0    0    0    0    0    0    0    0

You'll notice that there are more non-zero values in the second block. Thus, the quantization step allows us to store more useful data and discard the rest.

Conclusion and What’s Next?

From basic math, we've made our way to encoding and decoding real images. We visually saw how the DCT converts arbitrary data into a set of frequencies and how some frequencies are more important than others.

After we got comfortable with basic compression, we introduced the concept of quantization, which further reduces storage requirements.

In the next part, we'll move to RGB images and explore ways to compress the resulting data even further.

The code for this article can be found at https://github.com/petertechstories/image-dct