Let's Write a Video Codec, Part 2: the DCT in Two Dimensions

Continuing our media encoding exploration series, we'll transition from one-dimensional scenarios to those involving two dimensions. In the previous article, we unpacked how the Discrete Cosine Transform (DCT) functions in one dimension. Now, let's broaden our view to include its application in two-dimensional contexts, as commonly seen in image and video compression.

We're going to keep the tough math stuff as simple as we can. By the end of this, you'll know more about how DCT works in two dimensions and how to use it in different ways.

Two Dimensions

The procedure for extending the 1D-DCT to 2D is straightforward. It involves applying the 1D-DCT twice. The first application is performed on the rows of the image. Here, we treat each row as a separate signal and apply the 1D-DCT. The result is an intermediate image where each pixel's value reflects the frequency content of the row it's part of.

Next, we apply the 1D-DCT to the columns of the intermediate image. This process is similar to the first one, only now we treat each column as a separate signal. The result is a two-dimensional frequency domain representation of the original image.

To make this clearer, let's imagine that our image is a grid of numbers, where each number is a pixel value. When we apply the 2D-DCT, we replace each number (or pixel) in the grid with a new number that represents the amount and type of variation in the row and column of the original pixel.

Ultimately, the 2D-DCT gives us a new way to look at images and videos - not as a collection of individual pixels, but as a sum of different frequency components. This view is very useful for image and video compression because it allows us to identify and discard those components that don't contribute much to the image as perceived by the human eye.

Getting back to the code

Let's now dive into incorporating the mathematical definition of the 2D-DCT that we've just learned into the code from our previous article.

To kick off, we will extend the initial value into a 2D array and write a helper function that prints 2D arrays:

import Foundation

let input: [[Float]] = [
    [1.0, 0.5, 0.5, 1.0],
    [0.5, 0.5, 1.0, 0.5],
    [1.0, 0.5, 0.5, 1.0],
    [1.0, 1.0, 0.5, 1.0],
]

func toString(_ values: [[Float]]) -> String {
    return values.reduce(into: "") { current, row in
        if !current.isEmpty {
            current += "\n"
        }
        current += row.map { String(format: "%1.02f", $0) }.joined(separator: " ")
    }
}

Next up, we'll borrow the DCT and Inverse DCT functions that we had defined in the previous chapter:

func dct(_ x: [Float]) -> [Float] {
    let N = x.count
    return (0 ..< N).map { k in
        let sum = (0 ..< N).reduce(0.0) { accum, n in
            accum + x[n] * cos(Float.pi / Float(N) * (Float(n) + 0.5) * Float(k))
        }
        return sum
    }
}

func idct(_ X: [Float]) -> [Float] {
    let N = X.count
    let norm = 2.0 / Float(N)
    let firstNorm = sqrt(1.0 / Float(N))
    return (0 ..< N).map { n in
        let sum = (1 ..< N).reduce(0.0) { accum, k in
            accum + X[k] * cos(Float.pi / Float(N) * (Float(n) + 0.5) * Float(k))
        }
        return firstNorm * X[0] / 2.0 + norm * sum
    }
}

Following this, we need to define two helper functions that apply another (arbitrary) function to either rows or columns of a 2D array:

func applyByRow(_ value: [[Float]], _ f: ([Float]) -> [Float]) -> [[Float]] {
    var result: [[Float]] = Array(repeating: [], count: value[0].count)
    for i in 0 ..< value.count {
        result[i] = f(value[i])
    }
    return result
}

func applyByColumn(_ value: [[Float]], _ f: ([Float]) -> [Float]) -> [[Float]] {
    var result: [[Float]] = Array(repeating: Array(repeating: 0.0, count: value[0].count), count: value.count)
    for i in 0 ..< 4 {
        let column = value.map { $0[i] }
        let transformedColumn = f(column)
        for j in 0 ..< 4 {
            result[j][i] = transformedColumn[j]
        }
    }
    return result
}

At first glance, their function might not be immediately apparent, so let's dissect this. Assume we have a 2D array:

var exampleArray: [[Float]] = [
    [1, 2],
    [3, 4]
]

We then introduce a function that calculates the sum of all elements within a 1D array and replaces each element with this sum:

// e.g. sumItems([1.0, 2.0]) returns [3.0, 3.0]
func sumItems(_ items: [Float]) -> [Float] {
    var sum: Float = 0.0
    for item in items {
        sum += item
    }
    return Array<Float>(repeating: sum, count: items.count)
}

If we apply this function to each row and then each column respectively, we observe different results:

let sumByRow = applyByRow(exampleArray, { row in
    return sumItems(row)
})
// sumByRow = [
//  3.00, 3.00,
//  7.00, 7.00
// ]

let sumByColumn = applyByColumn(exampleArray, { row in
    return sumItems(row)
})
// sumByColumn = [
//  4.00, 6.00,
//  4.00, 6.00
// ]

Taking the 2D-DCT to the Field

Now, with all our preparations complete, we're ready to see how our code performs on real data:

print("Input:\n" + toString(input))

var value = input
value = applyByRow(value, dct)
print("Forward DCT by row:\n" + toString(value))

value = applyByColumn(value, dct)
print("Forward DCT by column:\n" + toString(value))

value = applyByColumn(value, idct)
print("Inverse DCT by column:\n" + toString(value))

value = applyByRow(value, idct)
print("Inverse DCT by row:\n" + toString(value))

This prints:

Input:
1.00 0.50 0.50 1.00
0.50 0.50 1.00 0.50
1.00 0.50 0.50 1.00
1.00 1.00 0.50 1.00

Forward DCT by row:
3.00 0.00 0.71 0.00
2.50 -0.19 -0.35 0.46
3.00 0.00 0.71 0.00
3.50 0.19 0.35 -0.46

Forward DCT by column:
12.00 0.00 1.41 0.00
-0.65 -0.25 -0.08 0.60
0.71 0.27 0.50 -0.65
0.27 0.10 1.12 -0.25

Inverse DCT by column:
3.00 0.00 0.71 0.00
2.50 -0.19 -0.35 0.46
3.00 0.00 0.71 0.00
3.50 0.19 0.35 -0.46

Inverse DCT by row:
1.00 0.50 0.50 1.00
0.50 0.50 1.00 0.50
1.00 0.50 0.50 1.00
1.00 1.00 0.50 1.00

Firstly, we apply the 1D-DCT to each row of the input, followed by applying the same 1D-DCT to each column of the resultant data. The array labeled “Forward DCT by column” is the result of applying the 2D-DCT transform to the input. This is typically the representation we would store on a disk or transmit over a network. The next two steps display the application of the inverse 1D-DCT transform, firstly to each column, then to each row of the intermediate representation. As anticipated, the final output is identical to the original input data.

If we examine the mathematical formula of the 2D-DCT, it becomes clear that the order in which we perform the encoding and decoding steps is irrelevant.

That is, we can reverse the order of the IDCT steps, and still achieve the same result:

...
value = applyByRow(value, idct)
print("Inverse DCT by row:\n" + toString(value))

value = applyByColumn(value, idct)
print("Inverse DCT by column:\n" + toString(value))

// This prints
Inverse DCT by row:
3.50 2.50 2.50 3.50
-0.19 -0.46 0.19 -0.19
0.35 0.35 -0.35 0.35
0.46 -0.19 -0.46 0.46
Inverse DCT by column:
1.00 0.50 0.50 1.00
0.50 0.50 1.00 0.50
1.00 0.50 0.50 1.00
1.00 1.00 0.50 1.00

This is an important property that we'll revisit during the optimization stage.

Embracing 2D Compression

Just like in the 1-dimensional scenario, it's trivial to compress DCT-2D-encoded data by eliminating some of the values from the intermediate representation.

Suppose we have the following DCT-encoded data:

Uncompressed DCT:
12.00 0.00 1.41 0.00
-0.65 -0.25 -0.08 0.60
0.71 0.27 0.50 -0.65
0.27 0.10 1.12 -0.25

A few of the values here are relatively small. Let’s neutralize them:

for i in 0 ..< 4 {
    for j in 0 ..< 4 {
        if abs(value[i][j]) < 0.26 {
            value[i][j] = 0.0
        }
    }
}
print("Compressed DCT:\n" + toString(value))

Compressed DCT:
12.00 0.00 1.41 0.00
-0.65 0.00 0.00 0.60
0.71 0.27 0.50 -0.65
0.27 0.00 1.12 0.00

And there you have it. Although we've eliminated 25% of the values, the result of the Inverse 2D-DCT applied to the compressed values is:

1.07 0.48 0.49 0.96
0.53 0.57 0.92 0.48
0.97 0.43 0.58 1.02
0.93 1.02 0.51 1.04

This result bears a close resemblance to our original data, with only a minor deviation:

1.00 0.50 0.50 1.00
0.50 0.50 1.00 0.50
1.00 0.50 0.50 1.00
1.00 1.00 0.50 1.00

These slight differences are a trade-off of the compression process, and in many applications, this minor loss in data fidelity is often deemed acceptable given the significant savings in data storage and transmission costs.

In Conclusion

It might seem abstract now, but these arrays of numbers are just simplified representations of what we'll be dealing with when we apply these techniques to actual images and videos. Remember, at their core, digital images are nothing more than two-dimensional arrays of numbers, where each number represents a pixel's color.

So, keep your focus sharp for the next part, where theory meets practice, and where numbers finally develop into the 'big picture'.