Let's Write a Video Codec, Part 1: the DCT

Just like the Mona Lisa's smile or the catchy notes of a Beatles song, there are some things in life that never seem to grow old. They have a special kind of magic that doesn't fade, no matter how much time passes. You could call them "classic" because they have a lasting value and remain popular even today.

In the world of technology, there's something that shares this "classic" label - Discrete Cosine Transform, or DCT for short. Just like timeless art and music, DCT may seem simple and has been around for quite a while, but it's still a superstar in the field of digital image encoding.

First introduced way back in the mid-1970s, DCT has stuck around for good reason. It's a key part of many digital technologies we use every day, like JPEG images and MPEG videos. Even though tech evolves really fast, DCT has managed to stay useful and important because of its strength and versatility.

The DCT

The Discrete Cosine Transform, or DCT, is a math-based method that comes in handy for image and signal processing. Its main job is to change a signal from the spatial domain (think of this like the original, raw data) into the frequency domain (where the data is sorted by how often different components occur). In simpler terms, DCT takes an input signal and breaks it down into a bunch of cosine functions, each with its own frequency and size.

In the world of image compression, DCT has a special role to play. It examines the frequency content of an image and spots the parts that can be thrown away without making a big difference to the image's overall quality. This usually means getting rid of high-frequency information. The great thing about this process is that it reduces the amount of data needed to represent the image, but without ruining the image's quality as perceived by our eyes.

This is the mathematical definition of the DCT:

You might be wondering, does any of this have to do with images? It might not seem so right now. But don't worry - hopefully, everything will start to make sense in the next section.

Let’s write some code

You may have heard the saying that the best way to learn something is by teaching a computer to do it. In this article, we'll use Swift Playgrounds to do just that. However, don't worry if you're more comfortable with a different programming language - the code we'll be working with can be easily transferred to other languages.

import Foundation

// Define a 4x1 input array
let input: [Float] = [-1.0, -0.5, 0.5, 1.0]
// Define output arrays for DCT and IDCT
var dctOutput: [Float] = [0, 0, 0, 0]
var idctOutput: [Float] = [0, 0, 0, 0]

// Define a function to compute the DCT of a single row or column
func dct1D(_ input: [Float], _ output: inout [Float]) {
    let N = input.count
    let pi = Float.pi
    
    for k in 0 ..< N {
        var sum: Float = 0
        
        for n in 0 ..< N {
            sum += input[n] * cos(pi * Float(k) * (Float(n) + 0.5) / Float(N))
        }
        
        output[k] = sqrt(2 / Float(N)) * (k == 0 ? 1 / sqrt(2) : 1) * sum
    }
}
// Define a function to compute the IDCT of a single row or column
func idct1D(_ input: [Float], _ output: inout [Float]) {
    let N = input.count
    let pi = Float.pi
    
    for n in 0 ..< N {
        var sum: Float = 0
        
        for k in 0 ..< N {
            sum += input[k] * cos(pi * Float(k) * (Float(n) + 0.5) / Float(N))
        }
        
        output[n] = sqrt(2 / Float(N)) * sum
    }
}
// Perform DCT on the input array
dct1D(input, &dctOutput)
// Print DCT output
print("DCT Output: \(dctOutput)")
// Perform IDCT on the DCT output
idct1D(dctOutput, &idctOutput)
// Print IDCT output
print("IDCT Output: \(idctOutput)")

This prints:

DCT Output: [0.0, -1.577161, -2.1073424e-07, 0.11208505]
IDCT Output: [-1.0000002, -0.4999996, 0.4999997, 0.9999999]

"IDCT" stands for Inverse Discrete Cosine Transform. As its name suggests, this function converts the DCT output back into the spatial domain. In simpler terms, it provides us with an approximation of the original data.

There are several points of interest to note here:

The IDCT is a lossless transform, but in this example, it could not restore the initial data ([-1.0, -0.5, 0.5, 1.0]). This occurs due to numerical errors that accumulate during function computation.
After applying the DCT, the data has the same shape as the input. The function simply converts it into the frequency domain.
The first component of the DCT output corresponds to an average value of the input data (in our case, 0.0). This will be useful later.
The amplitudes of the DCT output significantly differ from the input data. The components on the right are less important for reconstructing the initial data later. Let's explore this in more detail.

After invoking dct1D, add the following line:

dctOutput = [0.0, -1.577161, 0.0, 0.0]

When you run the code again, it prints:

IDCT Output: [-1.0303301, -0.42677668, 0.4267765, 1.0303301]

As you can see, by removing two of the four original frequency domain components, we still managed to get a faithful representation of the original data, with some additional precision loss. If the resulting precision is acceptable for our use case, we have just performed a lossy compression of 50% by simply dropping half of the frequency domain data.

What’s Next?

To sum up, we've explored the Discrete Cosine Transform (DCT), understood its math, and used it for some real data compression. In the next part, we'll go a step further. We'll adapt and extend our code to work with real 2D images. Stay tuned for more!