WTF is Transcoding?

If you have someone based on software development, live streaming, or stuffs like that, and mainly in video and multimedia stuff, you have often heard them mentioning transcoding when they are among people with similar backgrounds.

To understand the transcoding, just humor me with whatever is written below.

Assume, you are a person hailing from the stunning city of Paris. Being said that, it is fair to assume that he is fluent in french.

Also assume, your friend to be a person hailing from Egypt and that he is fluent in Egyptian.

Now, for you and your friend to talk through your phones to each other, they both should understand each other’s language correctly. But, it does not appear to be the case. While you are fluent in French, your friend is fluent in Egyptian.

Both, you and your friend are being lazy and not willing to put in the hard work and effort (CPU processing in technical terms) to learn each other’s language, and it appears that both of you are at an impasse.

The only solution that appears is to put a translator (a third person) in between who does understand both of the languages required by the participant. If you say something in French, he will convert it into Egyptian and send it to your friend and vice versa also stands true.

Simply put in technical terms, like there are multiple human languages considering the various geographical area and ethnicity, there are multiple phone languages too which differ concerning the complexity, bandwidth, coverage, quality, processing, as specified by their designers and creators. These language formats that the phones understand are called codecs, which are integrated/hardcoded into them. These codecs are chosen based on their purposes, the amount of CPU available to the end-users, network conditions, etc.

For instance, in the case of voice-only calls, the fixed-line phones that most of us have seen in our 90s used the PCMA codec, conferencing using the browsers like Chrome and Firefox uses the OPUS codec, the smartphones (LTE/VoLTE or 5G) that we use consist mainly of AMR (NarrowBand), AMR-WB (Wideband) or EVS codec (super wideband), WhatsApp uses the OPUS/SILK codec.

Just like the work of a translator concerning human languages, the conversion of data from one codec to another codec is done by the software at the server machine. This process is known as transcoding and the server which performs it is the transcoding server. A server is installed between the 2 parties who wish to communicate, and it will perform the appropriate encoding/decoding of media of these codecs required for these parties to communicate.

Being said the above, I will also attempt to state the fairly technical definition to ensure that I do not miss anything. We can begin with a fairly technical definition from this link-

Transcoding is the process of converting an audio or video file from one encoding format to another to increase the number of compatible target devices a media file can be played on.

Transcoding facilitates multimedia (Audio, video, text among other things) communication between two different devices without which it is impossible to communicate among those devices, thus increasing the interoperability in our world and technologies.

Well, it’s fairly simple to understand, but it is fairly complicated to integrate it into a network. Unlike a human, which can be called and introduced as a translator, it takes a fair amount of challenges to implement it. Since, most of the time, the generic solution is expected to be provided. With over a couple of dozens of codecs, all their encoders and decoders needs to be implemented. To add to this, most codecs possess unique capabilities which need to be identified during runtime to give a better performance.

A list of codecs available worldwide can be seen here.

Note - Simply stating, codecs are ways to send a voice or video packet in a compressed format to avoid network congestion

Need for Transcoding

Let’s understand the need for transcoding using a hypothetical situation.

Assuming, Naruto (A knuckle-headed Ninja) is on a mission and wants to verify some intel through Sakura, who is on a different mission. Since the mission is A-ranked, he just cannot leave the site and go to Sakura to verify his intel. If he attempts to do so, he might compromise his position and the mission may be declared a failure as well by the village. However, Naruto does has his smartphone and so does Sakura.

Naruto picks up his phone and excitedly initiates the call for Sakura by dialing her phone number and waits. The mobile of Naruto (or user equipment in technical jargon) generates a request which says that this message needs to be sent to Sakura, and I can communicate using the formats (codecs) A and B only.

The mobile of Sakura (which is, unfortunately, a pretty old version) receives the request and rejects them by saying that it can communicate on Format (codec) C only. The same message is relayed to Naruto.

Without the transcoding server, the Call of Naruto gets failed, he is stuck in a situation, curses the phones and the networks, and the mission is declared as a failure.

Now, let’s see what happens with our hypothetical situation after the introduction of the transcoding server -

Now, assuming the same mission with previous circumstances, Naruto’s mobile (user equipment) initiates the call by generating a request stating that ‘This message needs to be delivered to Sakura and I can communicate using the format (codec) A only.’

This message now instead of going to Sakura, first goes to the Transcoding server where it identifies the address of Sakura and sends the modified message to Sakura which states that ‘Naruto wishes to communicate with you through me and appends all the available codecs that the transcoding server understands, say codec A, B, C and D in the message’.

The message is then accepted by Sakura which responds with a message that states that she can communicate through codec D only. The transcoding server accepts the response and the same response is conveyed to Naruto with the modification that the communication will take place through codec A.

Thus, the mobile of Naruto feels that call is established through Codec A while the mobile of Sakura feels that call is established through codec D, but in reality, as it does not appears to be the case. This has been made entirely possible by the transcoding server.

So, the mobile of Naruto starts sending his voice packet encoded through the algorithm of Codec A, which is then received by the Transcoding server, who then decodes the packet using the algorithm of Codec A and then encodes the packet using the algorithm of the codec D and then forwards it to Sakura’s mobile who then decodes it using the Algorithm of Codec D, and Sakura hears the Naruto’s voice.

Similarly, the mobile of Sakura starts sending her voice packet encoded through the algorithm of Codec D, which is then received by the Transcoding server, who then decodes the packet using the algorithm of Codec D and then encodes the packet using the algorithm of the codec A and then forwards it to Sakura’s mobile who then decodes it using the Algorithm of Codec A, and Naruto hears the Sakura’s voice.

With the transcoding server, Call of Naruto is successful, he does sneak up on his enemy, and with the appropriate ninja gear, he defeats an entire clan planning revenge on the leaf village. Due to the proper infrastructure and communication, Naruto once again comes out as the Hero of the village hidden in the leaf and continues with his life happily. Due to this mission, he is one step closer to becoming the Hokage.

Needless to say, that in reality the list of codecs that exists is huge and has already been shared above. These algorithms for encoding and decoding the packets are thoroughly tested and carefully implemented and can be accessed under patents or open-source licenses.

Certain advanced tasks also come along when the media (RTP packets) starts flowing like NAT Learning, Interception, IPv4 to IPv6 transition, etc which I will explain in the upcoming blog post.

The media ( RTP packet ) which we do hear can be seen from a Wireshark tool looks like this. It shows that our media has been encoded with a PCMU codec whose Technical name is G.711. Rest of the detail in the image is not relevant to this blog post.

For a packet to pass through the transcoding server, it would appear to have gone several stages –

Any media packet is received at the transcoding server through the RTP Receiver which opens a port at which the media is supposed to arrive and listen to it. Once the packet is received, the data or the media content, or the payload is extracted (de-payload) from the RTP packet. The payload can also be seen in the above image as the last field.

Once, we have that extracted payload, we decode it with the decoder of the codec negotiated from the source, and after the decoding is done, we have the raw packet. Once the payload is in raw format, we can resample it if necessary, and then we encode it into the codec required at the destination. Once encoding is completed using the libraries, we go for the creation of the complete RTP packet by passing it through the Payloader. It is then sent out to the destination through the RTP sender.

Transcoding is a very high CPU-intensive task and hence special dedicated cards are created which perform the transcoding, particularly of video.

Note- Encoding/Decoding is done through the libraries available on the internet which are already made.

My primary focus remains the same. It is to introduce and appreciate the beautiful things that are continuously working in the picture all the time without us noticing the same. The amount of work done in something so simple as talking between a person with a fixed-line phone and a person with a mobile phone.

Understanding things like these make me feel that everything beautiful comes with a ton of hard work which is also one of the motivators for me. I Hope, it acts the same for you too.

I will publish a more technically detailed blog on these topics soon for the nerds and telecom undernoobs.