Encoding  Base128 Varints, Explained

Written by nazarhussain | Published 2020/07/01
Tech Story Tags: encoding | development | varint | leb128 | protobuf | binary-code | learning | tutorial | web-monetization

TLDR Encoding Base128 Varints, Explained by Nazarhussain Nazar Hussain. We try to understand how integer values can be represented in binary. We can store larger integer values say in two bytes. But what about numbers greater than 256 or larger than the capacity of storage in a single byte? We will be discussing a particular encoding for integer values. The solution is to overcome the problem of representing varying integer values with any number associated with an associated number with an integer value.via the TL;DR App

A process of converting data from one format to another is called encoding. In this process the data and format both can be variable, means for any kind of data to any kind of format, if we do any conversion, the process will be called encoding. Based on the nature of data there are quite a few terms that are popular like media encoding, character encoding. Media Encoding refers to converting audio, video files to different formats e.g. mp3, avi, wav, etc.
Similarly, the character encoding is also very popular which refers to representing character data to machine-readable binary format e.g. ASCII, US-ASCII, UTF-8. Although term encoding refers to a verb or process, sometimes used as a noun meant encoding itself. People tend to say mp3 encoding, or mp4 encoding, where they meant that data or media file is encoding in that particular format.

Integer Encoding

In this article, we will be discussing a particular encoding for integer values. We try to understand how integer values can be represented in binary and specifically how effectively it can be done.
Everyone must have heard and read terms bytes and bits. A bit is an atomic unit a computer can measure or understand. Computers read a bit either as value 1 or value 0. The collection of 8 bits together is called a byte where each bit can be represented as either 0 or 1. So there are total 
2 ^ 8 = 256
 combinations of those bits. Computers and programming languages are designed to read and write the collection of those eight bits a.k.a Byte together instead of reading/writing individual bits.
Each combination of those bits represents a decimal value. e.g.
00000000 
represents zero, and 
00000001
 represents one. Was not that complex 😄. But it gets complicated with other discrete values as bit combination
00000101
 represents number 
5
 and 
100000001 
represents
129
. Understanding this representation is not that complex and can be demystified with the following table. Where each bit represents a particular decimal value and we add up all values for which bits are turned 
1
 .

Encoding Large Integers — The problem

With the help of the above diagram, it’s easy to convert any decimal number to binary representation and vice versa. But what about numbers greater than 256 or larger than the capacity of storage in a single byte. You may have thought obviously more bytes and you are not wrong.
We can store larger integer values say in two bytes which will be 
2^16 = 65536
. So the binary representation of 
7896
 will be 
0001111011011000
 which will be stored in two bytes and can be stored as 
00011110
 and
11011000
. And decimal value larger than 
65536
 can also be stored in three bytes and so on. There is no limitation.
But there is a problem since bytes are read and write in a sequence to the memory while reading if found two bytes 
00011110
 and 
11011000
 we can’t be sure if these are 
30
 and 
216
 or two bytes combined as
7896
.
There can be different solutions to the problem, one is to pre-define the size of the value you wrote and then read that number of bytes back. This concept is called primitive types in any programming language. Programming languages define different types and have associated fixed number of bytes, e.g. in C 
short
 type take 2 bytes and 
long
 take 8 bytes. When we store and read those values, compiler already aware of their type and read that much byte sequence.
Another solution can be defined as a delimiter or an identifier when a particular byte sequence defines the boundary. For data types with variable length, this solution is mostly used. For example in C and other languages string values are delimited by a null byte. So while reading a
string
, the compiler keeps reading the bytes until found a null byte.
This solution can also be used for larger integers or varying size integer encoding but will cause wastage storage because most of the computer operations are involved in a huge number of computations on numbers.

Base 128 Varints — The solution

To overcome the problem of representing varying length integer values Base 128 Varints technique is used. This does not require a delimiter byte, instead use a single bit for this purpose. With this technique we can save a lot of storage as well we store any number of bytes associated with an integer value.
This encoding technique is one from many of variable-length encoding and another common name associated with Base 128 Varints is LEB128 — Little Endian 128. This encoding is used DWARF file format, WebAssembly and Google Protobuf to represent integer values in smaller bytes possible.
The process of encoding any integer to
LEB128
is not that complex and consists of the following steps.
  1. Convert the number to binary presentation
  2. Arrange bits in the group of 7 bits keeping 8th bit empty
  3. Fill up every 8th bit of each byte with 1 for least significant bytes (LSB)
  4. Fill most significant byte (MSB) remaining bits with 0
  5. Reverse the order of bytes
  6. You have LEB128 encoded binary representation
While reading the byte sequence:
  1. Read one byte from the sequence of bytes
  2. Check if its 8th bit of that byte is set to 1 keep it in a buffer and then repeat step 1.
  3. When reading a byte of which 8th bit is set to 0 we know that all bytes associated with one integer value are completed.
  4. Reverse the order of bytes in the buffer
  5. Now we have to remove 8th bit of each byte in the buffer
  6. Join remaining 7 bits of each byte together
  7. And here we have our original integer value
Was not that so easy, Take the example of a number 
963412
 which have a binary representation 
111010110011 01010100
. Let’s encode it with
LEB128
with the help of this diagram and the above steps.

What about LEB128 for signed integers

Signed numbers are the ones where you have to store the information if numbers were positive or negative. This information is stored in the most significant bit. While the whole binary representation is done as two’s complimentscheme. So normal binary representation of 
12
 in one byte is
00001100
 but for 
-12
 we will do two’s complement which will be
11110100
. That’s how it is calculated:
12        Decimal 
00001100  Binary 
11110011  1's compliment (invert the bits)
11110100  2's compliment (add 1)
The rest of the calculations for LEB128 remains the same on top of this resulted binary (2’s complement of same positive number).
For decoding it back to an integer you must already aware if it was a signed integer while encoding or not. This is the reason all programming languages treat signed and unsigned data types separately. And then you just decode normally and take two’s complement of the output. That will give absolute value that you can negate to get a signed number.
In case you want to be generic signed encoding algorithm for positive or negative values you can do by taking two’s complement of any value.

Is that all

Data encoding is a vast field and the topic of integer encoding itself has a lot of dimensions. Base 128 Varint is not the only integer encoding technique. There are a lot more, few popular ones are Fibonacci EncodingElias gamma codingHPACK.
If you know the distribution of input values and use of the output that you can decide upon which encoding is best for you. You can use any, develop new for yourself, or even can optimize existing for your need.

Implementation

Here is one implementation of LEB128 algorithm. Please bear in mind there can be many different approaches to solve this problem or to implement this algorithm, so feel free to develop your own.
# unsigned_leb128_encode_decode.py
# Python 3.7.4

def encode_unsigned_leb128(number):
    result = bytearray()

    # While number is greater than a byte
    while number.bit_length() > 7:
        # Get first 7 bits and append 1 on 8th bit
        single_byte = (number & 0b01111111) | 0b10000000
        # Append the byte to result
        result.append(single_byte)
        # Truncate right 7 bits
        number = number >> 7

    # Append remaining byte to result
    result.append(number)

    # As we appended earlier no need to reverse the bytes
    return result


def decode_unsigned_leb128(byte_array, offset=0):
    needle = offset
    pair_count = 0
    result = 0

    while True:
        single_byte = byte_array[needle]

        # If first bit is 1
        if single_byte & 0b10000000 == 0:
            break

        # Remove first bit
        single_byte = single_byte & 0b01111111

        # Push number of bits we already have calculated
        single_byte = single_byte << (pair_count * 7)

        # Merge byte with result
        result = result | single_byte

        needle = needle + 1
        pair_count = pair_count + 1

    # Merge last byte with result
    single_byte = single_byte << (pair_count * 7)
    result = result | single_byte

    return result
  
  
# signed_leb128_encode_decode.py
# Python 3.7.4

from unsigned_leb128_encode_decode import decode_unsigned_leb128, encode_unsigned_leb128


def encode_signed_leb128(number):
    # Higher multiple of 7
    bits_multiple_of_7 = ((number.bit_length() + 7) // 7) * 7
    twos_complement = (1 << bits_multiple_of_7) - number

    return encode_unsigned_leb128(twos_complement)


def decode_signed_leb128(byte_array, offset=0):
    number = decode_unsigned_leb128(byte_array, offset)
    # Lower multiple of 7
    bits_multiple_of_7 = (number.bit_length() // 7) * 7
    twos_complement = (1 << bits_multiple_of_7) - number

    return twos_complement
You can download the gist form this link. Share your thoughts and specially if your feedback if you found any issue with above code.

Written by nazarhussain | Programmer and technical designer. Currently work on Lisk, a blockhain application platform.
Published by HackerNoon on 2020/07/01