Encoding Base128 Varints, Explained

A process of converting data from one format to another is called encoding. In this process the data and format both can be variable, means for any kind of data to any kind of format, if we do any conversion, the process will be called encoding. Based on the nature of data there are quite a few terms that are popular like media encoding, character encoding. Media Encoding refers to converting audio, video files to different formats e.g. mp3, avi, wav, etc.

Similarly, the character encoding is also very popular which refers to representing character data to machine-readable binary format e.g. ASCII, US-ASCII, UTF-8. Although term encoding refers to a verb or process, sometimes used as a noun meant encoding itself. People tend to say mp3 encoding, or mp4 encoding, where they meant that data or media file is encoding in that particular format.

Integer Encoding

In this article, we will be discussing a particular encoding for integer values. We try to understand how integer values can be represented in binary and specifically how effectively it can be done.

Everyone must have heard and read terms bytes and bits. A bit is an atomic unit a computer can measure or understand. Computers read a bit either as value 1 or value 0. The collection of 8 bits together is called a byte where each bit can be represented as either 0 or 1. So there are total

2 ^ 8 = 256

combinations of those bits. Computers and programming languages are designed to read and write the collection of those eight bits a.k.a Byte together instead of reading/writing individual bits.

Each combination of those bits represents a decimal value. e.g.

00000000

represents zero, and

00000001

represents one. Was not that complex 😄. But it gets complicated with other discrete values as bit combination

00000101

represents number

and

100000001

represents

. Understanding this representation is not that complex and can be demystified with the following table. Where each bit represents a particular decimal value and we add up all values for which bits are turned

Encoding Large Integers — The problem

With the help of the above diagram, it’s easy to convert any decimal number to binary representation and vice versa. But what about numbers greater than 256 or larger than the capacity of storage in a single byte. You may have thought obviously more bytes and you are not wrong.

We can store larger integer values say in two bytes which will be

2^16 = 65536

. So the binary representation of

will be

0001111011011000

which will be stored in two bytes and can be stored as

00011110

and

11011000

. And decimal value larger than

can also be stored in three bytes and so on. There is no limitation.

But there is a problem since bytes are read and write in a sequence to the memory while reading if found two bytes

00011110

and

11011000

we can’t be sure if these are

and

or two bytes combined as

There can be different solutions to the problem, one is to pre-define the size of the value you wrote and then read that number of bytes back. This concept is called primitive types in any programming language. Programming languages define different types and have associated fixed number of bytes, e.g. in C

short

type take 2 bytes and

long

take 8 bytes. When we store and read those values, compiler already aware of their type and read that much byte sequence.

Another solution can be defined as a delimiter or an identifier when a particular byte sequence defines the boundary. For data types with variable length, this solution is mostly used. For example in C and other languages string values are delimited by a null byte. So while reading a

string

, the compiler keeps reading the bytes until found a null byte.

This solution can also be used for larger integers or varying size integer encoding but will cause wastage storage because most of the computer operations are involved in a huge number of computations on numbers.

Base 128 Varints — The solution

To overcome the problem of representing varying length integer values Base 128 Varints technique is used. This does not require a delimiter byte, instead use a single bit for this purpose. With this technique we can save a lot of storage as well we store any number of bytes associated with an integer value.

This encoding technique is one from many of variable-length encoding and another common name associated with Base 128 Varints is LEB128 — Little Endian 128. This encoding is used DWARF file format, WebAssembly and Google Protobuf to represent integer values in smaller bytes possible.

The process of encoding any integer to

LEB128

is not that complex and consists of the following steps.

Convert the number to binary presentation
Arrange bits in the group of 7 bits keeping 8th bit empty
Fill up every 8th bit of each byte with 1 for least significant bytes (LSB)
Fill most significant byte (MSB) remaining bits with 0
Reverse the order of bytes
You have LEB128 encoded binary representation

While reading the byte sequence:

Read one byte from the sequence of bytes
Check if its 8th bit of that byte is set to 1 keep it in a buffer and then repeat step 1.
When reading a byte of which 8th bit is set to 0 we know that all bytes associated with one integer value are completed.
Reverse the order of bytes in the buffer
Now we have to remove 8th bit of each byte in the buffer
Join remaining 7 bits of each byte together
And here we have our original integer value

Was not that so easy, Take the example of a number

which have a binary representation

111010110011 01010100

. Let’s encode it with

LEB128

with the help of this diagram and the above steps.

What about LEB128 for signed integers

Signed numbers are the ones where you have to store the information if numbers were positive or negative. This information is stored in the most significant bit. While the whole binary representation is done as two’s complimentscheme. So normal binary representation of

in one byte is

00001100

but for

-12

we will do two’s complement which will be

11110100

. That’s how it is calculated:

12        Decimal 
00001100  Binary 
11110011  1's compliment (invert the bits)
11110100  2's compliment (add 1)

The rest of the calculations for LEB128 remains the same on top of this resulted binary (2’s complement of same positive number).

For decoding it back to an integer you must already aware if it was a signed integer while encoding or not. This is the reason all programming languages treat signed and unsigned data types separately. And then you just decode normally and take two’s complement of the output. That will give absolute value that you can negate to get a signed number.

In case you want to be generic signed encoding algorithm for positive or negative values you can do by taking two’s complement of any value.

Is that all

Data encoding is a vast field and the topic of integer encoding itself has a lot of dimensions. Base 128 Varint is not the only integer encoding technique. There are a lot more, few popular ones are Fibonacci Encoding, Elias gamma coding, HPACK.

If you know the distribution of input values and use of the output that you can decide upon which encoding is best for you. You can use any, develop new for yourself, or even can optimize existing for your need.

Implementation

Here is one implementation of LEB128 algorithm. Please bear in mind there can be many different approaches to solve this problem or to implement this algorithm, so feel free to develop your own.

# unsigned_leb128_encode_decode.py
# Python 3.7.4

def encode_unsigned_leb128(number):
    result = bytearray()

    # While number is greater than a byte
    while number.bit_length() > 7:
        # Get first 7 bits and append 1 on 8th bit
        single_byte = (number & 0b01111111) | 0b10000000
        # Append the byte to result
        result.append(single_byte)
        # Truncate right 7 bits
        number = number >> 7

    # Append remaining byte to result
    result.append(number)

    # As we appended earlier no need to reverse the bytes
    return result


def decode_unsigned_leb128(byte_array, offset=0):
    needle = offset
    pair_count = 0
    result = 0

    while True:
        single_byte = byte_array[needle]

        # If first bit is 1
        if single_byte & 0b10000000 == 0:
            break

        # Remove first bit
        single_byte = single_byte & 0b01111111

        # Push number of bits we already have calculated
        single_byte = single_byte << (pair_count * 7)

        # Merge byte with result
        result = result | single_byte

        needle = needle + 1
        pair_count = pair_count + 1

    # Merge last byte with result
    single_byte = single_byte << (pair_count * 7)
    result = result | single_byte

    return result

# signed_leb128_encode_decode.py
# Python 3.7.4

from unsigned_leb128_encode_decode import decode_unsigned_leb128, encode_unsigned_leb128


def encode_signed_leb128(number):
    # Higher multiple of 7
    bits_multiple_of_7 = ((number.bit_length() + 7) // 7) * 7
    twos_complement = (1 << bits_multiple_of_7) - number

    return encode_unsigned_leb128(twos_complement)


def decode_signed_leb128(byte_array, offset=0):
    number = decode_unsigned_leb128(byte_array, offset)
    # Lower multiple of 7
    bits_multiple_of_7 = (number.bit_length() // 7) * 7
    twos_complement = (1 << bits_multiple_of_7) - number

    return twos_complement

You can download the gist form this link. Share your thoughts and specially if your feedback if you found any issue with above code.

Previously published at https://basicdrift.com/explore-encoding-base-128-varints-41665a0dca36