Determining file format using Python

Prehistory

Hello everyone! Most recently, I ran into a problem: for unexplained reasons, the memory card began to move all files to the LOST.DIR folder without any extensions. For a long time, there accumulated more than 500 files of different types: pictures, video, audio, documents. It was impossible to understand the format of the file independently, so I started looking for a way to solve this problem programmatically.

Looking for decision

I did not want to use ready-made solutions in the form of web services or programs, so there was an idea to write a console utility that would go through all the files and install the extensions automatically. Python was chosen to write the utility. The search for suitable modules and libraries did not bring results for several reasons:

Lack of support from the developer
Excessive functionality
Lack of support for new versions of Python
Excessive code complexity

Of the many libraries, python-magic is very popular (almost 1000 stars on GitHub). It’s a wrapper for the libmagic library. But it’s impossible to use python-magic under Windows without the DLL for the Unix library. So this option wasn’t good enough.

Solution of the problem

Proceeding from the above, I decided not to use third-party libraries and modules and solve the problem without them. After a short search of information on how to implement this task, the only true way was to determine the format by the signature of the file, also called “magic number”.

The file signature is a set of bytes that provides a definition of the file format. The signature has the following form in hexadecimal notation:

50 4D 4F 43 43 4D 4F 43

Fortunately, there are two good sites on the Internet with a lot of signatures of different formats. The most common formats became the goal.As it turned out, some signatures are suitable for different file formats, such as the signature of Microsoft Office files. Based on this, in some cases it will be necessary to return a list of suitable file extensions.

print(get("D:\\some_ms_office_document")) # prints ['doc', 'ppt', 'xls']

Also, often the signatures have an offset from the beginning of the file like 3GP multimedia container.

1. Compiling a list of data

As a list of data, I decided to use a JSON file, with the ‘data’ object, whose value will be an array of objects of the following form:

{"format": "jpg", "offset": 0, "signature": ["FF D8 FF E0", "FF D8 FF E1", "FF D8 FF E2", "FF D8 FF E8"]}

Where:

format — file format;
offset — offset of the signature from the beginning of the file;
signature — an array of suitable signatures for the specified file format.

2. Writing an utility

Import the necessary modules:

import osimport json

Read a list of data:

abspath = os.path.abspath(os.path.dirname(__file__))data = json.loads(open(os.path.join(abspath, "data.json"), "r", encoding="utf-8").read())["data"]

Great, the data list is loaded. Now we read the file as an array of bytes. We will only read the first 32 bytes, since the determination of common formats doesn’t require more, and full reading of a large file will take a long time.

file = open("path_to_the_file", "rb").read(32)

If you print ‘file’ variable, you will see something similar to this:

\x90\x00\x03\x00\x00\x00\x04

Now bytes must be converted to a hexadecimal system:

hex_bytes = " ".join(['{:02X}'.format(byte) for byte in file])

Next, we create a list in which the appropriate formats will be added:

out = []

And now we create a structure that will cyclically determine the file format:

for element in data:        for signature in element["signature"]:            offset = element["offset"]*2+element["offset"]            if signature == hex_bytes[offset:len(signature)+offset].upper():                out.append(element["format"])

About this string:

offset = element["offset"]*2+element["offset"]

Since our bytes are represented as a string, and two symbols represent one byte, we multiply the offset by 2 and add the number of spaces between the “bytes”.

And the only thing that remains for us is to output a list of suitable formats, which is represented by the ‘out’ variable.

print(out) # prints something like ['extension_1', 'extension_2']

Conclusion

As it turned out, various projects are faced with the need to recognize the file format, so I decided to release my solution in open-source as a module for Python called fleep link to the GitHub page. You can install the module using the standard python utility ‘pip’:

pip install fleep

Also there are examples of usage and a complete list of supported file formats on the GitHub project page.I improve fleep every day, adding new features and formats. You can use it in your project :)

Thank you for attention!

P.S. I would be glad to hear your opinion about my module.P.P.S. English is not my native language, so, excuse me for any mistakes :)