Crunching Large Datasets Made Fast and Easy: the Polars Library

Photo by Margaret Weir on Unsplash

“We spend a lot of time waiting for some data preparation task to finish —the destiny of data scientists, you would say.” So started an article I published two years ago. Then went on to show how we could speed things up by using techniques such as parallelization and memory-mapped files. All that because doing the same task with pandas would have taken hours.

TL;DR: I implemented the same task with the new Polars library for python. The results? Well, I am flashed. Polars can crunch the input (23GB) in less than 80 seconds. My old script needed 15 minutes, chakka!

But, let’s look at it one thing at a time, shall we?

The data

As happens often in large data, we have to take care of input quality. I wanted to filter out OCR defects from the Google Books 1-Ngram dataset. The data files contain frequencies of how often a term appears in the scanned books on a per-year basis. Summing over the years gives us then a good grip on suspect words (those that appear rarely) — “apokalypsc” appears only 41 times, “sngineering“ 381 times. In contrast, the word “air” appears 93,503,520 times, “population” 71,863,291 times.

In total, there are 27 files (one per letter) including 1,201,784,959 records (yes, over one thousand million records to crunch through, 23GB uncompressed).

DataFrames in Polars

Polars is a data processing and analysis library written entirely in rust with APIs in Python and Node.js. It is the new kid on the block competing with established top dogs such as pandas. It comes fully equipped with full support for numerical calculations, string manipulation, and data frame operations like filtering, joining, intersection, and aggregations such as groupby.

Polars has achieved honors in benchmarks as shown here and here.

Back to our task, this is the script implementing the logic described above for processing one file.

def process_file(file):
    global basepath, stopwords
    not_word = r'(_|[^\w])'

    # define what we are reading (only cols 0 and 2 and name them)
    df = pl.read_csv(basepath+file, sep="\t", columns=[0,2], 
      new_columns=['word','count'])

    # filter out terms with non alphabetical characters ...
    df = df.filter(pl.col("word").str.contains(not_word).is_not())

    #  ... and eliminate terms shorter than 3 chars
    df = df.filter(pl.col("word").str.lengths() > 2)
    
    #  ... and also stop words
    df["word"] = df["word"].str.to_lowercase()
    df = df.filter(pl.col("word").is_in(stopwords).is_not())
    
    # sum unique counts and sort by sum desc
    df = df.groupby('word')['count'].sum().sort(by='count_sum', reverse=True)
    
    #  select only terms that appear more 20,000 times in the books
    good = df.filter(pl.col("count_sum") > 20000)
    
    #  output a csv file
    print(f"out_{file}, {len(good)} terms")
    good.to_csv(f'out_{file}.csv', sep='\t', has_header=False)

The input format for each file is

ngram TAB year TAB count TAB volume_count NEWLINE

In a nutshell, we apply some heuristic filters and sum count for each term overall records. Finally, we output only terms that appear more often than a given threshold (test number).

The syntax of working with data frames in polars bears similarity with the syntax in pandas, but only to a certain extent. Polars has a chained expression syntax that makes it very … well, expressive. I liked that a lot. I must admit, though that without stackoverflow I would have never come up with pl.col(“colname”) to address the Series data structure storing each column in the data frame 😉

What makes Polars so fast?

In this interview, Ritchie Vink, the creator of Polars gives some insight into what happens behind the scenes. Parallelization happens in the underlying layers in rust. Lots of thought went into optimizing CPU caches and multi-core design. The use of the Arrow2 framework for columnar data also helped to speed things up. But, now we see something new:

The most inspiration came from database systems, mostly how they define a query plan and optimize that. This gave way for Polars expressions, and that are the key selling point of our API. They are declarative, composable and fast.

This quote caught my attention. See, dealing with a large data frame resembles accessing rows/cols in a database.

Behind the scenes we have copy-on-write so generally copies, which are expensive in RAM and speed, don’t have to happen unless you modify the data - the data itself is immutable. All of this happens in the Rust layer, using Rust threads (which you don’t see from the Python frontend), so running low on RAM is much less of an issue compared to Pandas.

Voilà.

Parallelizing the Input Pipe

Processing 27 input files doesn’t have to happen sequentially 😃

I use the python multi-processing library to have four processes running the script above at any given time (my Mac-mini has four cores and 32GB of memory). The script is available here.

Thank you for reading, hope you found it interesting. Comments and suggestions are always welcome!