Introduction to Polars for Data Processing

Polars is a blazingly fast DataFrame library implemented in Rust with Python bindings. Created by Ritchie Vink in 2020, it's designed from the ground up for performance, leveraging Apache Arrow, parallel execution, and query optimization.

Architecture: Why Polars is Fast

Apache Arrow Memory Format: Columnar, cache-friendly layout with zero-copy reads
Written in Rust: Memory-safe systems language without GIL (Global Interpreter Lock)
SIMD Vectorization: Single Instruction Multiple Data operations on modern CPUs
Parallel Execution: Automatically multi-threaded using Rayon (work-stealing scheduler)
Query Optimization: Lazy API builds execution plan, applies predicate pushdown and projection pruning

To provide an example, the following code runs in parallel without any specific configuation

import polars as pl
import numpy as np

# Polars excels at large-scale operations
df = pl.DataFrame({
    'id': range(10_000_000),
    'value': np.random.randn(10_000_000)
})

# Automatic parallelization across all CPU cores
result = df.filter(pl.col('value') > 0).group_by('id').agg(pl.col('value').sum())

OUTPUTshape: (5_000_372, 2) id value i64 f64 3 0.529892 4 0.967859 7 0.087366 18 0.307324 19 0.809991 … … 9999991 1.353322 9999993 1.742853 9999994 2.152039 9999996 1.50429 9999997 1.049186

Creating Dataframes

The first and easiest way to create a dataframe is by using a dictionary. You can simply call the DataFrame api from polars and this will generate a dataframe.

data = {
    "name": ["Alice", "Bob", "Charlie", "Jane"],
    "age": [25, 30, 35, 45],
    "city": ["NY", "LA", "SF", "TX"]
}

df = pl.DataFrame(data)
df

OUTPUTshape: (4, 3) name age city str i64 str "Alice" 25 "NY" "Bob" 30 "LA" "Charlie" 35 "SF" "Jane" 45 "TX"

One useful way to ensure type safefy and expect data operations to run is to set the schema and specifically identify the datatypes that exist within a particular column. Below is an example of how to achieve this.

# Explicit schema (type safety)
df = pl.DataFrame(
    data,
    schema={
        'name': pl.Utf8,
        'age': pl.Int32,  # More memory-efficient than Int64
        'city': pl.Categorical  # Automatic string interning
    }
)

df

OUTPUTshape: (4, 3) name age city str i32 cat "Alice" 25 "NY" "Bob" 30 "LA" "Charlie" 35 "SF" "Jane" 45 "TX"

Creating a Dataframe for Pandas

Since pandas came first, most people will be familiar with using pandas. Polars offers a nice and easy API to convert pandas dataframes to polars dataframe and back

import pandas as pd

pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], 
                    'b': [3, 1, 2, 4, 5, 3]})

# Convert to Polars
df = pl.from_pandas(pdf)
df

OUTPUTshape: (6, 2) a b i64 i64 1 3 2 1 3 2 4 4 5 5 6 3

Reading Data: Polars' Strengths

There are two main ways of reading data. One is the eager implementation and the other is the lazy reading. Let's begin with the first, read_csv().

# reading employees.csv
df = pl.read_csv("data/employees.csv")
df.head()

OUTPUTshape: (5, 6) id name age department salary join_date i64 str f64 str i64 str 1 "User_1" null null 123355 "2020-01-01" 2 "User_2" 28.0 "Sales" 118399 "2020-01-02" 3 "User_3" 29.0 "Sales" 88727 "2020-01-03" 4 "User_4" 48.0 "Engineering" 71572 "2020-01-04" 5 "User_5" 22.0 "Engineering" 81849 "2020-01-05"

Lazy Reading

Alternatively, we can also load data lazily, particularly, when you have a large dataset. This will perform a scan and provide a strategy for reading the data. and implementing processing.


# Lazy scan (recommended for large files)
df = pl.scan_csv( "data/employees.csv")
df

OUTPUTnaive plan: (run LazyFrame.explain(optimized=True) to see the optimized plan) Csv SCAN [data/employees.csv] PROJECT */6 COLUMNS ESTIMATED ROWS: 100

# Lazy benefits:
result = (
    df
    .filter(pl.col('salary') > 50000)  # Pushed down to file scan
    .select(['name', 'salary'])         # Only these columns read
    .collect()                          # Execute optimized plan
)
result

OUTPUTshape: (100, 2) name salary str i64 "User_1" 123355 "User_2" 118399 "User_3" 88727 "User_4" 71572 "User_5" 81849 … … "User_96" 62415 "User_97" 89151 "User_98" 122927 "User_99" 106714 "User_100" 137793

Reading JSON Datasets

Often times when we work with web data which follows the JSON format, and of course Polars over a JSON API allowing us to read json files. Below is an example implementation of reading a JSON file.

# Flat JSON
df_json = pl.read_json("../data/employees.json")
df_json.head()

OUTPUTshape: (5, 6) id name age department salary join_date i64 str f64 str i64 i64 1 "User_1" null null 123355 1577836800000 2 "User_2" 28.0 "Sales" 118399 1577923200000 3 "User_3" 29.0 "Sales" 88727 1578009600000 4 "User_4" 48.0 "Engineering" 71572 1578096000000 5 "User_5" 22.0 "Engineering" 81849 1578182400000

In the next section, we will go over expression which form the power tools for processing data.