Introduction to Polars for Data Processing
Polars is a blazingly fast DataFrame library implemented in Rust with Python bindings. Created by Ritchie Vink in 2020, it's designed from the ground up for performance, leveraging Apache Arrow, parallel execution, and query optimization.
Architecture: Why Polars is Fast
- Apache Arrow Memory Format: Columnar, cache-friendly layout with zero-copy reads
- Written in Rust: Memory-safe systems language without GIL (Global Interpreter Lock)
- SIMD Vectorization: Single Instruction Multiple Data operations on modern CPUs
- Parallel Execution: Automatically multi-threaded using Rayon (work-stealing scheduler)
- Query Optimization: Lazy API builds execution plan, applies predicate pushdown and projection pruning
To provide an example, the following code runs in parallel without any specific configuation
import polars as pl
import numpy as np
# Polars excels at large-scale operations
df = pl.DataFrame({
'id': range(10_000_000),
'value': np.random.randn(10_000_000)
})
# Automatic parallelization across all CPU cores
result = df.filter(pl.col('value') > 0).group_by('id').agg(pl.col('value').sum())
Creating Dataframes
The first and easiest way to create a dataframe is by using a dictionary. You can simply call the DataFrame api from polars and this will generate a dataframe.
data = {
"name": ["Alice", "Bob", "Charlie", "Jane"],
"age": [25, 30, 35, 45],
"city": ["NY", "LA", "SF", "TX"]
}
df = pl.DataFrame(data)
df
One useful way to ensure type safefy and expect data operations to run is to set the schema and specifically identify the datatypes that exist within a particular column. Below is an example of how to achieve this.
# Explicit schema (type safety)
df = pl.DataFrame(
data,
schema={
'name': pl.Utf8,
'age': pl.Int32, # More memory-efficient than Int64
'city': pl.Categorical # Automatic string interning
}
)
df
Creating a Dataframe for Pandas
Since pandas came first, most people will be familiar with using pandas. Polars offers a nice and easy API to convert pandas dataframes to polars dataframe and back
import pandas as pd
pdf = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],
'b': [3, 1, 2, 4, 5, 3]})
# Convert to Polars
df = pl.from_pandas(pdf)
df
Reading Data: Polars' Strengths
There are two main ways of reading data. One is the eager implementation and the other is the lazy reading. Let's
begin with the first,
# reading employees.csv
df = pl.read_csv("data/employees.csv")
df.head()
Lazy Reading
Alternatively, we can also load data lazily, particularly, when you have a large dataset. This will perform a scan and provide a strategy for reading the data. and implementing processing.
# Lazy scan (recommended for large files)
df = pl.scan_csv( "data/employees.csv")
df
# Lazy benefits:
result = (
df
.filter(pl.col('salary') > 50000) # Pushed down to file scan
.select(['name', 'salary']) # Only these columns read
.collect() # Execute optimized plan
)
result
Reading JSON Datasets
Often times when we work with web data which follows the JSON format, and of course Polars over a JSON API allowing us to read json files. Below is an example implementation of reading a JSON file.
# Flat JSON
df_json = pl.read_json("../data/employees.json")
df_json.head()
In the next section, we will go over expression which form the power tools for processing data.