Data Aggregation
In this short notebook, we look at data aggregation implementation for Polars. We maintain the same dataset of
employees.
import polars as pl
data = pl.read_csv("data/employees.csv")
data.head()
count()
Let's begin with the simplest version of aggregation. In this example, we return the count for each column, if
the value is not NULL. To do this we use the count() method.
print(data.count())
max() and min()
We can also fetch the max() and min() value for each column,
and of course, we can fetch individual max and min values.
print("\nMax Values")
print(data.max())
print("\nMin Values")
print(data.min())
data.select("salary").max().item(), data.select("salary").min().item()
mean() and std()
Ofcourse, we can also fetch the mean and the standard deviation for each column as well.
data.mean(), data.std()
Compound Expressions
The above examples are interesting but don't offer the type of compound aggregation that makes analysis
interesting. For this, we have to think about expressions. Here, it is important to think of these almost as SQL
statements.
Let's begin by fetching the salary and age columns. The we can compose different expression. For example, fetch
the median, mean and other functions.
# compound expressions
summary_expr = [
pl.col("salary").mean().alias("avg_salary"),
pl.col("salary").std().alias("std_salary"),
pl.col("salary").min().alias("min_salary"),
pl.col("salary").max().alias("max_salary")
]
print(data.select(summary_expr))
An alternatively, we can express this summarization differently by defining expressions as first class objects.
salary = pl.col("salary")
age = pl.col("age")
result = data.select(
salary.mean().alias("avg_salary"),
salary.min().alias("min_salary"),
salary.max().alias("max_salary"),
(salary.max() - salary.min()).alias("salary_range"),
age.median().alias("median_age"),
age.mean().alias("avg_age")
)
print(result)
Reusing Expressions
The wonderful thing about expressions is the fact that they are reusable. For example, we can use the previous
summary_expr variable as part of a groupby.
dept_summary = data.filter(~pl.col("department").is_null() ).group_by("department").agg(summary_expr)
print(dept_summary)