Data Aggregation

In this short notebook, we look at data aggregation implementation for Polars. We maintain the same dataset of employees.

import polars as pl 

data = pl.read_csv("data/employees.csv")
data.head()

OUTPUTshape: (5, 6) ┌─────┬────────┬──────┬─────────────┬────────┬────────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ str │ ╞═════╪════════╪══════╪═════════════╪════════╪════════════╡ │ 1 ┆ User_1 ┆ null ┆ null ┆ 123355 ┆ 2020-01-01 │ │ 2 ┆ User_2 ┆ 28.0 ┆ Sales ┆ 118399 ┆ 2020-01-02 │ │ 3 ┆ User_3 ┆ 29.0 ┆ Sales ┆ 88727 ┆ 2020-01-03 │ │ 4 ┆ User_4 ┆ 48.0 ┆ Engineering ┆ 71572 ┆ 2020-01-04 │ │ 5 ┆ User_5 ┆ 22.0 ┆ Engineering ┆ 81849 ┆ 2020-01-05 │ └─────┴────────┴──────┴─────────────┴────────┴────────────┘

count()

Let's begin with the simplest version of aggregation. In this example, we return the count for each column, if the value is not NULL. To do this we use the count() method.

print(data.count())

OUTPUTshape: (1, 6) ┌─────┬──────┬─────┬────────────┬────────┬───────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 │ ╞═════╪══════╪═════╪════════════╪════════╪═══════════╡ │ 100 ┆ 100 ┆ 90 ┆ 93 ┆ 100 ┆ 100 │ └─────┴──────┴─────┴────────────┴────────┴───────────┘

max() and min()

We can also fetch the max() and min() value for each column, and of course, we can fetch individual max and min values.

print("\nMax Values")
print(data.max())

print("\nMin Values")
print(data.min())

OUTPUTMax Values shape: (1, 6) ┌─────┬─────────┬──────┬────────────┬────────┬────────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ str │ ╞═════╪═════════╪══════╪════════════╪════════╪════════════╡ │ 100 ┆ User_99 ┆ 59.0 ┆ Sales ┆ 148840 ┆ 2020-04-09 │ └─────┴─────────┴──────┴────────────┴────────┴────────────┘ Min Values shape: (1, 6) ┌─────┬────────┬──────┬─────────────┬────────┬────────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ str │ ╞═════╪════════╪══════╪═════════════╪════════╪════════════╡ │ 1 ┆ User_1 ┆ 20.0 ┆ Engineering ┆ 52928 ┆ 2020-01-01 │ └─────┴────────┴──────┴─────────────┴────────┴────────────┘

data.select("salary").max().item(), data.select("salary").min().item()

OUTPUT(148840, 52928)

mean() and std()

Ofcourse, we can also fetch the mean and the standard deviation for each column as well.

data.mean(), data.std()

OUTPUT(shape: (1, 6) ┌──────┬──────┬───────────┬────────────┬──────────┬───────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ str ┆ f64 ┆ str ┆ f64 ┆ str │ ╞══════╪══════╪═══════════╪════════════╪══════════╪═══════════╡ │ 50.5 ┆ null ┆ 39.477778 ┆ null ┆ 97405.04 ┆ null │ └──────┴──────┴───────────┴────────────┴──────────┴───────────┘, shape: (1, 6) ┌───────────┬──────┬───────────┬────────────┬──────────────┬───────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ str ┆ f64 ┆ str ┆ f64 ┆ str │ ╞═══════════╪══════╪═══════════╪════════════╪══════════════╪═══════════╡ │ 29.011492 ┆ null ┆ 12.273318 ┆ null ┆ 27189.164567 ┆ null │ └───────────┴──────┴───────────┴────────────┴──────────────┴───────────┘)

Compound Expressions

The above examples are interesting but don't offer the type of compound aggregation that makes analysis interesting. For this, we have to think about expressions. Here, it is important to think of these almost as SQL statements.

Let's begin by fetching the salary and age columns. The we can compose different expression. For example, fetch the median, mean and other functions.

# compound expressions
summary_expr = [
    pl.col("salary").mean().alias("avg_salary"),
    pl.col("salary").std().alias("std_salary"),
    pl.col("salary").min().alias("min_salary"),
    pl.col("salary").max().alias("max_salary")
]

print(data.select(summary_expr))

OUTPUTshape: (1, 4) ┌────────────┬──────────────┬────────────┬────────────┐ │ avg_salary ┆ std_salary ┆ min_salary ┆ max_salary │ │ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ i64 ┆ i64 │ ╞════════════╪══════════════╪════════════╪════════════╡ │ 97405.04 ┆ 27189.164567 ┆ 52928 ┆ 148840 │ └────────────┴──────────────┴────────────┴────────────┘

An alternatively, we can express this summarization differently by defining expressions as first class objects.

salary = pl.col("salary")
age = pl.col("age")


result = data.select(
            salary.mean().alias("avg_salary"),
            salary.min().alias("min_salary"),
            salary.max().alias("max_salary"),
            (salary.max() - salary.min()).alias("salary_range"),
            age.median().alias("median_age"),
            age.mean().alias("avg_age")
        )

print(result)

OUTPUTshape: (1, 6) ┌────────────┬────────────┬────────────┬──────────────┬────────────┬───────────┐ │ avg_salary ┆ min_salary ┆ max_salary ┆ salary_range ┆ median_age ┆ avg_age │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ f64 ┆ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 │ ╞════════════╪════════════╪════════════╪══════════════╪════════════╪═══════════╡ │ 97405.04 ┆ 52928 ┆ 148840 ┆ 95912 ┆ 39.0 ┆ 39.477778 │ └────────────┴────────────┴────────────┴──────────────┴────────────┴───────────┘

Reusing Expressions

The wonderful thing about expressions is the fact that they are reusable. For example, we can use the previous summary_expr variable as part of a groupby.

dept_summary = data.filter(~pl.col("department").is_null() ).group_by("department").agg(summary_expr)
print(dept_summary)

OUTPUTshape: (4, 5) ┌─────────────┬───────────────┬──────────────┬────────────┬────────────┐ │ department ┆ avg_salary ┆ std_salary ┆ min_salary ┆ max_salary │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ f64 ┆ i64 ┆ i64 │ ╞═════════════╪═══════════════╪══════════════╪════════════╪════════════╡ │ Engineering ┆ 100800.96 ┆ 24746.970967 ┆ 61653 ┆ 145505 │ │ Sales ┆ 89187.166667 ┆ 23739.912497 ┆ 55931 ┆ 137793 │ │ Marketing ┆ 97095.956522 ┆ 27680.241155 ┆ 57156 ┆ 142489 │ │ HR ┆ 107561.619048 ┆ 30100.811834 ┆ 52928 ┆ 148840 │ └─────────────┴───────────────┴──────────────┴────────────┴────────────┘