GroupBy Aggregation

This notebook will explore the use of groupby in aggregations.

import polars as pl 

data = pl.read_csv("data/employees.csv")
print(data.head())

OUTPUTshape: (5, 6) ┌─────┬────────┬──────┬─────────────┬────────┬────────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ str │ ╞═════╪════════╪══════╪═════════════╪════════╪════════════╡ │ 1 ┆ User_1 ┆ null ┆ null ┆ 123355 ┆ 2020-01-01 │ │ 2 ┆ User_2 ┆ 28.0 ┆ Sales ┆ 118399 ┆ 2020-01-02 │ │ 3 ┆ User_3 ┆ 29.0 ┆ Sales ┆ 88727 ┆ 2020-01-03 │ │ 4 ┆ User_4 ┆ 48.0 ┆ Engineering ┆ 71572 ┆ 2020-01-04 │ │ 5 ┆ User_5 ┆ 22.0 ┆ Engineering ┆ 81849 ┆ 2020-01-05 │ └─────┴────────┴──────┴─────────────┴────────┴────────────┘

Simple Groupby

The following example, will implement a simple groupby of department by salary to return the mean for each department

dept_salary_avgs = data.group_by("department").agg( pl.col("salary").mean() )
print(dept_salary_avgs)

OUTPUTshape: (5, 2) ┌─────────────┬───────────────┐ │ department ┆ salary │ │ --- ┆ --- │ │ str ┆ f64 │ ╞═════════════╪═══════════════╡ │ null ┆ 83998.142857 │ │ Engineering ┆ 100800.96 │ │ HR ┆ 107561.619048 │ │ Marketing ┆ 97095.956522 │ │ Sales ┆ 89187.166667 │ └─────────────┴───────────────┘

Multiple Aggregation

One way to perform multiple aggregations is to define a set of expressions that can be passed on to a groupby function. The example below demonstrates this implementation.

summary_expr = [
    pl.col("salary").mean().alias("avg_salary"),
    pl.col("salary").median().alias("median_salary"),
    pl.col("salary").std().alias("std_salary"),
    pl.col("age").mean().alias("avg_age"),
    pl.len().alias("employee_count")
]

summary_data = data.group_by("department").agg(summary_expr)
print(summary_data)

OUTPUTshape: (5, 6) ┌─────────────┬─────────────┬────────────┬────────────┬───────────┬────────────┐ │ department ┆ avg_salary ┆ median_sal ┆ std_salary ┆ avg_age ┆ employee_c │ │ --- ┆ --- ┆ ary ┆ --- ┆ --- ┆ ount │ │ str ┆ f64 ┆ --- ┆ f64 ┆ f64 ┆ --- │ │ ┆ ┆ f64 ┆ ┆ ┆ u32 │ ╞═════════════╪═════════════╪════════════╪════════════╪═══════════╪════════════╡ │ Sales ┆ 89187.16666 ┆ 89576.0 ┆ 23739.9124 ┆ 37.391304 ┆ 24 │ │ ┆ 7 ┆ ┆ 97 ┆ ┆ │ │ null ┆ 83998.14285 ┆ 77334.0 ┆ 29473.7521 ┆ 33.333333 ┆ 7 │ │ ┆ 7 ┆ ┆ 73 ┆ ┆ │ │ Marketing ┆ 97095.95652 ┆ 96023.0 ┆ 27680.2411 ┆ 37.454545 ┆ 23 │ │ ┆ 2 ┆ ┆ 55 ┆ ┆ │ │ Engineering ┆ 100800.96 ┆ 95432.0 ┆ 24746.9709 ┆ 41.76 ┆ 25 │ │ ┆ ┆ ┆ 67 ┆ ┆ │ │ HR ┆ 107561.6190 ┆ 106714.0 ┆ 30100.8118 ┆ 42.647059 ┆ 21 │ │ ┆ 48 ┆ ┆ 34 ┆ ┆ │ └─────────────┴─────────────┴────────────┴────────────┴───────────┴────────────┘

Groupby Multiple Keys

An addditional feature that is useful is the ability to group data by multiple keys. Let's first create a new column based on age. Simply evaluate the ages, everything above media is old and below is younger.

data = data.with_columns( 
        pl.when(pl.col("age") >= pl.col("age").median())
        .then(pl.lit("old"))
        .otherwise(pl.lit("young")).alias("age_group"))

print(data.head())

OUTPUTshape: (5, 7) ┌─────┬────────┬──────┬─────────────┬────────┬────────────┬───────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date ┆ age_group │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ str ┆ str │ ╞═════╪════════╪══════╪═════════════╪════════╪════════════╪═══════════╡ │ 1 ┆ User_1 ┆ null ┆ null ┆ 123355 ┆ 2020-01-01 ┆ young │ │ 2 ┆ User_2 ┆ 28.0 ┆ Sales ┆ 118399 ┆ 2020-01-02 ┆ young │ │ 3 ┆ User_3 ┆ 29.0 ┆ Sales ┆ 88727 ┆ 2020-01-03 ┆ young │ │ 4 ┆ User_4 ┆ 48.0 ┆ Engineering ┆ 71572 ┆ 2020-01-04 ┆ old │ │ 5 ┆ User_5 ┆ 22.0 ┆ Engineering ┆ 81849 ┆ 2020-01-05 ┆ young │ └─────┴────────┴──────┴─────────────┴────────┴────────────┴───────────┘

summary_data = data.filter(~pl.col("department").is_null()).group_by("department", "age_group", ).agg(summary_expr)
print(summary_data)

OUTPUTshape: (8, 7) ┌───────────┬───────────┬──────────┬──────────┬──────────┬──────────┬──────────┐ │ departmen ┆ age_group ┆ avg_sala ┆ median_s ┆ std_sala ┆ avg_age ┆ employee │ │ t ┆ --- ┆ ry ┆ alary ┆ ry ┆ --- ┆ _count │ │ --- ┆ str ┆ --- ┆ --- ┆ --- ┆ f64 ┆ --- │ │ str ┆ ┆ f64 ┆ f64 ┆ f64 ┆ ┆ u32 │ ╞═══════════╪═══════════╪══════════╪══════════╪══════════╪══════════╪══════════╡ │ HR ┆ young ┆ 105220.9 ┆ 103962.0 ┆ 30998.13 ┆ 31.5 ┆ 10 │ │ ┆ ┆ ┆ ┆ 7831 ┆ ┆ │ │ Marketing ┆ young ┆ 101336.6 ┆ 96023.0 ┆ 29783.92 ┆ 25.5 ┆ 13 │ │ ┆ ┆ 15385 ┆ ┆ 4309 ┆ ┆ │ │ Marketing ┆ old ┆ 91583.1 ┆ 98821.5 ┆ 25107.81 ┆ 51.8 ┆ 10 │ │ ┆ ┆ ┆ ┆ 9572 ┆ ┆ │ │ Sales ┆ young ┆ 92751.23 ┆ 88727.0 ┆ 22545.29 ┆ 27.0 ┆ 13 │ │ ┆ ┆ 0769 ┆ ┆ 8683 ┆ ┆ │ │ Sales ┆ old ┆ 84975.09 ┆ 91618.0 ┆ 25500.27 ┆ 48.72727 ┆ 11 │ │ ┆ ┆ 0909 ┆ ┆ 9879 ┆ 3 ┆ │ │ Engineeri ┆ old ┆ 95780.14 ┆ 89367.0 ┆ 23926.09 ┆ 49.85714 ┆ 14 │ │ ng ┆ ┆ 2857 ┆ ┆ 9039 ┆ 3 ┆ │ │ HR ┆ old ┆ 109689.5 ┆ 122833.0 ┆ 30608.19 ┆ 48.72727 ┆ 11 │ │ ┆ ┆ 45455 ┆ ┆ 8387 ┆ 3 ┆ │ │ Engineeri ┆ young ┆ 107191.0 ┆ 106355.0 ┆ 25404.43 ┆ 31.45454 ┆ 11 │ │ ng ┆ ┆ 90909 ┆ ┆ 4575 ┆ 5 ┆ │ └───────────┴───────────┴──────────┴──────────┴──────────┴──────────┴──────────┘