GroupBy Aggregation
This notebook will explore the use of groupby in aggregations.
import polars as pl
data = pl.read_csv("data/employees.csv")
print(data.head())
Simple Groupby
The following example, will implement a simple groupby of department by salary to return the mean for each
department
dept_salary_avgs = data.group_by("department").agg( pl.col("salary").mean() )
print(dept_salary_avgs)
Multiple Aggregation
One way to perform multiple aggregations is to define a set of expressions that can be passed on to a groupby
function. The example below demonstrates this implementation.
summary_expr = [
pl.col("salary").mean().alias("avg_salary"),
pl.col("salary").median().alias("median_salary"),
pl.col("salary").std().alias("std_salary"),
pl.col("age").mean().alias("avg_age"),
pl.len().alias("employee_count")
]
summary_data = data.group_by("department").agg(summary_expr)
print(summary_data)
Groupby Multiple Keys
An addditional feature that is useful is the ability to group data by multiple keys. Let's first create a new
column based on age. Simply evaluate the ages, everything above media is old and below is younger.
data = data.with_columns(
pl.when(pl.col("age") >= pl.col("age").median())
.then(pl.lit("old"))
.otherwise(pl.lit("young")).alias("age_group"))
print(data.head())
summary_data = data.filter(~pl.col("department").is_null()).group_by("department", "age_group", ).agg(summary_expr)
print(summary_data)