Filtering Data

This section will look at filtering data. Filtering data involves row operations, comparing each value of the row for a given column against a specific value.

import numpy as np
import polars as pl

data = pl.read_csv('data/employees.csv')

print(data.head())
OUTPUTshape: (5, 6) ┌─────┬────────┬──────┬─────────────┬────────┬────────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ str │ ╞═════╪════════╪══════╪═════════════╪════════╪════════════╡ │ 1 ┆ User_1 ┆ null ┆ null ┆ 123355 ┆ 2020-01-01 │ │ 2 ┆ User_2 ┆ 28.0 ┆ Sales ┆ 118399 ┆ 2020-01-02 │ │ 3 ┆ User_3 ┆ 29.0 ┆ Sales ┆ 88727 ┆ 2020-01-03 │ │ 4 ┆ User_4 ┆ 48.0 ┆ Engineering ┆ 71572 ┆ 2020-01-04 │ │ 5 ┆ User_5 ┆ 22.0 ┆ Engineering ┆ 81849 ┆ 2020-01-05 │ └─────┴────────┴──────┴─────────────┴────────┴────────────┘

Single Column Filter

In the following example, we filter for the Engineering department. We use pl.col expression with base comparision to a value as demonstrated below.

engineers = data.filter( pl.col("department") == "Engineering" )
print(engineers)
OUTPUTshape: (25, 6) ┌─────┬─────────┬──────┬─────────────┬────────┬────────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ str │ ╞═════╪═════════╪══════╪═════════════╪════════╪════════════╡ │ 4 ┆ User_4 ┆ 48.0 ┆ Engineering ┆ 71572 ┆ 2020-01-04 │ │ 5 ┆ User_5 ┆ 22.0 ┆ Engineering ┆ 81849 ┆ 2020-01-05 │ │ 12 ┆ User_12 ┆ 51.0 ┆ Engineering ┆ 92048 ┆ 2020-01-12 │ │ 13 ┆ User_13 ┆ 38.0 ┆ Engineering ┆ 125023 ┆ 2020-01-13 │ │ 17 ┆ User_17 ┆ 58.0 ┆ Engineering ┆ 105267 ┆ 2020-01-17 │ │ … ┆ … ┆ … ┆ … ┆ … ┆ … │ │ 78 ┆ User_78 ┆ 57.0 ┆ Engineering ┆ 130266 ┆ 2020-03-18 │ │ 86 ┆ User_86 ┆ 28.0 ┆ Engineering ┆ 86603 ┆ 2020-03-26 │ │ 90 ┆ User_90 ┆ 59.0 ┆ Engineering ┆ 120491 ┆ 2020-03-30 │ │ 97 ┆ User_97 ┆ 35.0 ┆ Engineering ┆ 89151 ┆ 2020-04-06 │ │ 98 ┆ User_98 ┆ 38.0 ┆ Engineering ┆ 122927 ┆ 2020-04-07 │ └─────┴─────────┴──────┴─────────────┴────────┴────────────┘

Multiple Column Filter

We can also filter on more than one column. We simply combine the more than one expression into the filter function as demonstrated below.

senior_engineers = data.filter( (pl.col("department") == "Engineering") & 
                                (pl.col("age") > 30) &
                                (pl.col("salary") > 100000 ) )
print(senior_engineers)
OUTPUTshape: (10, 6) ┌─────┬─────────┬──────┬─────────────┬────────┬────────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ str │ ╞═════╪═════════╪══════╪═════════════╪════════╪════════════╡ │ 13 ┆ User_13 ┆ 38.0 ┆ Engineering ┆ 125023 ┆ 2020-01-13 │ │ 17 ┆ User_17 ┆ 58.0 ┆ Engineering ┆ 105267 ┆ 2020-01-17 │ │ 22 ┆ User_22 ┆ 43.0 ┆ Engineering ┆ 145505 ┆ 2020-01-22 │ │ 30 ┆ User_30 ┆ 38.0 ┆ Engineering ┆ 106355 ┆ 2020-01-30 │ │ 44 ┆ User_44 ┆ 31.0 ┆ Engineering ┆ 100475 ┆ 2020-02-13 │ │ 47 ┆ User_47 ┆ 46.0 ┆ Engineering ┆ 112288 ┆ 2020-02-16 │ │ 53 ┆ User_53 ┆ 36.0 ┆ Engineering ┆ 140253 ┆ 2020-02-22 │ │ 78 ┆ User_78 ┆ 57.0 ┆ Engineering ┆ 130266 ┆ 2020-03-18 │ │ 90 ┆ User_90 ┆ 59.0 ┆ Engineering ┆ 120491 ┆ 2020-03-30 │ │ 98 ┆ User_98 ┆ 38.0 ┆ Engineering ┆ 122927 ┆ 2020-04-07 │ └─────┴─────────┴──────┴─────────────┴────────┴────────────┘

Negative Selection

Another way to implement selection is a negative selection. That is select every item that is not this value. Below is an example of achieving this.

# NOT condition
not_hr = data.filter(pl.col("department") != "HR")

print(not_hr['department'].unique())
OUTPUTshape: (3,) Series: 'department' [str] [ "Marketing" "Sales" "Engineering" ]

is_in() filtering

Another available method for filtering is the is_in(). This allows for filtering and selection based on a defined list of items.

tech_depts = data.filter(pl.col("department").is_in(["IT", "Engineering"]))
print(tech_depts.head())
OUTPUTshape: (5, 6) ┌─────┬─────────┬──────┬─────────────┬────────┬────────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ str │ ╞═════╪═════════╪══════╪═════════════╪════════╪════════════╡ │ 4 ┆ User_4 ┆ 48.0 ┆ Engineering ┆ 71572 ┆ 2020-01-04 │ │ 5 ┆ User_5 ┆ 22.0 ┆ Engineering ┆ 81849 ┆ 2020-01-05 │ │ 12 ┆ User_12 ┆ 51.0 ┆ Engineering ┆ 92048 ┆ 2020-01-12 │ │ 13 ┆ User_13 ┆ 38.0 ┆ Engineering ┆ 125023 ┆ 2020-01-13 │ │ 17 ┆ User_17 ┆ 58.0 ┆ Engineering ┆ 105267 ┆ 2020-01-17 │ └─────┴─────────┴──────┴─────────────┴────────┴────────────┘