Column Selection

The ability to single out a column or a set of column from a table is a very useful way to explore and treat a specific part of the dataset independent of the rest. This can be very useful in analysis and visualization as well. Let's begin by reading our original dataset with employees.

import polars as pl

data = pl.read_csv('data/employees.csv')
data.head()

OUTPUTshape: (5, 6) ┌─────┬────────┬──────┬─────────────┬────────┬────────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ str │ ╞═════╪════════╪══════╪═════════════╪════════╪════════════╡ │ 1 ┆ User_1 ┆ null ┆ null ┆ 123355 ┆ 2020-01-01 │ │ 2 ┆ User_2 ┆ 28.0 ┆ Sales ┆ 118399 ┆ 2020-01-02 │ │ 3 ┆ User_3 ┆ 29.0 ┆ Sales ┆ 88727 ┆ 2020-01-03 │ │ 4 ┆ User_4 ┆ 48.0 ┆ Engineering ┆ 71572 ┆ 2020-01-04 │ │ 5 ┆ User_5 ┆ 22.0 ┆ Engineering ┆ 81849 ┆ 2020-01-05 │ └─────┴────────┴──────┴─────────────┴────────┴────────────┘

select()

The select method is the interface for working with specific columns. In the example below, I select the age column into its own independent variable.

age = data.select("age")
print(age.head())

OUTPUTshape: (5, 1) ┌──────┐ │ age │ │ --- │ │ f64 │ ╞══════╡ │ null │ │ 28.0 │ │ 29.0 │ │ 48.0 │ │ 22.0 │ └──────┘

selecting multiple columns

The select() method can also be used to select more that one column. Simply pass the variable of interest directly to the select function.

age_salary = data.select("age", "salary")
print(age_salary.head())

OUTPUTshape: (5, 2) ┌──────┬────────┐ │ age ┆ salary │ │ --- ┆ --- │ │ f64 ┆ i64 │ ╞══════╪════════╡ │ null ┆ 123355 │ │ 28.0 ┆ 118399 │ │ 29.0 ┆ 88727 │ │ 48.0 ┆ 71572 │ │ 22.0 ┆ 81849 │ └──────┴────────┘

pl.col()

Within the select() method, we can use the pl.col interface which allows use to effectively select a column. This is within the broader concepts of Expressions. For example, specifying pl.col("age") sets an lazy expression that can be executed when needed.

type(pl.col("age"))

OUTPUTpolars.expr.expr.Expr

age_salary = data.select(pl.col("age"), pl.col("salary"))
print(age_salary.head())

OUTPUTshape: (5, 2) ┌──────┬────────┐ │ age ┆ salary │ │ --- ┆ --- │ │ f64 ┆ i64 │ ╞══════╪════════╡ │ null ┆ 123355 │ │ 28.0 ┆ 118399 │ │ 29.0 ┆ 88727 │ │ 48.0 ┆ 71572 │ │ 22.0 ┆ 81849 │ └──────┴────────┘

pl.col()

Within the select() method, we can use the pl.col interface which allows use to effectively select a column. This is within the broader concepts of Expressions. For example, specifying pl.col("age") sets an lazy expression that can be executed when needed.

type(pl.col("age"))

OUTPUTpolars.expr.expr.Expr

age_salary = data.select(pl.col("age"), pl.col("salary"))
print(age_salary.head())

OUTPUTshape: (5, 2) ┌──────┬────────┐ │ age ┆ salary │ │ --- ┆ --- │ │ f64 ┆ i64 │ ╞══════╪════════╡ │ null ┆ 123355 │ │ 28.0 ┆ 118399 │ │ 29.0 ┆ 88727 │ │ 48.0 ┆ 71572 │ │ 22.0 ┆ 81849 │ └──────┴────────┘