Basic Expressions

This notebook will look at basic expression of Polars to data exploration and basic analysis. A number of these functions exist with other tools like pandas and should be straight forward. This short note will focus on the implementation of the ideas.

Reading a CSV File

Let's begin by reading a CSV file

import numpy as np
import polars as pl 

data = pl.read_csv('data/employees.csv')

head()

The head function returns the top values of the variable that contains the dataframe. You can pass in an n argument to specificy how many rows of data to return. In the example below, I set $n=10$.

data.head(n = 10)

OUTPUTshape: (10, 6) ┌─────┬─────────┬──────┬─────────────┬────────┬────────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ str │ ╞═════╪═════════╪══════╪═════════════╪════════╪════════════╡ │ 1 ┆ User_1 ┆ null ┆ null ┆ 123355 ┆ 2020-01-01 │ │ 2 ┆ User_2 ┆ 28.0 ┆ Sales ┆ 118399 ┆ 2020-01-02 │ │ 3 ┆ User_3 ┆ 29.0 ┆ Sales ┆ 88727 ┆ 2020-01-03 │ │ 4 ┆ User_4 ┆ 48.0 ┆ Engineering ┆ 71572 ┆ 2020-01-04 │ │ 5 ┆ User_5 ┆ 22.0 ┆ Engineering ┆ 81849 ┆ 2020-01-05 │ │ 6 ┆ User_6 ┆ 24.0 ┆ Marketing ┆ 90840 ┆ 2020-01-06 │ │ 7 ┆ User_7 ┆ 59.0 ┆ Marketing ┆ 93847 ┆ 2020-01-07 │ │ 8 ┆ User_8 ┆ 46.0 ┆ Marketing ┆ 57513 ┆ 2020-01-08 │ │ 9 ┆ User_9 ┆ 30.0 ┆ HR ┆ 101219 ┆ 2020-01-09 │ │ 10 ┆ User_10 ┆ 28.0 ┆ Sales ┆ 88030 ┆ 2020-01-10 │ └─────┴─────────┴──────┴─────────────┴────────┴────────────┘

tail()

Much like the head() method, the tail() method returns the last $n$ values in a dataframe.

data.tail(n = 10)

OUTPUTshape: (10, 6) ┌─────┬──────────┬──────┬─────────────┬────────┬────────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ f64 ┆ str ┆ i64 ┆ str │ ╞═════╪══════════╪══════╪═════════════╪════════╪════════════╡ │ 91 ┆ User_91 ┆ null ┆ null ┆ 58468 ┆ 2020-03-31 │ │ 92 ┆ User_92 ┆ 42.0 ┆ Marketing ┆ 69150 ┆ 2020-04-01 │ │ 93 ┆ User_93 ┆ 20.0 ┆ Marketing ┆ 108655 ┆ 2020-04-02 │ │ 94 ┆ User_94 ┆ 57.0 ┆ HR ┆ 142814 ┆ 2020-04-03 │ │ 95 ┆ User_95 ┆ 26.0 ┆ HR ┆ 64033 ┆ 2020-04-04 │ │ 96 ┆ User_96 ┆ 56.0 ┆ Sales ┆ 62415 ┆ 2020-04-05 │ │ 97 ┆ User_97 ┆ 35.0 ┆ Engineering ┆ 89151 ┆ 2020-04-06 │ │ 98 ┆ User_98 ┆ 38.0 ┆ Engineering ┆ 122927 ┆ 2020-04-07 │ │ 99 ┆ User_99 ┆ 51.0 ┆ HR ┆ 106714 ┆ 2020-04-08 │ │ 100 ┆ User_100 ┆ 44.0 ┆ Sales ┆ 137793 ┆ 2020-04-09 │ └─────┴──────────┴──────┴─────────────┴────────┴────────────┘

shape()

The shape method returns a tuple of values containing the number of rows and columns in the dataset. This is also printed along with head() and tail() methods. These can also be achieved through attributes height and width

data.shape

OUTPUT(100, 6)

data.height, data.width

OUTPUT(100, 6)

schema

schema is an attribute that can be both set and retrieved when working with a dataset. It returns the datatypes assigned to all of the columns. Schema can be set for type safety or infered from the dataset itself. Infact, polars gives us options to determine the length of the scan of data to determine these types.

data.schema

OUTPUTSchema([('id', Int64), ('name', String), ('age', Float64), ('department', String), ('salary', Int64), ('join_date', String)])

glimpse()

If you have use R and it's suite of tidyverse tools, you may know about the glimpse() method, which literally returns a sample of each column together with the column name, datatype and a few observations.

data.glimpse()

OUTPUTRows: 100 Columns: 6 $ id 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 $ name 'User_1', 'User_2', 'User_3', 'User_4', 'User_5', 'User_6', 'User_7', 'User_8', 'User_9', 'User_10' $ age null, 28.0, 29.0, 48.0, 22.0, 24.0, 59.0, 46.0, 30.0, 28.0 $ department null, 'Sales', 'Sales', 'Engineering', 'Engineering', 'Marketing', 'Marketing', 'Marketing', 'HR', 'Sales' $ salary 123355, 118399, 88727, 71572, 81849, 90840, 93847, 57513, 101219, 88030 $ join_date '2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-06', '2020-01-07',

null_count()

To get a sense of the number of null values in the dataset, the method null_count() is provided that returns the number of null values across all columns.

data.glimpse()

OUTPUTshape: (1, 6) ┌─────┬──────┬─────┬────────────┬────────┬───────────┐ │ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 │ ╞═════╪══════╪═════╪════════════╪════════╪═══════════╡ │ 0 ┆ 0 ┆ 10 ┆ 7 ┆ 0 ┆ 0 │ └─────┴──────┴─────┴────────────┴────────┴───────────┘

n_unique()

Similarly, we can get the number of unique counts using the n_unique() method.

data.n_unique()

OUTPUT100

columns

The columns returns the names of the columns of the data.

data.columns

OUTPUT['id', 'name', 'age', 'department', 'salary', 'join_date']

describe()

Finally, the describe() will return the statistical summary of the full dataset.

data.describe()

OUTPUTshape: (9, 7) ┌────────────┬───────────┬─────────┬───────────┬─────────────┬──────────────┬────────────┐ │ statistic ┆ id ┆ name ┆ age ┆ department ┆ salary ┆ join_date │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ f64 ┆ str ┆ f64 ┆ str ┆ f64 ┆ str │ ╞════════════╪═══════════╪═════════╪═══════════╪═════════════╪══════════════╪════════════╡ │ count ┆ 100.0 ┆ 100 ┆ 90.0 ┆ 93 ┆ 100.0 ┆ 100 │ │ null_count ┆ 0.0 ┆ 0 ┆ 10.0 ┆ 7 ┆ 0.0 ┆ 0 │ │ mean ┆ 50.5 ┆ null ┆ 39.477778 ┆ null ┆ 97405.04 ┆ null │ │ std ┆ 29.011492 ┆ null ┆ 12.273318 ┆ null ┆ 27189.164567 ┆ null │ │ min ┆ 1.0 ┆ User_1 ┆ 20.0 ┆ Engineering ┆ 52928.0 ┆ 2020-01-01 │ │ 25% ┆ 26.0 ┆ null ┆ 28.0 ┆ null ┆ 75187.0 ┆ null │ │ 50% ┆ 51.0 ┆ null ┆ 39.0 ┆ null ┆ 95345.0 ┆ null │ │ 75% ┆ 75.0 ┆ null ┆ 49.0 ┆ null ┆ 122833.0 ┆ null │ │ max ┆ 100.0 ┆ User_99 ┆ 59.0 ┆ Sales ┆ 148840.0 ┆ 2020-04-09 │ └────────────┴───────────┴─────────┴───────────┴─────────────┴──────────────┴────────────┘