Descriptive Statistics

This notebook covers the implementation of descriptive statistics and information methods that are available with pandas. This section will use the titanic dataset for demonstration.

import pandas as pd

titanic = pd.read_csv('./titanic.csv')
titanic.head()
OUTPUT PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

1. describe

The describe method provides a 5 number summary for all of the numerical columns in the dataframe object. There are options to include non-numeric columns but the output is not meaningful for many applications. To include non-numeric columns, use the keyword argument _include="all"_ for non-numeric inclusion.

titanic.describe(include='all')
OUTPUT PassengerId Survived Pclass Age SibSp Parch Fare count 891.000000 891.000000 891.00000 714.000000 891.000000 891.000000 891.000000 mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400 50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

describe - include objects

To describe non-numeric data, we can specify that the describe function only include objects. This will compute features like frequency, count and uniques for all non-numeric columns.

titanic.describe(include='object')
OUTPUT Name Sex Ticket Cabin Embarked count 891 891 891 204 889 unique 891 2 681 147 3 top Klasen, Mr. Klas Albin male CA. 2343 G6 S freq 1 577 7 4 644

General Statistical Methods

Outside the describe method, there exists multiple statistical functions that can be leverage easily upon a pandas object. Below are a few examples.

var() - variance

The variance method will return the variance $(y - \bar {y})^2 $ for all the numeric columns in the dataframe

titanic.var()
OUTPUTPassengerId 66231.000000 Survived 0.236772 Pclass 0.699015 Age 211.019125 SibSp 1.216043 Parch 0.649728 Fare 2469.436846 dtype: float64

std() - standard deviation

Similar to the variance method, the std() method returns the standard deviation of an array. Mathematically, $ std = \sqrt { var() } $

titanic.std()
OUTPUTPassengerId 257.353842 Survived 0.486592 Pclass 0.836071 Age 14.526497 SibSp 1.102743 Parch 0.806057 Fare 49.693429 dtype: float64

median() - median

The median method returns the middle observation for all of the numeric columns.

titanic.median()
OUTPUTPassengerId 446.0000 Survived 0.0000 Pclass 3.0000 Age 28.0000 SibSp 0.0000 Parch 0.0000 Fare 14.4542 dtype: float64

sum() - sum

The sum method returns the sum of all the numeric columns in the dataframe.

titanic.sum()
OUTPUTPassengerId 397386 Survived 342 Pclass 2057 Name Braund, Mr. Owen HarrisCumings, Mrs. John Brad... Sex malefemalefemalefemalemalemalemalemalefemalefe... Age 21205.2 SibSp 466 Parch 340 Ticket A/5 21171PC 17599STON/O2. 31012821138033734503... Fare 28693.9 dtype: object

mean()

Similar to the sum function, the mean returns the average for all of the numerical columns in the dataframe.

titanic.mean()
OUTPUTPassengerId 446.000000 Survived 0.383838 Pclass 2.308642 Age 29.699118 SibSp 0.523008 Parch 0.381594 Fare 32.204208 dtype: float64

cumsum()

The cumsum methods computes the cumulative sum of specified column. Notice that it returns an array with the cumulative sum at each observation. Below, I print out the top 5 cumulative sums

titanic[['Fare']].cumsum(skipna=True, axis=0).head()
OUTPUT Fare 0 7.2500 1 78.5333 2 86.4583 3 139.5583 4 147.6083

Other Statistical/Mathematical Functions

There are a few more statistical functions that we have not looked at but function is much the same way as what we have seen above. Overall, all the available functions are:

OUTPUTcount() --> Number of non-null observations sum() --> Sum of values mean() --> Mean of Values median() --> Median of Values mode() --> Mode of values std() --> Standard Deviation of the Values min() --> Minimum Value max() --> Maximum Value abs() --> Absolute Value prod() --> Product of Values cumsum() --> Cumulative Sum cumprod() --> Cumulative Product