The tidyverse

The tidyverse package is an opinionated collection of R packages designed for data science and machine learning. It enforces good practices and discourages bad practices by providing a consistent and cohesive set of tools that work well together. The tidyverse simplifies many common data tasks, promoting clear and readable code, and making it easier to transform, visualize, and model data.

Let's begin by installing and loading up a few packages.

# if you don't have it installed
#install.packages(c('tidyverse', 'gt'))

# loading tidyverse
library(tidyverse)
options(tidyverse.quiet = TRUE)
library(gt)
Registered S3 methods overwritten by 'htmltools': method from print.html tools:rstudio print.shiny.tag tools:rstudio print.shiny.tag.list tools:rstudio

Loading Data

In most cases, the first step to analyzing data will be reading and/or loading datasets. For this example, we will use the datasets library which contains in-built data. We can use the function data() to import it.

data(mtcars)

From here on, mtcars can be used as a variable that contains the data mtcars. In the code below, we use a simple in-built function to check whether mtcars is a data.frame object.

is.data.frame(mtcars)
[1] TRUE

Dataframe vs. Tibble

There are some important difference to know about data.frame and tibble objects in R. You can read more about them but we will use a summary on this notebook:

1. Printing: Tibbles have a more user-friendly printing method that shows only the first 10 rows and the columns that fit on the screen, avoiding overwhelming output.

2. Column Types: Tibbles are stricter about column types and do not convert strings to factors by default, unlike data frames.

3. Subsetting: Tibbles do not allow partial matching of column names, reducing potential errors.

4. Performance: Tibbles are generally more modern and optimized for performance with large datasets compared to traditional data frames.

With that in mind, we are going to be using the tibble . We can convert a data.frame object into a tibble object using the as_tibble() function.

mtcars <- as_tibble(mtcars)

# printing tibble object
print(mtcars)
mpg
<dbl>
cyl
<dbl>
disp
<dbl>
hp
<dbl>
drat
<dbl>
wt
<dbl>
qsec
<dbl>
vs
<dbl>
am
<dbl>
gear
<dbl>
carb
<dbl>
21 6 160 110 3.9 2.62 16.5 0 1 4 4
21 6 160 110 3.9 2.88 17.0 0 1 4 4
22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
24.4 4 147 62 3.69 3.19 20 1 0 4 2
22.8 4 141 95 3.92 3.15 22.9 1 0 4 2
19.2 6 168 123 3.92 3.44 18.3 1 0 4 4

The object mtcars can be visualized with many alternatives. Since our emphasis is using tidyverse functions, we can look at a few.

1.1. slice_head()

The slice_head function is used to select the first few rows of a data frame or tibble, making it useful for quickly viewing the beginning of a dataset or extracting a subset of rows for analysis.

slice_head(mtcars, n = 5)
mpg
<dbl>
cyl
<dbl>
disp
<dbl>
hp
<dbl>
drat
<dbl>
wt
<dbl>
qsec
<dbl>
vs
<dbl>
am
<dbl>
gear
<dbl>
carb
<dbl>
21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
18.7 8 360 175 3.15 3.440 17.02 0 0 3 2

1.2. slice_tail()

Similarly, the slice_tail function returns n number of rows from bottom of the dataset.

slice_tail(mtcars, n = 4)
mpg
<dbl>
cyl
<dbl>
disp
<dbl>
hp
<dbl>
drat
<dbl>
wt
<dbl>
qsec
<dbl>
vs
<dbl>
am
<dbl>
gear
<dbl>
carb
<dbl>
15.8 8 351 264 4.22 3.17 14.50 0 1 5 4
19.7 6 145 175 3.62 2.77 15.50 0 1 5 6
15.0 8 301 335 3.54 3.57 14.60 0 1 5 8
21.4 4 121 109 4.11 2.78 18.60 1 1 4 2

1.3. slice_sample()

The slice_sample() function in R, is used to randomly select a specified number or proportion of rows from a data frame or tibble.

mpg
<dbl>
cyl
<dbl>
disp
<dbl>
hp
<dbl>
drat
<dbl>
wt
<dbl>
qsec
<dbl>
vs
<dbl>
am
<dbl>
gear
<dbl>
carb
<dbl>
15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4

Data Dimensions

Another common task is understand the nature and dimensions of your dataset. This includes the number of columns and rows, the data types and even retrieving column headers. The following functions are useful in data dimension understanding.

dim()

Thedim() function returns the dimension of the data with the number of rows and the number of columns. In our dataset, we have 32 rows and 11 columns as seen in the output below.

dim(mtcars)
32•11

str()

The str() function returns the structure of the dataset, highlighting the column name, datatype, total observations and a sample of the data points

str(mtcars)
tibble [32 × 11] (S3: tbl_df/tbl/data.frame) mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ... disp: num [1:32] 160 160 108 258 360 ... hp : num [1:32] 110 110 93 110 175 105 245 62 95 123 ... drat: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... wt : num [1:32] 2.62 2.88 2.32 3.21 3.44 ... qsec: num [1:32] 16.5 17 18.6 19.4 17 ... vs : num [1:32] 0 0 1 1 0 1 0 1 1 1 ... am : num [1:32] 1 1 1 0 0 0 0 0 0 0 ... gear: num [1:32] 4 4 4 3 3 3 3 4 4 4 ... carb: num [1:32] 4 4 1 1 2 1 4 2 2 4 ...

summary()

The summary() function in R provides a quick overview of the key statistics for each variable in a data frame or vector. For numeric data, it returns the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values. For factors, it provides the frequency of each level. This function is useful for understanding the distribution and central tendencies of your data at a glance.

mpg cyl disp hp Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 Median :19.20 Median :6.000 Median :196.3 Median :123.0 Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 drat wt qsec vs Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 Median :3.695 Median :3.325 Median :17.71 Median :0.0000 Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000 am gear carb Min. :0.0000 Min. :3.000 Min. :1.000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 Median :0.0000 Median :4.000 Median :2.000 Mean :0.4062 Mean :3.688 Mean :2.812 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 Max. :1.0000 Max. :5.000 Max. :8.000

names()

The names function returns the names of the variables in your data set.

names(mtcars)
'mpg'•'cyl'•'disp'•'hp'•'drat'•'wt'•'qsec'•'vs'•'am'•'gear'•'carb'