Data Wrangling with tidyverse

We will now delve deeper into the tidyverse to explore data wrangling functions. We will examine several functions provided by the tidyverse library and how they work together. To do this, we will use the Coffee_Sales.csv dataset, which is available on Kaggle and included in the course materials.

library(tidyverse)
        
coffee_sales <- read.csv('../Datasets/Coffee_Sales.csv')
coffee_sales <- as_tibble(coffee_sales)

# peeking through the data
slice_head(coffee_sales, n = 10)
date
<chr>
datetime
<chr>
cash_type
<chr>
card
<chr>
money
<dbl>
coffee_name
<chr>
2024-03-01 2024-03-01 10:15:50.520 card ANON-0000-0000-0001 38.7 Latte
2024-03-01 2024-03-01 12:19:22.539 card ANON-0000-0000-0002 38.7 Hot Chocolate
2024-03-01 2024-03-01 12:20:18.089 card ANON-0000-0000-0002 38.7 Hot Chocolate
2024-03-01 2024-03-01 13:46:33.006 card ANON-0000-0000-0003 28.9 Americano
2024-03-01 2024-03-01 13:48:14.626 card ANON-0000-0000-0004 38.7 Latte
2024-03-01 2024-03-01 15:39:47.726 card ANON-0000-0000-0005 33.8 Americano with Milk
2024-03-01 2024-03-01 16:19:02.756 card ANON-0000-0000-0006 38.7 Hot Chocolate
2024-03-01 2024-03-01 18:39:03.580 card ANON-0000-0000-0007 33.8 Americano with Milk
2024-03-01 2024-03-01 19:22:01.762 card ANON-0000-0000-0008 38.7 Cocoa
2024-03-01 2024-03-01 19:23:15.887 card ANON-0000-0000-0008 33.8 Americano with Milk
dim(coffee_sales)
976 • 6
names(coffee_sales)
'date'•'datetime'•'cash_type'•'card'•'money'•'coffee_name'

The select function

We now introduce the select function, which returns specified columns from the original dataset. In this example, we wish to retain the columns: date, cash_type, money , and coffee_name.

coffee_sales %>% 
        select(date, cash_type, money, coffee_name) %>%
        slice_sample(n = 10)
date
<chr>
cash_type
<chr>
money
<dbl>
coffee_name
<chr>
2024-05-29 card 37.72 Cappuccino
2024-05-21 cash 39.00 Latte
2024-03-02 cash 40.00 Latte
2024-07-06 card 32.82 Latte
2024-06-23 card 32.82 Americano with Milk
2024-03-15 cash 40.00 Latte
2024-04-14 cash 40.00 Latte
2024-05-30 card 37.72 Latte
2024-05-02 card 27.92 Americano
2024-04-22 cash 39.00 Latte

As noted above, slice_sample returns a sample of 10 observations from the subset of columns specified using the select function. An important and interesting idea here is the use of %>%, also known as the pipe operator.

2. Pipe Operator %>%

The pipe operator (`%>%`) allows for the chaining of multiple functions in a readable and concise manner. It passes the output of one function directly into the next function as an argument, enabling a smooth flow of data manipulation steps without the need for nested function calls, as in our example above.

Sometimes when using the select method, we wish to perform a negative selection. That is remove columns instead of picking the columns to stay. We can achieve this with the following method.

coffee_sales %>% select( -c(datetime, card)) %>% slice_sample(n = 10)
date
<chr>
cash_type
<chr>
money
<dbl>
coffee_name
<chr>
2024-06-15 card 37.72 Cocoa
2024-05-14 card 37.72 Cappuccino
2024-03-01 card 33.80 Americano with Milk
2024-06-09 card 32.82 Americano with Milk
2024-06-13 card 32.82 Americano with Milk
2024-03-16 card 28.90 Americano
2024-05-19 card 32.82 Americano with Milk
2024-03-03 card 28.90 Americano
2024-03-21 card 38.70 Latte
2024-03-05 card 38.70 Hot Chocolate

Notice that we get similar results.

3. The renaming function

Renaming columns/variables can make your data analysis easier and smoother as variables reflect their true meaning. In our example, we can now rename the columns in the following order:

cash_type --> payment_type

money --> price

coffee_name --> drink_name

You can choose whichever names work for you.

coffee_sales %>% 
        # negative selection of columns - eliminate
        select(-c(datetime, card)) %>%
        # changing column names
        rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
        # see the top 10 values 
        slice_head(n = 10)
date
<chr>
payment_type
<chr>
price
<dbl>
drink_name
<chr>
2024-03-01 card 38.7 Latte
2024-03-01 card 38.7 Hot Chocolate
2024-03-01 card 38.7 Hot Chocolate
2024-03-01 card 28.9 Americano
2024-03-01 card 38.7 Latte
2024-03-01 card 33.8 Americano with Milk
2024-03-01 card 38.7 Hot Chocolate
2024-03-01 card 33.8 Americano with Milk
2024-03-01 card 38.7 Cocoa
2024-03-01 card 33.8 Americano with Milk

Before we move on to the filter function, a useful operation to know is how to return a unique set of values from a column.

4. The distinct function.

The distinct function is used to remove duplicate rows from a data frame or tibble. It helps in ensuring that the data contains only unique combinations of the specified columns. In our example, we want to know that distinct values exist for a specific column or set of columns.

coffee_sales %>% 
        # negative selection of columns - eliminate
        select(-c(datetime, card)) %>%
        # renaming column/variable names
        rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
        distinct(drink_name)
drink_name
<chr>
Latte
Hot Chocolate
Americano
Americano with Milk
Cocoa
Cortado
Espresso
Cappuccino

5. The filter Function

The filter function sub-selects the data based on logical and mathematical comparison. This means that we can filter for observations of a specific name or value.

Below, we sub-select observations whose purchases are above 30 with cash for a specific set of drinks.

coffee_sales %>% 
        # negative selection of columns - eliminate
        select(-c(datetime, card)) %>%
        # renaming column/variable names
        rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
        # using multiple filter selection
        filter( price >= 30, payment_type == 'cash', drink_name %in% c('Americano'))
date
<chr>
payment_type
<chr>
price
<dbl>
drink_name
<chr>
2024-03-10 cash 30 Americano
2024-03-11 cash 30 Americano
2024-03-21 cash 30 Americano
2024-03-31 cash 30 Americano
2024-04-05 cash 30 Americano
2024-04-09 cash 30 Americano

Now we see that we have a subset of cash payments for a specific coffee drink.

6. The mutate function

Often times, you may need to create a new variable based on transformations or calculations involving existing columns, enabling more complex data analysis and manipulation within your dataset. The mutate functions allows us to create new columns/variables.

In the example below, we will create a new variable that assigns value cocoa to a chocolate related drink or coffee otherwise. Specifically, it will check if a drink name has $chocolate$ of cocoa in it and assign it to cocoa and coffee otherwise

coffee_sales %>% 
        # negative selection of columns - eliminate
        select(-c(datetime, card)) %>%
        # renaming column/variable names
        rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
        # using multiple filter selection
        # filter( price >= 30, payment_type == 'cash', drink_name %in% c('Americano')) %>%
        # creating a new column 
        mutate( drink_type =  ifelse(str_detect(drink_name, regex("chocolate|cocoa", ignore_case = TRUE)), "cocoa",  "coffee") ) %>%
        # show the first ten values
        slice_head( n = 10 )
date payment_type price drink_name drink_type
2024-03-01card38.7Lattecoffee
2024-03-01card38.7Hot Chocolatecocoa
2024-03-01card38.7Hot Chocolatecocoa
2024-03-01card28.9Americanocoffee
2024-03-01card38.7Lattecoffee
2024-03-01card33.8Americano with Milkcoffee
2024-03-01card38.7Hot Chocolatecocoa
2024-03-01card33.8Americano with Milkcoffee
2024-03-01card38.7Cocoacocoa
2024-03-01card33.8Americano with Milkcoffee

The above mutate function is a slightly more complicated than we have seen, however, it simply checks whether the column value has the words chocolate and/or cocoa or not and assigns the outcome respectively.

The group_by function

The next useful function is group_by, which allows us to summarize observational groups. The example below groups the data by the newly created column drink_type.

coffee_sales %>% 
        # negative selection of columns - eliminate
        select(-c(datetime, card)) %>%
        # renaming column/variable names
        rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
        # using multiple filter selection
        # filter( price >= 30, payment_type == 'cash', drink_name %in% c('Americano')) %>%
        # creating a new column 
        mutate( drink_type =  ifelse(str_detect(drink_name, regex("chocolate|cocoa", ignore_case = TRUE)), "cocoa",  "coffee") ) %>%
        # implementing a groupby and showing the sample data.
        group_by( drink_type ) %>% glimpse()
Rows: 976 Columns: 5 Groups: drink_type [2] date "2024-03-01", "2024-03-01", "2024-03-01", "2024-03-01", "… payment_type "card", "card", "card", "card", "card", "card", "card", "… price 38.7, 38.7, 38.7, 28.9, 38.7, 33.8, 38.7, 33.8, 38.7, 33.… drink_name "Latte", "Hot Chocolate", "Hot Chocolate", "Americano", "… drink_type "coffee", "cocoa", "cocoa", "coffee", "coffee", "coffee",…

9. The arrange function

Finally, we look at the arrange function, which is used to sort the tibble object based on columns. This function allows you to order your data in ascending or descending order, making it easier to analyze and interpret specific trends or patterns.Finally, we look at the arrange function which is used to sort the tibble object based on columns.

In the example, below we sort the results of the previous group_by and summarise function by sorting the outcome based on avg_price and drink_name . Try and determine what this function does: arrange(avg_price, desc(drink_name)

coffee_sales %>% 
        # renaming the columns
        rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
        # groupby data by drink_name
        group_by( drink_name ) %>%
        # summarise data by computing metrics
        summarise( avg_price = mean(price), min_price = mean(price), max_price = max(price) ) %>%
        # sort variables by avg_price ascending, drink_name descending
        arrange( avg_price, desc(drink_name) )
drink_name avg_price min_price max_price
Espresso23.5748723.5748725
Americano28.2211028.2211030
Cortado28.4397728.4397730
Americano with Milk32.8511932.8511935
Hot Chocolate37.7467637.7467640
Cappuccino37.9071237.9071240
Latte37.9130037.9130040
Cocoa38.0785738.0785740

We have explored several functions within the tidyverse library, though there are many more to cover. As the course progresses, we will introduce additional functions to enhance your data manipulation skills. You are also encouraged to explore and experiment with these functions on your own. In the next section, we will delve into data visualization.