Data Wrangling with tidyverse

We will now delve deeper into the tidyverse to explore data wrangling functions. We will examine several functions provided by the tidyverse library and how they work together. To do this, we will use the Coffee_Sales.csv dataset, which is available on Kaggle and included in the course materials.

library(tidyverse)
        
coffee_sales <- read.csv('../Datasets/Coffee_Sales.csv')
coffee_sales <- as_tibble(coffee_sales)

# peeking through the data
slice_head(coffee_sales, n = 10)

date <chr>	datetime <chr>	cash_type <chr>	card <chr>	money <dbl>	coffee_name <chr>
2024-03-01	2024-03-01 10:15:50.520	card	ANON-0000-0000-0001	38.7	Latte
2024-03-01	2024-03-01 12:19:22.539	card	ANON-0000-0000-0002	38.7	Hot Chocolate
2024-03-01	2024-03-01 12:20:18.089	card	ANON-0000-0000-0002	38.7	Hot Chocolate
2024-03-01	2024-03-01 13:46:33.006	card	ANON-0000-0000-0003	28.9	Americano
2024-03-01	2024-03-01 13:48:14.626	card	ANON-0000-0000-0004	38.7	Latte
2024-03-01	2024-03-01 15:39:47.726	card	ANON-0000-0000-0005	33.8	Americano with Milk
2024-03-01	2024-03-01 16:19:02.756	card	ANON-0000-0000-0006	38.7	Hot Chocolate
2024-03-01	2024-03-01 18:39:03.580	card	ANON-0000-0000-0007	33.8	Americano with Milk
2024-03-01	2024-03-01 19:22:01.762	card	ANON-0000-0000-0008	38.7	Cocoa
2024-03-01	2024-03-01 19:23:15.887	card	ANON-0000-0000-0008	33.8	Americano with Milk

dim(coffee_sales)

OUTPUT976 • 6

names(coffee_sales)

OUTPUT'date'•'datetime'•'cash_type'•'card'•'money'•'coffee_name'

The select function

We now introduce the select function, which returns specified columns from the original dataset. In this example, we wish to retain the columns: date, cash_type, money , and coffee_name.

coffee_sales %>% 
        select(date, cash_type, money, coffee_name) %>%
        slice_sample(n = 10)

date <chr>	cash_type <chr>	money <dbl>	coffee_name <chr>
2024-05-29	card	37.72	Cappuccino
2024-05-21	cash	39.00	Latte
2024-03-02	cash	40.00	Latte
2024-07-06	card	32.82	Latte
2024-06-23	card	32.82	Americano with Milk
2024-03-15	cash	40.00	Latte
2024-04-14	cash	40.00	Latte
2024-05-30	card	37.72	Latte
2024-05-02	card	27.92	Americano
2024-04-22	cash	39.00	Latte

As noted above, slice_sample returns a sample of 10 observations from the subset of columns specified using the select function. An important and interesting idea here is the use of %>%, also known as the pipe operator.

2. Pipe Operator %>%

The pipe operator (`%>%`) allows for the chaining of multiple functions in a readable and concise manner. It passes the output of one function directly into the next function as an argument, enabling a smooth flow of data manipulation steps without the need for nested function calls, as in our example above.

Sometimes when using the select method, we wish to perform a negative selection. That is remove columns instead of picking the columns to stay. We can achieve this with the following method.

coffee_sales %>% select( -c(datetime, card)) %>% slice_sample(n = 10)

date <chr>	cash_type <chr>	money <dbl>	coffee_name <chr>
2024-06-15	card	37.72	Cocoa
2024-05-14	card	37.72	Cappuccino
2024-03-01	card	33.80	Americano with Milk
2024-06-09	card	32.82	Americano with Milk
2024-06-13	card	32.82	Americano with Milk
2024-03-16	card	28.90	Americano
2024-05-19	card	32.82	Americano with Milk
2024-03-03	card	28.90	Americano
2024-03-21	card	38.70	Latte
2024-03-05	card	38.70	Hot Chocolate

Notice that we get similar results.

3. The renaming function

Renaming columns/variables can make your data analysis easier and smoother as variables reflect their true meaning. In our example, we can now rename the columns in the following order:

cash_type --> payment_type

money --> price

coffee_name --> drink_name

You can choose whichever names work for you.

coffee_sales %>% 
        # negative selection of columns - eliminate
        select(-c(datetime, card)) %>%
        # changing column names
        rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
        # see the top 10 values 
        slice_head(n = 10)

date <chr>	payment_type <chr>	price <dbl>	drink_name <chr>
2024-03-01	card	38.7	Latte
2024-03-01	card	38.7	Hot Chocolate
2024-03-01	card	38.7	Hot Chocolate
2024-03-01	card	28.9	Americano
2024-03-01	card	38.7	Latte
2024-03-01	card	33.8	Americano with Milk
2024-03-01	card	38.7	Hot Chocolate
2024-03-01	card	33.8	Americano with Milk
2024-03-01	card	38.7	Cocoa
2024-03-01	card	33.8	Americano with Milk

Before we move on to the filter function, a useful operation to know is how to return a unique set of values from a column.

4. The distinct function.

The distinct function is used to remove duplicate rows from a data frame or tibble. It helps in ensuring that the data contains only unique combinations of the specified columns. In our example, we want to know that distinct values exist for a specific column or set of columns.

coffee_sales %>% 
        # negative selection of columns - eliminate
        select(-c(datetime, card)) %>%
        # renaming column/variable names
        rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
        distinct(drink_name)

drink_name <chr>
Latte
Hot Chocolate
Americano
Americano with Milk
Cocoa
Cortado
Espresso
Cappuccino

5. The filter Function

The filter function sub-selects the data based on logical and mathematical comparison. This means that we can filter for observations of a specific name or value.

Below, we sub-select observations whose purchases are above 30 with cash for a specific set of drinks.

coffee_sales %>% 
        # negative selection of columns - eliminate
        select(-c(datetime, card)) %>%
        # renaming column/variable names
        rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
        # using multiple filter selection
        filter( price >= 30, payment_type == 'cash', drink_name %in% c('Americano'))

date <chr>	payment_type <chr>	price <dbl>	drink_name <chr>
2024-03-10	cash	30	Americano
2024-03-11	cash	30	Americano
2024-03-21	cash	30	Americano
2024-03-31	cash	30	Americano
2024-04-05	cash	30	Americano
2024-04-09	cash	30	Americano

Now we see that we have a subset of cash payments for a specific coffee drink.

6. The mutate function

Often times, you may need to create a new variable based on transformations or calculations involving existing columns, enabling more complex data analysis and manipulation within your dataset. The mutate functions allows us to create new columns/variables.

In the example below, we will create a new variable that assigns value cocoa to a chocolate related drink or coffee otherwise. Specifically, it will check if a drink name has $chocolate$ of cocoa in it and assign it to cocoa and coffee otherwise

coffee_sales %>% 
        # negative selection of columns - eliminate
        select(-c(datetime, card)) %>%
        # renaming column/variable names
        rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
        # using multiple filter selection
        # filter( price >= 30, payment_type == 'cash', drink_name %in% c('Americano')) %>%
        # creating a new column 
        mutate( drink_type =  ifelse(str_detect(drink_name, regex("chocolate|cocoa", ignore_case = TRUE)), "cocoa",  "coffee") ) %>%
        # show the first ten values
        slice_head( n = 10 )

date	payment_type	price	drink_name	drink_type
2024-03-01	card	38.7	Latte	coffee
2024-03-01	card	38.7	Hot Chocolate	cocoa
2024-03-01	card	38.7	Hot Chocolate	cocoa
2024-03-01	card	28.9	Americano	coffee
2024-03-01	card	38.7	Latte	coffee
2024-03-01	card	33.8	Americano with Milk	coffee
2024-03-01	card	38.7	Hot Chocolate	cocoa
2024-03-01	card	33.8	Americano with Milk	coffee
2024-03-01	card	38.7	Cocoa	cocoa
2024-03-01	card	33.8	Americano with Milk	coffee

The above mutate function is a slightly more complicated than we have seen, however, it simply checks whether the column value has the words chocolate and/or cocoa or not and assigns the outcome respectively.

8. The group_by function

The next useful function is group_by, which allows us to summarize observational groups. The example below groups the data by the newly created column drink_type.

coffee_sales %>% 
        # negative selection of columns - eliminate
        select(-c(datetime, card)) %>%
        # renaming column/variable names
        rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
        # using multiple filter selection
        # filter( price >= 30, payment_type == 'cash', drink_name %in% c('Americano')) %>%
        # creating a new column 
        mutate( drink_type =  ifelse(str_detect(drink_name, regex("chocolate|cocoa", ignore_case = TRUE)), "cocoa",  "coffee") ) %>%
        # implementing a groupby and showing the sample data.
        group_by( drink_type ) %>% glimpse()

OUTPUT Rows: 976 Columns: 5 Groups: drink_type [2] date "2024-03-01", "2024-03-01", "2024-03-01", "2024-03-01", "… payment_type "card", "card", "card", "card", "card", "card", "card", "… price 38.7, 38.7, 38.7, 28.9, 38.7, 33.8, 38.7, 33.8, 38.7, 33.… drink_name "Latte", "Hot Chocolate", "Hot Chocolate", "Americano", "… drink_type "coffee", "cocoa", "cocoa", "coffee", "coffee", "coffee",…

9. The arrange function

Finally, we look at the arrange function, which is used to sort the tibble object based on columns. This function allows you to order your data in ascending or descending order, making it easier to analyze and interpret specific trends or patterns.Finally, we look at the arrange function which is used to sort the tibble object based on columns.

In the example, below we sort the results of the previous group_by and summarise function by sorting the outcome based on avg_price and drink_name . Try and determine what this function does: arrange(avg_price, desc(drink_name)

coffee_sales %>% 
        # renaming the columns
        rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
        # groupby data by drink_name
        group_by( drink_name ) %>%
        # summarise data by computing metrics
        summarise( avg_price = mean(price), min_price = mean(price), max_price = max(price) ) %>%
        # sort variables by avg_price ascending, drink_name descending
        arrange( avg_price, desc(drink_name) )

drink_name	avg_price	min_price	max_price
Espresso	23.57487	23.57487	25
Americano	28.22110	28.22110	30
Cortado	28.43977	28.43977	30
Americano with Milk	32.85119	32.85119	35
Hot Chocolate	37.74676	37.74676	40
Cappuccino	37.90712	37.90712	40
Latte	37.91300	37.91300	40
Cocoa	38.07857	38.07857	40

We have explored several functions within the tidyverse library, though there are many more to cover. As the course progresses, we will introduce additional functions to enhance your data manipulation skills. You are also encouraged to explore and experiment with these functions on your own. In the next section, we will delve into data visualization.