Data Wrangling with tidyverse
We will now delve deeper into the
library(tidyverse)
coffee_sales <- read.csv('../Datasets/Coffee_Sales.csv')
coffee_sales <- as_tibble(coffee_sales)
# peeking through the data
slice_head(coffee_sales, n = 10)
date <chr> |
datetime <chr> |
cash_type <chr> |
card <chr> |
money <dbl> |
coffee_name <chr> |
---|---|---|---|---|---|
2024-03-01 | 2024-03-01 10:15:50.520 | card | ANON-0000-0000-0001 | 38.7 | Latte |
2024-03-01 | 2024-03-01 12:19:22.539 | card | ANON-0000-0000-0002 | 38.7 | Hot Chocolate |
2024-03-01 | 2024-03-01 12:20:18.089 | card | ANON-0000-0000-0002 | 38.7 | Hot Chocolate |
2024-03-01 | 2024-03-01 13:46:33.006 | card | ANON-0000-0000-0003 | 28.9 | Americano |
2024-03-01 | 2024-03-01 13:48:14.626 | card | ANON-0000-0000-0004 | 38.7 | Latte |
2024-03-01 | 2024-03-01 15:39:47.726 | card | ANON-0000-0000-0005 | 33.8 | Americano with Milk |
2024-03-01 | 2024-03-01 16:19:02.756 | card | ANON-0000-0000-0006 | 38.7 | Hot Chocolate |
2024-03-01 | 2024-03-01 18:39:03.580 | card | ANON-0000-0000-0007 | 33.8 | Americano with Milk |
2024-03-01 | 2024-03-01 19:22:01.762 | card | ANON-0000-0000-0008 | 38.7 | Cocoa |
2024-03-01 | 2024-03-01 19:23:15.887 | card | ANON-0000-0000-0008 | 33.8 | Americano with Milk |
dim(coffee_sales)
names(coffee_sales)
The select function
We now introduce the
coffee_sales %>%
select(date, cash_type, money, coffee_name) %>%
slice_sample(n = 10)
date <chr> |
cash_type <chr> |
money <dbl> |
coffee_name <chr> |
---|---|---|---|
2024-05-29 | card | 37.72 | Cappuccino |
2024-05-21 | cash | 39.00 | Latte |
2024-03-02 | cash | 40.00 | Latte |
2024-07-06 | card | 32.82 | Latte |
2024-06-23 | card | 32.82 | Americano with Milk |
2024-03-15 | cash | 40.00 | Latte |
2024-04-14 | cash | 40.00 | Latte |
2024-05-30 | card | 37.72 | Latte |
2024-05-02 | card | 27.92 | Americano |
2024-04-22 | cash | 39.00 | Latte |
As noted above,
2. Pipe Operator %>%
The pipe operator (`%>%`) allows for the chaining of multiple functions in a readable and concise manner. It passes the output of one function directly into the next function as an argument, enabling a smooth flow of data manipulation steps without the need for nested function calls, as in our example above.
Sometimes when using the select method, we wish to perform a negative selection. That is remove columns instead of picking the columns to stay. We can achieve this with the following method.
coffee_sales %>% select( -c(datetime, card)) %>% slice_sample(n = 10)
date <chr> |
cash_type <chr> |
money <dbl> |
coffee_name <chr> |
---|---|---|---|
2024-06-15 | card | 37.72 | Cocoa |
2024-05-14 | card | 37.72 | Cappuccino |
2024-03-01 | card | 33.80 | Americano with Milk |
2024-06-09 | card | 32.82 | Americano with Milk |
2024-06-13 | card | 32.82 | Americano with Milk |
2024-03-16 | card | 28.90 | Americano |
2024-05-19 | card | 32.82 | Americano with Milk |
2024-03-03 | card | 28.90 | Americano |
2024-03-21 | card | 38.70 | Latte |
2024-03-05 | card | 38.70 | Hot Chocolate |
Notice that we get similar results.
3. The renaming function
Renaming columns/variables can make your data analysis easier and smoother as variables reflect their true meaning. In our example, we can now rename the columns in the following order:
You can choose whichever names work for you.
coffee_sales %>%
# negative selection of columns - eliminate
select(-c(datetime, card)) %>%
# changing column names
rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
# see the top 10 values
slice_head(n = 10)
date <chr> |
payment_type <chr> |
price <dbl> |
drink_name <chr> |
---|---|---|---|
2024-03-01 | card | 38.7 | Latte |
2024-03-01 | card | 38.7 | Hot Chocolate |
2024-03-01 | card | 38.7 | Hot Chocolate |
2024-03-01 | card | 28.9 | Americano |
2024-03-01 | card | 38.7 | Latte |
2024-03-01 | card | 33.8 | Americano with Milk |
2024-03-01 | card | 38.7 | Hot Chocolate |
2024-03-01 | card | 33.8 | Americano with Milk |
2024-03-01 | card | 38.7 | Cocoa |
2024-03-01 | card | 33.8 | Americano with Milk |
Before we move on to the filter function, a useful operation to know is how to return a unique set of values from a column.
4. The distinct function.
The distinct function is used to remove duplicate rows from a data frame or tibble. It helps in ensuring that the data contains only unique combinations of the specified columns. In our example, we want to know that distinct values exist for a specific column or set of columns.
coffee_sales %>%
# negative selection of columns - eliminate
select(-c(datetime, card)) %>%
# renaming column/variable names
rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
distinct(drink_name)
drink_name <chr> |
---|
Latte |
Hot Chocolate |
Americano |
Americano with Milk |
Cocoa |
Cortado |
Espresso |
Cappuccino |
5. The filter Function
The
Below, we sub-select observations whose purchases are above 30 with cash for a specific set of drinks.
coffee_sales %>%
# negative selection of columns - eliminate
select(-c(datetime, card)) %>%
# renaming column/variable names
rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
# using multiple filter selection
filter( price >= 30, payment_type == 'cash', drink_name %in% c('Americano'))
date <chr> |
payment_type <chr> |
price <dbl> |
drink_name <chr> |
---|---|---|---|
2024-03-10 | cash | 30 | Americano |
2024-03-11 | cash | 30 | Americano |
2024-03-21 | cash | 30 | Americano |
2024-03-31 | cash | 30 | Americano |
2024-04-05 | cash | 30 | Americano |
2024-04-09 | cash | 30 | Americano |
Now we see that we have a subset of cash payments for a specific coffee drink.
6. The mutate function
Often times, you may need to create a new variable based on transformations or calculations involving existing
columns, enabling more complex data analysis and manipulation within your dataset. The
In the example below, we will create a new variable that assigns value
coffee_sales %>%
# negative selection of columns - eliminate
select(-c(datetime, card)) %>%
# renaming column/variable names
rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
# using multiple filter selection
# filter( price >= 30, payment_type == 'cash', drink_name %in% c('Americano')) %>%
# creating a new column
mutate( drink_type = ifelse(str_detect(drink_name, regex("chocolate|cocoa", ignore_case = TRUE)), "cocoa", "coffee") ) %>%
# show the first ten values
slice_head( n = 10 )
date | payment_type | price | drink_name | drink_type |
---|---|---|---|---|
2024-03-01 | card | 38.7 | Latte | coffee |
2024-03-01 | card | 38.7 | Hot Chocolate | cocoa |
2024-03-01 | card | 38.7 | Hot Chocolate | cocoa |
2024-03-01 | card | 28.9 | Americano | coffee |
2024-03-01 | card | 38.7 | Latte | coffee |
2024-03-01 | card | 33.8 | Americano with Milk | coffee |
2024-03-01 | card | 38.7 | Hot Chocolate | cocoa |
2024-03-01 | card | 33.8 | Americano with Milk | coffee |
2024-03-01 | card | 38.7 | Cocoa | cocoa |
2024-03-01 | card | 33.8 | Americano with Milk | coffee |
The above mutate function is a slightly more complicated than we have seen, however, it simply checks whether the
column value has the words
The group_by function
The next useful function is
coffee_sales %>%
# negative selection of columns - eliminate
select(-c(datetime, card)) %>%
# renaming column/variable names
rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
# using multiple filter selection
# filter( price >= 30, payment_type == 'cash', drink_name %in% c('Americano')) %>%
# creating a new column
mutate( drink_type = ifelse(str_detect(drink_name, regex("chocolate|cocoa", ignore_case = TRUE)), "cocoa", "coffee") ) %>%
# implementing a groupby and showing the sample data.
group_by( drink_type ) %>% glimpse()
9. The arrange function
Finally, we look at the
In the example, below we sort the results of the previous
coffee_sales %>%
# renaming the columns
rename( payment_type = cash_type, price = money, drink_name = coffee_name) %>%
# groupby data by drink_name
group_by( drink_name ) %>%
# summarise data by computing metrics
summarise( avg_price = mean(price), min_price = mean(price), max_price = max(price) ) %>%
# sort variables by avg_price ascending, drink_name descending
arrange( avg_price, desc(drink_name) )
drink_name | avg_price | min_price | max_price |
---|---|---|---|
Espresso | 23.57487 | 23.57487 | 25 |
Americano | 28.22110 | 28.22110 | 30 |
Cortado | 28.43977 | 28.43977 | 30 |
Americano with Milk | 32.85119 | 32.85119 | 35 |
Hot Chocolate | 37.74676 | 37.74676 | 40 |
Cappuccino | 37.90712 | 37.90712 | 40 |
Latte | 37.91300 | 37.91300 | 40 |
Cocoa | 38.07857 | 38.07857 | 40 |
We have explored several functions within the