Importing Data from Multiple Sources

In real-world data science projects, data often comes from multiple sources and formats. Being able to efficiently import and combine data from these various sources is a crucial skill. In this section, we will cover how to import data from common sources such as CSV files, Excel files, databases, and web APIs using the tidyverse and related packages.

1. Reading Data from Weblink

Accessing and importing data directly from a web link is a common task in data science. It allows you to easily access datasets hosted online without having to manually download and save the files. The example below shows how to read a datafile posted from an online link:

url = 'https://www.stats.ox.ac.uk/pub/datasets/csb/ch11a.dat'
data = read.csv(url, sep = ' ')

head(data)
X01 X346 X0840 X36.33 X0
1 2 346 850 36.34 0
2 3 346 900 36.35 0
3 4 346 910 36.42 0
4 5 346 920 36.55 0
5 6 346 930 36.69 0
6 7 346 940 36.71 0

2. Importing Data from html

To import data from an HTML table, you can use the rvest package, which is part of the tidyverse. The rvest package is designed for web scraping and allows you to extract data from HTML web pages. Here’s how to do it using the example URL provided:

library(rvest)
library(tidyverse)

# URL of the HTML page
url <- "https://www.ssa.gov/oact/babynames/numberUSbirths.html"

# Read the HTML page
webpage <- read_html(url)

# Extract the table from the HTML page
table <- webpage %>%
    html_node("table") %>%
    html_table(fill = TRUE)

# Print the extracted data
head(table)
Year of birth Male Female Total
1880 118,399 97,605 216,004
1881 108,276 98,855 207,131
1882 122,031 115,694 237,725
1883 112,475 120,060 232,535
1884 122,738 137,585 260,323
1885 115,945 141,947 257,892

3. Importing Data from Excel Sheets

Excel sheets are commonly used for storing and analyzing data, making it essential to know how to import Excel files into R for further analysis and modeling. In this tutorial, we will demonstrate how to read Excel files using the readxl package, part of the tidyverse ecosystem.

Installing and Loading the readxl Package

# installing the readxl package
#install.packages("readxl")

library(readxl)

sample_xl <- readxl_example('datasets.xlsx')
data <- read_excel(sample_xl)
head(data, n = 10)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa

Reading Specific Excel Sheets

If your Excel file contains multiple sheets, you can specify which sheet to read by using the sheet parameter:

# reading the first sheet
excel_data = read_xls('../Datasets/sales.xls', sheet = 1)

head(excel_data)
YEAR P S
1950 12.9 181.9
1951 11.9 245.0
1952 10.7 250.2
1953 11.3 265.9
1954 11.2 248.5
1955 15.1 278.4
# reading the second sheet
excel_data = read_xls('../Datasets/sales.xls', sheet = 2)

head(excel_data)
x y z initial poprate inv school
356.7 374.3 0.8 3.091 1.4 0.1819 1.06
11.5 6.1 3.1 5.182 1.5 0.2447 0.98
7.3 3.6 2.3 3.908 0.3 0.2563 1.04
18.0 8.6 4.2 0.444 2.1 0.1199 0.61
6.3 1.4 2.7 0.595 3.0 0.1814 0.40
207.1 187.1 1.1 0.882 2.1 0.1708 0.68

Importing Data from a SQL Database

Accessing data stored in SQL databases is a common requirement for data analysis and modeling. In this tutorial, we will demonstrate how to connect to a SQL database and import data into R using the DBI and RSQLite packages, which are part of the tidyverse ecosystem.

Installing and Loading the Required Packages

#install.packages("DBI")
#install.packages("RSQLite")

library(DBI)
library(RSQLite)

# Establish a connection to the SQLite database
con <- dbConnect(RSQLite::SQLite(), dbname = '../Datasets/chinook.db')

# Print the connection object to verify
print(con)

Listing Tables in the Database

You can list all the tables available in the database using the dbListTables function:

# List all tables in the database
tables <- dbListTables(con)

# Print the list of tables
tables
'albums'•'artists'•'customers'•'employees'•'genres'•'invoice_items'•'invoices'•'media_types'•'playlist_track'•'playlists'•'sqlite_sequence'•'sqlite_stat1'•'tracks'

Importing Data from a Specific Table

To import data from a specific table, use the dbReadTable function. Note that in the example below, we use the customers table as an example.

# retrieve data from a table
data_table <- dbReadTable(con, 'customers')

# Print the imported data
head(data_table)
CustomerId FirstName LastName Company Address City State Country PostalCode Phone Fax Email SupportRepId
1 Lúís Gonçalves Embraer - Empresa Brasileira de Aeronáutica S.A. Av. Brigadeiro Faria Lima, 2170 São José dos Campos SP Brazil 12227-000 +55 (12) 3923‑5555 +55 (12) 3923‑5566 [email protected] 3
2 Leonie Köhler NA Theodor‑Heuss‑Straße 34 Stuttgart NA Germany 70174 +49 0711 2842222 NA [email protected] 5
3 François Tremblay NA 1498 rue Bélanger Montréal QC Canada H2G 1A7 +1 (514) 721‑4711 NA [email protected] 3
4 Bjørn Hansen NA Ullevålsveien 14 Oslo NA Norway 0171 +47 22 44 22 22 NA [email protected] 4
5 František Wichterlóva JetBrains s.r.o. Klanova 9/506 Prague NA Czech Republic 14700 +420 2 4172 5555 +420 2 4172 5555 [email protected] 4
6 Helena Holý NA Rilská 3174/6 Prague NA Czech Republic 14300 +420 2 4177 0449 NA [email protected] 5

Executing SQL Queries

Finally, you can also execute custom SQL queries using the dbGetQuery function to retrieve specific subsets of data. This gives you the power you'd normaly have on a DBMS and can be run through directly from R. Below is an example that returns customers from Germany

# Write your SQL query
query <- "SELECT * FROM customers WHERE Country = 'Germany' "

# Execute the SQL query and import the data
data_query <- dbGetQuery(con, query)

# Print the imported data from the query
data_query