Importing Data from Multiple Sources
In real-world data science projects, data often comes from multiple sources and formats. Being able to efficiently import and combine data from these various sources is a crucial skill. In this section, we will cover how to import data from common sources such as CSV files, Excel files, databases, and web APIs using the tidyverse and related packages.
1. Reading Data from Weblink
Accessing and importing data directly from a web link is a common task in data science. It allows you to easily
access datasets hosted online without having to manually download and save the files. The example below shows
how to read a datafile posted from an online link:
url = 'https://www.stats.ox.ac.uk/pub/datasets/csb/ch11a.dat'
data = read.csv(url, sep = ' ')
head(data)
X01 | X346 | X0840 | X36.33 | X0 | |
---|---|---|---|---|---|
1 | 2 | 346 | 850 | 36.34 | 0 |
2 | 3 | 346 | 900 | 36.35 | 0 |
3 | 4 | 346 | 910 | 36.42 | 0 |
4 | 5 | 346 | 920 | 36.55 | 0 |
5 | 6 | 346 | 930 | 36.69 | 0 |
6 | 7 | 346 | 940 | 36.71 | 0 |
2. Importing Data from html
To import data from an HTML table, you can use the rvest package, which is part of the tidyverse. The rvest
package is designed for web scraping and allows you to extract data from HTML web pages. Here’s how to do it
using the example URL provided:
library(rvest)
library(tidyverse)
# URL of the HTML page
url <- "https://www.ssa.gov/oact/babynames/numberUSbirths.html"
# Read the HTML page
webpage <- read_html(url)
# Extract the table from the HTML page
table <- webpage %>%
html_node("table") %>%
html_table(fill = TRUE)
# Print the extracted data
head(table)
Year of birth | Male | Female | Total |
---|---|---|---|
1880 | 118,399 | 97,605 | 216,004 |
1881 | 108,276 | 98,855 | 207,131 |
1882 | 122,031 | 115,694 | 237,725 |
1883 | 112,475 | 120,060 | 232,535 |
1884 | 122,738 | 137,585 | 260,323 |
1885 | 115,945 | 141,947 | 257,892 |
3. Importing Data from Excel Sheets
Excel sheets are commonly used for storing and analyzing data, making it essential to know how to import Excel files into R for further analysis and modeling. In this tutorial, we will demonstrate how to read Excel files using the readxl package, part of the tidyverse ecosystem.
Installing and Loading the readxl Package
# installing the readxl package
#install.packages("readxl")
library(readxl)
sample_xl <- readxl_example('datasets.xlsx')
data <- read_excel(sample_xl)
head(data, n = 10)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
4.6 | 3.4 | 1.4 | 0.3 | setosa |
5.0 | 3.4 | 1.5 | 0.2 | setosa |
4.4 | 2.9 | 1.4 | 0.2 | setosa |
4.9 | 3.1 | 1.5 | 0.1 | setosa |
Reading Specific Excel Sheets
If your Excel file contains multiple sheets, you can specify which sheet to read by using the sheet parameter:
# reading the first sheet
excel_data = read_xls('../Datasets/sales.xls', sheet = 1)
head(excel_data)
YEAR | P | S |
---|---|---|
1950 | 12.9 | 181.9 |
1951 | 11.9 | 245.0 |
1952 | 10.7 | 250.2 |
1953 | 11.3 | 265.9 |
1954 | 11.2 | 248.5 |
1955 | 15.1 | 278.4 |
# reading the second sheet
excel_data = read_xls('../Datasets/sales.xls', sheet = 2)
head(excel_data)
x | y | z | initial | poprate | inv | school |
---|---|---|---|---|---|---|
356.7 | 374.3 | 0.8 | 3.091 | 1.4 | 0.1819 | 1.06 |
11.5 | 6.1 | 3.1 | 5.182 | 1.5 | 0.2447 | 0.98 |
7.3 | 3.6 | 2.3 | 3.908 | 0.3 | 0.2563 | 1.04 |
18.0 | 8.6 | 4.2 | 0.444 | 2.1 | 0.1199 | 0.61 |
6.3 | 1.4 | 2.7 | 0.595 | 3.0 | 0.1814 | 0.40 |
207.1 | 187.1 | 1.1 | 0.882 | 2.1 | 0.1708 | 0.68 |
Importing Data from a SQL Database
Accessing data stored in SQL databases is a common requirement for data analysis and modeling. In this tutorial, we will demonstrate how to connect to a SQL database and import data into R using the DBI and RSQLite packages, which are part of the tidyverse ecosystem.
Installing and Loading the Required Packages
#install.packages("DBI")
#install.packages("RSQLite")
library(DBI)
library(RSQLite)
# Establish a connection to the SQLite database
con <- dbConnect(RSQLite::SQLite(), dbname = '../Datasets/chinook.db')
# Print the connection object to verify
print(con)
Listing Tables in the Database
You can list all the tables available in the database using the dbListTables function:
# List all tables in the database
tables <- dbListTables(con)
# Print the list of tables
tables
Importing Data from a Specific Table
To import data from a specific table, use the
# retrieve data from a table
data_table <- dbReadTable(con, 'customers')
# Print the imported data
head(data_table)
CustomerId | FirstName | LastName | Company | Address | City | State | Country | PostalCode | Phone | Fax | SupportRepId | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Lúís | Gonçalves | Embraer - Empresa Brasileira de Aeronáutica S.A. | Av. Brigadeiro Faria Lima, 2170 | São José dos Campos | SP | Brazil | 12227-000 | +55 (12) 3923‑5555 | +55 (12) 3923‑5566 | [email protected] | 3 |
2 | Leonie | Köhler | NA | Theodor‑Heuss‑Straße 34 | Stuttgart | NA | Germany | 70174 | +49 0711 2842222 | NA | [email protected] | 5 |
3 | François | Tremblay | NA | 1498 rue Bélanger | Montréal | QC | Canada | H2G 1A7 | +1 (514) 721‑4711 | NA | [email protected] | 3 |
4 | Bjørn | Hansen | NA | Ullevålsveien 14 | Oslo | NA | Norway | 0171 | +47 22 44 22 22 | NA | [email protected] | 4 |
5 | František | Wichterlóva | JetBrains s.r.o. | Klanova 9/506 | Prague | NA | Czech Republic | 14700 | +420 2 4172 5555 | +420 2 4172 5555 | [email protected] | 4 |
6 | Helena | Holý | NA | Rilská 3174/6 | Prague | NA | Czech Republic | 14300 | +420 2 4177 0449 | NA | [email protected] | 5 |
Executing SQL Queries
Finally, you can also execute custom SQL queries using the
# Write your SQL query
query <- "SELECT * FROM customers WHERE Country = 'Germany' "
# Execute the SQL query and import the data
data_query <- dbGetQuery(con, query)
# Print the imported data from the query
data_query