Importing Data from Multiple Sources - Files

In real-world data science projects, data often comes from multiple sources and formats. Being able to efficiently import and combine data from these various sources is a crucial skill. In this section, we will cover how to import data from common sources such as CSV files, Excel files, databases, and web APIs using the tidyverse and related packages.

1. Reading Data from Weblink

Accessing and importing data directly from a web link is a common task in data science. It allows you to easily access datasets hosted online without having to manually download and save the files. The example below shows how to read a datafile posted from an online link:

url = 'https://www.stats.ox.ac.uk/pub/datasets/csb/ch11a.dat'
data = read.csv(url, sep = ' ')

head(data)
X01 X346 X0840 X36.33 X0
1 2 346 850 36.34 0
2 3 346 900 36.35 0
3 4 346 910 36.42 0
4 5 346 920 36.55 0
5 6 346 930 36.69 0
6 7 346 940 36.71 0

2. Importing Data from html

To import data from an HTML table, you can use the rvest package, which is part of the tidyverse. The rvest package is designed for web scraping and allows you to extract data from HTML web pages. Here’s how to do it using the example URL provided:

library(rvest)
library(tidyverse)

# URL of the HTML page
url <- "https://www.ssa.gov/oact/babynames/numberUSbirths.html"

# Read the HTML page
webpage <- read_html(url)

# Extract the table from the HTML page
table <- webpage %>%
    html_node("table") %>%
    html_table(fill = TRUE)

# Print the extracted data
head(table)
Year of birth Male Female Total
1880 118,399 97,605 216,004
1881 108,276 98,855 207,131
1882 122,031 115,694 237,725
1883 112,475 120,060 232,535
1884 122,738 137,585 260,323
1885 115,945 141,947 257,892

3. Importing Data from Excel Sheets

Excel sheets are commonly used for storing and analyzing data, making it essential to know how to import Excel files into R for further analysis and modeling. In this tutorial, we will demonstrate how to read Excel files using the readxl package, part of the tidyverse ecosystem.

Installing and Loading the readxl Package

# installing the readxl package
#install.packages("readxl")

library(readxl)

sample_xl <- readxl_example('datasets.xlsx')
data <- read_excel(sample_xl)
head(data, n = 10)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa

Reading Specific Excel Sheets

If your Excel file contains multiple sheets, you can specify which sheet to read by using the sheet parameter:

# reading the first sheet
excel_data = read_xls('../Datasets/sales.xls', sheet = 1)

head(excel_data)
YEAR P S
1950 12.9 181.9
1951 11.9 245.0
1952 10.7 250.2
1953 11.3 265.9
1954 11.2 248.5
1955 15.1 278.4
# reading the second sheet
excel_data = read_xls('../Datasets/sales.xls', sheet = 2)

head(excel_data)
x y z initial poprate inv school
356.7 374.3 0.8 3.091 1.4 0.1819 1.06
11.5 6.1 3.1 5.182 1.5 0.2447 0.98
7.3 3.6 2.3 3.908 0.3 0.2563 1.04
18.0 8.6 4.2 0.444 2.1 0.1199 0.61
6.3 1.4 2.7 0.595 3.0 0.1814 0.40
207.1 187.1 1.1 0.882 2.1 0.1708 0.68