Importing Data from Multiple Sources - Files

In real-world data science projects, data often comes from multiple sources and formats. Being able to efficiently import and combine data from these various sources is a crucial skill. In this section, we will cover how to import data from common sources such as CSV files, Excel files, databases, and web APIs using the tidyverse and related packages.

1. Reading Data from Weblink

Accessing and importing data directly from a web link is a common task in data science. It allows you to easily access datasets hosted online without having to manually download and save the files. The example below shows how to read a datafile posted from an online link:

url = 'https://www.stats.ox.ac.uk/pub/datasets/csb/ch11a.dat'
data = read.csv(url, sep = ' ')

head(data)

X01	X346	X0840	X36.33	X0
1	2	346	850	36.34
2	3	346	900	36.35
3	4	346	910	36.42
4	5	346	920	36.55
5	6	346	930	36.69
6	7	346	940	36.71

2. Importing Data from html

To import data from an HTML table, you can use the rvest package, which is part of the tidyverse. The rvest package is designed for web scraping and allows you to extract data from HTML web pages. Here’s how to do it using the example URL provided:

library(rvest)
library(tidyverse)

# URL of the HTML page
url <- "https://www.ssa.gov/oact/babynames/numberUSbirths.html"

# Read the HTML page
webpage <- read_html(url)

# Extract the table from the HTML page
table <- webpage %>%
    html_node("table") %>%
    html_table(fill = TRUE)

# Print the extracted data
head(table)

Year of birth	Male	Female	Total
1880	118,399	97,605	216,004
1881	108,276	98,855	207,131
1882	122,031	115,694	237,725
1883	112,475	120,060	232,535
1884	122,738	137,585	260,323
1885	115,945	141,947	257,892

3. Importing Data from Excel Sheets

Excel sheets are commonly used for storing and analyzing data, making it essential to know how to import Excel files into R for further analysis and modeling. In this tutorial, we will demonstrate how to read Excel files using the readxl package, part of the tidyverse ecosystem.

Installing and Loading the readxl Package

# installing the readxl package
#install.packages("readxl")

library(readxl)

sample_xl <- readxl_example('datasets.xlsx')
data <- read_excel(sample_xl)
head(data, n = 10)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5.0	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa

Reading Specific Excel Sheets

If your Excel file contains multiple sheets, you can specify which sheet to read by using the sheet parameter:

# reading the first sheet
excel_data = read_xls('../Datasets/sales.xls', sheet = 1)

head(excel_data)

YEAR	P	S
1950	12.9	181.9
1951	11.9	245.0
1952	10.7	250.2
1953	11.3	265.9
1954	11.2	248.5
1955	15.1	278.4

# reading the second sheet
excel_data = read_xls('../Datasets/sales.xls', sheet = 2)

head(excel_data)

x	y	z	initial	poprate	inv	school
356.7	374.3	0.8	3.091	1.4	0.1819	1.06
11.5	6.1	3.1	5.182	1.5	0.2447	0.98
7.3	3.6	2.3	3.908	0.3	0.2563	1.04
18.0	8.6	4.2	0.444	2.1	0.1199	0.61
6.3	1.4	2.7	0.595	3.0	0.1814	0.40
207.1	187.1	1.1	0.882	2.1	0.1708	0.68