Importing Data from Multiple Sources - Files
In real-world data science projects, data often comes from multiple sources and formats. Being able to efficiently import and combine data from these various sources is a crucial skill. In this section, we will cover how to import data from common sources such as CSV files, Excel files, databases, and web APIs using the tidyverse and related packages.
1. Reading Data from Weblink
Accessing and importing data directly from a web link is a common task in data science. It allows you to easily
access datasets hosted online without having to manually download and save the files. The example below shows
how to read a datafile posted from an online link:
url = 'https://www.stats.ox.ac.uk/pub/datasets/csb/ch11a.dat'
data = read.csv(url, sep = ' ')
head(data)
| X01 | X346 | X0840 | X36.33 | X0 | |
|---|---|---|---|---|---|
| 1 | 2 | 346 | 850 | 36.34 | 0 |
| 2 | 3 | 346 | 900 | 36.35 | 0 |
| 3 | 4 | 346 | 910 | 36.42 | 0 |
| 4 | 5 | 346 | 920 | 36.55 | 0 |
| 5 | 6 | 346 | 930 | 36.69 | 0 |
| 6 | 7 | 346 | 940 | 36.71 | 0 |
2. Importing Data from html
To import data from an HTML table, you can use the rvest package, which is part of the tidyverse. The rvest
package is designed for web scraping and allows you to extract data from HTML web pages. Here’s how to do it
using the example URL provided:
library(rvest)
library(tidyverse)
# URL of the HTML page
url <- "https://www.ssa.gov/oact/babynames/numberUSbirths.html"
# Read the HTML page
webpage <- read_html(url)
# Extract the table from the HTML page
table <- webpage %>%
html_node("table") %>%
html_table(fill = TRUE)
# Print the extracted data
head(table)
| Year of birth | Male | Female | Total |
|---|---|---|---|
| 1880 | 118,399 | 97,605 | 216,004 |
| 1881 | 108,276 | 98,855 | 207,131 |
| 1882 | 122,031 | 115,694 | 237,725 |
| 1883 | 112,475 | 120,060 | 232,535 |
| 1884 | 122,738 | 137,585 | 260,323 |
| 1885 | 115,945 | 141,947 | 257,892 |
3. Importing Data from Excel Sheets
Excel sheets are commonly used for storing and analyzing data, making it essential to know how to import Excel files into R for further analysis and modeling. In this tutorial, we will demonstrate how to read Excel files using the readxl package, part of the tidyverse ecosystem.
Installing and Loading the readxl Package
# installing the readxl package
#install.packages("readxl")
library(readxl)
sample_xl <- readxl_example('datasets.xlsx')
data <- read_excel(sample_xl)
head(data, n = 10)
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 5.4 | 3.9 | 1.7 | 0.4 | setosa |
| 4.6 | 3.4 | 1.4 | 0.3 | setosa |
| 5.0 | 3.4 | 1.5 | 0.2 | setosa |
| 4.4 | 2.9 | 1.4 | 0.2 | setosa |
| 4.9 | 3.1 | 1.5 | 0.1 | setosa |
Reading Specific Excel Sheets
If your Excel file contains multiple sheets, you can specify which sheet to read by using the sheet parameter:
# reading the first sheet
excel_data = read_xls('../Datasets/sales.xls', sheet = 1)
head(excel_data)
| YEAR | P | S |
|---|---|---|
| 1950 | 12.9 | 181.9 |
| 1951 | 11.9 | 245.0 |
| 1952 | 10.7 | 250.2 |
| 1953 | 11.3 | 265.9 |
| 1954 | 11.2 | 248.5 |
| 1955 | 15.1 | 278.4 |
# reading the second sheet
excel_data = read_xls('../Datasets/sales.xls', sheet = 2)
head(excel_data)
| x | y | z | initial | poprate | inv | school |
|---|---|---|---|---|---|---|
| 356.7 | 374.3 | 0.8 | 3.091 | 1.4 | 0.1819 | 1.06 |
| 11.5 | 6.1 | 3.1 | 5.182 | 1.5 | 0.2447 | 0.98 |
| 7.3 | 3.6 | 2.3 | 3.908 | 0.3 | 0.2563 | 1.04 |
| 18.0 | 8.6 | 4.2 | 0.444 | 2.1 | 0.1199 | 0.61 |
| 6.3 | 1.4 | 2.7 | 0.595 | 3.0 | 0.1814 | 0.40 |
| 207.1 | 187.1 | 1.1 | 0.882 | 2.1 | 0.1708 | 0.68 |