11 Data Import - R for Data Science
This post covers the content and exercises for Ch 11: Data Import from R for Data Science. The chapter teaches how to read in plain text files of data.
library(tidyverse)
11.2 Getting started
Using {readr} to load text files.
11.2.2 Exercises
- What function would you use to read a file where fields were separated with “|”?
read_delim()
- Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?
- col_names
- col_types
- locale
- na
- quoted_na
- quote
- trim_ws
- n_max
- guess_max
- progress
- What are the most important arguments to read_fwf()?
- file
- col_positions
- Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or ’. By convention, read_csv() assumes that the quoting character will be ", and if you want to change it you’ll need to use read_delim() instead. What arguments do you need to specify to read the following text into a data frame?
"x,y\n1,'a,b'"
(text <- read_csv("x,y\n1,'a,b'", quote = "'"))
## # A tibble: 1 x 2
## x y
## <dbl> <chr>
## 1 1 a,b
- Identify what is wrong with each of the following inline CSV files. What happens when you run the code?
# only 2 column names
# drop last column
read_csv("a,b\n1,2,3\n4,5,6")
## Warning: 2 parsing failures.
## row col expected actual file
## 1 -- 2 columns 3 columns literal data
## 2 -- 2 columns 3 columns literal data
## # A tibble: 2 x 2
## a b
## <dbl> <dbl>
## 1 1 2
## 2 4 5
# 3 col names, 1st row has two obs, 2nd row has 4 obs
# add NA to 1st row, drop 4th col from 2nd row
read_csv("a,b,c\n1,2\n1,2,3,4")
## Warning: 2 parsing failures.
## row col expected actual file
## 1 -- 3 columns 2 columns literal data
## 2 -- 3 columns 4 columns literal data
## # A tibble: 2 x 3
## a b c
## <dbl> <dbl> <dbl>
## 1 1 2 NA
## 2 1 2 3
# 2 col names, 1st row has one obs
# add NA to 1st row
read_csv("a,b\n\"1")
## Warning: 2 parsing failures.
## row col expected actual file
## 1 a closing quote at end of file literal data
## 1 -- 2 columns 1 columns literal data
## # A tibble: 1 x 2
## a b
## <dbl> <chr>
## 1 1 <NA>
# a and b are coerced to character since they contain text
read_csv("a,b\n1,2\na,b")
## # A tibble: 2 x 2
## a b
## <chr> <chr>
## 1 1 2
## 2 a b
# should use read_delim
read_delim("a;b\n1;3", delim = ";")
## # A tibble: 1 x 2
## a b
## <dbl> <dbl>
## 1 1 3
11.3 Parsing a vector
11.3.5 Exercises
- What are the most important arguments to
locale()
?
#?locale()
- What happens if you try and set decimal_mark and grouping_mark to the same character? What happens to the default value of grouping_mark when you set decimal_mark to “,”? What happens to the default value of decimal_mark when you set the grouping_mark to “.”?
locale(decimal_mark = ",", grouping_mark = ",")
## Error: `decimal_mark` and `grouping_mark` must be different
# error occurs
locale(decimal_mark = ",")
## <locale>
## Numbers: 123.456,78
## Formats: %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday
## (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May),
## June (Jun), July (Jul), August (Aug), September (Sep), October
## (Oct), November (Nov), December (Dec)
## AM/PM: AM/PM
# default gouping mark changed to "."
locale(grouping_mark = ".")
## <locale>
## Numbers: 123.456,78
## Formats: %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days: Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday
## (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May),
## June (Jun), July (Jul), August (Aug), September (Sep), October
## (Oct), November (Nov), December (Dec)
## AM/PM: AM/PM
# default decimal mark changed to ","
- I didn’t discuss the date_format and time_format options to locale(). What do they do? Construct an example that shows when they might be useful.
- From the readr vignette,
time_format
currently isn’t used butdate_format
is used to specify the format of the dates in the data
parse_guess("01/12/2017", locale = locale(date_format = "%d/%m/%Y"))
## [1] "2017-12-01"
- If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.
- Luckily I’m in the US
- What’s the difference between read_csv() and read_csv2()?
- The difference is in the default delimeter.
read_csv()
uses a commaread_csv2()
uses a semicolon
6 What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.
- UTF-8 has been the dominant character encoding for the World Wide Web since 2009, and as of November 2017 accounts for 90.3% of all Web pages.
- Generate the correct format string to parse each of the following dates and times:
d1 <- "January 1, 2010"
parse_date(d1, format = "%B %d, %Y")
## [1] "2010-01-01"
d2 <- "2015-Mar-07"
parse_date(d2, format = "%Y-%b-%d")
## [1] "2015-03-07"
d3 <- "06-Jun-2017"
parse_date(d3, format = "%d-%b-%Y")
## [1] "2017-06-06"
d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, format = "%B %d (%Y)")
## [1] "2015-08-19" "2015-07-01"
d5 <- "12/30/14" # Dec 30, 2014
parse_date(d5, format = "%m/%d/%y")
## [1] "2014-12-30"
t1 <- "1705"
parse_time(t1, format = "%H%M")
## 17:05:00
t2 <- "11:15:10.12 PM"
parse_time(t2, format = "%I:%M:%OS %p")
## 23:15:10.12