This post covers the content and exercises for Ch 11: Data Import from R for Data Science. The chapter teaches how to read in plain text files of data.

library(tidyverse)

11.2 Getting started

Using {readr} to load text files.

11.2.2 Exercises

What function would you use to read a file where fields were separated with “|”?

read_delim()

Apart from file, skip, and comment, what other arguments do read_csv() and read_tsv() have in common?

col_names
col_types
locale
na
quoted_na
quote
trim_ws
n_max
guess_max
progress

What are the most important arguments to read_fwf()?

file
col_positions

Sometimes strings in a CSV file contain commas. To prevent them from causing problems they need to be surrounded by a quoting character, like " or ’. By convention, read_csv() assumes that the quoting character will be ", and if you want to change it you’ll need to use read_delim() instead. What arguments do you need to specify to read the following text into a data frame?

"x,y\n1,'a,b'"

(text <- read_csv("x,y\n1,'a,b'", quote = "'"))

## # A tibble: 1 x 2
##       x y    
##   <dbl> <chr>
## 1     1 a,b

Identify what is wrong with each of the following inline CSV files. What happens when you run the code?

# only 2 column names
# drop last column
read_csv("a,b\n1,2,3\n4,5,6")

## Warning: 2 parsing failures.
## row col  expected    actual         file
##   1  -- 2 columns 3 columns literal data
##   2  -- 2 columns 3 columns literal data

## # A tibble: 2 x 2
##       a     b
##   <dbl> <dbl>
## 1     1     2
## 2     4     5

# 3 col names, 1st row has two obs, 2nd row has 4 obs
# add NA to 1st row, drop 4th col from 2nd row
read_csv("a,b,c\n1,2\n1,2,3,4")

## Warning: 2 parsing failures.
## row col  expected    actual         file
##   1  -- 3 columns 2 columns literal data
##   2  -- 3 columns 4 columns literal data

## # A tibble: 2 x 3
##       a     b     c
##   <dbl> <dbl> <dbl>
## 1     1     2    NA
## 2     1     2     3

# 2 col names, 1st row has one obs
# add NA to 1st row
read_csv("a,b\n\"1")

## Warning: 2 parsing failures.
## row col                     expected    actual         file
##   1  a  closing quote at end of file           literal data
##   1  -- 2 columns                    1 columns literal data

## # A tibble: 1 x 2
##       a b    
##   <dbl> <chr>
## 1     1 <NA>

# a and b are coerced to character since they contain text
read_csv("a,b\n1,2\na,b")

## # A tibble: 2 x 2
##   a     b    
##   <chr> <chr>
## 1 1     2    
## 2 a     b

# should use read_delim
read_delim("a;b\n1;3", delim = ";")

## # A tibble: 1 x 2
##       a     b
##   <dbl> <dbl>
## 1     1     3

11.3 Parsing a vector

11.3.5 Exercises

What are the most important arguments to locale()?

#?locale()

What happens if you try and set decimal_mark and grouping_mark to the same character? What happens to the default value of grouping_mark when you set decimal_mark to “,”? What happens to the default value of decimal_mark when you set the grouping_mark to “.”?

locale(decimal_mark = ",", grouping_mark = ",")

## Error: `decimal_mark` and `grouping_mark` must be different

# error occurs

locale(decimal_mark = ",")

## <locale>
## Numbers:  123.456,78
## Formats:  %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday
##         (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May),
##         June (Jun), July (Jul), August (Aug), September (Sep), October
##         (Oct), November (Nov), December (Dec)
## AM/PM:  AM/PM

# default gouping mark changed to "."

locale(grouping_mark = ".")

## <locale>
## Numbers:  123.456,78
## Formats:  %AD / %AT
## Timezone: UTC
## Encoding: UTF-8
## <date_names>
## Days:   Sunday (Sun), Monday (Mon), Tuesday (Tue), Wednesday (Wed), Thursday
##         (Thu), Friday (Fri), Saturday (Sat)
## Months: January (Jan), February (Feb), March (Mar), April (Apr), May (May),
##         June (Jun), July (Jul), August (Aug), September (Sep), October
##         (Oct), November (Nov), December (Dec)
## AM/PM:  AM/PM

# default decimal mark changed to ","

I didn’t discuss the date_format and time_format options to locale(). What do they do? Construct an example that shows when they might be useful.

From the readr vignette, time_format currently isn’t used but date_format is used to specify the format of the dates in the data

parse_guess("01/12/2017", locale = locale(date_format = "%d/%m/%Y"))

## [1] "2017-12-01"

If you live outside the US, create a new locale object that encapsulates the settings for the types of file you read most commonly.

Luckily I’m in the US

What’s the difference between read_csv() and read_csv2()?

The difference is in the default delimeter.
read_csv() uses a comma
read_csv2() uses a semicolon

6 What are the most common encodings used in Europe? What are the most common encodings used in Asia? Do some googling to find out.

UTF-8 has been the dominant character encoding for the World Wide Web since 2009, and as of November 2017 accounts for 90.3% of all Web pages.

Generate the correct format string to parse each of the following dates and times:

d1 <- "January 1, 2010"
parse_date(d1, format = "%B %d, %Y")

## [1] "2010-01-01"

d2 <- "2015-Mar-07"
parse_date(d2, format = "%Y-%b-%d")

## [1] "2015-03-07"

d3 <- "06-Jun-2017"
parse_date(d3, format = "%d-%b-%Y")

## [1] "2017-06-06"

d4 <- c("August 19 (2015)", "July 1 (2015)")
parse_date(d4, format = "%B %d (%Y)")

## [1] "2015-08-19" "2015-07-01"

d5 <- "12/30/14" # Dec 30, 2014
parse_date(d5, format = "%m/%d/%y")

## [1] "2014-12-30"

t1 <- "1705"
parse_time(t1, format = "%H%M")

## 17:05:00

t2 <- "11:15:10.12 PM"
parse_time(t2, format = "%I:%M:%OS %p")

## 23:15:10.12

11 Data Import - R for Data Science

11.2 Getting started

11.2.2 Exercises

11.3 Parsing a vector

11.3.5 Exercises

Patrick O'Malley