This post covers the content and exercises for Ch 3: Data Visualization from R for Data Science. The chapter teaches the use of the grammar of graphics with ggplot2.

3.2 First Steps

We begin by loading the tidyverse which contains ggplot2 and numerous other useful libraries.

library(tidyverse)

The first plot addresses the question: Do cars with big engines use more fuel than cars with small engines?

head(mpg)

## # A tibble: 6 x 11
##   manufacturer model displ  year   cyl trans      drv     cty   hwy fl    class 
##   <chr>        <chr> <dbl> <int> <int> <chr>      <chr> <int> <int> <chr> <chr> 
## 1 audi         a4      1.8  1999     4 auto(l5)   f        18    29 p     compa…
## 2 audi         a4      1.8  1999     4 manual(m5) f        21    29 p     compa…
## 3 audi         a4      2    2008     4 manual(m6) f        20    31 p     compa…
## 4 audi         a4      2    2008     4 auto(av)   f        21    30 p     compa…
## 5 audi         a4      2.8  1999     6 auto(l5)   f        16    26 p     compa…
## 6 audi         a4      2.8  1999     6 manual(m5) f        18    26 p     compa…

ggplot(data = mpg) +
        geom_point(mapping = aes(x = displ, y = hwy))

3.2.4 Exercises

Run ggplot(data = mpg). What do you see?

ggplot(data = mpg)

It’s an empty plot

How many rows are in mpg? How many columns?

dim(mpg)

## [1] 234  11

234 rows and 11 columns

What does the drv variable describe? Read the help for ?mpg to find out.

drv indicates whether its front / rear / 4 wheel drive

Make a scatterplot of hwy vs cyl.

ggplot(data = mpg) +
        geom_point(aes(x = cyl, y = hwy))

What happens if you make a scatterplot of class vs drv? Why is the plot not useful?

ggplot(data = mpg) + 
        geom_point(mapping = aes(x = drv, y = class))

Scatterplots are not useful when looking at two categorical variables

3.3 Aesthetic Mappings

They are used to set things like color and shapes based on a variable value

ggplot(data = mpg) +
        geom_point(aes(x = displ, y = hwy, color = class))

The outliers with high displacements and high mileage are 2 seaters. Since sports cars are lighter than the trucks and suvs in this displacement range it makes sense that they have higher mpg

3.3.1 Exercises

What’s gone wrong with this code? Why are the points not blue?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

The color should have been set outside of the aes

Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?

summary(mpg)

##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

cyt and hwy are the only continuous variables

Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

ggplot(data = mpg) + 
        geom_point(mapping = aes(x = displ, y = hwy, color = cty))

ggplot(data = mpg) + 
        geom_point(mapping = aes(x = displ, y = hwy, size = cty))

ggplot(data = mpg) + 
        geom_point(mapping = aes(x = displ, y = hwy, shape = cty))

## Error: A continuous variable can not be mapped to shape

Size and color work with continuous variables, they just display a gradient. Shape does not work with continuous variables however.

What happens if you map the same variable to multiple aesthetics?

ggplot(data = mpg) + 
        geom_point(mapping = aes(x = displ, y = hwy, size = cty, color = cty, alpha = cty))

It’s not very helpful to use multiple aes on a single variable

What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)

ggplot(data = mpg) + 
        geom_point(mapping = aes(x = displ, y = hwy), shape = 21, stroke = 4)

stroke modifies the width of the shape border

What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)?

ggplot(data = mpg) + 
        geom_point(mapping = aes(x = displ, y = hwy, color = displ < 5))

It segments the aes based on the criteria specified

3.5 Facets

Facets can be used to split plots on categorical variable levels

facet_wrap() is used for splitting on a single variable

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

facet_grid() is used for splitting on two variables

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ cyl)

facet_grid(.~cyl) can be used to only facet by 1 variable

3.5.1 Exercises

What happens if you facet on a continuous variable?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ cty, nrow = 3)

A facet is created for each value

What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = drv, y = cyl))

The empty cells indicate there are no data for that combination of variables

What plots does the following code make? What does . do?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

drv is displayed in a single column and cyl is displayed in a single row

Take the first faceted plot in this section: What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

Faceting allows you to quickly distinguish the pattern for each level, however using color allows easier comparison between the levels. When working with larger data it may become difficult to clearly distinguish points based on color.

Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol argument?

nrow and ncol specify the number of rows and columns for the facets to occupy respectively. Other options include:
* switch - change location of labels * drop - removes unused factor levels * dir - direction * strip.position - change location of labels facet_grid() doesn’t use nrow and ncol since they are determined by the number of levels of the variables

When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_grid(drv ~ class)

Plots are typically wider than they are tall

3.6 Geometric Objects

These are the objects that are used to plot the data

# left
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy))

# right
ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Not every aes is available for every geom type

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

It helps to use aes when splitting by groups since grouping alone doesn’t provide any legend

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy, group = drv))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg) +
  geom_smooth(
    mapping = aes(x = displ, y = hwy, color = drv),
    show.legend = FALSE
  )

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

To display multiple geoms in a plot

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# and more compactly
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

To use different aesthetics in different layers

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Also able to specifiy a different data within geoms

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(mapping = aes(color = class)) + 
  geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

3.6.1 Exercises

What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

# line
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_line()

# boxplot
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

# histogram
ggplot(data = mpg, mapping = aes(x = hwy)) + 
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# area chart
ggplot(data = mpg) + 
  geom_ribbon(aes(displ, ymin = 0, ymax = hwy, fill = drv, color = drv))

Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point(show.legend = FALSE) + 
  geom_smooth(se = FALSE, show.legend = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

What does the se argument to geom_smooth() do?

Shows / hides the confidence interval

Will these two graphs look different? Why/why not?

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

They are both graphing the same data

Recreate the R code necessary to generate the following graphs.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth(aes(group = drv), se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(aes(color = drv)) +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(aes(color = drv)) +
  geom_smooth(aes(linetype = drv), se = FALSE)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point(aes(fill = drv), shape = 21, stroke = 2, color = "white")

3.7 Statistical Transformations

Several graphs plot the raw output of values in the dataset, however some graphs calculate new values for the plot, and the algorithm to produce these calculations is called the stat

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

Different stats can be specified in the geom

demo <- tribble(
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551
)

ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")

Additionally you can override default mapping, ex display a bar chart as proportions

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

Can also focus on the summary for the statistical transformation with stat_summary()

ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

3.7.1 Exercises

What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

pointrange

ggplot(data = diamonds) + 
  geom_pointrange(
    mapping = aes(x = cut, y = depth),
    stat = "summary",
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

What does geom_col() do? How is it different to geom_bar()?

geom_col() uses the stat identity by default so if the values that are to be used for column heights are already in the data then it should be used

Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?

geom	stat
area	identity
ribbon	identity
bar	count
col	identity
bin2d	bind2d
blank	identity
boxplot	boxplot
contour	contour
count	sum
crossbar	identity
errorbar	identity
errorbarh	identity
linerange	identity
pointrange	identity
curve	identity
segment	identity
density	density
density_2d	density2d
freqpoly	bin
histogram	bin
hex	binhex
jitter	identity
label	identity
path	identity
line	identity
map	identity
point	identity
polygon	identity
quantile	quantile
raster	identity
rect	identity
rug	identity
smooth	smooth
spoke	spoke
step	identity
text	identity
tile	identity
violin	ydensity

The majority of geoms use the identity stat so they require the data to contain the mapping values, while the remaining geoms are paired with the the stat that creates the specific shape of the plot from the data

What variables does stat_smooth() compute? What parameters control its behaviour?

it computes the formula y ~ x, and is controlled by method which determines the type of line to fit, and se which shows / hides the confidence interval

In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop..))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))

Each cut category proportion is calculated separately rather than as a proportion of the overall number of diamonds. Adding group = 1 ensures that the proportions are calculated from the total count of diamonds. In the second plot the fill = color is getting the color proportion for the overall data set, while it should be pulling from each color category

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..count../sum(..count..), fill = color)) + 
        ylab("prop")

3.8 Position Adjustments

There are several position options for a bar chart, by default they are stacked

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity))

Another option is identity which is not as useful for bars since they wil overlap

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) + 
  geom_bar(alpha = 1/5, position = "identity")

ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) + 
  geom_bar(fill = NA, position = "identity")

fill is similar to stacking but makes all bars the same height

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

dodge places the bars next to eachother so they are easy to compare

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

jitter is not useful for bars but helpul to avoid overplotting with scatterplots

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

3.8.1 Exercises

What is the problem with this plot? How could you improve it?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_point()

The plot likely suffers from overplotting and could be improved by using jitter

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_jitter()

What parameters to geom_jitter() control the amount of jittering?

width and height control the amount of jittering

Compare and contrast geom_jitter() with geom_count()

jitter adds random noise to separate values they may be plotted on top of eachother
count tallies the amount of points in an area and then adjusts the size of the point to represent that amount

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
  geom_count()

What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.

default position is "dodge"

ggplot(data = mpg, mapping = aes(x = drv, y = hwy)) + 
  geom_boxplot()

3.9 Coordinate Systems

Lots of other options beyond regular x, y Cartesian that can occasionally be useful

coord_flip() swaps x and y axis

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot()

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) + 
  geom_boxplot() +
  coord_flip()

coord_quickmap() sets the aspect ratio correctly for maps

nz <- map_data("nz")

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")

ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

coord_polar() uses polar coordinates

bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = cut, fill = cut), 
    show.legend = FALSE,
    width = 1
  ) + 
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

bar + coord_flip()

bar + coord_polar()

3.9.1 Exercises

Turn a stacked bar chart into a pie chart using coord_polar()

bar <- ggplot(data = diamonds) + 
  geom_bar(
    mapping = aes(x = 1, fill = color),
    width = 1) + 
  theme(aspect.ratio = 1) +
  labs(x = NULL, y = NULL)

bar

bar + coord_polar(theta = "y")

What does labs() do? Read the documentation.

Good labels are critical for making your plots accessible to a wider audience. Ensure the axis and legend labels display the full variable name. Use the plot title and subtitle to explain the main findings. It’s common to use the caption to provide information about the data source.

What’s the difference between coord_quickmap() and coord_map()?

coord_map projects a portion of the earth, which is approximately spherical, onto a flat 2D plane using any projection defined by the mapproj package. Map projections do not, in general, preserve straight lines, so this requires considerable computation. coord_quickmap is a quick approximation that does preserve straight lines. It works best for smaller areas closer to the equator.

What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()

cty and hwy have a positive linear relationship
coord_fixed() sets the aspect ratio, so that the x and y axes have the same scale
geom_abline adds a line with intercept = 0 and slope = 1

3.10 The Layered Grammar of Graphics

Now have a template for any ggplot

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

3 Data Visualization - R for Data Science