The aim of the grammar of graphics is to provide an efficient language for describing visualizations. Using this grammar, however, is no guarantee that a visual will prove to be compelling. The next question is to determine how to apply the grammar to a real problem - what data, mappings, and geoms have the highest potential of revealing the stories locked inside of data frames?
To ensure both the exploratory and persuasive powers of visualizing data, a four-stage workflow is recommended. The four stages in the visualization workflow are:
By way of example, we work through three of the four stages of data visualization. The first stage, purpose, is defined for you to set the context for our example. The content, structure, and formatting will be decided as we go through the example. Read the following background about baseball in Colorado’s Coors Field:
The purpose of our data visualization is 1) to confirm the Coors field reputation that it is the easiest baseball stadium for teams to score runs at, and 2) if true, create a visual that informs its audience that Coors Field is the easiest baseball stadium to score runs at - note, the purpose of the visual should be obvious to all who encounter it.
Coors Field (Figure 9.1) has a reputation for being friendly to batters - particularly for teams to score runs easily. Our purpose is to form a data-driven opinion about the validity of this statement and create a persuasive visualization to support our opinion.
To match our purpose, we need information about average runs scored per game at all baseball stadiums. And probably not just one game at each stadium, rather, we want a bunch of data from each stadium - say at least 100 games from each stadium. We can then compare Coors Field (three-letter code:
COL) to all others. Luckily, the
causact package has some data that seems relevant:
# the following line loads data from the 2010 - 2014 # baseball seasons. The following data gets loaded: # Date: The date the baseball game was played # Home: A three letter code for the "home" team # Visitor:A three letter code for the "visiting" team # HomeScore: # of runs scored by the home team # VisitorScore: # of runs scored by the visiting team library(causact) ## get dataset library(tidyverse) ## loads dplyr and ggplot2 baseballDF = baseballData %>% as_tibble() baseballDF
## # A tibble: 12,145 × 5 ## Date Home Visitor HomeScore VisitorScore ## <int> <fct> <fct> <int> <int> ## 1 20100405 ANA MIN 6 3 ## 2 20100405 CHA CLE 6 0 ## 3 20100405 KCA DET 4 8 ## 4 20100405 OAK SEA 3 5 ## 5 20100405 TEX TOR 5 4 ## 6 20100405 ARI SDN 6 3 ## 7 20100405 ATL CHN 16 5 ## 8 20100405 CIN SLN 6 11 ## 9 20100405 HOU SFN 2 5 ## 10 20100405 MIL COL 3 5 ## # ℹ 12,135 more rows
We will now switch to using
library(dplyr) - it
loads both of them with one command.
WOW!! There are 12,145 observations of baseball games spread over 30 teams. This should be about 400 observations per stadium - that will work. We just need to get average total runs by stadium:
## # A tibble: 31 × 2 ## Home avgRuns ## <fct> <dbl> ## 1 COL 11.1 ## 2 BOS 9.76 ## 3 TEX 9.58 ## 4 TOR 9.42 ## 5 DET 9.19 ## 6 ARI 9.18 ## 7 NYA 9.13 ## 8 MIN 9.09 ## 9 BAL 9.03 ## 10 MIL 8.98 ## # ℹ 21 more rows
Viewing the data, it seems obvious that we have the content we need. Let’s now structure the visualization using the most appropriate grammatical mappings.
WARNING: No matter how good a visualization might look aesthetically, if its content does not match the purpose of your analytic investigation, then it will be useless. For all visualizations, think of what matters.
Structure is our choice in how to display the content we have collected. This choice should reveal the most important data characteristics and the relationships we are analyzing. By choosing a meaningful geom and mapping of aesthetics to data, we can give our brains the best chance at perceiving meaningful patterns and insights.
To limit the scope of our options, let’s restrict ourselves to two geoms:
geom_col() - often referred to as a scatterplot and a bar chart, respectively. These two seem to be the workhorses of my visualization efforts. In terms of aesthetic mappings, two main characteristics of the data determine the likely suitability of any aesthetic. The two are :
To see how to choose which aesthetics to map to different types of data, read through the below table specifying the ordinality and cardinality appropriate for each aesthetic:
||Handles Ordinal Data||Max. Cardinality||Notes|
||\(\checkmark\)||Infinite||Most powerful aesthetic. Use it for your most important data. Handles infinite data and small differences in data are easily detected by the human eye.|
Similar to above excepts mappings of values to colors will change the hue, not the actual color. So numerical data might have small numbers mapped to a dark blue color and large numbers mapped to a lighter shade of blue.
Given that we have only two columns of data in our example baseball dataset, we can assess their ordinality and cardinality to see which aesthetics are even appropriate:
baseballData2 %>% select(Home) %>% distinct() and
baseballData2 %>% select(avgRuns) %>% distinct() will
help you determine the cardinality of
avgRuns. Simply look at the number of rows in the output
and that is the cardinality of the selected data column.
From the above table, we see that only the x- or y-axis is good for stadium name. Let’s suggest the x-axis and see how it works. For ‘avgRuns’ we saw more relevant aesthetics, but since
y is not used and x/y-position is the most powerful aesthetic in visualization, we will map
avgRuns to the y-axis. In terms of geoms, I am somewhat indifferent between points and bars, let’s try both. Points are shown in Figure 9.2.
And bars can be seen in Figure 9.3.
As is often the case with using
geom_col() I will recommend flipping the axes when the x-axis labels look crowded. Figure 9.4 is a preferred result to my eye:
If you were only trying to convince yourself that Coors Field is truly the most run-friendly baseball stadium, I recommend stopping here. However, if you need to convince others, then the plot’s purpose and result should be made easier for an audience to digest. That is the goal of formatting.
The last stage of data visualization is formatting. In exploratory data analysis, the work on this step should be minimal. However, once you go beyond exploration and your visualization is to be shared, then you should spend considerable time formatting your work to be persuasive. There is always a little productive struggle required to get things perfect, so bring your patience and perseverance to this step.
The easiest fix to ensure your audience gets the message of your visualization is to add a title. The title should be the message, not just a description of the plot:
Notice, the code to create Figure 9.5 added a labels layer (i.e. using the
+ sign and
labs function). In addition to
title, the arguments of the
labs function include
x for the x-axis label,
y for the y-axis label,
caption. Let’s take the time now to at least make our axis labels more meaningful to an external audience and use the caption capability to communicate our data source:
For the next iteration of formatting, ask yourself how can I accelerate the rate at which my external audience can confirm the plot’s title/purpose from the visual. To do this, I need the audience to be able to 1) find Coors Field on the plot and 2) quickly assess that Coors Field is truly the most run-friendly.
Finding Coors Field can be made easier by highlighting it. Of all the remaining aesthetics we can choose from to use,
fill is going to be the most powerful. A quick attempt to use
fill to distinguish among stadiums might look like this:
While your instincts might be to say how do I change the
fill aesthetic for the bars to highlight Coors Field, you
must resist those instincts and ask “do I have a column of data that I
can map to the fill aesthetic which distinguishes Coors Field from all
Figure 9.7 is unsatisfying and shows the danger of mapping a higher cardinality data column to color. We really just want Coors Field to be highlighted. This means we need another column of data. In fact, we need an indicator function (recall from
dplyr chapters) that supplies the data which answers this question, and then, we can map that data to the
The indicator can be created with
mutate and we will call the new dataframe
plotDF for lack of creative naming capabilities and to communicate this data column is for plotting purposes:
and now we can change the
data argument and the
fill aesthetic of the previous plot to map to our new column:
Accelerating an external audience’s assessment of Coors Field as the most run-friendly stadium is the last bit of formatting we will tackle here. Looking at the data mapped to the vertical axis of Figure 9.8, we see the order/positioning of the stadiums is arbirtrary - it seems sort of alphabetical. Let’s try to use position order of the stadiums to reflect run-friendiness.
Just how is position order determined? Well notice that stadium is particular type of object in
R called a factor:
##  "factor"
Factors are useful whenever data has a fixed and known set of possibilities, such as the fixed set of baseball stadiums. We can see the order of a factor using the
##  "ANA" "ARI" "ATL" "BAL" "BOS" "CHA" "CHN" "CIN" "CLE" "COL" "DET" "FLO" ##  "HOU" "KCA" "LAN" "MIL" "MIN" "NYA" "NYN" "OAK" "PHI" "PIT" "SDN" "SEA" ##  "SFN" "SLN" "TBA" "TEX" "TOR" "WAS" "MIA"
See https://forcats.tidyverse.org/ for more information about working with factors using the forcasts package that is part of the tidyverse package group.
Notice the order matches that of the vertical axis (from bottom to top) in Figure 9.8. To reorder a factor by another variable, we can use the
fct_reorder function which takes two arguments: 1)
.f - the original factor and 2)
.x - another variable to reorder the levels by. By default, the
.x variable is sorted in ascending order. Using
fct_reorder within a
mutate function, we can overwrite the levels of
plotDF$Home and then plot the data as shown in Figure 9.9.
library(tidyverse) # to get fct_reorder function formattedPlot = plotDF %>% # save plot to object mutate(Home = fct_reorder(Home,avgRuns)) %>% ##NEW LINE ggplot(aes(y = Home, x = avgRuns)) + geom_col(aes(fill = Stadium)) + labs(title = "Colorado's Coors Field (COL) Is The Most Run-Friendly Baseball Stadium", y = "Three-Letter Stadium Code", x = "Average Runs Per Game", caption = "Data from all games played between 2010-2014") formattedPlot # view ggplot object
Often, in the final formatting, you will be dissatisfied with some of the default choices made by ggplot. For example, you may want to get rid of the grey background of your plots so that your plots look more like the ones in this book. Use a theme to do this. You can reproduce the book’s plots exactly by adding a theme layer to your ggplot and passing it a base font size of 14 (Figure 9.10):
EXPERT TIP: Everytime I load
ggplot2, I run this command
theme_set(theme_minimal(14)) to avoid having to add the
theme layer to each plot I make. It makes this theme the new
For controlling how a plot maps data values to a visual aesthetic, we use scales. The syntax for scales is
* is replaced by an aesthetic such as fill, color, alpha, etc. For Figure 9.10, the ugly blue and orange colors really need replacing. I often like to use a color that people associate with the data to make mental connections that much easier. For example, the Colorado Rockies baseball team that plays at Coors Field uses black, white, silver, and purple as their team colors. Since purple is the most distinct of those colors, let’s use purple for Coors Field. For all other stadiums, I will diminish their impact be selecting a light gray. Here is the code you would add to get this effect:
You can find a list of all the named colors in R at http://www.stat.columbia.edu/~tzheng/files/Rcolor.pdf.
And while part of me wants to continue with some more tweaking, I will restrain myself and declare Figure 9.11 good enough - the purpose of the plot is both clear and easily digested by an external audience.
There are too many decisions made in visualization to comprehensively cover every scenario. I strongly encourage you to take a look at Claus Wilke’s work “Fundamentals of Data Visualization” (http://serialmentor.com/dataviz/) for more tips and tricks on aligning purpose, content, structure, and formatting decisions.
Additionally (https://github.com/rstudio/cheatsheets/raw/main/data-visualization-2.1.pdf) is a link to the data visualization cheatsheet produced by Posit. Gain additional proficiency with
ggplot by experimenting with the various layers, functions, scales, etc. that are described within it.
penguins dataset is a fun dataset made available in R by Horst, Hill, and Gorman (2020Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://doi.org/10.5281/zenodo.3960218.). Use the below script to get
penguinsDF which is needed for these exercises.
Exercise 9.1 Figure 9.12 is used to illustrate a statistical concept known as Simpson’s paradox where the overall relationship in a population misleads your judgment to make wrong conclusions by omitting an important confounding variable (in this case species). Correct this plot that falsely implies a negative relationship between bill length and bill depth by mapping both color and shape to the species column of the dataframe. Figure out how to get three linear regression lines instead of just one to highlight the positive correlation within each species of bill length and bill depth. Also, use scale_color_manual so that Adelie penguin points are colored blue, chinstrap penguins are colored dark green, and gentoo penguins are colored dark red.
penguinsDF %>% ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point(aes(shape = species), color = "cadetblue", size = 4) + labs(title = "Penguin bill dimensions (omit species)", subtitle = "Palmer Station LTER", x = "Bill length (mm)", y = "Bill depth (mm)") + theme_minimal(16) + geom_smooth(method = "lm", se = FALSE, color = "gray50")
Exercise 9.2 Use the penguin data to create the plot in Figure 9.13 (or something like it) with the same facets, colors, plot titles, and labels.
library(palmerpenguins) library(tidyverse) penguinsDF = penguins ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) + geom_point(aes(color = sex)) + scale_color_manual(values = c("darkorange","cyan4"), na.translate = FALSE) + labs(title = "Penguin flipper and body mass", subtitle = "Palmer Station Penguin Dimensions", x = "Flipper length (mm)", y = "Body mass (g)", color = "Penguin sex") + theme_minimal(15) + theme(legend.position = "bottom") + facet_wrap(~species)