Last compiled: 2023-09-05 15:26:25.305239
I write and update the solutions everyday. Please let me know of any typos.
For Week 3: read WG chapters 1-3.
ggplot(data = mpg). What do you see?library(tidyverse)
library(dplyr)
library(ggplot2)
ggplot(data = mpg)

This is an empty plot since no layers were added with geom function.
mtcars? How many columns?nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 11
There are 32 rows and 11 columns in mtcars.
drv variable describe? Read the help
for ?mpg tofind out.
# ?mpg
drv is the type of drive train, where f = front-wheel
drive, r = rear wheel drive, 4 = 4wd.
hwy versus
cyl.ggplot(mpg, aes(x = cyl, y = hwy)) +
geom_point()

class
versus drv?Why is the plot not useful?
ggplot(mpg, aes(x = drv, y = class)) +
geom_point()

A scatter plot is not a useful here because both drv and
class are categorical variables. Scatterplots are commonly
used to visualize two continuous variables using visual marks mapped to
a two-dimensional Cartesian space. Hence, the scatterplot here shows the
limitation of managing two categorical variables. This is largely
because categorical variables are typically stored as a small number of
values. Within the mpg dataset, we can see that
class variables holds 7 types of cars, and cyl
holds 4 types of cylinders. Therefore, there will only be 21 points
mapped on the scatterplot.
mpg %>% distinct(class) # or unique(mpg$class)
## # A tibble: 7 × 1
## class
## <chr>
## 1 compact
## 2 midsize
## 3 suv
## 4 2seater
## 5 minivan
## 6 pickup
## 7 subcompact
mpg %>% distinct(cyl) # or unique(mpg$cyl)
## # A tibble: 4 × 1
## cyl
## <int>
## 1 4
## 2 6
## 3 8
## 4 5
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

The color argument is defined as part of the
mapping. Therefore, it is being argued as an aesthetic.
This can be corrected as below:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Or,
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point(colour = "blue")
mpg are categorical? Which
variables are continuous? (Hint: type ?mpg to read the
documentation for the dataset). How can you see this information when
you run mpg?head(mpg, 10) # or glimpse(mpg)
## # A tibble: 10 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
My initial screening is that Variables with <chr>
are categorical, those with <dbl> and
<int> are continuous. If you take a closer look at
the dataset, cyl, disp can be treated as
categorical since there are categories of values the cars are stored
under. year, city, and
hwydiscrete variables, but we can decide to treat it as
continuous variables.
color,
size, and shape. How do these aesthetics
behave differently for categorical vs. continuous variables?ggplot(mpg, aes(x = cty, y = hwy, color = drv)) +
geom_point() +
labs(title = "mpg dataset: hwy vs. cty",
caption = "mpg dataset",
x = "Highway miles per gallon",
y = "city miles per gallon",
color = "Type of drive train")

It’s intuitive that cty and hwy will show a
positive linear relationship. The color mapping is set to
drv, which is a categorical variable.
ggplot(mpg, aes(x = cty, y = hwy, color = cyl)) +
geom_point() +
labs(title = "mpg dataset: hwy vs. cty",
caption = "mpg dataset",
x = "Highway miles per gallon",
y = "city miles per gallon",
color = "Number of cylinders")
Now we set our color mapping to cyl, a
continuous variable. The difference between mapping color
to a categorical vs continuous variables is the legend. For categorical
variables, distinct categories are created as the legend, while gradient
scale is created for continuous variables.
Now, let’s take a look at the difference of categorical and
continuous variables being mapped to size:
ggplot(mpg, aes(x = cty, y = hwy, size = drv)) +
geom_point() +
labs(title = "mpg dataset: hwy vs. cty",
caption = "mpg dataset",
x = "Highway miles per gallon",
y = "city miles per gallon",
size = "Type of drive train")

ggplot(mpg, aes(x = cty, y = hwy, size = cyl)) +
geom_point() +
labs(title = "mpg dataset: hwy vs. cty",
caption = "mpg dataset",
x = "Highway miles per gallon",
y = "city miles per gallon",
size = "Number of cylinders")

When mapped to size, the sizes of the points vary continuously as a function of their size.
Now, let’s assign the different variables types to
shape.
ggplot(mpg, aes(x = cty, y = hwy, shape = drv)) +
geom_point() +
labs(title = "mpg dataset: hwy vs. cty",
caption = "mpg dataset",
x = "Highway miles per gallon",
y = "city miles per gallon",
shape = "Type of drive train")
When the categorical variable is mapped to shape, three
types of drive train are assigned to different pointer shapes, and are
mapped on the graph accordingly. Intuitively, we will not be able to map
a continuous variable to shape. If you do, you will get the following
error message from R:
Error: A continuous variable can not be mapped to shape Runrlang::last_error()to see where the error occurred.
ggplot(mpg, aes(x = cty, y = hwy, size = cty, color = hwy, shape = drv)) +
geom_point() +
labs(title = "mpg dataset: hwy vs. cty",
caption = "mpg dataset",
x = "Highway miles per gallon",
y = "city miles per gallon",
size = "city miles per gallon",
color = "Highway miles per gallon",
shape = "Type of drive train")
In the plot above, I mapped cty to both x-axis and size,
and hwy to y-axis and color. The R runs and produces the
messy graph due to redundant information. This is not advised - mapping
a single varialbe to multiple aesthetics is a bad practice and should be
avoided.
stroke aesthetic do? What shapes does
it work with? (Hint: use ?geom_point)# ?geom_point
ggplot(mpg, aes(x = displ, y = cty)) +
geom_point(shape = 23, color = "#CC79A7", fill = "#F0E442", size = 3, stroke = 2)

The stroke aesthetically modifies the width of the
border for shapes. It works with shapes numbering 21 through 25, which
are shapes that take the fill command with a specific
color.
aes(colour = displ < 5)? Note,
you’ll also need to specify x and y.ggplot(mpg, aes(x = cty, y = hwy, color = displ < 5)) +
geom_point()

You are able to specify equations as well as expressions for
aesthetic mappings. In this case, displ < 5 will return
either TRUE or FALSE.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(. ~ hwy)

ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(. ~ hwy)

The primary use of facets is to add another categorical variable to
your plot. The variable that you pass to facet_wrap()
should be discrete. If you pass continuous variable, then it is
converted to a categorical variable, and the plot will contain a facet
for each distinct value.
facet_grid(drv ~ cyl) mean? How do they relate to this
plot?ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))

ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy)) +
facet_grid(drv ~ cyl)

The scatterplot of drv and cyl shows empty
plots, and the empty cells matches where combinations of the two
variables have no observations.
.
do?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)

ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)

Everything on the left of the ~ will split according to
rows, and everything on the right will split to columns. The
. on the right hand side of the formula
(. ~ cyl) fixes the cyl variable on the
x-axis. If you want to flip the facets, this can be done as
cyl ~ ..
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
What are the advantages to using faceting instead of the color
aesthetic? What are the disadvantages? How might the balance change if
you had a larger dataset?
Let’s take the provided code, but add the color aesthetic instead of using faceting:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = trans))
The advantage of adding color mapping is that we are able to add more
information to the graph. For instance, we are now able to distinguish
between the different types of transmission between engine displacement
and highway miles per hour variables across type of cars. Note that
using the color aesthetic can work against if there are many categories.
In this case, readers may have a hard time differentiating the colors
between manual(m5) and manual(m6).
The disadvantage of using faceting instead of the color aesthetic is the difficulty of comparing the observations between categories plotted across different plots. It is much easier to interpret the graph if the observations are plotted on the same x- and y-scales.
?facet_wrap. What does nrow do?
What does ncol do? What other options control the layout of
the individual panels? Why doesn’t facet_grid() have
nrow and ncol arguments?# ?facet_wrap
# ?facet_grid
nrow and ncol refers to the number of rows
and columns used when laying out the facets. These arguments are needed
given that facet_wrap() only takes one variable to create
facets. The arguments does not exist for facet_grid()
because the function determines the number of rows and columns from the
unique values of the specified variables.
facet_grid() you should usually put the
variable with more unique levels in the columns. Why?The simple reason is that there is more space for columns if the plots are laid out horizontally.
geom_line(): line chartgeom_boxplot(): boxplotgeom_histogram(): histogramgeom_area(): area chart
geom_point() will create a scatterplot with
displ on the x-axis, hwy on the y-axis,
drv grouped and colored. geom_smooth() will
create a smooth line, and se = FALSE will suppress standard
errors.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
show.legend = FALSE do? What happens if
you remove it? Why do you think I used it earlier in the chapter?The code below is from this chapter where the authors have
show.legend = FALSE.
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, colour = drv),
show.legend = FALSE
)

Here is the same code with show.legend = TRUE:
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, colour = drv),
show.legend = TRUE
)

Adding the legend would be beneficial if the authors decided to use a
single graph. However, with three graphs adjacent to each other, adding
a legend to only the last graph may confuse readers, as well as offset
the plot sizes. Moreover, the authors intention of using the three
graphs altogether was to show the difference adding extra
group or color mappings to aesthetics.
Therefore, legend was not necessary.
se argument to
geom_smooth() do?ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = TRUE)

As it can be seen from the two graphs, the se argument
adds the standard error bands to the lines.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()

ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

The first set of code has the mapping:
aes(x = displ, y = hwy), which is passed down to
geom_point() and geom_smooth(). For the second
set of code, it still uses the same mapping for both
geom_point() and geom_smooth() through
specification. Therefore, the two graphs will look identical.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(se = FALSE)

ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(aes(group=drv), se = FALSE)

ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)

ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = drv)) +
geom_smooth(se = FALSE)

ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = drv)) +
geom_smooth(aes(linetype = drv), se = FALSE)

ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(size = 5, color = "white") +
geom_point(aes(color = drv))

stat_summary()? How could you rewrite the previous plot to
use that geom function instead of the stat function?I assume this is what the authors mean by previous plot:
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.min = min,
fun.max = max,
fun = median
)

The documentation available by ?stat_summary shows that
geom = "pointrage". The documentation
for?geom_pointrange reads that it is various ways of
representing vertical interval defined by x,
ymin, and ymax. So here is what we can write
using the geom_pointrange() instead of the
stat_summary() function:
ggplot(data = diamonds) +
geom_pointrange(
mapping = aes(x = cut, y = depth),
stat = "summary",
fun = median,
fun.min = min,
fun.max = max
)

geom_col() do? How is it different to
geom_bar()?There are two types of bar charts: geom_bar() and
geom_col(). geom_bar() makes the height of the
bar proportional to the number of cases in each group (or if the weight
aesthetic is supplied, the sum of the weights). If you want the heights
of the bars to represent values in the data, use geom_col()
instead. geom_bar() uses stat_count() by
default: it counts the number of cases at each x position.
geom_col() uses stat_identity(): it leaves the
data as is.
The following table is a list of all the pairs I compiled. There was
a lesson from DataCamp that
covered this question. Also, this site contains list
of all geom_ and stat_ functions available in
ggplot2.
| Geom_ | Stat_ | Common Description |
|---|---|---|
geom_bar(), geom_col() |
stat_count() |
Bar charts |
geom_bin2d() |
stat_bin2d() |
Heatmap of 2d bin counts |
geom_boxplot() |
stat_boxplot() |
A box and whiskers plot (in the style of Tukey) |
geom_contour(),
geom_contour_filled() |
stat_contour(),
stat_contour_filled() |
2D contours of a 3D surface |
geom_count() |
stat_sum() |
Count overlapping points |
geom_density() |
stat_density(), |
Smoothed density estimates |
geom_density_2d() |
stat_density_2d() |
Contours of a 2D density estimate |
geom_dotplot() |
stat_bindot() |
Dot plot |
geom_function() |
stat_function() |
Draw a function as a continuous curve |
geom_hex() |
stat_bin_hex() |
Hexagonal heatmap of 2d bin counts |
geom_freqpoly(),
geom_histogram() |
stat_bin() |
Histograms and frequency polygons |
geom_qq(),
geom_qq_line() |
stat_qq(),
stat_qq_line() |
A quantile-quantile plot |
geom_quantile() |
stat_quantile() |
Quantile regression |
geom_sf() |
stat_sf() |
Visualize sf objects |
geom_smooth() |
stat_smooth() |
Smoothed conditional means |
geom_violin() |
stat_violin() |
Violin plot |
stat_smooth() compute? What
parameters control its behaviour?Try running ?stat_smooth() on your R console.
stat_smooth() computes the following variables: -
y or x: Predicted value - ymin or
xmin: Lower pointwise confidence interval around the mean -
ymax' orxmax: Upper pointwise confidence interval around the mean -se`:
Standard error
The parameters controlling the behavior of
stat_smooth(): - method: Smoothing method
(function) to use, accepts either NULL or a character
vector. - formula: Formula to use in smoothing function. -
se: display confidence interval around smooth
(TRUE by default) - na.rm: If
FALSE, the default, missing values are removed with a
warning. If TRUE, missing values are silently removed. -
orientation: The orientation of the layer. -
show.legend: NA, the default, includes if any
aesthetics are mapped. - method.args: List of additional
arguments passed on to the modelling function defined by
method.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop)))

ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))

As you can see from the outpu graphs, without group = 1,
all the bars end up with the same height. This is because the
geom_bar() function assumes that the groups are equal to
the x values since the stat computes the counts within the
group.
Note: Read about ?after_stat() function.
prop refers to percent of points in that panel in the
position.
Here is the intended bar graph:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop), group = 1))
Next section (3.8) explains how you can fill the bars with colors.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()

Try running dim(mpg). There should be 234 points on this
graph, so we know the problem is that points over being over-plotted. We
would need to adjust the position of the points.
Setting position = "jitter" fixes this problem.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point(position = "jitter")

geom_jitter() control the amount
of jittering??geom_jitter() shows two parameters that control the
amount of jittering. The width controls the amount of
horizontal displacement, and height controls the amount of
vertical displacement. You can play around by assigning different values
to the two parameters.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(width = 0)

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(height = 0)

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(width = 20)

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(height = 20)

Note that if you assign 0s to both height and
width, the resulting graph will be identical to the one
made using geom_point().
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(height = 0, width = 0)

geom_jitter() with
geom_count().geom_jitter() adds a small amount of random variation to
th location of each point. It is a useful way of handling overplotting
caused by discreteness in smaller datasets.
geom_count() counts the number of observations at each
location, then maps the count to point area. It is also useful when you
have discrete data and overplotting issue. Unlike
geom_jitter(), geom_count() is able to create
different size of the points relative to the number of observations.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_count()

With the mpg dataset, we see that
geom_count() is starting to cause the problem of
overplotting again.
We can play around with the color.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_jitter()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_count()

ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_count(position = "jitter")

While the graphs did add an additional info of class,
none of the graphs seems to solve the overplotting issue.
geom_boxplot()? Create a visualisation of the
mpg dataset that demonstrates it.?geom_boxplot() shows that the default position
adjustment is dodge2. Here is the r code from the previous
problem, but using geom_boxplot() instead:
ggplot(data = mpg, aes(x = cty, y = hwy, color = class)) +
geom_boxplot()

coord_polar().Start with a stacked bar chart:
ggplot(mpg, aes(x = factor(1), fill = drv)) +
geom_bar()

Add coord_polar(theta="y") to create pie chart:
ggplot(mpg, aes(x = factor(1), fill = drv)) +
geom_bar(width = 1) +
coord_polar(theta = "y")

labs() do? Read the documentation.The labs() allows you to add axis titles, plot titles,
and a caption to the plot. I have been using it since the very beginning
of this chapter.
Note that labs is not the only way to add titles. Other
functions such as xlab(), ylab(),
ggtitle() can also be used.
coord_quickmap() and
coord_map()?
coord_fixed() important? What
does geom_abline() do?ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()
