Last compiled: 2023-09-05 15:26:25.305239
I write and update the solutions everyday. Please let me know of any typos.
For Week 3: read WG chapters 1-3.
ggplot(data = mpg)
. What do you see?library(tidyverse)
library(dplyr)
library(ggplot2)
ggplot(data = mpg)
This is an empty plot since no layers were added with geom function.
mtcars
? How many columns?nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 11
There are 32 rows and 11 columns in mtcars
.
drv
variable describe? Read the help
for ?mpg
tofind out.
# ?mpg
drv
is the type of drive train, where f = front-wheel
drive, r = rear wheel drive, 4 = 4wd.
hwy
versus
cyl
.ggplot(mpg, aes(x = cyl, y = hwy)) +
geom_point()
class
versus drv
?Why is the plot not useful?
ggplot(mpg, aes(x = drv, y = class)) +
geom_point()
A scatter plot is not a useful here because both drv
and
class
are categorical variables. Scatterplots are commonly
used to visualize two continuous variables using visual marks mapped to
a two-dimensional Cartesian space. Hence, the scatterplot here shows the
limitation of managing two categorical variables. This is largely
because categorical variables are typically stored as a small number of
values. Within the mpg
dataset, we can see that
class
variables holds 7 types of cars, and cyl
holds 4 types of cylinders. Therefore, there will only be 21 points
mapped on the scatterplot.
mpg %>% distinct(class) # or unique(mpg$class)
## # A tibble: 7 × 1
## class
## <chr>
## 1 compact
## 2 midsize
## 3 suv
## 4 2seater
## 5 minivan
## 6 pickup
## 7 subcompact
mpg %>% distinct(cyl) # or unique(mpg$cyl)
## # A tibble: 4 × 1
## cyl
## <int>
## 1 4
## 2 6
## 3 8
## 4 5
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
The color
argument is defined as part of the
mapping
. Therefore, it is being argued as an aesthetic.
This can be corrected as below:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy), color = "blue")
Or,
ggplot(data = mpg, aes(x = displ, y = hwy)) +
geom_point(colour = "blue")
mpg
are categorical? Which
variables are continuous? (Hint: type ?mpg
to read the
documentation for the dataset). How can you see this information when
you run mpg
?head(mpg, 10) # or glimpse(mpg)
## # A tibble: 10 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
My initial screening is that Variables with <chr>
are categorical, those with <dbl>
and
<int>
are continuous. If you take a closer look at
the dataset, cyl
, disp
can be treated as
categorical since there are categories of values the cars are stored
under. year
, city
, and
hwy
discrete variables, but we can decide to treat it as
continuous variables.
color
,
size
, and shape
. How do these aesthetics
behave differently for categorical vs. continuous variables?ggplot(mpg, aes(x = cty, y = hwy, color = drv)) +
geom_point() +
labs(title = "mpg dataset: hwy vs. cty",
caption = "mpg dataset",
x = "Highway miles per gallon",
y = "city miles per gallon",
color = "Type of drive train")
It’s intuitive that cty
and hwy
will show a
positive linear relationship. The color
mapping is set to
drv
, which is a categorical variable.
ggplot(mpg, aes(x = cty, y = hwy, color = cyl)) +
geom_point() +
labs(title = "mpg dataset: hwy vs. cty",
caption = "mpg dataset",
x = "Highway miles per gallon",
y = "city miles per gallon",
color = "Number of cylinders")
Now we set our color
mapping to cyl
, a
continuous variable. The difference between mapping color
to a categorical vs continuous variables is the legend. For categorical
variables, distinct categories are created as the legend, while gradient
scale is created for continuous variables.
Now, let’s take a look at the difference of categorical and
continuous variables being mapped to size
:
ggplot(mpg, aes(x = cty, y = hwy, size = drv)) +
geom_point() +
labs(title = "mpg dataset: hwy vs. cty",
caption = "mpg dataset",
x = "Highway miles per gallon",
y = "city miles per gallon",
size = "Type of drive train")
ggplot(mpg, aes(x = cty, y = hwy, size = cyl)) +
geom_point() +
labs(title = "mpg dataset: hwy vs. cty",
caption = "mpg dataset",
x = "Highway miles per gallon",
y = "city miles per gallon",
size = "Number of cylinders")
When mapped to size, the sizes of the points vary continuously as a function of their size.
Now, let’s assign the different variables types to
shape
.
ggplot(mpg, aes(x = cty, y = hwy, shape = drv)) +
geom_point() +
labs(title = "mpg dataset: hwy vs. cty",
caption = "mpg dataset",
x = "Highway miles per gallon",
y = "city miles per gallon",
shape = "Type of drive train")
When the categorical variable is mapped to shape
, three
types of drive train are assigned to different pointer shapes, and are
mapped on the graph accordingly. Intuitively, we will not be able to map
a continuous variable to shape. If you do, you will get the following
error message from R:
Error: A continuous variable can not be mapped to shape Run
rlang::last_error()to see where the error occurred.
ggplot(mpg, aes(x = cty, y = hwy, size = cty, color = hwy, shape = drv)) +
geom_point() +
labs(title = "mpg dataset: hwy vs. cty",
caption = "mpg dataset",
x = "Highway miles per gallon",
y = "city miles per gallon",
size = "city miles per gallon",
color = "Highway miles per gallon",
shape = "Type of drive train")
In the plot above, I mapped cty
to both x-axis and size,
and hwy
to y-axis and color. The R runs and produces the
messy graph due to redundant information. This is not advised - mapping
a single varialbe to multiple aesthetics is a bad practice and should be
avoided.
stroke
aesthetic do? What shapes does
it work with? (Hint: use ?geom_point
)# ?geom_point
ggplot(mpg, aes(x = displ, y = cty)) +
geom_point(shape = 23, color = "#CC79A7", fill = "#F0E442", size = 3, stroke = 2)
The stroke
aesthetically modifies the width of the
border for shapes. It works with shapes numbering 21 through 25, which
are shapes that take the fill
command with a specific
color.
aes(colour = displ < 5)
? Note,
you’ll also need to specify x and y.ggplot(mpg, aes(x = cty, y = hwy, color = displ < 5)) +
geom_point()
You are able to specify equations as well as expressions for
aesthetic mappings. In this case, displ < 5
will return
either TRUE
or FALSE
.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_wrap(. ~ hwy)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
facet_grid(. ~ hwy)
The primary use of facets is to add another categorical variable to
your plot. The variable that you pass to facet_wrap()
should be discrete. If you pass continuous variable, then it is
converted to a categorical variable, and the plot will contain a facet
for each distinct value.
facet_grid(drv ~ cyl)
mean? How do they relate to this
plot?ggplot(data = mpg) +
geom_point(mapping = aes(x = drv, y = cyl))
ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy)) +
facet_grid(drv ~ cyl)
The scatterplot of drv
and cyl
shows empty
plots, and the empty cells matches where combinations of the two
variables have no observations.
.
do?ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(drv ~ .)
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_grid(. ~ cyl)
Everything on the left of the ~
will split according to
rows, and everything on the right will split to columns. The
.
on the right hand side of the formula
(. ~ cyl
) fixes the cyl
variable on the
x-axis. If you want to flip the facets, this can be done as
cyl ~ .
.
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy)) +
facet_wrap(~ class, nrow = 2)
What are the advantages to using faceting instead of the color
aesthetic? What are the disadvantages? How might the balance change if
you had a larger dataset?
Let’s take the provided code, but add the color aesthetic instead of using faceting:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = trans))
The advantage of adding color mapping is that we are able to add more information to the graph. For instance, we are now able to distinguish between the different types of transmission between engine displacement and highway miles per hour variables across type of cars. Note that using the color aesthetic can work against if there are many categories. In this case, readers may have a hard time differentiating the colors between manual(m5) and manual(m6).
The disadvantage of using faceting instead of the color aesthetic is the difficulty of comparing the observations between categories plotted across different plots. It is much easier to interpret the graph if the observations are plotted on the same x- and y-scales.
?facet_wrap
. What does nrow
do?
What does ncol
do? What other options control the layout of
the individual panels? Why doesn’t facet_grid()
have
nrow
and ncol
arguments?# ?facet_wrap
# ?facet_grid
nrow
and ncol
refers to the number of rows
and columns used when laying out the facets. These arguments are needed
given that facet_wrap()
only takes one variable to create
facets. The arguments does not exist for facet_grid()
because the function determines the number of rows and columns from the
unique values of the specified variables.
facet_grid()
you should usually put the
variable with more unique levels in the columns. Why?The simple reason is that there is more space for columns if the plots are laid out horizontally.
geom_line()
: line chartgeom_boxplot()
: boxplotgeom_histogram()
: histogramgeom_area()
: area chart
geom_point()
will create a scatterplot with
displ
on the x-axis, hwy
on the y-axis,
drv
grouped and colored. geom_smooth()
will
create a smooth line, and se = FALSE
will suppress standard
errors.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
show.legend = FALSE
do? What happens if
you remove it? Why do you think I used it earlier in the chapter?The code below is from this chapter where the authors have
show.legend = FALSE
.
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, colour = drv),
show.legend = FALSE
)
Here is the same code with show.legend = TRUE
:
ggplot(data = mpg) +
geom_smooth(
mapping = aes(x = displ, y = hwy, colour = drv),
show.legend = TRUE
)
Adding the legend would be beneficial if the authors decided to use a
single graph. However, with three graphs adjacent to each other, adding
a legend to only the last graph may confuse readers, as well as offset
the plot sizes. Moreover, the authors intention of using the three
graphs altogether was to show the difference adding extra
group
or color
mappings to aesthetics.
Therefore, legend was not necessary.
se
argument to
geom_smooth()
do?ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point(mapping = aes(color = class)) +
geom_smooth(data = filter(mpg, class == "subcompact"), se = TRUE)
As it can be seen from the two graphs, the se
argument
adds the standard error bands to the lines.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
The first set of code has the mapping:
aes(x = displ, y = hwy)
, which is passed down to
geom_point()
and geom_smooth()
. For the second
set of code, it still uses the same mapping for both
geom_point()
and geom_smooth()
through
specification. Therefore, the two graphs will look identical.
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(se = FALSE)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth(aes(group=drv), se = FALSE)
ggplot(mpg, aes(x = displ, y = hwy, color = drv)) +
geom_point() +
geom_smooth(se = FALSE)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = drv)) +
geom_smooth(se = FALSE)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = drv)) +
geom_smooth(aes(linetype = drv), se = FALSE)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(size = 5, color = "white") +
geom_point(aes(color = drv))
stat_summary()
? How could you rewrite the previous plot to
use that geom function instead of the stat function?I assume this is what the authors mean by previous plot:
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.min = min,
fun.max = max,
fun = median
)
The documentation available by ?stat_summary
shows that
geom = "pointrage"
. The documentation
for?geom_pointrange
reads that it is various ways of
representing vertical interval defined by x
,
ymin
, and ymax
. So here is what we can write
using the geom_pointrange()
instead of the
stat_summary()
function:
ggplot(data = diamonds) +
geom_pointrange(
mapping = aes(x = cut, y = depth),
stat = "summary",
fun = median,
fun.min = min,
fun.max = max
)
geom_col()
do? How is it different to
geom_bar()
?There are two types of bar charts: geom_bar()
and
geom_col()
. geom_bar()
makes the height of the
bar proportional to the number of cases in each group (or if the weight
aesthetic is supplied, the sum of the weights). If you want the heights
of the bars to represent values in the data, use geom_col()
instead. geom_bar()
uses stat_count()
by
default: it counts the number of cases at each x position.
geom_col()
uses stat_identity()
: it leaves the
data as is.
The following table is a list of all the pairs I compiled. There was
a lesson from DataCamp that
covered this question. Also, this site contains list
of all geom_
and stat_
functions available in
ggplot2
.
Geom_ | Stat_ | Common Description |
---|---|---|
geom_bar() , geom_col() |
stat_count() |
Bar charts |
geom_bin2d() |
stat_bin2d() |
Heatmap of 2d bin counts |
geom_boxplot() |
stat_boxplot() |
A box and whiskers plot (in the style of Tukey) |
geom_contour() ,
geom_contour_filled() |
stat_contour() ,
stat_contour_filled() |
2D contours of a 3D surface |
geom_count() |
stat_sum() |
Count overlapping points |
geom_density() |
stat_density() , |
Smoothed density estimates |
geom_density_2d() |
stat_density_2d() |
Contours of a 2D density estimate |
geom_dotplot() |
stat_bindot() |
Dot plot |
geom_function() |
stat_function() |
Draw a function as a continuous curve |
geom_hex() |
stat_bin_hex() |
Hexagonal heatmap of 2d bin counts |
geom_freqpoly() ,
geom_histogram() |
stat_bin() |
Histograms and frequency polygons |
geom_qq() ,
geom_qq_line() |
stat_qq() ,
stat_qq_line() |
A quantile-quantile plot |
geom_quantile() |
stat_quantile() |
Quantile regression |
geom_sf() |
stat_sf() |
Visualize sf objects |
geom_smooth() |
stat_smooth() |
Smoothed conditional means |
geom_violin() |
stat_violin() |
Violin plot |
stat_smooth()
compute? What
parameters control its behaviour?Try running ?stat_smooth()
on your R console.
stat_smooth()
computes the following variables: -
y
or x
: Predicted value - ymin
or
xmin
: Lower pointwise confidence interval around the mean -
ymax' or
xmax: Upper pointwise confidence interval around the mean -
se`:
Standard error
The parameters controlling the behavior of
stat_smooth()
: - method
: Smoothing method
(function) to use, accepts either NULL
or a character
vector. - formula
: Formula to use in smoothing function. -
se
: display confidence interval around smooth
(TRUE
by default) - na.rm
: If
FALSE
, the default, missing values are removed with a
warning. If TRUE
, missing values are silently removed. -
orientation
: The orientation of the layer. -
show.legend
: NA
, the default, includes if any
aesthetics are mapped. - method.args
: List of additional
arguments passed on to the modelling function defined by
method
.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop)))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = after_stat(prop)))
As you can see from the outpu graphs, without group = 1
,
all the bars end up with the same height. This is because the
geom_bar()
function assumes that the groups are equal to
the x
values since the stat computes the counts within the
group.
Note: Read about ?after_stat()
function.
prop
refers to percent of points in that panel in the
position.
Here is the intended bar graph:
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = after_stat(prop), group = 1))
Next section (3.8) explains how you can fill the bars with colors.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point()
Try running dim(mpg)
. There should be 234 points on this
graph, so we know the problem is that points over being over-plotted. We
would need to adjust the position of the points.
Setting position = "jitter"
fixes this problem.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point(position = "jitter")
geom_jitter()
control the amount
of jittering??geom_jitter()
shows two parameters that control the
amount of jittering. The width
controls the amount of
horizontal displacement, and height
controls the amount of
vertical displacement. You can play around by assigning different values
to the two parameters.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(width = 0)
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(height = 0)
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(width = 20)
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(height = 20)
Note that if you assign 0s to both height
and
width
, the resulting graph will be identical to the one
made using geom_point()
.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(height = 0, width = 0)
geom_jitter()
with
geom_count()
.geom_jitter()
adds a small amount of random variation to
th location of each point. It is a useful way of handling overplotting
caused by discreteness in smaller datasets.
geom_count()
counts the number of observations at each
location, then maps the count to point area. It is also useful when you
have discrete data and overplotting issue. Unlike
geom_jitter()
, geom_count()
is able to create
different size of the points relative to the number of observations.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_count()
With the mpg
dataset, we see that
geom_count()
is starting to cause the problem of
overplotting again.
We can play around with the color
.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_jitter()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_count()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) +
geom_count(position = "jitter")
While the graphs did add an additional info of class
,
none of the graphs seems to solve the overplotting issue.
geom_boxplot()
? Create a visualisation of the
mpg
dataset that demonstrates it.?geom_boxplot()
shows that the default position
adjustment is dodge2
. Here is the r code from the previous
problem, but using geom_boxplot()
instead:
ggplot(data = mpg, aes(x = cty, y = hwy, color = class)) +
geom_boxplot()
coord_polar()
.Start with a stacked bar chart:
ggplot(mpg, aes(x = factor(1), fill = drv)) +
geom_bar()
Add coord_polar(theta="y")
to create pie chart:
ggplot(mpg, aes(x = factor(1), fill = drv)) +
geom_bar(width = 1) +
coord_polar(theta = "y")
labs()
do? Read the documentation.The labs()
allows you to add axis titles, plot titles,
and a caption to the plot. I have been using it since the very beginning
of this chapter.
Note that labs
is not the only way to add titles. Other
functions such as xlab()
, ylab()
,
ggtitle()
can also be used.
coord_quickmap()
and
coord_map()
?
coord_fixed()
important? What
does geom_abline()
do?ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()