Basics of Data Visualization and DAGs in R

# Load packages. Install them first, in case you don't have them yet.

library(palmerpenguins) # To get our example's dataset
library(tidyverse) # To use dplyr functions and the pipe operator when needed
library(ggplot2) # To visualize data (this package is also loaded by library(tidyverse))
library(ggdag) # To create our DAGs

Welcome

This week's tutorial will be divided in two broader camps.

  • First, we will learn some basics of data visualization with ggplot.
  • Second, we will start our exploration of directed acyclic graphs (DAGs) for causal inference.

Introduction to ggplot2

ggplot2 is by far the most popular visualization package in R. ggplot2 implements the grammar of graphics to render a versatile syntax of creating visuals. The underlying logic of the package relies on deconstructing the structure of graphs (if you are interested in this you can read this article).

For the purposes of this introduction to visualization with ggplot, we care about the layered nature of visualizing with ggplot2.

*This tutorial is based largely on chapters 7 to 10 from the QPOLR book


Our building blocks

During this week, we will learn about the following building blocks:

  • Data: the data frame, or data frames, we will use to plot
  • Aesthetics: the variables we will be working with
  • Geometric objects: the type of visualization
  • Theme adjustments: size, text, colors etc

Data

The first building block for our plots are the data we intend to map. In ggplot2, we always have to specify the object where our data lives. In other words, you will always have to specify a data frame, as such:

ggplot(name_of_your_df)

In the future, we will see how to combine multiple data sources to build a single plot. For now, we will work under the assumption that all your data live in the same object.


Aesthetics

The second building block for our plots are the aesthetics. We need to specify the variables in the data frame we will be using and what role they play.

To do this we will use the function aes() within the ggplot() function after the data frame (remember to add a comma after the data frame).

ggplot(name_of_your_df, aes(x = your_x_axis_variable, y = your_y_axis_variable))

Beyond your axis, you can add more aesthetics representing further dimensions of the data in the two dimensional graphic plane, such as: size, color, fill, to name but a few.


Geometric objects

The third layer to render our graph is a geomethic object. To add one, we need to add a plus (+) at the end of the initial line and state the type of geometric object we want to add, for example, geom_point() for a scatter plot, or geom_bar() for barplots.

ggplot(name_of_your_df, aes(x = your_x_axis_variable, y = your_y_axis_variable)) +
  geom_point()

Theme

At this point our plot may just need some final thouches. We may want to fix the axes names or get rid of the default gray background. To do so, we need to add an additional layer preceded by a plus sign (+).

If we want to change the names in our axes, we can utilize the labs() function.

We can also employ some of the pre-loaded themes, for example, theme_minimal().

ggplot(name_of_your_df, aes(x = your_x_axis_variable, y = your_y_axis_variable)) +
  geom_point() +
  theme_minimal() +
  labs(x = "Name you want displayed",
       y = "Name you want displayed")

Our first plot

For our very first plot using ggplot2, we will use the penguins data from last week.

We would like to create a scatterplot that illustrates the relationship between the length of a penguin's flipper and their weight.

To do so, we need three of our building blocks: a) data, b) aesthetics, and c) a geometric object (geom_point()).

ggplot(penguins, aes(x = flipper_length_mm, y=body_mass_g)) +
  geom_point()


EXERCISE:

Once we have our scatterplot. Can you think of a way to adapt the code to:

    1. convey another dimension through color, the species of penguin
    1. change the axes names
    1. render the graph with theme_minimal().
Answer

ggplot(penguins, aes(x = flipper_length_mm, y=body_mass_g, color=species)) +
  geom_point() +
  theme_minimal() +
  labs(x = "Flipper Length (mm)",
       y = "Body mass (g)",
       color = "Species")


Visualizing effectively

Plotting distributions

If we are interested in plotting distributions of our data, we can leverage geometric objects, such as:

  • geom_histogram(): visualizes the distribution of a single continuous variable by dividing the x axis into bins and counting the number of observations in each bin (the default is 30 bins).
  • geom_density(): computes and draws kernel density estimate, which is a smoothed version of the histogram.
  • geom_bar(): renders barplots and in plotting distributions behaves in a very similar way from geom_histogram() (can also be used with two dimensions)

This is a histogram presenting the weight distribution of penguins in our sample. .

ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram()


EXERCISE:

Let's adapt the code of our histogram:

    1. add bins = 15 argument (type different numbers)
    1. add fill = "#FF6666" (type "red", "blue", instead of #FF6666)
    1. change the geom to _density and _bar
Answer

    1. Histogram with bins argument
ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(bins = 15)

    1. Histogram with bins and fill arguments
ggplot(penguins, aes(x = body_mass_g)) +
  geom_histogram(bins = 25, fill = "#FF6666")

    1. geom_density() and geom_bar()
ggplot(penguins, aes(x = body_mass_g)) +
  geom_density(alpha = 0.5, fill = "#FF6666")

ggplot(penguins, aes(x = body_mass_g)) +
  geom_bar(fill = "#FF6666")


Plotting relationships

We can utilize graphs to explore how different variables are related. In fact, we did so before in our scatterplot. We can also use box plots and lines to show some of these relationships.

For example, this boxplot showcasing the distribution of weight by species:

ggplot(penguins, aes(x = species, y = body_mass_g)) +
  geom_boxplot() +
  theme_minimal() +
  labs(x = "Species",
       y = "Body mass (g)")

Or this adaptation of our initial plot with a line of best fit for the observed data by each species:

ggplot(penguins, aes(x= flipper_length_mm, y = body_mass_g, color = species)) +
  geom_point() + 
  geom_smooth(method = "lm", se = F) +
  theme_minimal() +
  labs(x = "Length of the flipper",
       y = "Body mass (g)",
       color = "Species")


Next steps

Now that you have been introduced to some of the basics of ggplot2, the best way to move forward is to experiment. As we have discussed before, the R community is very open. Perhaps, you can gather some inspiration from the Tidy Tuesday social data project in R where users explore a new dataset each week and share their visualizations and code on Twitter under #TidyTuesday. You can explore some of the previous visualizations here and try to replicate their code.

Here is a curated list of awesome ggplot2 resources.


Directed Acyclic Graphs (DAGs)

This week we learned that directed acyclic graphs (DAGs) are very useful to express our beliefs about relationships among variables.

DAGs are compatible with the potential outcomes framework. They give us a more convinient and intuitive way of laying out causal models. Next week we will learn how they can help us develop a modeling strategy.

Today, we will focus on their structure and some DAG basics with the ggdag package.


Creating DAGs in R

To create our DAGs in R we will use the ggdag packages.

The first thing we will need to do is to create a dagified object. That is an object where we state our variables and the relationships they have to each other. Once we have our dag object we just need to plot with the ggdag() function.

Let's say we want to re-create this DAG:

We would like to express the following links:

  • P -> D
  • D -> M
  • D -> Y
  • M -> Y

To do so in R with ggdag, we would use the following syntax:

dag_object <- ggdag::dagify(variable_being_pointed_at ~ variable_pointing,
                            variable_being_pointed_at ~ variable_pointing,
                            variable_being_pointed_at ~ variable_pointing)

After this we would just:

ggdag::ggdag(dag_object)

Let's plot this DAG

our_dag <- ggdag::dagify(d ~ p,
                         m ~ d,
                         y ~ d,
                         y ~ m)

ggdag::ggdag(our_dag)


EXERCISE:

See what happens when you add + theme_minimal(), + theme_void(), or + theme_dag() to the DAG. What package do you think is laying behind the mappings ofggdag`?

Answer

our_dag <- ggdag::dagify(d ~ p,
                         m ~ d,
                         y ~ d,
                         y ~ m)

ggdag::ggdag(our_dag) +
  theme_minimal()

ggdag::ggdag(our_dag) +
  theme_void()


Polishing our DAGs in R

As you may have seen, the DAG is not rendered with the nodes in the positions we want.

If you ever want to explicitly tell ggdag where to position each node, you can tell it in a Cartesian coordinate plane.

Let's take P as the point (0,0):

coord_dag <- list(
  x = c(p = 0, d = 1, m = 2, y = 3),
  y = c(p = 0, d = 0, m = 1, y = 0)
)

our_dag <- ggdag::dagify(d ~ p,
                         m ~ d,
                         y ~ d,
                         y ~ m,
                         coords = coord_dag)

ggdag::ggdag(our_dag) + theme_void()


More complex example:

Let's say we're looking at the relationship between smoking and cardiac arrest. We might assume that smoking causes changes in cholesterol, which causes cardiac arrest:

smoking_ca_dag <- ggdag::dagify(cardiacarrest ~ cholesterol,
                                cholesterol ~ smoking + weight,
                                smoking ~ unhealthy,
                                weight ~ unhealthy,
                                labels = c("cardiacarrest" = "Cardiac\n Arrest", 
                                           "smoking" = "Smoking",
                                           "cholesterol" = "Cholesterol",
                                           "unhealthy" = "Unhealthy\n Lifestyle",
                                           "weight" = "Weight")
                                )

ggdag::ggdag(smoking_ca_dag, # the dag object we created
             text = FALSE, # this means the original names won't be shown
             use_labels = "label") + # instead use the new names
  theme_void()

In this example, we:

    1. Used more meaningful variable names
    1. Created a variable that was the result of two variables vs. just one (cholesterol)
    1. Used the "labels" argument to rename our variables (this is useful if your desired final variable name is more than one word)

Common DAG path structures


coord_dag <- list(
  x = c(d = 0, x = 1, y = 2),
  y = c(d = 0, x = 1, y = 0)
)

our_dag <- ggdag::dagify(x ~ d,
                         y ~ d,
                         y ~ x,
                         coords = coord_dag)

ggdag::ggdag(our_dag) + theme_void()

EXERCISE:

Let's adapt the code to make X a confounder and a collider.

Answer

    1. X as a confounder
coord_dag <- list(
  x = c(d = 0, x = 1, y = 2),
  y = c(d = 0, x = 1, y = 0)
)

our_dag <- ggdag::dagify(d ~ x, #line from x to d y ~ d, #line from d to y y ~ x, #line from x to y coords = coord_dag)

ggdag::ggdag(our_dag) + theme_void()

    1. X as a collider
coord_dag <- list(
  x = c(d = 0, x = 1, y = 2),
  y = c(d = 0, x = 1, y = 0)
)

our_dag <- ggdag::dagify(x ~ d, #line from d to x y ~ d, #line from d to y x ~ y, #line from y to x coords = coord_dag)

ggdag::ggdag(our_dag) + theme_void()

Previous