a couple recipes for exploratory data visualization in R
Everyone knows that visualization is key to understanding a dataset. But all too often, I feel like I get too involved in analyzing the dataset before actually seeing what the raw data look like.
In Python, pandas
makes this pretty easy:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv("/path/to/some/file.csv")
df.plot()
plt.show()
In R, ggpairs, as well as graphics::pairs
, perform similar functions. The problem with ggpairs
is that if you have a data frame with more than a few variables the result is criminally complex, because for n
variables n^2 graphs are produced. Often, I want to focus on the distribution of each variable by itself instead of looking at each pairwise relationship between each of the variables, which can get exhausting quickly. This also means only producing n
plots, which has a positive effect on my sanity.
Below I show a couple recipes and functions that others might find useful for quickly getting a sense of your data after reading it in. First I show how to plot the distribution of all numeric variables in a dataset. Second I show how to explore the relationship between some categorical variable and each of those numeric variables.
plotting histograms for all numeric variables
First let’s grab some data and load the tidyverse
:
library(tidyverse)
# data from DOI: 10.1098/rsbl.2016.0467
df <- read_csv("http://datadryad.org/bitstream/handle/10255/dryad.121150/Bolnick_data.csv?sequence=2")
# view the variables and their types
map_chr(df, typeof)
Lake Fish
"character" "character"
Model_color depth
"character" "integer"
N_bites N_inspects
"integer" "integer"
Aggressive_interactions Date
"integer" "character"
Mean.interval.between.bites Median.interval.between.bites
"double" "double"
SD.interval
"double"
These data describe how male sticklebacks respond to simulated intruder attacks in the wild in two lakes.
To look for outliers and get a sense of how each variable might be skewed, I define a simple function, plot_histograms()
:
plot_histograms <- function(data){
data %>%
select_if(is.numeric) %>%
gather(variable, value) %>%
ggplot(aes(x = value)) +
geom_histogram() +
facet_wrap(~variable)
}
plot_histograms(df)
This selects only the numeric columns from the data frame, puts the data into tidy format, then uses facet_wrap(~variable)
to produce a different plot for each variable. (If the variables are on very different scales, it might make sense to use facet_wrap(~variable, scales = "free_x")
). You could obviously make this function a lot fancier and more flexible, but it produces a plot that’s informative without being overwhelming as is.
We could use the same logic to just look at counts of the categorical variables:
plot_counts <- function(data){
data %>%
select_if(is.character) %>%
gather(variable, value) %>%
ggplot(aes(x = value)) +
geom_bar() +
facet_wrap(~variable, scale = "free_x")
}
plot_counts(df)
plot the relationship between a categorical variable and each numeric variable
To take it one step further, we’re probably interested in getting a first-pass look at the relationship between the numeric variables and each of the categorical variables. Again, ggpairs
does this already, but depending on the number of variables, it throws in a lot of extra information that can be difficult for a human brain to process.
Below I define plot_cat_relationship()
, which is similar to plot_histograms()
; instead of producing a histogram for each numeric variable, however, it produces a boxplot of the relationship between each numeric variable and one categorical variable. This keeps things simple enough for my head to process things.
The implementation gave me a chance to learn more about tidyeval and quosures, because the function relies on tidyverse
packages like dplyr
that use tidyeval. Interested readers should read the article linked above, which does a really nice job of explaining tidyeval.
ggplot2
does not use tidyeval yet, however, which leads to some ugly workarounds in the code below.
plot_cat_relationship <- function(data, categorical_variable) {
numeric_cols <- names(data)[map_lgl(data, is.numeric)]
enquo_cat <- enquo(categorical_variable)
data %>%
select(!!enquo_cat, numeric_cols) %>%
gather(variable, value, -!!enquo_cat) %>%
ggplot(aes_string(x = quo_name(enquo_cat), y = "value", color = quo_name(enquo_cat))) +
geom_boxplot() +
facet_wrap(~variable) +
coord_flip() +
scale_color_brewer(type = "qual", palette = "Dark2", guide = F)
}
# note that `Model_color` is not in quotes
plot_cat_relationship(df, Model_color)
This makes it easy to see how a single categorical variable, Model_color
, differs among each of the variables.
We also might be interested in quickly assessing whether there are big differences between lakes:
plot_cat_relationship(df, Lake)
summary
- Writing functions to automate tasks you should perform more regularly makes those tasks easier to do, and thus increases the likelihood that you’ll actually do the task
- Creating plots with
ggplot2
works pretty seamlessly in functions - Using functions from other
tidyverse
packages that use tidyeval can be a little tricky, but with a little practice they can be naturally incorporated into functions and used for programming