To complete this lab, you’ll need to install and load the following packages:
tidymodels
patchwork
GGally
This week, the labs are designed around fitting a model to a specific dataset, and how we visualize the results appropriately.
Today, you’ll first fit a simple linear regression model, then work through another online tutorial about multiple regression. In the first part, you’ll work with COVID survey data collected through Facebook.
The data we’ll use was collected by the Delphi Group, which is headquarted at Carnegie Mellon University and collects survey data through Facebook and makes results publicly available via their EpiData API. They make it super easy to access through their R package covidcast – I even tweeted about how easy it was to use!
Just downloaded a bunch of @CmuDelphi data yesterday to write some new labs/homeworks/projects for my students - the data is 💯, the documentation is 💯, and the covidcast R 📦 makes it SO EASY. Such a great resource for researchers + teachers! https://t.co/GnvKScdrbs
— Amanda Luby, PhD (@amandaluby) December 4, 2020
If you’re interested in diving into more of this data for any of your projects, let me know and I’m happy to share the code to download different date ranges or variables.
I’d highly recommend skimming the variables listed on Delphi’s Symptom Surveys Page for more information about the specific variables that are available. Here’s an excerpt:
Influenza-like illness or ILI is a standard indicator, and is defined by the CDC as: fever along with sore throat or cough. From the list of symptoms from Q1 on our survey, this means a and (b or c).
COVID-like illness or CLI is not a standard indicator. Through our discussions with the CDC, we chose to define it as: fever along with cough or shortness of breath or difficulty breathing.
Symptoms alone are not sufficient to diagnose influenza or coronavirus infections, and so these ILI and CLI indicators are not expected to be unbiased estimates of the true rate of influenza or coronavirus infections. These symptoms can be caused by many other conditions, and many true infections can be asymptomatic. Instead, we expect these indicators to be useful for comparison across the United States and across time, to determine where symptoms appear to be increasing.
The signals beginning with smoothed_ estimate the same quantities as their raw partners, but are smoothed in time to reduce day-to-day sampling noise; see details below. Crucially, because the smoothed signals combine information across multiple days, they have larger sample sizes and hence are available for more counties and MSAs than the raw signals.
The following chunk of code loads the data file that I’ve downloaded from Delphi and pre-processed. It contains responses from October 1st to December 31st, 2020 and the following variables:
covid = read_csv("data/covid_data.csv")
covid
## # A tibble: 65,688 x 6
## signal geo_value time_value value stderr sample_size
## <chr> <chr> <date> <dbl> <dbl> <dbl>
## 1 smoothed_cli ak 2020-10-01 1.59 0.429 760
## 2 smoothed_cli al 2020-10-01 0.983 0.141 3742.
## 3 smoothed_cli ar 2020-10-01 1.06 0.174 2587.
## 4 smoothed_cli az 2020-10-01 0.597 0.0905 5682.
## 5 smoothed_cli ca 2020-10-01 0.450 0.0399 21930.
## 6 smoothed_cli co 2020-10-01 0.561 0.0917 5137.
## 7 smoothed_cli ct 2020-10-01 0.444 0.0978 3866
## 8 smoothed_cli dc 2020-10-01 0.268 0.224 533
## 9 smoothed_cli de 2020-10-01 0.203 0.134 1129.
## 10 smoothed_cli fl 2020-10-01 0.516 0.0472 17767.
## # … with 65,678 more rows
First, we’ll make a simplified version of the data that takes the average of each variable across each state for the associated time period.
covid %>%
group_by(geo_value, signal) %>%
summarize(
avg = mean(value, na.rm = T)
) %>%
pivot_wider(id_cols = geo_value, names_from = signal, values_from = avg) %>%
ungroup() -> state_avg
Note the pivot_wider
command – this is a super neat data wrangling trick that lets you reshape your data to create new columns or collapse columns into rows. Take a look at what the data looked like before that step:
covid %>%
group_by(geo_value, signal) %>%
summarize(
avg = mean(value, na.rm = T)
)
## # A tibble: 714 x 3
## # Groups: geo_value [51]
## geo_value signal avg
## <chr> <chr> <dbl>
## 1 ak smoothed_anxious_5d 18.6
## 2 ak smoothed_cli 1.22
## 3 ak smoothed_depressed_5d 13.7
## 4 ak smoothed_felt_isolated_5d 23.0
## 5 ak smoothed_ili 1.26
## 6 ak smoothed_large_event_1d 8.89
## 7 ak smoothed_restaurant_1d 12.8
## 8 ak smoothed_shop_1d 53.7
## 9 ak smoothed_spent_time_1d 33.8
## 10 ak smoothed_tested_14d 21.9
## # … with 704 more rows
and after:
state_avg
## # A tibble: 51 x 15
## geo_value smoothed_anxiou… smoothed_cli smoothed_depres… smoothed_felt_i…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 ak 18.6 1.22 13.7 23.0
## 2 al 16.6 1.41 12.7 16.7
## 3 ar 19.7 1.20 14.8 19.1
## 4 az 16.6 1.05 12.0 18.3
## 5 ca 17.7 0.774 12.3 20.4
## 6 co 20.0 1.00 13.7 21.6
## 7 ct 18.2 0.723 11.4 17.8
## 8 dc 23.2 0.434 11.8 22.5
## 9 de 15.4 0.618 10.3 16.2
## 10 fl 15.3 0.715 10.5 15.9
## # … with 41 more rows, and 10 more variables: smoothed_ili <dbl>,
## # smoothed_large_event_1d <dbl>, smoothed_restaurant_1d <dbl>,
## # smoothed_shop_1d <dbl>, smoothed_spent_time_1d <dbl>,
## # smoothed_tested_14d <dbl>, smoothed_travel_outside_state_5d <dbl>,
## # smoothed_wearing_mask <dbl>, smoothed_work_outside_home_1d <dbl>,
## # smoothed_worried_finances <dbl>
So, what pivot_wider
did was make the data “wider” (more columns) by mapping signal
to be the new column names and avg
to be the value for that column for each state. We end up with a new dataset that has a single row for each state.
Let’s start with some exploratory analysis. The following code chunk creates a pairs plot, from the GGally
package, for all of the quantitative variables. (I also include a step of removing smoothed_
from each of the variable names before plotting so they’re easier to read)
state_avg %>%
select(-geo_value) %>%
rename_all(funs(stringr::str_replace(., "smoothed_", ""))) %>%
GGally::ggpairs()
We’re going to focus on the relationship between smoothed_wearing_mask
and smoothed_cli
.
Each point here corresponds to a state. It would be great if we could tell which state each point corresponds to. Lucky for us geom_text
does just that. Replace geom_point()
with geom_text()
and add the right label
argument to aes()
. Change the theme, axis labels, and title to something more appropriate.
state_avg %>%
mutate(state = str_to_upper(geo_value)) %>%
ggplot(aes(x = smoothed_wearing_mask, y = smoothed_cli)) +
geom_point()
This graph is screaming for a linear regression model. We can easily add the least-squares line to the graph using stat_smooth
(method = lm
tells R to fit a linear model and se = FALSE
tells R to not include the standard error of the line on the graph):
state_avg %>%
ggplot(aes(x = smoothed_wearing_mask, y = smoothed_cli)) +
geom_point() +
stat_smooth(method = "lm", se = FALSE)
This is great for adding a line to the graph, but doesn’t tell us much about the model. For instance, what is \(\hat{\beta_1}\)? Is the estimate statistically significant? We have no idea.
Let’s go ahead and fit our model:
lm_mod = lm(smoothed_cli ~ smoothed_wearing_mask, data = state_avg)
(and, since we’re good statisticians, we’ll check our residual plots to make sure we don’t have any egregious violations) In your report, state the LINE assumptions, how we check each of them, and whether you have any concerns about this model. Note that we use the patchwork
library here – this package allows us to use really simple notation for combining ggplots
into a single figure (the p1 + p2 + p3
at the end). In your report, clean these figures up a bit.
library(patchwork)
lm_res = augment(lm_mod, interval = "prediction")
p1 = ggplot(lm_res, aes(x = smoothed_wearing_mask, y = .resid)) +
geom_point()
p2 = ggplot(lm_res, aes(x = .resid)) +
geom_histogram(bins = 15)
p3 = ggplot(lm_res, aes(x = .fitted, y = .resid)) +
geom_point()
p1 + p2 + p3
The tidy
part of tidymodels
comes in handy when we want to look at our coefficients:
tidy(lm_mod, conf.int = TRUE)
## # A tibble: 2 x 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 6.69 0.452 14.8 1.05e-19 5.78 7.60
## 2 smoothed_wearing_mask -0.0637 0.00507 -12.6 6.30e-17 -0.0738 -0.0535
which make it super easy to make tables or figures based on the results. In your lab report, replicate the following graph using the results from tidy(lm_mod)
- you can use a different theme or labels if you prefer. (Hint: check out the help page for geom_pointrange()
)
OpenIntro Online Tutorial on Multiple Regression
Include screenshots of your two favorite plots. You can include a screenshot in an rmarkdown document by saving the screenshot in the same folder as your rmarkdown file and adding 
, where you fill in SCREENSHOT_FILENAME
with the name of your screenshot. (make sure to include .jpg
or .png
)