class: center, middle, inverse, title-slide # Lec06: Non-linear models ## Stat41: Data Viz ### Prof Amanda Luby ### Swarthmore College --- class: center, middle (1) Survey Results (4) IJALM: Machine Learning Strikes Back (3) Marginal Effects Plot (4) Regression Discontinuity (5) Density Plots (6) Second Milestone --- ### Survey Results <!-- --> --- ### Survey Results <!-- --> --- # From Lec 01: > **A quick note on time commitment:** My expectation is that you will spend 3-4 hours on class days, plus 6-10 hours on projects per week, so somewhere in the 20-25 hours per week range. If you're spending much more or less time than that, please let me know. -- A reason the projects are pretty open-ended is because I want *you* to have some control over how much time you can commit to this class. -- An important skill is to be able to recognize when you need to scale back or ramp up your effort. --- <blockquote class="twitter-tweet"><p lang="en" dir="ltr">IJALM is one of my favorite games. <a href="https://t.co/7y1hOcjW5F">https://t.co/7y1hOcjW5F</a></p>— Dr. Daniela Witten (@daniela_witten) <a href="https://twitter.com/daniela_witten/status/1286427260317179906?ref_src=twsrc%5Etfw">July 23, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> ---  --- ### Multiple Regression via Sliders and Switches  via Andrew Heiss --- ### Multiple Regression via Sliders and Switches  via Andrew Heiss --- ### Let's do an example on `penguins` ```r library(palmerpenguins) flipper_mod = lm(flipper_length_mm ~ body_mass_g + sex + species, data = penguins) broom::tidy(flipper_mod, conf.int = TRUE) ``` ``` ## # A tibble: 5 x 7 ## term estimate std.error statistic p.value conf.low conf.high ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 165. 3.18 51.7 1.01e-159 158. 171. ## 2 body_mass_g 0.00655 0.000931 7.04 1.15e- 11 0.00472 0.00838 ## 3 sexmale 2.48 0.854 2.90 3.97e- 3 0.798 4.16 ## 4 speciesChinstrap 5.54 0.785 7.06 9.92e- 12 4.00 7.09 ## 5 speciesGentoo 18.0 1.44 12.5 1.46e- 29 15.2 20.9 ``` --- ### We can use a coefficient plot: ```r broom::tidy(flipper_mod, conf.int = TRUE) %>% ggplot(., aes(y = term, x = estimate, xmin = conf.low, xmax = conf.high)) + geom_pointrange() + theme_xaringan() ``` <!-- --> --- ### Or a Marginal Effects Plot -- **Idea**: "plug in" values of predictor variables (changing one and holding others constant) to see effect of a variable on the outcome -- First, create a new data frame with "new" observations: -- ```r penguins_new_data = tibble(body_mass_g = seq(2700, 6300, by = 1), sex = "male", species = "Chinstrap") penguins_new_data ``` ``` ## # A tibble: 3,601 x 3 ## body_mass_g sex species ## <dbl> <chr> <chr> ## 1 2700 male Chinstrap ## 2 2701 male Chinstrap ## 3 2702 male Chinstrap ## 4 2703 male Chinstrap ## 5 2704 male Chinstrap ## 6 2705 male Chinstrap ## 7 2706 male Chinstrap ## 8 2707 male Chinstrap ## 9 2708 male Chinstrap ## 10 2709 male Chinstrap ## # … with 3,591 more rows ``` --- ### Then, predict values from the model using `augment`: ```r predicted_flipper <- broom::augment(flipper_mod, newdata = penguins_new_data, interval = "prediction") predicted_flipper ``` ``` ## # A tibble: 3,601 x 6 ## body_mass_g sex species .fitted .lower .upper ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> ## 1 2700 male Chinstrap 190. 179. 201. ## 2 2701 male Chinstrap 190. 179. 201. ## 3 2702 male Chinstrap 190. 179. 201. ## 4 2703 male Chinstrap 190. 179. 201. ## 5 2704 male Chinstrap 190. 179. 201. ## 6 2705 male Chinstrap 190. 179. 201. ## 7 2706 male Chinstrap 190. 179. 201. ## 8 2707 male Chinstrap 190. 179. 201. ## 9 2708 male Chinstrap 190. 179. 201. ## 10 2709 male Chinstrap 190. 179. 201. ## # … with 3,591 more rows ``` --- ### Finally, plot predicted values for each row: Shows marginal effect of **body mass** on **flipper length** for **male** **chinstrap** penguins. .pull-left[ ```r ggplot(predicted_flipper, aes(x = body_mass_g, y = .fitted)) + geom_ribbon(aes(ymin = .lower, ymax = .upper), fill = "#5601A4", alpha = 0.5) + geom_line(size = 1, color = "#5601A4") + theme_xaringan() ``` ] .pull-right[  ] --- class: center, middle, inverse # But wait, there's more --- ### Why stop at changing 1 variable? Use the `expand_grid` tidyverse function: ```r penguins_new_data_fancy <- expand_grid(body_mass_g = seq(2700, 6300, by = 1), sex = "male", species = c("Adelie", "Chinstrap", "Gentoo")) penguins_new_data_fancy ``` ``` ## # A tibble: 10,803 x 3 ## body_mass_g sex species ## <dbl> <chr> <chr> ## 1 2700 male Adelie ## 2 2700 male Chinstrap ## 3 2700 male Gentoo ## 4 2701 male Adelie ## 5 2701 male Chinstrap ## 6 2701 male Gentoo ## 7 2702 male Adelie ## 8 2702 male Chinstrap ## 9 2702 male Gentoo ## 10 2703 male Adelie ## # … with 10,793 more rows ``` --- ### Use `augment` again: ```r predicted_flipper_fancy <- broom::augment(flipper_mod, newdata = penguins_new_data_fancy, interval = "confidence") predicted_flipper_fancy ``` ``` ## # A tibble: 10,803 x 6 ## body_mass_g sex species .fitted .lower .upper ## <dbl> <chr> <chr> <dbl> <dbl> <dbl> ## 1 2700 male Adelie 185. 182. 187. ## 2 2700 male Chinstrap 190. 187. 193. ## 3 2700 male Gentoo 203. 198. 208. ## 4 2701 male Adelie 185. 182. 187. ## 5 2701 male Chinstrap 190. 187. 193. ## 6 2701 male Gentoo 203. 198. 208. ## 7 2702 male Adelie 185. 182. 187. ## 8 2702 male Chinstrap 190. 187. 193. ## 9 2702 male Gentoo 203. 198. 208. ## 10 2703 male Adelie 185. 182. 187. ## # … with 10,793 more rows ``` --- ### And plot: Shows marginal effect of **body mass** and **species** on **flipper length** for **male** penguins. .pull-left[ ```r ggplot(predicted_flipper_fancy, aes(x = body_mass_g, y = .fitted)) + geom_ribbon(aes(ymin = .lower, ymax = .upper, fill = species), alpha = 0.5) + geom_line(aes(color = species), size = 1) + guides(fill = FALSE, color = FALSE) + facet_wrap(vars(species)) + theme_xaringan() ``` ] .pull-right[  ] -- Why did the intervals change? --- # Why marginal effects plots? -- + See impact directly on the outcome variable -- + Vary multiple predictor variables at once -- + Works with models besides `lm()`! -- + So you don't need to fully understand the inner workings of a complicated model to see the impact of predictors on the outcome --- # In your groups: [PNAS Paper](https://www.pnas.org/content/117/26/14857): Read abstract, Figure 3. [Blog Post](https://roadtolarissa.com/regression-discontinuity/) on Regression Discontinuity --- # Density Plots: Basic Idea  Source: [Irizarry's Intro to Data Science](https://rafalab.github.io/dsbook/distributions.html) --- # Complicated part is the *y-axis* -- Brief interlude of probability theory -- In order to be a *probability density function*, the area under the curve has to be equal to 1. -- So what you see on the y-axis is scaled roughly to frequency, but in a way that the area under the curve is 1 -- High probability areas = High **density** areas = Places where **area** under the curve is large -- This usually corresponds to places where the density function is high, but we have to be careful! --- # Example:  Source: FiveThirtyEight --- class: middle  Source: Cameron Davidson-Pilon --- class: middle  --- class: middle  Source: Cameron Davidson-Pilon --- class: middle  Source: Cameron Davidson-Pilon --- class: middle  Source: Cameron Davidson-Pilon --- class: center, middle, inverse ### We've seen it before, humans are bad at perceiving differences in area --- # Takeaways: -- - Models are complicated! -- - Drawing smooth lines is complicated! -- - It's easy to mislead audience -- - Don't hide the raw data behind the model -- - Use multiple representations of results -- - A good explanation/interpretation goes a long way --- # Final Project ### Milestone 2 will be posted later today