class: center, middle, inverse, title-slide # Lec05: Linear Models ## Stat41: Data Viz ### Prof Amanda Luby ### Swarthmore College --- class: center, middle # Today: (1) Announcements (2) Survey (3) Regression Refresher (4) IJALM (5) Project 2 --- # Announcements Project01/Labs/Milestones: if you don't think you'll be able to get them in by **Tuesday**, please get in touch with me so we can figure out a plan to catch up -- If you're confident about catching up, I trust you. -- If you DO start to worry, please let me know - I'm here to help! -- This week: I've included a few options for integrating work from the labs or your final project into the mini-project, so there's an option to scale back the workload a bit -- Milestone 02: we will talk about tomorrow -- ggplot2 [cheat sheet](https://rstudio.com/resources/cheatsheets/) --- # Survey Please fill out this ["first week survey"](https://forms.gle/6Fi74BhDFvXMeyMi9) -- *some* groups will be switching around tomorrow, so I need this information!
03
:
00
--- # Regression Refresher .pull-left[ Idea: draw a line through a scatterplot, but with math -- Assumptions (LINE): * **L**inear Relationship * **I**ndependent Observations * **N**ormal Residuals * **E**qual variance in Residuals ] .pull-right[ <!-- --> ] --- # In groups: (1) How do you check each of these assumptions? -- (2) What do you do if they're violated? -- (3) What are the necessary visualizations in a OLS regression analysis? -- (4) Did any questions come up from reading? --- class: center, middle, inverse # Quick Check in --- **L**inear Relationship: .pull-left[ ```r penguins %>% filter(species == "Adelie") %>% ggplot(., aes(x = bill_length_mm, y = bill_depth_mm)) + geom_point() + geom_smooth(se = FALSE, method = "lm") + theme_xaringan() + labs(x = "Bill Length (mm)", y = "Bill Depth (mm)", title = "Adelie Penguins Bill Length/Depth", caption = "Source: palmerpenguins") ``` ] .pull-right[  ] --- **I**ndependent Observations First, consider how data was collected: does order matter? Next, look at the *residuals* plotted against the *predictor* variable. We shouldn't be able to see any relationship .pull-left[ ```r library(tidymodels) adelie = penguins %>% filter(species == "Adelie") lm_mod = lm(bill_depth_mm ~ bill_length_mm, data = adelie) lm_res = augment(lm_mod) ggplot(lm_res, aes(x = bill_length_mm, y = .resid)) + geom_point() + theme_xaringan() ``` ] .pull-right[  ] --- **N**ormal Residuals If we look at the distribution of the residuals, it should be *symmetric*, *unimodal*, and roughly *bell-shaped* .pull-left[ ```r ggplot(lm_res, aes(x = .resid)) + geom_histogram(bins = 20, col = "white") + theme_xaringan() ``` ] .pull-right[  ] --- **E**qual variance in the residuals In the residual by fitted values scatterplot, there should be no relationship .pull-left[ ```r ggplot(lm_res, aes(x = .fitted, y = .resid)) + geom_point() + theme_xaringan() ``` ] .pull-right[  ] --- # IJALM (**I**t's **J**ust **A** **L**inear **M**odel) -- If you can do a linear regression, you can do *all of the other* Stat011 methods. -- We'll do quick examples of the following tests using the `penguins` data, showing that they're all equivalent to a linear model -- * One-sample t-test -- * Two-sample t-test -- * ANOVA -- If you want to read more, [this link](https://lindeloev.github.io/tests-as-linear/linear_tests_cheat_sheet.pdf) has further details and discusses additional methods. --- # One-sample t-test .pull-left[ `$$H_0: \mu = 40$$` `$$H_A: \mu \ne 40$$` ] .pull-right[ <!-- --> ] --- ```r t_res = t.test(adelie$bill_length_mm, mu = 40) t_res ``` ``` ## ## One Sample t-test ## ## data: adelie$bill_length_mm ## t = -5.5762, df = 150, p-value = 1.114e-07 ## alternative hypothesis: true mean is not equal to 40 ## 95 percent confidence interval: ## 38.36312 39.21966 ## sample estimates: ## mean of x ## 38.79139 ``` ```r lm_res = lm((bill_length_mm - 40) ~ 1, data = adelie) tidy(lm_res) ``` ``` ## # A tibble: 1 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -1.21 0.217 -5.58 0.000000111 ``` --- # Two-sample t-test Test the mean of *Chinstrap* `bill_length` compared to *Gentoo* `bill_length` .pull-left[ `$$H_0: \mu_1 = \mu_2$$` `$$H_A: \mu_1 \ne \mu_2$$` ] .pull-right[ <!-- --> ] --- ```r t_res = t.test(bill_length_mm ~ species, data = chinstrap_gentoo, var.equal = TRUE) t_res ``` ``` ## ## Two Sample t-test ## ## data: bill_length_mm by species ## t = 2.7694, df = 189, p-value = 0.006176 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 0.3823625 2.2755285 ## sample estimates: ## mean in group Chinstrap mean in group Gentoo ## 48.83382 47.50488 ``` ```r lm_res = lm(bill_length_mm ~ species + 1, data = chinstrap_gentoo) tidy(lm_res) ``` ``` ## # A tibble: 2 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 48.8 0.385 127. 8.20e-185 ## 2 speciesGentoo -1.33 0.480 -2.77 6.18e- 3 ``` --- # ANOVA Test whether *all 3* species of penguins have the same *flipper length* .pull-left[ `$$H_0: \mu_1 = \mu_2 = \mu_3$$` `$$H_A: \text{at least one} \ne$$` ] .pull-right[ <!-- --> ] --- ```r aov_res = aov(flipper_length_mm ~ species, data = penguins) summary(aov_res) ``` ``` ## Df Sum Sq Mean Sq F value Pr(>F) ## species 2 52473 26237 594.8 <2e-16 *** ## Residuals 339 14953 44 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## 2 observations deleted due to missingness ``` ```r lm_res = lm(flipper_length_mm ~ species, data = penguins) summary(lm_res) ``` ``` ## ## Call: ## lm(formula = flipper_length_mm ~ species, data = penguins) ## ## Residuals: ## Min 1Q Median 3Q Max ## -17.9536 -4.8235 0.0464 4.8130 20.0464 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 189.9536 0.5405 351.454 < 2e-16 *** ## speciesChinstrap 5.8699 0.9699 6.052 3.79e-09 *** ## speciesGentoo 27.2333 0.8067 33.760 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 6.642 on 339 degrees of freedom ## (2 observations deleted due to missingness) ## Multiple R-squared: 0.7782, Adjusted R-squared: 0.7769 ## F-statistic: 594.8 on 2 and 339 DF, p-value: < 2.2e-16 ``` ] --- class: center, middle, inverse # Questions? --- # Project 2 The [project 2 prompt](https://aluby.domains.swarthmore.edu/stat041/Projects/proj-2.html) is posted