--- title: "Duncan Data Analysis" author: "J. Fox and S. Weisberg" date: "2018-04-24" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` This R Markdown file reproduces the data analysis described in Section 1.5 of _An R Companion to Applied Linear Regression, 3rd Edition_. See the text for further explanation. # Preliminaries As a first step, we load the **car** package, and examine the first few rows of the `Duncan` data: ```{r} library("car") # load car and carData packages head(Duncan, n=10) dim(Duncan) ``` Obtain summary statistics for the variables in `Duncan`: ```{r} summary(Duncan) ``` As a first graph, we view a histogram of the variable `prestige`: ```{r, fig.width=5, fig.height=5} with(Duncan, hist(prestige)) ``` ## 1.5.1 Examining the Data The `scatterplotMatrix()` function in the **car** package produces scatterplots for all paris of variables. A few relatively remote points are marked by case names, in this instance by occupation. ```{r fig.height=8, fig.width=8} scatterplotMatrix( ~ prestige + education + income, id=list(n=3), data=Duncan) ``` ## 1.5.2 Regression Analysis We use the`lm()` function to fit a linear regression model to the data: ```{r} (duncan.model <- lm(prestige ~ education + income, data=Duncan)) ``` ```{r} summary(duncan.model) ``` The `brief()` and `S()` functions, both in the **car** package, provide alternative summaries of a regression fit. ## 1.5.3 Regression Diagnostics The `rstudent()` function returns studentized residuals, and the `densityPlot()` function fits an adaptive kernel density estimator to the distribution of the studentized residuals: ```{r fig.height=5, fig.width=5} densityPlot(rstudent(duncan.model)) ``` A `qqPlot()` can be used as a check for nonnormal errors, comparing the studentized residuals to a t-distribution: ```{r fig.height=5,fig.width=5} qqPlot(duncan.model, id=list(n=3)) ``` This next function tests for outliers in the regression: ```{r} outlierTest(duncan.model) ``` This graph displays influence measures in index plots: ```{r fig.height=6,fig.width=6} influenceIndexPlot(duncan.model, vars=c("Cook", "hat"), id=list(n=3)) ``` Added-variable plots for Duncan's regression, looking for influential cases: ```{r fig.height=4, fig.width=8} avPlots(duncan.model, id=list(cex=0.75, n=3, method="mahal")) ``` Component-plus-residual plots for the regression, checking for nonlinearity: ```{r fig.height=4, fig.width=8} crPlots(duncan.model, smooth=list(span=0.7)) ``` Tests for non-constant error variance: ```{r} ncvTest(duncan.model) ncvTest(duncan.model, var.formula= ~ income + education) ``` Removing the cases `"minister"` and "`conductor'": ```{r} whichNames(c("minister", "conductor"), Duncan) duncan.model.2 <- update(duncan.model, subset=-c(6, 16)) summary(duncan.model.2) ``` Comparing the regressions with and without these two cases: ```{r} compareCoefs(duncan.model, duncan.model.2) ```