Preface

John Fox, Applied Regression Analysis, Linear Models, and Related Methods (Sage, 1997)

Contents of the preface:

Synopsis
Computing
To Readers, Students, and Instructors
Acknowledgments

Linear models, their variants, and extensions are among the most useful and widely used statistical tools for social research. This book aims to provide an accessible, in-depth, modern treatment of regression analysis, linear models, and closely related methods.

The book should be of interest to students and researchers in the social sciences. Although the specific choice of methods and examples reflects this readership, I expect that the book will prove useful in other disciplines that employ linear models for data analysis, and in courses on applied regression and linear models where the subject-matter of applications is not of special concern.

This book is a revision of my 1984 text Linear Statistical Models and Related Methods. In revising the book, I have freely incorporated material from my 1991 monograph on Regression Diagnostics.

The new title of the text reflects a change in organization and emphasis: I have thoroughly reworked the book, removing some topics and adding a variety of new material. But even more fundamentally, the book has been extensively rewritten. It is a new and different book.

I have endeavored in particular to make the text as accessible as possible. With the exception of three chapters, several sections, and a few shorter passages, the prerequisite for reading the book is a course in basic applied statistics that covers the elements of statistical data analysis and inference.

Many topics (e.g., logistic regression in Chapter 15) are introduced with an example that motivates the statistics, or (as in the case of bootstrapping, in Chapter 16) by appealing to familiar material. The treatment of regression analysis starts (in Chapter 2) with an elementary discussion of nonparametric regression, developing the notion of regression as a conditional average --- in the absence of restrictive assumptions about the nature of the relationship between the dependent and independent variables. This approach begins closer to the data than the traditional starting point of linear least-squares regression, and should make readers sceptical about glib assumptions of normality, constant variance, and so on.

More difficult chapters and sections are marked with asterisks. These parts of the text can be omitted without loss of continuity, but they provide greater understanding and depth, along with coverage of some topics that depend upon more extensive mathematical or statistical background. I do not, however, wish to exaggerate the background that is required for this 'more difficult' material: All that is necessary is some exposure to matrices, elementary linear algebra, and elementary differential calculus. Appendices to the text provide additional background material.

All chapters end with summaries, and most include recommendations for additional reading. Summary points are also presented in boxes interspersed with the text.

Synopsis

Part I

The first part of the book consists of preliminary material:

Chapter 1 discusses the role of statistical data analysis in social science, expressing the point of view that statistical models are essentially descriptive, not direct (if abstract) representations of social processes. This perspective provides the foundation for the data-analytic focus of the text.
Chapter 2 introduces the notion of regression analysis as tracing the conditional distribution of a 'dependent' variable as a function of one or several 'independent' variables. This idea is initially explored 'nonparametrically,' in the absence of a restrictive statistical model for the data.
Chapter 3 describes a variety of graphical tools for examining data. These methods are useful both as a preliminary to statistical modeling and to assist in the diagnostic checking of a model that has been fit to data.
Chapter 4 discusses variable transformation as a solution to several sorts of problems commonly encountered in data analysis, including skewness, nonlinearity, and non-constant spread.

Part II

The second part, on linear models fit by the method of least-squares, comprises the heart of the book:

Chapter 5 discusses linear least-squares regression. Linear regression is the prototypical linear model, and its direct extension is the subject of Chapters 7--10.
Chapter 6, on statistical inference in regression, develops tools for testing hypotheses and constructing confidence intervals that apply generally to linear models. This chapter also introduces the basic methodological distinction between empirical and structural relationships --- a distinction central to understanding causal inference in non-experimental research.
Chapter 7 shows how 'dummy variables' can be employed to extend the regression model to qualitative independent variables. Interactions among independent variables are introduced in this context.
Chapter 8, on analysis-of-variance models, deals with linear models in which all of the independent variables are qualitative.
Chapter 9* develops the statistical theory of linear models, providing the foundation for much of the material in Chapters 5--8 along with some additional results.
Chapter 10* applies vector geometry to linear models, allowing us literally to visualize the structure and properties of these models. Many topics are revisited from the geometric perspective, and central concepts --- such as 'degrees of freedom' --- are given a natural and compelling interpretation.

Part III

The third part of the book describes 'diagnostic' methods for discovering whether a linear model fit to data adequately represents the data. Methods are also presented for correcting problems that are revealed:

Chapter 11 deals with the detection of unusual and influential data in linear models.
Chapter 12 describes methods for diagnosing a variety of problems, including non-normally distributed errors, non-constant error variance, and nonlinearity. Some more advanced material in this chapter discusses how the method of maximum likelihood can be employed for selecting transformations.
Chapter 13 takes up the problem of collinearity --- the difficulties for estimation that ensue when the independent variables are highly correlated.

Part IV

The fourth part of the book discusses important extensions of linear least squares. In selecting topics, I was guided by the proximity of the methods to the general linear model and by the promise that these methods hold for data analysis in the social sciences. The methods described in this part of the text are (with the exception of logistic regression in Chapter 15) given introductory --- rather than extensive --- treatments. My aim in introducing these relatively advanced topics is (i) to provide enough information so that readers can begin to use these methods in their research; and (ii) to provide sufficient background to support further work in these areas should readers choose to pursue them:

Chapter 14* describes several important direct extensions of linear least-squares: Time-series regression (and generalized least squares), where the observations are ordered in time, and the errors in the linear model are consequently not assumed to be independent; nonlinear models fit by least squares; robust estimation of linear models (i.e., using methods of estimation more resistant than least-squares to unusual data); and nonparametric regression, in which the functional form of the relationship between the dependent and independent variables is not specified in advance.
Chapter 15 takes up the centrally important topic of linear-like models for qualitative and ordinal dependent variables, most notably, logit models (logistic regression). The chapter concludes with a brief introduction to generalized linear models, a grand synthesis encompassing linear least-squares regression, logistic regression, and a variety of related statistical models.
Chapter 16 discusses two broadly applicable techniques for assessing sampling variation: The 'bootstrap' and cross-validation. The bootstrap is a computationally intensive simulation method for constructing confidence intervals and hypothesis tests. The bootstrap does not make strong distributional assumptions about the data, and can be made to reflect the manner in which the data were collected (e.g., in complex survey-sampling designs). Cross validation is a simple method for drawing honest statistical inferences when --- as is commonly the case --- the data are employed both to select a statistical model and to estimate its parameters.

Appendices

Several appendices provide background, principally --- but not exclusively --- for the starred portions of the text:

Appendix A describes the notational conventions employed in the text.
Appendix B* shows how vector geometry can be used to visualize key concepts of linear algebra. The material in this appendix is required for Chapter 10 and presupposes a basic knowledge of matrices and linear algebra.
Appendix C* assumes an acquaintance with elementary differential calculus of one independent variable and shows how, employing matrices, differential calculus can be extended to several independent variables. This material is required for some starred portions of the text, for example, the derivation of least-squares and maximum-likelihood estimators.
Appendix D provides an introduction to the elements of probability theory and to basic concepts of statistical estimation and inference. A few, more demanding, parts of the appendix are starred. The background developed in this appendix is required for some of the material on statistical inference in the text.

Computing

Nearly all of the examples in this text employ real data from the social sciences, many of them previously analyzed and published. The exercises that involve data analysis also almost all use real data drawn from various areas of application. Most of the datasets are relatively small. I encourage readers to analyze their own data as well.

The datasets can be downloaded free of charge via the World Wide Web; point your web browser at http://davinci.socsci.mcmaster.ca/applied-regression.html. If you do not have access to the internet, then you can write to me at the Department of Sociology, McMaster University, Hamilton, Ontario, Canada, L8S 4L8, for information about obtaining the datasets on disk. Each dataset is associated with two files: The file with extension .cbk (e.g., duncan.cbk) contains a 'codebook' describing the data; the file with extension .dat (e.g., duncan.dat) contains the data themselves. Smaller datasets are also presented in tabular form in the text.

I occasionally comment in passing on computational matters, but the book generally ignores the finer points of statistical computing in favor of methods that are computationally simple. I feel that this approach facilitates learning. Once basic techniques are absorbed, an experienced data analyst has recourse to carefully designed programs for statistical computations.

I think that it is a mistake to tie a general discussion of linear and related statistical models too closely to particular software. Although the marvelous proliferation of statistical software has routinized the computations for most of the methods described in this book, the workings of standard computer programs are not sufficiently accessible to promote learning. I consequently find it desirable, where time permits, to teach the use of a statistical computing environment as part of a course on applied regression and linear models.

For nearly 20 years, I used the interactive programming language APL in this role. More recently, I use Lisp-Stat. I particularly recommend the R-code program, written in Lisp-Stat, which accompanies Cook and Weisberg's (1994) fine book on regression graphics. Other programmable computing environments that are used for statistical data analysis include S, Gauss, Stata, Mathematica, and the interactive matrix language (IML) in SAS. Descriptions of these environments appear in a book edited by Stine and Fox (1996).

To Readers, Students, and Instructors

I have used the material in this book for two types of courses (along with a variety of short courses and lectures):

I cover the unstarred sections of Chapters 1--8, 11--13, 15, and 16 in a one-semester (13-week) course for social-science graduate students (at McMaster University in Hamilton, Ontario) who have had (at least) a one-semester introduction to statistics at the level of Moore (1995). The outline of this course is as follows:

Week: topic; reading.

Introduction to the course and to MS/Windows; Ch. 1.
Introduction to regression, Lisp-Stat, and R-code; Ch. 2, App. A, D.
Examining and transforming data; Ch. 3, 4.
Linear least-squares regression; Ch. 5.
Statistical inference for regression; Ch. 6.
Dummy-variable regression; Ch. 7.
Analysis of variance; Ch. 8.
Diagnostics I: Unusual and influential data; Ch. 11.
Diagnostics II: Nonlinearity and other ills; Ch. 12.
Diagnostics III: Collinearity and variable selection; Ch. 13. Logit and probit models I; Ch. 15.
Logit and probit models II; Ch. 15 (cont.).
Assessing sampling variation and review; Ch. 16.

The readings from this text are supplemented with parts of Cook and Weisberg's (1994) book on regression graphics, a paper by Tierney (1995) on Lisp-Stat, and several handouts on computing. Students complete required weekly homework assignments, which are mostly focused on data analysis. Homework is collected and corrected, but not graded. I distribute answers when the homework is collected, and take it up in class after it is corrected and returned. There are mid-term and final take-home exams, also focused on data analysis.

I used the material in Chapters 1--13 and 15, along with the appendices and basic introductions to matrices, linear algebra, and calculus, for a two-semester course for social-science graduate students (at York University in Toronto) with similar statistical preparation. For this second, more intensive, course, background topics (such as linear algebra) were introduced as required, and constituted about one-fifth of the course. The organization of the course was similar to the first one.

Both courses include some treatment of statistical computing, with more information on programming in the second course. For students with the requisite mathematical and statistical background, it should be possible to cover the whole text in a reasonably paced two-semester course.

In learning statistics, it is important for the reader to participate actively, both by working though the arguments presented in the book, and --- even more importantly --- by applying methods to data. Statistical data analysis is a craft and, like any craft, it requires practice. Reworking examples is a good place to start, and I have presented illustrations in such a manner as to make re-analysis and further analysis possible. Where possible, I have relegated formal 'proofs' and derivations to exercises, which nevertheless typically provide some guidance to the reader. I believe that this type of material is best learned constructively.

As well, including too much algebraic detail in the body of the text invites readers to lose the statistical forest for the mathematical trees. You can decide for yourself (or your students) whether or not to work the theoretical exercises. It is my experience that some people feel that the process of working through derivations cements their understanding of the statistical material, while others find this activity tedious and pointless. Some of the theoretical exercises, marked with asterisks, are comparatively difficult. (Difficulty is assessed relative to the material in the text, so the threshold is higher in starred sections and chapters.)

In preparing the data-analytic exercises, I have tried to find datasets of some intrinsic interest that embody a variety of characteristics. You can safely assume, for example, that datasets for exercises in Chapter 11 include unusual data. In many instances, I try to supply some direction in the data-analytic exercises, but --- like all real data analysis --- these exercises are fundamentally open-ended. It is therefore important for instructors to set aside time to discuss data-analytic exercises in class, both before and after students tackle them. Although students often miss important features of the data in their initial analyses, this experience --- properly approached and integrated --- is an unavoidable part of learning the craft of data analysis.

A few exercises, marked with pound-signs (#) are meant for 'hand' computation. Hand computation (i.e., with a calculator) is tedious, and is practical only for unrealistically small problems, but it sometimes serves to make statistical procedures more concrete. Similarly, despite the emphasis in the text on analyzing real data, a small number of exercises generate simulated data to clarify certain properties of statistical methods.

Finally, a word about style: I try to use the first person singular --- ''I'' --- when I express opinions. ''We'' is reserved for you --- the reader --- and I.

Acknowledgments

Many individuals have helped me in the preparation of this book.

I am grateful to Georges Monette of York University, to Bob Stine of the University of Pennsylvania, and to two anonymous reviewers, for insightful comments and suggestions.

Mike Friendly, of York University, provided detailed comments, corrections, and suggestions on almost all of the text.

A number of friends and colleagues donated their data for illustrations and exercises --- implicitly subjecting their research to scrutiny and criticism.

Several individuals contributed to this book indirectly by helpful comments on its predecessor (Fox, 1984), both before and after publication: Ken Bollen, Gene Denzel, Shirley Dowdy, Paul Herzberg, and Doug Rivers.

C. Deborah Laughton, my editor at Sage Publications, has been patient and supportive throughout the several years that I have worked on this project.

I am also in debt to the students at York University and McMaster University, and to participants at the Inter-University Consortium for Political and Social Research Summer Program, all of whom were exposed to various versions and portions of this text and who have improved it through their criticism, suggestions, and --- occasionally --- informative incomprehension.

Finally, I am grateful to York University for providing me with a sabbatical-leave research grant during the 1994-95 academic year, when much of the text was drafted.

If, after all of this help, deficiencies remain, then I alone am at fault.

John Fox

Toronto, Canada

Last Modified: 22 January 1997 by John Fox, jfox@mcmaster.ca.