Linear models, their variants, and extensions are among the most useful
and widely used statistical tools for social research. This book aims to
provide an accessible, in-depth, modern treatment of regression analysis,
linear models, and closely related methods.
The book should be of interest to students and researchers in the social
sciences. Although the specific choice of methods and examples reflects
this readership, I expect that the book will prove useful in other disciplines
that employ linear models for data analysis, and in courses on applied
regression and linear models where the subject-matter of applications is
not of special concern.
This book is a revision of my 1984 text Linear Statistical Models and
Related Methods. In revising the book, I have freely incorporated material
from my 1991 monograph on Regression Diagnostics.
The new title of the text reflects a change in organization and emphasis:
I have thoroughly reworked the book, removing some topics and adding a
variety of new material. But even more fundamentally, the book has been
extensively rewritten. It is a new and different book.
I have endeavored in particular to make the text as accessible as possible.
With the exception of three chapters, several sections, and a few shorter
passages, the prerequisite for reading the book is a course in basic applied
statistics that covers the elements of statistical data analysis and inference.
Many topics (e.g., logistic regression in Chapter 15) are introduced with an example that motivates the statistics, or (as in the case of bootstrapping, in Chapter 16) by appealing to familiar material. The treatment of regression analysis starts (in Chapter 2) with an elementary discussion of nonparametric regression, developing the notion of regression as a conditional average --- in the absence of restrictive assumptions about the nature of the relationship between the dependent and independent variables. This approach begins closer to the data than the traditional starting point of linear least-squares regression, and should make readers sceptical about glib assumptions of normality, constant variance, and so on.
More difficult chapters and sections are marked with asterisks. These
parts of the text can be omitted without loss of continuity, but they provide
greater understanding and depth, along with coverage of some topics that
depend upon more extensive mathematical or statistical background. I do
not, however, wish to exaggerate the background that is required for this
'more difficult' material: All that is necessary is some exposure to matrices,
elementary linear algebra, and elementary differential calculus. Appendices
to the text provide additional background material.
All chapters end with summaries, and most include recommendations for
additional reading. Summary points are also presented in boxes interspersed
with the text.
The first part of the book consists of preliminary material:
The second part, on linear models fit by the method of least-squares,
comprises the heart of the book:
The third part of the book describes 'diagnostic' methods for discovering whether a linear model fit to data adequately represents the data. Methods are also presented for correcting problems that are revealed:
The fourth part of the book discusses important extensions of linear
least squares. In selecting topics, I was guided by the proximity of the
methods to the general linear model and by the promise that these methods
hold for data analysis in the social sciences. The methods described in
this part of the text are (with the exception of logistic regression in
Chapter 15) given introductory --- rather than extensive --- treatments.
My aim in introducing these relatively advanced topics is (i) to provide
enough information so that readers can begin to use these methods in their
research; and (ii) to provide sufficient background to support further
work in these areas should readers choose to pursue them:
Several appendices provide background, principally --- but not exclusively
--- for the starred portions of the text:
Nearly all of the examples in this text employ real data from the social
sciences, many of them previously analyzed and published. The exercises
that involve data analysis also almost all use real data drawn from various
areas of application. Most of the datasets are relatively small. I encourage
readers to analyze their own data as well.
The datasets can be downloaded free of charge via the World Wide Web; point your web browser at http://davinci.socsci.mcmaster.ca/applied-regression.html. If you do not have access to the internet, then you can write to me at the Department of Sociology, McMaster University, Hamilton, Ontario, Canada, L8S 4L8, for information about obtaining the datasets on disk. Each dataset is associated with two files: The file with extension .cbk (e.g., duncan.cbk) contains a 'codebook' describing the data; the file with extension .dat (e.g., duncan.dat) contains the data themselves. Smaller datasets are also presented in tabular form in the text.
I occasionally comment in passing on computational matters, but the
book generally ignores the finer points of statistical computing in favor
of methods that are computationally simple. I feel that this approach facilitates
learning. Once basic techniques are absorbed, an experienced data analyst
has recourse to carefully designed programs for statistical computations.
I think that it is a mistake to tie a general discussion of linear and
related statistical models too closely to particular software. Although
the marvelous proliferation of statistical software has routinized the
computations for most of the methods described in this book, the workings
of standard computer programs are not sufficiently accessible to promote
learning. I consequently find it desirable, where time permits, to teach
the use of a statistical computing environment as part of a course on applied
regression and linear models.
For nearly 20 years, I used the interactive programming language APL in this role. More recently, I use Lisp-Stat. I particularly recommend the R-code program, written in Lisp-Stat, which accompanies Cook and Weisberg's (1994) fine book on regression graphics. Other programmable computing environments that are used for statistical data analysis include S, Gauss, Stata, Mathematica, and the interactive matrix language (IML) in SAS. Descriptions of these environments appear in a book edited by Stine and Fox (1996).
I have used the material in this book for two types of courses (along
with a variety of short courses and lectures):
I cover the unstarred sections of Chapters 1--8, 11--13, 15, and 16
in a one-semester (13-week) course for social-science graduate students
(at McMaster University in Hamilton, Ontario) who have had (at least) a
one-semester introduction to statistics at the level of Moore (1995). The
outline of this course is as follows:
The readings from this text are supplemented with parts of Cook and
Weisberg's (1994) book on regression graphics, a paper by Tierney (1995)
on Lisp-Stat, and several handouts on computing. Students complete required
weekly homework assignments, which are mostly focused on data analysis.
Homework is collected and corrected, but not graded. I distribute answers
when the homework is collected, and take it up in class after it is corrected
and returned. There are mid-term and final take-home exams, also focused
on data analysis.
I used the material in Chapters 1--13 and 15, along with the appendices and basic introductions to matrices, linear algebra, and calculus, for a two-semester course for social-science graduate students (at York University in Toronto) with similar statistical preparation. For this second, more intensive, course, background topics (such as linear algebra) were introduced as required, and constituted about one-fifth of the course. The organization of the course was similar to the first one.
Both courses include some treatment of statistical computing, with more information on programming in the second course. For students with the requisite mathematical and statistical background, it should be possible to cover the whole text in a reasonably paced two-semester course.
In learning statistics, it is important for the reader to participate actively, both by working though the arguments presented in the book, and --- even more importantly --- by applying methods to data. Statistical data analysis is a craft and, like any craft, it requires practice. Reworking examples is a good place to start, and I have presented illustrations in such a manner as to make re-analysis and further analysis possible. Where possible, I have relegated formal 'proofs' and derivations to exercises, which nevertheless typically provide some guidance to the reader. I believe that this type of material is best learned constructively.
As well, including too much algebraic detail in the body of the text invites readers to lose the statistical forest for the mathematical trees. You can decide for yourself (or your students) whether or not to work the theoretical exercises. It is my experience that some people feel that the process of working through derivations cements their understanding of the statistical material, while others find this activity tedious and pointless. Some of the theoretical exercises, marked with asterisks, are comparatively difficult. (Difficulty is assessed relative to the material in the text, so the threshold is higher in starred sections and chapters.)
In preparing the data-analytic exercises, I have tried to find datasets
of some intrinsic interest that embody a variety of characteristics. You
can safely assume, for example, that datasets for exercises in Chapter
11 include unusual data. In many instances, I try to supply some direction
in the data-analytic exercises, but --- like all real data analysis ---
these exercises are fundamentally open-ended. It is therefore important
for instructors to set aside time to discuss data-analytic exercises in
class, both before and after students tackle them. Although students often
miss important features of the data in their initial analyses, this experience
--- properly approached and integrated --- is an unavoidable part of learning
the craft of data analysis.
A few exercises, marked with pound-signs (#) are meant for 'hand' computation. Hand computation (i.e., with a calculator) is tedious, and is practical only for unrealistically small problems, but it sometimes serves to make statistical procedures more concrete. Similarly, despite the emphasis in the text on analyzing real data, a small number of exercises generate simulated data to clarify certain properties of statistical methods.
Finally, a word about style: I try to use the first person singular
--- ''I'' --- when I express opinions. ''We'' is reserved for you --- the
reader --- and I.
Many individuals have helped me in the preparation of this book.
I am grateful to Georges Monette of York University, to Bob Stine of
the University of Pennsylvania, and to two anonymous reviewers, for insightful
comments and suggestions.
Mike Friendly, of York University, provided detailed comments, corrections,
and suggestions on almost all of the text.
A number of friends and colleagues donated their data for illustrations and exercises --- implicitly subjecting their research to scrutiny and criticism.
Several individuals contributed to this book indirectly by helpful comments on its predecessor (Fox, 1984), both before and after publication: Ken Bollen, Gene Denzel, Shirley Dowdy, Paul Herzberg, and Doug Rivers.
C. Deborah Laughton, my editor at Sage Publications, has been patient
and supportive throughout the several years that I have worked on this
project.
I am also in debt to the students at York University and McMaster University,
and to participants at the Inter-University Consortium for Political and
Social Research Summer Program, all of whom were exposed to various versions
and portions of this text and who have improved it through their criticism,
suggestions, and --- occasionally --- informative incomprehension.
Finally, I am grateful to York University for providing me with a sabbatical-leave
research grant during the 1994-95 academic year, when much of the text
was drafted.
If, after all of this help, deficiencies remain, then I alone am at
fault.
John Fox
Toronto, Canada