Introduction to the R Statistical Computing Environment

John Fox

(Department of Sociology, McMaster University)

ICPSR Summer Program, Ann Arbor Michigan

Summer 2010

illustrative 3D rgl graphlowess

The R statistical programming language and computing environment has become the de-facto standard for writing statistical software among statisticians and has made substantial inroads in the social and behavioural sciences. R is a free, open-source implementation of the S language, and is available for Windows, Mac OS X, and Unix/Linux systems. There is also a commercial implementation of S called S-PLUS, but it has been eclipsed by R.

A statistical package, such as SPSS or SAS, is primarily oriented toward combining instructions with rectangular case-by-variable datasets to produce (often voluminous) printouts. Such packages make routine data analysis relatively easy, but they make it relatively difficult to do things that are innovative or non-standard, or to add to the built-in capabilities of the package. In contrast, a good statistical computing environment also makes routine data analysis easy, but it additionally supports convenient programming; this means that users can extend the already impressive facilities of R. Statisticians and others have taken advantage of the extensibility of R to contribute more than 2000 freely available “packages” of documented R programs and data to CRAN ( the Comprehensive R Archive Network)  and many others to the Bioconductor package archive. As well, R is especially capable in the area of statistical graphics, reflecting the origin of S at Bell Labs, a centre of graphical innovation.

The first two (lecture) sessions are meant to provide a basic overview of and introduction to R, including to statistical modeling in R – in effect, using R as a statistical package. The following four to five workshop sessions pick up where the basic lectures leave off, and combine lecture material with hands-on experience. The workshop sessions are intended to provide the background required to use R seriously for data analysis and presentation, including an introduction to R programming and to the design of custom statistical graphs, unlocking the power in the R statistical programming environment. The topic for session 8 is flexible, depending upon participants’ interests: the topic given here is a suggestion. If the size of the group is sufficiently small, the workshops will be conducted in a computer lab. Otherwise, participants are encouraged to bring their laptops to the workshop sessions. Some slides for the workshop are available on-line.



Topics

Lecture/Workshop

Reading (in Fox and Weisberg's R Companion to Applied Regression, Second Edition)

Materials
1. Getting started with R Ch. 1 script, exercises,Tom Short's R reference card, Duncan.txt
2. Statistical models in R Ch. 4, 5 & appendices to the first edition scriptexercises, Prestige.txt, Powers.txt, Long.txt, Winer.txt,
3. Data in RCh. 2script, exercises, nations.por, Datasets.xls
4-6. R Programming Ch. 8 scriptexercisesbugged functions
7. R Graphics Ch. 7 script, exercises, graphs-solutions.R, symbols-colours.R, 3dplots.R
8. Building R packages or another topicWriting R Extensions manual; for Windows, Appendix D of R Installation and Administration manual
script, matrixDemos.R, matrixDemos_1.0.zip, matrix.Demos_1.0.tar.gz 



Acquiring R

Windows Users

You can download the R Windows installer from CRAN; then double-click on the installer to install R as you would any Windows software. You can subsequently download and install only those packages that you want over the Internet from CRAN, via the Packages Install packages from CRAN menu in the RGui console. 

Mac Users

A universal binary for Mac OS X 10.5 and higher is available from CRAN. Double-click on the downloaded file to install R. You can then download and install packages over the Internet via the Packages & Data Packages Installer menu in the R.app or R64.app console.

Linux/Unix Users

Precompiled binaries for popular Linux systems are available from CRAN, or users can compile R from source. See CRAN for details.

Installing the car Package

For this course, you'll probably want to install the development version of the car package rather than the version on CRAN (which is associated with the first edition of the Companion to Applied Regression); use the command install.packages("car", repos="http://R-Forge.R-project.org"), but first install the leaps package, on which car depends, from CRAN -- either via menus or by the command install.packages("leaps").


SELECTED BIBLIOGRAPHY

Publishers of statistical texts have recently been producing a steady stream of books on R. Of particular note is Springer's Use R! series of brief paperbacks on various R-related topics, several titles of which I've listed below.

Basic Texts

The principal source for this lecture series/workshop is J. Fox and S. Weisberg, An R Companion to Applied Regression, Second Edition, Sage (manuscript), which will be made available to participants. Additional materials are available on the web site for the first edition of the book, including several appendices (on structural-equation models, mixed models, survival analysis, etc.). The book is associated with the car package for R. Alternatively (or additionally), more advanced students may wish to use W. N. Venables and B. D. Ripley, Modern Applied Statistics with S as a principal source.


Manuals

R is distributed with a set of manuals, which are also available at the CRAN web site.

A manual for S-PLUS Trellis Graphics (also useful for the lattice package in R) is at also available on the web.


Programming in S

R. A. Becker, J. M. Chambers, and A .R. Wilks, The New S Language: A Programming Environment for Data Analysis and Statistics. Pacific Grove, CA: Wadsworth, 1988. Defines S Version 2, which forms the basis of the currently used S Versions 3 and 4, as well as R. (Sometimes called the “Blue Book.”)

J. M. Chambers, Programming with Data: A Guide to the S Language. New York: Springer, 1998. Describes the then-new features in S Version 4, including the newer formal object-oriented programming system (also incorporated in R), by the principal designer of the S language. Not an easy read. (The “Green Book.”)

J. M. Chambers, Software for Data Analysis: Programming with R. New York: Springer, 2008. Chambers’s newest book ranges quite widely, and emphasizes a deep understanding of the R language, along with object-oriented programming, and links between R and other software. Some topics are unusual, such as processing text data in R.

J. M. Chambers and T.J. Hastie, eds., Statistical Models in S. Pacific Grove, CA: Wadsworth, 1992.  An edited volume describing the statistical modeling capabilities in S, Versions 3 and 4, and R, and the object-oriented programming system used in S Version 3 and R (and available, for “backwards compatibility,” in S Version 4). In addition, the text covers S software for particular kinds of statistical models, including linear models, nonlinear models, generalized linear models, local-polynomial regression models, and generalized additive models. (The “White Book.”)

R. Gentleman, R Programming for Bioinformatics, Boca Raton: Chapman and Hall, 2009. A thorough, though at points relatively difficult, treatment of programming in R, by one of the original co-developers of R and a founder of the related Bioconductor Project (which develops computing tools for the analysis of genomic data). Don’t let the title fool you: Most of the book is of general interest to R programmers.

R. Ihaka and R. Gentleman, “R: A language for data analysis and graphics.” Journal of Computational and Graphical Statistics, 5:299-314, 1996. The original published description of the R project, now quite out of date but still worth looking at.

W. N. Venables and B. D. Ripley, S Programming. New York: Springer, 2000. A companion volume to Modern Applied Statistics with S, and at the time of its publication the definitive treatment of writing software in the various versions of S-PLUS and R; now somewhat dated, particularly with respect to R.

Statistical Computing in R

The following three books treat traditional topics in statistical computing, such as optimization, simulation, probability calculations, and computational linear algebra, using R (although the coverage of particular topics in the books differs). All offer introductions to R programming. Of these books, Braun and Murdoch is the briefest and most accessible.

W. J. Braun and D. J. Murdoch, A First Course in Statistical Programming with R. Cambridge: Cambridge University Press, 2007.

O. Jones, R. Maillardet, and A. Robinson, Introduction to Scientific Programming and Simulation Using R. Boca Raton: Chapman and Hall, 2009.

M. L. Rizzo. Statistical Computing with R, Boca Raton: Chapman and Hall, 2008.

Graphics in R

P. Murrell. R Graphics. New York: Chapman and Hall, 2006. A tour-de-force – the definitive reference on traditional R graphics and on the grid graphics system on which lattice graphics (the R implementation of William Cleveland’s Trellis graphics) is built. The figures from the book and R code to produce them are on Murrell’s web site.

P. Murrell and R. Ihaka, “An approach to providing mathematical annotation in plots.” Journal of Computational and Graphical Statistics, 9:582-599, 2000. One of the unusual and very useful features of R graphics is the ability to include mathematical notation. This article explains how.

D. Sarkar, Lattice: Multivariate Data Visualization with R. New York: Springer, 2008. Deepayan Sarkar is the developer of the powerful lattice package in R, which implements Trellis graphics. This book provides a fine introduction to and overview of lattice graphics. Figures from the book and the R code to produce them are available on the web.

H. Wickham, ggplot2: Elegant Graphics for Data Analysis. New York: Springer, 2009. A guide to Hadley Wickham's ggplot2 package, which provides an alternative graphics system for R based on an extension of Wilkinson's The Grammer of Graphics (Second Edition, Springer, 2005), which, in turn, provides a systematic basis for constructing statistical graphs.

Data Management

P. Spector, Data Manipulation with R. New York: Springer, 2008. Data management is a dry subject, but the ability to carry it out is vital to the effective day-to-day use of R (or of any statistical software). Spector provides a reasonably broad and clear introduction to the subject.

(Highly) Selected Statistical Methods Programmed in R

Also see the package listing on CRAN and the various CRAN "task views."

W. Bowman and A. Azzalini, Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. Oxford: Oxford University Press, 1997. A good introduction to nonparametric density estimation and nonparametric regression, associated with the sm package (for both S-PLUS and R).

C. Davison and D. V. Hinkley, Bootstrap Methods and their Application. Cambridge: Cambridge University Press, 1997. A comprehensive introduction to bootstrap resampling, associated with the boot package (written by A. J. Canty). Somewhat more difficult than Efron and Tibshirani (immediately below).

B. Efron and R. J. Tibshirani, An Introduction to the Bootstrap. London: Chapman and Hall, 1993. Another extensive treatment of bootstrapping by its originator (Efron), also accompanied by an R package, bootstrap (but somewhat less usable than boot).

A. Gelman and J. Hill, Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press, 2007. A wide-ranging yet deep treatment of hierarchical models and various related topics, predominantly but not exclusively from a Bayesian perspective, using both R and BUGS software.

F. E. Harrell, Jr., Regression Modeling Strategies, With Applications to Linear Models, Logistic Regression, and Survival Analysis. New York: Springer, 2001. Describes an interesting approach to statistical modeling, with frequent references to Harrell's Hmisc and Design packages.

T. J. Hastie and R. J. Tibshirani, Generalized Additive Models. London: Chapman and Hall, 1990. An accessible treatment of generalized additive models, as implemented in the gam package, and of nonparametric regression analysis in general. [The gam function in the mgcv package in R takes a somewhat different approach; see Wood (2000), below.]

R. Koenker, Quantile Regression. Cambridge: Cambridge University Press, 2005. Describes a variety of methods for quantile regression by the leading figure in the area. The methods are implemented in Koenker's quantreg package for R.

C. Loader, Local Likelihood and Regression. New York: Springer, 1999. Another text on nonparametric regression and density estimation, using the locfit package. Although the text is less readable than Bowman and Azzalini, the locfit software in very capable.

T. Lumley, Complex Surveys: A Guide to Analysis Using R. Hoboken NJ, Wiley, 2010. A lucid introduction to the analysis of data from complex survey samples and to Lumley's highly capable survey package.

J. C. Pinheiro and D. M. Bates, Mixed-Effects Models in S and S-PLUS. New York: Springer, 2000. An extensive treatment of linear and nonlinear mixed-effects models in S, focused on the authors' nlme package. Mixed models are appropriate for various kinds of non-independent (clustered) data, including hierarchical and longitudinal data. Does not cover Bates's newer lme4 package.

T. M. Therneau and P. M. Grambsch, Modeling Survival Data: Extending the Cox Model. New York, Springer: 2000. An overview of both basic and advanced methods of survival analysis (event-history analysis), with reference to S and SAS software, the former implemented in Therneau's state-of-the-art survival package.

W. N. Venables and B. D. Ripley. Modern Applied Statistics with S, Fourth Edition. New York: Springer, 2002. An influential and wide-ranging treatment of data analysis using S. Many of the facilities described in the book are programmed in the associated (and indispensable) MASS, nnet, and spatial packages, which are included in the standard R distribution. This text is more advanced and has a broader focus than the R Companion.

S. N. Wood, Generalized Additive Models: An Introduction with R. New York: Chapman and Hall, 2006. Describes the mgcv package in R, which contains a gam function for fitting generalized additive models based on smoothing splines. The initials "mgcv" stand for multiple generalized cross validation, the method by which Wood selects GAM smoothing parameters. 

Other Sources (Some Free)

See the publications list on the R web site. The R Journal, the journal of the R Project for Statistical Computing, and its predecessor R News, are also good sources of information, as is the Journal of Statistical Software, an on-line American Statistical Association journal dominated by coverage of R packages..


Last Modified: 1 July 2010 by J. Fox <jfox AT mcmaster.ca>