An Intro to R (and S-Plus)

I begun using R as a Matlab alternative since the stats toolbox doesn't offer routines for repeated measures data. I gradually realized that R is far superior to Matlab for fitting statistical models and so I now use it regularly.

R is a statistical programming language which is designed to make common statistical tasks easy. In this it succeeds. However, it is not a general-purpose language and coming from a Matlab/Perl background it can be rather confusing. This is made harder by the fact that some simple tasks are not documented in an obvious way. e.g. how do you change directory? This page contains a few notes which I hope might help others trying to make the transition. This page is is no way intended to be complete, comprehensible, accurate, or inoffensive. Also included are notes on S-Plus, the non-free alternative.

What is R and how does it relate to S-Plus?

The official description of R is here. Briefly, R is a free statistical programming language based on S, which was developed at Bell Labs. You can get R on virtually any computing platform. R is written by professional statisticians so it's the place to be for cutting-edge stuff.

S-Plus is the non-free equivalent. You pay (lots. yearly.) for a nicer GUI, prettier plotting interface, and some different libraries. The stats functions in R and S-Plus are similar. R is more powerful for some tests whereas S-Plus is more powerful for others. The exact state of play is always changing. I used S-Plus for a while but kept finding myself going back to R. Linux S-Plus has a horrible Java GUI and, disgracefully, no search history at the command line. Finally, the copy-protection system was so oppressive that I couldn't start it up half the time. I now use R exclusively.

File loading and saving, etc in R

R doesn't make it easy for you to change directory and I believe that S-Plus doesn't let you do it at all. Why is this? The reason is that both save their workspaces to the current directory so that when you quit and restart the program, your data are still there. R gives you the option to save on quiting whereas S-Plus saves your variables as you go. The advantage of this is that you end up neatly arranging your analyses such that you have one directory per analysis. This helps de-bug your thinking on large projects and to keep things orderly. See next section.

What the hell is a chapter?

If you're used to progams like Matlab then you'll think S-Plus stores data in a slightly weird way. As noted above, you can't change directories from within S-Plus. Instead you make a directory for each project project from outside of S-Plus, initialise it as a "Chapter" from within that directory, then load Splus. For example, let's say that we want to analyse if cow fart smells worse after they eat fresh grass rather than hay. We need to do the following from a *nix command prompt:


   $ mkdir smelly_cow_fart
   $ cd smelly_cow_fart
   $ Splus CHAPTER
   Creating data directory for chapter .
   S-PLUS chapter smelly_cow_fart initialized.
   $ Splus

The "smelly_cow_fart" directory now contains a new directory called ".Data." Bring up that directory in a window. Go to S-Plus and assign a value to a variable, such as t <- 1. Your variable, t, has magically appeared in .Data. Exit S-Plus (q()) and restart it. Type t and it still remembers your variable.

R is similar although it doesn't require the initialisation with "CHAPTER" and saves your workspace data in a single file. Saving is done optionally on exit or using the save.image() command.

Customising S-Plus

Functions and Chapters

Ok, so we've got our chapter directory and have a vague idea how to customise it. But let's say that we have some functions in text format which we want to edit and execute. How is this done? If your function text file is in the directory smelly_cow_fart and is called "methaneComposition.ssc" then you need to be in the directory and type source("methaneComposition.ssc"). This makes an S-Plus function in .Data which you can edit with fix(methaneComposition) and execute from within S-Plus.

I want to use an R library in S-Plus!

Let's say that you want to use Ripley's stepAIC function instead of the built in S-Plus function, step. setpAIC is part of the MASS package in R. So in R you would type library(MASS) and now you have access to stepAIC. You can check that the function is there by checking its help page: ?stepAIC. S-Plus 6.2 has library included as standard (type library(MASS) to get access to its functions). If you don't have 6.2 and want the library you can download it from here.

Wilkinson-Rogers Notation

This is the notation used in model formulae in R and S. In general, models have this structure: response variable ~ explanatory variables [that's a tilde, not a hyphen]. Note that the models below all include an intercept term by default, which you can remove by appending -1 to the model formula.

Symbol Meaning
~ The tilde is read as: `model as a function of'. y on the left, x's on the right. e.g. This code models y as a function of x: y ~ x
+ Inclusion of terms, not summation. To model y as a function of x and z: y ~ x + z
: indicates a specific interaction. e.g. y ~ a + b + x:z
* Multiplication indicates inclusion of all higher order terms with repetitions of a given term excluded. So this code: y ~ x*z is the same as y ~ x+z+x:z.
- A specific exclusion. Here we allow all interactions but the 4-way:
y ~ x * z * a * b - x:z:a:b
^ Inclusion of higher order terms up to specifed order. For example:
(A+B+C)^3 is the same as A*B*C
(A+B+C)^2 is the same as A*B*C - A:B:C
/ The forward slash is used to nest explanatory variables. So if x is a multi-level factor we could use the slash to fit a seperate slope and intercept of z for each x: y ~ x/z
Other examples are:
A/B/C is the same as A+B %in% A+C %in% B %in% A
A/(B + C*D) is the same as A:B + A:C:D
| This indicates conditioning. So the following code means `y as a function of x given z':
y ~ x|z
. By itself means include every main effect. With arguments on the left is means every main effect but those on the left. So if you have a data frame with variables x,y,z, this code: y ~ . means model y as a function of x and z. You can apply ^, +, and - to .
I() Evaluation: treat contents of braces as a `normal' equation. e.g. y ~ I(a+b) means y as a function of a single explanatory variable obtained by adding a to b.

Errors

What do those annoying errors mean?

Getting subplots out of ggplot2

I have found two ways of doing this buried at the end of the ggplot2 pdf. Here is a modified version of that code.


#Make some data
RAND <- data.frame(x=rnorm(100),y=rnorm(100))
P<-ggplot(RAND,aes(x=x,y=y))+geom_point()

# First approach
grid.newpage()
pushViewport(viewport(height=0.4, width=0.4, x=0.4, y=0.8))
print(P, newpage=FALSE, pretty=FALSE)
upViewport()
pushViewport(viewport(height=0.4, width=0.4, angle=25, x=0.4, y=0.3))
print(P, newpage=FALSE, pretty=FALSE)


#Second approach
vplayout <- function(x, y) viewport(layout.pos.row=x, layout.pos.col=y)
grid.newpage()
pushViewport(viewport(layout=grid.layout(3,3)))
print(P, vp=vplayout(1,1))
print(P, vp=vplayout(2:3,2:3))
print(P, vp=vplayout(1, 2:3))
print(P, vp=vplayout(2:3, 1))

Other odds and sods

S-Plus Resources