An Intro to R (and S-Plus)
I begun using R as a Matlab alternative since the stats toolbox doesn't offer routines for repeated measures data. I gradually realized that R is far superior to Matlab for fitting statistical models and so I now use it regularly.
R is a statistical programming language which is designed to make common statistical tasks easy. In this it succeeds. However, it is not a general-purpose language and coming from a Matlab/Perl background it can be rather confusing. This is made harder by the fact that some simple tasks are not documented in an obvious way. e.g. how do you change directory? This page contains a few notes which I hope might help others trying to make the transition. This page is is no way intended to be complete, comprehensible, accurate, or inoffensive. Also included are notes on S-Plus, the non-free alternative.
The official description of R is here. Briefly, R is a free statistical programming language based on S, which was developed at Bell Labs. You can get R on virtually any computing platform. R is written by professional statisticians so it's the place to be for cutting-edge stuff.
S-Plus is the non-free equivalent. You pay (lots. yearly.) for a nicer GUI, prettier plotting interface, and some different libraries. The stats functions in R and S-Plus are similar. R is more powerful for some tests whereas S-Plus is more powerful for others. The exact state of play is always changing. I used S-Plus for a while but kept finding myself going back to R. Linux S-Plus has a horrible Java GUI and, disgracefully, no search history at the command line. Finally, the copy-protection system was so oppressive that I couldn't start it up half the time. I now use R exclusively.
File loading and saving, etc in R
- Display working directory: getwd()
- Change directory: setwd("/my/directory/")
- Reading data from the working directory: t <- read.data("data_array.tab"). You can enter the full path to read from a different directory.
R doesn't make it easy for you to change directory and I believe that S-Plus doesn't let you do it at all. Why is this? The reason is that both save their workspaces to the current directory so that when you quit and restart the program, your data are still there. R gives you the option to save on quiting whereas S-Plus saves your variables as you go. The advantage of this is that you end up neatly arranging your analyses such that you have one directory per analysis. This helps de-bug your thinking on large projects and to keep things orderly. See next section.
What the hell is a chapter?
If you're used to progams like Matlab then you'll think S-Plus stores data in a slightly weird way. As noted above, you can't change directories from within S-Plus. Instead you make a directory for each project project from outside of S-Plus, initialise it as a "Chapter" from within that directory, then load Splus. For example, let's say that we want to analyse if cow fart smells worse after they eat fresh grass rather than hay. We need to do the following from a *nix command prompt:
$ mkdir smelly_cow_fart
$ cd smelly_cow_fart
$ Splus CHAPTER
Creating data directory for chapter .
S-PLUS chapter smelly_cow_fart initialized.
$ Splus
The "smelly_cow_fart" directory now contains a new directory called ".Data." Bring up that directory in a window. Go to S-Plus and assign a value to a variable, such as t <- 1. Your variable, t, has magically appeared in .Data. Exit S-Plus (q()) and restart it. Type t and it still remembers your variable.
R is similar although it doesn't require the initialisation with "CHAPTER" and saves your workspace data in a single file. Saving is done optionally on exit or using the save.image() command.
Customising S-Plus
- To enable emacs or vi line editing in an S-Plus console you should edit
your .profile file in your home directory to include the following:
S_CLEDITOR=emacs # Use emacs-style command line editing # S_CLEDITOR=vi # Use vi-style command line editing export S_CLEDITORThen start Splus with the "-e" flag. - You can create a new function with fix(awesomeNewFunction). S-Plus will bring up a vi window for you edit awesomeNewFunction. You can change this to any other editor, e.g. Emacs, by typing this command into an S-Plus prompt. options(editor="/opt/local/bin/emacs").
- Don't want to type all that in each time? Then you need to make a
function called ".First" in your ".Data" directory. You
can type this function into the command line like this:
> .First <- + function() + { + options(editor="/usr/local/bin/emacs") + } >Each time you load S-Plus in a chapter directory it looks for the .First function and executes everything that's in there. If restart S-Plus and type fix(.First) then you should be presented with an Emacs window containing the function script you just typed in. Note two things: firstly, the emacs file name is a temporary one. This leads on to the second thing which is that when you save the file it will be put in. Data but it will look garbled. That's normal.
Functions and Chapters
Ok, so we've got our chapter directory and have a vague idea how to customise it. But let's say that we have some functions in text format which we want to edit and execute. How is this done? If your function text file is in the directory smelly_cow_fart and is called "methaneComposition.ssc" then you need to be in the directory and type source("methaneComposition.ssc"). This makes an S-Plus function in .Data which you can edit with fix(methaneComposition) and execute from within S-Plus.
I want to use an R library in S-Plus!
Let's say that you want to use Ripley's stepAIC function instead of the built in S-Plus function, step. setpAIC is part of the MASS package in R. So in R you would type library(MASS) and now you have access to stepAIC. You can check that the function is there by checking its help page: ?stepAIC. S-Plus 6.2 has library included as standard (type library(MASS) to get access to its functions). If you don't have 6.2 and want the library you can download it from here.
Wilkinson-Rogers Notation
This is the notation used in model formulae in R and S. In general, models have this structure: response variable ~ explanatory variables [that's a tilde, not a hyphen]. Note that the models below all include an intercept term by default, which you can remove by appending -1 to the model formula.
| Symbol | Meaning |
| ~ | The tilde is read as: `model as a function of'. y on the left, x's on the right. e.g. This code models y as a function of x: y ~ x |
| + | Inclusion of terms, not summation. To model y as a function of x and z: y ~ x + z |
| : | indicates a specific interaction. e.g. y ~ a + b + x:z |
| * | Multiplication indicates inclusion of all higher order terms with repetitions of a given term excluded. So this code: y ~ x*z is the same as y ~ x+z+x:z. |
| - | A specific exclusion. Here we allow all interactions but the 4-way: y ~ x * z * a * b - x:z:a:b |
| ^ | Inclusion of higher order terms up to specifed order. For example: (A+B+C)^3 is the same as A*B*C (A+B+C)^2 is the same as A*B*C - A:B:C |
| / | The forward slash is used to nest explanatory
variables. So if x is a multi-level factor we could use the slash
to fit a seperate slope and intercept of z for each x:
y ~ x/z
Other examples are: A/B/C is the same as A+B %in% A+C %in% B %in% A A/(B + C*D) is the same as A:B + A:C:D |
| | | This indicates conditioning. So the following code means `y as a function
of x given z': y ~ x|z |
| . | By itself means include every main effect. With arguments on the left is means every main effect but those on the left. So if you have a data frame with variables x,y,z, this code: y ~ . means model y as a function of x and z. You can apply ^, +, and - to . |
| I() | Evaluation: treat contents of braces as a `normal' equation. e.g. y ~ I(a+b) means y as a function of a single explanatory variable obtained by adding a to b. |
Errors
What do those annoying errors mean?
-
Problem in methaneComposition(): Couldn't find a function definition for "[INSERT OFFENDING ITEM HERE]" Use traceback() to see the call stackThis means that S-Plus is looking for a function which it can't find. Say that it's missing the function stepAIC, the most likely reason is that you haven't loaded the library which contains this function. So, for stepAIC you'd: library(MASS), or if it's not liking java.graph() you would: library(winjava, T).
-
Can't extract cow vapours. Use traceback() to see the call stackThis means that you have forgotten to pipe a cow's arse into your PC so S-Plus can't determine the methane composition of the animal's flatulence. You should use some 1 inch silicone tubing to connect your bovine's anus to a free PCI slot. Tubing should be no longer than 10m to avoid signal attenuation.
Getting subplots out of ggplot2
I have found two ways of doing this buried at the end of the ggplot2 pdf. Here is a modified version of that code.
#Make some data
RAND <- data.frame(x=rnorm(100),y=rnorm(100))
P<-ggplot(RAND,aes(x=x,y=y))+geom_point()
# First approach
grid.newpage()
pushViewport(viewport(height=0.4, width=0.4, x=0.4, y=0.8))
print(P, newpage=FALSE, pretty=FALSE)
upViewport()
pushViewport(viewport(height=0.4, width=0.4, angle=25, x=0.4, y=0.3))
print(P, newpage=FALSE, pretty=FALSE)
#Second approach
vplayout <- function(x, y) viewport(layout.pos.row=x, layout.pos.col=y)
grid.newpage()
pushViewport(viewport(layout=grid.layout(3,3)))
print(P, vp=vplayout(1,1))
print(P, vp=vplayout(2:3,2:3))
print(P, vp=vplayout(1, 2:3))
print(P, vp=vplayout(2:3, 1))
Other odds and sods
- Removing un-used factor levels from a data frame:
TMP[] <- lapply(TMP, function(x) x[drop=TRUE])
S-Plus Resources
- If you use Emacs then you might want to run S-Plus or R through Emacs or use Emacs to edit scripts. If this is the case then you'll want the ESS mode (Emacs Speaks Statistics). Here is one useful link. If ESS bitches about needing "Splus6" when you execute "S" from within Emacs, you could do the inelegant fix of making a symbolic link with this name and have it point to the Splus startup script. Anyone who knows what you're really supposed to edit in the Lisp files could let me in on the trick.
- Ripley's introductory guide to S-Plus (pdf, 391k).
- Jonathan Baron's reference card for R (pdf 59k).