A through Z Index  | Search  | Campus Directory  | Calendars Log In
About Holy Cross Admissions Academics Student Life Administration Athletics

Rweb

December 1st, 2011 by Richard Lent

The R statistical system, in addition to running on a computer desktop, can also be run from a server and accessed via a Web browser. One such installation runs from a server at the Pôle Bio-Informatique Lyonnais in France:

http://pbil.univ-lyon1.fr/Rweb/

The R Web interface allows you to enter and execute R commands to produce statistical analyses and graphics. You can also upload your own data for analysis. Rweb will also work from a mobile browser, opening up vast potential for sophisticated data analysis from any place on Earth that has access to an Internet connection.

Why use R?

October 6th, 2011 by Richard Lent

SPSS, SAS, Stata, Systat, etc. users, please read this: Why use R? A grad student’s 2 cents.

Multivariate Analysis with R

January 13th, 2011 by Richard Lent

This post is based on a Holy Cross faculty workshop, “Multivariate Analysis with R Statistical Software,” given on 19 January 2011 in the Dinand Educational Technology Center.

1. R is a free, open-source, cross-platform statistical and graphical environment that can be obtained from http://www.r-project.org/ . For a general introduction to R, see this post. Go here for a handy list of the most useful R commands.

2. Multivariate analysis is that branch of statistics concerned with examination of several variables simultaneously. One of the best introductory books on this topic is Bryan Manly’s Multivariate Statistical Methods: A Primer. An excellent Web-based R reference is Quick-R.

3. Let’s get some multivariate data into R and look at it. Click this link to display the comma-delimited data file sites.txt, which should open in a new browser window or tab. (If you right-click the link you can choose how to display the file.) Because the file is not large, we can also list it here:

locality,code,grass,moss,sedge,forb,fern,bare,woody,shrub,
lon,lat,fwlength,thorax,spotting,rfray,rhray,n
Bellows Falls,BF,46,3.5,0,34,.5,4,10,0,703877,4784632,15.8,3.6,1.6,1.6,1.4,62
Springfield,SP,49,0,0,24,1,.5,14.4,0,707151,4793906,16.0,3.7,1.1,2.0,1.5,51
Mount Ascutney,MA,49,0,4,17,5,0,0,0,708310,4806931,16.2,3.8,1.2,1.9,1.6,58
Bernardston,BE,49,1,0,29,0,0,11,7,696647,4725212,16.7,3.7,1.4,1.8,1.6,27
Lyme,LY,50,0,0,14,4,0,0,0,729474,4856926,15.6,3.6,1.1,1.8,1.3,52
Fairlee,FA,50,5,0,12,0,2,0,0,729045,4868779,15.9,3.8,1.0,1.8,1.5,25
Newport,NW,49,0,0,35,0,0,0,0,724169,4977885,16.7,3.7,1.1,1.8,1.3,21
Orleans,OR,49,0,0,41,0,0,19,0,723193,4960641,16.1,3.8,1.4,1.9,1.7,41
Lyndonville,LN,50,0,0,9,0,0,3,0,734359,4935471,16.0,3.7,1.4,1.9,1.7,52
St. Johnsbury,SJ,49,0,0,42,0,0,0,0,734215,4925655,15.7,3.6,1.2,1.9,1.7,56
Hanover,HA,49,0,0,38,5,0,5,0,715984,4843575,16.0,3.6,1.2,1.8,1.8,65

The first line (broken into two lines here for easier reading) lists the variable names, with subsequent lines containing data for 11 grassland sites in Massachusetts, New Hampshire, and Vermont in which vegetation variables were measured along a 50-meter transect. For each of the 8 vegetation variables (grass through shrub), values represent the percentage of 50 transect points at which each type of vegetation was present. The spatial coordinates of each site are in variables lon (longitude) and lat (latitude) in meters using the UTM coordinate system. Additional variables represent means of morphological measurements of male Common Ringlet (Coenonympha tullia) butterflies. Morphological variables are length of right forewing (fwlength, mm), length of thorax (thorax, mm), number of eyespots or ocelli (spotting, count), length of right forewing ray (rfray, ordinal 0 through 3), and length of right hindwing ray (rhray, ordinal 0 through 3). The final variable, n, is the number of male Ringlets measured at each site.

We can read this data file into an R dataframe with the following command:

sites = read.table("http://college.holycross.edu/faculty/rlent/data/sites.txt",
sep = ",",header=TRUE)

After issuing this command, type attach(sites) so that we can refer to the variables directly by name instead of having to also reference the dataframe like sites$locality.

For a quick-and-dirty map of the sites, type plot(lat,lon). To add labels to the data points, we need to use an R function called textxy. However, that function resides in an R package, called calibrate, which we first need to install. After installing calibrate, type textxy(lon,lat,locality), and the site names will be added to the existing scatterplot. Alternatively you could use the 2-letter code in place of locality. This is essentially how R works to produce various statistical analyses and graphics–passing parameters to functions, and occasionally loading additional packages (collections of more functions) when needed.

4. As a preliminary to visualizing our multivariate data we will construct several distance matrices. A distance matrix expresses the dissimilarities between all pairs of objects by computing a distance measure based on several variables. A common distance measure is Euclidean distance, the simple “as the crow flies” distance between two points in a plane or Cartesian coordinate system.

In R, the following command

sites.veg = dist(cbind(moss,sedge,forb,fern,bare,woody,shrub),method = "euclidean")

computes Euclidean distances between all pairs of sites based on the vegetation variables and stores the distances in the lower-diagonal matrix sites.veg. The cbind function packages or binds several vectors into a matrix, which we can then hand off to the dist function. [Euclidean is the default distance metric, so we don't actually have to specify it. Type help(dist) for more information.] Note that we are not using the grass variable as it does not vary much among sites, which you can see by typing the variable name grass to display the vector of grass values. Typing sites.veg displays the lower-diagonal matrix; to display the entire, symmetrical matrix type as.matrix(sites.veg). Go here for more examples of R data type conversions.

Here are the first 5 rows and columns of the vegetation distance matrix, produced by the command as.matrix(sites.veg)[1:5,1:5].

          1         2        3         4        5
1  0.000000 12.004582 21.29554  9.874209 23.24866
2 12.004582  0.000000 16.98853  9.370699 17.79354
3 21.295539 16.988526  0.00000 18.867962  5.09902
4  9.874209  9.370699 18.86796  0.000000 20.29778
5 23.248656 17.793538  5.09902 20.297783  0.00000

Pairs of sites that have similar values of all vegetation variables have smaller values of Euclidean distance, and are thus more similar to each other in a multivariate sense, as if the sites were plotted in a n-dimensional space where n is the number of variables. Euclidean distance ranges from zero for identical sites (i.e., along the diagonal, the distance of each site to itself) to an arbitrary upper bound. Euclidean distance is sensitive to the scale of measurement. For this reason, data measured in different units are often standardized before computing Euclidean distances, which we can do in R with a command like:

sites.veg.std = dist(scale(cbind(moss,sedge,forb,fern,bare,woody,shrub)),
method = "euclidean")

This uses the scale function to standardize each variable to a mean of 0 and standard deviation of 1 before passing the data to the dist function. However, for these examples we won’t worry about it.

We will compute two additional Euclidean distance matrices. One is based on the butterfly variables, which are a multivariate measure of the average male phenotype at each site:

sites.pheno = dist(cbind(fwlength,thorax,spotting,rfray,rhray),method = "euclidean")

The third distance matrix is based on the latitude and longitude of each site:

sites.geo = dist(cbind(lat,lon),method = "euclidean")

Note that when Euclidean distance is calculated using spatial coordinates it yields the true geographic distance between sites.

5. One way to visualize multivariate distances is through cluster analysis, a technique for finding groups in data. Cluster analysis produces a tree diagram, or dendrogram, showing the distance relationships among a set of objects and grouping similar objects together. Using our vegetation distance matrix, we can produce a hierarchical cluster analysis with the following R command:

veg.clust = hclust(sites.veg, method="average")

and we can view the dendrogram by typing plot(veg.clust).

By default, the dendrogram uses record numbers as branch labels. Let’s add a custom title and labeling and remove the annoying subtitle at the bottom of the plot:

plot(veg.clust,labels=code,main="Clustering of Sites by Vegetation",xlab="Sites",
ylab="Euclidean Distance",sub="")

Next let’s cluster the sites by butterfly phenotype:

pheno.clust = hclust(sites.pheno, method="average")

Before producing the dendrogram, type dev.new(), which will produce a second graphics window and keep the previous one open. Then, plot the second dendrogram:

plot(pheno.clust,labels=code,main="Clustering of Sites by Phenotype",xlab="Sites",
ylab="Euclidean Distance",sub="")

Finally, we cluster the sites by geographic distance and plot the dendrogram in a third graphics window:

geo.clust = hclust(sites.geo, method="average")
dev.new()
plot(geo.clust,labels=code,main="Clustering of Sites by Geography",xlab="Sites",
ylab="Distance in Meters",sub="")

We can compare the geographic clustering of sites to an actual map of the sites. There is observable structure in each dendrogram, but what does it mean?

A simple way to compare clusters in a dendrogram is to create a grouping variable that indicates the cluster membership of each object, and then compare variables by cluster membership. To do this with the vegetation clustering, type:

cluster = cutree(veg.clust,h=16)

The cutree function creates a vector of cluster memberships at the height in the dendrogram indicated by the h parameter. We next need to merge this cluster membership vector back into the original data:

veg.groups = cbind(cluster,sites)

Then, we can compute means and other descriptive statistics of all the variables by cluster using the describe.by function from the psych package:

describe.by(veg.groups,cluster)

To make this output a little easier to read, first subset out the seven vegetation variables into their own dataframe, and then run the descriptive statistics:

veg = data.frame(cbind(moss,sedge,forb,fern,bare,woody,shrub))
describe.by(veg,cluster)

We can also produce exploratory graphics, such as the box plot:

boxplot(forb ~ cluster,notch=TRUE)

6. The Mantel test can be used to examine correlations between distance matrices. This lets us address questions related to things like the isolation-by-distance model of evolution: Are butterfly populations from geographically close sites more phenotypically similar than those from sites that are farther apart? This type of question can be answered by computing the Pearson correlation coefficient between the elements of two distance matrices. But because of the statistical dependence among the cells of a distance matrix, ordinary tests of significance cannot be used. Instead, the Mantel test uses random permutations of the rows and columns of one of the distance matrices. This is done a large number of times, and the correlation coefficient calculated each time, forming a sampling distribution. The observed correlation is then compared with the sampling distribution to assess the significance of the observed correlation under the assumption of randomness.

The Mantel test in R uses package ade4, which contains a variety of routines for ecological data. We will first run a Mantel test on our site vegetation distances versus phenotypic distances. The R command for this test is:

mantel.rtest(sites.veg, sites.pheno, nrepet = 1000)

And the test for phenotypic versus geographic distances is:

mantel.rtest(sites.pheno, sites.geo, nrepet = 1000)

This gives us 1000 random matrix permutations. Sadly, neither of the correlations is significant. We can get a nice graphical display of this non-significance by enclosing the Mantel call in a plot function:

plot(mantel.rtest(sites.pheno, sites.geo, nrepet = 1000))

which shows us a histogram of the simulated distribution of the test statistic along with the position of our observed correlation.

7. A different approach to analysis of multivariate distances is multidimensional scaling (MDS). Whereas cluster analysis uses a distance matrix to group similar objects together, MDS transforms a distance matrix into a set of coordinates in two or three dimensions, thereby reducing the dimensionality of the data. The coordinates are chosen such that the distance between objects in the coordinate space matches the dissimilarities contained in the original distance matrix as closely as possible. This is accomplished through an interative procedure.

We will perform a nonmetric multidimensional scaling of the site vegetation data. This form of MDS uses only the rank order of Euclidean distances in the data matrix, not the actual dissimilarities. The algorithm finds a configuration of points whose distances reflect as closely as possible the rank order of the data. This helps us to deal with the fact that Euclidean distances are sensitive to data that are measured in different units.

We must first install the MASS package, which supports the textbook “Modern Applied Statistics with S” (Venables and Ripley, 4th edition, 2002). [R is the open-source version of S.] Then, to perform the MDS, type:

mds = isoMDS(sites.veg, k=2)

The parameter k is the number of dimensions desired, usually 2 or 3. Then, type mds to display the coordinates. Because the notation for the column labels of the MDS coordinates is somewhat cumbersome, store them in a couple of vectors as follows:

mds1 = mds$points[,1]
mds2 = mds$points[,2]

Then, merge the coordinates with the original data with the following command:

sites.mds = cbind(sites,mds1,mds2)

We can then plot our MDS configuration and label the points with our two-letter site codes as follows:

plot(mds1,mds2)
textxy(mds1,mds2,code)

This plot shows some interesting patterning in the data, especially the 4 sites grouped at the extreme right end of the mds1 axis. These axes can be viewed as synthetic variables–our MDS analysis has reduced the dimensionality of the data from the original 7 variables to 2 synthetic axes. One way to interpret what these axes actually mean is to compute Pearson correlations between the axes and the original variables:

cor(cbind(mds1,mds2,moss,sedge,forb,fern,bare,woody,shrub))

Here we see that as values of mds1 increase, values of forbs and woody vegetation decrease, whereas mds2 appears to be primarily an axis of increasing woody vegetation. Unfortunately the cor function does not produce tests of significance. For that, we need to install the Hmisc package and use rcorr:

library(Hmisc)
rcorr(cbind(mds1,mds2,moss,sedge,forb,fern,bare,woody,shrub),
  type="pearson")

This gives us a matrix of correlation coefficients along with a second matrix of P-values. Of course with this type of shotgun approach we have to worry about spurious significances, but we’ll leave that for another day.

8. Multidimensional scaling is one form of ordination, often used as a complementary method to clustering, in which objects are arrayed continuously in a set of coordinate axes. Principal component analysis (PCA) is another ordination technique that can also be used to reduce the dimensionality of a multivariate dataset. PCA reduces the original variables, which are often correlated with one another, to a smaller set of uncorrelated, synthetic variables (the principal components). This is done by producing linear combinations of the original variables that maximize the variation or dispersion of the original objects along the principal components.

For this analysis and the next we will use a new dataset, Hermon Bumpus’ House Sparrow data:

Bumpus sparrow data: bumpus.txt
Bumpus sparrow metadata: bumpus.met
More information on Hermon Bumpus and House Sparrows

Read the Bumpus data into an R dataframe and attach it:

bumpus = read.table("http://college.holycross.edu/faculty/rlent/data/bumpus.txt",
sep = "",header=TRUE)
attach(bumpus)

The data consist of 9 morphological variables measured on male House Sparrows, plus the variable survive that is coded 1 for survived and 0 for died. We will ignore the survival variable for now and concentrate on the morphological measurements. First, compute the correlation matrix:

birdcor = cor(cbind(length,alar,weight,lbh,lhum,lfem,ltibio,wskull,lkeel))

As is often the case with morphological data, the variables are intercorrelated, suggesting that at least some of them are measuring the same underlying thing, such as “size” or “shape.” PCA can help to extract these underlying components of variation. First we’ll drop the survival variable for this analysis:

pcadata = bumpus[c(-1)]

This trick works because the survive variable is the first one in the bumpus dataframe. Then, compute the PCA from the correlation matrix:

pca = princomp(pcadata, cor=TRUE)

Then, typing summary(pca) will show the proportion of variation in the original data accounted for by the principal components. We get the same number of components as we have original variables, but hope that the first few account for most of the variation. We see that the first 3 components account for over 65% of the original variation:

Importance of components:
                         Comp.1    Comp.2    Comp.3
Standard deviation     1.9156970 1.0714045 1.0531993
Proportion of Variance 0.4077661 0.1275453 0.1232476
Cumulative Proportion  0.4077661 0.5353114 0.6585590

This is good, because we have reduced our original nine variables to three that account for much of the variation in the data. But as with our multidimensional scaling analysis, we now have to interpret these synthetic variables. As before we do this by computing correlations between the principal components and the original variables. The principal components are “scores” of each individual object (birds) computed as linear combinations of the original variables to produce component 1, component 2, and so on. To see these scores, type pca$scores. We’ll store the first three components into vectors and then merge them back into the original data:

pca1 = pca$scores[,1]
pca2 = pca$scores[,2]
pca3 = pca$scores[,3]
bumpuspca = cbind(pca1,pca2,pca3,pcadata)

Then, the correlations:

cor(bumpuspca)

[To get rid of annoying scientific notation, increase the threshhold number of digits that triggers scientific notation by typing options(scipen=12).]

We see that pca1 is highly negatively correlated with all variables except lbh, with which it has a negligible positive correlation. We can thus interpret pca1 as a “size” axis. Pca2 and pca3 each have a mixture of positive and negative correlations with the original data, usually interpreted as different aspects of an underlying “shape” factor.

A handy feature of principal components is that they are mutually orthogonal in the n-dimensional hyperspace. In English, this means that they are uncorrelated with each other:

options(scipen=20) #Discourage scientific notation.
cor(pca$scores)

The correlations among our principal components are zero for all intents and purposes. For this reason PCA can be used to remove inter-variable correlation in preparation for things like multiple regression analysis, in which multicollinearity among the predictors is a bad thing.

Now, plot the first two components and label the points with the survival variable:

plot(pca1,pca2)
textxy(pca1,pca2,survive)

It appears that we may be seeing an effect of morphology on survival. Sounds like natural selection. We can explore this aspect of the data with our final technique of this session, discriminant analysis.

9. Whereas cluster analysis finds groups in data, discriminant function analysis (DFA) produces a linear combination of variables that best separate two or more groups that are already known. This is done essentially by performing a multivariate analysis of variance (MANOVA) in reverse, computing the coefficients of the discriminant function to maximize the multivariate F-ratio. Thus instead of maximizing the total variance explained as in PCA, discriminant analysis maximizes the total variance between groups. (Go here and here for more details.)

To compute a linear discriminant analysis using the Bumpus data, first load the MASS package:

library(MASS)

Then issue the following command to compute the discriminant functions:

dfa = lda(survive ~ length+alar+weight+lbh+lhum+lfem+ltibio+wskull+lkeel)

Typing dfa will display the basic results. Prior probabilities are the likelihood of group membership based solely on the sample size of each group, with no prior knowledge of the morphological variables measured on each bird. Group means are just the univariate means between groups. The coefficients comprise the discriminant function used to compute a score for each observation along the discriminant axis.

Typing predict(dfa)$x will list the discriminant scores. With two groups (survived or not) we get one discriminant axis; in general, n-1 discriminant functions are produced, where n is the number of groups. We can view the distribution of survivors and non-survivors along the discriminant axis by typing plot(dfa). Survivors tended to have positive scores along the discriminant axis, non-survivors negative. As with our previous methods we now want to try to interpret the discriminant function, which we do by merging the discriminant scores with the original data and computing correlations:

cor(cbind(predict(dfa)$x,length,alar,weight,lbh,lhum,lfem,ltibio,wskull,lkeel))

As values of the discriminant function increase, weight and length of birds decrease. As survivors have predominantly positive values along the discriminant axis, one interpretation is that survivors tended to be shorter and lighter than non-survivors and potentially better able to nutritionally survive the severe winter storm of February 1898.

Introduction to R

October 7th, 2010 by Richard Lent

This post is based on a Holy Cross faculty workshop, “Introduction to R,” given on 14 October 2010.  It is mostly in outline form, and is certainly not a replacement for the many excellent R tutorials that exist on the Web, including my personal favorite, Quick-R.

1.   R is a free, open-source, cross-platform statistical and graphical environment that can be obtained from http://www.r-project.org/ .

2.   Rather than being  a “software package” in the usual sense of SPSS, SAS, Systat, etc., where the user fills in the blanks in a series of dialog boxes and then has to sift through reams of canned output, R is actually a programming language customized for statistical analysis and graphics.  R does have a graphical user interface (GUI) that makes the language easier to use, but the true power of R lies in its programmability.

3.   Getting a quick feel for R. The command prompt is the > symbol.

R can evaluate expressions and manipulate variables, e.g.,

> x = 2

Print the contents of the variable x.

> x

Much of R involves function calls and parameters:

Generate 1000 random values with default mean = 0 and standard deviation = 1.

>y = rnorm(1000)

Display the vector y.

>y

Generate 1000 random values with specified mean and standard deviation.

>y = rnorm(1000,mean=42,sd=2.5)

Summary statistics for vector y.

>summary(y)

Generate a scatterplot of the random values.

>plot(y)

Boxplot of y.

>boxplot(y)

Histogram.

>hist(y)

R graphics can be easily exported to a variety of formats that can be brought into Word documents, PowerPoint presentations, etc.

[Hint: R maintains a history of all of your commands, which you can access by scrolling the command window.  However, eventually your older commands disappear from the window.  To retrieve all of your past commands, type:

history(max.show=Inf)

For more help on managing your R workspace, go here.]

4.   Getting data into R.

R uses an internal structure called a dataframe to store data in a traditional row-column format, where the rows are the objects of interest and the columns are variables measured on each object.  In R, a dataframe is created by reading in raw data from an external file.  Conversely, an R dataframe is saved by exporting it to an external file or to R’s internal data format.

Bumpus sparrow data: bumpus.txt
Bumpus sparrow metadata: bumpus.met
More information on Hermon Bumpus and House Sparrows

To create an R dataframe of the sparrow data:

a. Download the plain text data file bumpus.txt to a readily-accessible location on your computer.

b. Issue the following command in R:

bumpus = read.table("d:/empty/bumpus.txt",sep = "",header=TRUE)

replacing the “d:/empty/bumpus.txt” with the path to the data file on your particular computer.

Once in a dataframe, you can then begin to manipulate, analyze, and visualize the data. Typing names(bumpus) lists the names of the variables.  Typing the name of the dataframe (bumpus) will scroll all of the data to the console window.  To display individual variables, use syntax like bumpus$survive.  Better still, enter attach(bumpus)–once a dataframe is “attached” to the console window you can just type the name of a variable by itself to display the variable’s contents. Typing edit(bumpus) will place the dataframe into a crude editing window (crude in the sense that it doesn’t do very much).  Anything you type into the editing window WILL NOT be saved to the dataframe (I said it was crude); use fix(bumpus) instead.

If we had an SPSS file, we could read the data into an R dataframe like this:

>cars = read.spss("d:/empty/cars.sav", to.data.frame = TRUE)

Go here to see how to import other popular data formats.

The following sequence of commands creates a new dataframe from scratch:

age = c(25, 30, 56)
gender = c("male", "female", "male")
weight = c(160, 110, 220)
mydata = data.frame(age,gender,weight)

The c function collects values into a vector, and the data.frame function packages the vectors as columns into a dataframe.

5.   Some graphics.

The R language makes it possible to create impressive graphics with only a few lines of code. For example, using the bumpus dataframe, simply typing

plot(bumpus)

will create a scatterplot matrix, displaying all pairwise plots of all variables in the dataframe.  This is a useful visual technique to screen your data for things like outliers and nonlinearities.

Scatterplot of humerus versus femur length.

>plot(lhum,lfem)

Same scatterplot with smoothed fit.

>scatter.smooth(lhum,lfem)

Another way.  Needs prior call to plot(lhum,lfem).

>lines(loess.smooth(lhum,lfem))

Linear instead of lowess/loess smooth. Note that variables are reversed in call to abline.

>abline(lm(lfem~lhum))

[lines() connects dots to produce a curve; abline() plots a single straight line]

Grouped boxplot.

>boxplot(length~survive)

Notched boxplot with simultaneous confidence interval on the medians. (Try THAT in SPSS–can’t do it!!)

>boxplot(length~survive, notch=TRUE)

Add a title and some axis labels.

>boxplot(length~survive, notch=TRUE,main="Bumpus' Sparrows",
 xlab="Survival",ylab="Body length")

Something a little fancier:

Invoke the rgl 3D graphics package.

>library(rgl)

Rotate your data in 3D (drag your mouse cursor over the plot to rotate).

>plot3d(length,alar,lfem, col="red", size=6)

Create a new variable for labeling points.

>bumpus$survcat = ifelse(bumpus$survive == 0, c('N'), c('Y'))

Re-plot with smaller plotting symbols.

>plot3d(length,alar,lfem,size=2)

Add text labels to data points and ponder the effects of body size on survival.

>text3d(length,alar,lfem,texts=bumpus$survcat)

You can even skip creating a separate labeling variable by inserting the recoding statement inside of the call to text3d():

>text3d(length,alar,lfem,texts=ifelse(bumpus$survive == 0, c('N'), c('Y')))

(My thanks to Holy Cross Professor John Little for showing me this trick.)

6.   Some statistics.

R’s approach to traditional statistical methods like analysis of variance has been called “less coherent and user-friendly” than that of traditional stats packages. Despite the nonzero slope of the learning curve, the R way is very powerful, once you get the hang of it.

Common statistical routines are found in R’s Stats Package. (It’s big.)

Summary statistics for all variables in dataframe bumpus.

>summary(bumpus)

Scatterplot of data for simple linear regression.

>plot(lhum,lfem)

Simple regression, get very basic results, just the coefficients.

>lm(lhum~lfem)

This gives us more traditional regression output.

>summary(lm(lhum~lfem))

Multiple regression, with some diagnostics.

summary(lm(weight~length+alar+lbh+lhum+lfem+ltibio+wskull))

Calculation of residuals (observed weight – predicted weight).

>residuals=weight-predict(lm(weight~length+alar+lbh+lhum+lfem+ltibio+wskull))

View the residuals.

>residuals

Looks decently random.

>plot(residuals)

Looks decently normal.

>hist(residuals)

Test for normality of residuals. (First download and install the nortest package.)  The Lilliefors test for normality (Kolmogorov-Smirnov test against a normal distribution with same mean and variance.)

>lillie.test(residuals)

Traditional t-test.

>t.test(length ~ survive)

Analysis of Variance.

>summary(aov(length ~ survive))

Contingency tables.

>data(farms)
>table(Manag,Manure)
>chisq.test(Manag,Manure)

7.   A couple examples of multivariate analysis using R

Cluster Analysis

Cluster analysis is a multivariate technique for finding groups in data.  We will perform a cluster analysis of Sir Ronald Fisher’s Iris data, consisting of morphological measurements of 150 individuals of 3 species of Iris (a plant). This is a famous dataset that has often been used to illustrate multivariate techniques.

The Iris dataset is prepackaged inside of R.  To get a list of all available R datasets, type “data()”.

The dataset we want is called “Edgar Anderson’s Iris Data.” (Though often described as “Fisher’s Iris Data,” the data actually originated with Edgar Anderson, who worked in Fisher’s lab.) To access and view the Iris dataset, type:

>data(iris)
>attach(iris)
>edit(iris)

Cluster analysis uses a distance matrix computed from the original data.  We will compute a Euclidean distance matrix using the 4 morphological measurements:

>iris.dist = dist(iris, method = "euclidean")

You will see the warning “NAs introduced by coercion.” This warning is caused by R trying to convert, or “coerce,” the string Species variable into a number, and can be ignored.  If we had first subsetted out the numeric variables and recalculated the distance matrix, we would not get the warning. But then we would not have the string Species variable for use as a label.

[Hint: To select a subset of variables for analysis, you need to create a new dataframe containing the subset.  For example:

>irisnew = iris[c("Sepal.Length", "Sepal.Width", "Petal.Length")]

Other methods for subsetting and selecting data in R can be found here.]

Then, compute the cluster analysis and plot the dendrogram:

>iris.hclust = hclust(iris.dist, method = "complete")
>plot(iris.hclust)

By default, the dendrogram uses record numbers as branch labels. To label the branches with the species name, type:

>plot(iris.hclust,labels=Species)

Let's redo the dendrogram again, adding a custom title and labeling, and to remove the annoying subtitle at the bottom of the plot:

>plot(iris.hclust,labels=Species,main="Fisher's/Anderson's Iris Data",
>xlab="Species",ylab="Euclidean Distance",sub="")

Discriminant Analysis

Whereas cluster analysis finds groups in data, discriminant analysis produces a linear combination of variables that best separate two or more groups that are already known.  (Ronald Fisher first introduced the Iris dataset as an illustration of discriminant analysis.)  To compute a linear discriminant analysis using the Iris data, first load the MASS package:

>library(MASS)

Then issue the following command to compute the discriminant functions:

>dfa = lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length +
          Petal.Width, na.action="na.omit")

Display tabular results.

>dfa

List the discriminant scores.

>predict(dfa)

Plot the Iris species in the discriminant axes.

>plot(dfa)

8.  An R menu system: R Commander

Enter library(Rcmdr) at the R prompt and you will see R Commander, a GUI developed to help make R easier to use.

We will construct a boxplot using the Iris data and R Commander.  (Note that the GUI does not have an option to produce a notched boxplot.  You need to know the language!)

We will repeat our cluster analysis of the Iris data using R Commander.

Brief exploration of data managment functions in R Commander. For example, let's recode a variable in the Iris dataset.

Data|Manage variables in active dataset|Recode variables...
Choose the Petal.Length variable to recode.
Rename the new, recoded variable Petal.Size.
Enter the following as the recode directives:

1:3 = "small"
else = "big"

Click OK, then click "View data set" to see the results.

Do a contingency table analysis of Species by Petal.Size.

Save the modified dataframe as an R data file.  The saved data can subsequently be reloaded into R Commander or into the R GUI using a command like:

>load("D:\\test\\iris.rda")    #Note the double slashes!

For more information on using R Commander, go here.

length~survive

Required reading for every scientist

March 22nd, 2010 by Richard Lent

Odds Are, It’s Wrong:  Science fails to face the shortcomings of statistics

Science News, March 27th, 2010; Vol.177 #7


Data Analysis and Visualization is proudly powered by WordPress MU running on Holy Cross Blogs. Create a new blog and join in the fun!
Entries (RSS) and Comments (RSS).