This post is based on a Holy Cross faculty workshop, “Multivariate Analysis with R Statistical Software,” given on 19 January 2011 in the Dinand Educational Technology Center.
1. R is a free, open-source, cross-platform statistical and graphical environment that can be obtained from http://www.r-project.org/ . For a general introduction to R, see this post. Go here for a handy list of the most useful R commands.
2. Multivariate analysis is that branch of statistics concerned with examination of several variables simultaneously. One of the best introductory books on this topic is Bryan Manly’s Multivariate Statistical Methods: A Primer. An excellent Web-based R reference is Quick-R.
3. Let’s get some multivariate data into R and look at it. Click this link to display the comma-delimited data file sites.txt, which should open in a new browser window or tab. (If you right-click the link you can choose how to display the file.) Because the file is not large, we can also list it here:
locality,code,grass,moss,sedge,forb,fern,bare,woody,shrub,
lon,lat,fwlength,thorax,spotting,rfray,rhray,n
Bellows Falls,BF,46,3.5,0,34,.5,4,10,0,703877,4784632,15.8,3.6,1.6,1.6,1.4,62
Springfield,SP,49,0,0,24,1,.5,14.4,0,707151,4793906,16.0,3.7,1.1,2.0,1.5,51
Mount Ascutney,MA,49,0,4,17,5,0,0,0,708310,4806931,16.2,3.8,1.2,1.9,1.6,58
Bernardston,BE,49,1,0,29,0,0,11,7,696647,4725212,16.7,3.7,1.4,1.8,1.6,27
Lyme,LY,50,0,0,14,4,0,0,0,729474,4856926,15.6,3.6,1.1,1.8,1.3,52
Fairlee,FA,50,5,0,12,0,2,0,0,729045,4868779,15.9,3.8,1.0,1.8,1.5,25
Newport,NW,49,0,0,35,0,0,0,0,724169,4977885,16.7,3.7,1.1,1.8,1.3,21
Orleans,OR,49,0,0,41,0,0,19,0,723193,4960641,16.1,3.8,1.4,1.9,1.7,41
Lyndonville,LN,50,0,0,9,0,0,3,0,734359,4935471,16.0,3.7,1.4,1.9,1.7,52
St. Johnsbury,SJ,49,0,0,42,0,0,0,0,734215,4925655,15.7,3.6,1.2,1.9,1.7,56
Hanover,HA,49,0,0,38,5,0,5,0,715984,4843575,16.0,3.6,1.2,1.8,1.8,65
The first line (broken into two lines here for easier reading) lists the variable names, with subsequent lines containing data for 11 grassland sites in Massachusetts, New Hampshire, and Vermont in which vegetation variables were measured along a 50-meter transect. For each of the 8 vegetation variables (grass through shrub), values represent the percentage of 50 transect points at which each type of vegetation was present. The spatial coordinates of each site are in variables lon (longitude) and lat (latitude) in meters using the UTM coordinate system. Additional variables represent means of morphological measurements of male Common Ringlet (Coenonympha tullia) butterflies. Morphological variables are length of right forewing (fwlength, mm), length of thorax (thorax, mm), number of eyespots or ocelli (spotting, count), length of right forewing ray (rfray, ordinal 0 through 3), and length of right hindwing ray (rhray, ordinal 0 through 3). The final variable, n, is the number of male Ringlets measured at each site.
We can read this data file into an R dataframe with the following command:
sites = read.table("http://college.holycross.edu/faculty/rlent/data/sites.txt",
sep = ",",header=TRUE)
After issuing this command, type attach(sites) so that we can refer to the variables directly by name instead of having to also reference the dataframe like sites$locality.
For a quick-and-dirty map of the sites, type plot(lat,lon). To add labels to the data points, we need to use an R function called textxy. However, that function resides in an R package, called calibrate, which we first need to install. After installing calibrate, type textxy(lon,lat,locality), and the site names will be added to the existing scatterplot. Alternatively you could use the 2-letter code in place of locality. This is essentially how R works to produce various statistical analyses and graphics–passing parameters to functions, and occasionally loading additional packages (collections of more functions) when needed.
4. As a preliminary to visualizing our multivariate data we will construct several distance matrices. A distance matrix expresses the dissimilarities between all pairs of objects by computing a distance measure based on several variables. A common distance measure is Euclidean distance, the simple “as the crow flies” distance between two points in a plane or Cartesian coordinate system.
In R, the following command
sites.veg = dist(cbind(moss,sedge,forb,fern,bare,woody,shrub),method = "euclidean")
computes Euclidean distances between all pairs of sites based on the vegetation variables and stores the distances in the lower-diagonal matrix sites.veg. The cbind function packages or binds several vectors into a matrix, which we can then hand off to the dist function. [Euclidean is the default distance metric, so we don't actually have to specify it. Type help(dist) for more information.] Note that we are not using the grass variable as it does not vary much among sites, which you can see by typing the variable name grass to display the vector of grass values. Typing sites.veg displays the lower-diagonal matrix; to display the entire, symmetrical matrix type as.matrix(sites.veg). Go here for more examples of R data type conversions.
Here are the first 5 rows and columns of the vegetation distance matrix, produced by the command as.matrix(sites.veg)[1:5,1:5].
1 2 3 4 5
1 0.000000 12.004582 21.29554 9.874209 23.24866
2 12.004582 0.000000 16.98853 9.370699 17.79354
3 21.295539 16.988526 0.00000 18.867962 5.09902
4 9.874209 9.370699 18.86796 0.000000 20.29778
5 23.248656 17.793538 5.09902 20.297783 0.00000
Pairs of sites that have similar values of all vegetation variables have smaller values of Euclidean distance, and are thus more similar to each other in a multivariate sense, as if the sites were plotted in a n-dimensional space where n is the number of variables. Euclidean distance ranges from zero for identical sites (i.e., along the diagonal, the distance of each site to itself) to an arbitrary upper bound. Euclidean distance is sensitive to the scale of measurement. For this reason, data measured in different units are often standardized before computing Euclidean distances, which we can do in R with a command like:
sites.veg.std = dist(scale(cbind(moss,sedge,forb,fern,bare,woody,shrub)),
method = "euclidean")
This uses the scale function to standardize each variable to a mean of 0 and standard deviation of 1 before passing the data to the dist function. However, for these examples we won’t worry about it.
We will compute two additional Euclidean distance matrices. One is based on the butterfly variables, which are a multivariate measure of the average male phenotype at each site:
sites.pheno = dist(cbind(fwlength,thorax,spotting,rfray,rhray),method = "euclidean")
The third distance matrix is based on the latitude and longitude of each site:
sites.geo = dist(cbind(lat,lon),method = "euclidean")
Note that when Euclidean distance is calculated using spatial coordinates it yields the true geographic distance between sites.
5. One way to visualize multivariate distances is through cluster analysis, a technique for finding groups in data. Cluster analysis produces a tree diagram, or dendrogram, showing the distance relationships among a set of objects and grouping similar objects together. Using our vegetation distance matrix, we can produce a hierarchical cluster analysis with the following R command:
veg.clust = hclust(sites.veg, method="average")
and we can view the dendrogram by typing plot(veg.clust).
By default, the dendrogram uses record numbers as branch labels. Let’s add a custom title and labeling and remove the annoying subtitle at the bottom of the plot:
plot(veg.clust,labels=code,main="Clustering of Sites by Vegetation",xlab="Sites",
ylab="Euclidean Distance",sub="")
Next let’s cluster the sites by butterfly phenotype:
pheno.clust = hclust(sites.pheno, method="average")
Before producing the dendrogram, type dev.new(), which will produce a second graphics window and keep the previous one open. Then, plot the second dendrogram:
plot(pheno.clust,labels=code,main="Clustering of Sites by Phenotype",xlab="Sites",
ylab="Euclidean Distance",sub="")
Finally, we cluster the sites by geographic distance and plot the dendrogram in a third graphics window:
geo.clust = hclust(sites.geo, method="average")
dev.new()
plot(geo.clust,labels=code,main="Clustering of Sites by Geography",xlab="Sites",
ylab="Distance in Meters",sub="")
We can compare the geographic clustering of sites to an actual map of the sites. There is observable structure in each dendrogram, but what does it mean?
A simple way to compare clusters in a dendrogram is to create a grouping variable that indicates the cluster membership of each object, and then compare variables by cluster membership. To do this with the vegetation clustering, type:
cluster = cutree(veg.clust,h=16)
The cutree function creates a vector of cluster memberships at the height in the dendrogram indicated by the h parameter. We next need to merge this cluster membership vector back into the original data:
veg.groups = cbind(cluster,sites)
Then, we can compute means and other descriptive statistics of all the variables by cluster using the describe.by function from the psych package:
describe.by(veg.groups,cluster)
To make this output a little easier to read, first subset out the seven vegetation variables into their own dataframe, and then run the descriptive statistics:
veg = data.frame(cbind(moss,sedge,forb,fern,bare,woody,shrub))
describe.by(veg,cluster)
We can also produce exploratory graphics, such as the box plot:
boxplot(forb ~ cluster,notch=TRUE)
6. The Mantel test can be used to examine correlations between distance matrices. This lets us address questions related to things like the isolation-by-distance model of evolution: Are butterfly populations from geographically close sites more phenotypically similar than those from sites that are farther apart? This type of question can be answered by computing the Pearson correlation coefficient between the elements of two distance matrices. But because of the statistical dependence among the cells of a distance matrix, ordinary tests of significance cannot be used. Instead, the Mantel test uses random permutations of the rows and columns of one of the distance matrices. This is done a large number of times, and the correlation coefficient calculated each time, forming a sampling distribution. The observed correlation is then compared with the sampling distribution to assess the significance of the observed correlation under the assumption of randomness.
The Mantel test in R uses package ade4, which contains a variety of routines for ecological data. We will first run a Mantel test on our site vegetation distances versus phenotypic distances. The R command for this test is:
mantel.rtest(sites.veg, sites.pheno, nrepet = 1000)
And the test for phenotypic versus geographic distances is:
mantel.rtest(sites.pheno, sites.geo, nrepet = 1000)
This gives us 1000 random matrix permutations. Sadly, neither of the correlations is significant. We can get a nice graphical display of this non-significance by enclosing the Mantel call in a plot function:
plot(mantel.rtest(sites.pheno, sites.geo, nrepet = 1000))
which shows us a histogram of the simulated distribution of the test statistic along with the position of our observed correlation.
7. A different approach to analysis of multivariate distances is multidimensional scaling (MDS). Whereas cluster analysis uses a distance matrix to group similar objects together, MDS transforms a distance matrix into a set of coordinates in two or three dimensions, thereby reducing the dimensionality of the data. The coordinates are chosen such that the distance between objects in the coordinate space matches the dissimilarities contained in the original distance matrix as closely as possible. This is accomplished through an interative procedure.
We will perform a nonmetric multidimensional scaling of the site vegetation data. This form of MDS uses only the rank order of Euclidean distances in the data matrix, not the actual dissimilarities. The algorithm finds a configuration of points whose distances reflect as closely as possible the rank order of the data. This helps us to deal with the fact that Euclidean distances are sensitive to data that are measured in different units.
We must first install the MASS package, which supports the textbook “Modern Applied Statistics with S” (Venables and Ripley, 4th edition, 2002). [R is the open-source version of S.] Then, to perform the MDS, type:
mds = isoMDS(sites.veg, k=2)
The parameter k is the number of dimensions desired, usually 2 or 3. Then, type mds to display the coordinates. Because the notation for the column labels of the MDS coordinates is somewhat cumbersome, store them in a couple of vectors as follows:
mds1 = mds$points[,1]
mds2 = mds$points[,2]
Then, merge the coordinates with the original data with the following command:
sites.mds = cbind(sites,mds1,mds2)
We can then plot our MDS configuration and label the points with our two-letter site codes as follows:
plot(mds1,mds2)
textxy(mds1,mds2,code)
This plot shows some interesting patterning in the data, especially the 4 sites grouped at the extreme right end of the mds1 axis. These axes can be viewed as synthetic variables–our MDS analysis has reduced the dimensionality of the data from the original 7 variables to 2 synthetic axes. One way to interpret what these axes actually mean is to compute Pearson correlations between the axes and the original variables:
cor(cbind(mds1,mds2,moss,sedge,forb,fern,bare,woody,shrub))
Here we see that as values of mds1 increase, values of forbs and woody vegetation decrease, whereas mds2 appears to be primarily an axis of increasing woody vegetation. Unfortunately the cor function does not produce tests of significance. For that, we need to install the Hmisc package and use rcorr:
library(Hmisc)
rcorr(cbind(mds1,mds2,moss,sedge,forb,fern,bare,woody,shrub),
type="pearson")
This gives us a matrix of correlation coefficients along with a second matrix of P-values. Of course with this type of shotgun approach we have to worry about spurious significances, but we’ll leave that for another day.
8. Multidimensional scaling is one form of ordination, often used as a complementary method to clustering, in which objects are arrayed continuously in a set of coordinate axes. Principal component analysis (PCA) is another ordination technique that can also be used to reduce the dimensionality of a multivariate dataset. PCA reduces the original variables, which are often correlated with one another, to a smaller set of uncorrelated, synthetic variables (the principal components). This is done by producing linear combinations of the original variables that maximize the variation or dispersion of the original objects along the principal components.
For this analysis and the next we will use a new dataset, Hermon Bumpus’ House Sparrow data:
Bumpus sparrow data: bumpus.txt
Bumpus sparrow metadata: bumpus.met
More information on Hermon Bumpus and House Sparrows
Read the Bumpus data into an R dataframe and attach it:
bumpus = read.table("http://college.holycross.edu/faculty/rlent/data/bumpus.txt",
sep = "",header=TRUE)
attach(bumpus)
The data consist of 9 morphological variables measured on male House Sparrows, plus the variable survive that is coded 1 for survived and 0 for died. We will ignore the survival variable for now and concentrate on the morphological measurements. First, compute the correlation matrix:
birdcor = cor(cbind(length,alar,weight,lbh,lhum,lfem,ltibio,wskull,lkeel))
As is often the case with morphological data, the variables are intercorrelated, suggesting that at least some of them are measuring the same underlying thing, such as “size” or “shape.” PCA can help to extract these underlying components of variation. First we’ll drop the survival variable for this analysis:
pcadata = bumpus[c(-1)]
This trick works because the survive variable is the first one in the bumpus dataframe. Then, compute the PCA from the correlation matrix:
pca = princomp(pcadata, cor=TRUE)
Then, typing summary(pca) will show the proportion of variation in the original data accounted for by the principal components. We get the same number of components as we have original variables, but hope that the first few account for most of the variation. We see that the first 3 components account for over 65% of the original variation:
Importance of components:
Comp.1 Comp.2 Comp.3
Standard deviation 1.9156970 1.0714045 1.0531993
Proportion of Variance 0.4077661 0.1275453 0.1232476
Cumulative Proportion 0.4077661 0.5353114 0.6585590
This is good, because we have reduced our original nine variables to three that account for much of the variation in the data. But as with our multidimensional scaling analysis, we now have to interpret these synthetic variables. As before we do this by computing correlations between the principal components and the original variables. The principal components are “scores” of each individual object (birds) computed as linear combinations of the original variables to produce component 1, component 2, and so on. To see these scores, type pca$scores. We’ll store the first three components into vectors and then merge them back into the original data:
pca1 = pca$scores[,1]
pca2 = pca$scores[,2]
pca3 = pca$scores[,3]
bumpuspca = cbind(pca1,pca2,pca3,pcadata)
Then, the correlations:
cor(bumpuspca)
[To get rid of annoying scientific notation, increase the threshhold number of digits that triggers scientific notation by typing options(scipen=12).]
We see that pca1 is highly negatively correlated with all variables except lbh, with which it has a negligible positive correlation. We can thus interpret pca1 as a “size” axis. Pca2 and pca3 each have a mixture of positive and negative correlations with the original data, usually interpreted as different aspects of an underlying “shape” factor.
A handy feature of principal components is that they are mutually orthogonal in the n-dimensional hyperspace. In English, this means that they are uncorrelated with each other:
options(scipen=20) #Discourage scientific notation.
cor(pca$scores)
The correlations among our principal components are zero for all intents and purposes. For this reason PCA can be used to remove inter-variable correlation in preparation for things like multiple regression analysis, in which multicollinearity among the predictors is a bad thing.
Now, plot the first two components and label the points with the survival variable:
plot(pca1,pca2)
textxy(pca1,pca2,survive)
It appears that we may be seeing an effect of morphology on survival. Sounds like natural selection. We can explore this aspect of the data with our final technique of this session, discriminant analysis.
9. Whereas cluster analysis finds groups in data, discriminant function analysis (DFA) produces a linear combination of variables that best separate two or more groups that are already known. This is done essentially by performing a multivariate analysis of variance (MANOVA) in reverse, computing the coefficients of the discriminant function to maximize the multivariate F-ratio. Thus instead of maximizing the total variance explained as in PCA, discriminant analysis maximizes the total variance between groups. (Go here and here for more details.)
To compute a linear discriminant analysis using the Bumpus data, first load the MASS package:
library(MASS)
Then issue the following command to compute the discriminant functions:
dfa = lda(survive ~ length+alar+weight+lbh+lhum+lfem+ltibio+wskull+lkeel)
Typing dfa will display the basic results. Prior probabilities are the likelihood of group membership based solely on the sample size of each group, with no prior knowledge of the morphological variables measured on each bird. Group means are just the univariate means between groups. The coefficients comprise the discriminant function used to compute a score for each observation along the discriminant axis.
Typing predict(dfa)$x will list the discriminant scores. With two groups (survived or not) we get one discriminant axis; in general, n-1 discriminant functions are produced, where n is the number of groups. We can view the distribution of survivors and non-survivors along the discriminant axis by typing plot(dfa). Survivors tended to have positive scores along the discriminant axis, non-survivors negative. As with our previous methods we now want to try to interpret the discriminant function, which we do by merging the discriminant scores with the original data and computing correlations:
cor(cbind(predict(dfa)$x,length,alar,weight,lbh,lhum,lfem,ltibio,wskull,lkeel))
As values of the discriminant function increase, weight and length of birds decrease. As survivors have predominantly positive values along the discriminant axis, one interpretation is that survivors tended to be shorter and lighter than non-survivors and potentially better able to nutritionally survive the severe winter storm of February 1898.