--- title: "Machine Learning Introduction" output: BiocStyle::html_document: toc: true highlight: tango css: Andrzej.css --- ```{r style, echo=FALSE, results="asis", cache=FALSE} library(knitr) BiocStyle::markdown() opts_chunk$set(fig.width=14, fig.height=10, cache=T, error=FALSE, message = FALSE) options(digits = 4, scipen = 10, stringsAsFactors = F, width=100) ``` Bernd Klaus[1], based on material form S. Arora ([Bioc Seattle, Oct 14](http://bioconductor.org/help/course-materials/2014/SeattleOct2014/)) and VJ Carey ([Brixen 2011](http://bioconductor.org/packages/release/bioc/vignettes/MLInterfaces/inst/doc/MLprac2_2.pdf)) as well as the `r Biocpkg("CMA")` package vignete. [1] European Molecular Biology Laboratory (EMBL), Heidelberg, Germany. **LAST UPDATE AT** `r date()` Preparations -------------- We first set global chunk options and load the neccessary packages. ```{r setup, cache = FALSE} library(BiocStyle) library(rmarkdown) library(geneplotter) library(ggplot2) library(plyr) library(dplyr) library(LSD) library(DESeq2) library(gplots) library(RColorBrewer) library(stringr) library(topGO) library(genefilter) library(biomaRt) library(classifly) library(MASS) library(EDASeq) library(fdrtool) library(e1071) library(LearnBioconductor) ## get it here: #http://bioconductor.org/help/course-materials/2014/SeattleOct2014/LearnBioconductor_0.1.6.tar.gz library(GenomicRanges) library(CMA) library(plsgenomics) library(crossval) library(MLInterfaces) library(LSD) library(GenomicRanges) library(GGally) library(MLSeq) library(caret) library(dendextend) library(class) library(rpart) library(ALL) #library(rggobi) #library(xlsx) ``` In this document we introduce some ML examples. # Introduction to Machine Learning The term machine learning refers to a family of computational methods for analyzing multivariate datasets. Each data point has a vector of features in a shared feature space, and may have a class label from some fixed finite set. Supervised learning refers to processes that help articulate rules that map feature vectors to class labels. The class labels are known and function as supervisory information to guide rule construction. Unsupervised learning refers to processes that discover structure in collections of feature vectors. Typically the structure consists of a grouping of objects into clusters. This practical introduction to machine learning will begin with a survey of a low- dimensional dataset to fix concepts, and will then address problems coming from genomic data analysis, using RNA expression and chromatin state data. Statistical Learning plays an important role in Genomics. A few example of learning problems would be * Identify the risk factors(genes) for prostrate cancer based on gene expression data * Predict the chances of breast cancer survival in a patient. * Identify patterns of gene expression among different sub types of breast cancer Typically, we have an outcome measurement, usually quantitative (such as gene expression) or categorical (such as breast cancer /no breast cancer), that we wish to predict based on a set of features (such as diet and clinical measurements). We have a training set of data, in which we observe the outcome and feature measurements for a set of objects (such as people). With this data we build a prediction model, or a learner, which will enable us to predict the outcome for new unseen objects. A good learner is one that accurately predicts such an outcome. This is called _supervised_ learning because the outcome variable guides the learning process. In the _unsupervised_ learning problem, have no measurements of the outcome and observe only the features. Our task is to describe how the data are organized or clustered. # Getting acquainted with machine learning via the crabs data The crabs data will be used to introduce you to machine learning methods. The crabs data frame has 200 rows and 8 columns, describing 5 morphological measurements on 50 crabs each of two color forms and both sexes, of the species Leptograpsus variegatus collected at Fremantle, W. Australia.
```{r load_crabs_data , echo = TRUE} data("crabs") dim(crabs) sample_n(crabs,10) table(crabs$sex) ```
We can plot the rear width (RW, in mm) per sex and species to get on idea on whether it differs by age group. ```{r plot_crabs_per_age , echo = TRUE} qplot(sp, RW, data = crabs, facets = ~ sex, color = sp, geom = "boxplot") ``` We will regard these data as providing five quantitative features (FL, RW, CL, CW, BD) 1 and a pair of class labels (sex, sp=species). We may regard this as a four class problem, or as two two class problems. Our first problem does not involve any computations. If you want to write R code to solve the problem, do so, but use prose first. * **Question ** On the basis of the boxplots in the Fiugre, comment on the prospects for predicting species on the basis of RW. State a rule for computing the predictions. Describe how to assess the performance of your rule. ## Species prediction via logistic regression A simple approach to prediction involves logistic regression, which uses a logit transformation of the class labels and then puts in the predictors in a regresssion curve.
```{r log_reg, dependson="load_crabs_data"} m1 = glm(sp~RW, data=crabs, family=binomial) summary(m1) ```
The logistic regression fit will tell us the probability for sample of it belonging to class O. Below, we classify a sample into the O class if the probability for it is greater than 0.5. * **Question ** Write down the statistical model corresponding to the R expression above. How can we derive a classifier from this model? * **Question ** Perform the following computations. Discuss their interpretation. What are the estimated error rates of the two models? Is the second model, on the subset, better? ```{r log_reg_mod, dependson="log_reg"} qplot(crabs$sp, predict(m1,type="response"), geom = "boxplot", color=crabs$sp ) table(predict(m1,type="response")>.5, crabs$sp) crabs.F <- subset(crabs, sex=="F") m2 = glm(sp~RW, data=crabs.F, family=binomial) table(predict(m2,type="response")>.5, crabs.F$sp) qplot(crabs.F$sp, predict(m2,type="response"), geom = "boxplot", color=crabs.F$sp ) ``` What we computed above is commonly called "confusion Matrix": For a two--class scenario, it gives the number of true positives / negatives and the errors represented by false positives and false negatives. This terminology stems from medical disease testing, obviously, in our example there is no natural "positive" or "negative" category. Thus, we set (arbitrarly) "B" as negative. ```{r confMatrix, dependson="log_reg_mod"} cMatrix = crossval::confusionMatrix(actual = crabs.F$sp, predicted = ifelse(predict(m2,type="response")>.5, "O", "B"), negative="B") cMatrix diagnosticErrors(cMatrix) ``` The function `diagnosticErrors` computes various diagnostic errors useful for evaluating the performance of a diagnostic test or a classifier: accuracy (acc), sensitivity (sens), specificity (spec), positive predictive value (ppv), negative predictive value (npv), and log-odds ratio (lor). The defintions are: * acc = (TP+TN)/(FP+TN+TP+FN) * sens = TP/(TP+FN) * spec = TN/(FP+TN) * ppv = TP/(FP+TP) * npv = TN/(TN+FN) * lor = $log$(TP$*$TN/(FN$*$FP)) The first model is very bad, there is hardly any predictive power, if we condition on the sex and use only the females, the predictions become better. ## The cross-validation concept Cross--validation is a technique that is widely used for reducing bias in the estimation of predictive accuracy. If no precautions are taken, bias can be caused by **overfitting** a classification algorithm to a particular dataset; the algorithm learns the classification "by heart", but performs poorly when asked to generalize it to new, unseen examples. Briefly, in cross--validation the dataset is deterministically partitioned into a series of training and test sets. The model is built for each training set and evaluated on the test set. The accuracy measures are averaged over this series of fits. Leave--one--out cross--validation consists of N fits, with N training sets of size N--1 and N test sets of size 1. First let us use MLearn from the `r Biocpkg("MLInterfaces") ` package to fit a single logistic model. MLearn requires you to specify an index set for training. We use ` c(1:30, 51:80) ` to choose a training set of size 60, balanced between two species (because we know the ordering of records). This procedure also requires you to specify a probability threshold for classification. We use a typical default of 0.5. If the predicted probability of being "O" exceeds 0.5, we classify to "O", otherwise to "B". ```{r trainTest, dependson="log_reg_mod" } ml1 = MLearn( sp~RW, crabs.F, glmI.logistic(thresh=.5), c(1:30, 51:80), family=binomial) ml1 cMatrix.TT = crossval::confusionMatrix(actual = ml1@testOutcomes, predicted =ml1@testPredictions, negative="B") cMatrix.TT diagnosticErrors(cMatrix.TT) ``` * **Question ** What does the report on ml1 tell you about predictions with this model? Can you reconcile this with the results in model m2? (Hint --- non--randomness of the selection of the training set is a problem.) Now we will illustrate cross-validation. First, we scramble the order of records in the ExpressionSet so that sequentially formed groups are approximately random samples. We invoke the now the MLearn specifying a five--fold cross-validation. ```{r crossval5, dependson="trainTest"} smx1 = MLearn( sp~RW, crabs.F, glmI.logistic(thresh=.5), xvalSpec(type = "LOG", niter = 5, balKfold.xvspec(5)) , family=binomial) cMatrix.cv = crossval::confusionMatrix(actual = smx1@testOutcomes, predicted =smx1@testPredictions, negative="B") cMatrix.cv diagnosticErrors(cMatrix.cv) ``` We can now compare the crossvalidation results to the ones obtained using the training set also for testing and using only one training--test split. ```{r compareTTandXVal} rbind(noSplit = diagnosticErrors(cMatrix), oneSplit = diagnosticErrors(cMatrix.TT), CV5 = diagnosticErrors(cMatrix.cv) ) ``` * **Question**. Define clearly the difference between models sml1 and smx1 # Applying classification to Microarray data Classification methods have been very popular for microarray data, in order to identify e.g. signatures of cancer subtypes. Traditional methods often yield unsatisfactory results or are even inapplicable in the $p \gg n$ setting. Hence, microarray studies have stimulated the development of new approaches and motivated the adaptation of known traditional methods to high-dimensional data. One of the most commonly used algorithms for this purpose is called NSC or PAM. [See the paper](http://statweb.stanford.edu/~tibs/ftp/STS040.pdf). Moreover, model selection and evaluation of prediction rules proves to be highly difficult in this situation for several reasons. Firstly, the hazard of overfitting, which is common to all prediction problems, is increased by high dimensionality. Secondly, the usual evaluation scheme based on the splitting into learning and test data sets often applies only partially in the case of small samples. Lastly, modern classification techniques rely on the proper choice of hyperparameters whose optimization is highly computer-intensive, especially in the case of high-dimensional data. ## Learning with expression arrays Here we will concentrate on ALL: acute lymphocytic leukemia, B-cell type. ### Phenotype reduction We will identify expression patterns that discriminate individuals with BCR/ABL fusion in B-cell leukemia. ```{r setupALL} data("ALL") bALL = ALL[, substr(ALL$BT,1,1) == "B"] fus = bALL[, bALL$mol.biol %in% c("BCR/ABL", "NEG")] fus$mol.biol = factor(fus$mol.biol) fus sample_n(pData(fus), 10) ``` ### Nonspecific filtering We can nonspecifically filter to 300 genes (to save computing time) with largest measures of robust variation across all samples: ```{r selectGenes, dependson="setupALL"} mads = apply(exprs(fus),1,mad) fusk = fus[ mads > sort(mads,decr=TRUE)[300], ] fcol = ifelse(fusk$mol.biol=="NEG", "green", "red") ``` ## Constructing a classification rule The second main issue is the construction of a appropriate prediction rule. In microarray data analysis, we have to deal with the $n\ll p$ situation, i.e. the number of predictors exceeds by far the number of observations. Some class prediction methods only work for the case $n \ll p$, e.g. linear or qudratic discriminant analysis which are based on the inversion of a matrix of size $p\times p$ and rank $n-1$. In the $n\ll p$ setting, the hazard of overfitting is especially acute: perfect separation of the classes for a given sample based on a high number of predictors is always possible. However, the resulting classification rule may generalize poorly on independent test data. There are basically three approaches to cope with the $n\ll p$ setting: * variable selection using, e.g., univariate statistical tests * regularization or shrinkage methods, such as the Support Vector Machine $\ell_2$ or $\ell_1$ penalized logistic regression or Boosting from which some also perform variable selection * dimension reduction or feature extraction. Most prominent is the Partial Least Squares method Most classification methods depend on a vector of hyperparameters ${\lambda}$ that have to be correctly chosen. Together with variable selection, this is part of the model selection and has thus to be performed separately for each learning sample ${L}_b$ to avoid bias. An optimal vector of hyperparameters ${\lambda}^{\text{opt}}$ is determined by defining a discrete set of values whose performance is then measured by cross-validation. This involves a further splitting step, which is sometimes called 'inner loop' or 'nested' cross-validation. More precisely, each learning set ${L}_b$ is divided into $l=1,\ldots,k$ parts such that the $l$-th part forms the test sample for the $l$-th (inner) iteration. This procedure can be used to derive an error estimator for each value on the grid of candiate hyperparameter values. The optimal vector ${\lambda}^{\text{opt}}$ is chosen to minimize this error criterion. Note that there are often several such minimizers. Furthermore, the minimizer found by this procedure can be relatively far away from the true minimizer, depending on how fine the discrete grid has been chosen. The choice of the inner cross-validation scheme is difficult. With a high $k$, computation times soon become prohibitively high. With a low $k$, the size of ${L}_{b_l}, \; l=1,\ldots,k$ is strongly reduced compared to the complete sample $S$, which may have an impact on the derived optimal parameter values. Nevertheless, nested cross-validation is commonly used in this context. Considering the computational effort required for hyperparameter optimization and the small sample sizes, one may prefer class prediction methods that do not depend on many hyperparameters and/or behave robustly against changes of the hyperparameter values. ## Using Logistic regression We first generate the learning sets for CV and then tune the parameters of the method. We use the method of [Zhu and Hastie, 2004](http://dx.doi.org/10.1093/biostatistics/kxg046). ```{r logisticReg, dependson="setupALL"} labs <- pData(fus)$mol.biol fiveCVdat <- GenerateLearningsets(y=labs, method="CV", fold = 5, strat = TRUE) fiveCVdat #?plrCMA ## tune the parameters tuningstep <- CMA::tune(X = t(exprs(fus)), y=labs, learningsets = fiveCVdat, classifier = "plrCMA") tuningstep plot(tuningstep, iter = 3) unlist(CMA::best(tuningstep)) ## run the classifier class_fiveCV <- classification(X = t(exprs(fus)), y=labs, classifier = "plrCMA", learningsets = fiveCVdat, lambda=0.0625) ##evaluate av_logR <- evaluation(class_fiveCV, measure = "average probability", scheme = "classwise", y=labs) show(av_logR) #boxplot(av_logR) #summary(av_logR) ``` ## Using PAM ```{r PAM, dependson="setupALL"} library(pamr) pamTrain <- pamr.train(list(x= exprs(fus), y=labs)) pamTrain #pam.results <- pamr.cv(pamTrain, list(x= exprs(fus), y=labs)) #predict labs_new = pamr.predict(pamTrain, exprs(fus) , threshold=TRUE) pred.err = sum(labs_new != labs)/length(labs) pred.err # class specific prediczion error #pred.err = sum(labs_new != labs & )/length(labs) ``` # Exploratory multivariate analysis ## Scatterplots * **Question 7** Interpret the following code, whose result is shown in Figure 2. Modify it to depict the pairwise configurations with different colors for crab genders. ```{r pairs} pairs(crabs[,-c(1:3)], col=ifelse(crabs$sp=="B", "blue", "orange")) ``` ## Principal components analysis (PCA) Principal components analysis transforms the multivariate data $X$ into a new coordinate system. If the original variables are X1, \ldots, Xp, then the variables in the new representation are denoted $PC_1$, \ldots, $PC_p$. These new variables have the properties that PC1 is the linear combination of the $X_1$, \ldots, $X_p$ having maximal variance, PC2 is the variance-maximizing linear combination of residuals of X after projecting into the hyperplane normal to PC1, and so on. If most of the variation in $X_{n \times p}$ can be captured in a low dimensional linear subspace of the space spanned by the columns of $X$, then the scatterplots of the first few principal components give a good representation of the structure in the data. One can show the data points (i.e., here, the samples) are projected onto the 2D plane such that they spread out optimally. Formally, we can compute the PC using the singular value decomposition of $X$, in which $X = UDV^t$, where $U_{n \times p}$ and $V_{p \times p}$ are orthonormal, and $D$ is a diagonal matrix of $p$ nonnegative singular values. The principal components transformation is $XV = UD$, and if $D$ is structured so that $D_{ii} \geq D_{jj}$ whenever $i > j$, then column $i$ of $XV$ is PCi. Note also that $D_{ii} = \sqrt{n-1}\;\mbox{sd}(\mbox{PCi})$. We use the function ` prcomp ` to compute the principal components. The resulting principal component scores, i.e. the coordinates of the samples in the space of the principal components can then be plotted. For this we use the function `qplot` from `r CRANpkg("ggplot2") `. `r CRANpkg("ggplot2") ` is a comprehensive plotting package for R, a very good tutorial is [here](http://jofrhwld.github.io/avml2012/). The function `qplot` is supposed to mimic the standard plotting function `plot` as closely as possible, but additional changes can be made, e.g. lattice--type graphics (splitting the plot by a factor of interest) can easily be generated. Addition elements are changed by using `+` and the corresponding commands. In our PCA plot, we change the plotting colors used and the axis labels. ```{r PCA_computations, dependson= "log_reg_mod" } PCA <- prcomp( dplyr::select(crabs, FL:BD )) percentVar <- round(100*PCA$sdev^2/sum(PCA$sdev^2),1) ``` We then use the function ` prcomp ` to compute the principal components. The resulting principal component scores, i.e. the coordinates of the samples in the space of the principal components can then be plotted. For this we use the function `qplot` from `r CRANpkg("ggplot2") `. `r CRANpkg("ggplot2") ` is a comprehensive plotting package for R, a very good tutorial is [here](http://jofrhwld.github.io/avml2012/). The function `qplot` is supposed to mimic the standard plotting function `plot` as closely as possible, but additional changes can be made, e.g. lattice--type graphics (splitting the plot by a factor of interest) can easily be generated. Addition elements are changed by using `+` and the corresponding commands. In our PCA plot, we change the plotting colors used and the axis labels. ```{r PCA_plots, fig.cap= "PCA plot", fig.width=7.5, fig.height=6, dependson="PCA_computations" } dataGG = data.frame(PC1 = PCA$x[,1], PC2 = PCA$x[,2], PC3 = PCA$x[,3], PC4 = PCA$x[,4], species = crabs$sp, sex = crabs$sex) (qplot(PC1, PC2, data = dataGG, color = species, shape = sex, main = "PC1 vs PC2", size = I(3)) + labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)), y = paste0("PC2, VarExp:", round(percentVar[2],4))) + scale_colour_brewer(type="qual", palette=2) ) (qplot(PC2, PC3, data = dataGG, color = species, shape = sex, main = "PC2 vs PC3", size = I(3)) + labs(x = paste0("PC2, VarExp:", round(percentVar[2],4)), y = paste0("PC3, VarExp:", round(percentVar[3],4))) + scale_colour_brewer(type="qual", palette=2) ) # (qplot(PC3, PC4, data = dataGG, color = species, shape = sex, # main = "PC3 vs PC4", size = I(3)) # + labs(x = paste0("PC3, VarExp:", round(percentVar[3],4)), # y = paste0("PC4, VarExp:", round(percentVar[4],4))) # + scale_colour_brewer(type="qual", palette=2) # ) ``` ## Clustering A familiar technique for displaying multivariate data in high--throughput biology is called the heatmap. In this display, samples are clustered as columns, and features as rows. The clustering technique used by default is R \Rfunction{hclust}. This procedure builds a clustering tree for the data as follows. Distances are computed between each pair of feature vectors for all $N$ observations. The two closest pair is joined and regarded as a new object, so there are $N-1$ objects (clusters) at this point. This process is repeated until 1 cluster is formed; the clustering tree shows the process by which clusters are created via this agglomeration process. The most crucial choice when applying this method is the initial choice of the distance metric between the features. Once clusters are being formed, there are several ways to measure distances between them, based on the initial between-feature distances. Single-linkage clustering takes the distance between two clusters to be the shortest distance between any two members of the different clusters; average linkage averages all the distances between members; complete-linkage uses hte maximum distance between any two members of the different clusters. Other methods are also available in \Rfunction{hclust}. We visualize cluster trees for samples and features in a heatmap, using the function ` heatmap.2 ` from the ` r CRANpkg("gplots")` package. ``` {r heatmap, fig.cap= "clustering heatmap", fig.height = 20, fig.width = 20, eval=TRUE} #distfun <- function(x){dist(x, method = "euclidean"} hmcol <- colorRampPalette(brewer.pal(n = 3, name="PuOr"))(255) heatmap.2(as.matrix(dplyr::select(crabs, FL:BD )), trace="none", col = rev(hmcol), margin=c(13, 13), main = "Eucledian distance heatmap") ``` The heatmaps shows that the features CL and CW seem to discriminate best between the two main clusters, while RW does not show obvious differences between the samples. # Application to high--dimensional biological data We will use two data sets now, - one for supervised learning and one for unsupervised learning. For **Unsupervised learning**, we will use the NCI60 cancer cell microarray data which contains 6830 gene expression measurements of 64 cancer cell lines. We dropped the sub types which contain only 1 sample ( there were about 6 of them) and created a SummarizedExperiment object. The *SummarizedExperiment* class is a matrix-like container where rows represent ranges of interest (as a GRanges or GRangesList-class) and columns represent samples (with sample data summarized as a DataFrame-class). A SummarizedExperiment contains one or more assays, each represented by a matrix-like object of numeric or other mode. ![Summarized Experiment](SummarizedExperiment.png) This object has been made available to you, you can simply read it in using ```{r message=FALSE} sefile <- system.file("extdata", "NCI60.Rda", package="LearnBioconductor") load(sefile) nci60data <- t(assay(NCI60)) ncilabels <- colData(NCI60) ``` The gene expression data is stored in "assay" whereas the labels are stored in colData. For **Supervised learning**, we will use cervical count data from the Biocoductor package, `r Biocpkg("MLSeq")`. This data set contains expressions of 714 miRNA's of human samples. There are 29 tumor and 29 non-tumor cervical samples. For learning purposes, we can treat these as two separate groups and run various classification algorithms. ```{r message=FALSE} filepath = system.file("extdata/cervical.txt", package = "MLSeq") cervical = read.table(filepath, header = TRUE) ``` ## Unsupervised Learning Let us use PCA to visualize the NCI60 data set. We first perform PCA on the data after scaling the variables (genes) to have standard deviation one and plot the first few principal components in order to visualize the data just as we did for the crabs data set. ```{r PCA_NCI} pcaNCI <- prcomp(nci60data, scale=TRUE) dataGG = data.frame(PC1 = pcaNCI$x[,1], PC2 = pcaNCI$x[,2], PC3 = pcaNCI$x[,3], PC4 = pcaNCI$x[,4], cell.line = colData(NCI60)$info) (qplot(PC1, PC2, data = dataGG, color = cell.line, main = "PC1 vs PC2", size = I(3)) + labs(x = paste0("PC1, VarExp:", round(percentVar[1],4)), y = paste0("PC2, VarExp:", round(percentVar[2],4))) + scale_colour_brewer(type="qual", palette=6) ) (qplot(PC2, PC3, data = dataGG, color = cell.line, main = "PC2 vs PC3", size = I(3)) + labs(x = paste0("PC2, VarExp:", round(percentVar[2],4)), y = paste0("PC3, VarExp:", round(percentVar[3],4))) + scale_colour_brewer(type="qual", palette=6) ) ``` Overall, we can conclude that the cell lines corresponding to a single cancer type tend to have similar values on the first few principal component score vectors. This indicates that cell lines from the same cancer type tend to have similar gene expression levels. ## Clustering observations In hierarchical clustering, we start by defining some dissimilarity measure between each pair of observations (like Euclidean distance). We start at the bottom of the dendrogram, each of the n observations are considered as a cluster, the two clusters that are most similar to each other are fused together so now we have n-1 clusters. Next the two clusters that are similar together are fused so that there are n-2 clusters. This algorithm proceeds iteractively until all samples belong to one single cluster and the dendrogram is complete. We define the dissimilarity between two clusters or two groups of observations using Linkage. There are four common types of linkage - "average", "complete", "single", "centroid". Assuming we have two clusters A and B, then - 1. _Complete_ refers to recording the *largest* of all pairwise dissimilarities between observations in cluster A and observations in Cluster B. 2. _Single_ refers to recording the *smallest* of all pairwise dissimilarities between observations in cluster A and observations in Cluster B. 3. _Single_ refers to recording the *average* of all pairwise dissimilarities between observations in cluster A and observations in Cluster B. 4. _Centroid_ refers to the dissimilarity between the centroid for cluster A and the centroid for cluster B. Usually, the dissimilarity measure is the Euclidean Distance. Lets cluster the observations using complete linkage with Euclidean distance as the dissimilarity measure in "complete linkage". ```{r NCIclusterAnalysis, fig.width=12 , message=FALSE} labs <- as.character(colData(NCI60)$info) cellColor <- function(vec) { uvec <- unique(vec) cols = rainbow(length(uvec)) colvec <- cols[as.numeric(as.factor(vec))] list(colvec=colvec, cols=cols, labels= uvec) } colres <- cellColor(labs) sdata <- scale(nci60data) d <- dist(sdata) labs <- as.character(unlist(as.list(ncilabels))) comp_clust <- hclust(d) dend <- as.dendrogram(comp_clust) leaves <- labs[order.dendrogram(dend)] labels_colors(dend, labels=TRUE) <- cellColor(leaves)$colvec labels(dend) <- leaves plot(dend, main ="Clustering using Complete Linkage") ``` ### Exercise: Different linkage methods 1. Perform hierarchical clustering using average and single linkage. 2. Interpret the difference in the dendrograms. 3. Can you observe some patterns from these dendrograms? (hint: use cutree) **Solutions:** 1. The plots can be made with the following code - ```{r fig.width=12, fig.height=6} plot(hclust(d, method="average"), labels= labs, main ="Clustering using Average Linkage" , xlab="", ylab="" ) plot(hclust(d, method="single"), labels= labs, main ="Clusteringg using Single Linkage" , xlab="", ylab="" ) ``` 2. Briefly, complete linkage provides maximal inter cluster dissimilarity, single linkage results in minimal inter cluster dissimilarity and average results in mean inter cluster dissimilarity. We see that the results are affected by the choice of the linkage. Single linkage tend to yield trailing clusters while complete and average linkage leads to more balanced and attractive clusters. 3. For our example, we see that the cell lines within a single cancer cell type do not cluster together. But if we consider complete linkage and cut the tree at height=4 ( we tried different heights to observe patterns) we observe some clear patterns like the leukemia cell lines fall in cluster 2 and the breast cancer cell lines spread out. ## Supervised Learning In supervised learning, along with the features $X_1$, $X_2$, ....,$X_p$, we also have the a response Y measured on the same n observations. The goal is then to predict Y using $X_1$, $X_2$, ....,$X_p$ for new observations. ### A simple example using knn For the cervical data, we know that the first 29 are non-Tumor samples whereas the last 29 are Tumor samples. We will code these as 0 and 1 respectively. ```{r} class = data.frame(condition = factor(rep(c(0, 1), c(29, 29)))) ``` Lets look at one of the most basic supervised learning techniques **k-Nearest Neighbor** and see what all goes into building a simple model with it. For the sake of simplicity, we will use only 2 predictors (so that we can represent the data in 2 dimensional space) ```{r} data <- t(cervical) data <- data[,1:2] df <- cbind(data, class) colnames(df) <- c("x1","x2","y") rownames(df) <- NULL head(df) ``` This is how the data looks - ```{r fig.width=12} plot(df[,"x1"], df[,"x2"], xlab="x1", ylab="x2", main="data representation for knn", col=ifelse(as.character(df[,"y"])==1, "red","blue")) ``` Given a observation $x_0$ and a positive integer, K, the KNN classifier first identifies K points in the training data that are closest to $x_0$, represented by $N_0$. It estimates the conditional probability for class j as a fraction of $N_0$ and applies Bayes rule to classify the test observation to the class with the largest probability. More concretely, if k=3 and there are 2 observation belonging to class 1 and 1 observation belonging to class 2, then we the test observation is assigned to class1. ![knn_figure](our_figures/knn.png) For all supervised experiments its a good idea to hold out some data as _Training Data_ and build a model with this data. We can then test the built model using the left over data (_Test Data_) to gain confidence in our model. We will randomly sample 30% of our data and use that as a test set. The remaining 70% of the data will be used as training data ```{r } set.seed(9) nTest = ceiling(ncol(cervical) * 0.2) ind = sample(ncol(cervical), nTest, FALSE) cervical.train = cervical[, -ind] cervical.train = as.matrix(cervical.train + 1) classtr = data.frame(condition = class[-ind, ]) cervical.test = cervical[, ind] cervical.test = as.matrix(cervical.test + 1) classts = data.frame(condition = class[ind, ]) ``` Training set error is the proportion of mistakes made if we apply our model to the training data and Test set error is the proportion of mistakes made when we apply our model to test data. For different neighbors , let us calculate the training error and test error using KNN. ```{r message=FALSE} newknn <- function( testset, trainset, testclass, trainclass, k) { pred.train <- knn.cv(trainset, trainclass, k=k) pred.test <- knn(trainset, testset, trainclass, k=k) test_fit <- length(which(mapply(identical, as.character(pred.test), testclass)==FALSE))/length(testclass) train_fit <- length(which(mapply(identical, as.character(pred.train), trainclass)==FALSE))/length(trainclass) c(train_fit=train_fit, test_fit= test_fit) } trainset <- t(cervical.train) testset <- t(cervical.test) testclass <- t(classts) trainclass <- t(classtr) klist <- 1:15 ans <- lapply(klist, function(x) newknn(testset, trainset, testclass, trainclass,k =x)) resdf <- t(as.data.frame(ans)) rownames(resdf) <- NULL plot(klist, resdf[,"train_fit"], col="blue", type="b",ylim=c(range(resdf)), main="k Nearest Neighbors for Cervical Data", xlab="No of neighbors", ylab ="Training and Test Error") points(klist, resdf[,"test_fit"], col="red", type="b") legend("bottomright", legend=c("Training error","Test error"), text.col=c("blue","red"), bty="n") ``` Another important concept in machine learning is **cross validation**. Since samples are often scarse, it is often not possible to set aside a validation set ans then use it to assess the performance of our prediction model. So we use cross validation to train a better model. We start by dividing the training data randomly into n equal parts. The learning method is fit to n-1 parts of the data, and the prediction error is computed on the remaining part. This is done in turn for each 1/n parts of the data, and finally the n prediction error estimates are averaged. For example, K-fold cross validation where K=5 ![cv_figure](our_figures/cross_validation.png) As you can see, computation for this very simple learner can become quite complicated. ### Fast classification using Bioconductor. `r Biocpkg("MLSeq")` aims to make computation less complicated for a user and allows one to learn a model using various classifier's with one single function. The main function of this package is classify which requires data in the form of a DESeqDataSet instance. The DESeqDataSet is a subclass of SummarizedExperiment, used to store the input values, intermediate calculations and results of an analysis of differential expression. So lets create DESeqDataSet object for both the training and test set, and run DESeq on it. ```{r} cervical.trainS4 = DESeqDataSetFromMatrix(countData = cervical.train, colData = classtr, formula(~condition)) cervical.trainS4 = DESeq(cervical.trainS4, fitType = "local") cervical.testS4 = DESeqDataSetFromMatrix(countData = cervical.test, colData = classts, formula(~condition)) cervical.testS4 = DESeq(cervical.testS4, fitType = "local") ``` Classify using Support Vector Machines. ```{r} svm = classify(data = cervical.trainS4, method = "svm", normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref = "1") svm ``` It returns an object of class 'MLseq' and we observe that it successfully fitted a model with 97.8% accuracy. We can access the slots of this S4 object by ```{r} getSlots("MLSeq") ``` And also, ask about the model trained. ```{r} trained(svm) ``` We can predict the class labels of our test data using "predict" ```{r} pred.svm = predictClassify(svm, cervical.testS4) table(pred.svm, relevel(cervical.testS4$condition, 2)) ``` The other classification methods available are 'randomforest', 'cart' and 'bagsvm'. ### Exercise: Train the same training data and test data using randomForest. **Solutions:** ```{r} rf = classify(data = cervical.trainS4, method = "randomforest", normalize = "deseq", deseqTransform = "vst", cv = 5, rpt = 3, ref = "1") trained(rf) pred.rf = predictClassify(rf, cervical.testS4) table(pred.rf, relevel(cervical.testS4$condition, 2)) ``` # Session information As last part of this document, we call the function *sessionInfo*, which reports the version numbers of R and all the packages used in this session. It is good practice to always keep such a record as it will help to trace down what has happened in case that an R script ceases to work because the functions have been changed in a newer version of a package. The session information should also **always** be included in any emails to the [Bioconductor support site](https://support.bioconductor.org) along with all code used in the analysis.
```{r} sessionInfo() ```