Archive for the ‘Computational Biology’ Category

Single Molecule sequencing

One day Helicos would make $1000 genome possible

Clustering and visualization using R and Dendroscope

Hierarchical clustering is a technique for grouping samples/data points into categories and subcategories based on a similarity measure. Being the powerful statistical package it is, R has several routines for doing hierarchical clustering.

Different libraries have different clustering functions

Package Function
ctc xcluster
amap hcluster
amap hclusterpar
stats hclust
cluster agnes

Xcluster is proven faster among rest of them.

I was working with a breast cancer microarray cell line data which was in .csv format. First I read the csv file as a matrix into R using

A <- as.matrix(read.csv(“breast_cancer.csv”, header=F))

Make sure there is no missing values in the matrix.  I used an algorithm to replace the missing values. There are few inbuilt functions in R for replacing missing values using zeros, using mean, median and linear interpolation but they are not recommended for microarray data.

Then I clustered it using Xcluster

C <- xcluster(dist(A))

ctc library should have been installed prior to everything, else you will get an error.

Error: could not find function “xcluster”

Once the clustering is done, we can plot the results of our cluster analysis using this command:


If your dataset is small, this might work well for you, but for most genomics applications, you’ll get a tree-shaped fuzzball like this:

The solution to this is to load a library from the Bioconductor package, called “ctc”. This will let us export the cluster object in the Newick file format. It can then be imported into other more powerful graphing programs. This is done like so:

write.table(hc2Newick(HC), file=”C:/breast_cancer.txt”, row.names=FALSE, col.names=FALSE)

You now have a file in Newick format, but R puts quotes around the output for some annoying reason. Open the file in notepad and remove the quotes and it should be ready to use.

To get a better, more readable plot, download “Dendroscope” from the University of Tubingen. Dendroscope will let you import the Newick file you created and gives you extensive plotting options. Check out this wicked Circular Cladogram…

There are lots of options for computing the clustering, and they may give very different results, so proceed with caution, but in general hierarchical clustering can be a useful tool for lots of data analysis situations.

I did this experiment as a part of my project and part of this tutorial was obtained from getting genetics done blog.