Project

 

Alternative 1 - Ensemble

The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online. Ensembl Perl API makes it easy to write Perl programs to access data.

You can find installation guidelines here. Note that instead of setting system paths you can also do this in your program (E.g. "use
lib 'c:\Ensemble\ensembl\modules';").

You can find a tutorial here. Go through it and try out the code!

This file contains a number of genome coordinates in the chicken genome (chr, startposition, stopposition and strand). Write a program that uses the Ensemble Perl API to download these sequences.

Try out the other tutorials and explore Ensemble!

 

Alternative 2 - Differentially expressed genes

One of the most common bioinformatics problems is that of finding statistically significant differentially expressed genes. In this assignment you are given cDNA microarray data from Populus trees. The expression profiles of a large number of genes have been measured in three wild type trees and six trees where a gene important for growth has been knocked down. The data is available here.

The first column in the data contains gene probes from the microarray. The same gene can have many probes on the array, and thus your first task is to merge the expression data from pobes coming from the same gene (e.g. by averaging the expression profiles). Note that the data contains missing values. A mapping between probes and genes can be found in this file.

Use the t-test to identify genes that are significantly differentially expressed between wild type and knock down trees. www.cpan.org has a large number of Perl modules, e.g. TTest. NB: Note that you don't really need to install the module. Just unzip the file and put the TTest.pm file in the same folder as your Perl program. Provide two lists of genes: genes up-regulated in knock down compared to wild type and genes down-regulated in knock down compared to wild type.

A problem with conducting many statistical test is that a threshold for significance of 0.05 is too loose and will result in many genes been significant just by change. Thus we need to correct for the fact that we do so many test: multiple hypothesis testing. Implement two correction methods and compare them: Bonferroni and FDR.

Find out whether there are any gene functions that are overrepresented among the differentially expressed genes. Use Gene Ontology function annotations from this file and the hypergeometric distribution, or use the GO-TermFinder module in CPAN.

 

Alternative 3 - Motif finding

Implement Gibbs sampling for the motif finding problem (see pseudo-code in Lecture 4) and try our implementation on this data set of Yeast promoters. According to the SCPD database these genes are regulated by the cell cycle related transcription factor MCM1 through a 10mer with consensus sequence CCNNNWWRGG.

Use a microarray data set to find support for your motif, i.e. do genes with the motif in their promoter exhibit more similar expression that random sets of genes. Use e.g. the CPAN Spearman correlation module to measure expression similarity.

Of course, the microarray validation says more about the gene set you were initially given than the motif you found. How would you fix this? Try out your idea!

 
To get the project approved, send your code to: torgeir.hvidsten(at)plantphys.umu.se