Predicting Gene Ontology Biological Process from Temporal Gene Expression Patterns
Astrid Lægreid, Torgeir Hvidsten1, Herman Midelfart1, Jan Komorowski31#, Arne K. Sandvik
1Department of Information and Computer Science, Norwegian University of Science and Technology, N-7491 Trondheim, NORWAY
2Department of Physiology and Biomedical Engineering, Norwegian University of Science and Technology, N-7489 Trondheim, NORWAY
3The Linnaeus Centre for Bioinformatics, Box 598, Uppsala University, SE-751 24  Uppsala, SWEDEN
§corresponding author: biomedicine: astrid.lagreid@medisin.ntnu.no
#corresponding author: bioinformatics: janko@idi.ntnu.no


Resources
Pointers
The fibroblast serum response data: http://genome-www.stanford.edu/serum/
Gene Ontology (GO): http://genome-www.stanford.edu/GO/
Swissprot: http://www.expasy.ch/sprot/
Rough set theory: Introduction Ohrn and Rowland, 
Rough sets: A knowledge discovery technique for multifactorial medical outcomes
American Journal of Physical Medicine 
& Rehabilitation, 79(1), 2000.
The Rosetta system: http://www.idi.ntnu.no/~aleks/rosetta
The methodology implemented in Rosetta: DEMO
Introduction

The aim of the present study was to generate hypotheses on the involvement of uncharacterized genes in biological processes. To this end, supervised learning was used to analyze microarray-derived time-series gene expression data. Our method was objectively evaluated on known genes using cross-validation and provided high precision Gene Ontology biological process classifications for 211 of the 213 uncharacterized genes in the data set used. In addition, new roles in  biological process were hypothesized for known genes. Our method uses biological knowledge expressed by Gene Ontology and generates a rule model associating this knowledge with minimal characteristic features of temporal gene expression profiles. This model allows learning and classification of multiple  biological process roles for each gene and can predict participation of genes in a biological process even though the genes of this class exhibit a wide variety of gene expression profiles including inverse co-regulation. A considerable number of the hypothesized new  roles for known genes were confirmed by literature search. In addition, many  biological process roles hypothesized for uncharacterized genes were found to agree with assumptions based on homology information. To our knowledge, a gene classifier of similar scope and functionality has not been reported earlier.

All annotations, re-classifications of known genes and classification of uncharacterized genes are available from our web-site: http://www.idi.ntnu.no/~torgeihv/fibroblast.

The concept of our method was originally introduced in our PSB2001 paper:
Predicting Gene Function from Gene Expressions and Ontologies 
T.R. Hvidsten, J. Komorowski, A.K. Sandvik, and A. Lægreid ; Pacific Symposium on Biocomputing 6:299-310 (2001).

Data material
The fibroblast serum response data includes gene expression level messurements for 493 unique genes (517 gene probes) over 12 time points from 0 to 24 hours. (Iyer et al, 1999. The transcriptional program in the response of human fibroblasts to serum. Science 283:83-87) http://genome-www.stanford.edu/serum/
Methodology
Step 1: Selecting functional classes (cellular processes) from the annotations obtained from Gene Ontology (GO) Annotations of known genes:
Table1.html
Expression profiles for genes in the 
different functional classes (also see
Figure 3A):
Figure1.pdf
Annotations compared to the 10 major 
gene expression profile clusters found 
by agglomerative hierarchical clustering 
by Iyer et al.:
Figure2.pdf
Step 2: Feature synthesis

A table is constructed in which all possible sub-intervals are columns and genes involved in one or more processes selected in Step 1 are rows. Each row is labelled with one of the processes from Step 1. Since one gene can be involved in more than one process, each gene can be represented by more than one row. An entry in the table is the template matching the given gene in the given sub-interval. Rows with only empty entries are discarded. 

Three templates are used: 
Constant: Maximum variation from the mean expression value should not be larger than 0.2 over at least four consecutive time points. 
Increasing: An increase of at least 0.6 over at least 3 consecutive time points, starting and ending with an increase. 
Decreasing: An decrease of at least 0.6 over at least 3 consecutive time points, starting and ending with a decrease.

Data transformed using the language of templates:

Table11.html

Step 3: Inducing/learning a predictive rule model

Only one process is predicted at a time. The genes in the test sets are represented by all matching templates, hence we obtain an unbiased result that can be seen as a good indication of the quality of the classification of the unknown genes. 

The sensitivity, specificity and accuracy resulting from classifying the test sets are all relative to a threshold value. The threshold is obtained by balancing specificity and sensitivity.

The induced rules: 
Table3.pdf
Table3.txt (extended)
Classification quality (10-fold CV): 
Table4.html
Classifications in numbers:
Table5.html
Rule model summary:
Table2.html
Step 4: Classifying the genes
A classifier is induced from the whole set of known genes selected in Step 1. Any processes getting a higher portion of votes than the threshold, is considered a possible classification. 

The classifier was used both to classify unknown genes and to (re-)classify known genes.

Re-classifications for known genes: 
Table6.html
Figure3.pdf
Classifications for unknown genes: 
Table9.html
Classifications for unknown genes
coinciding with homology info:
Table10.html
 
Step 5: Evaluating (re-)classifications
(Re-)classifying known genes led to new functional hypotheses not previously annotated to the genes. We therefore re-examined the literature with these genes in mind and found that a large part of the hypotheses indeed reflected true knowledge.
Co-regulated biological processes:
Table8.html
Missing annotations discovered by 
our classifier:
Table7.html