Computational Molecular Biology

How many new genes are there (Science, Mar 2006)

Toronto claim

In a recent (Sep, 2005) Science report, the FANTOM Consortium claim to have found 5,154 new proteins in the mouse genome not encoded by previously known mRNA sequences. This claim contrasts dramatically with the view of the International Human Genome Sequencing Consortium and our previous study using exon microarrays. By downloading FANTOM's protein sequence data and performing a careful, independent computational analysis, we concluded that the number of new protein-coding genes discovered by the FANTOM consortium is at most in the hundreds, with the remaining either splice isoforms of known proteins or false positives arising randomly from noncoding transcripts.

FANTOM claim

This has fueled a heated but friendly debate featured in Science ...

Project website:

http://www.psi.toronto.edu/TransLand/

References:

  • Leo J. Lee, Timothy R. Hughes, and Brendan J. Frey. How Many New Genes Are There? Science, vol. 311, no. 5768, pp. 1709-1711, Mar 26, 2006. [link to the paper]

  • The FANTOM Consortium: P. Carninci et al. The Transcriptional Landscape of the Mammalian Genome, Science, vol. 309, no. 5740, pp. 1559-1563, Sep 2, 2005. [link to the paper]

  • BJ Frey, N Mohammad, QD Morris, W Zhang, MD Robinson, S Mnaimneh, R Chang, Q Pan, E Sat, J Rossant, BG Bruneau, JE Aubin, BJ Blencowe, TR Hughes. Genome-wide analysis of mouse transcripts using exon microarrays and factor graphs, Nature Genetics, Aug 28, 2005. [PDF]

A revised view of the mammalian library of genes (Nature Genetics, Aug 2005)

GenRate

Recent mammalian microarray experiments have detected widespread transcription and raised the possibility that there may be a large number of undiscovered multi-exon protein-coding genes. To explore this possibility, we hybridized unamplified, polyadenylation-selected samples from 37 mouse tissues to microarrays encompassing 1.14 million exon probes. We analyzed these data using GenRate, a Bayesian algorithm that uses a genome-wide scoring function in a factor graph to infer genes. At a stringent exon false detection rate of 2.7%, GenRate detects 12,145 gene-length transcripts and confirms 81% of the 10,000 most highly-expressed known genes. Surprisingly, our analysis shows that most of the 155,839 exons detected by GenRate are associated with known genes, providing for the first time microarray-based evidence that the vast majority of multi-exon genes have already been discovered. GenRate also detects tens of thousands of potential new exons and reconciles discrepancies in current cDNA databases, by stitching novel transcribed regions into previously-annotated genes.

Left: A visualization of the steps in our analysis. The genome (DNA) was scanned to identify 1,140,000 regions that were most likely to contain exons. Five exon detectors were used and the blue-colored flags near the center of the helix indicate which of the five methods identified each region. The six red-colored flags indicate whether the region is annotated in a cDNA/EST database. We designed one probe for each region detected as described above, and the probes were hybridized to twelve samples. The bright-colored bands near the outer edge of the helix show the expression levels of the probes for the twelve samples, while the black/purple band indicates absolute expression level. Our analysis algorithm, GenRate, detects patterns of coregulation across the twelve samples and identifies putative genes, as shown by pull-outs at the outer-most edge of the helix.

GenRate detected approximately 30,000 novel putative exons (not appearing in the following databases: human ensembl, human ensembl novel, mouse refseq, mouse ensembl, mouse fantom2, mouse unigene). The following figure shows how the 10% most highly-expressed putative exons compare to well-annotated RefSeq genes.

GenRate Novel Exons

Project website (with data matrix):

http://www.psi.toronto.edu/genrate

References:

  • BJ Frey, N Mohammad, QD Morris, W Zhang, MD Robinson, S Mnaimneh, R Chang, Q Pan, E Sat, J Rossant, BG Bruneau, JE Aubin, BJ Blencowe, TR Hughes. Genome-wide analysis of mouse transcripts using exon microarrays and factor graphs, Nature Genetics 37:9, Aug 28, 2005. [PDF]

  • BJ Frey, TR Hughes and QD Morris. GenRate: A generative model that reveals novel transcripts in genome-tiling microarray data. Journal of Computational Biology 13:2, 200-214, March 2006.

  • B. J. Frey, Q. D. Morris, M. D. Robinson and T. R. Hughes 2005 Finding novel transcripts in high-resolution genome-wide microarray data using the GenRate model, RECOMB 2005, MIT, June 2005.

Bayesian learning of microRNA targets from sequence and expression data

MicroRNAs (miRNAs) have recently been discovered as an important class of non-coding RNA genes that play a major role in regulating gene expression, providing a means to control the relative amounts of mRNA transcripts and their protein products. Although much work has been done in the genome-wide computational prediction of miRNA genes and their target mRNAs, an open question is how to efficiently detect bona fide miRNA targets from a large number of candidate miRNA targets predicted by existing computational algorithms. We propose a novel probabilistic model that accounts for gene expression using miRNA expression data and a set of candidate miRNA targets. A set of underlying miRNA targets are learned from the data using our algorithm, GenMiR (Generative model for miRNA regulation). Our high-confidence miRNA targets represent a dramatic increase in the number of known miRNA targets and provide a wide range of novel testable hypotheses, offering a starting point for understanding miRNA regulation on a global scale.


GenMiR_BN
Bayesian network used for detecting miRNA targets: given a set of candidate miRNA-target interactions generated using a target-finding program, the GenMiR probability model uses expression data for mRNAs and miRNAs to find a subset of the candidates which are well-supported by the data.

Project website:

http://www.psi.toronto.edu/genmir/

References:

  • J.C. Huang, T. Babak, T.W. Corson, G. Chua, S. Khan, B.L. Gallie, T.R. Hughes, B.J. Blencowe, B.J. Frey and Q.D. Morris. (2007) Using expression profiling data to identify human microRNA targets. Nature Methods 4(12), 1045-1049. [Click here to access this paper.]

  • J.C. Huang, Q.D. Morris and B.J. Frey. (2007) Bayesian inference of microRNA targets from sequence and expression data. J. Comp. Bio. 14(5): 550-563. [Click here to access this paper.]

  • J.C. Huang, Q.D. Morris and B.J. Frey. (2006) Detecting microRNA targets by linking sequence, microRNA and gene expression data, Proceedings of the Tenth Annual International Conference on Research in Computational Molecular Biology (RECOMB), Venice, Italy, April 2-5, 2006.

GenXHC - Generative model for cross-hybridization compensation in high-density microarray data

Microarray designs containing millions to hundreds of millions of probes that tile entire genomes are currently being released. Probes in such high-density arrays have been shown to measure signal from both their target transcript and many other non-specific transcripts due to short oligonucleotide lengths. This problem of cross-hybridization noise will become increasingly significant in the upcoming era of genome-wide exon-tiling microarray experiments and algorithms which can accurately compensate for cross-hybridization will play an important role in future large-scale microarray assays. We have developed GenXHC (Generative model for X-Hybridization Compensation), a probabilistic model for cross-hybridization compensation which estimates transcript abundances from probe intensities by accounting for potential cross-hybridization in high-density genomewide microarray data.

GenXHC_bn
GenXHC_example
The generative process for cross-hybridization using measured expression profiles. Expression profiles consist of measurements across 12 tissue pools, where high intensity indicates high expression. Each probe can hybridize to multiple transcripts and thus measures components of expression from transcripts other than its target. The measured expression levels can be used in tandem with knowledge of probe-transcript hybridization constraints to infer expression levels for the transcripts.


We perform a pairwise alignment between microarray probes and known transcripts to identify likely cross-hybridizidation interactions for each microarray probe. Using this information, we fit a parameterized generative model of the observed probe intensities as a noisy weighted sum of unobserved transcript abundances. The unobserved transcript abundances and cross-hybridization weights are estimated through variational learning. The algorithm was applied to a subset of a genome-wide M.musculus exon-tiling microarray dataset: GO-BP clustering using the denoised profiles produced clsuters that were statistically enriched for many categories compared to clustering using noisy data.

References:

  • J.C. Huang, Q.D. Morris, T.R. Hughes and B.J. Frey 2005. GenXHC: A probabilistic generative model for cross-hybridization compensation in high-density, genome-wide microarray data. Proceedings of the Thirteenth Annual Conference on Intelligent Systems for Molecular Biology, Detroit, MI, June 25-29, 2005. Bioinformatics 21 Suppl 1:i222-i231. [Bioinformatics]

Prediction of mouse gene function using gene expression data (J Biol 2004)

The existence both of vast amounts of high-throughput biological data and large, curated gene annotation databases has made automatic gene function discovery a feasible goal. While this approach has been widely applied in yeast, and other model organisms, its utility has not yet been established in mammalian systems.

In collaboration with Timothy Hughes' lab, PSI lab is analyzing a comprehensive database of the gene expression of over 40,000 known and predicted genes across 55 different mouse tissues. The mouse tissues used in this study were collected through a large multi-lab collaboration at the University of Toronto. The mRNA was extracted, purified and the gene expression arrays were hybridized and quantitated in Hughes' lab. We used our microarray denoising pipeline, developed in collaboration with Hughes' lab to normalize the data

We analyzed the extent to which these data are useful for predicting gene function using a set of ~1,000 Gene Ontology Biological Process (GO-BP) categories containing annotations in at least one category for ~8,000 of our detected transcripts. We found that a majority of our detected genes were co-regulated with sets of genes significant enriched for one or more GO-BP categories.

To establish a baseline for the accuracy and extent of the functional predictions these data allow, we trained Support Vector Machines (SVM) to predict GO-BP category membership (i.e., gene function) using the gene's 55 tissue expression profile. We assigned annotations to 1,163 of the ~14,000 uncharacterized genes in one of more than 395 GO-BP categories, each with a predicted accuracy of more than 50%.

functional predictions

Our work has demonstrated the utility and power of functional prediction across gene co-regulation data in mammals. Our current work involves improving the number and accuracy of our predictions using probabilistic generative models which model the dependencies between GO-BP categories and the uncertainty in the gene expression measurements.

References:

  • W. Zhang1, Q. D. Morris1, R. Chang, O. Shai, M. A. Bakowski, N. Mitsakakis, N. Mohammad, M. Robinson, R. Zirnglibl, E. Somogyi, N. Laurin, W. T. Peng, N. Krogan, E. Eftekharpour, E. Sat, J. Grigull, Q. Pan, J. Greenblatt, M. Fehlings, D. van der Kooy, J. Aubin, B. G. Bruneau, J. Rossant, B. J. Blencowe, B. J. Frey, and T. R. Hughes 2004 The functional landscape of mouse gene expression, Journal of Biology 3:5, 21 [PDF | PubMed]
    1 Joint-first authors

  • Q. D. Morris, W. Zhang, O. Shai, B. J. Frey, M. A. Bakowski, R. Chang, N. Mitsakakis, B. J. Blencowe, and T. R. Hughes 2004 The functional landscape of mouse gene expression, invited talk in Snowbird Learning Workshop, Snowbird, Utah, 2004

GenASAP - Generative model for alternative splicing predictions from microarray data (Mol Cell 2004)

We present GenASAP (Generative model for Alternative Splicing Array Platform), a new algorithm coupled with a microarray platform for the study of alternative splicing (AS). A new microarray, targeted towards studying single cassette exon inclusion/exclusiong (see figure below) was designed.

AS events
AS probes
On the left, the most common alternative splicing events are shown: (a) single cassette exon inclusion/exclusion, (b) alternative 3'/5' splice site, (c) mutually exluded exons, and (d) intron inclusion/exclusion. On the right, the six microarray probes desined to study each AS event are shown. The microarray has three body probes (C1, C2, and A) and three junction probes (C1:A, A:C2, and C1:C2) for each AS event.

The model explains the observed values, consisting of measured transcription for exon body and junction probes, as a weighted linear combination of the abundance of the alternative isoforms with scale dependent noise and an outlier process. Learning in the generative model is carried out using a variational approximation of the Expectation-Maximization (EM) algorithm.

We carried out the learning on a new data set, consisting of 3126 "cassette" AS events across 10 tissues, where 3 exon body probes and 3 exon junction probes were used to study each event. A small subset (~200) of the events were closely examined using semi-quantitative RT-PCR, the results of which were used as the ground truth for evaluating the performance of GenASAP. The probabilistic nature of the algorithm suggests an approach to evaluating the confidence in the inferred values, which proved an important factor in evaluating the algorithm. The relative abundances of isoforms obtained from GenASAP's unsupervised learning were found to closely match the RT-PCR measurements, and to outperform supervised methods, such as KNN, logistic regression, and linear regression.

AS Bayes Net

References:

  • O. Shai, B. J. Frey, Q. D. Morris, Q. Pan, C. Misquitta, and B. J. Blencowe 2004, Probabilistic inference of alternative splicing events in microarray data. accepted to Neural Information Processing Systems 17 (NIPS 04)

  • Q. Pan1, O. Shai1, C. Misquitta, W. Zhang, N. Mohammad, T. Babak, H. Siu, T. R. Hughes, Q. D. Morris2, B. J. Frey2, and B. J. Blencowe2 2004 Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform, Molecular Cell 16:6, 929-941 [PubMed]
    1 Joint-first authors
    2 Joint-senior authors

Spatial trend removal (STR) - denoising microarray data

Microarrays allow biologists to measure the expression levels of thousands of mRNA transcripts simultaneously. To improve data quality, we attempt to remove spatial systematic noise, arising from slide imperfection, scanner artifacts, uneven washing, etc.

STR relies on the assumption that the probe placement on the array is random, and there should be no correlation between nearby probes. This assumtion leads to a view of the data as high frequency data added to a low frequency spatial trend. We have formulated an algorithm, STR, to remove spatial trends in the data, while preserving high data fidelity, accounting for outliers, and efficiently optimizing for slide specific trends.

filter steps

STR first removes outlying measurements (arising from differentially expressed genes, faulty probes, etc.) from the original data (shown in false colors in (a)). The "non-outliers" data (b) is then filtered with a low-pass filter (d) to obtain an estimate of the spatial trend (c). The final, detrended data is obtained by subtracting the spatial trend from the original data (e). STR used gradient descent to optimize for the parameters of the low-pass filter.

Matlab code for STR is available by contacting one of the following:

  • Ofer Shai - ofer[at]psi. utoronto. ca

  • Quaid Morris - quaid[at]psi. utoronto. ca

  • Brendan Frey - frey[at]psi. utoronto. ca

References:

  • Q. D. Morris1, O. Shai1, B. J. Frey, W. Zhang, and T. R. Hughes 2004 Spatial trend removal - reducing systematic noise in microarrays, in preparation
    1 Joint-first authors

Untangling biological networks (NIPS 2003)

Networks describing biological processes (like metabolism, transcriptional regulation, and protein-protein interactions) are invaluable for understanding and manipulating those processes. However, experimental methods that can generate these networks on a large-scale are prone to error, producing many false-positive and false-negative measurements. We have developed a method to identify measurement errors using knowledge about the connectivity structure of the noise-free network and the dependency structure of the measurement noise.

Many types of biological networks share a very similar connectivity structure: most nodes have a small degree (i.e., are connected to very few other nodes) and a few nodes, "hubs", have a very large degree. This structure is well-described by a degree distribution. Similarly, the dependency structure of the noise can also be well-represented by a degree distribution. These regularities are important because they can be used to untangle the real network (the signal) from the network of false positive measurements (the noise).

We developed a probabilistic generative model of tangled, experimentally measured networks. Probabilistic inference in our model untangles the signal and noise networks, however, exact inference in the model is intractable. To address these, we have developed an accurate, linear time sum-product algorithm for approximate inference.

Untangling 1  Untangling 2

We applied our algorithm to the problem of denoising yeast protein-protein interaction networks. After using a small set of trusted interactions to fit noise and signal degree distributions, our method detects 40% to 80% more true interactions, for the same false-positive rate, as a method which ignores graph structure.

Untangling 3

Our current work involves applying our method to untangle other types of biological networks and extending our generative model to incorporate other descriptors of connectivity structure, like degree correlations and clustering coefficients.

References:

  • Q. D. Morris, B. J. Frey, and C. J. Paige 2003 Denoising and untangling graphs using degree priors, in Proceedings of Neural Information Processing Systems 16 (NIPS 03) [PDF]

  • Q. D. Morris and B. J. Frey 2003 Denoising and untangling graphs using degree priors, invited talk in International Conference on Machine Learning, Bioinformatics Workshop (ICML 03), Washington DC