Supplementary Material for

How many new protein-coding genes are there in Fantom 3?

 

Leo J. Lee, Timothy R. Hughes and Brendan J. Frey

Department of Electrical and Computer Engineering

Banting and Best Department of Medical Research

University of Toronto, Toronto, Ontario, Canada

 


 

Database Preparation

 

The customized mouse RefSeq mRNA database is created from RefSeq release 13, including only those transcripts that are linked to a reference published no later than May 1, 2005, by using the Entrez tool on the NCBI website.  Such a database is named RefSeq_May and is available for download here.  We have further removed all the predicted XM transcripts from RefSeq_May, or in another word, keeping only the NM transcripts, and call this database RefSeq_May_NM, which is available here.

 

A customized mouse GenBank mRNA database including transcripts linked to publications prior to May 1, 2005 is similarly created and can be downloaded here.  As a reference, the 5,154 new FANTOM3 protein sequences can also be downloaded here.

 


 

Mapping Results

 

The mappings from new FANTOM3 proteins to know mRNA databases are done with BLAT version 3.20.  The mapping results to the two customized RefSeq databases are summarized in the following table.

 

Mapping Database

RefSeq_May

RefSeq_May_NM

Identical

159

124

Exon-skipping

3,248

2,716

Exon-inclusion

29

6

Other isoforms

181

90

Total

3,568

2,917

    Table 1: Mapping results of the 5,154 new FANTOM3 proteins to mouse RefSeq cDNAs.

The mapping criterion for different categories of splice isoforms are:

Since RefSeq already contains a small number of splice isoforms, a FANTOM3 protein can belong to more than one category, hence the total number is slightly less than the sum of the above four categories.

 

The list of 3,568 FANTOM3 proteins and the RefSeq transcripts being mapped to is available here as a tab delimited text file, while proteins in each of the four sub-categories are also available.

 

The additional 303 splice isoforms found by including GenBank mRNAs are listed here with the GenBank accession numbers being mapped to, where we require that more than 50% of a FANTOM3 protein overlaps with a GenBank transcript and the sequence similarity of the overlapping region is at least 95%.

 

Finally, the list of the additional 144 proteins detected by our own mouse microarray analysis is available here along with the probe indexes being mapped to, and details of this analysis is available on its own project page.