Supplementary Material for
How many new protein-coding genes are there in Fantom 3?
Leo J. Lee, Timothy R. Hughes and Brendan J. Frey
Department of Electrical and Computer Engineering
Banting and Best Department of Medical Research
University of Toronto, Toronto, Ontario, Canada
Database Preparation
The customized mouse RefSeq mRNA database is created from RefSeq release 13, including only those transcripts that are linked to a reference published no later than May 1, 2005, by using the Entrez tool on the NCBI website. Such a database is named RefSeq_May and is available for download here. We have further removed all the predicted XM transcripts from RefSeq_May, or in another word, keeping only the NM transcripts, and call this database RefSeq_May_NM, which is available here.
A customized mouse GenBank mRNA database including transcripts linked to publications prior to May 1, 2005 is similarly created and can be downloaded here. As a reference, the 5,154 new FANTOM3 protein sequences can also be downloaded here.
Mapping Results
The mappings from new FANTOM3 proteins to know mRNA databases are done with BLAT version 3.20. The mapping results to the two customized RefSeq databases are summarized in the following table.
|
Mapping Database |
RefSeq_May |
RefSeq_May_NM |
|
Identical |
159 |
124 |
|
Exon-skipping |
3,248 |
2,716 |
|
Exon-inclusion |
29 |
6 |
|
Other isoforms |
181 |
90 |
|
Total |
3,568 |
2,917 |
Table 1: Mapping results of the 5,154 new FANTOM3 proteins to mouse RefSeq cDNAs.
The mapping criterion for different categories of splice isoforms are:
Identical: when more than 95% of a FANTOM3 protein overlaps with more than 95% of a RefSeq transcript and the sequence similarity of the overlapping region is also at least 95%.
Exon-skipping: not identical but when more than 95% of a FANTOM3 protein overlaps with a RefSeq transcript and the sequence similarity of the overlapping region is at least 95%.
Exon-inclusion: not identical but when more than 95% of a RefSeq transcript overlaps with a FANTOM3 protein and the sequence similarity of the overlapping region is at least 95%.
Other: not identical but when more than 50% of a FANTOM3 protein overlaps with more than 50% of a RefSeq transcript and the sequence similarity of the overlapping region is at least 95%.
Since RefSeq already contains a small number of splice isoforms, a FANTOM3 protein can belong to more than one category, hence the total number is slightly less than the sum of the above four categories.
The list of 3,568 FANTOM3 proteins and the RefSeq transcripts being mapped to is available here as a tab delimited text file, while proteins in each of the four sub-categories are also available.
The additional 303 splice isoforms found by including GenBank mRNAs are listed here with the GenBank accession numbers being mapped to, where we require that more than 50% of a FANTOM3 protein overlaps with a GenBank transcript and the sequence similarity of the overlapping region is at least 95%.
Finally, the list of the additional 144 proteins detected by our own mouse microarray analysis is available here along with the probe indexes being mapped to, and details of this analysis is available on its own project page.