December 2007


LICR Transcriptomes in UCSC Genome Browser

In recent years, the Lausanne Branch’s Computational Genomics Group (formerly the Office of Information Technology) has done considerable work in producing detailed maps of gene transcripts, so-called transcriptomes, on the human and mouse genomes to enable high-quality annotation of gene expression data. Recently, these data were introduced into the University of California Santa Cruz (UCSC) Genome Browser and will contribute to a widely used public platform for genome analysis.

In 1999, shortly before the first version of the human genome sequence was released, LICR initiated the Human Cancer Genome Project (HCGP) in collaboration with the Brazilian research organization FAPESP (the state of São Paulo Research Foundation) with the aim of identifying all human genes with their splice variants and to determine their expression in cancer. By 2003, more than 800,000 ESTs (Expressed Sequence Tags; sub-sequences of mRNA transcripts) had been generated from tumors and normal tissues. The LICR Office of Information Technology was originally asked to help with the quality control and the analysis of the resulting data. It then embarked on analyzing these data to decipher what genes the ESTs had derived from. Though seemingly straightforward, this task led them to reconstruct the entire map of transcripts in the human genome.

Dr. Victor Jongeneel is the head of the group and of late also Vice President for Research of the Cyprus Institute in Nicosia. “We were particularly interested to know how many new genes we would discover using the sequence data produced in the HCGP. To know that, we had to estimate how many genes there are in the human genome. But at that time (1998-2000) the ‘genes’ annotated in public databases were not very reliable, and there was still considerable controversy about the actual number of human genes. For instance, we found many mis-annotated gene clusters, with some clusters containing two or more genes and some genes being split between different clusters. So we set out to draw our own version of the human transcriptome.”

The prediction of genes from ESTs, full-length mRNAs, or other evidence of transcription can be rather elaborate due to the complexity of eukaryotic genomes. Adjacent genes often overlap so that unrelated transcripts are produced from the same genomic region, and genes that have been amplified during evolution may be obscured in clusters of homologous sequences. Moreover, most genes encode several transcript variants due to alternative splicing, alternative transcription start sites or multiple polyadenylation sites. Therefore, comprehensive models are needed to deduce the entire structure of a gene including its transcript variants, regulatory elements, and local transcript environment.

To obtain a detailed map of transcripts in the human genome, Dr. Christian Iseli, Assistant Investigator of the Computational Genomics Group, developed a computational process for gene modeling together with colleagues from the LICR and from the Swiss Institute of Experimental Research (ISREC). The programming pipeline was built first on LICR computers, and then on Vital-IT, a platform for high-performance computing developed by the Swiss Institute of Bioinformatics (SIB) and involving LICR team members, investigators from ISREC and four other research centers, in conjunction with leading computer system manufacturers. Dr. Iseli and his colleagues combined ESTs, mRNA and reference sequences (RefSeqs) obtained from both public and in-house databases, and aligned these sequences with the fully-assembled human genome that had then been released. The best sequence alignments were chosen to define transcribed regions of the genome, which the team defined as ‘genes.’

After consulting with Dr. Philipp Bucher from ISREC, Dr. Iseli formulated an algorithm based on the mathematical concept of a digraph to describe the structure of each identified gene. “This type of model essentially generates a splicing diagram (see figure below). Transcription start sites, splice sites, and polyadenylation sites are nodes in the diagram; introns and exons are the arcs connecting the nodes. The graph can then be traversed to generate all the possible transcript combinations, but we let the actual transcripts guide the traversal. This way, we avoid creating an explosion of alternative transcripts, and instead favor the transcripts for which there is experimental evidence.”

splicing diagram

A splicing diagram of the human melan-A/MART1 gene (Swissprot Accession: Q16655). The green nodes represent transcription start sites; blue nodes are splice donor and acceptor sites; red nodes are polyadenylation sites. The arcs connecting the nodes represent introns and exons. The thickness of each arc reflects the amount of ESTs supporting it.

With their refined gene prediction model, Dr. Iseli and his colleagues started to build an in-house transcript database that encompassed the precise genomic mapping of genes with their splice and polyadenylation variants. Using this database, they analyzed the expression of numerous genes in cancer and normal tissues through high-quality annotation of gene expression data obtained by SAGE (Serial Analysis of Gene Expression) and MPSS (Massively Parallel Signature Sequencing). The database also provided a common ground of gene nomenclature and enabled the discovery of new genes. For example, several families of cancer/testis (CT) genes were discovered by analysis of gene clusters on the human X chromosome.

The new database soon became a resource used by LICR investigators to analyze gene expression data, discover new genes, find regulatory elements of known genes, or study the evolution of gene families. Computer scientists at ISREC also use the database to extend existing platforms for genome analysis. Over the last years, the Computational Genomics Group has developed their gene prediction program to meet the needs of its users. A detailed transcript map can now be generated automatically for any gene of interest, and visual tools have been developed by Drs. Brian Stevenson and Dmitry Kuznetsov to graphically display the data. In addition to the human transcriptome, other transcriptomes have been created based on the mouse, zebra fish, and Drosophila genomes, and transcriptomes of other model species can be built on individual investigators’ request.

In July this year, Dr. Iseli showed the transcript database to Dr. Jim Kent at the University of California Santa Cruz (UCSC) who created the UCSC Genome Browser, a renowned public web tool for genome-wide annotation analysis. After comparing the Lausanne group’s data to the UCSC’s own transcriptome, Dr. Kent decided to include these data in the UCSC Genome Browser, wherein they are named SIB genes after the SIB, a network organization that includes the groups involved in developing the Vital-IT platform. The UCSC Genome Browser displays transcripts in an interactive graphic interface, where the user can zoom and scroll over the chromosomes to obtain detailed maps over any region of interest. A number of different tracks can be activated to analyze the predicted transcripts in each region. The newly included data are found as two separate tracks: SIB genes and SIB alternative splicing.

“The Genome Browser predicts genes in a similar way to the program we developed, but their predictions are more stringent and, in some cases, miss certain genes,” says Dr. Iseli. “Our version nicely complements the Genome Browser, and also seems to deal better with duplicated genes. Our users can now display gene digraphs and transcript predictions in the Genome Browser’s graphic interface, and our data are available to a broader community of users.”

The mouse data presently shown in the Genome Browser are from the February 2006 assembly of the mouse genome, and will shortly be updated with data from the July 2007 assembly.

Access the UCSC Genome Browser to view the SIB/LICR gene predictions:http://genome.ucsc.edu

Christian Iseli

Dr. Christian Iseli, Computational Genomics Group
(Lausanne Branch)

Selected Reading

1. Chen YT, Venditti CA, Theiler G, Stevenson BJ, Iseli C, Gure AO, Jongeneel CV, Old LJ, Simpson AJ.Identification of CT46/HORMAD1, an immunogenic cancer/testis antigen encoding a putative meiosis-related protein. Cancer Immun. 2005 Jul 7;5:9.

2. Jongeneel CV, Delorenzi M, Iseli C, Zhou D, Haudenschild CD, Khrebtukova I, Kuznetsov D, Stevenson BJ, Strausberg RL, Simpson AJ, Vasicek TJ.An atlas of human gene expression from massively parallel signature sequencing (MPSS). Genome Res. 2005 Jul;15(7):1007-14.

3. Iseli C, Stevenson BJ, de Souza SJ, Samaia HB, Camargo AA, Buetow KH, Strausberg RL, Simpson AJ, Bucher P, Jongeneel CV.Long-range heterogeneity at the 3' ends of human mRNAs. Genome Res. 2002 Jul;12(7):1068-74.

4. Brentani H, Caballero OL, Camargo AA, da Silva AM, da Silva WA Jr, Dias Neto E, Grivet M, Gruber A, Guimaraes PE, Hide W, Iseli C, Jongeneel CV, Kelso J, Nagai MA, Ojopi EP, Osorio EC, Reis EM, Riggins GJ, Simpson AJ, de Souza S, et al.The generation and utilization of a cancer-oriented representation of the human transcriptome by using expressed sequence tags. Proc Natl Acad Sci U S A. 2003 Nov 11;100(23):13418-23.

5. Dermitzakis ET, Reymond A, Lyle R, Scamuffa N, Ucla C, Deutsch S, Stevenson BJ, Flegel V, Bucher P, Jongeneel CV, Antonarakis SE.Numerous potentially functional but non-genic conserved sequences on human chromosome 21. Nature. 2002 Dec 5;420(6915):578-82.


In This Issue