Genome Browsers
The publication over the last 10 years of increasing numbers of genomes, starting with S. cerevisiae in 1996, has made it necessary to present the results of the mapping, sequencing and annotation efforts in an interactive and graphical way. The sequence itself is of little help to the experimental biologist – it becomes valuable only in the context of its annotation.
After an initial flurry of software products for genome exploration, including Celera’s excellent Discovery System, three Web-based genome browsers have emerged as favorites of the community, each with their strong and weak points: the UCSC Genome Browser, produced by the Genome Bioinformatics Group at UC Santa Cruz; the Map Viewer produced by the National Center for Biotechnology Information (NCBI); and the Ensembl Browser produced by the Ensembl group at the Sanger Institute and the European Bioinformatics Institute (EBI). I will try to give a few pointers that may help you select the one that may become your own favorite, rather than the one you happened to stumble upon.
Similarities and differences
The international and collaborative nature of most genome sequencing projects have resulted in consensus assemblies for most genomes, and thus to consensus coordinate systems. This means that no matter which genome browser you use, the positions of features on the genome will be the same, provided of course that you are viewing the same version of the assembly. It is a common misconception that different browsers present different versions of the genome; for example, an early assembly of the human genome performed at UC Santa Cruz was dubbed the Golden Path, and this name has become indelibly associated with the UCSC genome browser. In fact, the current human assembly used in all browsers is known as NCBI Build 35 and was produced in November 2003.
The three genome browsers are also similar in that they allow the user to select annotation “tracks”, documenting specific classes of features (e.g. mRNA, promoter elements etc) mapped to the genome. Some of the tracks are produced by the group maintaining the genome browser, while others are contributed by external collaborators. Some tracks may be available in more than one browser, while most are specific to one or the other. For example, while all browsers will show RefSeq mRNA sequences aligned to the genome, the actual alignments may differ slightly from one browser to the other. Similarly, the sources of the single nucleotide polymorphisms (SNPs) available in the different browsers may not be the same. All the browsers allow the user to zoom in and out to different levels of detail, from the nucleotide sequence to entire chromosomes.
The differences between the browsers, however, are very significant. Not only do they display different tracks, but the look and functionalities are unique to each browser. The information that is linked to individual features on the genome is also entirely different, both in type and in origin. I will try to highlight the specificities of each browser.
The UCSC Genome Browser (http://genome.ucsc.edu)
This is the oldest of the three, and it is beginning to show, as its look and feel has not been updated in five years. However, it is also the easiest to use of the three browsers. There is only one window, which can be zoomed in or out, as well as moved along the sequence. The position of the current window on the chromosome is displayed on top. Searching for genes or jumping to specific chromosome positions is very simple, using a search box at the top of the page.
A list of all possible tracks is displayed below the main window. There are over 80 tracks to choose from, in the classes “Mapping and Sequencing”, “Genes and Gene Predictions”, “mRNA and EST”, “Expression and regulation”, “Comparative Genomics”, “ENCODE” (regions selected for detailed annotation), and “Variation and Repeats”. Individual tracks can be switched between viewing modes of different densities by clicking on them. Clicking on individual features (genes, mRNAs, SNPs etc) brings up a text window with a wealth of links to other information resources. It is also relatively straightforward to create one’s own tracks and to integrate them into the viewing window, by creating a text file that can be uploaded to the UCSC Web site. These custom tracks can be either private or shared with other users.
Another nice feature of the UCSC Browser is the integration of the BLAT database search software. BLAT is similar to BLAST, but tuned to find very rapidly almost exact matches to a query sequence. BLAT also tries to produce correct exon boundaries when aligning a cDNA to the corresponding genome sequence. Therefore, BLAT is the ideal tool to find the genomic localization(s) of a piece of cDNA, as it is faster and gives more accurate cDNA to genome alignments than BLAST.
The UCSC Genome Browser was my own favorite in the early days of the Human Genome Project, and continues to have a strong and enthusiastic user base.
Ensembl (http://www.ensembl.org)
The Ensembl Project encompasses much more than a genome browser. In fact, the browser is only the tip of a very large iceberg. Most of Ensembl’s resources are devoted to an automatic genome annotation pipeline, which is kept synchronized with the progress of genome finishing and assembly. The annotation produced by Ensembl for selected vertebrate genomes is itself the source material for a manually curated high-quality annotation known as Vega (Vertebrate Genome Annotation).
The browser itself is undoubtedly the most sophisticated of the three, and takes a while to fully master. There are four clickable viewing windows, from top to bottom: Chromosome, Overview, Detailed view, and Basepair view. As their names indicate, these represent four levels of zoom, and each has its set of associated tracks (except for Chromosome). The Overview displays the positions of genes, markers, contigs, etc in the region of interest. The Detailed view is the controlling window, and displays tracks very similar to those in the UCSC browser. It also includes a sophisticated set of controls for moving, zooming, and displaying various tracks. The Basepair view is useful mostly to show nucleotide level information such as SNPs, individual codons, restriction sites, etc. Each of these windows can be hidden or displayed, as needed.
Custom tracks can be added to Ensembl from a so-called DAS (Distributed Annotation System) server. This is not as easy as adding tracks to the UCSC browser, but much more flexible, in that DAS servers are maintained completely independently of the main Ensembl server and can be updated by the groups generating the annotation. For example, our group generates a database of gene models generated from cDNA to genome alignments, which is exported on a DAS server. It can be viewed in Ensembl just by adding a DAS source.
The world of Ensembl is a very rich one, and new features are added almost every day. The BioMart data mining tool (formerly EnsMart) is a recently developed environment that allows the user to extract features from a genome based on criteria of position, annotation, expression, polymorphism, GO category, etc. The internal data available to Ensembl are much richer than for the UCSC browser. For example, the Geneview Web pages give a wealth of information about gene sequence, structure and function. The Ensembl Web site has recently undergone a major update, with many improvements to the interface.
In summary, Ensembl is the most complete source of information about genes and genomes currently available on the Web. However it takes a while to find all of its features, so if a lab or Branch is thinking about using Ensembl extensively, it is worth enquiring about the availability of trainers from the Ensembl group, who give excellent courses.
The NCBI Map Viewer (http://www.ncbi.nlm.nih.gov/Genomes/)
For many biologists, this is the only genome browser they know about or have ever used, just because it is part of the NCBI one-stop bioinformatics resource philosophy. The NCBI browser is radically different from the two others, if only because it displays the current sequence vertically rather than horizontally. Instead of tracks mapped to the sequence, it has maps that are displayed next to each other, and one of which is used as the reference map. Typically the maps are based on cytogenetic bands, contigs, genes, or genetic markers. The number of features displayed on the map depends on the current zoom factor and on the density of the map. There is a rather primitive zoom function, as well as a search over the current view of the map. The display is customizable through a popup window. The number of tracks available for viewing is much more limited than in the Ensembl or UCSC browsers.
The main selling point of the NCBI viewer is that it is tightly integrated with other well-known NCBI resources, such as UniGene, LocusLink, RefSeq and BLAST (although BLAST is also available in Ensembl). It is also the only one to my knowledge that integrates the sequences of the widely used genome contig assemblies produced by NCBI. But it does not include any information generated outside NCBI, and is thus rather limited in both functionality and information content.
Which browser should I choose?
There is of course no easy answer to this question. The first criterion may be whether your favorite genome is available in a particular browser. Many are available in all three. The second one may be whether a particular track or source of information is available. Many commonly used data sources are at least searchable in all three, but as outlined above there are significant differences. Finally, it will most likely be the look and feel, and whether you can figure out how to carry out the analyses that you want to perform.
My personal favorite, as you may have guessed, is Ensembl. I believe that the little extra investment needed to get started will be amply paid back by the wealth of information that can be retrieved. And if you get stuck, you can always contact the OIT for help…
C. Victor Jongeneel
Director, Office of Information Technology