Bioinformatics on the LICR Intranet


The LICR Intranet is a place where information can be exchanged between the Institute’s Scientific Directorate, its Branches, but first and foremost, all of its staffs. Partly in response to requests from users, but also to showcase its own research efforts in the field of bioinformatics, the Office of Information Technology (OIT) in Lausanne has maintained a series of Web pages within the Intranet that are designed to help LICR staff perform some analyses on their own data, on LICR-generated proprietary data, or on publicly available datasets. They are grouped under the heading “Sequence Analysis” on the Intranet home page. The Web pages are organized as follows:

Map generator
This is probably the most innovative of our resources. It reflects the OIT’s efforts in producing a comprehensive map of transcripts (including ESTs) on the human genome. The output of the Map Generator is a database in ACEDB format of a human genome region of 1 MB or less that contains a gene of interest. The input is a gene identifier (GenBank or RefSeq accession number), a Unigene cluster ID, or a piece of genome based on the NCBI “NT” nomenclature. The ACEDB database created using the Map Generator will show every piece of RNA that maps to the genome in the region of interest, and our reconstruction of the alternatively spliced transcripts that can be deduced from the RNA to genome alignments. Additionally, the database contains all experimentally documented polyadenylation sites that map to the genomic region. The result is a very detailed transcriptional map, which can be invaluable in understanding the fine structure of any gene of interest, as well as its relationships with neighboring genes. It is not uncommon, for example, to find genes whose 3’ UTRs overlap, or that occasionally produce chimeric RNAs by borrowing exons from each other. Users of the Map Generator should have a copy of the ACEDB software installed on their local machines in order to be able to visualize and explore the maps. There are versions of ACEDB for every Unix flavor, as well as for MS Windows. Unfortunately, development of the Mac version was stopped several years ago.

Micro Array Information
On these pages, we have tried to make available to users of the cDNA microarrays (produced at the Sanger Institute) a set of informations that were difficult to obtain from the Sanger’s own microarray Web pages, or from the materials that they distributed. The usefulness of these pages is diminishing with the quality of documentation provided by the Sanger team. Nevertheless, some may still have some value to LICR staff. The “What’s on the Chip” page allows you to find out what gene a particular probe was derived from, or whether your sequence of interest is represented on the chip. The GeneSpring Definition File reflects our own effort to map probes to genes in a format compatible with the GeneSpring software. The LICR/ICRF Gene Lists reflect the “wish lists” of scientists at the two institutions as to which genes they would like to see represented on the chips. These lists are being used by the Sanger staff to prioritize the inclusion of probes on new versions of the chips. The current version has been frozen (i.e. it is no longer possible to add new genes to the list), but you can still contact Brian Stevenson at the OIT should you discover that your favorite gene is missing.

Gene Discovery
This Web page was designed to allow LICR staff to find new genes in the emerging drafts of the human genome and transcriptome, using known sequences or Prosite patterns as probes. With the rapidly improving annotation of the genome this page is becoming less likely to provide you with new discoveries, and will probably be replaced with new tools to explore the human transcriptome.

Database Search
These pages are provided as alternatives to those offered by the Swiss EMBnet node, often with less intimidating interfaces. As they have been used only very rarely, they will probably be phased out soon.

Non-GCG Utilities
Similar to the Database Search pages, these are alternatives to the Web pages of the Swiss EMBnet node, and are likely to be phased out. The AA Code Converter, which was requested by the London Office for the purpose of generating patent applications (they still require three-letter amino acid codes), will remain.

SeqWeb
This is the Web-based interface to the GCG software suite. We have stopped supporting this about two years ago, again for lack of interest. We recommend that users requiring a full-featured sequence analysis software suite acquire Vector NTI or a similar product.

Downloadable Databases
We have a publicly accessible database repository at the address ftp://ftp.licr.org/pub/databases. You will find here up-to-date versions of many commonly used databases, in one convenient location. Of particular interest is a non-redundant set of protein sequences (in the nrprot subdirectory) that proteomics experts have found rather useful. On the more speculative side, we produce predicted protein databases from EST assemblies and draft genome sequences, called TrEST and TrGen, respectively. These are available for an increasing number of species where sufficient ESTs and/or genome sequence is available.

The Future
Most of our work now focuses on three areas, which we hope to make available on the LICR Intranet as soon as they become usable:

1. The Cancer Immunome Database (CID). This database, for which an unfinished version is already available (http://www2.licr.org/CancerImmunomeDB), aims to provide a comprehensive documentation of the immune response to cancer. Currently, it is populated only by the data that were produced by the SEREX initiative, but it is planned that it will be significantly expanded in the future. As an adjunct to the CID, we are also compiling a list of human endogenous retroviruses (HERVs) with their pattern of expression across tissues.

2. The tromer program of reconstitution of human transcripts, including splice and polyadenylation variants, from a thorough comparison of transcriptome and genome data. The first tangible outputs of this program are an alternative EST cluster database that should gradually replace Unigene as the database of reference on the NCI’s CGAP site, and a mapping of SAGE or MPSS tags to genes, which is being used in the Breast Cancer Program as well as for the discovery of novel CT antigens.

3. The Transcriptome Database derived from tromer, which combines in a single data structure data about the genomic mapping of genes, the boundaries and connectivity of their exons, and their pattern of expression as deduced from EST, SAGE and MPSS data. While we are already using this database for our own gene discovery efforts, there is as yet no public Web interface available. However, we would be pleased to help you should you have an inquiry that we could answer using the Transcriptome Database.


« Back to LICR NewsLink


Ludwig Institute for Cancer Research ©2003