Bioinformatics on the LICR Intranet
|
The LICR Intranet is a place where information can be exchanged between
the Institute’s Scientific Directorate, its Branches, but first
and foremost, all of its staffs. Partly in response to requests from users,
but also to showcase its own research efforts in the field of bioinformatics,
the Office of Information Technology (OIT) in Lausanne has maintained
a series of Web pages within the Intranet that are designed to help LICR
staff perform some analyses on their own data, on LICR-generated proprietary
data, or on publicly available datasets. They are grouped under the heading
“Sequence Analysis” on the Intranet
home page. The Web pages are organized as follows:
Map generator
This is probably the most innovative of our resources. It reflects the
OIT’s efforts in producing a comprehensive map of transcripts (including
ESTs) on the human genome. The output of the Map Generator is a database
in ACEDB format of a human genome region of 1 MB or less that contains
a gene of interest. The input is a gene identifier (GenBank or RefSeq
accession number), a Unigene cluster ID, or a piece of genome based on
the NCBI “NT” nomenclature. The ACEDB database created using
the Map Generator will show every piece of RNA that maps to the genome
in the region of interest, and our reconstruction of the alternatively
spliced transcripts that can be deduced from the RNA to genome alignments.
Additionally, the database contains all experimentally documented polyadenylation
sites that map to the genomic region. The result is a very detailed transcriptional
map, which can be invaluable in understanding the fine structure of any
gene of interest, as well as its relationships with neighboring genes.
It is not uncommon, for example, to find genes whose 3’ UTRs overlap,
or that occasionally produce chimeric RNAs by borrowing exons from each
other. Users of the Map Generator should have a copy of the ACEDB software
installed on their local machines in order to be able to visualize and
explore the maps. There are versions of ACEDB for every Unix flavor, as
well as for MS Windows. Unfortunately, development of the Mac version
was stopped several years ago.
Micro Array Information
On these pages, we have tried to make available to users of the cDNA microarrays
(produced at the Sanger Institute) a set of informations that were difficult
to obtain from the Sanger’s own microarray Web pages, or from the
materials that they distributed. The usefulness of these pages is diminishing
with the quality of documentation provided by the Sanger team. Nevertheless,
some may still have some value to LICR staff. The “What’s
on the Chip” page allows you to find out what gene a particular
probe was derived from, or whether your sequence of interest is represented
on the chip. The GeneSpring Definition File reflects our own effort to
map probes to genes in a format compatible with the GeneSpring software.
The LICR/ICRF Gene Lists reflect the “wish lists” of scientists
at the two institutions as to which genes they would like to see represented
on the chips. These lists are being used by the Sanger staff to prioritize
the inclusion of probes on new versions of the chips. The current version
has been frozen (i.e. it is no longer possible to add new genes to the
list), but you can still contact Brian Stevenson at the OIT should you
discover that your favorite gene is missing.
Gene Discovery
This Web page was designed to allow LICR staff to find new genes in the
emerging drafts of the human genome and transcriptome, using known sequences
or Prosite patterns as probes. With the rapidly improving annotation of
the genome this page is becoming less likely to provide you with new discoveries,
and will probably be replaced with new tools to explore the human transcriptome.
Database Search
These pages are provided as alternatives to those offered by the Swiss
EMBnet node, often with less intimidating interfaces. As they have been
used only very rarely, they will probably be phased out soon.
Non-GCG Utilities
Similar to the Database Search pages, these are alternatives to the Web
pages of the Swiss EMBnet node, and are likely to be phased out. The AA
Code Converter, which was requested by the London Office for the purpose
of generating patent applications (they still require three-letter amino
acid codes), will remain.
SeqWeb
This is the Web-based interface to the GCG software suite. We have stopped
supporting this about two years ago, again for lack of interest. We recommend
that users requiring a full-featured sequence analysis software suite
acquire Vector NTI or a similar product.
Downloadable Databases
We have a publicly accessible database repository at the address ftp://ftp.licr.org/pub/databases.
You will find here up-to-date versions of many commonly used databases,
in one convenient location. Of particular interest is a non-redundant
set of protein sequences (in the nrprot subdirectory) that proteomics
experts have found rather useful. On the more speculative side, we produce
predicted protein databases from EST assemblies and draft genome sequences,
called TrEST and TrGen, respectively. These are available for an increasing
number of species where sufficient ESTs and/or genome sequence is available.
The Future
Most of our work now focuses on three areas, which we hope to make available
on the LICR Intranet as soon as they become usable:
1. The Cancer Immunome Database (CID). This database, for which an unfinished
version is already available (http://www2.licr.org/CancerImmunomeDB),
aims to provide a comprehensive documentation of the immune response to
cancer. Currently, it is populated only by the data that were produced
by the SEREX initiative, but it is planned that it will be significantly
expanded in the future. As an adjunct to the CID, we are also compiling
a list of human endogenous retroviruses (HERVs) with their pattern of
expression across tissues.
2. The tromer program of reconstitution of human transcripts,
including splice and polyadenylation variants, from a thorough comparison
of transcriptome and genome data. The first tangible outputs of this program
are an alternative EST cluster database that should gradually replace
Unigene as the database of reference on the NCI’s CGAP site, and
a mapping of SAGE or MPSS tags to genes, which is being used in the Breast
Cancer Program as well as for the discovery of novel CT antigens.
3. The Transcriptome Database derived from tromer, which combines
in a single data structure data about the genomic mapping of genes, the
boundaries and connectivity of their exons, and their pattern of expression
as deduced from EST, SAGE and MPSS data. While we are already using this
database for our own gene discovery efforts, there is as yet no public
Web interface available. However, we would be pleased to help you should
you have an inquiry that we could answer using the Transcriptome Database.
« Back to LICR NewsLink
|