You are here : Network » Technology & Resources Little font Medium font Large font

Technology & Resources

The Network established an IT team with staff at the Victorian Bioinformatics Consortium (Monash University) and in the Computational Research Support Unit (Faculty of Science, the University of Technology, Sydney) - a member of the Australian Partnership for Advanced Computing (APAC) Grid Program.

VBC Presentation

Download Ross Coppel's presentation to our 2007 conference, in which he describes the work that he and his colleagues have undertaken with funding assistance from the Network.

vbc_presentation.pdf (2.9Mb) »

The 2005 activities and progress of the IT team are summarised below:

EST Database Development

As a proof of capabilities exercise, the Network IT team undertook a project to develop a Sarcoptes scabiei EST Database. An NHMRC Medical Genomics Grant had provided initial funding to sequence an EST library generated from mRNA obtained from scabies mites, however, bioinformatics activities were not adequately funded by the grant and the Network took over the analysis, construction of database and public release of the information. This has required a great deal of work as the project had to start from scratch, processing the raw sequencer reads for quality and assembling raw reads into contigs.

The scientific leader of the project is Deborah Holt who provided The Network IT team with two lots of sequences of three different size fractions from the original cDNA library. Half of the fractions were cloned without normalisation, the three other fractions were made from the cDNA pool, but came from using a long PCR procedure and normalisation.

After checking the data integrity, the IT team called the bases and assembled the ESTs using Phred/Phrap programs. This led to the formation of 6962 Contigs (EST assembly) and 3720 singlets (single sequence).

In order to store and process the ESTs, a database called EST-PAC (which stands for EST package), was developed. EST-PAC was designed to be a sequence managing database, where either nucleic or protein sequences in a FASTA format can be entered. Users are able to upload groups of sequences, then jobs can be applied to these groups. For now, jobs are restricted to the BLASTALL programs, PFAM searches and ESTScan2 predictions.

The contigs and singlets were first blasted against the non-redundant database from NCBI. The search was made with the default values usually used with BLAST. From the 6962 Contigs, 4006 sequences have a hit and 2956 sequences have no similarity to sequences present in the database. For the singlets 1040 have a hit, whereas 2680 don't. We also did some blasts against a database containing DNA or protein drug targets. 1281 translated contigs show similarity with protein and 32 with DNA sequences.

To easily be able to view the quality of the Contigs, a schematic drawing of the assembly was developed. After having uploaded the assembly file (.ace file), users can browse either through all the contigs or choose to see the assembly with or without blast hits. All the sequences are represented by bars, which facilitates the interpretation of the assembly.

A database demo is publically accessible at :

http://vbc.med.monash.edu.au/~yvan/est-pac-demo/login.php »

Username: guest Password: guest DB: EST-1

Results can be queried through the Query link. This page allows powerful searches based on any term inside the database. To use this page, users don't need to know the relations between the tables, it is sufficient to select the table with results of interest and choose what field should be displayed.

The current set-up of the database allows any user to perform jobs or even to delete data. To avoid loss of data, the IT team is planning to create a user, who is only able to search the results without the right to perform jobs or manipulate data. To even further restrict access to our data, a simpler version of the database is envisaged. This database would only have the query feature.

For gene discovery groups who want to have EST-PAC locally installed, we provide scripts and instructions for downloading and installation of the database at following address:

http://vbc.med.monash.edu.au/~yvan/download.html »

The database runs under Linux, Mac OS X, and Windows XP operating systems and further descriptions will be sent for publication and will acknowledge the Network.

Currently, the IT team are developing procedures to maintain and clean the database by writing a 'cleaning' program, which will be activated each time the database is used. The team is also planning to recalibrate the ESTScan matrix using the programs provided by their developers. Finally, assembly and blast hit displays will be graphically displayed and we will develop a method to store and retrieve good quality annotations.

Network Bioinformatics Services

Advanced genomics and functional genomics platforms have been made available to Network scientists by arrangement with the Victorian Bioinformatics Consortium:

The Wasabi genome annotation system

Wasabi was designed to facilitate the rapid annotation of prokaryotic or eukaryotic genomes, and to allow browsing and searching of the annotated genomes. The main features of interest are protein coding regions, so Wasabi performs various analyses on the proteins beforehand. These analyses are used to provide an initial automatic annotation. They are also presented in a summarised form for use in manual curation; the annotator can easily verify or modify the automatic annotation. Multiple annotators can work on a genome simultaneously, and the annotations can be exported to standard file formats such as Genbank/EMBL, GFF and FASTA.

Wasabi has the idea of a "genome" that consists of one or more chunks of DNA, denoted "chromosomes" but could be any DNA sequence, such as a plasmid or contig. Each chromosome has many "features" (e.g. CDS, tRNA, rRNA, repeat_unit) which are wholly defined by their coordinates (stop,start) on a chromosome. These features may be annotated using standard labels such as "product", "function", "subcellular_location" and so on.

Usually a large set of features is imported into Wasabi when commencing a new genome annotation. Common sources for CDS features are from gene prediction software like GeneMarkS and Glimmer2, and tRNA-scan-SE is often used to get a list of candidate tRNAs. It is possible to add individual features later using the web interface. Another source of annotations is from an existing (possibly primitive) annotation in an EMBL or Genbank file, which may be imported.

There are two types of annotators, normal and heads. Head annotators have the ability to add and delete features and assign features to annotators (i.e. dole out the work). Each feature can be independently annotated by each annotator. This is useful when it is desired to get two or three times coverage on a genome. The individual annotations are then "merged" into a primary annotation at the end for publication.

The main characteristic of Wasabi that separates it from other feature annotation software is the large number of preliminary searches it does for you. When the user goes to annotate an ORF say, they are presented which a summarised set of evidence to help them make a decision as to what this feature does. The full search reports are also only a click away. The evidence currently provided for CDS features is:

  • The amino acid sequence
  • The DNA bases immediately upstream of the start codon
  • Various biochemical measures of the sequence such as weight and pI
  • rpsblast search results
  • blastp against Genbank "nr" protein database
  • blastp searches against other related peptide sequences
  • tblastn searches against other related nucleotide sequences
  • PSORT, PSORT-B and CELLO for the prediction of protein localization sites
  • LipoP prediction of lipoproteins and signal peptides in Gram- bacteria
  • SignalP predicts the presence and location of signal peptide cleavage sites
  • TMHMM for prediction of transmembrane helices in proteins
  • TMpred makes a prediction of membrane-spanning regions and their orientation
  • InterProScan identifies protein domains

This set of analyses can be extended by a plug-in-type architecture. It is also possible to bootstrap the annotation process by using these analyses to automatically perform an initial annotation. The human annotator then only needs to verify and possibly correct the automatic annotation. This saves much typing and expedites the annotation process.

Parasitology network scientists using the VBC installation of Wasabi only need a modern standards compliant web browser, e.g. Mozilla, Firefox, IE. From a users point of view it is platform independent. The Victorian Bioinformatics Consortium is able to host Network participants’ genome data on its Wasabi server. It would only be visible and annotatable by specified users and is password protected. Access to the system is obtained by contacting torsten.seeman@infotech.monash.edu.au.

Microarray Tools

The VBC provides, to participants in the ARC/NHMRC Research Network for Parasitology, computing infrastructure to support microarray experiments and also provides statistical expertise particularly for the analysis of microarray data.

The VBC maintains computer servers that allow researchers to store microarray data securely and share the data with collaborators anywhere in the world. Researchers are able to store their microarray experiment results, and perform analysis all via a standard Web interface. Currently, the VBC microarray server contains the results of hundreds of experiments.

The VBC has provided, and continues to provide expertise in the rigorous statistical analysis of microarray data. This is generally performed using Open Source Software. Microarray analysis includes: appropriate normalisation of the data to remove as much bias as possible; the calculation of differentially expressed genes using appropriate statistical test; cluster analysis; visualisations such as Principal Component Analysis, or Multi-Dimensional Scaling.

Access to this service is via contacting david.powell@med.monash.edu.au.