Genomic island datasets derived using
comparative genomics
Introduction
Genomic islands (GIs) are
clusters of genes in prokaryotic genomes of probable horizontal origin. GIs are
disproportionately associated with microbial adaptations of medical or
environmental interest.
Recently, multiple programs for automated detection of
GIs have been developed that utilize sequence composition characteristics (i.e.
oligo-nucleotide bias, etc). To robustly evaluate the accuracy of such methods,
we propose that a dataset of GIs be constructed that is not based on
artificially generated sequences and is constructed using criteria that are
independent of sequence composition-based analysis approaches.
We have developed a comparative genomics approach,
named IslandPick, that identifies both very probable islands and non-island
regions and permits an independent assessment of sequence composition based GI
prediction tool accuracy. The approach involves:
1) The
flexible, automated selection of comparative genomes for each query genome,
using a distance function that picks appropriate genomes for identification of
GIs
2) The
identification of regions unique to the query genome in comparison with the
previously chosen genomes (positive dataset)
3) The
identification of conserved genomic regions that are common to all genomes
being compared (negative dataset)
Using our constructed datasets, we investigated the
accuracy of several sequence composition-based GI prediction tools.
A manuscript describing the generation of positive and
negative datasets, and the accuracy results of other sequence composition based
GI prediction tools is currently under review.
An overview of our approach to produce these datasets
is available here.
Datasets and
source code
The positive dataset
of GIs and negative datasets (non-GIs)
for 118 bacterial chromosomes that can be
used for the evaluation of the accuracy of other GI predictors, as well as the source code for our comparative genomics
approach, IslandPick, is available under a GPL license. An additional positive dataset of GIs that use
more divergent genomes to predict more ancient GIs is also freely available.
Questions or problems can be emailed to islandpick-mail@sfu.ca (contact name
Morgan Langille).