Conceptual fingerprinting as both a literature discovery tool and a means
of semantic interlinkage of bibliographic, sequence and image databases
course leader: Les Grivell
Large-scale data generation is becoming increasingly important in biological
research and biologists are becoming increasingly aware that they are likely to
be swamped by a tsunami of information. Alongside difficulties of extracting
information from the literature, the limitations of the tools and systems to
store, sort, interlink and order the >500 data resources, ranging from
sequence to pathology and gene expression to metabolism data are becoming
increasingly apparent. The problem is further exacerbated by the large
heterogeneity of the underlying data and storage formats that make it difficult
for scientists to retrieve, integrate and analyse the information that he or
she needs.
E-BioSci makes use of a technology developed by Collexis b.v. Instead of
relying solely on keyword- or citation-based index searching, the system
employs the Collexis abstraction engine to “fingerprint" text. That is, it
produces concept profiles that can subsequently be used for semantic comparison
of documents either with each other, or with other text in database records.
E-BioSci’s current prototype uses a modified MeSH metathesaurus. This contains
a number of vocabularies that have been translated into various European
languages, thus making it possible to match documents in different language
formats. However, other thesauri, more suited to knowledge domains outside the
strictly biomedical can and indeed, are being used.
Besides its conventional web-page interface, E-BioSci also offers an XML /
SOAP- web-service for literature search. Just as HTML was crucial to the
development of the word-wide web, the use of XML now goes far beyond the
representation of web-pages in an attractive form. The language allows the
encapsulation, transfer and sharing of information that can subsequently be
seamlessly integrated into a user’s environment.