Conceptual fingerprinting as both a literature discovery tool and a means of semantic interlinkage of bibliographic, sequence and image databases
course leader: Les Grivell

Large-scale data generation is becoming increasingly important in biological research and biologists are becoming increasingly aware that they are likely to be swamped by a tsunami of information. Alongside difficulties of extracting information from the literature, the limitations of the tools and systems to store, sort, interlink and order the >500 data resources, ranging from sequence to pathology and gene expression to metabolism data are becoming increasingly apparent. The problem is further exacerbated by the large heterogeneity of the underlying data and storage formats that make it difficult for scientists to retrieve, integrate and analyse the information that he or she needs.

E-BioSci makes use of a technology developed by Collexis b.v. Instead of relying solely on keyword- or citation-based index searching, the system employs the Collexis abstraction engine to “fingerprint" text. That is, it produces concept profiles that can subsequently be used for semantic comparison of documents either with each other, or with other text in database records. E-BioSci’s current prototype uses a modified MeSH metathesaurus. This contains a number of vocabularies that have been translated into various European languages, thus making it possible to match documents in different language formats. However, other thesauri, more suited to knowledge domains outside the strictly biomedical can and indeed, are being used.

Besides its conventional web-page interface, E-BioSci also offers an XML / SOAP- web-service for literature search. Just as HTML was crucial to the development of the word-wide web, the use of XML now goes far beyond the representation of web-pages in an attractive form. The language allows the encapsulation, transfer and sharing of information that can subsequently be seamlessly integrated into a user’s environment.