Digital Library
Close Browse articles from a journal
<< previous    next >>
     Journal description
       All volumes of the corresponding journal
         All issues of the corresponding volume
           All articles of the corresponding issues
                                       Details for article 9 of 11 found articles
  Nomenclature-Based Data Retrieval without Prior Annotation: Facilitating Biomedical Data Integration with Fast Doublet Matching
Title: Nomenclature-Based Data Retrieval without Prior Annotation: Facilitating Biomedical Data Integration with Fast Doublet Matching
Author: Jules J. Berman
Appeared in: In silico biology
Paging: Volume 5 (2005) nr. 3 pages 313-322
Year: 2005-08-04
Contents: Assigning nomenclature codes to biomedical data is an arduous, expensive and error-prone task. Data records are coded to to provide a common representation of contained concepts, allowing facile retrieval of records via a standard terminology. In the medical field, cancer registrars, nurses, pathologists, and private clinicians all understand the importance of annotating medical records with vocabularies that codify the names of diseases, procedures, billing categories, etc. Molecular biologists need codified medical records so that they can discover or validate relationships between experimental data and clinical data. This paper introduces a new approach to retrieving data records without prior coding. The approach achieves the same result as a search over pre-coded records. It retrieves all records that contain any terms that are synonymous with a user's query-term. A recently described fast algorithm (the doublet method) permits quick iterative searches over every synonym for any term from any nomenclature occurring in a dataset of any size. As a demonstration, a 105+ Megabyte corpus of Pubmed abstracts was searched for medical terms. Query terms were matched against either of two vocabularies and expanded as an array of equivalent search items. A single search term may have over one hundred nomenclature synonyms, all of which were searched against the full database. Iterative searches of a list of concept-equivalent terms involves many more operations than a single search over pre-annotated concept codes. Nonetheless, the doublet method achieved fast query response times (0.05 seconds using Snomed and 5 seconds using the Developmental Lineage Classification of Neoplasms, on a computer with a 2.89 GHz processor). Pre-annotated datasets lose their value when the chosen vocabulary is replaced by a different vocabulary or by a different version of the same vocabulary. The doublet method can employ any version of any vocabulary with no pre-annotation. In many instances, the enormous effort and expense associated with data annotation can be eliminated by on-the-fly doublet matching. The algorithm for nomenclature-based database searches using the doublet method is described. Perl scripts for implementing the algorithm and testing execution speed are provided as open source documents available from the Association for Pathology Informatics (
Publisher: IOS Press
Source file: Elektronische Wetenschappelijke Tijdschriften

                             Details for article 9 of 11 found articles
<< previous    next >>
 Koninklijke Bibliotheek - National Library of the Netherlands