JCSE

Back Issues

JCSE, vol. 6, no. 2, pp.141-142, June, 2012

DOI:

Preface for the Special Issue on Languages in Biology and Medicine

Jong C. Park, Limsoon Wong, Goran Nenadic, Jung-jae Kim
KAIST, South Korea/ National University of Singapore, Singapore/ University of Manchester, UK/ Nanyang Technological University, Singapore

Abstract: Language is the universal means to represent, convey, and question knowledge. The 4th International Symposium on Languages in Biology and Medicine (LBM 2011), held at the Nanyang Technological University (NTU), Singapore, on December 14 and 15, 2011, hosted presentations that deal with many forms of language, including natural languages in scientific citations and clinical records, ontologies and taxonomies, query strings, and images. In this Special issue, we are pleased to present four contributions, which despite the diversity of the language forms, have shared many common themes, including 1) association of data expressed in languages (e.g., documents, queries) with pre-defined categories (or terms), known as text classification, by using machine learning techniques and 2) enhancing specific aspects of information retrieval, including indexing, querying, and semantic labeling, for biomedical search engines (e.g., PubMed). The key idea of text classification by machine learning is to represent text with its characteristic features depending on the nature of the task. Not all the information is useful for all text classification applications, and it is an open issue to learn the best combination of features for a given application. Read et al. [1] test various types features, including bag-ofwords, n-grams, syntactic dependencies, and WordNet synonym sets, for classifying suicide notes into fine-grained emotion categories (e.g., love, hopelessness), where the classification task and data originate from the 2011 i2b2 Medical NLP Challenge [2]. The challenge they faced in the task is the extreme imbalance among categories in the training data: for example, two out of 15 categories take up 30% of the training data. To address that, they separately learn a binary classifier with the best feature combination for each category, significantly improving micro-average F-measure over the best feature combination for all categories. Jimeno-Yepes et al. [3] present an application of text classification to MeSH (Medical Subject Headings). The MeSH vocabulary is designed for indexing and searching the MEDLINE citations. The indexing is done by human experts with the help of an automatic tool, named Medical Text Indexer (MTI) [4]. The task presented in the current paper is to improve the automatic indexing performance by selecting for each MeSH heading an available algorithm better than MTI. The challenge here is that there are numerous MeSH headings (more than 26 thousand), so that it is not possible to manually pick up the best algorithms for all the headings. The authors apply all the available algorithms for each heading and select the one with the best F-measure against a test dataset. They found that the best algorithms for 2,712 of 26k headings significantly outperform MTI, and will replace MTI to label the headings. Luu and Kim [5] apply text classification techniques to query strings. Query reformulation is a common practice by biologists, and automatic query suggestion is thus a valuable functionality of biomedical search engines like PubMed. The automation requires the understanding of how the user would revise a query, for example add a new word and remove or replace a word, which could be considered a classification task. The authors address the task by introducing various machine learning features such as query string/word length and the hit rate of a query. They then suggest methods to deal with each type of the query reformulation (e.g., addition, removal, replacement), improving the accuracy of the addition type more than ten times over the corresponding functionality of PubMed (‘Related searches’). Textual language processing is important for many aspects of information retrieval, as demonstrated by [3, 5]. Another form of language, illustrations (or images), has also a significant role in biomedical information retrieval, since “having access to the illustrations prior to obtaining the whole publication will greatly enhance user search experience [6].” Demner-Fushman et al. [7] present a multimodal biomedical information retrieval system, called OpenI. Their challenge is to “move beyond conventional text-based searching … by combining text and visual features in search queries and document representation.” They create an index for MEDLINE citations that are enriched, in addition to usual textual features, with information mined from illustrations in full-text articles by using Content-Based Image Retrieval (CBIR) techniques. Users can thus use text and/or visual queries for retrieving citations with illustrations by using OpenI. We would like to acknowledge the help of the following programme committee members and additional reviewers for this special issue and LBM 2011: Judith Blake (Jackson Lab, USA), Olivier Bodenreider (National Library of Medicine, USA), Wendy Chapman (UC San Diego, USA), Kevin Cohen (University of Colorado, USA), Dina Demner-Fushman (National Library of Medicine, USA), Juliane Fluck (SCAI, Germany), Jorg Hakenberg (Hoffmann-La Roche Inc., USA), Jin-Dong Kim (Database Center for Life Science, Japan), Hyunju Lee (GIST, South Korea), Dietrich Rebholz- Schuhmann (EMBL-EBI, UK), Stefan Schulz (Medical University Graz, Austria),

Keyword: no keyword

Full Paper: 164 Downloads, 2970 View