JCSE, vol. 6, no. 2, pp.141-142, 2012
DOI:
Preface for the Special Issue on Languages in Biology and Medicine
Jong C. Park, Limsoon Wong, Goran Nenadic, Jung-jae Kim
KAIST, South Korea/ National University of Singapore, Singapore/ University of Manchester, UK/ Nanyang Technological University, Singapore
Abstract: Language is the universal means to represent, convey, and question knowledge. The 4th International Symposium on
Languages in Biology and Medicine (LBM 2011), held at the Nanyang Technological University (NTU), Singapore, on
December 14 and 15, 2011, hosted presentations that deal with many forms of language, including natural languages in
scientific citations and clinical records, ontologies and taxonomies, query strings, and images.
In this Special issue, we are pleased to present four contributions, which despite the diversity of the language forms,
have shared many common themes, including 1) association of data expressed in languages (e.g., documents, queries)
with pre-defined categories (or terms), known as text classification, by using machine learning techniques and 2) enhancing
specific aspects of information retrieval, including indexing, querying, and semantic labeling, for biomedical search
engines (e.g., PubMed).
The key idea of text classification by machine learning is to represent text with its characteristic features depending on
the nature of the task. Not all the information is useful for all text classification applications, and it is an open issue to
learn the best combination of features for a given application. Read et al. [1] test various types features, including bag-ofwords,
n-grams, syntactic dependencies, and WordNet synonym sets, for classifying suicide notes into fine-grained emotion
categories (e.g., love, hopelessness), where the classification task and data originate from the 2011 i2b2 Medical
NLP Challenge [2]. The challenge they faced in the task is the extreme imbalance among categories in the training data:
for example, two out of 15 categories take up 30% of the training data. To address that, they separately learn a binary
classifier with the best feature combination for each category, significantly improving micro-average F-measure over the
best feature combination for all categories.
Jimeno-Yepes et al. [3] present an application of text classification to MeSH (Medical Subject Headings). The MeSH
vocabulary is designed for indexing and searching the MEDLINE citations. The indexing is done by human experts with
the help of an automatic tool, named Medical Text Indexer (MTI) [4]. The task presented in the current paper is to
improve the automatic indexing performance by selecting for each MeSH heading an available algorithm better than
MTI. The challenge here is that there are numerous MeSH headings (more than 26 thousand), so that it is not possible to
manually pick up the best algorithms for all the headings. The authors apply all the available algorithms for each heading
and select the one with the best F-measure against a test dataset. They found that the best algorithms for 2,712 of 26k
headings significantly outperform MTI, and will replace MTI to label the headings.
Luu and Kim [5] apply text classification techniques to query strings. Query reformulation is a common practice by
biologists, and automatic query suggestion is thus a valuable functionality of biomedical search engines like PubMed.
The automation requires the understanding of how the user would revise a query, for example add a new word and
remove or replace a word, which could be considered a classification task. The authors address the task by introducing
various machine learning features such as query string/word length and the hit rate of a query. They then suggest methods
to deal with each type of the query reformulation (e.g., addition, removal, replacement), improving the accuracy of
the addition type more than ten times over the corresponding functionality of PubMed (‘Related searches’).
Textual language processing is important for many aspects of information retrieval, as demonstrated by [3, 5]. Another
form of language, illustrations (or images), has also a significant role in biomedical information retrieval, since “having
access to the illustrations prior to obtaining the whole publication will greatly enhance user search experience [6].”
Demner-Fushman et al. [7] present a multimodal biomedical information retrieval system, called OpenI. Their challenge
is to “move beyond conventional text-based searching … by combining text and visual features in search queries and
document representation.” They create an index for MEDLINE citations that are enriched, in addition to usual textual
features, with information mined from illustrations in full-text articles by using Content-Based Image Retrieval (CBIR)
techniques. Users can thus use text and/or visual queries for retrieving citations with illustrations by using OpenI.
We would like to acknowledge the help of the following programme committee members and additional reviewers for
this special issue and LBM 2011: Judith Blake (Jackson Lab, USA), Olivier Bodenreider (National Library of Medicine,
USA), Wendy Chapman (UC San Diego, USA), Kevin Cohen (University of Colorado, USA), Dina Demner-Fushman
(National Library of Medicine, USA), Juliane Fluck (SCAI, Germany), Jorg Hakenberg (Hoffmann-La Roche Inc.,
USA), Jin-Dong Kim (Database Center for Life Science, Japan), Hyunju Lee (GIST, South Korea), Dietrich Rebholz-
Schuhmann (EMBL-EBI, UK), Stefan Schulz (Medical University Graz, Austria),
Keyword:
no keyword
Full Paper: 164 Downloads, 2630 View
|