Active Mining
H. Moloda (Ed.)
IOS Press, 2002
Toward Active Mining from On-line Scientific Text
Abstracts Using Pre-existing Sources
TuanNam Tran and Masayuki Numao
[email protected],
[email protected]
Department of Computer Science,
Tokyo Institute of Technology
2-12-1 O-okayama, Meguro-ku, Tokyo 152-8552, JAPAN
Abstract. As biomedical research enters the post-genome era and most
new information relevant to biology research is still recorded as free
text, there is an extensively increasing needs of extracting information
from biological literature databases such as MEDLINE. Different from
other work so far, in this paper we presents a framework for mining
MEDLINE by making use of a pre-existing biological database on a
kind of Yeast called S.cerevisiae. Our framework is based on an active
mining prospect and consists of two tasks: an information retrieval task
of actively selecting articles in accordance with users' interest, and a
text data mining task using association rule mining and term extraction
techniques. The preliminary results indicate that the proposed method
may be useful for consistency checking and error detection in annotation
of MeSH terms in MEDLINE records. It is considered that the proposed
approach of combining information retrieval making use of pre-existing
databases and text data mining could be expanded for other fields such
as Web mining.
1 Introduction
Because of the rapid growth of computer hardwares and network technologies, a vast
amount of information could be accessed through a variety of databases and sources.
Biology research inevitably plays an essential role in this century, producing a large
number of papers and on-line databases on this field. However, even though the number
and the size of sequence databases are growing rapidly, most new information relevant
to biology research is still recorded as free text. As biomedical research enters the post-
genome era, new kinds of databases that contain information beyond simple sequences
are needed, for example, information on protein-protein interactions, gene regulation
etc. Currently, most of early work on literature data mining for biology concentrated on
analytical tasks such as identifying protein names [5], simple techniques such as word
co-occurrence [12], pattern matching [8], or based on more general natural language
parsers that could handle considerably more complex sentences [9], [15].
In this paper, a different approach is proposed for dealing with literature data mining
from MEDLINE, a biomedical literature database which contains a vast amount of
useful information on medicine and bioinformatics. Our approach is based on active
mining, which focuses on active information gathering and data mining in accordance
with the purposes and interests of the users. In detail, our current, system contains two
subtasks: the first task exploits existing databases and machine learning techniques
for selecting useful articles, and the second one using association rule mining and term