PUBMED DATASET: A JAVA LIBRARY FOR AUTOMATIC

CONSTRUCTION OF EVALUATION DATASETS

Kirill Lassounski

, Sahudy Montenegro Gonz

alez

, Annabell del Real Tamariz

and Gabriel Lima de Oliveira

State University of Norte Fluminense, Campos dos Goytacazes, Brazil

Federal University of S

ao Carlos, Sorocaba, Brazil

Keywords:

Information Retrieval, Text Mining, PubMed, Evaluation Dataset, Java.

Abstract:

The NCBI (National Center for Biotechnology Information) provides information about genes, proteins, scien-

tiﬁc literature, molecular structures among other resources related to bio-medicine. The NCBI has a database

called PubMed that stores about 21 millions of scientiﬁc articles. There are many researches in the information

retrieval ﬁeld that need to automatically obtain useful data from PubMed to perform evaluation and testing.

This work describes a Java library to construct datasets, so that numerous scientiﬁc researches could evaluate

their results easily and quickly. Users must set input and output parameters such as article’s attributes (title,

abstract, keywords, etc.) to conform the dataset constructed as a serializable ﬁle. The creation of PubMed

Dataset came from the fact that the authors needed to build their own datasets to evaluate their system results.

In this article it is also presented the BioSearch Reﬁnement system as a case study. The system utilizes the

library to construct the datasets used to evaluate its algorithm for automatic extraction of keyphrases. We also

discuss the beneﬁts obtained from the usage of the PubMed Dataset.

1 INTRODUCTION

The NCBI website has in its PubMed database

around 21 million scientiﬁc articles, protein se-

quences, genomes and other. The database receives

more than 70 million queries every month. In a search

using the term ”brain disease”, 129.977 complete ar-

ticles were retrieved, of which approximately 25%

were published in the last three years. These articles

are examined daily by researchers from all over the

world looking for information that will help them in

their studies. As a result, there are a large number of

research projects in information retrieval that help to

categorize, cluster and describe these articles, provid-

ing quality tools for researchers.

This paper aims to describe the PubMed Dataset,

a Java library for the automatic creation of scientiﬁc

datasets on PubMed database. It offers to bioinfor-

matic researchers a valuable tool for the creation of

datasets to test their proposals. Users must set input

and output parameters such as article’s attributes (ti-

tle, abstract, keywords, etc.) to conform the dataset

constructed as a Java serializable ﬁle. It is impor-

tant to highlight that this proposal came from the fact

that the authors needed to build their own datasets to

evaluate two researches associated with the NCBI and

PubMed. A relevant factor for its development was

the considerable time spent on manual or semiauto-

matic construction of datasets to evaluate the research

results.

To augment the relevance of the PubMed Dataset

library in the world of bioinformatics, one can cite

(Feldman and Sanger, 2007), who point out that much

researches and applications of text mining in this area

exploit the data centered on the NCBI website, as it

represents the largest online repository of scientiﬁc

papers in the area of biomedicine published in En-

glish and other languages. In a complementary way,

(Krallinger et al., 2008) reviewed the literature until

2008, describing more than ﬁfty articles with focus

on efﬁcient retrieval and text mining techniques on

biological databases.

In this work it is also presented a case study of

the PubMed Dataset library. The case study is based

on a Web system called BioSearch Reﬁnement which

proposes an algorithm for automatic extraction and

summarization of keyphrases. The system utilizes the

library to construct the datasets used to evaluate the

keyphrase extraction algorithm.

This paper is organized as follows. Section 2

343

Lassounski K., Montenegro González S., del Real Tamariz A. and Lima de Oliveira G..

PUBMED DATASET: A JAVA LIBRARY FOR AUTOMATIC CONSTRUCTION OF EVALUATION DATASETS.

DOI: 10.5220/0003797203430346

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2012), pages 343-346

ISBN: 978-989-8425-90-4

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

presents a short review of related work. Section 3 de-

scribes the PubMed Dataset library and its features. In

Section 4, it is introduced the BioSearch Reﬁnement

system, the construction process of two datasets used

for evaluation purposes and the beneﬁts obtained from

the usage of PubMed Dataset are discussed. Section

5 presents ﬁnal considerations of the work.

2 RELATED WORK

(Krallinger et al., 2008) point out that developers of

text mining applications need to accomplish mean-

ingful system evaluations and comparative studies.

They also call attention to the difﬁculty in construct-

ing proper evaluation datasets.

Many works built their evaluation datasets man-

ual or semi-automatically, collecting documents from

a variety of areas, such as scientiﬁc literature, news

articles, magazine articles. Some efforts to collect

documents are described in (Nguyen and Kan, 2007),

(Wan and Xiao, 2008) and (Medelyan, 2009). The

ﬁrst constructed a dataset of 250 PDF documents us-

ing the Google SOAP API to ﬁnd them. These docu-

ments were manually restricted to articles published

in scientiﬁc conferences, later converted to plain

text. The second manually annotated the DUC2001

dataset, consisting of 309 news articles collected from

TREC-9. The annotation process lasted two weeks.

One of the research contributions in the latter was

an automatically extracted corpus of 180 science re-

search papers from collaboratively tagged data on the

bookmarking website CiteULike.org.

Systems described in (Zaremba et al., 2009),

(Yang et al., 2010) and (Uddin et al., 2010) need to

test results based on data extracted from PubMed and

are evidences of the importance of having a tool for

building datasets.

3 PUBMED DATASET

DESCRIPTION

Each article in PubMed has a PMID identiﬁer and is

described by several attributes such as title, abstract,

publication date, among others in a semi-structured

XML format. There were created several sets of tags,

but it is noteworthy that not all articles have all the

tags. The PubMed’s XML ﬁles follow the standard

of the NLM Journal Archiving and Interchange DTD

available at the National Library of Medicine.

3.1 Design Principles

The design principles that governed the development

of the library are: (1) compatibility/portability which

means that the library can be used on any operating

system through any Java code to build the necessary

datasets; (2) ﬂexibility to easily adapt and conﬁgure

the input and output parameters by the users; and (3)

to be open source to allow the reuse and improvement

of the code by the bioinformatics community mem-

bers.

3.2 Goal and Features

The tool aims to provide bioinformatics researchers

with a resource to create test sets from data available

on the PubMed database. The datasets can be used on

qualitative and quantitative assessment of their pro-

posals.

Figure 1 shows the overall process of the PubMed

Dataset library. The class DownloadConfiguration

stores the input and output settings of the user. Once

the parameters are speciﬁed, the tool is ready to con-

nect to PubMed via Esearch. The PMIDs are ob-

tained from the server and the XML documents of the

articles are retrieved and placed into a DOM (Doc-

ument Object Model) using EFetch. The data re-

quested by the user is extracted from this structure.

ArticleDownloader is the class responsible for re-

trieving articles from PubMed.

Figure 1: Overview of the PubMed Dataset process.

The required input parameters are:

• the search term used to retrieve a set of articles

from the PubMed database;

• the maximum number of items in the dataset;

• the list of article’s attributes to conform the

dataset;

• a boolean ﬂag mandatoryAttributes to specify

when an article will be consider or not to conform

the dataset. If it is set to true, an article with a

missing value for any attribute in the list will be

discarded.

The dataset is built according to the parameters in

the initial conﬁguration. The output ﬁle containing

the dataset can be unserialized at any moment and the

dataset will be promptly available.

BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms

344

4 BIOSEARCH REFINEMENT

SYSTEM

The BioSearch Reﬁnement system is a Web applica-

tion developed in Java. Figure 2 shows the main func-

tionalities of the system.

Figure 2: BioSearch Reﬁnement execution ﬂowchart.

The keyphrases are extracted from the abstracts of

scientiﬁc papers returned from an initial query sub-

mitted on PubMed and are used to summarize the

main issues addressed by the retrieved papers. The

results can be ﬁltered by keyphrases, helping the user

in their search for relevant material. The algorithm

to automatically obtain the keyphrases is called Ex-

traction Engine. It is based on concept matching and

reﬁnement. To summarize the main subjects of the ar-

ticles, local keywords are obtained ﬁrst and then dis-

tributed into concepts. After this, the most relevant

concepts from each article go to a global phase where

the ﬁnal keyphrases are deﬁned.

To validate the Extraction Engine proposal, it was

necessary to construct several datasets to evaluate

the quality of the extracted keyphrases in compar-

ison with manually tagged keyphrases. There are

two types of keywords on the PubMed database: the

MeSH index terms and the keywords provided by au-

thors. The experiments were conducted using the

MeSH terms because the author’s keywords are rarely

available in the database.

4.1 Experimental Evaluation

The experiment was accomplished to test the quality

of the datasets and measure execution times. To this

end, two datasets were created for the BioSearch Re-

ﬁnement system. The tests were performed on a 2.13

GHz Intel Core 2 Duo processor with 3GB of RAM.

First, the query terms were deﬁned. Each dataset must

contain a set of abstracts that will be used to automati-

cally generate the keyphrases and the MeSH terms so

we can evaluate the quality of predicted keyphrases

using precision and recall metrics. The general pro-

cess of the experiment is shown in Figure 3.

Figure 3: Experiment dataset ﬂowchart.

The code fragments below show a typical usage

of the library. The fragments specify the input/output

parameter conﬁguration.

import static com.uenf.pubmeddataset.internet.ParameterName.*;

...

DownloadConfiguration config =

new DownloadConfiguration(ABSTRACT, PMID, MESH_TERMS);

ArticleDownloader downloader = new ArticleDownloader(config);

downloader.setMandatoryAttributes(true);

...

//building DATASET 1

List<DynaArticles> articles = downloader.getDynaArticles

("mycobacterium tuberculosis", 10000);

// the search term is treated as stopword

ConceptDataSet cds = new ConceptDataSet(articles,

"mycobacterium tuberculosis");

DataSetSerializer.serializeDataSet(cds, searchTerm);

...

// after unserialized, iterate over the dataset articles

Collection<DynaArticle> articles = cds.getArticles();

for(DynaArticle a:articles{

String ab = (String) a.getParameter(ABSTRACT).getValue();

System.out.println(ab);

}

...

//ConceptDataSet class extends DataSet abstract class and

// implements generateKeyWords abstract method

The results of the dataset construction are dis-

played on Table 1 below:

Table 1: Dataset conﬁguration.

Dataset 1

search term: mycobacterium tuberculosis

max: 10.000 articles

Output ﬁle: data (abstract, MeSH terms)

from 8.045 PubMed articles

Dataset 2

search term: h1n1 inﬂuenza virus

max: 5.000 articles

Output ﬁle: data from 2.858 PubMed articles

Some reasons that caused the reduction of the

number of articles in the datasets were: (1) the

boolean ﬂag was set to true; (2) articles without

MeSH terms were disregarded; (3) if the initial search

term appeared as a MeSH term, it was not included on

the dataset; and (4) if an article ran out of terms it was

also disregarded.

PUBMED DATASET: A JAVA LIBRARY FOR AUTOMATIC CONSTRUCTION OF EVALUATION DATASETS

345

4.2 Discussion about PubMed Dataset

Usage

Using the library for the creation of datasets was

quick and easy, simply by specifying the initial pa-

rameters. The construction of a dataset becomes very

ﬂexible because its data and parameters can be eas-

ily modiﬁed and run again. The number of retrieved

articles will vary according to the attributes that were

speciﬁed by the user. The greater the number of at-

tributes the fewer items will be retrieved as the chance

of an article having all the attributes is smaller. The

quality of the dataset depends on what is available on

PubMed, because the datasets are an expression of the

data contained in the source. The performance results

are illustrated on Table 2.

Table 2: Performance times of PubMed Dataset.

Dataset Steps Runtime

1 Downloading 7,45 minutes

Serialization 2 seconds

Unserialization 3 seconds

Total 9,45 minutes

2 Downloading 3,11 minutes

Serialization 859 milliseconds

Unserialization 1 second

Total 3,13 minutes

The download phase is executed once and always

will be dependent on network trafﬁc. In the case of

Dataset 1, the time for downloading 10.000 articles

was considerable high. Once the dataset is written to

disk, the time taken to load the dataset is very short,

making its use very convenient for testing.

5 CONCLUSIONS

By the team’s experience the time spent to build and

manage evaluation datasets is signiﬁcant when com-

pared to the research itself, giving a positive bal-

ance to the use of this library. The PubMed Dataset

shows itself very useful on the fast, easy and efﬁ-

cient construction of datasets. The three design prin-

ciples: portability, ﬂexibility and to be open source

were achieved with success. Flexibility is a very im-

portant point that needs to be highlighted since any

article attribute can be included in the dataset.

It was presented a case study to create two datasets

using the library. The initial conﬁguration was sim-

ply speciﬁed via Java code. The dataset is created

as a serialized ﬁle and the performance times of the

complete execution are not an issue. Yet, we believe

that future use of the library by other users will bring

new suggestions on how to improve the library to be

even more ﬂexible and adaptable. The API’s source

code and documentation is available for download at

https://github.com/lassounski/PubMed-Dataset.

On an evaluation environment the tests are grad-

ually created and are often added more speciﬁc test

cases. Given this scenario, it is obvious the fact that

the tests will run several times. Without the use of this

library, the time spent obtaining test data and running

the tests would be remarkable. Note that the dataset

integrity and consistency are not guaranteed since the

data retrieved from PubMed may have missing values

and may vary over time.

ACKNOWLEDGEMENTS

This research was supported by CNPq (National

Counsel of Technological and Scientiﬁc Develop-

ment)/Brazil.

REFERENCES

Feldman, R. and Sanger, J. (2007). The Text Mining Hand-

book: Advanced Approaches in Analyzing Unstruc-

tured Data. Cambridge University Press.

Krallinger, M., Valencia, A., and Hirschman, L. (2008).

Linking genes to literature: text mining, informa-

tion extraction, and retrieval applications for biology.

Genome Biology, 9(2).

Medelyan, O. (2009). Human-competitive automatic topic

indexing. PhD thesis, University of Waikato.

Nguyen, T. and Kan, M. (2007). Keyphrase extraction

in scientiﬁc publications. In Proceedings of Interna-

tional Conference on Asian Digital Libraries (ICADL

’07), pages 317–326.

Uddin, J., Abulaish, M., and Dey, L. (2010). A concept-

driven biomedical knowledge extraction and visual-

ization framework for conceptualization of text cor-

pora. Journal of Biomedical Informatics, 43(6):1020–

1035.

Wan, Y. and Xiao, J. (2008). Collabrank: towards a collabo-

rative approach to single-document keyphrase extrac-

tion. In Proceedings of COLING.

Yang, Z., Lin, H., and Li, Y. (2010). Bioppisvmextractor:

A protein-protein interaction extractor for biomedical

literature using svm and rich feature sets. Journal of

Biomedical Informatics, 43:88–96.

Zaremba, S., Ramos-Santacruz, M., Hampton, T., Shetty,

P., Fedorko, J., Whitmore, J., Greene, J., Perna, N.,

Glasner, J., Plunkett, G., Shaker, M., and Pot, D.

(2009). Text-mining of pubmed abstracts by nat-

ural language processing to create a public knowl-

edge base on molecular mechanisms of bacterial en-

teropathogens. BMC Bioinformatics, 10(1).

BIOINFORMATICS 2012 - International Conference on Bioinformatics Models, Methods and Algorithms

346