A Proof-of-Concept System
Diana Maclean
and Margo Seltzer
Stanford University, Palo Alto, CA U.S.A.
Harvard School of Engineering and Applied Sciences, Cambridge, MA U.S.A.
Data mining, Knowledge discovery, Vioxx, Myocardial infarction, UMLS, MetaMap.
As the prevalence of blogs, discussion forums, and online news services continues to grow, so too does the
portion of this Web content that relates to health and medicine. We propose that everyday, medically-oriented
Web content is a valuable and viable data source for medical hypothesis generation and testing, despite its
being noisy. In this paper, we present a proof-of-concept system supporting this notion. We construct a corpus
comprising news articles relating to the drugs Vioxx, Naproxen and Ibuprofen, that were published between
1998-2002. Using this corpus, we show that there was a significant link between Vioxx and the concept
“Myocardial Infarction” well before the drug was withdrawn from the market in 2004. Indeed, within the
Vioxx-related content, the concept ranks amongst the top 3.3% in terms of importance. When compared with
the Naproxen and Ibuprofen control literatures, the term occurs significantly more frequently in the Vioxx-
related content.
There exists a wealth of data – publicly available and
easily accessible that reports individuals’ experi-
ence on a diversity of medical conditions and treat-
ments. This corpus, which we shall refer to as the
“Popular Medical Literature” (PML) is the web - or
rather, all medically related web content. The PML
ranges in depth and focus. A blog dedicated to a
patient’s experience with experimental cancer treat-
ments, for example, an online discussion forum for
expectant mothers, and a news article about Asprin
are all PML constituents. Although noisy, the PML is
inherently valuable: it almost always contains some
information about a given medical topic. Recent re-
search has noted the efficacy of using the web to bring
together communities of patients, families and medi-
cal practitioners on sites such as CureTogether
Patients Like Me
, and to detect disease outbreaks, as
demonstrated on Google Flu
and Biosurveillance
We propose that the PML contains medical facts
and connections that may be unknown, or non-
obvious, to both patients, and medical profession-
als alike. Moreover, knowledge of such connections
might assist in patient diagnosis, identification of best
practice, or in highlighting fruitful problems for fu-
ture research. For example, considering that the PML
contains data from more individuals than a clinician
might treat in a lifetime, a patient’s web search for
symptoms matching her own may be a viable path to
diagnosis. A recent analysis of self-reported symptom
data by Cure Together, for instance, found a statisti-
cally significant correlation between Asthma and In-
fertility - a hypothesis that has been previously tested
in clinical trials (Carmichael, 2009). At the same
time, conventional web search can lead to “cyber-
chondria, which occurs when users search for in-
nocuous symptoms, but become drawn to rare con-
ditions (White and Horvitz, 2008). We believe that
in addition to uncovering realistic hypotheses, a well
implemented medical data mining system may help to
alleviate such problems.
Despite its potential, distilling medical knowledge
from PML poses considerable challenges. First, the
data is unstandardized and unstructured, as contrib-
utors tend to have no professional medical training.
Second, web pages are not always time stamped,
making time series analysis difficult. Finally, the
PML is not scientific data: irrelevant or incorrect
MacLean D. and Seltzer M..
DOI: 10.5220/0003166403030308
In Proceedings of the International Conference on Health Informatics (HEALTHINF-2011), pages 303-308
ISBN: 978-989-8425-34-8
2011 SCITEPRESS (Science and Technology Publications, Lda.)
“fads” may obfuscate true signals.
We present a proof-of-concept prototype, demon-
strating that it is both possible and productive to mine
the PML for medical hypotheses. We pose the follow-
ing scenario: Imagine that it is the year 2002 and that
your doctor has suggested prescribing Vioxx
. The
drug is relatively new and there is little in the pub-
lished literature discussing possible adverse reactions.
However, several thousand people are already using
Vioxx. What if you could find out what they are ex-
periencing? Can we develop general purpose tech-
niques utilizing the PML that may have suggested, as
early as 2002, a link between Vioxx and heart attacks?
We answer this question by testing the following two
1. The concept “myocardial infarction (the official
term for “heart attack”) was more relevant in the
Vioxx-related PML than in comparative control
2. The concept “myocardial infarction” was signifi-
cant within the Vioxx PML, compared with other
medical concepts.
Strong support for both hypotheses will suggest
that we may have been able to predict that Vioxx
was far more dangerous than initially proposed, well
before its withdrawal from the market.
The contributions of this paper include a technique
for distilling medical concepts from web articles; a
technique for using the TF-IDF metric to rank de-
scriptive concepts within an entire corpus; and a pro-
totype system providing a proof of concept for the vi-
ability of medical knowledge and hypothesis mining
from the PML.
The rest of this paper proceeds as follows: Sec-
tion 2 discusses related work. Section 3 imparts an
overviewof the story of Vioxx, highlighting what was
and was not known about it at various points before
it was withdrawn from the market. In Section 4, we
describe the data set that we built for our analyses.
Section 5 presents our experiments as well as results
in testing the two hypotheses stated above. Section 6
We are not the first to suggest mining the web for pre-
viously unknown relationships between medical con-
cepts. Our work lies in the intersection of three liter-
Vioxx was released in 1999 and removed from circulation in
2004 due to a large number of adverse cardiovascular effects, in
particular, heart attacks.
atures: medical knowledge discovery, contagious dis-
ease outbreak monitoring, and web-based collabora-
tive hypothesis testing. We discuss these in turn.
2.1 Medical Knowledge Discovery
Medical Knowledge Discovery (MKD) focuses on
discovering conceptual paths (or B-concepts) be-
tween a source concept (A) and target concept (C).
For example: Fish Oil (A) Vascular Health (B)
Reynaud’s Syndrome (C) is a famous MKD re-
sult (Swanson, 1986). The field was pioneered by
Don Swanson in the 1980’s, in his search for what he
called “undiscovered public knowledge” in the medi-
cal literature (Swanson, 1986). “Complementary but
disjoint noninteractive structures in the literature of
science do exist, he wrote, “and can lead to novel
scientific hypotheses that are worth testing” (Swan-
son, 2001). His work resulted in a number of useful
findings, including the fish oil result mentioned above
and the effectiveness of magnesium as a treatment for
migraines (Swanson, 1988). Most importantly, Swan-
son’s discoveries highlighted the efficacy of an auto-
mated solution to hypothesis discovery and verifica-
Swanson’s approach was: given medical terms A
and C, generate a B-list of linking concepts, and then
filter that B-list until only the most relevant links re-
main (Smalheiser and Swanson, 1998). Most state of
the art MKD systems are based on Swanson’s original
method (Hu et al., 2005; Gordon and Lindsay, 1996;
Pratt and Yetisgen-Yildiz, 2003; Weeber et al., 2001),
some with several improvements (Hu, 2005). A lim-
itation of MKD, however, is that it requires both the
source and the target terms as input.
While our work follows the spirit of MKD, two
notable differencesare that we utilize the PML (rather
than academic literature) and that we do not re-
quire a priori identification of source and target con-
cepts. Closest to our work in the MKD literature
is the LitLinker project, with which we share sev-
eral methodological techniques (Pratt and Yetisgen-
Yildiz, 2003). Given a source concept, A, LitLinker
to retrieve related articles. It then
extracts medical concepts from the result set titles us-
ing MetaMap
, and filters these concepts to remove
1.) extremely common concepts, 2.) concepts highly
similar to A, and 3.) concepts highly dissimilar to
A. The latter are filtered by restricting the seman-
tic type of the result set, a technique also used by
Hu et al. (Hu et al., 2005). This process yields a
HEALTHINF 2011 - International Conference on Health Informatics
set of B-concepts and is repeated on these set ele-
ments to acquire a set of C-concepts. LitLinker suc-
cessfully replicated Swanson’s migraine/magnesium
discovery with a small result set of relevantly-ranked
concepts (Pratt and Yetisgen-Yildiz, 2003).
2.2 Contagious Disease Outbreak
While the previous section detailed work in extract-
ing linkages in the medical literature, recent work
has focused on utilizing the PML to predict disease
outbreaks. Work in this area illustrates the richness
and efficacy of the PML as a data source for medical
anomaly detection. Google Flu, for example, oper-
ates on the premise that individuals experiencing in-
fluenza symptoms will engage in health-seeking be-
havior on the internet. By aggregating this data over
geographic regions, Google Flu Trends claims to im-
prove the Center for Disease Control’s influenza out-
break predictions by 2 weeks (Ginsberg et al., 2008).
Taking a similar approach, HealthMap (Brown-
stein et al., 2008) uses international news articles
to predict the outbreak of any disease, anywhere.
HealthMap has been highly successful in predict-
ing epidemics in real time. Although the domain
of HealthMap is strongly restricted to outbreak de-
tection, the research underlying the project considers
text-mining methods designed for PML, article rele-
vance scoring, semantic disambiguation, and several
other topics of relevance to our work.
2.3 Online Medical Knowledge Sharing
Some prior work on online health communities
(OHCs) supports the proposal that PML is a viable
data source for medical hypothesis generation. Sites
such as CureTogether
, Patients Like Me
, Med-
, and others, cater to OHC participants primar-
ily by providing online tools for recording and ana-
lyzing personal health data. The most common form
of interaction is the discussion forum, through which
OHCs accumulate data about personal illness, symp-
toms and treatments. Prior work indicates that this
data is a viable source for medical hypothesis discov-
ery. We point to the example mentioned in Section
1, in which CureTogether discovered a correlation be-
tween Asthma and Infertility that had been previously
studied in the academic medical literature.
In 1998, Merck filed for FDA approval of Vioxx as
a treatment for arthritis. The application included
data from a drug study conducted on approximately
5400 osteoarthritis patients (Solomon et al., 2002;
Prakash and Valentine, 2007), showing no difference
in adverse cardiovascular effects between patients
treated with Vioxx, placebo, and comparative Non-
Steriodal Anti-Inflammatory Drugs (NSAIDs), such
as Ibuprofin (Gilmartin, 2004). In May 1999, the
FDA approved Vioxx as a treatment for osteoarthritis
and acute pain; Merck released Vioxx onto the mar-
Earlier that same year (January), Merck initiated
the the Vioxx Gastrointestinal Outcomes Research
(VIGOR) study, comprising approximately 8000 pa-
tients. The goal of the study was to show that Vioxx
was safer on the gastrointestinal tract than a compet-
ing arthritis treatment, Naproxin (Berenson et al.,
2004; Gilmartin, 2004; Prakash and Valentine, 2007;
Reuters, 2005). This was an important feature of
Vioxx: several similar painkillers (such as aspirin)
cause ulcers and other gastrointestinal side effects that
result in the deaths of thousands of Americans every
year (Berenson et al., 2004).
Eleven months into the VIGOR study 79 of
the 4000 patients on Vioxx had suffered heart at-
tacks compared with 41 of the 4000 patients taking
Naproxin. Although the VIGOR results were pub-
lished in the New England Journal of Medicine in
November 2000, significant data detailing the adverse
cardiovascular effects were excluded (Prakash and
Valentine, 2007). Not until April 2002 did a warning
appear on Vioxx packaging about adverse cardiovas-
cular effects (Gilmartin, 2004; Prakash and Valen-
tine, 2007; Reuters, 2005). Finally, two and a half
years later (September 2004), Merck voluntarily with-
drew Vioxx from the market.
We use this historical timeline to select a date
range in which to conduct our study. We want our
end date to be well in advance of the media hype sur-
rounding Vioxx’s dangerous side effects. We pick
1998-2002, leaving a window of 1.5 years before
Vioxx is withdrawn from the market.
In building a data corpus, we limit ourselves to PML
news articles related to Vioxx and two control drugs:
Naproxen and Ibuprofen. All three are painkillers,
and all belong to the same “drug family” known
as NSAIDs (Non-Steroidal Anti-Inflamatory Drugs).
We chose Naproxen as a control drug, because it was
used as the control drug against Vioxx in the VIGOR
study (Berenson et al., 2004; Reuters, 2005). We
chose Ibuprofen because it is one of the most com-
mon painkillers on the market. Both control drugs
lack significant cardiac side effects.
We constructed our corpus by issuing a Google
search on the three drugs in question. We
rely on Google News’ time categorization of the
search hits to return articles published between 1998
and 2002
To retrieve the Vioxx-related PML, we
searched for articles that contained the terms “Vioxx”,
“Ceoxx” or “Rofecoxib” (two generic names for
Vioxx), but that did not contain “Ibuprofen” or
“Naproxen”. We proceeded similarly for the control
articles. After scraping the search results and discard-
ing the “bad” articles
, we had 603 Vioxx articles,
141 Naproxen articles, and 500 Ibuprofen articles.
Confronted with the challenge of extracting the
relevant, medical content from the HTML source
code, we used the NIH’s MetaMap system (Aron-
son, 2006), which maps text to medical concepts
from the Unified Medical Language System (UMLS)
. MetaMap was developedby the Na-
tional Library of Medicine for the express purpose
of extracting biomedical concepts from text. The
Metathesaurus component of UMLS comprises a gi-
ant database of medical concepts drawn from several
source vocabularies. Each concept is tagged with at
least one semantic type from the Semantic Network,
yielding a broad but consistent concept categoriza-
tion. Metamap is capable of sophisticated semantic
parsing and term disambiguation, including the re-
fining of semantically-equivalent terms. In our case,
note that “heart attack”, myocardial infarction” and
other equivalent terms will all be mapped to the con-
cept “myocardial infarction”.
Finally, we find that restricting the semantic map-
ping types available to MetaMap not only makes
results more meaningful, but also eliminates some
“junk” from the source HTML
. After translating
each article from text to “medicalese”, we use Apache
, an open-source search engine library, to in-
dex our corpus.
While misclassifications occur, manual inspection suggests
these are rare. As web data is difficult to date accurately, we de-
cided that this small error was acceptable.
Page errors, empty articles etc.
A corner case is presented when text from article advertise-
ments is parsed by MetaMap as relevant, medical information.
While such text could comprise only noise, targeted advertisement
text might actually increase the page information.
To present a viable proof-of-concept for medical hy-
pothesis generation and testing on the PML, we test
two hypotheses against our corpus. We want to
know both whether the concept “Myocardial Infarc-
tion” (MI) is more significant in the Vioxx-related
PML than in the control-drug related PML, as well
as whether the concept MI is significant within the
Vioxx-related PML itself. Support for the first hy-
pothesis indicates that Vioxx and MI have a different
relationship than than one might expect. Support for
the second suggests that this relationship is meaning-
Note that we consider only the simplest PML sub-
sets: those that mention only Vioxx, only Naproxen
or only Ibuprofen. In future work we plan to analyze
articles containing combinations thereof. While sub-
scribing to this level of simplicity does lose informa-
tion, it also allows us to make stronger assumptions
of independence between the article subsets.
5.1 H1: The Concept “Myocardial
Infarction” is more Significant than
Expected in Vioxx-related Articles
We use frequency of MI term occurrence as a proxy
for significance between corpora. If the concept MI
holds no particular significance in the Vioxx-related
PML, then we expect that the MI term will have sim-
ilar frequency distributions in the Vioxx-related and
the control PML. We take this as the null hypothesis,
with the alternative being that the MI term occurs sig-
nificantly more frequently in the Vioxx-related PML.
Table 1 summarizes the mean and variance of the MI
frequency in each corpus segment.
Table 1: Summary of MI term frequencies in the drug-
segmented PML.
Drug # Articles Mean Freq. Variance Freq.
Vioxx 603 0.14 0.59
Control 641 0.08 0.34
To test our null hypothesis we use a one-sided
Mann-Whitney test of the MI frequency counts for
each document in the Vioxx-related PML against the
MI frequency counts for each document in the con-
trol drug-related PML. We assume that by restricting
our corpus segments to single drug mentions (as dis-
cussed above) in the given timeline that the samples
are independent. The p-value for the test is 0.04453,
and thus we reject the null hypothesis at the 0.05 sig-
nificance level.
HEALTHINF 2011 - International Conference on Health Informatics
5.2 H2: MI is a Significant Term within
the Vioxx-related PML
There are a total of 4696 unique terms and 603 doc-
uments in our Vioxx-specific PML. The MI term oc-
curs a total of 82 times, ranking in the top 250 (5%)
most frequent terms in the corpus. Usually we might
consider these top ranking terms irrelevant because of
their high frequency; however, many common stop-
words have already been removedfrom the text by the
MetaMap UMLS mapping. Despite this, frequency is
a coarse measure of importance in text. In addition,
we construct a ranking of important terms within the
Vioxx-related PML. Our ranking is based on the term-
frequency inverse-document-frequency (tf-idf) met-
t f-id f (t, d) = t f(t, d) log
|D : t D|
where t f(t, d) is ts normalized frequency in a doc-
ument d D, where D is the set of all corpus docu-
ments. Terms that occur frequently in a document, but
infrequently in the rest of the corpus, will get a high
tf-idf score, while common terms will get a low score.
Within a document d, tf-idf is intended to rank highly
those terms that “best describe that document”.
To obtain a set, S, of terms that are highly rele-
vant to the Vioxx-related corpus, we add the top 10%
of terms for each document by tf-idf ranking to S. For
example, if a document contained 50 terms, we would
add the top 5 tf-idf scored terms to S. Each element of
S is scored according to the number of documents for
which it was a top-10% tf-idf term. Although simple,
the technique is intuitive. While rare words will likely
score within the top 10% tf-idf ranked terms of some
document, only words truly descriptive of corpus seg-
ments should score within this range for several doc-
uments. Conversely, terms that occur so frequently as
to be meaningless should score within that range very
In the 3066 unique terms contained in S, MI is
ranked within the top 110 (3.3%) of highly relevant
terms in the corpus, and within the top 2.3% terms
in the corpus. In the same company are terms that
we would expect to rank highly, such as “arthri-
tis”, “pain”, and “NSAID”. There are also sev-
eral other suggestive terms. “Diethylstilbstrol” is
a non-steroidal drug that was withdrawn from the
market; “duodenal ulcer” relates to the problem that
Vioxx was supposed to solve (gastrointestinal com-
plications). Finally, a quick search for “vioxx” and
“deet” uncovers several articles comparing the danger
of the two drugs. However, the results also contained
irrelevant terms. “Text”, “document” and “stock”, for
example, are likely artifacts of HTML junk that the
MetaMap parser did not discard. Other terms, such
as “activity”, “wanted” and “include”, could likely be
filtered semantically.
The results presented in Section 5.1 support our first
hypothesis: “Myocardial Infarction” is more signifi-
cant than expected in Vioxx-related articles. The MI
term occurs on average almost twice as often in the
Vioxx-related articles than in the control articles, as
shown in Table 1. Moreover, a non-parametric sta-
tistical test indicates with high confidence that the
frequency distribution of the MI term in the Vioxx-
related PML is significantly skewed to the right when
compared to that of the Naproxen and Ibuprofen-
related PML.
In light of the inter-drug PML significance of the
MI term, the results presented in Section 5.2 for-
tify both our first and second hypotheses. A sim-
ple method that, in essence, counted the number of
documents for which a word was highly descriptive,
ranked the MI term in the top 3.3% of the most rele-
vant words in the corpus. Cast in this light, our first
result, the fact that the difference in MI term distri-
bution between the corpora segments is statistically
significant, becomes even more relevant. Given these
results, we can claim with reasonable confidence that
the concept “Myocardial Infarction” was a distinctive
term in the Vioxx-related PML both within that litera-
ture itself, as well as across control PMLs, well before
Vioxx was withdrawn from the market. That is, not
only could MI could be tied to “Vioxx” as an impor-
tant, descriptive concept, but it could also be labeled
as an anomaly in that general class of PML. These
results are heartening for a proof-of-concept system.
We do note, however, that while 3.3% is an impres-
sive margin, 110 terms is still too many for a user to
browse through. Implementing an effective search in-
terface for potentially relevant links is one of the most
important components of future work.
Further improvements include expanding our
analyses to incorporate data sets containing overlap-
ping drug terms. Incorporating additional drugs into
the control corpus would provide more comparison
points. Finally, an important goal is to cultivate Web
content from alternative sources, such as Twitter feeds
and blog posts, into the corpus. Developing methods
for attaining, cleaning and analyzing these data will
prove challenging, but we believe that results will be
We conclude by noting that well-implemented medi-
cal knowledge discovery systems based on the PML
have enormous potential. Early predictions of unan-
ticipated drug side effects and early warnings of dis-
ease outbreaks could improve health care quality and
intervention response times. Hypothesis generation
with predicted return values from hypothesis confir-
mation could streamline medical research. But most
importantly, we live in a data-driven age in which
digitization of medicine is inevitable: the future of
medicine will depend not only on our ability to re-
trieve and synthesize information from a wide array
of sources, but more importantly on our ability to ex-
tract significant patterns from that information.
Aronson, A. (2006). MetaMap: Mapping text to the UMLS
Metathesaurus. Bethesda, MD: NLM, NIH, DHHS.
Berenson, A., Harris, G., Meier, B., and Pollack,
A. (2004). Despite Warnings, Drug Giant
Took Long Path to Vioxx Recall. Retrieved
14merck.html?pagewanted=2& r
Brownstein, J., Freifeld, C., Reis, B., and Mandl, K. (2008).
Surveillance Sans Fronti`eres: Internet-Based Emerg-
ing Infectious Disease Intelligence and the HealthMap
Project. PLoS Med, 5(7):e151.
Carmichael, A. (2009). Crowdsourced Health Con-
firms Infertility-Asthma Finding. Retrieved from:
Gilmartin, R. (2004). Vioxx Timeline: Key Dates for
VIGOR and Long-term, Placebo-controlled Studies
Implemented to Provide Cardiovascular Safety Data.
Retrieved from:
Ginsberg, J., Mohebbi, M., Patel, R., Brammer, L., Smolin-
ski, M., and Brilliant, L. (2008). Detecting influenza
epidemics using search engine query data. Nature,
Gordon, M. and Lindsay, R. (1996). Toward discovery sup-
port systems: A replication, re-examination, and ex-
tension of Swanson’s work on literature-based discov-
ery of a connection between Raynaud’s and fish oil.
Journal of the American Society for Information Sci-
ence, 47(2):116–128.
Hu, X. (2005). Mining novel connections from large online
digital library using biomedical ontologies. Library
Management, 26(4/5):261–270.
Hu, X., Yoo, I., Song, M., Zhang, Y., and Song, I. (2005).
Mining undiscovered public knowledge from com-
plementary and non-interactive biomedical literature
through semantic pruning. In ACM CICM, pages 249–
Prakash, S. and Valentine, V. (2007). Time-
line: The Rise and Fall of Vioxx. Retrieved
Pratt, W. and Yetisgen-Yildiz, M. (2003). LitLinker: cap-
turing connections across the biomedical literature. In
ACM K-CAP, pages 105–112. ACM Press New York,
Reuters (2005). A Timeline of Vioxx. Retrieved
Smalheiser, N. and Swanson, D. (1998). Using ARROW-
SMITH: a computer-assisted approach to formulating
and assessing scientific hypotheses. Computer Meth-
ods and Programs in Biomedicine, 57(3):149–153.
Solomon, D., Glynn, R., Levin, R., and Avorn,
J. (2002). Nonsteroidal Anti-inflammatory
Drug Use and Acute Myocardial Infarction.
10/ 1099?view=abstract.
Swanson, D. (1986). Fish oil, Raynaud’s syndrome, and
undiscovered public knowledge. Perspectives in Biol-
ogy and Medicine, 30(1):7–18.
Swanson, D. (1988). Migraine and magnesium: eleven
neglected connections. Perspectives in Biology and
Medicine, 31(4):526–57.
Swanson, D. (2001). On the fragmentation of knowledge,
the connection explosion, and assembling other peo-
ple’s ideas. Bulletin of the American Society for Infor-
mation Science and Technology, 27(3):12–14.
Weeber, M., Klein, H., de Jong-van den Berg, L., and
Vos, R. (2001). Using concepts in literature-based
discovery: Simulating Swanson’s Raynaud–fish oil
and migraine–magnesium discoveries. Journal of the
American Society for Information Science and Tech-
nology, 52(7):548–557.
White, R. and Horvitz, E. (2008). Cyberchondria: Studies
of the Escalation of Medical Concerns in Web Search.
ACM TOIS, 27(4).
HEALTHINF 2011 - International Conference on Health Informatics