tasks,  in  this  study,  we  selected  L1-regularized 
logistic regression in order to be able to examine EHR 
feature weights alongside classification performance. 
A few features positively weighted by the classifier 
are  not  clearly  related  to  CDI  risk  or  likely  to  be 
related  to  evolving  symptomatology  –  for  example, 
service  or  admission  location.  In  practice, 
unexpectedly weighted characteristics also have the 
potential  to  reflect  phenomena  of  institutional  or 
clinical  epidemiological  interest,  such  as 
unrecognized  infection  transmission  routes  or 
previously undetected groups of patients at elevated 
risk  (Cohen  et  al.,  2010;  Shaughnessy,  Micielli, 
DePestel, et al., 2011).  Thus, in a machine learning 
classification  system,  it  is  desirable  to  be  able  to 
examine  what  features  are  being  identified  by  the 
system  as  predictive,  even when  such  features  may 
not  be  validated  as  risk  factors  by  previous 
epidemiological studies.   
A limitation of the current study is that we include 
data from only one set of archived electronic patient 
records for an intensive care unit patient population, 
limiting  the  generalizability  of  our  results.  Further 
investigations  are  needed  to  cross-validate  this 
system and compare the clinical performance of 
CREST  in  different  healthcare  facilities  and  for 
different  patient  groups.  In  addition,  other 
opportunities further performance improvements may 
also be accomplished through the use of  alternative 
core machine learning methods and optimized cross-
validation  approaches. It also remains to be studied 
whether changes in the risk score itself may be useful 
as inputs to the system.   
Given  the  overall  relatively  low  prevalence  of 
CDI  in  the  patient  population,  the  sensitivity  and 
specificity  of  CREST  would  require  improvement 
before the system could be used as a diagnostic tool. 
However, the ability of CREST to flag evolving high-
risk patients based on real-time clinical data  makes 
the  system  very  useful  for  preventive  interventions 
and  infection  control  epidemiology  applications. 
Facility-level  prevention  activities  that  present 
minimal  or  no  risk  to  individual  patients,  such  as 
precautionary  patient  isolation  or  increased 
observation  with  a  lowered  threshold  for  ordering 
diagnostic  testing,  might  be  considered  for  patients 
who the system identifies as potential CDI cases.
  
5  CONCLUSIONS 
We  conclude  from  this  study  that  machine learning 
strategies can be productively applied to EHR data for 
early  identification  of  hospital-acquired  CDI  cases 
and  that  dynamic  feature  variability  provides 
particularly strong predictive signals, beyond patient 
information  used  for  traditional  clinical  risk 
assessments.  Further  investigations  are  needed  to 
cross-validate  this  system,  to  compare  the 
performance  of  this  approach  for  different  facilities 
and  patient  groups,  and  to  explore  its  ability  to 
discriminate among diagnoses. 
ACKNOWLEDGEMENTS 
Thomas  Hartvigsen  thanks  the  US  Department  of 
Education for supporting his PhD studies via the grant 
P200A150306 on “GAANN Fellowships to Support 
Data-Driven  Computing  Research”.  Cansu  Sen 
thanks  WPI  for  granting  her  the  Arvid  Anderson 
Fellowship  (2015-2016)  to  pursue  her  PhD  studies. 
We  also  thank  the  DSRG  and  Data  Science 
Community at WPI for their support and feedback. 
REFERENCES 
‘Antibiotic  Resistance  Threats  in  the  United  States,’ 
Centers  for  Disease  Control  and  Prevention,  2019. 
https://www.cdc.gov/drugresistance/pdf/threats-
report/2019-ar-threats-report-508.pdf  
Lessa,  F.C.,  Mu,  Y.,  Bamberg,  W.M.,  Beldavs,  Z.G., 
Dumyati, G.K., Dunn, J.R., and others, 2015. Burden of 
Clostridium difficile infection in the  United States. N 
Engl J Med, 372 (9): 825-834.  
Cohen, S.H., Gerding, D.N., Johnson, S., Kelly, C.P., Loo, 
V.G.,  McDonald,  L.C.,  and  others,  2010.  Clinical 
practice guidelines  for Clostridium difficile infection: 
2010 update by the society for healthcare epidemiology 
of America (SHEA) and the infectious diseases society 
of America (IDSA). Infect Control Hosp Epidemiol, 31 
(5): 431-455.  
Evans, C.T., Safdar, N., 2015. Current Trends in the 
Epidemiology  and  Outcomes  of  Clostridium  difficile 
Infection. Clin Infect Dis, 60 (Suppl 2): S66-71.  
Burnham,  C.A.,  Carroll,  K.C.,  2013.  Diagnosis  of 
Clostridium difficile infection: an ongoing conundrum 
for  clinicians  and  for  clinical  laboratories.  Clin 
Microbiol Rev, 26(3): 604-630.  
Dubberke, E.R., Olsen, M.A., 2012. Burden of Clostridium 
difficile on the healthcare system. Clin Infect Dis, 55 
(Suppl 2): S88-92.  
Dubberke,  E.R.,  Carling,  P.,  Carrico,  R.,  Donskey,  C.J., 
Loo,  V.G.,  McDonald,  L.C.,  and  others,  2014. 
Strategies to prevent Clostridium difficile infections in 
acute care hospitals: 2014 update. Infect Control Hosp 
Epidemiol, 35(6): 628-645. 
Balsells,  E.,  Filipescu,  T.  Kyaw,  M.H.,  Wiuff,  C., 
Campbell, H., Nair, H., 2016. Infection prevention and 
control  of  Clostridium  difficile:  a  global  review  of