Air Quality and Cause-specific Mortality in the United States:

Association Analysis by Regression and CCA for 1980-2014

Erin Teeple

, Caitlin Kuhlman

, Brandon Werner

, Randy Paffenroth

1,2,3

and Elke Rundensteiner

1,2

Data Science Program, Worcester Polytechnic Institute, Worcester, MA, U.S.A.

Department of Computer Science, Worcester Polytechnic Institute, Worcester, MA, U.S.A.

Department of Mathematics, Worcester Polytechnic Institute, Worcester, MA, U.S.A.

Keywords: Air Quality, Canonical Correlation Analysis, CCA, Epidemiology, Environmental Health.

Abstract: Quantifying health effects resulting from environmental exposures is a complex task. Underestimation of

exposure-outcome associations may occur due to factors such as data quality, jointly distributed spectra of

possible effects, and uncertainty about exposure levels. Parametric methods are commonly used in population

health research because parameter estimates, rather than predictive accuracy, are useful for informing

regulatory policies. This project considers complementary approaches for capturing population-level

exposure-outcome associations: multiple linear regression and canonical correlation analysis (CCA). We

apply these methods for the task of characterizing relationships between air quality and cause-specific

mortality. We first create a national air pollution exposures-mortality outcomes data set by integrating United

States Environmental Protection Agency (EPA) annual summary county-level air quality measurements for

the period 1980-2014 with age-adjusted gender- and cause-specific county mortality rates from the same time

period published by the Institute for Health Metrics and Evaluation (IHME). Code for data integration is made

publicly available. We examine our model parameter estimates together with air quality-mortality rate

associations, revealing statistically significant correlations between air quality variations and variations in

cause-specific mortality which are particularly apparent when CCA is applied to our population health data

set.

1 INTRODUCTION

A significant challenge in assessing the impact of

environmental factors on health outcomes is that

many health outcomes are not deterministic and have

multiple contributing risk factors related not only to

the exposures being studied, but also to other

unrelated and unmeasured factors (Vineis & Kriebel,

2006). In this work, we investigate this phenomenon

with regards to the potential impact of air pollution

exposures on mortality rates. The task of estimating

the potential contributions of air quality to varying

mortality causes is challenging because over time and

space, not all individuals will have the same

underlying risk for different causes of death or the

same susceptibility to the effects of an exposure such

as air pollution.

For example, risk of death from respiratory or

cardiovascular disease may be influenced not only by

the quality of the air in a particular location, but also

by an individual’s long-term health, which is

impacted by factors such as activity level, blood

pressure, and diet, as well as by short-term health

events such as respiratory infections, among other

considerations (Vineis et al., 2006; Cromar, Gladson,

& Ewart, 2019).

Additionally, we may also find that multiple

interrelated outcomes can occur from a similar

mechanism. An example of this would be the

occurrence of a heart attack or stroke as a

manifestation of vascular disease, or in the

occurrence of a fatal cardiac event due to respiratory

stress. As we will demonstrate in the analyses

presented in this paper, these complex, non-

deterministic, and overlapping relationships between

our predictor and outcome variable sets have the

potential to present as more strongly correlated latent

relationships with air pollution variability through the

covariance of different mortality rates, rather than in

the form of parametric effect estimates, as can be

obtained using linear regression models which predict

single mortality rates from air quality measures.

228

Teeple, E., Kuhlman, C., Werner, B., Paffenroth, R. and Rundensteiner, E.

Air Quality and Cause-speciﬁc Mortality in the United States: Association Analysis by Regression and CCA for 1980-2014.

DOI: 10.5220/0009156702280236

In Proceedings of the 13th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2020) - Volume 5: HEALTHINF, pages 228-236

ISBN: 978-989-758-398-8; ISSN: 2184-4305

1.1 Motivation

The United States Clean Air Act §7401 et seq. (1970)

is a federal law that was first passed in 1970 and

amended in 1977 and 1990. The Clean Air Act

requires the United States Environmental Protection

Agency (EPA) to set National Ambient Air Quality

Standards (NAAQS) for six air pollutants termed

“criteria air pollutants”: ground-level ozone,

particulate matter, carbon monoxide, lead, sulfur

dioxide, and nitrogen dioxide. Air pollution exposure

has been linked in numerous studies with increased

risk for adverse health events, including cardiac events

and strokes (Di, Wang, Zanobetti, et al., 2017; Shah,

Lee, McAllister, et al. 2015; Han, Lim, Yorifuji, &

Hong, 2018; Peng, Xiao, Gao, et al., 2019; India State-

Level Disease Burden Initiative Air Pollution

Collaborators, 2019; Wang, Zhao, Liou, et al., 2019).

In addition, negative health impacts have been found

to be associated with air pollution exposure even at

levels below current United States federal regulatory

limits. For example, in one study of a Medicare

beneficiary population, increased all-cause mortality

was found to be associated with higher levels of small-

diameter particulate and ozone air pollution exposure

that were within federal exposure limits (Di et al.,

2017).

Since air pollution exposure may impact morbidity

and mortality risk across multiple organ systems, it

becomes challenging to evaluate and quantify the

effects of air pollution exposure on population health,

since each possible outcome has other unique risk

factors and rates of occurrence apart from the effects

of air pollution exposure. Nonetheless, such dose-

response and predictive models are necessary for

evaluating and informing policies that regulate and

update air pollution exposure limits. One approach for

modelling cause and effect relationships is the use of

multiple linear regression, which estimates a response

quantity from a set of predictor variables (James,

Witten, Hastie & Tibshirani, 2014). When assessing

the impacts of different environmental exposures on

multiple health target outcomes with the use of

regression models, however, we may underestimate

associations between environmental factors and

multiple interrelated outcomes by examining each

individually, rather than considering the total variation

in the outcomes of interest relative to exposure

variables.

1.2 Proposed Approach

Illustrating this point, in this paper we first examine

relationships among interrelated air pollution

exposure measures and cause-specific mortality rates

as single rates, using multiple linear regression. We

then explore the relationship between variations in air

quality and variations in health outcomes by applying

Canonical Correlation Analysis (CCA), which finds

combinations of predictor and outcome set elements

which are maximally correlated with each other,

thereby permitting quantification and hypothesis

testing about the presence of latent intercorrelations

accounting for covariations across and between

variable sets (Hotelling, 1936; Gonzalez, Dejean,

Martin & Baccini, 2008). CCA is performed in this

study by taking year and air quality measures as

elements of one intercorrelated variable set and

mortality rates for male and female all-cause,

cardiovascular, respiratory, and infectious disease

mortality as elements of a second intercorrelated

variable set.

Application of CCA to these matched data sets

then produces independent, linear combinations of set

variables which are maximally correlated in sequential

independent projection spaces. These independent

correlated projections of the data are termed canonical

dimensions, and the existence of a statistically

significant correlation within a given canonical

dimension may be interpreted as there being a latent,

unmeasured (canonical) factor accounting for the

observed covariation relationship between the two sets

of variables. Using the approaches of linear regression

and CCA together gives us complementary

perspectives on our phenomena of interest: from linear

regression, the proportion of variation in each

mortality rate explained by a multiple linear regression

model using year and air quality measures as

predictors, and from CCA, the degree to which

county-specific variations in cause-specific mortality

may be associated with air quality variations.

1.3 Key Contributions

 We create and publicly release a novel national

county-level air pollution exposure - mortality

outcome data set which integrates EPA air quality

measurements with county-level mortality data from

the Institute for Health Metrics and Evaluation

(IHME). Our approach uses federal county identifiers,

permitting easy integration with other geographically

coded data sets.

 We quantify nationwide associations between

cause-specific mortality rates and air quality measures

in the United States over a 34-year time period.

 We compare the performance of CCA and

regression for characterizing statistical relationships

among our data attributes, showing that variations in

Air Quality and Cause-speciﬁc Mortality in the United States: Association Analysis by Regression and CCA for 1980-2014

229

air quality have a strong and statistically significant

correlation with mortality rate variations.

 Our findings have important public health

implications: we find associations between lower air

quality and increased rates of specific mortality causes

even within United States regulatory limits for air

quality.

 We highlight further applications of our approach

for other questions in environmental epidemiology

and public health research.

2 RELATED WORK

Regression models are commonly used to examine

associations between air pollution of different types

and specific health outcomes (Di et al., 2017; Shah et

al. 2015; Han et al., 2018; Peng et al., 2019; India

State-Level Disease Burden Initiative Air Pollution

Collaborators, 2019; Wang et al., 2019). Findings

reported in such studies commonly include estimates

of excess mortality resulting from air pollution

exposures of different types and levels, as well as

parametric estimates of the contributions of specific

types of air pollution to different outcomes. To date,

numerous studies have reported impacts of air quality

on multiple health outcomes. A challenge for these

studies, however, is that since there exists evidence

that air pollution exposure adversely effects multiple

interdependent organ systems, any single health

outcome will be insufficient for fully quantifying the

impact of air pollution exposure on the overall health

of a study population.

CCA thus has the potential to add to the insights

provided by these previous studies as a result of its

being specifically suited for the situation where we

have multiple intercorrelated exposure measures and

multiple interrelated health effects (Hotelling, 1936;

Gonzalez et al., 2008). A further strength of applying

CCA for questions in population-level epidemiology

is that rather than assuming independence of

predictors or applying domain knowledge to engineer

interaction terms (as is commonly done in regression

analysis), unmeasured phenomena which impact

multiple variables and create intercorrelations may be

uncovered in the model as statistically significant

high-magnitude cross-set correlations between

projections of the data sets in the canonical

dimensions (Hotelling, 1936; Gonzalez et al., 2008).

Canonical correlation analysis (CCA) was first

described by Hotelling in 1936. CCA is used to

examine latent (canonical) relationships between

multi-dimensional vectors X = (x

, x

, …x

) and Y =

, …y

) which have non-zero Pearson

correlations (ρ) among variables such that ρ(x

, x

ρ(y

, y

), ρ(x

, y

) are non-zero for some variables.

Existence of such non-zero intercorrelations implies

that linear combinations of variables in the two sets

may be predictable by or predictive of the others. CCA

seeks to find linear combinations of X and Y with

maximal correlations with each other. In effect, these

linear combinations may be used to examine and

characterize possible latent relationships between

multidimensional X and Y domains, with correlations

between X and Y sets in the canonical dimensions

taken to represent latent factors accounting for

correlated set covariations (Hotelling, 1936; Gonzalez

et al., 2008).

In recent years, CCA has found further extensions

in kernel (Rudzicz, 2010) and deep (Andrew, Arora,

Bilmes & Livescu, 2013) CCA methods. In kernel

CCA, data sets are projected into high-dimensional

kernel space before CCA is performed, with the use of

a kernel permitting non-linear representations of the

data sets being correlated. Challenges in kernel and

deep CCA, however, include appropriate kernel

selection, difficulty when trying to interpret projection

relationships, and avoiding overfitting. As kernel and

deep CCA methods are further developed, future work

focusing on interpretability of canonical projections

may find use in epidemiology applications, as

demonstrated here for linear CCA, which relies upon

the interpretability of the canonical coefficients to

validate our interpretations of the results of an

analysis.

3 METHODS AND PROCEDURES

3.1 Data Sources

Data sources for this study were selected to provide

county-level information on air pollution exposure

and cause-specific mortality linked by a shared key,

which in this case are the United States county

identifier and the year of data collection.

Air Quality Data: AirData (United States

Environmental Protection Agency [EPA]) is a website

maintained by the EPA that provides public access to

air quality measurements collected at more than 4,000

outdoor monitors across the United States, Puerto

Rico, and the United States Virgin Islands. AirData

has available for download annual and daily summary

data tables containing measurements of overall

summary measures of ambient air quality, regulated

pollutants, particulates, meteorological conditions

(wind, temperature, pressure, barometric pressure, and

RH/dewpoint), toxics, ozone precursors, and lead

HEALTHINF 2020 - 13th International Conference on Health Informatics

230

measurements. For this analysis, we downloaded the

‘Annual Summary’ tables.

Mortality Data: United States county-level age-

standardized respiratory mortality rates for the years

1980-2014 are available through the IHME (Institute

for Health Metrics and Evaluation [IHME]). The

IHME produced estimates for United States county-

level mortality rates for 21 causes of death including

chronic respiratory diseases for the period 1980-2014

(IHME). This aggregated data set is available through

the Global Health Data Exchange. Age-standardized

mortality rates for male, female, and combined

genders are reported as the number of deaths per

100,000 people in the population. These estimates

were generated using death records from the National

Center for Health Statistics (NCHS); population

counts from the U.S. Census Bureau, NCHS, and the

Human Mortality Database; and the

cause list from the

Global Burden of Disease Study (GBD).

3.2 Analysis and Methodology

Data Preprocessing and Integration: Data pre-

processing and table joins were implemented in

Python, version 3.6, yielding a single .csv file

containing 31,019 data rows uniquely identified by

county location and year and for which mortality rate

and air quality measurement information was

available. From the EPA data, we extract the

following measures for each county and year: median

annual Air Quality Index (AQI) (AQI is summary

measurement of air quality, with scores ranging from

0-500); maximum annual AQI; the proportion of

recorded days on which AQI fell into each of the

following categories: Good (0-50), Moderate (51-

100), Unhealthy for Sensitive Groups (101-150),

Unhealthy (151-200), Very Unhealthy (201-300), and

Hazardous (301-500); and the proportion of days on

which the AQI was attributed to one of the following

pollutants: Carbon Monoxide (CO), Nitrogen Dioxide

(NO2), Ozone, Particulate Matter (PM10), and Sulfur

Dioxide (SO2).

In our analyses, we use this set of air quality

features to understand and predict eight annual,

county-specific mortality targets extracted from the

IHME data set: male and female age-adjusted

mortality rates for the following causes: All (ALL),

Respiratory disorders (RESP), Cardiovascular

diseases (CVD), and Lower respiratory and other

common infectious diseases (INF). These causes of

mortality were selected for inclusion based on

previous studies linking air pollution exposure with

systemic inflammation and adverse effects on the

cardiovascular and respiratory systems (Di, Wang,

Zanobetti, et al., 2017; Shah, Lee, McAllister, et al.

2015; Han, Lim, Yorifuji, & Hong 2018; Peng, Xiao,

Gao et al., 2019; India State-Level Disease Burden

Initiative Air Pollution Collaborators, 2019; Wang,

Zhao, Liou et al., 2019). We included all data rows for

which both air quality and mortality values were

available. Our complete data analysis and code can be

found here (https://github.com/erinteeple/CCA_air).

Exploratory Data Analysis: To explore simple

pairwise linear relationships between the attributes in

our dataset, we present Pearson correlations. Figure 1

demonstrates the existence of linear correlations

among and between air quality measures and mortality

rates. Given these multiple intercorrelations between

and within sets (Figure 1), we see that CCA is

appropriate for our analysis. We have intercorrelated

measures of air pollution exposures, intercorrelated

mortality rate measures, and cross-correlations

between elements of the two sets. Initial data

exploration also included characterizing temporal

trends in the mortality and air quality variables. Figure

2 shows generally linear trends for mean mortality by

cause and year and substantial variations in rates

within years for different locations. Mean mortality by

cause was observed to differ for males and females,

thus we chose to keep these rates separated in our

analyses (Figure 2). With respect to our air pollution

exposure measurements, several important

considerations need to be taken into account. First,

some subgroups of variables could not be used

together in regression models due to frank violations

of assumptions of predictor independence. To address

these issues, first, the proportions of days in each of

the AQI rating categories were combined into a single

measure, which is the proportion of days on which the

AQI was in the good or moderate air quality

categories.

Intuitively and as can be seen in Figure 1,

Figure 1: Pearson correlations for mortality rates and air

quality exposure measures.

Air Quality and Cause-speciﬁc Mortality in the United States: Association Analysis by Regression and CCA for 1980-2014

231

Figure 2: Box plots showing variations in age-standardized

mortality rates per 100,000 persons for time period 1980-

2014 by gender and cause: M: male; F: female; ALL: all

causes; RESP: respiratory; CVD: cardiovascular; INF:

infectious disease.

median AQI, maximum AQI, and proportion of days

on which AQI was good or moderate are interrelated

and therefore cannot be used together, thus, separate

linear regression models were generated for each of

these AQI summary measures and compared. An

additional consideration regarding the formatting of

the air quality summary data is that the proportions of

specific pollutants reflect only the proportion of the

days on which the maximal AQI is attributed to a

maximal type of air pollution – this means that (1) air

pollutants present at other levels are not captured by

this measure and so we have no measure of co-

exposures and (2) the magnitude of exposure to a

given pollutant is not captured by this proportion, only

that the pollutant was at a level accounting for the

recorded AQI. We therefore chose to include

interaction terms between AQI summary measures

and pollutant proportion terms in our multiple linear

regression models in order to assess the scaled

contributions of different pollutants. Multiple linear

regression and CCA were performed in R and Python

(Gonzalez et al., 2008; Pedregosa et al., 2011).

Multiple Linear Regression Analysis: Multiple

linear regression is a multivariate statistical method in

which we examine linear correlations between a

dependent variable and one or more independent

variables (James et al., 2014). Multiple linear

regression analysis produces statistical outputs

including coefficient estimates for each predictor,

which estimate the magnitude and direction of that

predictor’s contribution to the dependent variable

value in the model and confidence intervals for each

coefficient, which indicate a probability-based range

of values for these coefficients. The ability of a

multiple linear regression model to explain variation

in the dependent variable using the independent

variables may be also be quantified using the adjusted

R-squared value for the regression. The adjusted R-

squared value may be interpreted as quantifying the

proportion of variation in the dependent variable

explained by the independent variables in the linear

regression model.

Canonical Correlation Analysis: As an analysis

method, the formulation of CCA is as follows for a

data matrix M comprised of attribute sets X and Y, for

which  and  measurements are available,

respectively, for each of N observations.







X∶N∗

∶N∗

(1)

CCA then seeks independent, linear combinations

of the X and Y set variables 



and 



which

maximize ,:





















(2)





















(3)

, 

,















(4)

Given that environmental exposures rarely occur

in isolation and may have effects on multiple organ

systems at varying rates (Vineis et al., 2006), CCA

then is suited for such applications in the study of

relationships between environmental factors and

population health outcomes, particularly, as is

commonly the case, where baseline rates for these

outcomes are expected to vary with time and spatial

location and where the true rates are not themselves

known. In this study, we compare linear regression

with CCA, where mortality rates stratified by cause

and gender serve as a multidimensional intercorrelated

HEALTHINF 2020 - 13th International Conference on Health Informatics

232

response vector set to be cross correlated with

intercorrelated pollution exposure measures to

determine if this approach uncovers correlated

covariance between these variable sets. We compare

the results of our analysis with the proportion of

variance in the individual mortality rates explained

using multiple linear regression. In addition, in CCA

we also examine the weight vectors assigned for the

attributes in the two variable sets relative to their

contributions to each sequential canonical projection.

The interpretation of the variable weights in CCA

differs notably from parametric estimates in multiple

linear regression in that for a multiple linear regression

model, a coefficient would be taken to represent the

linear contribution of a variable to a specific

dependent/target value in the regression model. In

contrast, the variable weights assigned in CCA apply

in sequential canonical dimensions. These weights are

taken to reflect the importance and relative direction of

each set element’s contribution to each dimension-

specific correlation relationship. Examination of these

canonical weightings provides insight into the relative

contribution of each variable in the sets relative to the

latent relationship captured by the correlated

projection, and these weightings are most informative

when examined relative to weightings of the variables

in the other set. These correlated projections produced

by the set variable weightings may then be interpreted

as reflecting the influence of potential latent factors

captured in each of the canonical projections of the

paired data sets.

4 RESULTS

4.1 Multiple Linear Regression

Table 1 presents adjusted R-squared values for

regression models predicting annual mortality i) from

year only and ii) from year plus air quality measures,

including interaction terms between proportions of

days on which the leading pollutant was of a specific

type. Using the general linear test (Kutner, Nachstein,

Neter & Li, 2004) to compare these models, we

observe that significantly greater proportions of

variation for all mortality rates are explained by linear

regression models including year along with air quality

measures, consistent with research reporting diverse

multi-system health impacts (Table 1). Confidence

intervals for the air quality measure coefficients are

presented in Table 2. Of note, the coefficient intervals

for some air pollution variables include 0 (no effect),

and some coefficient estimates are negative. For

example, surprisingly, ozone has a negative coefficient

Table 1: Mortality rate prediction.

Table 2: Linear model coefficient estimates.

interval for female respiratory mortality, but this

should be interpreted in the context of overall

population-level trends in female respiratory mortality

(Figure 2, F-RESP), as well as consideration of the

other variables in the regression.

It is also worth noting that proportion of variation

explained by the multiple linear regression models is

more suitable in this case for comparing models than

is mean-squared error, as we conduct these analyses

without knowing what variability exists in our targets

due to causes other than air quality and our target rates

are expected to differ in their distributions.

Air Quality and Cause-speciﬁc Mortality in the United States: Association Analysis by Regression and CCA for 1980-2014

233

4.2 Canonical Correlation Analysis

The results of canonical correlation analysis applied to

our data set are presented in Table 3. By CCA, in the

first canonical dimension, we observe a correlation of

0.91 between the set of mortality rates and the set of

air quality variables which is found to be statistically

significantly different from 0 (null hypothesis for

testing). As can be seen in Table 3, in this first

canonical dimension, we observe positive weights

assigned to multiple adverse air quality measures,

positive weights assigned to mortality rates of multiple

causes, and a negative weight (protective effect)

assigned to the proportion of days on which the AQI

was rated as good or moderate.

These CCA results quantify a strong linear

correlation between air quality variations and

mortality rate covariations due to a latent factor, which

in this case we propose may be the biological

relationship between air pollution exposure and its

effects on human body system health and functions.

While in the linear regression model, we aimed to

estimate parametric contributions of measures of air

quality on the mortality rates of interest, by utilizing

CCA, we expand our concept of linear association to

capture the effects of the environmental exposure on

paired set covariation.

Table 3: CCA coefficient estimates.

5 DISCUSSION

We find in this analysis a strong and statistically

significant first-dimension canonical correlation

between variations in air pollution exposure and

variations in cause-specific mortality in the United

States during the years 1980-2014. These results

Figure 3: Air Quality and Mortality Variable Set

Correlations in the first three canonical dimensions. Note

the high linear correlation (0.91) in the first canonical

dimension, which also stratifies data by year (color).

complement our findings in linear regression analysis,

where we observe statistically significantly better

model fit in our models which include air quality

measures compared to a model which predicts

mortality rates from year alone.

Interestingly, We Observe These Relationships

between Air Pollution Exposures and Mortality of

Different Causes even at Air Pollution Levels that

are Subject to United States Federal Regulatory

Limits. In the case of air pollution, there exists other

research linking airborne exposures with health

outcomes of different types (Di et al., 2017; Shah et al.

2015; Han et al., 2018; Peng et al., 2019; India State-

Level Disease Burden Initiative Air Pollution

Collaborators, 2019; Wang et al., 2019). The results

of this analysis demonstrate the utility of CCA

HEALTHINF 2020 - 13th International Conference on Health Informatics

234

alongside regression for examination of the possible

effects of environmental exposures on health outcome

distributions, which in the absence of knowledge of

the intrinsic rates of these effects allows for the

quantification of a stronger association and

identification of possible harm to human health.

Implications for Future Population-level Outcome-

Exposure Analysis: This success of CCA for

specifically capturing the relationships between

exposure and outcome covariations has further

applications for approaching problems where a link

between environmental factors and possible health

effects is only hypothesized. CCA has other useful

potential applications in such investigations which

seek to determine the relative contributions of

environmental factors or other proposed risk factors to

shifts in distributions of multi-class outcomes, for

example variations in rates of cancers of different

kinds relative to different complex background

exposure levels.

An advantage of this approach is that we do not

require a priori knowledge of outcome distributions or

background risk levels. As evidenced particularly by

the strong and significant correlation in the first

canonical dimension in our CCA analysis, we see that

CCA quantifies a link between covariations in the data

sets, and the interpretation of this link can be

considered against the relative weights assigned to

each of the set elements in the canonical projections

(Hotelling, 1936; Gonzalez et al., 2008). Given that

environmental exposures rarely occur in isolation and

may have effects on multiple organ systems, CCA is

therefore uniquely suited for applications where we

aim to explore whether covariation relationships

between multi-dimensional environmental factors and

interrelated population health outcomes are present.

Future work advancing CCA applications in

environmental epidemiology may take into

consideration not only the formulation of maximally

correlated projections beyond those produced through

linear CCA methods but also preservation of

interpretability of the latent weightings, in order to

permit assessment and characterization of latent factor

relationships in kernel and deep CCA formulations or

the identification of locations which map to similar

positions within the latent projections as regions of

interest for further study.

6 CONCLUSIONS

In this work, we explore the potential of CCA for

population-level environmental epidemiology by

demonstrating its use for understanding the impact of

air pollution on mortality. Our analysis demonstrates

the complementarity of CCA for use alongside

traditional multiple linear regression approaches and

the promise of this method for extension to

investigating other hypothesized exposure outcome

data set relationships.

ACKNOWLEDGEMENTS

We thank the DSRG and Data Science Community

at WPI for their support and feedback.

REFERENCES

Vineis, P., Kriebel, D., 2006. Causal models in

epidemiology: past inheritance and genetic future.

Environmental Health: A Global Access Science

Source, 5:21.

Cromar K.R., Gladson, L.A., Ewart, G., 2019. Trends in

Excess Morbidity and Mortality Associated with Air

Pollution above American Thoracic Society-

Recommended Standards, 2008-2017. Annals ATS, Vol

16 (7): 836-845.

United States Clean Air Act: 42 United States Code §7401

et seq. (1970).

Di, Q., Wang, Y., Zanobetti, A., et al., 2017. Air pollution

and mortality in the Medicare population. NEJM, 26:

376.

Shah, A., Lee, K., McAllister, D., et al., 2015. Short term

exposure to air pollution and stroke: systematic review

and meta-analysis. BMJ, 24: 350.

Han, C., Lim, Y.H., Yorifuji, T., Hong, Y.C., 2018. Air

quality management policy and reduced mortality rates

in Seoul Metropolitan Area: A quasi-experimental

study. Environ Int. 121(Pt 1): 600-609.

Peng, L., Xiao, S., Gao, W., Zhou, Y., Zhou, J. Yang D.,

Ye, X., 2019. Short-term associations between size-

fractionated particulate air pollution and COPD

mortality in Shanghai, China. Environ Pollut. Epub.

India State-Level Disease Burden Initiative Air Pollution

Collaborators, 2019. The impact of air pollution on

deaths, disease burden, and life expectancy across the

states of India: the global burden of disease study 2017.

Lancet Planet Health 3(1): e26-e39.

Wang, T., Zhao, B., Liou, K.N., Gu, Y., Jiang, Z., Song, K.,

Su, H., Jerrett, M., Zhu Y., 2019. Mortality burdens in

California due to air pollution attributable to local and

nonlocal emissions. Environ Int, 133(Pt B):105232.

James, G., Witten, D., Hastie, T., Tibshirani, R.. An

Introduction to Statistical Learning with Applications

in R. Springer 2014. ISBN:1461471370

9781461471370.

Hotelling, H., 1936. Relations between two sets of variates.

Biometrika, 28 (3-4):321-377.

Air Quality and Cause-speciﬁc Mortality in the United States: Association Analysis by Regression and CCA for 1980-2014

235

Gonzalez, I., Dejean, S., Martin, P., Baccini, A., 2008.

CCA: An R package to extend canonical correlation

analysis. J Stat Software, 23(12).

Rudzicz, F., 2010. Adaptive Kernel Canonical Correlation

Analysis for Estimation of Task Dynamics from

Acoustics. Proceedings of the IEEE International

Conference on Acoustics, Speech, and Signal

Processing, ICASSP. https://www.researchgate.net/pub

lication/220736292_Adaptive_Kernel_Canonical_Cor

relation_Analysis_for_Estimation_of_Task_Dynamics

_from_Acoustics

Andrew, G., Arora, R., Bilmes, J., Livescu, K., 2013. Deep

Canonical Correlation Analysis. Proceedings of the

30th International Conference on Machine Learning.

https://ttic.uchicago.edu/~klivescu/papers/andrew_icm

l2013.pdf

United States Environmental Protection Agency, County

Monitor Annual Summary Files, https://aqs.epa.gov/

aqsweb/airdata/download_files.html

United States Combined and Gender-Specific Age-Adjusted

Mortality Rates by United States County, Institute for

Health Metrics and Evaluation, http://ghdx.healthdata.

org/record/united-states-mortality-rates-county-1980-

2014

Pedregosa, F., et al., 2011. Scikit-learn: Machine Learning

in Python, JMLR, 12:2825-2830.

Kutner, M., Nachstein, C., Neter, J., Li, W., 2004. Applied

Linear Statistical Models 5

Edition. McGraw-

Hill/Irwin. ISBN: 007310874X 978-0073108742.

HEALTHINF 2020 - 13th International Conference on Health Informatics

236