Evaluating the Effect of Justiﬁcation and Conﬁdence Information on

User Perception of a Privacy Policy Summarization Tool

Vanessa Bracamonte

, Seira Hidano

, Welderufael B. Tesfay

and Shinsaku Kiyomoto

KDDI Research, Inc., Saitama, Japan

Goethe University Frankfurt, Frankfurt, Germany

Keywords:

Abstract:

Privacy policies are long and cumbersome for users to read. To support understanding of the information

contained in privacy policies, automated analysis of textual data can be used to obtain a summary of their

content, which can then be presented in a shorter, more usable format. However, these tools are not perfect

and users indicate concern about the trustworthiness of their results. Although some of these tools provide

information about their performance, the effect if this information has not been investigated. In order to address

this, we conducted an experimental study to evaluate whether providing explanatory information such as result

conﬁdence and justiﬁcation inﬂuences users’s understanding of the privacy policy content and perception of

the tool. The results suggest that presenting a justiﬁcation of the results, in the form of a policy fragment,

can increase intention to use the tool and improve perception of trustworthiness and usefulness. On the other

hand, showing only a result conﬁdence percentage did not improve perception of the tool, nor did it help to

communicate the possibility of incorrect results. We discuss these results and their implications for the design

of privacy policy summarization tools.

1 INTRODUCTION

The introduction of regulations such as the

GDPR (European Parliament, 2016) has encour-

aged recent efforts to make privacy policies more

understandable to users. However, despite serving as

a defacto contractual document between the user and

service provider, privacy policies remain too long and

difﬁcult for users to read and comprehend.

Alternatives to these lengthy pieces of text have

been proposed, such as shorter notices in graphical

and standardized formats (Gluck et al., 2016; Kelley

et al., 2010) that can work to communicate informa-

tion about the privacy policy to users. Since these

formats are not employed by all companies, there are

projects such as ToS;DR (ToSDR, 2019) which pro-

vides summaries of existing privacy policies. ToS;DR

relies on a community of users to manually analyze

and categorize the content of these privacy policies,

which makes it very difﬁcult to scale the work to

cover every existing privacy policy.

There are also projects that propose to automatize

the analysis of privacy policies using machine learn-

ing techniques. Privacy policy summarization tools

are automated applications, implemented using differ-

ent machine learning and natural language processing

techniques, that analyze the content of a privacy pol-

icy text and provide a summary of the results of that

analysis (Figure 1).

Figure 1: Automated privacy policy summarization.

Examples of these projects are Privee (Zimmeck and

Bellovin, 2014), PrivacyCheck (Zaeem et al., 2018),

Polisis (Harkous et al., 2018) and PrivacyGuide (Tes-

fay et al., 2018b). These tools can provide a solution

to the problem of scale in the analysis of privacy poli-

cies, but they introduce a different challenge: users

express concern regarding the trustworthiness and ac-

curacy of a privacy policy summarization tool when

they know that the process is automated (Bracamonte

et al., 2019).

142

Bracamonte, V., Hidano, S., Tesfay, W. and Kiyomoto, S.

Evaluating the Effect of Justiﬁcation and Conﬁdence Information on User Perception of a Privacy Policy Summarization Tool.

DOI: 10.5220/0008986101420151

In Proceedings of the 6th International Conference on Information Systems Security and Privacy (ICISSP 2020), pages 142-151

ISBN: 978-989-758-399-5; ISSN: 2184-4356

The design of the result summary of automated

guidelines for usable formats aimed at presenting

tive ways of presenting privacy policies indicates that

shortened versions of these texts, supplemented with

icons, can provide users with the necessary informa-

tion for them to understand their content (Gluck et al.,

2016). Not only the length of the privacy policy is

important, but also how the information is presented:

standardized graphical formats can better provide in-

formation than text (Kelley et al., 2009; Kelley et al.,

2010). In addition to these design considerations,

some privacy policy summarization tools also provide

information related to the reliability and performance

of the machine learning techniques used. However,

there are no studies that evaluate how this informa-

tion affects user perception of these type of tools. Al-

though studies have evaluated perception and under-

standing of usable privacy policy formats produced

by humans, aspects related to the reliability of the pri-

vacy policy information have naturally not been pre-

viously considered in the research.

The purpose of this study is to address this gap. To

achieve this, we conducted an experiment to evaluate

understanding and perception of the results of an au-

tomated privacy policy summarization tool. We cre-

ated different conditions based on whether the results

of the tool showed information about justiﬁcation and

conﬁdence of the results, and asked participants about

the content of the privacy policy and their perception

of the tool in each of these conditions. The results

show that justiﬁcation information increased behav-

ioral intention and trustworthiness and usefulness per-

ception. Justiﬁcation information also helped users

qualify the answers provided by the tool, although the

effect was not present for every aspect of the policy.

Conﬁdence information, on the other hand, did not

have a positive effect on perception of the tool or on

understanding of the results of the tool. We discuss

these ﬁndings in the context of providing usable auto-

mated tools for privacy policy summarization and the

challenges for the design of these tools.

2 RELATED WORK

2.1 Information Provided by Automated

Automated privacy policy summarization tools

mainly provide information about the privacy policy

to the user: Privee (Zimmeck and Bellovin, 2014)

and PrivacyCheck (Zaeem et al., 2018), for exam-

ple, show a summary of a privacy policy based on

pre-established categories and risk levels, and the vi-

sual design of the summary includes icons and stan-

dard descriptions. PrivacyGuide (Tesfay et al., 2018a)

shows an icon-based result summary as well, and in

addition provides the fragment of the original privacy

policy. These tools are similar in the sense that they

provide a standardized category-based summary of

the privacy policy, although they use different crite-

ria for that categorization and for assigning risk lev-

els. Polisis (Harkous et al., 2018) takes a different and

more complex approach for the privacy policy analy-

sis and classiﬁcation, but also provides standard cat-

egories and fragments of the original privacy policy

text. A related chatbot tool, PriBot, returns fragments

of the privacy policy in response to freely composed

questions from users.

The automated tools mentioned sometimes pro-

vide explanatory information about the results. Two

of the tools mentioned in the previous section, Pri-

vacyGuide and PriBot, include information that may

be considered as explanation of performance of the

tool. PrivacyGuide provides a fragment of the origi-

nal privacy policy that the tool identiﬁes as related to

a privacy aspect and uses to assign a risk level. Pri-

Bot, on the other hand, shows a conﬁdence percentage

that works as a proxy for how accurately the fragment

it returns answers a user’s question (Harkous et al.,

2018).

2.2 Explanations of Automated Systems

When results are provided by automated tools, users

have questions about the reliability of those results.

One way of inﬂuencing this perception is through pro-

viding some explanation about the system. For ex-

ample, offering some justiﬁcation for outcomes can

positively inﬂuence perception of accuracy (Biran and

McKeown, 2017) and information about accuracy can

improve trust (Lai and Tan, 2019). Although there

is no predeﬁned way of communicating to the user

about the performance of automated privacy policy

summarizing tools, existing tools provide some infor-

mation. PrivacyGuide shows a fragment of the pri-

vacy policy, which can be classiﬁed as a justiﬁcation

or support explanation (Gregor and Benbasat, 1999).

Similarly, information about the conﬁdence of results

such as the one provided by PriBot can also be consid-

ered a dimension of explanation (Wang et al., 2016).

This serves as a measure of uncertainty (Diakopoulos,

2016) of the results and therefore as an indication of

performance.

Explanation information, including conﬁdence,

Evaluating the Effect of Justiﬁcation and Conﬁdence Information on User Perception of a Privacy Policy Summarization Tool

143

has been found to positively inﬂuence trust when

users interact with automation (Wang et al., 2016).

However, the effect of explanations is not always pos-

itive. Research indicates that certain types of explana-

tory information about the performance of a system,

for example an F-score accuracy measure, may not

be useful in applications intended for a general audi-

ence (Kay et al., 2015). In addition, it has also been

found that too much explanation could have a nega-

tive effect on aspects such as trust (Kizilcec, 2016).

These studies show that simply providing more infor-

mation may not result in a positive effect; therefore,

it is important to evaluate the effect of explanatory in-

formation, such as justiﬁcation and conﬁdence, pro-

vided by privacy policy summarization tools.

One limitation in this area is that there are few user

evaluations of automated privacy policy summariza-

tion tools. For Polisis, a user study was conducted

that evaluated the perception of the accuracy of re-

sults; however, it was conducted independent from the

interface (Harkous et al., 2018). The study found that

users considered the results were relevant to the ques-

tions, although this perception differed from the mea-

sure of accuracy of the predictive model. A user study

was also conducted to evaluate whether PrivacyGuide

results, which found that the tool partially achieved

the goal of informing users about the risk of a privacy

policy and increasing interest in its content (Braca-

monte et al., 2019). The study also found that users

indicated concern about the trustworthiness of the tool

and the accuracy of its results. However, these studies

have not considered the effect of justiﬁcation or conﬁ-

dence information shown by these tools, and how this

explanatory information might affect trust.

3 METHODOLOGY

3.1 Experiment Design

The experiment consisted of a task to view the results

of the analysis of the privacy policy of an ﬁctional on-

line shop, and answer questions about the content of

the privacy policy and about the perception of the tool

in general. We used a between-subjects design, with

a total of six experimental conditions. We deﬁned the

experimental conditions as follows. A Control con-

dition that included only information about the result

of the privacy policy summarization. A Conﬁdence

condition that included all the information from the

Control condition and added a conﬁdence percentage

for the results. A Justiﬁcation condition that included

all the information from the Control condition and

added justiﬁcation in the form of a short fragments

from the original privacy policy. A Highlight condi-

tion, which was a second form of justiﬁcation where

relevant words were emphasized in the privacy policy

fragments. Finally, two conditions that showed both

conﬁdence percentage and justiﬁcation (Justiﬁcation

+ Conﬁdence and Highlight + Conﬁdence).

3.2 Privacy Policy Summary Result

We based the design of the privacy policy summary on

PrivacyGuide and deﬁned that the result would corre-

spond to a low risk privacy policy, as deﬁned by (Bra-

camonte et al., 2019). In PrivacyGuide, an icon in a

color representing one of three levels of risk (Green,

Yellow and Red) is assigned to each result category

(privacy aspect) depending on the content of the pri-

vacy policy corresponding to that aspect (Tesfay et al.,

2018b).

The Control condition result interface included

icons and descriptions of the risk levels for each pri-

vacy aspect. The Justiﬁcation condition result in-

terface was based on the Control condition, and in-

cluded in addition a text fragment for each privacy

aspect. We selected the fragments from real privacy

policies, by running PrivacyGuide on the English lan-

guage privacy policies of well known international

websites that also provided an equivalent Japanese

language privacy policy. We chose those fragment

results that matched the privacy aspect risk level we

had deﬁned, but took the fragment from the matching

Japanese language privacy policy. This procedure re-

sulted in fragments that were obtained from different

privacy policies; therefore, we reviewed and modiﬁed

the texts so that they would be congruent with each

other in style and content. We also anonymized any

reference to the original company. The Highlight con-

dition result interface was based on the Justiﬁcation

result, and in addition emphasized the justiﬁcation by

highlighting words relevant to the corresponding pri-

vacy aspect.

The Conﬁdence condition was based on the Con-

trol condition, and showed in addition a conﬁdence

percentage for each privacy aspect result. The con-

ﬁdence percentages were set manually. The Justi-

ﬁcation+Conﬁdence and Highlight+Conﬁdence con-

ditions result interfaces showed all the information

described previously for the Justiﬁed/Highlight and

Conﬁdence. Regarding the values of the conﬁdence

percentages, since conﬁdence and justiﬁcation infor-

mation would be shown together for these last two

conditions, we set the conﬁdence values by manu-

ally evaluating how accurately the fragments repre-

sented the privacy aspect risk level. Conﬁdence per-

centages for the privacy aspect results ranged from

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

144

Figure 2: Experiment result screens. (a) Full results for the Conﬁdence condition. (b) Fragment corresponding to one privacy

aspect in the Justiﬁcation + Conﬁdence condition. (c) Fragment corresponding to one privacy aspect in the Highlight +

Conﬁdence condition. The result screens for the Control, Justiﬁcation and Highlight conditions are similar to (a), (b) and (c),

respectively, with the exception that the conﬁdence percentage is not included.

70% to 95%, with the exception of Privacy Settings.

For Privacy Settings, we set the conﬁdence percentage

to 45% and chose a fragment that did not accurately

represent the corresponding privacy aspect. This was

done to evaluate the inﬂuence of incorrect informa-

tion and low conﬁdence on users’ response to ques-

tions about the content of the privacy policy. Figure 2

shows the details of the interface.

The interfaces also included a help section at the

top of the result interface screen, which described

every element of the results from the privacy aspect

name to the conﬁdence percentage (where applica-

ble). Because users need time to familiarize them-

selves with elements in a privacy notice (Schaub et al.,

2015), we included the help section to compensate for

the lack of time, although such a section would not

normally be prominently displayed.

3.3 Questionnaire

We included questions about the privacy policy, to

evaluate participants’ understanding of the privacy

practices of the ﬁctional company based on what was

presented on the result interface. The questions were

adapted from (Kelley et al., 2010) and addressed

each of the privacy aspects (Table 1). We created

the questions so that the correct option would be a

positive answer (Deﬁnitely yes or Possibly yes) for all

questions except for the question corresponding to the

Protection of Children privacy aspect, where the cor-

rect option was left ambiguous. The only other ex-

ception was the Privacy Settings aspect, which was

assigned low conﬁdence percentage and an incorrect

justiﬁcation fragment (as deﬁned in the previous sec-

tion). Therefore, in the experimental conditions that

included these pieces of information, we expected the

Evaluating the Effect of Justiﬁcation and Conﬁdence Information on User Perception of a Privacy Policy Summarization Tool

145

Table 1: Privacy policy content questions. Response options: Deﬁnitely yes, Possibly yes, Possibly no, Deﬁnitely no, It

doesn’t say in the result, It’s unclear from the result.

Privacy aspect Question

Data Collection Does the online store (company) collect your personal information?

Privacy settings Does the online store give you options to manage your privacy preferences?

Account deletion Does the online store allow you to delete your account?

Protection of children Does the online store knowingly collect information from children?

Data security Does the online store have security measures to protect your personal information?

Third-party sharing Does the online store share your personal information with third parties?

Data retention Does the online store indicate how long they retain your data?

Data aggregation Does the online store aggregate your personal information?

Control of data Does the online store allow you to edit your information?

Policy changes Does the online store inform you if they change their privacy policy?

Table 2: Questionnaire items. Response scale: Completely agree, Agree, Somewhat agree, Somewhat disagree, Disagree,

Completely disagree.

Construct Item

Useful The application answers my questions about the privacy policy of the online store

The application addresses my concerns about the privacy policy of the online store

The application is useful to understand the privacy policy of the online store

The application does not answer what I want to know about the privacy policy of

the online store (Reverse worded)

Trustworthy The results of the application are trustworthy

The results of the application are reliable

The results of the application are accurate

Understandable The results of the application are understandable

The reason for the results is understandable

Intention I would use this application to analyze the privacy policy of various online stores

I would use this application to decide whether or not to use various online stores

AI use The use of AI is appropriate for this kind of application

correct answer to not be a positive answer.

We included items for measuring behavioral in-

tention and perception of usefulness, understandabil-

ity and trustworthiness of the tool (Table 2). The

items were rated on a six-point Likert scale, ranging

from Completely disagree to Completely agree. We

also included a question addressing the perceived ap-

propriateness of using AI for this use case.

3.3.1 Translation and Review

The questionnaire was developed in English; since we

conducted the survey in Japan with Japanese partic-

ipants, we translated the questionnaire with the fol-

lowing procedure. First, two native Japanese speakers

independently translated the whole questionnaire, in-

cluding the statements explaining the survey and pri-

vacy policy summarization tool. The translators and

a person ﬂuent in Japanese and English reviewed the

translated statements one by one, verifying that both

translations were equivalent to each other and had the

same meaning as the original English statement.

The reviewers found no contradictions in mean-

ing in this ﬁrst step. The reviewers then chose the

translated statements that more clearly communicated

the meaning of the questions, instructions or expla-

nations. Finally, the translators reviewed the whole

questionnaire to standardize the language, since they

had originally used different levels of formality.

3.4 Data Collection

We conducted the survey using an online survey com-

pany, which distributed an invitation to participate in

the survey to their registered users. We targeted the

recruitment process to obtain a sample with sex and

age demographics similar to those of the Japanese

population according to the 2101 census (Statistics

Bureau, Ministry of Internal Affairs and Communi-

cations, 2010), but limited participation to users who

were 18 years-old or older. Participants were com-

pensated by the online survey company.

Participants were randomly assigned to one of the

six experimental conditions and ﬁlled the survey on-

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

146

line. We received the pseudonymized data from par-

ticipants from the online survey company, which also

included demographic data. In addition, the survey

also registered the total time taken for the survey.

The survey was conducted from December 12-14,

2018.

3.5 Limitations

The study had the following limitations. In the study,

we do not manipulate the risk level of the privacy pol-

icy, we only consider a privacy policy deﬁned as low

risk. The number of privacy aspects and correspond-

ing risk levels result in a large number of possible

combinations, making it impractical to test them all.

Consequently, it may be that the results of the study

are not generalizable to other risk levels besides the

one chosen for the study.

In addition, we did not include a process to

validate that the participants had indeed compre-

hended every aspect of the result interface, beyond

the straightforward questions about the content of the

more detailed questions, the behavior of the partici-

pant would deviate further from a normal interaction

with these type of tools. Nevertheless, this means that

the results of this study reﬂect an evaluation of per-

ception rather than objective measures of comprehen-

sion.

Lastly, in the study we used PrivacyGuide’s pri-

vacy aspect categorization, which is based on the Eu-

ropean Union’s GDPR, and showed it to Japanese par-

ticipants. However, we consider that the GDPR-based

categories are relevant for our Japanese participants.

For one, there is a degree of compatibility between

the GDPR and Japanese privacy regulation (European

Commission, 2019). And the Japanese language pri-

vacy policies used in the experiment come from inter-

national websites aimed at Japanese audiences, and

are direct translations of English privacy policies cre-

ated to comply with the GDRP.

4 ANALYSIS AND RESULTS

4.1 Data Cleanup

The online survey returned a total of 1054 responses.

We ﬁrst identiﬁed suspicious responses, deﬁned as

cases with no variability of extreme response (all

questions answered with 1 or 6, which included re-

verse worded items for this purpose) and cases where

the total survey answer time was lower than 125 sec-

onds. We calculated this time considering a high read-

Table 3: Sample Characteristics.

n %

Total 944 100%

Gender Male 458 49%

Female 486 51%

Age 19-20s 168 18%

30s 175 19%

40s 200 21%

50s 185 20%

60s 216 23%

Job Government employee 35 4%

Company Employee 373 40%

Own business 59 6%

Freelance 17 2%

Full-time homemaker 174 18%

Part time 112 12%

Student 42 4%

Other 27 3%

Unemployed 105 11%

ing speed and the number of characters in the online

survey plus the result screen for the Control condition.

With these criteria, we identiﬁed 110 cases which

were manually reviewed and removed from further

analysis.

4.2 Sample Characteristics

The sample after data cleanup consisted of 944 cases

(Table 3). 51% of the participants were female, and

the age range was 19-69 years. The distribution these

demographic characteristics is similar to that of the

Japanese population (Statistics Bureau, Ministry of

Internal Affairs and Communications, 2010), as es-

tablished in the data collection process.

4.3 Effect on Responses to Questions

about the Content of the Privacy

Policy

We analyzed the categorical responses to the ques-

tions about the privacy policy using chi-square tests.

We were interested in the differences between the

responses to each privacy aspect question, so we

used contingency tables to represent the relationship

between questions and answers in each experimen-

tal condition. As indicated previously, we wanted

to test whether participants had understood the re-

sults of the tool and whether differences in the in-

formation provided in each condition were reﬂected

in their answers. The results of the chi-square test

of independence are shown in Figure 3. Association

Evaluating the Effect of Justiﬁcation and Conﬁdence Information on User Perception of a Privacy Policy Summarization Tool

147

Figure 3: Association plot of the relationship between privacy policy questions and their responses for each experimental

condition. Blue and red areas indicate signiﬁcantly higher and lower response proportion than expected (i.e if all questions

had the same response proportion), respectively.

plots (Meyer et al., 2006) were used to visualize ar-

eas with signiﬁcantly higher or lower response pro-

portion, compared between privacy aspect questions.

The responses in the Control condition provide a

base for how participants understood the content of

the privacy policy from the results of the tool. The

majority of participants chose a positive answer for

all of the questions except the one corresponding to

Protection of Children, indicating that the result of

the tool communicated the expected information in

the Control condition. However, the results show

that there were no differences in the proportion of re-

sponses corresponding to the question regarding Pri-

vacy Settings compared to other aspects in any of the

experimental conditions. As can be observed in Fig-

ure 3, there are no signiﬁcant differences in the pro-

portions of responses to the Privacy Settings question

compared to the responses to the Control of Data or

Data Aggregation questions, for example. This lack

of signiﬁcant differences indicates that participants’

responses were not inﬂuenced by either the low conﬁ-

dence percentage nor the incorrect justiﬁcation in the

result for the Privacy Settings aspect. However there

is some evidence that at least some participants con-

sidered the justiﬁcation information in their response.

For the conditions that include justiﬁcation, the pro-

portion of Does not say responses in the Data Reten-

tion question is higher. A review of the fragment cor-

responding to this privacy aspect indicates that there

is no mention of a speciﬁc time for the retention of

the data. This lack of detail may have resulted in at

least some participants considering that the question

was not answered in the privacy policy. We do not

see this difference in the Control condition nor in the

Conﬁdence condition, which do not include the justi-

ﬁcation fragment.

We also tested for differences in the time taken to

answer the full survey between conditions. The dis-

tribution of time was similar for all conditions and

highly skewed, so we used non-parametric Kruskal-

Wallis tests for the difference in median. We found no

signiﬁcant differences, indicating that the additional

information of justiﬁcation and conﬁdence conditions

did not have an inﬂuence on the time taken to ﬁnish

the survey.

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

148

Control Justif. Highlight Conf.

Justif.

+Conf.

Highl.

+Conf.

Trustworthy

Control Justif. Highlight Conf.

Justif.

+Conf.

Highl.

+Conf.

Useful

Control Justif. Highlight Conf.

Justif.

+Conf.

Highl.

+Conf.

Intention

Control Justif. Highlight Conf.

Justif.

+Conf.

Highl.

+Conf.

Understandable

Figure 4: Box plots for all variables. The blue line indicates

the median for the Control condition. The detail of signif-

icant differences according to Dunn’s test are indicated in

Table 5.

4.4 Effect on Perception of the Tool

We created composite variables by summing the items

corresponding to attitudes and perception of useful-

ness, understandability and trustworthiness. We cal-

culated a Cronbach’s alpha measure for the items cor-

responding to each composite variable. In the case of

usefulness, the recoded reverse-worded item was neg-

atively correlated and we removed it from the analy-

sis. After removal, Cronbach’s alpha values indicated

good internal consistency (all values above 0.9.) Fig-

ure 4 shows the median in each experimental condi-

tion for all variables.

All composite variables had a similar non-normal

distribution shape; therefore, we used non-parametric

Kruskal Wallis tests for the difference between their

medians, and Dunn’s test for multiple comparisons.

To control for false positives, p-values were adjusted

using the Benjamini–Hochberg procedure (Benjamini

and Hochberg, 1995). We found signiﬁcant differ-

ences between groups for all variables, according to

the results of Kruskal-Wallis tests (Table 4). We con-

ducted post-hoc comparisons using Dunn’s tests to

evaluate which of the groups were signiﬁcantly dif-

ferent. Table 5 shows the detailed results.

The Highlight and Highlight+Conﬁdence condi-

tions were more positively perceived in terms of use-

fulness and trustworthiness than the Control condi-

tion. For behavioral intention, we found a signiﬁcant

difference only between the Control and Justiﬁcation

conditions. In addition, we found no differences in

Table 4: Results of Kruskal-Wallis’ test for the difference

in median between conditions.

Chi-squared (df=5) p

Intention 11.694 0.033

Useful 17.217 0.004

Understandable 11.715 0.039

Trustworthy 17.028 0.004

Table 5: Results of Dunn’s test for multiple comparisons.

P-values for comparisons that were non-signiﬁcant (p>0.5)

are not shown. No signiﬁcant differences were found for

perceived understandability.

Useful Trustw. Intention

Control 0.049 0.017

- Justiﬁcation

Control 0.010 0.009

- Highlight

Control 0.048 0.026

- Highlight+Conf.

Highlight 0.035

- Conﬁdence

the perception of any of the variables of interest be-

tween the Control and Conﬁdence conditions. We

also did not ﬁnd signiﬁcant differences between the

Justiﬁcation and Highlight conditions; we consider

that this may be due to the relatively subtle effect of

bolding the words. Nor were there signiﬁcant differ-

ences between similar experimental conditions with

or without conﬁdence information. In addition, al-

though the Kruskal-Wallis test indicated signiﬁcant

difference for understandability, the post hoc Dunn

test did not ﬁnd signiﬁcant differences between any

condition, based on the adjusted p-value.

Finally, with regard to the appropriateness of us-

ing AI for summarizing privacy policies, the results of

a Kruskal-Wallis test showed that there were no sig-

niﬁcant differences between conditions. The median

value was 4 (”Somewhat agree”) for all conditions ex-

cept the Control condition, which had a median value

of 3 (”Somewhat disagree”).

5 DISCUSSION

The results show that providing privacy policy infor-

mation fragments as justiﬁcation, with and without

highlighted words, improved perception of the tool

compared to not showing that information, albeit on

different dimensions. On the other hand, the results

show that conﬁdence information did not have any

inﬂuence. Based on previous research on the conﬁ-

dence explanations (Wang et al., 2016), we had ex-

Evaluating the Effect of Justiﬁcation and Conﬁdence Information on User Perception of a Privacy Policy Summarization Tool

149

pected that showing a conﬁdence percentage would

improve perception of trustworthiness in particular,

but there was no effect on any of the measured per-

ception variables.

The results show that the short summary format

(Control condition) can inform users of the overall

This result is in line with research on shorter privacy

policy formats (Gluck et al., 2016). On the other

hand, the additional explanatory information of justi-

ﬁcation and result conﬁdence included in other exper-

imental conditions did not greatly alter the responses

to the questions about the privacy policy content, even

when the fragment did not accurately justify the result

or the conﬁdence percentage was low. One possibil-

ity is that participants may have relied mainly on the

privacy aspect’s icons and descriptions to answer the

questions about the privacy policy, even when there

was additional information that was in contradiction

of the overall result. This may have been due to the

fact that we did not explicitly bring attention to the

justiﬁcation and conﬁdence percentage, beyond in-

cluding its description in the help section of the inter-

face. In addition, the questions we asked participants

were straightforward and for the most part targeted in-

formation that was already available in the elements

of the Control condition interface.

However, the results indicate that participants did

consider justiﬁcation information, at least to some ex-

tent, as evidenced by the answers to the Data Reten-

tion question in conditions where the privacy policy

fragment was shown. As for the incorrect fragment

corresponding to the Privacy Settings aspect, it may

be that participants were not sufﬁciently familiar with

this type of settings, and therefore could not judge

whether the fragment was incorrect or not. In the case

of conﬁdence information, another possibility is that

the users did not think to question the results of the

tool and therefore ignored the contradictory informa-

tion of low conﬁdence. This can happen when the

user considers that the system is reliable (Wang et al.,

2016). Figure 4 shows that the median of trustwor-

thiness perception is higher than the midpoint for all

conditions, which lends support to this hypothesis.

In general, the results of this study suggest that

adding explanatory information in the form of justi-

ﬁcation can be beneﬁcial to automated privacy pol-

icy summarization tools. However, post-hoc power

analysis indicates that the sample is large enough to

detect small differences, meaning that it is possible

that statistically signiﬁcant improvements in percep-

tion of usefulness and trustworthiness may not be rel-

evant in practice. Conversely, the ﬁndings that not all

explanatory information had a signiﬁcant positive ef-

fect suggest that it is important to evaluate whether

this additional information can truly beneﬁt users, be-

fore considering adding it to the result interface of

an automated privacy policy summarization tool. Al-

though our results do not indicate that there would be

negative effects if explanatory information is shown,

future research should consider evaluating any pos-

sible tradeoffs in terms of usability. More research

is needed to identify what information to present to

users in order to improve efﬁcacy as well as percep-

tion of these automated privacy tools.

6 CONCLUSIONS

In this paper, we conducted an experimental study to

evaluate whether showing justiﬁcation and conﬁdence

of the results of an automated privacy policy summa-

rization tool inﬂuences user perception of the tool and

whether this information can help users correctly in-

terpret those results. The ﬁndings suggest that show-

ing a privacy policy fragment as justiﬁcation for the

result can improve perception of usefulness and trust-

worthiness of the tool, and can improve intention of

using the tool. On the other hand, information about

the conﬁdence of results did not appear to have much

inﬂuence on user perception. Moreover, the ﬁndings

also indicate that even when the conﬁdence informa-

tion indicated higher uncertainty of the results, the

users did not rely on this information to interpret the

results of the tool.

Altogether, the ﬁndings indicate that it may be

worth considering adding explanatory information to

help improve the perception of an automated privacy

policy summarization tool, but that the type of expla-

nation should be carefully chosen and evaluated, since

explanatory information by itself may not be enough

to help users understand the limitations of the results

of the tool. Future research should investigate the type

of explanatory information that these automated pri-

vacy tools should provide to users, as well as how to

present that information in a usable way.

REFERENCES

Benjamini, Y. and Hochberg, Y. (1995). Controlling

the False Discovery Rate: A Practical and Power-

ful Approach to Multiple Testing. Journal of the

Royal Statistical Society. Series B (Methodological),

57(1):289–300.

Biran, O. and McKeown, K. (2017). Human-centric Jus-

tiﬁcation of Machine Learning Predictions. In Pro-

ceedings of the 26th International Joint Conference on

ICISSP 2020 - 6th International Conference on Information Systems Security and Privacy

150

Artiﬁcial Intelligence, IJCAI’17, pages 1461–1467.

AAAI Press.

Bracamonte, V., Hidano, S., Tesfay, W. B., and Kiyomoto,

S. (2019). Evaluating Privacy Policy Summarization:

An Experimental Study among Japanese Users. In

Proceedings of the 5th International Conference on

Information Systems Security and Privacy - Volume 1:

ICISSP,, pages 370–377. INSTICC, SciTePress.

Diakopoulos, N. (2016). Accountability in Algorithmic De-

cision Making. Commun. ACM, 59(2):56–62.

European Commission (2019). European Commis-

sion - PRESS RELEASES - Press release - Eu-

ropean Commission adopts adequacy decision on

Japan, creating the world’s largest area of safe

data ﬂows. http://europa.eu/rapid/press-release IP-19-

421 en.htm.

European Parliament (2016). Regulation (EU) 2016/679

of the European Parliament and of the Council of 27

April 2016 on the protection of natural persons with

regard to the processing of personal data and on the

free movement of such data, and repealing Directive

95/46.

Gluck, J., Schaub, F., Friedman, A., Habib, H., Sadeh, N.,

Cranor, L. F., and Agarwal, Y. (2016). How Short

Is Too Short? Implications of Length and Framing

on the Effectiveness of Privacy Notices. In Twelfth

Symposium on Usable Privacy and Security (SOUPS

2016), pages 321–340. USENIX Association.

Gregor, S. and Benbasat, I. (1999). Explanations from In-

telligent Systems: Theoretical Foundations and Impli-

cations for Practice. MIS Quarterly, 23(4):497–530.

Harkous, H., Fawaz, K., Lebret, R., Schaub, F., Shin,

K. G., and Aberer, K. (2018). Polisis: Automated

Analysis and Presentation of Privacy Policies Using

Deep Learning. In 27th USENIX Security Symposium

(USENIX Security 18), pages 531–548. USENIX As-

sociation.

Kay, M., Patel, S. N., and Kientz, J. A. (2015). How Good

is 85%?: A Survey Tool to Connect Classiﬁer Eval-

uation to Acceptability of Accuracy. In Proceedings

of the 33rd Annual ACM Conference on Human Fac-

tors in Computing Systems, CHI ’15, pages 347–356.

ACM.

Kelley, P. G., Bresee, J., Cranor, L. F., and Reeder, R. W.

(2009). A ”Nutrition Label” for Privacy. In Proceed-

ings of the 5th Symposium on Usable Privacy and Se-

curity, SOUPS ’09, pages 4:1–4:12. ACM.

Kelley, P. G., Cesca, L., Bresee, J., and Cranor, L. F. (2010).

Standardizing Privacy Notices: An Online Study of

the Nutrition Label Approach. In Proceedings of the

SIGCHI Conference on Human Factors in Computing

Systems, CHI ’10, pages 1573–1582. ACM.

Kizilcec, R. F. (2016). How Much Information?: Effects of

Transparency on Trust in an Algorithmic Interface. In

Proceedings of the 2016 CHI Conference on Human

Factors in Computing Systems, CHI ’16, pages 2390–

2395. ACM.

Lai, V. and Tan, C. (2019). On Human Predictions with Ex-

planations and Predictions of Machine Learning Mod-

els: A Case Study on Deception Detection. In Pro-

ceedings of the Conference on Fairness, Accountabil-

ity, and Transparency, FAT* ’19, pages 29–38. ACM.

Meyer, D., Zeileis, A., and Hornik, K. (2006). The Strucplot

Framework: Visualizing Multi-way Contingency Ta-

bles with vcd. Journal of Statistical Software, 17(3).

Schaub, F., Balebako, R., Durity, A. L., and Cranor, L. F.

(2015). A design space for effective privacy notices.

In Eleventh Symposium On Usable Privacy and Secu-

rity (SOUPS 2015), pages 1–17.

Statistics Bureau, Ministry of Internal Affairs and Commu-

nications (2010). Population and Households of Japan

2010.

Tesfay, W. B., Hofmann, P., Nakamura, T., Kiyomoto, S.,

and Serna, J. (2018a). I Read but Don’T Agree:

ing and the EU GDPR. In Companion Proceedings

of the The Web Conference 2018, WWW ’18, pages

163–166. International World Wide Web Conferences

Steering Committee.

Tesfay, W. B., Hofmann, P., Nakamura, T., Kiyomoto, S.,

and Serna, J. (2018b). PrivacyGuide: Towards an

Implementation of the EU GDPR on Internet Privacy

Policy Evaluation. In Proceedings of the Fourth ACM

International Workshop on Security and Privacy Ana-

lytics, IWSPA ’18, pages 15–21. ACM.

ToSDR (2019). Terms of Service; Didn’t Read.

https://tosdr.org/.

Wang, N., Pynadath, D. V., and Hill, S. G. (2016). Trust

Calibration Within a Human-Robot Team: Compar-

ing Automatically Generated Explanations. In The

Eleventh ACM/IEEE International Conference on Hu-

man Robot Interaction, HRI ’16, pages 109–116.

IEEE Press.

Zaeem, R. N., German, R. L., and Barber, K. S. (2018).

PrivacyCheck: Automatic Summarization of Privacy

Policies Using Data Mining. ACM Trans. Internet

Technol., 18(4):53:1–53:18.

Zimmeck, S. and Bellovin, S. M. (2014). Privee: An

Architecture for Automatically Analyzing Web Pri-

vacy Policies. In 23rd USENIX Security Symposium

(USENIX Security 2014), pages 1–16. USENIX Asso-

ciation.

Evaluating the Effect of Justiﬁcation and Conﬁdence Information on User Perception of a Privacy Policy Summarization Tool

151