Dynamic Visualization System for Gaze and Dialogue Data

Jonathan Kvist

, Philip Ekholm

, Preethi Vaidyanathan

, Reynold Bailey

and Cecilia Ovesdotter Alm

Department of Computer Science and Media Technology, Malm

o University, Nordenski

oldsgatan 1, Malm

o, Sweden

Eyegaze Inc., 10363 Democracy Lane, Fairfax, VA 22030, U.S.A.

Rochester Institute of Technology, 1 Lomb Memorial Drive, Rochester, NY 14623, U.S.A.

Keywords:

Eye-tracking, Dialogue, Multimodal Visualization.

Abstract:

We report and review a visualization system capable of displaying gaze and speech data elicited from pairs of

subjects interacting in a discussion. We elicit such conversation data in our ﬁrst experiment, where two par-

ticipants are given the task of reaching a consensus about questions involving images. We validate the system

in a second experiment where the purpose is to see if a person could determine which question had elicited

a certain visualization. The visualization system allows users to explore reasoning behavior and participation

during multimodal dialogue interactions.

1 INTRODUCTION

AI systems collaborating with humans should under-

stand their users. Having access to more than just

one modality improves this interaction and can be less

error-prone (Tsai et al., 2015; Kontogiorgos et al.,

2018). However, since human signals are ambiguous,

their interpretation is challenging (Carter and Bailey,

2012). By visualizing gaze and speech behaviors of

people discussing the visual world we can better un-

derstand human reasoning. We present a visualization

system incorporating gaze and speech from two inter-

locutors engaging in dialogue about images.

Visualizing data is helpful in many scenarios, e.g.

eye movement data reveals how the user spread their

attention (Blascheck et al., 2017). Popular visualiza-

tions such as heat maps and word clouds only cover

one modality and are often presented as static im-

ages. In contrast, we present a multimodal dialogue

visualization system that incorporates both gaze and

speech data, enhancing the amount of information re-

garding how interlocutors reasoned. Multimodal dia-

logues are complex since they have temporal progres-

sion. Therefore, our system utilizes dynamic render-

ing to capture the evolution of gaze and dialogue.

To elicit multimodal data we conducted Experi-

ment I where participants were paired and given the

task to discuss and reach a verbal consensus on ques-

tions about images which were chosen based on com-

plexity and animacy (see Figure 1). The participants’

Figure 1: Images used in Experiment I (Top-left: simple

inanimate; top-right: complex inanimate; bottom-left: sim-

ple animate; bottom-right: complex animate).

eye-movement and dialogue were recorded. This was

followed by building a prototype visualization system

and then evaluating and validating it in Experiment

II by having participants view multimodal visualiza-

tions generated from Experiment I and attempting to

identify the underlying questions that generated the

visualization. The research questions studied were:

• RQ1: Does access to more modalities improve

identifying the underlying question that elicited

the dialogue shown (i.e. question recognition)?

• RQ2: Do the properties of stimuli, namely com-

plexity, animacy, and question-type impact ques-

tion recognition?

138

Kvist, J., Ekholm, P., Vaidyanathan, P., Bailey, R. and Alm, C.

Dynamic Visualization System for Gaze and Dialogue Data.

DOI: 10.5220/0008953201380145

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 2: HUCAPP, pages

138-145

ISBN: 978-989-758-402-2; ISSN: 2184-4321

Figure 2: First Experiment I collected speech and gaze data. Pairs sat across each other and in front of eye trackers, wearing

microphones to record their speech. Next, we developed a prototype of a multimodal visualization system using the data.

Then, Experiment II evaluated the prototype by letting users interact with it and respond to questions.

We use question recognition as an objective mea-

surement of visualization success.

2 PRIOR WORK

Eye-tracking is getting better, cheaper, more accessi-

ble, and could therefore let us understand more about

user behavior and human reasoning. Eye tracking

produces large amounts of data, making data anal-

ysis a challenge. Its analysis tend to involve quan-

titative statistical analysis or qualitative visualiza-

tions (Blascheck et al., 2014). The statistical tech-

niques can give us information such as ﬁxation count

and average number of saccades, while visualizations

can promote more holistic meaningful information by

highlighting how user attention is distributed over the

stimuli. Blascheck et al. (2017) explored how ex-

perts in the ﬁeld analyzed eye tracking data, and they

agreed that statistical analysis and exploratory visual-

izations both are important. Almost all experts used

visualizations in their work, especially heat maps and

scan paths. The authors suggested that more ad-

vanced techniques would be adopted if they were

more widespread and straightforward to use.

Eye-tracking visualizations are categorized as ei-

ther point-based or area-of-interest based (Blascheck

et al., 2014). Regardless of the approach used, effec-

tive visualization frameworks should ideally allow for

easy and customizable user interaction, provide meth-

ods to incorporate additional data modalities and be

able to handle dynamic content, not just static 2D im-

ages (Blascheck et al., 2014; Blascheck et al., 2017;

Stellmach et al., 2010; Blascheck et al., 2016; Ram-

loll et al., 2004). Interestingly, one of the experts in

the Blascheck et.al. (2017) study mentioned that a vi-

sualization should be analyzed “not just as a means

for the researcher to understand the data, but as a

speciﬁc independent variable in which participants

are shown the eye movement patterns of either them-

selves, or from other observers, and how they can (or

cannot) understand and exploit this information.” We

leverage this idea as a way to assess the effectiveness

of our visualization system.

Online collaborative learning tools have the po-

tential to change the way people learn. Visualization

of the information exchange can aid in the learning

process (Bull and Kay, 2016). To improve such sys-

tems, it is important to analyze how users collaborate

in them. Traditionally this might be done with a static

2D image after the session, but Kon

e et al. (2018)

emphasize that it will be too late by then. They also

point out that the temporal information from the col-

laboration is lost in such a visualization (Kon

e et al.,

2018). The authors therefore argue for the need of

a dynamic, real time visualization that includes the

temporal context, e.g. by using animations. Another

study reports on an approach that makes scan paths

more easily comparable by mapping them as a func-

tion of time (R

aih

a et al., 2005). This is considered

advantageous when the exact locations of ﬁxations are

less important than certain areas of interest.

Visualizations can be used to ﬁnd relationships be-

tween modalities. A strong relationship has been ob-

served between the dialogue and gaze patterns of two

interlocutors (Sharma and Jermann, 2018). Wang et

al. (2019) also found that the gaze from two inter-

locutors are linked in conversation; relationships that

our system can highlight. They presented the users

with both a 3D and 2D scene, where the results did

not differ much (Wang et al., 2019). Our data collec-

tion is based on their work’s 2D scene setup. Dur-

ing their study, they asked the participants questions

which inﬂuenced conversation and gaze patterns.

Dynamic Visualization System for Gaze and Dialogue Data

139

3 METHOD

This study included three steps as seen in Figure 2:

1. Experiment I - elicit multi-modal discussions

2. Prototype the multi-modal visualization system

3. Experiment II - evaluate the prototype

3.1 Experiment I - Eliciting

Multi-modal Discussions

We conducted an IRB-approved data collection ex-

periment with 20 participants, recruited at a univer-

sity in the USA. The participants were paired up,

and their remuneration for participation was 10 USD

each. Half of the participants were female and the

other half male, and in 8 of the 10 pairs a female par-

ticipant was teamed with a male participant. Eigh-

teen were 19-23 years old. The other two were 26-

30. They ﬁrst ﬁlled out a pre-survey, and were then

seated in front of each other at a small table as shown

in Figure 3. Each person was equipped with a head-

worn microphone to record their speech, and we used

a SensoMotoric Instruments RED 250Hz eye-tracker

mounted on a laptop to track their gaze. To elicit dif-

ferent conversations and gaze patterns, their task was

to discuss aloud and reach verbal consensus on a se-

ries of questions (shown in Figure 4) about four im-

ages (shown in Figure 1). Questions 1-3 were affec-

tive questions regarding the mood, feelings and sub-

jective attitudes, while questions 4-6 were material

questions about the subjective characteristics of the

objects in the images. We included different types of

questions to understand how visualization recognition

may be impacted by more or less emotional answers.

Experiment I was performed 10 times with a to-

tal of 20 participants, and each experiment lasted for

about 60 minutes. Data from two pairs had to be ex-

cluded due to misinterpretation of the task. Each pair

elicited 24 conversations (4 images * 6 questions),

meaning that we in total we got 192 conversations (8

pairs * 24 conversations).

3.2 Visualization System

The visualization prototype was built using Process-

ing, a Java-based sketchbook for drawing graphics,

text, images, and more (Fry and Reas, 2014).

3.2.1 Constraints

To enable our system to be fully autonomous, we use

Microsoft Azure Automatic Speech to Text (ASR) to

transcribe the conversations (Microsoft, 2019). We

Figure 3: Experiment set-up: Two interlocutors seated

across each other with their respective microphones, eye

trackers and laptops.

Experiment I questions

Q1. What activities does the owner of these objects or

the main person in the picture enjoy?

Q2. How would you change the environment to make

it more welcoming?

Q3. Describe an artwork that this image inspires you

to create.

Q4. Pick an object or person in the picture explain why

it does not belong there.

Q5. Which two items belong together in the picture?

Q6. Which non-living object is the oldest and which is

the newest?

Figure 4: Experiment I asked three affective questions (Q1-

Q3) and three material questions (Q4-Q6) to elicit discus-

sions.

computed the Word Error Rate (WER) for four con-

versations. The values ranged between 18-29%. We

believe that the WER is high due to two reasons. First,

both speakers had wearable microphones to record

their speech, but they were seated at the same table

which introduced some crosstalk. Second, we cannot

verify whether the training data included dialogues as

opposed to speech from a single person. The model’s

predictions may not be as accurate for dialogue data.

To counteract this, we included the conﬁdence level

that the ASR system assigned to a particular utterance

when presenting the transcription to the user. This

gives the user information to decide for themselves

whether to trust the transcription or not.

HUCAPP 2020 - 4th International Conference on Human Computer Interaction Theory and Applications

140

Figure 5: Overview of the prototype. 1: Fixations and words spoken by a participant while looking in the surrounding area.

2: Fixations and saccades in different colors corresponding to a speaker. 3: Various settings. 4: Machine-transcribed subtitles

of the dialogue. 5: Words that are mentioned frequently displayed along with their frequency count.

3.2.2 Co-reference Resolution

To extract more meaning from the utterances in the

discussion, we applied co-reference resolution. Co-

reference resolution is the task of determining if some

expressions in the discussion refer back to the same

entity, e.g. if he and Daniel refer to the same per-

son (Soon et al., 2001). We use a pre-trained model,

included in the AllenNLP processing platform (Gard-

ner et al., 2018), which performs end-to-end neural

co-reference resolution (Lee et al., 2017). This is ap-

plied to the text transcribed by Microsoft ASR and the

resulting information is used by the most frequently

mentioned words table in our system, so that an entity

is updated with the correct number of mentions.

3.2.3 Creation of the Visualization

The prototype system is made up of four major com-

ponents, which correspond to the different levels of

dialogue modalities in the visualization system seen

in Table 1: gaze + word tokens (M1), gaze + word to-

kens + subtitles (M2), gaze + word tokens + subtitles

+ conversation playback (M3), gaze + word tokens +

subtitles + conversation playback + access to all set-

tings (M4). Previous research identiﬁed the impor-

tance of handling the temporal progression of conver-

sations, meaning that our system had to be dynamic.

Different users would potentially also use the system

for different tasks. Therefore, we wanted to give the

user access to as many settings as possible from the

interface. Figure 5 shows a screenshot of the dynamic

visualization prototype. The key parts of the system

are numbered and explained below:

1. Fixations - represented by either a green or pink

circle, depending on which interlocutor generated

it. The size of the circle is proportional to the ﬁx-

ation duration.

2. Saccades - represented by a line between two ﬁx-

ations. As new saccades are displayed, the oldest

one present on the screen is removed to make the

visualization less cluttered.

3. Settings - the user can access the settings of the

visualization in the panel on the left side of the

system. A legend in the upper-left corner shows

the different color codes used. Additionally, the

Table 1: Modalities active for each iteration during Exper-

iment II. The evaluators had access to overlaid gaze and

words at all times but subtitles, audio, and the settings were

introduced as the experiment progressed. Stopwords were

not shown per default.

Modalities Gaze Words Subtitles Audio Settings

M1 Yes Yes

M2 Yes Yes Yes

M3 Yes Yes Yes Yes

M4 Yes Yes Yes Yes Yes

Dynamic Visualization System for Gaze and Dialogue Data

141

Figure 6: Number of times the different questions appeared

in Experiment II.

following settings can be accessed:

• Minimum ﬁxation length (sliding bar)

• Word display time (sliding bar)

• Fixation scaling (sliding bar)

• Playback speed (sliding bar)

• Saccades displayed (number input)

• Subtitles (toggle)

• Conversation sound (toggle)

• Show stopwords (toggle) - display and count

words that are recognized as stopwords. The

Natural Language Toolkit (Bird et al., 2009)

was used for ﬁltering them out.

• Update word count by co-reference (toggle) -

update word count for entities discovered dur-

ing co-reference resolution.

4. Subtitles - subtitles from the conversation in real

time as generated by the Microsoft ASR.

5. Most mentions - the words that have been men-

tioned the most in the conversation.

3.3 Experiment II - Evaluation of the

Multimodal Visualization System

In order to evaluate the effectiveness of our visual-

ization system, we conducted a second experiment

which utilized the data elicited from Experiment I.

For this experiment, we included ten participants re-

ferred to as evaluators to distinguish them from the

participants of Experiment I. All evaluators were re-

cruited from the same university as in Experiment I.

Six were female and four were male, and their remu-

neration was 10 USD. Nine evaluators were between

19 and 29 years old and one was above 40.

Each evaluator ﬁrst answered the same demo-

graphic pre-survey as in Experiment I. They were then

seated in front of a screen that displayed the visualiza-

tion prototype. They were presented with eight visu-

alizations in total, all with data elicited in Experiment

Figure 7: The conversation length decreased as the experi-

ment progressed indicating growing scene familiarity.

I. We chose the eight visualizations based on three

factors: (1) underlying question, (2) image, and (3)

pair from Experiment I. Each evaluator was presented

with each question once and two additional questions

selected at random. Each image appeared twice, and

the conversations were chosen from at least seven dif-

ferent pairs from Experiment I. A total of 80 multi-

modal visualizations were presented to evaluators (10

participants * 8 visualized conversations). Every se-

lected visualization was presented to an evaluator four

times. Each time, the access to modalities or function-

alities in the interface increased, as can be seen in Ta-

ble 1. First, the evaluator only had access to the word

tokens and the gaze points in the system (M1). Then

the subtitles were added (M2), as transcribed by Mi-

crosoft ASR. In the third run the evaluator wore ear-

phones and the actual conversation was played back

along with the visualization (M3). Lastly, the evalu-

ator got full access to all the settings in the interface

(M4). After each introduction of modality or func-

tionality, the evaluator were asked three questions:

1. Which question generated the visualized gaze and

spoken language? (Menu with questions)

2. How conﬁdent are you about this answer? (Menu

with 1-10, where 0 means not conﬁdent and 10

means extremely conﬁdent)

3. Why did you choose that question? (Long answer

text)

For the drop down menu in question 1, the evalua-

tor could choose from the questions in Figure 4. After

M1 to M4 had been covered for one visualization, the

evaluator was also asked to give a free text answer to

the question: what is your perception about this visu-

alization?

HUCAPP 2020 - 4th International Conference on Human Computer Interaction Theory and Applications

142

Figure 8: The question type (affective (Q1-Q3) or material

(Q4-Q6) all had similar conversation length.

4 RESULTS AND DISCUSSION

Conversation Duration: We analyzed the duration

of conversations in Experiment I. Figure 7 shows that

average conversation length decreased as the experi-

ment progressed. This could be due to growing scene

familiarity. Also, since images and questions were

repeated multiple times, fatigue and boredom could

also have played a part. Interestingly, the duration of

a conversation did not affect the ability of the evalua-

tor in accurately identifying the underlying question,

i.e. question recognition. Figure 8 shows that the con-

versation duration is not affected substantially by the

type of question asked.

Effectiveness of the Visualization: We received

mixed responses from evaluators in Experiment II.

One evaluator described the visualization as very con-

fusing, whereas another evaluator said think its clear

where circles with lines showing which objects take

out. Many participants mentioned that the extra

modalities (subtitles, sound) made it easier to ﬁgure

out the underlying question. However, they high-

lighted that one of the more helpful tools was the list

of most frequently mentioned words (Figure 5). This

list was available throughout M1 to M4. Certain ques-

tions have a tendency to generate certain keywords,

which likely aided the evaluators in correctly identi-

fying the underlying question. We further note that a

majority of the evaluators did not change any settings

in M4 even though they could, indicating that the de-

fault settings were well calibrated.

Modalities and Question Recognition (RQ1): Our

results show that as we increase the modality level

i.e. provide more information to the evaluator in Ex-

periment II, their ability to recognize the underlying

questions increased. This is evident from Figure 9.

With only word tokens and gaze data (M1), 68% of

the time evaluators were able to correctly recognize

the underlying question, which is far above random

Figure 9: Accuracy, measured as how often the evaluators

in Experiment II managed to correctly identify which ques-

tion elicited the multimodal visualization, increased when

providing more forms of data. M1 = Gaze data + word to-

kens, M2 = M1 + machine-transcribed subtitles, M3 = M2

+ dialogue sound, M4 = M3 + ability to customize settings.

A one-way ANOVA resulted in a p-value < 0.001.

Figure 10: Average self-reported conﬁdence increases as

more modalities/functionalities are introduced over M1

through M4. This follows the same pattern as the response

accuracy, as can be seen in Figure 9. A one-way ANOVA

was used as hypothesis-test with a p-value of 0.04.

guess between the six options. When provided with

added Microsoft ASR subtitles (M2), a signiﬁcantly

higher share (89%) of the evaluators successfully rec-

ognized the question. This could be due to that the

subtitles encapsulate syntactic structure and context,

which helps the evaluator identify what the conver-

sation was about. Instead of only having access to

word tokens and no syntactic structure, the evalua-

tor can now follow the unfolding of the conversa-

tion. With added access to the audio of the conver-

sation (M3), 94% of the evaluators successfully rec-

ognized the question. Listening to the conversation

gave access to prosody and voice inﬂection, provid-

ing the evaluator more information about the conver-

sation that subtitles cannot straightforwardly capture,

for instance sarcasm. Also, the conversation audio

playback is not reliant on the ASR system to provide

the correct output, which is likely important due to

its higher WER. The last modality (M4) i.e. access to

the various settings did not impact the share of evalua-

tors that correctly recognized the underlying question.

Dynamic Visualization System for Gaze and Dialogue Data

143

Figure 11: No signiﬁcant difference in question recognition

depending on the animacy of the visual content. A one-way

ANOVA resulted in a p-value of 0.88.

Figure 12: No signiﬁcant difference in question recognition

depending on the visual complexity of the image. A one-

way ANOVA resulted in a p-value of 0.46.

This can also be seen in Figure 10, which shows that

the overall self-reported conﬁdence level corresponds

well with the actual overall question recognition accu-

racy. We can see that during M3 (gaze, word tokens,

and audio), evaluators were almost certain they had

the correct answers. This gave them little to no incen-

tive to change any of the settings and improve upon

the prior answers. We observed that many evaluators

did not change any settings at all.

In summary, our results show that gaze and word

tokens were displayed in a manner that was helpful

for recognizing the underlying question. However,

context around the word tokens, whether provided via

subtitles or audio, appear to be a key piece of informa-

tion in understanding the focal point of a given con-

versation. This means that a useful visualization ben-

eﬁts from multimodality, but also from structurally

sound content. The user is helped by more than just

extracted word tokens. It would help to investigate the

usefulness of phrases as compared to word tokens and

to measure how much of the context around a word is

required to understand the conversation.

Stimulus, Questions, and Question Recognition

(RQ2): We observe that animacy of the visual con-

tent might impact the ability to accurately recognize

the questions to some degree (Figure 11), however,

this is not statistically signiﬁcant. Similarly, the com-

Figure 13: No signiﬁcant difference in question recognition

depending on the type of question. A one-way ANOVA

resulted in a p-value of 0.73.

plexity of the image (Figure 12) or type of question as

affective or material (Figure 13) did not seem to affect

the ability of the evaluator to correctly recognize the

underlying question. This is advantageous as a sin-

gle framework of the visualization can be used across

images or question types.

5 CONCLUSIONS

This work provides insight into how data from mul-

tiple modalities can be jointly displayed in a visual-

ization system for qualitative analysis and research

purposes. Our visualization system is dynamic and

allows a user to visualize both gaze and speech data

over time. We go one step further by providing multi-

party gaze and dialogue data visualization which will

be highly useful in studies aimed at understanding

human-human interaction and behavior. Such a visu-

alization system can also be used for remote mentor-

ing by an instructor. In this work we exemplify this

feature by showing the data from a pair of subjects

but this can be extended to more than two speakers

e.g. a group of people looking at a visual environment

and discussing it. We used 2D eye trackers but based

on Wang et al’s (2019) ﬁndings of few differences in

multimodal 2D and 3D scene understanding, we do

not necessarily anticipate major changes if moving to

3D actual visual environments.

Our results show that access only to gaze and word

tokens is helpful in understanding the focal point of a

conversation (via question recognition) to some de-

gree. However, access to the context of the language

structure signiﬁcantly increases the users ability to

identify the focal point. This means that a multimodal

visualization system that simply displays words is not

perfect and that transcriptions in the form of subti-

tles or spoken conversation are useful. The fact that

the evaluators found our visualization system (with all

modalities) useful, based on their response accuracy,

HUCAPP 2020 - 4th International Conference on Human Computer Interaction Theory and Applications

144

despite the type of image or question and even image

complexity indicates that the visualization system can

be used more broadly for understanding dialogues.

This is important for face-to-face collaborative appli-

cations such as a group of students collaborating on

a task in a classroom. In the future we would like to

analyze the beneﬁts of providing only gaze informa-

tion without the most frequent words and compare it

to other modalities, isolated and combined.

Our visualization system is ﬂexible and can be ex-

panded to include more features. For example, quan-

titative metrics such as mean and standard deviation

for ﬁxation duration and type-token-ratio can also be

added in the future. In addition, if more human gener-

ated data is elicited, it could be added into the system

as new user features. Currently, the system displays

the static image that was used in Experiment I but it

can be extended to display a dynamic stimulus or 3D

real-world scenes. This is challenging and will be par-

ticularly helpful for researchers who want to analyze

data from wearable eye trackers.

Finally, we have shown that our discussion-based

multimodal data elicitation method can capture multi-

party reasoning behavior in visual environments. Our

framework is an important step toward meaningfully

visualizing and interpreting such multiparty multi-

modal data.

ACKNOWLEDGEMENTS

This material is based upon work supported by the

National Science Foundation under Award No. IIS-

1851591. Any opinions, ﬁndings, and conclusions

or recommendations expressed in this material are

those of the author(s) and do not necessarily reﬂect

the views of the National Science Foundation.

REFERENCES

Bird, S., Klein, E., and Loper, E. (2009). Natural Language

Processing with Python. O’Reilly Media, Inc., 1st edi-

tion.

Blascheck, T., John, M., Kurzhals, K., Koch, S., and Ertl, T.

(2016). Va2: A visual analytics approach for evaluat-

ing visual analytics applications. IEEE Transactions

on Visualization and Computer Graphics, 22(1):61–

70.

Blascheck, T., Kurzhals, K., Raschke, M., Burch, M.,

Weiskopf, D., and Ertl, T. (2014). State-of-the-art of

visualization for eye tracking data. In EuroVis.

Blascheck, T., Kurzhals, K., Raschke, M., Burch, M.,

Weiskopf, D., and Ertl, T. (2017). Visualization of

eye tracking data: A taxonomy and survey. Computer

Graphics Forum, 36(8):260–284.

Bull, S. and Kay, J. (2016). Smili: a framework for inter-

faces to learning data in open learner models, learning

analytics and related ﬁelds. International Journal of

Artiﬁcial Intelligence in Education, 26(1):293–331.

Carter, S. and Bailey, V. (2012). Facial Expressions :

Dynamic Patterns, Impairments and Social Percep-

tions. Hauppauge, N.Y. : Nova Science Publishers,

Inc. 2012.

Fry, B. and Reas, C. (2014). Processing: A Programming

Handbook for Visual Designers and Artists. The MIT

Press.

Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi, P.,

Liu, N. F., Peters, M., Schmitz, M., and Zettlemoyer,

L. (2018). AllenNLP: A deep semantic natural lan-

guage processing platform. pages 1–6.

Kon

e, M., May, M., and Iksal, S. (2018). Towards a

dynamic visualization of online collaborative learn-

ing. In Proceedings of the 10th International Con-

ference on Computer Supported Education - Volume

1: CSEDU,, pages 205–212. INSTICC, SciTePress.

Kontogiorgos, D., Avramova, V., Alexanderson, S., Jonell,

P., Oertel, C., Beskow, J., Skantze, G., and Gustafson,

J. (2018). A multimodal corpus for mutual gaze

and joint attention in multiparty situated interaction.

In Proceedings of the Eleventh International Confer-

ence on Language Resources and Evaluation (LREC-

2018), Miyazaki, Japan. European Languages Re-

sources Association (ELRA).

Lee, K., He, L., Lewis, M., and Zettlemoyer, L. (2017).

End-to-end neural coreference resolution. pages 188–

197.

Microsoft (2019). Microsoft Azure Speech to Text. Ac-

cessed: 2019-12-17.

aih

a, K.-J., Aula, A., Majaranta, P., Rantala, H., and

Koivunen, K. (2005). Static visualization of tempo-

ral eye-tracking data. In IFIP Conference on Human-

Computer Interaction, pages 946–949. Springer.

Ramloll, R., Trepagnier, C., Sebrechts, M., and Beedasy, J.

(2004). Gaze data visualization tools: Opportunities

and challenges. 2010 14th International Conference

Information Visualisation, 0:173–180.

Sharma, K. and Jermann, P. (2018). Gaze as a proxy for

cognition and communication. 2018 IEEE 18th In-

ternational Conference on Advanced Learning Tech-

nologies (ICALT), Advanced Learning Technologies

(ICALT), 2018 IEEE 18th International Conference

on, ICALT.

Soon, W. M., Ng, H. T., and Lim, D. C. Y. (2001). A ma-

chine learning approach to coreference resolution of

noun phrases. Comput. Linguist., 27(4):521–544.

Stellmach, S., Nacke, L. E., Dachselt, R., and Lindley, C. A.

(2010). Trends and techniques in visual gaze analysis.

CoRR, abs/1004.0258.

Tsai, T. J., Stolcke, A., and Slaney, M. (2015). Multimodal

addressee detection in multiparty dialogue systems. In

Proc. IEEE ICASSP, pages 2314–2318. IEEE - Insti-

tute of Electrical and Electronics Engineers.

Wang, R., Olson, B., Vaidyanathan, P., Bailey, R., and Alm,

C. (2019). Fusing dialogue and gaze from discussions

2d and 3d scenes. In Adjunct of the 2019 International

Conference on Multimodal Interaction (ICMI ’19 Ad-

junct), October 14–18, 2019, Suzhou, China. ACM,

New York, NY, USA 6 Pages.

Dynamic Visualization System for Gaze and Dialogue Data

145