The Stor-e-Motion Visualization

for Topic Evolution Tracking in Text Data Streams

Andreas Weiler, Michael Grossniklaus and Marc H. Scholl

Department of Computer and Information Science, University of Konstanz

P.O. Box 188, 78457 Konstanz, Germany

Keywords:

Story Visualization, Text Data Streams, Twitter.

Abstract:

Nowadays, there are plenty of sources generating massive amounts of text data streams in a continuous way.

For example, the increasing popularity and the active use of social networks result in voluminous and fast-

ﬂowing text data streams containing a large amount of user-generated data about almost any topic around the

world. However, the observation and tracking of the ongoing evolution of topics in these unevenly distributed

text data streams is a challenging task for analysts, news reporters, or other users. This paper presents “Stor-

e-Motion” a shape-based visualization to track the ongoing evolution of topics’ frequency (i.e., importance),

sentiment (i.e., emotion), and context (i.e., story) in user-deﬁned topic channels over continuous ﬂowing text

data streams. The visualization supports the user in keeping the overview over vast amounts of streaming

data and guides the perception of the user to unexpected and interesting points or periods in the text data

stream. In this work, we mainly focus on the visualization of text streams from the social microblogging

service Twitter, for which we present a series of case studies (e.g., the observation of cities, movies, or natural

disasters) applied on real-world data streams collected from the public timeline. However, to further evaluate

our visualization, we also present a baseline case study applied on the text stream of a fantasy book series.

1 INTRODUCTION

In recent years, there has been a continuous increase

of social media services on the web. Unprecedented

success and active usage of these services result in

massive amounts of user-generated data. The amount

of information in the generated data increases as well.

For example, a large proportion of user-generated

content is automatically enriched by the geographical

location of the user’s device. As social media services

changed the way we use the Internet and play an in-

creasing role in our daily life, it was only a question

of time until social media became a source for infor-

mation gathering. Unfortunately, the vast amount and

the high variability in the quality of user-generated

data is obstructive to analysis tasks. But, at the same

time, user-generated data enables us to extract inter-

esting insights into a variety of different topics.

A popular example is the microblogging service

Twitter. Initially introduced in 2006 as a simple plat-

form for exchanging short messages (“tweets”) on the

Internet, Twitter rapidly gained worldwide popularity

and has evolved into an extremely inﬂuential channel

for broadcasting news and the means of real-time in-

formation exchange. Apart from its attractiveness as a

means of communication—with over 140 million ac-

tive users as of 2012 generating over 340 millions of

tweets daily—Twitter has revolutionized the ways of

exchanging information on the Internet and opened

new ways for knowledge acquisition from social in-

teraction streamed in real-time.

Due to the diversity of the provided information,

Twitter even plays an increasingly important role as

a source for news agencies. In fact, news agencies

use Twitter for two important functionalities in their

daily work. On the one hand, agencies use Twitter

as a publication and distribution platform for current

news articles with a high throughput rate. For exam-

ple, any reproduction of a tweet (“retweet”) reaches

an average of about 1,000 users (Kwak et al., 2010).

On the other hand, news agencies, such as BBC, are

constantly increasing the usage of Twitter as a refer-

ence in their daily news reports (Tonkin et al., 2012).

A further characteristic of Twitter is its vibrant user

community with a wide range of different personali-

ties from all over the world. The whole spectrum of

use cases can be subdivided into a few categories of

Twitter usage patterns, such as daily chatter, informa-

tion and URL sharing, or news reporting (Java et al.,

2007). Further research undertaken has discovered

Weiler A., Grossniklaus M. and Scholl M..

The Stor-e-Motion Visualization for Topic Evolution Tracking in Text Data Streams.

DOI: 10.5220/0005292900290039

In Proceedings of the 6th International Conference on Information Visualization Theory and Applications (IVAPP-2015), pages 29-39

ISBN: 978-989-758-088-8

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

that the majority of users publish messages focus-

ing on their personal concerns and matters, whereas

a smaller set of users publish for information sharing

(Naaman et al., 2010).

In this paper, we present an application for visu-

ally tracing and monitoring the importance, emotion,

and story of user-deﬁned topic channels in the contin-

uous data stream of Twitter. Our work presents a com-

pact visualization for time series event data, which

supports users in identifying interesting data points in

the large volume of tweets. Additionally, it is possi-

ble to overview whole sets of topics and to compare

the evolution of different topics with each other over

time. Furthermore, the application automatically dis-

plays the most inﬂuencing episode terms in a tag list

over time. The observation of the evolution of im-

portance, emotion, and story of topics is a task that

needs to be done in various ﬁelds of analytics. For

example, an analyst who wants to keep track of nat-

ural disasters appearing in the Twitter stream or any

other news stream needs an appropriate application,

which displays a compact overview of all appearances

of the topic in the data. In the case studies section, we

mainly focus on the visualization of text data streams

of the social microblogging service Twitter. However,

to further evaluate our visualization, we also present a

case study applied on the text data stream of the fan-

tasy book series “Harry Potter”.

The remainder of this paper is structured as fol-

lows. We begin in Section 2 with a presentation of the

state of the art and the background in which our work

is situated. We then present the system design, in-

cluding the implementation of the processing pipeline

with Niagarino, and the design goals of the visualiza-

tion in Section 3. In Section 4, we discuss a series

of case studies that give qualitative evidence as to the

validity of our approach. Finally, concluding remarks

are given in Section 5.

2 BACKGROUND

A lot of research is being done on the analysis and

knowledge discovery from social media data. As

a good overview, Bontcheva et al. (Bontcheva and

Rout, 2012) present a survey of sense making of

social media data, which lists state-of-the-art ap-

proaches for mining semantics from social media

streams. Because of the fast propagation speed of

information in social media networks, a high num-

ber of research works focus on event or topic detec-

tion and tracking for various domains. For example,

Sakaki et al. (Sakaki et al., 2010) presented a sys-

tem for earthquake detection and Weng et al. (Weng

et al., 2011) a system to detect events during elec-

tions in the Twitter data stream. In addition to do-

main speciﬁc systems, open domain event detection

systems, like “TwitInfo” (Marcus et al., 2011), “en-

Bloque” (Alvanaki et al., 2012), and “TwiCal” (Ritter

et al., 2012) tackle the challenge to detect events of

all different kinds and present the results with various

visualizations.

Further research is undertaken in the area of epi-

demics tracking (Culotta, 2010), situational aware-

ness (MacEachren et al., 2011), and disaster man-

agement (Lee et al., 2012). However, none of these

systems combine the dimensions importance, emo-

tions, and story to visually guide the perception of the

user to unexpected and interesting points or periods

in time. Nevertheless, there are a number of works

that emerged in the area of visual analytics for Twitter

streams. For example, “SensePlace2” (MacEachren

et al., 2011), supports overview and detail maps of

tweets, place-time-attribute ﬁltering of tweets, and

analysis of changing issues and perspectives over time

and across space as reﬂected in tweets. However, in

contrast to our work, they use a crawler to systemati-

cally query the Twitter API for tweets containing any

topics deemed to be of interest, instead of using the

data stream directly.

“ScatterBlogs2” (Bosch et al., 2013) is another

approach that lets analysts build task-tailored mes-

sage ﬁlters in an interactive and visual manner based

on recorded messages of well-understood previous

events. In contrast to our work, it is possible to re-

deﬁne ﬁlters and also to create more powerful ﬁlters.

However, they do not provide an overview visualiza-

tion to follow the evolution of topics over time and

also do not include any information about emotions.

Another work is presented by (Dork et al., 2010),

which they called a visual backchannel for large scale

events. They present a novel way of following and ex-

ploring online conversations about large-scale events

using interactive visualizations in a timeline fashion.

Furthermore a series of research was done in the

area of news messages visualization. Havre et al. de-

scribed “ThemeRiver” (Havre et al., 2002), an ap-

proach, which uses a stacked graph to help users to

identify time-related patterns, trends, and relation-

ships across a large collection of documents. Most

similar to our approach is the work proposed by

Krstaji

c et al. (Krstaji

c et al., 2011), which presents a

technique called “CloudLines” showing both the cur-

rent and the historic amount of news for pre-deﬁned

topics and try to capture the problem of high density

and over-plotting via an importance function. An-

other application in the area of news streams can be

found in (Wanner et al., 2009). Further work is done

IVAPP2015-InternationalConferenceonInformationVisualizationTheoryandApplications

with location reports (Overby et al., 2009) and fo-

rum posts (Wanner et al., 2011). In contrast to our

proposed idea, which uses a fast, uneven, and noisy

stream of short messages, almost all of the systems

mentioned above are applied to well-structured text

and lack the ﬂexibility to add new topics on-the-ﬂy

and monitor topics from a continuous stream of data.

3 SYSTEM DESIGN

The main contribution of our approach is a visualiza-

tion for tracking the evolution of self-deﬁned topics

by analysts, information seekers, or default users in

the massive stream of Twitter data. The high volume

and propagation rate of tweets makes it difﬁcult for

users to follow the evolution of topics inside the con-

tinuous data ﬂow. Furthermore, it is a big challenge

to discriminate between normal behavior of the topic

evolution or unusual and abnormal behavior, which

usually is an indicator for an interesting event in the

context of a topic. Therefore, the visualization is tai-

lored to support the characteristic of fast distribution

and spreading of information of social media services.

However, it can also be applied to other types of tex-

tual data like a book series (see Section 4.2).

In the following, we introduce the design of the

Stor-e-Motion visualization and motivate the three

major design goals of visualizing the evolution of the

Importance, Emotion, and Story around a topic. Fur-

thermore, we describe the different options for deﬁn-

ing Topic Channels to follow individual deﬁned topics

in the data stream.

3.1 Processing Pipeline

In order to support the exchange and extension of

components in the processing pipeline, we use Nia-

garino

, a data stream management system that is de-

veloped and maintained by our research group. The

main purpose of Niagarino is to serve as an easy-to-

use and extensible research platform for streaming ap-

plications such as the one presented in this paper. The

concepts embodied by Niagarino can be traced back

to a series of pioneering data stream management

systems, such as Aurora (Abadi et al., 2003), Bore-

alis (Abadi et al., 2005), and STREAM/CQL (Arasu

et al., 2006). In particular, Niagarino is an offshoot of

NiagaraST (Li et al., 2008), with which is shares the

most common ground. In this section, we brieﬂy sum-

marize the parts of Niagarino that are relevant for this

http://www.informatik.uni-konstanz.de/grossniklaus/

software/niagarino/

paper. Niagarino is implemented in Java 8 and relies

heavily on its new language features. In particular,

anonymous functions (λ-expressions) are used in sev-

eral operators in order to support lightweight exten-

sibility with user-deﬁned functionality. The current

implementation runs every operator in its own thread.

Operator threads are scheduled implicitly using ﬁxed-

size input/output buffers and explicitly through back-

wards messages.

In Niagarino, a query is represented as a directed

acyclic graph Q = (O, S), where O is the set of opera-

tors used in the query and S is the set of streams used

to connect the operators. The query plan of a sin-

gle topic channel in the Stor-e-Motion visualization is

shown in Figure 1. Each query plan emits the results

to the visualization node, which continuously updates

and visualizes the results. Niagarino implements a se-

ries of operators. The selection (σ) and projection (π)

operator work exactly the same as their counterparts

in relational databases. In our case, we use them to

select tuples corresponding to the topic channel def-

inition. For the selection operator, we use different

predicates (keyword selection and geographical based

selection) to express the channel deﬁnition.

As shown in Figure 1 these predicates can be com-

bined with “or” and “and” by using a logical pred-

icate operator. Other tuple-based operators include

the derive ( f ) and the unnest (µ) operator. The de-

rive operator applies a function to a single tuple and

appends the result value to the tuple. In our case, we

use the derivation functions to add the terms included

in the content attribute of tuple and the sentiment of

the content attribute of the tuple to the tuple. The

unnest operator splits a “nested” attribute value and

emits a tuple for each new value. A typical use case

for the unnest operator is to split a string and to pro-

duce a tuple for each term it contains. Apart from

these general operators, Niagarino provides a num-

ber of stream-speciﬁc operators that can be used to

segment the unbounded stream for processing. Apart

from the well-known time and tuple-based window

operators (ω) that can be tumbling or sliding (Li et al.,

2005), Niagarino also implements data-driven win-

dows, so-called frames (Maier et al., 2012). For the

Twitter case study (see Section 4.1), we use time-

based and for the Harry Potter case study (see Section

4.2) chapter-based tumbling windows.

Stream segments form the input for join () and

aggregation (Σ) operators. As with derive opera-

tors, Niagarino also supports user-deﬁned aggregation

functions. Niagarino operators can be partitioned into

three groups. The operators described above are gen-

eral operators, whereas source operators read input

streams and sink operators output results. Each query

TheStor-e-MotionVisualizationforTopicEvolutionTrackinginTextDataStreams

Figure 1: Query plan of a single topic channel.

can have multiple source and sink operators. Source

and sink operators used in the processing pipeline are

shown as rectangle in Figure 1.

3.2 Visualization

To visualize the evolution of topics over time and to

retain the original sequence of the text stream, we use

a shape based visualization in which all three major

design goals are incorporated. The visualization is

tailored to point the user to important and interesting

patterns in the streaming data and also to support the

user in serendipitous ﬁndings in the topics. The major

design goals are described in the following.

• Real-time visualization of the topic’s evolution of

frequency and emotion

• Presentation of the topic’s story in a compact but

signiﬁcant way

• Detail view of the story’s content

Figure 2 shows the shape, which consists of a

rounded rectangle, which reﬂects the frequency and

the percentages of the sentiment values of the time

window. These shapes are continuously added to the

next position at the right side in the panel and there-

fore form different patterns by visual aggregation in

the ongoing time series of the topic.

The examples in Figure 2 show a time series

with unchanged frequency but with increasing posi-

tive sentiment (a), decreasing negative sentiment (b),

Figure 2: Left: Shape for a single data window with 40%

positive and 40% negative sentiment; Right: Examples for

(a) increasing % positive, (b) decreasing % negative, (c) de-

creasing % positive and increasing % negative sentiment.

and decreasing positive and increasing negative senti-

ment (c).

Importance

The visualization of the ongoing topic evolution needs

to reﬂect the continuous and dynamic change of the

importance of a topic over time. Therefore, the impor-

tance of the topic in the time window is visualized by

using the size of the shape. Since the length of a time

window is pre-deﬁned and static, we use this value as

the width of the rectangle. For the height of the rect-

angle, we calculate for each shape a value against a

pre-deﬁned static max height. To calculate this value,

we use the values n (total number of tuples inside

the window) and m (total number of tuples inside the

topic channel and window) and calculate the Inverse

Document Frequency (Sparck Jones, 1988) for both

values. Then we use the two frequency values in the

following formula to get the height value.

height =



log





− log





∗

max height

log





Emotion

The emotion of a topic is visualized by using the col-

oring of the shapes. The ﬁlling color of the shape

signiﬁes the average sentiment (red = negative, green

= positive, yellow = neutral (see Figure 2) of the text

in the data window. The value of the sentiment for a

text segment is calculated by using an external library

et al. (Thelwall et al., 2010). The library analyzes

the text of the message and returns values for the sen-

timent between -4 to -1 (extremely negative to neg-

ative), 0 (neutral), and 1 to 4 (positive to extremely

positive). To reﬂect the different levels of the senti-

ment, we calculate a stepwise (0.1 steps) color value

corresponding to the average value of all positive as

well as all negative values. The darker the color the

IVAPP2015-InternationalConferenceonInformationVisualizationTheoryandApplications

higher is the average value. The color mapping can

be seen at the left side of each topic channel.

To visualize the ratio of positive and negative sen-

timent in the data windows, we use a linear gradient

with the colors according to the color map and the per-

centage of positive or negative text segments in the

overall text in the window. For example, Figure 2

shows a shape for a single window, which has 40%

of extremely negative and 40% of extremely positive

sentiment.

Story

Since the story around a topic also evolves over time,

it is a challenging task to keep an overview of ongo-

ing context changes in the topic channels. Therefore,

we visualize the story by using a tag list for a pre-

deﬁned number of data windows in a continuous way.

We call this collection of windows an episode and the

included content episode terms.

The tag list is created by tokenizing the contents

of the text segments and removing terms which are

included in a standard English stopword list or are

classiﬁed as noise terms (terms that are too short or

too long, terms with a repetition of the same charac-

ter, or terms without a vowel). Furthermore, we ﬁlter

out terms that are included in the topic channel deﬁ-

nition, because these terms would always be very fre-

quent in the resulting term set. To reﬂect the ongoing

evolution of terms around a topic, we use two rules

to increase the importance of newly occurring or in

their frequency increasing terms. The newly occur-

ring terms are multiplied with the average frequency

value of all terms from the previous window and the

already seen terms (in previous windows) are multi-

plied by their increasing factor. By taking the top ﬁve

terms of our computation to be shown for the corre-

sponding time window, we ensure that only terms are

displayed that have a certain inﬂuence to the story of

the topic.

The resulting episode terms are added to the space

of the corresponding episode. Since single terms,

even if they are in a group of episode terms, are some-

times not self-explanatory and it is therefore very

helpful to obtain additional context information of

single terms, we added an overview of the respective

texts for selected terms. For example, in the city ob-

servation case study (see Section 4.1) of Boston, we

are able to get an insight into the ﬁrst on-site reports

about the explosion (including a hyperlink to the very

ﬁrst image about the event). Additionally, each single

text in the overview is colored with it corresponding

sentiment color. This approach supports the user in

getting a better insight, into how much each text con-

tent has contributed to the overall sentiment value.

3.3 Topic Channels

A topic can be deﬁned in several ways. In text

streams, in which the only source is the text, the topic

deﬁnition mostly consists of one or several keywords.

However, by using social media data, which contain

a large amount of additional meta-data, topics can be

deﬁned in a great variety. For example, a topic chan-

nel could be deﬁned by the rules to follow speciﬁc

Twitter users in terms of their geographic location or

timezone. However, since we are interested in the

evolution of important and speciﬁc topics, we focus

on the textual and the geographical dimension of the

data stream. The three possible topic channel deﬁni-

tions are described in the following.

Keywords. Hereby, we can create a topic channel,

which follows a topic about one or several key-

words. There exist two alternatives for this chan-

nel deﬁnition. First, the text segment needs to

contain all of the deﬁned keywords (“and” clause)

or second it needs to contain at least one (“or”

clause) of the keywords.

Geographic. We can also create a topic channel,

which follows topics in one or several geograph-

ical areas. For example, it is possible to deﬁne a

geographical location as the center and radius of

the surrounding area to observe the importance,

emotions, and story inside a certain city or coun-

try. Note, that in this case only the “or” clause is

reasonable.

Mixed. To observe topics in a certain geographically

area more precisely, it is also possible to combine

the textual and the geographical ﬁltering. For ex-

ample, if a user is interested in an earthquake in a

certain country and not all over the world, it is for

example possible to query for the keyword “earth-

quake” and a country deﬁnition of Indonesia.

Since the ongoing evolution of the topic’s story,

which eventually deals with important subtopics and

surfaces serendipitous ﬁndings, can trigger the inter-

est in new topics, it is always possible to add new

topic channels with a totally new deﬁnition or ex-

tend/restrict topic deﬁnitions of existing topics. By

using the Niagarino framework we can easily change

or substitute the deﬁnitions of channels and also com-

bine different types of channel deﬁnitions by using a

logical predicate operator.

4 CASE STUDIES

In this section, we present case studies for two differ-

ent data sets. In the ﬁrst three case studies, we use the

TheStor-e-MotionVisualizationforTopicEvolutionTrackinginTextDataStreams

public live text data stream of Twitter. The fourth and

ﬁnal case study uses the entire text of all volumes of

the fantasy book series “Harry Potter” as a text data

stream.

4.1 Twitter Stream

The Twitter platform provides direct access to the

public live stream of Twitter. By using the Twit-

ter Streaming API

with the so-called “Gardenhose”

access level, we receive a randomly sampled 10%

stream of the public live stream. An exemplary evalu-

ation of a representative sample of days shows that the

10% stream contains an average of over 1.5 million

tweets per hour with an average of 25,000 tweets per

minute. We can also conclude that there is an increas-

ing availability of tweets with geographic informa-

tion with currently about 2,000 to 4,000 of incoming

tweets per minute. The geographic information either

consists of the latitude and longitude value, which is

automatically set by the used mobile device or a loca-

tion manually added to the tweet by the author of the

message. The live creation of a single topic channel

for a 24 hour period of data takes about 15 minutes

(3,2 GHz Intel Core i3 processor and 8 GB of main

memory) to process all tuples. The parallel creation

of the four topic channels of the ﬁrst case study took

about 23 minutes. Therefore, we can conclude that,

while it is still possible to follow the live stream of

tweets, adding further topic channels slows down the

processing.

For the following case studies, we collected the

tweets as they are streamed out by the API for the

speciﬁc dates in the central Europe time zone (CET).

Because the sentiment derivation function only works

for English texts, we pre-ﬁltered the data sets for

tweets whose content is in English by using a lan-

guage detection library

. After the pre-processing, we

get a tuple stream {T

, ..., T

} in which each tuple has

the attributes {a

, a

creationdate

, a

content

, a

coords

}. Note

that the coordinates attribute (a

coords

) is only set if the

tweet contains geographic information. The visual-

izations are created in the following manner. For each

tuple ﬂowing through our processing pipeline the

topic deﬁnitions are checked and terms as well as the

sentiment value is derived from the a

content

attribute.

After this pre-processing each tuple has the attributes

, a

creationdate

, a

content

, a

coords

, a

terms

, a

sentiment

The w

size

value of the tumbling window operator is

set to one minute and therefore a shape reﬂects the

aggregated values of one minute consisting of the

importance and emotion for the topics. The width

https://dev.twitter.com/streaming/overview

http://code.google.com/p/language-detection/

of the shape is set to two pixels and the max height

to 120 pixels. For each episode of 60 minutes a

tag list with the most inﬂuencing episode terms is

displayed. By selecting an episode term, the content

and sentiment of all tweets that belong to this term

and episode are shown in the detail view (see Figure 4

for example).

City Observation

The ﬁrst case study describes the observation of the

city Boston on the day April 15

, 2013. The data

set for that day contains a total of 20,046,861 tuples,

which is an average of about 830,000 tuples per hour

and about 14,000 tuples per minute. Figure 3 shows

the visualization of four topics for the whole day. The

ﬁrst topic channel Boston (Geo) is deﬁned by using

the city center of Boston and the surrounding area

of 25 miles. The second topic channel Boston is de-

ﬁned by using the name of the city as keyword. These

are the two starting topic channels a news reporter is

likely to choose in order to follow the events around a

speciﬁc city. These deﬁnitions make it possible to get

an overview of the tweets using the name of the city

Boston as well as the tweets that are sent from within

the city (and probably report on-site reports), but that

do not include the name of the city in the content of

the message.

By observing the ongoing evolution of the topic

channel Boston (Geo) and Boston, we can see that

the frequency decreases after a couple of hours. By

that time it was night in Boston and people send-

ing tweets from Boston or about Boston are less ac-

tive. The story after these low frequency episodes in

both channels mostly consists of episode terms re-

lated to a sports event (e.g., “running”, “team”, and

“marathon”). Attracted by the term “marathon”, we

are also interested in following this topic. Therefore

we choose to follow this term (see the blue rectan-

gle in Figure 3 inside the Boston (Geo) topic channel)

and a new topic channel with the title Marathon ap-

pears. In this topic channel, we can now see more de-

tails about the marathon event. In the Marathon topic

channel, we can identify the episode term “desisa”

and “ethiopia”. Also in the Boston (Geo) the episode

term “jeptoo” is mentioned. By getting more in-

sights into these terms by using the detail view of the

tweets, we can derive that these are the winners of the

marathon. The most interesting pattern appears a few

hours after these ﬁrst runners ﬁnished the marathon.

The negative emotion of all three topic channels in-

creases and drifts into the extremely negative. Also,

the overall importance of all three topic channels in-

creases signiﬁcantly and therefore reﬂects the hap-

pening of an interesting event. In the Marathon topic

IVAPP2015-InternationalConferenceonInformationVisualizationTheoryandApplications

Figure 3: Visualization of the topic channels (Twitter stream from April 15

2013) of Boston (Geo), Boston, Marathon, and

Explosion from top to bottom. Blue rectangles are the activators to start the corresponding topic channels.

Figure 4: Detail view of the episode term “coply” from the

topic channel Boston (Geo) in the episode 21.

channel, we are now attracted by the term “explosion”

and again add a new topic channel for this term. We

can derive that, at this point in time, an explosion took

place at the ﬁnishing line (episode term “line” appears

in the tag list) of the marathon course.

Shortly afterwards the terms “prayers”, “cops”,

and “injured” appear in the tag list of the different

channels. These terms reﬂect the current situation in

the ongoing topic. By having a look at the episode

terms of the Boston (Geo) we can see that the term

“coply” (Copley square is a public place in the city of

Boston) appears in the episode before the event. The

detail view of this term (see Figure 4) shows the ﬁrst

two on-site reports, which are written very shortly af-

ter the event, about the event (with hyperlinks to the

ﬁrst on-site pictures). It also indicates that the earlier

tweets about “coply” are positive or neutral and the

latter are changed to negative. We can derive that our

series of topic channels effectively reﬂects the parallel

evolution of the ongoing events in the city Boston.

Election Observation

The second case study describes the observation of

the papal election on the day March 13

, 2013. The

dataset for that day contains a total of 19,470,337 tu-

ples, which is an average of about 810,000 tuples per

hour and about 13,500 tuples per minute. Figure 5

shows the visualization of two topics for the whole

day. The source topic channel Pope is deﬁned by the

rule that a tweet needs to contain at least one of the

keywords “pope” or “pontifex”.

By observing the ongoing evolution of the topic

channel Pope, we can clearly see that the frequency

stays almost constant until the point of time when

the term “whitesmoke” (term is used as hashtag by

the Twitter community to signal that the papal elec-

tion is ﬁnished and a new pope is announced to the

world) appears in the tag list in Episode 19 and the

frequency increases signiﬁcantly. The co-occurring

episode terms “habemus” and “papam” (latin phrase

for a successful pope election) imply the same.

In the next episode, the tag list of the topic chan-

nel shows the term “argentina” and therefore we are

also interested in the topic evolution of this topic. We

can recognize in the topic channel Argentina that the

election meets the people in Argentina with a posi-

tive response. Furthermore, we can see that the burst

of tweets about the election event takes place in Ar-

gentina almost one hour later than the event is re-

ported in the source topic channel. Also two more

indicators for that event like “pope” and “elected” ap-

pear in Episode 20 in the Argentina topic channel.

Movie Premiere Observation

The third use case describes the observation of a

movie premiere on the day July 20

, 2012. The

dataset for that day contains a total of 12,760,600

tuples, which is an average of about 530,000 tuples

per hour and about 9,000 tuples per minute. Figure 6

TheStor-e-MotionVisualizationforTopicEvolutionTrackinginTextDataStreams

Figure 5: Visualization of the topic channels (Twitter stream from March 13

2013) Pope and Argentina.

Figure 6: Visualization of the topic channels (Twitter stream from July 20

2012) Dark Knight and Denver.

shows the visualization of three topics for the whole

day. The source topic channel Dark Knight is deﬁned

by using the keywords “dark knight” and “batman”

The evolution of the topic channel Dark Knight re-

ﬂects the ongoing movie premiere of the movie “The

Dark Knight Rises”, which premiered on that day. We

can identify that there is a slight increase (in Episode

10) in the frequency within the time frame when the

premiere starts in the cinema. The emotion per minute

in the time frame at the beginning tends to be more

positive than negative because of the anticipation of

the movie. However, after a short period of time

(starting with Episode 11) the sentiment drifts to the

negative and the episode term “denver” appears in the

tag list. This triggers us to start the new topic channel

Denver. The unexpected increase in negative senti-

ment is a clear sign that something unexpected hap-

pened.

In the Denver channel episode terms like “killed”,

“victims”, and “security” appear in the Episodes 11

and 12. An interesting observation is that the fre-

quency and the negative sentiment increases signif-

icantly for the Denver topic channel, while for the

Dark Knight topic channel only the negative senti-

ment increases. Also we note the term “james” that

appears in Episode 15 of the ﬁrst topic channel, which

we are able to determine (by inspecting the detail

view of the tweets) as the name of the suspect.

4.2 Text Stream

Another promising application for a streaming text

visualization are single books or complete book se-

ries. Therefore, we use this kind of data to further

evaluate our visualization and perform a ﬁnal case

study, which uses the complete text of the fantasy

book series of “Harry Potter” as text data stream.

We pre-processed the complete book series and ex-

tracted a total of 195 book chapters with a total of

33,939 sentences that contain more than ten charac-

ters. After the pre-processing step we get a tuple

stream {T

, ..., T

} in which each tuple has the at-

tributes {a

chapter

, a

book

, a

content

}. For each tuple, the

topic deﬁnitions are checked and the terms as well as

the sentiment value is derived. The derivation pro-

cesses add the new attributes a

terms

and a

senti

to each

tuple. The w

size

value of the tumbling window op-

erator is set to one in order that a shape reﬂects the

aggregated sentiment values of one chapter, consist-

ing of the importance and emotion for the topics. For

each episode of ﬁve chapters a tag list with the most

inﬂuencing episode terms is displayed. By selecting

an episode term, all sentences that belong to that term

and episode are shown and are colored in the corre-

sponding sentiment value. The creation of the entire

visualization takes about 30 seconds. Figure 7 shows

the visualization for all topic channels that we used in

our exploration. Note, that for this case study there

is more horizontal space available per data window

and therefore we set the width of a shape to 12 pixel

(max height is still 120 pixels).

The ﬁrst topic channel shows the overall evolution

of the importance, emotion, and story of the book se-

ries. Since the height of the shape is normalized to

the total amount of sentences in the chapter, there is

no change in the shapes height for this channel. How-

IVAPP2015-InternationalConferenceonInformationVisualizationTheoryandApplications

Figure 7: Visualization of the text data stream from the complete fantasy book series Harry Potter. It shows the overall

evolution of the story without any deﬁned topic (ﬁrst line) and the evolution in the topic channels of Harry, Hagrid, Hermione,

Kill, Dobby, Hogwarts, and Horcrux from top to bottom. Blue rectangles are the activators to start the topic channels.

ever, the overall channel supports users in reviewing

the full evolution of emotion and story as well as in

ﬁnding potential terms, which we might be interested

in following and exploring in further topic channels.

By tracking the episode terms of the overall chan-

nel, we see that the term “harry” is inﬂuential in

the ﬁrst episode and add this term as a second topic

channel to the visualization. In this channel, we di-

rectly recognize two additional terms in the ﬁrst two

episodes. First, the term “hagrid” and second the term

“hermione” are used to build new topic channels. The

topic channels Hermione and Harry clearly reﬂect the

interplay between these characters in the story. By

further tracking the episode terms of the Overall chan-

nel, we can identify the term “kill” in the Episode 11.

In order to explore the term “kill”, we have a look

at the corresponding sentences and can recognize that

the “death” of characters is an always important topic

in the story. Therefore, we add another topic channel,

which we call Kill and that is ﬁltered by the keywords

“kill, death, died, killed, or dead”. Since these are all

terms which are related to negative things the topic

channel for this topic is mainly colored in red. How-

ever, we can still recognize the uneven appearance of

the topic in the ongoing book series and detect some

extremely negative patterns. The next topic channel

(Dobby) is also added as a consequence of the occur-

rence of this term in the Overall channel.

By following the Dobby channel, we see the

episode term “hogwarts” as one of the inﬂuencing

terms in the story of this channel. As we assume

this to be an interesting term, we also track this term

within a new topic channel. In this channel, we

can see the episode term “horcrux” in the second-last

episode, which also seems to be important and there-

fore we add this term as the last topic channel to the

visualization. This leads to the observation that there

are very unevenly distributed topics in the full story.

The topic channel eight directly reveals that Horcrux

(magical objects) have only been introduced in the

last books of the fantasy series. Also, we can derive

that the character dobby only occurs occasionally and

leaves the story with a shape in negative color. By

looking at the corresponding sentences, we can derive

that these shapes signify the death of the character.

5 CONCLUSION

In this paper, we presented “Stor-e-Motion”, a shape-

based visualization to track the ongoing evolution of

topics’ frequency (i.e., importance), sentiment (i.e.,

emotion), and context (i.e., story) in user-deﬁned

topic channels over continuously ﬂowing text data

streams. Our case studies show that the visualization

supports users in keeping the overview in following

and tracking topics over time and also guides them

to interesting points or periods in time. Furthermore

the visualization contributes to a common, timely, and

TheStor-e-MotionVisualizationforTopicEvolutionTrackinginTextDataStreams

relevant situational awareness of topics and allows

them for serendipitous ﬁndings. These ﬁndings can

be easily added to existing or form new topic chan-

nels and therefore support the reﬁnement of topics.

Future work includes an evaluation and a user

study of the visualization and to further extend the an-

alytical functionality. A possible idea would be to ad-

ditionally include an event detection mechanism and

automatically feed the resulting terms into the visual-

ization to create a large landscape of events and top-

ics.

A further improvement would be to extend the

system with a zooming feature to provide a more de-

tailed view to the users. This allows data to be dis-

played at different levels of granularity in order to get

deeper insights into the interesting points or periods in

time. For the topic channel deﬁnition, further options,

such as the source of the tweet (e.g., mobile phone or

web), the geographic region of the tweet or the type

of the tweet (e.g., retweet or direct message), could

be derived from the meta-data of tweets.

For a more powerful search and to improve the re-

sults, more full-text options, such as fuzzy search or

the exclusion of negative terms could be added. The

importance of the exclusion of negative terms can be

derived from the city observation case study, in which

the results shifted from tweets about the original topic

(reactions to the sport event Marathon) to a differ-

ent topic (reactions to the explosions) and therefore

it would be helpful to separate the topics from each

other.

REFERENCES

Abadi, D. J., Ahmad, Y., Balazinska, M., C¸ etintemel, U.,

Cherniack, M., Hwang, J., Lindner, W., Maskey, A.,

Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., and

Zdonik, S. B. (2005). The Design of the Borealis

Stream Processing Engine. In Proc. Intl. Conf. on In-

novative Data Systems Research (CIDR), pages 277–

289.

Abadi, D. J., Carney, D., C¸ etintemel, U., Cherniack, M.,

Convey, C., Lee, S., Stonebraker, M., Tatbul, N., and

Zdonik, S. (2003). Aurora: A New Model and Ar-

chitecture for Data Stream Management. The VLDB

Journal, 12(2):120–139.

Alvanaki, F., Michel, S., Ramamritham, K., and Weikum,

G. (2012). See what’s enblogue: real-time emer-

gent topic identiﬁcation in social media. In Proceed-

ings of the 15th International Conference on Extend-

ing Database Technology, EDBT ’12, pages 336–347,

New York, NY, USA. ACM.

Arasu, A., Babu, S., and Widom, J. (2006). The CQL Con-

tinuous Query Language: Semantic Foundations and

Query Execution. The VLDB Journal, 15(2):121–142.

Bontcheva, K. and Rout, D. (2012). Making sense of so-

cial media streams through semantics: a survey. In

Semantic Web journal.

Bosch, H., Thom, D., Heimerl, F., Puttmann, E., Koch, S.,

uger, R., W

orner, M., and Ertl, T. (2013). Scat-

terblogs2: Real-time monitoring of microblog mes-

sages through user-guided ﬁltering. IEEE Trans. Vis.

Comput. Graph., 19(12):2022–2031.

Culotta, A. (2010). Towards detecting inﬂuenza epidemics

by analyzing twitter messages. In Proceedings of the

First Workshop on Social Media Analytics, SOMA

’10, pages 115–122, New York, NY, USA. ACM.

Dork, M., Gruen, D., Williamson, C., and Carpendale, S.

(2010). A visual backchannel for large-scale events.

IEEE Transactions on Visualization and Computer

Graphics, 16(6):1129–1138.

Havre, S., Hetzler, E., Whitney, P., and Nowell, L. (2002).

Themeriver: Visualizing thematic changes in large

document collections. Visualization and Computer

Graphics, IEEE Transactions on, 8(1):9–20.

Java, A., Song, X., Finin, T., and Tseng, B. (2007). Why we

twitter: understanding microblogging usage and com-

munities. In Proceedings of the 9th WebKDD and 1st

SNA-KDD 2007 workshop on Web mining and social

network analysis, pages 56–65. ACM.

Krstaji

c, M., Bertini, E., and Keim, D. A. (2011). Cloud-

lines: Compact display of event episodes in multi-

ple time-series. IEEE Trans. Vis. Comput. Graph.,

17(12):2432–2439.

Kwak, H., Lee, C., Park, H., and Moon, S. (2010). What

is Twitter, a social network or a news media? In

Proceedings of the 19th international conference on

World wide web, WWW ’10, pages 591–600. ACM.

Lee, S., Lee, S., Kim, K., and Park, J. (2012). Bursty event

detection from text streams for disaster management.

In Proceedings of the 21st international conference

companion on World Wide Web, WWW ’12 Compan-

ion, pages 679–682, New York, NY, USA. ACM.

Li, J., Maier, D., Tufte, K., Papadimos, V., and Tucker, P. A.

(2005). No Pane, No Gain: Efﬁcient Evaluation of

Sliding-Window Aggregates over Data Streams. SIG-

MOD Record, 34(1):39–44.

Li, J., Tufte, K., Shkapenyuk, V., Papadimos, V., Johnson,

T., and Maier, D. (2008). Out-of-Order Processing: A

New Architecture for High-Performance Stream Sys-

tems. PVLDB, 1(1):274–288.

MacEachren, A., Jaiswal, A., Robinson, A., Pezanowski,

S., Savelyev, A., Mitra, P., Zhang, X., and Blanford, J.

(2011). Senseplace2: Geotwitter analytics support for

situational awareness. In Visual Analytics Science and

Technology (VAST), 2011 IEEE Conference on, pages

181–190.

Maier, D., Grossniklaus, M., Moorthy, S., and Tufte, K.

(2012). Capturing Episodes: May the Frame Be With

You. In Proc. Intl. Conf. on Distributed Event-Based

Systems (DEBS), pages 1–11.

Marcus, A., Bernstein, M. S., Badar, O., Karger, D. R.,

Madden, S., and Miller, R. C. (2011). Twitinfo: ag-

gregating and visualizing microblogs for event explo-

ration. In Proceedings of the 2011 annual conference

IVAPP2015-InternationalConferenceonInformationVisualizationTheoryandApplications

on Human factors in computing systems, CHI ’11,

pages 227–236. ACM.

Naaman, M., Boase, J., and Lai, C.-H. (2010). Is it re-

ally about me?: message content in social awareness

streams. In Proceedings of the 2010 ACM conference

on Computer supported cooperative work, CSCW

’10, pages 189–192. ACM.

Overby, D., Keyser, J., and Wall, J. (2009). Interactive vi-

sual analysis of location reporting patterns. In Visual

Analytics Science and Technology, 2009. VAST 2009.

IEEE Symposium on, pages 223–224. IEEE.

Ritter, A., Mausam, Etzioni, O., and Clark, S. (2012). Open

domain event extraction from twitter. In Proceedings

of the 18th ACM SIGKDD international conference

on Knowledge discovery and data mining, KDD ’12,

pages 1104–1112, New York, NY, USA. ACM.

Sakaki, T., Okazaki, M., and Matsuo, Y. (2010). Earth-

quake shakes Twitter users: real-time event detection

by social sensors. In Proceedings of the 19th inter-

national conference on World wide web, WWW ’10,

pages 851–860. ACM.

Sparck Jones, K. (1988). A statistical interpretation of term

speciﬁcity and its application in retrieval, pages 132–

142. Taylor Graham Publishing.

Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., and Kap-

pas, A. (2010). Sentiment strength detection in short

informal text. Journal of the American Society for In-

formation Science and Technology, 61(12).

Tonkin, E., Pfeiffer, H. D., and Tourte, G. (2012). Twitter,

information sharing and the London riots? Bulletin

of the American Society for Information Science and

Technology, 38(2):49–57.

Wanner, F., Ramm, T., and Keim, D. A. (2011). Foravis:

Explorative user forum analysis. In Proceedings of the

International Conference on Web Intelligence, Mining

and Semantics, WIMS ’11, pages 14:1–14:10, New

York, NY, USA. ACM.

Wanner, F., Rohrdantz, C., Mansmann, F., Stoffel, A.,

Oelke, D., Krstajic, M., Keim, D. A., Luo, D., Yang,

J., and Atkinson, M. (2009). Large-scale Comparative

Sentiment Analysis of News Articles (InfoVis 2009).

Poster at IEEE InfoVis 2009.

Weng, J., Yao, Y., Leonardi, E., and Lee, F. (2011). Event

Detection in Twitter. Technical report, HP Labs.

TheStor-e-MotionVisualizationforTopicEvolutionTrackinginTextDataStreams