Hawkes Processes on Social and Mass Media: A Causal Study of the

#BlackLivesMatter Movement in the Summer of 2020

Alfred Lindstr

1,2 a

, Simon Lindgren

3 b

and Raazesh Sainudiin

1,2 c

Department of Mathematics, Uppsala University, Uppsala, Sweden

Combient Competence Centre for Data Engineering Sciences, Uppsala University, Uppsala, Sweden

Department of Sociology, Ume

a University, Ume

a, Sweden

Keywords:

Hawkes Process, Community Detection, Granger Causality, Hypothesis Test, Social & Mass Media Modelling.

Abstract:

In this work we study interactions in social media and the reports in mass media during the Black Lives

Matter (BLM) protests following the death of George Floyd. We implement open-source pipelines to process

the data at scale and employ the self-exciting counting process known as Hawkes process to address our

main question: is there a causal relation between interactions in social media and reports of street protests in

mass media? Speciﬁcally, we use distributed label propagation to identify such interactions in Twitter, that

supported the BLM movement, and compared the timing of these interaction to those of news reports of street

protests mentioning George Floyd, via the Global Database of Events, Language, and Tone (GDELT) Project.

The comparison was made through a Bivariate Hawkes process model for a formal hypothesis test of Granger-

causality. We show that interactions in social media that supported the BLM movement, at the beginning of

nationwide protests, caused the global mass media reports of street protests in solidarity with the movement.

This suggests that BLM activists have harnessed social media to mobilise street protests across the planet.

1 INTRODUCTION

On 25th of May 2020, George Floyd, a 46 year old

African-American man, is arrested in Minneapolis,

Minnesota for allegedly using a counterfeit $20 bill

to buy cigarettes. The arrest is caught on ﬁlm by

passersby, showing how police ofﬁcer Derek Chauvin

pins the handcuffed Floyd to the ground with his knee

on Floyd’s neck, while his three colleagues prevent

anyone from intervening. Floyd repeatedly utters the

words “I can’t breathe” before he goes unconscious.

He later dies at the hospital, and the video of the arrest

goes viral on Facebook (Deliso, 2021). The next day

protests in support of the Black Lives Matter (BLM)

movement, and against police brutality, start in Min-

neapolis, which during the following days will spread

both nationally and internationally to over 60 coun-

tries, and become what may be the largest protests in

U.S. history to date, with polls estimating attendances

in the range of 15-26 million people (Buchanan et al.,

2020).

BLM is a decentralised grassroots movement that

began on social media, using the hashtag #Black-

https://orcid.org/0009-0009-2300-4366

https://orcid.org/0000-0001-6289-9427

https://orcid.org/0000-0003-3265-5565

LivesMatter in the wake of the shooting of Trayvon

Martin in July 2013. The movement has since then

gained attention for demonstrations following the

deaths of Michael Brown and Eric Garner in 2014,

and George Floyd in 2020, with its main issues be-

ing that of advocating against police brutality toward

African-Americans, and policy issues related to racial

injustices (Jackson et al., 2020).

As reactions and critiques of the BLM movement,

the phrase “All lives matter” was coined, as well as the

phrase “Blue lives matter”, after the shooting of two

police ofﬁcers during protests in Ferguson, Missouri

in 2015. Both of these slogans are associated with

conservative views, and rejects the BLM-movement’s

idea of a need to focus on the racial injustice towards

African Americans.

The decentralised nature of all three of these

movements, and the way social media has played a

key part in their development, leading to real life

events such as mass protests, motivates our choice to

analyse data from social media and from mass media

to try to get a better understanding of the mobilisation

in social media into real-world action.

In this work we study the landscape in mass and

social media during the ﬁrst month of protests that fol-

lowed after the murder of George Floyd. Our primary

question is whether there is a statistically signiﬁcant

LindstrÃ˝um, A., Lindgren, S. and Sainudiin, R.

Hawkes Processes on Social and Mass Media: A Causal Study of the BlackLivesMatter Movement in the Summer of 2020.

DOI: 10.5220/0012089500003541

In Proceedings of the 12th International Conference on Data Science, Technology and Applications (DATA 2023), pages 77-88

ISBN: 978-989-758-664-4; ISSN: 2184-285X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

interaction between communications in socially net-

worked communities and street protests as measured

by published reports in mass media. We attempt to

answer this question by devising a data processing

framework to mathematically model the interactions

between social and mass media via the family of point

processes known as Hawkes processes and conduct

statistical hypothesis tests of Granger causality, sub-

sequent to identifying inﬂuential social media com-

munities using network models.

The paper’s outline is as follows. We describe

Models in Section 2, Data Handling in Section 3,

Analysis of Twitter Data in Section 4, Joint Media

Modeling in Section 5 and conclude in Section 6.

2 MODELS

2.1 Hawkes Processes

We will now introduce a family of point processes

known as Hawkes processes, assuming the reader is

familiar with point processes. These processes were

introduced by Hawkes (Hawkes, 1971), and due to

their self-exciting nature they are used in ﬁelds such

as epidemiology, seismology, and ﬁnance (Daley and

Vere-Jones, 2003; Bacry et al., 2015).

Suppose we observe events in continuous time,

i.e., points on the positive real line as timestamps,

where for each i, t

is the exact time where some sort

of event occurs for the i-th time. Deﬁne the history of

a point process up to time t, as the set H

containing

all timestamps {t

} up to time t. A Hawkes process

allows us to model the occurrence of future events af-

ter time t based on the entire history H

up to time t

as follows:

Deﬁnition 2.1. Let N(t) be a point process that

counts the number of events up to time t with history

. If the intensity λ(t) of N(t) is of the form

λ(t) = µ +

∑

∈H

φ(t −t

) , (1)

we deﬁne N(t) as a Hawkes process, where µ is the

baseline intensity and φ(t) is the kernel.

We will now introduce a particular choice of ker-

nel.

Deﬁnition 2.2. We deﬁne

φ(t) = αβe

−βt

, (2)

as an exponential kernel where parameter α ≥ 0 is the

self-excitation parameter, and parameter β > 0 is the

decay rate.

Parameter α thus decides how much an occurred

event will inﬂuence the rate of new events, while β

will decide how long into the future this inﬂuence will

last as φ(t) → 0, when t → ∞.

A natural extension of the Hawkes process is the

multivariate Hawkes process.

Deﬁnition 2.3. Let d ∈ N be the number of dimen-

sions, and H

t,i

for i = 1, .., d be the history of events in

dimension i. The multivariate point-process induced

by the intensities

(t) = µ

∑

j=1

∑

∈H

t,j

i j

(t −t

) i = 1,...,d (3)

is then deﬁned as a multivariate Hawkes process.

If the kernel φ

i j

(t) takes the form of the following

multivariate exponential kernel,

i j

(t) = α

i j

−β

i j

i, j = 1,...,d , (4)

where α

i j

≥ 0 is the excitation parameter, and β

i j

0 is the decay rate, then we have the multivariate

Hawkes process with exponential kernel.

The excitation parameter α

i j

can be interpreted

similarly as α in the one-dimensional case with the

exponential kernel, with the exception that this inﬂu-

ence on new events in dimension i now may come

from previous events in any dimension j ∈ {1,...,d}.

Analogously, β

i j

is interpreted as the rate of decay

that speciﬁes how past events in dimension j can in-

ﬂuence the arrival of new events in dimension i. In

Section 5 we use a multivariate Hawkes process to

model Twitter events in dimension 1 and mass media

reports of protests in dimension 2.

2.2 Granger Causality

How to rigorously deﬁne causality has been a topic of

discussion in western philosophy for over 2000 years,

starting with Plato and Aristotle (Falcon, 2019), and

continuing on with Hume and Kant’s disagreement

being one of the fundamental discussions in modern

philosophy. The problem is still open, (Pierris and

Friedman, 2018).

In light of this, and in some sense to get around the

metaphysical complications of proper causality, Clive

Granger introduced the concept of Granger Causality

relating to stochastic processes. The basic idea is if a

variable X

Granger-causes variable Y

, then the past

values of X

contain information that helps predict fu-

ture values of Y

t+1

better than doing prediction based

only on past values of Y

(Granger, 1980).

Using the following Theorem from Eichler (Eich-

ler et al., 2012), we will test the null hypothesis of the

non-existence of Granger causality between events in

social and mass media, and vice versa, in the sequel.

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

Theorem 2.1. Let N(t) be a multivariate Hawkes

process in d dimensions, with kernels φ

i j

(t), i, j ∈

{1,...,d}. Then the j-th component N

does not

Granger-cause the i-th component N

if and only if

i j

= 0, ∀t ∈ R.

Thus, when N(t) is a multivariate Hawkes pro-

cess with exponential kernel, by Theorem 2.1 the j-th

component N

does not Granger-cause the i-th com-

ponent N

if and only if α

i j

= 0, ∀t ∈ R.

3 DATA HANDLING

3.1 Apache SPARK

The data was handled using Apache Spark

which is

an open-source engine designed for data engineering,

data science, and machine learning on clusters of mul-

tiple computers, by implicit data parallelism. Spark

is multi-language and supports Scala, Python,

R, SQL, Java, C# and F#. While most of the

code for this article was written in Scala, the ease

of switching between languages in the same environ-

ment proved quite useful, as we would use libraries

written in both R and Python .

On top of Spark core, Spark SQL (Armbrust

et al., 2015), which introduces the data abstrac-

tion of DataFrames, allows manipulation in Scala,

Python, and R using the standard SQL language,

and the graph-processing framework GraphX (Gon-

zalez et al., 2014), allows for network-analysis. To

run Spark, the cloud data platform Databricks was

used, which provided cloud storage, computing clus-

ters, and a notebook-environment to write and run the

code after loading the two main libraries developed

for this study, MEP

and SPARK-GDELT

3.2 Twitter

Twitter is a micro-blog and social media service,

founded in 2006, where users post and interact via

tweets – a short message restricted to 280 characters,

which may also contain pictures, short videos and

URLs. Tweets can be original posts, replies to other

tweets, or retweets, i.e., sharing of another user’s

tweet. As long as a user does not actively chose to be

private, anyone is able to read the tweets of the user.

To help a tweet gain attraction, and make it easier for

other users to ﬁnd tweets on a speciﬁc topic, the user

can tag their posts by including keywords prefaced

https://github.com/apache/spark

https://github.com/lamastex/mep

https://github.com/lamastex/spark-gdelt

with ‘#’, the hash symbol. These tagged keywords are

called hashtags and they have been used by activists

in global social movements such as #BlackLivesMat-

ter and #MeToo (Jackson et al., 2020).

Users may also follow other users on Twitter. The

relationship of following is asymmetrical, meaning

that if user A follows user B, user B does not have

to follow user A. Compare this to Facebook, where

users mutually have to accept each other as friends

to be able to interact. To simplify things, if Face-

book is about keeping in touch and networking with

your friends, Twitter is about sharing and receiving

information the user ﬁnds interesting; according to

a study done in 2014, 44% of Twitter’s users have

never tweeted which seems to suggest that a large

part of the user base only uses Twitter for receiving

information (Muphy, 2014). Due to this asymmetri-

cal following relationship, which encourages a more

open discourse between users, along with its magni-

tude of users, choosing Twitter as the social media to

analyse becomes the natural choice. Furthermore, un-

like Twitter, other prominent social media platforms

including Facebook and Instagram do not allow re-

searchers open access to their data. We developed

MEP to be able to design experiments, collect and

analyse data from different Twitter APIs at scale in

public cloud infrastructure.

3.2.1 Application Programming Interface

To work with and be able to analyse Twitter data

efﬁciently on an arbitrarily large scale, access to

Twitter’s Application Programming Interface (API)

is needed, and requires Twitter developer credentials,

which anyone can apply for. With access to the cre-

dentials, one may request and download tweets which

can be represented as JSON-ﬁles. At the time of writ-

ing, two versions of the Twitter API exists. This work

was done in the older version 1.

To get a sense of how the data was handled, a brief

overview of the relevant ﬁelds from the schema of the

JSON for a tweet will be presented. For full details,

we refer to Twitter’s data dictionary

4 5

. The two most

basic objects for a tweet are the User object and the

Tweet object shown in Tables 1 and 2, respectively.

From the User object, as the name suggests, we

get access to the metadata of a user. However, note

that no direct information about which users follow

the user, or which users the user follows, beyond the

counts, is accessible from the user object.

https://developer.twitter.com/en/docs/twitter-api/v1/

data-dictionary/object-model/tweet

https://developer.twitter.com/en/docs/twitter-api/v1/

data-dictionary/object-model/user

Hawkes Processes on Social and Mass Media: A Causal Study of the BlackLivesMatter Movement in the Summer of 2020

Table 1: Some attributes, with their types and description,

for the User object.

User object

Attribute Type Description

id Int64 The unique integer

representation of the user.

screen String The screen name, also

name known as handle of the user.

followers Int The number of followers

count the user has.

friends Int The number of users

count the user follows.

From the Tweet object, we get access to the meta-

data of a tweet. Via the ﬁeld “user”, we also get

the information of the user behind the tweet, since

this is a User object. Moreover, since the ﬁelds

“quoted status” and “retweeted status” are Tweet ob-

jects, we get the full information of the original post

that has been retweeted or quoted.

Note that the Tweet Object in “retweeted status”

points to the original tweet that has been retweeted,

if the post is a retweet. It is possible for a user to

retweet another user’s retweet, but information on this

chain of events is thus not accessible. For example, let

user A write a tweet T that gets retweeted by user B.

Later, user C sees this retweet on user B’s timeline and

then retweets T. Twitter’s API will then only tell us

that user B and C have retweeted user A, but not the

fact that user C accessed this tweet via user B. This

limitation also motivates the use of retweet network

in Section 4.2.

Along with these two objects, there is another ob-

ject named entities, which contains all the metadata

of a tweet’s content, including any URLs, hashtags,

twitter handles of users mentioned, and media content

(pictures and short video clips).

3.2.2 Data Set

The data set that was used (Giorgi et al., 2020)

has 41.8 million collected tweets from 10.1 million

unique users regarding the Black Lives Matter move-

ment, along with the smaller counter movements of

Blue Lives Matter (pro-police movement) and All

Lives Matter. These tweets were collected by ﬁltering

on the keywords: BlackLivesMatter, BlueLivesMatter

and AllLivesMatter. The data contains tweets from

the beginning of the movement in 2013 to 30 June

2020. In this work, we focus on the events occurring

during the aftermath of the death of George Floyd on

25 May 2020, and discard all tweets before this date.

3.2.3 Collecting Data

Due to Twitter’s policy, collecting and sharing tweets

publicly is not allowed. To share a set of tweets, in-

stead one shares the IDs of each tweet, and to get the

full metadata of the tweets, access to Twitter’s API is

needed. There is also a limit on how many tweets one

may collect per hour, which initially was a problem.

To get around this, the python library twarc

was

used. twarc allowed us to collect tweets from the

IDs (a process known as hydrating), in an optimised

way with respect to the hourly collection limit.

To be able to work with the data in Databricks and

Spark, a Docker-container with python and twarc

was set up on a remote machine, that ran the hydration

script on small batches of the IDs, collected them as

‘.json’-ﬁles, and then compressed and stored them in

our Databricks cloud storage. This procedure took

roughly ﬁve days.

A consequence of retroactively collecting tweets

from their IDs is that all tweets that have been re-

moved due to various reasons (such as the users of

these tweets getting banned, removing their accounts,

or going private) at the time of hydrating, are not ac-

cessible and were therefore not collected.

After hydrating the IDs from the data set, and

discarding tweets posted earlier than 24 May 2020,

23.3 million tweets from 7.1 million unique users

were left. These were cleaned to be easier to work

with using Spark’s Dataframes. We also categorised

each tweet as an original tweet, retweet, quoted tweet,

etc., and then stored them in the column-based data-

storage format parquet on a delta lake (Armbrust

et al., 2020). See MEP for details of the collector, pre-

processor and categoriser behind the delta lake.

3.3 GDELT

The Global Database of Events, Language, and

Tone (GDELT) project, founded in 2013, is an open

database supported by Google Jigsaw, that monitors

news media in print, broadcast, and web formats from

all over the world in over 100 languages. It is updated

every ﬁfteen minutes and stretches back to the 1st Jan-

uary of 1979, containing meta-data such as the people

and organisations being mentioned, events and their

locations, counts of key-words along with the tone

and emotions of the parsed news sources

. We used

the GDELT database to get a high level understanding

of the mass media landscape during the given time

span, by reducing the records of reported events of

protests, to data points in time. We accomplish this

https://twarc-project.readthedocs.io/en/latest/

https://www.gdeltproject.org/

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

Table 2: Some attributes, with their types and description, for the Tweet object.

Tweet object

Attribute Type Description

created at String UTC-time when the tweet was created.

id Int64 The unique integer representation of the tweet.

text String The textual content of the tweet.

in reply to status id Int64 If the tweet is a reply to another tweet, the ﬁeld will

contain the tweet-ID of that tweet. Otherwise null.

in reply to user id Int64 If the tweet is a reply to another tweet, the ﬁeld will

contain the user-ID of that tweet. Otherwise null.

user User Object All information of the user of the tweet.

quoted status Tweet Object If the tweet is a quote tweet, all information

of the original tweet will be contained in this ﬁeld. Otherwise null

retweeted status Tweet Object If the tweet is a retweet, all information

of the original tweet will be contained in this ﬁeld. Otherwise null

by building an analytics-ready Delta Lake (Armbrust

et al., 2020). A brief overview of GDELT to appreci-

ate how we handled the data for this work follows. For

a more thorough overview, we refer to the documen-

tation

and SPARK-GDELT, our open-source library

developed for this study.

3.3.1 Coding

The idea behind GDELT is that of coding, which is

fundamentally fairly simple. Given a record – for ex-

ample a written news article – go through the text and

identify the real world events that are being reported

in the record, and identify the actors who are involved

in the event. During the Cold war, two coding frame-

works dominated: WEIS and the Conﬂict and Peace

Data Bank, COPDAB. Both of these frameworks,

being developed and used in a 20th century post-

World War II context, were focused on codifying how

sovereign states (the actors) interacted through ofﬁ-

cial diplomacy and military threats (Schrodt, 2012).

For example, in the following sentence:

“President Reagan has threatened further action

against the Soviet Union in an international televi-

sion program beamed by satellite to more than 50

countries”,

one would identify the act of threatening as the event,

and assign it some integer (decided by the code frame-

work), with the actors being President Reagan (or

the United States if the coder is only interested in

sovereign states), and the Soviet Union.

This process of coding would historically be done

by hand. However, the combination of psychological

studies showing that the kind of sustained decision-

making involved in coding leads to fatigue, inatten-

http://data.gdeltproject.org/documentation/

GDELT-Global Knowledge Graph Codebook-V2.1.pdf

tion, and heuristic shortcuts, and the technological ad-

vancement in computing software and hardware, cod-

ing is nowadays automated. The frameworks for cod-

ifying has also developed since the cold war, with

GDELT using the framework of Conﬂict and Me-

diation Event Observations (CAMEO) (Leetaru and

Schrodt, 2013). Some notable changes being that ac-

tors are no longer limited to sovereign states, and in-

clude persons, organisations, and companies.

In practice, GDELT is essentially two separate

but interlinked databases: The Global Knowledge

Graph (GKG), which consists of records and the

Event Database, which as the name suggests stores

events that are being reported.

3.3.2 GKG

The Global Knowledge Graph (GKG) consists of all

records from multiple news sources in the world. As

of version 2 of GDELT, new records get added ev-

ery ﬁfteen minutes. Whenever a record is added, the

source text is parsed via natural language processing

to identify the events (using coding), locations, per-

sons and organisations, as well as themes mentioned

in the text. Moreover, keywords such as “protest” that

are mentioned multiple times gets counted. Sentiment

analysis is also incorporated to get a value of the tone

of the source text (whether the text is positive, neu-

tral or negative). Many other metadata extracts are in

each GKG record.

3.3.3 Event Database

The Event database attempts to record all unique

events that are being identiﬁed in the parsing pro-

cess of the GKG database. Each data point is given

a unique ID for the event, and contains the date, the

actors along with the code of the type of event be-

Hawkes Processes on Social and Mass Media: A Causal Study of the BlackLivesMatter Movement in the Summer of 2020

ing identiﬁed. The coded event also gets mapped to

the Goldstein-scale (Goldstein, 1992), which seeks

to measure the potential impact the event could have

on the stability of the country. Moreover, the Event

database has metadata on how often the event has

been mentioned by records in GKG and the average

tone of these records.

3.3.4 Handling of the GDELT Data

Due to the sheer magnitude of data contained in the

GDELT database, working with data proved quite a

challenge. Our goal was to ﬁlter out the events about

the protests relating to the Black Lives Matter move-

ment and the counter movements between 25 May

2020 and 30 June 2020. Although the parsing of news

records into the GKG database identiﬁes organisa-

tions, it did not identify the Black Lives Matter move-

ment as one, probably due to its lack of centralisation.

What we did instead was to ﬁlter out all data relat-

ing to protests happening in the world. This naturally

led to noisy data, since we got reports of protest un-

related to the BLM movement, but we justify this by

the fact that no other major protests were happening in

the world at the same time. To check this, we ﬁltered

the Event database by events with CAMEO root-code

14, i.e., those events coded as protests, over a three

months timeline.

Figure 1: Events coded as protests in the GDELT Event

database.

As we see in Figure 1, there is a baseline of

roughly 5,000 events per day coded as protests be-

fore 25 May. This number then explodes, and there is

nothing that suggests that the sudden increase in mag-

nitude of protests are not related to the BLM protests.

It is worth pointing out that there is no bijection be-

tween the real world protest and the protest data from

the Event database. For example, if in one city dur-

ing one day, large protests are taking place and one

group of people are protesting peacefully while an-

other group is rioting, then the coding framework

should identify the act of the peaceful and rioting

protesters as two different events (Schrodt, 2012), al-

though they are near each other in time and space.

Thus, saying that more than 8,000 protests happened

on the 1 June 2020, would be incorrect.

In Section 5 we will look at news reports in

mass media, and therefore use data from the GKG

database. We did this by ﬁltering by the themes of

the records. All records in the GKG database with

theme “PROTEST” were ﬁltered out.

Figure 2: Comparison of records from the GKG database

with theme ”PROTEST”, and events coded as protests from

the Event database.

Ignoring the periodic dips in the GKG plot in Fig-

ure 2 (which are due to less reporting being done

on weekends), the two plots follow a similar pattern.

Naturally, there are more records than events, since

multiple news sources may report the same event.

4 ANALYSIS OF TWITTER DATA

In this Section, we explore the Twitter data, ﬁrst via

simple querying on the data set, and then by doing

network analysis on the induced retweet network. The

results from this exploratory data analysis then moti-

vated the choice of using Hawkes processes to model

and perform hypothesis tests to shed light on the phe-

nomena of interest in this study – occurrence of tweets

in support of the BLM movement and that of mass

media reports of street protests.

4.1 Data Observations

4.1.1 Timeline

We started by examining the data over the relevant

time-span from 24 May 2020 to 30 June 2020. During

this period, 23,346,745 tweets by 7,111,140 unique

users were collected using twarc on the BLM data

set (Giorgi et al., 2020).

From Figures 3 and 4, we can see that activity ﬁrst

starts on Twitter, and the reports of protests start to

drastically increase on 27 May. We also see a dip in

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

Figure 3: Number of tweets per day.

Figure 4: Log-scaled plot of the number of tweets, records

and events

Twitter activity between 31 May and 2 June, while

the GDELT data on the number of reports of protests

spikes during these days. The explanation of this is

simply that the data set lacks tweets on these days.

This was found while exploring the data, and noticing

that the data set contained retweets of a tweet from

this time period, but not the original tweet. Whether

these missing tweets disappeared during the collect-

ing of data, or if they are missing in the original data

set (Giorgi et al., 2020) of the Tweet IDs, remains un-

clear. To deal with this, we refrained from doing any

modelling with tweets from this time period.

4.1.2 Type & Media Content of Tweets

Next, we examined TweetTypes, i.e., the types of sta-

tus update or interactions in our Twitter data. The

most to least frequent TweetTypes (% of data) were

Retweets (55%), Retweets of Quoted Tweets (27%),

Original Tweets (7%), Quoted Tweets (7%), Reply

Tweets (3%), Original Tweets (1%). Thus, only 18%

of the tweets in the BLM-data set were original tweets

(either original, or replies to other tweets), with the

remaining 82% being some sort of retweeted content.

This suggests that the re-sharing of other users’ origi-

nal content is fundamental for how users interact with

each other on Twitter, and motivated our choice of ex-

amining the retweet network.

One initial idea was to focus on URLs to news

articles shared by Twitter users, and then link them

to the GDELT database. However, we soon discov-

ered that users in general did not share news sources

from mass media. Instead highly retweeted tweets

often contained original media (i.e., videos and pic-

tures), which were often taken from the protests. For

instance, 53% of tweets with over 1000 retweets, as

opposed to only 17% of all tweets, shared original

media.

4.2 Network Analysis

Section 4.1.2 showed the importance of retweets in

the Twitterverse. In this Section we will formalise

this by introducing a retweet network structure on our

data set.

4.2.1 Retweet Network

Deﬁnition 4.1. Let G

= (V, E) be a directed

weighted graph in time interval I ⊂ R

, where every

vertex v ∈ V is a unique Twitter user, and every edge

e ⊂ {(u, v) | (u,v) ∈ E ⊂ V

} is interpreted as user v

having retweeted u during time interval I. The weight

W (e) = W ((u,v)) ∈ N is the number of times user v

has retweeted user u. We then deﬁne G

as a retweet

network.

Furthermore, we deﬁne G

′

as an undirected

retweet network if (u,v) ∈ E ⇔ (v,u) ∈ E. Thus G

′

ignores whether u retweeted v or vice versa but pre-

serves the information that there is a retweet relation

between the two users.

We chose to look at retweets since a retweet

by user u of an original tweet by user v is highly

likely to mean that user u agrees with user v. Di-

rect retweets are generally recognized to indicate trust

in the communicator and endorsement (Jansen et al.,

2009; Metaxas et al., 2015; Boyd et al., 2010). The

number of times a user has been retweeted also gives

a probabilistic interpretation, using the random geo-

metric graph interpretation in (Sainudiin et al., 2019),

that measures how inﬂuential a user is on another in

terms of the lengths of their most retweeted paths.

By looking at our retweet network we can already

get some information from the Twitter data set; sim-

ply by summing the outgoing edges and their weights

for every user, we get the most retweeted users in our

time interval between 24 May 2020 and 31 June 2020.

One noteworthy user is the sixth most retweeted

user @MrAndyNgo. Andy Ngo is an American con-

servative journalist and a prominent opponent of the

Black Lives Matter movement, who in February 2021

published Unmasked: Inside Antifa’s Radical Plan to

Hawkes Processes on Social and Mass Media: A Causal Study of the BlackLivesMatter Movement in the Summer of 2020

Table 3: Ten most retweeted users, sorted by number

of retweets. Usernames for non-public users have been

anonymized. The communities were identifed using the la-

bel propagation algorithm.

Username | followers | | retweets | Community

@JoshuaPotash 142,833 759,572 Pro-BLM

@YourAnonCentral 5,862,927 529,431 Pro-BLM

- 1,584 187,065 Pro-BLM

@elijahdaniel 760,935 161,337 Pro-BLM

- 22,983 135,698 Pro-BLM

@MrAndyNgo 799,291 125,898 Anti-BLM

- 1,232 125,826 Pro-BLM

@BTS twt 34,107,446 125,534 K-pop

@shawnwasabi 140,788 106,731 Pro-BLM

@Drebae 141,613 103,594 Pro-BLM

Destroy Democracy (Ngo, 2021), where he among

other things writes about his experiences from the

BLM protests of 2020. His presence amongst the

most retweeted users will serve as a gateway into

the counter-movements of All Lives Matter and Blue

Lives Matter. Thus, we need to detect different com-

munities within the observed retweet network, such

that each community has more edges or retweets

within it when compared to the number of edges be-

tween it and another community.

4.2.2 Connected Components

The motivation behind the deﬁnition of an undirected

retweet network follows in the next step, when we

look at the connected components of our graph.

Deﬁnition 4.2. Let G be a graph. A sequence of

edges (e

,...,e

n−1

) is called path if it corresponds to

a sequence of distinct vertices (v

,..,v

), such that

= (v

i+1

). Two vertices u,v are connected if there

exists a path between them, and if G is undirected, we

call the sub-graph H of G a connected component if

and only if there exists a path between every pair of

vertices in H which contains a subset of the vertices

in G.

The reasoning behind invoking the notion of con-

nected components of the undirected retweet network

is to, on a high level, make sure that a meaningful

discourse between users, in terms of being inﬂuenced

by and inﬂuencing others, exists within the connected

component. In practice, we could have a very dis-

connected network with lots of unconnected compo-

nents, which would mean that most users only inter-

act and retweet a few selected users. Another inter-

esting case would be if the network would have a

few signiﬁcantly large components; this would sug-

gest the existence of a set of discourses, where the

users in their respective component do not interact –

perhaps because of political differences reﬂected in

large “echo chambers”. To ﬁnd all connected com-

ponents in the retweet network, the GraphFrames

framework in Spark was used. The result showed that

6,083,687 i.e., 85.6% of the 7,111,140 users were in

the same connected component. The remaining users

were scattered around in smaller connected compo-

nents, with the largest being 74 users. These users

were therefore discarded from further analysis.

4.2.3 Community Detection

While the data set contains tweets using the hash-

tags of the counter movements #AllLivesMatter and

#BlueLivesMatter, in practice, users associated with

these movement did not necessarily use these hash-

tags, but often used the hashtag #BlackLivesMatter

either ironically or to get more attention. Thus, just

using simple querying on the hashtags in the data set,

did not sufﬁce to get a sample of users from these

movements. To get a better sense of the relationship

between users, we instead therefore used the commu-

nity detection algorithm known as Label propagation

algorithm (LPA). LPA is a semi-supervised machine

learning algorithm, which seeks to assign labels to

nodes in a network, where each label maps to a spe-

ciﬁc community inside the network (Raghavan et al.,

2007). In Spark’s GraphX framework, the algorithm

is implemented using Pregel API (Malewicz et al.,

2010), which allows for parallel computation when

processing graphs. On a high level, Pregel computa-

tions are a sequence of iterations, deﬁned as super-

steps, where for every superstep, each vertex in the

graph runs a user deﬁned function. This local vertex-

centric approach where each vertex is processed inde-

pendently in parallel, in contrast to the more classical

iterative graph algorithms where each vertex is visited

one by one, naturally induces distributed implemen-

tations that can computationally scale to arbitrarily

large networks. In distributed LPA, implemented as

a Pregel program , each vertex in the graph is initially

assigned its own distinct vertex label to represent its

initial community label. At every superstep, vertices

send their community label to all out-neighbours and

update their label to be the mode community label

of incoming messages from their in-neighbours. Al-

though the algorithm can have trivial or oscillating so-

lutions without guarantees on convergence, it works

well in practice on real data as we found by running

LPA on the largest connected component with 10 su-

persteps and investigating at least the most inﬂuential

set of users within each community manually.

4.2.4 Exploring Ideological Diversity

By looking at the twenty most retweeted users, we

see that eighteen of these fall into the same pro-BLM

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

community, with 155,229 users. Andy Ngo is in a

community with 26,624 users. This is interesting

when we remind ourselves from Table 3 that he is the

sixth most retweeted user, and if we assume that most

of his retweets come from his relatively small com-

munity, it suggests that he has a very loyal set of core

followers. The questions that arises then are if we can

identify this core set of followers, and moreover if we

also can identify a similar core followings in the pro-

BLM community. In the same community where we

ﬁnd Andy Ngo, we also have prominent conservative

commentators such as Candance Owens, Glenn Beck,

Steven Crowder, Paul Joseph Watson, Dave Rubin,

and also Republican senator Ted Cruz, and Raheem

Kassam from the Reform UK-party (formerly known

as The Brexit-Party), along with others. It is worth

mentioning that all of the twenty most retweeted users

in this community are users with largest followings

(over 25,000 followers). Thus, the phenomena of

users with small followings reaching a larger audience

does not exist to the same extent in this community

when compared to the pro-BLM community.

The last of the twenty most retweeted users is

the ofﬁcial account of the South Korean pop (K-pop)

group BTS, who has their own community. The com-

munities for the top ten most retweeted users are pre-

sented in Table 3 and a sample of tweets from the

pro-BLM and anti-BLM communities are presented

in Table 4.

Note how the textual content of the tweets from

the two communities differ. By going through the la-

bel propagation algorithm we seem to have identiﬁed

the two different political camps. Moreover, we note

that usage of the hashtag #BlackLivesMatter is promi-

nent in the anti-BLM community. Thus, we can con-

clude that just ﬁltering by the anti-BLM #AllLives-

Matter and #BlueLivesMatter would not have sufﬁced

to identify these communities.

Thus, through the use of (1) retweet network,

which encodes retweets, one of the clearest signals of

directional ideological concurrence of the retweeter

with the tweeter, (2) distributed label propagation

on such a retweet network to detect communities of

users who are in ideological concurrence within each

community, and ﬁnally (3) listing the top K most

retweeted tweets within each such community, we

have a simple yet effective mechanism to explore the

ideological diversity that is representative of the com-

munities, independent of their sizes and activity lev-

els, i.e., the number of users and intensity of inter-

actions in Twitter. We found this simple three-step

process to be an effective approach to identifying the

pro/anti-BLM tweets before further analysis.

5 JOINT MEDIA MODELING

In this Section we examined the interplay between

the Twitter and GDELT data sets by looking at

the Granger causality between them. For this we

proposed simple two-dimensional Hawkes processes

with an exponential kernel. The timeline for this joint

modeling was three days after the death of George

Floyd over the 24-hours-long period between mid-

night of 28 May and midnight of 29 May, which is

when the protests had just started to spread nation-

wide across the US, and also become violent.

5.1 Model and Data

In dimension one we had the Twitter data. To control

the magnitude of the data we only considered original

tweets, i.e. all retweets were ﬁltered out, that had at

least one retweet, to ﬁlter out tweets made by users

with a negligible following. Moreover, we examined

the 20 largest communities and identiﬁed one anti-

BLM (the same community identiﬁed in the previ-

ous section), and ﬁltered out all tweets made by users

from that community, so that we only considered pro-

BLM tweets. This left us with 10,774 tweets.

In the second dimension we had records from

the GKG-database from GDELT. The records were

ﬁrst ﬁltered on mentioned themes, and only those re-

porting events of protests were selected. This natu-

rally lead to some noise in the data, due to not be-

ing able to precisely ﬁlter out only the events men-

tioning protests relating to the Black Lives Matter-

movement. To reduce this noise, we also ﬁltered on

records that mentioned George Floyd. While in the-

ory a record could report a BLM related protest with-

out mentioning George Floyd, we reasoned that since

our timeline of interest was three days after his pass-

ing, most records should mention George Floyd to

give the reader some context for the reported protest.

To handle that the GKG-database updates in inter-

vals every 15 minutes, every record got a randomised

timestamp in the ﬁfteen minute interval prior to it be-

ing added into the database, to get the records in con-

tinuous time. With this query in the selected time in-

terval, 3,341 records were found.

Given this data, we jointly model events in social

and mass media by ﬁtting the multivariate Hawkes

process in Deﬁnition 2.3. We want to test whether

or not Granger causation exists between dimensions

1 and 2 representing events in Twitter and events in

mass media from the GDELT project, respectively.

As per Theorem 2.1, parameter α

= 0 if and only

if mass media events do not Granger cause Twitter

events, and vice versa for α

= 0.

Hawkes Processes on Social and Mass Media: A Causal Study of the BlackLivesMatter Movement in the Summer of 2020

Table 4: Sample tweets from the pro-BLM and anti-BLM communities.

Pro-BLM community

i can’t stand by and continue to live in a world where the color of your skin is an automatic target on my family, friends, and neighbors backs.

tri-city we must come together to support our communities. THIS. IS. AMERICA. BE THE CHANGE YOU WANT TO SEE. #blacklivesmatter

https://t.co/XIDSNqgx6Q

Thread of people who took it upon themselves to trivialise the current situation going on and #BlackLivesMatter

#BlackLivesMatter Houston is hosting a protest march this FRIDAY at 2PM starting at Discovery Green demanding justice for #GeorgeFloyd White

allies, y’all gotta do better and this is a place to start. Everyone who’s able should be there. https://t.co /EbWeBrZneP

Aiyana Jones a 7 YEAR OLD CHILD who was shot in the head by an ofﬁcer, when the ofﬁcer raided the wrong house. A 7 year old girl didn’t deserve

to be killed because of disgusting reckless ofﬁcers. Acab and BLM, never forget this girls name! #BlackLivesMatter https://t.co/HCWzabkFv4

So protest in Huntsville, TX was small, but that was no surprise. We’re a small town and most things just caught up to the present on the outside...at the

end of the protest on my way home, I saw something I never noticed. This is why we do what we do. #BlackLivesMatter https://t.co/gTuCilB7mi

Anti-BLM community

Black people are 80 times more likely to kill white people in England/Wales than the reverse! And yet, #BlackLivesMatter more than others? EXPLAIN...

Check the stats: https://t.co/DmPDVVGbSo https://t.co/qxXmuNIh2X

#BlackLivesMatter should now be classiﬁed as an extreme political hate group.. Simple.. https://t.co/mFh56qCpo9

#DontTakeTheKnee #DontTakeTheKnee please get this trending Sick & tired of the #ScumMedia telling us what we should do! Well I say #Dont-

TakeTheKnee #BLM is a terrorist organisation. Do your homework! #AllLivesMatter #WhiteLivesMatter #ISTANDwithDominic Raab @SkyNews

Then someone gets stabbed and they want the police back after running them out of town. Ha you couldn’t make it up #BlackLivesMatter #blm #thugs

#brixton https://t.co/1uVXQ63UT2

Just saw a video of #BlackLivesMatter protest in #Reading - looks like 3 white people have been stabbed and in a bad way! Now if this turns out to be a

race attack, I’m going to blame the #Media. They’ve been stoking up tensions between blacks and whites for weeks now!

5.2 Results

The data was ﬁtted using python library tick

tick requires that the decay parameters β

i j

are given

as constants beforehand, which then allows highly ef-

ﬁcient ﬁtting of the remaining parameters µ

and α

i j

using accelerated gradient descent (Bacry and Muzy,

2016). The problem of ﬁtting the decay parameter β

in the exponential kernel is well-known (Santos et al.,

2021), and is due to the fact that while the baseline

parameter µ and excitation parameter α can be efﬁ-

ciently computed via convex optimisation, this is not

always true for β. With this in mind, we proposed

three different models where the decay parameters β

i j

were handled differently:

• M

: β

i j

= 1, ∀(i, j) ∈ {1,2} × {1,2} =: {1,2}

• M

: β

i j

= β ∈ (0,∞), ∀(i, j) ∈ {1,2}

• M

: β

i j

∈ (0,∞), ∀(i, j) ∈ {1,2}

To compare the different models, we looked at (i)

the Akaike information criterion AIC = 2k − 2ln(

L),

where k is the number of estimated parameters, and

L is the maximum likelihood of the model, (ii) the

relative likelihood exp((AIC

− AIC

)/2), where the

AIC values for models p and q satisfy AIC

AIC

, and (iii) the likelihood-ratio test statistic λ

−2ln(

5.2.1 Comparison Between M

and M

Setting β

i j

= 1 for all i, j in model M

gave us the log-

likelihood value of 372.981, and AIC = −733.963

https://x-datainitiative.github.io/tick/

(where k = 6 for the two estimated baseline param-

eters µ

and the four excitation parameters α

i j

. For

model M

, we did a sequential grid-search over β’s,

by using the convex optimiser in ticks to quickly

obtain the most likely µ

and α

i, j

’s for each ﬁxed

i, j

= β, to ﬁnd the most likely parameter

β = 6.17,

with the maximum log-likelihood value of 384.771

and AIC = −755.542 (where k = 7 since we now also

estimate β).

The relative likelihood of the models was

2.0624 × 10

−5

, i.e., model M

was 2.0624 × 10

−5

times as probable as model M

to minimize the in-

formation loss. Since M

is nested in M

, i.e., the pa-

rameter space of M

is a proper subset of that of M

we do a likelihood ratio test and reject M

in favour

of M

(λ

= 23.5781,p-value < 10

−7

5.2.2 Comparison Between M

and M

Model M

and M

assume that the decay parameters

i j

’s are identically β ∈ (0,∞), i.e., the decay param-

eter within each dimension and between every pair

of dimensions is given by the same value. The real-

world interpretation of this is that tweets and mass

media reports stay relevant for the same amount of

time into the future, which seems like a major as-

sumption as mass media dissemination and social me-

dia communication are fundamentally different in na-

ture. To account for this, we introduced model M

where each β

i j

can vary freely in (0,∞).

We did a sequential grid search over the 4-

simplex, similar to the one-dimensional case of M

We found the most likely values to be

DATA 2023 - 12th International Conference on Data Science, Technology and Applications

16.170,

= 3.702, and

= 8.638, at the max-

imum log-likelihood value of 384.772, with k = 10

and AIC = −749.544. Note that despite having three

additional parameters, the maximum log-likelihood

of M

is close to that of M

, with the relative likeli-

hood of the models, likelihood-ratio test statistic, and

p-value being 0.04984, 0.002121, and 0.9971, respec-

tively. We therefore do not reject M

in favour of M

and choose M

for further analysis.

5.2.3 Fitting the Data Using M

To ﬁnd whether Granger causality between the two

dimensions exists, we were interested in whether pa-

rameters

are equal to 0 or not. Fitting the data

using model M

with estimated decay parameter

β =

6.1700 gave us the following estimated parameters

ˆµ

= 1.000, ˆµ

= 0.998,

= 0.986,

= 0.0327,

= 0.0216,

= 0.921. Note that the point esti-

mates satisfying:

> 0, implies that there ex-

ists Granger causality between reported protests and

tweets regarding the BLM-movement, provided we

account for the errors in their estimation, i.e., their

conﬁdence intervals. We address this next using non-

parametric bootstraps.

5.2.4 Hypothesis Testing

The following null hypotheses were proposed:

• H

0,12

: α

= 0, i.e., reports of protests in mass me-

dia do not Granger-cause communication events

in Twitter related to the BLM-movement.

• H

0,21

: α

= 0, i.e., communication events in

Twitter related to the BLM-movement do not

Granger-cause reports of protests in mass media.

• H

: α

= α

= 0.

To get the conﬁdence intervals for α

,α

we did a

non-parametric bootstrap by sampling the observed

data with replacement, and then estimating the param-

eters on the bootstrapped data under model M

. This

was repeated 1000 times.

For α

, i.e., the inﬂuence of mass media on Twit-

ter, the 99-th percentile bootstrapped conﬁdence in-

terval is (0.000, 0.09405), and therefore we cannot re-

ject the null hypothesis H

0,12

that α

= 0 by the Wald

test. Thus, the reports of street protests in mass me-

dia do not Granger-cause the pro-BLM interactions in

Twitter.

On the other hand, the 99-th percentile boot-

strap conﬁdence interval for the parameter α

that models Twitter’s inﬂuence on mass media is

(0.01479,0.02949), and therefore we reject the null

hypothesis H

0,21

that α

= 0 by the Wald test. Thus,

the pro-BLM interactions in Twitter Granger-cause

the reports of street protests in mass media. We there-

fore also reject the common null hypothesis that there

is no Granger causality whatsoever between social

and mass media events around the BLM-movement,

i.e., H

: α

= α

= 0.

To estimate type I error, i.e., the probability of

rejecting the null hypothesis H

, when it is true, we

simulated data from the null hypothesis H

, i.e., from

the most likely parameters in M

, while restricting

= α

= 0. For each such simulated data, we then

performed the Wald test using non-parametric boot-

straps by sampling the data with replacement 1,000

times. Only one out of 100 such simulations from H

was rejected giving 0.01 as the Monte Carlo estimate

of the Type I error.

6 CONCLUSION

We jointly model and test hypotheses about causal

relationships between interactions in social media

and the reports in mass media during the Black

Lives Matter (BLM) protests following the death

of George Floyd, by implementing open-source

pipelines through MEP and SPARK-GDELT to pro-

cess the data, i.e., extract, load, transform, explore,

from scratch and at scale, on cloud infrastructure,

and by employing self-exciting Hawkes processes and

their Granger causal inference machinery.

We reject the null hypothesis that there is no

causal relationship, and show that communication

events in Twitter, surrounding tweets that supported

the BLM movement, Granger-caused the reports

of street protests in mass media from the GDELT

project. However, we cannot show that the reporting

of street protests in mass media Granger-caused the

corresponding communication events in Twitter. We

identiﬁed such pro-BLM tweets thorough a network

analysis of the Twitter data to identify communities

of users who have a shared ideology among an ideo-

logically diverse set of communities.

We thus establish a veriﬁable causal relationship

between social media interactions in Twitter that are

supportive of the global BLM social movement on

one hand, and global mass media reports of street

protests in solidarity with the movement on the other.

This suggests that activists have harnessed social me-

dia to raise awareness and mobilise street protests.

ACKNOWLEDGEMENTS

We thank three anonymous reviewers for their in-

sightful comments. AL was supported by a sum-

Hawkes Processes on Social and Mass Media: A Causal Study of the BlackLivesMatter Movement in the Summer of 2020

mer internship at Combient Competence Centre for

Data Engineering Sciences. SL and RS were par-

tially supported by Swedish Research Council project

no. 2019-03351 and RS was partially supported by

the Wallenberg AI, Autonomous Systems and Soft-

ware Program funded by Knut and Alice Wallenberg

Foundation. Computing infrastructure was supported

by Databricks University Alliance and AWS, and this

publication’s cost was sponsored by VakeWorks AB.

REFERENCES

Armbrust, M., Das, T., Sun, L., Yavuz, B., Zhu, S.,

Murthy, M., Torres, J., van Hovell, H., Ionescu, A.,

Łuszczak, A., Witakowski, M., Szafra

nski, M., Li,

X., Ueshin, T., Mokhtar, M., Boncz, P., Ghodsi, A.,

Paranjpye, S., Senster, P., Xin, R., and Zaharia, M.

(2020). Delta lake: High-performance acid table stor-

age over cloud object stores. Proc. VLDB Endow.,

13(12):3411–3424.

Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D.,

Bradley, J. K., Meng, X., Kaftan, T., Franklin, M. J.,

Ghodsi, A., and Zaharia, M. (2015). Spark sql: Re-

lational data processing in spark. In Proceedings

of the 2015 ACM SIGMOD International Confer-

ence on Management of Data, SIGMOD ’15, page

1383–1394, New York, NY, USA. Association for

Computing Machinery.

Bacry, E., Mastromatteo, I., and Muzy, J.-F. (2015).

Hawkes processes in ﬁnance. arXiv.org.

Bacry, E. and Muzy, J.-F. (2016). First- and Second-Order

Statistics Characterization of Hawkes Processes and

Non-Parametric Estimation. IEEE TRANSACTIONS

ON INFORMATION THEORY, VOL. 62.

Boyd, D., Golder, S., and Lotan, G. (2010). Tweet, tweet,

retweet: Conversational aspects of retweeting on twit-

ter. In Proceedings of the 2010 43rd Hawaii Inter-

national Conference on System Sciences, HICSS ’10,

pages 1–10, Washington, DC, USA. IEEE Computer

Society.

Buchanan, L., Bui, Q., and Patel, J. K. (3 July 2020). Black

Lives Matter May Be the Largest Movement in U.S.

History. New York Times.

Daley, D. and Vere-Jones, D. (2003). An Introduction to the

Theory of Point Processes: Volume I: Elementary The-

ory and Methods, Second Edition. Springer-Verlag.

Deliso, M. (21 April 2021). ABC News - Timeline: The

impact of George Floyd’s death in Minneapolis and

beyond. ABC News.

Eichler, M., Dahlhaus, R., and Dueck, J. (2012). Graphi-

cal Modeling for Multivariate Hawkes Processes with

Nonparametric Link Functions. Probability Theory

and Related Fields.

Falcon, A. (2019). Aristotle on Causality. The Stanford

Encyclopedia of Philosophy (Spring 2019 Edition).

Giorgi, S., Guntuku, S. C., Rahman, M., Himelein-

Wachowiak, M., Kwarteng, A., and Curtis, B. (2020).

Twitter Corpus of the #BlackLivesMatter Movement

And Counter Protests: 2013 to 2020.

Goldstein, J. S. (1992). A conﬂict-cooperation scale for

weis events data. Journal of Conﬂict Resolution,

36(2):369–385.

Gonzalez, J. E., Xin, R. S., Dave, A., Crankshaw, D.,

Franklin, M. J., and Stoica, I. (2014). Graphx: Graph

processing in a distributed dataﬂow framework. In

Proceedings of the 11th USENIX Conference on Oper-

ating Systems Design and Implementation, OSDI’14,

page 599–613, USA. USENIX Association.

Granger, C. (1980). Testing for Causality - A Personal

Viewpoint. Journal of Economic Dynamics and Con-

trol 2.

Hawkes, A. G. (1971). Spectra of Some Self-Exciting

and Mutually Exciting Point Processes. Biometrika,

58:83–90.

Jackson, S. J., Bailey, M., and Foucault Welles, B. (2020).

#HashtagActivism: Networks of Race and Gender

Justice. The MIT Press.

Jansen, B. J., Zhang, M., Sobel, K., and Chowdury, A.

(2009). Twitter power: Tweets as electronic word of

mouth. J Amer Soc Info Science Tech, 60(11):2169–

2188.

Leetaru, K. and Schrodt, P. A. (2013). Gdelt: Global data

on events, location, and tone. ISA Annual Convention.

Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C.,

Horn, I., Leiser, N., and Czajkowski, G. (2010).

Pregel: a system for large-scale graph processing. In

Proceedings of the 2010 ACM SIGMOD International

Conference on Management of data, pages 135–146.

Metaxas, P., Mustafaraj, E., Wong, K., Zeng, L., O’Keefe,

M., and Finn, S. (2015). What do retweets indicate?

results from user survey and meta-review of research.

In International AAAI Conference on Web and Social

Media. ACM.

Muphy, D. (13 April 2014). 44 Percent of Twitter Accounts

Have Never Tweeted. PCMag UK.

Ngo, A. (2021). Unmasked: Inside Antifa’s Radical Plan to

Destroy Democracy. Center Street.

Pierris, G. D. and Friedman, M. (2018). Kant and Hume on

Causality. The Stanford Encyclopedia of Philosophy

(Winter 2018 Edition).

Raghavan, U. N., Albert, R., and Kumara, S. (2007). Near

linear time algorithm to detect community structures

in large-scale networks. Physical Review E, 76.

Sainudiin, R., Yogeeswaran, K., Nash, K., and Sahioun, R.

(2019). Characterizing the twitter networks of promi-

nent politicians and splc-deﬁned hate groups in the

2016 us presidential election. Social Network Anal-

ysis and Mining, 9(34).

Santos, T., Lemmerich, F., and Helic, D. (2021). Surfac-

ing Estimation Uncertainty in the Decay Parameters

of Hawkes Processes with Exponential Kernels.

Schrodt, P. A. (2012). CAMEO - Conﬂict and Mediation

Event Observations Event and Actor Codebook.

DATA 2023 - 12th International Conference on Data Science, Technology and Applications