name, regardless of the tweets' sentiment. As a result 
of these errors, other researchers have even 
questioned the validity of social media content for 
forecasting events and movements (Metaxas, 2010; 
Mustafaraj, 2011; Metaxas, 2011).  
In this paper, we develop an accurate method for 
mining election-relevant data for a statistically 
correct prediction of the outcome. We have gathered 
a reliable large-scale dataset from twitter and Google 
Trends search interests, which is highly correlated 
with real trends of US 2016. We have applied 
Gaussian process regression to estimate weekly 
predictions. Unlike other papers, this model is built 
on predicting the candidates vote shares instead of an 
absolute winner. This paper proceeds as follows. In 
section 2 our method for predicting a large-scale 
election is described. In section 3 the method is 
applied to the data from the 2016 US elections and 
concluding remarks are mentioned in section 4. 
2 THE METHOD 
Four main steps are followed in this method. First, a 
uniformly sampled large dataset of tweets is gathered. 
This data is then processed and augmented by adding 
sentiment information to each tweet, collecting 
relevant keywords data from Google Trends, and 
arranging various online poll results. The authenticity 
of this data is then checked with a correlation test. In 
the end, a feature matrix is created and the Gaussian 
process regression model is trained.
 
2.1 Data Collection 
Social political events often have a short time span 
and great complexity. As mentioned in DiGarzia 
(2013), large datasets of online social content must be 
used to achieve accurate results. The online data 
sources used in this paper are twitter and Google 
Trends, as well as the online election polls held by 
polling firms and news reports, such as HuffPost 
pollster. These online polls are refined and later used 
as labels when training the model. These surveys are 
scattered over time, thus, the online polls are arranged 
chronologically and a final poll result is calculated for 
each week by adding the weighted sum of the surveys 
held in that week. Poll results are used as labels when 
training the statistical model. 
The data has been gathered from public tweets 
containing the candidates’ names with a high 
sampling rate of 1000 tweets per day per candidate 
during active election months (about 6 months for US 
Election). It should be mentioned that the method was 
also applied to a dataset of 100 tweet per day per 
candidate, which resulted in undesirable outcomes. 
Around 370,000 tweets are gathered, however, about 
70,000 repetitious tweets contain both candidates’ 
names which are then removed, resulting in a final 
300,000 tweet dataset. Despite what was stated in 
Sang (2012), the number of tweets containing a 
candidate’s name does not necessarily reflect the 
user’s election votes. Thus, the tweets’ sentiment 
needs to be taken into account. Table 1 demonstrates 
this fact in an example in which it is unlikely for the 
first user to vote for Clinton. 
The sentiment of a sentence can be analyzed using 
the grammatical structure and the choice of words. 
The RNTN algorithm (Socher, 2013) can determine 
the sentiment of a phrase as positive or negative with 
an accuracy rate of 80.7%. Due to processing 
limitations, a simpler algorithm is used in our 
experiment (Bose, 2017; Rinker 2017). 
After eliminating common terms, frequent 
hashtags and words are extracted from the twitter 
data, and manually grouped into meaningful word 
sets, 26 sets in our case. Each group contains an 
election-relevant term that is used frequently in 
tweets. The word representing each set is called a 
‘keyword’. This classification is done using common 
knowledge on election events. Table 2 explains this 
process with an example. 
The keywords are later used as search queries for 
collecting the Google Trends (2017) data. Google 
Trends returns a vector 
 on ‘Search interest factor’ 
which presents the popularity of a search query over 
time. 
Assuming 
 to be a keyword, we define: 
 
..
≝Google Trends 
search interest for keyword 
 in week , 
∈
1,
, 
(1)
where  is the total number of weeks in the dataset. 
Table 1: An example of why all the tweets containing a 
candidate’s name are not posted by their fans. 
Sentiment Tweet 
Negative 
Crooked Hillary: Not In The Pocket Of Anyone 
After Receiving $6 Million From Soros 
#WakeUpAmerica 
Positive 
I thought Hillary did well on #60Minutes. So 
calm and reasonable. Such a change from the 
Republican'ts.