Exploring the Connection Between Emoji Usage, User Identity and

Context Using Statistical and Machine Learning Approaches

Shaojie Wu

School of Mathematical Sciences, Fudan University, Shanghai, 200433, China

Keywords: Emojis, Statistical Analysis, Machine Learning.

Abstract: Due to the growing popularity in emojis on social media platforms, comprehensive researches regarding the

relationship between emoji usage and factors such as user identity, platform and context are of great

importance. Based on a dataset of typical emoji usage records, the research uses statistical analysis methods

and machine learning techniques to reach the target. In particular, chi-squared test, K-means and t-Distributed

Stochastic Neighbour Embedding (t-SNE) are used in the research. In the statistical analysis phase, the

research classifies the dataset based on different factors and compares the distributions of the subsets of data

with p-values generated by chi-squared results to determine the importance of the factors’ influences on emoji

usage. In machine learning phase, the research uses K-means to classify the users and emoji usage, to explore

the hidden user classification and emoji usage types. The research yields multiple results. In the analysis of

individual factors, context and user gender are the more important factors, while user age and platform are

less important. However, the classification concerning user gender and age combined has the greatest impact

on users’ emoji usage, showing different emoji usage distribution under the same context. The research finds

that classifying the users into 4 groups will best distinguish the users’ trends in using emojis. Finally, the

research categorizes the emoji usage behaviours into 3 classes, with 1 major usage and 2 exceptional or

sarcastic usages.

1 INTRODUCTION

Nowadays, emojis have become an indispensable part

of online communication, both delivering precise

messages that pure texts fail to express and

showcasing strong emotions that pure texts may lack

the strength (Boutet, LeBlanc, Chamberland and

Collin, 2021). The precursor to emojis originated in

Japan, where the first set of emojis with only 12x12

pixels was created in the late 1990s. In 2007,

Unicode, the international standard for text encoding,

included emojis in its character set. This rendered

emojis an opportunity to make their debut on any

online platform and operation system. By the early

2010s, emojis became a mainstream tool for online

communication, used in almost all online social

platforms (Stark and Crawford, 2015). Due to their

significance, they not only have unique semantic and

emotional features, but are also closely related to

marketing, law, health care and many other areas.

https://orcid.org/ 0009-0006-3403-8383

The research on emojis has become a hot topic in

the academic field, and an increasing number of

scholars from the fields of data science and

communication etc. are studying them. In the field of

data analysis and computer science, the research

topics mainly focus on these certain aspects (Bai, Dan,

Mu and Yang, 2019): 1) Analyzing emotional and

semantic meaning of emojis using big data. 2)

Switching between emojis and other expression

modalities 3) Using emojis for emotional analysis of

online data. 4) Using emojis for optimizing computer

systems. The researches mainly focus on the

expression and emotional meaning of emojis, using

deep learning and system optimization methods to

explore the usage of emojis.

In the field of communication modality, the visual

features and Unicode basis of emojis make them an

independent expressive language that is different

from text and pictures. A lot of research focuses on

connection between emoji and other modalities such

148

Wu, S.

Exploring the Connection Between Emoji Usage, User Identity and Context Using Statistical and Machine Learning Approaches.

DOI: 10.5220/0013680200004670

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 2nd International Conference on Data Science and Engineering (ICDSE 2025), pages 148-153

ISBN: 978-989-758-765-8

as text, picture and video. In-depth researches on the

interconnection between emojis and texts often focus

on emoji prediction model, which predicts the emoji

used in text such as tweets and comments. For

example, a project using the BERT model was

successful in predicting most of the emojis in related

text (Ma, Liu, Wang and Vosoughi, 2020).

Apart from the academic field, the social network

applications have been using deep learning models to

recommend emojis for users based on the data they

produce, including their reading history and

published text. For example, a type of model CAPER

is able to recommend emojis based on the context

using recommender system (Zhao, Liu, Chao and

Qian, 2021).

Despite the previous researches, research gaps

still exist as most researches focus on direct

relationship between emojis and text and emoji

recommendations for users. However, the diversity of

users may have impact on the emoji preference.

Furthermore, the inter-platform comparison of emoji

is often neglected as well. Previous researches have

found that users of certain groups may have higher

frequency of using emojis (Benkhedda, Xiao and

Magdy, 2024). The research exhibited that users’

identities have impact on their choices of using

emojis. Therefore, comprehensive analysis on the

association between the users’ information such as

gender, age and the platform where they are posting,

and their emoji choices has become a research gap

that needs to be filled.

In this regard, this research aims to fill the gap

through implementation of data analysis,

visualization and machine learning methods such as

clustering and dimension reduction on a dataset

containing records of emoji posted and the users’

basic information. Specifically, the dataset is gained

from the platform Kaggle (Kaggle, 2024). This

research implements K-means clustering and t-

Distributed Stochastic Neighbour Embedding(t-SNE)

analysis to study the user groups’ emoji preference

and uses one-hot embedding to vectorize the set of

user-emoji information for further research.

2 METHOD

2.1 Dataset Preparation

This research uses the open dataset for emoji trends

to implement analysis (Kaggle, 2024). The dataset

features over 4, 000 typical records of emoji usage on

social media platforms including Twitter, Snapchat

and Facebook etc. Each record of emoji usage

includes the gender, age of the user and the context in

which the emoji is used. The context includes

happiness, sadness, support etc. The feature names

are: User Gender, User Age, Context and Platform.

The dataset contains data with 30 types of emojis, 10

types of contexts, and 6 types of platforms. The type

information is described using string. The emojis are

stored using Unicode. Figure 1 is an example of the

emojis in the dataset.

Figure 1: The visualization of various Emojis (Kaggle,

2024).

2.2 Statistical Analysis

A method used in the statistical analysis of the

research is chi-squared test and the corresponding p-

value analysis. To determine whether the association

between two or more variables is statistically

significant, a test of significance called the Chi-

Square Test (Mindrila, Balentyne and Tables, 2013)

is often conducted. The chi-square test method

mainly compares the observed values to the expected

values through an equation called the chi-square

statistics. Then the p-value calculated is compared to

the alpha level to determine the reliability of the

association. In this particular research, the chi-square

test is used to determine the difference and

association between the emoji usage of different

types.

The research uses a comprehensive analytical

method to study the pattern of emoji usage. First, the

research intends to study the influence of an

individual feature on the usage of emojis. The

research studies the relationship between the user

gender and the emoji usage by counting the usage of

each emoji of male and female users separately and

drawing bar plot to visualize the times of usage and

the difference by gender. The research studies the

relationship between the user age and the emoji usage

Exploring the Connection Between Emoji Usage, User Identity and Context Using Statistical and Machine Learning Approaches

149

by calculating the average user age of each emoji. The

research studies the relationship between platform

and emoji usage by two means: analyzing the top

emojis and counting the overall pattern of emoji

usage.

The research analyses and compares the top 5

emojis used in each platform and counts the pattern

of emoji usage, implementing chi-squared test and

calculating p-value to compare the difference in

distribution (Mindrila et al., 2013). Assuming the

distribution pattern of emoji usage on different social

media platforms is the same, the chi-squared test is

then used on the set of individual distribution patterns

and calculated the p-square accordingly. Similar to

the analysis method of platforms, the research finds

the top 5 emojis used in different contexts and counts

the distribution patterns. The chi-squared tests are

applied accordingly.

In addition to single-factor analysis, the research

studies the interconnection of different factors and

their impact on emoji usage. The research first

classifies all the users by their age and gender into 6

user groups: Male, Young (age 0-30); Male, Mid (age

30-60); Male, Old (age more than 60); Female,

Young; Female, Mid; Female, Old. Then the research

counts the emoji usage pattern of the 6 user groups

and uses the chi-squared test to generate the

corresponding p-value to study the difference of the

distributions. The research also considers the usage of

emojis among different user groups under different

contexts, studying the preference of users to use

emojis when expressing the same emotion. To make

the study more concentrated, the research analyzes

the count of the top 5 emojis usage under each

context. The bar plot is drawn for each context to

visualize the different usage of the top 5 emojis by

user group. Then, chi-squared test is used for the

distribution of 6 user groups under each context and

p-values are gathered. Analyzing the p-values, the

research can find out under which context the

different groups of users diverge in their choice of

emojis.

With all the statistical analysis, the research

intends to figure out the direct single-factor influence

on emoji usage as well as the combined influence of

multiple factors. Specifically, the research aims to

find and analyze the pattern of emoji usage among

different users.

2.3 Dimensional reduction and

clustering analysis

To further analyze the pattern of emoji usage, the

research uses K-means algorithm to perform

clustering and uses t-SNE after One-hot encoding to

reduce the dimension and visualize the data. K-means

is an unsupervised machine learning algorithm used

for clustering tasks (Ahmed, Seraj and Islam, 2020).

It classifies a dataset into K distinct clusters based on

minimizing the largest Euclidean distance within

each cluster. The algorithm works by initializing K

cluster centroids randomly, assigning each data point

to the nearest centroid, and then updating the

centroids based on the mean of the points in each

cluster. This process repeats until convergence,

generating the final result. t-SNE is a nonlinear

dimension reduction technique (Van and Hinton,

2008). It works by converting the high-dimensional

Euclidean distances between data points into

probabilities that represent the extent to which the

data pairs are similar. t-SNE then minimizes the

divergence between these probabilities in the lower-

dimensional space, while preserving the local

structure of the data.

In this research, the K-means algorithm is mainly

used for classifying the records of emoji usage and

users. Classification of the users with K-means can

establish categories for emoji users for further

research and study. Classification of the emoji usage

with K-means can study the different usage patterns,

for example, the normal usage or ironic usage, and

spot the special and rare usage of emojis. The t-SNE

technique is used for visualizing and validating the

result of K-means and reduce the dimension of the

data. The research first uses the K-means clustering

result of the users to regroup the users and perform

chi-squared test to generate p-value. The research

then analyzes the K-means clustering result of the

emoji usage record to distinguish different ways of

using emojis to express feelings.

3 RESULTS AND DISCUSSION

3.1 Statistical Findings

3.1.1 The Relationship Between User Gender

and Emoji Usage

As is shown in Figure 2, the difference in user gender

will lead to notable differences in the usage of emojis.

The Figure 3 visualizes the difference and shows that

some particular emojis have more significant usage

differences caused by genders. For example, the

emojis “face with tears of joy” and “red heart” have

the biggest difference in used times by male and

female users respectively. The chi-squared test of the

emoji usage distribution of male and female users

generates a p-value of 0.941, suggesting the gender

factor is an important factor in the emoji usage

pattern.

ICDSE 2025 - The International Conference on Data Science and Engineering

150

(a)

(b)

Figure 2: Emoji usage by gender (Picture credit :

Original).

Figure 3: Emoji usage by gender difference (Picture

credit : Original).

3.1.2 The Relationship Between Average

User Age and Emoji Usage

The research finds that the average user age of each

emoji does not have notable differences as is shown

in the Figure 4. The average user age of each emoji is

from 35 to 40 with less than a difference of 5,

suggesting the average user age using emoji is around

35. The research concludes that the different emojis

do not have particular user age preferences, yet the

distributions of the user age of different emojis have

differences and will be featured in the following

sections.

Figure 4: Emoji average user age (Picture credit :

Original).

3.1.3 The Relationship Between Platform

and Emoji Usage

The research counts the record of emoji on particular

platforms and performs chi-squared test on the

statistics. The corresponding p-value is 0.060, which

is smaller than the common alpha level, suggesting

the relationship between platform and emoji usage is

unreliable. The platform on which the emoji is posted

has little impact on the actual content of the emoji

given that other factors are kept the same.

3.1.4 The Relationship Between Context and

Emoji Usage

On the basis of the conclusion that context is related

to the usage of emojis, the research finds the 5 most

used emoji under each context. The result is shown in

Figure 5. Some emojis appear in different contexts,

suggesting certain flexibility in the usage of emojis

even in opposite emotions. For example, the emoji

“Rolling on Floor Laughing” can be used under

confusion and celebration contexts.

Exploring the Connection Between Emoji Usage, User Identity and Context Using Statistical and Machine Learning Approaches

151

Figure 5: Most used emojis under different contexts

(Picture credit : Original).

3.1.5 The Relationship Between User Group

and Emoji Usage

The research classified the users into 6 groups, based

on their age (divided into the young, the middle-aged

and the old) and gender. The result of chi-squared

tests on the overall distribution of emoji usage of the

6 groups shows a p-value of 0.640, indicating the

significance of the user group on emoji usage. The

research finds the 5 most used emoji under each

context and within each user group. The

corresponding results mainly kept the same with the

most used emojis of all users. The research performs

chi-squared tests on all the counts of the top 5 most

used emojis under each context and within each user

group. The corresponding p-values are shown in the

Table 1. The result shows that all the p-values are

above 0.1, indicating the distribution of different user

groups’ emoji usages are different under all the

contexts. Certain contexts including love, sadness and

happiness will strengthen the difference and make the

choices of emojis of users from different groups

diverge more significantly. The p-values also show

that the effect of gender and age combined will

exceed the effect of each factor considered alone,

suggesting related researches and businesses take

both age and gender factors into consideration.

Table 1: The p-value of user-group related emoji

distributions under contexts.

Context p-value

Angry 0.366

Love 0.896

Confusion 0.294

Celebration 0.354

Funny 0.359

Support 0.650

Surprise 0.725

Happy 0.962

Cool 0.527

Sad 0.777

3.2 Machine Learning-based Analysis

3.2.1 The Clustering of Users

The research implements K-means method on the

features concerning user information. To identify the

most suitable hyperparameter K, the elbow graph is

drawn, as shown in Figure 6. The elbow graph shows

the best K is 4. The research performs one-hot

encoding and t-SNE on the data and the

corresponding results are shown in the visualization

(Figure 7).

Figure 6: Optimal K determined by the Elbow method

(Picture credit : Original).

Figure 7: The visualization of K-means results (Picture

credit : Original).

3.2.2 The Statistical Analysis of the User

Clusters

The statistical analysis methods similar to that of user

group patterns are implemented on the K-means

ICDSE 2025 - The International Conference on Data Science and Engineering

152

generated clusters. The p-value of the chi-squared test

on the emoji usage distribution of the 4 clusters

reaches 0.957, significantly higher than that of the

group classification considering user gender and age.

The result shows that the users’ K-means generated

clusters have a higher impact on the emoji usage of

users.

3.2.3 The Clustering of Emoji Usage

Records

The result of the K-means clustering of the emoji

usage is shown in Table 2. The platform factor is not

included as previous results show that the relevance

of platform and emoji usage is weak. The result

suggests that different emoji usages can be classified

into 3 clusters, with 1 cluster taking up most of the

emoji usage record. Therefore, the assumption is that

the 3 clusters represent 1 regular usage of emojis and

2 sarcastic usages of the emoji. To verify the

assumption, the research calculated the p-value of the

chi-squared tests on different clusters of emoji usage

records under the featured contexts. Because the

counts of Cluster 1 and Cluster 3 are scarce, the p-

value of the test on Cluster 1 and 3 is greatly

vulnerable to statistical mistakes. The results of the

other two tests are shown in Table 2. From the table,

the research concludes that the emoji usage

discovered in Cluster 1,3 are significantly different

from that of Cluster 2. Cluster 2 can be deemed as the

normal usage of emojis and Cluster 1 and 3 are

sarcastic or exceptional usage of emojis.

Table 2: The p-value of emoji-usage-cluster-related emoji

distributions under contexts

Context Cluster 1 & 2 Cluster 2 & 3

Angry 0.308 0.021

Love 0.023 0.711

Confusion 0.742 0.566

Celebration 0.677 0.513

Funny 0.558 0.567

Support 0.923 0.870

Surprise 0.451 0.150

Happy 0.562 0.613

Cool 0.425 0.525

Sad 0.823 0.155

4 CONCLUSIONS

In this work, a comprehensive analysis involving

multiple features of emoji users and contexts has been

implemented on a dataset of emoji usages to discover

the pattern of emoji usage in a more systematic

manner. Statistical analysis including single-feature

and multi-feature analysis and machine learning

methods are used in the research and the results are

analyzed. The research concludes that user gender has

a substantial influence on emoji usage while user age

and platform alone have a slight influence. The

research finds that user group, with age and gender

considered together, has the greatest impact on

choices of emojis under the same context. The

research also regroups the users using K-means and

the results of the new groups are more significant than

the old group. The research categorizes emoji usages

and identifies the normal usage and sarcastic or

exceptional usage within the records. However, the

result of the machine learning methods is yet to be

explained better. The preprocessing phase of machine

learning methods involves only one-hot encoding,

which is also to be extended.

REFERENCES

Ahmed, M., Seraj, R., & Islam, S. M. S. 2020. The k-means

algorithm: A comprehensive survey and performance

evaluation. Electronics, 9(8), 1295.

Bai, Q., Dan, Q., Mu, Z., & Yang, M. 2019. A systematic

review of emoji: Current research and future

perspectives. Frontiers in Psychology, 10.

Benkhedda, Y., Xiao, P., & Magdy, W. 2024. Emoji are

effective predictors of user’s demographics. In

Proceedings of the 2023 IEEE/ACM International

Conference on Advances in Social Networks Analysis

and Mining (ASONAM '23), 784–792.

Boutet, I., LeBlanc, M., Chamberland, J. A., & Collin, C.

A. 2021. Emojis influence emotional communication,

social attributions, and information

processing. Computers in Human Behavior, 119,

106722.

Kaggle. 2024. Emoji trends dataset. Retrieved from

https://www.kaggle.com/datasets/waqi786/emoji-

trends-dataset

Ma, W., Liu, R., Wang, L., & Vosoughi, S. 2020. Emoji

prediction: Extensions and benchmarking. arXiv

preprint arXiv:2007.07389.

Mindrila, D., Balentyne, P., & Tables, T. W. 2013. The Chi-

square test. The Basic Practice of Statistics, 6th ed.; WH

Freeman and Company: New York, NY, USA.

Stark, L., & Crawford, K. 2015. The conservatism of emoji:

Work, affect, and communication. Social Media +

Society, 1(2).

Van der Maaten, L., & Hinton, G. 2008. Visualizing data

using t-SNE. Journal of machine learning

research, 9(11).

Zhao, G., Liu, Z., Chao, Y., & Qian, X. 2021. CAPER:

Context-aware personalized emoji recommendation.

IEEE Transactions on Knowledge and Data

Engineering, 33(9), 3160-3172.

Exploring the Connection Between Emoji Usage, User Identity and Context Using Statistical and Machine Learning Approaches

153