Analyzing Emoji in Text
Research Scientist, Holler.io, San Mateo, CA.
sanjaya@holler.io | http://sanjw.org/ | @sanjrockz
SANJAYA WIJERATNE
BAX-423 Big Data Analytics
GUEST LECTURE AT THE GRADUATE SCHOOL OF MANAGEMENT OF THE UNIVERSITY OF CALIFORNIA, DAVIS, 24TH
/25TH
APRIL, 2020.
Meet Your Instructor
► Research Scientist at Holler.io
► Work on NLP
► Academic Background
► Education - Ph.D. in Computer Science and Engineering
► Research Interest - Emoji/Test Processing, NLU
► My Journey So Far
► I’m from Sri Lanka -> B.Sc. in IT (University of Moratuwa,
Sri Lanka) -> ~2 years as a Software Engineer, 7.5 years
as a GRA/TA at Wright State University
4/19/2020BAX-423 Big Data Analytics, UC Davis
2
Emoji Chain Gang Usage Non-Gang
Usage
32.25% 1.14%
53% 1.71%
How I Started Working with Emoji
Anthropology 189:001, UC Berkeley
3
Image Source – https://arxiv.org/pdf/1610.09516.pdf
4/19/2020
Why Study Emoji?
Emoji = Picture Character
5
► Introduced by Shigetaka Kurita in 1999
4/19/2020BAX-423 Big Data Analytics, UC Davis
► Unicode staterted supporting emoji
character set in 2010
► Emoji are not emoticons. Eg. :-), :-(
Why Emoji Usage Increased?
4/19/2020BAX-423 Big Data Analytics, UC Davis
6
Emoji Usage Statistics
4/19/2020BAX-423 Big Data Analytics, UC Davis
7
A Few Open Emoji Research
Problems related to Text Processing
► Challenges in interpreting the meaning of an
emoji in a message context
► Emoji similarity
► Emoji sense disambiguation
► Emoji prediction
► Emoji-based retrieval and search
4/19/2020BAX-423 Big Data Analytics, UC Davis
8
A Few Open Emoji Research
Problems related to Text Processing
► Challenges in interpreting the meaning of an
emoji in a message context
► Emoji similarity
► Emoji sense disambiguation
► Emoji prediction
► Emoji-based retrieval and search
4/19/2020BAX-423 Big Data Analytics, UC Davis
9
How Emoji get their
Meanings?
Emoji Semantics
► Emoji are inherently designed with no rigid
semantics
► Emoji does not have a grammar, thus, emoji cannot
be used as a language on its own
► How emoji meanings are assigned?
► Initially, by the emoji creators
► Later, by the users
11
4/19/2020BAX-423 Big Data Analytics, UC Davis
How Emoji get their meanings?
12
► Emoji creators submit possible emoji meanings in
their proposals
► Once accepted, these will be available in
Unicode Common Locale Data Repository
(CLDR) at
https://www.unicode.org/cldr/charts/latest/anno
tations/other.html
4/19/2020BAX-423 Big Data Analytics, UC Davis
How emoji get their meanings?
► When people replace words using emoji (logographic)
► Homonymy relations in languages (E.g., – eye & I)
13
Image Source – https://goo.gl/rjS1hX
I
*Actual social media content
4/19/2020BAX-423 Big Data Analytics, UC Davis
Getting the Emoji Meanings
14
Image Source – http://emojinet.knoesis.org
4/19/2020BAX-423 Big Data Analytics, UC Davis
EmojiNet
15
Image Source – https://arxiv.org/pdf/1707.04652.pdf
4/19/2020BAX-423 Big Data Analytics, UC Davis
Emoji Similarity Problem
Emoji Similarity Problem
17
4/19/2020BAX-423 Big Data Analytics, UC Davis
► Measuring the semantic similarity of emoji such
that the measure reflects the likeness of their
meaning, interpretation or intended use.”
[Wijeratne et al., 2017]
Notion of Emoji Similarity
18
4/19/2020BAX-423 Big Data Analytics, UC Davis
► Notion of emoji similarity is broad
► Pixel-based Emoji Similarity
► Meaning-based Emoji Similarity
Representing Emoji Meaning
19
4/19/2020BAX-423 Big Data Analytics, UC Davis
Distributional Semantics
20
► Finds semantic properties of linguistic items (words)
based on their distribution in a large corpus
► Based on Distributional Hypothesis (Harris, 1954)
► Words that are used and occur in the same contexts tend to
purport similar meanings
► We use large text corpora with emoji to learn
distributional semantics of emoji, which reveals
relationships among emoji
4/19/2020BAX-423 Big Data Analytics, UC Davis
Learning Emoji Embeddings
► Learn distributional semantics of words as word
embeddings using two corpora (Tweets and
Google News)
► Convert the words in emoji meanings to vectors
using word embeddings (emoji embeddings)
► Evaluate the similarity (distance) of emoji in the
embedding space using EmoSim508, a new
dataset with 508 emoji pairs
21
4/19/2020BAX-423 Big Data Analytics, UC Davis
Representing Emoji Meaning
22
4/19/2020BAX-423 Big Data Analytics, UC Davis
Ground Truth Data Creation
23
4/19/2020BAX-423 Big Data Analytics, UC Davis
► Most frequently occuring
emoji pairs from a 110M
Twitter dataset with emoji
► Evaluated each emoji
pair for their similarity and
relatedness by 10 human
users
Intrinsic Evaluation
► Using four different emoji definitions
(Sense_Desc., Sense_Label, Sense_Def.,
Sense_All) and two corpora (Twitter and Google
News), we trained eight emoji embedding
models for each emoji
► We calculated emoji similarity of the 508 emoji
pairs using each embedding model
24
4/19/2020BAX-423 Big Data Analytics, UC Davis
Intrinsic Evaluation Cont.
► Using Spearman’s Rank Correlation Coefficient
(Spearman’s ρ), we compared the similarity
rankings of each model with ground truth data
25
4/19/2020BAX-423 Big Data Analytics, UC Davis
Extrinsic Evaluation
► We tested our emoji embedding models using a
sentiment analysis baseline
► Our baseline had 12,920 English tweets, and 2,295 of
them had emoji
► All words in the tweets were replaced with their
corresponding word embeddings and emoji were
replaced with emoji embeddings learned
26
4/19/2020BAX-423 Big Data Analytics, UC Davis
Extrinsic Evaluation Cont.
27
4/19/2020BAX-423 Big Data Analytics, UC Davis
Key Takeaways
► Combining emoji sense knowledge with
distributional semantics could improve the emoji
embedding models
► Longer sense definitions are not suitable for emoji
similarity experiments
28
4/19/2020BAX-423 Big Data Analytics, UC Davis
Emoji Sense Disambiguation
Emoji Sense Disambiguation Problem
30
Image Source – https://goo.gl/rjS1hX 4/19/2020BAX-423 Big Data Analytics, UC Davis
*Actual social media contentI Look
► “The ability to identify the meaning of an emoji in the context of a
message in a computational manner” [Wijeratne et al., 2017].
Emoji Sense Disambiguation
► Currently, no labeled datasets available to solve the
emoji sense disambiguation in a supervised setting
31
4/19/2020BAX-423 Big Data Analytics, UC Davis
Emoji Sense Disambiguation Cont.
► We selected 25 most commonly misunderstood
emoji and selected 50 tweets for each emoji
► Used Simplified LESK algorithm for disambiguation
► Context words were learned for each emoji sense
definition using Twitter and Google News-based word
embedding models
► Twitter-based embeddings outperform others
32
4/19/2020BAX-423 Big Data Analytics, UC Davis
Results and Takeaways
33
4/19/2020BAX-423 Big Data Analytics, UC Davis
► Tools designed for well-formed text processing will not
work well when used for ill-formatted text processing
► Sense disambiguation accuracy increases with the
increase of the number of context words used
What Did We Learn?
Recap
35
4/19/2020BAX-423 Big Data Analytics, UC Davis
► We looked at
► Why it is important to do emoji analysis
► How emoji get their meanings
► How to calculate emoji similarity
► How to disambiguate the meaning of an emoji
Acknowledgements
36
Collaborators
Prof. Amit Sheth
University of South Carolina
Prof. Derek Doran
Wright State University
Lakshika Balasuriya
(Gracenote Inc.)
Funding
4/19/2020BAX-423 Big Data Analytics, UC Davis
References
► Sanjaya Wijeratne, Lakshika Balasuriya, Amit Sheth, Derek Doran. A Semantics-Based Measure of
Emoji Similarity. In 2017 IEEE/WIC/ACM International Conference on Web Intelligence (Web
Intelligence 2017). Leipzig, Germany; 2017. [PDF]
► Sanjaya Wijeratne, Lakshika Balasuriya, Amit Sheth, Derek Doran. EmojiNet: An Open Service and
API for Emoji Sense Discovery. In 11th International AAAI Conference on Web and Social Media
(ICWSM 2017). Montreal, Canada; 2017. [PDF]
► Sanjaya Wijeratne, Lakshika Balasuriya, Amit Sheth, Derek Doran. EmojiNet: Building a Machine
Readable Sense Inventory for Emoji. In 8th International Conference on Social Informatics (SocInfo
2016). Bellevue, WA, USA; 2016. [PDF]
► Lakshika Balasuriya, Sanjaya Wijeratne, Derek Doran, Amit Sheth. Finding Street Gang Members on
Twitter, In The 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis
and Mining (ASONAM 2016). San Francisco, CA, USA; 2016. [PDF]
37
4/19/2020BAX-423 Big Data Analytics, UC Davis
Thank You!
SANJAYA@HOLLER.IO | HTTP://SANJW.ORG/ | @SANJROCKZ

Analyzing Emoji in Text

  • 1.
    Analyzing Emoji inText Research Scientist, Holler.io, San Mateo, CA. sanjaya@holler.io | http://sanjw.org/ | @sanjrockz SANJAYA WIJERATNE BAX-423 Big Data Analytics GUEST LECTURE AT THE GRADUATE SCHOOL OF MANAGEMENT OF THE UNIVERSITY OF CALIFORNIA, DAVIS, 24TH /25TH APRIL, 2020.
  • 2.
    Meet Your Instructor ►Research Scientist at Holler.io ► Work on NLP ► Academic Background ► Education - Ph.D. in Computer Science and Engineering ► Research Interest - Emoji/Test Processing, NLU ► My Journey So Far ► I’m from Sri Lanka -> B.Sc. in IT (University of Moratuwa, Sri Lanka) -> ~2 years as a Software Engineer, 7.5 years as a GRA/TA at Wright State University 4/19/2020BAX-423 Big Data Analytics, UC Davis 2
  • 3.
    Emoji Chain GangUsage Non-Gang Usage 32.25% 1.14% 53% 1.71% How I Started Working with Emoji Anthropology 189:001, UC Berkeley 3 Image Source – https://arxiv.org/pdf/1610.09516.pdf 4/19/2020
  • 4.
  • 5.
    Emoji = PictureCharacter 5 ► Introduced by Shigetaka Kurita in 1999 4/19/2020BAX-423 Big Data Analytics, UC Davis ► Unicode staterted supporting emoji character set in 2010 ► Emoji are not emoticons. Eg. :-), :-(
  • 6.
    Why Emoji UsageIncreased? 4/19/2020BAX-423 Big Data Analytics, UC Davis 6
  • 7.
    Emoji Usage Statistics 4/19/2020BAX-423Big Data Analytics, UC Davis 7
  • 8.
    A Few OpenEmoji Research Problems related to Text Processing ► Challenges in interpreting the meaning of an emoji in a message context ► Emoji similarity ► Emoji sense disambiguation ► Emoji prediction ► Emoji-based retrieval and search 4/19/2020BAX-423 Big Data Analytics, UC Davis 8
  • 9.
    A Few OpenEmoji Research Problems related to Text Processing ► Challenges in interpreting the meaning of an emoji in a message context ► Emoji similarity ► Emoji sense disambiguation ► Emoji prediction ► Emoji-based retrieval and search 4/19/2020BAX-423 Big Data Analytics, UC Davis 9
  • 10.
    How Emoji gettheir Meanings?
  • 11.
    Emoji Semantics ► Emojiare inherently designed with no rigid semantics ► Emoji does not have a grammar, thus, emoji cannot be used as a language on its own ► How emoji meanings are assigned? ► Initially, by the emoji creators ► Later, by the users 11 4/19/2020BAX-423 Big Data Analytics, UC Davis
  • 12.
    How Emoji gettheir meanings? 12 ► Emoji creators submit possible emoji meanings in their proposals ► Once accepted, these will be available in Unicode Common Locale Data Repository (CLDR) at https://www.unicode.org/cldr/charts/latest/anno tations/other.html 4/19/2020BAX-423 Big Data Analytics, UC Davis
  • 13.
    How emoji gettheir meanings? ► When people replace words using emoji (logographic) ► Homonymy relations in languages (E.g., – eye & I) 13 Image Source – https://goo.gl/rjS1hX I *Actual social media content 4/19/2020BAX-423 Big Data Analytics, UC Davis
  • 14.
    Getting the EmojiMeanings 14 Image Source – http://emojinet.knoesis.org 4/19/2020BAX-423 Big Data Analytics, UC Davis
  • 15.
    EmojiNet 15 Image Source –https://arxiv.org/pdf/1707.04652.pdf 4/19/2020BAX-423 Big Data Analytics, UC Davis
  • 16.
  • 17.
    Emoji Similarity Problem 17 4/19/2020BAX-423Big Data Analytics, UC Davis ► Measuring the semantic similarity of emoji such that the measure reflects the likeness of their meaning, interpretation or intended use.” [Wijeratne et al., 2017]
  • 18.
    Notion of EmojiSimilarity 18 4/19/2020BAX-423 Big Data Analytics, UC Davis ► Notion of emoji similarity is broad ► Pixel-based Emoji Similarity ► Meaning-based Emoji Similarity
  • 19.
  • 20.
    Distributional Semantics 20 ► Findssemantic properties of linguistic items (words) based on their distribution in a large corpus ► Based on Distributional Hypothesis (Harris, 1954) ► Words that are used and occur in the same contexts tend to purport similar meanings ► We use large text corpora with emoji to learn distributional semantics of emoji, which reveals relationships among emoji 4/19/2020BAX-423 Big Data Analytics, UC Davis
  • 21.
    Learning Emoji Embeddings ►Learn distributional semantics of words as word embeddings using two corpora (Tweets and Google News) ► Convert the words in emoji meanings to vectors using word embeddings (emoji embeddings) ► Evaluate the similarity (distance) of emoji in the embedding space using EmoSim508, a new dataset with 508 emoji pairs 21 4/19/2020BAX-423 Big Data Analytics, UC Davis
  • 22.
  • 23.
    Ground Truth DataCreation 23 4/19/2020BAX-423 Big Data Analytics, UC Davis ► Most frequently occuring emoji pairs from a 110M Twitter dataset with emoji ► Evaluated each emoji pair for their similarity and relatedness by 10 human users
  • 24.
    Intrinsic Evaluation ► Usingfour different emoji definitions (Sense_Desc., Sense_Label, Sense_Def., Sense_All) and two corpora (Twitter and Google News), we trained eight emoji embedding models for each emoji ► We calculated emoji similarity of the 508 emoji pairs using each embedding model 24 4/19/2020BAX-423 Big Data Analytics, UC Davis
  • 25.
    Intrinsic Evaluation Cont. ►Using Spearman’s Rank Correlation Coefficient (Spearman’s ρ), we compared the similarity rankings of each model with ground truth data 25 4/19/2020BAX-423 Big Data Analytics, UC Davis
  • 26.
    Extrinsic Evaluation ► Wetested our emoji embedding models using a sentiment analysis baseline ► Our baseline had 12,920 English tweets, and 2,295 of them had emoji ► All words in the tweets were replaced with their corresponding word embeddings and emoji were replaced with emoji embeddings learned 26 4/19/2020BAX-423 Big Data Analytics, UC Davis
  • 27.
  • 28.
    Key Takeaways ► Combiningemoji sense knowledge with distributional semantics could improve the emoji embedding models ► Longer sense definitions are not suitable for emoji similarity experiments 28 4/19/2020BAX-423 Big Data Analytics, UC Davis
  • 29.
  • 30.
    Emoji Sense DisambiguationProblem 30 Image Source – https://goo.gl/rjS1hX 4/19/2020BAX-423 Big Data Analytics, UC Davis *Actual social media contentI Look ► “The ability to identify the meaning of an emoji in the context of a message in a computational manner” [Wijeratne et al., 2017].
  • 31.
    Emoji Sense Disambiguation ►Currently, no labeled datasets available to solve the emoji sense disambiguation in a supervised setting 31 4/19/2020BAX-423 Big Data Analytics, UC Davis
  • 32.
    Emoji Sense DisambiguationCont. ► We selected 25 most commonly misunderstood emoji and selected 50 tweets for each emoji ► Used Simplified LESK algorithm for disambiguation ► Context words were learned for each emoji sense definition using Twitter and Google News-based word embedding models ► Twitter-based embeddings outperform others 32 4/19/2020BAX-423 Big Data Analytics, UC Davis
  • 33.
    Results and Takeaways 33 4/19/2020BAX-423Big Data Analytics, UC Davis ► Tools designed for well-formed text processing will not work well when used for ill-formatted text processing ► Sense disambiguation accuracy increases with the increase of the number of context words used
  • 34.
  • 35.
    Recap 35 4/19/2020BAX-423 Big DataAnalytics, UC Davis ► We looked at ► Why it is important to do emoji analysis ► How emoji get their meanings ► How to calculate emoji similarity ► How to disambiguate the meaning of an emoji
  • 36.
    Acknowledgements 36 Collaborators Prof. Amit Sheth Universityof South Carolina Prof. Derek Doran Wright State University Lakshika Balasuriya (Gracenote Inc.) Funding 4/19/2020BAX-423 Big Data Analytics, UC Davis
  • 37.
    References ► Sanjaya Wijeratne,Lakshika Balasuriya, Amit Sheth, Derek Doran. A Semantics-Based Measure of Emoji Similarity. In 2017 IEEE/WIC/ACM International Conference on Web Intelligence (Web Intelligence 2017). Leipzig, Germany; 2017. [PDF] ► Sanjaya Wijeratne, Lakshika Balasuriya, Amit Sheth, Derek Doran. EmojiNet: An Open Service and API for Emoji Sense Discovery. In 11th International AAAI Conference on Web and Social Media (ICWSM 2017). Montreal, Canada; 2017. [PDF] ► Sanjaya Wijeratne, Lakshika Balasuriya, Amit Sheth, Derek Doran. EmojiNet: Building a Machine Readable Sense Inventory for Emoji. In 8th International Conference on Social Informatics (SocInfo 2016). Bellevue, WA, USA; 2016. [PDF] ► Lakshika Balasuriya, Sanjaya Wijeratne, Derek Doran, Amit Sheth. Finding Street Gang Members on Twitter, In The 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2016). San Francisco, CA, USA; 2016. [PDF] 37 4/19/2020BAX-423 Big Data Analytics, UC Davis
  • 38.
    Thank You! SANJAYA@HOLLER.IO |HTTP://SANJW.ORG/ | @SANJROCKZ