SlideShare a Scribd company logo
MOTIVATIONS BEHIND
SENDING MESSAGES IN MDT.
(METHODS, DERIVATION & RESULTS)
Anup Sawant
SONIC @ Northwestern University
Acknowledgement
¨  This analysis wouldn’t have been possible without the
support of SONIC @ Northwestern University and DELTA
Lab @ Georgia Tech in developing the project ‘My Dream
Team’.
¨  Other project members:
¤  Dr. Noshir Contractor, Northwestern University.
¤  Dr. Leslie DeChurch, Georgia Tech.
¤  Dr. Edward Palazzolo, Northwestern University.
¤  Harshad Gado, Northwestern University.
¤  Amy Wax, Georgia Tech.
¤  Samuel Posnock, Georgia Tech.
¤  Peter Seely, Georgia Tech.
Overview
¨  Description
¤  Objective
¤  Corpus details
¨  Methodology
¤  Different approaches
¤  Text Parsing
¤  Forming vectors
¤  Measuring similarity
¤  K-Means
¤  Choosing the optimum K
¤  Distribution & Density
¤  Topic Indicators
¨  Results
¤  What can we infer ?
¤  Supporting inference
¤  Gratitude & Request
¤  Need, Liking, Invite & Praise
¤  Collaboration & Grouping
¨  Conclusion
Project Description
¨  Objective
¤  To do textual analysis and find motivations behind message
interaction in the process of forming teams.
¤  Serve as a proof for some of the known reasons about team
formation ties, through, mathematically derived topical
patterns hidden in unstructured text data.
¨  Corpus details
¤  ‘My Dream Team’ run 19th – 23rd Feb, 2014.
¤  # Students participated : 214
¤  # Text messages exchanged : 353
¤  # Unique words/terms in entire corpus : 619
Methodology/Different Approaches
¨  Problem statement :
“ Given a text corpus X = { x1, x2, x3….} Where xi = document/message, find the topics/ideas
(in our context, primary motivations) that represent individual clusters within X.”
¨  Possible Approaches :
¤  Latent Semantic Analysis (mostly used in IR for indexing)
¤  Latent Dirichlet Allocation (probabilistic topic modeling)
¤  Document Clustering (we go by this)
¨  Problems :
¤  Real world textual data is always “dirty” when it comes to text parsing.
¤  Performance and accuracy can depend on rich vocabulary for
grammatical parsing.
Methodology/Text Parsing
“I enjoy learning and growing while also getting out a little of my competitive
spirit.”
“i enjoy learning and growing while also getting out a little of my competitive
spirit.”
[i, enjoy, learn, and, grow, while, also, get, out, a, little, of, my, competitive, spirit]
[i, enjoy, learn, grow, out, little, my, competitive, spirit]
Lowercase
Lemmatize – remove punctuations, split
into words & convert each word to its
root.
Remove Stopwords
Methodology/Forming Vectors
¨  Bag of words : Collect all unique words from the
corpus vocabulary.
¨  Document-Term index : For each word in the
vocabulary, count its frequency across all
documents/messages in the corpus. (example below)
Term/Word Message -1 Message-2 Message-3 Message-4
hey 1 1 1 0
team 1 1 0 1
join 1 0 1 1
work 1 0 1 0
Methodology/Measuring Similarity
¨  Cosine Similarity : Performs better when compared to
Euclidean measure. Similarity is mostly retained irrespective of
vectors space distance due to length of vectors. Intuitively,
documents/messages dealing with same topic/domain remain
close in vector space irrespective of their message length.
Euclidean distance
Cosine distance
Θ
x
y
(0,0)
Message 1
Message 2
Example : The figure on right indicates
2 message vectors. Although, the
Euclidean measure shows quite a bit of
distance in vector space, the Cosine
measure indicates that the vectors are
close enough to point in the same
direction. Cos 0 = 1 indicates similar
vectors, Cos 90 = 0 indicates dissimilar
vectors.
Methodology/K-Means
¨  The K-means algorithm is a method to automatically cluster similar data
examples together. The intuition behind K-means is an iterative procedure
that starts by guessing the initial centroids, and then refines this guess by
repeatedly assigning examples to their closest centroids and then
recomputing the centroids based on the assignments. Img source- Wikipedia
Methodology/Choosing the optimum K
In finding the optimum number of clusters/
means in the message corpus, we use the
‘Elbow Curve’ technique as shown in the
figure on right that plots the Jcost-min
function across number of means tried,
where,
Jcost-min = (1/m) Σ(xi – μci)2
m = number of messages.
xi = message vector.
ci = centroid number that vector xi
is assigned to.
μci= corresponding cluster centroid
vector to which xi belongs.
Thus, the number of optimum clusters
considered in MDT message corpus are 3
(The point where the graph curves like an
elbow).
Methodology/Distribution & Density
122
124
107
Message distribution among clusters
Cluster 1
Cluster 2
Cluster 3
0
200
400
600
800
1000
1200
1400
1600
1800
105 110 115 120 125
#Wordspercluster
# Messages per cluster
Cluster density
Cluster 3
The pie chart on the right gives the number of
messages distributed among 3 clusters in text
message corpus of MDT.
Cluster 2
Cluster 1
The graph on the right gives an idea of how
dense each cluster is with words or terms.
Note: Cluster-1 is bigger in size (has more
number of messages) when compared to
Cluster-3 but number of words that makeup
Cluster-1 is far less than that of Cluster-3.
This gives an important clue that Cluster-1 is
most likely madeup of ‘Short messages’.
Methodology/Topic indicators
On a broader scale we already know that all the
messages deal with ‘Team Formation’ topic. We are
on a hunt to find the hidden motivations on a sub-
topical scale.
The segmented pyramid on the right shows some of
the top ranking words by frequency in each cluster so
far with the core topic as ‘Team Formation’. The
words that are most common to all clusters and
reflect ‘Team Formation’ topic on a broader scale,
are at the core of the pyramid.
We consider top words as strong indicators of
hidden motivations.
Cluster-2
# messages : 124
# words : 1013
Cluster-3
# messages : 107
# words : 1601
Cluster-1
# messages : 122
# words : 387
Methodology/What can we infer ?
Cluster-1: Has short messages indicating gratitude or
request to be a member through words like ‘thanks’
and ‘accept’. Low frequency words from this cluster
are mostly a slang or non-dictionary word.
Cluster-2: Has messages that mostly refer recipient's
qualities and hence words like ‘cool’ & ‘like’ stand
out as some of the top words. Probably, these
messages also talk about sender's ‘need’ to add one
or more member to the team.
Cluster-3: Has messages that indicate topics such as
working together, grouping and mostly collaboration
with words like ‘group’, ‘work’ and ‘together’. with
high frequency.
Cluster-2
# messages : 124
# words : 1013
Cluster-3
# messages : 107
# words : 1601
Cluster-1
# messages : 122
# words : 387
Results/Supporting inference
¨  Though top words provide a strong indication of
probable topics in a cluster, high frequency of each
word isn’t enough to support our assumption of
topics.
¨  A good support to our inference would be through
a mathematical analysis of co-occurrence of the top
words from each cluster with the words ‘team’ and
‘you’ that makeup the core topic of ‘Team
Formation’.
Results/Probability for coherence
¨  Example :
“Ana (a word) is in the mall (topic) given her best friends Harry and Brian (‘team’ & ‘you’)
are in the mall (topic).”
¨  In other words, a word would define a topic only if it
co-occurs with other supporting words to reflect
coherence necessary to define that topic.
¨  P(topic) = P(w/X) where,
w = word, X = core words of topic ‘Team Formation’.
¨  We calculate P(topic) across all clusters for a given
word.
Results/Gratitude & Request by Cluster 1
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
thanks accept
Probability
words
Topic probability through conditional probability of words given core Team
Formation words ‘team’ and ‘you’
Cluster 1
Cluster 2
Cluster 3
Results/Need, Liking, Invite & Praise by Cluster 2
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
need one like join cool
Probability
words
Topic probability through conditional probability of words given core Team
Formation words ‘team’ and ‘you’
Cluster 1
Cluster 2
Cluster 3
Results/Collaboration & Grouping by Cluster 3
0
0.05
0.1
0.15
0.2
0.25
0.3
work group together
Probability
words
Topic probability through conditional probability of words given core Team
Formation words ‘team’ and ‘you’
Cluster 1
Cluster 2
Cluster 3
Conclusion
¨  Document clustering with probabilistic support to topical
inference over message corpus of MDT has helped us
expose following motivations behind sending messages.
Users interacted when,
¤  There was a need for one or more person to complete the team.
¤  They liked someone’s profile.
¤  They wanted to invite someone to join their team.
¤  They wanted to praise someone for good profile or looks.
¤  They wanted to group/merge their incomplete teams.
¤  They wanted to collaborate/work with someone.
¤  They wanted to express gratitude.
¤  They wanted to earnestly request someone to join.

More Related Content

Viewers also liked

R-Link : Research Content Linked Data Cloud
R-Link : Research Content Linked Data CloudR-Link : Research Content Linked Data Cloud
R-Link : Research Content Linked Data Cloud
Nicola Ghirardi
 
VIVO Team Builder - VIVO conference 2014
VIVO Team Builder - VIVO conference 2014VIVO Team Builder - VIVO conference 2014
VIVO Team Builder - VIVO conference 2014
Anup Sawant
 
Voa3r Project content population - CINECA
Voa3r Project content population - CINECAVoa3r Project content population - CINECA
Voa3r Project content population - CINECA
Nicola Ghirardi
 
Experience from 10 months of University Linked Data
Experience from 10 months of University Linked Data Experience from 10 months of University Linked Data
Experience from 10 months of University Linked Data
Mathieu d'Aquin
 
Working with data.open.ac.uk, the Linked Data Platform of the Open University
Working with data.open.ac.uk, the Linked Data Platform of the Open UniversityWorking with data.open.ac.uk, the Linked Data Platform of the Open University
Working with data.open.ac.uk, the Linked Data Platform of the Open University
Mathieu d'Aquin
 
Jena Programming
Jena ProgrammingJena Programming
Jena Programming
Myungjin Lee
 

Viewers also liked (6)

R-Link : Research Content Linked Data Cloud
R-Link : Research Content Linked Data CloudR-Link : Research Content Linked Data Cloud
R-Link : Research Content Linked Data Cloud
 
VIVO Team Builder - VIVO conference 2014
VIVO Team Builder - VIVO conference 2014VIVO Team Builder - VIVO conference 2014
VIVO Team Builder - VIVO conference 2014
 
Voa3r Project content population - CINECA
Voa3r Project content population - CINECAVoa3r Project content population - CINECA
Voa3r Project content population - CINECA
 
Experience from 10 months of University Linked Data
Experience from 10 months of University Linked Data Experience from 10 months of University Linked Data
Experience from 10 months of University Linked Data
 
Working with data.open.ac.uk, the Linked Data Platform of the Open University
Working with data.open.ac.uk, the Linked Data Platform of the Open UniversityWorking with data.open.ac.uk, the Linked Data Platform of the Open University
Working with data.open.ac.uk, the Linked Data Platform of the Open University
 
Jena Programming
Jena ProgrammingJena Programming
Jena Programming
 

Similar to Finding motivations behind message interaction in MDT

Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
CITE
 
Aaai 2006 Pedersen
Aaai 2006 PedersenAaai 2006 Pedersen
Eacl 2006 Pedersen
Eacl 2006 PedersenEacl 2006 Pedersen
You WonT Believe This.. 22 Reasons For Informal Discu
You WonT Believe This.. 22 Reasons For Informal DiscuYou WonT Believe This.. 22 Reasons For Informal Discu
You WonT Believe This.. 22 Reasons For Informal Discu
Lesly Lockwood
 
Sentiment Analysis for IET ATC 2016
Sentiment Analysis for IET ATC 2016Sentiment Analysis for IET ATC 2016
Sentiment Analysis for IET ATC 2016
Asoka Korale
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
Sai Srinivas Kotni
 
O’Brien .docx
O’Brien                                                   .docxO’Brien                                                   .docx
O’Brien .docx
honey690131
 
Query recommendation papers
Query recommendation papersQuery recommendation papers
Query recommendation papers
Ashish Kulkarni
 
Puppy Writing Stationary Writing, Puppies, Words
Puppy Writing Stationary Writing, Puppies, WordsPuppy Writing Stationary Writing, Puppies, Words
Puppy Writing Stationary Writing, Puppies, Words
Michelle Adams
 
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsLaboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Carla Marini
 
Ijcai 2007 Pedersen
Ijcai 2007 PedersenIjcai 2007 Pedersen
Ijcai 2007 Pedersen
University of Minnesota, Duluth
 
Yahoo Answers! Answer Evaluation
Yahoo Answers! Answer EvaluationYahoo Answers! Answer Evaluation
Yahoo Answers! Answer Evaluation
Vivek Adithya Mohankumar
 
English language 1123 Essentials 2k17
English language 1123 Essentials 2k17English language 1123 Essentials 2k17
English language 1123 Essentials 2k17
LanguageGuru
 
Analysing and Reporting Qualitative Data.pdf
Analysing and Reporting Qualitative Data.pdfAnalysing and Reporting Qualitative Data.pdf
Analysing and Reporting Qualitative Data.pdf
Sarah Pollard
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
RwanEnan
 
Data Science Course In Pune
Data Science Course In Pune Data Science Course In Pune
Data Science Course In Pune
APT
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangalore
devipatnala1
 
Data Science Course Pune
Data Science Course PuneData Science Course Pune
Data Science Course Pune
APT
 
Data science course pdf
Data science course pdfData science course pdf
Data science course pdf
APT
 
Data Science course in Pune
Data Science course in PuneData Science course in Pune
Data Science course in Pune
ashvisingh
 

Similar to Finding motivations behind message interaction in MDT (20)

Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
Multiple Methods and Techniques in Analyzing Computer-Supported Collaborative...
 
Aaai 2006 Pedersen
Aaai 2006 PedersenAaai 2006 Pedersen
Aaai 2006 Pedersen
 
Eacl 2006 Pedersen
Eacl 2006 PedersenEacl 2006 Pedersen
Eacl 2006 Pedersen
 
You WonT Believe This.. 22 Reasons For Informal Discu
You WonT Believe This.. 22 Reasons For Informal DiscuYou WonT Believe This.. 22 Reasons For Informal Discu
You WonT Believe This.. 22 Reasons For Informal Discu
 
Sentiment Analysis for IET ATC 2016
Sentiment Analysis for IET ATC 2016Sentiment Analysis for IET ATC 2016
Sentiment Analysis for IET ATC 2016
 
Presentation on Text Classification
Presentation on Text ClassificationPresentation on Text Classification
Presentation on Text Classification
 
O’Brien .docx
O’Brien                                                   .docxO’Brien                                                   .docx
O’Brien .docx
 
Query recommendation papers
Query recommendation papersQuery recommendation papers
Query recommendation papers
 
Puppy Writing Stationary Writing, Puppies, Words
Puppy Writing Stationary Writing, Puppies, WordsPuppy Writing Stationary Writing, Puppies, Words
Puppy Writing Stationary Writing, Puppies, Words
 
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsLaboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
 
Ijcai 2007 Pedersen
Ijcai 2007 PedersenIjcai 2007 Pedersen
Ijcai 2007 Pedersen
 
Yahoo Answers! Answer Evaluation
Yahoo Answers! Answer EvaluationYahoo Answers! Answer Evaluation
Yahoo Answers! Answer Evaluation
 
English language 1123 Essentials 2k17
English language 1123 Essentials 2k17English language 1123 Essentials 2k17
English language 1123 Essentials 2k17
 
Analysing and Reporting Qualitative Data.pdf
Analysing and Reporting Qualitative Data.pdfAnalysing and Reporting Qualitative Data.pdf
Analysing and Reporting Qualitative Data.pdf
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Data Science Course In Pune
Data Science Course In Pune Data Science Course In Pune
Data Science Course In Pune
 
data science institute in bangalore
data science institute in bangaloredata science institute in bangalore
data science institute in bangalore
 
Data Science Course Pune
Data Science Course PuneData Science Course Pune
Data Science Course Pune
 
Data science course pdf
Data science course pdfData science course pdf
Data science course pdf
 
Data Science course in Pune
Data Science course in PuneData Science course in Pune
Data Science course in Pune
 

Recently uploaded

ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
sameer shah
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
SaffaIbrahim1
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
sameer shah
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 

Recently uploaded (20)

ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docxDATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
DATA COMMS-NETWORKS YR2 lecture 08 NAT & CLOUD.docx
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens""Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 

Finding motivations behind message interaction in MDT

  • 1. MOTIVATIONS BEHIND SENDING MESSAGES IN MDT. (METHODS, DERIVATION & RESULTS) Anup Sawant SONIC @ Northwestern University
  • 2. Acknowledgement ¨  This analysis wouldn’t have been possible without the support of SONIC @ Northwestern University and DELTA Lab @ Georgia Tech in developing the project ‘My Dream Team’. ¨  Other project members: ¤  Dr. Noshir Contractor, Northwestern University. ¤  Dr. Leslie DeChurch, Georgia Tech. ¤  Dr. Edward Palazzolo, Northwestern University. ¤  Harshad Gado, Northwestern University. ¤  Amy Wax, Georgia Tech. ¤  Samuel Posnock, Georgia Tech. ¤  Peter Seely, Georgia Tech.
  • 3. Overview ¨  Description ¤  Objective ¤  Corpus details ¨  Methodology ¤  Different approaches ¤  Text Parsing ¤  Forming vectors ¤  Measuring similarity ¤  K-Means ¤  Choosing the optimum K ¤  Distribution & Density ¤  Topic Indicators ¨  Results ¤  What can we infer ? ¤  Supporting inference ¤  Gratitude & Request ¤  Need, Liking, Invite & Praise ¤  Collaboration & Grouping ¨  Conclusion
  • 4. Project Description ¨  Objective ¤  To do textual analysis and find motivations behind message interaction in the process of forming teams. ¤  Serve as a proof for some of the known reasons about team formation ties, through, mathematically derived topical patterns hidden in unstructured text data. ¨  Corpus details ¤  ‘My Dream Team’ run 19th – 23rd Feb, 2014. ¤  # Students participated : 214 ¤  # Text messages exchanged : 353 ¤  # Unique words/terms in entire corpus : 619
  • 5. Methodology/Different Approaches ¨  Problem statement : “ Given a text corpus X = { x1, x2, x3….} Where xi = document/message, find the topics/ideas (in our context, primary motivations) that represent individual clusters within X.” ¨  Possible Approaches : ¤  Latent Semantic Analysis (mostly used in IR for indexing) ¤  Latent Dirichlet Allocation (probabilistic topic modeling) ¤  Document Clustering (we go by this) ¨  Problems : ¤  Real world textual data is always “dirty” when it comes to text parsing. ¤  Performance and accuracy can depend on rich vocabulary for grammatical parsing.
  • 6. Methodology/Text Parsing “I enjoy learning and growing while also getting out a little of my competitive spirit.” “i enjoy learning and growing while also getting out a little of my competitive spirit.” [i, enjoy, learn, and, grow, while, also, get, out, a, little, of, my, competitive, spirit] [i, enjoy, learn, grow, out, little, my, competitive, spirit] Lowercase Lemmatize – remove punctuations, split into words & convert each word to its root. Remove Stopwords
  • 7. Methodology/Forming Vectors ¨  Bag of words : Collect all unique words from the corpus vocabulary. ¨  Document-Term index : For each word in the vocabulary, count its frequency across all documents/messages in the corpus. (example below) Term/Word Message -1 Message-2 Message-3 Message-4 hey 1 1 1 0 team 1 1 0 1 join 1 0 1 1 work 1 0 1 0
  • 8. Methodology/Measuring Similarity ¨  Cosine Similarity : Performs better when compared to Euclidean measure. Similarity is mostly retained irrespective of vectors space distance due to length of vectors. Intuitively, documents/messages dealing with same topic/domain remain close in vector space irrespective of their message length. Euclidean distance Cosine distance Θ x y (0,0) Message 1 Message 2 Example : The figure on right indicates 2 message vectors. Although, the Euclidean measure shows quite a bit of distance in vector space, the Cosine measure indicates that the vectors are close enough to point in the same direction. Cos 0 = 1 indicates similar vectors, Cos 90 = 0 indicates dissimilar vectors.
  • 9. Methodology/K-Means ¨  The K-means algorithm is a method to automatically cluster similar data examples together. The intuition behind K-means is an iterative procedure that starts by guessing the initial centroids, and then refines this guess by repeatedly assigning examples to their closest centroids and then recomputing the centroids based on the assignments. Img source- Wikipedia
  • 10. Methodology/Choosing the optimum K In finding the optimum number of clusters/ means in the message corpus, we use the ‘Elbow Curve’ technique as shown in the figure on right that plots the Jcost-min function across number of means tried, where, Jcost-min = (1/m) Σ(xi – μci)2 m = number of messages. xi = message vector. ci = centroid number that vector xi is assigned to. μci= corresponding cluster centroid vector to which xi belongs. Thus, the number of optimum clusters considered in MDT message corpus are 3 (The point where the graph curves like an elbow).
  • 11. Methodology/Distribution & Density 122 124 107 Message distribution among clusters Cluster 1 Cluster 2 Cluster 3 0 200 400 600 800 1000 1200 1400 1600 1800 105 110 115 120 125 #Wordspercluster # Messages per cluster Cluster density Cluster 3 The pie chart on the right gives the number of messages distributed among 3 clusters in text message corpus of MDT. Cluster 2 Cluster 1 The graph on the right gives an idea of how dense each cluster is with words or terms. Note: Cluster-1 is bigger in size (has more number of messages) when compared to Cluster-3 but number of words that makeup Cluster-1 is far less than that of Cluster-3. This gives an important clue that Cluster-1 is most likely madeup of ‘Short messages’.
  • 12. Methodology/Topic indicators On a broader scale we already know that all the messages deal with ‘Team Formation’ topic. We are on a hunt to find the hidden motivations on a sub- topical scale. The segmented pyramid on the right shows some of the top ranking words by frequency in each cluster so far with the core topic as ‘Team Formation’. The words that are most common to all clusters and reflect ‘Team Formation’ topic on a broader scale, are at the core of the pyramid. We consider top words as strong indicators of hidden motivations. Cluster-2 # messages : 124 # words : 1013 Cluster-3 # messages : 107 # words : 1601 Cluster-1 # messages : 122 # words : 387
  • 13. Methodology/What can we infer ? Cluster-1: Has short messages indicating gratitude or request to be a member through words like ‘thanks’ and ‘accept’. Low frequency words from this cluster are mostly a slang or non-dictionary word. Cluster-2: Has messages that mostly refer recipient's qualities and hence words like ‘cool’ & ‘like’ stand out as some of the top words. Probably, these messages also talk about sender's ‘need’ to add one or more member to the team. Cluster-3: Has messages that indicate topics such as working together, grouping and mostly collaboration with words like ‘group’, ‘work’ and ‘together’. with high frequency. Cluster-2 # messages : 124 # words : 1013 Cluster-3 # messages : 107 # words : 1601 Cluster-1 # messages : 122 # words : 387
  • 14. Results/Supporting inference ¨  Though top words provide a strong indication of probable topics in a cluster, high frequency of each word isn’t enough to support our assumption of topics. ¨  A good support to our inference would be through a mathematical analysis of co-occurrence of the top words from each cluster with the words ‘team’ and ‘you’ that makeup the core topic of ‘Team Formation’.
  • 15. Results/Probability for coherence ¨  Example : “Ana (a word) is in the mall (topic) given her best friends Harry and Brian (‘team’ & ‘you’) are in the mall (topic).” ¨  In other words, a word would define a topic only if it co-occurs with other supporting words to reflect coherence necessary to define that topic. ¨  P(topic) = P(w/X) where, w = word, X = core words of topic ‘Team Formation’. ¨  We calculate P(topic) across all clusters for a given word.
  • 16. Results/Gratitude & Request by Cluster 1 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 thanks accept Probability words Topic probability through conditional probability of words given core Team Formation words ‘team’ and ‘you’ Cluster 1 Cluster 2 Cluster 3
  • 17. Results/Need, Liking, Invite & Praise by Cluster 2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 need one like join cool Probability words Topic probability through conditional probability of words given core Team Formation words ‘team’ and ‘you’ Cluster 1 Cluster 2 Cluster 3
  • 18. Results/Collaboration & Grouping by Cluster 3 0 0.05 0.1 0.15 0.2 0.25 0.3 work group together Probability words Topic probability through conditional probability of words given core Team Formation words ‘team’ and ‘you’ Cluster 1 Cluster 2 Cluster 3
  • 19. Conclusion ¨  Document clustering with probabilistic support to topical inference over message corpus of MDT has helped us expose following motivations behind sending messages. Users interacted when, ¤  There was a need for one or more person to complete the team. ¤  They liked someone’s profile. ¤  They wanted to invite someone to join their team. ¤  They wanted to praise someone for good profile or looks. ¤  They wanted to group/merge their incomplete teams. ¤  They wanted to collaborate/work with someone. ¤  They wanted to express gratitude. ¤  They wanted to earnestly request someone to join.