With the popularity of Twitter, there has been voluminous growth in the digital footprints of real-life events in the Internet. The references to different types of events in Twitter have the potential to provide extremely valuable information to researchers and organizations, which could be mined and analyzed for making major decisions. There are tremendous applications in the areas of real-life event analysis, opinion mining, reference tracking, online advertising, recommendation engines, cyber security, event management, enterprise data integration, among others. Thus, there is a need of a generic framework that can collect different event references, extract identity information of the events from them and maintain the information persistently for resolving new references to the events and provide updated analytics. The presented research establishes the design and implementation of such a framework from the perspective of Event Identity Information Management (EIIM) in the domain of Twitter. The paper introduces the problem of EIIM in Twitter, discusses the prevalent challenges and proposes the design of a framework capable of managing persistent identity information of pre-specified set of events. We explore the applications of the research, validate the different components of the framework and conclude with our comments on various criteria showing high efficacy and practical utility of our proposed framework.
How can I successfully sell my pi coins in Philippines?
A Framework for Collecting, Extracting and Managing Event Identity Information from Twitter
1. A Framework for Collecting,
Extracting and Managing Event
Identity Information from Twitter
Debanjan Mahata, John R. Talburt
dxmahata@ualr.edu, jrtalburt@ualr.edu
Department of Information Science
University of Arkansas at Little Rock
Vivek Kumar Singh
vivek@cs.sau.ac.in
Department of Computer Science
South Asian University, New Delhi, India
2. Social Media
A daily average of 58 million tweets is posted in Twitter. Source: http://goo.gl/Oz5sIZ
An average 60 million photos are shared in Instagram daily. Source: http://instagram.com/press
Facebook stores 300 petabytes of data related to its users from all over the
world. Source: http://goo.gl/XxEfeX
72% of all internet users are now active on social media. Source: http://goo.gl/qAuIoe
46% of adult Internet users post original photos or videos online that they
themselves have created. Source: http://goo.gl/iQ06Ix
/
4. EIIM in MDM
Zhou, Yinle, and John Talburt. "Entity identity information management (EIIM)."International Conference on Information
Quality (ICIQ-11), Adelaide, Australia. 2011.
6. Challenges
Volume and Velocity Veracity
New post: Sochi Was For Suckers -
Laugh Studios/
http://t.co/cWQJCBp3Ow #lol
#funny #rofl #funnypic #fail #wtf
Informal Text
Variety
Searching the Long TailSampling
Bias
Sparse Link
Structure Between
Content in
Social Media
Lack of Evaluation
Datasets
7. EIIM Life Cycle in Twitter
Mahata, Debanjan, and John Talburt. "A Framework for Collecting and Managing Entity Identity Information from Social Media.“ 19th
International Conference on Information Quality, Xi’An, China.
8. Identity Integrity1
Assigns unique identifier to a
real-life event being tracked by
the framework and maintains
the same identifier for newly
collected event references
Identity Integrity Requires
• Each real-world event in the domain has one and only
one representation in the information system.
• Distinct real-world events have distinct representations
in the information system.
Allocates
individual EIIS to
each real-life event
being tracked by
the framework
9.
10.
11. Event Reference Preparation
• Parts-of-Speech Tagging
• Special Character Detection
• Data Cleansing
• Duplicate Detection
• Stop Word Detection and Elimination
• Slang Word Extraction
• Feeling Word Extraction
• Tokenization
• Stemming
• Tweet Meta-Data
• Expanded URLs
• User Information
• Verification
• Favorite Count
• Retweet Count
• User Mentions
• Entity Extraction
21. Potential Applications
• Event Monitoring and Analysis
• Event Information Retrieval
• Opinion and Review Mining
• Recommender Systems
• Event Management and Marketing
• Social Media Data Integration
• Many More
22. Future Directions
• Summarizing Event Content
• Identification of Insightful Opinionated
Content
• Event Topic Modeling
• Event-specific Recommendations
• Distributed Processing of
TwitterEventInfoGraph
• Ontology for Event Content in Social Media
• Many More
26. Tweet Features
No. of Unigram Tokens, No. of Stop Words, No. of Slang
Words, No. of Feeling Words, No. of Hashtags, Has URL,
Is Verified, No. of User Mentions, Length of Post, No. of
Unique Characters, No. of Special Characters, Favorite
Count, Retweet Count, Formality, No. of Nouns, No. of
Adjectives, No. of Verbs, No. of Adverbs.
Logistic Regression
Model
Performance
Precision Recall F-1 Score
Non-informative (0) 0.70 0.49 0.57
Informative (1) 0.78 0.90 0.84
Avg/Total
Accuracy = 76.64%
0.76 0.77 0.75
Olteanu, Alexandra, et al. "CrisisLex: A lexicon for collecting and filtering microblogged communications in crises." In Proceedings of
the 8th International AAAI Conference on Weblogs and Social Media (ICWSM" 14). No. EPFL-CONF-203561. 2014.
Event Information Quality
28000 annotated tweets
26 Events
Related and Informative – “#Media
Large wildfire in N. Colorado prompts
Evacuation : Crews are battling a fast-
Moving wildfire http://t.co/ju1BGTKH
#Politics #News”
Related but not Informative – “RT
@LarimerSheriff: #HighParkFire
update http://t.co/hBy5shen”
Not Related – “#Intern #US #TATTOO
#Wisconsin #Ohio #NC #PA #Florida
#Colorado #Iowa #Nevada #Virginia
#NV #mlb Travel Destinations;
http://t.co/TIHBJKF2”
36. • SeenRank (http://seen.co/about)
• TextRank (Mihalcea, Rada, and Paul Tarau. "TextRank: Bringing order into texts." Association for
Computational Linguistics, 2004.)
• LexRank(Erkan, Günes, and Dragomir R. Radev. "LexRank: graph-based lexical centrality as salience
in text summarization." Journal of Artificial Intelligence Research (2004): 457-479.)
• RTRank
• Centroid(Becker, Hila, Mor Naaman, and Luis Gravano. "Selecting Quality Twitter Content for
Events." ICWSM 11 (2011).)
• Logistic Regression
Baselines
37. Evaluation Metrics
p
i
rel
p
i
DCG
i
1 )1log(
12
p
p
p
IDCG
DCG
nDCG
n
natreferencesrelevantofNumber
natecision Pr
Baeza-Yates, Ricardo, and Berthier Ribeiro-Neto. Modern information retrieval. Vol. 463. New York: ACM press, 1999.
Järvelin, Kalervo, and Jaana Kekäläinen. "Cumulated gain-based evaluation of IR techniques." ACM Transactions on Information
Systems (TOIS) 20.4 (2002): 422-446.