1. Integrated analysis of
News, views & reviews
Presented By :
Naman Gupta
IIT Bombay
M.Tech - CSE
Guided By :
Dr. Lipika Dey
Principal Scientist, TCS
Innovation Labs - Delhi
2. Problem Statement
• Integrating open source data like News articles
with social-media content from Twitter and
dedicated discussion forum like customer
complaint/review websites
• Retrieval of relevant information
• Linking related information
• Visualization
• Domain : Automobile (Car)
3. Objective
• Helping integrated analysis of structured and unstructured
data.
• Twitter gives people reaction to news item.
• Websites give early signals about problems faced by
customers.
• To be used in future for Predictive Analysis.
4. Joint Analysis of
News & Tweets
Linking &
Retrieval
Analysis of
Customer
Comments
(Edmunds.com)
Visualization
Summary of the Work
5. Module 1 :
• Analyzing Tweets with respect to News Article
to capture user reaction to an event reported
in the news
• Grouping of Tweets
• Ranking of Tweets
• Tag Cloud
• Tweet Distribution
• Tweet Space.
6. Grouping of Duplicate Tweets
• Initial Scheme : Retweets were grouped.
• Used BLEU (Bilingual Evaluation Understudy) score
measure to group tweets which are syntactically
same.
• BLEU Score : Measures the quality of translation.
• Algorithm (To Follow)
7. Algorithm
• Input : N tweets, Output : Tweet Groups.
• Clean Tweet by removing special characters, url’s, #tags.
• For every tweet t_i :
• If no group present :
• Make a new group with Tweet t_i in it.
• else
For tweet t_j in every other group.
• If t_i is substring of t_j or t_j is a substring of t_i
•Add Tweet to group of t_j.
• Else
• Score = BlueScore(t_i,t_j)
• If score >= 0.7
• Add t1 to group of t_j.
• Else
• Make a new group with tweet t_j in it.
8. Ranking of Tweets
• Initial Scheme :
• Tweets were ordered by the number of Tweets in a Group.
• Higher number of re-tweets does not guarantee the most relevant tweet for a news.
• Modified Scheme :
• Used News text to rank tweets.
• News text focuses on keywords related to main event like recall, faulty, steering etc multiple
times.
• Algorithm :
• N= Extract the top frequent words (after removing stop words).
• For every Tweet t1
• Num_Key = number of words from N present in t1
• Rank tweets based on Num_Key
9. Visualization : Tweets & News
• Objective :
• To show the main problem / event reported by the news.
• Number of Tweets : 8 Lacs.
• Method
• 8 Lac+ Tweets.
• Tweets were cleaned by removing special characters, #tags, urls.
• Tweets and News description were fed in OPTRA .
• Processed to extract Noun Phrases.
• For every news, Most frequent NP were displayed as Tag Cloud.
• Used D3 Tag Cloud API.
10. Modules 3 & 4
• Extracting users review/complaints
• Extraction and Processing of Data.
• Crawler for Edmunds.com
• Text Processing done in OPTRA
• Content visualization using output of
OPTRA
• Report generation for relevant content
retrieved
11. Extracting Data from Edmunds.com
• Reviews for 10 car models were extracted from Edmunds.
• Crawler using Jsoup Api.
• Information Extracted :
• Review date,
• Review,
• Suggested Improvement,
• Favorite features,
• Review Rating,
• Up Rating for a review and
• Down rating for a review.
12. Content retrieval – linking problems across
sources
• Objective:
• Capture common problems, features discussed for a chosen entity
• To retrieve customer reviews that dealt with issues reported in a News article
• Challenge – the language used in two different sources are not identical
• Approximate matching technique using proximity was used
• Method:
• Noun Phrases(NP) and Enhanced Phrases(EP) from OPTRA are used.
• Phrases with their frequency are obtained.
• Algorithm :
• Fetch Enhanced Phrases.
• Clean the Phrase (remove numbers, stalk word).
• If Phrase after removal has length >=2.
• Preserve the phrase.
• Fetch NP and clean them using above method.
• If NP is present as an EP also:
• Boost the frequency of the Enhanced Phrase
• Output Phrases having highest frequency.
20. Future Work
• Adding more sources to work together within the same
framework.
• Adding automated analysis for detecting early signals
and predicting effects.