SlideShare a Scribd company logo
1 of 21
Download to read offline
Integrated analysis of 
News, views & reviews 
Presented By : 
Naman Gupta 
IIT Bombay 
M.Tech - CSE 
Guided By : 
Dr. Lipika Dey 
Principal Scientist, TCS 
Innovation Labs - Delhi
Problem Statement 
• Integrating open source data like News articles 
with social-media content from Twitter and 
dedicated discussion forum like customer 
complaint/review websites 
• Retrieval of relevant information 
• Linking related information 
• Visualization 
• Domain : Automobile (Car)
Objective 
• Helping integrated analysis of structured and unstructured 
data. 
• Twitter gives people reaction to news item. 
• Websites give early signals about problems faced by 
customers. 
• To be used in future for Predictive Analysis.
Joint Analysis of 
News & Tweets 
Linking & 
Retrieval 
Analysis of 
Customer 
Comments 
(Edmunds.com) 
Visualization 
Summary of the Work
Module 1 : 
• Analyzing Tweets with respect to News Article 
to capture user reaction to an event reported 
in the news 
• Grouping of Tweets 
• Ranking of Tweets 
• Tag Cloud 
• Tweet Distribution 
• Tweet Space.
Grouping of Duplicate Tweets 
• Initial Scheme : Retweets were grouped. 
• Used BLEU (Bilingual Evaluation Understudy) score 
measure to group tweets which are syntactically 
same. 
• BLEU Score : Measures the quality of translation. 
• Algorithm (To Follow)
Algorithm 
• Input : N tweets, Output : Tweet Groups. 
• Clean Tweet by removing special characters, url’s, #tags. 
• For every tweet t_i : 
• If no group present : 
• Make a new group with Tweet t_i in it. 
• else 
For tweet t_j in every other group. 
• If t_i is substring of t_j or t_j is a substring of t_i 
•Add Tweet to group of t_j. 
• Else 
• Score = BlueScore(t_i,t_j) 
• If score >= 0.7 
• Add t1 to group of t_j. 
• Else 
• Make a new group with tweet t_j in it.
Ranking of Tweets 
• Initial Scheme : 
• Tweets were ordered by the number of Tweets in a Group. 
• Higher number of re-tweets does not guarantee the most relevant tweet for a news. 
• Modified Scheme : 
• Used News text to rank tweets. 
• News text focuses on keywords related to main event like recall, faulty, steering etc multiple 
times. 
• Algorithm : 
• N= Extract the top frequent words (after removing stop words). 
• For every Tweet t1 
• Num_Key = number of words from N present in t1 
• Rank tweets based on Num_Key
Visualization : Tweets & News 
• Objective : 
• To show the main problem / event reported by the news. 
• Number of Tweets : 8 Lacs. 
• Method 
• 8 Lac+ Tweets. 
• Tweets were cleaned by removing special characters, #tags, urls. 
• Tweets and News description were fed in OPTRA . 
• Processed to extract Noun Phrases. 
• For every news, Most frequent NP were displayed as Tag Cloud. 
• Used D3 Tag Cloud API.
Modules 3 & 4 
• Extracting users review/complaints 
• Extraction and Processing of Data. 
• Crawler for Edmunds.com 
• Text Processing done in OPTRA 
• Content visualization using output of 
OPTRA 
• Report generation for relevant content 
retrieved
Extracting Data from Edmunds.com 
• Reviews for 10 car models were extracted from Edmunds. 
• Crawler using Jsoup Api. 
• Information Extracted : 
• Review date, 
• Review, 
• Suggested Improvement, 
• Favorite features, 
• Review Rating, 
• Up Rating for a review and 
• Down rating for a review.
Content retrieval – linking problems across 
sources 
• Objective: 
• Capture common problems, features discussed for a chosen entity 
• To retrieve customer reviews that dealt with issues reported in a News article 
• Challenge – the language used in two different sources are not identical 
• Approximate matching technique using proximity was used 
• Method: 
• Noun Phrases(NP) and Enhanced Phrases(EP) from OPTRA are used. 
• Phrases with their frequency are obtained. 
• Algorithm : 
• Fetch Enhanced Phrases. 
• Clean the Phrase (remove numbers, stalk word). 
• If Phrase after removal has length >=2. 
• Preserve the phrase. 
• Fetch NP and clean them using above method. 
• If NP is present as an EP also: 
• Boost the frequency of the Enhanced Phrase 
• Output Phrases having highest frequency.
Screenshots
Evaluation – matching technique 
Phrase # Sentence 
Retrieved 
# Relevant Accuracy 
Steer Wheel 37 35 94.59 
Power Steer 27 25 92.59 
Heat seat 25 21 84 
Manual 
12 8 66.66 
Transmission 
Automatic 
Transmission 
13 12 92.30 
Air Sensor 3 1 33.33 
Low Speed 16 14 87.5
Evaluation – contd. 
Phrase # Sentence 
Retrieved 
# Relevant Accuracy 
Steer bolt 1 1 100 
Lose bolt 0 0 0 
Back seat 10 9 90 
Power window 4 4 100 
Car problem 8 4 50 
Road noise 8 7 87.5 
Trunk Lid 6 6 100 
Engine Fire 5 3 60
Evaluation 
Total Sentences 
Retrieved 
Total Relevant 
Sentences 
Accuracy 
175 150 85.71
Future Work 
• Adding more sources to work together within the same 
framework. 
• Adding automated analysis for detecting early signals 
and predicting effects.
Thanks

More Related Content

Viewers also liked

First Aid Certificate - Level 1
First Aid Certificate - Level 1First Aid Certificate - Level 1
First Aid Certificate - Level 1Thalaine Mirfin
 
Exploración de la naturaleza 1 grado bloque II
Exploración de la naturaleza 1 grado bloque IIExploración de la naturaleza 1 grado bloque II
Exploración de la naturaleza 1 grado bloque IIBianka Luna
 
Mensaje de Cuaresma 2009
Mensaje de Cuaresma 2009Mensaje de Cuaresma 2009
Mensaje de Cuaresma 2009Episcopalpy
 
Predicting judicial decisions of the European Court of Human Rights: a Natura...
Predicting judicial decisions of the European Court of Human Rights: a Natura...Predicting judicial decisions of the European Court of Human Rights: a Natura...
Predicting judicial decisions of the European Court of Human Rights: a Natura...Nikolaos Aletras
 
1- Redacción de artículos científicos
1- Redacción de artículos científicos1- Redacción de artículos científicos
1- Redacción de artículos científicosGab Mchn
 
【Mr / 製薬 / 医療関係者 向け】無料試食会のご案内
【Mr / 製薬 / 医療関係者 向け】無料試食会のご案内【Mr / 製薬 / 医療関係者 向け】無料試食会のご案内
【Mr / 製薬 / 医療関係者 向け】無料試食会のご案内Solad
 
Teoria del ser humano vs plantas y animales
Teoria del ser humano vs plantas y animalesTeoria del ser humano vs plantas y animales
Teoria del ser humano vs plantas y animalesKarma Wangchuk Sengue
 

Viewers also liked (10)

Conclusiones niels
Conclusiones nielsConclusiones niels
Conclusiones niels
 
First Aid Certificate - Level 1
First Aid Certificate - Level 1First Aid Certificate - Level 1
First Aid Certificate - Level 1
 
конкурс совенятко днз №12
конкурс совенятко днз №12конкурс совенятко днз №12
конкурс совенятко днз №12
 
Exploración de la naturaleza 1 grado bloque II
Exploración de la naturaleza 1 grado bloque IIExploración de la naturaleza 1 grado bloque II
Exploración de la naturaleza 1 grado bloque II
 
Mensaje de Cuaresma 2009
Mensaje de Cuaresma 2009Mensaje de Cuaresma 2009
Mensaje de Cuaresma 2009
 
Predicting judicial decisions of the European Court of Human Rights: a Natura...
Predicting judicial decisions of the European Court of Human Rights: a Natura...Predicting judicial decisions of the European Court of Human Rights: a Natura...
Predicting judicial decisions of the European Court of Human Rights: a Natura...
 
1- Redacción de artículos científicos
1- Redacción de artículos científicos1- Redacción de artículos científicos
1- Redacción de artículos científicos
 
【Mr / 製薬 / 医療関係者 向け】無料試食会のご案内
【Mr / 製薬 / 医療関係者 向け】無料試食会のご案内【Mr / 製薬 / 医療関係者 向け】無料試食会のご案内
【Mr / 製薬 / 医療関係者 向け】無料試食会のご案内
 
Plantas y animales 4
Plantas y animales 4Plantas y animales 4
Plantas y animales 4
 
Teoria del ser humano vs plantas y animales
Teoria del ser humano vs plantas y animalesTeoria del ser humano vs plantas y animales
Teoria del ser humano vs plantas y animales
 

Similar to Internship

Sentiment Analysis on Twitter
Sentiment Analysis on TwitterSentiment Analysis on Twitter
Sentiment Analysis on TwitterSmritiAgarwal26
 
IRE2014 Filtering Tweets Related to an entity
IRE2014 Filtering Tweets Related to an entityIRE2014 Filtering Tweets Related to an entity
IRE2014 Filtering Tweets Related to an entitykartik179
 
SubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an EntitySubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an EntityAnkita Kumari
 
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.comPersonalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.comLucidworks
 
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...VNIT-ACM Student Chapter
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusoneDotNetCampus
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATADotNetCampus
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Lucidworks
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1arthi v
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support SystemKavita Ganesan
 
[System design] Design a tweeter-like system
[System design] Design a tweeter-like system[System design] Design a tweeter-like system
[System design] Design a tweeter-like systemAree Oh
 
Sentiment Analysis on Demonetization Tweets
Sentiment Analysis on Demonetization TweetsSentiment Analysis on Demonetization Tweets
Sentiment Analysis on Demonetization TweetsAmit99123
 
Building a Meta-search Engine
Building a Meta-search EngineBuilding a Meta-search Engine
Building a Meta-search EngineAyan Chandra
 
Twitter Sentiment Analysis.pdf
Twitter Sentiment Analysis.pdfTwitter Sentiment Analysis.pdf
Twitter Sentiment Analysis.pdfRachanasamal3
 
Measurement & Monitoring Best Practices
Measurement & Monitoring Best PracticesMeasurement & Monitoring Best Practices
Measurement & Monitoring Best PracticesMakala Arce
 
Unit Testing Best Practices
Unit Testing Best PracticesUnit Testing Best Practices
Unit Testing Best PracticesTomaš Maconko
 

Similar to Internship (20)

Sentiment Analysis on Twitter
Sentiment Analysis on TwitterSentiment Analysis on Twitter
Sentiment Analysis on Twitter
 
IRE2014 Filtering Tweets Related to an entity
IRE2014 Filtering Tweets Related to an entityIRE2014 Filtering Tweets Related to an entity
IRE2014 Filtering Tweets Related to an entity
 
SubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an EntitySubTopic Detection of Tweets Related to an Entity
SubTopic Detection of Tweets Related to an Entity
 
Hpd 1
Hpd 1Hpd 1
Hpd 1
 
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.comPersonalized Search and Job Recommendations - Simon Hughes, Dice.com
Personalized Search and Job Recommendations - Simon Hughes, Dice.com
 
Automatic Summarizaton Tutorial
Automatic Summarizaton TutorialAutomatic Summarizaton Tutorial
Automatic Summarizaton Tutorial
 
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
Research Opportunities in India & Keyword Search Over Dynamic Categorized Inf...
 
SNATZ Technology
SNATZ TechnologySNATZ Technology
SNATZ Technology
 
Net campus2015 antimomusone
Net campus2015 antimomusoneNet campus2015 antimomusone
Net campus2015 antimomusone
 
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATAPREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
PREDICT THE FUTURE , MACHINE LEARNING & BIG DATA
 
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
Evolving The Optimal Relevancy Scoring Model at Dice.com: Presented by Simon ...
 
Building largescalepredictionsystemv1
Building largescalepredictionsystemv1Building largescalepredictionsystemv1
Building largescalepredictionsystemv1
 
Twitter sentiment analysis ppt
Twitter sentiment analysis pptTwitter sentiment analysis ppt
Twitter sentiment analysis ppt
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support System
 
[System design] Design a tweeter-like system
[System design] Design a tweeter-like system[System design] Design a tweeter-like system
[System design] Design a tweeter-like system
 
Sentiment Analysis on Demonetization Tweets
Sentiment Analysis on Demonetization TweetsSentiment Analysis on Demonetization Tweets
Sentiment Analysis on Demonetization Tweets
 
Building a Meta-search Engine
Building a Meta-search EngineBuilding a Meta-search Engine
Building a Meta-search Engine
 
Twitter Sentiment Analysis.pdf
Twitter Sentiment Analysis.pdfTwitter Sentiment Analysis.pdf
Twitter Sentiment Analysis.pdf
 
Measurement & Monitoring Best Practices
Measurement & Monitoring Best PracticesMeasurement & Monitoring Best Practices
Measurement & Monitoring Best Practices
 
Unit Testing Best Practices
Unit Testing Best PracticesUnit Testing Best Practices
Unit Testing Best Practices
 

Internship

  • 1. Integrated analysis of News, views & reviews Presented By : Naman Gupta IIT Bombay M.Tech - CSE Guided By : Dr. Lipika Dey Principal Scientist, TCS Innovation Labs - Delhi
  • 2. Problem Statement • Integrating open source data like News articles with social-media content from Twitter and dedicated discussion forum like customer complaint/review websites • Retrieval of relevant information • Linking related information • Visualization • Domain : Automobile (Car)
  • 3. Objective • Helping integrated analysis of structured and unstructured data. • Twitter gives people reaction to news item. • Websites give early signals about problems faced by customers. • To be used in future for Predictive Analysis.
  • 4. Joint Analysis of News & Tweets Linking & Retrieval Analysis of Customer Comments (Edmunds.com) Visualization Summary of the Work
  • 5. Module 1 : • Analyzing Tweets with respect to News Article to capture user reaction to an event reported in the news • Grouping of Tweets • Ranking of Tweets • Tag Cloud • Tweet Distribution • Tweet Space.
  • 6. Grouping of Duplicate Tweets • Initial Scheme : Retweets were grouped. • Used BLEU (Bilingual Evaluation Understudy) score measure to group tweets which are syntactically same. • BLEU Score : Measures the quality of translation. • Algorithm (To Follow)
  • 7. Algorithm • Input : N tweets, Output : Tweet Groups. • Clean Tweet by removing special characters, url’s, #tags. • For every tweet t_i : • If no group present : • Make a new group with Tweet t_i in it. • else For tweet t_j in every other group. • If t_i is substring of t_j or t_j is a substring of t_i •Add Tweet to group of t_j. • Else • Score = BlueScore(t_i,t_j) • If score >= 0.7 • Add t1 to group of t_j. • Else • Make a new group with tweet t_j in it.
  • 8. Ranking of Tweets • Initial Scheme : • Tweets were ordered by the number of Tweets in a Group. • Higher number of re-tweets does not guarantee the most relevant tweet for a news. • Modified Scheme : • Used News text to rank tweets. • News text focuses on keywords related to main event like recall, faulty, steering etc multiple times. • Algorithm : • N= Extract the top frequent words (after removing stop words). • For every Tweet t1 • Num_Key = number of words from N present in t1 • Rank tweets based on Num_Key
  • 9. Visualization : Tweets & News • Objective : • To show the main problem / event reported by the news. • Number of Tweets : 8 Lacs. • Method • 8 Lac+ Tweets. • Tweets were cleaned by removing special characters, #tags, urls. • Tweets and News description were fed in OPTRA . • Processed to extract Noun Phrases. • For every news, Most frequent NP were displayed as Tag Cloud. • Used D3 Tag Cloud API.
  • 10. Modules 3 & 4 • Extracting users review/complaints • Extraction and Processing of Data. • Crawler for Edmunds.com • Text Processing done in OPTRA • Content visualization using output of OPTRA • Report generation for relevant content retrieved
  • 11. Extracting Data from Edmunds.com • Reviews for 10 car models were extracted from Edmunds. • Crawler using Jsoup Api. • Information Extracted : • Review date, • Review, • Suggested Improvement, • Favorite features, • Review Rating, • Up Rating for a review and • Down rating for a review.
  • 12. Content retrieval – linking problems across sources • Objective: • Capture common problems, features discussed for a chosen entity • To retrieve customer reviews that dealt with issues reported in a News article • Challenge – the language used in two different sources are not identical • Approximate matching technique using proximity was used • Method: • Noun Phrases(NP) and Enhanced Phrases(EP) from OPTRA are used. • Phrases with their frequency are obtained. • Algorithm : • Fetch Enhanced Phrases. • Clean the Phrase (remove numbers, stalk word). • If Phrase after removal has length >=2. • Preserve the phrase. • Fetch NP and clean them using above method. • If NP is present as an EP also: • Boost the frequency of the Enhanced Phrase • Output Phrases having highest frequency.
  • 14.
  • 15.
  • 16.
  • 17. Evaluation – matching technique Phrase # Sentence Retrieved # Relevant Accuracy Steer Wheel 37 35 94.59 Power Steer 27 25 92.59 Heat seat 25 21 84 Manual 12 8 66.66 Transmission Automatic Transmission 13 12 92.30 Air Sensor 3 1 33.33 Low Speed 16 14 87.5
  • 18. Evaluation – contd. Phrase # Sentence Retrieved # Relevant Accuracy Steer bolt 1 1 100 Lose bolt 0 0 0 Back seat 10 9 90 Power window 4 4 100 Car problem 8 4 50 Road noise 8 7 87.5 Trunk Lid 6 6 100 Engine Fire 5 3 60
  • 19. Evaluation Total Sentences Retrieved Total Relevant Sentences Accuracy 175 150 85.71
  • 20. Future Work • Adding more sources to work together within the same framework. • Adding automated analysis for detecting early signals and predicting effects.