8447779800, Low rate Call girls in Kotla Mubarakpur Delhi NCR
Anuj sharma i labscamp
1. ANUJ SHARMA, DATA SCIENCE PRACTICES, IMPETUS
Text Mining on Social Network
Data for Business Applications
2. Content
Data
Unstructured
data as
business
opportunity
Text mining
Sentiment
Analysis and
opinion mining
Feature based
opinion mining
Topic Modeling
Social media
based
sentiment
analysis
Tools for text
mining
Use Cases
Hotel review
demo
Advertising
campaign
analysis
Data Science
Practices at
Impetus
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 2
3. Data
Structured data
Tables, Records
Semi-structured
data
XML, JSON
Unstructured data
Text, Audio, Video,
conversations, Web,
Wikis, Documents,
Web logs…
Social Media data
Tweets, Blogs,
Facebook, other
social platforms
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 3
5. Unstructured Data as Business Opportunity
“Unstructured” data such as natural language, which is distinguished from the “structured”
information found in conventional spreadsheets and databases.
Unstructured data constitutes 80% of the whole enterprise data (Gartner Research)
Unstructured text can contain business critical information, untapped opportunities and latent
risks
Example:
Consumer’s thoughts and opinions, found in communications such as emails, web pages,
surveys, contracts, blogs, wikis, and reports.
Whether it’s a customer complaints, employee feedback, analyst opinions, or competitors'
intentions, this valuable and actionable information lies hidden in unstructured text corpus
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 5
6. Text Mining
Text Mining integrates innovative text analytics approaches, tools and solutions to leverage the
unstructured data
Typical text mining tasks include-
Text categorization
Text clustering
Concept/entity extraction
Production of granular taxonomies
Sentiment analysis
Document summarization
Entity relation modeling
For a company, the successful management of unstructured information may lead to more
profitable decisions and business opportunities
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 6
7. Text Mining Process
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 7
Set of algorithms for
converting
unstructured raw
text into structured
objects
The quantitative
methods that
analyze this object
to discover
knowledge
Text
Text processing
feature generation
Feature selection
Data mining/Pattern
discovery
Evaluation &
Interpretation
Bag of words
Stemming
Stop words
Part of Speech tagging
Parsing
Word sense disambiguation
Reduce dimensionality
Delete irrelevant features
Classification
Clustering
8. Learning from Text Mining
Classification
Spam detection
Document
organization
Clustering
Trend analysis
Topic identification
Web mining
Trend analysis
Ontology creation
Opinion mining
Natural Language
Processing
Text summarization
Question answering
Information
extraction
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 8
9. Sentiment Analysis and Opinion Mining
Opinion mining, sentiment analysis, and subjectivity analysis are introduced as computational
analysis of opinion, sentiment, and subjectivity in online text
Subjectivity analysis or subjectivity classification is automatically discriminating opinion
containing text from objective text representing factual information
Sentiment analysis originated from machine learning (ML), information retrieval (IR) and
natural language processing (NLP)
Opinion Mining originated from the Web search and IR community and involved processing
search results for a given product, retrieving attributes and aggregating users’ opinions
9
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA
10. Sentiment Analysis Workflow
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 10
+ve Words -ve Words
Corpus from
Large no of
documents
Latent Semantic
Analysis(LSA)
Chennai Express attempts to marry the puppy-dog sentimentality of a typical Shah Rukh Khan romance with
the broad humor and the crash-bang-boom thrills of a Rohit Shetty action comedy. But the film does little justice
to either genre. A big reason for that is the lethargic pacing. Shetty has pulled off cornier stories in the past,
delivering gags and stunts at breakneck speed. This film, however, is a tough slog because the jokes aren't
funny, and the set pieces entirely rehashed. In place of a real performance, Shah Rukh resorts to the sort of
facial gymnastics that could shame an Olympian. To endure this indulgence, you have to be a die-hard fan.
chennai express attempt marri puppi dog sentiment typic shah rukh khan romanc broad humor crash
bang boom thrill rohit shetti action comedi film littl justic either genr big reason letharg
pace shetti pull cornier stori past deliv gag stunt breakneck speed film tough slog joke funni
set piec entir rehash place real perform shah rukh resort sort facial gymnast shame olympian
endur indulg die hard fan
Text Preprocessing
Lowercase
Remove stop words
Remove punctuation
Stemming
Remove extra whitespaces
Remove Numbers
Sentiment Analysis
and Document
Polarity Strength
11. Feature-level Opinion Mining (FLOM)
Product features are attributes that provide additional functionality to products and play a
crucial role in distinguishing similar products of different brands
FLOM provides deep analysis of online reviews by identifying different attributes of products
that consumers are concerned about
By mining product features and their associated opinion, feature-level buzz monitoring and
feature-level opinion summarization can be done
Buzz - a term used in word-of-mouth marketing defined as a vague but positive (may be
negative rare times) association or anticipation about a product or service
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 11
12. Topic Modeling
A topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents
Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections
Identification of emerging topics in communities, trending topics in social media, hot topics in online discussion may be critical for
businesses
LDA (Latent Dirichlet Allocation), is a generative model that allows sets of observations to be explained by unobserved groups that
explain why some parts of the data are similar. It was developed by David Blei, Andrew Ng, and Michael Jordan in 2003.
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 12
13. Topic Modeling Cond…
Now topic modeling might predict this text as 75% about Music and 25% about cars
"dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will
appear in documents about cats, and "the" and "is" will appear equally in both.
What type of analysis LDA can perform:
◦ Topic identification
◦ Which topics are similar?
◦ Which documents are similar based on topic allocations?
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 13
I listen to Motorhead, Pink Floyd and Metallica whenever
I’m travelling in my car.
“
14. Social Media based Sentiment Analysis
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 14
Data from Internet and
Web 2.0
• Buzz Monitoring
• Sentiment Analysis
• Content Categorization
Trends Detection
and Recommendation
• Brand Image Monitoring
• Sentiment trends in
customer comments
• Discovering undercurrents &
recommend adjustments
• Overall Vs Service Attributes
based Sentiment Extraction
• Real-time monitoring of
consumer perceptions
• Identification of Data
sources (Twitter/ Facebook/
Discussion boards)
• Collection of consumer
expressed text
15. Social media SA solution framework
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 15
Data from Internet
and Web 2.0
Overall/ Attribute
level Sentiment
Analysis
Trends Detection
and
Recommendation
Service Attributes
Identification
Buzz Monitoring
and Summary
Report
Content
Categorization
Sentiment Trends
Summary
Classified
Content/Topics*
* Not included in this work
16. Tools for Text mining
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 16
17. Use Case
- HOTEL REVIEW
- ADVERTISING CAMPAIGN
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 17
18. Hotel Review demo
Objective:
Analyze hotel review text data
Calculate hotel’s rating based on the review sentiment analysis
Visualize data on Maps in a web based platform with features like zooming,
clicking and hover
Design a web-based User-interface for larger data
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 18
Data:
254,574 Reviews 2560 Hotel
6 Countries
UAE, CANADA, CHINA, INDIA, UK, USA
10 Cities
Beijing, Dubai, England, Illinois, Montreal, Nevada, New
Delhi, New York City, Quebec, San Francisco, Shanghai
19. Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 19
Data Science :
LSI (Latent Semantic Indexing)
Sentiment Analysis
Part of speech tagging
Feature extraction
Feature based opinion mining
Open search : Apache Solr
Database : Apache Cassandra
Maps : Google Maps API
Sentiment score for each
hotel based on the
sentiment analysis of its
reviews by calculating the
polarity of the reviews with
positive and negative words
Hotel feature based opinion
mining for following features
– Food, Room, Location,
Service, Price, etc.
21. Advertising Campaign Buzz Monitoring
Social Media Monitoring and Analysis for VISA Advertising Campaign
Feature-level Buzz Summary for Company name, Campaign and other
hidden features
Blogs Analysis for campaign name
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 21
Data collected from:
Tweets for 1 month
4000+ Tweets (including 1440 re-tweets)
43 Blogs and comments were crawled and analyzed
Features
22. Results for blogs
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 22
83%
17%
0%
Blogs
33%
43%
24%
Comments
Negative Positive Neutral
39%
40%
21%
Blogs & Comments
0
2
4
6
8
10
12
14
16
18
Neutral Positive Negative
Feature-level Buzz Summary for 3 features
Number
of
Blogs
23. Insights
The ad campaign has a an overall negative sentiment associated with this
People are using hashtags to express negative sentiment like #bad #worstadsever
There is some hike in associated tweets due to frequent advertisements during a particular day
This analysis is based on 30 days tweets only!!!
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 23
24. Data Science Practices at Impetus
Team
Nitin Agarwal
Joydeb Mukherjee
Ashutosh Gupta
Bibudh Lahiri
Anuj Sharma
Syed Bilal
Rohit Gupta
Ankit Sharma
Wednesday, April 6, 2022 DATA SCIENCE PRACTICES, IMPETUS - INDIA 24
Work
Statistical
model
development
Text
mining
Financial
data
analysis
Healthcare
data
analysis
Manufacturing
data analysis
Web
analytics
Funnel
Analysis