SlideShare a Scribd company logo
1 of 75
WEB CONTENT MINING
12/3/2018
Web Content Mining
• Mining Knowledge from web pages
• Involves extraction of
– structured data
• List pages and detailed pages
• Lists of products, services, jobs, music etc.
– Unstructured data
• Text data
12/3/2018 Professor V. Nagadevara
Web Data Sources
• Web server log data
• Page tags
• Cookies
• Web content
• Web documents
• Hyperlinks
12/3/2018 Professor V. Nagadevara
Web Content Mining
• Structured data
• Unstructured data
12/3/2018 Professor V. Nagadevara
List Page
List
Detailed page
Unstructured Data
• Opinions
• Reviews
• News items
• Uses many concepts from “Text Mining”
Text content on the web
• Large amount of information
– Document databases
– Research papers
– News articles
– Books
– Email messages
– Blogs
– Etc.
12/3/2018 Professor V. Nagadevara
Data Mining Vs. Web Content Mining
• Data Mining extracts knowledge from
structured data
– Credit card, Insurance records, Call records etc.
• Web Content Mining works with unstructured,
text documents (in addition to structured
data)
– Written language is not structured
12/3/2018 Professor V. Nagadevara
Pre-processing
• Requires considerable pre-processing
• Web Page Pre-processing
• Text Pre-processing
12/3/2018 Professor V. Nagadevara
Web page pre-processing
• Web pages are different from text pages
(documents)
• HTML Contains different text “fields”
– Title, meta data, body etc.
– Titles given higher weightage (header tags <h1>,
<h2>; bold tag <b> etc.)
– One can leverage this for “Spamming” web pages!
12/3/2018 Professor V. Nagadevara
Web page pre-processing
• Anchor text
– Anchor text associated with a hyperlink is treated
specially
– It is a concise and better description of the
content contained in the page pointed to by the
link
– More important if it points to a page in a different
site
12/3/2018 Professor V. Nagadevara
Web page pre-processing
• Removing HTML Tags
– Information is presented in a number of blocks
– Removing HTML tags combines these blocks
– Many text phrases are pointers!
– Some are buttons!
http://en.wikipedia.org/wiki/Main_Page
12/3/2018 Professor V. Nagadevara
Web page pre-processing
• Identifying main content blocks
• Web pages contain large amount of information
which is not part of main text
• Eg. Banner ads, navigation bars, copy right
notices, anchor texts
• Clean the main text by removing anchor texts,
navigation links etc.
12/3/2018 Professor V. Nagadevara
Web page pre-processing
• Duplicate detection
• Web sites maintain “mirror sites” and
“duplicate pages
• Objectives is provide fast downloading of files
and pages
• Identifying these is important
• Similarities can be identified by using “Jaccard
Coefficient” (discuss later)
12/3/2018 Professor V. Nagadevara
Text Content Pre-processing
• Word Level
• Sentence Level
• Document Level
• Document-Collection Level
• Linked-Document-Collection Level
• Application Level
12/3/2018 Professor V. Nagadevara
• Stop word removal
– Typically; a, an, the, of, with, about etc.
– “form” may be a stop word in chemical industry,
but not in other domains
• Stemming
– Words which are slightly different are treated as
one.
– Use the linguistic “stem” of the group
• Learn, learner, learning, learns, learned
12/3/2018 Professor V. Nagadevara
Stemming
• Different forms of the same word are usually
problematic for text data analysis, because
they have different spelling and similar
meaning(e.g. learns, learned, learning,…)
• Stemming is a process of transforming a word
into its stem (normalized form)
N-Grams
• Simple way for generating phrases are frequent n-grams:
– N-Gram is a sequence of n consecutive words(e.g. “machine learning”
is 2-gram)
– “Frequent n-grams” are the ones which appear in all observed
documents “MinFreq” or more times
• N-grams are interesting because of the simple and efficient dynamic
programming algorithm:
• Given:
– Set of documents (each document is a sequence of words),
– MinFreq(minimal n-gram frequency),
– MaxNGramSize(maximal n-gram length)
• for Len = 1 to MaxNGramSize do
– Generate candidate n-grams as sequences of words of size Len using
frequent n-grams of length Len-1
– Delete candidate n-grams with the frequency less then MinFreq
Jaccard Coefficient
• John Went to School With His Brother
• Translated into 5 3-gram phrases
• “John Went to”; “Went to School”; “to School
With”; “School With His”; With His Brother”
• Sn(di) is the set of distinctive n grams in
document di
• 𝑆𝑖𝑚 𝑑1, 𝑑2 =
𝑆 𝑛(𝑑1)∩𝑆 𝑛(𝑑2)
|𝑆 𝑛(𝑑1)∪𝑆 𝑛(𝑑2)|
12/3/2018 Professor V. Nagadevara
Web Spamming
• Web Search has become an integral part
• Increased exposure leads to higher financial gains
• Rank positions of pages become important
• Spamming vs. page optimization
• Spamming refers to actions that do not increase
the information value of the page, but increase
the rank position dramatically by misleading the
search algorithm to rank it high
12/3/2018 Professor V. Nagadevara
Web Spamming
• Content Spamming – Leverage TF-IDF
– Make some pages more relevant by tailoring the
contents of the text fields
– Title tags
– Meta tags (author, abstract, key words etc.)
– Body: use spam terms in the body
– Anchor Text: appropriate words in links
– URL: include spam terms in URL
12/3/2018 Professor V. Nagadevara
Web Spamming
• Spam Techniques
– Repeating some important terms: increases the TF
scores. Spam terms are woven into some
sentences: “the picture mining quality of this
camera mining is amazing”
– Dumping many unrelated terms: this will make the
page relevant to many queries. Spammers simply
copy many sentences from related pages
– Use frequently searched terms: holiday packages
with cruise liners put “Tom Cruise” in ad pages!
12/3/2018 Professor V. Nagadevara
Link Spamming
• We will discuss this next class when we take
up link analysis
12/3/2018 Professor V. Nagadevara
12/3/2018
• Case – Analyzing Customer Reviews
Agenda
• Business Problem
• Data Set
• Classification
• Text Mining
Amazon.com
1. Business Problem
Reviews
1. Business Problem
Rating
Feedbacks
Summary
Sales Rank
Sales
Rank
Business Problem
– Predict sales rank of audio
CD using classification
techniques
– Sentiment analysis of audio
CD reviews using text
analytics
– Obtain categorical insights
on audio CD reviews using
text analytics
1. Business Problem
Data Set
– amazon-memberinfo-locations.txt
– amazon-member-shortSummary.txt
– productInfoXML-reviewed-mProducts.txt
– productInfoXML-reviewed-AudioCDs.txt
– reviewsNew.rar
– productinfo.rar
– Booksinfo.txt
2. Data Set
CLASSIFICATION – C5.0
Data Preparation
• Extraction
– Perl Scripts
• Audio CD
– 35,305 Tuples
– 50 variables
• Binning
Actual/Classi
fied as
>10000 <=10000
>10000 TN FP
<= 10000 FN TP
TV=T: If the sales rank <= 10000
TV=F: If the sales rank > 10000
3. Classification
Model Construction
• C5.0
• Entire Data Set
Actual / Classified As > 10000 <= 10000
> 10000 34195 5
<= 10000 181 924
Model TN FP FN TP
Entire Data set 99.99% 0.01% 16.38% 83.62%
3. Classification
Model Construction
• Over Sampling
• Maj : Min – 2 : 1
Model TN FP FN TP
Training 99.57% 0.43% 1.16% 98.84%
Testing 98.74% 1.26% 12.15% 87.85%
3. Classification
Model Construction
• Boosting
– 10 trials
– 5 trials
10 Trials TN FP FN TP
Training 99.83% 0.17% 0.46% 99.54%
Testing 99.17% 0.83% 12.15% 87.85%
5 Trails TN FP FN TP
Training 99.98% 0.02% 0.23% 99.77%
Testing 99.41% 0.59% 12.15% 87.85%
3. Classification
Model Construction
• Differential Error
Weights
• 5, 4 and 4.5
Cost 5 TN FP FN TP
Training 89.69% 10.31% 0.00% 100.00%
Testing 89.31% 10.69% 5.90% 94.10%
Cost 4 TN FP FN TP
Training 97.63% 2.37% 0.00% 100.00%
Testing 94.52% 5.48% 10.42% 89.58%
Cost 4.5 TN FP FN TP
Training 92.50% 7.50% 0.12% 99.88%
Testing 90.87% 9.13% 9.03% 90.97%
3. Classification
Conclusion
• Predicting
Top 10000
sales rank
for
AudioCDs
with 91%
accuracy.
3. Classification
Conclusions - Classification
– # of reviews
– # of feedbacks
– Helpful feedbacks,
ratios
– Min member rank and
mean member rank
3. Classification
TEXT ANALYTICS
Text Mining
4. Text Analytics
Business Problem
– Sentiment analysis of
audio CD reviews using
text analytics
– Obtain categorical insights
on audio CD reviews using
text analytics
4. Text Analytics
Overall sentiment analysis
•
4. Text Analytics
Overall sentiment analysis
4. Text Analytics
“Customized” Package
4. Text Analytics
Results
Type of Comment Average Rating
Only Positive 4.540107
Mixed 3.122253
Neutral 4.423077
4. Text Analytics
Results
4. Text Analytics
Results
4. Text Analytics
Category Analysis
Category Analysis
4. Text Analytics
Clustering - Results
Fans Generics
Musicians Tracks, Songs,
Musicians, Album
Outliers Generics Fans
NULL Tracks,
Songs,
Musicians,
Album
Musicians
Outlier
s
Experts Fans Generic
s
NULL Tracks,
Songs,
Musicians,
Album,
arttist
movie
Music
ians
Songs,
Album,
Musicia
ns
Outli
ers
Generic
s
Fans Gener
ics
Exper
ts
NULL Tracks
Songs,
Musicia
ns,
Album,
memory
device
Musi
cian
s
Songs,
Albu
m,
Music
ians
Albu
mArti
st,
Band,
Movie
Songs
Vocals
Tracks
Music
Associations - Results
4. Text Analytics
12/3/2018 Professor V. Nagadevara
12/3/2018 Professor V. Nagadevara
Application
• Project done by Earlier Batch Students
• Business Objectives
– Measurement of the behavior of visitors on a web site.
– Which sources bring the most customers, orders and revenue?
What is my conversion funnel? – Campaign Effectiveness
– What navigational paths lead to most conversions? – Product
View Associations
– Does my online channel have usability issues slowing down the
adoption?
Mining Objectives
• Which products are viewed and bought
together to form recommendations
• Which web campaigns are most effective in
bringing the visitors to the website
• Which web campaigns are most effective in
improving the conversion rate
Bobsfruitsite with index.php
Sample Records
• 2003-09-03 00:00:00 76.192.65.6 GET /index.php - 200
Mozilla/4.0+(IE5.5;Win2k) -
http://www.google.com/search?chr=UFT&p=buy+fruit+loom&lang=en
g
2003-09-03 00:00:13 76.192.65.6 GET /fruit.php
id=7&topic=Description& 200 Mozilla/4.0+(IE5.5;Win2k)
SESSIONID=COOKIE15 http://www.bobsfruitsite.com/index.php
2003-09-03 00:00:28 76.192.65.6 GET /fruit.php
id=4&topic=Description& 200 Mozilla/4.0+(IE5.5;Win2k)
SESSIONID=COOKIE15
http://www.bobsfruitsite.com/fruit.php?id=7&topic=Description
Challenges
• Voluminous data – 64000 records.
• Cleaning(98% Perspiration 2% Inspiration) Effort spent
on filtering junk records, extracting derived variables.
• Deriving Variables
– Transaction Id
– Product Viewed
– Product Bought
– Source (referrer)
– Campaign
– Creative
Variables
• Transaction – Id : Derived variable
– This was derived by combining the IP address, Cookie and the date
fields. This gave us a unique identifying number that was used as the
transaction-id.
• Page Visited : Direct variable
– A unique list of values for this variable were
• a) cart.php: This is the name of the page which indicates that the
user has bought the product
• b) index.php: This is the first page of the website which the user
either reaches directly or from a referred source (Google, yahoo,
AOL etc)
• c) fruit.php : If the user has clicked on any of the fruits this is
the page that gets displayed.
Variables
• Referer: Derived Variable
– This variable indicates the source from where the
user has come from. The value for the referrer
was buried inside the dataset field called cs-uri-
query.
– source=google&group=3&campaign=3&id=3&cr
eative=1&bhcmp=7441164&bhadg=229120974&
bhcrt=322008024&bhsrc=google
• Campaign : Derived Variable
– This variable indicates the kind of campaign that
was carried out on the referrers website that
brought the user to “bobsfruit” website. Each
campaign was represented by a number starting
from 1 onwards.
– source=google&group=3&campaign=3&id=3&creative=1&bhcmp=74
41164&bhadg=229120974&bhcrt=322008024&bhsrc=google
• Creative: Derived Variable
– This variable indicates the subset of the
campaign that was carried out on the referrers’
website that brought the user to “bobs fruit”
website. Like campaign, creative too was
represented by a number starting from 1
onwards.
– source=google&group=3&campaign=3&id=3&cre
ative=1&bhcmp=7441164&bhadg=229120974&
bhcrt=322008024&bhsrc=google
Mining Techniques
• Association Rules
– Trx ID
– Product id Viewed
– Product id Bought
• Classification Trees
– Referrer
– Campaign
– Creative
– Product Viewed/Bought
Mining Results
• Association Rules
– [Mango] ==> [Durian]
• Support:45.2344%
• Confidence:75.5900%
• Lift:1.3022
– [Orange]+[Durian]+[Apple] ==> [Mango]
• Support:35.0781%
• Confidence:87.1800%
• Lift:1.4568
Association Rules (Products Viewed)
Rule Support Confidence Lift
[Mango]  [Durian] 45.23% 75.59% 1.30
[Durian]  [Mango] 45.23% 77.93% 1.30
[Mango]  [Apple] 51.87% 86.68% 1.27
Association Rules (Products Viewed)
Rule Support Confidence Lift
[Orange]+[Durian] 
[Mango]
37.03% 85.87% 1.43
[Orange]+[Durian]+[Appl
e]  [Mango]
35.07% 87.18% 1.45
[Orange]+[Star
Fruit]+[Mango]  [Apple] 34.21% 94.81% 1.39
Association Rules (Products Bought)
Rule Support Confidence Lift
[Kiwi Fruit] ==> [Orange] 5.46% 38.10% 1.38
[Star Fruit] ==> [Apple] 5.46% 25.53% 0.80
[Passion Fruit] ==>
[Orange]
5.46% 30.38% 1.10
[Banana] ==> [Apple] 5.58% 29.52% 0.93
[Star Fruit] ==> [Orange] 5.80% 27.13% 0.98
[Passion Fruit] ==>
[Apple]
6.15% 34.18% 1.08
[Orange] ==> [Apple] 7.17% 26.14% 0.82
Mining Results (Classification)
• Classification Tree Rules
– IF campaign = 1 THEN product bought = n
(No product was bought)
• Support:27.3%
– IF campaign = 2 AND creative = 0 THEN product
viewed = n (No product was viewed)
• Support:2.6%
– IF campaign = 2 AND creative = 1 THEN
product bought = y (product was bought)
• Support:8.4%
Questions?
12/3/2018 Professor V. Nagadevara

More Related Content

Similar to Web content mining

Data management for TA's
Data management for TA'sData management for TA's
Data management for TA'saaroncollie
 
Strayer cis-515-week-8-case-study-database-development
Strayer cis-515-week-8-case-study-database-developmentStrayer cis-515-week-8-case-study-database-development
Strayer cis-515-week-8-case-study-database-developmentkxipvscsk02
 
Data Mining and Recommendation Systems
Data Mining and Recommendation SystemsData Mining and Recommendation Systems
Data Mining and Recommendation SystemsSalil Navgire
 
Structured Authoring for Business-Critical Content
Structured Authoring for Business-Critical ContentStructured Authoring for Business-Critical Content
Structured Authoring for Business-Critical ContentLavaCon
 
Trendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesTrendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesMarieke Guy
 
Planning for Research Data Management
Planning for Research Data ManagementPlanning for Research Data Management
Planning for Research Data Managementdancrane_open
 
Planning for Research Data Managment
Planning for Research Data ManagmentPlanning for Research Data Managment
Planning for Research Data ManagmentDaniel Crane
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise SearchFindwise
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisCrowdFlower
 
STC Information Topology
STC Information TopologySTC Information Topology
STC Information TopologyTyrinAvery1
 
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Sonya Liberman
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingOntotext
 
Adopting a Data-Driven Philosophy In Your Organization
Adopting a Data-Driven Philosophy In Your OrganizationAdopting a Data-Driven Philosophy In Your Organization
Adopting a Data-Driven Philosophy In Your OrganizationMark F Simmons
 
ACRL 2011 Data-Driven Library Web Design
ACRL 2011 Data-Driven Library Web DesignACRL 2011 Data-Driven Library Web Design
ACRL 2011 Data-Driven Library Web DesignAmanda Dinscore
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Shahriar Rafee
 

Similar to Web content mining (20)

Data management for TA's
Data management for TA'sData management for TA's
Data management for TA's
 
Strayer cis-515-week-8-case-study-database-development
Strayer cis-515-week-8-case-study-database-developmentStrayer cis-515-week-8-case-study-database-development
Strayer cis-515-week-8-case-study-database-development
 
Data Mining and Recommendation Systems
Data Mining and Recommendation SystemsData Mining and Recommendation Systems
Data Mining and Recommendation Systems
 
kaggle_meet_up
kaggle_meet_upkaggle_meet_up
kaggle_meet_up
 
Text Analytics for Legal work
Text Analytics for Legal workText Analytics for Legal work
Text Analytics for Legal work
 
Structured Authoring for Business-Critical Content
Structured Authoring for Business-Critical ContentStructured Authoring for Business-Critical Content
Structured Authoring for Business-Critical Content
 
Trendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sourcesTrendspotting: Helping you make sense of large information sources
Trendspotting: Helping you make sense of large information sources
 
Planning for Research Data Management
Planning for Research Data ManagementPlanning for Research Data Management
Planning for Research Data Management
 
Planning for Research Data Managment
Planning for Research Data ManagmentPlanning for Research Data Managment
Planning for Research Data Managment
 
Introduction to Enterprise Search
Introduction to Enterprise SearchIntroduction to Enterprise Search
Introduction to Enterprise Search
 
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
 
Compilerpt
CompilerptCompilerpt
Compilerpt
 
STC Information Topology
STC Information TopologySTC Information Topology
STC Information Topology
 
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
Taking the Pain out of Data Science - RecSys Machine Learning Framework Over ...
 
Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
Adopting a Data-Driven Philosophy In Your Organization
Adopting a Data-Driven Philosophy In Your OrganizationAdopting a Data-Driven Philosophy In Your Organization
Adopting a Data-Driven Philosophy In Your Organization
 
Data mining
Data miningData mining
Data mining
 
ACRL 2011 Data-Driven Library Web Design
ACRL 2011 Data-Driven Library Web DesignACRL 2011 Data-Driven Library Web Design
ACRL 2011 Data-Driven Library Web Design
 
Web mining
Web miningWeb mining
Web mining
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 

Recently uploaded

Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...vershagrag
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...HyderabadDolls
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxchadhar227
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridihmeghakumariji156
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Klinik kandungan
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 

Recently uploaded (20)

Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 

Web content mining

  • 2. Web Content Mining • Mining Knowledge from web pages • Involves extraction of – structured data • List pages and detailed pages • Lists of products, services, jobs, music etc. – Unstructured data • Text data 12/3/2018 Professor V. Nagadevara
  • 3. Web Data Sources • Web server log data • Page tags • Cookies • Web content • Web documents • Hyperlinks 12/3/2018 Professor V. Nagadevara
  • 4. Web Content Mining • Structured data • Unstructured data 12/3/2018 Professor V. Nagadevara
  • 7.
  • 8. Unstructured Data • Opinions • Reviews • News items • Uses many concepts from “Text Mining”
  • 9. Text content on the web • Large amount of information – Document databases – Research papers – News articles – Books – Email messages – Blogs – Etc. 12/3/2018 Professor V. Nagadevara
  • 10. Data Mining Vs. Web Content Mining • Data Mining extracts knowledge from structured data – Credit card, Insurance records, Call records etc. • Web Content Mining works with unstructured, text documents (in addition to structured data) – Written language is not structured 12/3/2018 Professor V. Nagadevara
  • 11. Pre-processing • Requires considerable pre-processing • Web Page Pre-processing • Text Pre-processing 12/3/2018 Professor V. Nagadevara
  • 12. Web page pre-processing • Web pages are different from text pages (documents) • HTML Contains different text “fields” – Title, meta data, body etc. – Titles given higher weightage (header tags <h1>, <h2>; bold tag <b> etc.) – One can leverage this for “Spamming” web pages! 12/3/2018 Professor V. Nagadevara
  • 13. Web page pre-processing • Anchor text – Anchor text associated with a hyperlink is treated specially – It is a concise and better description of the content contained in the page pointed to by the link – More important if it points to a page in a different site 12/3/2018 Professor V. Nagadevara
  • 14. Web page pre-processing • Removing HTML Tags – Information is presented in a number of blocks – Removing HTML tags combines these blocks – Many text phrases are pointers! – Some are buttons! http://en.wikipedia.org/wiki/Main_Page 12/3/2018 Professor V. Nagadevara
  • 15.
  • 16.
  • 17. Web page pre-processing • Identifying main content blocks • Web pages contain large amount of information which is not part of main text • Eg. Banner ads, navigation bars, copy right notices, anchor texts • Clean the main text by removing anchor texts, navigation links etc. 12/3/2018 Professor V. Nagadevara
  • 18. Web page pre-processing • Duplicate detection • Web sites maintain “mirror sites” and “duplicate pages • Objectives is provide fast downloading of files and pages • Identifying these is important • Similarities can be identified by using “Jaccard Coefficient” (discuss later) 12/3/2018 Professor V. Nagadevara
  • 19. Text Content Pre-processing • Word Level • Sentence Level • Document Level • Document-Collection Level • Linked-Document-Collection Level • Application Level 12/3/2018 Professor V. Nagadevara
  • 20. • Stop word removal – Typically; a, an, the, of, with, about etc. – “form” may be a stop word in chemical industry, but not in other domains • Stemming – Words which are slightly different are treated as one. – Use the linguistic “stem” of the group • Learn, learner, learning, learns, learned 12/3/2018 Professor V. Nagadevara
  • 21. Stemming • Different forms of the same word are usually problematic for text data analysis, because they have different spelling and similar meaning(e.g. learns, learned, learning,…) • Stemming is a process of transforming a word into its stem (normalized form)
  • 22. N-Grams • Simple way for generating phrases are frequent n-grams: – N-Gram is a sequence of n consecutive words(e.g. “machine learning” is 2-gram) – “Frequent n-grams” are the ones which appear in all observed documents “MinFreq” or more times • N-grams are interesting because of the simple and efficient dynamic programming algorithm: • Given: – Set of documents (each document is a sequence of words), – MinFreq(minimal n-gram frequency), – MaxNGramSize(maximal n-gram length) • for Len = 1 to MaxNGramSize do – Generate candidate n-grams as sequences of words of size Len using frequent n-grams of length Len-1 – Delete candidate n-grams with the frequency less then MinFreq
  • 23. Jaccard Coefficient • John Went to School With His Brother • Translated into 5 3-gram phrases • “John Went to”; “Went to School”; “to School With”; “School With His”; With His Brother” • Sn(di) is the set of distinctive n grams in document di • 𝑆𝑖𝑚 𝑑1, 𝑑2 = 𝑆 𝑛(𝑑1)∩𝑆 𝑛(𝑑2) |𝑆 𝑛(𝑑1)∪𝑆 𝑛(𝑑2)| 12/3/2018 Professor V. Nagadevara
  • 24. Web Spamming • Web Search has become an integral part • Increased exposure leads to higher financial gains • Rank positions of pages become important • Spamming vs. page optimization • Spamming refers to actions that do not increase the information value of the page, but increase the rank position dramatically by misleading the search algorithm to rank it high 12/3/2018 Professor V. Nagadevara
  • 25. Web Spamming • Content Spamming – Leverage TF-IDF – Make some pages more relevant by tailoring the contents of the text fields – Title tags – Meta tags (author, abstract, key words etc.) – Body: use spam terms in the body – Anchor Text: appropriate words in links – URL: include spam terms in URL 12/3/2018 Professor V. Nagadevara
  • 26. Web Spamming • Spam Techniques – Repeating some important terms: increases the TF scores. Spam terms are woven into some sentences: “the picture mining quality of this camera mining is amazing” – Dumping many unrelated terms: this will make the page relevant to many queries. Spammers simply copy many sentences from related pages – Use frequently searched terms: holiday packages with cruise liners put “Tom Cruise” in ad pages! 12/3/2018 Professor V. Nagadevara
  • 27. Link Spamming • We will discuss this next class when we take up link analysis 12/3/2018 Professor V. Nagadevara
  • 28. 12/3/2018 • Case – Analyzing Customer Reviews
  • 29. Agenda • Business Problem • Data Set • Classification • Text Mining
  • 33. Business Problem – Predict sales rank of audio CD using classification techniques – Sentiment analysis of audio CD reviews using text analytics – Obtain categorical insights on audio CD reviews using text analytics 1. Business Problem
  • 34. Data Set – amazon-memberinfo-locations.txt – amazon-member-shortSummary.txt – productInfoXML-reviewed-mProducts.txt – productInfoXML-reviewed-AudioCDs.txt – reviewsNew.rar – productinfo.rar – Booksinfo.txt 2. Data Set
  • 36. Data Preparation • Extraction – Perl Scripts • Audio CD – 35,305 Tuples – 50 variables • Binning Actual/Classi fied as >10000 <=10000 >10000 TN FP <= 10000 FN TP TV=T: If the sales rank <= 10000 TV=F: If the sales rank > 10000 3. Classification
  • 37. Model Construction • C5.0 • Entire Data Set Actual / Classified As > 10000 <= 10000 > 10000 34195 5 <= 10000 181 924 Model TN FP FN TP Entire Data set 99.99% 0.01% 16.38% 83.62% 3. Classification
  • 38. Model Construction • Over Sampling • Maj : Min – 2 : 1 Model TN FP FN TP Training 99.57% 0.43% 1.16% 98.84% Testing 98.74% 1.26% 12.15% 87.85% 3. Classification
  • 39. Model Construction • Boosting – 10 trials – 5 trials 10 Trials TN FP FN TP Training 99.83% 0.17% 0.46% 99.54% Testing 99.17% 0.83% 12.15% 87.85% 5 Trails TN FP FN TP Training 99.98% 0.02% 0.23% 99.77% Testing 99.41% 0.59% 12.15% 87.85% 3. Classification
  • 40. Model Construction • Differential Error Weights • 5, 4 and 4.5 Cost 5 TN FP FN TP Training 89.69% 10.31% 0.00% 100.00% Testing 89.31% 10.69% 5.90% 94.10% Cost 4 TN FP FN TP Training 97.63% 2.37% 0.00% 100.00% Testing 94.52% 5.48% 10.42% 89.58% Cost 4.5 TN FP FN TP Training 92.50% 7.50% 0.12% 99.88% Testing 90.87% 9.13% 9.03% 90.97% 3. Classification
  • 41. Conclusion • Predicting Top 10000 sales rank for AudioCDs with 91% accuracy. 3. Classification
  • 42. Conclusions - Classification – # of reviews – # of feedbacks – Helpful feedbacks, ratios – Min member rank and mean member rank 3. Classification
  • 44. Text Mining 4. Text Analytics
  • 45. Business Problem – Sentiment analysis of audio CD reviews using text analytics – Obtain categorical insights on audio CD reviews using text analytics 4. Text Analytics
  • 49. Results Type of Comment Average Rating Only Positive 4.540107 Mixed 3.122253 Neutral 4.423077 4. Text Analytics
  • 54. Clustering - Results Fans Generics Musicians Tracks, Songs, Musicians, Album Outliers Generics Fans NULL Tracks, Songs, Musicians, Album Musicians Outlier s Experts Fans Generic s NULL Tracks, Songs, Musicians, Album, arttist movie Music ians Songs, Album, Musicia ns Outli ers Generic s Fans Gener ics Exper ts NULL Tracks Songs, Musicia ns, Album, memory device Musi cian s Songs, Albu m, Music ians Albu mArti st, Band, Movie Songs Vocals Tracks Music
  • 55. Associations - Results 4. Text Analytics
  • 58. Application • Project done by Earlier Batch Students • Business Objectives – Measurement of the behavior of visitors on a web site. – Which sources bring the most customers, orders and revenue? What is my conversion funnel? – Campaign Effectiveness – What navigational paths lead to most conversions? – Product View Associations – Does my online channel have usability issues slowing down the adoption?
  • 59. Mining Objectives • Which products are viewed and bought together to form recommendations • Which web campaigns are most effective in bringing the visitors to the website • Which web campaigns are most effective in improving the conversion rate
  • 61.
  • 62. Sample Records • 2003-09-03 00:00:00 76.192.65.6 GET /index.php - 200 Mozilla/4.0+(IE5.5;Win2k) - http://www.google.com/search?chr=UFT&p=buy+fruit+loom&lang=en g 2003-09-03 00:00:13 76.192.65.6 GET /fruit.php id=7&topic=Description& 200 Mozilla/4.0+(IE5.5;Win2k) SESSIONID=COOKIE15 http://www.bobsfruitsite.com/index.php 2003-09-03 00:00:28 76.192.65.6 GET /fruit.php id=4&topic=Description& 200 Mozilla/4.0+(IE5.5;Win2k) SESSIONID=COOKIE15 http://www.bobsfruitsite.com/fruit.php?id=7&topic=Description
  • 63. Challenges • Voluminous data – 64000 records. • Cleaning(98% Perspiration 2% Inspiration) Effort spent on filtering junk records, extracting derived variables. • Deriving Variables – Transaction Id – Product Viewed – Product Bought – Source (referrer) – Campaign – Creative
  • 64. Variables • Transaction – Id : Derived variable – This was derived by combining the IP address, Cookie and the date fields. This gave us a unique identifying number that was used as the transaction-id. • Page Visited : Direct variable – A unique list of values for this variable were • a) cart.php: This is the name of the page which indicates that the user has bought the product • b) index.php: This is the first page of the website which the user either reaches directly or from a referred source (Google, yahoo, AOL etc) • c) fruit.php : If the user has clicked on any of the fruits this is the page that gets displayed.
  • 65. Variables • Referer: Derived Variable – This variable indicates the source from where the user has come from. The value for the referrer was buried inside the dataset field called cs-uri- query. – source=google&group=3&campaign=3&id=3&cr eative=1&bhcmp=7441164&bhadg=229120974& bhcrt=322008024&bhsrc=google
  • 66. • Campaign : Derived Variable – This variable indicates the kind of campaign that was carried out on the referrers website that brought the user to “bobsfruit” website. Each campaign was represented by a number starting from 1 onwards. – source=google&group=3&campaign=3&id=3&creative=1&bhcmp=74 41164&bhadg=229120974&bhcrt=322008024&bhsrc=google
  • 67. • Creative: Derived Variable – This variable indicates the subset of the campaign that was carried out on the referrers’ website that brought the user to “bobs fruit” website. Like campaign, creative too was represented by a number starting from 1 onwards. – source=google&group=3&campaign=3&id=3&cre ative=1&bhcmp=7441164&bhadg=229120974& bhcrt=322008024&bhsrc=google
  • 68. Mining Techniques • Association Rules – Trx ID – Product id Viewed – Product id Bought • Classification Trees – Referrer – Campaign – Creative – Product Viewed/Bought
  • 69. Mining Results • Association Rules – [Mango] ==> [Durian] • Support:45.2344% • Confidence:75.5900% • Lift:1.3022 – [Orange]+[Durian]+[Apple] ==> [Mango] • Support:35.0781% • Confidence:87.1800% • Lift:1.4568
  • 70. Association Rules (Products Viewed) Rule Support Confidence Lift [Mango]  [Durian] 45.23% 75.59% 1.30 [Durian]  [Mango] 45.23% 77.93% 1.30 [Mango]  [Apple] 51.87% 86.68% 1.27
  • 71. Association Rules (Products Viewed) Rule Support Confidence Lift [Orange]+[Durian]  [Mango] 37.03% 85.87% 1.43 [Orange]+[Durian]+[Appl e]  [Mango] 35.07% 87.18% 1.45 [Orange]+[Star Fruit]+[Mango]  [Apple] 34.21% 94.81% 1.39
  • 72. Association Rules (Products Bought) Rule Support Confidence Lift [Kiwi Fruit] ==> [Orange] 5.46% 38.10% 1.38 [Star Fruit] ==> [Apple] 5.46% 25.53% 0.80 [Passion Fruit] ==> [Orange] 5.46% 30.38% 1.10 [Banana] ==> [Apple] 5.58% 29.52% 0.93 [Star Fruit] ==> [Orange] 5.80% 27.13% 0.98 [Passion Fruit] ==> [Apple] 6.15% 34.18% 1.08 [Orange] ==> [Apple] 7.17% 26.14% 0.82
  • 73. Mining Results (Classification) • Classification Tree Rules – IF campaign = 1 THEN product bought = n (No product was bought) • Support:27.3% – IF campaign = 2 AND creative = 0 THEN product viewed = n (No product was viewed) • Support:2.6% – IF campaign = 2 AND creative = 1 THEN product bought = y (product was bought) • Support:8.4%