Web content mining

Web Content Mining
• Mining Knowledge from web pages
• Involves extraction of
– structured data
• List pages and detailed pages
• Lists of products, services, jobs, music etc.
– Unstructured data
• Text data
12/3/2018 Professor V. Nagadevara

Web Data Sources
• Web server log data
• Page tags
• Cookies
• Web content
• Web documents
• Hyperlinks

Web Content Mining
• Structured data
• Unstructured data

Unstructured Data
• Opinions
• Reviews
• News items
• Uses many concepts from “Text Mining”

Text content on the web
• Large amount of information
– Document databases
– Research papers
– News articles
– Books
– Email messages
– Blogs
– Etc.

Data Mining Vs. Web Content Mining
• Data Mining extracts knowledge from
structured data
– Credit card, Insurance records, Call records etc.
• Web Content Mining works with unstructured,
text documents (in addition to structured
data)
– Written language is not structured

Pre-processing
• Requires considerable pre-processing
• Web Page Pre-processing
• Text Pre-processing

Web page pre-processing
• Web pages are different from text pages
(documents)
• HTML Contains different text “fields”
– Title, meta data, body etc.
– Titles given higher weightage (header tags <h1>,
<h2>; bold tag <b> etc.)
– One can leverage this for “Spamming” web pages!

• Anchor text
– Anchor text associated with a hyperlink is treated
specially
– It is a concise and better description of the
content contained in the page pointed to by the
link
– More important if it points to a page in a different
site

• Removing HTML Tags
– Information is presented in a number of blocks
– Removing HTML tags combines these blocks
– Many text phrases are pointers!
– Some are buttons!
http://en.wikipedia.org/wiki/Main_Page

• Identifying main content blocks
• Web pages contain large amount of information
which is not part of main text
• Eg. Banner ads, navigation bars, copy right
notices, anchor texts
• Clean the main text by removing anchor texts,
navigation links etc.

• Duplicate detection
• Web sites maintain “mirror sites” and
“duplicate pages
• Objectives is provide fast downloading of files
and pages
• Identifying these is important
• Similarities can be identified by using “Jaccard
Coefficient” (discuss later)

Text Content Pre-processing
• Word Level
• Sentence Level
• Document Level
• Document-Collection Level
• Linked-Document-Collection Level
• Application Level

• Stop word removal
– Typically; a, an, the, of, with, about etc.
– “form” may be a stop word in chemical industry,
but not in other domains
• Stemming
– Words which are slightly different are treated as
one.
– Use the linguistic “stem” of the group
• Learn, learner, learning, learns, learned

Stemming
• Different forms of the same word are usually
problematic for text data analysis, because
they have different spelling and similar
meaning(e.g. learns, learned, learning,…)
• Stemming is a process of transforming a word
into its stem (normalized form)

N-Grams
• Simple way for generating phrases are frequent n-grams:
– N-Gram is a sequence of n consecutive words(e.g. “machine learning”
is 2-gram)
– “Frequent n-grams” are the ones which appear in all observed
documents “MinFreq” or more times
• N-grams are interesting because of the simple and efficient dynamic
programming algorithm:
• Given:
– Set of documents (each document is a sequence of words),
– MinFreq(minimal n-gram frequency),
– MaxNGramSize(maximal n-gram length)
• for Len = 1 to MaxNGramSize do
– Generate candidate n-grams as sequences of words of size Len using
frequent n-grams of length Len-1
– Delete candidate n-grams with the frequency less then MinFreq

Jaccard Coefficient
• John Went to School With His Brother
• Translated into 5 3-gram phrases
• “John Went to”; “Went to School”; “to School
With”; “School With His”; With His Brother”
• Sn(di) is the set of distinctive n grams in
document di
• 𝑆𝑖𝑚 𝑑1, 𝑑2 =
𝑆 𝑛(𝑑1)∩𝑆 𝑛(𝑑2)
|𝑆 𝑛(𝑑1)∪𝑆 𝑛(𝑑2)|

Web Spamming
• Web Search has become an integral part
• Increased exposure leads to higher financial gains
• Rank positions of pages become important
• Spamming vs. page optimization
• Spamming refers to actions that do not increase
the information value of the page, but increase
the rank position dramatically by misleading the
search algorithm to rank it high

Web Spamming
• Content Spamming – Leverage TF-IDF
– Make some pages more relevant by tailoring the
contents of the text fields
– Title tags
– Meta tags (author, abstract, key words etc.)
– Body: use spam terms in the body
– Anchor Text: appropriate words in links
– URL: include spam terms in URL

Web Spamming
• Spam Techniques
– Repeating some important terms: increases the TF
scores. Spam terms are woven into some
sentences: “the picture mining quality of this
camera mining is amazing”
– Dumping many unrelated terms: this will make the
page relevant to many queries. Spammers simply
copy many sentences from related pages
– Use frequently searched terms: holiday packages
with cruise liners put “Tom Cruise” in ad pages!

Link Spamming
• We will discuss this next class when we take
up link analysis

12/3/2018
• Case – Analyzing Customer Reviews

Agenda
• Business Problem
• Data Set
• Classification
• Text Mining

Amazon.com
1. Business Problem

Reviews
1. Business Problem
Rating
Feedbacks
Summary

Business Problem
– Predict sales rank of audio
CD using classification
techniques
– Sentiment analysis of audio
CD reviews using text
analytics
– Obtain categorical insights
on audio CD reviews using
text analytics
1. Business Problem

Data Set
– amazon-memberinfo-locations.txt
– amazon-member-shortSummary.txt
– productInfoXML-reviewed-mProducts.txt
– productInfoXML-reviewed-AudioCDs.txt
– reviewsNew.rar
– productinfo.rar
– Booksinfo.txt
2. Data Set

Data Preparation
• Extraction
– Perl Scripts
• Audio CD
– 35,305 Tuples
– 50 variables
• Binning
Actual/Classi
fied as
>10000 <=10000
>10000 TN FP
<= 10000 FN TP
TV=T: If the sales rank <= 10000
TV=F: If the sales rank > 10000
3. Classification

Model Construction
• C5.0
• Entire Data Set
Actual / Classified As > 10000 <= 10000
> 10000 34195 5
<= 10000 181 924
Model TN FP FN TP
Entire Data set 99.99% 0.01% 16.38% 83.62%
3. Classification

Model Construction
• Over Sampling
• Maj : Min – 2 : 1
Model TN FP FN TP
Training 99.57% 0.43% 1.16% 98.84%
Testing 98.74% 1.26% 12.15% 87.85%
3. Classification

Model Construction
• Boosting
– 10 trials
– 5 trials
10 Trials TN FP FN TP
Training 99.83% 0.17% 0.46% 99.54%
Testing 99.17% 0.83% 12.15% 87.85%
5 Trails TN FP FN TP
Training 99.98% 0.02% 0.23% 99.77%
Testing 99.41% 0.59% 12.15% 87.85%
3. Classification

Model Construction
• Differential Error
Weights
• 5, 4 and 4.5
Cost 5 TN FP FN TP
Training 89.69% 10.31% 0.00% 100.00%
Testing 89.31% 10.69% 5.90% 94.10%
Cost 4 TN FP FN TP
Training 97.63% 2.37% 0.00% 100.00%
Testing 94.52% 5.48% 10.42% 89.58%
Cost 4.5 TN FP FN TP
Training 92.50% 7.50% 0.12% 99.88%
Testing 90.87% 9.13% 9.03% 90.97%
3. Classification

Conclusion
• Predicting
Top 10000
sales rank
for
AudioCDs
with 91%
accuracy.
3. Classification

Conclusions - Classification
– # of reviews
– # of feedbacks
– Helpful feedbacks,
ratios
– Min member rank and
mean member rank
3. Classification

Business Problem
– Sentiment analysis of
audio CD reviews using
text analytics
– Obtain categorical insights
on audio CD reviews using
text analytics
4. Text Analytics

Overall sentiment analysis
•
4. Text Analytics

Overall sentiment analysis
4. Text Analytics

“Customized” Package
4. Text Analytics

Results
Type of Comment Average Rating
Only Positive 4.540107
Mixed 3.122253
Neutral 4.423077
4. Text Analytics

Category Analysis
4. Text Analytics

Clustering - Results
Fans Generics
Musicians Tracks, Songs,
Musicians, Album
Outliers Generics Fans
NULL Tracks,
Songs,
Musicians,
Album
Musicians
Outlier
s
Experts Fans Generic
s
NULL Tracks,
Songs,
Musicians,
Album,
arttist
movie
Music
ians
Songs,
Album,
Musicia
ns
Outli
ers
Generic
s
Fans Gener
ics
Exper
ts
NULL Tracks
Songs,
Musicia
ns,
Album,
memory
device
Musi
cian
s
Songs,
Albu
m,
Music
ians
Albu
mArti
st,
Band,
Movie
Songs
Vocals
Tracks
Music

Associations - Results
4. Text Analytics

Application
• Project done by Earlier Batch Students
• Business Objectives
– Measurement of the behavior of visitors on a web site.
– Which sources bring the most customers, orders and revenue?
What is my conversion funnel? – Campaign Effectiveness
– What navigational paths lead to most conversions? – Product
View Associations
– Does my online channel have usability issues slowing down the
adoption?

Mining Objectives
• Which products are viewed and bought
together to form recommendations
• Which web campaigns are most effective in
bringing the visitors to the website
• Which web campaigns are most effective in
improving the conversion rate

Sample Records
• 2003-09-03 00:00:00 76.192.65.6 GET /index.php - 200
Mozilla/4.0+(IE5.5;Win2k) -
http://www.google.com/search?chr=UFT&p=buy+fruit+loom&lang=en
g
2003-09-03 00:00:13 76.192.65.6 GET /fruit.php
id=7&topic=Description& 200 Mozilla/4.0+(IE5.5;Win2k)
SESSIONID=COOKIE15 http://www.bobsfruitsite.com/index.php
2003-09-03 00:00:28 76.192.65.6 GET /fruit.php
id=4&topic=Description& 200 Mozilla/4.0+(IE5.5;Win2k)
SESSIONID=COOKIE15
http://www.bobsfruitsite.com/fruit.php?id=7&topic=Description

Challenges
• Voluminous data – 64000 records.
• Cleaning(98% Perspiration 2% Inspiration) Effort spent
on filtering junk records, extracting derived variables.
• Deriving Variables
– Transaction Id
– Product Viewed
– Product Bought
– Source (referrer)
– Campaign
– Creative

Variables
• Transaction – Id : Derived variable
– This was derived by combining the IP address, Cookie and the date
fields. This gave us a unique identifying number that was used as the
transaction-id.
• Page Visited : Direct variable
– A unique list of values for this variable were
• a) cart.php: This is the name of the page which indicates that the
user has bought the product
• b) index.php: This is the first page of the website which the user
either reaches directly or from a referred source (Google, yahoo,
AOL etc)
• c) fruit.php : If the user has clicked on any of the fruits this is
the page that gets displayed.

Variables
• Referer: Derived Variable
– This variable indicates the source from where the
user has come from. The value for the referrer
was buried inside the dataset field called cs-uri-
query.
– source=google&group=3&campaign=3&id=3&cr
eative=1&bhcmp=7441164&bhadg=229120974&
bhcrt=322008024&bhsrc=google

• Campaign : Derived Variable
– This variable indicates the kind of campaign that
was carried out on the referrers website that
brought the user to “bobsfruit” website. Each
campaign was represented by a number starting
from 1 onwards.
– source=google&group=3&campaign=3&id=3&creative=1&bhcmp=74
41164&bhadg=229120974&bhcrt=322008024&bhsrc=google

• Creative: Derived Variable
– This variable indicates the subset of the
campaign that was carried out on the referrers’
website that brought the user to “bobs fruit”
website. Like campaign, creative too was
represented by a number starting from 1
onwards.
– source=google&group=3&campaign=3&id=3&cre
ative=1&bhcmp=7441164&bhadg=229120974&
bhcrt=322008024&bhsrc=google

Mining Techniques
• Association Rules
– Trx ID
– Product id Viewed
– Product id Bought
• Classification Trees
– Referrer
– Campaign
– Creative
– Product Viewed/Bought

Mining Results
• Association Rules
– [Mango] ==> [Durian]
• Support:45.2344%
• Confidence:75.5900%
• Lift:1.3022
– [Orange]+[Durian]+[Apple] ==> [Mango]
• Support:35.0781%
• Confidence:87.1800%
• Lift:1.4568

Association Rules (Products Viewed)
Rule Support Confidence Lift
[Mango]  [Durian] 45.23% 75.59% 1.30
[Durian]  [Mango] 45.23% 77.93% 1.30
[Mango]  [Apple] 51.87% 86.68% 1.27

Association Rules (Products Viewed)
[Orange]+[Durian] 
[Mango]
37.03% 85.87% 1.43
[Orange]+[Durian]+[Appl
e]  [Mango]
35.07% 87.18% 1.45
[Orange]+[Star
Fruit]+[Mango]  [Apple] 34.21% 94.81% 1.39

Association Rules (Products Bought)
[Kiwi Fruit] ==> [Orange] 5.46% 38.10% 1.38
[Star Fruit] ==> [Apple] 5.46% 25.53% 0.80
[Passion Fruit] ==>
[Orange]
5.46% 30.38% 1.10
[Banana] ==> [Apple] 5.58% 29.52% 0.93
[Star Fruit] ==> [Orange] 5.80% 27.13% 0.98
[Passion Fruit] ==>
[Apple]
6.15% 34.18% 1.08
[Orange] ==> [Apple] 7.17% 26.14% 0.82

Mining Results (Classification)
• Classification Tree Rules
– IF campaign = 1 THEN product bought = n
(No product was bought)
• Support:27.3%
– IF campaign = 2 AND creative = 0 THEN product
viewed = n (No product was viewed)
• Support:2.6%
– IF campaign = 2 AND creative = 1 THEN
product bought = y (product was bought)
• Support:8.4%

Web content mining

Recommended

Recommended

More Related Content

Similar to Web content mining

Similar to Web content mining (20)

Recently uploaded

Recently uploaded (20)

Web content mining