This document discusses web content mining and summarizes key concepts from a lecture on the topic. It covers extracting both structured and unstructured data from web pages, including lists, details pages, text, opinions and reviews. Pre-processing steps for web content mining are outlined, including removing HTML tags, identifying main content blocks, and detecting duplicate pages. Text preprocessing techniques like stop word removal and stemming are also summarized. The document concludes by discussing web spamming techniques used to improperly influence search engine rankings.
2. Web Content Mining
• Mining Knowledge from web pages
• Involves extraction of
– structured data
• List pages and detailed pages
• Lists of products, services, jobs, music etc.
– Unstructured data
• Text data
12/3/2018 Professor V. Nagadevara
3. Web Data Sources
• Web server log data
• Page tags
• Cookies
• Web content
• Web documents
• Hyperlinks
12/3/2018 Professor V. Nagadevara
4. Web Content Mining
• Structured data
• Unstructured data
12/3/2018 Professor V. Nagadevara
9. Text content on the web
• Large amount of information
– Document databases
– Research papers
– News articles
– Books
– Email messages
– Blogs
– Etc.
12/3/2018 Professor V. Nagadevara
10. Data Mining Vs. Web Content Mining
• Data Mining extracts knowledge from
structured data
– Credit card, Insurance records, Call records etc.
• Web Content Mining works with unstructured,
text documents (in addition to structured
data)
– Written language is not structured
12/3/2018 Professor V. Nagadevara
12. Web page pre-processing
• Web pages are different from text pages
(documents)
• HTML Contains different text “fields”
– Title, meta data, body etc.
– Titles given higher weightage (header tags <h1>,
<h2>; bold tag <b> etc.)
– One can leverage this for “Spamming” web pages!
12/3/2018 Professor V. Nagadevara
13. Web page pre-processing
• Anchor text
– Anchor text associated with a hyperlink is treated
specially
– It is a concise and better description of the
content contained in the page pointed to by the
link
– More important if it points to a page in a different
site
12/3/2018 Professor V. Nagadevara
14. Web page pre-processing
• Removing HTML Tags
– Information is presented in a number of blocks
– Removing HTML tags combines these blocks
– Many text phrases are pointers!
– Some are buttons!
http://en.wikipedia.org/wiki/Main_Page
12/3/2018 Professor V. Nagadevara
15.
16.
17. Web page pre-processing
• Identifying main content blocks
• Web pages contain large amount of information
which is not part of main text
• Eg. Banner ads, navigation bars, copy right
notices, anchor texts
• Clean the main text by removing anchor texts,
navigation links etc.
12/3/2018 Professor V. Nagadevara
18. Web page pre-processing
• Duplicate detection
• Web sites maintain “mirror sites” and
“duplicate pages
• Objectives is provide fast downloading of files
and pages
• Identifying these is important
• Similarities can be identified by using “Jaccard
Coefficient” (discuss later)
12/3/2018 Professor V. Nagadevara
19. Text Content Pre-processing
• Word Level
• Sentence Level
• Document Level
• Document-Collection Level
• Linked-Document-Collection Level
• Application Level
12/3/2018 Professor V. Nagadevara
20. • Stop word removal
– Typically; a, an, the, of, with, about etc.
– “form” may be a stop word in chemical industry,
but not in other domains
• Stemming
– Words which are slightly different are treated as
one.
– Use the linguistic “stem” of the group
• Learn, learner, learning, learns, learned
12/3/2018 Professor V. Nagadevara
21. Stemming
• Different forms of the same word are usually
problematic for text data analysis, because
they have different spelling and similar
meaning(e.g. learns, learned, learning,…)
• Stemming is a process of transforming a word
into its stem (normalized form)
22. N-Grams
• Simple way for generating phrases are frequent n-grams:
– N-Gram is a sequence of n consecutive words(e.g. “machine learning”
is 2-gram)
– “Frequent n-grams” are the ones which appear in all observed
documents “MinFreq” or more times
• N-grams are interesting because of the simple and efficient dynamic
programming algorithm:
• Given:
– Set of documents (each document is a sequence of words),
– MinFreq(minimal n-gram frequency),
– MaxNGramSize(maximal n-gram length)
• for Len = 1 to MaxNGramSize do
– Generate candidate n-grams as sequences of words of size Len using
frequent n-grams of length Len-1
– Delete candidate n-grams with the frequency less then MinFreq
23. Jaccard Coefficient
• John Went to School With His Brother
• Translated into 5 3-gram phrases
• “John Went to”; “Went to School”; “to School
With”; “School With His”; With His Brother”
• Sn(di) is the set of distinctive n grams in
document di
• 𝑆𝑖𝑚 𝑑1, 𝑑2 =
𝑆 𝑛(𝑑1)∩𝑆 𝑛(𝑑2)
|𝑆 𝑛(𝑑1)∪𝑆 𝑛(𝑑2)|
12/3/2018 Professor V. Nagadevara
24. Web Spamming
• Web Search has become an integral part
• Increased exposure leads to higher financial gains
• Rank positions of pages become important
• Spamming vs. page optimization
• Spamming refers to actions that do not increase
the information value of the page, but increase
the rank position dramatically by misleading the
search algorithm to rank it high
12/3/2018 Professor V. Nagadevara
25. Web Spamming
• Content Spamming – Leverage TF-IDF
– Make some pages more relevant by tailoring the
contents of the text fields
– Title tags
– Meta tags (author, abstract, key words etc.)
– Body: use spam terms in the body
– Anchor Text: appropriate words in links
– URL: include spam terms in URL
12/3/2018 Professor V. Nagadevara
26. Web Spamming
• Spam Techniques
– Repeating some important terms: increases the TF
scores. Spam terms are woven into some
sentences: “the picture mining quality of this
camera mining is amazing”
– Dumping many unrelated terms: this will make the
page relevant to many queries. Spammers simply
copy many sentences from related pages
– Use frequently searched terms: holiday packages
with cruise liners put “Tom Cruise” in ad pages!
12/3/2018 Professor V. Nagadevara
27. Link Spamming
• We will discuss this next class when we take
up link analysis
12/3/2018 Professor V. Nagadevara
33. Business Problem
– Predict sales rank of audio
CD using classification
techniques
– Sentiment analysis of audio
CD reviews using text
analytics
– Obtain categorical insights
on audio CD reviews using
text analytics
1. Business Problem
34. Data Set
– amazon-memberinfo-locations.txt
– amazon-member-shortSummary.txt
– productInfoXML-reviewed-mProducts.txt
– productInfoXML-reviewed-AudioCDs.txt
– reviewsNew.rar
– productinfo.rar
– Booksinfo.txt
2. Data Set
36. Data Preparation
• Extraction
– Perl Scripts
• Audio CD
– 35,305 Tuples
– 50 variables
• Binning
Actual/Classi
fied as
>10000 <=10000
>10000 TN FP
<= 10000 FN TP
TV=T: If the sales rank <= 10000
TV=F: If the sales rank > 10000
3. Classification
37. Model Construction
• C5.0
• Entire Data Set
Actual / Classified As > 10000 <= 10000
> 10000 34195 5
<= 10000 181 924
Model TN FP FN TP
Entire Data set 99.99% 0.01% 16.38% 83.62%
3. Classification
38. Model Construction
• Over Sampling
• Maj : Min – 2 : 1
Model TN FP FN TP
Training 99.57% 0.43% 1.16% 98.84%
Testing 98.74% 1.26% 12.15% 87.85%
3. Classification
45. Business Problem
– Sentiment analysis of
audio CD reviews using
text analytics
– Obtain categorical insights
on audio CD reviews using
text analytics
4. Text Analytics
54. Clustering - Results
Fans Generics
Musicians Tracks, Songs,
Musicians, Album
Outliers Generics Fans
NULL Tracks,
Songs,
Musicians,
Album
Musicians
Outlier
s
Experts Fans Generic
s
NULL Tracks,
Songs,
Musicians,
Album,
arttist
movie
Music
ians
Songs,
Album,
Musicia
ns
Outli
ers
Generic
s
Fans Gener
ics
Exper
ts
NULL Tracks
Songs,
Musicia
ns,
Album,
memory
device
Musi
cian
s
Songs,
Albu
m,
Music
ians
Albu
mArti
st,
Band,
Movie
Songs
Vocals
Tracks
Music
58. Application
• Project done by Earlier Batch Students
• Business Objectives
– Measurement of the behavior of visitors on a web site.
– Which sources bring the most customers, orders and revenue?
What is my conversion funnel? – Campaign Effectiveness
– What navigational paths lead to most conversions? – Product
View Associations
– Does my online channel have usability issues slowing down the
adoption?
59. Mining Objectives
• Which products are viewed and bought
together to form recommendations
• Which web campaigns are most effective in
bringing the visitors to the website
• Which web campaigns are most effective in
improving the conversion rate
64. Variables
• Transaction – Id : Derived variable
– This was derived by combining the IP address, Cookie and the date
fields. This gave us a unique identifying number that was used as the
transaction-id.
• Page Visited : Direct variable
– A unique list of values for this variable were
• a) cart.php: This is the name of the page which indicates that the
user has bought the product
• b) index.php: This is the first page of the website which the user
either reaches directly or from a referred source (Google, yahoo,
AOL etc)
• c) fruit.php : If the user has clicked on any of the fruits this is
the page that gets displayed.
65. Variables
• Referer: Derived Variable
– This variable indicates the source from where the
user has come from. The value for the referrer
was buried inside the dataset field called cs-uri-
query.
– source=google&group=3&campaign=3&id=3&cr
eative=1&bhcmp=7441164&bhadg=229120974&
bhcrt=322008024&bhsrc=google
66. • Campaign : Derived Variable
– This variable indicates the kind of campaign that
was carried out on the referrers website that
brought the user to “bobsfruit” website. Each
campaign was represented by a number starting
from 1 onwards.
– source=google&group=3&campaign=3&id=3&creative=1&bhcmp=74
41164&bhadg=229120974&bhcrt=322008024&bhsrc=google
67. • Creative: Derived Variable
– This variable indicates the subset of the
campaign that was carried out on the referrers’
website that brought the user to “bobs fruit”
website. Like campaign, creative too was
represented by a number starting from 1
onwards.
– source=google&group=3&campaign=3&id=3&cre
ative=1&bhcmp=7441164&bhadg=229120974&
bhcrt=322008024&bhsrc=google
68. Mining Techniques
• Association Rules
– Trx ID
– Product id Viewed
– Product id Bought
• Classification Trees
– Referrer
– Campaign
– Creative
– Product Viewed/Bought
73. Mining Results (Classification)
• Classification Tree Rules
– IF campaign = 1 THEN product bought = n
(No product was bought)
• Support:27.3%
– IF campaign = 2 AND creative = 0 THEN product
viewed = n (No product was viewed)
• Support:2.6%
– IF campaign = 2 AND creative = 1 THEN
product bought = y (product was bought)
• Support:8.4%