Web Data Mining
7/2/2019 Compiled by: Kamal Acharya 1
7/2/2019 Compiled by: Kamal Acharya 2
Introduction
• Web: A huge, widely-distributed, highly heterogeneous, semi-
structured, interconnected information repository
• Web is a huge collection of documents plus
– Hyper-link information
– Access and usage information
7/2/2019 Compiled by: Kamal Acharya 3
Contd..
• What is Web Mining?
– Web mining is the application of data mining techniques to
find interesting and potentially useful knowledge from web
data.
– Web data:
• Web content data : Text, image, records, etc.
• Web structure data: Hyperlinks
• Web usages data: server logs
7/2/2019 Compiled by: Kamal Acharya 4
Contd..
– Web mining is usually divided into the following three categories.
• Web content mining
• Web usage mining and
• Web structure mining
Fig: Types of web mining
7/2/2019 Compiled by: Kamal Acharya 5
Web usages mining
• Automatic discovery of patterns in clickstreams(usages) and associated
data collected or generated as a result of user interactions with one or
more Web sites.
• Goal: analyze the behavioral patterns and profiles of users interacting
with a Web site.
• The discovered patterns are usually represented as collections of pages,
objects, or resources that are frequently accessed by groups of users
with common interests.
7/2/2019 Compiled by: Kamal Acharya 6
Contd..
• Application: Analyzing click stream data can help :
– determine the life-time value of clients,
– design cross-marketing strategies across products and services,
– evaluate the effectiveness of promotional campaigns,
– optimize the functionality of Web-based applications,
– provide more personalized content to visitors, and find the most effective
logical structure for their Web space.
7/2/2019 Compiled by: Kamal Acharya 7
Contd..
• Phase of Web Usage Mining:
– There are generally three distinctive phases in web usage mining:
• Data collection and preprocessing,
• Knowledge discovery,
• and pattern analysis
7/2/2019 Compiled by: Kamal Acharya 8
Contd..
• Data Collection and Pre-processing Phase:
– It deals with generating and cleaning of web data and
transforming it to a set of user transactions representing
activities of each user during his/her website visit.
– This step will influence the quality and result of the pattern
discovery and analysis. Therefore, it needs to be done very
carefully.
7/2/2019 Compiled by: Kamal Acharya 9
Contd..
• Pattern Discovery Phase
– Knowledge or pattern discovery is the key component of the Web mining,
which uses the algorithms and techniques from data mining.
– At present the usually used data mining methods mainly have clustering,
classifying and association rule mining.
– Each method has its own excellence and shortcomings, but the quite
effective method mainly is classifying and clustering at the present.
7/2/2019 Compiled by: Kamal Acharya 10
Contd..
• Pattern Analysis Phase:
– Pattern Analysis is the final stage of the Web usage mining.
– Challenges of Pattern Analysis are to filter uninteresting
information and to visualize and interpret the interesting
patterns to the user.
7/2/2019 Compiled by: Kamal Acharya 11
Web Content mining
• Web Content Mining is the process of extracting useful
information from the contents of Web documents.
• Content data corresponds to the collection of facts a Web page
was designed to convey to the users.
• It may consist of text, images, audio, video, or structured records
such as lists and tables as shown in Figure below.
7/2/2019 Compiled by: Kamal Acharya 12
Contd..
7/2/2019 Compiled by: Kamal Acharya 13
Web structure mining
• Web structure mining, one of three categories of web mining for
data, is a tool used to identify the relationship between Web
pages linked by information or direct link connection.
• It is used to study the topology of hyperlinks with or without
the description of the links.
7/2/2019 Compiled by: Kamal Acharya 14
Contd..
• The main purpose for structure mining is to extract previously
unknown relationships between Web pages.
• This structure data mining provides use for a business to link the
information of its own Web site to enable navigation and cluster
information into site maps.
• This allows its users the ability to access the desired information
through keyword association and content mining.
7/2/2019 Compiled by: Kamal Acharya 15
Contd..
• According to the type of web structural data, web structure
mining can be divided into two kinds: Hyperlinks
and Document Structure as shown in Figure below:
7/2/2019 Compiled by: Kamal Acharya 16
Issues and Challenges in Web Mining
• There are various issues and challenges with the web. Some
challenges include:
– The Web pages are dynamic that is the information is changes constantly.
Copping the changes and monitoring them is an important issue for many
applications.
– Noise elimination on the web is another issue. A user feels noisy
environment during searching the content, if the information comes from
different sources. Typical Web page involves many pieces of information
for instance the navigation links, main content of the page, copyright
notices, advertisements, and privacy policies. Only part of the information
is useful for a particular application but the rest is considered noise.
7/2/2019 Compiled by: Kamal Acharya 17
Contd..
• The diversity of the information on the multiple pages show
similar information in different words or formats, based on the
diverse authorship of Web pages that make the integration of
information from multiple pages as a challenging problem.
• Handing Big Data on the web is most important challenge, which
is scalable in term of volume, variety, variability, and complexity.
7/2/2019 Compiled by: Kamal Acharya 18
Contd..
• To maintain security and privacy of web data is not an easy task.
Advanced cryptographic algorithm is required for optimal service
on the web.
• Discovery of advance hyperlink topology and its management is
the other mining issue on the web.
7/2/2019 Compiled by: Kamal Acharya 19
Web Mining Application Areas
• Web mining is an important tool to gather knowledge of the
behavior of Websites visitors and thereby to allow for appropriate
adjustments and decisions with respect to Websites‘ actual users
and traffic patterns.
• Along with a description of the processes involved in Web
mining states that Website Design, Web Traffic Handling, e-
Business and Web Personalization are four major application
areas for Web mining. These are briefly described in the
following sections.
7/2/2019 Compiled by: Kamal Acharya 20
Contd..
• Website Design:
– The content and structure of the Website is important to the user
experience/impression of the site and the site‘s usability. The problem is
that different types of users have different preferences, background,
knowledge etc. making it difficult (if not impossible) to find a design that
is optimal for all users.
– Web usage mining can then be used to detect which types of users are
accessing the website, and their behavior, knowledge which can then be
used to manually design/re-design the website, or to automatically change
the structure and content based on the profile of the user visiting it.
7/2/2019 Compiled by: Kamal Acharya 21
Contd..
• Web Traffic Handling:
– The performance and service of Websites can be improved using
knowledge of the Web traffic in order to predict the navigation path of the
current user. This may be used for cashing, load balancing or data
distribution to improve the performance. The path prediction can also be
used to detect fraud, break-ins, intrusion etc.
7/2/2019 Compiled by: Kamal Acharya 22
Contd..
• Web Personalization:
– Based on Web Mining Techniques, websites are designed to have the look-
and-feel and contents are personalized to the needs of an individual end-
user.
– Web Personalization or customization is an attractive application area for
Web based companies, allowing for recommendations, marketing
campaigns etc. to be specifically customized for different categories of
users, and more importantly to do this in real-time, automatically, as the
user accesses the Website.
7/2/2019 Compiled by: Kamal Acharya 23
Contd..
• e-Business:
– For Web based companies, Web mining is a powerful tool to collect
business intelligence by using electronic business to get competitive
advantages.
– Patterns of the customer’s activities on the Website can be used as
important knowledge in the decision-making process, e.g. predicting
customer’s future behavior; recruiting new customers and developing new
products are beneficial choices.
7/2/2019 Compiled by: Kamal Acharya 24
Contd..
• E-Learning and Digital Library:
– Web mining can be used for improving the performance of electronic
learning. Applications of web mining towards e-learning are usually web
usage based. Machine learning and web usage mining improve web based
learning.
7/2/2019 Compiled by: Kamal Acharya 25
Contd..
• Security and Crime Investigation:
– Along with the rapid popularity of the Internet, crime
information on the web is becoming increasingly rampant,
and the majority of them are in the form of text.
– Because a lot of crime information in documents is described
through events, event-based semantic technology can be used
to study the patterns and trends of web-oriented crimes
7/2/2019 Compiled by: Kamal Acharya 26
Time series data mining
• Sequential data (or time series) refers to data that appear in a specific order.
– The order defines a time axis, that differentiates this data from other cases
we have seen so far
• Examples
– The price of a stock (or of many stocks) over time
– Environmental data (pressure, temperature, precipitation etc) over time
– The sequence of queries in a search engine, or the frequency of a query
over time
– The words in a document as they appear in order, and etc.
7/2/2019 Compiled by: Kamal Acharya 27
Contd..
• Why deal with sequential data?
– Because all data is sequential
• All data items arrive in the data store in some order
– In some (many) cases the order does not matter
• E.g., we can assume a bag of words model for a document
– In many cases the order is of interest
• E.g., stock prices do not make sense without the time
information.
7/2/2019 Compiled by: Kamal Acharya 28
Contd..
Fig: General time series data mining framework
7/2/2019 Compiled by: Kamal Acharya 29
Homework
• What is web data mining? In what situations can web data
mining techniques can be useful?
• What are the aims of web data mining?
• Explain the difference between the three types of web data
mining.
• What data mining techniques can be used for log data analysis?
• What are time series data? Explain about time series data mining.
Thank You !
Compiled by: Kamal Acharya 307/2/2019

Web Mining

  • 1.
    Web Data Mining 7/2/2019Compiled by: Kamal Acharya 1
  • 2.
    7/2/2019 Compiled by:Kamal Acharya 2 Introduction • Web: A huge, widely-distributed, highly heterogeneous, semi- structured, interconnected information repository • Web is a huge collection of documents plus – Hyper-link information – Access and usage information
  • 3.
    7/2/2019 Compiled by:Kamal Acharya 3 Contd.. • What is Web Mining? – Web mining is the application of data mining techniques to find interesting and potentially useful knowledge from web data. – Web data: • Web content data : Text, image, records, etc. • Web structure data: Hyperlinks • Web usages data: server logs
  • 4.
    7/2/2019 Compiled by:Kamal Acharya 4 Contd.. – Web mining is usually divided into the following three categories. • Web content mining • Web usage mining and • Web structure mining Fig: Types of web mining
  • 5.
    7/2/2019 Compiled by:Kamal Acharya 5 Web usages mining • Automatic discovery of patterns in clickstreams(usages) and associated data collected or generated as a result of user interactions with one or more Web sites. • Goal: analyze the behavioral patterns and profiles of users interacting with a Web site. • The discovered patterns are usually represented as collections of pages, objects, or resources that are frequently accessed by groups of users with common interests.
  • 6.
    7/2/2019 Compiled by:Kamal Acharya 6 Contd.. • Application: Analyzing click stream data can help : – determine the life-time value of clients, – design cross-marketing strategies across products and services, – evaluate the effectiveness of promotional campaigns, – optimize the functionality of Web-based applications, – provide more personalized content to visitors, and find the most effective logical structure for their Web space.
  • 7.
    7/2/2019 Compiled by:Kamal Acharya 7 Contd.. • Phase of Web Usage Mining: – There are generally three distinctive phases in web usage mining: • Data collection and preprocessing, • Knowledge discovery, • and pattern analysis
  • 8.
    7/2/2019 Compiled by:Kamal Acharya 8 Contd.. • Data Collection and Pre-processing Phase: – It deals with generating and cleaning of web data and transforming it to a set of user transactions representing activities of each user during his/her website visit. – This step will influence the quality and result of the pattern discovery and analysis. Therefore, it needs to be done very carefully.
  • 9.
    7/2/2019 Compiled by:Kamal Acharya 9 Contd.. • Pattern Discovery Phase – Knowledge or pattern discovery is the key component of the Web mining, which uses the algorithms and techniques from data mining. – At present the usually used data mining methods mainly have clustering, classifying and association rule mining. – Each method has its own excellence and shortcomings, but the quite effective method mainly is classifying and clustering at the present.
  • 10.
    7/2/2019 Compiled by:Kamal Acharya 10 Contd.. • Pattern Analysis Phase: – Pattern Analysis is the final stage of the Web usage mining. – Challenges of Pattern Analysis are to filter uninteresting information and to visualize and interpret the interesting patterns to the user.
  • 11.
    7/2/2019 Compiled by:Kamal Acharya 11 Web Content mining • Web Content Mining is the process of extracting useful information from the contents of Web documents. • Content data corresponds to the collection of facts a Web page was designed to convey to the users. • It may consist of text, images, audio, video, or structured records such as lists and tables as shown in Figure below.
  • 12.
    7/2/2019 Compiled by:Kamal Acharya 12 Contd..
  • 13.
    7/2/2019 Compiled by:Kamal Acharya 13 Web structure mining • Web structure mining, one of three categories of web mining for data, is a tool used to identify the relationship between Web pages linked by information or direct link connection. • It is used to study the topology of hyperlinks with or without the description of the links.
  • 14.
    7/2/2019 Compiled by:Kamal Acharya 14 Contd.. • The main purpose for structure mining is to extract previously unknown relationships between Web pages. • This structure data mining provides use for a business to link the information of its own Web site to enable navigation and cluster information into site maps. • This allows its users the ability to access the desired information through keyword association and content mining.
  • 15.
    7/2/2019 Compiled by:Kamal Acharya 15 Contd.. • According to the type of web structural data, web structure mining can be divided into two kinds: Hyperlinks and Document Structure as shown in Figure below:
  • 16.
    7/2/2019 Compiled by:Kamal Acharya 16 Issues and Challenges in Web Mining • There are various issues and challenges with the web. Some challenges include: – The Web pages are dynamic that is the information is changes constantly. Copping the changes and monitoring them is an important issue for many applications. – Noise elimination on the web is another issue. A user feels noisy environment during searching the content, if the information comes from different sources. Typical Web page involves many pieces of information for instance the navigation links, main content of the page, copyright notices, advertisements, and privacy policies. Only part of the information is useful for a particular application but the rest is considered noise.
  • 17.
    7/2/2019 Compiled by:Kamal Acharya 17 Contd.. • The diversity of the information on the multiple pages show similar information in different words or formats, based on the diverse authorship of Web pages that make the integration of information from multiple pages as a challenging problem. • Handing Big Data on the web is most important challenge, which is scalable in term of volume, variety, variability, and complexity.
  • 18.
    7/2/2019 Compiled by:Kamal Acharya 18 Contd.. • To maintain security and privacy of web data is not an easy task. Advanced cryptographic algorithm is required for optimal service on the web. • Discovery of advance hyperlink topology and its management is the other mining issue on the web.
  • 19.
    7/2/2019 Compiled by:Kamal Acharya 19 Web Mining Application Areas • Web mining is an important tool to gather knowledge of the behavior of Websites visitors and thereby to allow for appropriate adjustments and decisions with respect to Websites‘ actual users and traffic patterns. • Along with a description of the processes involved in Web mining states that Website Design, Web Traffic Handling, e- Business and Web Personalization are four major application areas for Web mining. These are briefly described in the following sections.
  • 20.
    7/2/2019 Compiled by:Kamal Acharya 20 Contd.. • Website Design: – The content and structure of the Website is important to the user experience/impression of the site and the site‘s usability. The problem is that different types of users have different preferences, background, knowledge etc. making it difficult (if not impossible) to find a design that is optimal for all users. – Web usage mining can then be used to detect which types of users are accessing the website, and their behavior, knowledge which can then be used to manually design/re-design the website, or to automatically change the structure and content based on the profile of the user visiting it.
  • 21.
    7/2/2019 Compiled by:Kamal Acharya 21 Contd.. • Web Traffic Handling: – The performance and service of Websites can be improved using knowledge of the Web traffic in order to predict the navigation path of the current user. This may be used for cashing, load balancing or data distribution to improve the performance. The path prediction can also be used to detect fraud, break-ins, intrusion etc.
  • 22.
    7/2/2019 Compiled by:Kamal Acharya 22 Contd.. • Web Personalization: – Based on Web Mining Techniques, websites are designed to have the look- and-feel and contents are personalized to the needs of an individual end- user. – Web Personalization or customization is an attractive application area for Web based companies, allowing for recommendations, marketing campaigns etc. to be specifically customized for different categories of users, and more importantly to do this in real-time, automatically, as the user accesses the Website.
  • 23.
    7/2/2019 Compiled by:Kamal Acharya 23 Contd.. • e-Business: – For Web based companies, Web mining is a powerful tool to collect business intelligence by using electronic business to get competitive advantages. – Patterns of the customer’s activities on the Website can be used as important knowledge in the decision-making process, e.g. predicting customer’s future behavior; recruiting new customers and developing new products are beneficial choices.
  • 24.
    7/2/2019 Compiled by:Kamal Acharya 24 Contd.. • E-Learning and Digital Library: – Web mining can be used for improving the performance of electronic learning. Applications of web mining towards e-learning are usually web usage based. Machine learning and web usage mining improve web based learning.
  • 25.
    7/2/2019 Compiled by:Kamal Acharya 25 Contd.. • Security and Crime Investigation: – Along with the rapid popularity of the Internet, crime information on the web is becoming increasingly rampant, and the majority of them are in the form of text. – Because a lot of crime information in documents is described through events, event-based semantic technology can be used to study the patterns and trends of web-oriented crimes
  • 26.
    7/2/2019 Compiled by:Kamal Acharya 26 Time series data mining • Sequential data (or time series) refers to data that appear in a specific order. – The order defines a time axis, that differentiates this data from other cases we have seen so far • Examples – The price of a stock (or of many stocks) over time – Environmental data (pressure, temperature, precipitation etc) over time – The sequence of queries in a search engine, or the frequency of a query over time – The words in a document as they appear in order, and etc.
  • 27.
    7/2/2019 Compiled by:Kamal Acharya 27 Contd.. • Why deal with sequential data? – Because all data is sequential • All data items arrive in the data store in some order – In some (many) cases the order does not matter • E.g., we can assume a bag of words model for a document – In many cases the order is of interest • E.g., stock prices do not make sense without the time information.
  • 28.
    7/2/2019 Compiled by:Kamal Acharya 28 Contd.. Fig: General time series data mining framework
  • 29.
    7/2/2019 Compiled by:Kamal Acharya 29 Homework • What is web data mining? In what situations can web data mining techniques can be useful? • What are the aims of web data mining? • Explain the difference between the three types of web data mining. • What data mining techniques can be used for log data analysis? • What are time series data? Explain about time series data mining.
  • 30.
    Thank You ! Compiledby: Kamal Acharya 307/2/2019