Web mining

WEB MINING
Submitted by:
Dheeraj Kashnyal
dheerajkashnyal55@gmail.com
ETL Design &
Report Specifications

Introduction
• Web mining is the use of techniques to automatically discover and extract
information from Web documents and services.
• Various kinds of information extracted via Web Mining:
• Web activity, from server logs and Web browser activity tracking.
• Web graph, from links between pages, people and other data.
• Web content, for the data found on Web pages and inside of documents.
• The project is based on extracting values from
web pages and other documents found on the
web.
• This presentation covers the ETL design and
Report Specification portion.
Web Mining

Challenges
• The Web is noisy. A Web page typically contains a mixture of many kinds of
information, e.g., main contents, advertisements, navigation panels,
copyright notices, etc.
• The Web is dynamic. Information on the Web changes constantly. Keeping
up with the changes and monitoring the changes are important issues.
• Much of the Web information is redundant. The same piece of information
or its variants may appear in many pages.
• Information/data of almost all types exist on the Web, e.g., structured tables,
texts, multimedia data, etc.
• Much of the Web information is semi-structured due to the nested structure
of HTML code.
Web Mining

Data Flow of the System
Web Mining

WM_FACT
Datekey
TODkey
Visitorkey
Referrerkey
Statuskey
Objectkey
Browserkey
OSKey
Timestamp of Request
GMT_Diff
TimeViewed
BytesTransferred
DATE_DIM_TB
Datekey
Date
DayOfWeek
DayOfWeekNumber
WeekNumber
Week
MonthDay
MonthNumber
Month
Quarter
Year
BROWSER_DIM_TB
Browser key
Browser Type
Browser Name
OS_DIM_TB
OS_key
OS Name
OS Type
STATUS_DIM_TB
Status_key
Status Code
StatusDescription
StatusType
REFERRER_DIM_TB
Referrer_key
ReferringURL
ReferringSite
Keyword
OBJECT_DIM_TB
Object_key
URL
FileName
FileType
ObjectType
Object_size
Content Page
PageName
PageType
VISITOR_DIM_TB
Visitor_key
VisitorFlag
IPAddress
DomainName
CountryCode
Country
User_Name
TOD_DIM_TB
TODkey
TOD Lower
TOD Higher
Period of Day
Developing Data Model

ETL DESIGN
 Given Source to Target(Dimension) Data Mapping
 Given Data Sources are files
 Mapping of Dimension Table
• DIMENSION TABLES
• Date Dimension.
• TOD Dimension
• Visitor Dimension
• Object Dimension
• Referrer Dimension
• Status Dimension
• Browser Dimension
• Operating System Dimension
• FACT TABLE
• Click Stream Fact
Web Mining

Cont.……
• FACT TABLE
• Click Stream Fact
• The time in seconds the visitor has viewed a page on a
particular date and time is stored in the click stream fact as a
measure.
• The bytes transferred to the user machine from the web
server are stored as a measure.
• The referrer key points to referred dimension, which
provides information about the referrer of the page.
• Rest are the Foreign Keys of the respective dimensions .
Web Mining

• Statistics of Visits
• The measures reported are:
• No. of visitors during the day
• No. of content pages access by all visitors
• No. of objects accessed by all visitors
• Total size of the data that is being delivered
• Most popular (most accessed) pages
• The report should show the following measures
• No. of visitors during the day accessing this content page
• No. of hits during the day for this content page
• The report should show only the Top 5 pages accessed
based on the No. of hits
Web Mining

• Least popular (least accessed) pages
• No. of visitors during the day accessing this content page
• No. of hits during the day for this content page
• The report should show only the Top 5 pages where the no.
of hits are the lowest
• Location of the visitors
• The report should show the Date of the visit
• The report should list the location of the Visitor
• Country Code
• Country Name
Web Mining

• Most frequent visitors
• The report should list the Visitor’s details
• IP Address
• Domain name of the visitor’s IP Address
• No. of hits during the date range for this content page
• Total size of content delivered
• The report should show only the Top 5 visitors accessed
based on the No. of hits.
• Top Referrers and Keywords
• The report should show the Referrer Domain and the
keyword used
• The report should show the no. of hits during the date range
and the time period
Web Mining

• Most used Browsers and Operating systems
• The report should retrieve data for the date range between
from and to date.
• The report should show the no. of hits during the date range
and the time period.
Web Mining

Web mining

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Web mining

Similar to Web mining (20)

Recently uploaded

Recently uploaded (20)

Web mining