Data Ware Housing And Mining subject offer in Gujarat Technological University in Branch of Information and Technology.
This Topic is from chapter 8 named Advance Topics.
Biology for Computer Engineers Course Handout.pptx
Â
Introduction to Web Mining and Spatial Data Mining
1. GUJARAT TECHNOLOGICAL UNIVERSITY
Introduction To Web Mining
and
Spatial Data Mining
Active Learning Assignment of
Data Ware Housing and Mining (3161610)
PREPARED BY
AARSH DHOKAI
DHARMAM SAVANI
GUIDED BY
PROF. RAVI PATEL
SIR
A. D. Patel Institute of Technology
2. ⢠What is the Data Mining ?
⢠Data mining is a process of extracting
and discovering patterns in large data
sets involving methods at the
intersection of machine learning,
statistics, and database systems.
⢠What is the Web Mining ?
⢠Web Mining is the process of Data
Mining techniques to automatically
discover and extract information from
Web documents and services.
⢠The main purpose of web mining is
discovering useful information from the
World-Wide Web and its usage patterns.
3. D ATA M I N I N G V / S W E B M I N I N G
Points Data Mining Web Mining
Definition Data Mining is the process that attempts to
discover pattern and hidden knowledge in
large data sets in any system.
Web Mining is the process of data mining
techniques to automatically discover and extract
information from web documents.
Application Data Mining is very useful for to find pattern
in large batches of data.
Web Mining is very useful for a particular
website and e-service.
Performed By Data scientist and data engineers. Data scientists along with data analysts.
Access Data Mining access data privately. Web Mining access data publicly.
Structure Data Mining gets the information from
explicit structure.
Web Mining gets the information from
structured, unstructured and semi-structured
web pages.
Problem Type Clustering, classification, regression,
prediction, optimization and control.
Web content mining, Web structure mining, Web
usage mining
Tools It includes tools like machine learning
algorithms.
Special tools for web mining are Scrapy,
PageRank and Apache logs.
Skills It includes approaches for data cleansing,
machine learning algorithms. Statistics and
probability.
It includes application level knowledge, data
engineering with mathematical modules like
statistics and probability.
4. W H Y W E B
M I N I N G ?
⢠Web mining is the application of
data mining techniques to
discover patterns, structures,
and knowledge from the Web.
⢠The World Wide Web is fertile
source for data mining.
⢠The World Wide Web serves as
a huge, widely distributed,
global information center for
news, advertisements,
consumer information, financial
management, education,
government, and e-commerce.
5. T Y P E S O F W E B M I N I N G
Web Mining
Content
Mining
Structure
Mining
Usage
Mining
6. W E B
C O N T E N T
M I N I N G
⢠Web Content Mining is the process of extracting
useful information from the content of the web
documents.
⢠Web content consist of several types of data â text,
image, audio, video or structured records such as
lists and tables.
⢠Web content mining has been studied extensively by
researchers, search engines, and other web service
companies.
⢠Web content mining can build links across multiple
web pages for individuals; therefore, it has the
potential to inappropriately disclose personal
information.
7. W E B C O N T E N T M I N I N G
understand the
content of web
pages.
provide scalable
and informative
keyword-based
page indexing.
entity/concept
resolution.
web page
relevance and
ranking.
web page content
summaries.
other valuable
information related
to web search and
analysis.
Web content mining is done to:-
8. W E B
S T R U C T U R E
M I N I N G
⢠Web structure mining uses graph
theory to analyze the node and
connection structure of a web site.
According to the type of web
structural data.
⢠Web structure mining can be divided
into two kinds:
⢠Extracting patterns from
hyperlinks in the web:
a hyperlink is a structural
component that connects the
web page to a different location.
⢠Mining the document structure:
analysis of the tree-like structure
of page structures to
describe HTML or XML tag
usage.
⢠Web structure mining terminology:
⢠Web graph: directed
graph representing web.
⢠Node: web page in graph.
⢠Edge: hyperlinks.
⢠In degree: number of links
pointing to particular node.
⢠Out degree: number of links
generated from particular
node.
9. W E B S T R U C T U R E M I N I N G
Evaluate quality
of Web Page or
Ranking of web
pages
Give authority of
a page on a
topic
Deciding which
pages to crawl
Finding Related
Pages
Detection of
duplicated
pages
Example:-
Google page
rank algorithm
Web structure mining is done to :-
10. W E B
U S A G E
M I N I N G
⢠It is the is the process of extracting useful information
from server logs of users.
⢠It is classified in to three kind of data usage :
⢠Web Server Data: The web server including IP
address, page reference and access time
collects user logs.
⢠Application Server Data: Ability to track various
kinds of business events and log them in
application server logs.
⢠Application Level Data: Defining new kinds of
events and logging them by generating histories
of the events.
11. W E B U S A G E M I N I N G
finds patterns related to
general or particular
groups of users.
understands userâs
search patterns,
trends, and
associations.
predicts what users are
looking for on the
Internet.
helps improve search
efficiency and
effectiveness.
promotes products or
related information to
different groups of
users at the right time.
Web search companies
routinely conduct web
usage mining to
improve their quality of
service.
Web usage mining is done to :-
12. T O O L S F O R
W E B
M I N I N G
⢠R
⢠Oracle Data Mining
⢠Tableau
Web Usage Mining
⢠Scrapy(Python)
Web Content Mining
⢠HITS algorithm
⢠PageRank Algorithm
Web Structure Mining
14. I N B U S I N E S S
web mining enabled e-commerce to do personalized marketing, which
eventually results in higher trade volumes.
Companies can establish better customer relationship by understanding the
needs of the customer better and reacting to customer needs faster.
Companies can find, attract and retain customers; they can save on
production costs by utilizing the acquired insight of customer requirements.
15. S E C U R I T Y A N D
C R I M E
I N V E S T I G A T I O N
⢠Government agencies are using this
technology to classify threats and fight
against terrorism. The predicting capability
of mining applications can benefit society
by identifying criminal activities.
⢠Terrorist groups use the Web as their
infrastructure for various purposes.
⢠Web Usage Mining is aims to track down
online access to abnormal content, which
may include terrorist-generated sites, by
analyzing the content of information
accessed by the Web users.
16. S E A R C H
E N G I N E S
⢠Web mining helps to improve the power of web
search engine by classifying the web
documents and identifying the web pages.
⢠It is used for Web Searching e.g., Google,
Yahoo etc.
⢠The use of data mining in web search engine
helps in analyzing the content and at the same
time delivering results that are relevant for the
users. As a result, digital marketers who are
focused on creating valuable content for users
sure to benefit from the impact of data mining
on SEO.
17. A D VA N TA G E S
O F
W E B M I N I N G
The amount of information on the Web
is huge, and easily accessible.
The coverage of Web information is
very wide and diverse. One can find
information about almost anything.
Data of almost all types exist on the
Web, e.g., structured tables, texts,
multimedia data, etc.
Much of the Web information is linked.
There are hyperlinks among pages
within a site, and across different sites.
18. C H A L L E N G E S I N W E B M I N I N G
Much of the Web information is
redundant. The same piece of
information or its variants may
appear in many pages.
Much of the Web information is semi-
structured due to the nested
structure of HTML code.
The Web is noisy. A Web page
typically contains a mixture of many
kinds of information, e.g., main
contents, advertisements, navigation
panels, copyright notices, etc.
the Web is dynamic. Information on
the Web changes constantly.
Keeping up with the changes and
monitoring the changes are
important issues.
19. C H A L L E N G E S I N W E B M I N I N G
URLâs can be
tracked to
access the data.
Since data is
updatable it is
not trustable.
Multiplicity of
events and
URLâs.
Large amount of
data remain
unused.
Data may be
inaccurate.
Data may be
incomplete and
unavailable.
21. W H AT I S S PAT I A L
D ATA ?
⢠Spatial data is any data with a direct or indirect reference
to a specific location or geographical area.
⢠Spatial data is often referred to as geospatial data or
geographic information.
22. I N T R O D U C T I O N
T O
S P A T I A L D A T A
M I N I N G
Spatial data mining is the process of
discovering interesting, useful, non-
trivial patterns from large spatial
datasets.
Eg. Determining hotspots, unusual
locations.
Spatial Data Mining Tasks : continued
in further slide.
23. S PAT I A L D ATA M I N I N G TA S K S
⢠Classification :
⢠finds a set of rules which
determine the class of the
classified object according to
its attributes
⢠e. g. â Classify remotely-sensed
images based on spectrum and
GIS data.
⢠Association Rules :
⢠find (spatially related) rules from the database.
Association rules describe patterns, which are often in
the database.
⢠The association rule has the following form: A â B
(s%, c%), where s is the support of the rule (the
probability, that A and B hold together in all the possible
cases) and c is the confidence (the conditional
probability that B is true under the condition of A.
⢠E. g. â Rain (x, pour) = > landslide (x, happen), support is
76%, and confidence is 51%.â
24. S PAT I A L D ATA M I N I N G TA S K S
⢠Clustering :
⢠groups the object from database into clusters
in such a way that object in one cluster are
similar and objects from different clusters are
dissimilar.
⢠e. g. we can find clusters of cities with similar
level of un employment or we can cluster
pixels into similarity classes based on
spectral characteristics.
⢠Trend Detection :
⢠Finds trends in database. A trend is a
temporal pattern in some time series data. A
spatial trend is defined as a pattern of
change of a non-spatial attribute in the
neighborhood of a spatial object.
⢠e. g. âGoogle Maps Traffic Detectionâ
25. S PAT I A L D ATA M I N I N G TA S K S
⢠Characteristic Rules :
⢠A common character of a kind of spatial entity, or
several kinds of spatial entities. A kind of tested
knowledge for summarizing similar features of
objects in a target class.
⢠e. g. â Characterize similar ground objects in a
large set of remote sensing images.â.
⢠Discriminant Rules :
⢠Describe differences between two parts of
database.
⢠e. g. Compare land price in urban boundary and
land price in urban center.
26. S PAT I A L
D ATA B A S E
⢠Database is similar to a plain relational database, but in addition to
storing data on qualitative and quantitative attributes, spatial
databases store data about physical location and feature geometry
type.
⢠Every record in a spatial database is stored with numeric
coordinates that represent where that record occurs on a map and
each feature is represented by only one of these three geometry
types:
ďą Point
ďą Line
ďą Polygon
⢠Stores a large amount of space-related data
⢠Maps, Remote Sensing, Medical Imaging, VLSI chip layout
27. S PAT I A L D ATA B A S E
⢠Whether you want to calculate the distance between two places on a
map or determine the area of a particular piece of land, you can use
spatial database querying to quickly and easily make automated
spatial calculations on entire sets of records at one time.
⢠You can use spatial databases to perform almost all the same types of
calculations on â and manipulations of â attribute data that you can
in a plain relational database system.
28. S PAT I A L C L A S S I F I C AT I O N
⢠Analyze spatial objects to derive classification schemes, such as decision trees, in
relevance to certain spatial properties (district, highway, river)
⢠Classifying medium-size families according to income, region, and infant mortality
rates
⢠Mining Data for volcanoes on Venus
⢠Employ methods such as:
⢠Decision-tree classification, Naïve-Bayesian classifier + boosting, neural network, etc.
29. S PAT I A L
T R E N D
A N A LY S I S
⢠Detect changes and trends along a
spatial dimension.
⢠Study the trend of non-spatial or spatial
data changing with space.
Function
⢠Observe the trend of changes of the
climate.
⢠Crime rate or unemployment rate change
with regard to city geo- distribution.
⢠Traffic flows in highways and in cities.
Application examples
30. A P P L I C AT I O N S O F
S PAT I A L D ATA M I N I N G
Domain Spatial Data Mining Application
Public Safety Discovery of hotspot patterns from crime event maps
Epidemiology Detection of disease outbreak
Neuroscience Discovering patterns of human brain activity from
neuroimages
Climate
Science
Finding positive or negative correlations between
temperatures of distance places
Business Market allocation to maximize stores' profits
31. O T H E R A P P L I C AT I O N S
⢠Spatial data mining is used in
⢠Space technology : ISRO GPS SYSTEM
⢠Security : National Crime Records Bureau uses spatial data to
track down criminals
⢠GIS, Geo-marketing, Remote Sensing, Image database
exploration, medical imaging, Navigation
32. C H A L L E N G E S
I N S PAT I A L D ATA M I N I N G
⢠Complexity of spatial data types and access methods
⢠Large amounts of data Requires Huge Data storage
facilities.