The document discusses various topics related to web mining and data mining including:
- Web mining techniques like web content mining, web usage mining, and web structure mining.
- Common data mining techniques like classification, clustering, association rule mining etc. and how they are applied in web content mining.
- How web usage mining analyzes server log files to understand user browsing behavior and patterns.
- Classification and clustering are two popular techniques used in web usage mining, with decision trees and k-means clustering provided as examples.
Data visualization is a marvelous way to understand and communicate complex ideas. Graph visualization expands the expressive power of the medium, revealing patterns and connections that put everything in context.
During this webinar, we explore the many beautiful and informative uses of graph visualization.
Social Media Mining - Chapter 9 (Recommendation in Social Media)SocialMediaMining
R. Zafarani, M. A. Abbasi, and H. Liu, Social Media Mining: An Introduction, Cambridge University Press, 2014.
Free book and slides at http://socialmediamining.info/
Data visualization is a marvelous way to understand and communicate complex ideas. Graph visualization expands the expressive power of the medium, revealing patterns and connections that put everything in context.
During this webinar, we explore the many beautiful and informative uses of graph visualization.
Social Media Mining - Chapter 9 (Recommendation in Social Media)SocialMediaMining
R. Zafarani, M. A. Abbasi, and H. Liu, Social Media Mining: An Introduction, Cambridge University Press, 2014.
Free book and slides at http://socialmediamining.info/
I was invited in Web Tech Talk Event as a Speaker. The event was organized by Tech Speakers Bangladesh. On that event, I gave a speech on Deep and Dark Web. I made this slide for that speech.
Autonomous Vehicles: Technologies, Economics, and OpportunitiesJeffrey Funk
These slides use concepts from my (Jeff Funk) course entitled analyzing hi-tech opportunities to show how the cost and performance of autonomous vehicles are improving rapidly. LIDAR, other sensors, ICs, and wireless are experiencing rapid improvements that are enabling the overall cost of AVs to fall. For example, the latency of wireless systems is improving rapidly thus enabling vehicles to be controlled with wireless systems. This is also creating many new opportunities in the vehicle industry in the Internet of Things, data analytics, and logistics. The slides include a detailed discussion of AVs in Singapore, a likely early adopter.
UNIT V TEXT AND OPINION MINING
Text Mining in Social Networks -Opinion extraction – Sentiment classification and clustering -
Temporal sentiment analysis - Irony detection in opinion mining - Wish analysis – Product review mining – Review Classification – Tracking sentiments towards topics over time
Social Media Data Collection & AnalysisScott Sanders
A non-technical primer on how to collect and analyze social media data. This was an invited lecture by Biostatistics and Bioinformatics Department in the School of Public Health at the University of Louisville.
SP1: Exploratory Network Analysis with GephiJohn Breslin
ICWSM 2011 Tutorial
Sebastien Heymann and Julian Bilcke
Gephi is an interactive visualization and exploration software for all kinds of networks and relational data: online social networks, emails, communication and financial networks, but also semantic networks, inter-organizational networks and more. Designed to make data navigation and manipulation easy, it aims to fulfill the complete chain from data importing to aesthetics refinements and interaction. Users interact with the visualization and manipulate structures, shapes and colors to reveal hidden properties. The goal is to help data analysts to make hypotheses, intuitively discover patterns or errors in large data collections.
In this tutorial we will provide a hands-on demonstration of the essential functionalities of Gephi, based on a real case scenario: the exploration of student networks from the "Facebook100" dataset (Social Structure of Facebook Networks, Amanda L. Traud et al, 2011). The participants will be guided step by step through the complete chain of representation, manipulation, layout, analysis and aesthetics refinements. Particular focus will be put on filters and metrics for the creation of their first visualizations. They will be incited to compare the hypotheses suggested by their own exploration to the results actually published in the academic paper afterwards. They finally will walk away with the practical knowledge enabling them to use Gephi for their own projects. The tutorial is intended for professionals, researchers and graduates who wish to learn how playing during a network exploration can speed up their studies.
Sébastien Heymann is a Ph.D. Candidate in Computer Science at Université Pierre et Marie Curie, France. His research at the ComplexNetworks team focuses on the dynamics of realworld networks. He leads the Gephi project since 2008, and is the administrator of the Gephi Consortium.
Julian Bilcke is a Software Engineer at ISC-PIF (Complex Systems Institute of Paris, France). He is a founder and a developer for the Gephi project since 2008.
Predicting Influence and Communities Using Graph AlgorithmsDatabricks
Relationships are one of the most predictive indicators of behavior and preferences. Communities detection based on relationships is a powerful tool for inferring similar preferences in peer groups, anticipating future behavior, estimating group resiliency, finding hierarchies, and preparing data for other analysis. Centrality measures based on relationships identify the most important items in a network and help us understand group dynamics such as influence, accessibility, the speed at which things spread, and bridges between groups. Data scientists use graph algorithms to identify groups and estimate important entities based on their interactions. In this session, we'll cover the common uses of community detection and centrality measures and how some of the iconic graph algorithms compute values. We'll show examples of how to run community detection and centrality algorithms in Apache Spark including using the AggregateMessages function to add your own algorithms. You'll learn best practices and tips for tricky situations. For those that want to run graph algorithms in a graph platform, we'll also illustrate a few examples in Neo4j. Some of the Community Detection Algorithms included: * Triangle Count and Clustering Coefficient to estimate network cohesiveness * Strongly Connected Components and Connected Components to find clusters * Label Propagation to quickly infer groups and data cleans with semi-supervised learning * Louvain Modularity to uncover at group hierarchies Balanced Triad to identify unstable groups * PageRank to reveal influencers * Betweenness Centrality to predict bottlenecks and bridges.
Authors: Amy Hodler, Sören Reichardt
Types of recommender systems in information retrieval. Collaborative filtering is a very widely used method in recommendation systems. Content based filtering and collaborative filtering are two major approaches. Hybrid systems are now being employed to get better recommendations. One such method is content-boosted collaborative filtering.
Overview of the Recommender system or recommendation system. RFM Concepts in brief. Collaborative Filtering in Item and User based. Content-based Recommendation also described.Product Association Recommender System. Stereotype Recommendation described with advantage and limitations.Customer Lifetime. Recommender System Analysis and Solving Cycle.
I was invited in Web Tech Talk Event as a Speaker. The event was organized by Tech Speakers Bangladesh. On that event, I gave a speech on Deep and Dark Web. I made this slide for that speech.
Autonomous Vehicles: Technologies, Economics, and OpportunitiesJeffrey Funk
These slides use concepts from my (Jeff Funk) course entitled analyzing hi-tech opportunities to show how the cost and performance of autonomous vehicles are improving rapidly. LIDAR, other sensors, ICs, and wireless are experiencing rapid improvements that are enabling the overall cost of AVs to fall. For example, the latency of wireless systems is improving rapidly thus enabling vehicles to be controlled with wireless systems. This is also creating many new opportunities in the vehicle industry in the Internet of Things, data analytics, and logistics. The slides include a detailed discussion of AVs in Singapore, a likely early adopter.
UNIT V TEXT AND OPINION MINING
Text Mining in Social Networks -Opinion extraction – Sentiment classification and clustering -
Temporal sentiment analysis - Irony detection in opinion mining - Wish analysis – Product review mining – Review Classification – Tracking sentiments towards topics over time
Social Media Data Collection & AnalysisScott Sanders
A non-technical primer on how to collect and analyze social media data. This was an invited lecture by Biostatistics and Bioinformatics Department in the School of Public Health at the University of Louisville.
SP1: Exploratory Network Analysis with GephiJohn Breslin
ICWSM 2011 Tutorial
Sebastien Heymann and Julian Bilcke
Gephi is an interactive visualization and exploration software for all kinds of networks and relational data: online social networks, emails, communication and financial networks, but also semantic networks, inter-organizational networks and more. Designed to make data navigation and manipulation easy, it aims to fulfill the complete chain from data importing to aesthetics refinements and interaction. Users interact with the visualization and manipulate structures, shapes and colors to reveal hidden properties. The goal is to help data analysts to make hypotheses, intuitively discover patterns or errors in large data collections.
In this tutorial we will provide a hands-on demonstration of the essential functionalities of Gephi, based on a real case scenario: the exploration of student networks from the "Facebook100" dataset (Social Structure of Facebook Networks, Amanda L. Traud et al, 2011). The participants will be guided step by step through the complete chain of representation, manipulation, layout, analysis and aesthetics refinements. Particular focus will be put on filters and metrics for the creation of their first visualizations. They will be incited to compare the hypotheses suggested by their own exploration to the results actually published in the academic paper afterwards. They finally will walk away with the practical knowledge enabling them to use Gephi for their own projects. The tutorial is intended for professionals, researchers and graduates who wish to learn how playing during a network exploration can speed up their studies.
Sébastien Heymann is a Ph.D. Candidate in Computer Science at Université Pierre et Marie Curie, France. His research at the ComplexNetworks team focuses on the dynamics of realworld networks. He leads the Gephi project since 2008, and is the administrator of the Gephi Consortium.
Julian Bilcke is a Software Engineer at ISC-PIF (Complex Systems Institute of Paris, France). He is a founder and a developer for the Gephi project since 2008.
Predicting Influence and Communities Using Graph AlgorithmsDatabricks
Relationships are one of the most predictive indicators of behavior and preferences. Communities detection based on relationships is a powerful tool for inferring similar preferences in peer groups, anticipating future behavior, estimating group resiliency, finding hierarchies, and preparing data for other analysis. Centrality measures based on relationships identify the most important items in a network and help us understand group dynamics such as influence, accessibility, the speed at which things spread, and bridges between groups. Data scientists use graph algorithms to identify groups and estimate important entities based on their interactions. In this session, we'll cover the common uses of community detection and centrality measures and how some of the iconic graph algorithms compute values. We'll show examples of how to run community detection and centrality algorithms in Apache Spark including using the AggregateMessages function to add your own algorithms. You'll learn best practices and tips for tricky situations. For those that want to run graph algorithms in a graph platform, we'll also illustrate a few examples in Neo4j. Some of the Community Detection Algorithms included: * Triangle Count and Clustering Coefficient to estimate network cohesiveness * Strongly Connected Components and Connected Components to find clusters * Label Propagation to quickly infer groups and data cleans with semi-supervised learning * Louvain Modularity to uncover at group hierarchies Balanced Triad to identify unstable groups * PageRank to reveal influencers * Betweenness Centrality to predict bottlenecks and bridges.
Authors: Amy Hodler, Sören Reichardt
Types of recommender systems in information retrieval. Collaborative filtering is a very widely used method in recommendation systems. Content based filtering and collaborative filtering are two major approaches. Hybrid systems are now being employed to get better recommendations. One such method is content-boosted collaborative filtering.
Overview of the Recommender system or recommendation system. RFM Concepts in brief. Collaborative Filtering in Item and User based. Content-based Recommendation also described.Product Association Recommender System. Stereotype Recommendation described with advantage and limitations.Customer Lifetime. Recommender System Analysis and Solving Cycle.
How do you structure your information systems to enable collaboration? Through careful planning, proper structure, and
aligned technology, serendipity can happen in large scale and massive organizational benefits can be achieved.
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...IAEME Publication
In today ’s global business, the web has been the most important means of communication. Clients and customers may find their products online, which is a benefit of doing business online. Web mining is the process of using data mining tools to analyse and extract the information from a Web pages and applications autonomously. Many firms use web structure mining to generate suitable predictions and judgments for business growth, productivity, manufacturing techniques, and more utilizing data mining business strategies. In the online booking domain, optimum web data mining analysis of web structure is a crucial component that gives a systematic manner of new application towards real-time data with various levels of implications. Web structure mining emphases on the construction of the web's hyperlinks. Linkage administration that is done correctly can lead to future connections, which can therefore increase the prediction performance of learnt models. A increased interest in Web mining, structural analysis research has expanded, resulting in a new research area that sits at the crossroads of work in the network analysis, hyperlink and the web mining, structural training, and empirical software design techniques, as well as graph mining. Web structure mining is the development of determining structure data from the web. The proposed WSM approach is a system of finding the structure of data stored over the Web. Web structure mining can encourage the clients to recover the significant records by breaking down the connection situated structure of Web content. Web structure mining has been one of the most important resources for information extraction and the knowledge discovery as the amount of data available online has increased.
Enhanced Web Usage Mining Using Fuzzy Clustering and Collaborative Filtering ...inventionjournals
Information is overloaded in the Internet due to the unstable growth of information and it makes information search as complicate process. Recommendation System (RS) is the tool and largely used nowadays in many areas to generate interest items to users. With the development of e-commerce and information access, recommender systems have become a popular technique to prune large information spaces so that users are directed toward those items that best meet their needs and preferences. As the exponential explosion of various contents generated on the Web, Recommendation techniques have become increasingly indispensable. Web recommendation systems assist the users to get the exact information and facilitate the information search easier. Web recommendation is one of the techniques of web personalization, which recommends web pages or items to the user based on the previous browsing history. But the tremendous growth in the amount of the available information and the number of visitors to web sites in recent years places some key challenges for recommender system. The recent recommender systems stuck with producing high quality recommendation with large information, resulting unwanted item instead of targeted item or product, and performing many recommendations per second for millions of user and items. To avoid these challenges a new recommender system technologies are needed that can quickly produce high quality recommendation, even for a very large scale problems. To address these issues we use two recommender system process using fuzzy clustering and collaborative filtering algorithms. Fuzzy clustering is used to predict the items or product that will be accessed in the future based on the previous action of user browsers behavior. Collaborative filtering recommendation process is used to produce the user expects result from the result of fuzzy clustering and collection of Web Database data items. Using this new recommendation system, it results the user expected product or item with minimum time. This system reduces the result of unrelated and unwanted item to user and provides the results with user interested domain.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Abstract: In many fields, such as industry, commerce, government, and education, knowledge discovery and data
mining can be immensely valuable to the subject of Artificial Intelligence. Because of the recent increase in
demand for KDD techniques, such as those used in machine learning, databases, statistics, knowledge acquisition,
data visualisation, and high performance computing, knowledge discovery and data mining have grown in
importance. By employing standard formulas for computational correlations, we hope to create an integrated
technique that can be used to filter web world social information and find parallels between similar tastes of
diverse user information in a variety of settings
International conference On Computer Science And technologyanchalsinghdm
ICGCET 2019 | 5th International Conference on Green Computing and Engineering Technologies. The conference will be held on 7th September - 9th September 2019 in Morocco. International Conference On Engineering Technology
The conference aims to promote the work of researchers, scientists, engineers and students from across the world on advancement in electronic and computer systems.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
2. OUTLINE
◼Web mining
◼Data mining/Data mining techniques/ Data mining Algorithms
◼Social media mining
◼Text mining
◼Categories of web mining
Web content mining
Web Usage Mining
Web Structure Mining
https://orange.biolab.si/
3. WHAT IS WEB MINING?
Web Mining is the use of the data mining techniques to automatically discover and
extract information from web.
Web Mining can find interesting and potentially useful knowledge from web data
4. WHAT IS DATA MINING?
Data mining or knowledge discovery from data is the process of analyzing data from
different perspectives and summarizing it into useful information
Knowledge Discovery in Databases
Raw data knowledge
5. DATA MINING TECHNIQUES
Clustering
Classification
Association Rules
Correlation
Naive Bayesian
Neural Networks
Outlier detection/ Anomaly detection
Regression
Logistic Regression
The most popular data mining techniques are:
7. WHAT IS WEB DATA?
Web content –text , image, records, etc.
Web structure – hyperlinks, tags, etc.
Web usage –http log , app server logs ,etc
Intra-page structures- document level
Inter-page structures- hyperlink level
Supplemental data
Profiles
Registration information
Cookies
8. DATA MINING VS. WEB MINING
Data Mining
Data is structured and relational
Well-defined tables, columns, rows, keys, and constraints.
Web Mining
Semi-structured(HTML) and unstructured
9. EXAMPLE: ASSESSING CREDIT RISK
Situation: Person applies for a loan
Task: Should a bank approve the loan?
Note: People who have the best credit don’t need
the loans, and people with worst credit are not likely to repay.
Bank’s best customers are in the middle.
10. EXAMPLE: INSURANCE FRAUD
Insurance Fraud is the filing of a false claim to life, health, automobile, property or
other types of insurance benefits.
Insurance companies lose millions of dollars each year through fraudulent claims,
largely because they do not have a way to easily determine which claims are legitimate
and which may be fraudulent.
11. EXAMPLE: INSURANCE FRAUD
Data mining enables insurance companies to predict which insurance claims are likely
to be fraudulent.
http://www.hugin.com/solutions/fraud-detection-management/online-demonstration
12. OPPORTUNITIES & CHALLENGES
The amount of information on the Web is huge
The coverage of Web information is very wide and diverse.One can Find information
about almost anything. Information/data of almost all types exist on the Web. For
example, structured tables, texts, stream data, etc.
Much of the Web information is semi-structured due to the nested structure of HTML
code.
Much of the Web information is linked. There are hyperlinks among pages within a
site, and across different sites.
Much of the Web information is redundant. The same piece of information or its
variants may appear in many pages.
13. OPPORTUNITIES & CHALLENGES
The Web is noisy.A Webpage generally contains a mixture of many kinds of
information. For example: main contents, advertisements, navigation panels, copyright
notices, etc.
The Web is dynamic. New pages are constantly being generated. Keeping up with the
changes and monitoring the changes are important issues.
Above all, the Web is a virtual society. It is not only about data, information and
services, but also about interactions among people, organizations and automatic
systems,and communities.
14. APPLICATION OF WEB MINING IN E-COMMERECE
Customer Analyzing
Mined data help acquire new, retain existing customers, Improvement of merchant services and
profit by predicting customer online purchase behavior
◼What do the customers do?
◼What do the customers want?
◼How effectively use the web data to market products and to service the customer?
◼Whether customers are purposefully or just browsing?
◼Buying something they are familer with or something they know little about?
◼Are they shopping from home, from work or from a hotel?
15. Web personalization
According to the information from user behavior, a website can be designed and re-structured to
make it more advance and user-friendly. In addition, the image and product value of the
company is very important in satisfying customer need based on website quality.
Personalizing a website involves tailoring content based on the characteristics of each
individual user’s online behaviors.
Personalized content is often determined by user behaviors such as pages viewed, buttons
clicked and forms submitted.
APPLICATION OF WEB MINING IN E-COMMERECE
16. Product search & Recommendation
When the user searches for a product how we find the best results for the users?
Typically, a user query of a few keywords can match many products.
Through large-scale data analysis of query logs, we can create graphs between queries and products, and
between different products.
For example, the user who searches for “Verizon cell phones” might click on the Samsung SCH U940 Glyde
product, and the LG VX10000 Voyager. We now know the query is related to those two products, and the two
products have a relationship to each other since a user viewed (and perhaps considered buying) both.
APPLICATION OF WEB MINING IN E-COMMERECE
17. CATEGORIES OF WEB MINING
Web mining is divided into three categories:
1.Web Content Mining
2. Web Usage Mining
3. Web Structure Mining
18. WEB CONTENT MINING
To gather, categorize, organize and provide the best possible information available on the web to the user
requesting the information
The data may be unstructured or structured (data from a database) or semi-structured (html)
Content mining is the scanning and mining of text, pictures, video, audio and graphs of a Web page to
determine the relevance of the content to the search query
Content mining provides the results lists to search engines in order of highest relevance to the keywords in
the query
Web content mining is related to data mining and text mining Discovering useful information
from contents of Webpages
19. TEXT MINING
Text mining is the analysis of data contained in natural language text
Text mining attempts to derive meaning from the words and sentences in order to
classify documents, route messages appropriately, as well as create summaries of
content
Unstructured Data Examples: Email, Insurance Claim,
Web Pages, Technical Documents, Contracts
https://www.nytimes.com/2016/09/24/us/politics/presidential-debate-hillary-clinton-donald-trump.html?_r=0
https://www.youtube.com/watch?v=Ozo2QuCKml0
https://voyant-tools.org/
20. DATA MINING TECHNIQUES USING IN WEB CONTENT MINING
The more basic and popular data mining techniques in web content mining are:
Classification : Placing the documents into a predefined set of groups such as science articles, Political
articles, etc.
Clustering : Clustering is a technique used to group similar documents (is not done based on
predefined). As a result useful documents will not be omitted from the search results. Clustering helps the
user to easily select the topic of interest.
Summarization is used to reduce the length of the document by maintaining the main points. An
example for text Summarization is Microsoft word’s AutoSummarize
Visualization utilizes feature extraction and key term indexing to build a graphical representation.
Through visualization, documents having similarity are found out is useful to find out related topic from a
very large amount of documents. Examples: Word Cloud, Scatter Plot, Streamgraph, Tree map, Heat map,
Gantt Chart, etc.
21. WEB USAGE MINING
Web usage mining
Is used to understand the customer behavior
Focuses on the discovering of potential knowledge from browsing patterns of the users.
Can discover the knowledge in the hidden browsing patterns and analyses the visiting characteristics of the
users.
The primary data source used in web usage mining is the server log-files (web-logs).
Browsing web pages by the user leaves a lot of information in the log-file.
Analyzing log-files information drives us to understand the behavior of the user
Techniques use for discovering the potential knowledge from the browsing patterns are:
Clustering
Classification
Association rule
40% of Online Shopper don't complete
their purchases
23. CLASSIFICATION
Classification is the most familiar and most popular data mining technique for web usage
mining.
Data classification is the process of organizing data into categories for its most effective and
efficient use.
Classification technique uses to segment and classify observations
Example :
People with age less than 40 and salary more than 40000, trade
on line(Demographic segmentation ) .
Blackberry was launched for users who were business people, Samsung was launched for
users who like android and like various applications for a free price, and Apple was launched for
the premium customers who want to be a part of a unique and popular niche(Behavioral
segmentation)
24. CLASSIFICATION
Classification consist of assigning a class label to a set of unclassified cases.
The goal of classification is to build a model that can be used to predict the class of records whose class
label is not Know.
25. CLASSIFICATION ALGORITHMS
The most popular classification algorithms are:
Decision trees
Logistic regression
Neural networks
k-nearest neighbors
26. DECISION TREES
◼A decision tree is a graph that uses a branching method to illustrate every possible outcome of a decision.
EXAMPLE
28. Decision Tree using Orange Data Mining
Analysing data in Orange using Decision tree.
Select file: Decision tree from Dataset Folder(On Fronter)
Exercise:
Explain the output of the Decision tree
29. CLUSTERING
◼Clustering is the process of dividing a dataset into groups such that the members of
each group are as similar as possible to one another and different groups are as
dissimilar as possible from one another
◼The most popular distance-based clustering algorithms is ‘k-means’.
31. K MEANS FOR CLUSTERING
K-Means Algorithm for Clustering
The number of car accident is
classified by population
32. CLUSTERING USING ORANGE
Select file: Clustering from Dataset Folder(On Fronter)
Select K-Means from Unsupervised Widget set.
Select MDC(Multidimensional scaling )
Unsupervised Widget set
Exercise:
Explain the output of the Clustering
to create a segmentation based only on buying behavior
https://archive.ics.uci.edu/ml/datasets/Wholesale+customers
33. ASSOCIATION RULE
Association rule finds interesting associations and correlation
relationships among large sets of data items.
Association rules show attribute value conditions that occur frequently
together in a given data set.
A typical example of association rule mining is Market Basket Analysis.
What items are frequently
bought together by customers?
34. EXAMPLE OF MARKET BASKET
Items are frequently
bought together by customers, should be
placed together in the store to maximize
sales.
35. PRODUCT OFFER & RECOMMENDATIONS
IF {milk, flour, sugar, eggs, candles} THEN {party hats, paper plates, magician}
36. Association analysis in Orange
Select file: Association Rulefrom Dataset Folder(On Fronter)
Select Data Table from Data at the Widget set.
Select Frequent Itemset from Associate
Select Association Rules from Associ
Exercise:
Explain the output of the Association
https://www.lynda.com/Business-Intelligence-tutorials/Association-analysis-
Orange/475936/529739-4.html
37. WEB STRUCTURE MINING
The structure of a Web consists of Web pages as nodes, and hyperlinks as edges
connecting between two related pages
The research at the hyperlink level is also called HYPERLINK
ANALYSIS
Web structure mining is to study the relationship between the reference pages to find useful
patterns, and improve search quality by analyzing the links between pages
Web structure Mining focuses on
Reducing irrelevant search results
Help indexing information on the web
38. Web Structure Terminology
Web-Graph: A directed graph that represent the web.
Node: Each Web page is a node of the Web-graph.
Link: Each hyperlink on the Web is a directed edge of the Web-graph.
In-degree: The in-degree of a node, p is the number of distinct links that
point to p.
Out-degree: The out-degree of a node, p is the number of distinct links
originating at p that point to other nodes.
39. Web Structure Terminology
Directed Path: A sequence of links, starting from p that can be followed to reach q.
Shortest Path: Of all the paths between nodes p and q, which has the shortest length, i.e.
number of links on it.
Diameter: The maximum of all the shortest paths between a pair of nodes p and q, for all pairs of
nodes p and q in the Web-graph (the length of the longest shortest path)
40. Hubs and authorities are ‘fans’ and ‘centers’ of a web graph
A good hub page is one that points to many good authority pages
A good authority page is one that is pointed to by many good hub pages
Hubs and Authorities
42. Google’s Page Rank
Rank of a web page depends on the rank of the web pages
pointing to it
Hyperlink analysis algorithm assigns numerical weight to a
webpage
Page Rank increases effectiveness of search engines
To Climb to The Top of Google Search
43. SOCIAL MEDIA MINING
Social media mining is the process of representing, analyzing, and extracting actionable patterns and trends
from raw social media data.
Social media mining uses a range of basic concepts from computer science, data mining, machine learning,
and statistics.
Social media mining is based on theory from social network analysis(SNA)
Data mining techniques in social media mining are:
Graph Mining
Text Mining
44. SOCIAL NETWORK ANALYSIS
Social network analysis [SNA] is the mapping and measuring of relationships and flows between
people, groups, organizations, computers, and other connected information/knowledge entities.
The nodes in the network are the people and groups while the links show relationships or flows
between the nodes.
SNA provides both a visual and a mathematical analysis of human relationships.
EXAMPLE:
Who knows whom and who shares what information
and knowledge with whom through what media.
45. GRAPH MINING
Extracting useful knowledge (patterns, outliers, etc.) from structured data that can be represented as a grap
https://neo4j.com/download/
A Graph is a set of nodes and the
relationships that connect those nodes
Nodes and Relationships contain
properties to represent data.
46. TEXT MINING
◼A social network contains a lot of data in the nodes of various forms. For example, a
social network may contain blogs, articles, messages, and etc.
◼ Common application for text mining is to aid in the automatic classification of texts.
For example, it is possible to "filter" out automatically most undesirable "junk email"
based on certain terms or words that are not likely to appear in legitimate messages
48. SUMMARY
◼ Web mining
◼ Data mining
◼ Data mining techniques
◼ Web Data
◼ Applications of web mining in E-commerce
◼ Categories of web mining
Web content mining
Text mining
Data mining
o Classification
o Clustering
o Summarization
o Visualization
Web Usage Mining
Clustering –K means algorithms
Classification – Decision Tree
Association rule –Basket Analysis
Web Structure Mining
◼ Social Media Mining
Graph Mining
Text Mining