Web mining involves applying data mining techniques to automatically discover and extract information from web documents and services. It has three main types: web content mining, which extracts useful information from web document contents; web structure mining, which analyzes the hyperlink structure of websites; and web usage mining, which involves discovering patterns from user interactions on websites. Popular algorithms for web mining include PageRank for web structure mining and HITS for determining both hub and authority pages.
Evaluation of Web Search Engines Based on Ranking of Results and FeaturesWaqas Tariq
Search engines help the user to surf the web. Due to the vast number of web pages it is highly impossible for the user to retrieve the appropriate web page he needs. Thus, Web search ranking algorithms play an important role in ranking web pages so that the user could retrieve the page which is most relevant to the user's query. This paper presents a study of the applicability of two user-effort-sensitive evaluation measures on five Web search engines (Google, Ask, Yahoo, AOL and Bing). Twenty queries were collected from the list of most hit queries in the last year from various search engines and based upon that search engines are evaluated.
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSZac Darcy
Web is a wide, various and dynamic environment in which different users publish their documents. Webmining is one of data mining applications in which web patterns are explored. Studies on web mining can be categorized into three classes: application mining, content mining and structure mining. Today, internet has found an increasing significance. Search engines are considered as an important tool to respond users’ interactions. Among algorithms which is used to find pages desired by users is page rank algorithm which ranks pages based on users’ interests. However, as being the most widely used algorithm by search engines including Google, this algorithm has proved its eligibility compared to similar algorithm, but considering growth speed of Internet and increase in using this technology, improving performance of this algorithm is considered as one of the web mining necessities. Current study emphasizes on Ant Colony algorithm and marks most visited links based on higher amount of pheromone. Results of the proposed algorithm indicate high accuracy of this method compared to previous methods. Ant Colony Algorithm as one of the swarm intelligence algorithms inspired by social behavior of ants can be effective in modeling social behavior of web users. In addition, application mining and structure mining techniques can be used simultaneously to improve page ranking performance.
Evaluation of Web Search Engines Based on Ranking of Results and FeaturesWaqas Tariq
Search engines help the user to surf the web. Due to the vast number of web pages it is highly impossible for the user to retrieve the appropriate web page he needs. Thus, Web search ranking algorithms play an important role in ranking web pages so that the user could retrieve the page which is most relevant to the user's query. This paper presents a study of the applicability of two user-effort-sensitive evaluation measures on five Web search engines (Google, Ask, Yahoo, AOL and Bing). Twenty queries were collected from the list of most hit queries in the last year from various search engines and based upon that search engines are evaluated.
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSZac Darcy
Web is a wide, various and dynamic environment in which different users publish their documents. Webmining is one of data mining applications in which web patterns are explored. Studies on web mining can be categorized into three classes: application mining, content mining and structure mining. Today, internet has found an increasing significance. Search engines are considered as an important tool to respond users’ interactions. Among algorithms which is used to find pages desired by users is page rank algorithm which ranks pages based on users’ interests. However, as being the most widely used algorithm by search engines including Google, this algorithm has proved its eligibility compared to similar algorithm, but considering growth speed of Internet and increase in using this technology, improving performance of this algorithm is considered as one of the web mining necessities. Current study emphasizes on Ant Colony algorithm and marks most visited links based on higher amount of pheromone. Results of the proposed algorithm indicate high accuracy of this method compared to previous methods. Ant Colony Algorithm as one of the swarm intelligence algorithms inspired by social behavior of ants can be effective in modeling social behavior of web users. In addition, application mining and structure mining techniques can be used simultaneously to improve page ranking performance.
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIJwest
Web is a wide, various and dynamic environment in which different users publish their documents. Webmining is one of data mining applications in which web patterns are explored. Studies on web mining can be categorized into three classes: application mining, content mining and structure mining. Today, internet has found an increasing significance. Search engines are considered as an important tool to respond users’ interactions. Among algorithms which is used to find pages desired by users is page rank algorithm which ranks pages based on users’ interests. However, as being the most widely used algorithm by search engines including Google, this algorithm has proved its eligibility compared to similar algorithm, but considering growth speed of Internet and increase in using this technology, improving performance of this algorithm is considered as one of the web mining necessities. Current study emphasizes on Ant Colony algorithm and marks most visited links based on higher amount of pheromone. Results of the proposed algorithm indicate high accuracy of this method compared to previous methods. Ant Colony Algorithm as one of the swarm intelligence algorithms inspired by social behavior of ants can be effective in modeling social behavior of web users. In addition, application mining and structure mining techniques can be used simultaneously to improve page ranking performance.
Identifying Important Features of Users to Improve Page Ranking Algorithms dannyijwest
Increase in number of ontologies on Semantic Web and endorsement of OWL as language of discourse for the Semantic Web has lead to a scenario where research efforts in the field of ontology engineering may be applied for making the process of ontology development through reuse a viable option for ontology developers. The advantages are twofold as when existing ontological artefacts from the Semantic Web are reused, semantic heterogeneity is reduced and help in interoperability which is the essence of Semantic Web. From the perspective of ontology development advantages of reuse are in terms of cutting down on cost as well as development life as ontology engineering requires expert domain skills and is time taking process. We have devised a framework to address challenges associated with reusing ontologies from the Semantic Web. In this paper we present methods adopted for extraction and integration of concepts across multiple ontologies. We have based extraction method on features of OWL language constructs and context to extract concepts and for integration a relative semantic similarity measure is devised. We also present here guidelines for evaluation of ontology constructed. The proposed methods have been applied on concepts from food ontology and evaluation has been done on concepts from domain of academics using Golden Ontology Evaluation Method with satisfactory outcomes
This presentation is based on ranking of web pages, mainly it consist of PageRank algorithm and HITS algorithm. It gives brief knowledge of how to calculate page rank by looking at the links between the pages. It tells you about different techniques of search engine optimization.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.iosrjce
The internet is a vast collection of billions of web pages containing terabytes of information
arranged in thousands of servers using HTML. The size of this collection itself is a formidable obstacle in
retrieving necessary and relevant information. This made search engines an important part of our lives. Search
engines strive to retrieve information as relevant as possible. One of the building blocks of search engines is the
Web Crawler. We tend to propose a two - stage framework, specifically two smart Crawler, for efficient
gathering deep net interfaces. Within the first stage, smart Crawler, performs site-based sorting out centre
pages with the assistance of search engines, avoiding visiting an oversized variety of pages. To realize
additional correct results for a targeted crawl, smart Crawler, ranks websites to order extremely relevant ones
for a given topic. Within the second stage, smart Crawler, achieves quick in – site looking by excavating most
relevant links with associate degree accommodative link -ranking
Hi All,
This Presentation will feature more about the working of search engine how do the inner functionality takes place. In the later half of the Presentation the Page Rank will be explained in depth. how do they calculate it, How it differing from the actual PR, Google PR. How frequently they do update the PR value in the google. and lots more with calculation and few examples.
The PageRank algorithm is an important algorithm which is implemented to determine the quality of a page on the web. With search engines attaining a high position in guiding the traffic on the internet, PageRank is an important factor to determine its flow. Since link analysis is used in search engine's ranking systems, link based spam structure known as link farms are created by spammers to generate a high PageRank for their and in turn a target page. In this paper, we suggest a method through which these structures can be detected and thus the overall ranking results can be improved.
International conference On Computer Science And technologyanchalsinghdm
ICGCET 2019 | 5th International Conference on Green Computing and Engineering Technologies. The conference will be held on 7th September - 9th September 2019 in Morocco. International Conference On Engineering Technology
The conference aims to promote the work of researchers, scientists, engineers and students from across the world on advancement in electronic and computer systems.
IDENTIFYING IMPORTANT FEATURES OF USERS TO IMPROVE PAGE RANKING ALGORITHMSIJwest
Web is a wide, various and dynamic environment in which different users publish their documents. Webmining is one of data mining applications in which web patterns are explored. Studies on web mining can be categorized into three classes: application mining, content mining and structure mining. Today, internet has found an increasing significance. Search engines are considered as an important tool to respond users’ interactions. Among algorithms which is used to find pages desired by users is page rank algorithm which ranks pages based on users’ interests. However, as being the most widely used algorithm by search engines including Google, this algorithm has proved its eligibility compared to similar algorithm, but considering growth speed of Internet and increase in using this technology, improving performance of this algorithm is considered as one of the web mining necessities. Current study emphasizes on Ant Colony algorithm and marks most visited links based on higher amount of pheromone. Results of the proposed algorithm indicate high accuracy of this method compared to previous methods. Ant Colony Algorithm as one of the swarm intelligence algorithms inspired by social behavior of ants can be effective in modeling social behavior of web users. In addition, application mining and structure mining techniques can be used simultaneously to improve page ranking performance.
Identifying Important Features of Users to Improve Page Ranking Algorithms dannyijwest
Increase in number of ontologies on Semantic Web and endorsement of OWL as language of discourse for the Semantic Web has lead to a scenario where research efforts in the field of ontology engineering may be applied for making the process of ontology development through reuse a viable option for ontology developers. The advantages are twofold as when existing ontological artefacts from the Semantic Web are reused, semantic heterogeneity is reduced and help in interoperability which is the essence of Semantic Web. From the perspective of ontology development advantages of reuse are in terms of cutting down on cost as well as development life as ontology engineering requires expert domain skills and is time taking process. We have devised a framework to address challenges associated with reusing ontologies from the Semantic Web. In this paper we present methods adopted for extraction and integration of concepts across multiple ontologies. We have based extraction method on features of OWL language constructs and context to extract concepts and for integration a relative semantic similarity measure is devised. We also present here guidelines for evaluation of ontology constructed. The proposed methods have been applied on concepts from food ontology and evaluation has been done on concepts from domain of academics using Golden Ontology Evaluation Method with satisfactory outcomes
This presentation is based on ranking of web pages, mainly it consist of PageRank algorithm and HITS algorithm. It gives brief knowledge of how to calculate page rank by looking at the links between the pages. It tells you about different techniques of search engine optimization.
Smart Crawler: A Two Stage Crawler for Concept Based Semantic Search Engine.iosrjce
The internet is a vast collection of billions of web pages containing terabytes of information
arranged in thousands of servers using HTML. The size of this collection itself is a formidable obstacle in
retrieving necessary and relevant information. This made search engines an important part of our lives. Search
engines strive to retrieve information as relevant as possible. One of the building blocks of search engines is the
Web Crawler. We tend to propose a two - stage framework, specifically two smart Crawler, for efficient
gathering deep net interfaces. Within the first stage, smart Crawler, performs site-based sorting out centre
pages with the assistance of search engines, avoiding visiting an oversized variety of pages. To realize
additional correct results for a targeted crawl, smart Crawler, ranks websites to order extremely relevant ones
for a given topic. Within the second stage, smart Crawler, achieves quick in – site looking by excavating most
relevant links with associate degree accommodative link -ranking
Hi All,
This Presentation will feature more about the working of search engine how do the inner functionality takes place. In the later half of the Presentation the Page Rank will be explained in depth. how do they calculate it, How it differing from the actual PR, Google PR. How frequently they do update the PR value in the google. and lots more with calculation and few examples.
The PageRank algorithm is an important algorithm which is implemented to determine the quality of a page on the web. With search engines attaining a high position in guiding the traffic on the internet, PageRank is an important factor to determine its flow. Since link analysis is used in search engine's ranking systems, link based spam structure known as link farms are created by spammers to generate a high PageRank for their and in turn a target page. In this paper, we suggest a method through which these structures can be detected and thus the overall ranking results can be improved.
International conference On Computer Science And technologyanchalsinghdm
ICGCET 2019 | 5th International Conference on Green Computing and Engineering Technologies. The conference will be held on 7th September - 9th September 2019 in Morocco. International Conference On Engineering Technology
The conference aims to promote the work of researchers, scientists, engineers and students from across the world on advancement in electronic and computer systems.
The PPT describes following contents
What is process?
Scheduling Criteria
Types of schedulers
Process Scheduling algorithms along with examples.
Threads
Multithreading
User thread
kernel thread
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
2. What is Web Mining?
Web is a collection of files/documents over the Internet which are
inter-linked.
The World Wide Web contains a large amount of data that provides a
rich source to data mining.
Web Mining is the application of Data Mining techniques to
automatically discover and extract information from Web documents
and services.
The objective of Web mining is to look for patterns in Web data by
collecting and examining data in order to gain insights.
Web mining helps to improve the power of web search engine by
identifying the web pages and classifying the web documents.
Web mining is very useful to e-commerce websites and e-services.
2
3. What is Web Mining?
Web mining is used to predict user behavior.
It is used for Web Searching e.g., Google, Yahoo etc. and in Vertical
Searching.
Web mining has a distinctive property to provide a set of various data
types.
The web has multiple aspects that yield different approaches for the
mining process, such as
Web pages consist of text,
Web pages are linked via hyperlinks, and
User activity can be monitored via web server logs.
3
4. Challenges in Web Mining
Amount of information on the Web is huge
Coverage of Web information is very wide diverse
Web information is semi-structured.
Web information is linked.
Web information is redundant
Web consists of Surface web and Deep web
Web information is dynamic
Web information is noisy
4
5. Web Mining Taxonomy
Web Mining
Web
Content
Mining
Web Usage
Mining
Web
Structure
Mining
Extracting
useful
knowledge
from the
contents of
Web
documents
Extracting
interesting
patterns from
user
interactions on
multiple
websites
Extracting
hyperlinks
structure
connecting
different web
sites/applicatio
ns 5
6. Web Structure Mining
Web structure mining is the application of discovering
structure information from the web.
Structure mining basically shows the structured summary of a
particular website.
It identifies relationship between web pages linked by
information or direct link connection.
To determine the connection between two commercial
websites, Web structure mining can be very useful.
6
7. Web Structure Mining
The structure of the web graph consists
of web pages as nodes, and hyperlinks as
edges connecting related pages.
Type of mining
Document level mining (Intra-page)
Hyperlink level mining (Inter-page)
The research at hyperlink level is known
as Hyperlink Analysis
Web pages can be accessed using URL,
hyperlinks or using navigational tools.
7
8. Web Structure Mining
Web page ranking algorithms PageRank (PR)
Google Search engine uses PageRank algorithm to rank websites in
their search engine results.
It is a way of measuring the importance of website pages.
The PageRank algorithm outputs a probability distribution used to
represent the likelihood that a person randomly clicking on links
will arrive at any particular page.
PageRank can be calculated for collections of documents of any
size.
8
“It counts the number and quality of links to a page to determine a rough estimate of
how important the website is. The underlying assumption is that more important
websites are likely to receive more links from other websites.”
https://www.youtube.com/watch?v=meonLcN7LD4
10. Web Structure Mining
Web page ranking algorithms PageRank (PR)
Google interprets a link from page A to page B as a vote from page A to
page B.
All incoming links can be interpreted as votes.
10
11. Web Structure Mining
Web page ranking algorithms PageRank (PR)
It also takes into consideration the “importance” of the page that is
“giving” out the vote.
If the page that’s casting a vote is more important, then the links
are worth more and it will help rank up the other pages.
Page’s importance is equal to the sum of the votes of its incoming
links.
A web page is important if it is pointed by other important web
pages.
Types of Links
Inbound links (In-links)
Outbound links (Out-links)
11
12. Web Structure Mining
Web page ranking algorithms PageRank (PR)
Page rank of a page 𝑃𝑖 denoted by 𝑃𝑅(𝑃𝑖) is the sum of page
ranks of all pages pointing into 𝑃𝑖.
𝑃𝑅 𝑃𝑖 =
𝑃𝑗→𝑃𝑖
𝑃𝑅(𝑃𝑗)
|𝑃𝑗|
𝑃𝑗 → 𝑃𝑖 set of pages pointing to 𝑃𝑖
|𝑃𝑗| = number of outlinks from 𝑃𝑗
𝑃𝑅 is page rank initially unknown
All pages are given equal page rank initially 1
𝑛
𝑛 is number of pages in Google index of web
12
17. Web Structure Mining
Web page ranking algorithms PageRank (PR)
Simple PageRank Algorithm
𝑃𝑅 𝐴 = 1 − 𝑑 + 𝑑
𝑃𝑅 𝑇1
𝐶 𝑇1
+ ⋯ +
𝑃𝑅 𝑇𝑛
𝐶 𝑇𝑛
where,
𝑃𝑅 𝐴 –Page Rank of page 𝐴
𝑃𝑅 𝑇𝑖 – Page Rank of pages 𝑇𝑖 which link to page 𝐴
𝐶 𝑇𝑖 -number of outbound links on page 𝑇𝑖
𝑑 -damping factor which can be set between 0 and 1
(Usually set to 0.85)
The probability of a random user continuing to click on links
as they browse the web.
17
19. Web Structure Mining
Hyperlink Induced Topic Search (HITS) Algorithm
A Link Analysis Algorithm that rates webpages, developed by Jon
Kleinberg.
This algorithm is used to discover and rank the webpages relevant
for a particular search.
HITS uses hubs and authorities to define a recursive relationship
between webpages.
Given a query to a Search Engine, the set of highly relevant web pages
are called Roots. They are potential Authorities.
Pages that are not very relevant but point to pages in the Root are called
Hubs.
An Authority is a page that many hubs link to, whereas a Hub is a page
that links to many authorities.
20
21. Web Structure Mining
Hyperlink Induced Topic Search (HITS) Algorithm
Hub and authority scores:
Compute for each page d in the base set a hub score ℎ(𝑑) and an
authority score 𝑎(𝑑).
Initialize for all d : ℎ 𝑑 = 1 ; 𝑎 𝑑 = 1.
Update h(𝑑) and 𝑎(𝑑) for each iteration.
Output pages with highest ℎ scores as top hubs.
Output pages with highest 𝑎 scores as top authorities
22
25. Web Structure Mining
PageRank vs HITS
PageRank:
Used for ranking all the nodes of the complete graph and then applying a
search.
Based on the ’random surfer’ idea
Example : Google
HITS
Applied when a search is done on the complete graph.
Applied on a subgraph based on query.
Defines hubs and authorities recursively.
Example :ASK, Yahoo, Clever
26
26. Web Content Mining
Web content mining is the application of extracting useful
information from the content of the web documents.
Web content consist of several types of data – text, image, audio,
video etc.
Content data is the group of facts that a web page is designed.
It can provide effective and interesting patterns about user needs.
Text documents are related to text mining, machine learning and
natural language processing.
This mining is also known as text mining.
This type of mining performs scanning and mining of the text,
images and groups of web pages according to the content of the
input.
27
27. Web Content Mining
With web mining, one can
Cluster web documents
Classify web pages
Extract patterns
Research activities in this field involve using other techniques
from other discipline such as Information Retrieval (IR) and
Natural Language Processing (NLP).
Applications
Document clustering / categorization
Topic identification / tracking
Concept discovery
28
28. Web Content Mining
Web Scraping
Web Scraping is an automatic method to obtain large amounts of data
from websites.
Unstructured data in an HTML format is converted into structured data
in a spreadsheet or a database so that it can be used in various
applications.
There are many different ways to perform web scraping to obtain data
from websites.
Using online services,
Using particular API’s or
creating your code for web scraping from scratch.
Websites like Google, Twitter, Facebook, StackOverflow, etc. have API’s
that allow you to access their data in a structured format.
Some websites that don’t allow users to access large amounts of data
in a structured form, use Web Scraping to scrape the website for data.29
29. Web Content Mining
Web Scraping
Web scraping requires two parts namely
the crawler and the scraper.
The crawler is an artificial intelligence
algorithm that browses the web to
search the particular data required by
following the links across the internet.
The scraper is a specific tool created to
extract the data from the website.
30
30. Web Content Mining
Web Crawler
A web crawler as known as web spider is a bot that searches and
indexes content on the internet.
Web crawlers are responsible for understanding the content on a
web page so they can retrieve it when an inquiry is made.
Web crawlers are operated by search engines with their own
algorithms.
A web spider will search (crawl) and categorize all web pages on
the internet that it can find and index them.
The most well known web crawler is the Googlebot.
31
31. Web Content Mining
How does a Web Crawler work?
Finds information by crawling
Crawlers look at webpages and follow links on those
pages, much like you would if you were browsing content
on the web.
They go from link to link and bring data about those
webpages back to the servers.
Organizes information by indexing
When crawlers find a webpage, systems render the
content of the page, just as a browser does.
It takes note of key signals — from keywords to website
freshness — and keeps track of it all in the Search index.
32
https://www.youtube.com/watch?v=jkuR
WcH1-kk
32. Web Content Mining
Applications of Web Crawler
The crawlers are the basis for the work of search engines.
Web crawlers are also used for other purposes:
Price comparison portals search for information on specific products on the
Web, so that prices or data can be compared accurately.
In the area of data mining, a crawler may collect publicly available e-mail or
postal addresses of companies.
Web analysis tools use crawlers or spiders to collect data for page views, or
incoming or outbound links.
Crawlers serve to provide information hubs with data, for example, news
sites.
33
33. Web Content Mining
Web Scraping
First the web scraper is provided the
URL’s of the required sites.
Then it loads all the HTML code for those
sites
Then scraper obtains the required data
from this HTML code and outputs this
data in the format specified by the user.
Output is in the form of an Excel
spreadsheet or a CSV file but the data can
also be saved in other formats such as a
JSON file.
34
https://www.youtube.com/watch?v=US-
HP1hiRHY
34. Web Content Mining
Web Scraping Applications
Price Intelligence
Market Research
Alternative data for finance
Real estate
News and content monitoring etc.
35
35. Web Content Mining
Web Scraping
Python is popular programming language for web scraping.
Scrapy:
Open-source web crawling framework
Ideal for web scraping as well as extracting data using APIs.
Beautiful soap:
Highly suitable for Web Scraping.
It creates a parse tree that can be used to extract data from HTML on a
website.
36
36. Web Usage Mining
Focuses on predicting the behavior of users while they are interacting
with the WWW.
Focus:
How web sites are used by users?
Understanding behavior of different user segment.
Predict user behavior in future
Extracting meaningful information to individual user or group.
What pages are being accessed frequently?
From what search engine are visitors coming?
Which browser and operating systems are most commonly used by visitors?
What are the most recent visits per page?
Who is visiting which page?
37
37. Web Usage Mining
Analyzes user activities on different web pages and track them over a
period of time.
Three types of Web Data
Web content data
Web structure data
Web usage data
The main source of data here is-Web Server and Application Server.
It involves log data which is collected by sources.
The data in this type can be mainly categorized into three types based on the
source it comes from:
Server side.
Client side
Proxy side
38
38. Web Usage Mining
Types of Web Usage mining based on the usage data
Web server data
includes the IP address, browser logs, proxy server logs, user profiles, etc.
Application server data
Tracking various business events and logging them into application server logs is
mainly what application server data consists of.
Application level data
There are various new kinds of events that can be there in an application.
39
39. Web Usage Mining
Techniques used in Web Usage Mining
40
Techniques
used in Web
Usage Mining
Association
Rules
Classification Clustering
40. Web Usage Mining
Techniques used in Web Usage Mining
Association Rules
This technique focuses on relations among the web pages that frequently appear
together in users’ sessions.
The pages accessed together are always put together into a single server session.
Association Rules help in the reconstruction of websites using the access logs.
So many sets of rules produced together may result in some of the rules being
completely inconsequential.
41
41. Web Usage Mining
Techniques used in Web Usage Mining
Classification
Develops that kind of profile of users/customers that are associated with a
particular class/category.
One requires to extract the best features that will be best suitable for the
associated class.
Classification can be implemented by various algorithms – some of them include-
Support vector machines, K-Nearest Neighbors, Logistic Regression, Decision
Trees, etc.
42
42. Web Usage Mining
Techniques used in Web Usage Mining
Clustering
There are mainly 2 types of clusters- the first one is the usage cluster and the
second one is the page cluster.
The clustering of pages can be readily performed based on the usage data.
In usage-based clustering, items that are commonly accessed /purchased
together can be automatically organized into groups.
43
43. Virtual Web View
44
A view on top of the World-Wide-Web
An approach to handle an unstructured data
Uses Multiple Layered Database (MLDB) built on top of web