Understanding websites
Trainings by Vidya Bhagwat
Websites:
A website is hosted on at least one web server, accessible
via a network such as the Internet or a private local area network
through an Internet address known as a Uniform resource
locator. All publicly accessible websites collectively constitute
the.
Importance of websites:
• Internet marketing comes
of age
• Internet marketing is now a
major, multi-billion dollar
industry.
• Despite some concerns,
many consumers now have
the skills and the
confidence to transact
purchases using the web.
Trainings by Vidya Bhagwat
• Local business is affected as well
• Many small business operators have been disappointed with
the results achieved by their websites.
• Sites have been created but few if any business has resulted.
• There are a number of reasons:
• unrealistic expectations;
• poor website construction (not search engine friendly);
• poor targeting.
• Local search is growing in importance. Local search is the
ability to search for and find businesses and organizations in
the local area, that is, in close proximity geographically.
• This will vary from business to business.
Trainings by Vidya Bhagwat
Website structure understanding:
• Website Structure Understanding and its Applications
• Website structure understanding can be treated as a reverse
engineering for the purpose of automatically discovering the
layout templates and URL patterns of a website, and
understanding how these templates and patterns are
integrated to organize the website. The study of this problem
has had a great impact to many applications which can
leverage such site-level knowledge to help web search and
data mining.
Trainings by Vidya Bhagwat
Trainings by Vidya Bhagwat
• What’s Website Structure?
• In this project, the website structure consists of three
components: layout templates, URL patterns, and linkage
structure.
• Layout Template:
• Most web pages consist of HTML elements like table, menu,
button, image, and input box. The layout of a web page
describes what HTML elements are included in the page, as
well as how these elements are visually distributed in page
rendering. Essentially, a page layout is represented by a so
called DOM (Document Object Model) tree. In this project, a
layout template is considered as a group of pages which have
very similar layouts (DOM trees).
Trainings by Vidya Bhagwat
• Link Structure
• Based on the layout templates and URL patterns, we can
construct a directed graph to represent the website
organization structure. That is, each layout template is
considered as a node in a graph, and two nodes are linked if
there are hyperlinks between the pages belonging to the two
nodes. The link direction is the same as the related
hyperlinks. And each link is characterized with the URL pattern
of the corresponding hyperlink URLs. Again, it should be
noticed that there could be multiple links from one node to
another if the corresponding hyperlinks have more than one
URL pattern.
• Fig. 2 gives an illustrative example of the sub-graph
constructed based on the layout templates and URL patterns
above.
Trainings by Vidya Bhagwat
• Random Sampling
• The goal of random sampling is to provide a snapshot of a
website by downloading only a relatively small number of
pages. The sampling quality is the foundation of the whole
mining process. To keep the downloaded pages as diverse as
possible, in practice the sampling process adopts a
strategy combining both breadth-first and depth-first, and can
quickly retrieve pages at deep levels within a few steps.
Trainings by Vidya Bhagwat
• Inspired by this observation, in this project, DOM path is
utilized to characterize the layout of a webpage. As shown in
Fig. 5, a DOM path is a path from a leaf node to the root of
the DOM tree. The leaf node indicates the component type,
and the path-to-root approximately describes the visual
location of that component in page rendering.
• Given a set of HTML pages, all unique DOM paths are
extracted to form a feature space. Each page is represented
by a point in the feature space, and the layout similarity of
two pages can be estimated. A bottom-up strategy is then
utilized to group similar pages, and each cluster is considered
as a layout template.
Trainings by Vidya Bhagwat
• URL Pattern Discovery
• A URL is not an ordinary string but has a syntax structure
scheme strictly defined by W3C standards. Based on a syntax
structure, a URL string can be represented by a group of key-
value pairs. Fig. 6 gives an example URL, its syntax structure,
and the corresponding key-value pairs.
It is noticed that different URL components (or keys) usually
have different functions and play different roles in a website.
In general, keys denoting directories, functions, and
document types are with only a few values, which should be
explicitly recorded in a URL pattern. By contrast, keys
denoting parameters such as user names are with quite
diverse values, which should be generalized in the pattern.
Trainings by Vidya Bhagwat
• It is noticed that different URL components (or keys) usually
have different functions and play different roles in a website.
In general, keys denoting directories, functions, and
document types are with only a few values, which should be
explicitly recorded in a URL pattern. By contrast, keys
denoting parameters such as user names are with quite
diverse values, which should be generalized in the
pattern. Based on this observation, a top-down recursive split
process is proposed in this project to construct a pattern tree
to characterize a set of URLs. Fig. 7 gives an example pattern
tree based on URLs from www.wretch.cc. Algorithm details
please refer to.
Trainings by Vidya Bhagwat
• Website Designing India have assisted hundreds of businesses
to build or update a website custom to their requirements.
You get more than just a website with our Website Designing
Services. You can update your website content easily, take
credit card payments online, and use lots of tools like poll
managers, news managers, photo galleries, and form builders.
Whether you're looking for an ecommerce web design
company or a web development company that showcases
your business, our website designing & development services
give you control over your site with no technical skills needed.
Trainings by Vidya Bhagwat
Domain name:
• This article is about domain names in the Internet. For other
uses, see Domain.
• A domain name is a unique name that identifies a website. It
is an identification string that defines a realm of
administrative autonomy, authority or control on the Internet.
Domain names are formed by the rules Domain Name System
(DNS). Any name registered in the DNS is a domain name. The
functional description of domain names is presented in the
Domain Name System article. Broader usage and industry
aspects are captured here.
Trainings by Vidya Bhagwat
• Domain names are used in various networking contexts and
application-specific naming and addressing purposes. In
general, a domain name represents an Internet Protocol (IP)
resource, such as a personal computer used to access the
Internet, a server computer hosting a web site, or the web
site itself or any other service communicated via the Internet.
In 2010, the number of active domains reached 196 million.
Trainings by Vidya Bhagwat
Use in web site hosting
• The domain name is a component of a Uniform Resource
Locator (URL) used to access web sites, for example:
• URL: http://www.example.net/index.html
• Top-level domain name: net
• Second-level domain name: example.net
Trainings by Vidya Bhagwat
• Host name: www.example.net
• A domain name may point to multiple IP addresses in order to
provide server redundancy for the cybernetic services to be
delivered; such multi-address capability is used to manage the
traffic of large, popular web sites. More commonly, however,
one server computer, at a given IP address, may also host web
sites in different domains. Such address overloading enables
virtual web hosting, commonly used by large web hosting
services to conserve IP address space. IP-address overloading
is possible through a feature in the HTTP version 1.1 protocol,
but not in the HTTP version 1.0 protocol, which requires that a
request identify the domain name being referred for
connection.
Trainings by Vidya Bhagwat
Contact Information
• To obtain further information about any of our databases,
services, or programs, contact NCBI:
Pub Med Customer Service:
• Send an Email for help with technical issues, searching, or
content assistance
• Call 1-888-FIND-NLM (1-888-346-3656) for help with
searching or content assistance only
• General Information: info@ncbi.nlm.nih.gov
• Questions about and technical support for NCBI and its
programs and services
• BLAST: blast-help@ncbi.nlm.nih.gov
• Technical questions on running or interperting BLAST
sequence comparison searches
Trainings by Vidya Bhagwat

Understanding website

  • 1.
  • 2.
    Trainings by VidyaBhagwat Websites: A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet address known as a Uniform resource locator. All publicly accessible websites collectively constitute the.
  • 3.
    Importance of websites: •Internet marketing comes of age • Internet marketing is now a major, multi-billion dollar industry. • Despite some concerns, many consumers now have the skills and the confidence to transact purchases using the web.
  • 4.
    Trainings by VidyaBhagwat • Local business is affected as well • Many small business operators have been disappointed with the results achieved by their websites. • Sites have been created but few if any business has resulted. • There are a number of reasons: • unrealistic expectations; • poor website construction (not search engine friendly); • poor targeting. • Local search is growing in importance. Local search is the ability to search for and find businesses and organizations in the local area, that is, in close proximity geographically. • This will vary from business to business.
  • 5.
    Trainings by VidyaBhagwat Website structure understanding: • Website Structure Understanding and its Applications • Website structure understanding can be treated as a reverse engineering for the purpose of automatically discovering the layout templates and URL patterns of a website, and understanding how these templates and patterns are integrated to organize the website. The study of this problem has had a great impact to many applications which can leverage such site-level knowledge to help web search and data mining.
  • 6.
  • 7.
    Trainings by VidyaBhagwat • What’s Website Structure? • In this project, the website structure consists of three components: layout templates, URL patterns, and linkage structure. • Layout Template: • Most web pages consist of HTML elements like table, menu, button, image, and input box. The layout of a web page describes what HTML elements are included in the page, as well as how these elements are visually distributed in page rendering. Essentially, a page layout is represented by a so called DOM (Document Object Model) tree. In this project, a layout template is considered as a group of pages which have very similar layouts (DOM trees).
  • 8.
    Trainings by VidyaBhagwat • Link Structure • Based on the layout templates and URL patterns, we can construct a directed graph to represent the website organization structure. That is, each layout template is considered as a node in a graph, and two nodes are linked if there are hyperlinks between the pages belonging to the two nodes. The link direction is the same as the related hyperlinks. And each link is characterized with the URL pattern of the corresponding hyperlink URLs. Again, it should be noticed that there could be multiple links from one node to another if the corresponding hyperlinks have more than one URL pattern. • Fig. 2 gives an illustrative example of the sub-graph constructed based on the layout templates and URL patterns above.
  • 9.
    Trainings by VidyaBhagwat • Random Sampling • The goal of random sampling is to provide a snapshot of a website by downloading only a relatively small number of pages. The sampling quality is the foundation of the whole mining process. To keep the downloaded pages as diverse as possible, in practice the sampling process adopts a strategy combining both breadth-first and depth-first, and can quickly retrieve pages at deep levels within a few steps.
  • 10.
    Trainings by VidyaBhagwat • Inspired by this observation, in this project, DOM path is utilized to characterize the layout of a webpage. As shown in Fig. 5, a DOM path is a path from a leaf node to the root of the DOM tree. The leaf node indicates the component type, and the path-to-root approximately describes the visual location of that component in page rendering. • Given a set of HTML pages, all unique DOM paths are extracted to form a feature space. Each page is represented by a point in the feature space, and the layout similarity of two pages can be estimated. A bottom-up strategy is then utilized to group similar pages, and each cluster is considered as a layout template.
  • 11.
    Trainings by VidyaBhagwat • URL Pattern Discovery • A URL is not an ordinary string but has a syntax structure scheme strictly defined by W3C standards. Based on a syntax structure, a URL string can be represented by a group of key- value pairs. Fig. 6 gives an example URL, its syntax structure, and the corresponding key-value pairs. It is noticed that different URL components (or keys) usually have different functions and play different roles in a website. In general, keys denoting directories, functions, and document types are with only a few values, which should be explicitly recorded in a URL pattern. By contrast, keys denoting parameters such as user names are with quite diverse values, which should be generalized in the pattern.
  • 12.
    Trainings by VidyaBhagwat • It is noticed that different URL components (or keys) usually have different functions and play different roles in a website. In general, keys denoting directories, functions, and document types are with only a few values, which should be explicitly recorded in a URL pattern. By contrast, keys denoting parameters such as user names are with quite diverse values, which should be generalized in the pattern. Based on this observation, a top-down recursive split process is proposed in this project to construct a pattern tree to characterize a set of URLs. Fig. 7 gives an example pattern tree based on URLs from www.wretch.cc. Algorithm details please refer to.
  • 13.
    Trainings by VidyaBhagwat • Website Designing India have assisted hundreds of businesses to build or update a website custom to their requirements. You get more than just a website with our Website Designing Services. You can update your website content easily, take credit card payments online, and use lots of tools like poll managers, news managers, photo galleries, and form builders. Whether you're looking for an ecommerce web design company or a web development company that showcases your business, our website designing & development services give you control over your site with no technical skills needed.
  • 14.
    Trainings by VidyaBhagwat Domain name: • This article is about domain names in the Internet. For other uses, see Domain. • A domain name is a unique name that identifies a website. It is an identification string that defines a realm of administrative autonomy, authority or control on the Internet. Domain names are formed by the rules Domain Name System (DNS). Any name registered in the DNS is a domain name. The functional description of domain names is presented in the Domain Name System article. Broader usage and industry aspects are captured here.
  • 15.
    Trainings by VidyaBhagwat • Domain names are used in various networking contexts and application-specific naming and addressing purposes. In general, a domain name represents an Internet Protocol (IP) resource, such as a personal computer used to access the Internet, a server computer hosting a web site, or the web site itself or any other service communicated via the Internet. In 2010, the number of active domains reached 196 million.
  • 16.
    Trainings by VidyaBhagwat Use in web site hosting • The domain name is a component of a Uniform Resource Locator (URL) used to access web sites, for example: • URL: http://www.example.net/index.html • Top-level domain name: net • Second-level domain name: example.net
  • 17.
    Trainings by VidyaBhagwat • Host name: www.example.net • A domain name may point to multiple IP addresses in order to provide server redundancy for the cybernetic services to be delivered; such multi-address capability is used to manage the traffic of large, popular web sites. More commonly, however, one server computer, at a given IP address, may also host web sites in different domains. Such address overloading enables virtual web hosting, commonly used by large web hosting services to conserve IP address space. IP-address overloading is possible through a feature in the HTTP version 1.1 protocol, but not in the HTTP version 1.0 protocol, which requires that a request identify the domain name being referred for connection.
  • 18.
    Trainings by VidyaBhagwat Contact Information • To obtain further information about any of our databases, services, or programs, contact NCBI: Pub Med Customer Service: • Send an Email for help with technical issues, searching, or content assistance • Call 1-888-FIND-NLM (1-888-346-3656) for help with searching or content assistance only • General Information: info@ncbi.nlm.nih.gov • Questions about and technical support for NCBI and its programs and services • BLAST: blast-help@ncbi.nlm.nih.gov • Technical questions on running or interperting BLAST sequence comparison searches
  • 19.