Web Browser
A web browser is a software application that enables a user to display...

developing standards, specifically with XHTML and CSS (cascading style sheets, used
for page layout).
Some of the more ...

Microsoft proposed Cascading Style Sheets over Netscape's JavaScript Style Sheets
(JSSS) by W3C, the Netscape browser s...

    •   HTML, XML and XHTML
    •   Graphics file formats including GIF, PNG, JPEG, and SVG
    •   Cascading Style Sh...

Markup is the process of taking ordinary text and adding extra symbols. Each of the
symbols used for markup in HTML is ...

In November 2006, the HTML Working Group published a new charter indicating its
intent to resume development of HTML in...

XHTML 1.0, published January 26, 2000 as a W3C Recommendation, later revised and
republished August 1, 2002. It offers ...

Sometimes web services or browser manufacturers remedy these shortcomings. For
instance, members of the modern social s...

"boldface" in bold text, but has no clear semantics for aural devices that read the text
aloud for the sight-impaired. ...

<span     id='anId'    class='aClass'     style='color:red;'    title='HyperText       Markup

The goal of semantic HTML requires two things of authors:
1) to avoid the use of presentational markup (elements, attr...

Some aspects of authoring documents make separating semantics from style (in other
words, meaning from presentation) d...

HTML e-mails. Use of HTML in e-mail is controversial due to compatibility issues,
because it can be used in phishing/p...

4.01 and its associated protocols to begin with, or erroneously implements the HTML
To understand the...

followed by a div, is it because the document is not well-formed (the closing paragraph
label is missing) or is the do...

align attribute on div, form, paragraph (p), and heading (h1...h6) elements

align, noshade, size, and width attribute...

[edit] Summary of flavors
As this list demonstrates, the loose flavors of the specification are maintained for legacy

As of the release of Windows Vista, NetMeeting is no longer included and has been
replaced by Windows Meeting Space.

   •   Online discourse environment
   •   Talker
   •   Internet Relay Chat
   •   Instant messen...

independent of the plugins, making it possible for plugins to be added and updated
dynamically without changes to the ...

This article concerns communication between pairs of electronic devices. For the
specific topic of computing protocols...

An even higher protocol may perform network functions. One very common protocol is
the Internet protocol (IP), which i...

In general, the performance of TCP is severely degraded in conditions of high packet loss
(more than 0.1%), due to the...

aggregated networks like backbones, where the motto "bandwidth is cheap" fails to
deliver on its promise. Experience h...

                                 1 Typical properties
                                 2 Importance

    •   DHCP (Dynamic Host Configuration Protocol).
    •   IMAP (Internet Message Access Protocol).


High-level architecture of a standard Web crawler
A crawler must not only have a good crawling strategy, as noted i...

The goal of storing an index is to optimize the speed and performance of finding relevant
documents for a search query...

Ngram indices - for storing sequences of n length of data to support other types of
retrieval or text mining. Term doc...

index is similar to the term document matrices employed by latent semantic analysis. The
inverted index can be conside...

A fictitious estimate of 250 words per webpage on average, based on the assumption of
being similar to the pages of a ...

Language Ambiguity - to assist with properly ranking matching documents, many search
engines collect additional inform...

and language tagging. Automated language recognition is the subject of ongoing research
in natural language processing...

CAB - Microsoft Windows Cabinet File
Gzip - Gzip file
BZIP - Bzip file
TAR, GZ, and TAR.GZ - Unix Gzip'ped Archives
Internet application unit2
Internet application unit2
Internet application unit2
Internet application unit2
Internet application unit2
Internet application unit2
Internet application unit2
Internet application unit2
Upcoming SlideShare
Loading in …5

Internet application unit2


Published on

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Internet application unit2

  1. 1. 1 Web Browser A web browser is a software application that enables a user to display and interact with text, images, and other information typically located on a web page at a website on the World Wide Web or a local area network. Text and images on a web page can contain hyperlinks to other web pages at the same or different website. Web browsers allow a user to quickly and easily access information provided on many web pages at many websites by traversing these links. Web browsers format HTML information for display, so the appearance of a web page may differ between browsers. Some of the web browsers available for personal computers include Internet Explorer, Mozilla Firefox, Safari, Netscape, and Opera in order of descending popularity (as of August 2006).[1] Web browsers are the most commonly used type of HTTP user agent. Although browsers are typically used to access the World Wide Web, they can also be used to access information provided by web servers in private networks or content in file systems. Protocols and standards Web browsers communicate with web servers primarily using HTTP (hypertext transfer protocol) to fetch webpages. HTTP allows web browsers to submit information to web servers as well as fetch web pages from them. The most commonly used HTTP is HTTP/ 1.1, which is fully defined in RFC 2616. HTTP/1.1 has its own required standards that Internet Explorer does not fully support, but most other current-generation web browsers do. Pages are located by means of a URL (uniform resource locator), which is treated as an address, beginning with http: for HTTP access. Many browsers also support a variety of other URL types and their corresponding protocols, such as ftp: for FTP (file transfer protocol), rtsp: for RTSP (real-time streaming protocol), and https: for HTTPS (an SSL encrypted version of HTTP). The file format for a web page is usually HTML (hyper-text markup language) and is identified in the HTTP protocol using a MIME content type. Most browsers natively support a variety of formats in addition to HTML, such as the JPEG, PNG and GIF image formats, and can be extended to support more through the use of plugins. The combination of HTTP content type and URL protocol specification allows web page designers to embed images, animations, video, sound, and streaming media into a web page, or to make them accessible through the web page. Early web browsers supported only a very simple version of HTML. The rapid development of proprietary web browsers led to the development of non-standard dialects of HTML, leading to problems with Web interoperability. Modern web browsers support a combination of standards- and defacto-based HTML and XHTML, which should display in the same way across all browsers. No browser fully supports HTML 4.01, XHTML 1.x or CSS 2.1 yet. Currently many sites are designed using WYSIWYG HTML generation programs such as Macromedia Dreamweaver or Microsoft Frontpage. These often generate non-standard HTML by default, hindering the work of the W3C in
  2. 2. 2 developing standards, specifically with XHTML and CSS (cascading style sheets, used for page layout). Some of the more popular browsers include additional components to support Usenet news, IRC (Internet relay chat), and e-mail. Protocols supported may include NNTP (network news transfer protocol), SMTP (simple mail transfer protocol), IMAP (Internet message access protocol), and POP (post office protocol). These browsers are often referred to as Internet suites or application suites rather than merely web browsers. Brief history A NeXTcube was used by Tim Berners-Lee (who pioneered the use of hypertext for sharing information) as the world's first web server, and also to write the first web browser, WorldWideWeb in 1990. Berners-Lee introduced it to colleagues at CERN in March 1991. Since then the development of web browsers has been inseparably intertwined with the development of the web itself. The first browser, Silversmith, was created by John Bottoms in 1987.[2] The browser, based on SGML tags, used a tag set from the Electronic Document Project of the AAP with minor modifications and was sold to a number of early adopters. At the time SGML was used exclusively for the formatting of printed documents. The use of SGML for electronically displayed documents signaled a shift in electronic publishing and was met with considerable resistance. Silversmith included an integrated indexer, full text searches, hypertext links between images text and sound using SGML tags and a return stack for use with hypertext links. It included features that are still not available in today's browsers. These include capabilities such as the ability to restrict searches within document structures, searches on indexed documents using wild cards and the ability to search on tag attribute values and attribute names. SGML-FAQ US Patent In 1992, Tony Johnson releases the MidasWWW browser. Based on Motif/X, MidasWWW allows viewing of PostScript files on the Web from Unix and VMS, and even handles compressed PostScript. Another early popular web browser was ViolaWWW, which was modeled after HyperCard. However, the explosion in popularity of the web was triggered by NCSA Mosaic which was a graphical browser running originally on Unix but soon ported to the Apple Macintosh and Microsoft Windows platforms. Version 1.0 was released in September 1993, and was dubbed the killer application of the Internet. Marc Andreessen, who was the leader of the Mosaic team at NCSA, quit to form a company that would later be known as Netscape Communications Corporation. Netscape released its flagship Navigator product in October 1994, and it took off the next year. Microsoft, which had thus far not marketed a browser, now entered the fray with its Internet Explorer product, purchased from Spyglass Inc. This began what is known as the browser wars, the fight for the web browser market between Microsoft and Netscape. The wars put the web in the hands of millions of ordinary PC users, but showed how commercialization of the web could stymie standards efforts. Both Microsoft and Netscape liberally incorporated proprietary extensions to HTML in their products, and tried to gain an edge by product differentiation. Starting with the acceptance of the
  3. 3. 3 Microsoft proposed Cascading Style Sheets over Netscape's JavaScript Style Sheets (JSSS) by W3C, the Netscape browser started being generally considered inferior to Microsoft's browser version after version, from feature considerations to application robustness to standard compliance. The wars effectively ended in 1998 when it became clear that Netscape's declining market share trend was irreversible. This trend may have been due in part to Microsoft's integrating its browser with its operating system and bundling deals with OEMs; Microsoft faced antitrust litigation on these charges. Netscape responded by open sourcing its product, creating Mozilla. This did nothing to slow Netscape's declining market share. The company was purchased by America Online in late 1998. At first, the Mozilla project struggled to attract developers, but by 2002 it had evolved into a relatively stable and powerful internet suite. Mozilla 1.0 was released to mark this milestone. Also in 2002, a spin off project that would eventually become the popular Mozilla Firefox was released. In 2004, Firefox 1.0 was released; Firefox 1.5 was released in November 2005. Firefox 2, a major update, was released in October 2006 and work has already begun on Firefox 3 which is scheduled for release in 2007. As of 2006, Mozilla and its derivatives account for approximately 12% of web traffic. Opera, an innovative, speedy browser popular in handheld devices, particularly mobile phones, as well as on PCs in some countries was released in 1996 and remains a niche player in the PC web browser market. It is available on Nintendo's DS, DS Lite and Wii consoles[2]. The Opera Mini browser uses the Presto (layout engine) like all versions of Opera, but runs on most phones supporting Java Midlets. The Lynx browser remains popular for Unix shell users and with vision impaired users due to its entirely text-based nature. There are also several text-mode browsers with advanced features, such as w3m, Links (which can operate both in text and graphical mode), and the Links forks such as ELinks. The Macintosh scene too has traditionally been dominated by Internet Explorer and Netscape. However, Apple's Safari, the default browser on Mac OS X since version 10.3, has slowly grown to dominate this market. In 2003, Microsoft announced that Internet Explorer would no longer be made available as a separate product but would be part of the evolution of its Windows platform, and that no more releases for the Macintosh would be made. However, in early 2005, Microsoft changed its plans, releasing version 7 of Internet Explorer for Windows XP, Windows Server 2003, and Windows Vista in October 2006. Features Different browsers can be distinguished from each other by the features they support. Modern browsers and web pages tend to utilize many features and techniques that did not exist in the early days of the web. As noted earlier, with the browser wars there was a rapid and chaotic expansion of browser and World Wide Web feature sets. The following is a list of some of the most notable features: • Standards support • HTTP and HTTPS
  4. 4. 4 • HTML, XML and XHTML • Graphics file formats including GIF, PNG, JPEG, and SVG • Cascading Style Sheets (CSS) • JavaScript (Dynamic HTML) and XMLHttpRequest • Cookie • Digital certificates • Favicons • RSS, Atom Fundamental features • Bookmark manager • Caching of web contents • Support of media types via plugins such as Macromedia Flash and QuickTime Usability and accessibility features • Autocompletion of URLs and form data • Tabbed browsing • Spatial navigation • Caret navigation • Screen reader or full speech support HTML HTML, short for HyperText Markup Language, is the predominant markup language for the creation of web pages. It provides a means to describe the structure of text-based information in a document — by denoting certain text as headings, paragraphs, lists, and so on — and to supplement that text with interactive forms, embedded images, and other objects. HTML is written in the form of labels (known as tags), created by greater-than signs (>) and less-than signs (<). HTML can also describe, to some degree, the appearance and semantics of a document, and can include embedded scripting language code which can affect the behavior of web browsers and other HTML processors. HTML is also often used to refer to content of the MIME type text/html or even more broadly as a generic term for HTML whether in its XML-descended form (such as XHTML 1.0 and later) or its form descended directly from SGML (such as HTML 4.01 and earlier). What is HTML? HTML stands for Hypertext Markup Language. Hypertext is ordinary text that has been dressed up with extra features, such as formatting, images, multimedia, and links to other documents.
  5. 5. 5 Markup is the process of taking ordinary text and adding extra symbols. Each of the symbols used for markup in HTML is a command that tells a browser how to display the text. History of HTML Tim Berners-Lee created the original HTML (and many associated protocols such as HTTP) on a NeXTcube workstation using the NeXTSTEP development environment. At the time, HTML was not a specification, but a collection of tools to solve an immediate problem: the communication and dissemination of ongoing research among Berners-Lee and a group of his colleagues. His solution later combined with the emerging international and public internet to garner worldwide attention. Early versions of HTML were defined with loose syntactic rules, which helped its adoption by those unfamiliar with web publishing. Web browsers commonly made assumptions about intent and proceeded with rendering of the page. Over time, as the use of authoring tools increased, the trend in the official standards has been to create an increasingly strict language syntax. However, browsers still continue to render pages that are far from valid HTML. HTML is defined in formal specifications that were developed and published throughout the 1990s, inspired by Tim Berners-Lee's prior proposals to graft hypertext capability onto a homegrown SGML-like markup language for the Internet. The first published specification for a language called HTML was drafted by Berners-Lee with Dan Connolly, and was published in 1993 by the IETF as a formal "application" of SGML (with an SGML Document Type Definition defining the grammar). The IETF created an HTML Working Group in 1994 and published HTML 2.0 in 1995, but further development under the auspices of the IETF was stalled by competing interests. Since 1996, the HTML specifications have been maintained, with input from commercial software vendors, by the World Wide Web Consortium (W3C).[1] However, in 2000, HTML also became an international standard (ISO/IEC 15445:2000). The last HTML specification published by the W3C is the HTML 4.01 Recommendation, published in late 1999 and its issues and errors were last acknowledged by errata published in 2001. Since the publication of HTML 4.0 in late 1997, the W3C's HTML Working Group has increasingly — and from 2002 through 2006, exclusively — focused on the development of XHTML, an XML-based counterpart to HTML that is described on one W3C web page as HTML's "successor".[2][3][4] XHTML applies the more rigorous, less ambiguous syntax requirements of XML to HTML to make it easier to process and extend, and as support for XHTML has increased in browsers and tools, it has been embraced by many web standards advocates in preference to HTML. XHTML is routinely characterized by mass-media publications for both general and technical audiences as the newest "version" of HTML, but W3C publications, as of 2006, do not make such a claim; neither HTML 3.2 nor HTML 4.01 have been explicitly rescinded, deprecated, or superseded by any W3C publications, and, as of 2006, they continue to be listed alongside XHTML as current Recommendations in the W3C's primary publication indices.[5][6][7]
  6. 6. 6 In November 2006, the HTML Working Group published a new charter indicating its intent to resume development of HTML in a manner that unifies HTML 4 and XHTML 1, allowing for this hybrid language to manifest in both an XML format and a "classic HTML" format that is SGML-compatible but not strictly SGML-based. Among other things, it is planned that the new specification, to be released and refined throughout 2007 through 2008, will include conformance and parsing requirements, DOM APIs, and new widgets and APIs. The group also intends to publish test suites and validation tools.[8] Version history of the standard HTML Character encodings Dynamic HTML Font family HTML editor HTML element HTML scripting Layout engine comparison Style Sheets Unicode and HTML W3C Web browsers comparison Web colors XHTML This box: view • talk • edit Hypertext Markup Language (First Version), published June 1993 as an Internet Engineering Task Force (IETF) working draft (not standard). HTML 2.0, published November 1995 as IETF RFC 1866, supplemented by RFC 1867 (form-based file upload) that same month, RFC 1942 (tables) in May 1996, RFC 1980 (client-side image maps) in August 1996, and RFC 2070 (internationalization) in January 1997; ultimately all were declared obsolete/historic by RFC 2854 in June 2000. HTML 3.2, published January 14, 1997 as a W3C Recommendation. HTML 4.0, published December 18, 1997 as a W3C Recommendation. It offers three "flavors": Strict, in which deprecated elements are forbidden Transitional, in which deprecated elements are allowed Frameset, in which mostly only frame related elements are allowed HTML 4.01, published December 24, 1999 as a W3C Recommendation. It offers the same three flavors as HTML 4.0, and its last errata was published May 12, 2001. ISO/IEC 15445:2000 ("ISO HTML", based on HTML 4.01 Strict), published May 15, 2000 as an ISO/IEC international standard. HTML 4.01 and ISO/IEC 15445:2000 are the most recent and final versions of HTML. XHTML is a separate language that began as a reformulation of HTML 4.01 using XML 1.0. It continues to be developed:
  7. 7. 7 XHTML 1.0, published January 26, 2000 as a W3C Recommendation, later revised and republished August 1, 2002. It offers the same three flavors as HTML 4.0 and 4.01, reformulated in XML, with minor restrictions. XHTML 1.1, published May 31, 2001 as a W3C Recommendation. It is based on XHTML 1.0 Strict, but includes minor changes and is reformulated using modules from Modularization of XHTML, which was published April 10, 2001 as a W3C Recommendation. XHTML 2.0 is still a W3C Working Draft. There is no official standard HTML 1.0 specification because there were multiple informal HTML standards at the time. Berners-Lee's original version did not include an IMG element type. Work on a successor for HTML, then called "HTML+", began in late 1993, designed originally to be "A superset of HTML…which will allow a gradual rollover from the previous format of HTML". The first formal specification was therefore given the version number 2.0 in order to distinguish it from these unofficial "standards". Work on HTML+ continued, but it never became a standard. The HTML 3.0 standard was proposed by the newly formed W3C in March 1995, and provided many new capabilities such as support for tables, text flow around figures, and the display of complex math elements. Even though it was designed to be compatible with HTML 2.0, it was too complex at the time to be implemented, and when the draft expired in September 1995, work in this direction was discontinued due to lack of browser support. HTML 3.1 was never officially proposed, and the next standard proposal was HTML 3.2 (code-named "Wilbur"), which dropped the majority of the new features in HTML 3.0 and instead adopted many browser-specific element types and attributes which had been created for the Netscape and Mosaic web browsers. Math support as proposed by HTML 3.0 finally came about years later with a different standard, MathML. HTML 4.0 likewise adopted many browser-specific element types and attributes, but at the same time began to try to "clean up" the standard by marking some of them as deprecated, and suggesting they not be used. Minor editorial revisions to the HTML 4.0 specification were published as HTML 4.01. The most common filename extension for files containing HTML is .html. However, older operating systems and filesystems, such as the DOS versions from the 80's and early 90's and FAT, limit file extensions to three letters, so a .htm extension is also used. Although perhaps less common now, the shorter form is still widely supported by current software. HTML as a hypertext format HTML is the basis of a comparatively weak hypertext implementation. Earlier hypertext systems had features such as typed links, transclusion and source tracking. Another feature lacking today is fat links.[9] Even some hypertext features that were in early versions of HTML have been ignored by most popular web browsers until recently, such as the link element and editable web pages.
  8. 8. 8 Sometimes web services or browser manufacturers remedy these shortcomings. For instance, members of the modern social software landscape such as wikis and content management systems allow surfers to edit the web pages they visit. HTML markup HTML markup consists of several types of entities, including: elements, attributes, data types and character references. The Document Type Definition In order to enable Document Type Definition (DTD)-based validation with SGML tools and in order to avoid the Quirks mode in browsers, all HTML documents should start with a Document Type Declaration (informally, a "DOCTYPE"). The DTD contains machine readable grammar specifying the permitted and prohibited content for a document conforming to such a DTD. Browsers do not read the DTD, however. Browsers only look at the doctype in order to decide the layout mode. Not all doctypes trigger the Standards layout mode avoiding the Quirks mode. For example: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> This declaration references the Strict DTD of HTML 4.01, which does not have presentational elements like <font>, leaving formatting to Cascading Style Sheets. SGML-based validators read the DTD in order to properly parse the document and to perform validation. In modern browsers, the HTML 4.01 Strict doctype activates the Standards layout mode for CSS as opposed to the Quirks mode. In addition, HTML 4.01 provides Transitional and Frameset DTDs. The Transitional DTD was intended to gradually phase in the changes made in the Strict DTD, while the Frameset DTD was intended for those documents which contained frames. [edit] Elements See HTML elements for more detailed descriptions. Elements are the basic structure for HTML markup. Elements have two basic properties: attributes and content. Each attribute and each element's content has certain restrictions that must be followed for an HTML document to be considered valid. An element usually has a start label (eg. <label>) and an end label (eg. </label>). The element's attributes are contained in the start label and content is located between the labels (eg. <label>Content</label>). Some elements, such as <br>, will never have any content and do not need closing labels. Listed below are several types of markup elements used in HTML. Structural markup describes the purpose of text. For example, <h2>Golf</h2> establishes "Golf" as a second-level heading, which would be rendered in a browser in a manner similar to the "Markup element types" title at the start of this section. A blank line is included after the header. Structural markup does not denote any specific rendering, but most web browsers have standardized on how elements should be formatted. Further styling should be done with Cascading Style Sheets (CSS). Presentational markup describes the appearance of the text, regardless of its function. For example <b>boldface</b> indicates that visual output devices should render
  9. 9. 9 "boldface" in bold text, but has no clear semantics for aural devices that read the text aloud for the sight-impaired. In the case of both <b>bold</b> and <i>italic</i> there are elements which usually have an equivalent visual rendering but are more semantic in nature, namely <strong>strong emphasis</strong> and <em>emphasis</em> respectively. It is easier to see how an aural user agent should interpret the latter two elements. However, they are not equivalent to their presentational counterparts: it would be undesirable for a screen-reader to emphasize the name of a book, for instance, but on a screen such a name would be italicized. Most presentational markup elements have become deprecated under the HTML 4.0 specification, in favor of CSS based style design. Hypertext markup links parts of the document to other documents. HTML up through version XHTML 1.1 requires the use of an anchor element to create a hyperlink in the flow of text: <a>Wikipedia</a>. However, the href attribute must also be set to a valid URL so for example the HTML code, <a href="http://en.wikipedia.org/">Wikipedia</a>, will render the word "Wikipedia" as a hyperlink. In order to view the HTML code in a website click --> View --> Source. [edit] Attributes The attributes of an element are name-value pairs, separated by "=", and written within the start label of an element, after the element's name. The value should be enclosed in single or double quotes, although values consisting of certain characters can be left unquoted in HTML (but not XHTML).[10][11] Leaving attribute values unquoted is considered unsafe.[12] Most elements take any of several common attributes: id, class, style and title. Most also take language-related attributes: lang and dir. The id attribute provides a document-wide unique identifier for an element. This can be used by stylesheets to provide presentational properties, by browsers to focus attention on the specific element or by scripts to alter the contents or presentation of an element. The class attribute provides a way of classifying similar elements for presentation purposes. For example, an HTML (or a set of documents) document may use the designation class="notation" to indicate that all elements with this class value are all subordinate to the main text of the document (or documents). Such notation classes of elements might be gathered together and presented as footnotes on a page, rather than appearing in the place where they appear in the source HTML. An author may use the style non-attributal codes presentational properties to a particular element. It is considered better practice to use an element’s son- id page and select the element with a stylesheet, though sometimes this can be too cumbersome for a simple ad hoc application of styled properties. The title is used to attach subtextual explanation to an element. In most browsers this title attribute is displayed as what is often referred to as a tooltip. The generic inline span element can be used to demonstrate these various non-attributes.
  10. 10. 10 <span id='anId' class='aClass' style='color:red;' title='HyperText Markup Language'>HTML</span> which displays as HTML (pointing the cursor at the abbreviation should display the title text in most browsers). [edit] Other markup As of version 4.0, HTML defines a set of 252 character entity references and a set of 1,114,050 numeric character references, both of which allow individual characters to be written via simple markup, rather than literally. A literal character and its markup equivalent are considered equivalent and are rendered identically. The ability to "escape" characters in this way allows for the characters "<" and "&" (when written as &lt; and &amp;, respectively) to be interpreted as character data, rather than markup. For example, a literal "<" normally indicates the start of a label, and "&" normally indicates the start of a character entity reference or numeric character reference; writing it as "&amp;" or "&" allows "&" to be included in the content of elements or the values of attributes. The double-quote character, ", when used to quote an attribute value, must also be escaped as "&quot;" or "" when it appears within in the attribute value itself. However, since document authors often overlook the need to escape these characters, browsers tend to be very forgiving, treating them as markup only when subsequent text appears to confirm that intent. Escaping also allows for characters that are not easily typed or that aren't even available in the document's character encoding to be represented within the element and attribute content. For example, "é", a character typically found only on Western European keyboards, can be written in any HTML document as the entity reference &eacute; or as the numeric references é or é. The characters comprising those references (that is, the "&", the ";", the letters in "eacute", and so on) are available on all keyboards and are supported in all character encodings, whereas the literal "é" is not. HTML also defines several data types for element content, such as script data and stylesheet data, and a plethora of types for attribute values, including IDs, names, URIs, numbers, units of length, languages, media descriptors, colors, character encodings, dates and times, and so on. All of these data types are specializations of character data. [edit] Semantic HTML There is no official specification called "Semantic HTML", though the strict flavors of HTML discussed below are a push in that direction. Rather, semantic HTML refers to an objective and a practice to create documents with HTML that contain only the author's intended meaning, without any reference to how this meaning is presented or conveyed. A classic example is the distinction between the emphasis element (<em>) and the italics element (<i>). Often the emphasis element is displayed in italics, so the presentation is typically the same. However, emphasizing something is different from listing the title of a book, for example, which may also be displayed in italics. In purely semantic HTML, a book title would use a separate element than emphasized text uses (for example a <span>), because they are each meaningfully different things.
  11. 11. 11 The goal of semantic HTML requires two things of authors: 1) to avoid the use of presentational markup (elements, attributes and other entities); 2) the use of available markup to differentiate the meanings of phrases and structure in the document. So for example, the book title from above would need to have its own element and class specified such as <cite class="booktitle">The Grapes of Wrath</cite>. Here, the <cite> element is used, because it most closely matches the meaning of this phrase in the text. However, the <cite> element is not specific enough to this task because we mean to cite specifically a book title as opposed to a newspaper article or a particular academic journal. Semantic HTML also requires complementary specifications and software compliance with these specifications. Primarily, the development and proliferation of CSS has led to increasing support for semantic HTML because CSS provides designers with a rich language to alter the presentation of semantic-only documents. With the development of CSS the need to include presentational properties in a document has virtually disappeared. With the advent and refinement of CSS and the increasing support for it in web browsers, subsequent editions of HTML increasingly stress only using markup that suggests the semantic structure and phrasing of the document, like headings, paragraphs, quotes, and lists, instead of using markup which is written for visual purposes only, like <font>, <b> (bold), and <i> (italics). Some of these elements are not permitted in certain varieties of HTML, like HTML 4.01 Strict. CSS provides a way to separate document semantics from the content's presentation, by keeping everything relevant to presentation defined in a CSS file. See separation of style and content. Semantic HTML offers many advantages. First, it ensures consistency in style across elements that have the same meaning. Every heading, every quotation mark, every similar element receives the same presentation properties. Second, semantic HTML frees authors from the need to concern themselves with presentation details. When writing the number two, for example, should it be written out in words ("two"), or should it be written as a numeral (2)? A semantic markup might enter something like <number>2</number> and leave presentation details to the stylesheet designers. Similarly, an author might wonder where to break out quotations into separate indented blocks of text - with purely semantic HTML, such details would be left up to stylesheet designers. Authors would simply indicate quotations when they occur in the text, and not concern themselves with presentation. A third advantage is device independence and repurposing of documents. A semantic HTML document can be paired with any number of stylesheets to provide output to computer screens (through web browsers), high-resolution printers, handheld devices, aural browsers or braille devices for those with visual impairments, and so on. To accomplish this nothing needs to be changed in a well coded semantic HTML document. Readily available stylesheets make this a simple matter of pairing a semantic HTML document with the appropriate stylesheets (of course, the stylesheet's selectors need to match the appropriate properties in the HTML document).
  12. 12. 12 Some aspects of authoring documents make separating semantics from style (in other words, meaning from presentation) difficult. Some elements are hybrids, using presentation in their very meaning. For example, a table displays content in a tabular form. Often this content only conveys the meaning when presented in this way. Repurposing a table for an aural device typically involves somehow presenting the table as an inherently visual element in an audible form. On the other hand, we frequently present lyrical songs — something inherently meant for audible presentation — and instead present them in textual form on a web page. For these types of elements, the meaning is not so easily separated from their presentation. However, for a great many of the elements used and meanings conveyed in HTML the translation is relatively smooth. [edit] Delivery of HTML HTML documents can be delivered by the same means as any other computer file; however, HTML documents are most often delivered in one of the following two forms: Over HTTP servers and through email. [edit] Publishing HTML with HTTP The World Wide Web is primarily composed of HTML documents transmitted from a web server to a web browser using the HyperText Transfer Protocol (HTTP). However, HTTP can be used to serve images, sound and other content in addition to HTML. To allow the web browser to know how to handle the document it received, an indication of the file format of the document must be transmitted along with the document. This vital metadata includes the MIME type (text/html for HTML 4.01 and earlier, application/xhtml+xml for XHTML 1.0 and later) and the character encoding (see Character encodings in HTML). In modern browsers, the MIME type that is sent with the HTML document affects how the document is interpreted. A document sent with an XHTML MIME type, or served as application/xhtml+xml, is expected to be well-formed XML and a syntax error may cause the browser to fail to render the document. The same document sent with a HTML MIME type, or served as text/html, might get displayed since web browsers are more lenient with HTML. However, XHTML parsed this way is not considered either proper XHTML nor HTML, but so-called tag soup. If the MIME type is not recognized as HTML, the web browser should not attempt to render the document as HTML, even if the document is prefaced with a correct Document Type Declaration. Nevertheless, some web browsers do examine the contents or URL of the document and attempt to infer the file type, despite this being forbidden by the HTTP 1.1 specification. [edit] HTML e-mail Main article: HTML e-mail Most graphical e-mail clients allow the use of a subset of HTML (often ill-defined) to provide formatting and semantic markup capabilities not available with plain text, like emphasized text, block quotations for replies, and diagrams or mathematical formulas that couldn't easily be described otherwise. Many of these clients include both a GUI editor for composing HTML e-mails and a rendering engine for displaying received
  13. 13. 13 HTML e-mails. Use of HTML in e-mail is controversial due to compatibility issues, because it can be used in phishing/privacy attacks, because it can confuse spam filters, and because the message size is larger than plain text. [edit] Current flavors of HTML Since its inception HTML and its associated protocols gained acceptance relatively quickly. However, no clear standards existed in the early years of the language. Though its creators originally conceived of HTML as a semantic language devoid of presentation details, practical uses pushed many presentational elements and attributes into the language: driven largely by the various browser vendors. The latest standards surrounding HTML reflect efforts to overcome the sometimes chaotic development of the language and to create a rational foundation to build both meaningful and well-presented documents. To return HTML to its role as a semantic language, the W3C has developed style languages such as CSS and XSL to shoulder the burden of presentation. In conjunction the HTML specification has slowly reined in the presentational elements within the specification. There are two axes differentiating various flavors of HTML as currently specified: SGML-based HTML versus XML-based HTML (referred to as XHTML) on the one axis and strict versus transitional (loose) versus frameset on the other axis. [edit] Traditional versus XML-based HTML One difference in the latest HTML specifications lies in the distinction between the SGML-based specification and the XML-based specification. The XML-based specification is often called XHTML to clearly distinguish it from the more traditional definition; however, the root element name continues to be HTML even in the XHTML- specified HTML. The W3C intends XHTML 1.0 to be identical with HTML 4.01 except in the often stricter requirements of XML over traditional HTML. XHTML 1.0 likewise has three sub-specifications: strict, loose and frameset. The strictness of XHTML in terms of its syntax is often confused with the strictness of the strict versus the loose definitions in terms of the content rules of the specifications. The strictness of XML lies in the need to: always explicitly close elements (<h1>); and to always use quotation-marks (double " or single ') to enclose attribute values. The use of implied closing labels in HTML led to confusion for both editors and parsers. Aside from the different opening declarations for a document, the differences between HTML 4.01 and XHTML 1.0 — in each of the corresponding DTDs — is largely syntactic. Adhering to valid and well-formed XHTML 1.0 will result in a well-formed HTML 4.01 document in every way, except one. XHTML introduces a new markup in a self-closing element as short-hand for handling empty elements. The short-hand adds a slash (/) at the end of an opening label like this: <br/>. The introduction of this short- hand, undefined in any HTML 4.01 DTD, may confuse earlier software unfamiliar with this new convention. To help with the transition, the W3C recommends also including a space character before the slash like this:<br />. As validators and browsers adapt to this evolution in the standard, the migration from traditional to XML-based HTML should be relatively simple. The major problems occur when software is non-conforming to HTML
  14. 14. 14 4.01 and its associated protocols to begin with, or erroneously implements the HTML recommendations. To understand the subtle differences between HTML and XHTML consider the transformation of a valid and well-formed XHTML 1.0 document into a valid and well- formed HTML 4.0. To make this translation requires the following steps:: The language code for the element should be specified with a lang rather than the XHTML xml:lang attribute HTML 4.01 instead defines its own attribute for language) whereas XHTML uses the XML defined attribute. Remove the XML namespace (xmlns=URI). HTML does not require and has no facilities for namespaces. Change the DTD declaration from XHTML 1.0 to HTML 4.01. (see DTD section for further explanation]]). If present, remove the XML declaration (Typically this is: <?xml version="1.0" encoding="utf-8"?>). Change the document’s mime type to text/html This may come from a meta element, from the HTTP header of the server or possibly from a filename extension (for example, change .xhtml to html). Change the XML empty label short-cut to a standard opening label (<br/> to <br>) Those are the only changes necessary to translate a document from XHTML 1.0 to HTML 4.01. The reverse operation can be much more complicated. HTML 4.01 allows the omission of many labels in a complex pattern derived by determining which labels are (in some sense) redundant for a valid document. In other words if the document is authored precisely to the associated HTML 4.01 content model, some labels need not be expressed. For example, since a paragraph cannot contain another paragraph, when an opening paragraph label is followed by another opening paragraph label, this implies the previous paragraph element is now closed. Similarly, elements such as br have no allowed content, so HTML does not require an explicit closing label for this element. Also since HTML was the only specification targeted by user-agents (browsers and other HTML consuming software), the specification even allows the omission of opening and closing labels for html, head, and body, if the document's head has no content. To translate from HTML to XHTML would first require the addition of any omitted closing labels (or using the closing label shortcut for empty elements like <br/>). Notice how XHTML’s requirement to always include explicit closing labels, allows the separation between the concepts of valid and well-formed. A well-formed XHTML document adheres to all the syntax requirements of XML. A valid document adheres to the content specification for XHTML. In other words a valid document only includes content, attributes and attribute values within each element in accord with the specification. If a closing label is omitted, an XHTML parser can first determine the document is not well-formed. Once the elements are all explicitly closed, the parser can address the question of whether the document is also valid. For an HTML parse these separate aspects of a document are not discernible. If a paragraph opening label (p) is
  15. 15. 15 followed by a div, is it because the document is not well-formed (the closing paragraph label is missing) or is the document invalid (a div does not belong in a paragraph)? Whether coding in HTML or XHTML it may just be best to always include the optional labels within an HTML document rather than remembering which labels can be omitted. The W3C recommends several conventions to ensure an easy migration between HTML and XHTML (see HTML Compatibility Guidelines). Basically the W3C recommends: Including both xml:lang and lang attributes on any elements assigning language. Using the self-closing label only for elements specified as empty Make all label names and attribute names lower-case. Ensuring all attribute values are quoted with either single quotes (') or double quotes (") Including an extra space in self-closing labels: for example <br /> instead of <br/> Including explicit close labels for elements that permit content but are left empty (for example, "<img></img>", not "<img />" ) Note that by carefully following the W3C’s compatibility guidelines the difference between the resulting HTML 4.01 document and the XHTML 1.0 document is merely the DOCTYPE declaration, and the XML declaration preceding the document’s contents. The W3C allows the resulting XHTML 1.0 (or any XHTML 1.0) document to be delivered as either HTML or XHTML. For delivery as HTML, the document’s MIME type should be set to 'text/html', while, for XHTML, the document’s MIME type should be set to 'application/xhtml+xml'. When delivered as XHTML, browsers and other user agents are expected to adhere strictly to the XML specifications in parsing, interpreting, and displaying the document’s contents. [edit] Transitional versus Strict The latest SGML-based specification HTML 4.01 and the earliest XHTML version include three sub-specifications: strict, transitional (also called loose), and frameset. The difference between strict on the one hand and loose and frameset on the other, is that the strict definition tries to adhere more tightly to a presentation-free or style-free concept of a semantic HTML. The loose standard maintains many of the various presentational elements and attributes absent in the strict definition. The primary differences making the transitional specification loose versus the strict specification (whether XHTML 1.0 or HTML 4.01) are: A looser content model Inline elements and character strings (#PCDATA) are allowed in: body, blockquote, form, noscript, noframes Presentation related elements underline (u) strike-through (s and strike) center font basefont Presentation related attributes background and bgcolor attributes for body element.
  16. 16. 16 align attribute on div, form, paragraph (p), and heading (h1...h6) elements align, noshade, size, and width attributes on hr element align, border, vspace, and hspace attributes on img and object elements align attribute on legend and caption elements align and bgcolor on table element nowrap, bgcolor, width, height on td and th elements bgcolor attribute on tr element clear attribute on br element compact attribute on dl, dir and menu elements type, compact, and start attributes on ol and ul elements type and value attributes on li element width attribute on pre element Additional elements in loose (transitional) specification menu list (no substitute, though unordered list is recommended; may return in XHTML 2.0 specification) dir list (no substitute, though unordered list is recommended) isindex (element requires server-side support and is typically added to documents server-side) applet (deprecated in favor of object element) The pre element does not allow: applet, font, and basefont (elements not defined in strict DTD) The language attribute on script element (presumably redundant with type attribute, though this is maintained for legacy reasons). Frame related entities frameset element (used in place of body for frameset DTD) frame element iframe noframes target attribute on anchor, client-side image-map (imagemap), link, form, and base elements [edit] Frameset versus transitional In addition to the above transitional differences, the frameset specifications (whether XHTML 1.0 or HTML 4.01) specifies a different content model: <html> <head> Any of the various head related elements. </head> <frameset> At least one of either: another frameset or a frame and an optional noframes element. </frameset> </html>
  17. 17. 17 [edit] Summary of flavors As this list demonstrates, the loose flavors of the specification are maintained for legacy support. However, contrary to popular misconceptions, the move to XHTML does not imply a removal of this legacy support. Rather the X in XML stands for extensible and the W3C is modularizing the entire specification and opening it up to independent extensions. The primary achievement in the move from XHTML 1.0 to XHTML 1.1 is the modularization of the entire specification. The strict version of HTML is deployed in XHTML 1.1 through a set of modular extensions to the base XHTML 1.1 specification. Likewise someone looking for the loose (transitional) or frameset specifications will find similar extended XHTML 1.1 support (much of it is contained in the legacy or frame modules). The modularization also allows for separate features to develop on their own timetable. So for example XHTML 1.1 will allow quicker migration to emerging XML standards such as MathML (a presentational and semantic math language based on XML) and XFORMS — a new highly advanced web-form technology to replace the existing HTML forms. In summary, the HTML 4.01 specification primarily reined in all the various HTML implementations into a single clear written specification based on SGML. XHTML 1.0, ported this specification, as is, to the new XML defined specification. Next, XHTML 1.1 takes advantage of the extensible nature of XML and modularizes the whole specification. XHTML 2.0 will be the first step in adding new features to the specification in a standards-body-based approach. NetMeetting Microsoft NetMeeting is a VoIP and multi-point videoconferencing client included in many versions of Microsoft Windows (from Windows 95 OSR2 to Windows XP). It uses the H.323 protocol for video and audio conferencing, and is interoperable with OpenH323-based clients such as Ekiga, and Internet Locator Service (ILS) as mirror server. It also uses a slightly modified version of the ITU T.120 Protocol for whiteboarding, application sharing, desktop sharing, remote desktop sharing (RDS) and file transfers. The secondary Whiteboard in NetMeeting 2.1 and later utilizes the H.324 protocol. Before video service became common on free IM clients, such Yahoo Messenger and MSN Messenger, NetMeeting was a popular way to perform video conferences and chatting over the Internet (with the help of public ILS servers). Since the release of Windows XP, Microsoft has deprecated it in favour of Windows Messenger, although it is still installed by default (Start > Run... > conf.exe). Note that Windows Messenger, MSN Messenger and Windows Live Messenger hooks directly into NetMeeting for the application sharing, desktop sharing, and Whiteboard features exposed by each application.
  18. 18. 18 As of the release of Windows Vista, NetMeeting is no longer included and has been replaced by Windows Meeting Space. chat can refer to any kind of communication over the internet, but is primarily meant to refer to direct 1-on-1 chat or text-based group chat (formally also known as synchronous conferencing), using tools such as instant messaging applications—computer programs, Internet Relay Chat, talkers and possibly MUDs, MUCKs, MUSHes and MOOes. While many of the web's well known custodians offer online chat and messaging services for free, an increasing number of providers are beginning to show strong revenue streams from paid-for services. Again it is the Adult service providers, profiting from the advent of reliable and high-speed broadband, (notably across Eastern Europe) who are at the forefront of the paid-for online chat revolution. For every business traveller engaging in a video call or conference call rather than braving the check-in queue, there are countless web users replacing traditional conversational means with online chat and messaging. Like Email, which has reduced the need and usage of letter, fax and memo communication, online chat is steadily replacing telephony as the means of office and home communication. The early adopters in these areas are undoubtedly teenage users of instant messaging. It might not be long before SMS text messaging usage declines as mobile handsets provide the technology for online chat. Other forms of online chat that are not usually referred to as online chat [edit] MUDs A MUD, or a multi-user dungeon, is a multi-user version of dungeons and dragons for the internet, and is an early use of the internet. In a MUD, as well as playing the game, people can chat to each other. Talkers were originally based off MUDs and the earliest versions of talkers were primarily MUDs without the gaming element. Other derivations of MUDs were used that combined gaming with talking, and these include MUSHes, MOOs and MUCKs. [edit] Discussion boards Besides real-time chat, another type of online community includes Internet forums and bulletin board systems (BBSes), where users write posts (blocks of text) to which later visitors may respond. Unlike the transient nature of chats, these systems generally archive posts and save them for weeks or years. They can be used for technical troubleshooting, advice, general conversation and more. See also General terms • Chat room • Web chat site • Voice chat • VoIP Voice over IP • Live support software • Online discussion
  19. 19. 19 • Online discourse environment Protocols/Programs • Talker • Internet Relay Chat • Instant messenger • PalTalk • Talk (Unix) • MUD • MUSH • MOO • Google Talk • Yahoo! Messenger • Skype • SILC • Windows Live Messenger • Campfire Chat programs supporting multiple protocols • Adium • Gaim • Miranda IM • Trillian • Retrieved from "http://en.wikipedia.org/wiki/Online_chat" Plugins A plugin (or plug-in) is a computer program that interacts with a main (or host) application (a web browser or an email program, for example) to provide a certain, usually very specific, function on-demand. Typical examples are • plugins that read or edit specific types of files (for instance, decode multimedia files) • encrypt or decrypt email (for instance, PGP) • filter images in graphic programs in ways that the host application could not normally do • play and watch Flash presentations in a web browser The host application provides services which the plugins can use, including a way for plugins to register themselves with the host application and a protocol by which data is exchanged with plugins. Plugins are dependent on these services provided by the main application and do not usually work by themselves. Conversely, the main application is
  20. 20. 20 independent of the plugins, making it possible for plugins to be added and updated dynamically without changes to the main application. Plugins are slightly different from extensions, which modify or add to existing functionality. The main difference is that plugins generally rely on the main application's user interface and have a well-defined boundary to their possible set of actions. Extensions generally have fewer restrictions on their actions, and may provide their own user interfaces. They sometimes are used to decrease the size of the main application and offer optional functions. Mozilla Firefox uses a well-developed extension system to reduce the feature creep that plagued the Mozilla Application Suite. Perhaps the first software applications to include a plugin function were HyperCard and QuarkXPress on the Macintosh, both released in 1987. In 1988, Silicon Beach Software included plugin functionality in Digital Darkroom and SuperPaint, and the term plug-in was coined by Ed Bomke. Currently, plugins are typically implemented as shared libraries that must be installed in a place prescribed by the main application. HyperCard supported a similar facility, but it was more common for the plugin code to be included in the HyperCard documents (called stacks) themselves. This way, the HyperCard stack became a self-contained application in its own right, which could be distributed as a single entity that could be run by the user without the need for additional installation steps. Open application programming interfaces (APIs) provide a standard interface, allowing third parties to create plugins that interact with the main application. A stable API allows third-party plugins to function as the original version changes and to extend the lifecycle of obsolete applications. The Adobe Photoshop and After Effects plugin APIs have become a standard and been adopted to some extent by competing applications. Other examples of such APIs include Audio Units and VST. Examples Many professional software packages offer plugin APIs to developers, in order to increase the utility of the base product. Examples of these include: • Eclipse • GStreamer multimedia pipe handler • jEdit Program Editor • Quintessential Media Player, Winamp, foobar2000 and XMMS • Notepad++ • OmniPeek packet analysis platform • VST Audio Plugin Format Communications protocol From Wikipedia, the free encyclopedia Jump to: navigation, search
  21. 21. 21 This article concerns communication between pairs of electronic devices. For the specific topic of computing protocols, see Protocol (computing). For protocols on two- way voice communications, see Voice procedure. For other meanings of the word protocol, see Protocol. In the field of telecommunications, a communications protocol is the set of standard rules for data representation, signalling, authentication and error detection required to send information over a communications channel. An example of a simple communications protocol adapted to voice communication is the case of a radio dispatcher talking to mobile stations. The communication protocols for digital computer network communication have many features intended to ensure reliable interchange of data over an imperfect communication channel. Communication protocol is basically following certain rules so that the system works properly. Network protocol design principles Systems engineering principles have been applied to create a set of common network protocol design principles[citation needed] . These principles include effectiveness, reliability, and resiliency. Effectiveness Needs to be specified in such a way, that engineers, designers, and in some cases software developers can implement and/or use it. In human-machine systems, its design needs to facilitate routine usage by humans. Protocol layering accomplishes these objectives by dividing the protocol design into a number of smaller parts, each of which performs closely related sub-tasks, and interacts with other layers of the protocol only in a small number of well-defined ways. Protocol layering allows the parts of a protocol to be designed and tested without a combinatorial explosion of cases, keeping each design relatively simple. The implementation of a sub-task on one layer can make assumptions about the behavior and services offered by the layers beneath it. Thus, layering enables a "mix-and-match" of protocols that permit familiar protocols to be adapted to unusual circumstances. For an example that involves computing, consider an email protocol like the Simple Mail Transfer Protocol (SMTP). An SMTP client can send messages to any server that conforms to SMTP's specification. Actual applications can be (for example) an aircraft with an SMTP server receiving messages from a ground controller over a radio-based internet link. Any SMTP client can correctly interact with any SMTP server, because they both conform to the same protocol specification, RFC2821, RT49764368. This paragraph informally provides some examples of layers, some required functionalities, and some protocols that implement them, all from the realm of computing protocols. At the lowest level, bits are encoded in electrical, light or radio signals by the Physical layer. Some examples include RS-232, SONET, and WiFi. A somewhat higher Data link layer such as the point-to-point protocol (PPP) may detect errors and configure the transmission system.
  22. 22. 22 An even higher protocol may perform network functions. One very common protocol is the Internet protocol (IP), which implements addressing for large set of protocols. A common associated protocol is the Transmission control protocol (TCP) which implements error detection and correction (by retransmission). TCP and IP are often paired, giving rise to the familiar acronym TCP/IP. A layer in charge of presentation might describe how to encode text (ie: ASCII, or Unicode). An application protocol like SMTP, may (among other things) describe how to inquire about electronic mail messages. These different tasks show why there's a need for a software architecture or reference model that systematically places each task into context. The reference model usually used for protocol layering is the OSI seven layer model, which can be applied to any protocol, not just the OSI protocols of the International Organization for Standardization (ISO). In particular, the Internet Protocol can be analysed using the OSI model. Reliability Assuring reliability of data transmission involves error detection and correction, or some means of requesting retransmission. It is a truism that communication media are always faulty. The conventional measure of quality is the number of failed bits per bits transmitted. This has the useful feature of being a dimensionless figure of merit that can be compared across any speed or type of communication media. In telephony, links with bit error rates (BER) of 10-4 or more are regarded as faulty (they interfere with telephone conversations), while links with a BER of 10-5 or more should be dealt with by routine maintenance (they can be heard). Data transmission often requires bit error rates below 10-12. Computer data transmissions are so frequent that larger error rates would affect operations of customers like banks and stock exchanges. Since most transmissions use networks with telephonic error rates, the errors caused by these networks must be detected and then corrected. Communications systems detect errors by transmitting a summary of the data with the data. In TCP (the internet's Transmission Control Protocol), the sum of the data bytes of packet is sent in each packet's header. Simple arithmetic sums do not detect out-of-order data, or cancelling errors. A bit-wise binary polynomial, a cyclic redundancy check, can detect these errors and more, but is slightly more expensive to calculate. Communication systems correct errors by selectively resending bad parts of a message. For example, in TCP when a checksum is bad, the packet is discarded. When a packet is lost, the receiver acknowledges all of the packets up to, but not including the failed packet. Eventually, the sender sees that too much time has elapsed without an acknowledgement, so it resends all of the packets that have not been acknowledged. At the same time, the sender backs off its rate of sending, in case the packet loss was caused by saturation of the path between sender and receiver. (Note: this is an over- simplification: see TCP and congestion collapse for more detail)
  23. 23. 23 In general, the performance of TCP is severely degraded in conditions of high packet loss (more than 0.1%), due to the need to resend packets repeatedly. For this reason, TCP/IP connections are typically either run on highly reliable fiber networks, or over a lower- level protocol with added error-detection and correction features (such as modem links with ARQ). These connections typically have uncorrected bit error rates of 10-9 to 10-12, ensuring high TCP/IP performance. Resiliency Resiliency addresses a form of network failure known as topological failure in which a communications link is cut, or degrades below usable quality. Most modern communication protocols periodically send messages to test a link. In phones, a framing bit is sent every 24 bits on T1 lines. In phone systems, when "sync is lost", fail-safe mechanisms reroute the signals around the failing equipment. In packet switched networks, the equivalent functions are performed using router update messages to detect loss of connectivity. Standards organizations Most recent protocols are assigned by the IETF for Internet communications, and the IEEE, or the ISO organizations for other types. The ITU-T handles telecommunications protocols and formats for the public switched telephone network (PSTN). The ITU-R handles protocols and formats for radio communications. As the PSTN. radio systems, and Internet converge, the different sets of standards are also being driven towards technological convergence. [edit] Protocol families A number of major protocol stacks or families exist, including the following: Open standards: Internet protocol suite Open Systems Interconnection (OSI) A connection-oriented networking protocol is one which identifies traffic flows by some connection identifier rather than by explicitly listing source and destination addresses. Typically, this connection identifier is a small integer (10 bits for Frame Relay, 24 for ATM, for example). This makes network switches substantially faster (as routing tables are just simple look-up tables, and are trivial to implement in hardware). The impact is so great, in fact, that even characteristically connectionless protocols, such as IP traffic, are being tagged with connection-oriented header prefixes (e.g., as with MPLS, or IPv6's built-in Flow ID field). Note that connection-oriented protocols are not necessarily reliable protocols. ATM and Frame Relay, for example, are both examples of a connection-oriented, unreliable protocol. There are also reliable connectionless protocols as well, such as AX.25 when it passes data in I-frames. But this combination is rare, and reliable-connectionless is uncommon in commercial and academic networks. Note that connection-oriented protocols handle real-time traffic substantially more efficiently than connectionless protocols, which is why ATM has yet to be replaced by Ethernet for carrying real-time, isochronous traffic streams, especially in heavily
  24. 24. 24 aggregated networks like backbones, where the motto "bandwidth is cheap" fails to deliver on its promise. Experience has also shown that overprovisioning bandwidth does not resolve all quality of service issues. Hence, (10-)gigabit Ethernet is not expected to replace ATM at this time. [edit] List of Connection-oriented protocols TCP Phone Call - user must dial the telephone, get an answer before transmitting data ATM Frame Relay Connectionless protocol From Wikipedia, the free encyclopedia Jump to: navigation, search In telecommunications, connectionless describes communication between two network end points in which a message can be sent from one end point to another without prior arrangement. The device at one end of the communication transmits data to the other, without first ensuring that the recipient is available and ready to receive the data. The device sending a message simply sends it addressed to the intended recipient. As such there are more frequent problems with transmission than with connection-orientated protocols and it may be necessary to resend the data several times. Connectionless protocols are often disfavoured by network administrators because it is much harder to filter malicious packets from a connectionless protocol using a firewall. The Internet Protocol (IP) and User Datagram Protocol (UDP) are connectionless protocols, but TCP/IP (the most common use of IP) is connection-oriented. Connectionless protocols are usually described as stateless because the endpoints have no protocol-defined way to remember where they are in a "conversation" of message exchanges. The alternative to the connectionless approach uses connection-oriented protocols, which are sometimes described as stateful because they can keep track of a conversation. List of Connectionless protocols • IP • UDP • ICMP • IPX In computing, a protocol is a convention or standard that controls or enables the connection, communication, and data transfer between two computing endpoints. In its simplest form, a protocol can be defined as the rules governing the syntax, semantics, and synchronization of communication. Protocols may be implemented by hardware, software, or a combination of the two. At the lowest level, a protocol defines the behavior of a hardware connection. Contents [hide]
  25. 25. 25 1 Typical properties 2 Importance 3 Common Protocols 4 See also Typical properties It is difficult to generalize about protocols because they vary so greatly in purpose and sophistication. Most protocols specify one or more of the following properties: Detection of the underlying physical connection (wired or wireless), or the existence of the other endpoint or node Handshaking Negotiation of various connection characteristics How to start and end a message How to format a message What to do with corrupted or improperly formatted messages (error correction) How to detect unexpected loss of the connection, and what to do next Termination of the session or connection. Importance The widespread use and expansion of communications protocols is both a prerequisite to the Internet, and a major contributor to its power and success. The pair of Internet Protocol (or IP) and Transmission Control Protocol (or TCP) are the most important of these, and the term TCP/IP refers to a collection (or protocol suite) of its most used protocols. Most of the Internet's communication protocols are described in the RFC documents of the Internet Engineering Task Force (or IETF). Object-oriented programming has extended the use of the term to include the programming protocols available for connections and communication between objects. Generally, only the simplest protocols are used alone. Most protocols, especially in the context of communications or networking, are layered together into protocol stacks where the various tasks listed above are divided among different protocols in the stack. Whereas the protocol stack denotes a specific combination of protocols that work together, the Reference Model is a software architecture that lists each layer and the services each should offer. The classic seven-layer reference model is the OSI model, which is used for conceptualizing protocol stacks and peer entities. This reference model also provides an opportunity to teach more general software engineering concepts like hiding, modularity, and delegation of tasks. This model has endured in spite of the demise of many of its protocols (and protocol stacks) originally sanctioned by the ISO. The OSI model is not the only reference model however. [edit] Common Protocols • HTTP (Hyper Text Transfer Protocol) • POP3 (Post Office Protocol 3). • SMTP (Simple Mail Transfer Protocol). • FTP (File Transfer Protocol). • IP (Internet Protocol).
  26. 26. 26 • DHCP (Dynamic Host Configuration Protocol). • IMAP (Internet Message Access Protocol). Search Engine A search engine is an information retrieval system designed to help find information stored on a computer system, such as on the World Wide Web, inside a corporate or proprietary network, or in a personal computer. The search engine allows one to ask for content meeting specific criteria (typically those containing a given word or phrase) and retrieves a list of items that match those criteria. This list is often sorted with respect to some measure of relevance of the results. Search engines use regularly updated indexes to operate quickly and efficiently. Without further qualification, search engine usually refers to a Web search engine, which searches for information on the public Web. Other kinds of search engine are enterprise search engines, which search on intranets, personal search engines, and mobile search engines. Different selection and relevance criteria may apply in different environments, or for different uses. Some search engines also mine data available in newsgroups, databases, or open directories. Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorthmic and human input. How search engines work A search engine operates, in the following order Web crawling Indexing Searching A web crawler (also known as a Web spider or Web robot) is a program or automated script which browses the World Wide Web in a methodical, automated manner. Other less frequently used names for Web crawlers are ants, automatic indexers, bots, and worms (Kobayashi and Takeda, 2000). This process is called Web crawling or spidering. Many legitimate sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine, that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for spam). A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies. Web crawler architectures
  27. 27. 27 High-level architecture of a standard Web crawler A crawler must not only have a good crawling strategy, as noted in the previous sections, but it should also have a highly optimized architecture. Shkapenyuk and Suel (Shkapenyuk and Suel, 2002) noted that: "While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability." Web crawlers are a central part of search engines, and details on their algorithms and architecture are kept as business secrets. When crawler designs are published, there is often an important lack of detail that prevents others from reproducing the work. There are also emerging concerns about "search engine spamming", which prevent major search engines from publishing their ranking algorithms. indexing entails how data is collected, parsed, and stored to facilitate fast and accurate retrieval. Index design incorporates interdisciplinary concepts from Linguistics, Cognitive psychology, Mathematics, Informatics, Physics, and Compruter science. An alternate name for the procss is Web indexing, within the context of search engines designed to find web pages o the Internet. Popular engines focus on the full-text indexing of online, natural language documents, yet there are other searchable media types such as video, audio, and graphics. Meta search engines reuse the indices of other services and do not store a local index, whereas cache-based search engines permanently store the index along with the corpus. Unlike full text indices, partial text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined interval due to the required time and processing costs, whereas agent-based search engines index in real time. Indexing
  28. 28. 28 The goal of storing an index is to optimize the speed and performance of finding relevant documents for a search query. Without an index, the search engine would scan every document in the corpus, which would take a considerable amount of time and computing power. For example, an index of 1000 documents can be queried within milliseconds, where a raw scan of 1000 documents could take hours. No search engine user would be comfortable waiting several hours to get search results. The trade off for the time saved during retrieval is that additional storage is required to store the index and that it takes a considerable amount of time to update. [edit] Index Design Factors Major factors in designing a search engine's architecture include: Merge factors - how data enters the index, or how words or subject features are added to the index during corpus traversal, and whether multiple indexers can work asynchronously. The indexer must first check whether it is updating old content or adding new content. Traversal typically correlates to the data collection policy. Search engine index merging is similar in concept to the SQL Merge command and other merge algorithms. Storage techniques - how to store the index data - whether information should be compressed or filtered Index size - how much computer storage is required to support the index Lookup speed - how quickly a word can be found in the inverted index. How quickly an entry in a data structure can be found, versus how quickly it can be updated or removed, is a central focus of computer science Maintenance - maintaining the index over time Fault tolerance - how important it is for the service to be reliable, how to deal with index corruption, whether bad data can be treated in isolation, dealing with bad hardware, partitioning schemes such as hash-based or composite partitioning, data replication Index Data Structures Search engine architectures vary in how indexing is performed and in index storage to meet the various design factors. Types of indices include: Suffix trees - figuratively structured like a tree, supports linear time lookup. Built by storing the suffices of words. Used for searching for patterns in DNA sequences and clustering. A major drawback is that the storage of a word in the tree may require more storage than storing the word itself An alternate representation is a suffix array, which is considered to require less memory and supports compression like BWT. Tries - an ordered tree data structure that is used to store an associative array where the keys are strings. Regarded as faster than a hash table, but are less space efficient. The suffix tree is a type of trie. Tries support extendible hashing, which is important for search engine indexing. Inverted indices - stores a list of occurrences of each atomic search criterion, typically in the form of a hash table or binary tree. Citation indices - stores the existence of citations or hyperlinks between documents to support citation analysis, a subject of Bibliometrics.
  29. 29. 29 Ngram indices - for storing sequences of n length of data to support other types of retrieval or text mining. Term document matrices - used in latent semantic analysis, stores the occurrences of words in documents in a two dimensional sparse matrix. Challenges in Parallelism A major challenge in the design of search engines is the management of parallel processes. There are many opportunities for race conditions and coherence faults. For example, a new document is added to the corpus and the index must be updated, but the index simultaneously needs to continue responding to search queries. This is a collision between two competiting tasks. Consider that authors are producers of information, and a crawler is the consumer of this information, grabbing the text and storing it in a cache (or corpus). The forward index is the consumer of the information produced by the corpus, and the inverted index is the consumer of information produced by the forward index. This is commonly referred to as a producer-consumer model. The indexer is the producer of searchable information and users are the consumers that need to search. The challenge is magnified when working with distributed storage and distributed processing. In an effort to scale with larger amounts of indexed information, the search engine's architecture may involve distributed computing, where the search engine consists of several machines operating in unison. This increases the possibilities for incoherency and makes it more difficult to maintain a fully-synchronized, distributed, parallel architecture. Inverted indices Many search engines incorporate an inverted index when evaluating a search query to quickly locate the documents which contain the words in a query and rank these documents by relevance. The inverted index stores a list of the documents for each word. The search engine can retrieve the matching documents quickly using direct access to find the documents for a word. The following is a simplified illustration of the inverted index: Inverted Index Word Documents the Document 1, Document 3, Document 4, Document 5 cow Document 2, Document 3, Document 4 says Document 5 moo Document 7 The above figure is a simplified form of a Boolean index. Such an index would only serve to determine whether a document matches a query, but would not contribute to ranking matched documents. In some designs the index includes additional information such as the frequency of each word in each document or the positions of the word in each document. With position, the search algorithm can identify word proximity to support searching for phrases. Frequency can be used to help in ranking the relevance of documents to the query. Such topics are the central research focus of information retrieval. The inverted index is a sparse matrix given that words are not present in each document. It is stored differently than a two dimensional array to reduce memory requirements. The
  30. 30. 30 index is similar to the term document matrices employed by latent semantic analysis. The inverted index can be considered a form of a hash table. In some cases the index is a form of a binary tree, which requires additional storage but may reduce the lookup time. In larger indices the architecture is typically distributed. Inverted indices can be programmed in several computer programming languages. Index Merging The inverted index is filled via a merge or rebuild. A rebuild is similar to a merge but first deletes the contents of the inverted index. The architecture may be designed to support incremental indexing, where a merge involves identifying the document or documents to add into or update in the index and parsing each document into words. For technical accuracy, a merge involves the unison of newly indexed documents, typically residing in virtual memory, with the index cache residing on one or more computer hard drives. After parsing, the indexer adds the containing document to the document list for the appropriate words. The process of finding each word in the inverted index in order to denote that it occurred within a document may be too time consuming when designing a larger search engine, and so this process is commonly split up into the development of a forward index and the process of sorting the contents of the forward index for entry into the inverted index. The inverted index is named inverted because it is an inversion of the forward index. The Forward Index The forward index stores a list of words for each document. The following is a simplified form of the forward index: Forward Index Document Words Document 1 the,cow,says,moo Document 2 the,cat,and,the,hat Document 3 the,dish,ran,away,with,the,spoon The rationale behind developing a forward index is that as documents are parsed, it is better to immediately store the words per document. The delineation enables asynchronous processing, which partially circumvents the inverted index update bottleneck. The forward index is sorting to transform it to an inverted index. The forward index is essentially a list of pairs consisting of a document and a word, collated by the document. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. In this regard, the inverted index is a word-sorted forward index. Compression Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. Many search engines utilize a form of compression to reduce the size of the indices on disk. Consider the following scenario for a full text, Internet, search engine. An estimated 2,000,000,000 different web pages exist as of the year 2000
  31. 31. 31 A fictitious estimate of 250 words per webpage on average, based on the assumption of being similar to the pages of a novel. It takes 8 bits (or 1 byte) to store a single character. Some encodings use 2 bytes per characterThe average number of characters in any given word on a page can be estimated at 5 (Wikipedia:Size comparisons) The average personal computer comes with about 20 gigabytes of usable space Given these estimates, generating a uncompressed index (assuming a non-conflated, simple, index) for 2 billion web pages would need to store 5 billion word entries. At 1 byte per character, or 5 bytes per word, this would require 25 gigabytes of storage space alone, more than the average size a personal computer's free disk space. This space is further increased in the case of a distributed storage architecture that is fault-tolerant. Using compression, the index size can be reduced to a portion of its size, depending on which compression techniques are chosen. The trade off is the time and processing power required to perform compression. Notably, large scale search engine designs incorporate the cost of storage, and the costs of electricity to power the storage. Compression, in this regard, is a measure of cost as well. Document Parsing Document parsing involves breaking apart the components (words) of a document or other form of media for insertion into the forward and inverted indices. For example, if the full contents of a document consisted of the sentence "Hello World", there would typically be two words found, the token "Hello" and the token "World". In the context of search engine indexing and natural language processing, parsing is more commonly referred to as tokenization, and sometimes word boundary disambiguation, tagging, Text segmentation, Content analysis, text analysis, Text mining, Concordance generation, Speech segmentation, lexing, or lexical analysis. The terms 'indexing', 'parsing', and 'tokenization' are used interchangeably in corporate slang. Natural language processing, as of 2006, is the subject of continuous research and technological improvement. There are a host of challenges in tokenization, in extracting the necessary information from documents for indexing to support quality searching. Tokenization for indexing involves multiple technologies, the implementation of which are commonly kept as corporate secrets. Challenges in Natural Language Processing Word Boundary Ambiguity - native English speakers can at first consider tokenization to be a straightfoward task, but this is not the case with designing a multilingual indexer. In digital form, the text of other languages such as Chinese, Japanese or Arabic represent a greater challenge as words are not clearly delineated by whitespace. The goal during tokenization is to identify words for which users will search. Language specific logic is employed to properly identify the boundaries of words, which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax).
  32. 32. 32 Language Ambiguity - to assist with properly ranking matching documents, many search engines collect additional information about words, such as its language or lexical category (part of speech). These techniques are language-dependent as the syntax varies among languages. Documents do not always clearly identify the language of the document or represent it accurately. In tokenizing the document, some search engines attempt to automatically identify the language of the document. Diverse File Formats - in order to correctly identify what bytes of a document represent characters, the file format must be correctly handled. Search engines which support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document. Faulty Storage - the quality of the natural language data is not always assumed to be perfect. An unspecified amount of documents, particular on the Internet, do not always closely obey proper file protocol. Binary characters may be mistakenly encoded into various parts of a document. Without recognition of these characters and appropriate handling, the index quality or indexer performance could degrade. [edit] Tokenization Unlike literrate human adults, computers are not inherently aware of the structure of a natural language document and do not instantly recognize words and sentences. To a computer, a document is only a big sequence of bytes. Computers do not know that a space character between two sequences of characters means that there are two separate words in the document. Instead, a computer program is developed by humans which trains the computer, or instructs the computer, how to identify what constitutes an individual or distinct word, referred to as a token. This program is commonly referred to as a tokenizer or parser or lexer. Many search engines, as well as other natural language processing software, incorporate specialized programs for parsing, such as YACC OR Lex. During tokenization, the parser identifies sequences of characters, which typically represent words. Commonly recognized tokens include punctuation, sequences of numerical characters, alphabetical characters, alphanumerical characters, binary characters (backspace, null, print, and other antiquated print commands), whitespace (space, tab, carriage return, line feed), and entities such as email addresses, phone numbers, and URLs. When identifying each token, several characteristics may be stored such as the token's case (upper, lower, mixed, proper), language or encoding, lexical category (part of speech, like 'noun' or 'verb'), position, sentence number, sentence position, length, and line number. Language Recognition If the search engine supports multiple languages, a common initial step during tokenization is to identify each document's language, given that many of the later steps are language dependent (such as stemming and part of speech tagging). Language recognition is the process by which a computer program attempts to automatically identify, or categorize, the language of a document. Other names for language recognition include language classification, language analysis, language identification,
  33. 33. 33 and language tagging. Automated language recognition is the subject of ongoing research in natural language processing. Finding which words the language belongs to may involve the use of a language recognition chart. Format Analysis Depending on whether the search engine supports multiple document formats, documents must be prepared for tokenization. The challenge is that many document formats contain, in addition to textual content, formatting information. For example, HTML documents contain HTML tags, which specify formatting information, like whether to start a new line, or display a word in bold, or change the font size or family. If the search engine were to ignore the difference between content and markup, the segments would also be included in the index, leading to poor search results. Format analysis involves the identification and handling of formatting content embedded within documents which control how the document is rendered on a computer screen or interpreted by a software program. Format analysis is also referred to as structure analysis, format parsing, tag stripping, format stripping, text normalization, text cleaning, or text preparation. The challenge of format analysis is further complicated by the intricacies of various file formats. Certain file formats are proprietary and very little information is disclosed, while others are well documented. Common, well-documented file formats that many search engines support include: Microsroft Word Microsoft Excel Microsoft Powerpoint IBM Lotus Notes HTML ASCII Text files (a text document without any formatting) Adobe's Portable Document Format (PDF) PostScript (PS) LaTex The UseNet archive (NNTP) and other deprecated bulletin board formats XML and derivatives like RSS SGML (this is more of a general protocol) Multimedia meta data formats like ID3 Techniques for dealing with various formats include: Using a publicly available commercial parsing tool that is offered by the organization which developed, maintains, or owns the format Writing a custom parser Some search engines support inspection of files that are stored in a compressed, or encrypted, file format. If working with a compressed format, then the indexer first decompresses the document, which may result in one or more files, each of which must be indexed separately. Commonly supported compressed file formats include: ZIP - Zip File RAR - Archive File
  34. 34. 34 CAB - Microsoft Windows Cabinet File Gzip - Gzip file BZIP - Bzip file TAR, GZ, and TAR.GZ - Unix Gzip'ped Archives Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. Content can manipulate the formatting information to include additional content. Examples of abusing document formatting for spamdexing: Including hundreds or thousands of words in a section which is hidden from view on the computer screen, but visible to the indexer, by use of formatting (e.g. hidden "div" tag in HTML, which may incorporate the use of CSS or Javascript to do so). Setting the foreground font color of words to the same as the background color, making words hidden on the computer screen to a person viewing the document, but not hidden to the indexer. Section Recognition Some search engines incorporate section recognition, the identification of major parts of a document, prior to tokenization. Not all the documents in a corpus read like a well- written book, divided into organized chapters and pages. Many documents on the web contain erroneous content and side-sections which do not contain primary material, that which the document is about, such as newsletters and corporate reports. For example, this article may display a side menu with words inside links to other web pages. Some file formats, like HTML or PDF, allow for content to be displayed in columns. Even though the content is displayed, or rendered, in different areas of the view, the raw markup content may store this information sequentially. Words that appear in the raw source content sequentially are indexed sequentially, even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content, a dilemma ensues where the quality of the index is degraded and search quality is degraded due to the mixed content and improper word proximity. Two primary problems are noted: Content in different sections is treated as related in the index, when in reality it is not Organizational 'side bar' content is included in the index, but the side bar content does not contribute to the meaning of the document, and the index is filled with a poor representation of its documents, assuming the goal is to go after the meaning of each document, a sub-goal of providing quality search results. Section analysis may require the search engine to implement the rendering logic of each document, essentially an abstract representation of the actual document, and then index the representation instead. For example, some content on the Internet is rendered via Javascript. Viewers of web pages in web browsers see this content. If the search engine does not render the page and evaluate the javascript within the page, it would not 'see' this content in the same way, and index the document incorrectly. Given that some search engines do not bother with rendering issues, many web page designers avoid displaying content via javascript or use the Noscript tag to ensure that the web page is indexed