Lecture 39:                                   …and the                                   World Wide                       ...
AnnouncementsExam 2 due 61 seconds ago!           70           69           68           67           66           65     ...
PlanThe World Wide WebBuilding Web ApplicationsHow Google Works  (or, going back to pre-PS5 to make things      really fas...
The World Wide Web
The “Desk Wide Web”            Memex MachineVannevar Bush, As We May Think, LIFE, 1945
WorldWideWebSir Tim Berners-Lee   First web server and client, 1990CERN (Switzerland)           (This picture, 1993)      ...
Overview:                                             Many of the discussions of the                                      ...
A Practical Project                      8
9
WorldWideWebEstablished a common language for sharing  information on computersLots of previous attempts (Gopher, WAIS,  A...
Why the World Wide Web?World Wide Web succeeded because it was simple!Didn’t attempt to maintain links, just a common  way...
HyperText Transfer Protocol                                      Server                   GET /cs1120/index.html HTTP/1.0 ...
HTML: HyperText Markup LanguageLanguage for controlling display of web pagesUses formatting tags: between < and >        D...
Popular Web Site: Strategy 1          Static, Authored Web Site                                                     Drawba...
Popular Web Site: Strategy 2           Dynamic Web Applications                                                           ...
Popular Web Site: Strategy 2               Dynamic Web Applications                                             Attracts u...
Dynamic Web SitesPrograms that run on the web server   Can be written in any language (often in Python or Java), just     ...
Searching the Web                    18
19
Building a Web Search EngineDatabase of web pages  Crawling the web collecting pages and links  Indexing them efficientlyR...
Crawling CrawleractiveURLs = * “www.yahoo.com” +while (len(activeURLs) > 0) : newURLs = [ ] for URL in activeURLs:    page...
Building a Web Search EngineDatabase of web pages  Crawling the web collecting pages and links  Indexing them efficientlyR...
Building an IndexWhat if we just stored all the pages?Answering a query would be (size of the database)      (need to look...
Hash Table             Index                          Key-Value Pairs               0               , <“Colleen”, ? >, <“v...
Google’s Lexicon1998: 14 million words (billions today?)Lookup word in H(word, nbins): maps to WordID    Key              ...
Google’s Reverse Index (Based on 1998 paper…definitely changed some since then, but now they are secretive!)  WordId      ...
Inverted Barrelsdocid (27 bits)    nhits (5 bits)   hits (16 bits                                    each)            plai...
Building a Web Search EngineDatabase of web pages  Crawling the web collecting pages and links  Indexing them efficientlyR...
Finding the “Best” DocumentsHumans rate them  “Jerry and David’s Guide to the World Wide Web”    (became Yahoo!)Machines r...
PageRankIf a site is important and interesting, other sites   will link to it. Don’t ever take <a href=http://www.cs.virgi...
PageRankdef pageRank (u):  rank = 0  for b in linksToPage (u)     rank = rank + PageRank (b) / Links (b)  return rank     ...
Converging PageRankRanks of all pages depend on ranks of all other  pagesKeep recalculating ranks until they convergedef C...
PageRank: 1998Crawlable web (1998):  150 million pages, 1.7 Billion linksDatabase of 322 million links  Converges in about...
Do we have a  search engine?Theoretician: Sure!Ali G: No way! It’ll blow up.                                Google’s First...
How do we make our service fastenough to index the whole web and serve billions of requests?                              ...
Counting Word Occurrences“When in the Course of human events, it                                                          ...
* <“When”, 1>,  <“in”, 1>,  <“the”, 2>  …+                   reduce* <“We”, 1>,  <“in”, 1>,                  * <“We”, 1>, ...
MapReduce            38
Key to Massive Parallel Execution   Get rid of state and mutation!                                    39
(define (count-matches p b)                                                                   Functional Programming      ...
(define (count-matches p b)                                                                   Functional Programming      ...
SimObject                                 PhysicalObject                                                          Objects ...
SimObject                                 PhysicalObject                                                          Objects ...
Objects                         Recursive Definitions   State and MutationFunctional Programming                          ...
Upcoming SlideShare
Loading in …5
×

Class 39: ...and the World Wide Web

683 views

Published on

The World Wide Web
Dynamic Web Applications
Search Engines
MapReduce
Course Summary

Published in: Technology, Design
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
683
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Class 39: ...and the World Wide Web

  1. 1. Lecture 39: …and the World Wide Webcs1120 Fall 2011David Evanshttp://www.cs.virginia.edu/evans
  2. 2. AnnouncementsExam 2 due 61 seconds ago! 70 69 68 67 66 65 64 63 62 60Friday: we will return graded Exam 2, along with guidance about the Final Must be present (or email me in advance) to win! If you want to present your PS8 in class Monday, remember to email me! 2
  3. 3. PlanThe World Wide WebBuilding Web ApplicationsHow Google Works (or, going back to pre-PS5 to make things really fast again!)cs1120 recap in one (heavily animated) slide! 3
  4. 4. The World Wide Web
  5. 5. The “Desk Wide Web” Memex MachineVannevar Bush, As We May Think, LIFE, 1945
  6. 6. WorldWideWebSir Tim Berners-Lee First web server and client, 1990CERN (Switzerland) (This picture, 1993) MIT
  7. 7. Overview: Many of the discussions of the future at CERN and the LHC era end with the question – “Yes, but how will we ever keep track of such a large project?” This proposal provides an answer to such questions. Firstly, it discusses the problem of information access at CERN. Then, it introduces the idea of linked information systems, and compares them with less flexible ways of finding information.http://www.w3.org/History/1989/proposal-msw.html
  8. 8. A Practical Project 8
  9. 9. 9
  10. 10. WorldWideWebEstablished a common language for sharing information on computersLots of previous attempts (Gopher, WAIS, Archie, Xanadu, etc.) failed 10
  11. 11. Why the World Wide Web?World Wide Web succeeded because it was simple!Didn’t attempt to maintain links, just a common way to name thingsUniform Resource Locators (URL) http://www.cs.virginia.edu/cs1120/index.html Service Hostname File PathHyperText Transfer Protocol
  12. 12. HyperText Transfer Protocol Server GET /cs1120/index.html HTTP/1.0 <html> <head> Contents … of fileClient (Browser) HTML HyperText Markup Language
  13. 13. HTML: HyperText Markup LanguageLanguage for controlling display of web pagesUses formatting tags: between < and > Document ::= <html> Header Body </html> Header ::= <head> HeadElements </head> HeadElements ::= HeadElement HeadElements HeadElements ::= ε | <title> Element </title> Body ::= <body> Elements </body> Elements ::= ε | Element Elements Element ::= <p> Element </p> Element ::= <center> Element </center> …
  14. 14. Popular Web Site: Strategy 1 Static, Authored Web Site Drawbacks: •Have to do all the work yourself •The world may already have enough Twinkie-experiment websitesContent Producer http://www.twinkiesproject.com/
  15. 15. Popular Web Site: Strategy 2 Dynamic Web Applications Attracts users Seed content and functionWeb Programmer Produce more content eBay in 1997 http://web.archive.org/web/19970614001443/http://www.ebay.com/
  16. 16. Popular Web Site: Strategy 2 Dynamic Web Applications Attracts users Seed content and functionAdvantages:• Users do most of the work• If you’re lucky, they might even pay you for the privilege!Disadvantages:• Lose control over the content (you might Produce more get sued for things your users do) content reddit.com today• Have to know how to program a web application reddit.com in 2005
  17. 17. Dynamic Web SitesPrograms that run on the web server Can be written in any language (often in Python or Java), just need a way to connect the web server to the program Program generates HTML (often JavaScript also now) Every useful web site does thisPrograms that run on the client’s machine Java, JavaScript (aka, “Scheme for the Web”), Flash, etc.: language must be supported by the client’s browser Responsive interface: limited round-trips to server
  18. 18. Searching the Web 18
  19. 19. 19
  20. 20. Building a Web Search EngineDatabase of web pages Crawling the web collecting pages and links Indexing them efficientlyResponding to Searches Spell checking – edit distance How to find documents that match a query How to rank the “best” documents
  21. 21. Crawling CrawleractiveURLs = * “www.yahoo.com” +while (len(activeURLs) > 0) : newURLs = [ ] for URL in activeURLs: page = downloadPage (URL) newURLs += extractLinks (page) activeURLs = newURLs Problems: Will keep revisiting the same pages Will take very long to get a good view of the web Will annoy web server admins downloadPage and extractLinks must be very robust
  22. 22. Building a Web Search EngineDatabase of web pages Crawling the web collecting pages and links Indexing them efficientlyResponding to Searches How to find documents that match a query How to rank the “best” documents
  23. 23. Building an IndexWhat if we just stored all the pages?Answering a query would be (size of the database) (need to look at all characters in database)Google: about 40 Billion pages (1 Trillion URLs, but numberactually indexed is a closely kept corporate secret) * 60 KB (average web page size) = ~2.4 Quadrillion bytes to search!Linear is not nearly good enough when n is Quadrillions
  24. 24. Hash Table Index Key-Value Pairs 0 , <“Colleen”, ? >, <“virginia”, ? >, … - 1 , <“Bob”, ? >, … - 2 3 … [about a million bins?]def lookup(key, table) : searchEntries(table[H(key, len(table))]) Finding a good H is difficult You can download google’s from http://code.google.com/p/google-sparsehash/
  25. 25. Google’s Lexicon1998: 14 million words (billions today?)Lookup word in H(word, nbins): maps to WordID Key Words 0 *<“aardvark”, 1024235>, ... + 1 *<“aaa”, 224155>, ..., <“zzz”, 29543> + ... ... nbins – 1 *<“abba”, 25583>, ..., <“zeit”, 50395> +
  26. 26. Google’s Reverse Index (Based on 1998 paper…definitely changed some since then, but now they are secretive!) WordId ndocs pointer00000000 300000001 15... “Inverted Barrels”:16777215 105 41 GB (1998) Today: many TB? Lexicon: 293 MB (1998) Today: many GB?
  27. 27. Inverted Barrelsdocid (27 bits) nhits (5 bits) hits (16 bits each) plain hit: capitalized: 1 bit7630486927 23 font size: 3 bits position: 12 bits... first 4095 chars, everything else extra info for anchors, titles (less position bits) Suggested experiment for winter break: is the position field still only 12 bits?
  28. 28. Building a Web Search EngineDatabase of web pages Crawling the web collecting pages and links Indexing them efficientlyResponding to Searches Spell checking – edit distance How to find documents that match a query How to rank the “best” documents
  29. 29. Finding the “Best” DocumentsHumans rate them “Jerry and David’s Guide to the World Wide Web” (became Yahoo!)Machines rate them Count number of occurrences of keyword Easy for sites to rig this Machine language understanding not good enoughBusiness Model Whoever pays you the most is listed first
  30. 30. PageRankIf a site is important and interesting, other sites will link to it. Don’t ever take <a href=http://www.cs.virginia.edu/cs1120>cs1120</a>!But…not all links are equal: if a lot of highly-ranked sites link to this site, this site should be highly-ranked. 30
  31. 31. PageRankdef pageRank (u): rank = 0 for b in linksToPage (u) rank = rank + PageRank (b) / Links (b) return rank Would this work?
  32. 32. Converging PageRankRanks of all pages depend on ranks of all other pagesKeep recalculating ranks until they convergedef CalculatePageRanks (urls): initially, every rank is 1 for as many times as necessary calculate a new rank for each page (using old ranks) replace the old ranks with the new ranks How do initial ranks effect results? How many iterations are necessary?
  33. 33. PageRank: 1998Crawlable web (1998): 150 million pages, 1.7 Billion linksDatabase of 322 million links Converges in about 50 iterationsInitialization matters All pages = 1: very democratic, models browser equally likely to start on random page www.yahoo.com = 1, ..., all others = 0 More like what Google probably uses
  34. 34. Do we have a search engine?Theoretician: Sure!Ali G: No way! It’ll blow up. Google’s First Server 34
  35. 35. How do we make our service fastenough to index the whole web and serve billions of requests? 35
  36. 36. Counting Word Occurrences“When in the Course of human events, it * <“When”, 1>,becomes necessary for one people to dissolve <“in”, 1>,the political bands which have connected them <“the”, 2>with another, …” …+“We the People of the United States, in Order * <“We”, 1>,to form a more perfect Union, establish Justice, <“in”, 1>,insure domestic Tranquility, provide for the …” <“the”, 2> …+ map(doc, countWords) If we have enough machines, can we do this fast for the whole web? 36
  37. 37. * <“When”, 1>, <“in”, 1>, <“the”, 2> …+ reduce* <“We”, 1>, <“in”, 1>, * <“We”, 1>, <“the”, 2> <“in”, 2>, * <“a”, 5>, …+ …+ reduce <“in”, 6>,* <“a”, 5>, …+ <“in”, 3>, <“the”, 2> …+ reduce* <“apple”, 1>, <“in”, 1>, <“the”, 7> * <“a”, 5>, …+ <“in”, 4>, …+
  38. 38. MapReduce 38
  39. 39. Key to Massive Parallel Execution Get rid of state and mutation! 39
  40. 40. (define (count-matches p b) Functional Programming (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p))) (PS 1-4) def meval(expr, env): Interpreters … return evalApplication(expr, env)... # 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 # ... Any Mechanical 1 3 Turing Machine 2 Computation A B C R1 R0 (or a b) 0 0 0 0 0 (not (and (not a) 0 0 1 0 1 Any Discrete Function (not b))) … … … … … AND NOT Mechanical Logic “Magic” Transistors 40
  41. 41. (define (count-matches p b) Functional Programming (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p))) (PS 1-4) def meval(expr, env): Interpreters … return evalApplication(expr, env)... # 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 # ... Any Mechanical 1 3 Turing Machine 2 Computation A B C R1 R0 (or a b) 0 0 0 0 0 (not (and (not a) 0 0 1 0 1 Any Discrete Function (not b))) … … … … … AND NOT Mechanical Logic “Magic” Transistors
  42. 42. SimObject PhysicalObject Objects Place MobileObject m1: State and Mutation 1 2 3 (define (count-matches p b) Functional Programming (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p))) (PS 1-4) def meval(expr, env): Interpreters … return evalApplication(expr, env)... # 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 # ... Any Mechanical 1 3 Turing Machine 2 Computation A B C R1 R0 (or a b)
  43. 43. SimObject PhysicalObject Objects Place MobileObject m1: State and Mutation 1 2 3 (define (count-matches p b) Functional Programming (list-sum (map (lambda (v) (if (eq? v b) 1 0)) p))) (PS 1-4) def meval(expr, env): Interpreters … return evalApplication(expr, env)... # 1 0 1 1 0 1 1 1 0 1 1 0 1 1 1 # ... Any Mechanical 1 3 Turing Machine 2 Computation A B C R1 R0 (or a b)
  44. 44. Objects Recursive Definitions State and MutationFunctional Programming Charge (PS 1-4) Universality Abstraction Now, you know Interpreters almost everything you need to build the Any Mechanical Computation next reddit or google! Any Discrete Function Mechanical Logic “Magic” Transistors

×