Brandon

762 views

Published on

Published in: Business, Technology
1 Comment
3 Likes
Statistics
Notes
  • This is really wonderful!!! Thank you!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
762
On SlideShare
0
From Embeds
0
Number of Embeds
37
Actions
Shares
0
Downloads
0
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Brandon

  1. 1. Web-scale data architectures A survey of next generation data storage and retrieval
  2. 2. Web-scale architectures An argument for computing in the cloud Or how I learned to stop worrying and love *aaS
  3. 3. Software as a COMMUNITY SERVICE participation SIMPLE user interfaces DATA
  4. 4. Seller reputation Product recommendation
  5. 5. What mattered to these companies? • MVC architecture? • Java vs. Python vs. [insert your favorite language]? • Ajax vs. Flash?
  6. 6. The common theme?... MASSIVE AMOUNTS OF DATA
  7. 7. In the beginning… DATA
  8. 8. Forget SQL… YOU ONLY WANT TO STORE DATA
  9. 9. Web 2.0 is *pushing the envelope*… • Scale • CPU-intensive text analytics • Search outside the column • 7x24 operation
  10. 10. Web Application Heresies? • REST and Resource-oriented data • Cloud computing • Map/Reduce will be next decade’s MVC • Semi-structured data • Grassroots, flexible schemas…microformats • Distributed hash tables • Offline browser clients ( * Adapted from Sam Ruby)
  11. 11. Save. See. Share. Secure. FOUR PILLARS OF DATA MANAGEMENT* ( * According to Damien Katz )
  12. 12. Massively distributed and scalable. Standards driven. THE INTERNET AS A PLATFORM
  13. 13. The Three Levels of Platforms you will meet on the Internet * ACCESS. PLUG-IN. RUNTIME. * Marc Andreessen, http://blog.pmarca.com/2007/09/the-three-kinds.html
  14. 14. Common. Code lives outside the runtime. LEVEL 1: ACCESS API
  15. 15. Will become more common. More difficult for both platform and app developers. LEVEL 2: PLUG-IN API
  16. 16. Rare. Constrained assumptions. Platform hosts code. LEVEL 3: RUNTIME ENVIRONMENT
  17. 17. Lowers barriers to entry. Enables situational applications. Isolates concerns to app. LEVEL 3
  18. 18. Who is building Level 3 platforms? Ning Social Application Platform Salesforce Sforce/AppExchange Google Mashup Editor, et al Second Life Scriptable 3D world Amazon Electronic Computing Cloud Akamai EdgeComputing Yahoo! Pipes IBM Mashup Maker
  19. 19. “IN THE LONG RUN, ALL CREDIBLE LARGE-SCALE INTERNET COMPANIES WILL PROVIDE LEVEL 3 PLATFORMS” * Marc Andreessen, http://blog.pmarca.com/2007/09/the-three-kinds.html
  20. 20. WEB-SCALE DATA + PROCESSING
  21. 21. “…ORGANIZE THE WORLD’S INFORMATION…”
  22. 22. Google File System. MapReduce Algorithm. Chubby Lock Server. BIGTABLE
  23. 23. MapReduce Defined
  24. 24. Column-oriented databases Id Last_name First_name Salary 1 Smith Joe 40000 2 Jones Mary 50000 3 Johnson Cathy 44000 1,Smith,Joe,40000;2,Jones,Mary,50000;3,Johnson,Cathy,44000; 1,2,3;Smith,Jones,Johnson;Joe,Mary,Cathy;40000,50000,44000;
  25. 25. Google Apps Based on BigTable • Google Reader • Google Docs • Google Maps • Google Calendar • Google Print • Google Page Creator • Google Earth • Google Notebook • Blogger.com • Google Mashup Editor • Google Code • Etc. • Orkut • YouTube
  26. 26. Distributed Hash Table Versioning Vector Clocks Quorum CASE STUDY: AMAZON
  27. 27. S3 → Simple Storage Service SQS → Simple Queue Service EC2 → Electronic Compute Cloud Dynamo
  28. 28. NING’S ARCHITECTURE
  29. 29. http://docs.ning.com/page/page/show?id=492524:Page:26
  30. 30. Common themes? • Flexible schema • Highly distributed • HTTP is the database driver • JSON , XML, HTML, and JavaScript • Full text search
  31. 31. One Size Fits All AN IDEA WHOSE TIME HAS COME AND GONE? ( * Michael Stonebraker )
  32. 32. The first is that there will be a dedicated core, those that are heavily invested, either monetarily or professionally, in the status quo, and they will resist any change. The second is that change doesn't care about your investment. TWO RULES FOR ANY CHANGE IN TECHNOLOGY * ( * Joe Gregorio, http://bitworking.org/news/217/Ch-ch-changes )
  33. 33. N>1
  34. 34. scale out not up
  35. 35. WEB-SCALE DATA + PROCESSING
  36. 36. GOOGLE + MYSQL
  37. 37. CouchDb • Green implementation, no legacy • Designed to: – implement the four pillars of data management – leverage recent paradigm shifts • Level 3 Data Platform
  38. 38. CouchDB: Feature Summary Robust Data Storage Replication REST API User Authentication Views Built on Erlang/OTP Append-only writes MVCC with optimistic concurrency Etags Full text search Map/Reduce (Your feature here. It’s open source!)
  39. 39. CouchDb: REST API • Easy retrieval using our favorite, scalable architecture: HTTP • Exchange in industry-standard formats: (XML/JSON) • Simple and intuitive interface
  40. 40. APACHE HADOOP
  41. 41. The Hadoop stack (from a DBMS perspective) MapReduce Java framework to write parallel scans and aggregations Hbase Simple database HDFS Distributed file system IBM Impliance Muse Query Language Declarative query language MapReduce+ Enhancements to MapReduce Muse Data Model Semi-structured data model Hbase Core Databse storage, transactions HDFS Distributed file system
  42. 42. “Luckily, there are only a handful of companies…in the world that need to operate at [this] scale.” DOES EVERYBODY NEED THIS DEGREE OF SCALE? ( * Dare Obasanjo, http://www.25hoursaday.com/weblog/2007/10/06/ThoughtsOnAmazonsInternalStorageSystemDynamo.aspx )
  43. 43. “BEWARE OF FOCUSING TOO MUCH ON THE APPS OF THE PAST WHEN LOOKING AT PLATFORMS OF THE FUTURE”
  44. 44. Linux, XEN virtualization, Apache Hadoop IBM & GOOGLE UNIVERSITY SPONSORSHIP
  45. 45. Understand and communicate HTTP Resource vs. RDBMS differences Research, explore, and push the limits of the MapReduce programming model Discover where distributed hash tables may make sense over RDBMS ACADEMIC PROPOSAL
  46. 46. View ourselves as a Level 3 Platform by… Take the runtime out of the developers control Leverage IBM’s Impliance project for massive data scaling PROJECT ZERO PROPOSAL

×