Successfully reported this slideshow.
Reconceiving the Web as a
Distributed (NoSQL) Data
System
Daniel Austin
PayPal, Inc.
NoSQL Now! Conference
August 22, 2013...
The Big Idea
“The World-Wide Web is the World’s
Largest NoSQL Distributed Data
System”
The Mind Map
History
• DNS (1983)
The first large-scale
DDS, using Flat files
• WWW (1989)
“a single user-interface to many
large class...
WWWDB: Anatomy
WWW
HTML
(Presentation)
URI
(Addressing)
HTTP
(Transport)
Typology of Hyperlink Queries
• Hypertext links come in two flavors:
transitive and intransitive
• Transitive queries are ...
Data Clients Query Data
Sources
What Do HTTP URIs Identify?
• Not a single resource
• WWWDB query syntax is split between
HTTP ‘verbs’ (POST, GET, PUT,
DE...
CDN as a Caching Mechanism
• CDNs such as Akamai and Cloudfront
provide local caching services for
WWWDB, mostly for stati...
APIs as Secondary Queries
• Active Subqueries
• Usually dynamic
• URIs function as a selection mechanism
• Often User-Actu...
REST as a Query Syntax
Mechanism
• Common Semantics
– REST provides a
means of specifying
the proper query for
an object i...
Indexing WWWDB
• Google, Bing, Yahoo! and other ‘index
searches’ on WWWDB
– Inconsistent results are accepted
• Query Cach...
Does the CAP Theorem Apply?
Yes, It Does, But Only Partially
• Partition and Availability – 404’s, DDOS
• WWWDB Relaxes th...
Drawbacks of the CAP Model
• Caching – All data is Not cached
everywhere
– Some sites are single-location/single source
– ...
Improving WWWDB
• Better Data Clients
– HTML5 provides new query mechanism via
Web Sockets, WebStorage, and other
means
– ...
RDF and the Semantic Web
• Changes query patterns but not storage
– Queries based on semantic ID of resource
• Requires co...
Browsers as Data Clients
• Presentation First!
– Data is treated as secondary
• Designed for Browsing Not Querying
– Query...
Optimizing Web Queries
• REST doesn’t imply FAST
– Use a domain model to limit query
endpoints
– May require unnecessary r...
Reforming Hypertext for
Querying WWWDB
• Enlarge the number of link types
• Distinguish transitive links
• Add bidirection...
IPv6 and Query Routing for
WWWDB
• The IPv6 space is large enough to allow
for multiple query addressing schemes:
– Semant...
Scaling the WWWDB
• This may require
expanding our notions
of URIs and links
(queries)
• Semantic mapping of
resources req...
Final Thoughts
• The Web is the largest NoSQL Distributed
Data System
– URIs address the resultset of a NoSQL query
– Tran...
Reconceiving the Web as a
Distributed Data System
Thank You!
Daniel Austin
PayPal, Inc.
NoSQL Now! Conference
August 22, 2...
Upcoming SlideShare
Loading in …5
×

Reconceiving the Web as a Distributed (NoSQL) Data System

1,027 views

Published on

[Slides from NoSQL Now! 2013]
Nearly every Web request is a request for information from a database or a front-end caching system for one. Based on this concept, we can reconceive the Web as a large-scale distributed data system using NoSQL query languages across high-level protocols such as HTTP. Exploring this idea further leads us to a better understanding of the structure of the Web, and invites us to apply modern NoSQL thinking toward making it better. My goal is to re-orient people’s thinking toward the Web as a big NoSQL data system and then explore the implications.

Published in: Technology, Education
  • Be the first to comment

Reconceiving the Web as a Distributed (NoSQL) Data System

  1. 1. Reconceiving the Web as a Distributed (NoSQL) Data System Daniel Austin PayPal, Inc. NoSQL Now! Conference August 22, 2013 V1.2
  2. 2. The Big Idea “The World-Wide Web is the World’s Largest NoSQL Distributed Data System”
  3. 3. The Mind Map
  4. 4. History • DNS (1983) The first large-scale DDS, using Flat files • WWW (1989) “a single user-interface to many large classes of stored information such as reports, notes, data- bases, computer documentation and on-line systems help” Berners-Lee & Cailliau, 1989 But Why NoSQL?
  5. 5. WWWDB: Anatomy WWW HTML (Presentation) URI (Addressing) HTTP (Transport)
  6. 6. Typology of Hyperlink Queries • Hypertext links come in two flavors: transitive and intransitive • Transitive queries are usually for inactive content – presentation material to supplement the user’s queried data • Intransitive queries are user-actuated and usually provide navigation and business logic for the query
  7. 7. Data Clients Query Data Sources
  8. 8. What Do HTTP URIs Identify? • Not a single resource • WWWDB query syntax is split between HTTP ‘verbs’ (POST, GET, PUT, DELETE) and their objects, addressed by URIs • URI encapsulates a resource as the object identified by a query (Note that transitive and intransitive hyperlinks almost always go to different locations)
  9. 9. CDN as a Caching Mechanism • CDNs such as Akamai and Cloudfront provide local caching services for WWWDB, mostly for static, presentation- related objects – Frequency-based caching for transitive hyperlinks – Most secondary queries go to the CDN – 95%+ of all the bytes transported over the Web – ~90% of all WWWDB queries (HTTP requests/responses)
  10. 10. APIs as Secondary Queries • Active Subqueries • Usually dynamic • URIs function as a selection mechanism • Often User-Actuated, Intransitive Events • Query results often modify the display
  11. 11. REST as a Query Syntax Mechanism • Common Semantics – REST provides a means of specifying the proper query for an object in a specific state • Demands NoSQL due to state constraints • Uses query strings for ranged searches Image courtesy IBM
  12. 12. Indexing WWWDB • Google, Bing, Yahoo! and other ‘index searches’ on WWWDB – Inconsistent results are accepted • Query Cache or a Data Cache? • Secondary Query Routing • Alternative query indices – Wolfram Alpha, Index Mundi, Twitter act as ‘almanacs’
  13. 13. Does the CAP Theorem Apply? Yes, It Does, But Only Partially • Partition and Availability – 404’s, DDOS • WWWDB Relaxes the Consistency Constraint • We accept inconsistent queries and broken links as a tradeoff for real-time availability and high-velocity updates But We Can Do Better!
  14. 14. Drawbacks of the CAP Model • Caching – All data is Not cached everywhere – Some sites are single-location/single source – Hard (static) assets are far more widely cached • What does CAP mean when data is only partially distributed? – Very little – consistency only applies to part of the queries
  15. 15. Improving WWWDB • Better Data Clients – HTML5 provides new query mechanism via Web Sockets, WebStorage, and other means – Still mostly presentation-level improvments • Better Caching, Distribution & Tranport – Work currently being done at IETF on HTTP 2.0 • Better Queries – Very little work being done – more on this
  16. 16. RDF and the Semantic Web • Changes query patterns but not storage – Queries based on semantic ID of resource • Requires content to be semantically labeled • Work on Sparql reduces query limitations – But may also make things slower (!) • Cloud computing and query distribution will prove a more powerful force for improving WWWDB than semantic
  17. 17. Browsers as Data Clients • Presentation First! – Data is treated as secondary • Designed for Browsing Not Querying – Query patterns are inefficient – Semi-stateful nature of Web sessions • Bedeviled with Legacy Issues
  18. 18. Optimizing Web Queries • REST doesn’t imply FAST – Use a domain model to limit query endpoints – May require unnecessary requests • Query-string semantics allows for joins, arbitrary comparison • Recognize that some queries require state and use it • Distribute intransitive queries more widely
  19. 19. Reforming Hypertext for Querying WWWDB • Enlarge the number of link types • Distinguish transitive links • Add bidirectional linking • Enhance the semantics of the query string • Make hypertext more useful for mobile and devices
  20. 20. IPv6 and Query Routing for WWWDB • The IPv6 space is large enough to allow for multiple query addressing schemes: – Semantic addressing of objects by type – Objects in the Internet of Things – Dynamic, context driven addressing
  21. 21. Scaling the WWWDB • This may require expanding our notions of URIs and links (queries) • Semantic mapping of resources requires additional complexity for queries • Explicit state management for efficiency Every system has a scaling limit
  22. 22. Final Thoughts • The Web is the largest NoSQL Distributed Data System – URIs address the resultset of a NoSQL query – Transitive and Intransitive hyperlinks • We can add power and simplicity to our queries by carefully reforming the URI syntax and the current implementations of hypertext • HTTP and HTML are undergoing significant evolution – now it’s time for URIs!
  23. 23. Reconceiving the Web as a Distributed Data System Thank You! Daniel Austin PayPal, Inc. NoSQL Now! Conference August 22, 2013 V1.2 @daniel_b_austin

×