Thu 1400 cagle_kurt_color


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Thu 1400 cagle_kurt_color

  1. 1. Your Trusted Web Presence Partner BIG DATA Semantic Data© 2011 Avalon Consulting, LLC 1
  2. 2. Your Speaker• Kurt Cagle is an Information Architect for Avalon Consulting.• Author of 18 books on XML, Web Development and the Semantic Web• Managing Editor of• Email:© 2011 Avalon Consulting, LLC
  3. 3. Perspectives• In 1945, the cost to acquire a byte of data was high ~ $1/kB in 2011 USD.• By 1960, that same dollar could get 1000 times as much data. This is another expression of Moores Law.• At 15 years per 1000x increase, the cost to acquire a kByte in 2011 is ~ $0.000000001/kB.• By 2050, this will be $0.00000000000000000000001/kB, or 10,000,000,000,000,000,000,000 kB/$.© 2011 Avalon Consulting, LLC
  4. 4. Record Data• The fundamental unit of data is the record.• A record has an identity, a unique code (within its context) that differentiates it from all other records.• A record has zero or more properties that describe specific characteristics of the record.• Some of those properties may be pointers to other records that have a given identity.• The combination of properties and identity also share a semantic cohesiveness.© 2011 Avalon Consulting, LLC
  5. 5. From Record to Resource• A resource is an abstract entity that is both unique and addressable.• A representation of that resource is a (potentially structured) bag of properties that describe characteristics of that resource.• If that resource is part of a collection, then the representation of that resource is a record in that collection.• Note that the record is NOT the resource – it is only a description of the state of the resource at the moment that it is queried.© 2011 Avalon Consulting, LLC
  6. 6. Addressibility• A resource is addressible if, for a given collection, there is a key that can be used to retrieve a representation of the resource from the collection.• A collection in turn, can be thought of as the context or namespace of all addresses within that collection.• Conversely, if there is no collection for which a key exists to retrieve a resource then it is not a resource.© 2011 Avalon Consulting, LLC
  7. 7. Resources and Time• Most resources are state machines – they change their (internal) state over time.• Depending upon the requested representation format, a record may be – the most recent representation of a resource, – the delta of changes from the last request, – or may be a log of all changes.• Through 2000 or so, most resources changed slowly. That is no longer necessarily true.• This means that all resources are services.© 2011 Avalon Consulting, LLC
  8. 8. BIG DATA and REST• REST – Representational State Transfer – is becoming the primary services representation.• The period from 2005 to 2020 will be concerned with making previously non-addressable data RESTful.• REST imputes CRUD (Create, Read, Update, Delete) semantics to addressable networks.• REST also implies that collections are resources.© 2011 Avalon Consulting, LLC
  9. 9. Representations• A collection processor (aka, a server) is an abstraction layer between internal and external data representations.• Any time a request is made, there is almost always a transformation that maps between an internal entity and the requested output.• Representations do not have to be (and most times are not) in the same format as the underlying resource – a resource internally shown as XML can be rendered as HTML, JSON, zipped files, graphics, PDF, etc.© 2011 Avalon Consulting, LLC
  10. 10. Collections• A collection establishes a constrained set of resources.• A collection can also be thought of as a category, with each resource key as a term in the categorys taxonomy.• A resource may belong to more than one collection.• Collections determine available representations for resources.© 2011 Avalon Consulting, LLC
  11. 11. Collections and Search• A search is the invocation of a parametric function on a collection that returns a set of resource representations (typically with links to more extensive representations).• It is possible for a collection to have multiple search functions bound to that collection – in that regard a search function may itself be a resource in a different collection.• Search is the bridge between addressable RESTful retrieval and imperative web services.© 2011 Avalon Consulting, LLC
  12. 12. Resources and Data Models• A resource is an abstraction of a physical entity or process, and as such it is, itself, conformant to a data model.• A representation of a resource is a transformation of that resource within the context of its collection.• This means that relationships that are only inferred within the internal model may be made explicit within the representation.© 2011 Avalon Consulting, LLC
  13. 13. BIG DATA Stage 1: Hadoop• Hadoop can take context poor or legacy structured data and create from it contextual richer data records, which can in turn inform the development of resources.• While Hadoop can be used for performing queries, these will be high latency searches compared to most other systems.• Hadoops real value comes from its ability to process data into more structurally queryable or manipulatible forms.© 2011 Avalon Consulting, LLC
  14. 14. BIG DATA Stage 2: SQL• SQL relational databases work best with single dimensional views based upon primary/foreign key relationships, ideal for many data models that have relatively rigid structure but somewhat richer semantics.• SQL is still ideal for many forms of real world data processing, but generally has both non-standardized streaming mechanisms and constrained procedure semantics.© 2011 Avalon Consulting, LLC
  15. 15. BIG DATA Stage 3: Hash Databases• MongoDB and similar tools extend hash tables to representing more complex data models, as the combination of hashes and sequences can readily represent objects that have the same core namespace context.• These are becoming popular as producers and consumers of JSON, and typically employ a JavaScript stack for most transactions.© 2011 Avalon Consulting, LLC
  16. 16. BIG DATA Stage 4: XML Databases• XML Databases store XML representations of objects, which best encode narrative structures, hierarchical entities that cross namespace boundaries, and provide very sophisticated tools for building RESTful web application services.• XML Databases are optimal for documents and hybrid document/data structures.• XML Databases are standardizing upon XQuery as the common stack language.© 2011 Avalon Consulting, LLC
  17. 17. BIG DATA Stage 5: RDF Triple Stores• Triple stores encode relationships between resources via RDF n-tuples for query by SPARQL. Triple stores work best for working with distributed data where relationships between resources are as or more important than the actual contents of the resources themselves.© 2011 Avalon Consulting, LLC
  18. 18. Complementary Technologies• There is a tendency to see NoSQL technologies as competitive. This is both wrong and dangerous.• The technologies have evolved to handle different phases of data acquisition, processing and search, going from data with poor semantics and rigid structure to data with rich semantics and flexible structure.© 2011 Avalon Consulting, LLC
  19. 19. Generation 5 Data Stores• The next wave of databases will likely incorporate two or more generation 4 technologies – such as combining an XML or JSON database with an RDF triple store to be able to handle inferences about stored content, or by adding Hadoop processing support to a SQL database capable of providing RESTful representations in XML or JSON of internal views built from SQL tables.© 2011 Avalon Consulting, LLC
  20. 20. Query Unification• The central challenge of the next twenty years will be in finding the commonalities across these various platforms in order to build a twenty first century SQL.• This will require a deeper understanding of the characteristics of data algebras, which provides the underpinnings for the most optimal expression of data structures.© 2011 Avalon Consulting, LLC
  21. 21. XQuery?• XQuery may be a good candidate to handle this unification, for several reasons: – Solid extensibility model – SQL-like syntax, but capable of handling RESTful services inbound and outbound – XQuery 3.0 supports maps, which are hash/sequence constructs ideal for encoding JSON or MongoDB structures. – Works well in a distributed context. – Can integrate Sparql or SQL scripts readily. – Standards based, and written by SQL author© 2011 Avalon Consulting, LLC
  22. 22. Contact Us Avalon Consulting, LLCDallas Office - HQ Washington, DC Office5600 Tennyson Parkway 527 Maple Avenue EastSuite 230 Suite 200Plano, TX 75024 Vienna, VA 22180469-424-3449 703-635-3302© 2011 Avalon Consulting, LLC 22