The CIOs Guide to NoSQL


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The CIOs Guide to NoSQL

  1. 1. The CIO's Guide to NoSQL<br />Dan McCreary<br />July 2011<br />Version 5<br />
  2. 2. Agenda<br />Historical Context<br />The Business Case for NoSQL<br />Terminology<br />How NoSQL is Different<br />Key NoSQL Products<br />Call to Action: The NoSQL Pilot Project<br />The Future of NoSQL<br />Copyright Kelly-McCreary & Associates, LLC<br />2<br />
  3. 3. Background for Dan McCreary<br />Bell Labs<br />NeXT Computer (Steve Jobs)<br />Owner of Custom Object-Oriented Software Consultancy<br />Federal data integration (National Information Exchange Model)<br />Native XML/XQuery – 2006<br />Advocate of NoSQL/XRX systems<br />Copyright Kelly-McCreary & Associates, LLC<br />3<br />
  4. 4. NoSQL Training Areas<br />Copyright Kelly-McCreary & Associates, LLC<br />4<br />Track<br />Course<br />You Are<br />Here<br />The CIO's<br />Guide to<br />NoSQL<br />Managers<br />Project Manager's<br />Guide to NoSQL<br />Transitioning<br />to NoSQL<br />Architectural<br />Tradeoff Modeling<br />Architects/Project Managers<br />XQuery<br />MapReduce<br />Hadoop<br />Functional<br />Programming<br />Developer<br />
  5. 5. Sample of NoSQL Jargon <br />Document orientation<br />Schema free<br />MapReduce<br />Horizontal scaling<br />Sharding and auto-sharding<br />Brewer's CAP Theorem<br />Consistency<br />Reliability<br />Partition tolerance<br />Single-point-of-failure<br />Object-Relational mapping<br />Key-value stores<br />Column stores<br />Document-stores<br />Memcached<br />5<br />Copyright Kelly-McCreary & Associates, LLC<br />Indexing<br />B-Tree<br />Configurable durability<br />Documents for archives<br />Functional programming<br />Document Transformation<br />Document Indexing and Search<br />Alternate Query Languages<br />Aggregates<br />OLAP<br />XQuery<br />MDX<br />RDF<br />SPARQL<br />Architecture Tradeoff Modeling<br />ATAM<br />Note that within the context of NoSQL many of these terms have different meanings!<br />
  6. 6. Selecting a Database…<br />"Selecting the right data storage solution is no longer a trivial task."<br />Copyright Kelly-McCreary & Associates, LLC<br />6<br />Does it look like document?<br />Use Microsoft<br />Office<br />Yes<br />Start<br />No<br />Use theRDBMS<br />Stop<br />
  7. 7. Pressures on SQL Only Systems<br />Copyright Kelly-McCreary & Associates, LLC<br />7<br />Scalability<br />Large Data Sets<br />Reliability<br />SQL<br />Social Networks<br />OLAP/BI/DataWarehouse<br />Linked Data<br />Document-Data<br />Agile<br />Schema Free<br />
  8. 8. Simplicity is a Virtue<br />Many systems derive their strength by dramatically limiting the features in their system<br />Simplicity allows database designers to focus on the primary business driver<br />Examples:<br />Touch screen interfaces<br />Key/Value data stores<br />Copyright Kelly-McCreary & Associates, LLC<br />8<br />
  9. 9. Historical Context<br />Mainframe Era<br />Commodity Processors<br />1 CPU<br />COBOL and FORTRAN<br />Punchcards and flat files<br />$10,000 per CPU hour<br />10,000 CPUs<br />Functional programming<br />MapReduce "farms"<br />Pennies per CPU hour<br />Copyright Kelly-McCreary & Associates, LLC<br />9<br />
  10. 10. Two Approaches to Computation<br />Copyright 2010 Dan McCreary & Associates<br />1930s and 40s<br />Alonzo Church<br />John Von Neumann<br />Manage state with a program counter.<br />Make computations act like math functions.<br />Which is simpler? Which is cheaper? Which will scale to 10,000 CPUs?<br />10<br />
  11. 11. Standard vs. MapReduce Prices<br />Copyright Kelly-McCreary & Associates, LLC<br />11<br />John's Way<br />Alonzo's Way<br /><br />
  12. 12. MapReduce CPUs Cost Less!<br />Copyright Kelly-McCreary & Associates, LLC<br />12<br />82% Cost<br />Reduction!<br />Cuts cost from 32 to 6 cents per CPU hour!<br />Perhaps Alanzo was right!<br />Why? (hint: how "shareable" is this process)<br /><br />
  13. 13. Perspectives<br />Kelly-McCreary & Associates, LLC<br />13<br />Object<br />Stores<br />OLAP<br />MDX<br />Native XML<br />NoSQL for <br />Web 2.0<br />and <br />BigData<br />Graph<br />Stores<br />Perspective depends on your context<br />
  14. 14. Architectural Tradeoffs<br />Kelly-McCreary & Associates, LLC<br />14<br />"I want a fast car with good mileage."<br />"I want a scaleable database with low cost that runs well on the 1,000 CPUs in our data center."<br />
  15. 15. Recent History<br />The term NoSQL became re-popularized around 2009<br />Used for conferences of advocates of non-relational databases<br />Became a contagious idea "meme"<br />First of many "NoSQL meetups" in San Francisco organized by Jon Oskarsson<br />Conversion from "No SQL" to "Not Only SQL" in recent year<br />15<br />Kelly-McCreary & Associates, LLC<br />
  16. 16. NoSQL on Google Trends<br />16<br />Kelly-McCreary & Associates, LLC<br />
  17. 17. NoSQL and Web 2.0 Startups<br />Many web 2.0 startups did not use Oracle or MySQL<br />They built their own data stores influenced by Amazon’s Dynamo and Google’s BigTable in order to store and process huge amounts of data<br />In the social community or cloud computing applications, most of these data stores became OpenSource software<br />17<br />Kelly-McCreary & Associates, LLC<br />
  18. 18. Google MapReduce<br />2004 paper that had huge impact of functional programming in the entire community<br />Copied by many organizations, including Yahoo<br />Copyright Kelly-McCreary & Associates, LLC<br />18<br />
  19. 19. Google Bigtable Paper<br />2006 paper that gave focus to scaleable databases<br />designed to reliably scale to petabytes of<br /> data and thousands of machines<br />Copyright Kelly-McCreary & Associates, LLC<br />19<br />
  20. 20. Amazon's Dynamo Paper<br />Werner Vogels<br />CTO -<br />October 2, 2007<br />Used to power Amazon's S3 service<br />One of the most influential papers in the NoSQL movement<br />Copyright Kelly-McCreary & Associates, LLC<br />20<br />Giuseppe DeCandia, DenizHastorun, MadanJampani, GunavardhanKakulapati, AvinashLakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels, “Dynamo: Amazon's Highly Available Key-Value Store”, in the Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.<br />
  21. 21. NoSQL "Meetups"<br />“NoSQLerscame to share how they had overthrown the tyranny of slow, expensive relational databases in favor of more efficient and cheaper ways of managing data.”<br />21<br />Kelly-McCreary & Associates, LLC<br />Computerworld magazine, July 1st, 2009<br />
  22. 22. Key Motivators<br />Licensing RDBMS on multiple CPUs<br />The Thee "V"s<br />Velocity – lots of data arriving fast<br />Volume – web-scale BigData<br />Variability – many exceptions<br />Desire to escape rigid schema design<br />Avoidance of complex Object-Relational Mapping (the "Vietnam" of computer science)<br />22<br />Kelly-McCreary & Associates, LLC<br />
  23. 23. Copyright 2008 Dan McCreary & Associates<br />The constraints of yesterday…<br />Challenge:<br />Ask ourselves the question…<br />Do our current method of solving problems with tabular data…<br />Reflect the storage of the 1950s…<br />Or our actual business requirements?<br />What structures best solve the actual business problem?<br />23<br />Many Processes Today Are Driven By…<br />
  24. 24. Copyright 2008 Dan McCreary & Associates<br />No-Shredding!<br />My<br />Data<br />Relational databases take a single hierarchical document and shred it into many pieces so it will fit in tabular structures<br />Document stores prevent this shredding<br />24<br />
  25. 25. Copyright 2008 Dan McCreary & Associates<br />Is Shredding Really Necessary?<br />Every time you take hierarchical data and put it into a traditional database you have to put repeating groups in separate tables and use SQL “joins” to reassemble the data<br />25<br />
  26. 26. Object Relational Mapping<br />T2<br />T1<br />T3<br />T4<br />Relational<br />Database<br />Object Middle<br />Tier<br />Web Browser<br />T1 – HTML into Objects<br />T2 –Objects into SQL Tables<br />T3 – Tables into Objects<br />T4 – Objects into HTML<br />26<br />Kelly-McCreary & Associates, LLC<br />
  27. 27. "The Vietnam of Applications"<br />Object-relational mapping has become one of the most complex components of building applications today<br />A "Quagmire" where many projects get lost<br />Many "heroic efforts" have been made to solve the problem:<br />Hibernate<br />Ruby on Rails<br />But sometimes the way to avoid complexity is to keep your architecture very simple<br />Copyright Kelly-McCreary & Associates, LLC<br />27<br />
  28. 28. Document Stores Need No Translation<br />Copyright 2010 Dan McCreary & Associates<br />Document<br />Document<br />Application Layer<br />Database<br />Documents in the database<br />Documents in the application<br />No object middle tier<br />No "shredding"<br />No reassembly<br />Simple!<br />28<br />
  29. 29. Zero Translation (XML)<br />Copyright 2010 Dan McCreary & Associates<br />REST-Interfaces<br />XForms<br />XML database<br />Web Browser<br />XML lives in the web browser (XForms)<br />REST interfaces<br />XML in the database (Native XML, XQuery)<br />XRX Web Application Architecture<br />No translation!<br />29<br />
  30. 30. "Schema Free"<br />Systems that automatically determine how to index data as the data is loaded into the database<br />No a prioriknowledge of data structure<br />No need for up-front logical data modeling<br />…but some modeling is still critical<br />Adding new data elements or changing data elements is not disruptive<br />Searching millions of records still has sub-second response time<br />30<br />Copyright 2010 Dan McCreary & Associates<br />
  31. 31. Monoculture and Mono-architecture<br />Image Source: Wikipedia<br />31<br />Copyright 2010 Dan McCreary & Associates<br />
  32. 32. Eric Evans<br /> “The whole point of seeking alternatives [to RDBMS systems] is that you need to solve a problem that relational databases are a bad fit for.”<br />Eric Evans<br />Rackspace<br />32<br />Kelly-McCreary & Associates, LLC<br />
  33. 33. Evolution of Ideas in OpenSource<br />Copyright Kelly-McCreary & Associates, LLC<br />33<br />New Products<br />New Database Ideas<br />Proprietary Software<br />Product A<br />OpenSource<br />Schema-free<br />Product B<br />Product B<br />MapReduce<br />Auto-sharding<br />Cloud Computing<br />How quickly can new ideas be recombined into new database products?<br />OpenSource software has proved to be the most efficient way to quickly recombine new ideas into new products<br />
  34. 34. 34<br />Copyright 2010 Dan McCreary & Associates<br />Storage Architectural Patterns<br />Tables<br />Trees<br />Stars<br />Triples<br />
  35. 35. Finding the Right Match<br />Schema-Free<br />Standards Compliant<br />Mature Query Language<br />Use CMU's Architectural Tradeoff and Modeling (ATAM) Process<br />35<br />Copyright 2010 Dan McCreary & Associates<br />
  36. 36. Brewer's CAP Theorem<br />Consistency<br />You can not have all three so pick two! <br />Availability<br />Partition Tolerance<br />36<br />Kelly-McCreary & Associates, LLC<br />
  37. 37. Avoidance of Unneeded Complexity<br />Relational databases provide a variety of features to ALWAYS support strict data consistency<br />Rich feature set and the ACID properties implemented by RDBMSs might be more than necessary for particular applications and use cases<br />37<br />Kelly-McCreary & Associates, LLC<br />
  38. 38. High Throughput<br />Some NoSQL databases provide a significantly higher data throughput than traditional RDBMS<br />Hypertable which pursues Google’s Bigtable approach allows the local search engine Zvent to store one billion data cells per day<br />Google is able to process 20 petabytesa day stored in BigTable via it’s MapReduce approach<br />38<br />Kelly-McCreary & Associates, LLC<br />
  39. 39. Complexity and Cost of Settingup Database Clusters<br />NoSQL databases are designedin a way that “PC clusters can be easily and cheaply expanded without the complexity and cost of ’sharding,’ which involves cutting up databases into multiple tables to run on large clusters or grids”.<br />Nati Shalom, CTO and founder of GigaSpaces<br />39<br />Kelly-McCreary & Associates, LLC<br />
  40. 40. Compromising Reliability for Better Performance<br />Shalom argues that there are “different scenarios where applications would be willing to compromise reliability for better performance.” <br />Performance over reliability<br />Example: HTTP session data example<br />“needs to be shared between various web servers but since the data is transient in nature (it goes away when the user logs off) there is no need to store it in persistent storage.”<br />40<br />Kelly-McCreary & Associates, LLC<br />
  41. 41. "Once Size Fits…"<br />"One Size Does Not Fit All"<br />James Hamilton Nov. 3rd, 2009<br />Kelly-McCreary & Associates, LLC<br />41<br />,guid,afe46691-a293-4f9a-8900-5688a597726a.aspx<br />
  42. 42. Different Thinking<br />Sequential Processing<br />Parallel Processing<br />The output of any step can be used in the next step<br />State must be carefully managed<br />Each loop of XQuery FLOWR statements are independent thread (no side-effects)<br />42<br />Kelly-McCreary & Associates, LLC<br />
  43. 43. Cloud Computing<br />High scalability<br />Especially in the horizontal direction (multi CPUs)<br />Low administration overhead<br />Simple web page administration<br />43<br />Kelly-McCreary & Associates, LLC<br />
  44. 44. Databases work well in the cloud<br />Data warehousing specific databases for batch data processing and map/reduce operations<br />Simple, scalable and fast key/value-stores<br />Databases containing a richer feature set than key/value-stores fitting the gap with traditional<br />RDBMS while offering good performance and scalability properties (such as document databases).<br />44<br />Kelly-McCreary & Associates, LLC<br />
  45. 45. Auto-Sharding<br />When one database gets almost full it tells a "coordinator" system and the data automatically gets migrated to other systems<br />Copyright Kelly-McCreary & Associates, LLC<br />45<br />After<br />45% full<br />Before<br />90% full<br />45% full<br />
  46. 46. Scale Up vs. Scale Out<br />Scale Up<br />Scale Out<br />Make Many CPUs work together<br />Learn how to divide your problems into independent threads<br />Make a single CPU as fast as possible<br />Increase clock speed<br />Add RAM<br />Make disk I/O go faster<br />Copyright Kelly-McCreary & Associates, LLC<br />46<br />
  47. 47. Functional Programming<br />What does it mean to your IT staff?<br />What experience do they have in functional programming?<br />Can they "unlearn" the habits of the procedural world?<br />Copyright Kelly-McCreary & Associates, LLC<br />47<br />
  48. 48. The NO-SQL Universe<br />Copyright 2010 Dan McCreary & Associates<br />Document Stores<br />Key-Value Stores<br />XML<br />Graph Stores<br />Object Stores<br />Column Stores<br />48<br />
  49. 49. Key Value Stores<br />A table with two columns and a simple interface<br />Add a key-value<br />For this key, give me the value<br />Delete a key<br />Blazingly fast and easy to scale<br />Copyright Kelly-McCreary & Associates, LLC<br />49<br />Key<br />Value<br />
  50. 50. Types of Key-Value Stores<br />Eventually‐consistent Key‐Value store<br />Hierarchical Key-Value Stores<br />Key-Value Stores In RAM<br />Key Value Stores on Disk<br />Ordered Key-Value Stores<br />Copyright Kelly-McCreary & Associates, LLC<br />50<br />
  51. 51. Cassendra<br />Apache open source project<br />Originally developed by Facebook<br />Designed for highly distributed high-reliable systems<br />No single point of failure<br />Column-family data model<br />Copyright Kelly-McCreary & Associates, LLC<br />51<br /><br />
  52. 52. Voldomort<br />A distributed key-value system<br />Used at LinkedIn<br />10K-20K node operations/CPU<br />Auto-sharding<br />Graceful server failure handling<br />Copyright Kelly-McCreary & Associates, LLC<br />52<br />
  53. 53. MongoDB<br />Open Source License<br />Document/Collection centric<br />Sharding built-in, automatic<br />Stores data in JSON format<br />Query language is JSON<br />Can be 10x faster than MySQL<br />Many languages (C++, JavaScript, Java, Perl, Python etc.)<br />Copyright Kelly-McCreary & Associates, LLC<br />53<br />
  54. 54. Hadoop/Hbase<br />Open source implementation of MapReduce algorithm written in Java<br />Initially created by Yahoo<br />300 person-years development<br />Column-oriented data store<br />Java interface<br />Hbase designed specifically to work with Hadoop<br />Copyright Kelly-McCreary & Associates, LLC<br />54<br />
  55. 55. CouchDB<br />Apache Document Store<br />Written in ERLANG<br />RESTful JSON API<br />Distributed, featuring robust, incremental replication with bi-directional conflict detection and management<br />Copyright Kelly-McCreary & Associates, LLC<br />55<br />
  56. 56. Memcached<br />Free & open source in-memory caching system<br />Designed to speeding up dynamic web applications by alleviating database load<br />RAM resident key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering<br />Simple interface<br />Designed for quick deployment, ease of development<br />APIs in many languages<br />Copyright Kelly-McCreary & Associates, LLC<br />56<br />
  57. 57. MarkLogic<br />Native XML database designed to used by Petabyte data stores<br />ACID compliant<br />Heavy use by federal agencies, document publishers and "high-variability" data<br />Arguably the most successful NoSQL company<br />Copyright Kelly-McCreary & Associates, LLC<br />57<br />
  58. 58. eXist<br />OpenSource native XML database<br />Strong support for XQuery and XQuery extensions<br />Heavily used by the Text Encoding Initiative (TEI) community and XRX/XForms communities<br />Ideal for metadata management<br />Integrated Lucene search and structured search<br />Copyright Kelly-McCreary & Associates, LLC<br />58<br />
  59. 59. Riak<br />Community and Commercial licenses<br />A "Dynamo-inspired" database<br />Written in ERLANG<br />Query JSON or ERLANG<br />Copyright Kelly-McCreary & Associates, LLC<br />59<br />
  60. 60. Hypertable<br />Open Source<br />Closely modeled after Google's Bigtable project<br />High performance distributed data storage system<br />Designed to support applications requiring maximum performance, scalability, and reliability<br />Hypertable Query Language (HQL) that is syntactically similar to SQL<br />Copyright Kelly-McCreary & Associates, LLC<br />60<br />
  61. 61. Selecting a NoSQL Pilot Project<br />The "Goldilocks Pilot Project Strategy"<br />Not to big, not to small, just the right size<br />Duration<br />Sponsorship<br />Importance<br />Skills<br />Mentorship<br />61<br />Copyright 2010 Dan McCreary & Associates<br />
  62. 62. The Future of the NoSQL Movement<br />Will data sets continue to grow at exponential rates?<br />Will new system options become more diverse?<br />Will new markets have different demands?<br />Will some ideas be "absorbed" into existing RDBMS vendors products?<br />Will the NoSQL community continue to be the place where new database ideas and products are incubated?<br />Will the job of doing high-quality architectural tradeoffs analysis become easier?<br />Copyright Kelly-McCreary & Associates, LLC<br />62<br />Growth<br />Diversity<br />
  63. 63. Using the Wrong Architecture<br />Start<br />Finish<br />Credit: Isaac Homelund – MN Office of the Revisor<br />
  64. 64. Using the Right Architecture<br />Finish<br />Start<br />Find ways to remove barriers to empowering<br />the non programmers on your team.<br />
  65. 65. Questions<br />Dan McCreary<br />President, Kelly-McCreary & Associates<br /><br />65<br />Kelly-McCreary & Associates, LLC<br />