• Like

The CIOs Guide to NoSQL

  • 1,534 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,534
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
48
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The CIO's Guide to NoSQL
    Dan McCreary
    July 2011
    Version 5
  • 2. Agenda
    Historical Context
    The Business Case for NoSQL
    Terminology
    How NoSQL is Different
    Key NoSQL Products
    Call to Action: The NoSQL Pilot Project
    The Future of NoSQL
    Copyright Kelly-McCreary & Associates, LLC
    2
  • 3. Background for Dan McCreary
    Bell Labs
    NeXT Computer (Steve Jobs)
    Owner of Custom Object-Oriented Software Consultancy
    Federal data integration (National Information Exchange Model)
    Native XML/XQuery – 2006
    Advocate of NoSQL/XRX systems
    Copyright Kelly-McCreary & Associates, LLC
    3
  • 4. NoSQL Training Areas
    Copyright Kelly-McCreary & Associates, LLC
    4
    Track
    Course
    You Are
    Here
    The CIO's
    Guide to
    NoSQL
    Managers
    Project Manager's
    Guide to NoSQL
    Transitioning
    to NoSQL
    Architectural
    Tradeoff Modeling
    Architects/Project Managers
    XQuery
    MapReduce
    Hadoop
    Functional
    Programming
    Developer
  • 5. Sample of NoSQL Jargon
    Document orientation
    Schema free
    MapReduce
    Horizontal scaling
    Sharding and auto-sharding
    Brewer's CAP Theorem
    Consistency
    Reliability
    Partition tolerance
    Single-point-of-failure
    Object-Relational mapping
    Key-value stores
    Column stores
    Document-stores
    Memcached
    5
    Copyright Kelly-McCreary & Associates, LLC
    Indexing
    B-Tree
    Configurable durability
    Documents for archives
    Functional programming
    Document Transformation
    Document Indexing and Search
    Alternate Query Languages
    Aggregates
    OLAP
    XQuery
    MDX
    RDF
    SPARQL
    Architecture Tradeoff Modeling
    ATAM
    Note that within the context of NoSQL many of these terms have different meanings!
  • 6. Selecting a Database…
    "Selecting the right data storage solution is no longer a trivial task."
    Copyright Kelly-McCreary & Associates, LLC
    6
    Does it look like document?
    Use Microsoft
    Office
    Yes
    Start
    No
    Use theRDBMS
    Stop
  • 7. Pressures on SQL Only Systems
    Copyright Kelly-McCreary & Associates, LLC
    7
    Scalability
    Large Data Sets
    Reliability
    SQL
    Social Networks
    OLAP/BI/DataWarehouse
    Linked Data
    Document-Data
    Agile
    Schema Free
  • 8. Simplicity is a Virtue
    Many systems derive their strength by dramatically limiting the features in their system
    Simplicity allows database designers to focus on the primary business driver
    Examples:
    Touch screen interfaces
    Key/Value data stores
    Copyright Kelly-McCreary & Associates, LLC
    8
  • 9. Historical Context
    Mainframe Era
    Commodity Processors
    1 CPU
    COBOL and FORTRAN
    Punchcards and flat files
    $10,000 per CPU hour
    10,000 CPUs
    Functional programming
    MapReduce "farms"
    Pennies per CPU hour
    Copyright Kelly-McCreary & Associates, LLC
    9
  • 10. Two Approaches to Computation
    Copyright 2010 Dan McCreary & Associates
    1930s and 40s
    Alonzo Church
    John Von Neumann
    Manage state with a program counter.
    Make computations act like math functions.
    Which is simpler? Which is cheaper? Which will scale to 10,000 CPUs?
    10
  • 11. Standard vs. MapReduce Prices
    Copyright Kelly-McCreary & Associates, LLC
    11
    John's Way
    Alonzo's Way
    http://aws.amazon.com/elasticmapreduce/#pricing
  • 12. MapReduce CPUs Cost Less!
    Copyright Kelly-McCreary & Associates, LLC
    12
    82% Cost
    Reduction!
    Cuts cost from 32 to 6 cents per CPU hour!
    Perhaps Alanzo was right!
    Why? (hint: how "shareable" is this process)
    http://aws.amazon.com/elasticmapreduce/#pricing
  • 13. Perspectives
    Kelly-McCreary & Associates, LLC
    13
    Object
    Stores
    OLAP
    MDX
    Native XML
    NoSQL for
    Web 2.0
    and
    BigData
    Graph
    Stores
    Perspective depends on your context
  • 14. Architectural Tradeoffs
    Kelly-McCreary & Associates, LLC
    14
    "I want a fast car with good mileage."
    "I want a scaleable database with low cost that runs well on the 1,000 CPUs in our data center."
  • 15. Recent History
    The term NoSQL became re-popularized around 2009
    Used for conferences of advocates of non-relational databases
    Became a contagious idea "meme"
    First of many "NoSQL meetups" in San Francisco organized by Jon Oskarsson
    Conversion from "No SQL" to "Not Only SQL" in recent year
    15
    Kelly-McCreary & Associates, LLC
  • 16. NoSQL on Google Trends
    16
    Kelly-McCreary & Associates, LLC
  • 17. NoSQL and Web 2.0 Startups
    Many web 2.0 startups did not use Oracle or MySQL
    They built their own data stores influenced by Amazon’s Dynamo and Google’s BigTable in order to store and process huge amounts of data
    In the social community or cloud computing applications, most of these data stores became OpenSource software
    17
    Kelly-McCreary & Associates, LLC
  • 18. Google MapReduce
    2004 paper that had huge impact of functional programming in the entire community
    Copied by many organizations, including Yahoo
    Copyright Kelly-McCreary & Associates, LLC
    18
  • 19. Google Bigtable Paper
    2006 paper that gave focus to scaleable databases
    designed to reliably scale to petabytes of
    data and thousands of machines
    Copyright Kelly-McCreary & Associates, LLC
    19
  • 20. Amazon's Dynamo Paper
    Werner Vogels
    CTO - Amazon.com
    October 2, 2007
    Used to power Amazon's S3 service
    One of the most influential papers in the NoSQL movement
    Copyright Kelly-McCreary & Associates, LLC
    20
    Giuseppe DeCandia, DenizHastorun, MadanJampani, GunavardhanKakulapati, AvinashLakshman, Alex Pilchin, Swami Sivasubramanian, Peter Vosshall and Werner Vogels, “Dynamo: Amazon's Highly Available Key-Value Store”, in the Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, October 2007.
  • 21. NoSQL "Meetups"
    “NoSQLerscame to share how they had overthrown the tyranny of slow, expensive relational databases in favor of more efficient and cheaper ways of managing data.”
    21
    Kelly-McCreary & Associates, LLC
    Computerworld magazine, July 1st, 2009
  • 22. Key Motivators
    Licensing RDBMS on multiple CPUs
    The Thee "V"s
    Velocity – lots of data arriving fast
    Volume – web-scale BigData
    Variability – many exceptions
    Desire to escape rigid schema design
    Avoidance of complex Object-Relational Mapping (the "Vietnam" of computer science)
    22
    Kelly-McCreary & Associates, LLC
  • 23. Copyright 2008 Dan McCreary & Associates
    The constraints of yesterday…
    Challenge:
    Ask ourselves the question…
    Do our current method of solving problems with tabular data…
    Reflect the storage of the 1950s…
    Or our actual business requirements?
    What structures best solve the actual business problem?
    23
    Many Processes Today Are Driven By…
  • 24. Copyright 2008 Dan McCreary & Associates
    No-Shredding!
    My
    Data
    Relational databases take a single hierarchical document and shred it into many pieces so it will fit in tabular structures
    Document stores prevent this shredding
    24
  • 25. Copyright 2008 Dan McCreary & Associates
    Is Shredding Really Necessary?
    Every time you take hierarchical data and put it into a traditional database you have to put repeating groups in separate tables and use SQL “joins” to reassemble the data
    25
  • 26. Object Relational Mapping
    T2
    T1
    T3
    T4
    Relational
    Database
    Object Middle
    Tier
    Web Browser
    T1 – HTML into Objects
    T2 –Objects into SQL Tables
    T3 – Tables into Objects
    T4 – Objects into HTML
    26
    Kelly-McCreary & Associates, LLC
  • 27. "The Vietnam of Applications"
    Object-relational mapping has become one of the most complex components of building applications today
    A "Quagmire" where many projects get lost
    Many "heroic efforts" have been made to solve the problem:
    Hibernate
    Ruby on Rails
    But sometimes the way to avoid complexity is to keep your architecture very simple
    Copyright Kelly-McCreary & Associates, LLC
    27
  • 28. Document Stores Need No Translation
    Copyright 2010 Dan McCreary & Associates
    Document
    Document
    Application Layer
    Database
    Documents in the database
    Documents in the application
    No object middle tier
    No "shredding"
    No reassembly
    Simple!
    28
  • 29. Zero Translation (XML)
    Copyright 2010 Dan McCreary & Associates
    REST-Interfaces
    XForms
    XML database
    Web Browser
    XML lives in the web browser (XForms)
    REST interfaces
    XML in the database (Native XML, XQuery)
    XRX Web Application Architecture
    No translation!
    29
  • 30. "Schema Free"
    Systems that automatically determine how to index data as the data is loaded into the database
    No a prioriknowledge of data structure
    No need for up-front logical data modeling
    …but some modeling is still critical
    Adding new data elements or changing data elements is not disruptive
    Searching millions of records still has sub-second response time
    30
    Copyright 2010 Dan McCreary & Associates
  • 31. Monoculture and Mono-architecture
    Image Source: Wikipedia
    31
    Copyright 2010 Dan McCreary & Associates
  • 32. Eric Evans
    “The whole point of seeking alternatives [to RDBMS systems] is that you need to solve a problem that relational databases are a bad fit for.”
    Eric Evans
    Rackspace
    32
    Kelly-McCreary & Associates, LLC
  • 33. Evolution of Ideas in OpenSource
    Copyright Kelly-McCreary & Associates, LLC
    33
    New Products
    New Database Ideas
    Proprietary Software
    Product A
    OpenSource
    Schema-free
    Product B
    Product B
    MapReduce
    Auto-sharding
    Cloud Computing
    How quickly can new ideas be recombined into new database products?
    OpenSource software has proved to be the most efficient way to quickly recombine new ideas into new products
  • 34. 34
    Copyright 2010 Dan McCreary & Associates
    Storage Architectural Patterns
    Tables
    Trees
    Stars
    Triples
  • 35. Finding the Right Match
    Schema-Free
    Standards Compliant
    Mature Query Language
    Use CMU's Architectural Tradeoff and Modeling (ATAM) Process
    35
    Copyright 2010 Dan McCreary & Associates
  • 36. Brewer's CAP Theorem
    Consistency
    You can not have all three so pick two!
    Availability
    Partition Tolerance
    36
    Kelly-McCreary & Associates, LLC
  • 37. Avoidance of Unneeded Complexity
    Relational databases provide a variety of features to ALWAYS support strict data consistency
    Rich feature set and the ACID properties implemented by RDBMSs might be more than necessary for particular applications and use cases
    37
    Kelly-McCreary & Associates, LLC
  • 38. High Throughput
    Some NoSQL databases provide a significantly higher data throughput than traditional RDBMS
    Hypertable which pursues Google’s Bigtable approach allows the local search engine Zvent to store one billion data cells per day
    Google is able to process 20 petabytesa day stored in BigTable via it’s MapReduce approach
    38
    Kelly-McCreary & Associates, LLC
  • 39. Complexity and Cost of Settingup Database Clusters
    NoSQL databases are designedin a way that “PC clusters can be easily and cheaply expanded without the complexity and cost of ’sharding,’ which involves cutting up databases into multiple tables to run on large clusters or grids”.
    Nati Shalom, CTO and founder of GigaSpaces
    39
    Kelly-McCreary & Associates, LLC
  • 40. Compromising Reliability for Better Performance
    Shalom argues that there are “different scenarios where applications would be willing to compromise reliability for better performance.”
    Performance over reliability
    Example: HTTP session data example
    “needs to be shared between various web servers but since the data is transient in nature (it goes away when the user logs off) there is no need to store it in persistent storage.”
    40
    Kelly-McCreary & Associates, LLC
  • 41. "Once Size Fits…"
    "One Size Does Not Fit All"
    James Hamilton Nov. 3rd, 2009
    Kelly-McCreary & Associates, LLC
    41
    http://perspectives.mvdirona.com/CommentView,guid,afe46691-a293-4f9a-8900-5688a597726a.aspx
  • 42. Different Thinking
    Sequential Processing
    Parallel Processing
    The output of any step can be used in the next step
    State must be carefully managed
    Each loop of XQuery FLOWR statements are independent thread (no side-effects)
    42
    Kelly-McCreary & Associates, LLC
  • 43. Cloud Computing
    High scalability
    Especially in the horizontal direction (multi CPUs)
    Low administration overhead
    Simple web page administration
    43
    Kelly-McCreary & Associates, LLC
  • 44. Databases work well in the cloud
    Data warehousing specific databases for batch data processing and map/reduce operations
    Simple, scalable and fast key/value-stores
    Databases containing a richer feature set than key/value-stores fitting the gap with traditional
    RDBMS while offering good performance and scalability properties (such as document databases).
    44
    Kelly-McCreary & Associates, LLC
  • 45. Auto-Sharding
    When one database gets almost full it tells a "coordinator" system and the data automatically gets migrated to other systems
    Copyright Kelly-McCreary & Associates, LLC
    45
    After
    45% full
    Before
    90% full
    45% full
  • 46. Scale Up vs. Scale Out
    Scale Up
    Scale Out
    Make Many CPUs work together
    Learn how to divide your problems into independent threads
    Make a single CPU as fast as possible
    Increase clock speed
    Add RAM
    Make disk I/O go faster
    Copyright Kelly-McCreary & Associates, LLC
    46
  • 47. Functional Programming
    What does it mean to your IT staff?
    What experience do they have in functional programming?
    Can they "unlearn" the habits of the procedural world?
    Copyright Kelly-McCreary & Associates, LLC
    47
  • 48. The NO-SQL Universe
    Copyright 2010 Dan McCreary & Associates
    Document Stores
    Key-Value Stores
    XML
    Graph Stores
    Object Stores
    Column Stores
    48
  • 49. Key Value Stores
    A table with two columns and a simple interface
    Add a key-value
    For this key, give me the value
    Delete a key
    Blazingly fast and easy to scale
    Copyright Kelly-McCreary & Associates, LLC
    49
    Key
    Value
  • 50. Types of Key-Value Stores
    Eventually‐consistent Key‐Value store
    Hierarchical Key-Value Stores
    Key-Value Stores In RAM
    Key Value Stores on Disk
    Ordered Key-Value Stores
    Copyright Kelly-McCreary & Associates, LLC
    50
  • 51. Cassendra
    Apache open source project
    Originally developed by Facebook
    Designed for highly distributed high-reliable systems
    No single point of failure
    Column-family data model
    Copyright Kelly-McCreary & Associates, LLC
    51
    http://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
  • 52. Voldomort
    A distributed key-value system
    Used at LinkedIn
    10K-20K node operations/CPU
    Auto-sharding
    Graceful server failure handling
    Copyright Kelly-McCreary & Associates, LLC
    52
  • 53. MongoDB
    Open Source License
    Document/Collection centric
    Sharding built-in, automatic
    Stores data in JSON format
    Query language is JSON
    Can be 10x faster than MySQL
    Many languages (C++, JavaScript, Java, Perl, Python etc.)
    Copyright Kelly-McCreary & Associates, LLC
    53
  • 54. Hadoop/Hbase
    Open source implementation of MapReduce algorithm written in Java
    Initially created by Yahoo
    300 person-years development
    Column-oriented data store
    Java interface
    Hbase designed specifically to work with Hadoop
    Copyright Kelly-McCreary & Associates, LLC
    54
  • 55. CouchDB
    Apache Document Store
    Written in ERLANG
    RESTful JSON API
    Distributed, featuring robust, incremental replication with bi-directional conflict detection and management
    Copyright Kelly-McCreary & Associates, LLC
    55
  • 56. Memcached
    Free & open source in-memory caching system
    Designed to speeding up dynamic web applications by alleviating database load
    RAM resident key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering
    Simple interface
    Designed for quick deployment, ease of development
    APIs in many languages
    Copyright Kelly-McCreary & Associates, LLC
    56
  • 57. MarkLogic
    Native XML database designed to used by Petabyte data stores
    ACID compliant
    Heavy use by federal agencies, document publishers and "high-variability" data
    Arguably the most successful NoSQL company
    Copyright Kelly-McCreary & Associates, LLC
    57
  • 58. eXist
    OpenSource native XML database
    Strong support for XQuery and XQuery extensions
    Heavily used by the Text Encoding Initiative (TEI) community and XRX/XForms communities
    Ideal for metadata management
    Integrated Lucene search and structured search
    Copyright Kelly-McCreary & Associates, LLC
    58
  • 59. Riak
    Community and Commercial licenses
    A "Dynamo-inspired" database
    Written in ERLANG
    Query JSON or ERLANG
    Copyright Kelly-McCreary & Associates, LLC
    59
  • 60. Hypertable
    Open Source
    Closely modeled after Google's Bigtable project
    High performance distributed data storage system
    Designed to support applications requiring maximum performance, scalability, and reliability
    Hypertable Query Language (HQL) that is syntactically similar to SQL
    Copyright Kelly-McCreary & Associates, LLC
    60
  • 61. Selecting a NoSQL Pilot Project
    The "Goldilocks Pilot Project Strategy"
    Not to big, not to small, just the right size
    Duration
    Sponsorship
    Importance
    Skills
    Mentorship
    61
    Copyright 2010 Dan McCreary & Associates
  • 62. The Future of the NoSQL Movement
    Will data sets continue to grow at exponential rates?
    Will new system options become more diverse?
    Will new markets have different demands?
    Will some ideas be "absorbed" into existing RDBMS vendors products?
    Will the NoSQL community continue to be the place where new database ideas and products are incubated?
    Will the job of doing high-quality architectural tradeoffs analysis become easier?
    Copyright Kelly-McCreary & Associates, LLC
    62
    Growth
    Diversity
  • 63. Using the Wrong Architecture
    Start
    Finish
    Credit: Isaac Homelund – MN Office of the Revisor
  • 64. Using the Right Architecture
    Finish
    Start
    Find ways to remove barriers to empowering
    the non programmers on your team.
  • 65. Questions
    Dan McCreary
    President, Kelly-McCreary & Associates
    dan@danmccreary.com
    65
    Kelly-McCreary & Associates, LLC