Big Data Platforms: An Overview

  • 6,375 views
Uploaded on

A high level overview of NoSQL Big Data platforms. This presentation is targeted more towards business people than technical people.

A high level overview of NoSQL Big Data platforms. This presentation is targeted more towards business people than technical people.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
6,375
On Slideshare
0
From Embeds
0
Number of Embeds
4

Actions

Shares
Downloads
0
Comments
1
Likes
43

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big Data Platforms: An Overview C. Scyphers Chief Technical ArchitectDaemon Consulting, LLC
  • 2. What Is “Big Data”?• Big Data is not simply a huge pile of information• A good starting place is the following thought: “Big Data describes datasets so large they become very difficult to manage with traditional database tools.”
  • 3. What Is A Big Data Platform? Putting it simply, it is any platform which supports those kind of large datasets.
  • 4. It doesn’t have to be cutting edge technology.
  • 5. Lots of legacy technologies can address the problem.
  • 6. If only at a sizeable cost.
  • 7. SQLOf the new technologies, the most promising are from the “NoSQL” family.
  • 8. What Is “NoSQL”?SQLA family of non-relational data storage technologies
  • 9. Horizontal scalability
  • 10. Distributed processing
  • 11. Faster throughput
  • 12. Usually with less cost than more traditional approaches.
  • 13. Some of thesetechnologiesare new andinnovative
  • 14. Others have been around for decades.
  • 15. NoSQL Does Not Mean “SQL Is Bad”When the trend was just starting, “NoSQL” was coined. It’sunfortunate, because it implies antagonism towards SQL.
  • 16. NoSQL Means “Not Only SQL” RELATIONAL RELATIONAL NON-NoSQL is a complement to a traditional RDBMS, not necessarily as a replacement of them.
  • 17. Why Won’t SQL Do?
  • 18. Scale is very hard without ridiculous expense
  • 19. SQL can get very complex, very quickly
  • 20. Changing a schema for a large production system is both risky and expensive
  • 21. Throughput can be a challenge
  • 22. How DoesNoSQL Do It?
  • 23. Scale is achieved through a shared-nothing architecture, removing bottlenecks
  • 24. Schemaless design means change becomes much less risky and significantly cheaper
  • 25. Most solutions use simple RESTful interfaces
  • 26. NoSQL is based upon a better understanding of data storage, usually referred to as the “CAP Theorem”
  • 27. The CAP Theorem Grossly simplified (with apologies to Brewer):A database can be• Consistent (All clients see the same data)• Available (All clients can find some available node)• Partition-Tolerant (the database will continue to function even if split into disconnected sets – e.g. a network disruption) Pick Any Two.
  • 28. CAP In Practice• Consistent & Available (no Partition Tolerance) • Either single machines or single site clusters. • Typically uses 2 phase commits
  • 29. CAP In Practice• Consistent & Partition Tolerant (no Availability) • Some data may be inaccessible, but the remainder is available and consistent • Sharding is an example of this implementation Customers Customers Customers A-F G-R S-Z
  • 30. CAP In Practice• Available & Partition Tolerant (no Consistency) • Some data may be inaccurate; a conflict resolution strategy is required. • DNS is an example of this, as well as standard master-slave replication
  • 31. CAP From A Vendor POV• C-A (no P) – this is generally how most RDBMS vendors operate• C-P (no A) – this is how many RDBMS’ attempt to address scale without incurring large costs• A-P (no C) – this is how most NoSQL approaches solve the problem
  • 32. ACID vs BASE Traditional Databases NoSQL Databases Tend Are ACID Compliant To Be BASE CompliantAtomicity – either the entire transaction Basically completes or none of it doesConsistent – any transaction will take the Available database from one consistent state to another, with no broken constraintsIsolation – changes do not affect other users Scalable until committedDurability – committed transactions can be Eventually consistent recovered in case of system failure Eventually consistent is the key phrase here
  • 33. SQL StrengthsVery well known technology
  • 34. Very mature technology
  • 35. Interoperability across vendors
  • 36. Large talent pool from which to choose
  • 37. Ad hoc operations common, if not encouraged
  • 38. NoSQL Strengths Built to address massive scale
  • 39. Through horizontal scalability
  • 40. While remaining highly available
  • 41. And handling unstructured data
  • 42. NoSQL Pros/Cons Pros Cons• Schema Evolution • Querying the data is• Horizontal Scalability much harder• Simple Protocols • Paradigm Shift • Security is a big issue • May or may not support data types (BLOBs, spatial) • Generally, uniqueness cannot be enforced
  • 43. A Disclaimer Before We Continue• I am not an expert on every possible Big Data Platform• There are hundreds of them; these are the ones I consider the leaders in the field and recommend• If you have a favorite, please let me know and I’ll update the deck for next time• The internal details on how these systems work are rather complex; I would prefer to take those questions offline
  • 44. Flavors Of NoSQLThe major four divisions of NoSQL are:• Key-Value• Document Store• Columnar• Other
  • 45. Key-Value• At a very high level, key-value works essentially by pairing a index token (a key) with a data element (a value).• Both index token and the data value can be of any structure.• Such a pairing is arbitrary and up to the developer of the system to determine.
  • 46. A Key-Value Example“John Smith”, “100 Century Dr. Alexandria VA 22304”“John Doe”, “16 Kozyak Street, Lozenets District, 1408 Sofia Bulgaria” In both examples, the key is a name and the value is an address. However, the structure of the address differs between the two.
  • 47. Document Store• Document stores extend the key-value paradigm into values with multiple attributes.• The document values tend to be semi-structured data (XML, JSON, et al) but can also be Word or PDF documents.
  • 48. A Document Store Example“John Smith”, “<address><street>100 Century Dr.</street> <city>Alexandria</city> <state>VA</state> <postalCode>22304</postalCode> </address>”“John Doe”, “{ “address”: { “street”: “16 Kozyak Street” “district”: “Lozenets, 1408” “city”: “Sofia” “country”: “Bulgaria” } }”
  • 49. Columnar Family• Usually has “rows” and “columns” • Or, at least, their logical equivalents• Not a traditional, “pure” column store • More of a hybridized approach leveraging key-value pairs• A key with many values attached
  • 50. The Others• Hierarchical Databases • LDAP, Active Directory• Graph Databases • Neo4j, Flock DB, InfiniteGraph• XML • MarkLogic• Object Oriented Databases • Versant• Lotus Notes• HPCC (LexisNexis)
  • 51. Key-Value Recap Pairing a index token (a key) with a data element (a value)
  • 52. Key-Value Pro/Con Pros Cons• Schema Evolution • Packing & unpacking each key• Horizontal Scalability • Keys typically are not related• Simple Protocols to each other• Works well for volatile data • The entire value must be• High throughput, typically returned, not just a part of it optimized for reads or writes • Security tends to be an issue• Keys become meaningful • Hard to support reporting, rather than arbitrary analytics, aggregation or• Application logic defines ordered values object model • Generally does not support updates in place • Application logic defines object model
  • 53. Where Did Key-Value Come From?The concept is quite old, but most people trace thelineage back to Amazon and the Dynamo paper.
  • 54. DynamoAmazon devised the Dynamo engine as a way toaddress their scalability issues in a reliable way.• Communication between nodes is peer to peer (P2P)• Replication occurs with the end client addressing conflict resolution• Quorum Reads/Writes• Always writable (Hinted Handoff)• Eventually Consistent
  • 55. Eventually Consistent• Rather than expending the runtime resources to ensure that all nodes are aware of a change before continuing, Dynamo uses an eventually consistent model.• In this model, a subset of nodes are changed• Those nodes then inform their neighbors until all nodes are changed (grossly simplifying).
  • 56. Can I Use Dynamo? No. It’s an Amazon only internal product. However, AWS S3 is largely based upon it.Amazon did announce a DynamoDB offering for their AWS customers. While it’s probably the same, I cannot guarantee that it is.
  • 57. Riak• Riak is a key-value database largely modeled after the Dynamo model.• Open source (free) with paid support from Basho.• Main claims to fame: • Extreme reliability • Performance speed
  • 58. Riak Pro/Con Pros Cons• All nodes are equal – no • Not meant for small, discrete single point of failure and numerous datapoints.• Horizontal Scalability • Getting data in is great;• Full Text Search getting it out, not so much• RESTful interface (and HTTP) • Security is non-existent:• Consistency level tunable on “Riak assumes the internal each operation environment is trusted”• Secondary indexes available • Conflict resolution can bubble• Map/Reduce (JavaScript & up to the client if not careful. Erlang only) • Erlang is fast, but it’s got a serious learning curve.
  • 59. Riak Users
  • 60. Redis• Redis is a key-value in-memory datastore.• Open source (free) with support from the community.• Main claims to fame: • Fast. So very, very fast. • Transactional support • Best for rapidly changing data
  • 61. Redis Pro/Con Pros Cons• Transactional support • Entirely in memory• Blob storage • Master-slave replication• Support for sets, lists and (instead of master-master) sorted sets • Security is non-existent:• Support for Publish-Subscribe designed to be used in (Pub-Sub) messaging trusted environments• Robust set of operators • Does not support encryption • Support can be hard to find
  • 62. Redis Users
  • 63. Voldemort• Voldemort is a key-value in-memory database built by LinkedIn.• Open source (free) with support from the community• Main claims to fame: • Low latency • Highly Available • Very fast reads
  • 64. Voldemort Pro/Con Pros Cons• Highly customizable – each • Versioning means lots of disk layer of the stack can be space being used. replaced as needed • Does not support range• Data elements are versioned queries during changes • No complex query filters• All nodes are independent – • All joins must be done in no single point of failure code• Very, very fast reads • No foreign key constraints • No triggers • Support can be hard to find
  • 65. Voldemort Users
  • 66. Key/Value “Big Vendors”• Microsoft Azure Table Storage• Oracle NoSQL• BerkleyDB (Oracle)
  • 67. Document Store RecapDocument stores store an index tokenwith a grouping of attributes in a semi- structured document
  • 68. Document Store Pro/Con Pros Cons• Tends to support a more • The entire value must be complex data model than returned, not just a part of it key/value • Security tends to be an issue• Good at content • Joins are not available within management the database• Usually supports multiple • No foreign keys indexes • Application logic defines• Schemaless (can be nested) object model• Typically low latency reads• Application logic defines object model
  • 69. CouchDB• CouchDB is a document store database.• Open source (free), part of the Apache foundation with paid support available from several vendors.• Main claims to fame: • Simple and easy to use • Good read consistency • Master-master replication
  • 70. CouchDB Pro/Con Pros Cons• Very simple API for • The simple API for development development is somewhat• MVCC support for read limited consistency • No foreign keys• Full Map/Reduce support • Conflict resolution devolves• Data is versioned to the application• Secondary indexes supported • Versioning requires extensive• Some security support disk space• RESTful API, JSON support • Versioning places large load• Materialized views with on I/O channels incremental update support • Replication for performance, not availability
  • 71. CouchDB Users
  • 72. MongoDB• MongoDB is a document store database.• Open source (free) with paid support available from 10Gen.• Main claims to fame: • Index anything • Ad hoc query support • SQL like operations (not SQL syntax)
  • 73. MongoDB Pro/Con Pros Cons• Auto-sharding • Does not support JSON: BSON• Auto-failover instead• Update in place • Master-slave replication• Spatial index support • Has had some growing pains• Ad hoc query support (e.g. Foursquare outage)• Any field in Mongo can be • Not RESTful by default indexed • Failures require a manual• Very, very popular (lots of database repair operation production deployments) (similar to MySQL)• Very easy transition from SQL • Replication for availability, not performance
  • 74. MongoDB Users
  • 75. Document Store “Big Vendors”• Lotus Domino
  • 76. Columnar Family Recap• A key with many values attached• Usually presenting as “rows” and “columns” • Or, at least, their logical equivalents
  • 77. Columnar Pro/Con Pros Cons• Tend to have some level of • Is much less efficient when rudimentary security support processing many columns• Usually include a degree of simultaneously versioning • Joins tend to not be• Can be more efficient than supported row databases when • Referential integrity not processing a limited number available of columns over a large amount of rows
  • 78. Where Did Columnar Come From?The concept has been around for a while, but most people trace the NoSQL lineage back to Google.
  • 79. BigTableGoogle devised the BigTable engine as a way toaddress their search related scalability issues in areliable way.• Data is organized through a set of keys: • Row • Column • Timestamp• A hybrid row/column store with a single master• Versioning is handled through the time key• Tablets are a dynamic partition of a sequence of rows – supports very efficient range scans• Columns can be grouped into column families• Column families can have access control
  • 80. Can I Use BigTable?No. It’s a Google only internal product. However, quite a few open source products are built upon the concepts.
  • 81. Cassandra• Cassandra is a hybrid of Big Table built on Dynamo infrastructure• Open source (free), built by Facebook with paid support available from several vendors.• Main claims to fame: • An Apache project • Very, very fast writes • Spans multiple datacenters
  • 82. Cassandra Pro/Con Pros Cons• Designed to span multiple • No joins datacenters • No referential integrity• Peer to peer communication • Written in Java – quite between nodes complex to administer• No single point of failure and configure• Always writeable • Last update wins• Consistency level is tunable at run time• Supports secondary indexes• Supports Map/Reduce• Supports range queries
  • 83. Cassandra Users
  • 84. HBase• Hbase is a columnar database built on top of the Hadoop environment.• Open source (free) with paid support from numerous vendors• Main claims to fame: • Ad hoc type abilities • Easy integration with Map/Reduce
  • 85. HBase Pro/Con Pros Cons• Map/Reduce support • Secondary indexes generally• More of a CA approach and not supported an AP • Security is non-existent• Supports predicate push • Requires a Hadoop down for performance gains infrastructure to function• Automatic partitioning and rebalancing of regions• Data is stored in a sorted order (not indexed)• RESTful API• Strong and vibrant ecosystem
  • 86. HBase Users
  • 87. Hadoop• Hadoop is not a columnar store as such.• Rather, Hadoop is a massively parallel data processing engine• Main claims to fame: • Specializes in unstructured data • Very flexible and popular
  • 88. Hadoop Pro/Con Pros Cons• While written in Java, almost • Large amounts of disk space any language can leverage and bandwidth required Hadoop • Paradigm shift for IT staff• Runs on commodity servers • Quality talent is highly in• Horizontally scalable demand and expensive• Very fast and powerful • Security is non-existent• Where Map/Reduce • Name node is a single point originated of failure• Ample support from vendors • More or less only supporting• “Helper” languages like Hive batch processing and Pig • Not user friendly to anyone• Strong and vibrant ecosystem other than developers
  • 89. Hadoop Users Plus lots, lots more
  • 90. Columnar “Big Vendor”• EMC Greenplum• Teradata Aster In so far as both of these solutions are graftingMap/Reduce into a (more or less) SQL environment
  • 91. Which One Do I Use Where?• Key-Value for (relatively) simple, volitile data• Document store for more complex data• Columnar for analytical processing• RBDMS for traditional processing – particularly where a lazy consistency is not acceptable • Point Of Sale, for example
  • 92. Questions?
  • 93. @scyphers Additional Information Athttp://www.daemonconsulting.net/BDC-FOSE-2012Daemon Consulting, LLC http://www.daemonconsulting.net/ Specializing In The Hard Stuff