Everything youve always wanted to      know about Big Data        (But were afraid to ask)           Howie Rosenshine     ...
Administrivia    Why the title of the talk?    When I use the unqualified term database I    probably mean DBMS. Bad hab...
What is Big Data?    Size is such that it cannot (easily/economically)    be processed within a single node (or single   ...
CRUD    Create Read Update Destroy (plus a potentially    huge amount of actual computing)    Big Data examples tend to ...
Why is all this CRAP a problem?    Because tools and architectures that have    grown to support not(B) do not scale well...
Bottlenecks (Scalability)    Consider the single Node RDB example (single    node = shared everything)    Bottlenecks ca...
Scalability Solution(?)    Distribute!    Multiple machines ⇒multiple kernels, multiple I/O backplanes...yay!    Shared...
Shared Something    Shared logical disk implemented with pretty    extensive inter-machine ipc locking mechanism.    Thi...
Big Data Strategies    Shared nothing parallel relational database    NoSQL (key/value stores)    Map Reduce    Note: E...
Shared Nothing parallel RDB    Shared nothing, obviously    Partitioning/Sharding    Columnar (typically)              ...
Shared Nothing    Well, “Nothing but Net”, that is    Network should be fast, certainly for bandwidth,    preferably for...
Partitioning/Sharding    Ideally little/no inter-shard/inter-node    communication (local/localized join)    Data distri...
Columnar Store    Columnar store, for the most part at this point“Some RDBMS are born columnar, and some  have columnarne...
NoSQL (Key/Value store) Types    “Key value” store (simple key/value store)    ⇒ riak, voldemort, etc    Document store ...
NoSQL (Key/Value store)            Characteristics    Relatively low latency, targeted at transaction    oriented data (s...
Database vs Datastore?    Is it ACID?    Must a database be an instantiation of DBMS?    “I shall not attempt to furthe...
Map Reduce    (“And now for something completely different”)    Practical general purpose (or as close as    anyone has c...
Hadoop characteristics    Hadoop addresses the crAP partition    Hadoop map reduce is composed, primarily of    HDFS and...
Hadoop    “Hello World”    $HADOOP_HOME/bin/hadoop jar     $HADOOP_HOME/hadoop-streaming.jar      -input myInputDirs    ...
General Purpose    Use your imagination: If you can make the    shoe fit, Hadoop will wear it    HIVE ⇒ RDBMS    RDBMS ...
Big Picture “Scalability”    Order of magnitude comparison 1/10/100/1000⇒ Single node/shared something rdb/  shared nothi...
Further Reading:    dbms2.com - Curt Monash    dbmsmusings.blogspot.com - Daniel Abadi                  Howie Rosenshine...
Upcoming SlideShare
Loading in …5
×

Big Data Overview

5,535 views

Published on

An overview of Big Data, including an attempt to describe it in a meaningful fashion, as well as the means for dealing with it.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
5,535
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big Data Overview

  1. 1. Everything youve always wanted to know about Big Data (But were afraid to ask) Howie Rosenshine howie.rosenshine@gmail.com PhillyDB – 6/19/2012
  2. 2. Administrivia Why the title of the talk? When I use the unqualified term database I probably mean DBMS. Bad habit. When I use the unqualified term database I probably mean RDBMS (not NoSQL) Another habit...maybe not bad...it will be for you to decide Howie Rosenshine - Ergo Analytics
  3. 3. What is Big Data? Size is such that it cannot (easily/economically) be processed within a single node (or single "shared something" cluster) Or any smaller architecture that is capable of scaling to the above Big Data definition And what exactly do we mean by “processed” ? Howie Rosenshine - Ergo Analytics
  4. 4. CRUD Create Read Update Destroy (plus a potentially huge amount of actual computing) Big Data examples tend to come from “machine generated” domain e.g. web crawling or tracking, realtime sensor data, logfiles, etc. So for Big Data, CRUD ⇒ CRud. Or perhaps: CRAP ⇒ Create Read Analytical Processing Howie Rosenshine - Ergo Analytics
  5. 5. Why is all this CRAP a problem? Because tools and architectures that have grown to support not(B) do not scale well to B, where B=Big Data. Why is this so? Scalability? No such thing. Bottlenecks! Howie Rosenshine - Ergo Analytics
  6. 6. Bottlenecks (Scalability) Consider the single Node RDB example (single node = shared everything) Bottlenecks can be hardware or software (probably software more often than not) e.g. kernel locks for I/O contention will probably bite before you run out of disks to attach or PCI bandwidth, etc. But one or the other will bite eventually. Howie Rosenshine - Ergo Analytics
  7. 7. Scalability Solution(?) Distribute! Multiple machines ⇒multiple kernels, multiple I/O backplanes...yay! Shared something...yay? Howie Rosenshine - Ergo Analytics
  8. 8. Shared Something Shared logical disk implemented with pretty extensive inter-machine ipc locking mechanism. This will typically bottleneck long before any aggregated hardware limits Nevertheless, it is good enough to become a dominant force in the OLTP industry.Note: this is not to say that you can’t do serious analytical processing on such an architecture But what happens when your “really big data”exceeds this limit? Howie Rosenshine - Ergo Analytics
  9. 9. Big Data Strategies Shared nothing parallel relational database NoSQL (key/value stores) Map Reduce Note: Embarrassingly parallel problems require none of these. Examples:  Static web pages.  Wikipedia (at least w/o edits).  Google maps/earth. Howie Rosenshine - Ergo Analytics
  10. 10. Shared Nothing parallel RDB Shared nothing, obviously Partitioning/Sharding Columnar (typically) Howie Rosenshine - Ergo Analytics
  11. 11. Shared Nothing Well, “Nothing but Net”, that is Network should be fast, certainly for bandwidth, preferably for latency as well At least for some queries (see next section) Howie Rosenshine - Ergo Analytics
  12. 12. Partitioning/Sharding Ideally little/no inter-shard/inter-node communication (local/localized join) Data distribution/redistribution among shards Redundancy also allows for orthogonal sharding Howie Rosenshine - Ergo Analytics
  13. 13. Columnar Store Columnar store, for the most part at this point“Some RDBMS are born columnar, and some have columnarness thrust upon them” Strong advantage for aggregation Also advantageous for compression Howie Rosenshine - Ergo Analytics
  14. 14. NoSQL (Key/Value store) Types “Key value” store (simple key/value store) ⇒ riak, voldemort, etc Document store (complex key/value store) ⇒ mongodb, couchdb, etc Column oriented stores (tabular key/value store) ⇒ bigtable, hbase, cassandra, etc Howie Rosenshine - Ergo Analytics
  15. 15. NoSQL (Key/Value store) Characteristics Relatively low latency, targeted at transaction oriented data (simple transactions) Typically not ACID Typically no joins Howie Rosenshine - Ergo Analytics
  16. 16. Database vs Datastore? Is it ACID? Must a database be an instantiation of DBMS? “I shall not attempt to further define the characteristics of a database, but I know it when I see it, and this isn’t it” Howie Rosenshine - Ergo Analytics
  17. 17. Map Reduce (“And now for something completely different”) Practical general purpose (or as close as anyone has come) implicit parallel programming paradigm Attributed to Google, who published the original Map Reduce white paper. Open Source Hadoop - Doug Cutting, YahooNote: Hadoop is an “ecosystem”, not a “product”, however the unqualified use of Hadoop is typically taken to mean the use of Hadoop map reduce Howie Rosenshine - Ergo Analytics
  18. 18. Hadoop characteristics Hadoop addresses the crAP partition Hadoop map reduce is composed, primarily of HDFS and map reduce itself. Not just Java ⇒ streams interface Python, Ruby..., Unix: utilities, pipes, filters, shell Howie Rosenshine - Ergo Analytics
  19. 19. Hadoop “Hello World” $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar -input myInputDirs -output myOutputDir - mapper /bin/cat - reducer /bin/wc Howie Rosenshine - Ergo Analytics
  20. 20. General Purpose Use your imagination: If you can make the shoe fit, Hadoop will wear it HIVE ⇒ RDBMS RDBMS X...new and improved, 100% fortified with Hadoop ⇒ ETL Howie Rosenshine - Ergo Analytics
  21. 21. Big Picture “Scalability” Order of magnitude comparison 1/10/100/1000⇒ Single node/shared something rdb/ shared nothing rdb/map reduce This is not necessarily a good inter-platform performance comparison, though it may be reasonable for intra-platform comparison Howie Rosenshine - Ergo Analytics
  22. 22. Further Reading: dbms2.com - Curt Monash dbmsmusings.blogspot.com - Daniel Abadi Howie Rosenshine - Ergo Analytics

×