Big Data Technologies and           Techniques                Ryan BrushDistinguished Engineer, Cerner Corporation        ...
Relational Databases are Awesome
Atomic, transactional updates   Guaranteed consistencyRelational Databases are Awesome             Declarative queriesEasy...
Relational Databases are Awesome          …so use them!
Relational Databases are Awesome          …so use them!        But…
Those advantages have a costGlobal, atomic state means global,atomic coordination      Coordination does not scale linearly
The costs of coordination     Remember the     network effect?
The costs of coordination             n(n -1)  channels =                2  2 nodes = 1 channel  5 nodes = 10 channels  12...
So we better be able to scale
The costs of coordination  Databases have optimized this in  many clever ways, but a limit on  scalability still exists
Let’s look at some ways to scale
Bulk processing billions of records
Bulk processing billions of records Data aggregation and storage
Bulk processing billions of records Data aggregation and storage    Real-time processing of updates
Bulk processing billions of records Data aggregation and storage    Real-time processing of updates Serving data for: Onli...
Let’s start with scalability ofbulk processing
Quiz: which one is scalable?
Quiz: which one is scalable?    1000-node Hadoop cluster where    jobs depend on a common process
Quiz: which one is scalable?    1000-node Hadoop cluster where    jobs depend on a common process    1000 Windows ME machi...
Quiz: which one is scalable?    1000-node Hadoop cluster where    jobs depend on a common process    1000 Windows ME machi...
Independence   Parallelizable
Independence      Parallelizable      Parallelizable     Scalable
“Shared Nothing” architectures are themost scalable…
“Shared Nothing” architectures are themost scalable…     …but most real-world problems require     us to share something…
“Shared Nothing” architectures are themost scalable…     …but most real-world problems require     us to share something… ...
The key is to make sure the vast majorityof our work in the cloud is independent andparallelizable.
Amdahl’s Law             1           S : speed improvementS(N ) =                  P : ratio of the problem that        (1...
MapReduce PrimerInput Data      Map Phase   Shuffle   Reduce  Split 1                             Phase                 Ma...
MapReduce Example: Word Count  Books   Map Phase     Shuffle   Reduce          Count words             Phase            pe...
Notice there is still a serial part of theproblem: the of the reducers must becombined
Notice there is still a serial part of theproblem: the of the reducers must becombined   …but this is much smaller, and ca...
Also notice that the network is a sharedresource when processing big data
Also notice that the network is a sharedresource when processing big data So rather than moving data to computation, we mo...
MapReduce Data LocalityInput Data     Map Phase       Shuffle     Reduce  Split 1                                  Phase  ...
Data locality is only guaranteed the Mapphase
Data locality is only guaranteed the Mapphase So the most data-intensive work should be done in the map, with smaller sets...
Data locality is only guaranteed the Mapphase So the most data-intensive work should be done in the map, with smaller sets...
MapReduce Gone WrongBooks     Map Phase     Shuffle   Reduce          Count words             Phase            per book   ...
Even if our Word Addition Service isscalable, we’d need to scale it to the size ofthe largest Map/Reduce job that will eve...
So for data processing, prefer embeddedlibraries over remote services
So for data processing, prefer embeddedlibraries over remote servicesUse remote services for configuration, toprime caches...
Joining a billion recordsWord counts are great, but many real-worldproblems mean bringing together multipledatasets. So ho...
Map-Side JoinsWhen joining one big input to a small one,Simply copy the small data set to each mapper    Data Set 1     Ma...
Merge in ReducerRoute common items to the same reducer  Data Set 1     Map Phase      Shuffle   Reduce     Split 1        ...
Higher-Level ConstructsMapReduce is a primitive operation forhigher-level constructsHive, Pig, Cascading, and Crunch all c...
MapReduce and MPP Databases
MapReduce                          MPP DatabasesData in a distributed filesystem   Data in sharded relational databases
MapReduce                          MPP DatabasesData in a distributed filesystem   Data in sharded relational databasesOri...
MapReduce                           MPP DatabasesData in a distributed filesystem    Data in sharded relational databasesO...
MapReduce                               MPP DatabasesData in a distributed filesystem        Data in sharded relational da...
MapReduce                               MPP DatabasesData in a distributed filesystem        Data in sharded relational da...
MapReduce                               MPP DatabasesData in a distributed filesystem        Data in sharded relational da...
MapReduce   MPP Databases      …are complementary!
MapReduce           MPP Databases          …are complementary!Map/Reduce to clean, normalize, reconcileand codify data to ...
Bulk processing of millions of records Data aggregation and storage
Hadoop Distributed Filesystem  Scales to many petabytes
Hadoop Distributed Filesystem  Scales to many petabytes  Splits all files into blocks and spreads  them across data nodes
Hadoop Distributed Filesystem  Scales to many petabytes  Splits all files into blocks and spreads  them across data nodes ...
Hadoop Distributed Filesystem  Scales to many petabytes  Splits all files into blocks and spreads  them across data nodes ...
Hadoop Distributed Filesystem  Scales to many petabytes  Splits all files into blocks and spreads  them across data nodes ...
HDFS Writes            Lookup Data Node                                 Name Node   Client       Write      Data Node 1   ...
HDFS Reads              Lookup Block              locations        Name Node   Client                Read      Data Node 1...
HDFS Shortcomings No random reads No random writes Doesn’t deal with many small files
HDFS Shortcomings No random reads No random writes Doesn’t deal with many small files             Enter HBase“Random Acces...
HBase Emulates random I/O with a Write Ahead Log (WAL) Periodically flushes log to sorted files
HBase Emulates random I/O with a Write Ahead Log (WAL) Periodically flushes log to sorted files Files accessible as tables...
HBase Emulates random I/O with a Write Ahead Log (WAL) Periodically flushes log to sorted files Files accessible as tables...
Use HBase when: You have noisy, semi-structured data
Use HBase when: You have noisy, semi-structured data You want to apply massively parallel processing to your problem
Use HBase when: You have noisy, semi-structured data You want to apply massively parallel processing to your problem To ha...
Use HBase when: You have noisy, semi-structured data You want to apply massively parallel processing to your problem To ha...
But there are drawbacks:  Limited schema support  Limited atomicity guarantees  No built-in secondary indexesHBase is a gr...
The data store should alignwith the needs of the application
So a pattern is emerging:     Collection   Aggregation   Processing    Storage     Millennium                             ...
But we have a potential bottleneck     Collection   Aggregation   Processing    Storage    Millennium                     ...
Direct inserts are designed for onlineupdates, not massively parallel data loadsSo shift the work into MapReduce, and pre-...
And we’re missing an important piece:     Collection   Aggregation   Processing    Storage    Millennium                  ...
And we’re missing an important piece:     Collection   Aggregation   Processing    Storage    Millennium                  ...
How do we make it fast?                             Speed Layer                              Batch Layerhttp://www.slidesh...
How do we make it fast?                          Move data to computation Hours of data             Speed Layer           ...
How do we make it fast?               Complex Event Processing          Speed Layer  Storm          Batch Layer         Ha...
And now, the challenge…
Process all data overnight
Quickly create new data models   Fast iteration cycles means fast innovation    Process all data overnight             Sim...
Questions?
Upcoming SlideShare
Loading in …5
×

Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

1,174 views

Published on

Kansas City IT Professionals, a grassroots tech community of 9,000+ members held an event on August 30th, 2012 entitled Big Data: The Future Of Insights (see: http://kcitp.me/M67S9M).

The event consisted of 2 keynotes & a panel with expert data scientists, engineers, and data analysts from companies like Adknowledge and Cerner.

This talk, entitled "Big Data Technologies and Tools" was delivered by Ryan Brush, Distinguished Engineer w/ Cerner

Published in: Business, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,174
On SlideShare
0
From Embeds
0
Number of Embeds
451
Actions
Shares
0
Downloads
20
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

  1. 1. Big Data Technologies and Techniques Ryan BrushDistinguished Engineer, Cerner Corporation @ryanbrush
  2. 2. Relational Databases are Awesome
  3. 3. Atomic, transactional updates Guaranteed consistencyRelational Databases are Awesome Declarative queriesEasy to reason about Long track record of success
  4. 4. Relational Databases are Awesome …so use them!
  5. 5. Relational Databases are Awesome …so use them! But…
  6. 6. Those advantages have a costGlobal, atomic state means global,atomic coordination Coordination does not scale linearly
  7. 7. The costs of coordination Remember the network effect?
  8. 8. The costs of coordination n(n -1) channels = 2 2 nodes = 1 channel 5 nodes = 10 channels 12 nodes = 66 channels 25 nodes = 300 channels
  9. 9. So we better be able to scale
  10. 10. The costs of coordination Databases have optimized this in many clever ways, but a limit on scalability still exists
  11. 11. Let’s look at some ways to scale
  12. 12. Bulk processing billions of records
  13. 13. Bulk processing billions of records Data aggregation and storage
  14. 14. Bulk processing billions of records Data aggregation and storage Real-time processing of updates
  15. 15. Bulk processing billions of records Data aggregation and storage Real-time processing of updates Serving data for: Online Apps Analytics
  16. 16. Let’s start with scalability ofbulk processing
  17. 17. Quiz: which one is scalable?
  18. 18. Quiz: which one is scalable? 1000-node Hadoop cluster where jobs depend on a common process
  19. 19. Quiz: which one is scalable? 1000-node Hadoop cluster where jobs depend on a common process 1000 Windows ME machines running independent Excel macros
  20. 20. Quiz: which one is scalable? 1000-node Hadoop cluster where jobs depend on a common process 1000 Windows ME machines running independent Excel macros
  21. 21. Independence Parallelizable
  22. 22. Independence Parallelizable Parallelizable Scalable
  23. 23. “Shared Nothing” architectures are themost scalable…
  24. 24. “Shared Nothing” architectures are themost scalable… …but most real-world problems require us to share something…
  25. 25. “Shared Nothing” architectures are themost scalable… …but most real-world problems require us to share something… …so our designs usually have a parallel part and a serial part
  26. 26. The key is to make sure the vast majorityof our work in the cloud is independent andparallelizable.
  27. 27. Amdahl’s Law 1 S : speed improvementS(N ) = P : ratio of the problem that (1- P) + P can be parallelized N N: number of processors
  28. 28. MapReduce PrimerInput Data Map Phase Shuffle Reduce Split 1 Phase Mapper 1 Split 2 Mapper 2 Reducer 1 Split 3 Mapper 3 Reducer 2 . . . . . . . . Reducer N Split N Mapper N
  29. 29. MapReduce Example: Word Count Books Map Phase Shuffle Reduce Count words Phase per book Sum words Count words A-C per book Sum words . D-E . . . . Sum words W-Z Count words per book
  30. 30. Notice there is still a serial part of theproblem: the of the reducers must becombined
  31. 31. Notice there is still a serial part of theproblem: the of the reducers must becombined …but this is much smaller, and can be handled by a single process
  32. 32. Also notice that the network is a sharedresource when processing big data
  33. 33. Also notice that the network is a sharedresource when processing big data So rather than moving data to computation, we move computation to data.
  34. 34. MapReduce Data LocalityInput Data Map Phase Shuffle Reduce Split 1 Phase Mapper 1 Split 2 Mapper 2 Reducer 1 Split 3 Mapper 3 Reducer 2 . . . . . . . . Reducer N Split N Mapper N = a physical machine
  35. 35. Data locality is only guaranteed the Mapphase
  36. 36. Data locality is only guaranteed the Mapphase So the most data-intensive work should be done in the map, with smaller sets set to the reducer
  37. 37. Data locality is only guaranteed the Mapphase So the most data-intensive work should be done in the map, with smaller sets set to the reducerSome Map/Reduce jobs have no reducer atall!
  38. 38. MapReduce Gone WrongBooks Map Phase Shuffle Reduce Count words Phase per book Sum words Count words A-C per book Sum words Word . D-E . Addition . . . Service Sum words W-Z Count words per book
  39. 39. Even if our Word Addition Service isscalable, we’d need to scale it to the size ofthe largest Map/Reduce job that will everuse it
  40. 40. So for data processing, prefer embeddedlibraries over remote services
  41. 41. So for data processing, prefer embeddedlibraries over remote servicesUse remote services for configuration, toprime caches, etc. – just not for every dataelement!
  42. 42. Joining a billion recordsWord counts are great, but many real-worldproblems mean bringing together multipledatasets. So how do we “join” with MapReduce?
  43. 43. Map-Side JoinsWhen joining one big input to a small one,Simply copy the small data set to each mapper Data Set 1 Map Phase Shuffle Reduce Mapper 1 Phase Split 1 Data set 2 Reducer 1 Mapper 2 Split 2 Data set 2 Reducer 2 . Mapper 3 . Split 3 Data set 2
  44. 44. Merge in ReducerRoute common items to the same reducer Data Set 1 Map Phase Shuffle Reduce Split 1 Phase Group by key Split 2 Group by key Reducer 1 Split 3 Group by key Reducer 2 . . Data Set 2 Split 1 Group by key Reducer N Split 2 Group by key Split 3 Group by key
  45. 45. Higher-Level ConstructsMapReduce is a primitive operation forhigher-level constructsHive, Pig, Cascading, and Crunch all compileInto MapReduce Use one! Crunch!
  46. 46. MapReduce and MPP Databases
  47. 47. MapReduce MPP DatabasesData in a distributed filesystem Data in sharded relational databases
  48. 48. MapReduce MPP DatabasesData in a distributed filesystem Data in sharded relational databasesOriented towards unstructured Oriented towards structured dataor semi-structured data
  49. 49. MapReduce MPP DatabasesData in a distributed filesystem Data in sharded relational databasesOriented towards unstructured Oriented towards structured dataor semi-structured dataJava or Domain-Specific Languages SQL(e.g., Pig and Hive)
  50. 50. MapReduce MPP DatabasesData in a distributed filesystem Data in sharded relational databasesOriented towards unstructured Oriented towards structured dataor semi-structured dataJava or Domain-Specific Languages SQL(e.g., Pig and Hive)Poor support for iterative operations Good support of iterative operations
  51. 51. MapReduce MPP DatabasesData in a distributed filesystem Data in sharded relational databasesOriented towards unstructured Oriented towards structured dataor semi-structured dataJava or Domain-Specific Languages SQL(e.g., Pig and Hive)Poor support for iterative operations Good support of iterative operationsArbitrarily complex programs SQL and User-Defined Functionsrunning next to data running next to data
  52. 52. MapReduce MPP DatabasesData in a distributed filesystem Data in sharded relational databasesOriented towards unstructured Oriented towards structured dataor semi-structured dataJava or Domain-Specific Languages SQL(e.g., Pig and Hive)Poor support for iterative operations Good support of iterative operationsArbitrarily complex programs SQL and User-Defined Functionsrunning next to data running next to dataPoor interactive query support Good interactive query support
  53. 53. MapReduce MPP Databases …are complementary!
  54. 54. MapReduce MPP Databases …are complementary!Map/Reduce to clean, normalize, reconcileand codify data to load into a MPP systemfor interactive analysis
  55. 55. Bulk processing of millions of records Data aggregation and storage
  56. 56. Hadoop Distributed Filesystem Scales to many petabytes
  57. 57. Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes
  58. 58. Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes The name node keeps track of what blocks belong to what file
  59. 59. Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes The name node keeps track of what blocks belong to what file All blocks written in triplicate
  60. 60. Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes The name node keeps track of what blocks belong to what file All blocks written in triplicate Write and append only – no random updates!
  61. 61. HDFS Writes Lookup Data Node Name Node Client Write Data Node 1 Data Node 2 Data Node N Block Replicate Block Replicate . . . Block Block Block
  62. 62. HDFS Reads Lookup Block locations Name Node Client Read Data Node 1 Data Node 2 Data Node N Block Block ... Block Block Block
  63. 63. HDFS Shortcomings No random reads No random writes Doesn’t deal with many small files
  64. 64. HDFS Shortcomings No random reads No random writes Doesn’t deal with many small files Enter HBase“Random Access To Your Planet-Size Data”
  65. 65. HBase Emulates random I/O with a Write Ahead Log (WAL) Periodically flushes log to sorted files
  66. 66. HBase Emulates random I/O with a Write Ahead Log (WAL) Periodically flushes log to sorted files Files accessible as tables, split across many regions, hosted by region servers
  67. 67. HBase Emulates random I/O with a Write Ahead Log (WAL) Periodically flushes log to sorted files Files accessible as tables, split across many regions, hosted by region servers Preserves scalability, data locality, and Map/Reduce features of Hadoop
  68. 68. Use HBase when: You have noisy, semi-structured data
  69. 69. Use HBase when: You have noisy, semi-structured data You want to apply massively parallel processing to your problem
  70. 70. Use HBase when: You have noisy, semi-structured data You want to apply massively parallel processing to your problem To handle huge write loads
  71. 71. Use HBase when: You have noisy, semi-structured data You want to apply massively parallel processing to your problem To handle huge write loads As a scalable key/value store
  72. 72. But there are drawbacks: Limited schema support Limited atomicity guarantees No built-in secondary indexesHBase is a great tool for many jobs,but not every job
  73. 73. The data store should alignwith the needs of the application
  74. 74. So a pattern is emerging: Collection Aggregation Processing Storage Millennium MPP CCDs Relational Hadoop MapReduce with Claims Jobs HBase Document Store HL7 HBase
  75. 75. But we have a potential bottleneck Collection Aggregation Processing Storage Millennium MPP CCDs Relational Hadoop MapReduce with Claims Jobs HBase Document Store HL7 HBase
  76. 76. Direct inserts are designed for onlineupdates, not massively parallel data loadsSo shift the work into MapReduce, and pre-build files for bulk import Oracle Loader for Hadoop HBase HFile Import Bulk Loads for MPP
  77. 77. And we’re missing an important piece: Collection Aggregation Processing Storage Millennium MPP CCDs Relational Hadoop MapReduce with Claims Jobs HBase Document Store HL7 HBase
  78. 78. And we’re missing an important piece: Collection Aggregation Processing Storage Millennium MPP Realtime Processing CCDs Relational Hadoop with Claims HBase Document Map/Red Store HL7 uce Jobs (batch) HBase
  79. 79. How do we make it fast? Speed Layer Batch Layerhttp://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems
  80. 80. How do we make it fast? Move data to computation Hours of data Speed Layer Incremental Low Latency (seconds to process) updates Move computation to data Years of data Batch Layer Bulk loadsHigh Latency (minutes or hours to process)
  81. 81. How do we make it fast? Complex Event Processing Speed Layer Storm Batch Layer Hadoop MapReduce
  82. 82. And now, the challenge…
  83. 83. Process all data overnight
  84. 84. Quickly create new data models Fast iteration cycles means fast innovation Process all data overnight Simple correction of any bugsMuch easier to understand and work with
  85. 85. Questions?

×