Successfully reported this slideshow.
Your SlideShare is downloading. ×

A real time architecture using Hadoop and Storm @ FOSDEM 2013

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 67 Ad
Advertisement

More Related Content

Slideshows for you (20)

Similar to A real time architecture using Hadoop and Storm @ FOSDEM 2013 (20)

Advertisement
Advertisement

A real time architecture using Hadoop and Storm @ FOSDEM 2013

  1. 1. A real-time architecture using Hadoop and Storm.
  2. 2. Speakers Nathan Bijnens Geert Van Landeghem @nathan_gs @gvanlandeghem A real-time architecture using Hadoop & Storm. 2
  3. 3. Our Vision Volume Big Data test A real-time architecture using Hadoop & Storm. 3
  4. 4. Big Data Velocity test A real-time architecture using Hadoop & Storm. 4
  5. 5. Our Vision Volume test Variety A real-time architecture using Hadoop & Storm. 5
  6. 6. Credits Nathan Marz Engineer at Backtype (now Twitter). Storm Cascalog ElephantDB manning.com/marz A real-time architecture using Hadoop & Storm. 6
  7. 7. A Data System A real-time architecture using Hadoop & Storm. 7
  8. 8. Data is more than Information Not all information is equal. Some information is derived from other pieces of information. A real-time architecture using Hadoop & Storm. 8
  9. 9. Data is more than Information Eventually you will reach the most This is the information you hold true, simple because it exists. A real-time architecture using Hadoop & Storm. 9
  10. 10. Events Everything we do generates events: - Pay with Credit Card - Commit to Git - Click on a webpage - Tweet A real-time architecture using Hadoop & Storm. 10
  11. 11. Events - Before Events used to manipulate the master data. A real-time architecture using Hadoop & Storm. 11
  12. 12. Events - After Today, events are the master data. A real-time architecture using Hadoop & Storm. 12
  13. 13. Data System everything. A real-time architecture using Hadoop & Storm. 13
  14. 14. Events Data is Immutable A real-time architecture using Hadoop & Storm. 14
  15. 15. Events Data is Time Based A real-time architecture using Hadoop & Storm. 15
  16. 16. Capturing change traditionally Person Location Person Location Nathan Antwerp Nathan Ghent Geert Dendermonde Geert Dendermonde John Ghent John Ghent A real-time architecture using Hadoop & Storm. 16
  17. 17. Capturing change Person Location Time Person Location Time Nathan Antwerp 2005-01-01 Nathan Antwerp 2005-01-01 Geert Dendermonde 2011-10-08 Geert Dendermonde 2011-10-08 John Ghent 2010-05-02 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 A real-time architecture using Hadoop & Storm. 17
  18. 18. Query The data you query is often transformed, aggregated, ... A real-time architecture using Hadoop & Storm. 18
  19. 19. Query Query = function ( data ) A real-time architecture using Hadoop & Storm. 19
  20. 20. Number of people living in each city. Person Location Time Location Count Nathan Antwerp 2005-01-01 Ghent 2 Geert Dendermonde 2011-10-08 Dendermonde 1 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 A real-time architecture using Hadoop & Storm. 20
  21. 21. Query All Data Query A real-time architecture using Hadoop & Storm. 21
  22. 22. Query: Precompute All Data Precomputed View Query A real-time architecture using Hadoop & Storm. 22
  23. 23. Layered Architecture Batch Layer Speed Layer Serving Layer A real-time architecture using Hadoop & Storm. 23
  24. 24. Layered Architecture Cassandra Query Incoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 24
  25. 25. Batch Layer A real-time architecture using Hadoop & Storm. 25
  26. 26. Batch Layer Incoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 26
  27. 27. Batch Layer Unrestrained computation. A real-time architecture using Hadoop & Storm. 27
  28. 28. Batch Layer Horizontal scalable. A real-time architecture using Hadoop & Storm. 28
  29. 29. Batch Layer High Latency. matter. A real-time architecture using Hadoop & Storm. 29
  30. 30. Batch Layer Stores master copy of data set... append only. A real-time architecture using Hadoop & Storm. 30
  31. 31. Batch Layer A real-time architecture using Hadoop & Storm. 31
  32. 32. Batch: View generation View #1 Master Dataset View #2 MapReduce View #3 A real-time architecture using Hadoop & Storm. 32
  33. 33. MapReduce 1. Take a large problem and divide it into sub-problems … MAP 2. Perform the same function on all sub-problems … DoWork() DoWork() DoWork() 3. Combine the output from all sub-problems REDUCE … Output A real-time architecture using Hadoop & Storm. 33
  34. 34. Batch View Database Read only database. No random writes required. A real-time architecture using Hadoop & Storm. 34
  35. 35. Batch View Database ElephantDB Splout A real-time architecture using Hadoop & Storm. 35
  36. 36. Batch Layer Just a few hours of data. Not yet Data absorbed into Batch Views absorbed. Time Now A real-time architecture using Hadoop & Storm. 36
  37. 37. Speed Layer A real-time architecture using Hadoop & Storm. 37
  38. 38. Overview Cassandra Incoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 38
  39. 39. Speed Layer Stream processing. A real-time architecture using Hadoop & Storm. 39
  40. 40. Speed Layer Continuous computation. A real-time architecture using Hadoop & Storm. 40
  41. 41. Speed Layer Transactional. A real-time architecture using Hadoop & Storm. 41
  42. 42. Speed Layer Storing a limited window of data. Compensating for the last few hours of data. A real-time architecture using Hadoop & Storm. 42
  43. 43. Speed Layer All the complexity is isolated in the Speed layer auto- corrected. A real-time architecture using Hadoop & Storm. 43
  44. 44. CAP You have a choice between: Availability - Queries are eventual consistent. Consistency - Queries are consistent. A real-time architecture using Hadoop & Storm. 44
  45. 45. Eventual accuracy Some algorithms are hard to implement in real time. For those cases we could estimate the results. A real-time architecture using Hadoop & Storm. 45
  46. 46. Speed Layer Real Time View 1 Incoming Data Real Time View 2 A real-time architecture using Hadoop & Storm. 46
  47. 47. Storm Message passing. Distributed processing. Horizontally scalable. Incremental algorithms. Fast. Data in motion. A real-time architecture using Hadoop & Storm. 47
  48. 48. Storm Message passing. Distributed processing. Horizontally scalable. Incremental algorithms. Fast. Data in motion. A real-time architecture using Hadoop & Storm. 48
  49. 49. Storm Nimbus Zookeeper Supervisor Supervisor Supervisor Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Node Worker Node Worker Node A real-time architecture using Hadoop & Storm. 49
  50. 50. Storm Tuple Stream A real-time architecture using Hadoop & Storm. 50
  51. 51. Storm Spout Bolt A real-time architecture using Hadoop & Storm. 51
  52. 52. Storm Grouping A real-time architecture using Hadoop & Storm. 52
  53. 53. Speed Layer Views The views are stored in Read & Write database. - Cassandra - Hbase - MongoDB - MySQL - ElasticSearch - Much more complex than a read only view. A real-time architecture using Hadoop & Storm. 53
  54. 54. Serving Layer A real-time architecture using Hadoop & Storm. 54
  55. 55. Overview Cassandra Query Incoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 55
  56. 56. Serving Layer This layer queries the Batch & Real Time views and merges it. A real-time architecture using Hadoop & Storm. 56
  57. 57. Serving Layer Batch Views Merge Real Time Views A real-time architecture using Hadoop & Storm. 57
  58. 58. Overview A real-time architecture using Hadoop & Storm. 58
  59. 59. Overview Cassandra Query Incoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 59
  60. 60. Lambda Architecture Can discard any view, batch and real time, and just recreate everything from the master data. Mistakes are corrected via recomputation. - Write bad data? Remove the data & recompute. - Bug in view generation? Just recompute the view. Data storage is highly optimized. A real-time architecture using Hadoop & Storm. 60
  61. 61. Recommendations A real-time architecture using Hadoop & Storm. 61
  62. 62. Serialization & Schema Catch errors as quickly as they happen. Validation on write vs on read. A real-time architecture using Hadoop & Storm. 62
  63. 63. Serialization & Schema CSV is actually a serialization language that is just poorly defined. A real-time architecture using Hadoop & Storm. 63
  64. 64. Serialization & Schema Use a format with a schema. - Thrift - Avro - Protobuffers A real-time architecture using Hadoop & Storm. 64
  65. 65. Questions? What are your needs? @nathan_gs & @gvanlandeghem A real-time architecture using Hadoop & Storm. 65
  66. 66. DataCrunchers We enable companies in envisioning, defining and implementing a data strategy. A one-stop-shop for all your Big Data needs. The first Big Data Consultancy agency in Belgium. A real-time architecture using Hadoop & Storm. 66
  67. 67. Jobs We are hiring. jobs@datacrunchers.eu A real-time architecture using Hadoop & Storm. 67

×