A real time architecture using Hadoop and Storm @ FOSDEM 2013

30,674 views
30,662 views

Published on

Published in: Technology
3 Comments
73 Likes
Statistics
Notes
No Downloads
Views
Total views
30,674
On SlideShare
0
From Embeds
0
Number of Embeds
33
Actions
Shares
0
Downloads
837
Comments
3
Likes
73
Embeds 0
No embeds

No notes for slide

A real time architecture using Hadoop and Storm @ FOSDEM 2013

  1. 1. A real-time architecture using Hadoop and Storm.
  2. 2. Speakers Nathan Bijnens Geert Van Landeghem @nathan_gs @gvanlandeghem A real-time architecture using Hadoop & Storm. 2
  3. 3. Our Vision Volume Big Data test A real-time architecture using Hadoop & Storm. 3
  4. 4. Big Data Velocity test A real-time architecture using Hadoop & Storm. 4
  5. 5. Our Vision Volume test Variety A real-time architecture using Hadoop & Storm. 5
  6. 6. CreditsNathan Marz Engineer at Backtype (now Twitter). Storm Cascalog ElephantDB manning.com/marz A real-time architecture using Hadoop & Storm. 6
  7. 7. A Data System A real-time architecture using Hadoop & Storm. 7
  8. 8. Data is more than Information Not all information is equal. Some information is derived from other pieces of information. A real-time architecture using Hadoop & Storm. 8
  9. 9. Data is more than Information Eventually you will reach the most This is the information you hold true, simple because it exists. A real-time architecture using Hadoop & Storm. 9
  10. 10. EventsEverything we do generates events:- Pay with Credit Card- Commit to Git- Click on a webpage- Tweet A real-time architecture using Hadoop & Storm. 10
  11. 11. Events - Before Events used to manipulate the master data. A real-time architecture using Hadoop & Storm. 11
  12. 12. Events - After Today, events are the master data. A real-time architecture using Hadoop & Storm. 12
  13. 13. Data System everything. A real-time architecture using Hadoop & Storm. 13
  14. 14. Events Data is Immutable A real-time architecture using Hadoop & Storm. 14
  15. 15. Events Data is Time Based A real-time architecture using Hadoop & Storm. 15
  16. 16. Capturing change traditionallyPerson Location Person LocationNathan Antwerp Nathan GhentGeert Dendermonde Geert DendermondeJohn Ghent John Ghent A real-time architecture using Hadoop & Storm. 16
  17. 17. Capturing changePerson Location Time Person Location TimeNathan Antwerp 2005-01-01 Nathan Antwerp 2005-01-01Geert Dendermonde 2011-10-08 Geert Dendermonde 2011-10-08John Ghent 2010-05-02 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 A real-time architecture using Hadoop & Storm. 17
  18. 18. Query The data you query is often transformed, aggregated, ... A real-time architecture using Hadoop & Storm. 18
  19. 19. Query Query = function ( data ) A real-time architecture using Hadoop & Storm. 19
  20. 20. Number of people living in each city. Person Location Time Location Count Nathan Antwerp 2005-01-01 Ghent 2 Geert Dendermonde 2011-10-08 Dendermonde 1 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 A real-time architecture using Hadoop & Storm. 20
  21. 21. Query All Data Query A real-time architecture using Hadoop & Storm. 21
  22. 22. Query: Precompute All Data Precomputed View Query A real-time architecture using Hadoop & Storm. 22
  23. 23. Layered Architecture Batch Layer Speed Layer Serving Layer A real-time architecture using Hadoop & Storm. 23
  24. 24. Layered Architecture Cassandra QueryIncoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 24
  25. 25. Batch Layer A real-time architecture using Hadoop & Storm. 25
  26. 26. Batch LayerIncoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 26
  27. 27. Batch Layer Unrestrained computation. A real-time architecture using Hadoop & Storm. 27
  28. 28. Batch Layer Horizontal scalable. A real-time architecture using Hadoop & Storm. 28
  29. 29. Batch Layer High Latency. matter. A real-time architecture using Hadoop & Storm. 29
  30. 30. Batch Layer Stores master copy of data set... append only. A real-time architecture using Hadoop & Storm. 30
  31. 31. Batch Layer A real-time architecture using Hadoop & Storm. 31
  32. 32. Batch: View generation View #1 Master Dataset View #2 MapReduce View #3 A real-time architecture using Hadoop & Storm. 32
  33. 33. MapReduce 1. Take a large problem and divide it into sub-problems … MAP 2. Perform the same function on all sub-problems … DoWork() DoWork() DoWork() 3. Combine the output from all sub-problems REDUCE … Output A real-time architecture using Hadoop & Storm. 33
  34. 34. Batch View Database Read only database. No random writes required. A real-time architecture using Hadoop & Storm. 34
  35. 35. Batch View DatabaseElephantDBSplout A real-time architecture using Hadoop & Storm. 35
  36. 36. Batch Layer Just a few hours of data. Not yet Data absorbed into Batch Views absorbed. Time Now A real-time architecture using Hadoop & Storm. 36
  37. 37. Speed Layer A real-time architecture using Hadoop & Storm. 37
  38. 38. Overview CassandraIncoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 38
  39. 39. Speed Layer Stream processing. A real-time architecture using Hadoop & Storm. 39
  40. 40. Speed Layer Continuous computation. A real-time architecture using Hadoop & Storm. 40
  41. 41. Speed Layer Transactional. A real-time architecture using Hadoop & Storm. 41
  42. 42. Speed Layer Storing a limited window of data. Compensating for the last few hours of data. A real-time architecture using Hadoop & Storm. 42
  43. 43. Speed Layer All the complexity is isolated in the Speed layer auto- corrected. A real-time architecture using Hadoop & Storm. 43
  44. 44. CAPYou have a choice between: Availability - Queries are eventual consistent. Consistency - Queries are consistent. A real-time architecture using Hadoop & Storm. 44
  45. 45. Eventual accuracySome algorithms are hard to implement in real time. For those cases we could estimate the results. A real-time architecture using Hadoop & Storm. 45
  46. 46. Speed Layer Real Time View 1Incoming Data Real Time View 2 A real-time architecture using Hadoop & Storm. 46
  47. 47. StormMessage passing.Distributed processing.Horizontally scalable.Incremental algorithms.Fast.Data in motion. A real-time architecture using Hadoop & Storm. 47
  48. 48. StormMessage passing.Distributed processing.Horizontally scalable.Incremental algorithms.Fast.Data in motion. A real-time architecture using Hadoop & Storm. 48
  49. 49. Storm Nimbus Zookeeper Supervisor Supervisor Supervisor Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Node Worker Node Worker Node A real-time architecture using Hadoop & Storm. 49
  50. 50. StormTupleStream A real-time architecture using Hadoop & Storm. 50
  51. 51. StormSpoutBolt A real-time architecture using Hadoop & Storm. 51
  52. 52. StormGrouping A real-time architecture using Hadoop & Storm. 52
  53. 53. Speed Layer ViewsThe views are stored in Read & Write database.- Cassandra- Hbase- MongoDB- MySQL- ElasticSearch-Much more complex than a read only view. A real-time architecture using Hadoop & Storm. 53
  54. 54. Serving Layer A real-time architecture using Hadoop & Storm. 54
  55. 55. Overview Cassandra QueryIncoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 55
  56. 56. Serving Layer This layer queries the Batch & Real Time views and merges it. A real-time architecture using Hadoop & Storm. 56
  57. 57. Serving Layer Batch Views Merge Real Time Views A real-time architecture using Hadoop & Storm. 57
  58. 58. Overview A real-time architecture using Hadoop & Storm. 58
  59. 59. Overview Cassandra QueryIncoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 59
  60. 60. Lambda ArchitectureCan discard any view, batch and real time, and justrecreate everything from the master data.Mistakes are corrected via recomputation.- Write bad data? Remove the data & recompute.- Bug in view generation? Just recompute the view.Data storage is highly optimized. A real-time architecture using Hadoop & Storm. 60
  61. 61. Recommendations A real-time architecture using Hadoop & Storm. 61
  62. 62. Serialization & Schema Catch errors as quickly as they happen. Validation on write vs on read. A real-time architecture using Hadoop & Storm. 62
  63. 63. Serialization & Schema CSV is actually a serialization language that is just poorly defined. A real-time architecture using Hadoop & Storm. 63
  64. 64. Serialization & Schema Use a format with a schema.- Thrift- Avro- Protobuffers A real-time architecture using Hadoop & Storm. 64
  65. 65. Questions? What are your needs? @nathan_gs & @gvanlandeghem A real-time architecture using Hadoop & Storm. 65
  66. 66. DataCrunchersWe enable companies in envisioning, defining andimplementing a data strategy.A one-stop-shop for all your Big Data needs.The first Big Data Consultancy agency in Belgium. A real-time architecture using Hadoop & Storm. 66
  67. 67. Jobs We are hiring. jobs@datacrunchers.eu A real-time architecture using Hadoop & Storm. 67

×