A real-time architecture using     Hadoop and Storm.
Speakers Nathan Bijnens                     Geert Van Landeghem @nathan_gs                         @gvanlandeghem         ...
Our Vision Volume  Big Data             test   A real-time architecture using Hadoop & Storm.   3
Big Data   Velocity              test   A real-time architecture using Hadoop & Storm.   4
Our Vision Volume             test                                       Variety                    A real-time architectu...
CreditsNathan Marz Engineer at Backtype (now Twitter). Storm Cascalog ElephantDB                                    mannin...
A Data System     A real-time architecture using Hadoop & Storm.   7
Data is more than Information           Not all information is equal.       Some information is derived from other pieces ...
Data is more than Information    Eventually you will reach the most    This is the information you hold true, simple becau...
EventsEverything we do generates events:-   Pay with Credit Card-   Commit to Git-   Click on a webpage-   Tweet          ...
Events - Before      Events used to manipulate the               master data.                  A real-time architecture us...
Events - After       Today, events are the master                  data.                  A real-time architecture using H...
Data System                   everything.              A real-time architecture using Hadoop & Storm.   13
Events         Data is Immutable              A real-time architecture using Hadoop & Storm.   14
Events         Data is Time Based               A real-time architecture using Hadoop & Storm.   15
Capturing change traditionallyPerson   Location                     Person                 LocationNathan   Antwerp       ...
Capturing changePerson   Location      Time                         Person           Location        TimeNathan   Antwerp ...
Query The data you query is often transformed,             aggregated, ...                  A real-time architecture using...
Query Query = function ( data )           A real-time architecture using Hadoop & Storm.   19
Number of people living in each city. Person   Location      Time                         Location               Count Nat...
Query        All Data                                             Query                   A real-time architecture using H...
Query: Precompute     All Data                  Precomputed                                   View                   Query...
Layered Architecture                 Batch Layer                Speed Layer                Serving Layer                  ...
Layered Architecture                                     Cassandra                                                        ...
Batch Layer    A real-time architecture using Hadoop & Storm.   25
Batch LayerIncoming Data                  Hadoop                                          Elephant                        ...
Batch Layer        Unrestrained computation.                 A real-time architecture using Hadoop & Storm.   27
Batch Layer              Horizontal scalable.                    A real-time architecture using Hadoop & Storm.   28
Batch Layer              High Latency.                  matter.                 A real-time architecture using Hadoop & St...
Batch Layer     Stores master copy of data set...              append only.                  A real-time architecture usin...
Batch Layer              A real-time architecture using Hadoop & Storm.   31
Batch: View generation                                                                View #1    Master Dataset           ...
MapReduce           1. Take a large problem and divide it into sub-problems                                               ...
Batch View Database          Read only database.            No random writes required.                    A real-time arch...
Batch View DatabaseElephantDBSplout                 A real-time architecture using Hadoop & Storm.   35
Batch Layer                                                   Just a few hours of data.                                   ...
Speed Layer    A real-time architecture using Hadoop & Storm.   37
Overview                                   CassandraIncoming Data                  Hadoop                                 ...
Speed Layer              Stream processing.                    A real-time architecture using Hadoop & Storm.   39
Speed Layer        Continuous computation.                A real-time architecture using Hadoop & Storm.   40
Speed Layer              Transactional.                 A real-time architecture using Hadoop & Storm.   41
Speed Layer    Storing a limited window of data.       Compensating for the last few hours of data.                       ...
Speed Layer All the complexity is isolated in the Speed  layer                               auto-                correcte...
CAPYou have a choice between:  Availability - Queries are eventual consistent.  Consistency - Queries are consistent.     ...
Eventual accuracySome algorithms are hard to implement in   real time. For those cases we could           estimate the res...
Speed Layer                                                                  Real                                         ...
StormMessage passing.Distributed processing.Horizontally scalable.Incremental algorithms.Fast.Data in motion.             ...
StormMessage passing.Distributed processing.Horizontally scalable.Incremental algorithms.Fast.Data in motion.             ...
Storm                          Nimbus                                        Zookeeper        Supervisor                 S...
StormTupleStream         A real-time architecture using Hadoop & Storm.   50
StormSpoutBolt        A real-time architecture using Hadoop & Storm.   51
StormGrouping           A real-time architecture using Hadoop & Storm.   52
Speed Layer ViewsThe views are stored in Read & Write database.-   Cassandra-   Hbase-   MongoDB-   MySQL-   ElasticSearch...
Serving Layer     A real-time architecture using Hadoop & Storm.   54
Overview                                   Cassandra                                                                 Query...
Serving Layer  This layer queries the Batch & Real Time             views and merges it.                   A real-time arc...
Serving Layer           Batch           Views                                    Merge            Real           Time     ...
Overview  A real-time architecture using Hadoop & Storm.   58
Overview                                   Cassandra                                                                 Query...
Lambda ArchitectureCan discard any view, batch and real time, and justrecreate everything from the master data.Mistakes ar...
Recommendations      A real-time architecture using Hadoop & Storm.   61
Serialization & Schema      Catch errors as quickly as they happen.               Validation on write vs on read.         ...
Serialization & Schema  CSV is actually a serialization language that is just                    poorly defined.          ...
Serialization & Schema Use a format with a schema.- Thrift- Avro- Protobuffers                      A real-time architectu...
Questions?             What are your needs?             @nathan_gs & @gvanlandeghem                      A real-time archi...
DataCrunchersWe enable companies in envisioning, defining andimplementing a data strategy.A one-stop-shop for all your Big...
Jobs         We are hiring.       jobs@datacrunchers.eu             A real-time architecture using Hadoop & Storm.   67
Upcoming SlideShare
Loading in...5
×

A real time architecture using Hadoop and Storm @ FOSDEM 2013

26,138

Published on

Published in: Technology
3 Comments
67 Likes
Statistics
Notes
No Downloads
Views
Total Views
26,138
On Slideshare
0
From Embeds
0
Number of Embeds
31
Actions
Shares
0
Downloads
744
Comments
3
Likes
67
Embeds 0
No embeds

No notes for slide

A real time architecture using Hadoop and Storm @ FOSDEM 2013

  1. 1. A real-time architecture using Hadoop and Storm.
  2. 2. Speakers Nathan Bijnens Geert Van Landeghem @nathan_gs @gvanlandeghem A real-time architecture using Hadoop & Storm. 2
  3. 3. Our Vision Volume Big Data test A real-time architecture using Hadoop & Storm. 3
  4. 4. Big Data Velocity test A real-time architecture using Hadoop & Storm. 4
  5. 5. Our Vision Volume test Variety A real-time architecture using Hadoop & Storm. 5
  6. 6. CreditsNathan Marz Engineer at Backtype (now Twitter). Storm Cascalog ElephantDB manning.com/marz A real-time architecture using Hadoop & Storm. 6
  7. 7. A Data System A real-time architecture using Hadoop & Storm. 7
  8. 8. Data is more than Information Not all information is equal. Some information is derived from other pieces of information. A real-time architecture using Hadoop & Storm. 8
  9. 9. Data is more than Information Eventually you will reach the most This is the information you hold true, simple because it exists. A real-time architecture using Hadoop & Storm. 9
  10. 10. EventsEverything we do generates events:- Pay with Credit Card- Commit to Git- Click on a webpage- Tweet A real-time architecture using Hadoop & Storm. 10
  11. 11. Events - Before Events used to manipulate the master data. A real-time architecture using Hadoop & Storm. 11
  12. 12. Events - After Today, events are the master data. A real-time architecture using Hadoop & Storm. 12
  13. 13. Data System everything. A real-time architecture using Hadoop & Storm. 13
  14. 14. Events Data is Immutable A real-time architecture using Hadoop & Storm. 14
  15. 15. Events Data is Time Based A real-time architecture using Hadoop & Storm. 15
  16. 16. Capturing change traditionallyPerson Location Person LocationNathan Antwerp Nathan GhentGeert Dendermonde Geert DendermondeJohn Ghent John Ghent A real-time architecture using Hadoop & Storm. 16
  17. 17. Capturing changePerson Location Time Person Location TimeNathan Antwerp 2005-01-01 Nathan Antwerp 2005-01-01Geert Dendermonde 2011-10-08 Geert Dendermonde 2011-10-08John Ghent 2010-05-02 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 A real-time architecture using Hadoop & Storm. 17
  18. 18. Query The data you query is often transformed, aggregated, ... A real-time architecture using Hadoop & Storm. 18
  19. 19. Query Query = function ( data ) A real-time architecture using Hadoop & Storm. 19
  20. 20. Number of people living in each city. Person Location Time Location Count Nathan Antwerp 2005-01-01 Ghent 2 Geert Dendermonde 2011-10-08 Dendermonde 1 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 A real-time architecture using Hadoop & Storm. 20
  21. 21. Query All Data Query A real-time architecture using Hadoop & Storm. 21
  22. 22. Query: Precompute All Data Precomputed View Query A real-time architecture using Hadoop & Storm. 22
  23. 23. Layered Architecture Batch Layer Speed Layer Serving Layer A real-time architecture using Hadoop & Storm. 23
  24. 24. Layered Architecture Cassandra QueryIncoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 24
  25. 25. Batch Layer A real-time architecture using Hadoop & Storm. 25
  26. 26. Batch LayerIncoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 26
  27. 27. Batch Layer Unrestrained computation. A real-time architecture using Hadoop & Storm. 27
  28. 28. Batch Layer Horizontal scalable. A real-time architecture using Hadoop & Storm. 28
  29. 29. Batch Layer High Latency. matter. A real-time architecture using Hadoop & Storm. 29
  30. 30. Batch Layer Stores master copy of data set... append only. A real-time architecture using Hadoop & Storm. 30
  31. 31. Batch Layer A real-time architecture using Hadoop & Storm. 31
  32. 32. Batch: View generation View #1 Master Dataset View #2 MapReduce View #3 A real-time architecture using Hadoop & Storm. 32
  33. 33. MapReduce 1. Take a large problem and divide it into sub-problems … MAP 2. Perform the same function on all sub-problems … DoWork() DoWork() DoWork() 3. Combine the output from all sub-problems REDUCE … Output A real-time architecture using Hadoop & Storm. 33
  34. 34. Batch View Database Read only database. No random writes required. A real-time architecture using Hadoop & Storm. 34
  35. 35. Batch View DatabaseElephantDBSplout A real-time architecture using Hadoop & Storm. 35
  36. 36. Batch Layer Just a few hours of data. Not yet Data absorbed into Batch Views absorbed. Time Now A real-time architecture using Hadoop & Storm. 36
  37. 37. Speed Layer A real-time architecture using Hadoop & Storm. 37
  38. 38. Overview CassandraIncoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 38
  39. 39. Speed Layer Stream processing. A real-time architecture using Hadoop & Storm. 39
  40. 40. Speed Layer Continuous computation. A real-time architecture using Hadoop & Storm. 40
  41. 41. Speed Layer Transactional. A real-time architecture using Hadoop & Storm. 41
  42. 42. Speed Layer Storing a limited window of data. Compensating for the last few hours of data. A real-time architecture using Hadoop & Storm. 42
  43. 43. Speed Layer All the complexity is isolated in the Speed layer auto- corrected. A real-time architecture using Hadoop & Storm. 43
  44. 44. CAPYou have a choice between: Availability - Queries are eventual consistent. Consistency - Queries are consistent. A real-time architecture using Hadoop & Storm. 44
  45. 45. Eventual accuracySome algorithms are hard to implement in real time. For those cases we could estimate the results. A real-time architecture using Hadoop & Storm. 45
  46. 46. Speed Layer Real Time View 1Incoming Data Real Time View 2 A real-time architecture using Hadoop & Storm. 46
  47. 47. StormMessage passing.Distributed processing.Horizontally scalable.Incremental algorithms.Fast.Data in motion. A real-time architecture using Hadoop & Storm. 47
  48. 48. StormMessage passing.Distributed processing.Horizontally scalable.Incremental algorithms.Fast.Data in motion. A real-time architecture using Hadoop & Storm. 48
  49. 49. Storm Nimbus Zookeeper Supervisor Supervisor Supervisor Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Node Worker Node Worker Node A real-time architecture using Hadoop & Storm. 49
  50. 50. StormTupleStream A real-time architecture using Hadoop & Storm. 50
  51. 51. StormSpoutBolt A real-time architecture using Hadoop & Storm. 51
  52. 52. StormGrouping A real-time architecture using Hadoop & Storm. 52
  53. 53. Speed Layer ViewsThe views are stored in Read & Write database.- Cassandra- Hbase- MongoDB- MySQL- ElasticSearch-Much more complex than a read only view. A real-time architecture using Hadoop & Storm. 53
  54. 54. Serving Layer A real-time architecture using Hadoop & Storm. 54
  55. 55. Overview Cassandra QueryIncoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 55
  56. 56. Serving Layer This layer queries the Batch & Real Time views and merges it. A real-time architecture using Hadoop & Storm. 56
  57. 57. Serving Layer Batch Views Merge Real Time Views A real-time architecture using Hadoop & Storm. 57
  58. 58. Overview A real-time architecture using Hadoop & Storm. 58
  59. 59. Overview Cassandra QueryIncoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 59
  60. 60. Lambda ArchitectureCan discard any view, batch and real time, and justrecreate everything from the master data.Mistakes are corrected via recomputation.- Write bad data? Remove the data & recompute.- Bug in view generation? Just recompute the view.Data storage is highly optimized. A real-time architecture using Hadoop & Storm. 60
  61. 61. Recommendations A real-time architecture using Hadoop & Storm. 61
  62. 62. Serialization & Schema Catch errors as quickly as they happen. Validation on write vs on read. A real-time architecture using Hadoop & Storm. 62
  63. 63. Serialization & Schema CSV is actually a serialization language that is just poorly defined. A real-time architecture using Hadoop & Storm. 63
  64. 64. Serialization & Schema Use a format with a schema.- Thrift- Avro- Protobuffers A real-time architecture using Hadoop & Storm. 64
  65. 65. Questions? What are your needs? @nathan_gs & @gvanlandeghem A real-time architecture using Hadoop & Storm. 65
  66. 66. DataCrunchersWe enable companies in envisioning, defining andimplementing a data strategy.A one-stop-shop for all your Big Data needs.The first Big Data Consultancy agency in Belgium. A real-time architecture using Hadoop & Storm. 66
  67. 67. Jobs We are hiring. jobs@datacrunchers.eu A real-time architecture using Hadoop & Storm. 67
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×