a real-time architecture using Hadoop and Storm at Devoxx


Published on

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Projecten
  • How much data do you have? 44 times as much data in the next decade, 15 Zb in 2015Data silos (erp, crm, …)Traditionele systemen kunnen dit volume niet aan.Turn 12 terabytes of Tweets created each day into improved product sentiment analysisConvert 350 billion annual meter readings to better predict power consumption
  • Real timeTime sensitivedecisiontakingFrauddetectionEnergy allocationMarketing campaignsMarket transactionsSolution:Real-time solutions in combination with batch (hadoop)Nosql systems
  • StructuredUnstructured80% is unstructured data, A key drawback of using traditional relational database systems is that they're not good at handling variable data. A flexible data modelWord, email, foto, text, video, APIs, …?What are your needs regarding variety?The end result: bringingstructureintounstructured dataMonitor 100’s of live video feeds from surveillance cameras to target points of interestExploit the 80% data growth in images, video and documents to improve customer satisfaction
  • We can afford to keep Immutable Copies of lots of data.We NEED immutability to Coordinate with fewer challenges.Semaphores & Locks are the things to avoid: Instruction opportunities lost waiting for a semaphore increase with more cores…
  • The # of followers on Twitter = all follows & unfollows combined.Account balance
  • Data = event = atomicIn an ever changing world we found some stabilityEverything we do generates events:Pay with Credit CardCommit to GitClick on a webpageTweet
  • It is easier to store all data in a cost effective way.Compare to DWH world.
  • Immutability greatly restricts the range of errors that can cause data loss or data corruption.Ex. Only CR, no more CRUD.Information might of course change.Fault ToleranceData lossHuman error, Hardware failureData CorruptionParallel met functioneelprogrammeren.
  • Allows state regeneration. Eg. What was my bank balance on 1 may 2005?
  • Queries as pure functions that take all data as input is the most general formulation.Different functions may look at different portions and aggregate information in different ways.
  • Too slow; might be petabyte scaleImpala/Drill: why not
  • The batch layer can calculate anything (given enough time).
  • The batch layer stores the data normalized, but in the views it generates, data is often, if not always de normalized.
  • Not vertically
  • It’s OK to croak and restart
  • Is something really immutable when it’s name can change.
  • Doesn’t have to be Hadoop. The importance here is a Distributed FS combined with a processing framework.Spark,
  • Source: PolybasePass2012.pptxhttp://whyjava.wordpress.com/2011/08/04/how-i-explained-mapreduce-to-my-wife/
  • http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=nsValue of schemas• Structural integrity• Guarantees on what can and can’t be stored• Prevents corruptionOtherwise you’ll detect corruption issues at read-time
  • http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=ns
  • Maarkanopgelostworden, door bvb ES je views op voorhandtegenereren.
  • In some circumstances.
  • All the complexity of *dealing* with the CAP theorem (like read repair) is isolated in the realtime layer.
  • Consistency (all nodes see the same data at the same time)Availability (a guarantee that every request receives a response about whether it was successful or failed)Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)http://codahale.com/you-cant-sacrifice-partition-tolerance/Hbasavs Cassandra
  • Eg. Unique countsML
  • Nimbus:Manages the clusterWorker Node:Supervisor:Manages workers; restarts them if neededExecuterPhysical JVM process.Execute tasks (those are spread evenly across the workers)TasksEach in his own Thread. Is the actual Bolt or Spout.Processes the stream.
  • Tuple:Named list of valuesDynamicly typedStreamSequence of Tuples
  • SpoutSource of StreamsSometimes replayableBoltStream transformationsAt least 1 input stream0 - * output streams
  • The serving layer needs to be able to answer any query in a short amount of time.
  • AVG = sum + count; preaggregate, but not everything is possible.
  • Command Query Responsibility Segregation (CQRS) applies the CQS principle by using separate Query and Command objects to retrieve and modify data respectivelymultiple representations of information. The change that CQRS introduces is to split that conceptual model into separate models for update and display, which it refers to as Command and Query respectivelyA method should either change state of an object, or return a result, but not both.Presentation Tom Michiels, Monday Evening at one of the BOFs
  • Lambda first named by Alonzo Church, he needed a letter for functional abstraction in theory of computation in the 1930s.
  • High tolerance for human & system errors.
  • http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=ns
  • Data storage layer optimized independently from query resolution layer
  • If you remember one thing about this presentation is: Immutability.
  • a real-time architecture using Hadoop and Storm at Devoxx

    1. 1. A real-time architecture using Hadoop and Storm. Nathan Bijnens #DV13-#rtbigdata @nathan_gs
    2. 2. Speaker Nathan Bijnens DataCrunchers @nathan_gs #DV13-#rtbigdata @nathan_gs
    3. 3. Our Vision Volume Big Data test #DV13-#rtbigdata @nathan_gs
    4. 4. Big Data Velocity test #DV13-#rtbigdata @nathan_gs
    5. 5. Our Vision Volum e Variety test #DV13-#rtbigdata @nathan_gs
    6. 6. Computing Trends Past Current Computation (CPUs) Expensive Computation Cheap (Many Core Computers) Disk Storage Expensive Disk Storage Cheap (Cheap Commodity Disks) DRAM Expensive DRAM / SSD Getting Cheap Coordination Easy (Latches Don’t Often Hit) Coordination Hard (Latches Stall a Lot, etc) Source: Immutability Changes Everything - Pat Helland, RICON2012 #DV13-#rtbigdata @nathan_gs
    7. 7. Credits Nathan Marz • • • • • Ex-Backtype & Twitter Startup in Stealthmode Storm Cascalog ElephantDB manning.com/marz #DV13-#rtbigdata @nathan_gs
    8. 8. A Data System #DV13-#rtbigdata @nathan_gs
    9. 9. Data is more than Information Not all information is equal. Some information is derived from other pieces of information. #DV13-#rtbigdata @nathan_gs
    10. 10. Data is more than Information Eventually you will reach the most ‘raw’ form of information. This is the information you hold true, simple because it exists. Let’s call this ‘data’, very similar to ‘event’. #DV13-#rtbigdata @nathan_gs
    11. 11. Events - Before Events used to manipulate the master data. #DV13-#rtbigdata @nathan_gs
    12. 12. Events - After Today, events are the master data. #DV13-#rtbigdata @nathan_gs
    13. 13. Data System Let’s store everything. #DV13-#rtbigdata @nathan_gs
    14. 14. Events Data is Immutable #DV13-#rtbigdata @nathan_gs
    15. 15. Events Data is Time Based #DV13-#rtbigdata @nathan_gs
    16. 16. Capturing change traditionally Person Location Person Location Nathan Antwerp Nathan Ghent Geert Dendermonde Geert Dendermonde John Ghent John Ghent #DV13-#rtbigdata @nathan_gs
    17. 17. Capturing change Person Location Timestamp Person Location Time Nathan Antwerp 2005-01-01 Nathan Antwerp 2005-01-01 Geert Dendermonde 2011-10-08 Geert Dendermonde 2011-10-08 John Ghent 2010-05-02 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 #DV13-#rtbigdata @nathan_gs
    18. 18. Query The data you query is often transformed, aggregated, ... Rarely used in it’s original form. #DV13-#rtbigdata @nathan_gs
    19. 19. Query Query = function ( all data ) #DV13-#rtbigdata @nathan_gs
    20. 20. Number of people living in each city. Person Location Time Location Count Nathan Antwerp 2005-01-01 Ghent 2 Geert Dendermond e 2011-10-08 Dendermonde 1 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 #DV13-#rtbigdata @nathan_gs
    21. 21. Query All Data #DV13-#rtbigdata Query @nathan_gs
    22. 22. Query: Precompute All Data #DV13-#rtbigdata Precomputed View Query @nathan_gs
    23. 23. Layered Architecture Batch Layer Speed Layer Serving Layer #DV13-#rtbigdata @nathan_gs
    24. 24. Layered Architecture Query Cassandra Incoming Data Hadoop #DV13-#rtbigdata Elephant DB @nathan_gs
    25. 25. Batch Layer #DV13-#rtbigdata @nathan_gs
    26. 26. Batch Layer Incoming Data Hadoop #DV13-#rtbigdata Elephant DB @nathan_gs
    27. 27. Batch Layer Unrestrained computation. #DV13-#rtbigdata @nathan_gs
    28. 28. Batch Layer No need to De-Normalize. #DV13-#rtbigdata @nathan_gs
    29. 29. Batch Layer Horizontal scalable. #DV13-#rtbigdata @nathan_gs
    30. 30. Batch Layer High Latency. Let’s pretend temporarily that update latency doesn’t matter. #DV13-#rtbigdata @nathan_gs
    31. 31. Batch Layer Functional computation, based on immutable inputs, is idempotent. #DV13-#rtbigdata @nathan_gs
    32. 32. Batch Layer Stores master copy of data set... append only. #DV13-#rtbigdata @nathan_gs
    33. 33. Batch Layer #DV13-#rtbigdata @nathan_gs
    34. 34. Batch: View generation View #1 Master Dataset MapReduc e View #2 View #3 #DV13-#rtbigdata @nathan_gs
    35. 35. MapReduce MAP 1. Take a large data set and divide it into subsets … 2. Perform the same function on all subsets REDUC E DoWork() DoWork() DoWork() … 3. Combine the output from all subsets #DV13-#rtbigdata … Output @nathan_gs
    36. 36. MapReduce #DV13-#rtbigdata @nathan_gs
    37. 37. Serialization & Schema Catch errors as quickly as they happen. Validation on write vs on read. #DV13-#rtbigdata @nathan_gs
    38. 38. Serialization & Schema CSV is actually a serialization language that is just poorly defined. #DV13-#rtbigdata @nathan_gs
    39. 39. Serialization & Schema • Use a format with a schema. • • • Thrift Avro Protobuffers • Added bonus: it’s faster & uses less space. #DV13-#rtbigdata @nathan_gs
    40. 40. Batch View Database Read only database. No random writes required. #DV13-#rtbigdata @nathan_gs
    41. 41. Batch View Database Every iteration produces the Views from scratch. #DV13-#rtbigdata @nathan_gs
    42. 42. Batch View Database • ElephantDB • Splout • Voldemort •… #DV13-#rtbigdata @nathan_gs
    43. 43. Batch Layer We are not done yet… Just a few hours of data. Data absorbed into Batch Views #DV13-#rtbigdata Now Time Not yet absorbed. @nathan_gs
    44. 44. Speed Layer #DV13-#rtbigdata @nathan_gs
    45. 45. Overview Cassandra Incoming Data Hadoop #DV13-#rtbigdata Elephant DB @nathan_gs
    46. 46. Speed Layer Stream processing. #DV13-#rtbigdata @nathan_gs
    47. 47. Speed Layer Continuous computation. #DV13-#rtbigdata @nathan_gs
    48. 48. Speed Layer Transactional. #DV13-#rtbigdata @nathan_gs
    49. 49. Speed Layer Storing a limited window of data. Compensating for the last few hours of data. #DV13-#rtbigdata @nathan_gs
    50. 50. Speed Layer All the complexity is isolated in the Speed layer. If anything goes wrong, it’s auto-corrected. #DV13-#rtbigdata @nathan_gs
    51. 51. CAP You have a choice between: Availability • • Queries are eventual consistent. • Consistency • Consistency Queries are consistent. #DV13-#rtbigdata Partition Tolerance Availability @nathan_gs
    52. 52. Eventual accuracy Some algorithms are hard to implement in real time. For those cases we could estimate the results. #DV13-#rtbigdata @nathan_gs
    53. 53. Speed Layer Real Time View 1 Incoming Data Real Time View 2 #DV13-#rtbigdata @nathan_gs
    54. 54. Storm • Message passing. • Distributed processing. • Horizontally scalable. • Incremental algorithms. • Fast. • Data in motion. #DV13-#rtbigdata @nathan_gs
    55. 55. Storm #DV13-#rtbigdata @nathan_gs
    56. 56. Storm • Tuple • Stream #DV13-#rtbigdata @nathan_gs
    57. 57. Storm • Spout • Bolt #DV13-#rtbigdata @nathan_gs
    58. 58. Storm • Grouping #DV13-#rtbigdata @nathan_gs
    59. 59. Data Ingestion • Kafka • Flume • Scribe • *MQ •… #DV13-#rtbigdata @nathan_gs
    60. 60. Speed Layer Views • The views are stored in Read & Write database. • • • • • • Cassandra Hbase Redis MySQL ElasticSearch … • Much more complex than a read only view. #DV13-#rtbigdata @nathan_gs
    61. 61. Serving Layer #DV13-#rtbigdata @nathan_gs
    62. 62. Overview Query Cassandra Incoming Data Hadoop #DV13-#rtbigdata Elephant DB @nathan_gs
    63. 63. Serving Layer Random reads #DV13-#rtbigdata @nathan_gs
    64. 64. Serving Layer This layer queries the Batch & Real Time views and merges it. #DV13-#rtbigdata @nathan_gs
    65. 65. Serving Layer Batch Views Merge Real Time Views #DV13-#rtbigdata @nathan_gs
    66. 66. Serving Layer How to query an Average? #DV13-#rtbigdata @nathan_gs
    67. 67. Overview #DV13-#rtbigdata @nathan_gs
    68. 68. Overview Query Cassandra Incoming Data Hadoop #DV13-#rtbigdata Elephant DB @nathan_gs
    69. 69. CQRS Source: martinfowler.com/bliki/CQRS.html – Martin Fowler #DV13-#rtbigdata @nathan_gs
    70. 70. Lambda Architecture #DV13-#rtbigdata @nathan_gs
    71. 71. Lambda Architecture Can discard any view, batch and real time, and just recreate everything from the master data. #DV13-#rtbigdata @nathan_gs
    72. 72. Lambda Architecture Mistakes are corrected via recomputation. Write bad data? Remove the data & recompute. Bug in view generation? Just recompute the view. #DV13-#rtbigdata @nathan_gs
    73. 73. Lambda Architecture Data storage is highly optimized. #DV13-#rtbigdata @nathan_gs
    74. 74. Lambda Architecture Immutability changes everything. #DV13-#rtbigdata @nathan_gs
    75. 75. Questions? Questions? @nathan_gs & #DV13 slideshare.net/nathan_gs nathan@datacrunchers.eu #DV13-#rtbigdata @nathan_gs
    76. 76. DataCrunchers We enable companies in envisioning, defining and implementing a data strategy. A one-stop-shop for all your Big Data needs. The first Big Data Consultancy agency in Belgium. #DV13-#rtbigdata @nathan_gs
    77. 77. Thank you Thank you @nathan_gs nathan@datacrunchers.eu #DV13-#rtbigdata @nathan_gs