A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

1,197
-1

Published on

Presented at JAX London 2013

With the proliferation of data sources and growing user bases, the amount of data generated requires new ways for storage and processing. Hadoop opened new possibilities, yet it falls short of instant delivery. Adding stream processing using Nathan Marz’s Storm, can overcome this delay and bridge the gap to real-time aggregation and reporting. On the Batch layer all master data is kept and is immutable. Once the base data is stored a recurring process will index the data. This process reads all master data, parses it and will create new views out of it.

Published in: Technology, Business
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,197
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
72
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • How much data do you have? 44 times as much data in the next decade, 15 Zb in 2015Data silos (erp, crm, …)CustomersTrimble (3Tb in hun database systeem)Truvo (wijzigen van een index duurt 24u)Traditionele systemen kunnen dit volume niet aan.How many data do you have?Turn 12 terabytes of Tweets created each day into improved product sentiment analysisConvert 350 billion annual meter readings to better predict power consumption
  • Real timeTime sensitivedecisiontakingFrauddetectionEnergy allocationMarketing campaignsMarket transactionsSolution:Real-time solutions in combination with batch (hadoop)Nosql systems
  • StructuredUnstructured80% is unstructured data, A key drawback of using traditional relational database systems is that they're not good at handling variable data. A flexible data modelWord, email, foto, text, video, APIs, …?What are your needs regarding variety?The end result: bringingstructureintounstructured dataMonitor 100’s of live video feeds from surveillance cameras to target points of interestExploit the 80% data growth in images, video and documents to improve customer satisfaction
  • We can afford to keep Immutable Copies of lots of data.We NEED immutability to Coordinate with fewer challenges.Semaphores & Locks are the things to avoid: Instruction opportunities lost waiting for a semaphore increase with more cores…
  • The # of followers on Twitter = all follows & unfollows combined.Account balance
  • Data = eventIn an ever changing world we found a ‘safe heaven’ for dataEverything we do generates events:Pay with Credit CardCommit to GitClick on a webpageTweet
  • It is easier to store all data in a cost effective way.Compare to DWH world.
  • Immutability greatly restricts the range of errors that can cause data loss or data corruption.Ex. Only CR, no more CRUD.Information might of course change.Fault ToleranceData lossHuman error, Hardware failureData CorruptionParallel met functioneelprogrammeren.
  • Allows state regeneration. Eg. What was my bank balance on 1 may 2005?
  • Queries as pure functions that take all data as input is the most general formulation.Different functions may look at different portions and aggregate information in different ways.
  • Too slow; might be petabyte scaleImpala/Drill: why not
  • The batch layer can calculate anything (given enough time).
  • The batch layer stores the data normalized, but in the views it generates, data is often, if not always de normalized.
  • Not vertically
  • It’s OK to croak and restart
  • Is something really immutable when it’s name can change.
  • Doesn’t have to be Hadoop. The importance here is a Distributed FS combined with a processing framework.Spark,
  • Source: PolybasePass2012.pptxhttp://whyjava.wordpress.com/2011/08/04/how-i-explained-mapreduce-to-my-wife/
  • http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=nsValue of schemas• Structural integrity• Guarantees on what can and can’t be stored• Prevents corruptionOtherwise you’ll detect corruption issues at read-time
  • http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=ns
  • Maarkanopgelostworden, door bvb ES je views op voorhandtegenereren.
  • In some circumstances.
  • All the complexity of *dealing* with the CAP theorem (like read repair) is isolated in the realtime layer.
  • Consistency (all nodes see the same data at the same time)Availability (a guarantee that every request receives a response about whether it was successful or failed)Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)http://codahale.com/you-cant-sacrifice-partition-tolerance/Hbasavs Cassandra
  • Eg. Unique countsML
  • Nimbus:Manages the clusterWorker Node:Supervisor:Manages workers; restarts them if neededExecuterPhysical JVM process.Execute tasks (those are spread evenly across the workers)TasksEach in his own Thread. Is the actual Bolt or Spout.Processes the stream.
  • Tuple:Named list of valuesDynamicly typedStreamSequence of Tuples
  • SpoutSource of StreamsSometimes replayableBoltStream transformationsAt least 1 input stream0 - * output streams
  • The serving layer needs to be able to answer any query in a short amount of time.
  • AVG = sum + count; preaggregate, but not everything is possible.
  • Lambda first named by Alonzo Church, he needed a letter for functional abstraction in theory of computation in the 1930s.
  • High tolerance for human & system errors.
  • http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=ns
  • Data storage layer optimized independently from query resolution layer
  • If you remember one thing about this presentation is: Immutability.
  • A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van Landeghem - DataCrunchers

    1. 1. A real-time architecture using Hadoop and Storm.
    2. 2. Speaker Nathan Bijnens @nathan_gs A real-time architecture using Hadoop & Storm. #JaxLondon 2
    3. 3. Our Vision Volume Big Data test A real-time architecture using Hadoop & Storm. #JaxLondon 3
    4. 4. Big Data Velocity test A real-time architecture using Hadoop & Storm. #JaxLondon 4
    5. 5. Our Vision Volum e Variety test A real-time architecture using Hadoop & Storm. #JaxLondon 5
    6. 6. Computing Trends Past Current Computation (CPUs) Expensive Computation Cheap (Many Core Computers) Disk Storage Expensive Disk Storage Cheap (Cheap Commodity Disks) DRAM Expensive DRAM / SSD Getting Cheap Coordination Easy (Latches Don’t Often Hit) Coordination Hard (Latches Stall a Lot, etc) Source: Immutability Changes Everything - Pat Helland, RICON2012 A real-time architecture using Hadoop & Storm. #JaxLondon 6
    7. 7. Credits Nathan Marz Ex-Backtype & Twitter Startup in Stealthmode Storm Cascalog ElephantDB manning.com/marz A real-time architecture using Hadoop & Storm. #JaxLondon 7
    8. 8. A Data System A real-time architecture using Hadoop & Storm. #JaxLondon 8
    9. 9. Data is more than Information Not all information is equal. Some information is derived from other pieces of information. A real-time architecture using Hadoop & Storm. #JaxLondon 9
    10. 10. Data is more than Information Eventually you will reach the most ‘raw’ form of information. This is the information you hold true, simple because it exists. Let’s call this ‘data’, very similar to ‘event’. A real-time architecture using Hadoop & Storm. #JaxLondon 10
    11. 11. Events - Before Events used to manipulate the master data. A real-time architecture using Hadoop & Storm. #JaxLondon 11
    12. 12. Events - After Today, events are the master data. A real-time architecture using Hadoop & Storm. #JaxLondon 12
    13. 13. Data System Let’s store everything. A real-time architecture using Hadoop & Storm. #JaxLondon 13
    14. 14. Events Data is Immutable A real-time architecture using Hadoop & Storm. #JaxLondon 14
    15. 15. Events Data is Time Based A real-time architecture using Hadoop & Storm. #JaxLondon 15
    16. 16. Capturing change traditionally Person Location Person Location Nathan Antwerp Nathan Ghent Geert Dendermonde Geert Dendermonde John Ghent John Ghent A real-time architecture using Hadoop & Storm. #JaxLondon 16
    17. 17. Capturing change Person Location Timestamp Person Location Time Nathan Antwerp 2005-01-01 Nathan Antwerp 2005-01-01 Geert Dendermonde 2011-10-08 Geert Dendermond e 2011-10-08 John Ghent 2010-05-02 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 A real-time architecture using Hadoop & Storm. #JaxLondon 17
    18. 18. Query The data you query is often transformed, aggregated, ... Rarely used in it’s original form. A real-time architecture using Hadoop & Storm. #JaxLondon 18
    19. 19. Query Query = function ( all data ) A real-time architecture using Hadoop & Storm. #JaxLondon 19
    20. 20. Number of people living in each city. Person Location Time Location Count Nathan Antwerp 2005-01-01 Ghent 2 Geert Dendermond e 2011-10-08 Dendermonde 1 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 A real-time architecture using Hadoop & Storm. #JaxLondon 20
    21. 21. Query All Data Query A real-time architecture using Hadoop & Storm. #JaxLondon 22
    22. 22. Query: Precompute All Data Precomputed View Query A real-time architecture using Hadoop & Storm. #JaxLondon 23
    23. 23. Layered Architecture Batch Layer Speed Layer Serving Layer A real-time architecture using Hadoop & Storm. #JaxLondon 24
    24. 24. Layered Architecture Query Cassandr a Incoming Data Hadoop Elephan tDB A real-time architecture using Hadoop & Storm. #JaxLondon 25
    25. 25. Batch Layer A real-time architecture using Hadoop & Storm. #JaxLondon 26
    26. 26. Batch Layer Incoming Data Hadoop Elephan tDB A real-time architecture using Hadoop & Storm. #JaxLondon 27
    27. 27. Batch Layer Unrestrained computation. A real-time architecture using Hadoop & Storm. #JaxLondon 28
    28. 28. Batch Layer No need to De-Normalize. A real-time architecture using Hadoop & Storm. #JaxLondon 29
    29. 29. Batch Layer Horizontal scalable. A real-time architecture using Hadoop & Storm. #JaxLondon 30
    30. 30. Batch Layer High Latency. Let’s pretend temporarily that update latency doesn’t matter. A real-time architecture using Hadoop & Storm. #JaxLondon 31
    31. 31. Batch Layer Functional computation, based on immutable inputs, is idempotent. A real-time architecture using Hadoop & Storm. #JaxLondon 32
    32. 32. Batch Layer Stores master copy of data set... append only. A real-time architecture using Hadoop & Storm. #JaxLondon 33
    33. 33. Batch Layer A real-time architecture using Hadoop & Storm. #JaxLondon 34
    34. 34. Batch: View generation View #1 Master Dataset MapReduc e View #2 View #3 A real-time architecture using Hadoop & Storm. #JaxLondon 35
    35. 35. MapReduce MAP 1. Take a large data set and divide it into subsets … 2. Perform the same function on all subsets REDUCE DoWork() DoWork() DoWork() … 3. Combine the output from all subsets … Output A real-time architecture using Hadoop & Storm. #JaxLondon 36
    36. 36. Serialization & Schema Catch errors as quickly as they happen. Validation on write vs on read. A real-time architecture using Hadoop & Storm. #JaxLondon 37
    37. 37. Serialization & Schema CSV is actually a serialization language that is just poorly defined. A real-time architecture using Hadoop & Storm. #JaxLondon 38
    38. 38. Serialization & Schema Use a format with a schema. - Thrift Avro Protobuffers Added bonus: it’s faster & uses less space. A real-time architecture using Hadoop & Storm. #JaxLondon 39
    39. 39. Batch View Database Read only database. No random writes required. A real-time architecture using Hadoop & Storm. #JaxLondon 40
    40. 40. Batch View Database Every iteration produces the Views from scratch. A real-time architecture using Hadoop & Storm. #JaxLondon 41
    41. 41. Batch View Database ElephantDB Splout Voldemort … A real-time architecture using Hadoop & Storm. #JaxLondon 42
    42. 42. Batch Layer We are not done yet… Just a few hours of data. Data absorbed into Batch Views Not yet absorbed. A real-time architecture using Hadoop & Storm. #JaxLondon No w Time 44
    43. 43. Speed Layer A real-time architecture using Hadoop & Storm. #JaxLondon 45
    44. 44. Overview Cassandr a Incoming Data Hadoop Elephan tDB A real-time architecture using Hadoop & Storm. #JaxLondon 46
    45. 45. Speed Layer Stream processing. A real-time architecture using Hadoop & Storm. #JaxLondon 47
    46. 46. Speed Layer Continuous computation. A real-time architecture using Hadoop & Storm. #JaxLondon 48
    47. 47. Speed Layer Transactional. A real-time architecture using Hadoop & Storm. #JaxLondon 49
    48. 48. Speed Layer Storing a limited window of data. Compensating for the last few hours of data. A real-time architecture using Hadoop & Storm. #JaxLondon 50
    49. 49. Speed Layer All the complexity is isolated in the Speed layer. If anything goes wrong, it’s auto-corrected. A real-time architecture using Hadoop & Storm. #JaxLondon 51
    50. 50. CAP You have a choice between: Availability - Queries are eventual consistent. Consistency - Queries are consistent. A real-time architecture using Hadoop & Storm. #JaxLondon 52
    51. 51. Eventual accuracy Some algorithms are hard to implement in real time. For those cases we could estimate the results. A real-time architecture using Hadoop & Storm. #JaxLondon 53
    52. 52. Speed Layer Real Time View 1 Incoming Data Real Time View 2 A real-time architecture using Hadoop & Storm. #JaxLondon 54
    53. 53. Storm Message passing. Distributed processing. Horizontally scalable. Incremental algorithms. Fast. Data in motion. A real-time architecture using Hadoop & Storm. #JaxLondon 55
    54. 54. Storm Nimbus Execute r Execute r Worker Node Supervis or Execute r Execute r Execute r Worker Node Supervis or Execute r Execute r Execute r Execute r Supervis or Zookeep er Worker Node A real-time architecture using Hadoop & Storm. #JaxLondon 56
    55. 55. Storm Tuple Stream A real-time architecture using Hadoop & Storm. #JaxLondon 57
    56. 56. Storm Spout Bolt A real-time architecture using Hadoop & Storm. #JaxLondon 58
    57. 57. Storm Grouping A real-time architecture using Hadoop & Storm. #JaxLondon 59
    58. 58. Data Ingestion Kafka Flume Scribe *MQ Kestrel A real-time architecture using Hadoop & Storm. #JaxLondon 60
    59. 59. Speed Layer Views The views are stored in Read & Write database. - Cassandra Hbase Redis MySQL ElasticSearch … Much more complex than a read only view. A real-time architecture using Hadoop & Storm. #JaxLondon 61
    60. 60. Serving Layer A real-time architecture using Hadoop & Storm. #JaxLondon 62
    61. 61. Overview Query Cassandr a Incoming Data Hadoop Elephan tDB A real-time architecture using Hadoop & Storm. #JaxLondon 63
    62. 62. Serving Layer Random reads A real-time architecture using Hadoop & Storm. #JaxLondon 64
    63. 63. Serving Layer This layer queries the Batch & Real Time views and merges it. A real-time architecture using Hadoop & Storm. #JaxLondon 65
    64. 64. Serving Layer Batch Views Merge Real Time Views A real-time architecture using Hadoop & Storm. #JaxLondon 66
    65. 65. Serving Layer How to query an Average? A real-time architecture using Hadoop & Storm. #JaxLondon 67
    66. 66. Overview A real-time architecture using Hadoop & Storm. #JaxLondon 68
    67. 67. Overview Query Cassandr a Incoming Data Hadoop Elephan tDB A real-time architecture using Hadoop & Storm. #JaxLondon 69
    68. 68. Lambda Architecture A real-time architecture using Hadoop & Storm. #JaxLondon 70
    69. 69. Lambda Architecture Can discard any view, batch and real time, and just recreate everything from the master data. A real-time architecture using Hadoop & Storm. #JaxLondon 71
    70. 70. Lambda Architecture Mistakes are corrected via recomputation. Write bad data? Remove the data & recompute. Bug in view generation? Just recompute the view. A real-time architecture using Hadoop & Storm. #JaxLondon 72
    71. 71. Lambda Architecture Data storage is highly optimized. A real-time architecture using Hadoop & Storm. #JaxLondon 73
    72. 72. Lambda Architecture Immutability changes everything. A real-time architecture using Hadoop & Storm. #JaxLondon 74
    73. 73. Questions? Questions? @nathan_gs & #BigDataCon13 A real-time architecture using Hadoop & Storm. #JaxLondon 75
    74. 74. DataCrunchers We enable companies in envisioning, defining and implementing a data strategy. A one-stop-shop for all your Big Data needs. The first Big Data Consultancy agency in Belgium. A real-time architecture using Hadoop & Storm. #JaxLondon 76
    75. 75. Thank you Thank you @nathan_gs A real-time architecture using Hadoop & Storm. #JaxLondon 77
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×