Your SlideShare is downloading. ×
a real-time architecture using Hadoop and Storm at Devoxx
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

a real-time architecture using Hadoop and Storm at Devoxx


Published on

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Projecten
  • How much data do you have? 44 times as much data in the next decade, 15 Zb in 2015Data silos (erp, crm, …)Traditionele systemen kunnen dit volume niet aan.Turn 12 terabytes of Tweets created each day into improved product sentiment analysisConvert 350 billion annual meter readings to better predict power consumption
  • Real timeTime sensitivedecisiontakingFrauddetectionEnergy allocationMarketing campaignsMarket transactionsSolution:Real-time solutions in combination with batch (hadoop)Nosql systems
  • StructuredUnstructured80% is unstructured data, A key drawback of using traditional relational database systems is that they're not good at handling variable data. A flexible data modelWord, email, foto, text, video, APIs, …?What are your needs regarding variety?The end result: bringingstructureintounstructured dataMonitor 100’s of live video feeds from surveillance cameras to target points of interestExploit the 80% data growth in images, video and documents to improve customer satisfaction
  • We can afford to keep Immutable Copies of lots of data.We NEED immutability to Coordinate with fewer challenges.Semaphores & Locks are the things to avoid: Instruction opportunities lost waiting for a semaphore increase with more cores…
  • The # of followers on Twitter = all follows & unfollows combined.Account balance
  • Data = event = atomicIn an ever changing world we found some stabilityEverything we do generates events:Pay with Credit CardCommit to GitClick on a webpageTweet
  • It is easier to store all data in a cost effective way.Compare to DWH world.
  • Immutability greatly restricts the range of errors that can cause data loss or data corruption.Ex. Only CR, no more CRUD.Information might of course change.Fault ToleranceData lossHuman error, Hardware failureData CorruptionParallel met functioneelprogrammeren.
  • Allows state regeneration. Eg. What was my bank balance on 1 may 2005?
  • Queries as pure functions that take all data as input is the most general formulation.Different functions may look at different portions and aggregate information in different ways.
  • Too slow; might be petabyte scaleImpala/Drill: why not
  • The batch layer can calculate anything (given enough time).
  • The batch layer stores the data normalized, but in the views it generates, data is often, if not always de normalized.
  • Not vertically
  • It’s OK to croak and restart
  • Is something really immutable when it’s name can change.
  • Doesn’t have to be Hadoop. The importance here is a Distributed FS combined with a processing framework.Spark,
  • Source: PolybasePass2012.pptx
  • of schemas• Structural integrity• Guarantees on what can and can’t be stored• Prevents corruptionOtherwise you’ll detect corruption issues at read-time
  • Maarkanopgelostworden, door bvb ES je views op voorhandtegenereren.
  • In some circumstances.
  • All the complexity of *dealing* with the CAP theorem (like read repair) is isolated in the realtime layer.
  • Consistency (all nodes see the same data at the same time)Availability (a guarantee that every request receives a response about whether it was successful or failed)Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system) Cassandra
  • Eg. Unique countsML
  • Nimbus:Manages the clusterWorker Node:Supervisor:Manages workers; restarts them if neededExecuterPhysical JVM process.Execute tasks (those are spread evenly across the workers)TasksEach in his own Thread. Is the actual Bolt or Spout.Processes the stream.
  • Tuple:Named list of valuesDynamicly typedStreamSequence of Tuples
  • SpoutSource of StreamsSometimes replayableBoltStream transformationsAt least 1 input stream0 - * output streams
  • The serving layer needs to be able to answer any query in a short amount of time.
  • AVG = sum + count; preaggregate, but not everything is possible.
  • Command Query Responsibility Segregation (CQRS) applies the CQS principle by using separate Query and Command objects to retrieve and modify data respectivelymultiple representations of information. The change that CQRS introduces is to split that conceptual model into separate models for update and display, which it refers to as Command and Query respectivelyA method should either change state of an object, or return a result, but not both.Presentation Tom Michiels, Monday Evening at one of the BOFs
  • Lambda first named by Alonzo Church, he needed a letter for functional abstraction in theory of computation in the 1930s.
  • High tolerance for human & system errors.
  • Data storage layer optimized independently from query resolution layer
  • If you remember one thing about this presentation is: Immutability.
  • Transcript

    • 1. A real-time architecture using Hadoop and Storm. Nathan Bijnens #DV13-#rtbigdata @nathan_gs
    • 2. Speaker Nathan Bijnens DataCrunchers @nathan_gs #DV13-#rtbigdata @nathan_gs
    • 3. Our Vision Volume Big Data test #DV13-#rtbigdata @nathan_gs
    • 4. Big Data Velocity test #DV13-#rtbigdata @nathan_gs
    • 5. Our Vision Volum e Variety test #DV13-#rtbigdata @nathan_gs
    • 6. Computing Trends Past Current Computation (CPUs) Expensive Computation Cheap (Many Core Computers) Disk Storage Expensive Disk Storage Cheap (Cheap Commodity Disks) DRAM Expensive DRAM / SSD Getting Cheap Coordination Easy (Latches Don’t Often Hit) Coordination Hard (Latches Stall a Lot, etc) Source: Immutability Changes Everything - Pat Helland, RICON2012 #DV13-#rtbigdata @nathan_gs
    • 7. Credits Nathan Marz • • • • • Ex-Backtype & Twitter Startup in Stealthmode Storm Cascalog ElephantDB #DV13-#rtbigdata @nathan_gs
    • 8. A Data System #DV13-#rtbigdata @nathan_gs
    • 9. Data is more than Information Not all information is equal. Some information is derived from other pieces of information. #DV13-#rtbigdata @nathan_gs
    • 10. Data is more than Information Eventually you will reach the most ‘raw’ form of information. This is the information you hold true, simple because it exists. Let’s call this ‘data’, very similar to ‘event’. #DV13-#rtbigdata @nathan_gs
    • 11. Events - Before Events used to manipulate the master data. #DV13-#rtbigdata @nathan_gs
    • 12. Events - After Today, events are the master data. #DV13-#rtbigdata @nathan_gs
    • 13. Data System Let’s store everything. #DV13-#rtbigdata @nathan_gs
    • 14. Events Data is Immutable #DV13-#rtbigdata @nathan_gs
    • 15. Events Data is Time Based #DV13-#rtbigdata @nathan_gs
    • 16. Capturing change traditionally Person Location Person Location Nathan Antwerp Nathan Ghent Geert Dendermonde Geert Dendermonde John Ghent John Ghent #DV13-#rtbigdata @nathan_gs
    • 17. Capturing change Person Location Timestamp Person Location Time Nathan Antwerp 2005-01-01 Nathan Antwerp 2005-01-01 Geert Dendermonde 2011-10-08 Geert Dendermonde 2011-10-08 John Ghent 2010-05-02 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 #DV13-#rtbigdata @nathan_gs
    • 18. Query The data you query is often transformed, aggregated, ... Rarely used in it’s original form. #DV13-#rtbigdata @nathan_gs
    • 19. Query Query = function ( all data ) #DV13-#rtbigdata @nathan_gs
    • 20. Number of people living in each city. Person Location Time Location Count Nathan Antwerp 2005-01-01 Ghent 2 Geert Dendermond e 2011-10-08 Dendermonde 1 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 #DV13-#rtbigdata @nathan_gs
    • 21. Query All Data #DV13-#rtbigdata Query @nathan_gs
    • 22. Query: Precompute All Data #DV13-#rtbigdata Precomputed View Query @nathan_gs
    • 23. Layered Architecture Batch Layer Speed Layer Serving Layer #DV13-#rtbigdata @nathan_gs
    • 24. Layered Architecture Query Cassandra Incoming Data Hadoop #DV13-#rtbigdata Elephant DB @nathan_gs
    • 25. Batch Layer #DV13-#rtbigdata @nathan_gs
    • 26. Batch Layer Incoming Data Hadoop #DV13-#rtbigdata Elephant DB @nathan_gs
    • 27. Batch Layer Unrestrained computation. #DV13-#rtbigdata @nathan_gs
    • 28. Batch Layer No need to De-Normalize. #DV13-#rtbigdata @nathan_gs
    • 29. Batch Layer Horizontal scalable. #DV13-#rtbigdata @nathan_gs
    • 30. Batch Layer High Latency. Let’s pretend temporarily that update latency doesn’t matter. #DV13-#rtbigdata @nathan_gs
    • 31. Batch Layer Functional computation, based on immutable inputs, is idempotent. #DV13-#rtbigdata @nathan_gs
    • 32. Batch Layer Stores master copy of data set... append only. #DV13-#rtbigdata @nathan_gs
    • 33. Batch Layer #DV13-#rtbigdata @nathan_gs
    • 34. Batch: View generation View #1 Master Dataset MapReduc e View #2 View #3 #DV13-#rtbigdata @nathan_gs
    • 35. MapReduce MAP 1. Take a large data set and divide it into subsets … 2. Perform the same function on all subsets REDUC E DoWork() DoWork() DoWork() … 3. Combine the output from all subsets #DV13-#rtbigdata … Output @nathan_gs
    • 36. MapReduce #DV13-#rtbigdata @nathan_gs
    • 37. Serialization & Schema Catch errors as quickly as they happen. Validation on write vs on read. #DV13-#rtbigdata @nathan_gs
    • 38. Serialization & Schema CSV is actually a serialization language that is just poorly defined. #DV13-#rtbigdata @nathan_gs
    • 39. Serialization & Schema • Use a format with a schema. • • • Thrift Avro Protobuffers • Added bonus: it’s faster & uses less space. #DV13-#rtbigdata @nathan_gs
    • 40. Batch View Database Read only database. No random writes required. #DV13-#rtbigdata @nathan_gs
    • 41. Batch View Database Every iteration produces the Views from scratch. #DV13-#rtbigdata @nathan_gs
    • 42. Batch View Database • ElephantDB • Splout • Voldemort •… #DV13-#rtbigdata @nathan_gs
    • 43. Batch Layer We are not done yet… Just a few hours of data. Data absorbed into Batch Views #DV13-#rtbigdata Now Time Not yet absorbed. @nathan_gs
    • 44. Speed Layer #DV13-#rtbigdata @nathan_gs
    • 45. Overview Cassandra Incoming Data Hadoop #DV13-#rtbigdata Elephant DB @nathan_gs
    • 46. Speed Layer Stream processing. #DV13-#rtbigdata @nathan_gs
    • 47. Speed Layer Continuous computation. #DV13-#rtbigdata @nathan_gs
    • 48. Speed Layer Transactional. #DV13-#rtbigdata @nathan_gs
    • 49. Speed Layer Storing a limited window of data. Compensating for the last few hours of data. #DV13-#rtbigdata @nathan_gs
    • 50. Speed Layer All the complexity is isolated in the Speed layer. If anything goes wrong, it’s auto-corrected. #DV13-#rtbigdata @nathan_gs
    • 51. CAP You have a choice between: Availability • • Queries are eventual consistent. • Consistency • Consistency Queries are consistent. #DV13-#rtbigdata Partition Tolerance Availability @nathan_gs
    • 52. Eventual accuracy Some algorithms are hard to implement in real time. For those cases we could estimate the results. #DV13-#rtbigdata @nathan_gs
    • 53. Speed Layer Real Time View 1 Incoming Data Real Time View 2 #DV13-#rtbigdata @nathan_gs
    • 54. Storm • Message passing. • Distributed processing. • Horizontally scalable. • Incremental algorithms. • Fast. • Data in motion. #DV13-#rtbigdata @nathan_gs
    • 55. Storm #DV13-#rtbigdata @nathan_gs
    • 56. Storm • Tuple • Stream #DV13-#rtbigdata @nathan_gs
    • 57. Storm • Spout • Bolt #DV13-#rtbigdata @nathan_gs
    • 58. Storm • Grouping #DV13-#rtbigdata @nathan_gs
    • 59. Data Ingestion • Kafka • Flume • Scribe • *MQ •… #DV13-#rtbigdata @nathan_gs
    • 60. Speed Layer Views • The views are stored in Read & Write database. • • • • • • Cassandra Hbase Redis MySQL ElasticSearch … • Much more complex than a read only view. #DV13-#rtbigdata @nathan_gs
    • 61. Serving Layer #DV13-#rtbigdata @nathan_gs
    • 62. Overview Query Cassandra Incoming Data Hadoop #DV13-#rtbigdata Elephant DB @nathan_gs
    • 63. Serving Layer Random reads #DV13-#rtbigdata @nathan_gs
    • 64. Serving Layer This layer queries the Batch & Real Time views and merges it. #DV13-#rtbigdata @nathan_gs
    • 65. Serving Layer Batch Views Merge Real Time Views #DV13-#rtbigdata @nathan_gs
    • 66. Serving Layer How to query an Average? #DV13-#rtbigdata @nathan_gs
    • 67. Overview #DV13-#rtbigdata @nathan_gs
    • 68. Overview Query Cassandra Incoming Data Hadoop #DV13-#rtbigdata Elephant DB @nathan_gs
    • 69. CQRS Source: – Martin Fowler #DV13-#rtbigdata @nathan_gs
    • 70. Lambda Architecture #DV13-#rtbigdata @nathan_gs
    • 71. Lambda Architecture Can discard any view, batch and real time, and just recreate everything from the master data. #DV13-#rtbigdata @nathan_gs
    • 72. Lambda Architecture Mistakes are corrected via recomputation. Write bad data? Remove the data & recompute. Bug in view generation? Just recompute the view. #DV13-#rtbigdata @nathan_gs
    • 73. Lambda Architecture Data storage is highly optimized. #DV13-#rtbigdata @nathan_gs
    • 74. Lambda Architecture Immutability changes everything. #DV13-#rtbigdata @nathan_gs
    • 75. Questions? Questions? @nathan_gs & #DV13 #DV13-#rtbigdata @nathan_gs
    • 76. DataCrunchers We enable companies in envisioning, defining and implementing a data strategy. A one-stop-shop for all your Big Data needs. The first Big Data Consultancy agency in Belgium. #DV13-#rtbigdata @nathan_gs
    • 77. Thank you Thank you @nathan_gs #DV13-#rtbigdata @nathan_gs