A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne '14)

20,833 views

Published on

1 Comment
72 Likes
Statistics
Notes
  • More than 5000 registered IT consultants and Corporates.Search for IT online training Providers at http://www.todaycourses.com
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
20,833
On SlideShare
0
From Embeds
0
Number of Embeds
2,317
Actions
Shares
0
Downloads
698
Comments
1
Likes
72
Embeds 0
No embeds

No notes for slide

A real-time (lambda) architecture using Hadoop & Storm (NoSQL Matters Cologne '14)

  1. 1. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 A real-time Lambda Architecture using Hadoop & Storm NoSQL Matters Cologne 2014 by Nathan Bijnens
  2. 2. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Speaker Nathan Bijnens Big Data Engineer @ Virdata @nathan_gs
  3. 3. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Computing Trends Past Computation (CPUs) Expensive Disk Storage Expensive Coordination Easy (Latches Don’t Often Hit) DRAM Expensive Computation Cheap (Many Core Computers) Disk Storage Cheap (Cheap Commodity Disks) Coordination Hard (Latches Stall a Lot, etc) DRAM / SSD Getting Cheap Current Source: Immutability Changes Everything - Pat Helland, RICON2012
  4. 4. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Credits Nathan Marz ● Ex-Backtype & Twitter ● Startup in Stealthmode Creator of ● Storm ● Cascalog ● ElephantDB Coined the term Lambda Architecture. manning.com/marz
  5. 5. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 a Data System
  6. 6. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Not all information is equal. Some information is derived from other pieces of information. Data is more than Information
  7. 7. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Eventually you will reach the most ‘raw’ form of information. This is the information you hold true, simply because it exists. Let’s call this ‘data’, very similar to ‘event’. Data is more than Information
  8. 8. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Events used to manipulate the master data. Events: Before
  9. 9. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Today, events are the master data. Events: After
  10. 10. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Let’s store everything. Data System
  11. 11. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Data is Immutable. Data System
  12. 12. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Data is Time Based. Data System
  13. 13. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Capturing change INSERT INTO contact (name, city) VALUES (‘Nathan’, ‘Antwerp’) UPDATE contact SET city = ‘Cologne’ WHERE name = ‘Nathan’ Traditionally
  14. 14. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Capturing change INSERT INTO contact (name, city, timestamp) VALUES (‘Nathan’, ‘Antwerp’, 2008-10-11 20:00Z) INSERT INTO contact (name, city, timestamp) VALUES (‘Nathan’, ‘Cologne’, 2014-04-29 10:00Z) in a Data System
  15. 15. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 The data you query is often transformed, aggregated, ... Rarely used in it’s original form. Query
  16. 16. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Query = function ( all data ) Query
  17. 17. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Query: Number of people living in each city Person City Timestamp Nathan Antwerp 2008-10-11 John Cologne 2010-01-23 Dirk Antwerp 2012-09-12 Nathan Cologne 2014-04-29 City Count Antwerp 1 Cologne 2
  18. 18. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Query All Data QueryPrecomputed View
  19. 19. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Layered Architecture Batch Layer Speed Layer Serving Layer
  20. 20. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Layered Architecture Hadoop ElephantDB Incoming Data Cassandra Query
  21. 21. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Batch Layer
  22. 22. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Batch Layer Hadoop ElephantDB Incoming Data
  23. 23. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Batch Layer The batch layer can calculate anything, given enough time... Unrestrained computation.
  24. 24. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 No need to De-Normalize. The batch layer stores the data normalized, the generated views are often, if not always denormalized. Batch Layer
  25. 25. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Horizontally scalable. Batch Layer
  26. 26. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 High Latency. Let’s for now pretend the update latency doesn’t matter. Batch Layer
  27. 27. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Functional computation, based on immutable inputs, is idempotent. Batch Layer
  28. 28. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Stores a master copy of the data set Batch Layer … append only
  29. 29. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Batch Layer
  30. 30. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Batch: view generation Master Dataset View #1 View #3 View #2 MapReduce MapReduce MapReduce
  31. 31. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 MapReduce 1. Take a large data set and divide it into subsets 2. Perform the same function on all subsets 3. Combine the output from all subsets … … Output DoWork() DoWork() DoWork() … MAPREDUCE
  32. 32. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 MapReduce
  33. 33. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Serialization & Schema Catch errors as quickly as they happen. Validate on write vs on read. Catch errors as quickly as they happen. Validate on write vs on read.
  34. 34. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 CSV is actually a serialization language that is just poorly defined. Serialization & Schema
  35. 35. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Use a format with a schema ● Thrift ● Avro ● Protocolbuffers Could be combined with Parquet. Added bonus: it’s faster and uses less space. Serialization & Schema
  36. 36. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Batch View Database No random writes required. Read Only database
  37. 37. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Every iteration produces the views from scratch. Batch View Database
  38. 38. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Pure Lambda databases ● ElephantDB ● SploutSQL Databases with a batch load & read only views ● Voldemort Other databases that could be used ● ElasticSearch/Solr: generate the lucene indexes using MapReduce ● Cassandra: generate sstables ● ... Batch View Databases
  39. 39. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Batch Layer Without the associated complexities. Eventually consistent
  40. 40. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Batch Layer Data absorbed into Batch Views Time Now We are not done yet… Not yet absorbed. Just a few hours of data.
  41. 41. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Speed Layer
  42. 42. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Speed Layer Hadoop ElephantDB Incoming Data Cassandra
  43. 43. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Stream processing. Speed Layer
  44. 44. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Continuous computation. Speed Layer
  45. 45. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Storing a limited window of data. Compensating for the last few hours of data. Speed Layer
  46. 46. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 All the complexity is isolated in the Speed Layer. If anything goes wrong, it’s auto-corrected. Speed Layer
  47. 47. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 You have a choice between: ● Availability ○ Queries are eventual consistent ● Consistency ○ Queries are consistent CAP
  48. 48. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Eventual accuracy Some algorithms are hard to implement in real-time. For those cases we could estimate the results.
  49. 49. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Storm Speed Layer
  50. 50. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Message passing Storm
  51. 51. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Distributed processing Storm
  52. 52. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Horizontally scalable. Storm
  53. 53. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Incremental algorithms Storm
  54. 54. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Fast. Storm
  55. 55. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Storm
  56. 56. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Storm Tuple Stream
  57. 57. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Storm Spout Bolt
  58. 58. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Storm Grouping
  59. 59. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Data Ingestion Queues & Pub/Sub models are a natural fit.
  60. 60. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 ● Kafka ● Flume ● Scribe ● *MQ ● … Data Ingestion
  61. 61. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Speed Layer Views The views need to be stored in a random writable database.
  62. 62. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 The logic behind a R/W database is much more complex than a read-only view. Speed Layer Views
  63. 63. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 The views are stored in a Read & Write database. ● Cassandra ● Hbase ● Redis ● SQL ● ElasticSearch ● ... Speed Layer Views
  64. 64. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Serving Layer
  65. 65. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Serving Layer Hadoop ElephantDB Incoming Data Cassandra Query
  66. 66. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Serving Layer Random reads.
  67. 67. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 This layer queries the batch & real-time views and merges it. Serving layer
  68. 68. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 How to query an Average? Serving Layer
  69. 69. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Side note: CQRS
  70. 70. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 CQRS Source: martinfowler.com/bliki/CQRS.html - Martin Fowler
  71. 71. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 CQRS & Event Sourcing Event Sourcing ● Every command is a new event. ● The event store keeps all events, new events are appended. ● Any query loops through all related events, even to produce an aggregate. source: CQRS Journey - Microsoft Patterns & Practices
  72. 72. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Lambda Architecture
  73. 73. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Lambda Architecture The Lambda Architecture can discard any view, batch and real-time, and just recreate everything from the master data.
  74. 74. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Mistakes are corrected via recomputation. Write bad data? Remove the data & recompute. Bug in view generation? Just recompute the view. Lambda Architecture
  75. 75. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Data storage is highly optimized. Lambda Architecture
  76. 76. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Immutability changes everything. Lambda Architecture
  77. 77. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Questions? @nathan_gs #nosql14 nathan@nathan.gs / slideshare.net/nathan_gs lambda-architecture.net / @LambdaArch / #LambdaArch
  78. 78. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Virdata is the cross-industry cloud service/platform for the Internet of Things. Designed to elastically scale to monitor and manage an unprecedented amount of devices and applications using concurrent persistent connections, Virdata opens the door to numerous new business opportunities. Virdata combines Publish-Subscribe based Distributed Messaging, Complex Event Processing and state-of-the-art Big Data paradigms to enable both historical & real-time monitoring and near real-time analytics with a scale required for the Internet of Things.
  79. 79. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Acknowledgements I would like to thank Nathan Marz for writing a very insightful book, where most of the ideas in this presentation come from. Parts of this presentation has been created while working for datacrunchers.eu, I thank them for the opportunities to speak about the Lambda Architecture both at clients and at conferences. DataCrunchers is the first Big Data agency in Belgium. Schema’s & Pictures: Computing Trends: Immutability Changes Everything - Pat Helland, RICON2012 MapReduce #1: PolybasePass2012.pptx - David J. DeWitt, Microsoft Gray Systems Lab MapReduce #2: Introduction to MapReduce and Hadoop - Shivnath Babu, Duke CQRS: martinfowler.com/bliki/CQRS.html - Martin Fowler CQRS & Event Sourcing: CQRS Journey - Adam Dymitruk, Josh Elster & Mark Seemann, Microsoft Patterns & Practices
  80. 80. NoSQL Matter 2014 - A real-time (Lambda) Architecture using Hadoop & Storm - #nosql14 Thank you @nathan_gs nathan@nathan.gs

×