A real-time architecture using Hadoop and Storm @ JAX London

  • 4,902 views
Uploaded on

 

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,902
On Slideshare
0
From Embeds
0
Number of Embeds
6

Actions

Shares
Downloads
118
Comments
0
Likes
10

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • 1
  • 2
  • How much data doyou have?
    44 times as much data in the next decade, 15Zbin 2015
    Data silos (erp,crm, …)
    Customers
    Trimble (3Tb inhundatabasesysteem)
    Truvo (wijzigenvaneenindexduurt24u)
    Traditionele systemen kunnen dit volume niet aan.
    How many data do you have?
    Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
    Convert 350 billion annual meter readings to better predict power consumption
    3
  • Real time
    Timesensitivedecisiontaking
    Frauddetection
    Energyallocation
    Marketingcampaigns
    Market transactions
    Solution:
    Real-time solutions in combination with batch (hadoop)
    Nosqlsystems
    4
  • Structured
    Unstructured
    80% is unstructured data,
    A key drawback of using traditional relational database systems is that they're not good at handling variable data.
    Aflexibledata model
    Word, email,foto, text, video, APIs, …?
    What are your needs regarding variety?
    The endresult:bringingstructureintounstructureddata
    Monitor 100’s of live video feeds from surveillance cameras to target points of interest
    Exploit the 80% data growth in images, video and documents to improve customer satisfaction
    5
  • We can afford to keepImmutableCopiesof lots of data.
    We NEED immutability to Coordinate with fewer challenges.
    Semaphores & Locks are the things to avoid:
    Instruction opportunities lost waiting for a semaphore increase with more cores…
    6
  • The #of followers on Twitter = all follows & unfollows combined.
    Account balance
    9
  • Data = event
    In an ever changingworld we found a ‘safe heaven’ for data
    Everything we do generates events:
    Pay with Credit Card
    Commit to Git
    Click on a webpage
    Tweet
    10
  • It is easier tostore all data in a cost effective way.
    Compare to DWH world.
    13
  • Immutability greatly restricts the range of errors that can cause data loss or data corruption.
    Ex.
    Only CR, no moreCRUD.
    Informationmight of course change.
    Fault Tolerance
    Data loss
    Human error, Hardware failure
    Data Corruption
    Parallel metfunctioneelprogrammeren.
    14
  • Allows state regeneration.Eg. What was my bank balance on 1 may 2005?
    15
  • Queries as pure functions that take all data as input is the most general formulation.
    Different functions may look at different portions and aggregate information in different ways.
    19
  • 22
  • Tooslow; might be petabyte scale
    Impala/Drill: why not
    23
  • The batch layer can calculate anything (given enough time).
    28
  • The batchlayer stores the data normalized, but in the views it generates, data is often, if not always de normalized.
    29
  • Not vertically
    30
  • 31
  • It’s OK to croak and restart
    32
  • Is something really immutable when it’s name can change.
    33
  • Doesn’t have to be Hadoop.The importance here is a Distributed FS combined with a processing framework.
    Spark,
    34
  • 35
  • Source: PolybasePass2012.pptx
    http://whyjava.wordpress.com/2011/08/04/how-i-explained-mapreduce-to-my-wife/
    36
  • http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=ns
    Value of schemas
    • Structural integrity
    • Guarantees on what can and can’t be stored
    • Prevents corruption
    Otherwise you’ll detect corruption issues at read-time
    37
  • http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=ns
    38
  • 39
  • 40
  • 41
  • Maarkanopgelostworden, doorbvbES je views opvoorhandtegenereren.
    42
  • 43
  • 47
  • 48
  • In some circumstances.
    49
  • 50
  • All the complexity of *dealing* with the CAP theorem (like read repair) is isolated in the realtime layer.
    51
  • Consistency (all nodes see the same data at the same time)
    Availability (a guarantee that every request receives a response about whether it was successful or failed)
    Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)
    http://codahale.com/you-cant-sacrifice-partition-tolerance/
    HbasavsCassandra
    52
  • Eg. Unique counts
    ML
    53
  • 54
  • Nimbus:
    Manages the cluster
    Worker Node:
    Supervisor:
    Manages workers; restartsthem if needed
    Executer
    Physical JVM process.
    Execute tasks (those are spread evenly across the workers)
    Tasks
    Each in his own Thread.
    Is the actual Bolt or Spout.
    Processes the stream.
    56
  • Tuple:
    Named list of values
    Dynamiclytyped
    Stream
    Sequence of Tuples
    57
  • Spout
    Source of Streams
    Sometimesreplayable
    Bolt
    Streamtransformations
    At least 1 input stream
    0 - * output streams
    58
  • 60
  • 61
  • The serving layer needs to be able to answer any query in a short amount of time.
    64
  • 65
  • AVG = sum + count;preaggregate, but not everything is possible.
    67
  • Lambda firstnamed by Alonzo Church, he needed a letter for functional abstraction in theory of computation in the 1930s.
    70
  • Hightolerance for human & system errors.
    71
  • http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=ns
    72
  • Data storage layer optimized independently from query resolution layer
    73
  • If you remember one thing about this presentation is: Immutability.
    74

Transcript

  • 1. A real-time architecture using Hadoop and Storm.
  • 2. Speaker Nathan Bijnens @nathan_gs A real-time architecture using Hadoop & Storm. #JaxLondon 2
  • 3. Our Vision Volume Big Data test A real-time architecture using Hadoop & Storm. #JaxLondon 3
  • 4. Big Data Velocity test A real-time architecture using Hadoop & Storm. #JaxLondon 4
  • 5. Our Vision Volume test Variety A real-time architecture using Hadoop & Storm. #JaxLondon 5
  • 6. Computing Trends Current Past Computation (CPUs) Expensive Computation Cheap (Many Core Computers) Disk Storage Expensive Disk Storage Cheap (Cheap Commodity Disks) DRAM Expensive DRAM / SSD Getting Cheap Coordination Easy (Latches Don t Often Hit) Coordination Hard (Latches Stall a Lot, etc) Source: Immutability Changes Everything - Pat Helland, RICON2012 A real-time architecture using Hadoop & Storm. #JaxLondon 6
  • 7. Credits Nathan Marz Ex-Backtype & Twitter Startup in Stealthmode Storm Cascalog ElephantDB manning.com/marz A real-time architecture using Hadoop & Storm. #JaxLondon 7
  • 8. A Data System A real-time architecture using Hadoop & Storm. #JaxLondon 8
  • 9. Data is more than Information Not all information is equal. Some information is derived from other pieces of information. A real-time architecture using Hadoop & Storm. #JaxLondon 9
  • 10. Data is more than Information Eventually you will reach the most This is the information you hold true, simple because it exists. A real-time architecture using Hadoop & Storm. #JaxLondon 10
  • 11. Events - Before Events used to manipulate the master data. A real-time architecture using Hadoop & Storm. #JaxLondon 11
  • 12. Events - After Today, events are the master data. A real-time architecture using Hadoop & Storm. #JaxLondon 12
  • 13. Data System everything. A real-time architecture using Hadoop & Storm. #JaxLondon 13
  • 14. Events Data is Immutable A real-time architecture using Hadoop & Storm. #JaxLondon 14
  • 15. Events Data is Time Based A real-time architecture using Hadoop & Storm. #JaxLondon 15
  • 16. Capturing change traditionally Person Location Person Location Nathan Antwerp Nathan Ghent Geert Dendermonde Geert Dendermonde John Ghent John Ghent A real-time architecture using Hadoop & Storm. #JaxLondon 16
  • 17. Capturing change Person Location Timestamp Person Location Time Nathan Antwerp 2005-01-01 Nathan Antwerp 2005-01-01 Geert Dendermonde 2011-10-08 Geert Dendermonde 2011-10-08 John Ghent 2010-05-02 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 A real-time architecture using Hadoop & Storm. #JaxLondon 17
  • 18. Query The data you query is often transformed, aggregated, ... A real-time architecture using Hadoop & Storm. #JaxLondon 18
  • 19. Query Query = function ( all data ) A real-time architecture using Hadoop & Storm. #JaxLondon 19
  • 20. Number of people living in each city. Person Location Time Location Count Nathan Antwerp 2005-01-01 Ghent 2 Geert Dendermonde 2011-10-08 Dendermonde 1 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 A real-time architecture using Hadoop & Storm. #JaxLondon 20
  • 21. Query All Data Query A real-time architecture using Hadoop & Storm. #JaxLondon 22
  • 22. Query: Precompute All Data Precomputed View Query A real-time architecture using Hadoop & Storm. #JaxLondon 23
  • 23. Layered Architecture Batch Layer Speed Layer Serving Layer A real-time architecture using Hadoop & Storm. #JaxLondon 24
  • 24. Layered Architecture Query Cassandra Incoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. #JaxLondon 25
  • 25. Batch Layer A real-time architecture using Hadoop & Storm. #JaxLondon 26
  • 26. Batch Layer Incoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. #JaxLondon 27
  • 27. Batch Layer Unrestrained computation. A real-time architecture using Hadoop & Storm. #JaxLondon 28
  • 28. Batch Layer No need to De-Normalize. A real-time architecture using Hadoop & Storm. #JaxLondon 29
  • 29. Batch Layer Horizontal scalable. A real-time architecture using Hadoop & Storm. #JaxLondon 30
  • 30. Batch Layer High Latency. matter. A real-time architecture using Hadoop & Storm. #JaxLondon 31
  • 31. Batch Layer Functional computation, based on immutable inputs, is idempotent. A real-time architecture using Hadoop & Storm. #JaxLondon 32
  • 32. Batch Layer Stores master copy of data set... append only. A real-time architecture using Hadoop & Storm. #JaxLondon 33
  • 33. Batch Layer A real-time architecture using Hadoop & Storm. #JaxLondon 34
  • 34. Batch: View generation View #1 Master Dataset MapReduce View #2 View #3 A real-time architecture using Hadoop & Storm. #JaxLondon 35
  • 35. MapReduce MAP 1. Take a large data set and divide it into subsets … 2. Perform the same function on all subsets REDUCE DoWork() DoWork() DoWork() … 3. Combine the output from all subsets … Output A real-time architecture using Hadoop & Storm. #JaxLondon 36
  • 36. Serialization & Schema Catch errors as quickly as they happen. Validation on write vs on read. A real-time architecture using Hadoop & Storm. #JaxLondon 37
  • 37. Serialization & Schema CSV is actually a serialization language that is just poorly defined. A real-time architecture using Hadoop & Storm. #JaxLondon 38
  • 38. Serialization & Schema Use a format with a schema. - Thrift Avro Protobuffers A real-time architecture using Hadoop & Storm. #JaxLondon 39
  • 39. Batch View Database Read only database. No random writes required. A real-time architecture using Hadoop & Storm. #JaxLondon 40
  • 40. Batch View Database Every iteration produces the Views from scratch. A real-time architecture using Hadoop & Storm. #JaxLondon 41
  • 41. Batch View Database ElephantDB Splout Voldemort A real-time architecture using Hadoop & Storm. #JaxLondon 42
  • 42. Batch Layer Just a few hours of data. Data absorbed into Batch Views Not yet absorbed. A real-time architecture using Hadoop & Storm. #JaxLondon Now Time 44
  • 43. Speed Layer A real-time architecture using Hadoop & Storm. #JaxLondon 45
  • 44. Overview Cassandra Incoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. #JaxLondon 46
  • 45. Speed Layer Stream processing. A real-time architecture using Hadoop & Storm. #JaxLondon 47
  • 46. Speed Layer Continuous computation. A real-time architecture using Hadoop & Storm. #JaxLondon 48
  • 47. Speed Layer Transactional. A real-time architecture using Hadoop & Storm. #JaxLondon 49
  • 48. Speed Layer Storing a limited window of data. Compensating for the last few hours of data. A real-time architecture using Hadoop & Storm. #JaxLondon 50
  • 49. Speed Layer All the complexity is isolated in the Speed layer. -corrected. A real-time architecture using Hadoop & Storm. #JaxLondon 51
  • 50. CAP You have a choice between: Availability - Queries are eventual consistent. Consistency - Queries are consistent. A real-time architecture using Hadoop & Storm. #JaxLondon 52
  • 51. Eventual accuracy Some algorithms are hard to implement in real time. For those cases we could estimate the results. A real-time architecture using Hadoop & Storm. #JaxLondon 53
  • 52. Speed Layer Real Time View 1 Incoming Data Real Time View 2 A real-time architecture using Hadoop & Storm. #JaxLondon 54
  • 53. Storm Message passing. Distributed processing. Horizontally scalable. Incremental algorithms. Fast. Data in motion. A real-time architecture using Hadoop & Storm. #JaxLondon 55
  • 54. Storm Nimbus Supervisor Supervisor Executer Executer Worker Node Supervisor Executer Executer Executer Executer Executer Executer Executer Worker Node Zookeeper Worker Node A real-time architecture using Hadoop & Storm. #JaxLondon 56
  • 55. Storm Tuple Stream A real-time architecture using Hadoop & Storm. #JaxLondon 57
  • 56. Storm Spout Bolt A real-time architecture using Hadoop & Storm. #JaxLondon 58
  • 57. Storm Grouping A real-time architecture using Hadoop & Storm. #JaxLondon 59
  • 58. Data Ingestion Kafka Flume Scribe *MQ Kestrel A real-time architecture using Hadoop & Storm. #JaxLondon 60
  • 59. Speed Layer Views The views are stored in Read & Write database. - Cassandra Hbase Redis MySQL ElasticSearch Much more complex than a read only view. A real-time architecture using Hadoop & Storm. #JaxLondon 61
  • 60. Serving Layer A real-time architecture using Hadoop & Storm. #JaxLondon 62
  • 61. Overview Query Cassandra Incoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. #JaxLondon 63
  • 62. Serving Layer Random reads A real-time architecture using Hadoop & Storm. #JaxLondon 64
  • 63. Serving Layer This layer queries the Batch & Real Time views and merges it. A real-time architecture using Hadoop & Storm. #JaxLondon 65
  • 64. Serving Layer Batch Views Merge Real Time Views A real-time architecture using Hadoop & Storm. #JaxLondon 66
  • 65. Serving Layer How to query an Average? A real-time architecture using Hadoop & Storm. #JaxLondon 67
  • 66. Overview A real-time architecture using Hadoop & Storm. #JaxLondon 68
  • 67. Overview Query Cassandra Incoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. #JaxLondon 69
  • 68. Lambda Architecture A real-time architecture using Hadoop & Storm. #JaxLondon 70
  • 69. Lambda Architecture Can discard any view, batch and real time, and just recreate everything from the master data. A real-time architecture using Hadoop & Storm. #JaxLondon 71
  • 70. Lambda Architecture Mistakes are corrected via recomputation. Write bad data? Remove the data & recompute. Bug in view generation? Just recompute the view. A real-time architecture using Hadoop & Storm. #JaxLondon 72
  • 71. Lambda Architecture Data storage is highly optimized. A real-time architecture using Hadoop & Storm. #JaxLondon 73
  • 72. Lambda Architecture Immutability changes everything. A real-time architecture using Hadoop & Storm. #JaxLondon 74
  • 73. Questions? Questions? @nathan_gs & #BigDataCon13 A real-time architecture using Hadoop & Storm. #JaxLondon 75
  • 74. DataCrunchers We enable companies in envisioning, defining and implementing a data strategy. A one-stop-shop for all your Big Data needs. The first Big Data Consultancy agency in Belgium. A real-time architecture using Hadoop & Storm. #JaxLondon 76
  • 75. Thank you Thank you @nathan_gs A real-time architecture using Hadoop & Storm. #JaxLondon 77