• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
A real-time architecture using Hadoop and Storm @ JAX London
 

A real-time architecture using Hadoop and Storm @ JAX London

on

  • 3,246 views

 

Statistics

Views

Total Views
3,246
Views on SlideShare
2,604
Embed Views
642

Actions

Likes
8
Downloads
73
Comments
0

6 Embeds 642

http://nathan.gs 523
http://localhost 87
https://twitter.com 20
http://lanyrd.com 8
http://www.linkedin.com 3
https://www.rebelmouse.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • 1 <br />
  • 2 <br />
  • How much data doyou have? <br /> 44 times as much data in the next decade, 15Zbin 2015 <br /> Data silos (erp,crm, …) <br /> Customers <br /> Trimble (3Tb inhundatabasesysteem) <br /> Truvo (wijzigenvaneenindexduurt24u) <br /> Traditionele systemen kunnen dit volume niet aan. <br /> How many data do you have? <br /> Turn 12 terabytes of Tweets created each day into improved product sentiment analysis <br /> Convert 350 billion annual meter readings to better predict power consumption <br /> 3 <br />
  • Real time <br /> Timesensitivedecisiontaking <br /> Frauddetection <br /> Energyallocation <br /> Marketingcampaigns <br /> Market transactions <br /> Solution: <br /> Real-time solutions in combination with batch (hadoop) <br /> Nosqlsystems <br /> 4 <br />
  • Structured <br /> Unstructured <br /> 80% is unstructured data, <br /> A key drawback of using traditional relational database systems is that they&apos;re not good at handling variable data. <br /> Aflexibledata model <br /> Word, email,foto, text, video, APIs, …? <br /> What are your needs regarding variety? <br /> The endresult:bringingstructureintounstructureddata <br /> Monitor 100’s of live video feeds from surveillance cameras to target points of interest <br /> Exploit the 80% data growth in images, video and documents to improve customer satisfaction <br /> 5 <br />
  • We can afford to keepImmutableCopiesof lots of data. <br /> We NEED immutability to Coordinate with fewer challenges. <br /> Semaphores & Locks are the things to avoid: <br /> Instruction opportunities lost waiting for a semaphore increase with more cores… <br /> 6 <br />
  • The #of followers on Twitter = all follows & unfollows combined. <br /> Account balance <br /> 9 <br />
  • Data = event <br /> In an ever changingworld we found a ‘safe heaven’ for data <br /> Everything we do generates events: <br /> Pay with Credit Card <br /> Commit to Git <br /> Click on a webpage <br /> Tweet <br /> 10 <br />
  • It is easier tostore all data in a cost effective way. <br /> Compare to DWH world. <br /> 13 <br />
  • Immutability greatly restricts the range of errors that can cause data loss or data corruption. <br /> Ex. <br /> Only CR, no moreCRUD. <br /> Informationmight of course change. <br /> Fault Tolerance <br /> Data loss <br /> Human error, Hardware failure <br /> Data Corruption <br /> Parallel metfunctioneelprogrammeren. <br /> 14 <br />
  • Allows state regeneration.Eg. What was my bank balance on 1 may 2005? <br /> 15 <br />
  • Queries as pure functions that take all data as input is the most general formulation. <br /> Different functions may look at different portions and aggregate information in different ways. <br /> 19 <br />
  • 22 <br />
  • Tooslow; might be petabyte scale <br /> Impala/Drill: why not <br /> 23 <br />
  • The batch layer can calculate anything (given enough time). <br /> 28 <br />
  • The batchlayer stores the data normalized, but in the views it generates, data is often, if not always de normalized. <br /> 29 <br />
  • Not vertically <br /> 30 <br />
  • 31 <br />
  • It’s OK to croak and restart <br /> 32 <br />
  • Is something really immutable when it’s name can change. <br /> 33 <br />
  • Doesn’t have to be Hadoop.The importance here is a Distributed FS combined with a processing framework. <br /> Spark, <br /> 34 <br />
  • 35 <br />
  • Source: PolybasePass2012.pptx <br /> http://whyjava.wordpress.com/2011/08/04/how-i-explained-mapreduce-to-my-wife/ <br /> 36 <br />
  • http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=ns <br /> Value of schemas <br /> • Structural integrity <br /> • Guarantees on what can and can’t be stored <br /> • Prevents corruption <br /> Otherwise you’ll detect corruption issues at read-time <br /> 37 <br />
  • http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=ns <br /> 38 <br />
  • 39 <br />
  • 40 <br />
  • 41 <br />
  • Maarkanopgelostworden, doorbvbES je views opvoorhandtegenereren. <br /> 42 <br />
  • 43 <br />
  • 47 <br />
  • 48 <br />
  • In some circumstances. <br /> 49 <br />
  • 50 <br />
  • All the complexity of *dealing* with the CAP theorem (like read repair) is isolated in the realtime layer. <br /> 51 <br />
  • Consistency (all nodes see the same data at the same time) <br /> Availability (a guarantee that every request receives a response about whether it was successful or failed) <br /> Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system) <br /> http://codahale.com/you-cant-sacrifice-partition-tolerance/ <br /> HbasavsCassandra <br /> 52 <br />
  • Eg. Unique counts <br /> ML <br /> 53 <br />
  • 54 <br />
  • Nimbus: <br /> Manages the cluster <br /> Worker Node: <br /> Supervisor: <br /> Manages workers; restartsthem if needed <br /> Executer <br /> Physical JVM process. <br /> Execute tasks (those are spread evenly across the workers) <br /> Tasks <br /> Each in his own Thread. <br /> Is the actual Bolt or Spout. <br /> Processes the stream. <br /> 56 <br />
  • Tuple: <br /> Named list of values <br /> Dynamiclytyped <br /> Stream <br /> Sequence of Tuples <br /> 57 <br />
  • Spout <br /> Source of Streams <br /> Sometimesreplayable <br /> Bolt <br /> Streamtransformations <br /> At least 1 input stream <br /> 0 - * output streams <br /> 58 <br />
  • 60 <br />
  • 61 <br />
  • The serving layer needs to be able to answer any query in a short amount of time. <br /> 64 <br />
  • 65 <br />
  • AVG = sum + count;preaggregate, but not everything is possible. <br /> 67 <br />
  • Lambda firstnamed by Alonzo Church, he needed a letter for functional abstraction in theory of computation in the 1930s. <br /> 70 <br />
  • Hightolerance for human & system errors. <br /> 71 <br />
  • http://www.quora.com/Apache-Hadoop/What-is-the-advantage-of-writing-custom-input-format-and-writable-versus-the-TextInputFormat-and-Text-writable/answer/Eric-Sammer?srid=PU&st=ns <br /> 72 <br />
  • Data storage layer optimized independently from query resolution layer <br /> 73 <br />
  • If you remember one thing about this presentation is: Immutability. <br /> 74 <br />

A real-time architecture using Hadoop and Storm @ JAX London A real-time architecture using Hadoop and Storm @ JAX London Presentation Transcript

  • A real-time architecture using Hadoop and Storm.
  • Speaker Nathan Bijnens @nathan_gs A real-time architecture using Hadoop & Storm. #JaxLondon 2
  • Our Vision Volume Big Data test A real-time architecture using Hadoop & Storm. #JaxLondon 3
  • Big Data Velocity test A real-time architecture using Hadoop & Storm. #JaxLondon 4
  • Our Vision Volume test Variety A real-time architecture using Hadoop & Storm. #JaxLondon 5
  • Computing Trends Current Past Computation (CPUs) Expensive Computation Cheap (Many Core Computers) Disk Storage Expensive Disk Storage Cheap (Cheap Commodity Disks) DRAM Expensive DRAM / SSD Getting Cheap Coordination Easy (Latches Don t Often Hit) Coordination Hard (Latches Stall a Lot, etc) Source: Immutability Changes Everything - Pat Helland, RICON2012 A real-time architecture using Hadoop & Storm. #JaxLondon 6
  • Credits Nathan Marz Ex-Backtype & Twitter Startup in Stealthmode Storm Cascalog ElephantDB manning.com/marz A real-time architecture using Hadoop & Storm. #JaxLondon 7
  • A Data System A real-time architecture using Hadoop & Storm. #JaxLondon 8
  • Data is more than Information Not all information is equal. Some information is derived from other pieces of information. A real-time architecture using Hadoop & Storm. #JaxLondon 9
  • Data is more than Information Eventually you will reach the most This is the information you hold true, simple because it exists. A real-time architecture using Hadoop & Storm. #JaxLondon 10
  • Events - Before Events used to manipulate the master data. A real-time architecture using Hadoop & Storm. #JaxLondon 11
  • Events - After Today, events are the master data. A real-time architecture using Hadoop & Storm. #JaxLondon 12
  • Data System everything. A real-time architecture using Hadoop & Storm. #JaxLondon 13
  • Events Data is Immutable A real-time architecture using Hadoop & Storm. #JaxLondon 14
  • Events Data is Time Based A real-time architecture using Hadoop & Storm. #JaxLondon 15
  • Capturing change traditionally Person Location Person Location Nathan Antwerp Nathan Ghent Geert Dendermonde Geert Dendermonde John Ghent John Ghent A real-time architecture using Hadoop & Storm. #JaxLondon 16
  • Capturing change Person Location Timestamp Person Location Time Nathan Antwerp 2005-01-01 Nathan Antwerp 2005-01-01 Geert Dendermonde 2011-10-08 Geert Dendermonde 2011-10-08 John Ghent 2010-05-02 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 A real-time architecture using Hadoop & Storm. #JaxLondon 17
  • Query The data you query is often transformed, aggregated, ... A real-time architecture using Hadoop & Storm. #JaxLondon 18
  • Query Query = function ( all data ) A real-time architecture using Hadoop & Storm. #JaxLondon 19
  • Number of people living in each city. Person Location Time Location Count Nathan Antwerp 2005-01-01 Ghent 2 Geert Dendermonde 2011-10-08 Dendermonde 1 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 A real-time architecture using Hadoop & Storm. #JaxLondon 20
  • Query All Data Query A real-time architecture using Hadoop & Storm. #JaxLondon 22
  • Query: Precompute All Data Precomputed View Query A real-time architecture using Hadoop & Storm. #JaxLondon 23
  • Layered Architecture Batch Layer Speed Layer Serving Layer A real-time architecture using Hadoop & Storm. #JaxLondon 24
  • Layered Architecture Query Cassandra Incoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. #JaxLondon 25
  • Batch Layer A real-time architecture using Hadoop & Storm. #JaxLondon 26
  • Batch Layer Incoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. #JaxLondon 27
  • Batch Layer Unrestrained computation. A real-time architecture using Hadoop & Storm. #JaxLondon 28
  • Batch Layer No need to De-Normalize. A real-time architecture using Hadoop & Storm. #JaxLondon 29
  • Batch Layer Horizontal scalable. A real-time architecture using Hadoop & Storm. #JaxLondon 30
  • Batch Layer High Latency. matter. A real-time architecture using Hadoop & Storm. #JaxLondon 31
  • Batch Layer Functional computation, based on immutable inputs, is idempotent. A real-time architecture using Hadoop & Storm. #JaxLondon 32
  • Batch Layer Stores master copy of data set... append only. A real-time architecture using Hadoop & Storm. #JaxLondon 33
  • Batch Layer A real-time architecture using Hadoop & Storm. #JaxLondon 34
  • Batch: View generation View #1 Master Dataset MapReduce View #2 View #3 A real-time architecture using Hadoop & Storm. #JaxLondon 35
  • MapReduce MAP 1. Take a large data set and divide it into subsets … 2. Perform the same function on all subsets REDUCE DoWork() DoWork() DoWork() … 3. Combine the output from all subsets … Output A real-time architecture using Hadoop & Storm. #JaxLondon 36
  • Serialization & Schema Catch errors as quickly as they happen. Validation on write vs on read. A real-time architecture using Hadoop & Storm. #JaxLondon 37
  • Serialization & Schema CSV is actually a serialization language that is just poorly defined. A real-time architecture using Hadoop & Storm. #JaxLondon 38
  • Serialization & Schema Use a format with a schema. - Thrift Avro Protobuffers A real-time architecture using Hadoop & Storm. #JaxLondon 39
  • Batch View Database Read only database. No random writes required. A real-time architecture using Hadoop & Storm. #JaxLondon 40
  • Batch View Database Every iteration produces the Views from scratch. A real-time architecture using Hadoop & Storm. #JaxLondon 41
  • Batch View Database ElephantDB Splout Voldemort A real-time architecture using Hadoop & Storm. #JaxLondon 42
  • Batch Layer Just a few hours of data. Data absorbed into Batch Views Not yet absorbed. A real-time architecture using Hadoop & Storm. #JaxLondon Now Time 44
  • Speed Layer A real-time architecture using Hadoop & Storm. #JaxLondon 45
  • Overview Cassandra Incoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. #JaxLondon 46
  • Speed Layer Stream processing. A real-time architecture using Hadoop & Storm. #JaxLondon 47
  • Speed Layer Continuous computation. A real-time architecture using Hadoop & Storm. #JaxLondon 48
  • Speed Layer Transactional. A real-time architecture using Hadoop & Storm. #JaxLondon 49
  • Speed Layer Storing a limited window of data. Compensating for the last few hours of data. A real-time architecture using Hadoop & Storm. #JaxLondon 50
  • Speed Layer All the complexity is isolated in the Speed layer. -corrected. A real-time architecture using Hadoop & Storm. #JaxLondon 51
  • CAP You have a choice between: Availability - Queries are eventual consistent. Consistency - Queries are consistent. A real-time architecture using Hadoop & Storm. #JaxLondon 52
  • Eventual accuracy Some algorithms are hard to implement in real time. For those cases we could estimate the results. A real-time architecture using Hadoop & Storm. #JaxLondon 53
  • Speed Layer Real Time View 1 Incoming Data Real Time View 2 A real-time architecture using Hadoop & Storm. #JaxLondon 54
  • Storm Message passing. Distributed processing. Horizontally scalable. Incremental algorithms. Fast. Data in motion. A real-time architecture using Hadoop & Storm. #JaxLondon 55
  • Storm Nimbus Supervisor Supervisor Executer Executer Worker Node Supervisor Executer Executer Executer Executer Executer Executer Executer Worker Node Zookeeper Worker Node A real-time architecture using Hadoop & Storm. #JaxLondon 56
  • Storm Tuple Stream A real-time architecture using Hadoop & Storm. #JaxLondon 57
  • Storm Spout Bolt A real-time architecture using Hadoop & Storm. #JaxLondon 58
  • Storm Grouping A real-time architecture using Hadoop & Storm. #JaxLondon 59
  • Data Ingestion Kafka Flume Scribe *MQ Kestrel A real-time architecture using Hadoop & Storm. #JaxLondon 60
  • Speed Layer Views The views are stored in Read & Write database. - Cassandra Hbase Redis MySQL ElasticSearch Much more complex than a read only view. A real-time architecture using Hadoop & Storm. #JaxLondon 61
  • Serving Layer A real-time architecture using Hadoop & Storm. #JaxLondon 62
  • Overview Query Cassandra Incoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. #JaxLondon 63
  • Serving Layer Random reads A real-time architecture using Hadoop & Storm. #JaxLondon 64
  • Serving Layer This layer queries the Batch & Real Time views and merges it. A real-time architecture using Hadoop & Storm. #JaxLondon 65
  • Serving Layer Batch Views Merge Real Time Views A real-time architecture using Hadoop & Storm. #JaxLondon 66
  • Serving Layer How to query an Average? A real-time architecture using Hadoop & Storm. #JaxLondon 67
  • Overview A real-time architecture using Hadoop & Storm. #JaxLondon 68
  • Overview Query Cassandra Incoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. #JaxLondon 69
  • Lambda Architecture A real-time architecture using Hadoop & Storm. #JaxLondon 70
  • Lambda Architecture Can discard any view, batch and real time, and just recreate everything from the master data. A real-time architecture using Hadoop & Storm. #JaxLondon 71
  • Lambda Architecture Mistakes are corrected via recomputation. Write bad data? Remove the data & recompute. Bug in view generation? Just recompute the view. A real-time architecture using Hadoop & Storm. #JaxLondon 72
  • Lambda Architecture Data storage is highly optimized. A real-time architecture using Hadoop & Storm. #JaxLondon 73
  • Lambda Architecture Immutability changes everything. A real-time architecture using Hadoop & Storm. #JaxLondon 74
  • Questions? Questions? @nathan_gs & #BigDataCon13 A real-time architecture using Hadoop & Storm. #JaxLondon 75
  • DataCrunchers We enable companies in envisioning, defining and implementing a data strategy. A one-stop-shop for all your Big Data needs. The first Big Data Consultancy agency in Belgium. A real-time architecture using Hadoop & Storm. #JaxLondon 76
  • Thank you Thank you @nathan_gs A real-time architecture using Hadoop & Storm. #JaxLondon 77