A real time architecture using Hadoop and Storm @ FOSDEM 2013

  • 20,501 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
20,501
On Slideshare
0
From Embeds
0
Number of Embeds
27

Actions

Shares
Downloads
547
Comments
3
Likes
57

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. A real-time architecture using Hadoop and Storm.
  • 2. Speakers Nathan Bijnens Geert Van Landeghem @nathan_gs @gvanlandeghem A real-time architecture using Hadoop & Storm. 2
  • 3. Our Vision Volume Big Data test A real-time architecture using Hadoop & Storm. 3
  • 4. Big Data Velocity test A real-time architecture using Hadoop & Storm. 4
  • 5. Our Vision Volume test Variety A real-time architecture using Hadoop & Storm. 5
  • 6. CreditsNathan Marz Engineer at Backtype (now Twitter). Storm Cascalog ElephantDB manning.com/marz A real-time architecture using Hadoop & Storm. 6
  • 7. A Data System A real-time architecture using Hadoop & Storm. 7
  • 8. Data is more than Information Not all information is equal. Some information is derived from other pieces of information. A real-time architecture using Hadoop & Storm. 8
  • 9. Data is more than Information Eventually you will reach the most This is the information you hold true, simple because it exists. A real-time architecture using Hadoop & Storm. 9
  • 10. EventsEverything we do generates events:- Pay with Credit Card- Commit to Git- Click on a webpage- Tweet A real-time architecture using Hadoop & Storm. 10
  • 11. Events - Before Events used to manipulate the master data. A real-time architecture using Hadoop & Storm. 11
  • 12. Events - After Today, events are the master data. A real-time architecture using Hadoop & Storm. 12
  • 13. Data System everything. A real-time architecture using Hadoop & Storm. 13
  • 14. Events Data is Immutable A real-time architecture using Hadoop & Storm. 14
  • 15. Events Data is Time Based A real-time architecture using Hadoop & Storm. 15
  • 16. Capturing change traditionallyPerson Location Person LocationNathan Antwerp Nathan GhentGeert Dendermonde Geert DendermondeJohn Ghent John Ghent A real-time architecture using Hadoop & Storm. 16
  • 17. Capturing changePerson Location Time Person Location TimeNathan Antwerp 2005-01-01 Nathan Antwerp 2005-01-01Geert Dendermonde 2011-10-08 Geert Dendermonde 2011-10-08John Ghent 2010-05-02 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 A real-time architecture using Hadoop & Storm. 17
  • 18. Query The data you query is often transformed, aggregated, ... A real-time architecture using Hadoop & Storm. 18
  • 19. Query Query = function ( data ) A real-time architecture using Hadoop & Storm. 19
  • 20. Number of people living in each city. Person Location Time Location Count Nathan Antwerp 2005-01-01 Ghent 2 Geert Dendermonde 2011-10-08 Dendermonde 1 John Ghent 2010-05-02 Nathan Ghent 2013-02-03 A real-time architecture using Hadoop & Storm. 20
  • 21. Query All Data Query A real-time architecture using Hadoop & Storm. 21
  • 22. Query: Precompute All Data Precomputed View Query A real-time architecture using Hadoop & Storm. 22
  • 23. Layered Architecture Batch Layer Speed Layer Serving Layer A real-time architecture using Hadoop & Storm. 23
  • 24. Layered Architecture Cassandra QueryIncoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 24
  • 25. Batch Layer A real-time architecture using Hadoop & Storm. 25
  • 26. Batch LayerIncoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 26
  • 27. Batch Layer Unrestrained computation. A real-time architecture using Hadoop & Storm. 27
  • 28. Batch Layer Horizontal scalable. A real-time architecture using Hadoop & Storm. 28
  • 29. Batch Layer High Latency. matter. A real-time architecture using Hadoop & Storm. 29
  • 30. Batch Layer Stores master copy of data set... append only. A real-time architecture using Hadoop & Storm. 30
  • 31. Batch Layer A real-time architecture using Hadoop & Storm. 31
  • 32. Batch: View generation View #1 Master Dataset View #2 MapReduce View #3 A real-time architecture using Hadoop & Storm. 32
  • 33. MapReduce 1. Take a large problem and divide it into sub-problems … MAP 2. Perform the same function on all sub-problems … DoWork() DoWork() DoWork() 3. Combine the output from all sub-problems REDUCE … Output A real-time architecture using Hadoop & Storm. 33
  • 34. Batch View Database Read only database. No random writes required. A real-time architecture using Hadoop & Storm. 34
  • 35. Batch View DatabaseElephantDBSplout A real-time architecture using Hadoop & Storm. 35
  • 36. Batch Layer Just a few hours of data. Not yet Data absorbed into Batch Views absorbed. Time Now A real-time architecture using Hadoop & Storm. 36
  • 37. Speed Layer A real-time architecture using Hadoop & Storm. 37
  • 38. Overview CassandraIncoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 38
  • 39. Speed Layer Stream processing. A real-time architecture using Hadoop & Storm. 39
  • 40. Speed Layer Continuous computation. A real-time architecture using Hadoop & Storm. 40
  • 41. Speed Layer Transactional. A real-time architecture using Hadoop & Storm. 41
  • 42. Speed Layer Storing a limited window of data. Compensating for the last few hours of data. A real-time architecture using Hadoop & Storm. 42
  • 43. Speed Layer All the complexity is isolated in the Speed layer auto- corrected. A real-time architecture using Hadoop & Storm. 43
  • 44. CAPYou have a choice between: Availability - Queries are eventual consistent. Consistency - Queries are consistent. A real-time architecture using Hadoop & Storm. 44
  • 45. Eventual accuracySome algorithms are hard to implement in real time. For those cases we could estimate the results. A real-time architecture using Hadoop & Storm. 45
  • 46. Speed Layer Real Time View 1Incoming Data Real Time View 2 A real-time architecture using Hadoop & Storm. 46
  • 47. StormMessage passing.Distributed processing.Horizontally scalable.Incremental algorithms.Fast.Data in motion. A real-time architecture using Hadoop & Storm. 47
  • 48. StormMessage passing.Distributed processing.Horizontally scalable.Incremental algorithms.Fast.Data in motion. A real-time architecture using Hadoop & Storm. 48
  • 49. Storm Nimbus Zookeeper Supervisor Supervisor Supervisor Worker Worker Worker Worker Worker Worker Worker Worker Worker Worker Node Worker Node Worker Node A real-time architecture using Hadoop & Storm. 49
  • 50. StormTupleStream A real-time architecture using Hadoop & Storm. 50
  • 51. StormSpoutBolt A real-time architecture using Hadoop & Storm. 51
  • 52. StormGrouping A real-time architecture using Hadoop & Storm. 52
  • 53. Speed Layer ViewsThe views are stored in Read & Write database.- Cassandra- Hbase- MongoDB- MySQL- ElasticSearch-Much more complex than a read only view. A real-time architecture using Hadoop & Storm. 53
  • 54. Serving Layer A real-time architecture using Hadoop & Storm. 54
  • 55. Overview Cassandra QueryIncoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 55
  • 56. Serving Layer This layer queries the Batch & Real Time views and merges it. A real-time architecture using Hadoop & Storm. 56
  • 57. Serving Layer Batch Views Merge Real Time Views A real-time architecture using Hadoop & Storm. 57
  • 58. Overview A real-time architecture using Hadoop & Storm. 58
  • 59. Overview Cassandra QueryIncoming Data Hadoop Elephant DB A real-time architecture using Hadoop & Storm. 59
  • 60. Lambda ArchitectureCan discard any view, batch and real time, and justrecreate everything from the master data.Mistakes are corrected via recomputation.- Write bad data? Remove the data & recompute.- Bug in view generation? Just recompute the view.Data storage is highly optimized. A real-time architecture using Hadoop & Storm. 60
  • 61. Recommendations A real-time architecture using Hadoop & Storm. 61
  • 62. Serialization & Schema Catch errors as quickly as they happen. Validation on write vs on read. A real-time architecture using Hadoop & Storm. 62
  • 63. Serialization & Schema CSV is actually a serialization language that is just poorly defined. A real-time architecture using Hadoop & Storm. 63
  • 64. Serialization & Schema Use a format with a schema.- Thrift- Avro- Protobuffers A real-time architecture using Hadoop & Storm. 64
  • 65. Questions? What are your needs? @nathan_gs & @gvanlandeghem A real-time architecture using Hadoop & Storm. 65
  • 66. DataCrunchersWe enable companies in envisioning, defining andimplementing a data strategy.A one-stop-shop for all your Big Data needs.The first Big Data Consultancy agency in Belgium. A real-time architecture using Hadoop & Storm. 66
  • 67. Jobs We are hiring. jobs@datacrunchers.eu A real-time architecture using Hadoop & Storm. 67