2. Speakers
Nathan Bijnens Geert Van Landeghem
@nathan_gs @gvanlandeghem
A real-time architecture using Hadoop & Storm. 2
3. Our Vision
Volume
Big Data
test A real-time architecture using Hadoop & Storm. 3
4. Big Data
Velocity
test A real-time architecture using Hadoop & Storm. 4
5. Our Vision
Volume
test
Variety
A real-time architecture using Hadoop & Storm. 5
6. Credits
Nathan Marz
Engineer at Backtype
(now Twitter).
Storm
Cascalog
ElephantDB
manning.com/marz
A real-time architecture using Hadoop & Storm. 6
7. A Data System
A real-time architecture using Hadoop & Storm. 7
8. Data is more than Information
Not all information is equal.
Some information is derived from other pieces of
information.
A real-time architecture using Hadoop & Storm. 8
9. Data is more than Information
Eventually you will reach the most
This is the information you hold true, simple because it exists.
A real-time architecture using Hadoop & Storm. 9
10. Events
Everything we do generates events:
- Pay with Credit Card
- Commit to Git
- Click on a webpage
- Tweet
A real-time architecture using Hadoop & Storm. 10
11. Events - Before
Events used to manipulate the
master data.
A real-time architecture using Hadoop & Storm. 11
12. Events - After
Today, events are the master
data.
A real-time architecture using Hadoop & Storm. 12
13. Data System
everything.
A real-time architecture using Hadoop & Storm. 13
14. Events
Data is Immutable
A real-time architecture using Hadoop & Storm. 14
15. Events
Data is Time Based
A real-time architecture using Hadoop & Storm. 15
16. Capturing change traditionally
Person Location Person Location
Nathan Antwerp Nathan Ghent
Geert Dendermonde Geert Dendermonde
John Ghent John Ghent
A real-time architecture using Hadoop & Storm. 16
17. Capturing change
Person Location Time Person Location Time
Nathan Antwerp 2005-01-01 Nathan Antwerp 2005-01-01
Geert Dendermonde 2011-10-08 Geert Dendermonde 2011-10-08
John Ghent 2010-05-02 John Ghent 2010-05-02
Nathan Ghent 2013-02-03
A real-time architecture using Hadoop & Storm. 17
18. Query
The data you query is often transformed,
aggregated, ...
A real-time architecture using Hadoop & Storm. 18
19. Query
Query = function ( data )
A real-time architecture using Hadoop & Storm. 19
20. Number of people living in each city.
Person Location Time Location Count
Nathan Antwerp 2005-01-01 Ghent 2
Geert Dendermonde 2011-10-08 Dendermonde 1
John Ghent 2010-05-02
Nathan Ghent 2013-02-03
A real-time architecture using Hadoop & Storm. 20
21. Query
All Data Query
A real-time architecture using Hadoop & Storm. 21
22. Query: Precompute
All Data Precomputed
View Query
A real-time architecture using Hadoop & Storm. 22
23. Layered Architecture
Batch Layer
Speed Layer
Serving Layer
A real-time architecture using Hadoop & Storm. 23
24. Layered Architecture
Cassandra
Query
Incoming Data
Hadoop
Elephant
DB
A real-time architecture using Hadoop & Storm. 24
25. Batch Layer
A real-time architecture using Hadoop & Storm. 25
33. MapReduce
1. Take a large problem and divide it into sub-problems
…
MAP
2. Perform the same function on all sub-problems
…
DoWork() DoWork() DoWork()
3. Combine the output from all sub-problems
REDUCE
…
Output
A real-time architecture using Hadoop & Storm. 33
34. Batch View Database
Read only database.
No random writes required.
A real-time architecture using Hadoop & Storm. 34
36. Batch Layer
Just a few hours of data.
Not yet
Data absorbed into Batch Views absorbed.
Time
Now
A real-time architecture using Hadoop & Storm. 36
37. Speed Layer
A real-time architecture using Hadoop & Storm. 37
38. Overview
Cassandra
Incoming Data
Hadoop
Elephant
DB
A real-time architecture using Hadoop & Storm. 38
39. Speed Layer
Stream processing.
A real-time architecture using Hadoop & Storm. 39
40. Speed Layer
Continuous computation.
A real-time architecture using Hadoop & Storm. 40
41. Speed Layer
Transactional.
A real-time architecture using Hadoop & Storm. 41
42. Speed Layer
Storing a limited window of data.
Compensating for the last few hours of data.
A real-time architecture using Hadoop & Storm. 42
43. Speed Layer
All the complexity is isolated in the Speed
layer auto-
corrected.
A real-time architecture using Hadoop & Storm. 43
44. CAP
You have a choice between:
Availability
- Queries are eventual consistent.
Consistency
- Queries are consistent.
A real-time architecture using Hadoop & Storm. 44
45. Eventual accuracy
Some algorithms are hard to implement in
real time. For those cases we could
estimate the results.
A real-time architecture using Hadoop & Storm. 45
46. Speed Layer
Real
Time
View 1
Incoming Data
Real
Time
View 2
A real-time architecture using Hadoop & Storm. 46
51. Storm
Spout
Bolt
A real-time architecture using Hadoop & Storm. 51
52. Storm
Grouping
A real-time architecture using Hadoop & Storm. 52
53. Speed Layer Views
The views are stored in Read & Write database.
- Cassandra
- Hbase
- MongoDB
- MySQL
- ElasticSearch
-
Much more complex than a read only view.
A real-time architecture using Hadoop & Storm. 53
54. Serving Layer
A real-time architecture using Hadoop & Storm. 54
55. Overview
Cassandra
Query
Incoming Data
Hadoop
Elephant
DB
A real-time architecture using Hadoop & Storm. 55
56. Serving Layer
This layer queries the Batch & Real Time
views and merges it.
A real-time architecture using Hadoop & Storm. 56
57. Serving Layer
Batch
Views
Merge
Real
Time
Views
A real-time architecture using Hadoop & Storm. 57
58. Overview
A real-time architecture using Hadoop & Storm. 58
59. Overview
Cassandra
Query
Incoming Data
Hadoop
Elephant
DB
A real-time architecture using Hadoop & Storm. 59
60. Lambda Architecture
Can discard any view, batch and real time, and just
recreate everything from the master data.
Mistakes are corrected via recomputation.
- Write bad data? Remove the data & recompute.
- Bug in view generation? Just recompute the view.
Data storage is highly optimized.
A real-time architecture using Hadoop & Storm. 60
61. Recommendations
A real-time architecture using Hadoop & Storm. 61
62. Serialization & Schema
Catch errors as quickly as they happen.
Validation on write vs on read.
A real-time architecture using Hadoop & Storm. 62
63. Serialization & Schema
CSV is actually a serialization language that is just
poorly defined.
A real-time architecture using Hadoop & Storm. 63
64. Serialization & Schema
Use a format with a schema.
- Thrift
- Avro
- Protobuffers
A real-time architecture using Hadoop & Storm. 64
65. Questions?
What are your needs?
@nathan_gs & @gvanlandeghem
A real-time architecture using Hadoop & Storm. 65
66. DataCrunchers
We enable companies in envisioning, defining and
implementing a data strategy.
A one-stop-shop for all your Big Data needs.
The first Big Data Consultancy agency in Belgium.
A real-time architecture using Hadoop & Storm. 66
67. Jobs
We are hiring.
jobs@datacrunchers.eu
A real-time architecture using Hadoop & Storm. 67