w e l c o m e
BIG DATA
Architectures and Approaches
David Elliman & Ashok Subramanian
Luke Barrett
1971-2014
http://upload.wikimedia.org/wikipedia/commons/f/f0/DARPA_Big_Data.jpg
BIG DATA
https://www.flickr.com/photos/katerha/8380451137/
1944
https://www.flickr.com/photos/timetrax/376152628/sizes/l
1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 ...
1961
1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
1971
1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
1996
https://www.flickr.com/photos/epsos/8336691931
ge becomes more cost effective for storing da
1940 1945 1950 1955 1960...
1996
1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
1998
1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
1998
1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
https://www.usenix.org/conference/199...
2004
1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
2006
1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
2008
1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
2010
1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
2013
1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
"alottabytes"
2015
https://www.flickr.com/photos/will-lion/2595830716/
1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 ...
https://www.flickr.com/photos/taedc/6998468974
http://blogs.gartner.com/doug-laney/batman-on-big-data/
https://www.flickr.com/photos/10ch/3347658610/
THE OPPORTUNITY
<- 1990
DATA
INSIGHT
DATA
INSIGHT
DATA
INSIGHT
1990s - 2000 2000 ->
Key Takeaways
• This isn’t a new problem
• The problem isn’t going away
• Remember to focus on the
VALUE
https://www.flick...
Where do we…
https://www.flickr.com/photos/ekosystem/4334671818/
https://www.flickr.com/photos/libraryacu/7695938410/
Complexity
Value
Descriptive
Analytics
Diagnostic
Analytics
Predictive
Analytics
Prescriptive
Analytics
What happened?
Why...
https://www.flickr.com/photos/lopetz/3912416793/
REAL TIME BATCH
Volume
Velocity
REAL TIME BATCH
https://www.flickr.com/photos/ingythewingy/5510406450/
THINK
BIG
S M A L L
A C T
S M A L L
A C T
Small
is the
New Big
(Seth Godin)
https://www.flickr.com/photos/pauldineen/4529216647/
“80% of the work in any data project is in cleaning the data” – D J Patil
https://www.flickr.com/photos/desideratum/859525...
https://www.flickr.com/photos/22280677@N07/2504310138/
https://www.flickr.com/photos/jm3/4814208649/
SQL
https://www.flickr.com/photos/marc_smith/6793088143/
Key Takeaways
• Start small
• Start with the ?
• Iteratively follow the value
• Using freely available tooling
• Volume vs...
Scaling the Solution
https://www.flickr.com/photos/auntiep/4310240/
https://www.flickr.com/photos/111692634@N04/11407095913/
–attributed to Gene Amdahl 1967
“Amdahl’s law is used to find the maximum
expected improvement to an overall system
when o...
https://twitter.com/PieCalculus/status/459485747842523136/photo/1
https://www.flickr.com/photos/rofi/2097239111/
Batch
Speed
Serving Query
query = function(all data)
All Data
Lambda Architecture
Scaled Data
Store
Event
Processing
Network
QueryAll Data
Lambda Architecture
Batch View
Realtime View
Batch
Write
Random
W...
Batch
Speed
Serving Query
query = function(all data)
All Data
Lambda Architecture
Client
Master Node
JobTracker
Name Node
Metadata Operations
to Get Block Info
Job assignment to cluster
Task Tracker
Slave...
Batch - MapReduce
Map Shuffle Reduce
Batch - Cascading
Batch - Spark
Segment
Servers
Query processing
and data storage
Network
Interconnect
Master
Servers
Query planning &
dispatch
External S...
Batch
Speed
Serving Query
query = function(all data)
All Data
Lambda Architecture
Speed - Storm
CEP
Batch
Speed
Serving Query
query = function(all data)
All Data
Lambda Architecture
Lambda Architecture - Serving
http://www.wallzhq.com/wp-content/uploads/2014/02/matrix_binary-wide.jpg
Pull-based
Batch Loads
Enterprise
Data Models
Complex ETL
Logic
Poorly
Suited to
Non-Relational Data
Emergent design is di...
Pivotal Business Data Lake Architecture
http://www.gopivotal.com/sites/default/files/Pivotal-Business-Data-Lake-Technical_...
DATA CORE
RAW FACTUAL DATA
HISTORIZED EVENTS
RETAIN BUSINESS KEY
DATA LINEAGE
DATA INGESTION
EVENT DRIVEN
MESSAGE QUEUE
TRICKLE FEED
BATCH LOAD
INFORMATION PUBLISHING
TOPICAL QUEUES
POST PROCESSING
INFORMATION TIER
PURPOSE BUILT
DATA SUBSETS
TRANSFORMATION
DATA GOVERNANCE
MDM CONCERNS
POST PROCESSING
PRESENTATION TIER
BUSINESS VALUE
APPLICATIONS
DATA SERVICES
AD HOC QUERYING
WRITE BACK?
Transformation
Logic
Data
Post Processing
Near Real Time
Feed
Emergent Design
&
Agile Delivery
Apache Kafka
Apache Storm
Micro-data-services
Drive Towards In Memory Processing
https://www.tele-task.de/archive/lecture/overview/5721/
Remember
https://www.flickr.com/photos/anjin/695894443/
Data Structures
Algorithmshttps://www.flickr.com/photos/herrolsen/7645876896/
Raw Data
Data
Structure
Algorithm Insight
Key Takeaways
• Embrace the cloud
• Fit the Architecture to the
problem
• Remember Knuth
https://www.flickr.com/photos/djw...
https://www.flickr.com/photos/tim_norris/2789759648/
SUMMARY
http://www.datameer.com/blog/uncategorized/the-hadoop-ecosystem-visualized-in-datameer.html
48
30
26
22
18 18
16 15 15 15
...
https://www.flickr.com/photos/classblog/5136926303/
Commercial Open Source
https://blog.cloudera.com/blog/2011/10/the-community-effect/
https://www.flickr.com/photos/ctsi-global/6556284907/
https://www.flickr.com/photos/will-lion/2597608152/
https://www.flickr.com/photos/jurvetson/14105339228/
Open Questions
http://talkmarketing.co.uk/wp-content/uploads/2013/07/Open-Ended-Questions.jpg
https://www.flickr.com/photos/typoatelier/5615759848/
https://www.flickr.com/photos/rembcc/3802038945/
https://www.flickr.com/photos/sidelong/246816211/
No matter how much you speed up
the computers or the way you put
computers together, the real issues
are at the DATA LEVEL
https://www.flickr.com/photos/opensourceway/5556249000/
Enterprise Master
Data Management
Localised Formats
Single System of
Record
SoR is a process not
a place
Database Integration
(by another name)
http://www.bain.com/infographics/big-data/
Organisational Models
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Big Data: Architectures and Approaches
Upcoming SlideShare
Loading in...5
×

Big Data: Architectures and Approaches

6,551

Published on

ThoughtWorkers David Elliman and Ashok Subramanian present how the big data world is moving quickly with predictions of amazing industry growth. For more information on how the 'Internet of Things' is playing an increasingly larger role, read David's blog post or watch the video from the London-based event. http://www.thoughtworks.com/insights/blog/big-data-and-internet-things

Published in: Technology
0 Comments
18 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,551
On Slideshare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
269
Comments
0
Likes
18
Embeds 0
No embeds

No notes for slide
  • Dave
    http://www.forbes.com/sites/gilpress/2013/05/09/a-very-short-history-of-big-data/Reference
  • Ashok
    Big data analytics are driving rapid growth for public cloud computing vendors with revenues for the top 50 public cloud providers shooting up 47% in the fourth quarter last year to $6.2 billion
  • Dave
    http://nsa.gov1.info/utah-data-center/
  • Ashok
    Who is that handsome man!
  • Dave & Ashok
    Growth in retail, usage of iBeacons, Precision marketing, some sophistication with web analytics & CRM - greater penetration.
    Healthcare - remote monitoring, automated procedures
  • Ashok
  • Ashok
    Validation or Discovery
    picture of fork in the road?
  • Ashok & Dave
  • Dave
  • Ashok
    Exploring alternate models
  • Dave
  • Ashok
    Lambda Architecture - section heading
  • Ashok - high level description of components
  • Dave
    Batch Hadoop 2.0/MR2
    goal: allows you to share a large cluster of machines between different frameworks. Similar to Mesos, both are steps towards distributed data OS.
  • Dave
    Data Lakes
  • Dave
  • Ashok
  • Ashok
  • Ashok
    Fast and Scalable Analytics depends on efficient data structures
    Matching the Algorithm to the data structure
    Morphing the Raw data into the data structure
    Raw data > Data Structure > Algorithm > Insight
  • Conclusion
  • Ashok
    Balance shifting from Commercial to Open-Source
    Innovations coming from the open source world
  • Ashok
    Quantum computing - this is one apparently!
  • closing statement before Q&A
  • Dave
  • Big Data: Architectures and Approaches

    1. 1. w e l c o m e BIG DATA Architectures and Approaches David Elliman & Ashok Subramanian
    2. 2. Luke Barrett 1971-2014
    3. 3. http://upload.wikimedia.org/wikipedia/commons/f/f0/DARPA_Big_Data.jpg BIG DATA
    4. 4. https://www.flickr.com/photos/katerha/8380451137/
    5. 5. 1944 https://www.flickr.com/photos/timetrax/376152628/sizes/l 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
    6. 6. 1961 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
    7. 7. 1971 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
    8. 8. 1996 https://www.flickr.com/photos/epsos/8336691931 ge becomes more cost effective for storing da 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
    9. 9. 1996 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
    10. 10. 1998 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
    11. 11. 1998 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 https://www.usenix.org/conference/1999-usenix-annual-technical-conference/big-data-and-next-wave-infrastress-problems
    12. 12. 2004 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
    13. 13. 2006 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
    14. 14. 2008 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
    15. 15. 2010 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
    16. 16. 2013 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015 "alottabytes"
    17. 17. 2015 https://www.flickr.com/photos/will-lion/2595830716/ 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
    18. 18. https://www.flickr.com/photos/taedc/6998468974
    19. 19. http://blogs.gartner.com/doug-laney/batman-on-big-data/
    20. 20. https://www.flickr.com/photos/10ch/3347658610/
    21. 21. THE OPPORTUNITY
    22. 22. <- 1990 DATA INSIGHT DATA INSIGHT DATA INSIGHT 1990s - 2000 2000 ->
    23. 23. Key Takeaways • This isn’t a new problem • The problem isn’t going away • Remember to focus on the VALUE https://www.flickr.com/photos/djwtwo/8331524425/
    24. 24. Where do we… https://www.flickr.com/photos/ekosystem/4334671818/
    25. 25. https://www.flickr.com/photos/libraryacu/7695938410/
    26. 26. Complexity Value Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make it happen? Analytics - Goals
    27. 27. https://www.flickr.com/photos/lopetz/3912416793/ REAL TIME BATCH
    28. 28. Volume Velocity REAL TIME BATCH
    29. 29. https://www.flickr.com/photos/ingythewingy/5510406450/
    30. 30. THINK BIG S M A L L A C T S M A L L A C T Small is the New Big (Seth Godin)
    31. 31. https://www.flickr.com/photos/pauldineen/4529216647/
    32. 32. “80% of the work in any data project is in cleaning the data” – D J Patil https://www.flickr.com/photos/desideratum/8595251348/
    33. 33. https://www.flickr.com/photos/22280677@N07/2504310138/
    34. 34. https://www.flickr.com/photos/jm3/4814208649/
    35. 35. SQL
    36. 36. https://www.flickr.com/photos/marc_smith/6793088143/
    37. 37. Key Takeaways • Start small • Start with the ? • Iteratively follow the value • Using freely available tooling • Volume vs Velocity https://www.flickr.com/photos/djwtwo/8331524425/
    38. 38. Scaling the Solution https://www.flickr.com/photos/auntiep/4310240/
    39. 39. https://www.flickr.com/photos/111692634@N04/11407095913/
    40. 40. –attributed to Gene Amdahl 1967 “Amdahl’s law is used to find the maximum expected improvement to an overall system when only part of the system is improved.”
    41. 41. https://twitter.com/PieCalculus/status/459485747842523136/photo/1
    42. 42. https://www.flickr.com/photos/rofi/2097239111/
    43. 43. Batch Speed Serving Query query = function(all data) All Data Lambda Architecture
    44. 44. Scaled Data Store Event Processing Network QueryAll Data Lambda Architecture Batch View Realtime View Batch Write Random Write
    45. 45. Batch Speed Serving Query query = function(all data) All Data Lambda Architecture
    46. 46. Client Master Node JobTracker Name Node Metadata Operations to Get Block Info Job assignment to cluster Task Tracker Slave Node Data Node Map Reduce Task Tracker Slave Node Data Node Map Reduce Task Tracker Slave Node Data Node Map Reduce Task Tracker Slave Node Data Node Map Reduce 1 3 1 2 1 5 6 4 Data Replication on Multiple Nodes DataWrite DataRead Batch - Hadoop (MR1)
    47. 47. Batch - MapReduce Map Shuffle Reduce
    48. 48. Batch - Cascading
    49. 49. Batch - Spark
    50. 50. Segment Servers Query processing and data storage Network Interconnect Master Servers Query planning & dispatch External Sources Loading, streaming, etc. SQL or MapReduceBatch - MPP database
    51. 51. Batch Speed Serving Query query = function(all data) All Data Lambda Architecture
    52. 52. Speed - Storm
    53. 53. CEP
    54. 54. Batch Speed Serving Query query = function(all data) All Data Lambda Architecture
    55. 55. Lambda Architecture - Serving
    56. 56. http://www.wallzhq.com/wp-content/uploads/2014/02/matrix_binary-wide.jpg
    57. 57. Pull-based Batch Loads Enterprise Data Models Complex ETL Logic Poorly Suited to Non-Relational Data Emergent design is difficult Conventional Architectures
    58. 58. Pivotal Business Data Lake Architecture http://www.gopivotal.com/sites/default/files/Pivotal-Business-Data-Lake-Technical_Brochure_WEB.PDF
    59. 59. DATA CORE RAW FACTUAL DATA HISTORIZED EVENTS RETAIN BUSINESS KEY DATA LINEAGE
    60. 60. DATA INGESTION EVENT DRIVEN MESSAGE QUEUE TRICKLE FEED BATCH LOAD
    61. 61. INFORMATION PUBLISHING TOPICAL QUEUES POST PROCESSING
    62. 62. INFORMATION TIER PURPOSE BUILT DATA SUBSETS TRANSFORMATION DATA GOVERNANCE MDM CONCERNS POST PROCESSING
    63. 63. PRESENTATION TIER BUSINESS VALUE APPLICATIONS DATA SERVICES AD HOC QUERYING WRITE BACK?
    64. 64. Transformation Logic Data Post Processing Near Real Time Feed Emergent Design & Agile Delivery
    65. 65. Apache Kafka Apache Storm
    66. 66. Micro-data-services
    67. 67. Drive Towards In Memory Processing
    68. 68. https://www.tele-task.de/archive/lecture/overview/5721/
    69. 69. Remember https://www.flickr.com/photos/anjin/695894443/
    70. 70. Data Structures Algorithmshttps://www.flickr.com/photos/herrolsen/7645876896/
    71. 71. Raw Data Data Structure Algorithm Insight
    72. 72. Key Takeaways • Embrace the cloud • Fit the Architecture to the problem • Remember Knuth https://www.flickr.com/photos/djwtwo/8331524425/
    73. 73. https://www.flickr.com/photos/tim_norris/2789759648/ SUMMARY
    74. 74. http://www.datameer.com/blog/uncategorized/the-hadoop-ecosystem-visualized-in-datameer.html 48 30 26 22 18 18 16 15 15 15 13 13 13 13 12 0 13 25 38 50 63 Hadoop Ecosystem
    75. 75. https://www.flickr.com/photos/classblog/5136926303/ Commercial Open Source
    76. 76. https://blog.cloudera.com/blog/2011/10/the-community-effect/
    77. 77. https://www.flickr.com/photos/ctsi-global/6556284907/
    78. 78. https://www.flickr.com/photos/will-lion/2597608152/
    79. 79. https://www.flickr.com/photos/jurvetson/14105339228/
    80. 80. Open Questions http://talkmarketing.co.uk/wp-content/uploads/2013/07/Open-Ended-Questions.jpg
    81. 81. https://www.flickr.com/photos/typoatelier/5615759848/
    82. 82. https://www.flickr.com/photos/rembcc/3802038945/
    83. 83. https://www.flickr.com/photos/sidelong/246816211/
    84. 84. No matter how much you speed up the computers or the way you put computers together, the real issues are at the DATA LEVEL
    85. 85. https://www.flickr.com/photos/opensourceway/5556249000/
    86. 86. Enterprise Master Data Management
    87. 87. Localised Formats
    88. 88. Single System of Record
    89. 89. SoR is a process not a place
    90. 90. Database Integration (by another name)
    91. 91. http://www.bain.com/infographics/big-data/ Organisational Models
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×