0
Future of Data Intensive
Applications
Milind Bhandarkar
Chief Scientist, Pivotal
@techmilind

Thursday, December 12, 2013
About Me
•
•
•
•
•
•

Thursday, December 12, 2013

http://www.linkedin.com/in/milindb
Founding member of Hadoop team at Ya...
Thursday, December 12, 2013
Kryptonite: First Hadoop Cluster At Yahoo!
Thursday, December 12, 2013
M45
Thursday, December 12, 2013
OpenCirrus
Thursday, December 12, 2013
Analytics Workbench

Thursday, December 12, 2013
Analytics Workbench

Thursday, December 12, 2013
Thursday, December 12, 2013
70% of data
generated by
customers

80% of data
being stored

3% being
prepared for
analysis

0.5% being
analyzed

Average...
Example: Healthcare
• In last 5 years
• 3,573 studies on hospital readmissions
• 9,745 papers on comparative effectiveness...
PowerPoint is where
Models go to Die.

- Hulya Farinas, Principal Data Scientist,
Pivotal

Thursday, December 12, 2013
Modernization
Thursday, December 12, 2013
Building Blocks
Thursday, December 12, 2013
Applications

BI/Analytics Tools

Data Science

On-Demand, Self-Service Access

(Big) Data Staging
Platform

Meta-data dri...
Data Fabric
Requirements

• Store massive & diverse data sets
economically

• Integrate and Ingest from legacy & disparate...
Data Fabric Architecture
Thursday, December 12, 2013
Infrastructure-As-AService is the new
“Hardware”

Thursday, December 12, 2013
IAAS: New Hardware
• AWS, GCE, Azure
• vSphere, OpenStack
• Easy Provisioning
• Scalable, Elastic, Ubiquitous
• Needs bund...
App Fabric
Requirements

• IAAS Cloud-Agnostic
• Rapid provisioning, Elasticity
• Open, No-Lock-In, Data As-A-Service
• Au...
Thursday, December 12, 2013
Ecosystem
Thursday, December 12, 2013
Broader Ecosystem
Thursday, December 12, 2013
Legacy App Deployment
Thursday, December 12, 2013
!"#$%&%#'()*+(,-#./0(
((12"341()*+(,-#./0(
((!.&5()*+(2!!0(
((6%'/()*+(&4"$%,4&0(
((&,2-4()*+(2!!0(7899(
.!3"2/4()*+(,-#./...
App

App

App

App

Dev Framework

Dev Framework

Dev Framework

Configurations

Configurations

Manifests, Automations

J...
Hadoop’s Role in Data
Clouds

Thursday, December 12, 2013
Trough of Disillusionment ?
Thursday, December 12, 2013
Or, Hadoop Everywhere?
Thursday, December 12, 2013
Thursday, December 12, 2013
Thursday, December 12, 2013
Thursday, December 12, 2013
Thursday, December 12, 2013
Thursday, December 12, 2013
Game Changing
Hadoop Economics
Big Data Platform Price/TB

$80,000

$60,000

$40,000

$20,000

$2008

Thursday, December 1...
Storage Options
• HDFS, MapR, Quantcast QFS
• EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS,
Lustre

• Amazon S3, EMC Atmos, O...
SQL-on-Hadoop
• Pivotal HAWQ
• Cloudera Impala, Facebook Presto, Apache

Drill, Cascading Lingual, Optiq, Hortonworks
Stin...
!"#$%&'())'

!"#$%&'())'

INTERACTIVE

ONLINE

!"#$%&'())'

!"#$%&'())'

!"#$%&'())'

BATCH

BATCH

BATCH

HDFS

HDFS

HDF...
MapReduce 1.0
(Image Courtesy Arun Murthy, Hortonworks)

Thursday, December 12, 2013
HADOOP 1.0
456%

!$'('*8/91*

%
!57*% 89:*;<%
!.:+1*

!2'.2'$,&01*
*

&'()*+,-*%

HADOOP 2.0
&)%

!-'(2;1*

456%

!$'('*8/...
!""#$%&'()*+,-)+.&'/0#1+2.+3&4(("+
:!;<3+
2.;@,!<;2A@+
=>&",04-%0?+
=;0B?+

M.O2.@+
=3:&*0?+

7;,@!>2.C+
=7D(EFG+7HGI?+

C...
5$6"7)8$%&'&($)*

+4-$',0*
98:$#74$)*

!"#$%&'&($)*

!"#$%&'&($)*

!"#$%&'&($)*

!"#$%&'&($)*

+"',&-'$)*./.*
+"',&-'$)*0/...
YARN
• Yet Another Resource Negotiator
• Resource Manager
• Node Managers
• Application Masters
• Specific to paradigm, e.g...
Beyond MapReduce
• Apache Giraph - BSP & Graph Processing
• Storm on Yarn - Streaming Computation
• HOYA - HBase on Yarn
•...
Hamster
•

Hadoop and MPI on the same
cluster

•

OpenMPI Runtime on Hadoop
YARN

•

Hadoop Provides: Resource
Scheduling,...
Hamster Components
• Hamster Application Master
• Gang Scheduler,YARN Application
Preemption

• Resource Isolation (lxc Co...
Client

Client-RM

RM-NodeManager

Resource Manager
Scheduler
AMService

Proc/
Container

RM-AM

RMNodeManager

MPI AM
MPI...
Hamster Scalability
• Sufficient for small to medium HPC
workloads

• Job launch time gated by YARN resource
scheduler

Lau...
!

GraphLab + Hamster
on Hadoop

Thursday, December 12, 2013
About GraphLab
• Graph-based, High-Performance distributed
computation framework

• Started by Prof. Carlos Guestrin in CM...
GraphLab Features
• Topic Modeling (e.g. LDA)
• Graph Analytics (Pagerank, Triangle counting)
• Clustering (K-Means)
• Col...
Only Graphs are not
Enough
• Full Data processing workflow required ETL/

Postprocessing, Visualization, Data Wrangling, Se...
Call To Action

Thursday, December 12, 2013
Prepare for
Convergence
• HPC: Cache Coherence, Prefetching, Zerocopy, Low-contention locks

• “Big Data”: Caching, Mirror...
Convergence
• Resource Allocation, Scheduling, Lifecycle
Management

• Compute, Storage, and Communication

isolation, Mul...
New Hardware
Platforms
• Mellanox - Hadoop Acceleration through
Network-assisted Merge

• RoCE - Brocade, Cisco, Extreme, ...
Data Cloud of Future?

deploy

Public Cloud

On Premise

Private Cloud

Thursday, December 12, 2013
Questions?
Thursday, December 12, 2013
Upcoming SlideShare
Loading in...5
×

NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

1,462

Published on

NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar, Chief Scientist, Pivotal Labs

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,462
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
44
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar"

  1. 1. Future of Data Intensive Applications Milind Bhandarkar Chief Scientist, Pivotal @techmilind Thursday, December 12, 2013
  2. 2. About Me • • • • • • Thursday, December 12, 2013 http://www.linkedin.com/in/milindb Founding member of Hadoop team at Yahoo! [2005-2010] Contributor to Apache Hadoop since v0.1 Built and led Grid Solutions Team at Yahoo! [2007-2010] Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu) Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
  3. 3. Thursday, December 12, 2013
  4. 4. Kryptonite: First Hadoop Cluster At Yahoo! Thursday, December 12, 2013
  5. 5. M45 Thursday, December 12, 2013
  6. 6. OpenCirrus Thursday, December 12, 2013
  7. 7. Analytics Workbench Thursday, December 12, 2013
  8. 8. Analytics Workbench Thursday, December 12, 2013
  9. 9. Thursday, December 12, 2013
  10. 10. 70% of data generated by customers 80% of data being stored 3% being prepared for analysis 0.5% being analyzed Average Enterprises The Big Gap Thursday, December 12, 2013 <0.5% being operationalized
  11. 11. Example: Healthcare • In last 5 years • 3,573 studies on hospital readmissions • 9,745 papers on comparative effectiveness • 39,230 studies on drug interactions • 132,241 studies on hospital mortality • Yet, very few models operational Thursday, December 12, 2013
  12. 12. PowerPoint is where Models go to Die. - Hulya Farinas, Principal Data Scientist, Pivotal Thursday, December 12, 2013
  13. 13. Modernization Thursday, December 12, 2013
  14. 14. Building Blocks Thursday, December 12, 2013
  15. 15. Applications BI/Analytics Tools Data Science On-Demand, Self-Service Access (Big) Data Staging Platform Meta-data driven access control Unstructured Analytic Data Warehouse High-Speed Integration Semi-structured In Memory Data Grid Data provisioning, shared security, coordinated transformation Structured Modern Data Architecture Thursday, December 12, 2013
  16. 16. Data Fabric Requirements • Store massive & diverse data sets economically • Integrate and Ingest from legacy & disparate sources • Ability to rapidly analyze massive data sets • Control, Auditing, Manageability • Self-Service Thursday, December 12, 2013
  17. 17. Data Fabric Architecture Thursday, December 12, 2013
  18. 18. Infrastructure-As-AService is the new “Hardware” Thursday, December 12, 2013
  19. 19. IAAS: New Hardware • AWS, GCE, Azure • vSphere, OpenStack • Easy Provisioning • Scalable, Elastic, Ubiquitous • Needs bundling with Data & Analytics as Services Thursday, December 12, 2013
  20. 20. App Fabric Requirements • IAAS Cloud-Agnostic • Rapid provisioning, Elasticity • Open, No-Lock-In, Data As-A-Service • Automation for Application Lifecycle Management • Developer Agility : Eliminate Infrastructure Wiring Thursday, December 12, 2013
  21. 21. Thursday, December 12, 2013
  22. 22. Ecosystem Thursday, December 12, 2013
  23. 23. Broader Ecosystem Thursday, December 12, 2013
  24. 24. Legacy App Deployment Thursday, December 12, 2013
  25. 25. !"#$%&%#'()*+(,-#./0( ((12"341()*+(,-#./0( ((!.&5()*+(2!!0( ((6%'/()*+(&4"$%,4&0( ((&,2-4()*+(2!!0(7899( .!3"2/4()*+(,-#./0( ( Modern App Deployment Thursday, December 12, 2013
  26. 26. App App App App Dev Framework Dev Framework Dev Framework Configurations Configurations Manifests, Automations JVM JVM App Server App Server VM VM Infrastructure One Infrastructure Two Container 1 Container 2 JVM JVM App Server App Server Infrastructure One Infrastructure Two Application As Unit of Deployment Thursday, December 12, 2013
  27. 27. Hadoop’s Role in Data Clouds Thursday, December 12, 2013
  28. 28. Trough of Disillusionment ? Thursday, December 12, 2013
  29. 29. Or, Hadoop Everywhere? Thursday, December 12, 2013
  30. 30. Thursday, December 12, 2013
  31. 31. Thursday, December 12, 2013
  32. 32. Thursday, December 12, 2013
  33. 33. Thursday, December 12, 2013
  34. 34. Thursday, December 12, 2013
  35. 35. Game Changing Hadoop Economics Big Data Platform Price/TB $80,000 $60,000 $40,000 $20,000 $2008 Thursday, December 12, 2013 2009 2010 Big Data DB 2011 Hadoop 2012 2013
  36. 36. Storage Options • HDFS, MapR, Quantcast QFS • EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS, Lustre • Amazon S3, EMC Atmos, OpenStack Swift • GlusterFS, Ceph • EMC ViPR Thursday, December 12, 2013
  37. 37. SQL-on-Hadoop • Pivotal HAWQ • Cloudera Impala, Facebook Presto, Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger • Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase • More to come... Thursday, December 12, 2013
  38. 38. !"#$%&'())' !"#$%&'())' INTERACTIVE ONLINE !"#$%&'())' !"#$%&'())' !"#$%&'())' BATCH BATCH BATCH HDFS HDFS HDFS Hadoop 1.0 (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013
  39. 39. MapReduce 1.0 (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013
  40. 40. HADOOP 1.0 456% !$'('*8/91* % !57*% 89:*;<% !.:+1* !2'.2'$,&01* * &'()*+,-*% HADOOP 2.0 &)% !-'(2;1* 456% !$'('*8/91* % !57*% 89:*;<% !.:+1* !2'.2'$,&01* % 2*3% )2%% $9;*'=>%$*;75-*<% ?;'(:% -.*/0' !#6#2%7/&*#&0,&#1* /0)1% !2+%.(#"*"#./%"2#*3'&'0#3#&(* *4*$'('*5"/2#..,&01* !2+%.(#"*"#./%"2#*3'&'0#3#&(1* !"#$% !"#$.% !"#$%&$'&()*"#+,'-+#*.(/"'0#1* !"#$%&$'&()*"#+,'-+#*.(/"'0#1* Hadoop 2.0 (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013 !"#$%&'' ()$*+,' * *
  41. 41. !""#$%&'()*+,-)+.&'/0#1+2.+3&4(("+ :!;<3+ 2.;@,!<;2A@+ =>&",04-%0?+ =;0B?+ M.O2.@+ =3:&*0?+ 7;,@!>2.C+ =7D(EFG+7HGI?+ C,!J3+ =C$E&"K?+ 2.L>@>M,9+ =7"&EN?+ 3J<+>J2+ =M"0)>J2?+ 9!,.+!3+%4(#0*"#4/%05#*6'&'1#7#&(2*** 35678+!"#$%&$'&()*"#+,'-+#*.(/0'1#2* YARN Platform (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013 M;3@,+ =70&E%K?+ =P0&/0I?+
  42. 42. 5$6"7)8$%&'&($)* +4-$',0* 98:$#74$)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* +"',&-'$)*./.* +"',&-'$)*0/0* +"',&-'$)*0/1* !"#$%&'&($)* !"#$%&'&($)* 3%*.* !"#$%&'&($)* +"',&-'$)*./0* !"#$%&'&($)* !"#$%&'&($)* 3%0* !"#$%&'&($)* +"',&-'$)*./2* !"#$%&'&($)* +"',&-'$)*0/.* !"#$%&'&($)* +"',&-'$)*0/2* YARN Architecture (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013
  43. 43. YARN • Yet Another Resource Negotiator • Resource Manager • Node Managers • Application Masters • Specific to paradigm, e.g. MR Application master (aka JobTracker) Thursday, December 12, 2013
  44. 44. Beyond MapReduce • Apache Giraph - BSP & Graph Processing • Storm on Yarn - Streaming Computation • HOYA - HBase on Yarn • Hamster - MPI on Hadoop • More to come ... Thursday, December 12, 2013
  45. 45. Hamster • Hadoop and MPI on the same cluster • OpenMPI Runtime on Hadoop YARN • Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System • Open MPI Provides: Process launching, Communication, I/O forwarding Thursday, December 12, 2013
  46. 46. Hamster Components • Hamster Application Master • Gang Scheduler,YARN Application Preemption • Resource Isolation (lxc Containers) • ORTE: Hamster Runtime • Process launching, Wireup, Interconnect Thursday, December 12, 2013
  47. 47. Client Client-RM RM-NodeManager Resource Manager Scheduler AMService Proc/ Container RM-AM RMNodeManager MPI AM MPI Scheduler ! Proc/ Container Proc/ Container Aux Srvcs HNP Framework Daemon ! Proc/ Container Aux Srvcs Framework Daemon NS NS AM-NM Node Manager Node Manager ! Node Manager Hamster Architecture Thursday, December 12, 2013
  48. 48. Hamster Scalability • Sufficient for small to medium HPC workloads • Job launch time gated by YARN resource scheduler Launch WireUp Collectives Monitor OpenMPI O(logN) O(logN) O(logN) O(logN) Hamster O(logN) O(logN) O(logN) Thursday, December 12, 2013 O(N)
  49. 49. ! GraphLab + Hamster on Hadoop Thursday, December 12, 2013
  50. 50. About GraphLab • Graph-based, High-Performance distributed computation framework • Started by Prof. Carlos Guestrin in CMU in 2009 • Recently founded Graphlab Inc to commercialize Graphlab.org Thursday, December 12, 2013
  51. 51. GraphLab Features • Topic Modeling (e.g. LDA) • Graph Analytics (Pagerank, Triangle counting) • Clustering (K-Means) • Collaborative Filtering • Linear Solvers • etc... Thursday, December 12, 2013
  52. 52. Only Graphs are not Enough • Full Data processing workflow required ETL/ Postprocessing, Visualization, Data Wrangling, Serving • MapReduce excels at data wrangling • OLTP/NoSQL Row-Based stores excel at Serving • GraphLab should co-exist with other Hadoop frameworks Thursday, December 12, 2013
  53. 53. Call To Action Thursday, December 12, 2013
  54. 54. Prepare for Convergence • HPC: Cache Coherence, Prefetching, Zerocopy, Low-contention locks • “Big Data”: Caching, Mirroring, Sharding (various flavors), relaxed consistency • Databases: Indexing, MVCC, Columnar storage/processing, Cost-based optimization Thursday, December 12, 2013
  55. 55. Convergence • Resource Allocation, Scheduling, Lifecycle Management • Compute, Storage, and Communication isolation, Multi-tenancy, Performance SLAs • Auth & Auth, Data/System Provisioning and Management, Monitoring, Metadata Management, Metering Thursday, December 12, 2013
  56. 56. New Hardware Platforms • Mellanox - Hadoop Acceleration through Network-assisted Merge • RoCE - Brocade, Cisco, Extreme, Arista... • ARM - Low power Hadoop servers • SSD - Velobit, Violin, FusionIO, Samsung.. • Niche - Compression, Encryption... Thursday, December 12, 2013
  57. 57. Data Cloud of Future? deploy Public Cloud On Premise Private Cloud Thursday, December 12, 2013
  58. 58. Questions? Thursday, December 12, 2013
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×