Future of Data Intensive
Applications
Milind Bhandarkar
Chief Scientist, Pivotal
@techmilind
Thursday, December 12, 2013
About Me
• http://www.linkedin.com/in/milindb
• Founding member of Hadoop team atYahoo! [2005-2010]
• Contributor to Apache Hadoop since v0.1
• Built and led Grid SolutionsTeam atYahoo! [2007-2010]
• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)
• Center for Development of Advanced Computing (C-DAC), National Center
for Supercomputing Applications (NCSA), Center for Simulation of Advanced
Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by
QLogic),Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
Thursday, December 12, 2013
Thursday, December 12, 2013
Kryptonite: First Hadoop Cluster AtYahoo!
Thursday, December 12, 2013
M45
Thursday, December 12, 2013
OpenCirrus
Thursday, December 12, 2013
Analytics Workbench
Thursday, December 12, 2013
Analytics Workbench
Thursday, December 12, 2013
Thursday, December 12, 2013
70% of data
generated by
customers
80% of data
being stored
3% being
prepared for
analysis
0.5% being
analyzed
<0.5% being
operationalized
Average Enterprises
The Big Gap
Thursday, December 12, 2013
Example: Healthcare
•In last 5 years
•3,573 studies on hospital readmissions
•9,745 papers on comparative effectiveness
•39,230 studies on drug interactions
•132,241 studies on hospital mortality
•Yet, very few models operational
Thursday, December 12, 2013
PowerPoint is where
Models go to Die.
- Hulya Farinas, Principal Data Scientist,
Pivotal
Thursday, December 12, 2013
Modernization
Thursday, December 12, 2013
Building Blocks
Thursday, December 12, 2013
Structured
BI/Analytics ToolsData Science Applications
Semi-structuredUnstructured
High-Speed Integration Data provisioning, shared security,
coordinated transformation
(Big) Data Staging
Platform
Analytic
Data Warehouse
In Memory
Data Grid
On-Demand, Self-Service Access
Meta-data driven access control
Modern Data Architecture
Thursday, December 12, 2013
Data Fabric
Requirements
•Store massive & diverse data sets
economically
•Integrate and Ingest from legacy & disparate
sources
•Ability to rapidly analyze massive data sets
•Control,Auditing, Manageability
•Self-Service
Thursday, December 12, 2013
Data Fabric Architecture
Thursday, December 12, 2013
Infrastructure-As-A-
Service is the new
“Hardware”
Thursday, December 12, 2013
IAAS: New Hardware
•AWS, GCE,Azure
•vSphere, OpenStack
•Easy Provisioning
•Scalable, Elastic, Ubiquitous
•Needs bundling with Data & Analytics as
Services
Thursday, December 12, 2013
App Fabric
Requirements
•IAAS Cloud-Agnostic
•Rapid provisioning, Elasticity
•Open, No-Lock-In, Data As-A-Service
•Automation for Application Lifecycle
Management
•Developer Agility : Eliminate Infrastructure
Wiring
Thursday, December 12, 2013
Thursday, December 12, 2013
Ecosystem
Thursday, December 12, 2013
Broader Ecosystem
Thursday, December 12, 2013
Legacy App Deployment
Thursday, December 12, 2013
!"#$%&%#'()*+(,-#./0(
((12"341()*+(,-#./0(
((!.&5()*+(2!!0(
((6%'/()*+(&4"$%,4&0(
((&,2-4()*+(2!!0(7899(
.!3"2/4()*+(,-#./0(
(
Modern App Deployment
Thursday, December 12, 2013
Infrastructure
One
JVM
VM
Infrastructure
One
Infrastructure
Two
App
Container 1
App Server
JVM
Container 2
App Server
JVM
Dev Framework Dev Framework
App Server
Configurations Manifests, Automations
Infrastructure
Two
JVM
VM
Dev Framework
App Server
Configurations
App App App
Application As Unit of Deployment
Thursday, December 12, 2013
Hadoop’s Role in Data
Clouds
Thursday, December 12, 2013
Trough of Disillusionment ?
Thursday, December 12, 2013
Or, Hadoop Everywhere?
Thursday, December 12, 2013
Thursday, December 12, 2013
Thursday, December 12, 2013
Thursday, December 12, 2013
Thursday, December 12, 2013
Thursday, December 12, 2013
Game Changing
Hadoop Economics
$-
$20,000
$40,000
$60,000
$80,000
2008 2009 2010 2011 2012 2013
Big Data Platform Price/TB
Big Data DB Hadoop
Thursday, December 12, 2013
Storage Options
•HDFS, MapR, Quantcast QFS
•EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS,
Lustre
•Amazon S3, EMC Atmos, OpenStack Swift
•GlusterFS, Ceph
•EMCViPR
Thursday, December 12, 2013
SQL-on-Hadoop
•Pivotal HAWQ
•Cloudera Impala, Facebook Presto,Apache
Drill, Cascading Lingual, Optiq, Hortonworks
Stinger
•Hadapt, Jethrodata, IBM BigSQL, Microsoft
PolyBase
•More to come...
Thursday, December 12, 2013
!"#$%&'())'
BATCH
HDFS
!"#$%&'())'
INTERACTIVE
!"#$%&'())'
BATCH
HDFS
!"#$%&'())'
BATCH
HDFS
!"#$%&'())'
ONLINE
Hadoop 1.0
(Image Courtesy Arun Murthy, Hortonworks)
Thursday, December 12, 2013
MapReduce 1.0
(Image Courtesy Arun Murthy, Hortonworks)
Thursday, December 12, 2013
Hadoop 2.0
(Image Courtesy Arun Murthy, Hortonworks)
HADOOP 1.0
!"#$%
!"#$%&$'&()*"#+,'-+#*.(/"'0#1*
&'()*+,-*%
!2+%.(#"*"#./%"2#*3'&'0#3#&(*
*4*$'('*5"/2#..,&01*
!"#$.%
!"#$%&$'&()*"#+,'-+#*.(/"'0#1*
/0)1%
!2+%.(#"*"#./%"2#*3'&'0#3#&(1*
2*3%
!#6#2%7/&*#&0,&#1*
HADOOP 2.0
456%
!$'('*8/91*
!57*%
!.:+1*
%
89:*;<%
!2'.2'$,&01*
*
456%
!$'('*8/91*
!57*%
!.:+1*
%
89:*;<%
!2'.2'$,&01*
%
&)%
!-'(2;1*
)2%%
$9;*'=>%
?;'(:%
!"#$%&''
()$*+,'
*
$*;75-*<%
-.*/0'
*
Thursday, December 12, 2013
!""#$%&'()*+,-)+.&'/0#1+2.+3&4(("+
35678+!"#$%&$'&()*"#+,'-+#*.(/0'1#2*
9!,.+!3+%4(#0*"#4/%05#*6'&'1#7#&(2***
:!;<3+
=>&",04-%0?+
2.;@,!<;2A@+
=;0B?+
7;,@!>2.C+
=7D(EFG+7HGI?+
C,!J3+
=C$E&"K?+
2.L>@>M,9+
=7"&EN?+
3J<+>J2+
=M"0)>J2?+
M.O2.@+
=3:&*0?+
M;3@,+
=70&E%K?+
=P0&/0I?+
YARN Platform
(Image Courtesy Arun Murthy, Hortonworks)
Thursday, December 12, 2013
!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*
+"',&-'$)*./.*
+"',&-'$)*0/1*
!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*
!"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)*
+"',&-'$)*./0*
+"',&-'$)*./2*
3%*.*
+"',&-'$)*0/0*
+"',&-'$)*0/.*
+"',&-'$)*0/2*
3%0*
+4-$',0*
5$6"7)8$%&'&($)*
98:$#74$)*
YARN Architecture
(Image Courtesy Arun Murthy, Hortonworks)
Thursday, December 12, 2013
YARN
•Yet Another Resource Negotiator
•Resource Manager
•Node Managers
•Application Masters
•Specific to paradigm, e.g. MR Application
master (aka JobTracker)
Thursday, December 12, 2013
Beyond MapReduce
•Apache Giraph - BSP & Graph Processing
•Storm onYarn - Streaming Computation
•HOYA - HBase onYarn
•Hamster - MPI on Hadoop
•More to come ...
Thursday, December 12, 2013
Hamster
• Hadoop and MPI on the same
cluster
• OpenMPI Runtime on Hadoop
YARN
• Hadoop Provides: Resource
Scheduling, Process monitoring,
Distributed File System
• Open MPI Provides: Process
launching, Communication, I/O
forwarding
Thursday, December 12, 2013
Hamster Components
•Hamster Application Master
•Gang Scheduler,YARN Application
Preemption
•Resource Isolation (lxc Containers)
•ORTE: Hamster Runtime
•Process launching,Wireup, Interconnect
Thursday, December 12, 2013
Resource Manager
Scheduler
AMService
Node Manager Node Manager Node Manager
!
Proc/
Container
Framework
Daemon
NS
MPI
Scheduler
HNP
MPI AM
Proc/
Container
!RM-AM
AM-NM
RM-NodeManagerClient
Client-RM
Aux Srvcs
Proc/
Container
Framework
Daemon
NS
Proc/
Container
!
Aux Srvcs
RM-
NodeManager
Hamster Architecture
Thursday, December 12, 2013
Hamster Scalability
•Sufficient for small to medium HPC
workloads
•Job launch time gated byYARN resource
scheduler
Launch WireUp Collectives Monitor
OpenMPI O(logN) O(logN) O(logN) O(logN)
Hamster O(N) O(logN) O(logN) O(logN)
Thursday, December 12, 2013
GraphLab + Hamster
on Hadoop
!
Thursday, December 12, 2013
About GraphLab
•Graph-based, High-Performance distributed
computation framework
•Started by Prof. Carlos Guestrin in CMU in
2009
•Recently founded Graphlab Inc to
commercialize Graphlab.org
Thursday, December 12, 2013
GraphLab Features
•Topic Modeling (e.g. LDA)
•Graph Analytics (Pagerank,Triangle counting)
•Clustering (K-Means)
•Collaborative Filtering
•Linear Solvers
•etc...
Thursday, December 12, 2013
Only Graphs are not
Enough
• Full Data processing workflow required ETL/
Postprocessing,Visualization, Data Wrangling, Serving
• MapReduce excels at data wrangling
• OLTP/NoSQL Row-Based stores excel at Serving
• GraphLab should co-exist with other Hadoop
frameworks
Thursday, December 12, 2013
CallTo Action
Thursday, December 12, 2013
Prepare for
Convergence
•HPC: Cache Coherence, Prefetching, Zero-
copy, Low-contention locks
•“Big Data”: Caching, Mirroring, Sharding
(various flavors), relaxed consistency
•Databases: Indexing, MVCC, Columnar
storage/processing, Cost-based optimization
Thursday, December 12, 2013
Convergence
•Resource Allocation, Scheduling, Lifecycle
Management
•Compute, Storage, and Communication
isolation, Multi-tenancy, Performance SLAs
•Auth & Auth, Data/System Provisioning and
Management, Monitoring, Metadata
Management, Metering
Thursday, December 12, 2013
New Hardware
Platforms
•Mellanox - Hadoop Acceleration through
Network-assisted Merge
•RoCE - Brocade, Cisco, Extreme,Arista...
•ARM - Low power Hadoop servers
•SSD -Velobit,Violin, FusionIO, Samsung..
•Niche - Compression, Encryption...
Thursday, December 12, 2013
Data Cloud of Future?
deploy
Public Cloud
Private Cloud
On Premise
Thursday, December 12, 2013
Questions?
Thursday, December 12, 2013

Future of Data Intensive Applicaitons