• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar
 

NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

on

  • 1,277 views

NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar, Chief Scientist, Pivotal Labs

NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar, Chief Scientist, Pivotal Labs

Statistics

Views

Total Views
1,277
Views on SlideShare
527
Embed Views
750

Actions

Likes
0
Downloads
36
Comments
0

4 Embeds 750

http://www.nasscom.in 746
http://webcache.googleusercontent.com 2
https://twitter.com 1
http://translate.googleusercontent.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar Presentation Transcript

    • Future of Data Intensive Applications Milind Bhandarkar Chief Scientist, Pivotal @techmilind Thursday, December 12, 2013
    • About Me • • • • • • Thursday, December 12, 2013 http://www.linkedin.com/in/milindb Founding member of Hadoop team at Yahoo! [2005-2010] Contributor to Apache Hadoop since v0.1 Built and led Grid Solutions Team at Yahoo! [2007-2010] Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu) Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
    • Thursday, December 12, 2013
    • Kryptonite: First Hadoop Cluster At Yahoo! Thursday, December 12, 2013
    • M45 Thursday, December 12, 2013
    • OpenCirrus Thursday, December 12, 2013
    • Analytics Workbench Thursday, December 12, 2013
    • Analytics Workbench Thursday, December 12, 2013
    • Thursday, December 12, 2013
    • 70% of data generated by customers 80% of data being stored 3% being prepared for analysis 0.5% being analyzed Average Enterprises The Big Gap Thursday, December 12, 2013 <0.5% being operationalized
    • Example: Healthcare • In last 5 years • 3,573 studies on hospital readmissions • 9,745 papers on comparative effectiveness • 39,230 studies on drug interactions • 132,241 studies on hospital mortality • Yet, very few models operational Thursday, December 12, 2013
    • PowerPoint is where Models go to Die. - Hulya Farinas, Principal Data Scientist, Pivotal Thursday, December 12, 2013
    • Modernization Thursday, December 12, 2013
    • Building Blocks Thursday, December 12, 2013
    • Applications BI/Analytics Tools Data Science On-Demand, Self-Service Access (Big) Data Staging Platform Meta-data driven access control Unstructured Analytic Data Warehouse High-Speed Integration Semi-structured In Memory Data Grid Data provisioning, shared security, coordinated transformation Structured Modern Data Architecture Thursday, December 12, 2013
    • Data Fabric Requirements • Store massive & diverse data sets economically • Integrate and Ingest from legacy & disparate sources • Ability to rapidly analyze massive data sets • Control, Auditing, Manageability • Self-Service Thursday, December 12, 2013
    • Data Fabric Architecture Thursday, December 12, 2013
    • Infrastructure-As-AService is the new “Hardware” Thursday, December 12, 2013
    • IAAS: New Hardware • AWS, GCE, Azure • vSphere, OpenStack • Easy Provisioning • Scalable, Elastic, Ubiquitous • Needs bundling with Data & Analytics as Services Thursday, December 12, 2013
    • App Fabric Requirements • IAAS Cloud-Agnostic • Rapid provisioning, Elasticity • Open, No-Lock-In, Data As-A-Service • Automation for Application Lifecycle Management • Developer Agility : Eliminate Infrastructure Wiring Thursday, December 12, 2013
    • Thursday, December 12, 2013
    • Ecosystem Thursday, December 12, 2013
    • Broader Ecosystem Thursday, December 12, 2013
    • Legacy App Deployment Thursday, December 12, 2013
    • !"#$%&%#'()*+(,-#./0( ((12"341()*+(,-#./0( ((!.&5()*+(2!!0( ((6%'/()*+(&4"$%,4&0( ((&,2-4()*+(2!!0(7899( .!3"2/4()*+(,-#./0( ( Modern App Deployment Thursday, December 12, 2013
    • App App App App Dev Framework Dev Framework Dev Framework Configurations Configurations Manifests, Automations JVM JVM App Server App Server VM VM Infrastructure One Infrastructure Two Container 1 Container 2 JVM JVM App Server App Server Infrastructure One Infrastructure Two Application As Unit of Deployment Thursday, December 12, 2013
    • Hadoop’s Role in Data Clouds Thursday, December 12, 2013
    • Trough of Disillusionment ? Thursday, December 12, 2013
    • Or, Hadoop Everywhere? Thursday, December 12, 2013
    • Thursday, December 12, 2013
    • Thursday, December 12, 2013
    • Thursday, December 12, 2013
    • Thursday, December 12, 2013
    • Thursday, December 12, 2013
    • Game Changing Hadoop Economics Big Data Platform Price/TB $80,000 $60,000 $40,000 $20,000 $2008 Thursday, December 12, 2013 2009 2010 Big Data DB 2011 Hadoop 2012 2013
    • Storage Options • HDFS, MapR, Quantcast QFS • EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS, Lustre • Amazon S3, EMC Atmos, OpenStack Swift • GlusterFS, Ceph • EMC ViPR Thursday, December 12, 2013
    • SQL-on-Hadoop • Pivotal HAWQ • Cloudera Impala, Facebook Presto, Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger • Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase • More to come... Thursday, December 12, 2013
    • !"#$%&'())' !"#$%&'())' INTERACTIVE ONLINE !"#$%&'())' !"#$%&'())' !"#$%&'())' BATCH BATCH BATCH HDFS HDFS HDFS Hadoop 1.0 (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013
    • MapReduce 1.0 (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013
    • HADOOP 1.0 456% !$'('*8/91* % !57*% 89:*;<% !.:+1* !2'.2'$,&01* * &'()*+,-*% HADOOP 2.0 &)% !-'(2;1* 456% !$'('*8/91* % !57*% 89:*;<% !.:+1* !2'.2'$,&01* % 2*3% )2%% $9;*'=>%$*;75-*<% ?;'(:% -.*/0' !#6#2%7/&*#&0,&#1* /0)1% !2+%.(#"*"#./%"2#*3'&'0#3#&(* *4*$'('*5"/2#..,&01* !2+%.(#"*"#./%"2#*3'&'0#3#&(1* !"#$% !"#$.% !"#$%&$'&()*"#+,'-+#*.(/"'0#1* !"#$%&$'&()*"#+,'-+#*.(/"'0#1* Hadoop 2.0 (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013 !"#$%&'' ()$*+,' * *
    • !""#$%&'()*+,-)+.&'/0#1+2.+3&4(("+ :!;<3+ 2.;@,!<;2A@+ =>&",04-%0?+ =;0B?+ M.O2.@+ =3:&*0?+ 7;,@!>2.C+ =7D(EFG+7HGI?+ C,!J3+ =C$E&"K?+ 2.L>@>M,9+ =7"&EN?+ 3J<+>J2+ =M"0)>J2?+ 9!,.+!3+%4(#0*"#4/%05#*6'&'1#7#&(2*** 35678+!"#$%&$'&()*"#+,'-+#*.(/0'1#2* YARN Platform (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013 M;3@,+ =70&E%K?+ =P0&/0I?+
    • 5$6"7)8$%&'&($)* +4-$',0* 98:$#74$)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* +"',&-'$)*./.* +"',&-'$)*0/0* +"',&-'$)*0/1* !"#$%&'&($)* !"#$%&'&($)* 3%*.* !"#$%&'&($)* +"',&-'$)*./0* !"#$%&'&($)* !"#$%&'&($)* 3%0* !"#$%&'&($)* +"',&-'$)*./2* !"#$%&'&($)* +"',&-'$)*0/.* !"#$%&'&($)* +"',&-'$)*0/2* YARN Architecture (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013
    • YARN • Yet Another Resource Negotiator • Resource Manager • Node Managers • Application Masters • Specific to paradigm, e.g. MR Application master (aka JobTracker) Thursday, December 12, 2013
    • Beyond MapReduce • Apache Giraph - BSP & Graph Processing • Storm on Yarn - Streaming Computation • HOYA - HBase on Yarn • Hamster - MPI on Hadoop • More to come ... Thursday, December 12, 2013
    • Hamster • Hadoop and MPI on the same cluster • OpenMPI Runtime on Hadoop YARN • Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System • Open MPI Provides: Process launching, Communication, I/O forwarding Thursday, December 12, 2013
    • Hamster Components • Hamster Application Master • Gang Scheduler,YARN Application Preemption • Resource Isolation (lxc Containers) • ORTE: Hamster Runtime • Process launching, Wireup, Interconnect Thursday, December 12, 2013
    • Client Client-RM RM-NodeManager Resource Manager Scheduler AMService Proc/ Container RM-AM RMNodeManager MPI AM MPI Scheduler ! Proc/ Container Proc/ Container Aux Srvcs HNP Framework Daemon ! Proc/ Container Aux Srvcs Framework Daemon NS NS AM-NM Node Manager Node Manager ! Node Manager Hamster Architecture Thursday, December 12, 2013
    • Hamster Scalability • Sufficient for small to medium HPC workloads • Job launch time gated by YARN resource scheduler Launch WireUp Collectives Monitor OpenMPI O(logN) O(logN) O(logN) O(logN) Hamster O(logN) O(logN) O(logN) Thursday, December 12, 2013 O(N)
    • ! GraphLab + Hamster on Hadoop Thursday, December 12, 2013
    • About GraphLab • Graph-based, High-Performance distributed computation framework • Started by Prof. Carlos Guestrin in CMU in 2009 • Recently founded Graphlab Inc to commercialize Graphlab.org Thursday, December 12, 2013
    • GraphLab Features • Topic Modeling (e.g. LDA) • Graph Analytics (Pagerank, Triangle counting) • Clustering (K-Means) • Collaborative Filtering • Linear Solvers • etc... Thursday, December 12, 2013
    • Only Graphs are not Enough • Full Data processing workflow required ETL/ Postprocessing, Visualization, Data Wrangling, Serving • MapReduce excels at data wrangling • OLTP/NoSQL Row-Based stores excel at Serving • GraphLab should co-exist with other Hadoop frameworks Thursday, December 12, 2013
    • Call To Action Thursday, December 12, 2013
    • Prepare for Convergence • HPC: Cache Coherence, Prefetching, Zerocopy, Low-contention locks • “Big Data”: Caching, Mirroring, Sharding (various flavors), relaxed consistency • Databases: Indexing, MVCC, Columnar storage/processing, Cost-based optimization Thursday, December 12, 2013
    • Convergence • Resource Allocation, Scheduling, Lifecycle Management • Compute, Storage, and Communication isolation, Multi-tenancy, Performance SLAs • Auth & Auth, Data/System Provisioning and Management, Monitoring, Metadata Management, Metering Thursday, December 12, 2013
    • New Hardware Platforms • Mellanox - Hadoop Acceleration through Network-assisted Merge • RoCE - Brocade, Cisco, Extreme, Arista... • ARM - Low power Hadoop servers • SSD - Velobit, Violin, FusionIO, Samsung.. • Niche - Compression, Encryption... Thursday, December 12, 2013
    • Data Cloud of Future? deploy Public Cloud On Premise Private Cloud Thursday, December 12, 2013
    • Questions? Thursday, December 12, 2013