NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar

on

  • 1,558 views

NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar, Chief Scientist, Pivotal Labs

NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar, Chief Scientist, Pivotal Labs

Statistics

Views

Total Views
1,558
Views on SlideShare
667
Embed Views
891

Actions

Likes
1
Downloads
39
Comments
0

4 Embeds 891

http://www.nasscom.in 887
http://webcache.googleusercontent.com 2
https://twitter.com 1
http://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

NATC 2013 - Future of Data Intensive Applications by Milind Bhandarkar Presentation Transcript

  • 1. Future of Data Intensive Applications Milind Bhandarkar Chief Scientist, Pivotal @techmilind Thursday, December 12, 2013
  • 2. About Me • • • • • • Thursday, December 12, 2013 http://www.linkedin.com/in/milindb Founding member of Hadoop team at Yahoo! [2005-2010] Contributor to Apache Hadoop since v0.1 Built and led Grid Solutions Team at Yahoo! [2007-2010] Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu) Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
  • 3. Thursday, December 12, 2013
  • 4. Kryptonite: First Hadoop Cluster At Yahoo! Thursday, December 12, 2013
  • 5. M45 Thursday, December 12, 2013
  • 6. OpenCirrus Thursday, December 12, 2013
  • 7. Analytics Workbench Thursday, December 12, 2013
  • 8. Analytics Workbench Thursday, December 12, 2013
  • 9. Thursday, December 12, 2013
  • 10. 70% of data generated by customers 80% of data being stored 3% being prepared for analysis 0.5% being analyzed Average Enterprises The Big Gap Thursday, December 12, 2013 <0.5% being operationalized
  • 11. Example: Healthcare • In last 5 years • 3,573 studies on hospital readmissions • 9,745 papers on comparative effectiveness • 39,230 studies on drug interactions • 132,241 studies on hospital mortality • Yet, very few models operational Thursday, December 12, 2013
  • 12. PowerPoint is where Models go to Die. - Hulya Farinas, Principal Data Scientist, Pivotal Thursday, December 12, 2013
  • 13. Modernization Thursday, December 12, 2013
  • 14. Building Blocks Thursday, December 12, 2013
  • 15. Applications BI/Analytics Tools Data Science On-Demand, Self-Service Access (Big) Data Staging Platform Meta-data driven access control Unstructured Analytic Data Warehouse High-Speed Integration Semi-structured In Memory Data Grid Data provisioning, shared security, coordinated transformation Structured Modern Data Architecture Thursday, December 12, 2013
  • 16. Data Fabric Requirements • Store massive & diverse data sets economically • Integrate and Ingest from legacy & disparate sources • Ability to rapidly analyze massive data sets • Control, Auditing, Manageability • Self-Service Thursday, December 12, 2013
  • 17. Data Fabric Architecture Thursday, December 12, 2013
  • 18. Infrastructure-As-AService is the new “Hardware” Thursday, December 12, 2013
  • 19. IAAS: New Hardware • AWS, GCE, Azure • vSphere, OpenStack • Easy Provisioning • Scalable, Elastic, Ubiquitous • Needs bundling with Data & Analytics as Services Thursday, December 12, 2013
  • 20. App Fabric Requirements • IAAS Cloud-Agnostic • Rapid provisioning, Elasticity • Open, No-Lock-In, Data As-A-Service • Automation for Application Lifecycle Management • Developer Agility : Eliminate Infrastructure Wiring Thursday, December 12, 2013
  • 21. Thursday, December 12, 2013
  • 22. Ecosystem Thursday, December 12, 2013
  • 23. Broader Ecosystem Thursday, December 12, 2013
  • 24. Legacy App Deployment Thursday, December 12, 2013
  • 25. !"#$%&%#'()*+(,-#./0( ((12"341()*+(,-#./0( ((!.&5()*+(2!!0( ((6%'/()*+(&4"$%,4&0( ((&,2-4()*+(2!!0(7899( .!3"2/4()*+(,-#./0( ( Modern App Deployment Thursday, December 12, 2013
  • 26. App App App App Dev Framework Dev Framework Dev Framework Configurations Configurations Manifests, Automations JVM JVM App Server App Server VM VM Infrastructure One Infrastructure Two Container 1 Container 2 JVM JVM App Server App Server Infrastructure One Infrastructure Two Application As Unit of Deployment Thursday, December 12, 2013
  • 27. Hadoop’s Role in Data Clouds Thursday, December 12, 2013
  • 28. Trough of Disillusionment ? Thursday, December 12, 2013
  • 29. Or, Hadoop Everywhere? Thursday, December 12, 2013
  • 30. Thursday, December 12, 2013
  • 31. Thursday, December 12, 2013
  • 32. Thursday, December 12, 2013
  • 33. Thursday, December 12, 2013
  • 34. Thursday, December 12, 2013
  • 35. Game Changing Hadoop Economics Big Data Platform Price/TB $80,000 $60,000 $40,000 $20,000 $2008 Thursday, December 12, 2013 2009 2010 Big Data DB 2011 Hadoop 2012 2013
  • 36. Storage Options • HDFS, MapR, Quantcast QFS • EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS, Lustre • Amazon S3, EMC Atmos, OpenStack Swift • GlusterFS, Ceph • EMC ViPR Thursday, December 12, 2013
  • 37. SQL-on-Hadoop • Pivotal HAWQ • Cloudera Impala, Facebook Presto, Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger • Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase • More to come... Thursday, December 12, 2013
  • 38. !"#$%&'())' !"#$%&'())' INTERACTIVE ONLINE !"#$%&'())' !"#$%&'())' !"#$%&'())' BATCH BATCH BATCH HDFS HDFS HDFS Hadoop 1.0 (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013
  • 39. MapReduce 1.0 (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013
  • 40. HADOOP 1.0 456% !$'('*8/91* % !57*% 89:*;<% !.:+1* !2'.2'$,&01* * &'()*+,-*% HADOOP 2.0 &)% !-'(2;1* 456% !$'('*8/91* % !57*% 89:*;<% !.:+1* !2'.2'$,&01* % 2*3% )2%% $9;*'=>%$*;75-*<% ?;'(:% -.*/0' !#6#2%7/&*#&0,&#1* /0)1% !2+%.(#"*"#./%"2#*3'&'0#3#&(* *4*$'('*5"/2#..,&01* !2+%.(#"*"#./%"2#*3'&'0#3#&(1* !"#$% !"#$.% !"#$%&$'&()*"#+,'-+#*.(/"'0#1* !"#$%&$'&()*"#+,'-+#*.(/"'0#1* Hadoop 2.0 (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013 !"#$%&'' ()$*+,' * *
  • 41. !""#$%&'()*+,-)+.&'/0#1+2.+3&4(("+ :!;<3+ 2.;@,!<;2A@+ =>&",04-%0?+ =;0B?+ M.O2.@+ =3:&*0?+ 7;,@!>2.C+ =7D(EFG+7HGI?+ C,!J3+ =C$E&"K?+ 2.L>@>M,9+ =7"&EN?+ 3J<+>J2+ =M"0)>J2?+ 9!,.+!3+%4(#0*"#4/%05#*6'&'1#7#&(2*** 35678+!"#$%&$'&()*"#+,'-+#*.(/0'1#2* YARN Platform (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013 M;3@,+ =70&E%K?+ =P0&/0I?+
  • 42. 5$6"7)8$%&'&($)* +4-$',0* 98:$#74$)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* +"',&-'$)*./.* +"',&-'$)*0/0* +"',&-'$)*0/1* !"#$%&'&($)* !"#$%&'&($)* 3%*.* !"#$%&'&($)* +"',&-'$)*./0* !"#$%&'&($)* !"#$%&'&($)* 3%0* !"#$%&'&($)* +"',&-'$)*./2* !"#$%&'&($)* +"',&-'$)*0/.* !"#$%&'&($)* +"',&-'$)*0/2* YARN Architecture (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013
  • 43. YARN • Yet Another Resource Negotiator • Resource Manager • Node Managers • Application Masters • Specific to paradigm, e.g. MR Application master (aka JobTracker) Thursday, December 12, 2013
  • 44. Beyond MapReduce • Apache Giraph - BSP & Graph Processing • Storm on Yarn - Streaming Computation • HOYA - HBase on Yarn • Hamster - MPI on Hadoop • More to come ... Thursday, December 12, 2013
  • 45. Hamster • Hadoop and MPI on the same cluster • OpenMPI Runtime on Hadoop YARN • Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System • Open MPI Provides: Process launching, Communication, I/O forwarding Thursday, December 12, 2013
  • 46. Hamster Components • Hamster Application Master • Gang Scheduler,YARN Application Preemption • Resource Isolation (lxc Containers) • ORTE: Hamster Runtime • Process launching, Wireup, Interconnect Thursday, December 12, 2013
  • 47. Client Client-RM RM-NodeManager Resource Manager Scheduler AMService Proc/ Container RM-AM RMNodeManager MPI AM MPI Scheduler ! Proc/ Container Proc/ Container Aux Srvcs HNP Framework Daemon ! Proc/ Container Aux Srvcs Framework Daemon NS NS AM-NM Node Manager Node Manager ! Node Manager Hamster Architecture Thursday, December 12, 2013
  • 48. Hamster Scalability • Sufficient for small to medium HPC workloads • Job launch time gated by YARN resource scheduler Launch WireUp Collectives Monitor OpenMPI O(logN) O(logN) O(logN) O(logN) Hamster O(logN) O(logN) O(logN) Thursday, December 12, 2013 O(N)
  • 49. ! GraphLab + Hamster on Hadoop Thursday, December 12, 2013
  • 50. About GraphLab • Graph-based, High-Performance distributed computation framework • Started by Prof. Carlos Guestrin in CMU in 2009 • Recently founded Graphlab Inc to commercialize Graphlab.org Thursday, December 12, 2013
  • 51. GraphLab Features • Topic Modeling (e.g. LDA) • Graph Analytics (Pagerank, Triangle counting) • Clustering (K-Means) • Collaborative Filtering • Linear Solvers • etc... Thursday, December 12, 2013
  • 52. Only Graphs are not Enough • Full Data processing workflow required ETL/ Postprocessing, Visualization, Data Wrangling, Serving • MapReduce excels at data wrangling • OLTP/NoSQL Row-Based stores excel at Serving • GraphLab should co-exist with other Hadoop frameworks Thursday, December 12, 2013
  • 53. Call To Action Thursday, December 12, 2013
  • 54. Prepare for Convergence • HPC: Cache Coherence, Prefetching, Zerocopy, Low-contention locks • “Big Data”: Caching, Mirroring, Sharding (various flavors), relaxed consistency • Databases: Indexing, MVCC, Columnar storage/processing, Cost-based optimization Thursday, December 12, 2013
  • 55. Convergence • Resource Allocation, Scheduling, Lifecycle Management • Compute, Storage, and Communication isolation, Multi-tenancy, Performance SLAs • Auth & Auth, Data/System Provisioning and Management, Monitoring, Metadata Management, Metering Thursday, December 12, 2013
  • 56. New Hardware Platforms • Mellanox - Hadoop Acceleration through Network-assisted Merge • RoCE - Brocade, Cisco, Extreme, Arista... • ARM - Low power Hadoop servers • SSD - Velobit, Violin, FusionIO, Samsung.. • Niche - Compression, Encryption... Thursday, December 12, 2013
  • 57. Data Cloud of Future? deploy Public Cloud On Premise Private Cloud Thursday, December 12, 2013
  • 58. Questions? Thursday, December 12, 2013