Future of Data Intensive Applicaitons

1,997 views

Published on

"Big Data" is a much-hyped term nowadays in Business Computing. However, the core concept of collaborative environments conducting experiments over large shared data repositories has existed for decades. In this talk, I will outline how recent advances in Cloud Computing, Big Data processing frameworks, and agile application development platforms enable Data Intensive Cloud Applications. I will provide a brief history of efforts in building scalable & adaptive run-time environments, and the role these runtime systems will play in new Cloud Applications. I will present a vision for cloud platforms for science, where data-intensive frameworks such as Apache Hadoop will play a key role.

Published in: Data & Analytics

Future of Data Intensive Applicaitons

  1. 1. Future of Data Intensive Applications Milind Bhandarkar Chief Scientist, Pivotal @techmilind Thursday, December 12, 2013
  2. 2. About Me • http://www.linkedin.com/in/milindb • Founding member of Hadoop team atYahoo! [2005-2010] • Contributor to Apache Hadoop since v0.1 • Built and led Grid SolutionsTeam atYahoo! [2007-2010] • Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu) • Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic),Yahoo!, LinkedIn, and Pivotal (formerly Greenplum) Thursday, December 12, 2013
  3. 3. Thursday, December 12, 2013
  4. 4. Kryptonite: First Hadoop Cluster AtYahoo! Thursday, December 12, 2013
  5. 5. M45 Thursday, December 12, 2013
  6. 6. OpenCirrus Thursday, December 12, 2013
  7. 7. Analytics Workbench Thursday, December 12, 2013
  8. 8. Analytics Workbench Thursday, December 12, 2013
  9. 9. Thursday, December 12, 2013
  10. 10. 70% of data generated by customers 80% of data being stored 3% being prepared for analysis 0.5% being analyzed <0.5% being operationalized Average Enterprises The Big Gap Thursday, December 12, 2013
  11. 11. Example: Healthcare •In last 5 years •3,573 studies on hospital readmissions •9,745 papers on comparative effectiveness •39,230 studies on drug interactions •132,241 studies on hospital mortality •Yet, very few models operational Thursday, December 12, 2013
  12. 12. PowerPoint is where Models go to Die. - Hulya Farinas, Principal Data Scientist, Pivotal Thursday, December 12, 2013
  13. 13. Modernization Thursday, December 12, 2013
  14. 14. Building Blocks Thursday, December 12, 2013
  15. 15. Structured BI/Analytics ToolsData Science Applications Semi-structuredUnstructured High-Speed Integration Data provisioning, shared security, coordinated transformation (Big) Data Staging Platform Analytic Data Warehouse In Memory Data Grid On-Demand, Self-Service Access Meta-data driven access control Modern Data Architecture Thursday, December 12, 2013
  16. 16. Data Fabric Requirements •Store massive & diverse data sets economically •Integrate and Ingest from legacy & disparate sources •Ability to rapidly analyze massive data sets •Control,Auditing, Manageability •Self-Service Thursday, December 12, 2013
  17. 17. Data Fabric Architecture Thursday, December 12, 2013
  18. 18. Infrastructure-As-A- Service is the new “Hardware” Thursday, December 12, 2013
  19. 19. IAAS: New Hardware •AWS, GCE,Azure •vSphere, OpenStack •Easy Provisioning •Scalable, Elastic, Ubiquitous •Needs bundling with Data & Analytics as Services Thursday, December 12, 2013
  20. 20. App Fabric Requirements •IAAS Cloud-Agnostic •Rapid provisioning, Elasticity •Open, No-Lock-In, Data As-A-Service •Automation for Application Lifecycle Management •Developer Agility : Eliminate Infrastructure Wiring Thursday, December 12, 2013
  21. 21. Thursday, December 12, 2013
  22. 22. Ecosystem Thursday, December 12, 2013
  23. 23. Broader Ecosystem Thursday, December 12, 2013
  24. 24. Legacy App Deployment Thursday, December 12, 2013
  25. 25. !"#$%&%#'()*+(,-#./0( ((12"341()*+(,-#./0( ((!.&5()*+(2!!0( ((6%'/()*+(&4"$%,4&0( ((&,2-4()*+(2!!0(7899( .!3"2/4()*+(,-#./0( ( Modern App Deployment Thursday, December 12, 2013
  26. 26. Infrastructure One JVM VM Infrastructure One Infrastructure Two App Container 1 App Server JVM Container 2 App Server JVM Dev Framework Dev Framework App Server Configurations Manifests, Automations Infrastructure Two JVM VM Dev Framework App Server Configurations App App App Application As Unit of Deployment Thursday, December 12, 2013
  27. 27. Hadoop’s Role in Data Clouds Thursday, December 12, 2013
  28. 28. Trough of Disillusionment ? Thursday, December 12, 2013
  29. 29. Or, Hadoop Everywhere? Thursday, December 12, 2013
  30. 30. Thursday, December 12, 2013
  31. 31. Thursday, December 12, 2013
  32. 32. Thursday, December 12, 2013
  33. 33. Thursday, December 12, 2013
  34. 34. Thursday, December 12, 2013
  35. 35. Game Changing Hadoop Economics $- $20,000 $40,000 $60,000 $80,000 2008 2009 2010 2011 2012 2013 Big Data Platform Price/TB Big Data DB Hadoop Thursday, December 12, 2013
  36. 36. Storage Options •HDFS, MapR, Quantcast QFS •EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS, Lustre •Amazon S3, EMC Atmos, OpenStack Swift •GlusterFS, Ceph •EMCViPR Thursday, December 12, 2013
  37. 37. SQL-on-Hadoop •Pivotal HAWQ •Cloudera Impala, Facebook Presto,Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger •Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase •More to come... Thursday, December 12, 2013
  38. 38. !"#$%&'())' BATCH HDFS !"#$%&'())' INTERACTIVE !"#$%&'())' BATCH HDFS !"#$%&'())' BATCH HDFS !"#$%&'())' ONLINE Hadoop 1.0 (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013
  39. 39. MapReduce 1.0 (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013
  40. 40. Hadoop 2.0 (Image Courtesy Arun Murthy, Hortonworks) HADOOP 1.0 !"#$% !"#$%&$'&()*"#+,'-+#*.(/"'0#1* &'()*+,-*% !2+%.(#"*"#./%"2#*3'&'0#3#&(* *4*$'('*5"/2#..,&01* !"#$.% !"#$%&$'&()*"#+,'-+#*.(/"'0#1* /0)1% !2+%.(#"*"#./%"2#*3'&'0#3#&(1* 2*3% !#6#2%7/&*#&0,&#1* HADOOP 2.0 456% !$'('*8/91* !57*% !.:+1* % 89:*;<% !2'.2'$,&01* * 456% !$'('*8/91* !57*% !.:+1* % 89:*;<% !2'.2'$,&01* % &)% !-'(2;1* )2%% $9;*'=>% ?;'(:% !"#$%&'' ()$*+,' * $*;75-*<% -.*/0' * Thursday, December 12, 2013
  41. 41. !""#$%&'()*+,-)+.&'/0#1+2.+3&4(("+ 35678+!"#$%&$'&()*"#+,'-+#*.(/0'1#2* 9!,.+!3+%4(#0*"#4/%05#*6'&'1#7#&(2*** :!;<3+ =>&",04-%0?+ 2.;@,!<;2A@+ =;0B?+ 7;,@!>2.C+ =7D(EFG+7HGI?+ C,!J3+ =C$E&"K?+ 2.L>@>M,9+ =7"&EN?+ 3J<+>J2+ =M"0)>J2?+ M.O2.@+ =3:&*0?+ M;3@,+ =70&E%K?+ =P0&/0I?+ YARN Platform (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013
  42. 42. !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* +"',&-'$)*./.* +"',&-'$)*0/1* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* !"#$%&'&($)* +"',&-'$)*./0* +"',&-'$)*./2* 3%*.* +"',&-'$)*0/0* +"',&-'$)*0/.* +"',&-'$)*0/2* 3%0* +4-$',0* 5$6"7)8$%&'&($)* 98:$#74$)* YARN Architecture (Image Courtesy Arun Murthy, Hortonworks) Thursday, December 12, 2013
  43. 43. YARN •Yet Another Resource Negotiator •Resource Manager •Node Managers •Application Masters •Specific to paradigm, e.g. MR Application master (aka JobTracker) Thursday, December 12, 2013
  44. 44. Beyond MapReduce •Apache Giraph - BSP & Graph Processing •Storm onYarn - Streaming Computation •HOYA - HBase onYarn •Hamster - MPI on Hadoop •More to come ... Thursday, December 12, 2013
  45. 45. Hamster • Hadoop and MPI on the same cluster • OpenMPI Runtime on Hadoop YARN • Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System • Open MPI Provides: Process launching, Communication, I/O forwarding Thursday, December 12, 2013
  46. 46. Hamster Components •Hamster Application Master •Gang Scheduler,YARN Application Preemption •Resource Isolation (lxc Containers) •ORTE: Hamster Runtime •Process launching,Wireup, Interconnect Thursday, December 12, 2013
  47. 47. Resource Manager Scheduler AMService Node Manager Node Manager Node Manager ! Proc/ Container Framework Daemon NS MPI Scheduler HNP MPI AM Proc/ Container !RM-AM AM-NM RM-NodeManagerClient Client-RM Aux Srvcs Proc/ Container Framework Daemon NS Proc/ Container ! Aux Srvcs RM- NodeManager Hamster Architecture Thursday, December 12, 2013
  48. 48. Hamster Scalability •Sufficient for small to medium HPC workloads •Job launch time gated byYARN resource scheduler Launch WireUp Collectives Monitor OpenMPI O(logN) O(logN) O(logN) O(logN) Hamster O(N) O(logN) O(logN) O(logN) Thursday, December 12, 2013
  49. 49. GraphLab + Hamster on Hadoop ! Thursday, December 12, 2013
  50. 50. About GraphLab •Graph-based, High-Performance distributed computation framework •Started by Prof. Carlos Guestrin in CMU in 2009 •Recently founded Graphlab Inc to commercialize Graphlab.org Thursday, December 12, 2013
  51. 51. GraphLab Features •Topic Modeling (e.g. LDA) •Graph Analytics (Pagerank,Triangle counting) •Clustering (K-Means) •Collaborative Filtering •Linear Solvers •etc... Thursday, December 12, 2013
  52. 52. Only Graphs are not Enough • Full Data processing workflow required ETL/ Postprocessing,Visualization, Data Wrangling, Serving • MapReduce excels at data wrangling • OLTP/NoSQL Row-Based stores excel at Serving • GraphLab should co-exist with other Hadoop frameworks Thursday, December 12, 2013
  53. 53. CallTo Action Thursday, December 12, 2013
  54. 54. Prepare for Convergence •HPC: Cache Coherence, Prefetching, Zero- copy, Low-contention locks •“Big Data”: Caching, Mirroring, Sharding (various flavors), relaxed consistency •Databases: Indexing, MVCC, Columnar storage/processing, Cost-based optimization Thursday, December 12, 2013
  55. 55. Convergence •Resource Allocation, Scheduling, Lifecycle Management •Compute, Storage, and Communication isolation, Multi-tenancy, Performance SLAs •Auth & Auth, Data/System Provisioning and Management, Monitoring, Metadata Management, Metering Thursday, December 12, 2013
  56. 56. New Hardware Platforms •Mellanox - Hadoop Acceleration through Network-assisted Merge •RoCE - Brocade, Cisco, Extreme,Arista... •ARM - Low power Hadoop servers •SSD -Velobit,Violin, FusionIO, Samsung.. •Niche - Compression, Encryption... Thursday, December 12, 2013
  57. 57. Data Cloud of Future? deploy Public Cloud Private Cloud On Premise Thursday, December 12, 2013
  58. 58. Questions? Thursday, December 12, 2013

×