Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Benefits of Hadoop as Platform as a Service

1,154 views

Published on

Benefits of Hadoop as Platform as a Service

Published in: Technology
  • Be the first to comment

Benefits of Hadoop as Platform as a Service

  1. 1. Dublin, 14 April 2016 Benefits of Hadoop as Platform as a Service Aaron Call Barcelona SupercomputingCenter www.bsc.es
  2. 2. Barcelona Supercomputing Center
  3. 3. BSC – Barcelona Supercomputing Center 3 23 years resarch on computer architecture • European Center for Parallelism of Barcelona (CEPBA) • Based at the Polytechnical University of Catalonia (UPC) Led by Mateo Valero • Seymour Cray 2015, first european to win it • ACM fellow, Eckert-Mauchly award in 2007, Google award 2009 Large resarch staff • 1000+ publications
  4. 4. BSC – Barcelona Supercomputing Center 4 Many life sciencies computational projects • Computational Genomics • Molecular modeling and bioinformatics • Protein interactions and docking • In place computational capabilities • Mare Nostrum supercomputer Research activity around Hadoop since 2008 • Data-centric research group: http://www.bsc.es/computer-sciences/data-centric- computing • SLA-driven scheduling (adaptive scheduler) • Project ALOJA
  5. 5. ALOJA
  6. 6. Automated characterization of cost-effectiveness of Big Data deployments Seeks to provide knowledge and tools aiming to help users reduce the TCO of infrastructures About the project 6
  7. 7. What is the most effective configuration for my needs? About the project 7
  8. 8. On ALOJA we acquired large knowledge on the behavior of On- Premise and IaaS hadoop deployments 60k+ runs Public repository 8
  9. 9. What it is best for one workload it is not for all Lessons learnt from IaaS 9 Disks and network impact Local vs remote disks HDD-IB SSD-ETH HDD-ETH SSD-IB Local only 1 Remote 2 Remotes 3 Remotes 1 Remote /tmp local 2 Remote /tmp local 3 Remote /tmp local
  10. 10. PaaS Advantages
  11. 11. Provides an automated setup of BigData services (Hadoop, Spark, Hive..) • Optimized for the underlying hardware • Removes cost of installation The service provider is in charge of maintenance • Reduces TCO • As any cloud service you pay as you go Platform as a Service 12
  12. 12. O'Reily made a survey on data science salaries and estimated an average salary of 140.000 US$ for a data engineer Within a cluster of 16 datanodes on HDInsight of A3 machines, for a year it costs: • (16 datanodes + 2 headnodes) * 0.2384/hr = 4.2912 $US/hr => 4.2912*24*365 = 37,590.912 $US/year Hence, on ideal conditions we can save up to 102,409.088 $US per year How much spent on maintenance? 13
  13. 13. Some current solutions • Azure HDInsight • Rackspace CBD • Amazon EMR • Google Cloud Platform Platform as a Service 14
  14. 14. Linux-based clusters of 4,8 and 16 datanodes • Azure HDInsight and Rackspace CBD • Azure IaaS and Rackspace IaaS clusters as well Clusters of up to 8 cores / per node and 64 GB RAM HDInsight: azure storage HDFS (remote disks) Rackspace CBD: nodes’ local disks as HDFS Evaluation environment 15
  15. 15. Wordcount • CPU intensive: useful to analyze scalability of the nodes between VM sizes Tested workloads 16 %user %system %steal %iowait %nice
  16. 16. Terasort • Combined I/O and CPU loads, a de facto benchmark in the community Tested workloads 17 Datasizes of 1, 10,100 and 1000 GB This is enough to stress the system and get an overall behavior of it %user %system %steal %iowait %nice
  17. 17. Runs repeated several times Cloud variability (100GB runs) 18 Benchmark Provider Standard Deviation (%) Terasort HDInsight 60% Rackspace CBD 28% Wordcount HDInsight 55% Rackspace CBD 47%
  18. 18. Relevant factors tree 19 ALOJA-ML is a set of machine learning techniques and tools to estimate executions’ behavior on the unexplored search space Relevant factors tree: a tool that explores the parameters that changes most an execution’s behavior
  19. 19. Relevant factors tree 20 Resulting tree for PaaS executions IOFileBuffer=131072 Datasize Benchmark=Terasort Replication Benchmark=wordcount Datanodes IOFileBuffer=262144 Datasize
  20. 20. Relevant factors tree 21 Provider is not a relevant factor
  21. 21. Relevant factors tree 22 But datasize changes which is next important factor
  22. 22. IO File Buffer 10GB 23 Analysing IO File Buffer (most relevant parameter on the tree)
  23. 23. IO File Buffer 100GB 24
  24. 24. IO File Buffer 1TB 25 Whether to use one or the other it all depends on your application
  25. 25. Replication factor 100GB 26
  26. 26. Replication factor 1TB 27 Important but not making a significant difference
  27. 27. Datasize scalability terasort 28 4cores,15GB
  28. 28. Datasize scalability terasort 29 4cores,7GB 8cores,14GB 4cores,15GB 8cores,30GB
  29. 29. Datanodes impact, wordcount 32 4cores,15GB 8cores,30GB
  30. 30. Datanodes impact, terasort 33 4cores,7GB 8cores,14GB 4cores,15GB 8cores,30GB
  31. 31. Datanodes impact, terasort 34 Diminishing returns $2.87 $2.88 8cores,14GB 4cores,15GB
  32. 32. Cost difference IaaS and PaaS 35 Provider VM Size IaaS US$/h PaaS US$/h Azure/HDI 4 CPU, 7GB RAM $0,176/h $0,32/h 8 CPU, 15GB RAM $0,352/h $0,64/h Rackspace/CBD 4vCPU,15GB RAM $0,555/h $0,7925/h 8vCPU,30GB RAM $1,11/h $2,776/h Amazon/EMR 4vCPU,16G RAM $0,239/h $0,299/h 8vCPU,32GB RAM $0,479/h $0,599/h IaaS is cheaper, but might increase TCO (maintenance on your own!)
  33. 33. Conclusions 36 Providers are not really significant In public cloud, large datasizes or large clusters introduce problems • A larger cluster may improve performance but be more expensive in the end PaaS allows you to save on maintenance • But you still have to take care of tunning a bit • Not as much as on IaaS • Cheaper or not than IaaS it all depends on your business
  34. 34. Thank you! For further information please contact aaron.call@bsc.es www.bsc.es

×