Your SlideShare is downloading. ×
Starfish: A Self-tuning System for Big Data Analytics
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Starfish: A Self-tuning System for Big Data Analytics

3,167
views

Published on

Slides from Shivnath Babu's talk at the Triangle Hadoop User Group's April 2011 meeting on Starfish. See also http://www.trihug.org

Slides from Shivnath Babu's talk at the Triangle Hadoop User Group's April 2011 meeting on Starfish. See also http://www.trihug.org

Published in: Technology

0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,167
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
56
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Herodotos Herodotou, Harold Lim, GangLuo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, Shivnath Babu Duke University
  • 2. Outline Why Starfish? What should Starfish be able to do? What can Starfish do so far? 2
  • 3. We are in the Era of Big Data  Google processes 20 PB a day (2008)  Wayback Machine has 3 PB + 100 TB/month (3/2009)  eBay has 6.5 PB of user data + 50 TB/day (5/2009)  Facebook has 36 PB of user data + 80-90 TB/day (6/2010)  CERN’s LHC: 15 PB a year (any day now)  LSST: 6-10 PB a year (~2015)From http://www.umiacs.umd.edu/~jimmylin/
  • 4. Who are the “Big Data” Practitioners? Data analysts  Report generation, data mining, ad optimization, … Computational scientists  Computational biology, economics, journalism, … Statisticians and machine-learning researchers Systems researchers, developers, and testers  Distributed systems, networking, security, … You! 4
  • 5. Practitioners want a MAD System Magnetic system  Users want to get fresh new data into the system quickly  Data may be of multiple formats, with missing fields, etc. Agile system and analytics  Change (data, workload, needs) is constant, make it easy  Complex data gathering & processing pipelines (real-time) Deep analytics  Sophisticated aggregation/statistical analysis  Users want to use interfaces they are familiar with or the best available: SQL, MapReduce, Java, Python, R, … 5
  • 6. Hadoop is as MAD as it gets! Magnetic:  Load data into HDFS as files  Load first, ask questions later Agile:  Hadoop is extremely malleable: pluggable data formats, storage engines/filesystems, scheduler, instrumentation, …  Not just a querying tool: supports the end-to-end data pipeline  Built for elastic computing: fine-grained scheduler, highly fault tolerant, dynamic node addition and dropping Deep:  Well integrated with programming languages  MapReduce is a powerful programming model, plus other interfaces (Pig Latin, HiveQL, JAQL) on top 6
  • 7. MAD + Good Performance Users want good performance, without having to understand and tune system internals  Performance is multidimensional: time, cost, scalability  Learn from the troubled history of database tuning Tuning a MAD system is highly challenging  Data is opaque until it is accessed  Data loaded/accessed as files (Vs. organized DB stores)  MapReduce programs pose different challenges than SQL  Simpler in some ways, more complex in others  Heavy use of programming languages (e.g., Java/python)  Elasticity is wonderful, but hard to achieve (Hadoop has many useful mechanisms, but policies are lacking)  Terabyte-scale data cycles 7
  • 8. The Starfish Philosophy Goal: A high-performance MAD system Build on Hadoop’s strengths  Hadoop is MAD & has a rapidly growing user base How can users get good performance automatically?  Without having to understand & tune system internals  Recall: Perf. is multidimensional (time, cost, scalability) 8
  • 9. Starfish: Self-Tuning System Our goal: Provide good performance automatically NOT our goal: Improve Hadoop’s peak performance Java Client Pig Hive Oozie Elastic MR … Analytics System Starfish Hadoop MapReduce Execution Engine Distributed File System 9
  • 10. Outline Why Starfish? What should Starfish be able to do? What can Starfish do so far? 10
  • 11. Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job
  • 12. Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job
  • 13. Lifecycle of a MapReduce Job Time Input Map Map Reduce Reduce Splits Wave 1 Wave 2 Wave 1 Wave 2How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?
  • 14. Job Configuration Parameters • 190+ parameters in Hadoop • Set manually or defaults are used – Rules-of-thumb
  • 15. MapReduce Job Tuning in Hadoop 2-dim Projection of a 13-dim Surface
  • 16. Challenges faced by Practitioners• Joe Public can now provision a 100-node Hadoop cluster in minutes. Joe may need to answers to: – How many reduce tasks to use in MapReduce job J for getting the best perf. on my 8-node production cluster? – My current cluster needs more than 6 hours to process 1 day’s worth of data. Want to reduce that to under 3 hours. • Are the MapReduce job workflows running optimally? • How many and what type of Amazon EC2 nodes to use?
  • 17. Users (username, GeoInfo (ipaddr, Clicks (username,Amazon S3 age, ipaddr) region) url, value) storage Copy Copy Copy Copy Copy Filter value >0 Partition by age into <20, ≤25, ≤35, >35 Join Count users Join per age <20 Join Filter age >35 Filter url is “Sports” type Count users per region with Count clicks age > 25 per url type Count clicks per region,age Count clicks Count clicks per age per age S3 II III IV V VI Istorage
  • 18. Performance Vs. Price Tradeoff 2 nodes 4 nodes 6 nodes 2 nodes 4 nodes 6 nodes 12000 $6.00Execution Time (sec) 10000 $5.00 Cost ($) 8000 $4.00 6000 $3.00 4000 $2.00 2000 $1.00 0 $0.00 m1.small m1.large m1.xlarge m1.small m1.large m1.xlarge Node Type on Amazon EC2 Node Type on Amazon EC2
  • 19. Outline Why Starfish? What should Starfish be able to do? What can Starfish do so far? 19
  • 20. Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow-aware What-if Optimizer Engine Job-level tuning Just-in-Time Optimizer Profiler Sampler Data Manager Metadata Intermediate Data Layout & Mgr. Data Mgr. Storage Mgr. 20
  • 21. Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow-aware What-if Optimizer Engine Job-level tuning Just-in-Time Optimizer Profiler Sampler Data Manager Metadata Intermediate Data Layout & Mgr. Data Mgr. Storage Mgr. 21
  • 22. Job Configuration Parameters Over 190 parameters Many affect performance in complex ways Impact depends on Job, Data, and Cluster properties 22
  • 23. Current Approaches Rules of thumb Rules of thumb  mapred.reduce.tasks = 0.9 * number_of_reduce_slots  io.sort.record.percent = 16 / (16 + average_record_size) Rules of thumb may not suffice 23
  • 24. Just-in-Time Job Optimization Just-in-Time Optimizer  Searches through the space of parameter settings Profiler  Collects information about MapReduce job executions Sampler  Collects statistics about input, intermediate, and output key-value spaces of MapReduce jobs What-if Engine  Uses mix of simulation and model-based estimation Code is ready for release! Demo after the talk 24
  • 25. Job Profiler Dynamic instrumentation  Monitors phases of MapReduce job execution Benefits  Zero overhead when turned off  Works with unmodified MapReduce programs Used to construct a job profile  Concise representation of job execution  Allows for in-depth analysis of job behavior 25
  • 26. Insights from Job Profiles WordCount A WordCount B Few, large spills  Many, small spills Combiner gave high data  Combiner gave smaller data reduction reduction Combiner made Mappers  Better resource utilization in CPU bound Mappers 26
  • 27. Estimates from the What-if Engine True surface Surface estimated by What-if Engine
  • 28. Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow-aware What-if Optimizer Engine Job-level tuning Just-in-Time Optimizer Profiler Sampler Data Manager Metadata Intermediate Data Layout & Mgr. Data Mgr. Storage Mgr. 28
  • 29. MapReduce Job Workflows Producer-Consumer relationships among jobs Data layout crucial for later jobs in the workflow  Avoid unbalanced data layouts  Make effective use of parallelism 29
  • 30. Workflow-Aware Optimizer Goal: Optimize overall performance of workflow  Select best data layout + job parameters Overall approach is same as job-level optimizer, but larger space of options We hope to support Amazon Elastic MapReduce in the near future – summer project for Duke undergrad 30
  • 31. Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow-aware What-if Optimizer Engine Job-level tuning Just-in-Time Optimizer Profiler Sampler Data Manager Metadata Intermediate Data Layout & Mgr. Data Mgr. Storage Mgr. 31
  • 32. Elastisizer – Hadoop Provisioning  Goal: Make provisioning decisions based on workload requirements (e.g., completion time, cost) 2 nodes 4 nodes 6 nodes 2 nodes 4 nodes 6 nodes 12000 $6.00Execution Time (sec) 10000 $5.00 Cost ($) 8000 $4.00 6000 $3.00 4000 $2.00 2000 $1.00 0 $0.00 m1.small m1.large m1.xlarge m1.small m1.large m1.xlarge Node Type on Amazon EC2 Node Type on Amazon EC2 32
  • 33. Optimizing Hadoop Workloads Data-flow sharing Materialization Reorganization Java Client Pig Hive Oozie Elastic MR … Analytics System Starfish Hadoop MapReduce Execution Engine Distributed File System 33
  • 34. Starfish: Self-Tuning SystemFocus simultaneously on Different workload granularities  Workload  Workflows  Jobs (procedural & declarative) Across various decision points  Provisioning  Optimization We welcome  Scheduling your  Data layout collaboration! 34