STARFISH: A SELF-TUNING SYSTEM FOR BIGDATA ANALYTICS 
SEMINAR BY 
Y.SAI PRAMODA 
10191A0511
CONTENTS 
• Introduction to Big data 
• Hadoop 
• Tuning problems 
• Starfish Architecture 
• Usage of Starfish 
• Conclusion
INTRODUCTION TO BIG DATA 
 Big data is the term for data sets so large and complicated 
that it becomes difficult to process using traditional data 
management tools or processing applications 
 What are the tools of Big data? 
 Features of Big data Analytics
BIG DATA PRACTITIONERS 
• Data analysts 
Report generation, data mining, ad optimization 
• Computational scientists 
Computational biology, economics, journalism 
• Statisticians and machine-learning researchers 
• Systems researchers, developers, and testers 
Distributed systems, networking, security, …
Practitioners want a MAD system-HADOOP 
Hadoop is as MAD as it is! 
Magnetism “Attracts” or welcomes all sources of data, 
regardless of structure, values, etc. 
Agility Adaptive, remains in sync with rapid data 
evolution and modification 
Depth More than just your typical analytics, we 
need to support complex operations like statistical analysis 
and machine learning
MADDER 
Data-lifecycle Do more than just queries, 
Awareness optimize the movement, 
storage, and processing of big 
Elasticity Dynamically adjust resource usage 
and user requirements 
Robustness Provide storage and querying 
services even in the 
event of some failures
Tuning Challenges 
• Heavy use of programming languages for 
MapReduce programs 
• Data loaded/accessed as opaque files 
• Large space of tuning choices 
• Elasticity is wonderful, but hard to achieve 
• Terabyte-scale data cycles.
Tuning Problems 
Job-level 
MapReduce 
configuration 
Cluster sizing 
Workload 
management 
Data 
layout 
tuning 
J1 J2 
Workflow 
optimization 
J3 
J4
Starfish’s Core Approach to Tuning 
Profiler 
Collects concise 
summaries of 
execution 
Cluster 
What-if Engine 
Estimates impact of 
hypothetical changes 
on execution 
Optimizers 
Search through space of tuning choices 
Job 
Workflow 
Workload 
Data layout
THE STARFISH PHILOSOPHY 
• Goal: A high-performance MAD system 
• Build on Hadoop’s strengths 
• How can users get good performance 
automatically?
STARFISH ARCHITECTURE
VISUALIZE WITH STARFISH 
• See how MapReduce apps are working 
• Understand Bottlenecks in Hadoop 
• Find Misconfigured Hadoop Parameters 
• Learn to develop MapReduce apps
OPTIMIZE WITH STARFISH 
• Tune Hadoop easily 
• Find Optimal parameters settings for 
MapReduce applications
STRATEGIZE WITH STARFISH 
• Make intelligent resource allocation choices for 
Hadoop. 
• Find Instances for Workloads. 
• Meet time and cost budgets with ease.
STEPS TO USE STARFISH
Cntd… 
• First Step: collect the profiling the data from your 
Hadoop cluster. 
• Second Step: import the profiling data into profile 
store. 
• Third Step: Fire up the Graphical or Command Line 
interfaces to invoke visualize, optimize and strategize 
features.
CONCLUSION 
Hadoop is now a viable competitor to existing 
systems for big data analytics. 
 Starfish fills a different void by enabling Hadoop 
users and applications to get good performance 
automatically throughout the data lifecycle in analytics.
REFERENCES 
• Herodotou, Herodotos, et al. "Starfish: A self-tuning 
system for big data analytics." Proc. of the Fifth CIDR 
Conf. 2011. 
• Dong, Fei. Extending Starfish to Support the Growing 
Hadoop Ecosystem. Diss. Duke University, 2012. 
• Herodotou, Herodotos, Fei Dong, and Shivnath Babu. 
"MapReduce programming and cost-based 
optimization? Crossing this chasm with Starfish." 
Proceedings of the VLDB Endowment 4.12 (2011). 
• http://www.cs.duke.edu/starfish/ 
• http://www.youtube.com/watch?v=Upxe2dzE1uk
Starfish-A self tuning system for bigdata analytics

Starfish-A self tuning system for bigdata analytics

  • 1.
    STARFISH: A SELF-TUNINGSYSTEM FOR BIGDATA ANALYTICS SEMINAR BY Y.SAI PRAMODA 10191A0511
  • 2.
    CONTENTS • Introductionto Big data • Hadoop • Tuning problems • Starfish Architecture • Usage of Starfish • Conclusion
  • 3.
    INTRODUCTION TO BIGDATA  Big data is the term for data sets so large and complicated that it becomes difficult to process using traditional data management tools or processing applications  What are the tools of Big data?  Features of Big data Analytics
  • 4.
    BIG DATA PRACTITIONERS • Data analysts Report generation, data mining, ad optimization • Computational scientists Computational biology, economics, journalism • Statisticians and machine-learning researchers • Systems researchers, developers, and testers Distributed systems, networking, security, …
  • 5.
    Practitioners want aMAD system-HADOOP Hadoop is as MAD as it is! Magnetism “Attracts” or welcomes all sources of data, regardless of structure, values, etc. Agility Adaptive, remains in sync with rapid data evolution and modification Depth More than just your typical analytics, we need to support complex operations like statistical analysis and machine learning
  • 6.
    MADDER Data-lifecycle Domore than just queries, Awareness optimize the movement, storage, and processing of big Elasticity Dynamically adjust resource usage and user requirements Robustness Provide storage and querying services even in the event of some failures
  • 7.
    Tuning Challenges •Heavy use of programming languages for MapReduce programs • Data loaded/accessed as opaque files • Large space of tuning choices • Elasticity is wonderful, but hard to achieve • Terabyte-scale data cycles.
  • 8.
    Tuning Problems Job-level MapReduce configuration Cluster sizing Workload management Data layout tuning J1 J2 Workflow optimization J3 J4
  • 9.
    Starfish’s Core Approachto Tuning Profiler Collects concise summaries of execution Cluster What-if Engine Estimates impact of hypothetical changes on execution Optimizers Search through space of tuning choices Job Workflow Workload Data layout
  • 10.
    THE STARFISH PHILOSOPHY • Goal: A high-performance MAD system • Build on Hadoop’s strengths • How can users get good performance automatically?
  • 11.
  • 12.
    VISUALIZE WITH STARFISH • See how MapReduce apps are working • Understand Bottlenecks in Hadoop • Find Misconfigured Hadoop Parameters • Learn to develop MapReduce apps
  • 13.
    OPTIMIZE WITH STARFISH • Tune Hadoop easily • Find Optimal parameters settings for MapReduce applications
  • 14.
    STRATEGIZE WITH STARFISH • Make intelligent resource allocation choices for Hadoop. • Find Instances for Workloads. • Meet time and cost budgets with ease.
  • 15.
    STEPS TO USESTARFISH
  • 16.
    Cntd… • FirstStep: collect the profiling the data from your Hadoop cluster. • Second Step: import the profiling data into profile store. • Third Step: Fire up the Graphical or Command Line interfaces to invoke visualize, optimize and strategize features.
  • 17.
    CONCLUSION Hadoop isnow a viable competitor to existing systems for big data analytics.  Starfish fills a different void by enabling Hadoop users and applications to get good performance automatically throughout the data lifecycle in analytics.
  • 18.
    REFERENCES • Herodotou,Herodotos, et al. "Starfish: A self-tuning system for big data analytics." Proc. of the Fifth CIDR Conf. 2011. • Dong, Fei. Extending Starfish to Support the Growing Hadoop Ecosystem. Diss. Duke University, 2012. • Herodotou, Herodotos, Fei Dong, and Shivnath Babu. "MapReduce programming and cost-based optimization? Crossing this chasm with Starfish." Proceedings of the VLDB Endowment 4.12 (2011). • http://www.cs.duke.edu/starfish/ • http://www.youtube.com/watch?v=Upxe2dzE1uk

Editor's Notes

  • #10 Profiler Collect summaries of jobs Collect information on a task basis What-if Engine Answers questions after the Profiler is run Optimizers Enumerate & Search through decision space to satisfy the requirements.