The art of data analysis

1,081 views
1,032 views

Published on

I conduct workshops on The Art of Data Analysis for corporate clients and at conferences. I recently did the workshop at the Fifth Elephant, a conference on Data in Bangalore. These are the slides I used for that workshop.

For corporate clients, I custom develop case studies that are relevant to their company/industry. For more details, contact me at
karthik DOT shashidhar AT gmail DOT com

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,081
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
27
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The art of data analysis

  1. 1. The Art Of Data AnalysisKarthik ShashidharQuant Consultantkarthik.shashidhar@gmail.com © Karthik Shashidhar
  2. 2. IntroductionSix-step process Case StudyCommon Pitfalls © Karthik Shashidhar
  3. 3. Why do you need this workshop?We are moving to an increasingly data-driven worldAbility to use data for day-to-day decision-makingcan prove to be a massive competitive advantageThis workshop equips managers with basic tools for dealing with data © Karthik Shashidhar
  4. 4. Who needs this workshop? What is the optimal level of sales Sales Managers commissions in order to maximize profitability? Production How do we set daily production targets Managers given probabilities of line shut downs? What are the factors that determine HR Managers employee attrition?This workshop is suitable for personnel in middle to senior management roles across functions © Karthik Shashidhar
  5. 5. IntroductionSix-step process Case StudyCommon Pitfalls © Karthik Shashidhar
  6. 6. Frame a clear and concise problem statement Break down your problem into smaller problems, and then use those to generate hypotheses Gather, clean and prepare dataA structured, iterativeapproach to data-drivendecision making Test hypotheses. In the process, generate additional hypotheses Consolidate results to solve the main problem Make the data tell a story © Karthik Shashidhar
  7. 7. IntroductionSix-step process Case StudyCommon Pitfalls © Karthik Shashidhar
  8. 8. The Rs. 32 Poverty Line Based on data from the 66th NSSO Survey, the PlanningCommission fixed the “Poverty Line” at Rs. 32 per person per day for people living in urban areas. This has led to much controversy and protests. The Prime Minister has asked for your inputs. What do you recommend? © Karthik Shashidhar
  9. 9. Frame a clear and concise problem statement Break down your problem into smaller problems, and then use those to generate hypotheses Gather, clean and prepare dataFor your reference Test hypotheses. In the process, generate additional hypotheses Consolidate results to solve the main problem Make the data tell a story © Karthik Shashidhar
  10. 10. Frame a clear and concise problem statementHow would you frame the problem statement for this one? Break down your problem into smaller problems, and then use those to generate hypotheses• Your client may not have framed the question precisely. You need to do that job and frame a precise problem Gather, clean and prepare data statement• “Solving this problem” should tell you everything you want to know Test hypotheses. In the process, from your analysis generate additional hypotheses• Be concise, so that you remain focused towards answering your question Consolidate results to solve the• Frame your question such that it has main problem an objective answer. Yes/No questions or questions with numerical answers are preferred Make the data tell a story © Karthik Shashidhar
  11. 11. Frame a clear and concise problem statementHas the poverty line been set too low at Rs. 32 per day? Break down your problem into smaller problems, and then use those to generate hypotheses• This problem statement has an objective answer (yes/no)• The solution to this will be necessary Gather, clean and prepare data and sufficient to answer the question our client (the PM) demands Test hypotheses. In the process,• The question addresses directly the generate additional hypotheses situation (people complaining that the poverty line has been set too low) Consolidate results to solve the• This problem statement is to the main problem point and doesn’t take on additional responsibilities (such as defining an alternate poverty line) Make the data tell a story © Karthik Shashidhar
  12. 12. Frame a clear and concise problemWhat problems do we need to statement solve in order to solve the Break down your problem into main problem? smaller problems, and then use those to generate hypotheses• The set of “level two problems” must be precise and complete, in that: Gather, clean and prepare data • The combination of solution of all level two problems leads to the solution of the main problem • The solution of each level two Test hypotheses. In the process, problem directly impacts the main generate additional hypotheses problem• Once again, it is key to frame problems concisely and with objective answers• We need not stop at two levels. Some Consolidate results to solve the main problem level two problems might require solution of deeper problems. Add them to the list of sub-problems Make the data tell a story © Karthik Shashidhar
  13. 13. Frame a clear and concise problem What do we need to know to statementanswer “Has the poverty line been Break down your problem into set too low at Rs. 32 per day?” smaller problems, and then use those to generate hypotheses• How is “poverty line” defined?• What are the implications of poverty line? Gather, clean and prepare data• What is the distribution of income in India?• Does the distribution of income vary Test hypotheses. In the process, across states? If it varies significantly generate additional hypotheses does it make sense to have a state- wise poverty line?• What are the essential goods that Consolidate results to solve the most people need? main problem• For a given income level, what essential goods can a person afford? Make the data tell a story © Karthik Shashidhar
  14. 14. Frame a clear and concise problemProblems generate sub-problems, statement and some of these will lead to Break down your problem into hypotheses. smaller problems, and then use those to generate hypotheses Gather, clean and prepare data• Hypothesis1: There is significant difference in income level across states• Hypothesis2: Essential goods are Test hypotheses. In the process, generate additional hypotheses those that the poorest people consume. Also, their use flattens out as income goes up Consolidate results to solve the main problem Make the data tell a story © Karthik Shashidhar
  15. 15. Frame a clear and concise problem Some problems, however, are statementdirect, and don’t need hypotheses. Some are qualitative while others Break down your problem into need data smaller problems, and then use those to generate hypotheses• Question1: How is “poverty line” defined? • Poverty line is the minimum Gather, clean and prepare data income level that is deemed adequate • If a family is “below poverty Test hypotheses. In the process, line” it qualifies for additional generate additional hypotheses state benefits• Question2: What is the distribution of incomes in each state? Consolidate results to solve the• Question3: Is there some kind of a main problem threshold about the proportion of population that can be below poverty line? Make the data tell a story © Karthik Shashidhar
  16. 16. Frame a clear and concise problem statement What data do you need here? Break down your problem into smaller problems, and then use those to generate hypotheses• It is important to frame problem and Gather, clean and prepare data break it down into components before listing data requirements, else data could bias you Test hypotheses. In the process,• Define data requirements in a generate additional hypotheses general fashion, to allow you to easily access proxies• Remember to gather data that both Consolidate results to solve the answers your questions and will main problem allow you to test your hypotheses Make the data tell a story © Karthik Shashidhar
  17. 17. Frame a clear and concise problem Once you’ve identified data statementrequirements, identify sources and Break down your problem into gather data smaller problems, and then use those to generate hypotheses Gather, clean and prepare data• Here we need • Distribution of a measure of income for India • Distribution of a measure of Test hypotheses. In the process, generate additional hypotheses income for each state • Spending patterns for different income levels • Data on household sizes in Consolidate results to solve the main problem different states Make the data tell a story © Karthik Shashidhar
  18. 18. Frame a clear and concise problem Once you’ve identified data statementrequirements, identify sources and Break down your problem into gather data smaller problems, and then use those to generate hypotheses• The National Sample Survey Gather, clean and prepare data Organization (NSSO) conducts surveys every 5 years about income and expenditure, so we could Test hypotheses. In the process, perhaps use this generate additional hypotheses• However, income data gathered from surveys are notorious with respect to quality Consolidate results to solve the• Poor have little savings so their total main problem consumption is a better indicator of income than the income data Make the data tell a story © Karthik Shashidhar
  19. 19. Frame a clear and concise problem statement Data cleaning is an ugly but important step Break down your problem into smaller problems, and then use those to generate hypotheses• It is important to make sure names from data procured from different sources match Gather, clean and prepare data • For example, some government sites say “AndhraPradesh”, while others say “Andhra Pradesh”. Test hypotheses. In the process, Fails if you want to do a join generate additional hypotheses• If data set is small, go through it once to check numbers for consistency. For example, if you have Consolidate results to solve the data on percentages, make sure it main problem adds up to 100%• For larger data sets, try write scripts to do basic cleaning Make the data tell a story © Karthik Shashidhar
  20. 20. Frame a clear and concise problem statement Understand and prepare data before you dive into analysis Break down your problem into smaller problems, and then use those to generate hypotheses• Get a general feel for the numbers Gather, clean and prepare data before getting into the analysis• Simple visualization techniques such as scatter plots and density plots Test hypotheses. In the process, help generate additional hypotheses• Use simple summary statistics (mean, median, SD, quartiles) to get a better feel for the data Consolidate results to solve the• Check out what different functional main problem forms of your data look like Make the data tell a story © Karthik Shashidhar
  21. 21. Frame a clear and concise problemWhile testing hypotheses, be on the statement lookout for anything Break down your problem into interesting/unusual smaller problems, and then use those to generate hypotheses • It is impossible to generate all possible hypotheses before you begin the analysis Gather, clean and prepare data • Usually, as you test out some hypotheses, something in the data will stand out which will lead to Test hypotheses. In the process, further hypotheses generate additional hypotheses • It is ok to generate these hypotheses, which is what makes it an iterative process Consolidate results to solve the • However, one needs to be careful to main problem not stray from the original objective – each new hypothesis should directly tie in to the original question Make the data tell a story © Karthik Shashidhar
  22. 22. Frame a clear and concise problem statement Consolidate results Break down your problem into smaller problems, and then use those to generate hypotheses• Build up your case in a bottom-up manner Gather, clean and prepare data• Sometimes different pieces of analysis can throw up contradictory inferences. Check, and reconcile Test hypotheses. In the process, before you integrate generate additional hypotheses• Make sure all components of the solution that you required are available Consolidate results to solve the• Don’t include results in the final main problem analysis unless it makes a definite contribution to the final solution Make the data tell a story © Karthik Shashidhar
  23. 23. Frame a clear and concise problem statement Use graphics intelligently! Break down your problem into smaller problems, and then use those to generate hypotheses• A picture is worth a thousand words, so use clear and easy-to-use visualizations Gather, clean and prepare data to communicate your findings• Use visualizations that make the solution self-evident, rather than something that requires a lot of explanation Test hypotheses. In the process,• Use your graphics to communicate, not generate additional hypotheses to confuse. If the intent of a graphic is to confuse, it is better to leave out that graphic• Sometimes all it takes to solve the Consolidate results to solve the main problem problem is to visualize the data from a different perspective! Make the data tell a story © Karthik Shashidhar
  24. 24. Frame a clear and concise problemThis graphic shows the decile in statementwhich Rs. 32 per day (Rs. 960 per Break down your problem into month) would fall in each state smaller problems, and then use those to generate hypotheses Gather, clean and prepare data Test hypotheses. In the process, generate additional hypotheses Consolidate results to solve the main problem Make the data tell a story © Karthik Shashidhar
  25. 25. IntroductionSix-step process Case StudyCommon Pitfalls © Karthik Shashidhar
  26. 26. Correlation does Beware of not imply anecdotal causality evidence Beware of Don’t overfit Outliers models Data-driven inference is fraught with pitfalls. Drawing ContradictoryStart with gettinga feel for the data the wrong conclusion out of inferences from data is easier than drawing the same data right conclusion. Don’t simply Don’t over-throw everything complicate into the mix Models can graphics Graphics can deceive misbehave © Karthik Shashidhar
  27. 27. Outliers cansignificantly distort inferences © Karthik Shashidhar
  28. 28. “Throwingeverything into the mix” may notalways produce an accurate model © Karthik Shashidhar
  29. 29. It could lead tomulticollinearity, for example According to this regression, the tallest person should have an extremely large right foot and a tiny left foot! That makes no sense! © Karthik Shashidhar
  30. 30. Over-fitting can lead to spurious modelsIt helps to keep your models as simple as possible. A simple rule of thumb – a good model is one that can be easily explained in simple English © Karthik Shashidhar
  31. 31. Diving into modelfitting without firstunderstanding the data can lead tosuboptimal results People are prone to doing regressions without actually looking at the data. Here, a simple linear regression gives a reasonable fit (R^2 = 42%). However, a simple scatter plot would suggest a clear Y= 1/X kind of relationship which the regression completely misses out on © Karthik Shashidhar
  32. 32. Contradictoryinferences can be derived from the same data © Karthik Shashidhar
  33. 33. 160140120100 80 60 40 20 150 0 0 2 4 6 8 10 12 14 16 140 130 Choice of axes and 120 scales can have a 110 significant impact on 100 the message your 90 graphic conveys 80 0 2 4 6 8 10 12 14 16 © Karthik Shashidhar
  34. 34. Correlation does not imply causality © Karthik Shashidhar
  35. 35. Mistaking correlation for causality can lead to hilarious conclusions © Karthik Shashidhar
  36. 36. Readers get turned off by overlycomplicated graphics © Karthik Shashidhar
  37. 37. Anecdotal/insufficient datacan lead to false conclusions © Karthik Shashidhar
  38. 38. A model is justthat: a model. It is not a substitute for reality © Karthik Shashidhar
  39. 39. The Art of Data Analysis will be further illustratedby means of a detailed Case Study relevant to your company/industry For a half-day workshop on The Art of Data Analysis (including a case study), contact Karthik Shashidhar at karthik.shashidhar@gmail.com © Karthik Shashidhar

×