Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Probate Myths Debunked
Probate Myths Debunked
Loading in …3
×
1 of 31

Data science and good questions eric kostello

1

Share

Big Data Camp LA 2014, The Role of Data Science in asking and answering good questions by Eric Kostello of Neilsen

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Data science and good questions eric kostello

  1. 1. THE ROLE OF DATA SCIENCE IN ASKING AND ANSWERING ‘GOOD QUESTIONS’ Eric Kostello Big Data Camp--Data ScienceTrack Los Angeles, CA14 June 2014 © 2014 Eric Kostello
  2. 2. DATA ANALYSIS WITHOUT CONTEXT Prepare data Analysis Report / Summary Get data
  3. 3. THE CONTEXT OF DATA ANALYSIS Problem/Issue Outcome Form Objectives &Take actions
  4. 4. THE CONTEXT OF DATA ANALYSIS Problem/Issue Outcome Prepare data Analysis Report / Summary Get data
  5. 5. • Which objectives are the right objectives? • What is the right way to achieve them? • Increased data set size and increased computational power only address a small piece of the puzzle EVEN BIGGER CONTEXT
  6. 6. • Question everything in the social world, and try to “get to the root of the matter” √ • How do social scientists explain social facts? • With other social facts! • Challenge: Everything in the social world is related to everything else • The whole world is the original “big data” to sociologists SOCIOLOGICALVIEW
  7. 7. VOCABULARYTEST a be dawned finger Michael on Symonds adjudged beckoned day Hussey Mike overnight the an before depended inevitable much Oz this and being dreaded inside mustered perished Thus Andrew But duo latter near Ponting to/too/two as capacity edge leg new pyrotechnics unfortunate Australia cheered either little ninth run wicket bar Clarke every lunch of shown with batsmen crowd final lustily off side wizard (Any objection to any of these words?)
  8. 8. VOCABULARYTEST dawned finger adjudged beckoned day overnight before depended inevitable much being dreaded inside mustered perished duo latter near capacity edge leg new pyrotechnics unfortunate cheered either little ninth run wicket bar every lunch shown batsmen crowd final lustily side wizard Stop words and proper nouns removed... Ready for a reading test?
  9. 9. READINGTEST Thus, as the final day dawned and a near capacity crowd lustily cheered every run Australia mustered, much depended on Ponting and the new wizard of Oz, Mike Hussey, the two overnight batsmen. But this duo perished either side of lunch--the latter a little unfortunate to be adjudged leg-before--and with Andrew Symonds, too, being shown the dreaded finger off an inside edge, the inevitable beckoned, bar the pyrotechnics of Michael Clarke and the ninth wicket. WhatYour Kindergartner Needs to Know E.D. Hirsch, Jr. and John Holdren, eds., (2013)
  10. 10. WHAT IS A SKILL? • Substantive knowledge required for reading • Reading is not a skill • Data analysis is not a skill • Both require substantive understanding • If there is such a thing as “data science” its practitioners must combine skills and subject matter knowledge
  11. 11. • Hard Science • Theories + Evidence = Laws • Experiments • Repeatability! • Is anything we do remotely like that? SCIENCE
  12. 12. • Essence of scientific approach is trying to find valid generalizations • “Theories” or “laws” relate causal conditions to outcomes • Making predictions about things that are actually observable is much more likely to result in valid generalizations. • Placing p-values next to regressions does not make an analysis scientific. SCIENCE???
  13. 13. • Universal? • Plenty of laws are not universal. (e.g. F = ma) • Precise? • The more accurately we measure, the more we discover discrepancies. • No exceptions? • “Exceptions are just the least frequent alternative in a collection of facts.” --M. Bunge REQUIREMENTS FOR A SCIENTIFIC LAW?
  14. 14. • No experiment or empirical result provides absolutely consistency (between inputs/outputs or causes and effects) • The finer the measurement, the more inconsistencies you find IN/CONSISTENCY ConsistencyConsistencyConsistency High Low More general laws, more widely applicable Laws with more limited scope • (But nobody cares about “laws” that have tiny, tiny scope)
  15. 15. • Lawfulness is not identical to universal applicability + 100% consistency • There can be all kinds of lawfulness, discoverable by proceeding scientifically. • Science advances in a community • Discovery of lawfulness is not only the primary result of scientific research, it is a fundamental presupposition of scientific endeavors. SCIENCE:YES
  16. 16. • Counterfactual thinking • Things could have been a different way [i.e. counter to] actual results [i.e. fact] • Find the conditions that make the difference in outcomes • A pattern in data is not evidence if you search only for that pattern without letting other possibilities into the picture • Observational data complicates things • Have to find a way to meet “all else equal” condition • Reminder: predictions about actually observable phenomena >> p-values PROCEEDING SCIENTIFICALLY
  17. 17. READ A MUCH BETTER VERSION OFTHIS ARGUMENT HERE
  18. 18. • Matching the level of precision to the scope of the generalizations needed • It helps (a lot) to develop agreement on • What you need to know about to meet objectives • Balance required • Too limited scope doesn’t do much for the next project • Too ambitious is too costly, takes too long, and still might not answer your questions ORGANIZATIONAL IMPLICATIONS
  19. 19. “SHOWYOUR WORK” • Good answers have credibility when the process that generates them is clear • Building valid generalizations is often incremental, not starting over from scratch each time (in an interactive environment) • Obstacles to modifying your analysis create intolerable friction (mental and organizational) • Spreadsheet jockeys and interactive statistics package users: you are on notice
  20. 20. REPRODUCIBLE COMPUTING ENHANCES COLLABORATION • Thanks to developments like knitr, anybody can reproduce your analysis. • Motivation for the programming steps becomes much clearer • Combine with distributed version control enables collaborative, reproducible research. • People with complementary skills can collaborate • Most importantly, we are following Knuth’s dictum that we should be concentrate on explaining to other humans what we want the computer to do
  21. 21. GOOD QUESTIONS • Started with “Problem/Issue” ... formulate that in form of question • “What is the relationship between this and that?” • Treats the relationship itself as a hypothesis • Good questions are posed at a useful level of abstraction • “So what?” Good questions provide answers for the inevitable “so what?” • Derived from and add to/extend/challenge current thinking
  22. 22. APPROPRIATE DATA SET SIZE • Adding/collecting lots of data just because you can is not a strategy • You might find something surprising/cool • You might waste your time looking at what you have instead of what you need to • The correct balance depends on the problem • Too small to resolve issues is a waste of money • Unnecessarily large is also a waste • Lots of variables won’t save you from endogeneity problems (when cause and effect are unclear)
  23. 23. • Big data is • Going to help me find a parking spot • Going to help you offer me up just the right ad at just the right time BIG [DATA] CONCLUSION • Computational complexity of dealing with big data is dropping • Not going to change the logic of establishing lawfulness
  24. 24. SAMPLESVS. DATA SETS • True samples are random and representative. • Statistics from random samples have clear relationship to the whole population • The rest (i.e. “what you have”) are just ad hoc collections of data of varying quality • Selection bias exists and is a problem when who (or what) gets into the data set is related to what is being measured by the data set. • Election polls calling only landline telephones (because there is a tendency for cellphone only voters to vote differently than landline users.) • Important to think very carefully about the data-generation mechanism.
  25. 25. ISYOUR DATA SUBJECTTO SELECTION BIAS? • Yes • But you have to size the impact • (Paul Rosenbaum’s Observational Studies offers an accessible framework to quantify how vulnerable results are to unmeasured biases) • Thousands of variables won’t help you figure out what is going on if... • you are missing substantial chunks of the population of interest • you are not measuring the right thing(s) • Temptation to assume that lots of data (variables or observations) means lots of coverage and therefore implies representative data • Very poor assumption
  26. 26. BLACK BOX PREDICTIVE SYSTEMSTOTHE RESCUE? • Can you build a machine-learning system that compensates for the missing segments of the population? • Assumes the relationship between the represented and unrepresented population is stable. • But we can’t measure everything, so what do we do?
  27. 27. THE NON-ROYAL ROAD • Measure the right things to discover valid laws about the relationships we are interested in • It could take anywhere from tens to trillions of measurements • Be cognizant of the valid scope of generalizations possible • Selection bias, unreliable humans, etc. • Knowing the weaknesses of a data set or a statistical method is up to us • Combine: Subject matter knowledge and statistical understanding and data manipulation
  28. 28. CAGE MATCH! • Measure the “right” things • Use the “right” data • But this is the only data I have. I know it’s flawed, but it’s this or nothing. • Consider the limitations of who is in the data, the validity of claiming that a measurement is really the thing you want it to be, etc. • Consider if there is any other way to see how far off you are on critical issues because of these limits. Vs
  29. 29. APPLICATION • During the talk I gave a too-hasty example that a lot of people didn’t get. • It showed the strength of the relationship between different items. (The sort of visualization makes sense for trade offs or flows as well.) • I created the graph using the R library circlize, which implements the circos library in R. • circlize • Zuguang Gu (2014). circlize: Circular visualization in R. R package version 0.0.8. • http://CRAN.R-project.org/package=circlize • circose • http://circos.ca/ • Then I was trying to make a little joke about how I would show the “steps” to produce the plot, with the idea that the listener might think they are about to see R code. What they actually saw was...
  30. 30. f39869 f39324 f42394 f39578 f43461 f42245 f40023 f40884 f42052 f39112 f42252 f41756 f42090 STEPSTO PRODUCETHIS VISUALIZATION... • Join analyst group that uses data • Try to figure things out “their” way • (Take their problem seriously) • Ask questions • Of them • Of the data • See if there is a way to marshall the data to support better insights • Iterate
  31. 31. FINALTHOUGHTS • Situational lawfulness • Defining the situation in which we are trying to find lawfulness of sufficient generality contributes to organizational harmony and success • “Defining” is not a solo activity • Black boxes can be predictive, but understanding relationships matters • The right data is better than lots of data • The right data depends on the questions, which depends on substantive understanding • Science progresses in a community: Listen and engage actively and broadly • Communicate to contribute • There is no simple formula for “good questions,” only general guidelines about how to head in the right direction

×