Just the basics_strata_2013

  • 156 views
Uploaded on

 

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
156
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1.  Just the Basics: CoreData Science Skills    William Cukierski, PhD!will.cukierski@kaggle.com!
Ben Hamner
ben.hamner@kaggle.com!     Photo  by  mikebaird,  www.flickr.com/photos/mikebaird  
  • 2. JUST the basics!We mean the basics! –  Ask dumb questions! (we’ll give dumb answers)! –  We can’t be comprehensive, but we can omit pretense and jargon! –  Expect a little Python, R, Matlab, Excel, command line, hand-waving!
  • 3. Pronounced Kah-gull (as in waggle),
 not Kegel (as in bagel) !
  • 4. Before we get started! You’ll need a Kaggle account ! Create a team for the competition! www.kaggle.com/account/register ! www.kaggle.com/c/just-the-basics-strata-2013! ! Add (Strata) to the end of your team name!! e.g. – William Cukierski (Strata)
 !
  • 5. Agenda:PreliminariesIdentifying a Problem
Performing the analysis
Visualizing the Solution
Contest!!
  • 6. Will background!Physics & Biomedical Engineering! –  Studied machine learning for diagnosis of pathology images! –  Constantly reinventing sophomore- level CS concepts!Former “successful” machine learningcompetitor! –  Successful?! •  Finished near top?! •  Got me a job?! •  Fooled people into believing I understand stats
 (a.k.a. “data scientist”)!
  • 7. Ben Background!Biomedical Engineering & ElectricalEngineering! –  Applied machine learning to improve brain-computer interface! –  Software development in various languages / domains!Machine learning competitions! –  Top finishes in many 2010-2011! –  Teamed up with Will on several! –  Switched to the dark side, spent much of the past year designing competitions at Kaggle! Driving a Brain-Controlled Wheelchair
  • 8. The unfortunate hype of modern analytics!•  BIG DATA!!•  Every second 6.2 trillion exabytes of data are being collected!•  Need shared vocabulary, shared protocols!•  Need to leverage! –  weather reports! –  surveys! –  text documents! –  human genomes! –  regulatory information! –  cell phone logs! –  satellite surveillance ! –  etc.! –  etc.! –  etc.!
  • 9. What do we do about it?!•  Create committees, consortiums, taxonomies, platforms, frameworks, clouds!•  Create acronyms for our committees, consortiums, taxonomies, platforms, frameworks, clouds!•  Go to conferences to promote and learn about our acronym’d things!•  And if time permits and the mood strikes?! work
  • 10. I’m ready to leave now !
  • 11. Big Data Barry!Lives by the Shirky Principle:! Preserving the problem to which he is the solution!
Favorite talking points! Data provenance, data warehousing, data privacy, data regulations, data silos, need for standards, need for standards on standards of standards, lack of data correctness, need for communication! Source: http://mojette.deviantart.com/!
  • 12. Listen, I’ve been in this field for 22 years. The Bayesian guys in the modeling group are never gonna talk to the IT guys because they don’t speak the same language. In my 22 years of experience, what we need are tighter standards around what the processes should be for requesting data, how that data should be stored, and who should have access to the data. Also privacy. Privacy is a thing about which I have no clue, but nonetheless I’m compelled to steamroll even the most benign use of our data for anything beyond occupying a database. Oh, and speaking of databases and my 22 years of experience, we need stricter governance aboutthe schemas a policies that inform the ways the data gets federated, so the model guys will stop trying to implement things that’ll never work.…!
  • 13. Seriously,guys, letme out !
  • 14. The plight of the data scientist!Job description:!Data Scientists (n.) Person who isbetter at statistics than any softwareengineer and better at softwareengineering than any statistician.!!Job reality:!Data Scientists (n.) Person who isworse at statistics than any statisticianand worse at software engineering thanany software engineer.!!!
  • 15. This problem can only be solved by an 8th-order I’m making an Excel VBAkernel projection onto an script to access our Oracle orthonormal space of database and find the mean homoscedastic of the revenue column! eigentensors Data science (noun): Statistics done wrong The boss is going to have my neck if I can’t get this Hadoop iPhone app ready in time for Strata
  • 16. Data science
The application of scientific experimentation (hypothesistesting, model generation, statistical analysis) in problem-agnostic ways. !!Not data science!{infographics, apps, site architecture, sending JSONthingies around, Javascript frameworks, web analytics,plotting tweets on maps, cloud storage, domains that endin .io, any idea/thing/product that touches data}!
  • 17. Agenda:PreliminariesIdentifying a Problem
Performing the analysis
Visualizing the Solution
Contest!!
  • 18. Optimization What’s the best the can happen? Predictive Modeling What will happen next? Analytics Forecasting/extrapolation What if these trends continue?Sophistication Statistical analysis Why is this happening? Alerts What actions are needed? Query/drill down What exactly is the problem? Access and Ad hoc reports How many, how often, where? reporting Standard reports What happened? Gain Source: Competing on Analytics, Davenport/Harris, 2007!
  • 19. When to use data!Asking specific questions is mostly harmless! –  How many users bought shampoo X at store Y last quarter?!Prediction is not a free lunch! –  Being data-driven and wrong is easy and bad! –  Fancy models should serve fancy questions! •  Don’t forecast something that can be measured!Human knowledge precedes machine knowledge! –  Sometimes black boxes work! –  Often, they don’t: earthquakes, finance models, etc.!
  • 20. When to use data!Human experts are good at generalization!!Human experts are bad at! –  Accurate predictions! –  Estimating the uncertainty of their predictions! –  Making the same prediction under the same evidence! –  Updating predictions in the face of new evidence! –  Ignoring unrelated evidence!
  • 21. http://www.nytimes.com/interactive/science/rock-paper-scissors.html!
  • 22. We need to teach the computer to generalize laptop:~ wcuk$ RUN IT’S A BEAR -bash: BEAR: threat not found
  • 23. …without overfitting laptop:~ wcuk$ RUN IT’S A BEAR run: Must specify one of –black –grizzly –teddy laptop:~ wcuk$ RUN IT’S A BEAR -grizzly run: Are you sure you want to run? (y/n) y run: Enter the bear’s name: Rupert run: Is it Rupert with the scar on his ear? He’s cool. He’s more of a salmon kind of bear. (y/n): n run:...RUN!!!!!!!
  • 24. Storing data! Binary! Text! Database!“If you wish to make an apple pie from scratch, you mustfirst invent the universe.” – Carl Sagan!
  • 25. Reading data into a useful format! We overcomplicate storage and formats! –  Databases are quite often a bad choice! –  Most data science is a batch process on tabular data! –  Your debugging cycle should be fast
 ! Why text?! –  Simple! –  Universal! –  Fast (to read/write/debug)! –  Transparent!
  • 26. Most data is not useful for scientific experimentation!Too “macro” (lacking causal detail)! Meant for human consumption!
  • 27. Structured data is not always machine ready ! Game 1 ! Game 2! Seat 1: Solracca ($95.30 in chips) Seat 1: Kingcovey ($108.65 in chips) Seat 2: BrickT63 ($127.10 in chips) Seat 3: VoronIN_exe ($119.80 in chips) Seat 3: sven160482 ($184.30 in chips) Seat 4: ehle123 ($104 in chips) Seat 4: Adelantez ($103 in chips) Seat 5: MercuriusAA ($107.60 in chips) Seat 6: manfred zeal ($155.50 in chips) Seat 6: budapestkin ($133.15 in chips) Solracca: posts small blind $0.50 budapestkin: posts small blind $0.50 BrickT63: posts big blind $1 Kingcovey: posts big blind $1 *** HOLE CARDS *** *** HOLE CARDS *** sven160482: raises $1 to $2 VoronIN_exe: raises $2 to $3 Adelantez: raises $5.50 to $7.50 ehle123: folds manfred zeal: folds MercuriusAA: folds Solracca: folds budapestkin: calls $2.50 BrickT63: folds Kingcovey: folds sven160482: folds *** FLOP *** [7c Tc Ks] Uncalled bet ($5.50) returned to Adelantez budapestkin: checks Adelantez collected $5.50 from pot VoronIN_exe: bets $4.45 *** SUMMARY *** budapestkin: calls $4.45 Total pot $5.50 | Rake $0 *** TURN *** [7c Tc Ks] [8c] Seat 4: Adelantez collected ($5.50) budapestkin: checks VoronIN_exe: checks *** RIVER *** [7c Tc Ks 8c] [Kc] budapestkin: bets $11 VoronIN_exe: folds Uncalled bet ($11) returned to budapestkin budapestkin collected $15.15 from pot *** SUMMARY *** Total pot $15.90 | Rake $0.75 Seat 6: budapestkin collected ($15.15)
  • 28. A word of caution on scraping!•  Scraping is time intensive, unleveraged, brittle!•  Before you code, research existing libraries!! –  Will solve 95% of the problems you don’t even know you will have! –  E.g. web scraping using python’s BeautifulSoup! page = urllib2.urlopen("http://www.kaggle.com/competitions") soup = BeautifulSoup(page.read()) allLinks = soup.find_all(a) allLinks = uniqify(allLinks) for link in allLinks: match = (re.search(^/c/.*, link.get(href))) if match: fileName = link.get(href); fileName = fileName.replace(/,_) + ".zip" fileName = fileName[3:] getStuff(fileName, "http://www.kaggle.com" + link.get("href") + "/publicleaderboarddata.zip")
  • 29. Excel has a time and place! –  Looking at data! –  Pivot tables! –  Quick plots to verify things!Never:! –  Pass spreadsheets around! –  “Code” in Excel! –  Create workflows that require copy/ pasting data around!
  • 30. Excel !
  • 31. Agenda:PreliminariesIdentifying a Problem
Performing the analysis
Visualizing the Solution
Contest!!
  • 32. Command line!
  • 33. Glossary!features = attributes = independent variables!targets = gold standard = ground truth = dependent variable(s)!training set = data & targets use to train a model!validation set = data & targets used as feedback in model training!test set = separate data & targets used only to evaluate the model!cross validation = partitioning the training set to estimate how well amodel will generalize!
  • 34. Feature Read! Learn! Extraction!Train! Generalize!Test!
  • 35. Bayes theorem!How to update beliefs in the face of evidence?!For proposition A and evidence B:! P (B|A)P (A) –  P(A) = prior (belief in A)! P (A|B) = P (B) –  P(B) = evidence! –  P(A | B) = posterior (belief in A given B)! –  P(B | A) = likelihood! P (long hair|f emale)P (f emale) P (f emale|long hair) = P (long hair)
  • 36. R!
  • 37. MATLAB!
  • 38. Agenda:PreliminariesIdentifying a Problem
Performing the analysis
Visualizing the Solution
Contest!!
  • 39. Visualization!Speak the language of your audience! –  Use simple plots! –  Use units that matter (dollars, time, widgets)! –  Include the units!! –  Don’t use acronyms!!!Most visualization should be internal facing (am I doing thisright?) and not external facing (hey check this out!)!
  • 40. •  Babysitting model performance!•  Plotting raw features! •  Looking for optima!•  Looking for outliers, •  Watching for sensitivity to initial anomalies, correlation! conditions, perturbations! •  Verifying feature selection or •  Summarizing! dimensionality reduction! •  Checking the result is reasonable! •  Looking at manifold density! •  Comparisons to the alternative! •  Looking at class separation!
  • 41. Your job is to solve a problem! –  Sell the message, not the graphic!Avoid chartjunk! “The purpose of decoration varies — to make the graphic appear more scientific and precise, to enliven the display, to give the designer an opportunity to exercise artistic skills. Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.” –Edward Tufte!
  • 42. source: http://i.dailymail.co.uk/i/pix/2012/03/21/article-2118152-124602BE000005DC-0_964x528.jpg
  • 43. source: http://www.fivethirtyeight.com/2009/10/older-and-wealthier-people-are-more.html
  • 44. Election fraud: 2D histograms of the number of units for a given voter turnout(x axis) and the percentage of votes (y axis) for the winning party! source: http://www.pnas.org/content/early/2012/09/20/1210722109.abstract
  • 45. ggplot2!
  • 46. Agenda:PreliminariesIdentifying a Problem
Performing the analysis
Visualizing the Solution
Contest!!
  • 47. Make a spam detector!The data represents a corpus of emails. Some are spam andsome are normal.!•  Due to time constraints, feature extraction is done for you:! –  train.csv - contains 600 emails x 100 features! –  train_labels.csv – contains the 600 training labels (1 = spam, 0 = normal)! –  test.csv - contains 4000 emails x 100 features!•  Submit a file with each of the 4000 predictions on a separate line (in the same order as test.csv).! –  No header is necessary! –  Predictions can be continuous numbers or 0/1 labels!
  • 48. How the leaderboard works! Return%   ProductID   Dept   Price   MFR   1.94   54323   Household   54.95   USA   0.023   92356   Household   9.95   USA   0.8   78023   Computer   4.5   China   0.01   12340   Audio   109.99   China   0.41   31240   Audio   29.99   Taiwan   0.97   12351   Hardware   54.95   Mexico   0.0115   90141   Hardware   4.99   USA   0.4   81240   Hardware   6.55   Taiwan   0.03   14896   Computer   211.99   Korea   0.205   62132   Computer   1100   USA   1.6878   54323   Audio   34.99   USA   0.0345   92356   Audio   7.99   USA   0.64   78023   Household   229.9   Brazil   0.72   12340   Audio   19.95   Mexico   0.41   31240   Computer   6.99   Taiwan   1.94   54323   Hardware   11.99   Taiwan   0.023   92356   Household   2.05   USA   0.08   78023   Computer   99.99   USA   2.09   12340   Computer   129.99   China   1.1   31240   Audio   18.99   China  
  • 49. How the leaderboard works! Return%   ProductID   Dept   Price   MFR   1.94   54323   Household   54.95   USA   0.023   92356   Household   9.95   USA   0.8   78023   Computer   4.5   China   0.01   12340   Audio   109.99   China   0.41   31240   Audio   29.99   Taiwan   0.97   12351   Hardware   54.95   Mexico   0.0115   0.4   90141   81240   Hardware   Hardware   4.99   6.55   USA   Taiwan   Training 0.03   14896   Computer   211.99   Korea   0.205   62132   Computer   1100   USA   1.6878   54323   Audio   34.99   USA   0.0345   92356   Audio   7.99   USA   0.64   78023   Household   229.9   Brazil   0.72   12340   Audio   19.95   Mexico   0.41   31240   Computer   6.99   Taiwan   1.94   54323   Hardware   11.99   Taiwan   Solution 0.023   92356   Household   2.05   USA   Test “Ground Truth” 0.08   2.09   78023   12340   Computer   Computer   99.99   129.99   USA   China   1.1   31240   Audio   18.99   China  
  • 50. How the leaderboard works! Return%   ProductID   Dept   Price   MFR   1.94   54323   Household   54.95   USA   0.023   92356   Household   9.95   USA   0.8   78023   Computer   4.5   China   0.01   12340   Audio   109.99   China   0.41   31240   Audio   29.99   Taiwan   0.97   12351   Hardware   54.95   Mexico   0.0115   0.4   90141   81240   Hardware   Hardware   4.99   6.55   USA   Taiwan   Training 0.03   14896   Computer   211.99   Korea   0.205   62132   Computer   1100   USA   1.6878   54323   Audio   34.99   USA   0.0345   92356   Audio   7.99   USA   0.64   78023   Household   229.9   Brazil   ?   12340   Audio   19.95   Mexico   ?   31240   Computer   6.99   Taiwan   Solution ?   ?   54323   92356   Hardware   Household   11.99   2.05   Taiwan   USA   Test “Ground Truth” ?   ?   78023   12340   Computer   Computer   99.99   129.99   USA   China   ?   31240   Audio   18.99   China  
  • 51. How the leaderboard works! Return%   ProductID   Dept   Price   MFR   1.94   54323   Household   54.95   USA   0.023   92356   Household   9.95   USA   0.8   78023   Computer   4.5   China   0.01   12340   Audio   109.99   China   0.41   31240   Audio   29.99   Taiwan   0.97   12351   Hardware   54.95   Mexico   0.0115   0.4   90141   81240   Hardware   Hardware   4.99   6.55   USA   Taiwan   Training 0.03   14896   Computer   211.99   Korea   0.205   62132   Computer   1100   USA   1.6878   54323   Audio   34.99   USA   0.0345   92356   Audio   7.99   USA   0.64   78023   Household   229.9   Brazil   0.03   12340   Audio   19.95   Mexico   1.298   31240   Computer   6.99   Taiwan   0.94   54323   Hardware   11.99   Taiwan   0.04   0.36   92356   78023   Household   Computer   2.05   99.99   USA   USA   Test 1.2 12340   Computer   129.99   China   0.02   31240   Audio   18.99   China   Submission
  • 52. How the leaderboard works! Return%   ProductID   Dept   Price   MFR   1.94   54323   Household   54.95   USA   0.023   92356   Household   9.95   USA   0.8   78023   Computer   4.5   China   0.01   12340   Audio   109.99   China   0.41   31240   Audio   29.99   Taiwan   0.97   12351   Hardware   54.95   Mexico   0.0115   0.4   90141   81240   Hardware   Hardware   4.99   6.55   USA   Taiwan   Training 0.03   14896   Computer   211.99   Korea   0.205   62132   Computer   1100   USA   1.6878   54323   Audio   34.99   USA   0.0345   92356   Audio   7.99   USA   0.64   78023   Household   229.9   Brazil   Public Leaderboard   0.03   12340   Audio   19.95   Mexico   Private Leaderboard   1.298   31240   Computer   6.99   Taiwan   0.94   54323   Hardware   11.99   Taiwan   0.04   0.36   92356   78023   Household   Computer   2.05   99.99   USA   USA   Test 1.2 12340   Computer   129.99   China   0.02   31240   Audio   18.99   China   Submission
  • 53. Area under the receiver-operating characteristic curve !
  • 54. Example Model !
  • 55. Think about!•  Missing values!•  Noise!•  Combinations of features!•  Transformations of features (e.g. log)!•  Combinations of methods!•  Overfitting!•  Binary vs. continuous predictions!•  How good is a good spam detector?!