The R of War

603 views

Published on

Data Science. .Net/C# Monte Carlo modeling. The R Programming language. See it all come together in one place in this talk. Presentation date 6/13 at Lake County .NET User Group.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
603
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

The R of War

  1. 1. The R of WarADVENTURES IN DATA SCIENCEKEVIN P. DAVIS, PH. D.
  2. 2. Know Your EnemyWho Was I? Who Am I?
  3. 3. We’ve All Made Mistakes
  4. 4. Motivation
  5. 5. Divide and Conquer Part 1 – What is Data Science? Part 2 – The War Machine Part 3 – Analyzing the Results
  6. 6. Part 1 – What is Data Science?
  7. 7. What is Data Science?(Round 1) Stupid Marketing Terms “Hot New Gig in tech” “The next sexy job” We can’t start there. We need to talk about what created theproblem to which Data Science is the solution.
  8. 8. Big Data Consider two events Moon Landing vs. Red Bull Stratos jump The amount of data collected, and how cheaply is what makes allthe difference. We’re talking where Petabytes is not unheard of. Terabytes is arelatively manageable data set and Gigabytes is trivial Data is being collected at an alarming rate, and at extremely lowcosts. And that’s just one of three big problems with Big Data
  9. 9. Big Data – 3 V’s
  10. 10. With Great Data... Big Data is the raw materials. We need some way to sift through all this data Tools Techniques People Ultimately produce data products By studying and distilling the contents of big datasets, we can makemore quantifiably better decisions than we would
  11. 11. So What is Data Science, Then? Statistics is a necessarycomponent to Data Science.It’s what many stock analystsmiss. Sure, patterns areeverywhere, but if you don’tknow what statisticallysignificant is, then you aren’tpredicting. You’re gambling. Above all, creativity
  12. 12. Things You Might Say if You FlunkedIntro Probability and Statistics "Everything happens for a reason." "Id rather drive than fly. I feel safer." "Theres no such thing ascoincidences" "I have a lucky number"
  13. 13. As an Example
  14. 14. Mike Driscoll’s Three Sexy Skills Statistics – raw analysis Data Munging – What to do when data isn’t clean Visualization – how to present information in convincing ways
  15. 15. Typical Process for a Data Scientist Prepare the Data Set Run the Analysis Communicate the Results
  16. 16. So a Bit of a JOATCurrent Skill Set What You Need toKnowDBA Dealing withunstructured dataStatistician Data that doesn’t fitin memorySoftwareEngineerStatistical models –communication ofresultsBusiness Analyst Algorithms andtradeoffs at scale‘90’sWebmaster
  17. 17. Now What? If you want to be a writer, first you have to write. Getting Clean Data is Hard Finding Interesting Data is Fun http://www.kdnuggets.com/datasets/ Google Books (example)
  18. 18. Data Science Can Be Useful …
  19. 19. Data Science Can Be Useful …
  20. 20. Or…
  21. 21. Now That I Have a Hammer…… time to find a screw.
  22. 22. Part 2 – The War Machine
  23. 23. Expect the Unexpected Like to play cards Kids like to play Good teaching game: War
  24. 24. The Rules of WarBeatsBeatsNothing BeatsWAR!BeatsPlayer 1 Player 2All cards to Player 1And this is recursive…
  25. 25. The Problem with War It can be a long game How long, on average? Can it take forever? Unequal distribution of Aces Is there a foregone conclusion?
  26. 26. We Want Answers What would a Data Scientist do? Look for a data set? Find an analytical solution?
  27. 27. Write a Monte Carlo! When the answer isnot necessarilydeterministic Run simulations ofrandom events toinvestigate patterns Ability to generatevery large datasetsto minimize statisticaluncertainty
  28. 28. Well, That’s Easy We’re talking about writing some code here Use any language you want But since we’re .NET folks, I used C# to generate my dataset Note: we run these to get statistics, but there are limitations… Take flipping a coin, for instance – 50/50 right? Flips Spins Edgewise http://econ.ucsb.edu/~doug/240a/Coin%20Flip.htmLet’s have a look at some code, shall we?
  29. 29. Code…… Nope
  30. 30. Code…… Nope
  31. 31. Code…NoNoNoNoReally No
  32. 32. Some Data& otherridiculousconceptsfor keepingyour rodentsbusy…
  33. 33. Part 3 – Analyzing the R(esults)
  34. 34. Why R? Origin Stable (1993 – current) Based, of course, on the S programming language (plus lexical scoping) University of Auckland, NZ Statistical language Vector based Publication-ready graphics packages Free, open source Scripting language Active community with diverse domain packages
  35. 35. Getting Set Up to Use R What you need A basic R install (http://www.r-project.org/) Currently on version 3.0.1 Totally usable as is But… a good IDE is worth its weight in gold Rstudio (http://www.rstudio.com/) Currently on version 0.97.551 Seems widely adopted as the standard environment Let’s look at the environment
  36. 36. Basic R Syntax – Concatenationand Assignment Create a vector c(1:5) Add two vectors a <- c(1:5) b <- c(6:10) c <- a + b Even better d <- a + 5Let’s try out the command line. We’ll use it throughout.
  37. 37. Accessing Data From Vectors Vector’s kinda like a C# array, so easy to remember [] a <- seq(from=10, to=100, by=10) a[2] But better than arrays: a[2:4]
  38. 38. Matrices A grid of numbers Vector of vectors Column, not row-based Can be counterintuitive mat<-matrix( 1:12, ncol=4 ) Selection similar to vectors mat[2,4] mat[2:3,3:4]
  39. 39. Data Frames This is where things start getting good Data Frames are like tables in SQL Server Each row is a data point Each column has a specific type, but can be different You can address columns as frameName$columnName R ships with a basic data frame you can explore… mtcarsLet’s explore mtcars a bitWe’ll start with the str() function – (the “structure” function)
  40. 40. Calling Functions functionName(arguments) Examples of built-in math functions: mean(c(1:5)) sum(mat[,3]) sum(mat[1,]) mean(mat[,3]) range(mat[1,]) length(mat[,3])
  41. 41. Basic Graphics plot(mat[1,]) plot(mat[1,], mat[2,]) hist(mat) hist(mtcars$mpg)
  42. 42. Good Graphics install.packages("ggplot2") – it’s a one time dealie Like gems or NuGet library(“ggplot2”) – like using statements, must be included atruntime ggplot2 comes with qplot (quickplot), which displays better than outof the box graphics
  43. 43. Aesthetics X position Y position Size of Elements Shape of Elements Color of ElementsThat is NOT What GG Stands For!Grammar of GraphicsGeometrics Points Lines Line segments Bars text
  44. 44. What Kind of Little Bozo CompaniesUse R Anyway? New York Times Google FDA John Deere National WeatherService Zillow Consumer FinancialProtection Bureau Twitter FourSquare Facebook
  45. 45. Armed to the Teeth What can we do? Let’s run that Monte Carlo
  46. 46. Let’s take a small sample and justlook at it: qplot(warData$NumTricks)Uh oh…
  47. 47. Let’s Remove That Awful Spike qplot( warData[warData$NumTricks < 100000,]$NumTricks )nrow(warData[warData$NumTricks > 100000,])/nrow(warData)* 100 = 38.36 %mean(warData[warData$NumTricks < 100000,]$NumTricks) = 1645
  48. 48. So We Regenerate The Data set Since the maximum of the < 100000 set seemed around 15k If we get to 25000 tricks without ending, assume we’re stuck in aloop – force one player to shuffle their cards What happens when we regenerate the data and look at NumTricksagain?
  49. 49. Look Ma, No SpikeOn a Log scale
  50. 50. What About #Winning?And What Does That Tell Us?
  51. 51. What Have We Learned Data Science is hot hot hot But not without it’s hard work We can write models in our language of choice But they’re still models R is a useful tool for interpreting/presenting data It’s quite deep, but don’t be afraid to dip a toe
  52. 52. I Am Not Your Enemy – Know Me Anyway Contact me @kevinpdavis kevin@kevinpdavis.com http://www.kevinpdavis.com http://www.github.com/daviskevinp http://www.slideshare.net/kevinpdavis http://SoftwareArchaeology.blogspot.com That Conference - http://www.ThatConference.com

×