The R of War


Published on

Data Science. .Net/C# Monte Carlo modeling. The R Programming language. See it all come together in one place in this talk. Presentation date 6/13 at Lake County .NET User Group.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The R of War

  2. 2. Know Your EnemyWho Was I? Who Am I?
  3. 3. We’ve All Made Mistakes
  4. 4. Motivation
  5. 5. Divide and Conquer Part 1 – What is Data Science? Part 2 – The War Machine Part 3 – Analyzing the Results
  6. 6. Part 1 – What is Data Science?
  7. 7. What is Data Science?(Round 1) Stupid Marketing Terms “Hot New Gig in tech” “The next sexy job” We can’t start there. We need to talk about what created theproblem to which Data Science is the solution.
  8. 8. Big Data Consider two events Moon Landing vs. Red Bull Stratos jump The amount of data collected, and how cheaply is what makes allthe difference. We’re talking where Petabytes is not unheard of. Terabytes is arelatively manageable data set and Gigabytes is trivial Data is being collected at an alarming rate, and at extremely lowcosts. And that’s just one of three big problems with Big Data
  9. 9. Big Data – 3 V’s
  10. 10. With Great Data... Big Data is the raw materials. We need some way to sift through all this data Tools Techniques People Ultimately produce data products By studying and distilling the contents of big datasets, we can makemore quantifiably better decisions than we would
  11. 11. So What is Data Science, Then? Statistics is a necessarycomponent to Data Science.It’s what many stock analystsmiss. Sure, patterns areeverywhere, but if you don’tknow what statisticallysignificant is, then you aren’tpredicting. You’re gambling. Above all, creativity
  12. 12. Things You Might Say if You FlunkedIntro Probability and Statistics "Everything happens for a reason." "Id rather drive than fly. I feel safer." "Theres no such thing ascoincidences" "I have a lucky number"
  13. 13. As an Example
  14. 14. Mike Driscoll’s Three Sexy Skills Statistics – raw analysis Data Munging – What to do when data isn’t clean Visualization – how to present information in convincing ways
  15. 15. Typical Process for a Data Scientist Prepare the Data Set Run the Analysis Communicate the Results
  16. 16. So a Bit of a JOATCurrent Skill Set What You Need toKnowDBA Dealing withunstructured dataStatistician Data that doesn’t fitin memorySoftwareEngineerStatistical models –communication ofresultsBusiness Analyst Algorithms andtradeoffs at scale‘90’sWebmaster
  17. 17. Now What? If you want to be a writer, first you have to write. Getting Clean Data is Hard Finding Interesting Data is Fun Google Books (example)
  18. 18. Data Science Can Be Useful …
  19. 19. Data Science Can Be Useful …
  20. 20. Or…
  21. 21. Now That I Have a Hammer…… time to find a screw.
  22. 22. Part 2 – The War Machine
  23. 23. Expect the Unexpected Like to play cards Kids like to play Good teaching game: War
  24. 24. The Rules of WarBeatsBeatsNothing BeatsWAR!BeatsPlayer 1 Player 2All cards to Player 1And this is recursive…
  25. 25. The Problem with War It can be a long game How long, on average? Can it take forever? Unequal distribution of Aces Is there a foregone conclusion?
  26. 26. We Want Answers What would a Data Scientist do? Look for a data set? Find an analytical solution?
  27. 27. Write a Monte Carlo! When the answer isnot necessarilydeterministic Run simulations ofrandom events toinvestigate patterns Ability to generatevery large datasetsto minimize statisticaluncertainty
  28. 28. Well, That’s Easy We’re talking about writing some code here Use any language you want But since we’re .NET folks, I used C# to generate my dataset Note: we run these to get statistics, but there are limitations… Take flipping a coin, for instance – 50/50 right? Flips Spins Edgewise’s have a look at some code, shall we?
  29. 29. Code…… Nope
  30. 30. Code…… Nope
  31. 31. Code…NoNoNoNoReally No
  32. 32. Some Data& otherridiculousconceptsfor keepingyour rodentsbusy…
  33. 33. Part 3 – Analyzing the R(esults)
  34. 34. Why R? Origin Stable (1993 – current) Based, of course, on the S programming language (plus lexical scoping) University of Auckland, NZ Statistical language Vector based Publication-ready graphics packages Free, open source Scripting language Active community with diverse domain packages
  35. 35. Getting Set Up to Use R What you need A basic R install ( Currently on version 3.0.1 Totally usable as is But… a good IDE is worth its weight in gold Rstudio ( Currently on version 0.97.551 Seems widely adopted as the standard environment Let’s look at the environment
  36. 36. Basic R Syntax – Concatenationand Assignment Create a vector c(1:5) Add two vectors a <- c(1:5) b <- c(6:10) c <- a + b Even better d <- a + 5Let’s try out the command line. We’ll use it throughout.
  37. 37. Accessing Data From Vectors Vector’s kinda like a C# array, so easy to remember [] a <- seq(from=10, to=100, by=10) a[2] But better than arrays: a[2:4]
  38. 38. Matrices A grid of numbers Vector of vectors Column, not row-based Can be counterintuitive mat<-matrix( 1:12, ncol=4 ) Selection similar to vectors mat[2,4] mat[2:3,3:4]
  39. 39. Data Frames This is where things start getting good Data Frames are like tables in SQL Server Each row is a data point Each column has a specific type, but can be different You can address columns as frameName$columnName R ships with a basic data frame you can explore… mtcarsLet’s explore mtcars a bitWe’ll start with the str() function – (the “structure” function)
  40. 40. Calling Functions functionName(arguments) Examples of built-in math functions: mean(c(1:5)) sum(mat[,3]) sum(mat[1,]) mean(mat[,3]) range(mat[1,]) length(mat[,3])
  41. 41. Basic Graphics plot(mat[1,]) plot(mat[1,], mat[2,]) hist(mat) hist(mtcars$mpg)
  42. 42. Good Graphics install.packages("ggplot2") – it’s a one time dealie Like gems or NuGet library(“ggplot2”) – like using statements, must be included atruntime ggplot2 comes with qplot (quickplot), which displays better than outof the box graphics
  43. 43. Aesthetics X position Y position Size of Elements Shape of Elements Color of ElementsThat is NOT What GG Stands For!Grammar of GraphicsGeometrics Points Lines Line segments Bars text
  44. 44. What Kind of Little Bozo CompaniesUse R Anyway? New York Times Google FDA John Deere National WeatherService Zillow Consumer FinancialProtection Bureau Twitter FourSquare Facebook
  45. 45. Armed to the Teeth What can we do? Let’s run that Monte Carlo
  46. 46. Let’s take a small sample and justlook at it: qplot(warData$NumTricks)Uh oh…
  47. 47. Let’s Remove That Awful Spike qplot( warData[warData$NumTricks < 100000,]$NumTricks )nrow(warData[warData$NumTricks > 100000,])/nrow(warData)* 100 = 38.36 %mean(warData[warData$NumTricks < 100000,]$NumTricks) = 1645
  48. 48. So We Regenerate The Data set Since the maximum of the < 100000 set seemed around 15k If we get to 25000 tricks without ending, assume we’re stuck in aloop – force one player to shuffle their cards What happens when we regenerate the data and look at NumTricksagain?
  49. 49. Look Ma, No SpikeOn a Log scale
  50. 50. What About #Winning?And What Does That Tell Us?
  51. 51. What Have We Learned Data Science is hot hot hot But not without it’s hard work We can write models in our language of choice But they’re still models R is a useful tool for interpreting/presenting data It’s quite deep, but don’t be afraid to dip a toe
  52. 52. I Am Not Your Enemy – Know Me Anyway Contact me @kevinpdavis That Conference -