What is Data Science?(Round 1) Stupid Marketing Terms “Hot New Gig in tech” “The next sexy job” We can’t start there. We need to talk about what created theproblem to which Data Science is the solution.
Big Data Consider two events Moon Landing vs. Red Bull Stratos jump The amount of data collected, and how cheaply is what makes allthe difference. We’re talking where Petabytes is not unheard of. Terabytes is arelatively manageable data set and Gigabytes is trivial Data is being collected at an alarming rate, and at extremely lowcosts. And that’s just one of three big problems with Big Data
With Great Data... Big Data is the raw materials. We need some way to sift through all this data Tools Techniques People Ultimately produce data products By studying and distilling the contents of big datasets, we can makemore quantifiably better decisions than we would
So What is Data Science, Then? Statistics is a necessarycomponent to Data Science.It’s what many stock analystsmiss. Sure, patterns areeverywhere, but if you don’tknow what statisticallysignificant is, then you aren’tpredicting. You’re gambling. Above all, creativity
Things You Might Say if You FlunkedIntro Probability and Statistics "Everything happens for a reason." "Id rather drive than fly. I feel safer." "Theres no such thing ascoincidences" "I have a lucky number"
Mike Driscoll’s Three Sexy Skills Statistics – raw analysis Data Munging – What to do when data isn’t clean Visualization – how to present information in convincing ways
Typical Process for a Data Scientist Prepare the Data Set Run the Analysis Communicate the Results
So a Bit of a JOATCurrent Skill Set What You Need toKnowDBA Dealing withunstructured dataStatistician Data that doesn’t fitin memorySoftwareEngineerStatistical models –communication ofresultsBusiness Analyst Algorithms andtradeoffs at scale‘90’sWebmaster
Now What? If you want to be a writer, first you have to write. Getting Clean Data is Hard Finding Interesting Data is Fun http://www.kdnuggets.com/datasets/ Google Books (example)
Expect the Unexpected Like to play cards Kids like to play Good teaching game: War
The Rules of WarBeatsBeatsNothing BeatsWAR!BeatsPlayer 1 Player 2All cards to Player 1And this is recursive…
The Problem with War It can be a long game How long, on average? Can it take forever? Unequal distribution of Aces Is there a foregone conclusion?
We Want Answers What would a Data Scientist do? Look for a data set? Find an analytical solution?
Write a Monte Carlo! When the answer isnot necessarilydeterministic Run simulations ofrandom events toinvestigate patterns Ability to generatevery large datasetsto minimize statisticaluncertainty
Well, That’s Easy We’re talking about writing some code here Use any language you want But since we’re .NET folks, I used C# to generate my dataset Note: we run these to get statistics, but there are limitations… Take flipping a coin, for instance – 50/50 right? Flips Spins Edgewise http://econ.ucsb.edu/~doug/240a/Coin%20Flip.htmLet’s have a look at some code, shall we?
Why R? Origin Stable (1993 – current) Based, of course, on the S programming language (plus lexical scoping) University of Auckland, NZ Statistical language Vector based Publication-ready graphics packages Free, open source Scripting language Active community with diverse domain packages
Getting Set Up to Use R What you need A basic R install (http://www.r-project.org/) Currently on version 3.0.1 Totally usable as is But… a good IDE is worth its weight in gold Rstudio (http://www.rstudio.com/) Currently on version 0.97.551 Seems widely adopted as the standard environment Let’s look at the environment
Basic R Syntax – Concatenationand Assignment Create a vector c(1:5) Add two vectors a <- c(1:5) b <- c(6:10) c <- a + b Even better d <- a + 5Let’s try out the command line. We’ll use it throughout.
Accessing Data From Vectors Vector’s kinda like a C# array, so easy to remember  a <- seq(from=10, to=100, by=10) a But better than arrays: a[2:4]
Matrices A grid of numbers Vector of vectors Column, not row-based Can be counterintuitive mat<-matrix( 1:12, ncol=4 ) Selection similar to vectors mat[2,4] mat[2:3,3:4]
Data Frames This is where things start getting good Data Frames are like tables in SQL Server Each row is a data point Each column has a specific type, but can be different You can address columns as frameName$columnName R ships with a basic data frame you can explore… mtcarsLet’s explore mtcars a bitWe’ll start with the str() function – (the “structure” function)
Calling Functions functionName(arguments) Examples of built-in math functions: mean(c(1:5)) sum(mat[,3]) sum(mat[1,]) mean(mat[,3]) range(mat[1,]) length(mat[,3])
Good Graphics install.packages("ggplot2") – it’s a one time dealie Like gems or NuGet library(“ggplot2”) – like using statements, must be included atruntime ggplot2 comes with qplot (quickplot), which displays better than outof the box graphics
Aesthetics X position Y position Size of Elements Shape of Elements Color of ElementsThat is NOT What GG Stands For!Grammar of GraphicsGeometrics Points Lines Line segments Bars text
What Kind of Little Bozo CompaniesUse R Anyway? New York Times Google FDA John Deere National WeatherService Zillow Consumer FinancialProtection Bureau Twitter FourSquare Facebook
Armed to the Teeth What can we do? Let’s run that Monte Carlo
Let’s take a small sample and justlook at it: qplot(warData$NumTricks)Uh oh…
So We Regenerate The Data set Since the maximum of the < 100000 set seemed around 15k If we get to 25000 tricks without ending, assume we’re stuck in aloop – force one player to shuffle their cards What happens when we regenerate the data and look at NumTricksagain?
What About #Winning?And What Does That Tell Us?
What Have We Learned Data Science is hot hot hot But not without it’s hard work We can write models in our language of choice But they’re still models R is a useful tool for interpreting/presenting data It’s quite deep, but don’t be afraid to dip a toe
I Am Not Your Enemy – Know Me Anyway Contact me @kevinpdavis email@example.com http://www.kevinpdavis.com http://www.github.com/daviskevinp http://www.slideshare.net/kevinpdavis http://SoftwareArchaeology.blogspot.com That Conference - http://www.ThatConference.com