Data Science 101: Using R Language
to get Big Insights
Satnam Singh,
Senior Chief Engineer,
Samsung Research India – Banga...
2
Motivation: Using Data to get Business Insights
Data Bases
& Clusters
Data Bases
& Clusters
Data Bases
& Clusters
Insigh...
Ref. [kaggle.com]
Data Science Programming Languages
Why R?
• Popular, Free
• Open source
• Multi-platform
• Vectorization...
R Language Basics
> y <- 21
> y
[1] 21
> z = 233
> z
[1] 233
> y <- c(1,2,3,4)
> y
[1] 1 2 3 4
Simple
Operations
Vector
Op...
5
R Language: Data Structures Examples
• Data frame
• Matrix
• List
> MyFamilyage <- c(5,6,40,38)
> MyFamilyage <- c(5,6,4...
6
Case Study: Activity Recognition
• Activity Recognition: Detect walking,
driving, biking, climbing stairs,
standing, etc...
7
Data Analysis - Steps
Feature
Extraction
Time Series Data 43 Features
Mean for each
acc. Axis (3)
Std. dev. for each
acc...
Data Visualization – Activity (Class Variable)
[Ref] Rattle R Data Mining Tool
ds <-
rbind(summary(na.omit(crs$dataset[,]$...
Data Visualization Example – Variable Yavg.
ds <-
rbind(data.frame(dat=crs$dataset[,][,"YAVG
"], grp="All"),
data.frame(da...
• Easy to interpret
Blue : Positive correlation
Red: Negative correlation
Correlation Plot
[Ref] Rattle R Data Mining Tool...
Functions Library Discription
Cluster hclust stats Hierarchical cluster analysis
kmeans stats Kmeans clustering
Classifier...
Decision Tree - Visualization
[Ref] Rattle R Data Mining Tool
• Decision Tree Model Results:
n= 3792
1) root 3792 2364 Walking (0.098 0.3 0.057 0.049 0.12 0.38)
2) YABSOLDEV>=5.095 109...
Random Forest: Ensemble of Trees
[Ref] Rattle R Data Mining Tool
…
Σ
Random Forest
Tree1 Tree2
Treen
• Random Forest Model Results:
Number of observations used to build the model: 3792
Type of random forest: classification
...
• Fusion of data science and domain knowledge
enables the big insights from the data
• R language provides a platform to r...
• R Project: http://www.r-project.org
• Activity Recognition Dataset- “ The Impact of Personalization on
Smartphone-Based ...
Upcoming SlideShare
Loading in …5
×

India software developers conference 2013 Bangalore

1,538 views

Published on

India software developers conference 2013 Bangalore

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,538
On SlideShare
0
From Embeds
0
Number of Embeds
111
Actions
Shares
0
Downloads
22
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • The R statistical programming language is a free open source package based on the S language developed by Bell Labs.The language is very powerful for writing programs.Many statistical functions are already built in.Contributed packages expand the functionality to cutting edge research.Since it is a programming language, generating computer code to complete tasks is required.Implement many common statistical proceduresIt has a large collection of intermediate tools for data analysisExcellent graphical facilities for data analysis and display either on-screen or on hardcopyA well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.Versions of R exist of Windows, MacOS, Linux and various other Unix flavorsA vibrant world wide community
  • Command c creates a vector that is assigned to object a
  • A table where columns can contain numeric and string valuesAll columns must contain either numeric or string values, but these can not be combinedData frame d is converted into a matrix eR: f&lt;-as.data.frame(e)Matrix e is converted into a dataframe f
  • Smartphone has Tri-axial accelerometer that measures acceleration in all three spatial dimensions.Accuracy for general model~75%, &gt;95% personalized model using 10 seconds training for each activityAccelerometer sensor is low power consuming sensor can be used for the whole day
  • The &apos;randomForest&apos; and package provides the &apos;randomForest&apos; function.The ‘party’ package provide conditional random forest ‘randomForest’ can be used for classification and regression. It can also be used in unsupervised mode for assessing proximities among data points.
  • India software developers conference 2013 Bangalore

    1. 1. Data Science 101: Using R Language to get Big Insights Satnam Singh, Senior Chief Engineer, Samsung Research India – Bangalore [ Twitter - @satnam74s] India Software Developers Conference, Bangalore March 16, 2013
    2. 2. 2 Motivation: Using Data to get Business Insights Data Bases & Clusters Data Bases & Clusters Data Bases & Clusters Insights? Insights? Insights?
    3. 3. Ref. [kaggle.com] Data Science Programming Languages Why R? • Popular, Free • Open source • Multi-platform • Vectorization • Many statistical packages • Large support base • Obj. oriented prog. lang. Ref [http://www.r-project.org]
    4. 4. R Language Basics > y <- 21 > y [1] 21 > z = 233 > z [1] 233 > y <- c(1,2,3,4) > y [1] 1 2 3 4 Simple Operations Vector Operations Function Calls
    5. 5. 5 R Language: Data Structures Examples • Data frame • Matrix • List > MyFamilyage <- c(5,6,40,38) > MyFamilyage <- c(5,6,40,38) > MFamilyName <- c("Sat",“Veera",“Minu","Dummy") > MyFamilyweight <- c(72,70,12,40) > MyFamily<- data.frame(MyFamilyName,MyFamilyage,MyFamilyweight) > MyMatrix<-as.matrix(MyFamilyage) > Mydataframe <-as.data.frame(MyMatrix) > MyList <-a.list(Mydataframe)
    6. 6. 6 Case Study: Activity Recognition • Activity Recognition: Detect walking, driving, biking, climbing stairs, standing, etc. Example of Accelerometer data Smartphone’s Accelerometer Sensor [Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY [Ref] Jordan Frank, McGill University [Ref] Commercial API Providers: Sensor Platoforms, Movea, Alohar
    7. 7. 7 Data Analysis - Steps Feature Extraction Time Series Data 43 Features Mean for each acc. Axis (3) Std. dev. for each acc. Axis (3) 200 samples (10 sec) Avg. Abs. diff. from Mean for each acc. Axis (3) Avg. Resultant Acc. (1) Histogram (30) Classifiers CART: Decision Tree RF: Random Forest Classify the Activity [Ref] Gary M. Weiss and Jeffrey W. Lockhart, Fordham University, Bronx, NY [Ref] Jordan Frank, McGill University
    8. 8. Data Visualization – Activity (Class Variable) [Ref] Rattle R Data Mining Tool ds <- rbind(summary(na.omit(crs$dataset[,]$clas s)), summary(na.omit(crs$dataset[,][crs $dataset$class=="Downstairs",]$class)), summary(na.omit(crs$dataset[,][crs$datase t$class=="Jogging",]$class)), summary( na.omit(crs$dataset[,][crs$dataset$class= ="Sitting",]$class)), summary(na.omit( crs$dataset[,][crs$dataset$class=="Standi ng",]$class)), summary(na.omit(crs$dat aset[,][crs$dataset$class=="Upstairs",]$c lass)), summary(na.omit(crs$dataset[,] [crs$dataset$class=="Walking",]$class))) ord <- order(ds[1,], decreasing=TRUE) bp <- barplot2(ds[,ord], beside=TRUE, ylab="Fre quency", xlab="class", ylim=c(0, 2497), c ol=rainbow_hcl(7)) dotchart(ds[nrow(ds):1,ord], col=rev(rainbow_hcl(7)), labels="", xlab="Frequency", ylab="class", pch=c(1:6, 19)) Bar Plot Dot Plot
    9. 9. Data Visualization Example – Variable Yavg. ds <- rbind(data.frame(dat=crs$dataset[,][,"YAVG "], grp="All"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Downstairs","YAVG"], grp="Downstairs"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Jogging","YAVG"], grp="Jogging"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Sitting","YAVG"], grp="Sitting"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Standing","YAVG"], grp="Standing"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Upstairs","YAVG"], grp="Upstairs"), data.frame(dat=crs$dataset[,][crs$dataset$ class=="Walking","YAVG"], grp="Walking")) bp <- boxplot(formula=dat ~ grp, data=ds, col=rainbow_hcl(7), xlab="class", ylab="YAVG", varwidth=TRUE, notch=TRUE) require(doBy, quietly=TRUE) points(1:7, summaryBy(dat ~ grp, data=ds, FUN=mean, na.rm=TRUE)$dat.mean, pch=8) hs <- hist(ds[ds$grp=="All",1], main="", xlab="YAVG", ylab="Frequency", col="grey90", ylim=c(0, 2137.72617616154), breaks="fd", border=TRUE) [Ref] Rattle R Data Mining Tool
    10. 10. • Easy to interpret Blue : Positive correlation Red: Negative correlation Correlation Plot [Ref] Rattle R Data Mining Tool require(ellipse, quietly=TRUE) crs$cor <- cor(crs$dataset[, crs$numeric], use=" pairwise", method="pearson") crs$ord <- order(crs$cor[1,]) crs$cor <- crs$cor[crs$ord, crs$ord] print(crs$cor) plotcorr(crs$cor, col=colorRampPalette(c("red", "white", "blue"))(11)[5*crs$cor + 6]
    11. 11. Functions Library Discription Cluster hclust stats Hierarchical cluster analysis kmeans stats Kmeans clustering Classifiers glm stats Logistic regression rpart rpart Recursive partitioning and regression trees ksvm kernlab Support Vector Machine apriori arules Rule based classification Ensemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and regression Data Science R Packages
    12. 12. Decision Tree - Visualization [Ref] Rattle R Data Mining Tool
    13. 13. • Decision Tree Model Results: n= 3792 1) root 3792 2364 Walking (0.098 0.3 0.057 0.049 0.12 0.38) 2) YABSOLDEV>=5.095 1097 85 Jogging (0.0055 0.92 0 0 0.031 0.041) 4) ZAVG>=-4.125 1058 46 Jogging (0.0057 0.96 0 0 0.032 0.0057) * 5) ZAVG< -4.125 39 0 Walking (0 0 0 0 0 1) * 3) YABSOLDEV< 5.095 2695 1312 Walking (0.14 0.047 0.08 0.069 0.16 0.51) 6) YSTANDDEV< 1.675 382 175 Sitting (0 0 0.54 0.44 0 0.016) Variables actually used in tree construction: RESULTANT YABSOLDEV YAVG YSTANDDEV ZABSOLDEV ZAVG Root node error: 2364/3792 = 0.62342 Decision Tree rpart(formula = class ~ ., data = smartphone_data, method = "class", parms = list(split = "information"), control = rpart.control(usesurrogate = 0, maxsurrogate = 0))
    14. 14. Random Forest: Ensemble of Trees [Ref] Rattle R Data Mining Tool … Σ Random Forest Tree1 Tree2 Treen
    15. 15. • Random Forest Model Results: Number of observations used to build the model: 3792 Type of random forest: classification OOB estimate of error rate: 11.05% Confusion matrix: Downstairs Jogging Sitting Standing Upstairs Walking class.error Downstairs 204 7 0 1 64 97 0.45308311 Jogging 6 1117 0 0 8 7 0.01845343 Sitting 0 0 209 5 1 0 0.02790698 Standing 4 0 0 177 4 0 0.04324324 Upstairs 48 31 1 0 276 97 0.39072848 Walking 20 1 1 1 15 1390 0.02661064 Random Forest Package in R randomForest(formula = class ~ ., data = smartphone_data, ntree = 300, mtry = 6, importance = TRUE, replace = FALSE, na.action = na.roughfix)
    16. 16. • Fusion of data science and domain knowledge enables the big insights from the data • R language provides a platform to rapidly build prototypes and test the ideas • Getting data insights is an outcome of intense team effort between various stakeholders 16 Summary
    17. 17. • R Project: http://www.r-project.org • Activity Recognition Dataset- “ The Impact of Personalization on Smartphone-Based Activity Recognition” Gary M. Weiss and Jeffrey W. Lockhart, Activity Context Representation: Techniques and Languages, AAAI Technical Report WS-12-05 • “Activity and Gait Recognition with Time-Delay Embeddings” Jordan Frank, AAAI Conference on Artificial Intelligence -2010 • R wiki: http://rwiki.sciviews.org/doku.php • R graph gallery: http://addictedtor.free.fr/graphiques/thumbs.php • Kickstarting R: http://cran.r-project.org/doc/contrib/Lemon-kickstart/ • Rattle – R Data Mining Tool [http://rattle.togaware.com/] • Sensor Platforms, http://www.sensorplatforms.com/context-aware/ • Movea, http://www.movea.com/ • Alohar, https://www.alohar.com 17 References

    ×