2. Part I
Predictive Model in Detail
Part II
Portfolio
Part III
Energy Efficiency in Building Systems
3. Part I: Building of a Predictive Model
Human Activity Recognition using
‘RandomForest’
4. Conceptually…
Steps in building a predictive model
1. Define the question
2. Define the ideal data set
3. Determine what data you can access
4. Obtain the data
5. Clean the data
6. Exploratory data analysis
7. Statistical prediction/modelling
8. Interpret results
9. Challenge results
10. Synthesize/write up results
Predictive Model in Detail
5. Problem
Human Activity Prediction Using
Smartphones Data Set
Samsung Galaxy S II
30 volunteers wearing on their waist
Six activities
WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS,
SITTING, STANDING, LAYING
Sensors
Accelerometer and Gyroscope
Predictive Model in Detail
Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
6. Dataset
UCI Machine Learning Repository
561-feature vector with time and frequency
domain variables, augmented with “subject” and
“activity” => 563
3-axial linear acceleration
3-axial angular velocity
Predictive Model in Detail
Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
7. Duplicate Column Names
R Language
load(".samsungData.rda")
is.data.frame(samsungData)
# [1]TRUE
table(duplicated(names(samsungData))) # checking
for duplicate headers
# FALSE TRUE
# 479 84
Predictive Model in Detail
8. Duplicate Column Names
samsDF <- data.frame(samsungData)
is.data.frame(samsungData)
# [1]TRUE
table(duplicated(names(samsDF))) # checking for
duplicate headers
# FALSE
# 563
Predictive Model in Detail
9. Column Types
table(sapply(samsDF, class))
# character integer numeric
# 1 1 561
which(sapply(samsDF, is.character))
# activity
# 563
which(sapply(samsDF, is.integer))
# subject
# 562
Predictive Model in Detail
10. Missing Data & Finite Values
dim(samsDF)
# [1] 7352 563
table(complete.cases(samsDF))
#TRUE
# 7352
table(sapply(samsDF[,1:561], is.finite))
#TRUE
# 4124472 #7352*561 = 4124472
Predictive Model in Detail
11. Balanced Data
table(samsDF$activity)
# laying sitting standing walk walkdown walkup
# 1407 1286 1374 1226 986 1073
sum(table(samsDF$activity))
# [1] 7352
round( table(samsDF$activity)/nrow(samsDF), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail
12. Splitting Data
library(caTools)
# Randomly split the data into training and testing sets
set.seed(1000)
split = sample.split(samsDF$activity, SplitRatio = 0.7)
# Split up the data using subset
train = subset(samsDF, split==TRUE)
dim(train)
# [1] 5146 563
round( table(train$activity)/nrow(train), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail
13. Test Data
test = subset(samsDF, split==FALSE)
dim(test)
# [1] 2206 56
round( table(test$activity)/nrow(test), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail
15. Determining ntree
fit <- randomForest(as.factor(activity) ~ ., data=trainF,
importance=TRUE, ntree=500, do.trace=T)
ntree = 293
Initial Results:
Prediction <- predict(fit, test[1:561])
library(caret)
confusionMatrix(Prediction , test[,563])
# Accuracy : 0.9782
# 95% CI : (0.9713, 0.9839)
Predictive Model in Detail
16. Determining mtry
# mtry : Optimal number of variables selected at each split
mtry <- tuneRF(trainF[-562], as.factor(trainF$activity), ntreeTry=200,
stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE)
bestm <- mtry[mtry[, 2] == min(mtry[, 2]), 1]
bestm
# [1] 11
Predictive Model in Detail
17. Building & testing the Model
fitF <- randomForest(as.factor(activity) ~ ., data=trainF,
importance=TRUE, ntree=293, mtry=bestm, do.trace=T)
PredictionF <- predict(fitF, test[1:561])
library(caret)
confusionMatrix(PredictionF , test[,563])
# Accuracy : 0.9805
# 95% CI : (0.9738, 0.9859)
Reduction in Error = (0.9805 - 0.9782)/(1 - 0.9782) = 0.1055
Predictive Model in Detail
18. AUC
library(pROC)
ROC1 <- multiclass.roc( test$activity, as.numeric(PredictionF))
auc(ROC1)
# Multi-class area under the curve: 0.9953
Predictive Model in Detail
DecisionTree by Hand: http://bit.ly/DTree123
21. Problem
Analyze Apache weblog and provide:
EpochTime (date and time the request was
processed by the server)
IP Address
Latitude, Longitude
URI
Referer
MapReduce: ApacheWeblog
http://bit.ly/oFraud123
22. Combined Weblog Format
"%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-
agent}i""
(%h) - IP address of the client (remote host)
-(%l) - the "hyphen" indicates missing information
(%u) - the "userid" of the person requesting
(%t) - time of the request
…
…
Source: https://httpd.apache.org/docs/1.3/logs.html
MapReduce: ApacheWeblog
23. Knowing your customers
through Apache Logs
198.0.200.105 - - [14/Jan/2014:09:36:51 -0800] "GET
/svds.com/rockandroll/js/libs/modernizr-2.6.2.min.js HTTP/1.1"
200 8768 "http://www.svds.com/rockandroll/" "Mozilla/5.0
(Macintosh; Intel MacOS X 10_9_1)AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/31.0.1650.63 Safari/537.36“
MapReduce: ApacheWeblog
IP address Date &Time
URI Referer
24. Challenges
Weblog needs to be parsed to extract the
required information
Time is not expressed in “EpochTime”
Latitude and Longitude are not readily
available
MapReduce: ApacheWeblog
27. EpochTime
import time
def convert_time(d, utc):
# d = "14/Jan/2014:09:36:50"
# utc = '-0800'
fmt ='%d/%b/%Y:%H:%M:%S'
utci = int(utc)
epot = time.mktime(time.strptime(d, fmt)) #parses string given the format; converts to sec
epod = (abs(utci) % 100)/60.0 + (abs(utci) // 100) # minutes converted to hrs + int division in hrs
if utc.isdigit():
epf = epot + epod*3600
else:
epf = epot - epod*3600
return int(epf)
MapReduce: ApacheWeblog
28. Latitude and Longitude
Geolite2 from MaxMind
geolite2.lookup(<IP address>)
Reducer
http://bit.ly/ApaMapper
Source: https://www.maxmind.com/en/home
MapReduce: ApacheWeblog
29. Mapper
#!/usr/bin/env python
import sys
#Iterate through every line passed in to stdin
for input in sys.stdin.readlines():
value = input.strip()
print value
http://bit.ly/ApaMapper
MapReduce: ApacheWeblog
41. Language Learning over a
Chat session
Streaming Data: Speech
https://angel.co/freeaccent
42. Problem
Learn a foreign language from a native
speaker
Student andTutor are separated
Use computing device and internet
Streaming Data: Speech
https://angel.co/freeaccent
43. Challenges
Collect the speech data off the web
Record: start record, stop record
Upload
Preprocessing speech data
End point detection
Noise
ExtractingAccent Score frame by frame
Populating on the web page on demand
Streaming Data: Speech
https://angel.co/freeaccent
44. Technology Stack
Collect the speech data off the web
Html5, JavaScript, PHP
Preprocessing speech data
Energy based algo, MFCC
ExtractingAccent Score frame by frame
Proprietary algo
Populating on the web page on demand
AJAX
Streaming Data: Speech
https://angel.co/freeaccent
53. Challenges
Saturation at Initialization
Known solutions:
Small initial weights
HyperbolicTangent Function instead of Sigmoidal
Other challenges relating to speech
processing
ANN
http://bit.ly/my_pubs
59. Problem
ReplaceTurbidity meter
Piezo-electric transducer to detect water-
sludge interface
Measure water depth in a final stage settling
tank
Water-Sludge interface Detection
64. Data Visualization
Average of the reverberated signal by the pulse period
Water-Sludge interface Detection
Leakage
Bottom of
theTank
Reverberation
3.68 ms
65. Computing the Water Depth
Speed of sound ~1.5x103 m/s
1.5x103 x 3.68 ms
= 5.52 m
Depth of water
= 2.76 m
= 9.05 ft
Water-Sludge interface Detection
66. Impact
Proof of concept was successful
Won a contract to develop an instrument
Water-Sludge interface Detection
71. Sun Shades
Energy Efficiency in Building Systems
UC Davis WestVillage is the largest planned “zero net energy” community
72. Net Zero Homes
New homes to be net-zero energy by 2020
California Public Utilities Commission (CPUC) and
California Energy Commission (CEC)
Energy Efficiency in Building Systems
Source: www.greentechmedia.com/articles/read/California-Wants-All-New-Homes-to-be-Net-Zero-in-2020