112 portfpres.pdf

PROJECTS RELATING TO
DATA SCIENCE
Email: sashs at gmx dot com

 Part I
 Predictive Model in Detail
 Part II
 Portfolio
 Part III
 Energy Efficiency in Building Systems

Part I: Building of a Predictive Model
 Human Activity Recognition using
‘RandomForest’

Conceptually…
 Steps in building a predictive model
1. Define the question
2. Define the ideal data set
3. Determine what data you can access
4. Obtain the data
5. Clean the data
6. Exploratory data analysis
7. Statistical prediction/modelling
8. Interpret results
9. Challenge results
10. Synthesize/write up results
Predictive Model in Detail

Problem
 Human Activity Prediction Using
Smartphones Data Set
 Samsung Galaxy S II
 30 volunteers wearing on their waist
 Six activities
 WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS,
SITTING, STANDING, LAYING
 Sensors
 Accelerometer and Gyroscope
Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

Dataset
 UCI Machine Learning Repository
 561-feature vector with time and frequency
domain variables, augmented with “subject” and
“activity” => 563
 3-axial linear acceleration
 3-axial angular velocity
Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

Duplicate Column Names
R Language
load(".samsungData.rda")
is.data.frame(samsungData)
# [1]TRUE
table(duplicated(names(samsungData))) # checking
for duplicate headers
# FALSE TRUE
# 479 84

Duplicate Column Names
samsDF <- data.frame(samsungData)
is.data.frame(samsungData)
# [1]TRUE
table(duplicated(names(samsDF))) # checking for
duplicate headers
# FALSE
# 563

Column Types
table(sapply(samsDF, class))
# character integer numeric
# 1 1 561
which(sapply(samsDF, is.character))
# activity
# 563
which(sapply(samsDF, is.integer))
# subject
# 562

Missing Data & Finite Values
dim(samsDF)
# [1] 7352 563
table(complete.cases(samsDF))
#TRUE
# 7352
table(sapply(samsDF[,1:561], is.finite))
#TRUE
# 4124472 #7352*561 = 4124472

Balanced Data
table(samsDF$activity)
# laying sitting standing walk walkdown walkup
# 1407 1286 1374 1226 986 1073
sum(table(samsDF$activity))
# [1] 7352
round( table(samsDF$activity)/nrow(samsDF), 2)
# 0.19 0.17 0.19 0.17 0.13 0.15

Splitting Data
library(caTools)
# Randomly split the data into training and testing sets
set.seed(1000)
split = sample.split(samsDF$activity, SplitRatio = 0.7)
# Split up the data using subset
train = subset(samsDF, split==TRUE)
dim(train)
# [1] 5146 563
round( table(train$activity)/nrow(train), 2)
# 0.19 0.17 0.19 0.17 0.13 0.15

Test Data
test = subset(samsDF, split==FALSE)
dim(test)
# [1] 2206 56
round( table(test$activity)/nrow(test), 2)
# 0.19 0.17 0.19 0.17 0.13 0.15

Random Forest
library(randomForest)
set.seed(415)
trainF = train
trainF[562] = NULL
dim(trainF)
# [1] 5146 562

Determining ntree
fit <- randomForest(as.factor(activity) ~ ., data=trainF,
importance=TRUE, ntree=500, do.trace=T)
ntree = 293
Initial Results:
Prediction <- predict(fit, test[1:561])
library(caret)
confusionMatrix(Prediction , test[,563])
# Accuracy : 0.9782
# 95% CI : (0.9713, 0.9839)

Determining mtry
# mtry : Optimal number of variables selected at each split
mtry <- tuneRF(trainF[-562], as.factor(trainF$activity), ntreeTry=200,
stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE)
bestm <- mtry[mtry[, 2] == min(mtry[, 2]), 1]
bestm
# [1] 11

Building & testing the Model
fitF <- randomForest(as.factor(activity) ~ ., data=trainF,
importance=TRUE, ntree=293, mtry=bestm, do.trace=T)
PredictionF <- predict(fitF, test[1:561])
library(caret)
confusionMatrix(PredictionF , test[,563])
# Accuracy : 0.9805
# 95% CI : (0.9738, 0.9859)
Reduction in Error = (0.9805 - 0.9782)/(1 - 0.9782) = 0.1055

AUC
library(pROC)
ROC1 <- multiclass.roc( test$activity, as.numeric(PredictionF))
auc(ROC1)
# Multi-class area under the curve: 0.9953
DecisionTree by Hand: http://bit.ly/DTree123

Part II: Portfolio
 MapReduce: ApacheWeblog
 Visualization: LTV
 Streaming Data Analysis: Speech
 Artificial Neural Network (ANN)
 Water-Sludge interface Detection

MapReduce: Apache Weblog
Source: https://www.maxmind.com/en/home

Problem
 Analyze Apache weblog and provide:
 EpochTime (date and time the request was
processed by the server)
 IP Address
 Latitude, Longitude
 URI
 Referer
MapReduce: ApacheWeblog
http://bit.ly/oFraud123

Combined Weblog Format
 "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-
agent}i""
 (%h) - IP address of the client (remote host)
 -(%l) - the "hyphen" indicates missing information
 (%u) - the "userid" of the person requesting
 (%t) - time of the request
 …
 …
Source: https://httpd.apache.org/docs/1.3/logs.html

Knowing your customers
through Apache Logs
 198.0.200.105 - - [14/Jan/2014:09:36:51 -0800] "GET
/svds.com/rockandroll/js/libs/modernizr-2.6.2.min.js HTTP/1.1"
200 8768 "http://www.svds.com/rockandroll/" "Mozilla/5.0
(Macintosh; Intel MacOS X 10_9_1)AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/31.0.1650.63 Safari/537.36“
IP address Date &Time
URI Referer

Challenges
 Weblog needs to be parsed to extract the
required information
 Time is not expressed in “EpochTime”
 Latitude and Longitude are not readily
available

Regular Expression & Testing
(S+) (S+) (S+) [([^:]+:d+:d+:d+) ([^]]+)] "(S+) /(.*?)
(S+)" (S+) (S+) "([^"]*)" "([^"]*)
https://regex101.com/

RegEx Groups
https://regex101.com/

EpochTime
import time
def convert_time(d, utc):
# d = "14/Jan/2014:09:36:50"
# utc = '-0800'
fmt ='%d/%b/%Y:%H:%M:%S'
utci = int(utc)
epot = time.mktime(time.strptime(d, fmt)) #parses string given the format; converts to sec
epod = (abs(utci) % 100)/60.0 + (abs(utci) // 100) # minutes converted to hrs + int division in hrs
if utc.isdigit():
epf = epot + epod*3600
else:
epf = epot - epod*3600
return int(epf)

Latitude and Longitude
 Geolite2 from MaxMind
 geolite2.lookup(<IP address>)
 Reducer
 http://bit.ly/ApaMapper
Source: https://www.maxmind.com/en/home

Mapper
#!/usr/bin/env python
import sys
#Iterate through every line passed in to stdin
for input in sys.stdin.readlines():
value = input.strip()
print value
http://bit.ly/ApaMapper

Hadoop
hadoop jar path/to/hadoop-streaming-
0.20.203.0.jar
-mapper path/to/mapper.py
-reducer path/to/reducer.py
-input path/to/input/*
-output path/to/output

Sample Output
http://bit.ly/oFraud123

Impact
Helps to Detect Online Fraud and
Locate OnlineVisitors

Background
 Gamers sign up each day and become part of
a cohort
 LTV is computed for up to 30 days
Visualization: LTV

Problem
 UseTableau to:
 Compute LTV
 Compute weighted LTV
Visualization: LTV

Challenges
 Tableau is relatively new
 LTV computation was not readily available
 Given dataset is irregular:
Visualization: LTV

Computed LTV
Visualization: LTV

Weighted LTV
Visualization: LTV

Impact
Customer LTV
>
Cost of customerAcquisition (CAC)
 CAC
 $10 engagement -> 5 new users -> these users
acquire 15 more users at no cost
 CAC = $10/(5+15) = $0.50
Visualization: LTV

Streaming Data: Speech
https://angel.co/freeaccent

Language Learning over a
Chat session

Problem
 Learn a foreign language from a native
speaker
 Student andTutor are separated
 Use computing device and internet

Challenges
 Collect the speech data off the web
 Record: start record, stop record
 Upload
 Preprocessing speech data
 End point detection
 Noise
 ExtractingAccent Score frame by frame
 Populating on the web page on demand

Technology Stack
 Collect the speech data off the web
 Html5, JavaScript, PHP
 Preprocessing speech data
 Energy based algo, MFCC
 ExtractingAccent Score frame by frame
 Proprietary algo
 Populating on the web page on demand
 AJAX

User Interface

Impact
 Measurement tool
 Motivational: helps to set goal
 Customer retention

Artificial Neural Network (ANN)

McCulloch Pitts (MP) Neuron
Source: https://appliedgo.net/perceptron/
ANN

Equation of the MP neuron
Source: http://dms1.irb.hr/tutorial/tut_nnets_short.php
ANN

Multi-Layer Perceptron
 Fully interconnected
Source: http://dms1.irb.hr/tutorial/tut_nnets_short.php
ANN

Optimization function
 Rumelhart et al – Gradient Descent
(Generalized Delta Rule)
ANN

Challenges
 Saturation at Initialization
 Known solutions:
 Small initial weights
 HyperbolicTangent Function instead of Sigmoidal
 Other challenges relating to speech
processing
ANN
http://bit.ly/my_pubs

Hyperbolic Tangent Function
ANN

Saturation at Initialization
ANN
http://bit.ly/modANN

Introduced (N)
where (N)  1
ANN
http://bit.ly/modANN

Impact
 Training time was significantly reduced
 3 layers – not needed
 (N) - empirical
ANN

Water-Sludge interface Detection
 Thames Water Authority – Deephams
Station, Enfield

Problem
 ReplaceTurbidity meter
 Piezo-electric transducer to detect water-
sludge interface
 Measure water depth in a final stage settling
tank

Piezo-electric Transducer
Receiver
Transmitter

Final Stage Settling Tank

Pulsed Sinusoidal Signal
 Period of pulse 27.5 ms

Collecting Data
 Envelope Detection and Amplification

Data Visualization
 Average of the reverberated signal by the pulse period
Leakage
Bottom of
theTank
Reverberation
3.68 ms

Computing the Water Depth
 Speed of sound ~1.5x103 m/s
1.5x103 x 3.68 ms
= 5.52 m
Depth of water
= 2.76 m
= 9.05 ft

Impact
 Proof of concept was successful
 Won a contract to develop an instrument

Addition of Internet?
 IoT
 On a computer or a device

Part III
 Energy Efficiency in Building Systems

Powerwall by Tesla
Energy Efficiency in Building Systems
Powerwall

Solar Tubes and Walls

Sun Shades
UC Davis WestVillage is the largest planned “zero net energy” community

Net Zero Homes
 New homes to be net-zero energy by 2020
 California Public Utilities Commission (CPUC) and
 California Energy Commission (CEC)
Source: www.greentechmedia.com/articles/read/California-Wants-All-New-Homes-to-be-Net-Zero-in-2020

Related Work
 http://bit.ly/EnergyEff123
Source: www.greentechmedia.com/articles/read/California-Wants-All-New-Homes-to-be-Net-Zero-in-2020

112 portfpres.pdf

Recommended

Recommended

More Related Content

Similar to 112 portfpres.pdf

Similar to 112 portfpres.pdf (20)

More from sash236

More from sash236 (7)

Recently uploaded

Recently uploaded (20)

112 portfpres.pdf