SlideShare a Scribd company logo
1 of 74
PROJECTS RELATING TO
DATA SCIENCE
Email: sashs at gmx dot com
 Part I
 Predictive Model in Detail
 Part II
 Portfolio
 Part III
 Energy Efficiency in Building Systems
Part I: Building of a Predictive Model
 Human Activity Recognition using
‘RandomForest’
Conceptually…
 Steps in building a predictive model
1. Define the question
2. Define the ideal data set
3. Determine what data you can access
4. Obtain the data
5. Clean the data
6. Exploratory data analysis
7. Statistical prediction/modelling
8. Interpret results
9. Challenge results
10. Synthesize/write up results
Predictive Model in Detail
Problem
 Human Activity Prediction Using
Smartphones Data Set
 Samsung Galaxy S II
 30 volunteers wearing on their waist
 Six activities
 WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS,
SITTING, STANDING, LAYING
 Sensors
 Accelerometer and Gyroscope
Predictive Model in Detail
Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Dataset
 UCI Machine Learning Repository
 561-feature vector with time and frequency
domain variables, augmented with “subject” and
“activity” => 563
 3-axial linear acceleration
 3-axial angular velocity
Predictive Model in Detail
Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Duplicate Column Names
R Language
load(".samsungData.rda")
is.data.frame(samsungData)
# [1]TRUE
table(duplicated(names(samsungData))) # checking
for duplicate headers
# FALSE TRUE
# 479 84
Predictive Model in Detail
Duplicate Column Names
samsDF <- data.frame(samsungData)
is.data.frame(samsungData)
# [1]TRUE
table(duplicated(names(samsDF))) # checking for
duplicate headers
# FALSE
# 563
Predictive Model in Detail
Column Types
table(sapply(samsDF, class))
# character integer numeric
# 1 1 561
which(sapply(samsDF, is.character))
# activity
# 563
which(sapply(samsDF, is.integer))
# subject
# 562
Predictive Model in Detail
Missing Data & Finite Values
dim(samsDF)
# [1] 7352 563
table(complete.cases(samsDF))
#TRUE
# 7352
table(sapply(samsDF[,1:561], is.finite))
#TRUE
# 4124472 #7352*561 = 4124472
Predictive Model in Detail
Balanced Data
table(samsDF$activity)
# laying sitting standing walk walkdown walkup
# 1407 1286 1374 1226 986 1073
sum(table(samsDF$activity))
# [1] 7352
round( table(samsDF$activity)/nrow(samsDF), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail
Splitting Data
library(caTools)
# Randomly split the data into training and testing sets
set.seed(1000)
split = sample.split(samsDF$activity, SplitRatio = 0.7)
# Split up the data using subset
train = subset(samsDF, split==TRUE)
dim(train)
# [1] 5146 563
round( table(train$activity)/nrow(train), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail
Test Data
test = subset(samsDF, split==FALSE)
dim(test)
# [1] 2206 56
round( table(test$activity)/nrow(test), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail
Random Forest
library(randomForest)
set.seed(415)
trainF = train
trainF[562] = NULL
dim(trainF)
# [1] 5146 562
Predictive Model in Detail
Determining ntree
fit <- randomForest(as.factor(activity) ~ ., data=trainF,
importance=TRUE, ntree=500, do.trace=T)
ntree = 293
Initial Results:
Prediction <- predict(fit, test[1:561])
library(caret)
confusionMatrix(Prediction , test[,563])
# Accuracy : 0.9782
# 95% CI : (0.9713, 0.9839)
Predictive Model in Detail
Determining mtry
# mtry : Optimal number of variables selected at each split
mtry <- tuneRF(trainF[-562], as.factor(trainF$activity), ntreeTry=200,
stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE)
bestm <- mtry[mtry[, 2] == min(mtry[, 2]), 1]
bestm
# [1] 11
Predictive Model in Detail
Building & testing the Model
fitF <- randomForest(as.factor(activity) ~ ., data=trainF,
importance=TRUE, ntree=293, mtry=bestm, do.trace=T)
PredictionF <- predict(fitF, test[1:561])
library(caret)
confusionMatrix(PredictionF , test[,563])
# Accuracy : 0.9805
# 95% CI : (0.9738, 0.9859)
Reduction in Error = (0.9805 - 0.9782)/(1 - 0.9782) = 0.1055
Predictive Model in Detail
AUC
library(pROC)
ROC1 <- multiclass.roc( test$activity, as.numeric(PredictionF))
auc(ROC1)
# Multi-class area under the curve: 0.9953
Predictive Model in Detail
DecisionTree by Hand: http://bit.ly/DTree123
Part II: Portfolio
 MapReduce: ApacheWeblog
 Visualization: LTV
 Streaming Data Analysis: Speech
 Artificial Neural Network (ANN)
 Water-Sludge interface Detection
MapReduce: Apache Weblog
Source: https://www.maxmind.com/en/home
Problem
 Analyze Apache weblog and provide:
 EpochTime (date and time the request was
processed by the server)
 IP Address
 Latitude, Longitude
 URI
 Referer
MapReduce: ApacheWeblog
http://bit.ly/oFraud123
Combined Weblog Format
 "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-
agent}i""
 (%h) - IP address of the client (remote host)
 -(%l) - the "hyphen" indicates missing information
 (%u) - the "userid" of the person requesting
 (%t) - time of the request
 …
 …
Source: https://httpd.apache.org/docs/1.3/logs.html
MapReduce: ApacheWeblog
Knowing your customers
through Apache Logs
 198.0.200.105 - - [14/Jan/2014:09:36:51 -0800] "GET
/svds.com/rockandroll/js/libs/modernizr-2.6.2.min.js HTTP/1.1"
200 8768 "http://www.svds.com/rockandroll/" "Mozilla/5.0
(Macintosh; Intel MacOS X 10_9_1)AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/31.0.1650.63 Safari/537.36“
MapReduce: ApacheWeblog
IP address Date &Time
URI Referer
Challenges
 Weblog needs to be parsed to extract the
required information
 Time is not expressed in “EpochTime”
 Latitude and Longitude are not readily
available
MapReduce: ApacheWeblog
Regular Expression & Testing
(S+) (S+) (S+) [([^:]+:d+:d+:d+) ([^]]+)] "(S+) /(.*?)
(S+)" (S+) (S+) "([^"]*)" "([^"]*)
https://regex101.com/
MapReduce: ApacheWeblog
RegEx Groups
https://regex101.com/
MapReduce: ApacheWeblog
EpochTime
import time
def convert_time(d, utc):
# d = "14/Jan/2014:09:36:50"
# utc = '-0800'
fmt ='%d/%b/%Y:%H:%M:%S'
utci = int(utc)
epot = time.mktime(time.strptime(d, fmt)) #parses string given the format; converts to sec
epod = (abs(utci) % 100)/60.0 + (abs(utci) // 100) # minutes converted to hrs + int division in hrs
if utc.isdigit():
epf = epot + epod*3600
else:
epf = epot - epod*3600
return int(epf)
MapReduce: ApacheWeblog
Latitude and Longitude
 Geolite2 from MaxMind
 geolite2.lookup(<IP address>)
 Reducer
 http://bit.ly/ApaMapper
Source: https://www.maxmind.com/en/home
MapReduce: ApacheWeblog
Mapper
#!/usr/bin/env python
import sys
#Iterate through every line passed in to stdin
for input in sys.stdin.readlines():
value = input.strip()
print value
http://bit.ly/ApaMapper
MapReduce: ApacheWeblog
Hadoop
hadoop jar path/to/hadoop-streaming-
0.20.203.0.jar 
-mapper path/to/mapper.py 
-reducer path/to/reducer.py 
-input path/to/input/* 
-output path/to/output
MapReduce: ApacheWeblog
Sample Output
MapReduce: ApacheWeblog
http://bit.ly/oFraud123
Impact
Helps to Detect Online Fraud and
Locate OnlineVisitors
MapReduce: ApacheWeblog
Visualization: LTV
Background
 Gamers sign up each day and become part of
a cohort
 LTV is computed for up to 30 days
Visualization: LTV
Problem
 UseTableau to:
 Compute LTV
 Compute weighted LTV
Visualization: LTV
Challenges
 Tableau is relatively new
 LTV computation was not readily available
 Given dataset is irregular:
Visualization: LTV
Computed LTV
Visualization: LTV
Weighted LTV
Visualization: LTV
Impact
Customer LTV
>
Cost of customerAcquisition (CAC)
 CAC
 $10 engagement -> 5 new users -> these users
acquire 15 more users at no cost
 CAC = $10/(5+15) = $0.50
Visualization: LTV
Streaming Data: Speech
https://angel.co/freeaccent
Language Learning over a
Chat session
Streaming Data: Speech
https://angel.co/freeaccent
Problem
 Learn a foreign language from a native
speaker
 Student andTutor are separated
 Use computing device and internet
Streaming Data: Speech
https://angel.co/freeaccent
Challenges
 Collect the speech data off the web
 Record: start record, stop record
 Upload
 Preprocessing speech data
 End point detection
 Noise
 ExtractingAccent Score frame by frame
 Populating on the web page on demand
Streaming Data: Speech
https://angel.co/freeaccent
Technology Stack
 Collect the speech data off the web
 Html5, JavaScript, PHP
 Preprocessing speech data
 Energy based algo, MFCC
 ExtractingAccent Score frame by frame
 Proprietary algo
 Populating on the web page on demand
 AJAX
Streaming Data: Speech
https://angel.co/freeaccent
User Interface
Streaming Data: Speech
https://angel.co/freeaccent
Impact
 Measurement tool
 Motivational: helps to set goal
 Customer retention
Streaming Data: Speech
https://angel.co/freeaccent
Artificial Neural Network (ANN)
McCulloch Pitts (MP) Neuron
Source: https://appliedgo.net/perceptron/
ANN
Diagram of the MP neuron
ANN
Equation of the MP neuron
Source: http://dms1.irb.hr/tutorial/tut_nnets_short.php
ANN
Multi-Layer Perceptron
 Fully interconnected
Source: http://dms1.irb.hr/tutorial/tut_nnets_short.php
ANN
Optimization function
 Rumelhart et al – Gradient Descent
(Generalized Delta Rule)
ANN
Challenges
 Saturation at Initialization
 Known solutions:
 Small initial weights
 HyperbolicTangent Function instead of Sigmoidal
 Other challenges relating to speech
processing
ANN
http://bit.ly/my_pubs
Hyperbolic Tangent Function
ANN
Saturation at Initialization
ANN
http://bit.ly/modANN
Introduced (N)
where (N)  1
ANN
http://bit.ly/modANN
Impact
 Training time was significantly reduced
 3 layers – not needed
 (N) - empirical
ANN
Water-Sludge interface Detection
 Thames Water Authority – Deephams
Station, Enfield
Problem
 ReplaceTurbidity meter
 Piezo-electric transducer to detect water-
sludge interface
 Measure water depth in a final stage settling
tank
Water-Sludge interface Detection
Piezo-electric Transducer
Water-Sludge interface Detection
Receiver
Transmitter
Final Stage Settling Tank
Water-Sludge interface Detection
Pulsed Sinusoidal Signal
 Period of pulse 27.5 ms
Water-Sludge interface Detection
Collecting Data
 Envelope Detection and Amplification
Water-Sludge interface Detection
Data Visualization
 Average of the reverberated signal by the pulse period
Water-Sludge interface Detection
Leakage
Bottom of
theTank
Reverberation
3.68 ms
Computing the Water Depth
 Speed of sound ~1.5x103 m/s
1.5x103 x 3.68 ms
= 5.52 m
Depth of water
= 2.76 m
= 9.05 ft
Water-Sludge interface Detection
Impact
 Proof of concept was successful
 Won a contract to develop an instrument
Water-Sludge interface Detection
Addition of Internet?
 IoT
 On a computer or a device
Water-Sludge interface Detection
Part III
 Energy Efficiency in Building Systems
Powerwall by Tesla
Energy Efficiency in Building Systems
Powerwall
Solar Tubes and Walls
Energy Efficiency in Building Systems
Sun Shades
Energy Efficiency in Building Systems
UC Davis WestVillage is the largest planned “zero net energy” community
Net Zero Homes
 New homes to be net-zero energy by 2020
 California Public Utilities Commission (CPUC) and
 California Energy Commission (CEC)
Energy Efficiency in Building Systems
Source: www.greentechmedia.com/articles/read/California-Wants-All-New-Homes-to-be-Net-Zero-in-2020
Related Work
 http://bit.ly/EnergyEff123
Energy Efficiency in Building Systems
Source: www.greentechmedia.com/articles/read/California-Wants-All-New-Homes-to-be-Net-Zero-in-2020
112 portfpres.pdf

More Related Content

Similar to 112 portfpres.pdf

Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent MonitoringIntelie
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiInfluxData
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...Jos Boumans
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineAndy McKay
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft AzureDmitry Petukhov
 
iguazio - nuclio Meetup Nov 30th
iguazio - nuclio Meetup Nov 30thiguazio - nuclio Meetup Nov 30th
iguazio - nuclio Meetup Nov 30thiguazio
 
Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Maarten Balliauw
 
soft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.jssoft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.jssoft-shake.ch
 
Data science for infrastructure dev week 2022
Data science for infrastructure   dev week 2022Data science for infrastructure   dev week 2022
Data science for infrastructure dev week 2022ZainAsgar1
 
HTML5 on Mobile
HTML5 on MobileHTML5 on Mobile
HTML5 on MobileAdam Lu
 
PHP Continuous Data Processing
PHP Continuous Data ProcessingPHP Continuous Data Processing
PHP Continuous Data ProcessingMichael Peacock
 
Three Lessons about Gatling and Microservices
Three Lessons about Gatling and MicroservicesThree Lessons about Gatling and Microservices
Three Lessons about Gatling and MicroservicesDragos Manolescu
 
Application Security from the Inside - OWASP
Application Security from the Inside - OWASPApplication Security from the Inside - OWASP
Application Security from the Inside - OWASPSqreen
 
What's Coming Next in Sencha Frameworks
What's Coming Next in Sencha FrameworksWhat's Coming Next in Sencha Frameworks
What's Coming Next in Sencha FrameworksGrgur Grisogono
 
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLA
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLARiot Games Scalable Data Warehouse Lecture at UCSB / UCLA
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLAsean_seannery
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark ApplicationsTzach Zohar
 
Do we need a bigger dev data culture
Do we need a bigger dev data cultureDo we need a bigger dev data culture
Do we need a bigger dev data cultureSimon Dittlmann
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Petr Dvořák: Mobilní webové služby pohledem iPhone developera
Petr Dvořák: Mobilní webové služby pohledem iPhone developeraPetr Dvořák: Mobilní webové služby pohledem iPhone developera
Petr Dvořák: Mobilní webové služby pohledem iPhone developeraWebExpo
 

Similar to 112 portfpres.pdf (20)

Intelligent Monitoring
Intelligent MonitoringIntelligent Monitoring
Intelligent Monitoring
 
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry PiMonitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
Monitoring Your ISP Using InfluxDB Cloud and Raspberry Pi
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...
 
Cross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App EngineCross Domain Web
Mashups with JQuery and Google App Engine
Cross Domain Web
Mashups with JQuery and Google App Engine
 
Machine Learning with Microsoft Azure
Machine Learning with Microsoft AzureMachine Learning with Microsoft Azure
Machine Learning with Microsoft Azure
 
iguazio - nuclio Meetup Nov 30th
iguazio - nuclio Meetup Nov 30thiguazio - nuclio Meetup Nov 30th
iguazio - nuclio Meetup Nov 30th
 
Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...Sherlock Homepage - A detective story about running large web services - WebN...
Sherlock Homepage - A detective story about running large web services - WebN...
 
soft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.jssoft-shake.ch - Hands on Node.js
soft-shake.ch - Hands on Node.js
 
Data science for infrastructure dev week 2022
Data science for infrastructure   dev week 2022Data science for infrastructure   dev week 2022
Data science for infrastructure dev week 2022
 
HTML5 on Mobile
HTML5 on MobileHTML5 on Mobile
HTML5 on Mobile
 
PHP Continuous Data Processing
PHP Continuous Data ProcessingPHP Continuous Data Processing
PHP Continuous Data Processing
 
Analytics with Spark
Analytics with SparkAnalytics with Spark
Analytics with Spark
 
Three Lessons about Gatling and Microservices
Three Lessons about Gatling and MicroservicesThree Lessons about Gatling and Microservices
Three Lessons about Gatling and Microservices
 
Application Security from the Inside - OWASP
Application Security from the Inside - OWASPApplication Security from the Inside - OWASP
Application Security from the Inside - OWASP
 
What's Coming Next in Sencha Frameworks
What's Coming Next in Sencha FrameworksWhat's Coming Next in Sencha Frameworks
What's Coming Next in Sencha Frameworks
 
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLA
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLARiot Games Scalable Data Warehouse Lecture at UCSB / UCLA
Riot Games Scalable Data Warehouse Lecture at UCSB / UCLA
 
Monitoring Spark Applications
Monitoring Spark ApplicationsMonitoring Spark Applications
Monitoring Spark Applications
 
Do we need a bigger dev data culture
Do we need a bigger dev data cultureDo we need a bigger dev data culture
Do we need a bigger dev data culture
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Petr Dvořák: Mobilní webové služby pohledem iPhone developera
Petr Dvořák: Mobilní webové služby pohledem iPhone developeraPetr Dvořák: Mobilní webové služby pohledem iPhone developera
Petr Dvořák: Mobilní webové služby pohledem iPhone developera
 

More from sash236

Marketing Attribution: Website traffic that is attributable to TV advertising
Marketing Attribution: Website traffic that is attributable to TV advertisingMarketing Attribution: Website traffic that is attributable to TV advertising
Marketing Attribution: Website traffic that is attributable to TV advertisingsash236
 
CGPTLiveDemo.pdf
CGPTLiveDemo.pdfCGPTLiveDemo.pdf
CGPTLiveDemo.pdfsash236
 
Architecture Portfolio
Architecture PortfolioArchitecture Portfolio
Architecture Portfoliosash236
 
113 robotmannequin.pdf
113 robotmannequin.pdf113 robotmannequin.pdf
113 robotmannequin.pdfsash236
 
Conjoint.pdf
Conjoint.pdfConjoint.pdf
Conjoint.pdfsash236
 
EnergyEffic.pdf
EnergyEffic.pdfEnergyEffic.pdf
EnergyEffic.pdfsash236
 
111 agilePM.pdf
111 agilePM.pdf111 agilePM.pdf
111 agilePM.pdfsash236
 

More from sash236 (7)

Marketing Attribution: Website traffic that is attributable to TV advertising
Marketing Attribution: Website traffic that is attributable to TV advertisingMarketing Attribution: Website traffic that is attributable to TV advertising
Marketing Attribution: Website traffic that is attributable to TV advertising
 
CGPTLiveDemo.pdf
CGPTLiveDemo.pdfCGPTLiveDemo.pdf
CGPTLiveDemo.pdf
 
Architecture Portfolio
Architecture PortfolioArchitecture Portfolio
Architecture Portfolio
 
113 robotmannequin.pdf
113 robotmannequin.pdf113 robotmannequin.pdf
113 robotmannequin.pdf
 
Conjoint.pdf
Conjoint.pdfConjoint.pdf
Conjoint.pdf
 
EnergyEffic.pdf
EnergyEffic.pdfEnergyEffic.pdf
EnergyEffic.pdf
 
111 agilePM.pdf
111 agilePM.pdf111 agilePM.pdf
111 agilePM.pdf
 

Recently uploaded

Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 

Recently uploaded (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 

112 portfpres.pdf

  • 1. PROJECTS RELATING TO DATA SCIENCE Email: sashs at gmx dot com
  • 2.  Part I  Predictive Model in Detail  Part II  Portfolio  Part III  Energy Efficiency in Building Systems
  • 3. Part I: Building of a Predictive Model  Human Activity Recognition using ‘RandomForest’
  • 4. Conceptually…  Steps in building a predictive model 1. Define the question 2. Define the ideal data set 3. Determine what data you can access 4. Obtain the data 5. Clean the data 6. Exploratory data analysis 7. Statistical prediction/modelling 8. Interpret results 9. Challenge results 10. Synthesize/write up results Predictive Model in Detail
  • 5. Problem  Human Activity Prediction Using Smartphones Data Set  Samsung Galaxy S II  30 volunteers wearing on their waist  Six activities  WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING  Sensors  Accelerometer and Gyroscope Predictive Model in Detail Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
  • 6. Dataset  UCI Machine Learning Repository  561-feature vector with time and frequency domain variables, augmented with “subject” and “activity” => 563  3-axial linear acceleration  3-axial angular velocity Predictive Model in Detail Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
  • 7. Duplicate Column Names R Language load(".samsungData.rda") is.data.frame(samsungData) # [1]TRUE table(duplicated(names(samsungData))) # checking for duplicate headers # FALSE TRUE # 479 84 Predictive Model in Detail
  • 8. Duplicate Column Names samsDF <- data.frame(samsungData) is.data.frame(samsungData) # [1]TRUE table(duplicated(names(samsDF))) # checking for duplicate headers # FALSE # 563 Predictive Model in Detail
  • 9. Column Types table(sapply(samsDF, class)) # character integer numeric # 1 1 561 which(sapply(samsDF, is.character)) # activity # 563 which(sapply(samsDF, is.integer)) # subject # 562 Predictive Model in Detail
  • 10. Missing Data & Finite Values dim(samsDF) # [1] 7352 563 table(complete.cases(samsDF)) #TRUE # 7352 table(sapply(samsDF[,1:561], is.finite)) #TRUE # 4124472 #7352*561 = 4124472 Predictive Model in Detail
  • 11. Balanced Data table(samsDF$activity) # laying sitting standing walk walkdown walkup # 1407 1286 1374 1226 986 1073 sum(table(samsDF$activity)) # [1] 7352 round( table(samsDF$activity)/nrow(samsDF), 2) # laying sitting standing walk walkdown walkup # 0.19 0.17 0.19 0.17 0.13 0.15 Predictive Model in Detail
  • 12. Splitting Data library(caTools) # Randomly split the data into training and testing sets set.seed(1000) split = sample.split(samsDF$activity, SplitRatio = 0.7) # Split up the data using subset train = subset(samsDF, split==TRUE) dim(train) # [1] 5146 563 round( table(train$activity)/nrow(train), 2) # laying sitting standing walk walkdown walkup # 0.19 0.17 0.19 0.17 0.13 0.15 Predictive Model in Detail
  • 13. Test Data test = subset(samsDF, split==FALSE) dim(test) # [1] 2206 56 round( table(test$activity)/nrow(test), 2) # laying sitting standing walk walkdown walkup # 0.19 0.17 0.19 0.17 0.13 0.15 Predictive Model in Detail
  • 14. Random Forest library(randomForest) set.seed(415) trainF = train trainF[562] = NULL dim(trainF) # [1] 5146 562 Predictive Model in Detail
  • 15. Determining ntree fit <- randomForest(as.factor(activity) ~ ., data=trainF, importance=TRUE, ntree=500, do.trace=T) ntree = 293 Initial Results: Prediction <- predict(fit, test[1:561]) library(caret) confusionMatrix(Prediction , test[,563]) # Accuracy : 0.9782 # 95% CI : (0.9713, 0.9839) Predictive Model in Detail
  • 16. Determining mtry # mtry : Optimal number of variables selected at each split mtry <- tuneRF(trainF[-562], as.factor(trainF$activity), ntreeTry=200, stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE) bestm <- mtry[mtry[, 2] == min(mtry[, 2]), 1] bestm # [1] 11 Predictive Model in Detail
  • 17. Building & testing the Model fitF <- randomForest(as.factor(activity) ~ ., data=trainF, importance=TRUE, ntree=293, mtry=bestm, do.trace=T) PredictionF <- predict(fitF, test[1:561]) library(caret) confusionMatrix(PredictionF , test[,563]) # Accuracy : 0.9805 # 95% CI : (0.9738, 0.9859) Reduction in Error = (0.9805 - 0.9782)/(1 - 0.9782) = 0.1055 Predictive Model in Detail
  • 18. AUC library(pROC) ROC1 <- multiclass.roc( test$activity, as.numeric(PredictionF)) auc(ROC1) # Multi-class area under the curve: 0.9953 Predictive Model in Detail DecisionTree by Hand: http://bit.ly/DTree123
  • 19. Part II: Portfolio  MapReduce: ApacheWeblog  Visualization: LTV  Streaming Data Analysis: Speech  Artificial Neural Network (ANN)  Water-Sludge interface Detection
  • 20. MapReduce: Apache Weblog Source: https://www.maxmind.com/en/home
  • 21. Problem  Analyze Apache weblog and provide:  EpochTime (date and time the request was processed by the server)  IP Address  Latitude, Longitude  URI  Referer MapReduce: ApacheWeblog http://bit.ly/oFraud123
  • 22. Combined Weblog Format  "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User- agent}i""  (%h) - IP address of the client (remote host)  -(%l) - the "hyphen" indicates missing information  (%u) - the "userid" of the person requesting  (%t) - time of the request  …  … Source: https://httpd.apache.org/docs/1.3/logs.html MapReduce: ApacheWeblog
  • 23. Knowing your customers through Apache Logs  198.0.200.105 - - [14/Jan/2014:09:36:51 -0800] "GET /svds.com/rockandroll/js/libs/modernizr-2.6.2.min.js HTTP/1.1" 200 8768 "http://www.svds.com/rockandroll/" "Mozilla/5.0 (Macintosh; Intel MacOS X 10_9_1)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36“ MapReduce: ApacheWeblog IP address Date &Time URI Referer
  • 24. Challenges  Weblog needs to be parsed to extract the required information  Time is not expressed in “EpochTime”  Latitude and Longitude are not readily available MapReduce: ApacheWeblog
  • 25. Regular Expression & Testing (S+) (S+) (S+) [([^:]+:d+:d+:d+) ([^]]+)] "(S+) /(.*?) (S+)" (S+) (S+) "([^"]*)" "([^"]*) https://regex101.com/ MapReduce: ApacheWeblog
  • 27. EpochTime import time def convert_time(d, utc): # d = "14/Jan/2014:09:36:50" # utc = '-0800' fmt ='%d/%b/%Y:%H:%M:%S' utci = int(utc) epot = time.mktime(time.strptime(d, fmt)) #parses string given the format; converts to sec epod = (abs(utci) % 100)/60.0 + (abs(utci) // 100) # minutes converted to hrs + int division in hrs if utc.isdigit(): epf = epot + epod*3600 else: epf = epot - epod*3600 return int(epf) MapReduce: ApacheWeblog
  • 28. Latitude and Longitude  Geolite2 from MaxMind  geolite2.lookup(<IP address>)  Reducer  http://bit.ly/ApaMapper Source: https://www.maxmind.com/en/home MapReduce: ApacheWeblog
  • 29. Mapper #!/usr/bin/env python import sys #Iterate through every line passed in to stdin for input in sys.stdin.readlines(): value = input.strip() print value http://bit.ly/ApaMapper MapReduce: ApacheWeblog
  • 30. Hadoop hadoop jar path/to/hadoop-streaming- 0.20.203.0.jar -mapper path/to/mapper.py -reducer path/to/reducer.py -input path/to/input/* -output path/to/output MapReduce: ApacheWeblog
  • 32. Impact Helps to Detect Online Fraud and Locate OnlineVisitors MapReduce: ApacheWeblog
  • 34. Background  Gamers sign up each day and become part of a cohort  LTV is computed for up to 30 days Visualization: LTV
  • 35. Problem  UseTableau to:  Compute LTV  Compute weighted LTV Visualization: LTV
  • 36. Challenges  Tableau is relatively new  LTV computation was not readily available  Given dataset is irregular: Visualization: LTV
  • 39. Impact Customer LTV > Cost of customerAcquisition (CAC)  CAC  $10 engagement -> 5 new users -> these users acquire 15 more users at no cost  CAC = $10/(5+15) = $0.50 Visualization: LTV
  • 41. Language Learning over a Chat session Streaming Data: Speech https://angel.co/freeaccent
  • 42. Problem  Learn a foreign language from a native speaker  Student andTutor are separated  Use computing device and internet Streaming Data: Speech https://angel.co/freeaccent
  • 43. Challenges  Collect the speech data off the web  Record: start record, stop record  Upload  Preprocessing speech data  End point detection  Noise  ExtractingAccent Score frame by frame  Populating on the web page on demand Streaming Data: Speech https://angel.co/freeaccent
  • 44. Technology Stack  Collect the speech data off the web  Html5, JavaScript, PHP  Preprocessing speech data  Energy based algo, MFCC  ExtractingAccent Score frame by frame  Proprietary algo  Populating on the web page on demand  AJAX Streaming Data: Speech https://angel.co/freeaccent
  • 45. User Interface Streaming Data: Speech https://angel.co/freeaccent
  • 46. Impact  Measurement tool  Motivational: helps to set goal  Customer retention Streaming Data: Speech https://angel.co/freeaccent
  • 48. McCulloch Pitts (MP) Neuron Source: https://appliedgo.net/perceptron/ ANN
  • 49. Diagram of the MP neuron ANN
  • 50. Equation of the MP neuron Source: http://dms1.irb.hr/tutorial/tut_nnets_short.php ANN
  • 51. Multi-Layer Perceptron  Fully interconnected Source: http://dms1.irb.hr/tutorial/tut_nnets_short.php ANN
  • 52. Optimization function  Rumelhart et al – Gradient Descent (Generalized Delta Rule) ANN
  • 53. Challenges  Saturation at Initialization  Known solutions:  Small initial weights  HyperbolicTangent Function instead of Sigmoidal  Other challenges relating to speech processing ANN http://bit.ly/my_pubs
  • 56. Introduced (N) where (N)  1 ANN http://bit.ly/modANN
  • 57. Impact  Training time was significantly reduced  3 layers – not needed  (N) - empirical ANN
  • 58. Water-Sludge interface Detection  Thames Water Authority – Deephams Station, Enfield
  • 59. Problem  ReplaceTurbidity meter  Piezo-electric transducer to detect water- sludge interface  Measure water depth in a final stage settling tank Water-Sludge interface Detection
  • 60. Piezo-electric Transducer Water-Sludge interface Detection Receiver Transmitter
  • 61. Final Stage Settling Tank Water-Sludge interface Detection
  • 62. Pulsed Sinusoidal Signal  Period of pulse 27.5 ms Water-Sludge interface Detection
  • 63. Collecting Data  Envelope Detection and Amplification Water-Sludge interface Detection
  • 64. Data Visualization  Average of the reverberated signal by the pulse period Water-Sludge interface Detection Leakage Bottom of theTank Reverberation 3.68 ms
  • 65. Computing the Water Depth  Speed of sound ~1.5x103 m/s 1.5x103 x 3.68 ms = 5.52 m Depth of water = 2.76 m = 9.05 ft Water-Sludge interface Detection
  • 66. Impact  Proof of concept was successful  Won a contract to develop an instrument Water-Sludge interface Detection
  • 67. Addition of Internet?  IoT  On a computer or a device Water-Sludge interface Detection
  • 68. Part III  Energy Efficiency in Building Systems
  • 69. Powerwall by Tesla Energy Efficiency in Building Systems Powerwall
  • 70. Solar Tubes and Walls Energy Efficiency in Building Systems
  • 71. Sun Shades Energy Efficiency in Building Systems UC Davis WestVillage is the largest planned “zero net energy” community
  • 72. Net Zero Homes  New homes to be net-zero energy by 2020  California Public Utilities Commission (CPUC) and  California Energy Commission (CEC) Energy Efficiency in Building Systems Source: www.greentechmedia.com/articles/read/California-Wants-All-New-Homes-to-be-Net-Zero-in-2020
  • 73. Related Work  http://bit.ly/EnergyEff123 Energy Efficiency in Building Systems Source: www.greentechmedia.com/articles/read/California-Wants-All-New-Homes-to-be-Net-Zero-in-2020