SlideShare a Scribd company logo
1 of 22
BayanAlghuraybi
Krishna Marvaniya
Guojun Xia
Advisor : Prof JongwookWoo
24th Annual Student Symposium
on Research, Scholarship and Creative Activity
Analyze NYCTaxi Data using Hive
and Machine Learning
Friday, February 26, 2016
California State University, Los Angeles
Table of Content
1. Hadoop and Microsoft
2. Our Road Map
3. Demonstration
• Ingest data
• HDInsight Query console
• Visualize Data
• Predictive Analytics
1.Hadoop and Microsoft
• Big Data Not Only volume
• Improve analytic and Statics
• Extract BusinessValue
• Efficient Architecture
Hadoop
2.Road Map
Deploy ModelBuild ModelExplore sample DataLoad Data
Azure Machine LearningHive & Power BI
3. Demonstration : Ingest data
■ Data set source (URL): http://chriswhong.com/open-data/foil_nyc_taxi/
18.7 G
■ Specifiction of experimental equipment:
Number Of Nodes : 4 (Worker node:A7 & Head Node:A3)
Memory Size: 605GB(Worker node) & 285GB(Head node)
CPU/core speed: 56GB(Worker node) & 7GB(Head node)
■ Gitlab/Github information:
■ https://gitlab.com/kmarvan/Analysis.git
■ https://github.com/kmarvan/Analysis-on-NYC-taxi-Data-using-Hive-and-
Machine-Learning
HDInsight Query console
Visualize Data
MaximumTip
MaximumTotal amount in PM
Predictive Analytics
Algorithms for Microsoft Azure Machine
Learning
A.Two-Class Logistic Regression
Two-Class Logistic Regression module is to create a logistic regression model that can be used to
predict one of two states of the target variable. Logistic regression is a well-known statistical
technique that is used for modeling many kinds of outcomes.
B. Boosted DecisionTree Regression
Boosted DecisionTree Regression module is to create an ensemble of regression trees using
boosting. Boosting means that each tree is dependent on prior trees, and boosting in a decision
tree ensemble tends to improve accuracy with some small risk of less coverage.
Evaluating a Multiclass Classification Model:
Creating the Experiment Reader module
What Can we predict ?
Train, evaluate and compare
Payment_type
Tipped
Tip_class
References
■ MicrosoftAzure Machine Learning, https://studio.azureml.net
■ Process Data with HIVE, http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-
apache-hive/
■ How to use HDInsight to create cluster, https://azure.microsoft.com/en-
us/documentation/articles/hdinsight-hadoop-tutorial-get-started-windows/
Q & A
ThankYou

More Related Content

Similar to 528 presentation-26 feb

Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
NidhiAhuja30
 

Similar to 528 presentation-26 feb (20)

Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
 
DSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdfDSBDA Miniproject Assignment - TE A (1).pdf
DSBDA Miniproject Assignment - TE A (1).pdf
 
Proof of Concept for Learning Analytics Interoperability
Proof of Concept for Learning Analytics InteroperabilityProof of Concept for Learning Analytics Interoperability
Proof of Concept for Learning Analytics Interoperability
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
 
Big_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic backgroundBig_data_1674238705.ppt is a basic background
Big_data_1674238705.ppt is a basic background
 
Lightweight Collection and Storage of Software Repository Data with DataRover
Lightweight Collection and Storage of  Software Repository Data with DataRoverLightweight Collection and Storage of  Software Repository Data with DataRover
Lightweight Collection and Storage of Software Repository Data with DataRover
 
Reproducibility and Dataverse
Reproducibility and DataverseReproducibility and Dataverse
Reproducibility and Dataverse
 
Analysing GitHub commits with R
Analysing GitHub commits with RAnalysing GitHub commits with R
Analysing GitHub commits with R
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 
Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014Hadoop on OpenStack - Sahara @DevNation 2014
Hadoop on OpenStack - Sahara @DevNation 2014
 
Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack Big Data Analysis : Deciphering the haystack
Big Data Analysis : Deciphering the haystack
 
ShruthiNayak
ShruthiNayakShruthiNayak
ShruthiNayak
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
Ml based detection of users anomaly activities (20th OWASP Night Tokyo, English)
 
Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)Enabling Secure Data Discoverability (SC21 Tutorial)
Enabling Secure Data Discoverability (SC21 Tutorial)
 
Skillshare - Introduction to Data Scraping
Skillshare - Introduction to Data ScrapingSkillshare - Introduction to Data Scraping
Skillshare - Introduction to Data Scraping
 
Dataverse: Helping Researchers Publish Their Data Through Automation
Dataverse: Helping Researchers Publish Their Data Through Automation�Dataverse: Helping Researchers Publish Their Data Through Automation�
Dataverse: Helping Researchers Publish Their Data Through Automation
 
A FAIR Approach to Publishing and Sharing Machine Learning Models
A FAIR Approach to Publishing and Sharing Machine Learning ModelsA FAIR Approach to Publishing and Sharing Machine Learning Models
A FAIR Approach to Publishing and Sharing Machine Learning Models
 

Recently uploaded

Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Valters Lauzums
 
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
hwhqz6r1y
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
pyhepag
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
fztigerwe
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
great91
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra MalangToko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
adet6151
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
0uyfyq0q4
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
cyebo
 

Recently uploaded (20)

Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
如何办理新加坡国立大学毕业证(NUS毕业证)学位证成绩单原版一比一
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
How I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prisonHow I opened a fake bank account and didn't go to prison
How I opened a fake bank account and didn't go to prison
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Kashmiri Gate ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
 
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra MalangToko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
Toko Jual Viagra Asli Di Malang 081229400522 COD Obat Kuat Viagra Malang
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
 
一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理一比一原版纽卡斯尔大学毕业证成绩单如何办理
一比一原版纽卡斯尔大学毕业证成绩单如何办理
 

528 presentation-26 feb

  • 1. BayanAlghuraybi Krishna Marvaniya Guojun Xia Advisor : Prof JongwookWoo 24th Annual Student Symposium on Research, Scholarship and Creative Activity Analyze NYCTaxi Data using Hive and Machine Learning Friday, February 26, 2016 California State University, Los Angeles
  • 2. Table of Content 1. Hadoop and Microsoft 2. Our Road Map 3. Demonstration • Ingest data • HDInsight Query console • Visualize Data • Predictive Analytics
  • 3. 1.Hadoop and Microsoft • Big Data Not Only volume • Improve analytic and Statics • Extract BusinessValue • Efficient Architecture Hadoop
  • 4. 2.Road Map Deploy ModelBuild ModelExplore sample DataLoad Data Azure Machine LearningHive & Power BI
  • 5. 3. Demonstration : Ingest data ■ Data set source (URL): http://chriswhong.com/open-data/foil_nyc_taxi/ 18.7 G ■ Specifiction of experimental equipment: Number Of Nodes : 4 (Worker node:A7 & Head Node:A3) Memory Size: 605GB(Worker node) & 285GB(Head node) CPU/core speed: 56GB(Worker node) & 7GB(Head node) ■ Gitlab/Github information: ■ https://gitlab.com/kmarvan/Analysis.git ■ https://github.com/kmarvan/Analysis-on-NYC-taxi-Data-using-Hive-and- Machine-Learning
  • 7.
  • 10.
  • 13. Algorithms for Microsoft Azure Machine Learning A.Two-Class Logistic Regression Two-Class Logistic Regression module is to create a logistic regression model that can be used to predict one of two states of the target variable. Logistic regression is a well-known statistical technique that is used for modeling many kinds of outcomes. B. Boosted DecisionTree Regression Boosted DecisionTree Regression module is to create an ensemble of regression trees using boosting. Boosting means that each tree is dependent on prior trees, and boosting in a decision tree ensemble tends to improve accuracy with some small risk of less coverage.
  • 14. Evaluating a Multiclass Classification Model: Creating the Experiment Reader module
  • 15. What Can we predict ?
  • 20. References ■ MicrosoftAzure Machine Learning, https://studio.azureml.net ■ Process Data with HIVE, http://hortonworks.com/hadoop-tutorial/how-to-process-data-with- apache-hive/ ■ How to use HDInsight to create cluster, https://azure.microsoft.com/en- us/documentation/articles/hdinsight-hadoop-tutorial-get-started-windows/
  • 21. Q & A