SlideShare a Scribd company logo
1 of 32
Predicting Amazon Rating
Using Spark ML and Azure ML
Aakanksha Tasgaonkar
Amogh Mahesh
Monika Mishra
Professor – Dr. Jongwook Woo
Introduction
 Rating helps customers to obtain useful
information of products more easily and
more efficiently.
 Prediction of rating is important for the
business to take corrective measures.
 Recommendation systems are an
important units in today's e-commerce
applications, such as targeted
advertising, personalized marketing and
information retrieval.
ABOUTTHE DATASET
• Dataset URL : - https://s3.amazonaws.com/amazon-reviews-
pds/tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz
• Products reviewed between 2005 and 2015 in US
• File Size : 3.63 GB
• Number of rows : 6.93 Millions
• Number of Columns : 15
• Number of File : 1
• File Format : TSV (Tab Separated Values)
FEATURES
marketplace helpful_votes
customer_id total_votes
review_id vine
product_id verified_purchase
product_paren
t
review_headline
product_title review_body
star_rating
Features Label
TECHNICAL SPECIFICATIONS
• Free Workspace
• 10 GB storage
• Single node
• DataBricks
Subscription
• Cluster 5.2 (includes
Apache Spark 2.4.0,
Scala 2.11)
• 6 GB Memory, 0.88
Cores, 1 DBU
• Python Version 3
ALGORITHMS USED
Azure ML
• Matchbox Recommender
• Decision Forest Regression
• Boosted Decision Tree Regression
Spark ML
• Collaborative Filtering
• Text Analytics using Logistic Regression
Flowchart
Gathering Data
Data Preprocessing
Data Transformation
Data Split
Train and Test
Evaluate
AZURE
ML
SAMPLING
Original Dataset : 3.63 GB
Sampled Dataset : 73 MB
Time Taken : 5.30 Minutes
Stratified Split ensures
that the output dataset
contains a representative
sample of the values in
the selected column.
Matchbox Recommender
• 2% Sample
• Time Taken : 6 Mins
• 75:25 Split Train/Test
• Item
Recommendation
• Related Users
• Rating Prediction
• Related Items
Item Recommendation
(From Rated Items)
Related Items
(Categories)
Matchbox Recommender Outcomes
Rating Prediction Related Users
Matchbox Recommender Outcomes
Matchbox Recommender Outcomes
Item Recommendation – From All Items
Collaborative
Filtering Recommender
Selected the required
columns from the
cleaned data on the
AzureML platform and
uploaded in Databricks
File System
Alternating Least Squares (ALS) algorithm * Explicit Feedback
Collaborative
Filtering Recommender
• The ALS algorithm requires all three inputs in the integer form
• StringIndexer Feature used to convert product category (string)
into integer form
Collaborative
Filtering Recommender
Time Taken : 30 Minutes
Decision Forest Regression
• 2% Sample
• 70:30 Split Train/Test
• Cross-Validation
• Tune Model
Hyperparameters
• Permutation Feature
Importance
• Time Taken: 30 Mins
Decision Forest Regression
Permutation
Feature
Evaluation
Metrics
Decision Forest Regression
RMSE decreased but not a
very significant difference
Features used:
• product_parent
• product_title
• review_headline,
• product_id
• review_date
Evaluation
Metrics
Boosted DecisionTree Regression
• 2% Sample
• 70:30 Split Train/Test
• Cross-Validation
• Tune Model
Hyperparameters
• Permutation Feature
Importance
• Time Taken: 1 hour
30 Minutes
Boosted DecisionTree Regression
Permutation
Feature
Evaluation
Metrics
Boosted DecisionTree Regression
There was no change in
Root Mean Squared Error
even after removing less
weighted features
Features removed:
• marketplace
• customer_id
• review_id
• review_body
Evaluation
Metrics
Comparison
Decision Forest
Regression
Boosted Decision
Tree Regression
Mean Absolute Error 0.882887 0.649424
Root Mean Squared
Error
1.143119 0.911949
Relative Absolute
Error
0.989171 0.727603
Relative Squared Error 0.986502 0.627851
Coefficient of
Determination
0.013498 0.372149
Text Analytics
• Analyzing text from “review_head” column to
predict the sentiment.
• star_rating > 3 means positive sentiment else
negative sentiment
• 70:30 Split Train/Test
• Tokenizer splits the review into individual
words.
• StopWordsRemover removes common words
• HashingTF generates numeric vectors from
the text values
• LogisticRegression algorithm used to train
Text Analytics
AUR = 0.7101944679648156
Time Taken : 4 Minutes
CONCLUSION
Azure ML RMSE Time Taken
Matchbox Recommender 1.222147 6 Minutes
Decision Forest Regression 1.143199 30 Minutes
Boosted Decision Tree Regression 0.911949 1 Hour 30 Mins
Spark ML RMSE Time Taken
Collaborative Filtering 1.729060 30 Minutes
Spark ML AUR Time Taken
Text Analytics with Logistic Regression 0.710194 4 Minutes
SUMMARY
• Recommendation model is implemented to predict the item
recommendation and rating prediction .
• It can help in finding customers with the preferred items.
• Based on RMSE, for recommendation model, AzureML
performed better than the Spark ML.
• Boosted Decision Tree performed better than the Decision
Forest in rating prediction.
• Text Analysis – Helps to understand customer sentiment and
satisfaction of a product.
CHALLENGES FACED
AZURE ML
Error 0138: Memory has been exhausted, unable to complete
running of module. Process exited with error code -2
• Save the experiment
under a new name.
• Delete the old
experiment.
• This will release the
memory
CHALLENGES FACED
SPARK ML
• File size lesser
than 2 GB will
resolve the
issue.
• The resolution
to the spark
driver being
stopped
unexpectedly
is unknown.
GITHUB LINK
Github Link :
https://github.com/monika2403/mmishra2/tree/mas
ter/CIS%205560
REFERENCES
• Microsoft DAT202.3x Implementing Predictive Analytics with
Spark in Azure HDInsight
• Microsoft's DAT203x, Data Science and Machine Learning
Essentials
• How to choose algorithms for Microsoft Azure Machine
Learning - https://docs.microsoft.com/en-
us/azure/machine-learning/machine-learning-algorithm-
choice
• Databricks Spark Data Engineers -
https://docs.microsoft.com/en-us/azure/machine-
learning/machine-learning-algorithm-choice
REFERENCES
• M. Woolf, “Playing with 80 Million Amazon Product Review
Ratings Using Apache Spark,” minimaxir, 02-Jan-2017.
[Online]. Available:
https://minimaxir.com/2017/01/amazon-spark/.
Predicting Amazon Rating Using Spark ML and Azure ML

More Related Content

Similar to Predicting Amazon Rating Using Spark ML and Azure ML

Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEPredicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEFeng Zhu
 
Migrating from Azure Search to SearcStax
Migrating from Azure Search to SearcStaxMigrating from Azure Search to SearcStax
Migrating from Azure Search to SearcStaxVarunNehra
 
ClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLGeorge Simov
 
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and BeyondMongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and BeyondMongoDB
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP TestingRTTS
 
Walk through of azure machine learning studio new features
Walk through of azure machine learning studio new featuresWalk through of azure machine learning studio new features
Walk through of azure machine learning studio new featuresLuca Zavarella
 
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...Aditya Bhattacharya
 
Service quality monitoring system architecture
Service quality monitoring system architectureService quality monitoring system architecture
Service quality monitoring system architectureMatsuo Sawahashi
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure applicationCodecamp Romania
 
Recommender System Using AZURE ML
Recommender System Using AZURE MLRecommender System Using AZURE ML
Recommender System Using AZURE MLDev Raj Gautam
 
Evolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comEvolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comSimon Hughes
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1Bill Liu
 
An intro to building an architecture repository meta model and modeling frame...
An intro to building an architecture repository meta model and modeling frame...An intro to building an architecture repository meta model and modeling frame...
An intro to building an architecture repository meta model and modeling frame...wweinmeyer79
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBAuto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBDatabricks
 
Query Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingQuery Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingRTTS
 
AWS re:Invent 2016: Workshop: AWS S3 Deep-Dive Hands-On Workshop: Deploying a...
AWS re:Invent 2016: Workshop: AWS S3 Deep-Dive Hands-On Workshop: Deploying a...AWS re:Invent 2016: Workshop: AWS S3 Deep-Dive Hands-On Workshop: Deploying a...
AWS re:Invent 2016: Workshop: AWS S3 Deep-Dive Hands-On Workshop: Deploying a...Amazon Web Services
 
Experimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOpsExperimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOpsDatabricks
 
Pr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open sourcePr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open sourceTerry Bunio
 

Similar to Predicting Amazon Rating Using Spark ML and Azure ML (20)

Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIMEPredicting Azure Churn with Deep Learning and Explaining Predictions with LIME
Predicting Azure Churn with Deep Learning and Explaining Predictions with LIME
 
Migrating from Azure Search to SearcStax
Migrating from Azure Search to SearcStaxMigrating from Azure Search to SearcStax
Migrating from Azure Search to SearcStax
 
ClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureML
 
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and BeyondMongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
MongoDB Evenings Chicago - Find Your Way in MongoDB 3.2: Compass and Beyond
 
How to Automate your Enterprise Application / ERP Testing
How to Automate your  Enterprise Application / ERP TestingHow to Automate your  Enterprise Application / ERP Testing
How to Automate your Enterprise Application / ERP Testing
 
Walk through of azure machine learning studio new features
Walk through of azure machine learning studio new featuresWalk through of azure machine learning studio new features
Walk through of azure machine learning studio new features
 
dd presentation.pdf
dd presentation.pdfdd presentation.pdf
dd presentation.pdf
 
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
Aditya Bhattacharya - Enterprise DL - Accelerating Deep Learning Solutions to...
 
Service quality monitoring system architecture
Service quality monitoring system architectureService quality monitoring system architecture
Service quality monitoring system architecture
 
Alex mang patterns for scalability in microsoft azure application
Alex mang   patterns for scalability in microsoft azure applicationAlex mang   patterns for scalability in microsoft azure application
Alex mang patterns for scalability in microsoft azure application
 
Recommender System Using AZURE ML
Recommender System Using AZURE MLRecommender System Using AZURE ML
Recommender System Using AZURE ML
 
Evolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.comEvolving the Optimal Relevancy Ranking Model at Dice.com
Evolving the Optimal Relevancy Ranking Model at Dice.com
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
 
An intro to building an architecture repository meta model and modeling frame...
An intro to building an architecture repository meta model and modeling frame...An intro to building an architecture repository meta model and modeling frame...
An intro to building an architecture repository meta model and modeling frame...
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
Auto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADBAuto-Train a Time-Series Forecast Model With AML + ADB
Auto-Train a Time-Series Forecast Model With AML + ADB
 
Query Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programmingQuery Wizards - data testing made easy - no programming
Query Wizards - data testing made easy - no programming
 
AWS re:Invent 2016: Workshop: AWS S3 Deep-Dive Hands-On Workshop: Deploying a...
AWS re:Invent 2016: Workshop: AWS S3 Deep-Dive Hands-On Workshop: Deploying a...AWS re:Invent 2016: Workshop: AWS S3 Deep-Dive Hands-On Workshop: Deploying a...
AWS re:Invent 2016: Workshop: AWS S3 Deep-Dive Hands-On Workshop: Deploying a...
 
Experimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOpsExperimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOps
 
Pr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open sourcePr dc 2015 sql server is cheaper than open source
Pr dc 2015 sql server is cheaper than open source
 

More from Monika Mishra

Aws image recognition
Aws image recognitionAws image recognition
Aws image recognitionMonika Mishra
 
Drug Review Analysis Using Elasticsearch and Kibana
Drug Review Analysis Using Elasticsearch and KibanaDrug Review Analysis Using Elasticsearch and Kibana
Drug Review Analysis Using Elasticsearch and KibanaMonika Mishra
 
An Empirical Study on Customer Consumption, Loyalty and Retention on a B2C E-...
An Empirical Study on Customer Consumption, Loyalty and Retention on a B2C E-...An Empirical Study on Customer Consumption, Loyalty and Retention on a B2C E-...
An Empirical Study on Customer Consumption, Loyalty and Retention on a B2C E-...Monika Mishra
 
Re-admit Historical using SAS Visual Analytics
Re-admit Historical  using SAS Visual AnalyticsRe-admit Historical  using SAS Visual Analytics
Re-admit Historical using SAS Visual AnalyticsMonika Mishra
 
Diabetic Encounter Analysis using SAS studio
Diabetic Encounter Analysis using SAS studioDiabetic Encounter Analysis using SAS studio
Diabetic Encounter Analysis using SAS studioMonika Mishra
 
Superstore Data Analysis using R
Superstore Data Analysis using RSuperstore Data Analysis using R
Superstore Data Analysis using RMonika Mishra
 
LA Energy and Water Efficiency Statistics using Tableau
LA Energy and Water Efficiency Statistics using TableauLA Energy and Water Efficiency Statistics using Tableau
LA Energy and Water Efficiency Statistics using TableauMonika Mishra
 
Amazon Product Review Data Analysis
Amazon Product ReviewData AnalysisAmazon Product ReviewData Analysis
Amazon Product Review Data AnalysisMonika Mishra
 

More from Monika Mishra (8)

Aws image recognition
Aws image recognitionAws image recognition
Aws image recognition
 
Drug Review Analysis Using Elasticsearch and Kibana
Drug Review Analysis Using Elasticsearch and KibanaDrug Review Analysis Using Elasticsearch and Kibana
Drug Review Analysis Using Elasticsearch and Kibana
 
An Empirical Study on Customer Consumption, Loyalty and Retention on a B2C E-...
An Empirical Study on Customer Consumption, Loyalty and Retention on a B2C E-...An Empirical Study on Customer Consumption, Loyalty and Retention on a B2C E-...
An Empirical Study on Customer Consumption, Loyalty and Retention on a B2C E-...
 
Re-admit Historical using SAS Visual Analytics
Re-admit Historical  using SAS Visual AnalyticsRe-admit Historical  using SAS Visual Analytics
Re-admit Historical using SAS Visual Analytics
 
Diabetic Encounter Analysis using SAS studio
Diabetic Encounter Analysis using SAS studioDiabetic Encounter Analysis using SAS studio
Diabetic Encounter Analysis using SAS studio
 
Superstore Data Analysis using R
Superstore Data Analysis using RSuperstore Data Analysis using R
Superstore Data Analysis using R
 
LA Energy and Water Efficiency Statistics using Tableau
LA Energy and Water Efficiency Statistics using TableauLA Energy and Water Efficiency Statistics using Tableau
LA Energy and Water Efficiency Statistics using Tableau
 
Amazon Product Review Data Analysis
Amazon Product ReviewData AnalysisAmazon Product ReviewData Analysis
Amazon Product Review Data Analysis
 

Recently uploaded

Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...ThinkInnovation
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancingmohamed Elzalabany
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsBrainSell Technologies
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Klinik Aborsi
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontangsiskavia95
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单aqpto5bt
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationmuqadasqasim10
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024patrickdtherriault
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchersdarmandersingh4580
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshareraiaryan448
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Valters Lauzums
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfgreat91
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives23050636
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...yulianti213969
 

Recently uploaded (20)

Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
Data Analytics for Digital Marketing Lecture for Advanced Digital & Social Me...
 
edited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdfedited gordis ebook sixth edition david d.pdf
edited gordis ebook sixth edition david d.pdf
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 

Predicting Amazon Rating Using Spark ML and Azure ML

  • 1. Predicting Amazon Rating Using Spark ML and Azure ML Aakanksha Tasgaonkar Amogh Mahesh Monika Mishra Professor – Dr. Jongwook Woo
  • 2. Introduction  Rating helps customers to obtain useful information of products more easily and more efficiently.  Prediction of rating is important for the business to take corrective measures.  Recommendation systems are an important units in today's e-commerce applications, such as targeted advertising, personalized marketing and information retrieval.
  • 3. ABOUTTHE DATASET • Dataset URL : - https://s3.amazonaws.com/amazon-reviews- pds/tsv/amazon_reviews_multilingual_US_v1_00.tsv.gz • Products reviewed between 2005 and 2015 in US • File Size : 3.63 GB • Number of rows : 6.93 Millions • Number of Columns : 15 • Number of File : 1 • File Format : TSV (Tab Separated Values)
  • 4. FEATURES marketplace helpful_votes customer_id total_votes review_id vine product_id verified_purchase product_paren t review_headline product_title review_body star_rating Features Label
  • 5. TECHNICAL SPECIFICATIONS • Free Workspace • 10 GB storage • Single node • DataBricks Subscription • Cluster 5.2 (includes Apache Spark 2.4.0, Scala 2.11) • 6 GB Memory, 0.88 Cores, 1 DBU • Python Version 3
  • 6. ALGORITHMS USED Azure ML • Matchbox Recommender • Decision Forest Regression • Boosted Decision Tree Regression Spark ML • Collaborative Filtering • Text Analytics using Logistic Regression
  • 7. Flowchart Gathering Data Data Preprocessing Data Transformation Data Split Train and Test Evaluate AZURE ML
  • 8. SAMPLING Original Dataset : 3.63 GB Sampled Dataset : 73 MB Time Taken : 5.30 Minutes Stratified Split ensures that the output dataset contains a representative sample of the values in the selected column.
  • 9. Matchbox Recommender • 2% Sample • Time Taken : 6 Mins • 75:25 Split Train/Test • Item Recommendation • Related Users • Rating Prediction • Related Items
  • 10. Item Recommendation (From Rated Items) Related Items (Categories) Matchbox Recommender Outcomes
  • 11. Rating Prediction Related Users Matchbox Recommender Outcomes
  • 12. Matchbox Recommender Outcomes Item Recommendation – From All Items
  • 13. Collaborative Filtering Recommender Selected the required columns from the cleaned data on the AzureML platform and uploaded in Databricks File System Alternating Least Squares (ALS) algorithm * Explicit Feedback
  • 14. Collaborative Filtering Recommender • The ALS algorithm requires all three inputs in the integer form • StringIndexer Feature used to convert product category (string) into integer form
  • 16. Decision Forest Regression • 2% Sample • 70:30 Split Train/Test • Cross-Validation • Tune Model Hyperparameters • Permutation Feature Importance • Time Taken: 30 Mins
  • 18. Decision Forest Regression RMSE decreased but not a very significant difference Features used: • product_parent • product_title • review_headline, • product_id • review_date Evaluation Metrics
  • 19. Boosted DecisionTree Regression • 2% Sample • 70:30 Split Train/Test • Cross-Validation • Tune Model Hyperparameters • Permutation Feature Importance • Time Taken: 1 hour 30 Minutes
  • 21. Boosted DecisionTree Regression There was no change in Root Mean Squared Error even after removing less weighted features Features removed: • marketplace • customer_id • review_id • review_body Evaluation Metrics
  • 22. Comparison Decision Forest Regression Boosted Decision Tree Regression Mean Absolute Error 0.882887 0.649424 Root Mean Squared Error 1.143119 0.911949 Relative Absolute Error 0.989171 0.727603 Relative Squared Error 0.986502 0.627851 Coefficient of Determination 0.013498 0.372149
  • 23. Text Analytics • Analyzing text from “review_head” column to predict the sentiment. • star_rating > 3 means positive sentiment else negative sentiment • 70:30 Split Train/Test • Tokenizer splits the review into individual words. • StopWordsRemover removes common words • HashingTF generates numeric vectors from the text values • LogisticRegression algorithm used to train
  • 24. Text Analytics AUR = 0.7101944679648156 Time Taken : 4 Minutes
  • 25. CONCLUSION Azure ML RMSE Time Taken Matchbox Recommender 1.222147 6 Minutes Decision Forest Regression 1.143199 30 Minutes Boosted Decision Tree Regression 0.911949 1 Hour 30 Mins Spark ML RMSE Time Taken Collaborative Filtering 1.729060 30 Minutes Spark ML AUR Time Taken Text Analytics with Logistic Regression 0.710194 4 Minutes
  • 26. SUMMARY • Recommendation model is implemented to predict the item recommendation and rating prediction . • It can help in finding customers with the preferred items. • Based on RMSE, for recommendation model, AzureML performed better than the Spark ML. • Boosted Decision Tree performed better than the Decision Forest in rating prediction. • Text Analysis – Helps to understand customer sentiment and satisfaction of a product.
  • 27. CHALLENGES FACED AZURE ML Error 0138: Memory has been exhausted, unable to complete running of module. Process exited with error code -2 • Save the experiment under a new name. • Delete the old experiment. • This will release the memory
  • 28. CHALLENGES FACED SPARK ML • File size lesser than 2 GB will resolve the issue. • The resolution to the spark driver being stopped unexpectedly is unknown.
  • 29. GITHUB LINK Github Link : https://github.com/monika2403/mmishra2/tree/mas ter/CIS%205560
  • 30. REFERENCES • Microsoft DAT202.3x Implementing Predictive Analytics with Spark in Azure HDInsight • Microsoft's DAT203x, Data Science and Machine Learning Essentials • How to choose algorithms for Microsoft Azure Machine Learning - https://docs.microsoft.com/en- us/azure/machine-learning/machine-learning-algorithm- choice • Databricks Spark Data Engineers - https://docs.microsoft.com/en-us/azure/machine- learning/machine-learning-algorithm-choice
  • 31. REFERENCES • M. Woolf, “Playing with 80 Million Amazon Product Review Ratings Using Apache Spark,” minimaxir, 02-Jan-2017. [Online]. Available: https://minimaxir.com/2017/01/amazon-spark/.