SlideShare a Scribd company logo
1 of 17
Download to read offline
Data cleaning on
applications’ rating
and review from
Google Play Store
Towards Data Cleaning
Xiaotong Hu
Shang Li
Yi Chun Liu
Yu-Chen Su
Contents
Datasets and Use Cases
1
Data Cleaning Method and Process
2
Building Database
3
Text Data Cleaning
4
Workflow Model
5
Future Work
6
Datasets and Use Cases
3
Datasets
● Web crawler data
● Google Play Store Rating
● Google Play Store Users
Review
Use Cases
● For product managers, they can review
the applications through the the rating
score and comments on Google Play
Store
● Continuously optimize application
products
Data Cleaning Method and Process
4
Dataset Column Method
Google Play
Store
Rating 1. Place “NaN” with “null” 2. Transform to number
Reviews 1. Transform to number
Size
1. Change “M” (Megabit) to Kilobit 2. Remove “k” 3. Replace “Varies with device” with “00000” 4. Transform to
number
Installs 1. Remove “+” and “,” 2. Transform to number
Type 1. Place “NaN” with “null” 2. Create dummy variable column 3. Transform dummy variable column to number
Price 1. Remove “$” 2. Transform to number
Genres 1. Split into two columns by “;”
Google Play
Store User
Reviews
Sentiment Polarity 1. Place empty cells with “Null” 2. Transform to number
Sentiment Subjectivity 1. Place empty cells with “Null” 2. Transform to number
Data Cleaning - Special Cases
● Remove the data which are mismatched with columns.
● The translated_Review column will be cleaned with Python since
OpenRefine is not efficient to remove punctuation
# Example comment
I like eat delicious food. That's I'm cooking
food myself, case "10 Best Foods" helps lot,
also "Best Before (Shelf Life)"
After using Openrefine to clean up the data, we are able import data into MySQL database
Import Data to MySQL
6
GooglePlayStore
App TEXT
Category TEXT
Rating DOUBLE
Reviews INT
Size INT
Installs INT
Type Text
Typedummy INT
Price INT
ContentRating Text
Genres TEXT
Genres1 TEXT
Genres2 TEXT
LastUpdated DATETIME
CurrentVer Text
AndroidVer Text
Reviews
App Text
Translated_ Review Text
Sentiment Text
Sentiment_ Polarity DOUBLE
Sentiment_ Subjectivity DOUBLE
Schema & Datatype:
Rules:
● if Sentiment_Polarity > 0 => Sentiment is POSITIVE
● if Sentiment_Polarity < 0 => Sentiment is NEGATIVE
● if Sentiment_Polarity = 0 => Sentiment is NEUTRAL
● if Sentiment_Polarity IS NULL => Sentiment IS NULL
Integrity Constraints Violation Check
7
NO Violation found
Join Two Tables into One
SQL Syntax: Joint Table
● 70,471 Observation
● 20 Variables
● 17.7MB in CSV
● Figure out the keyword
frequency based on each
sentiment categories
● Python - Natural Language
Toolkit (NLTK)
Text Review Data Cleaning
Step 1: Remove punctuation
remove string punctuation, including
!"#$%&'()*+,-./:;<=>?@[]^_`{|}~
Example Comment
[‘I like eat delicious food. That’s I’m
cooking food myself, case “10 Best
Foods” helps lot, also “Best Before
(Shelf Life)”’]
[‘I like eat delicious food Thats Im
cooking food myself case 10 Best Foods
helps lot also Best Before Shelf Life’]
Text Review Data Cleaning
Step 2: Tokenizer
Splits a string into substrings using
a regular expression
['i', 'like', 'eat', 'delicious', 'food', 'thats', 'im',
'cooking', 'food', 'myself', 'case', '10', 'best', 'foods',
'helps', 'lot', 'also', 'best', 'before', 'shelf', 'life']
['like', 'eat', 'delicious', 'food', 'thats', 'im',
'cooking', 'food', 'case', '10', 'best', 'foods', 'helps',
'lot', 'also', 'best', 'shelf', 'life']
Step 3: Remove stop words
Remove words that do not contain
important significance to be used
in search queries
Text Review Data Cleaning
Step 4: Stemming & Lemmatization
Stemming
- Stemming is the reduction
method to convert words into
stems, such as treating "cats" as
"cat" and "effective" as "effect"
- The word may be unable to
express complete semantics
after stemming
V.S.
Lemmatization
- Lemmatization is
transformation method to
transform the word into its
original form, such as treating
“drove” to “drive” and “driving”
as “drive”
Text Review Data Cleaning
Workflow Model
Yesworkflow Model (OpenRefine) –
Google Play Store Rating
Yesworkflow Model (OpenRefine) –
Google Play Store Users Review
Future Work
16
● Everyone is responsible for different part with different tools
● Because of some constraints of each tools, it is difficult to cooperate with
each other during the data cleaning process
● Study on how to improve the cooperation efficiency when everyone using
different tools
Thank you for listening
Any Questions?

More Related Content

What's hot

Doubly Linked List
Doubly Linked ListDoubly Linked List
Doubly Linked ListNinad Mankar
 
SQL Joins With Examples | Edureka
SQL Joins With Examples | EdurekaSQL Joins With Examples | Edureka
SQL Joins With Examples | EdurekaEdureka!
 
Relational algebra ppt
Relational algebra pptRelational algebra ppt
Relational algebra pptGirdharRatne
 
Introduction to SQL
Introduction to SQLIntroduction to SQL
Introduction to SQLEhsan Hamzei
 
Susan Wendell The rejected body
Susan Wendell The rejected bodySusan Wendell The rejected body
Susan Wendell The rejected bodyAmmar farooq
 
Infix to-postfix examples
Infix to-postfix examplesInfix to-postfix examples
Infix to-postfix examplesmua99
 
linked list in data structure
linked list in data structure linked list in data structure
linked list in data structure shameen khan
 
Vsam interview questions and answers.
Vsam interview questions and answers.Vsam interview questions and answers.
Vsam interview questions and answers.Sweta Singh
 
프레임레이트 향상을 위한 공간분할 및 오브젝트 컬링 기법
프레임레이트 향상을 위한 공간분할 및 오브젝트 컬링 기법프레임레이트 향상을 위한 공간분할 및 오브젝트 컬링 기법
프레임레이트 향상을 위한 공간분할 및 오브젝트 컬링 기법YEONG-CHEON YOU
 
9. Object Relational Databases in DBMS
9. Object Relational Databases in DBMS9. Object Relational Databases in DBMS
9. Object Relational Databases in DBMSkoolkampus
 
Pseudo code of stack Queue and Array
Pseudo code of stack Queue and ArrayPseudo code of stack Queue and Array
Pseudo code of stack Queue and Arrayrdp rehmatullah
 

What's hot (20)

Doubly Linked List
Doubly Linked ListDoubly Linked List
Doubly Linked List
 
SQL Joins With Examples | Edureka
SQL Joins With Examples | EdurekaSQL Joins With Examples | Edureka
SQL Joins With Examples | Edureka
 
Relational algebra ppt
Relational algebra pptRelational algebra ppt
Relational algebra ppt
 
Python-List.pptx
Python-List.pptxPython-List.pptx
Python-List.pptx
 
Sql commands
Sql commandsSql commands
Sql commands
 
Sorting
SortingSorting
Sorting
 
Introduction to SQL
Introduction to SQLIntroduction to SQL
Introduction to SQL
 
Good sql server interview_questions
Good sql server interview_questionsGood sql server interview_questions
Good sql server interview_questions
 
Susan Wendell The rejected body
Susan Wendell The rejected bodySusan Wendell The rejected body
Susan Wendell The rejected body
 
Infix to-postfix examples
Infix to-postfix examplesInfix to-postfix examples
Infix to-postfix examples
 
linked list in data structure
linked list in data structure linked list in data structure
linked list in data structure
 
Sql
SqlSql
Sql
 
Lambda expressions in C++
Lambda expressions in C++Lambda expressions in C++
Lambda expressions in C++
 
Vsam interview questions and answers.
Vsam interview questions and answers.Vsam interview questions and answers.
Vsam interview questions and answers.
 
프레임레이트 향상을 위한 공간분할 및 오브젝트 컬링 기법
프레임레이트 향상을 위한 공간분할 및 오브젝트 컬링 기법프레임레이트 향상을 위한 공간분할 및 오브젝트 컬링 기법
프레임레이트 향상을 위한 공간분할 및 오브젝트 컬링 기법
 
Sql DML
Sql DMLSql DML
Sql DML
 
9. Object Relational Databases in DBMS
9. Object Relational Databases in DBMS9. Object Relational Databases in DBMS
9. Object Relational Databases in DBMS
 
Sql views
Sql viewsSql views
Sql views
 
Pseudo code of stack Queue and Array
Pseudo code of stack Queue and ArrayPseudo code of stack Queue and Array
Pseudo code of stack Queue and Array
 
SQL Server Stored procedures
SQL Server Stored proceduresSQL Server Stored procedures
SQL Server Stored procedures
 

Similar to Data cleaning on the rating and review from Google Play Store

Yelp Rating Prediction
Yelp Rating PredictionYelp Rating Prediction
Yelp Rating PredictionKartik Lunkad
 
Empowering Businesses using Yelp Reviews Mining
Empowering Businesses using Yelp Reviews MiningEmpowering Businesses using Yelp Reviews Mining
Empowering Businesses using Yelp Reviews MiningVipul Munot
 
PredictingYelpReviews
PredictingYelpReviewsPredictingYelpReviews
PredictingYelpReviewsGary Giust
 
Ranked-Restaurant Searching System using Data Mining
Ranked-Restaurant Searching System using Data MiningRanked-Restaurant Searching System using Data Mining
Ranked-Restaurant Searching System using Data MiningSangjun Han
 
presentation.pdf
presentation.pdfpresentation.pdf
presentation.pdfcaa28steve
 
Before launching your experiment. QA tips and tools.
Before launching your experiment. QA tips and tools. Before launching your experiment. QA tips and tools.
Before launching your experiment. QA tips and tools. Optimizely
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support SystemKavita Ganesan
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptopRising Media, Inc.
 
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020Emily Potter
 
Supercharge Your Testing Program
Supercharge Your Testing ProgramSupercharge Your Testing Program
Supercharge Your Testing ProgramOptimizely
 
10 tips-for-optimizing-sql-server-performance-white-paper-22127
10 tips-for-optimizing-sql-server-performance-white-paper-2212710 tips-for-optimizing-sql-server-performance-white-paper-22127
10 tips-for-optimizing-sql-server-performance-white-paper-22127Kaizenlogcom
 
Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#J On The Beach
 
Test Driven Database Development With Data Dude
Test Driven Database Development With Data DudeTest Driven Database Development With Data Dude
Test Driven Database Development With Data DudeCory Foy
 
IRJET- Fake Review Detection using Opinion Mining
IRJET- Fake Review Detection using Opinion MiningIRJET- Fake Review Detection using Opinion Mining
IRJET- Fake Review Detection using Opinion MiningIRJET Journal
 
Predicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with LanguagePredicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with LanguageSebastian W. Cheah
 

Similar to Data cleaning on the rating and review from Google Play Store (20)

Yelp Rating Prediction
Yelp Rating PredictionYelp Rating Prediction
Yelp Rating Prediction
 
Empowering Businesses using Yelp Reviews Mining
Empowering Businesses using Yelp Reviews MiningEmpowering Businesses using Yelp Reviews Mining
Empowering Businesses using Yelp Reviews Mining
 
PredictingYelpReviews
PredictingYelpReviewsPredictingYelpReviews
PredictingYelpReviews
 
Ranked-Restaurant Searching System using Data Mining
Ranked-Restaurant Searching System using Data MiningRanked-Restaurant Searching System using Data Mining
Ranked-Restaurant Searching System using Data Mining
 
presentation.pdf
presentation.pdfpresentation.pdf
presentation.pdf
 
Ashwin resume
Ashwin resumeAshwin resume
Ashwin resume
 
Lean Six Sigma
Lean Six SigmaLean Six Sigma
Lean Six Sigma
 
Before launching your experiment. QA tips and tools.
Before launching your experiment. QA tips and tools. Before launching your experiment. QA tips and tools.
Before launching your experiment. QA tips and tools.
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support System
 
Voice of the Market, Tom Anderson
Voice of the Market, Tom AndersonVoice of the Market, Tom Anderson
Voice of the Market, Tom Anderson
 
1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop1440 track 2 boire_using our laptop
1440 track 2 boire_using our laptop
 
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020
Your SEO Experience Is Holding You Back - DeepCrawl Webinar - May 2020
 
Supercharge Your Testing Program
Supercharge Your Testing ProgramSupercharge Your Testing Program
Supercharge Your Testing Program
 
10 tips-for-optimizing-sql-server-performance-white-paper-22127
10 tips-for-optimizing-sql-server-performance-white-paper-2212710 tips-for-optimizing-sql-server-performance-white-paper-22127
10 tips-for-optimizing-sql-server-performance-white-paper-22127
 
Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#Agile experiments in Machine Learning with F#
Agile experiments in Machine Learning with F#
 
Test Driven Database Development With Data Dude
Test Driven Database Development With Data DudeTest Driven Database Development With Data Dude
Test Driven Database Development With Data Dude
 
Business analyst
Business analystBusiness analyst
Business analyst
 
IRJET- Fake Review Detection using Opinion Mining
IRJET- Fake Review Detection using Opinion MiningIRJET- Fake Review Detection using Opinion Mining
IRJET- Fake Review Detection using Opinion Mining
 
Predicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with LanguagePredicting Yelp Review Star Ratings with Language
Predicting Yelp Review Star Ratings with Language
 
Telecom Churn Analysis
Telecom Churn AnalysisTelecom Churn Analysis
Telecom Churn Analysis
 

More from National Taiwan University (9)

Prediction on covid-19 recovery rate
Prediction on covid-19 recovery ratePrediction on covid-19 recovery rate
Prediction on covid-19 recovery rate
 
Consumer analytics - Strategy for Ctrip
Consumer analytics - Strategy for CtripConsumer analytics - Strategy for Ctrip
Consumer analytics - Strategy for Ctrip
 
(2017) Marketing Proposal - GIANT
(2017) Marketing Proposal - GIANT(2017) Marketing Proposal - GIANT
(2017) Marketing Proposal - GIANT
 
Case Study : IKEA
Case Study : IKEA Case Study : IKEA
Case Study : IKEA
 
企業政策_藍海策略_台積電
企業政策_藍海策略_台積電 企業政策_藍海策略_台積電
企業政策_藍海策略_台積電
 
幸福保險0929三之三版
幸福保險0929三之三版幸福保險0929三之三版
幸福保險0929三之三版
 
企業政策_第一組期末報告_六角國際
企業政策_第一組期末報告_六角國際企業政策_第一組期末報告_六角國際
企業政策_第一組期末報告_六角國際
 
管理學期末報告 第 七 組
管理學期末報告 第 七 組管理學期末報告 第 七 組
管理學期末報告 第 七 組
 
經濟期末報告 第六組
經濟期末報告 第六組經濟期末報告 第六組
經濟期末報告 第六組
 

Recently uploaded

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 

Recently uploaded (20)

VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 

Data cleaning on the rating and review from Google Play Store

  • 1. Data cleaning on applications’ rating and review from Google Play Store Towards Data Cleaning Xiaotong Hu Shang Li Yi Chun Liu Yu-Chen Su
  • 2. Contents Datasets and Use Cases 1 Data Cleaning Method and Process 2 Building Database 3 Text Data Cleaning 4 Workflow Model 5 Future Work 6
  • 3. Datasets and Use Cases 3 Datasets ● Web crawler data ● Google Play Store Rating ● Google Play Store Users Review Use Cases ● For product managers, they can review the applications through the the rating score and comments on Google Play Store ● Continuously optimize application products
  • 4. Data Cleaning Method and Process 4 Dataset Column Method Google Play Store Rating 1. Place “NaN” with “null” 2. Transform to number Reviews 1. Transform to number Size 1. Change “M” (Megabit) to Kilobit 2. Remove “k” 3. Replace “Varies with device” with “00000” 4. Transform to number Installs 1. Remove “+” and “,” 2. Transform to number Type 1. Place “NaN” with “null” 2. Create dummy variable column 3. Transform dummy variable column to number Price 1. Remove “$” 2. Transform to number Genres 1. Split into two columns by “;” Google Play Store User Reviews Sentiment Polarity 1. Place empty cells with “Null” 2. Transform to number Sentiment Subjectivity 1. Place empty cells with “Null” 2. Transform to number
  • 5. Data Cleaning - Special Cases ● Remove the data which are mismatched with columns. ● The translated_Review column will be cleaned with Python since OpenRefine is not efficient to remove punctuation # Example comment I like eat delicious food. That's I'm cooking food myself, case "10 Best Foods" helps lot, also "Best Before (Shelf Life)"
  • 6. After using Openrefine to clean up the data, we are able import data into MySQL database Import Data to MySQL 6 GooglePlayStore App TEXT Category TEXT Rating DOUBLE Reviews INT Size INT Installs INT Type Text Typedummy INT Price INT ContentRating Text Genres TEXT Genres1 TEXT Genres2 TEXT LastUpdated DATETIME CurrentVer Text AndroidVer Text Reviews App Text Translated_ Review Text Sentiment Text Sentiment_ Polarity DOUBLE Sentiment_ Subjectivity DOUBLE Schema & Datatype:
  • 7. Rules: ● if Sentiment_Polarity > 0 => Sentiment is POSITIVE ● if Sentiment_Polarity < 0 => Sentiment is NEGATIVE ● if Sentiment_Polarity = 0 => Sentiment is NEUTRAL ● if Sentiment_Polarity IS NULL => Sentiment IS NULL Integrity Constraints Violation Check 7 NO Violation found
  • 8. Join Two Tables into One SQL Syntax: Joint Table ● 70,471 Observation ● 20 Variables ● 17.7MB in CSV
  • 9. ● Figure out the keyword frequency based on each sentiment categories ● Python - Natural Language Toolkit (NLTK) Text Review Data Cleaning Step 1: Remove punctuation remove string punctuation, including !"#$%&'()*+,-./:;<=>?@[]^_`{|}~ Example Comment [‘I like eat delicious food. That’s I’m cooking food myself, case “10 Best Foods” helps lot, also “Best Before (Shelf Life)”’] [‘I like eat delicious food Thats Im cooking food myself case 10 Best Foods helps lot also Best Before Shelf Life’]
  • 10. Text Review Data Cleaning Step 2: Tokenizer Splits a string into substrings using a regular expression ['i', 'like', 'eat', 'delicious', 'food', 'thats', 'im', 'cooking', 'food', 'myself', 'case', '10', 'best', 'foods', 'helps', 'lot', 'also', 'best', 'before', 'shelf', 'life'] ['like', 'eat', 'delicious', 'food', 'thats', 'im', 'cooking', 'food', 'case', '10', 'best', 'foods', 'helps', 'lot', 'also', 'best', 'shelf', 'life'] Step 3: Remove stop words Remove words that do not contain important significance to be used in search queries
  • 11. Text Review Data Cleaning Step 4: Stemming & Lemmatization Stemming - Stemming is the reduction method to convert words into stems, such as treating "cats" as "cat" and "effective" as "effect" - The word may be unable to express complete semantics after stemming V.S. Lemmatization - Lemmatization is transformation method to transform the word into its original form, such as treating “drove” to “drive” and “driving” as “drive”
  • 12. Text Review Data Cleaning
  • 14. Yesworkflow Model (OpenRefine) – Google Play Store Rating
  • 15. Yesworkflow Model (OpenRefine) – Google Play Store Users Review
  • 16. Future Work 16 ● Everyone is responsible for different part with different tools ● Because of some constraints of each tools, it is difficult to cooperate with each other during the data cleaning process ● Study on how to improve the cooperation efficiency when everyone using different tools
  • 17. Thank you for listening Any Questions?