SlideShare a Scribd company logo
IDA MOOC
Coursera Data Science Capstone
Week 2 + 3 (Getting and Cleaning data + Exploratory Data)
name Kee Yuan Chuan
github kylase
email <redacted>
https://github.com/kylase/IDA-MOOC-Data-Exploratory-Cleaning
Content
• Problem Statement
• What are we dealing with?
• Additional Content (not using R)
What is the product at the end
of the project?
https://kylase-coursera.shinyapps.io/capstone-app
Data
That’s when I thought nothing is as it seems.No tomorrow can be predicted accurately .We can just
plan,hope,wish or pray for things but then there is a higher power in action who makes everyday different
for every person so that he/she can learn ,grow and live to be a better person.

The Atlantic Yards development is supported by a wide range of elected officials, unions, community
leaders, issue advocates, urban development experts, religious leaders and organizations, local businesses
and thousands of fans all across Brooklyn and the New York area.

In this gauntlet of openness to new ways of thinking, Mercury makes an action-demanding link to Pluto the
transformer. Caution goes out the window. Raw truths, even secrets, come to the surface. Game-changing
information pops out. Thinking changes, permanently. Certain concepts aren’t going along for the ride any
longer. The cumulative result is streamlining, liberating, motivating and energizing.

We’ll be selling soda at the county fair this Sunday afternoon as part of their fundraiser.

Jim was a poet and a musician. He wrote some beautiful pieces and I’ve always been drawn to him since I
heard the Doors for the first time at an old boyfriend’s house. I’ve always be

en inspired to express myself both with lyrics and stories or visually with photographs and
drawings.Friend, let’s stand before God often and ask Him to reveal things in our life that are not
pleasing to Him. As we do that, the world will not have any stones to throw. They will be drawn to Jesus
and living their life for Him just like we should.

In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.

We love you Mr. Brown.

If you have an alternative argument, let's hear it! :)

If I were a bear,

Other friends have similar stories, of how they were treated brusquely by Laurelwood staff, and as often
as not, the same names keep coming up. About a half-dozen friends of mine refuse to step foot in there
ever again because of it. How many others they’re telling - and keeping away - one can only guess.
Where is the table?
One of your jobs is to create that
😅
That’s when I thought nothing is as it seems.No
tomorrow can be predicted accurately .We can just
plan,hope,wish or pray for things but then there is a
higher power in action who makes everyday different
for every person so that he/she can learn ,grow and
live to be a better person.

The Atlantic Yards development is supported by a wide
range of elected officials, unions, community
leaders, issue advocates, urban development experts,
religious leaders and organizations, local businesses
and thousands of fans all across Brooklyn and the New
York area.

In this gauntlet of openness to new ways of thinking,
Mercury makes an action-demanding link to Pluto the
transformer. Caution goes out the window. Raw truths,
even secrets, come to the surface. Game-changing
information pops out. Thinking changes, permanently.
Certain concepts aren’t going along for the ride any
longer. The cumulative result is streamlining,
liberating, motivating and energizing.

We’ll be selling soda at the county fair this Sunday
afternoon as part of their fundraiser.

Jim was a poet and a musician. He wrote some
beautiful pieces and I’ve always been drawn to him
since I heard the Doors for the first time at an old
boyfriend’s house. I’ve always be

Raw Data
Cleaned
Data
n-gram
data
Your
Predictive
Model
Today’s coverage
Concepts in NLP
• Stopwords
• Stemming
• Part-of-speech tagging
• Edit distance
• N-grams
• Tokenisation
Differences between previous courses and
capstone
Tabular Data (previous courses)
• Unlike data cleaning of tabular data,
where you
• fill missing values with mean/median/
mode
• remove columns with no variance
• remove columns that are 100% unique
(e.g. id)
• selecting or combining/computing
new columns as features
Text data (Capstone)
• Applying the concepts (previous slides)
affects your data and features for the
model
• Which are the relevant concepts?
• Stemming?
• Part-of-speech tagging?
• Stopwords?
• Edit distance?
Are these two sentences the same?
There is “something” going on right now. Like right… now. He’s there.
There is "something" going on right now. Like right... now. He's there.
Getting data and verifying data integrity
• Getting data seems trivial and
easy: just click Download
• How to ensure you are getting
what it is?
• File signature in terms of a hash
(MD5, SHA1, etc)
• MD5 and SHA1 are hashing
algorithm to convert content into
a long string of hexadecimal
File Integrity and Content Check
Demo
https://github.com/kylase/IDA-MOOC-Data-Exploratory-Cleaning/blob/master/
file_integrity_demo.R
Take home message from the demo
• Good practise to check your file signature
• Something that means the same in our interpretation may not be the
same to a computer.
• Will impact your work and outcome
Understanding the data
That’s when I thought nothing is as it seems.No tomorrow can be
predicted accurately .We can just plan,hope,wish or pray for things
but then there is a higher power in action who makes everyday
different for every person so that he/she can learn ,grow and live to
be a better person.

The Atlantic Yards development is supported by a wide range of
elected officials, unions, community leaders, issue advocates, urban
development experts, religious leaders and organizations, local
businesses and thousands of fans all across Brooklyn and the New York
area.

In this gauntlet of openness to new ways of thinking, Mercury makes
an action-demanding link to Pluto the transformer. Caution goes out
the window. Raw truths, even secrets, come to the surface. Game-
changing information pops out. Thinking changes, permanently. Certain
concepts aren’t going along for the ride any longer. The cumulative
result is streamlining, liberating, motivating and energizing.

We’ll be selling soda at the county fair this Sunday afternoon as
part of their fundraiser.

Jim was a poet and a musician. He wrote some beautiful pieces and
I’ve always been drawn to him since I heard the Doors for the first
time at an old boyfriend’s house. I’ve always be

en inspired to express myself both with lyrics and stories or
visually with photographs and drawings.Friend, let’s stand before God
often and ask Him to reveal things in our life that are not pleasing
to Him. As we do that, the world will not have any stones to throw.
They will be drawn to Jesus and living their life for Him just like
we should.
• Blogs, news and tweets are
modern web contents
• Contents are not just alphabets,
numbers, punctuations but also
• Emoticons 😬😀
• Platform specific terms:
@mention, #hashtag, stock
symbols
Dealing with content
• One of the tasks in the Capstone is to remove profanity
• Several approaches (applies to other things as well)
1. Remove that word only
2. Remove the line containing the word
3. Replace the word with a placeholder
• The key question to ask is: How will my action affect the outcome?
Understanding what your tokeniser does
• Tokeniser can come from NLP or RWeka or other packages
• RWeka uses Weka: http://weka.sourceforge.net/doc.dev/weka/
core/tokenizers/NGramTokenizer.html
• Different tokeniser has different behaviour, thus different outcome.
Tokenisation Demo
https://github.com/kylase/IDA-MOOC-Data-Exploratory-Cleaning/blob/master/
tokeniser_demo.R
something"	going 1
“something”	going 1
going	on 2
he's	there. 1
he’s	there. 1
is	"something" 1
is	“something” 1
like	right... 1
like	right… 1
now,	like 2
now.	he's 1
now.	he’s 1
on	right 2
right	now, 2
right...	now. 1
right…	now. 1
there	is 2
going	on 1
he	s 1
is	something 1
like	right 1
now	he 1
now	like 1
on	right 1
right	now 2
s	there 1
something	going 1
there	is 1
Bigrams
from NLP
Bigrams
from RWeka
What happened?
• These 2 sentences are the same to us, but not to the computer.
• In modern web content, there is something called Encoding.
• The double quotations (left and right), apostrophe, triple dots on the first line are
all Unicode characters.
• Dealing with it as previously mentioned: remove or replace.
• But consider that do you want to predict when your user typed “something” or
something or both.
There is “something” going on right now. Like right… now. He’s there.
There is "something" going on right now. Like right... now. He's there.
Post-processing
• Why need post-processing?
• Tokenisers are limited in capability
• Get cleaner data
something	going
going	on
he's	there
is	something
like	right
on	right
right	now
there	is
Additional Content
Alternative Method
• Why?
• R is slow
• There is always another way to do the same thing, faster. Thus allow you to fail-
fast and learn more, explore your data.
• For people with “other than R” programming experience and the interested
• Unix commands (grep, wc, awk, etc)
• Python
• NLTK (Natural Language Toolkit)
Benefits?
• Processed 100% of all 3 text files
within 16 GB RAM
• Caveat: do you need all the data
for a representative model?
• Processing time is about 10 to 30
minutes for each text file
depending on tokenisation length
• Failing-fast allows you to discover
your data
Time taken to do Quiz 1
(sec)
0
10
20
30
40
*nix CLI R
Python NLTK Bigram Tokeniser
• Using cleaned data (Unicode
characters handled)
• Capital letters, end-of-sentence and
clause details are all preserved.
• Post-processing can be done to filter
unigram and invalid bigrams, etc.
There	is 1
on	right 1
...	now 1
He	's 1
like	right 1
``	something 1
right	... 1
,	like 1
now	, 1
something	'' 1
now	. 1
there	. 1
'	going 1
going	on 1
right	now 1
s	there 1
is	`` 1
https://github.com/kylase/data-science-capstone
If you are interested in how Unix tools and Python are used, check out
Q&A
Thank you for your attention

More Related Content

Similar to IDA MOOC Coursera Data Science Capstone (Data Cleaning/Data Exploration)

2013 LIANZA Keynote: River's End
2013 LIANZA Keynote: River's End2013 LIANZA Keynote: River's End
2013 LIANZA Keynote: River's End
gnat
 
True Intent: The Best Online Benchmark You've Never Measured
True Intent: The Best Online Benchmark You've Never MeasuredTrue Intent: The Best Online Benchmark You've Never Measured
True Intent: The Best Online Benchmark You've Never Measured
UXPA International
 
Ala presentation
Ala presentationAla presentation
Ala presentationbillahml
 
Voices from the Field: Practices, Challenges & Directions in Digital Humaniti...
Voices from the Field: Practices, Challenges & Directions in Digital Humaniti...Voices from the Field: Practices, Challenges & Directions in Digital Humaniti...
Voices from the Field: Practices, Challenges & Directions in Digital Humaniti...
Monica Bulger
 
Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?
David Smith
 
Beyond The Bench Workshops
Beyond The Bench WorkshopsBeyond The Bench Workshops
Beyond The Bench WorkshopsBeyond The Bench
 
Inspiration Architecture: The Future of Libraries
Inspiration Architecture: The Future of LibrariesInspiration Architecture: The Future of Libraries
Inspiration Architecture: The Future of Libraries
Peter Morville
 
Open Sesame (and other open movements)
Open Sesame (and other open movements)Open Sesame (and other open movements)
Open Sesame (and other open movements)
Dorothea Salo
 
Day 2 literacy palooza revised
Day 2 literacy palooza revisedDay 2 literacy palooza revised
Day 2 literacy palooza revised
Teri Lesesne
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of Understanding
Peter Morville
 
04 Network Data Collection
04 Network Data Collection04 Network Data Collection
04 Network Data Collection
Duke Network Analysis Center
 
Data visualization for development
Data visualization for developmentData visualization for development
Data visualization for development
Sara-Jayne Terp
 
How To Effectively Communicate With Techies
How  To Effectively Communicate With TechiesHow  To Effectively Communicate With Techies
How To Effectively Communicate With Techies
Helen Linda
 
Webinar - Libraries As Innovation Hubs - 2017-05-31
Webinar - Libraries As Innovation Hubs - 2017-05-31Webinar - Libraries As Innovation Hubs - 2017-05-31
Webinar - Libraries As Innovation Hubs - 2017-05-31
TechSoup
 
The Creative Value of Distant Analogies
The Creative Value of Distant AnalogiesThe Creative Value of Distant Analogies
The Creative Value of Distant Analogies
R. Sosa
 
Courage of our Connections
Courage of our ConnectionsCourage of our Connections
Courage of our Connections
Rachel Frick
 
Digital scholarship - all day workshop
Digital scholarship - all day workshopDigital scholarship - all day workshop
Digital scholarship - all day workshop
Martin Weller
 
Isle of Man open data overview
Isle of Man open data overviewIsle of Man open data overview
Isle of Man open data overview
Chris Taggart
 
Concept Analysis Paper Pain Management Essay E
Concept Analysis Paper Pain Management Essay EConcept Analysis Paper Pain Management Essay E
Concept Analysis Paper Pain Management Essay E
Stephanie King
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of Understanding
Peter Morville
 

Similar to IDA MOOC Coursera Data Science Capstone (Data Cleaning/Data Exploration) (20)

2013 LIANZA Keynote: River's End
2013 LIANZA Keynote: River's End2013 LIANZA Keynote: River's End
2013 LIANZA Keynote: River's End
 
True Intent: The Best Online Benchmark You've Never Measured
True Intent: The Best Online Benchmark You've Never MeasuredTrue Intent: The Best Online Benchmark You've Never Measured
True Intent: The Best Online Benchmark You've Never Measured
 
Ala presentation
Ala presentationAla presentation
Ala presentation
 
Voices from the Field: Practices, Challenges & Directions in Digital Humaniti...
Voices from the Field: Practices, Challenges & Directions in Digital Humaniti...Voices from the Field: Practices, Challenges & Directions in Digital Humaniti...
Voices from the Field: Practices, Challenges & Directions in Digital Humaniti...
 
Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?
 
Beyond The Bench Workshops
Beyond The Bench WorkshopsBeyond The Bench Workshops
Beyond The Bench Workshops
 
Inspiration Architecture: The Future of Libraries
Inspiration Architecture: The Future of LibrariesInspiration Architecture: The Future of Libraries
Inspiration Architecture: The Future of Libraries
 
Open Sesame (and other open movements)
Open Sesame (and other open movements)Open Sesame (and other open movements)
Open Sesame (and other open movements)
 
Day 2 literacy palooza revised
Day 2 literacy palooza revisedDay 2 literacy palooza revised
Day 2 literacy palooza revised
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of Understanding
 
04 Network Data Collection
04 Network Data Collection04 Network Data Collection
04 Network Data Collection
 
Data visualization for development
Data visualization for developmentData visualization for development
Data visualization for development
 
How To Effectively Communicate With Techies
How  To Effectively Communicate With TechiesHow  To Effectively Communicate With Techies
How To Effectively Communicate With Techies
 
Webinar - Libraries As Innovation Hubs - 2017-05-31
Webinar - Libraries As Innovation Hubs - 2017-05-31Webinar - Libraries As Innovation Hubs - 2017-05-31
Webinar - Libraries As Innovation Hubs - 2017-05-31
 
The Creative Value of Distant Analogies
The Creative Value of Distant AnalogiesThe Creative Value of Distant Analogies
The Creative Value of Distant Analogies
 
Courage of our Connections
Courage of our ConnectionsCourage of our Connections
Courage of our Connections
 
Digital scholarship - all day workshop
Digital scholarship - all day workshopDigital scholarship - all day workshop
Digital scholarship - all day workshop
 
Isle of Man open data overview
Isle of Man open data overviewIsle of Man open data overview
Isle of Man open data overview
 
Concept Analysis Paper Pain Management Essay E
Concept Analysis Paper Pain Management Essay EConcept Analysis Paper Pain Management Essay E
Concept Analysis Paper Pain Management Essay E
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of Understanding
 

Recently uploaded

一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
theahmadsaood
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 

Recently uploaded (20)

一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
tapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive datatapal brand analysis PPT slide for comptetive data
tapal brand analysis PPT slide for comptetive data
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 

IDA MOOC Coursera Data Science Capstone (Data Cleaning/Data Exploration)

  • 1. IDA MOOC Coursera Data Science Capstone Week 2 + 3 (Getting and Cleaning data + Exploratory Data) name Kee Yuan Chuan github kylase email <redacted> https://github.com/kylase/IDA-MOOC-Data-Exploratory-Cleaning
  • 2. Content • Problem Statement • What are we dealing with? • Additional Content (not using R)
  • 3. What is the product at the end of the project?
  • 6. That’s when I thought nothing is as it seems.No tomorrow can be predicted accurately .We can just plan,hope,wish or pray for things but then there is a higher power in action who makes everyday different for every person so that he/she can learn ,grow and live to be a better person.
 The Atlantic Yards development is supported by a wide range of elected officials, unions, community leaders, issue advocates, urban development experts, religious leaders and organizations, local businesses and thousands of fans all across Brooklyn and the New York area.
 In this gauntlet of openness to new ways of thinking, Mercury makes an action-demanding link to Pluto the transformer. Caution goes out the window. Raw truths, even secrets, come to the surface. Game-changing information pops out. Thinking changes, permanently. Certain concepts aren’t going along for the ride any longer. The cumulative result is streamlining, liberating, motivating and energizing.
 We’ll be selling soda at the county fair this Sunday afternoon as part of their fundraiser.
 Jim was a poet and a musician. He wrote some beautiful pieces and I’ve always been drawn to him since I heard the Doors for the first time at an old boyfriend’s house. I’ve always be
 en inspired to express myself both with lyrics and stories or visually with photographs and drawings.Friend, let’s stand before God often and ask Him to reveal things in our life that are not pleasing to Him. As we do that, the world will not have any stones to throw. They will be drawn to Jesus and living their life for Him just like we should.
 In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
 We love you Mr. Brown.
 If you have an alternative argument, let's hear it! :)
 If I were a bear,
 Other friends have similar stories, of how they were treated brusquely by Laurelwood staff, and as often as not, the same names keep coming up. About a half-dozen friends of mine refuse to step foot in there ever again because of it. How many others they’re telling - and keeping away - one can only guess.
  • 7. Where is the table?
  • 8. One of your jobs is to create that 😅
  • 9. That’s when I thought nothing is as it seems.No tomorrow can be predicted accurately .We can just plan,hope,wish or pray for things but then there is a higher power in action who makes everyday different for every person so that he/she can learn ,grow and live to be a better person.
 The Atlantic Yards development is supported by a wide range of elected officials, unions, community leaders, issue advocates, urban development experts, religious leaders and organizations, local businesses and thousands of fans all across Brooklyn and the New York area.
 In this gauntlet of openness to new ways of thinking, Mercury makes an action-demanding link to Pluto the transformer. Caution goes out the window. Raw truths, even secrets, come to the surface. Game-changing information pops out. Thinking changes, permanently. Certain concepts aren’t going along for the ride any longer. The cumulative result is streamlining, liberating, motivating and energizing.
 We’ll be selling soda at the county fair this Sunday afternoon as part of their fundraiser.
 Jim was a poet and a musician. He wrote some beautiful pieces and I’ve always been drawn to him since I heard the Doors for the first time at an old boyfriend’s house. I’ve always be

  • 11. Concepts in NLP • Stopwords • Stemming • Part-of-speech tagging • Edit distance • N-grams • Tokenisation
  • 12. Differences between previous courses and capstone Tabular Data (previous courses) • Unlike data cleaning of tabular data, where you • fill missing values with mean/median/ mode • remove columns with no variance • remove columns that are 100% unique (e.g. id) • selecting or combining/computing new columns as features Text data (Capstone) • Applying the concepts (previous slides) affects your data and features for the model • Which are the relevant concepts? • Stemming? • Part-of-speech tagging? • Stopwords? • Edit distance?
  • 13. Are these two sentences the same? There is “something” going on right now. Like right… now. He’s there. There is "something" going on right now. Like right... now. He's there.
  • 14. Getting data and verifying data integrity • Getting data seems trivial and easy: just click Download • How to ensure you are getting what it is? • File signature in terms of a hash (MD5, SHA1, etc) • MD5 and SHA1 are hashing algorithm to convert content into a long string of hexadecimal
  • 15. File Integrity and Content Check Demo https://github.com/kylase/IDA-MOOC-Data-Exploratory-Cleaning/blob/master/ file_integrity_demo.R
  • 16. Take home message from the demo • Good practise to check your file signature • Something that means the same in our interpretation may not be the same to a computer. • Will impact your work and outcome
  • 17. Understanding the data That’s when I thought nothing is as it seems.No tomorrow can be predicted accurately .We can just plan,hope,wish or pray for things but then there is a higher power in action who makes everyday different for every person so that he/she can learn ,grow and live to be a better person.
 The Atlantic Yards development is supported by a wide range of elected officials, unions, community leaders, issue advocates, urban development experts, religious leaders and organizations, local businesses and thousands of fans all across Brooklyn and the New York area.
 In this gauntlet of openness to new ways of thinking, Mercury makes an action-demanding link to Pluto the transformer. Caution goes out the window. Raw truths, even secrets, come to the surface. Game- changing information pops out. Thinking changes, permanently. Certain concepts aren’t going along for the ride any longer. The cumulative result is streamlining, liberating, motivating and energizing.
 We’ll be selling soda at the county fair this Sunday afternoon as part of their fundraiser.
 Jim was a poet and a musician. He wrote some beautiful pieces and I’ve always been drawn to him since I heard the Doors for the first time at an old boyfriend’s house. I’ve always be
 en inspired to express myself both with lyrics and stories or visually with photographs and drawings.Friend, let’s stand before God often and ask Him to reveal things in our life that are not pleasing to Him. As we do that, the world will not have any stones to throw. They will be drawn to Jesus and living their life for Him just like we should. • Blogs, news and tweets are modern web contents • Contents are not just alphabets, numbers, punctuations but also • Emoticons 😬😀 • Platform specific terms: @mention, #hashtag, stock symbols
  • 18. Dealing with content • One of the tasks in the Capstone is to remove profanity • Several approaches (applies to other things as well) 1. Remove that word only 2. Remove the line containing the word 3. Replace the word with a placeholder • The key question to ask is: How will my action affect the outcome?
  • 19. Understanding what your tokeniser does • Tokeniser can come from NLP or RWeka or other packages • RWeka uses Weka: http://weka.sourceforge.net/doc.dev/weka/ core/tokenizers/NGramTokenizer.html • Different tokeniser has different behaviour, thus different outcome.
  • 21. something" going 1 “something” going 1 going on 2 he's there. 1 he’s there. 1 is "something" 1 is “something” 1 like right... 1 like right… 1 now, like 2 now. he's 1 now. he’s 1 on right 2 right now, 2 right... now. 1 right… now. 1 there is 2 going on 1 he s 1 is something 1 like right 1 now he 1 now like 1 on right 1 right now 2 s there 1 something going 1 there is 1 Bigrams from NLP Bigrams from RWeka
  • 22. What happened? • These 2 sentences are the same to us, but not to the computer. • In modern web content, there is something called Encoding. • The double quotations (left and right), apostrophe, triple dots on the first line are all Unicode characters. • Dealing with it as previously mentioned: remove or replace. • But consider that do you want to predict when your user typed “something” or something or both. There is “something” going on right now. Like right… now. He’s there. There is "something" going on right now. Like right... now. He's there.
  • 23. Post-processing • Why need post-processing? • Tokenisers are limited in capability • Get cleaner data something going going on he's there is something like right on right right now there is
  • 25. Alternative Method • Why? • R is slow • There is always another way to do the same thing, faster. Thus allow you to fail- fast and learn more, explore your data. • For people with “other than R” programming experience and the interested • Unix commands (grep, wc, awk, etc) • Python • NLTK (Natural Language Toolkit)
  • 26. Benefits? • Processed 100% of all 3 text files within 16 GB RAM • Caveat: do you need all the data for a representative model? • Processing time is about 10 to 30 minutes for each text file depending on tokenisation length • Failing-fast allows you to discover your data Time taken to do Quiz 1 (sec) 0 10 20 30 40 *nix CLI R
  • 27. Python NLTK Bigram Tokeniser • Using cleaned data (Unicode characters handled) • Capital letters, end-of-sentence and clause details are all preserved. • Post-processing can be done to filter unigram and invalid bigrams, etc. There is 1 on right 1 ... now 1 He 's 1 like right 1 `` something 1 right ... 1 , like 1 now , 1 something '' 1 now . 1 there . 1 ' going 1 going on 1 right now 1 s there 1 is `` 1
  • 28. https://github.com/kylase/data-science-capstone If you are interested in how Unix tools and Python are used, check out
  • 29. Q&A
  • 30. Thank you for your attention