SlideShare a Scribd company logo
1 of 30
Download to read offline
IDA MOOC
Coursera Data Science Capstone
Week 2 + 3 (Getting and Cleaning data + Exploratory Data)
name Kee Yuan Chuan
github kylase
email <redacted>
https://github.com/kylase/IDA-MOOC-Data-Exploratory-Cleaning
Content
• Problem Statement
• What are we dealing with?
• Additional Content (not using R)
What is the product at the end
of the project?
https://kylase-coursera.shinyapps.io/capstone-app
Data
That’s when I thought nothing is as it seems.No tomorrow can be predicted accurately .We can just
plan,hope,wish or pray for things but then there is a higher power in action who makes everyday different
for every person so that he/she can learn ,grow and live to be a better person.

The Atlantic Yards development is supported by a wide range of elected officials, unions, community
leaders, issue advocates, urban development experts, religious leaders and organizations, local businesses
and thousands of fans all across Brooklyn and the New York area.

In this gauntlet of openness to new ways of thinking, Mercury makes an action-demanding link to Pluto the
transformer. Caution goes out the window. Raw truths, even secrets, come to the surface. Game-changing
information pops out. Thinking changes, permanently. Certain concepts aren’t going along for the ride any
longer. The cumulative result is streamlining, liberating, motivating and energizing.

We’ll be selling soda at the county fair this Sunday afternoon as part of their fundraiser.

Jim was a poet and a musician. He wrote some beautiful pieces and I’ve always been drawn to him since I
heard the Doors for the first time at an old boyfriend’s house. I’ve always be

en inspired to express myself both with lyrics and stories or visually with photographs and
drawings.Friend, let’s stand before God often and ask Him to reveal things in our life that are not
pleasing to Him. As we do that, the world will not have any stones to throw. They will be drawn to Jesus
and living their life for Him just like we should.

In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.

We love you Mr. Brown.

If you have an alternative argument, let's hear it! :)

If I were a bear,

Other friends have similar stories, of how they were treated brusquely by Laurelwood staff, and as often
as not, the same names keep coming up. About a half-dozen friends of mine refuse to step foot in there
ever again because of it. How many others they’re telling - and keeping away - one can only guess.
Where is the table?
One of your jobs is to create that
😅
That’s when I thought nothing is as it seems.No
tomorrow can be predicted accurately .We can just
plan,hope,wish or pray for things but then there is a
higher power in action who makes everyday different
for every person so that he/she can learn ,grow and
live to be a better person.

The Atlantic Yards development is supported by a wide
range of elected officials, unions, community
leaders, issue advocates, urban development experts,
religious leaders and organizations, local businesses
and thousands of fans all across Brooklyn and the New
York area.

In this gauntlet of openness to new ways of thinking,
Mercury makes an action-demanding link to Pluto the
transformer. Caution goes out the window. Raw truths,
even secrets, come to the surface. Game-changing
information pops out. Thinking changes, permanently.
Certain concepts aren’t going along for the ride any
longer. The cumulative result is streamlining,
liberating, motivating and energizing.

We’ll be selling soda at the county fair this Sunday
afternoon as part of their fundraiser.

Jim was a poet and a musician. He wrote some
beautiful pieces and I’ve always been drawn to him
since I heard the Doors for the first time at an old
boyfriend’s house. I’ve always be

Raw Data
Cleaned
Data
n-gram
data
Your
Predictive
Model
Today’s coverage
Concepts in NLP
• Stopwords
• Stemming
• Part-of-speech tagging
• Edit distance
• N-grams
• Tokenisation
Differences between previous courses and
capstone
Tabular Data (previous courses)
• Unlike data cleaning of tabular data,
where you
• fill missing values with mean/median/
mode
• remove columns with no variance
• remove columns that are 100% unique
(e.g. id)
• selecting or combining/computing
new columns as features
Text data (Capstone)
• Applying the concepts (previous slides)
affects your data and features for the
model
• Which are the relevant concepts?
• Stemming?
• Part-of-speech tagging?
• Stopwords?
• Edit distance?
Are these two sentences the same?
There is “something” going on right now. Like right… now. He’s there.
There is "something" going on right now. Like right... now. He's there.
Getting data and verifying data integrity
• Getting data seems trivial and
easy: just click Download
• How to ensure you are getting
what it is?
• File signature in terms of a hash
(MD5, SHA1, etc)
• MD5 and SHA1 are hashing
algorithm to convert content into
a long string of hexadecimal
File Integrity and Content Check
Demo
https://github.com/kylase/IDA-MOOC-Data-Exploratory-Cleaning/blob/master/
file_integrity_demo.R
Take home message from the demo
• Good practise to check your file signature
• Something that means the same in our interpretation may not be the
same to a computer.
• Will impact your work and outcome
Understanding the data
That’s when I thought nothing is as it seems.No tomorrow can be
predicted accurately .We can just plan,hope,wish or pray for things
but then there is a higher power in action who makes everyday
different for every person so that he/she can learn ,grow and live to
be a better person.

The Atlantic Yards development is supported by a wide range of
elected officials, unions, community leaders, issue advocates, urban
development experts, religious leaders and organizations, local
businesses and thousands of fans all across Brooklyn and the New York
area.

In this gauntlet of openness to new ways of thinking, Mercury makes
an action-demanding link to Pluto the transformer. Caution goes out
the window. Raw truths, even secrets, come to the surface. Game-
changing information pops out. Thinking changes, permanently. Certain
concepts aren’t going along for the ride any longer. The cumulative
result is streamlining, liberating, motivating and energizing.

We’ll be selling soda at the county fair this Sunday afternoon as
part of their fundraiser.

Jim was a poet and a musician. He wrote some beautiful pieces and
I’ve always been drawn to him since I heard the Doors for the first
time at an old boyfriend’s house. I’ve always be

en inspired to express myself both with lyrics and stories or
visually with photographs and drawings.Friend, let’s stand before God
often and ask Him to reveal things in our life that are not pleasing
to Him. As we do that, the world will not have any stones to throw.
They will be drawn to Jesus and living their life for Him just like
we should.
• Blogs, news and tweets are
modern web contents
• Contents are not just alphabets,
numbers, punctuations but also
• Emoticons 😬😀
• Platform specific terms:
@mention, #hashtag, stock
symbols
Dealing with content
• One of the tasks in the Capstone is to remove profanity
• Several approaches (applies to other things as well)
1. Remove that word only
2. Remove the line containing the word
3. Replace the word with a placeholder
• The key question to ask is: How will my action affect the outcome?
Understanding what your tokeniser does
• Tokeniser can come from NLP or RWeka or other packages
• RWeka uses Weka: http://weka.sourceforge.net/doc.dev/weka/
core/tokenizers/NGramTokenizer.html
• Different tokeniser has different behaviour, thus different outcome.
Tokenisation Demo
https://github.com/kylase/IDA-MOOC-Data-Exploratory-Cleaning/blob/master/
tokeniser_demo.R
something"	going 1
“something”	going 1
going	on 2
he's	there. 1
he’s	there. 1
is	"something" 1
is	“something” 1
like	right... 1
like	right… 1
now,	like 2
now.	he's 1
now.	he’s 1
on	right 2
right	now, 2
right...	now. 1
right…	now. 1
there	is 2
going	on 1
he	s 1
is	something 1
like	right 1
now	he 1
now	like 1
on	right 1
right	now 2
s	there 1
something	going 1
there	is 1
Bigrams
from NLP
Bigrams
from RWeka
What happened?
• These 2 sentences are the same to us, but not to the computer.
• In modern web content, there is something called Encoding.
• The double quotations (left and right), apostrophe, triple dots on the first line are
all Unicode characters.
• Dealing with it as previously mentioned: remove or replace.
• But consider that do you want to predict when your user typed “something” or
something or both.
There is “something” going on right now. Like right… now. He’s there.
There is "something" going on right now. Like right... now. He's there.
Post-processing
• Why need post-processing?
• Tokenisers are limited in capability
• Get cleaner data
something	going
going	on
he's	there
is	something
like	right
on	right
right	now
there	is
Additional Content
Alternative Method
• Why?
• R is slow
• There is always another way to do the same thing, faster. Thus allow you to fail-
fast and learn more, explore your data.
• For people with “other than R” programming experience and the interested
• Unix commands (grep, wc, awk, etc)
• Python
• NLTK (Natural Language Toolkit)
Benefits?
• Processed 100% of all 3 text files
within 16 GB RAM
• Caveat: do you need all the data
for a representative model?
• Processing time is about 10 to 30
minutes for each text file
depending on tokenisation length
• Failing-fast allows you to discover
your data
Time taken to do Quiz 1
(sec)
0
10
20
30
40
*nix CLI R
Python NLTK Bigram Tokeniser
• Using cleaned data (Unicode
characters handled)
• Capital letters, end-of-sentence and
clause details are all preserved.
• Post-processing can be done to filter
unigram and invalid bigrams, etc.
There	is 1
on	right 1
...	now 1
He	's 1
like	right 1
``	something 1
right	... 1
,	like 1
now	, 1
something	'' 1
now	. 1
there	. 1
'	going 1
going	on 1
right	now 1
s	there 1
is	`` 1
https://github.com/kylase/data-science-capstone
If you are interested in how Unix tools and Python are used, check out
Q&A
Thank you for your attention

More Related Content

Similar to IDA MOOC Coursera Data Science Capstone (Data Cleaning/Data Exploration)

Ala presentation
Ala presentationAla presentation
Ala presentation
billahml
 
Beyond The Bench Workshops
Beyond The Bench WorkshopsBeyond The Bench Workshops
Beyond The Bench Workshops
Beyond The Bench
 

Similar to IDA MOOC Coursera Data Science Capstone (Data Cleaning/Data Exploration) (20)

2013 LIANZA Keynote: River's End
2013 LIANZA Keynote: River's End2013 LIANZA Keynote: River's End
2013 LIANZA Keynote: River's End
 
True Intent: The Best Online Benchmark You've Never Measured
True Intent: The Best Online Benchmark You've Never MeasuredTrue Intent: The Best Online Benchmark You've Never Measured
True Intent: The Best Online Benchmark You've Never Measured
 
Ala presentation
Ala presentationAla presentation
Ala presentation
 
Voices from the Field: Practices, Challenges & Directions in Digital Humaniti...
Voices from the Field: Practices, Challenges & Directions in Digital Humaniti...Voices from the Field: Practices, Challenges & Directions in Digital Humaniti...
Voices from the Field: Practices, Challenges & Directions in Digital Humaniti...
 
Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?
 
Beyond The Bench Workshops
Beyond The Bench WorkshopsBeyond The Bench Workshops
Beyond The Bench Workshops
 
Inspiration Architecture: The Future of Libraries
Inspiration Architecture: The Future of LibrariesInspiration Architecture: The Future of Libraries
Inspiration Architecture: The Future of Libraries
 
Open Sesame (and other open movements)
Open Sesame (and other open movements)Open Sesame (and other open movements)
Open Sesame (and other open movements)
 
Day 2 literacy palooza revised
Day 2 literacy palooza revisedDay 2 literacy palooza revised
Day 2 literacy palooza revised
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of Understanding
 
04 Network Data Collection
04 Network Data Collection04 Network Data Collection
04 Network Data Collection
 
Data visualization for development
Data visualization for developmentData visualization for development
Data visualization for development
 
How To Effectively Communicate With Techies
How  To Effectively Communicate With TechiesHow  To Effectively Communicate With Techies
How To Effectively Communicate With Techies
 
Webinar - Libraries As Innovation Hubs - 2017-05-31
Webinar - Libraries As Innovation Hubs - 2017-05-31Webinar - Libraries As Innovation Hubs - 2017-05-31
Webinar - Libraries As Innovation Hubs - 2017-05-31
 
The Creative Value of Distant Analogies
The Creative Value of Distant AnalogiesThe Creative Value of Distant Analogies
The Creative Value of Distant Analogies
 
Courage of our Connections
Courage of our ConnectionsCourage of our Connections
Courage of our Connections
 
Digital scholarship - all day workshop
Digital scholarship - all day workshopDigital scholarship - all day workshop
Digital scholarship - all day workshop
 
Isle of Man open data overview
Isle of Man open data overviewIsle of Man open data overview
Isle of Man open data overview
 
Concept Analysis Paper Pain Management Essay E
Concept Analysis Paper Pain Management Essay EConcept Analysis Paper Pain Management Essay E
Concept Analysis Paper Pain Management Essay E
 
The Architecture of Understanding
The Architecture of UnderstandingThe Architecture of Understanding
The Architecture of Understanding
 

Recently uploaded

obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
yulianti213969
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
varanasisatyanvesh
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh +966572737505 get cytotec
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 

Recently uploaded (20)

obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
DS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .pptDS Lecture-1 about discrete structure .ppt
DS Lecture-1 about discrete structure .ppt
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
 
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarjSCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
SCI8-Q4-MOD11.pdfwrwujrrjfaajerjrajrrarj
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Pentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AIPentesting_AI and security challenges of AI
Pentesting_AI and security challenges of AI
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...Simplify hybrid data integration at an enterprise scale. Integrate all your d...
Simplify hybrid data integration at an enterprise scale. Integrate all your d...
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 

IDA MOOC Coursera Data Science Capstone (Data Cleaning/Data Exploration)

  • 1. IDA MOOC Coursera Data Science Capstone Week 2 + 3 (Getting and Cleaning data + Exploratory Data) name Kee Yuan Chuan github kylase email <redacted> https://github.com/kylase/IDA-MOOC-Data-Exploratory-Cleaning
  • 2. Content • Problem Statement • What are we dealing with? • Additional Content (not using R)
  • 3. What is the product at the end of the project?
  • 6. That’s when I thought nothing is as it seems.No tomorrow can be predicted accurately .We can just plan,hope,wish or pray for things but then there is a higher power in action who makes everyday different for every person so that he/she can learn ,grow and live to be a better person.
 The Atlantic Yards development is supported by a wide range of elected officials, unions, community leaders, issue advocates, urban development experts, religious leaders and organizations, local businesses and thousands of fans all across Brooklyn and the New York area.
 In this gauntlet of openness to new ways of thinking, Mercury makes an action-demanding link to Pluto the transformer. Caution goes out the window. Raw truths, even secrets, come to the surface. Game-changing information pops out. Thinking changes, permanently. Certain concepts aren’t going along for the ride any longer. The cumulative result is streamlining, liberating, motivating and energizing.
 We’ll be selling soda at the county fair this Sunday afternoon as part of their fundraiser.
 Jim was a poet and a musician. He wrote some beautiful pieces and I’ve always been drawn to him since I heard the Doors for the first time at an old boyfriend’s house. I’ve always be
 en inspired to express myself both with lyrics and stories or visually with photographs and drawings.Friend, let’s stand before God often and ask Him to reveal things in our life that are not pleasing to Him. As we do that, the world will not have any stones to throw. They will be drawn to Jesus and living their life for Him just like we should.
 In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
 We love you Mr. Brown.
 If you have an alternative argument, let's hear it! :)
 If I were a bear,
 Other friends have similar stories, of how they were treated brusquely by Laurelwood staff, and as often as not, the same names keep coming up. About a half-dozen friends of mine refuse to step foot in there ever again because of it. How many others they’re telling - and keeping away - one can only guess.
  • 7. Where is the table?
  • 8. One of your jobs is to create that 😅
  • 9. That’s when I thought nothing is as it seems.No tomorrow can be predicted accurately .We can just plan,hope,wish or pray for things but then there is a higher power in action who makes everyday different for every person so that he/she can learn ,grow and live to be a better person.
 The Atlantic Yards development is supported by a wide range of elected officials, unions, community leaders, issue advocates, urban development experts, religious leaders and organizations, local businesses and thousands of fans all across Brooklyn and the New York area.
 In this gauntlet of openness to new ways of thinking, Mercury makes an action-demanding link to Pluto the transformer. Caution goes out the window. Raw truths, even secrets, come to the surface. Game-changing information pops out. Thinking changes, permanently. Certain concepts aren’t going along for the ride any longer. The cumulative result is streamlining, liberating, motivating and energizing.
 We’ll be selling soda at the county fair this Sunday afternoon as part of their fundraiser.
 Jim was a poet and a musician. He wrote some beautiful pieces and I’ve always been drawn to him since I heard the Doors for the first time at an old boyfriend’s house. I’ve always be

  • 11. Concepts in NLP • Stopwords • Stemming • Part-of-speech tagging • Edit distance • N-grams • Tokenisation
  • 12. Differences between previous courses and capstone Tabular Data (previous courses) • Unlike data cleaning of tabular data, where you • fill missing values with mean/median/ mode • remove columns with no variance • remove columns that are 100% unique (e.g. id) • selecting or combining/computing new columns as features Text data (Capstone) • Applying the concepts (previous slides) affects your data and features for the model • Which are the relevant concepts? • Stemming? • Part-of-speech tagging? • Stopwords? • Edit distance?
  • 13. Are these two sentences the same? There is “something” going on right now. Like right… now. He’s there. There is "something" going on right now. Like right... now. He's there.
  • 14. Getting data and verifying data integrity • Getting data seems trivial and easy: just click Download • How to ensure you are getting what it is? • File signature in terms of a hash (MD5, SHA1, etc) • MD5 and SHA1 are hashing algorithm to convert content into a long string of hexadecimal
  • 15. File Integrity and Content Check Demo https://github.com/kylase/IDA-MOOC-Data-Exploratory-Cleaning/blob/master/ file_integrity_demo.R
  • 16. Take home message from the demo • Good practise to check your file signature • Something that means the same in our interpretation may not be the same to a computer. • Will impact your work and outcome
  • 17. Understanding the data That’s when I thought nothing is as it seems.No tomorrow can be predicted accurately .We can just plan,hope,wish or pray for things but then there is a higher power in action who makes everyday different for every person so that he/she can learn ,grow and live to be a better person.
 The Atlantic Yards development is supported by a wide range of elected officials, unions, community leaders, issue advocates, urban development experts, religious leaders and organizations, local businesses and thousands of fans all across Brooklyn and the New York area.
 In this gauntlet of openness to new ways of thinking, Mercury makes an action-demanding link to Pluto the transformer. Caution goes out the window. Raw truths, even secrets, come to the surface. Game- changing information pops out. Thinking changes, permanently. Certain concepts aren’t going along for the ride any longer. The cumulative result is streamlining, liberating, motivating and energizing.
 We’ll be selling soda at the county fair this Sunday afternoon as part of their fundraiser.
 Jim was a poet and a musician. He wrote some beautiful pieces and I’ve always been drawn to him since I heard the Doors for the first time at an old boyfriend’s house. I’ve always be
 en inspired to express myself both with lyrics and stories or visually with photographs and drawings.Friend, let’s stand before God often and ask Him to reveal things in our life that are not pleasing to Him. As we do that, the world will not have any stones to throw. They will be drawn to Jesus and living their life for Him just like we should. • Blogs, news and tweets are modern web contents • Contents are not just alphabets, numbers, punctuations but also • Emoticons 😬😀 • Platform specific terms: @mention, #hashtag, stock symbols
  • 18. Dealing with content • One of the tasks in the Capstone is to remove profanity • Several approaches (applies to other things as well) 1. Remove that word only 2. Remove the line containing the word 3. Replace the word with a placeholder • The key question to ask is: How will my action affect the outcome?
  • 19. Understanding what your tokeniser does • Tokeniser can come from NLP or RWeka or other packages • RWeka uses Weka: http://weka.sourceforge.net/doc.dev/weka/ core/tokenizers/NGramTokenizer.html • Different tokeniser has different behaviour, thus different outcome.
  • 21. something" going 1 “something” going 1 going on 2 he's there. 1 he’s there. 1 is "something" 1 is “something” 1 like right... 1 like right… 1 now, like 2 now. he's 1 now. he’s 1 on right 2 right now, 2 right... now. 1 right… now. 1 there is 2 going on 1 he s 1 is something 1 like right 1 now he 1 now like 1 on right 1 right now 2 s there 1 something going 1 there is 1 Bigrams from NLP Bigrams from RWeka
  • 22. What happened? • These 2 sentences are the same to us, but not to the computer. • In modern web content, there is something called Encoding. • The double quotations (left and right), apostrophe, triple dots on the first line are all Unicode characters. • Dealing with it as previously mentioned: remove or replace. • But consider that do you want to predict when your user typed “something” or something or both. There is “something” going on right now. Like right… now. He’s there. There is "something" going on right now. Like right... now. He's there.
  • 23. Post-processing • Why need post-processing? • Tokenisers are limited in capability • Get cleaner data something going going on he's there is something like right on right right now there is
  • 25. Alternative Method • Why? • R is slow • There is always another way to do the same thing, faster. Thus allow you to fail- fast and learn more, explore your data. • For people with “other than R” programming experience and the interested • Unix commands (grep, wc, awk, etc) • Python • NLTK (Natural Language Toolkit)
  • 26. Benefits? • Processed 100% of all 3 text files within 16 GB RAM • Caveat: do you need all the data for a representative model? • Processing time is about 10 to 30 minutes for each text file depending on tokenisation length • Failing-fast allows you to discover your data Time taken to do Quiz 1 (sec) 0 10 20 30 40 *nix CLI R
  • 27. Python NLTK Bigram Tokeniser • Using cleaned data (Unicode characters handled) • Capital letters, end-of-sentence and clause details are all preserved. • Post-processing can be done to filter unigram and invalid bigrams, etc. There is 1 on right 1 ... now 1 He 's 1 like right 1 `` something 1 right ... 1 , like 1 now , 1 something '' 1 now . 1 there . 1 ' going 1 going on 1 right now 1 s there 1 is `` 1
  • 28. https://github.com/kylase/data-science-capstone If you are interested in how Unix tools and Python are used, check out
  • 29. Q&A
  • 30. Thank you for your attention