The document discusses various topics related to data exploration and cleaning for a Coursera Data Science Capstone project. It covers checking data integrity by comparing file signatures, understanding different data types like text which require preprocessing, and demonstrating various text preprocessing techniques in R and other languages like Python NLTK. The goal is to understand the data and prepare it for building a predictive model by applying natural language processing concepts. Alternative faster methods to the default R packages are also presented.
6. That’s when I thought nothing is as it seems.No tomorrow can be predicted accurately .We can just
plan,hope,wish or pray for things but then there is a higher power in action who makes everyday different
for every person so that he/she can learn ,grow and live to be a better person.
The Atlantic Yards development is supported by a wide range of elected officials, unions, community
leaders, issue advocates, urban development experts, religious leaders and organizations, local businesses
and thousands of fans all across Brooklyn and the New York area.
In this gauntlet of openness to new ways of thinking, Mercury makes an action-demanding link to Pluto the
transformer. Caution goes out the window. Raw truths, even secrets, come to the surface. Game-changing
information pops out. Thinking changes, permanently. Certain concepts aren’t going along for the ride any
longer. The cumulative result is streamlining, liberating, motivating and energizing.
We’ll be selling soda at the county fair this Sunday afternoon as part of their fundraiser.
Jim was a poet and a musician. He wrote some beautiful pieces and I’ve always been drawn to him since I
heard the Doors for the first time at an old boyfriend’s house. I’ve always be
en inspired to express myself both with lyrics and stories or visually with photographs and
drawings.Friend, let’s stand before God often and ask Him to reveal things in our life that are not
pleasing to Him. As we do that, the world will not have any stones to throw. They will be drawn to Jesus
and living their life for Him just like we should.
In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
We love you Mr. Brown.
If you have an alternative argument, let's hear it! :)
If I were a bear,
Other friends have similar stories, of how they were treated brusquely by Laurelwood staff, and as often
as not, the same names keep coming up. About a half-dozen friends of mine refuse to step foot in there
ever again because of it. How many others they’re telling - and keeping away - one can only guess.
9. That’s when I thought nothing is as it seems.No
tomorrow can be predicted accurately .We can just
plan,hope,wish or pray for things but then there is a
higher power in action who makes everyday different
for every person so that he/she can learn ,grow and
live to be a better person.
The Atlantic Yards development is supported by a wide
range of elected officials, unions, community
leaders, issue advocates, urban development experts,
religious leaders and organizations, local businesses
and thousands of fans all across Brooklyn and the New
York area.
In this gauntlet of openness to new ways of thinking,
Mercury makes an action-demanding link to Pluto the
transformer. Caution goes out the window. Raw truths,
even secrets, come to the surface. Game-changing
information pops out. Thinking changes, permanently.
Certain concepts aren’t going along for the ride any
longer. The cumulative result is streamlining,
liberating, motivating and energizing.
We’ll be selling soda at the county fair this Sunday
afternoon as part of their fundraiser.
Jim was a poet and a musician. He wrote some
beautiful pieces and I’ve always been drawn to him
since I heard the Doors for the first time at an old
boyfriend’s house. I’ve always be
12. Differences between previous courses and
capstone
Tabular Data (previous courses)
• Unlike data cleaning of tabular data,
where you
• fill missing values with mean/median/
mode
• remove columns with no variance
• remove columns that are 100% unique
(e.g. id)
• selecting or combining/computing
new columns as features
Text data (Capstone)
• Applying the concepts (previous slides)
affects your data and features for the
model
• Which are the relevant concepts?
• Stemming?
• Part-of-speech tagging?
• Stopwords?
• Edit distance?
13. Are these two sentences the same?
There is “something” going on right now. Like right… now. He’s there.
There is "something" going on right now. Like right... now. He's there.
14. Getting data and verifying data integrity
• Getting data seems trivial and
easy: just click Download
• How to ensure you are getting
what it is?
• File signature in terms of a hash
(MD5, SHA1, etc)
• MD5 and SHA1 are hashing
algorithm to convert content into
a long string of hexadecimal
15. File Integrity and Content Check
Demo
https://github.com/kylase/IDA-MOOC-Data-Exploratory-Cleaning/blob/master/
file_integrity_demo.R
16. Take home message from the demo
• Good practise to check your file signature
• Something that means the same in our interpretation may not be the
same to a computer.
• Will impact your work and outcome
17. Understanding the data
That’s when I thought nothing is as it seems.No tomorrow can be
predicted accurately .We can just plan,hope,wish or pray for things
but then there is a higher power in action who makes everyday
different for every person so that he/she can learn ,grow and live to
be a better person.
The Atlantic Yards development is supported by a wide range of
elected officials, unions, community leaders, issue advocates, urban
development experts, religious leaders and organizations, local
businesses and thousands of fans all across Brooklyn and the New York
area.
In this gauntlet of openness to new ways of thinking, Mercury makes
an action-demanding link to Pluto the transformer. Caution goes out
the window. Raw truths, even secrets, come to the surface. Game-
changing information pops out. Thinking changes, permanently. Certain
concepts aren’t going along for the ride any longer. The cumulative
result is streamlining, liberating, motivating and energizing.
We’ll be selling soda at the county fair this Sunday afternoon as
part of their fundraiser.
Jim was a poet and a musician. He wrote some beautiful pieces and
I’ve always been drawn to him since I heard the Doors for the first
time at an old boyfriend’s house. I’ve always be
en inspired to express myself both with lyrics and stories or
visually with photographs and drawings.Friend, let’s stand before God
often and ask Him to reveal things in our life that are not pleasing
to Him. As we do that, the world will not have any stones to throw.
They will be drawn to Jesus and living their life for Him just like
we should.
• Blogs, news and tweets are
modern web contents
• Contents are not just alphabets,
numbers, punctuations but also
• Emoticons 😬😀
• Platform specific terms:
@mention, #hashtag, stock
symbols
18. Dealing with content
• One of the tasks in the Capstone is to remove profanity
• Several approaches (applies to other things as well)
1. Remove that word only
2. Remove the line containing the word
3. Replace the word with a placeholder
• The key question to ask is: How will my action affect the outcome?
19. Understanding what your tokeniser does
• Tokeniser can come from NLP or RWeka or other packages
• RWeka uses Weka: http://weka.sourceforge.net/doc.dev/weka/
core/tokenizers/NGramTokenizer.html
• Different tokeniser has different behaviour, thus different outcome.
21. something" going 1
“something” going 1
going on 2
he's there. 1
he’s there. 1
is "something" 1
is “something” 1
like right... 1
like right… 1
now, like 2
now. he's 1
now. he’s 1
on right 2
right now, 2
right... now. 1
right… now. 1
there is 2
going on 1
he s 1
is something 1
like right 1
now he 1
now like 1
on right 1
right now 2
s there 1
something going 1
there is 1
Bigrams
from NLP
Bigrams
from RWeka
22. What happened?
• These 2 sentences are the same to us, but not to the computer.
• In modern web content, there is something called Encoding.
• The double quotations (left and right), apostrophe, triple dots on the first line are
all Unicode characters.
• Dealing with it as previously mentioned: remove or replace.
• But consider that do you want to predict when your user typed “something” or
something or both.
There is “something” going on right now. Like right… now. He’s there.
There is "something" going on right now. Like right... now. He's there.
23. Post-processing
• Why need post-processing?
• Tokenisers are limited in capability
• Get cleaner data
something going
going on
he's there
is something
like right
on right
right now
there is
25. Alternative Method
• Why?
• R is slow
• There is always another way to do the same thing, faster. Thus allow you to fail-
fast and learn more, explore your data.
• For people with “other than R” programming experience and the interested
• Unix commands (grep, wc, awk, etc)
• Python
• NLTK (Natural Language Toolkit)
26. Benefits?
• Processed 100% of all 3 text files
within 16 GB RAM
• Caveat: do you need all the data
for a representative model?
• Processing time is about 10 to 30
minutes for each text file
depending on tokenisation length
• Failing-fast allows you to discover
your data
Time taken to do Quiz 1
(sec)
0
10
20
30
40
*nix CLI R
27. Python NLTK Bigram Tokeniser
• Using cleaned data (Unicode
characters handled)
• Capital letters, end-of-sentence and
clause details are all preserved.
• Post-processing can be done to filter
unigram and invalid bigrams, etc.
There is 1
on right 1
... now 1
He 's 1
like right 1
`` something 1
right ... 1
, like 1
now , 1
something '' 1
now . 1
there . 1
' going 1
going on 1
right now 1
s there 1
is `` 1