2. Structured Vs Unstructured Data
Defined vs Undefined Data
Qualitative vs Quantitative Data
Storage in Data Houses vs Data Lakes
Easy vs Hard to Analyze
Predefined format vs a variety of formats
| 3
https://www.xplenty.com/blog/structured-vs-unstructured-data-key-differences/#:~:text=Structured%20data%20is%20clearly%20defined,stored%20in%20its%20native%20format.&text=Structured%20data%20exists%20in%20predefined,in%20a%20variety%20of%20formats.
4. Hurricane Sandy -
2012
• Hit New York and its suburbs, and the Long Island.
• At least 53 people died in New York as a result of the storm.
• Thousands of homes and an estimated 250,000 vehicles were
destroyed during the storm.
• As per the study from a team of academics at the University of
Buffalo, USA, 86 to 91% of users on Twitter were spreading
misinformation during these disasters, one was man-made and the
other was natural.
• As if it was not enough, after knowing the truth less than 20
percent users tried to put correct information and less than 10%
users deleted the wrong information.
• In another study by MIT, it concluded that any wrong information
has 70% more likelihood to be retweeted
| 5
5. Understanding People
Many companies maintain online presences
Managing public perception in age of instant
communication essential
Reacting to changing sentiment, identifying offensive
posts, determining topics of interest…
How can we use analytics to address this?
| 6
7. Challenges
Huge, e.g. whatsapp has 69 MM messages everyday
Unstructured
Even at a small scale, cost and time is a challenge
| 8
How can computers help?
Computers need to understand text
This field is called Natural Language Processing
The goal is to understand and derive meaning from human language
8. It == Bag?
It == Car??
| 9
“I put my bag in the car. It is large
and blue”
17. Irregularities – lower case or upper case
| 18
University university uNIVerSITY
University University University
University
3
18. Irregularities – punctuation
| 19
--University #University University!
University University University
University
3
Punctuation also
causes problems –
basic approach is
to remove
everything that
isn’t a,b,…,z
19. Removing Unhelpful Terms
Many words are frequently used but are only meaningful in a sentence - “stop
words”
Examples: the, is, at, which…
Unlikely to improve machine learning prediction quality
Remove to reduce size of data
| 20
20. Stemming
| 21
Do we need to
draw a
distinction
between the
following words?
Argue Argued Argues Arguing
Argu Argu Argu Argu
Argu
4