06/03/2019 1Demetris Trihinas
trihinas.d@unic.ac.cy
1Tutorial | MSc Research Seminars
Department of
Computer Science
The Data Science Process
From Mining Raw Data to Story
Visualization
Demetris Trihinas
Department of Computer Science
University of Nicosia
trihinas.d@unic.ac.cy
06/03/2019 2Demetris Trihinas
trihinas.d@unic.ac.cy
2Tutorial | MSc Research Seminars
Department of
Computer Science
Full-Time Faculty Member
University of Nicosia
“Designing and developing scalable and self-adaptive tools for data
management, exploration and visualization”
@dtrihinas
http://dtrihinas.info
https://ailab.unic.ac.cy/https://www.slideshare.net/DemetrisTrihinas
06/03/2019 3Demetris Trihinas
trihinas.d@unic.ac.cy
3Tutorial | MSc Research Seminars
Department of
Computer Science
The quest for
knowledge used to
begin with grand
theories
(a hypothesis).
Now it begins with
massive amounts of
data.
Welcome to the
Petabyte Age.
Wired, Jun 2008
06/03/2019 4Demetris Trihinas
trihinas.d@unic.ac.cy
4Tutorial | MSc Research Seminars
Department of
Computer Science
State | Unemployment
------------------------------
NY | 1.72
CA | 2.43
DC | 3.54
…
Raw bits n’ bytes
Structure
Knowledge
Story
Data
Information
Understanding
Wisdom
Population
(initial data)
Data
model
Algorithmic
model
Visual
model
Cause/Effect
(why?)
Today’s Talk
06/03/2019 5Demetris Trihinas
trihinas.d@unic.ac.cy
5Tutorial | MSc Research Seminars
Department of
Computer Science
Data Collection
“Taping” into data sources
06/03/2019 6Demetris Trihinas
trihinas.d@unic.ac.cy
6Tutorial | MSc Research Seminars
Department of
Computer Science
Data Collection
• In the process of data democratization… the world’s data
have never been more open that today.
• The world’s data sources (e.g., social media, news outlets)
often permit –restricted– access to their data.
• Web Scraping: methodically scrape website content
• Application Programmable Interfaces (APIs)
• “ASK for permission and GET access to resource(s)”
• So… turn the “tap” of a data source (coding task) and store the
data somewhere (data warehousing) for analysis.
06/03/2019 7Demetris Trihinas
trihinas.d@unic.ac.cy
7Tutorial | MSc Research Seminars
Department of
Computer Science
Web Scraping Behind the Scenes
HTML markups
model how data
should be
displayed.
06/03/2019 8Demetris Trihinas
trihinas.d@unic.ac.cy
8Tutorial | MSc Research Seminars
Department of
Computer Science
Data Collection via API
Data
Collection
GET access to tweets
You can have 1% for free
with this access token.
For > 1% pay up!
The tweet sink
Data
Warehouse
GET tweets with token
from @dtrihinas
or with #data_miningAlso, ask for
#cyprus and #cyprus
06/03/2019 9Demetris Trihinas
trihinas.d@unic.ac.cy
9Tutorial | MSc Research Seminars
Department of
Computer Science
Twitter Search
Behind the “scenes” is the Twitter API
JSON format for
exchanging
hierarchically
modeled data
objects.
06/03/2019 10Demetris Trihinas
trihinas.d@unic.ac.cy
10Tutorial | MSc Research Seminars
Department of
Computer Science
Dec 2018
20945 related articles
06/03/2019 11Demetris Trihinas
trihinas.d@unic.ac.cy
11Tutorial | MSc Research Seminars
Department of
Computer Science
Data Overview
• Trawling through a couple of articles manually is easy.
• But… what about thousands of news articles from
multiple news outlets?
Humans are slow, Computers are fast!
• Get the data, store it and then mine it!
06/03/2019 12Demetris Trihinas
trihinas.d@unic.ac.cy
12Tutorial | MSc Research Seminars
Department of
Computer Science
Premier League Standings After Matchday
Let machines do the work for you!
06/03/2019 13Demetris Trihinas
trihinas.d@unic.ac.cy
13Tutorial | MSc Research Seminars
Department of
Computer Science
Data Models
• The representation chosen to store and extract data.
Y f(X, parameters, random noise)
We understand
the world!
• For example, db schemas, spreadsheets, objects, etc.
06/03/2019 14Demetris Trihinas
trihinas.d@unic.ac.cy
14Tutorial | MSc Research Seminars
Department of
Computer Science
Big Data refers to datasets that are too large or
complex for traditional data-processing application
software to adequately deal with.
06/03/2019 15Demetris Trihinas
trihinas.d@unic.ac.cy
15Tutorial | MSc Research Seminars
Department of
Computer Science
Big Data is… a Volume Problem
06/03/2019 16Demetris Trihinas
trihinas.d@unic.ac.cy
16Tutorial | MSc Research Seminars
Department of
Computer Science
The Internet’s Digital Footprint
06/03/2019 17Demetris Trihinas
trihinas.d@unic.ac.cy
17Tutorial | MSc Research Seminars
Department of
Computer Science
Sensory Data
Boeing 787 generates
40TB of data per hour
in flight.
Google’s self-driving car
generates 1GB of data
per minute.
06/03/2019 18Demetris Trihinas
trihinas.d@unic.ac.cy
18Tutorial | MSc Research Seminars
Department of
Computer Science
The Internet of Things
21 Billion devices by 2020 accounting for 12% of the digital universe.
06/03/2019 19Demetris Trihinas
trihinas.d@unic.ac.cy
19Tutorial | MSc Research Seminars
Department of
Computer Science
Big Data is… a Velocity Problem
06/03/2019 20Demetris Trihinas
trihinas.d@unic.ac.cy
20Tutorial | MSc Research Seminars
Department of
Computer Science
Batch Data
• Assumes that the data is available when and if we want it
(e.g., reading and parsing data from a file or database)
• The processing engine knows the dataset in advance and
controls the input rate of the data
Count events by color
fetch data
<red, 3>
<yellow, 1>
<blue, 2>
<green, 2>
Processing
Engine
Database
06/03/2019 21Demetris Trihinas
trihinas.d@unic.ac.cy
21Tutorial | MSc Research Seminars
Department of
Computer Science
• Unbounded Data -> the volume of the data is overwhelming
• Conceptually infinite sequence of data items
• Push Model -> data arrives at high velocity and different rates
• Potentially multiple sources pushing data to the processing engine at
different rates (data distribution changes over time)
Data Streams
Processing
Engine
src1
src2
src3
0
2
4
input rate
t
06/03/2019 22Demetris Trihinas
trihinas.d@unic.ac.cy
22Tutorial | MSc Research Seminars
Department of
Computer Science
US Presidential Elections 2016
Happiness Anger
Clinton
Trump
Per minute Emotions During First Debate
200K
tweets/min
https://qz.com/810092
06/03/2019 23Demetris Trihinas
trihinas.d@unic.ac.cy
23Tutorial | MSc Research Seminars
Department of
Computer Science
Big Data is… a Variety Problem
06/03/2019 24Demetris Trihinas
trihinas.d@unic.ac.cy
24Tutorial | MSc Research Seminars
Department of
Computer Science
Data Variety
Almost 80% of today’s business data is unstructured (text) data
Text
Images
Sound
Video
…
.xsl
.csv
.json
.xml
…
06/03/2019 25Demetris Trihinas
trihinas.d@unic.ac.cy
25Tutorial | MSc Research Seminars
Department of
Computer Science
Big Data is… a Veracity Problem
06/03/2019 26Demetris Trihinas
trihinas.d@unic.ac.cy
26Tutorial | MSc Research Seminars
Department of
Computer Science
Data Quality
Can we trust the
data regardless
of the source?
Can we take good decisions
with the available data?
06/03/2019 28Demetris Trihinas
trihinas.d@unic.ac.cy
28Tutorial | MSc Research Seminars
Department of
Computer Science
Big Data is… a Value Problem
06/03/2019 29Demetris Trihinas
trihinas.d@unic.ac.cy
29Tutorial | MSc Research Seminars
Department of
Computer Science
Data Mining
From bits n’ bytes to knowledge
06/03/2019 30Demetris Trihinas
trihinas.d@unic.ac.cy
30Tutorial | MSc Research Seminars
Department of
Computer Science
Data Warehousing
• Data warehousing provides data storage and
management capabilities.
• Memory and storage have
never been cheaper.
1MB today is 10 times
cheaper than 5 years
ago!
06/03/2019 31Demetris Trihinas
trihinas.d@unic.ac.cy
31Tutorial | MSc Research Seminars
Department of
Computer Science
Database people used to worry
how to get data in a database.
But, now… its all about how to
get data out!
06/03/2019 32Demetris Trihinas
trihinas.d@unic.ac.cy
32Tutorial | MSc Research Seminars
Department of
Computer Science
Computing Power
• Cloud Computing - Abundance of computing power.
• Rent instead of buying expensive compute power (removes
also side-costs e.g., cooling, physical security, etc.)
06/03/2019 33Demetris Trihinas
trihinas.d@unic.ac.cy
33Tutorial | MSc Research Seminars
Department of
Computer Science
Computing Power
• Cloud Computing - Abundance of computing power.
• Pay-as-you-use cost model
06/03/2019 34Demetris Trihinas
trihinas.d@unic.ac.cy
34Tutorial | MSc Research Seminars
Department of
Computer Science
Marketing Mantra
Collect whatever data you can, whenever and wherever
possible.
The expectation is that collected data
will have value either for the purpose
collected or for a purpose not yet
envisioned.
06/03/2019 35Demetris Trihinas
trihinas.d@unic.ac.cy
35Tutorial | MSc Research Seminars
Department of
Computer Science
Economist, May, 2017
06/03/2019 36Demetris Trihinas
trihinas.d@unic.ac.cy
36Tutorial | MSc Research Seminars
Department of
Computer Science
Data Mining
• Data is useless unless you can convert it to structured
information and ultimately into knowledge.
• So… data mining provides you with the intelligence to
convert data into knowledge.
06/03/2019 37Demetris Trihinas
trihinas.d@unic.ac.cy
37Tutorial | MSc Research Seminars
Department of
Computer Science
Confluence of Multiple Disciplines
06/03/2019 38Demetris Trihinas
trihinas.d@unic.ac.cy
38Tutorial | MSc Research Seminars
Department of
Computer Science
We are drowning in data
but starved for knowledge…
John Naisbitt, 1982
06/03/2019 39Demetris Trihinas
trihinas.d@unic.ac.cy
39Tutorial | MSc Research Seminars
Department of
Computer Science
What is NOT Data Mining
• Any question you can ask and get an –immediate and
concrete– answer from a database.
• How many sofas models does IKEA currently have in stock?
• How many sofas did IKEA sell in Sweden last month?
• Which IKEA customers bought a sofa worth more than 500
euros this year?
06/03/2019 40Demetris Trihinas
trihinas.d@unic.ac.cy
40Tutorial | MSc Research Seminars
Department of
Computer Science
Algorithmic Models
• Attempt to understand and represent the reality
through a particular lens (e.g., math, biological).
• Artificial construction where all extraneous detail has
been removed or abstracted.
We don’t understand the world (but try too!)
Model
(black box)Y X
State | Unemployment
------------------------------
NY | 1.72
CA | 2.43
DC | 3.54
… Refinement
06/03/2019 41Demetris Trihinas
trihinas.d@unic.ac.cy
41Tutorial | MSc Research Seminars
Department of
Computer Science
Data Mining Techniques
• Classification
• Clustering
• Pattern Discovery
• Associations
• Regression
• Outlier Detection
06/03/2019 42Demetris Trihinas
trihinas.d@unic.ac.cy
42Tutorial | MSc Research Seminars
Department of
Computer Science
Classification
• Develop models (or functions) that feature the ability
to distinguish and describe a collection of various
attributes into classes.
• “Give a label to your data!”
• Should the IKEA sofa model S be added to this month’s
discount items (yes, no)?
06/03/2019 43Demetris Trihinas
trihinas.d@unic.ac.cy
43Tutorial | MSc Research Seminars
Department of
Computer Science
Predicting Person’s Credit Worthiness
Attribute
Values
Classes
{Yes, No}
06/03/2019 44Demetris Trihinas
trihinas.d@unic.ac.cy
44Tutorial | MSc Research Seminars
Department of
Computer Science
Google News
Classify
by type
Classify
by country
06/03/2019 45Demetris Trihinas
trihinas.d@unic.ac.cy
45Tutorial | MSc Research Seminars
Department of
Computer Science
Clustering
• Develop models to group data together based on their
similarity or dissimilarity to data in other groups.
• Group IKEA customers based on how much disposable
income they have, or how often they tend to shop at a
particular IKEA branch.
• Similar to classification but with unknown classes.
06/03/2019 46Demetris Trihinas
trihinas.d@unic.ac.cy
46Tutorial | MSc Research Seminars
Department of
Computer Science
Customer Demographics
Customers of this group usually buy sofa S so let’s send to
customer X an email with a discount for S.
06/03/2019 47Demetris Trihinas
trihinas.d@unic.ac.cy
47Tutorial | MSc Research Seminars
Department of
Computer Science
Google News
Similar
articles
clustered
together
06/03/2019 48Demetris Trihinas
trihinas.d@unic.ac.cy
48Tutorial | MSc Research Seminars
Department of
Computer Science
Google News
Article
Clustering
based on
similarity
Cluster
Classification
automated
label
generation
06/03/2019 49Demetris Trihinas
trihinas.d@unic.ac.cy
49Tutorial | MSc Research Seminars
Department of
Computer Science
Pattern Discovery
• One of the most basic techniques in data mining is learning
to recognize patterns in the data.
• This is usually a recognition of some aberration in your data
happening at regular intervals, or an ebb and flow of a
certain variable over time.
• Sales of a certain product seem to spike just before the
holidays, or notice that warmer weather drives more
people to your website.
06/03/2019 50Demetris Trihinas
trihinas.d@unic.ac.cy
50Tutorial | MSc Research Seminars
Department of
Computer Science
IKEA Sofa Sales Forecast
???
06/03/2019 51Demetris Trihinas
trihinas.d@unic.ac.cy
51Tutorial | MSc Research Seminars
Department of
Computer Science
Association
• Association is related to tracking patterns, but is more
specific to dependently linked attributes.
• Model developed to look for specific events or
attributes that are highly correlated with another event
or attribute.
• When your customers buy a specific item, they also
often buy a second, related item.
06/03/2019 52Demetris Trihinas
trihinas.d@unic.ac.cy
52Tutorial | MSc Research Seminars
Department of
Computer Science
People Also…
06/03/2019 53Demetris Trihinas
trihinas.d@unic.ac.cy
53Tutorial | MSc Research Seminars
Department of
Computer Science
06/03/2019 57Demetris Trihinas
trihinas.d@unic.ac.cy
57Tutorial | MSc Research Seminars
Department of
Computer Science
Beware…
Data mining is NOT about fitting the model to the answer
YOU want!
06/03/2019 58Demetris Trihinas
trihinas.d@unic.ac.cy
58Tutorial | MSc Research Seminars
Department of
Computer Science
Correlation
• Correlation is a statistical technique that tells us how
strongly related are pairs of variables.
• But… correlation does not tell us the why and how
behind the relationship.
• So… correlation just says that a relationship exists.
06/03/2019 59Demetris Trihinas
trihinas.d@unic.ac.cy
59Tutorial | MSc Research Seminars
Department of
Computer Science
Ice-Cream and Sunglass Sales
As the sales of ice creams is increasing so do
the sales of sunglasses.
06/03/2019 60Demetris Trihinas
trihinas.d@unic.ac.cy
60Tutorial | MSc Research Seminars
Department of
Computer Science
Causation
• Causation denotes that any change in the value of one
variable will cause a change in the value of another
variable.
• This means that one variable makes other to happen.
06/03/2019 61Demetris Trihinas
trihinas.d@unic.ac.cy
61Tutorial | MSc Research Seminars
Department of
Computer Science
Exercise and Calories
• When a person is exercising then the amount of
calories burned increases every minute.
• The former (exercise) is causing the latter (calories
burned) to happen.
06/03/2019 62Demetris Trihinas
trihinas.d@unic.ac.cy
62Tutorial | MSc Research Seminars
Department of
Computer Science
Ice-Cream and Homicides in New York
• A study in the 90’s showed that ice-cream sales are the
cause of homicides in New York.
• As the sales of ice-cream rise and fall, so do the
number of homicides -> correlation.
• But… does the consumption of ice-cream actually
cause the death of people in NY?
https://www.nytimes.com/2009/06/19/nyregion/19murder.html
06/03/2019 63Demetris Trihinas
trihinas.d@unic.ac.cy
63Tutorial | MSc Research Seminars
Department of
Computer Science
Correlation Does NOT Imply Causation
• The two things are, yes, correlated.
• But this does NOT mean one causes other.
Correlation is something which
we think, when we can’t see
under the covers.
So the less the information we
have the more we are forced
to observe correlations.
06/03/2019 64Demetris Trihinas
trihinas.d@unic.ac.cy
64Tutorial | MSc Research Seminars
Department of
Computer Science
Confidence Intervals
• How many football games do US citizens got to?
• To get an -exact- answer (100% correct), you must ask
everyone in the US (>350M people) -> Not practical!
• Use a random sample, meaning ask (much) less people
-> but we won’t be 100% correct.
06/03/2019 65Demetris Trihinas
trihinas.d@unic.ac.cy
65Tutorial | MSc Research Seminars
Department of
Computer Science
Confidence Intervals
• What we try to achieve: Get an interval that we are
confident that the actual answer lies within.
“I am 95% confident that the number of football games
people in the U.S. go to lies between 10 and 12”
• So basically, CIs describe the level of uncertainty
associated with a sample estimation.
06/03/2019 66Demetris Trihinas
trihinas.d@unic.ac.cy
66Tutorial | MSc Research Seminars
Department of
Computer Science
Random Sample Selection
• Random… means random!
• You cannot just select 1000 people from one city, the
sample wont represent the whole US.
• You cannot just send FB messages to 1000 random
people, you will get a representation of US FB users,
and of course not all of the US citizens use FB.
• So… constructing a random sample is actually hard!
06/03/2019 69Demetris Trihinas
trihinas.d@unic.ac.cy
69Tutorial | MSc Research Seminars
Department of
Computer Science
Confidence Intervals
• Random sample: 1000 US citizens
• Avg is 11 games and SD is 5 games.
• Let’s say we want a 95% confidence interval.
95%
11
With some statistics
we get an interval of
+-1 game for 95% CI.
We are 95% confident
that the average US
citizen watches between
10-12 games a year.
06/03/2019 70Demetris Trihinas
trihinas.d@unic.ac.cy
70Tutorial | MSc Research Seminars
Department of
Computer Science
Data Visualization
Visually communicate analysis results
06/03/2019 71Demetris Trihinas
trihinas.d@unic.ac.cy
71Tutorial | MSc Research Seminars
Department of
Computer Science
A picture is worth a 1000 words...
Chinese proverb
06/03/2019 72Demetris Trihinas
trihinas.d@unic.ac.cy
72Tutorial | MSc Research Seminars
Department of
Computer Science
Unemployment Data in the US
06/03/2019 73Demetris Trihinas
trihinas.d@unic.ac.cy
73Tutorial | MSc Research Seminars
Department of
Computer Science
Unemployment Data in the US
06/03/2019 74Demetris Trihinas
trihinas.d@unic.ac.cy
74Tutorial | MSc Research Seminars
Department of
Computer Science
Seismic Activity in California
06/03/2019 75Demetris Trihinas
trihinas.d@unic.ac.cy
75Tutorial | MSc Research Seminars
Department of
Computer Science
Seismic Activity in California
06/03/2019 76Demetris Trihinas
trihinas.d@unic.ac.cy
76Tutorial | MSc Research Seminars
Department of
Computer Science
Why Visualize Your Results?
Easier to interpret large
volumes of data because
the human eye can
immediately focus on
the main information.
06/03/2019 77Demetris Trihinas
trihinas.d@unic.ac.cy
77Tutorial | MSc Research Seminars
Department of
Computer Science
06/03/2019 78Demetris Trihinas
trihinas.d@unic.ac.cy
78Tutorial | MSc Research Seminars
Department of
Computer Science
Interactiveness
Focus even more on information that we care about and we can
perform “real-time” queries on the data.
06/03/2019 79Demetris Trihinas
trihinas.d@unic.ac.cy
79Tutorial | MSc Research Seminars
Department of
Computer Science
Big Data Challenges
The human eye cannot find anymore the information that we
care about…
06/03/2019 80Demetris Trihinas
trihinas.d@unic.ac.cy
80Tutorial | MSc Research Seminars
Department of
Computer Science
Big Data Challenges
Data navigation through interactiveness either does not work
or is not “real-time” anymore…
06/03/2019 81Demetris Trihinas
trihinas.d@unic.ac.cy
81Tutorial | MSc Research Seminars
Department of
Computer Science
Data Science Process
Data
Warehousing
Data
Collection
Data
Mining
Data
Visualization
Insights StoryStruct
Info
Raw
Data
06/03/2019 82Demetris Trihinas
trihinas.d@unic.ac.cy
82Tutorial | MSc Research Seminars
Department of
Computer Science
Data Science Process
Data
Warehousing
Data
Collection
Data
Mining
Data
Visualization
Insights Story
Struct
Info
Raw
Data
Data
Preprocessing
Preprocessed
Info
06/03/2019 83Demetris Trihinas
trihinas.d@unic.ac.cy
83Tutorial | MSc Research Seminars
Department of
Computer Science
Data Preprocessing
• Data mining, especially on big data, is a -compute and
time- expensive process.
• Data Preprocessing can significantly increase
performance if performed before mining.
• Data Cleaning
• Data Reduction
• Data Transformation
Preprocessing can even take around
60% of your effort but totally worth it!
06/03/2019 84Demetris Trihinas
trihinas.d@unic.ac.cy
84Tutorial | MSc Research Seminars
Department of
Computer Science
That’s a lot of data, but…
how much is actually useful!
06/03/2019 85Demetris Trihinas
trihinas.d@unic.ac.cy
85Tutorial | MSc Research Seminars
Department of
Computer Science
Data Cleaning
• You would assume that data stored in a database is
ready for analysis, but… “dirty data”.
• Removing duplicate, erroneous or NA data.
• Statistically imputing missing data.
id name age score
1000
1001
Anna
John
42
fifty
84.7
89.5
age MUST be a number
id name age score
1000
1001
1002
Anna
John
Mat
42
50
29
84.7
89.5
Mat was sick on test day but is C-
average student so lets assume he
would have scored a 72.0
06/03/2019 86Demetris Trihinas
trihinas.d@unic.ac.cy
86Tutorial | MSc Research Seminars
Department of
Computer Science
Data Transformation
• Reshape, sort and combine data to suitable format(s)
for analysis.
id name age score
1000
1001
1002
Anna
John
Mat
42
50
29
84.7
89.7
72.0
id name Eats Breakfast
1000
1001
1002
Anna
John
Mat
Yes
yes
no
id name age score
1001
1000
1002
John
Anna
Mat
50
42
29
90
85
72
Breakfast
1
1
0 Sort
by
score
06/03/2019 87Demetris Trihinas
trihinas.d@unic.ac.cy
87Tutorial | MSc Research Seminars
Department of
Computer Science
Data Reduction
• Perform filtering on the data that is not needed for the
analysis to consume less resources and time.
• Analysis will be performed on US citizens so remove others.
• Use only a sample of the data to get an approximate, but
quick, answer
• Create random sample of 1K rows instead of 1M rows.
• Reduce the dimensionality of the problem
• The field age is not relevant to analysis.
06/03/2019 88Demetris Trihinas
trihinas.d@unic.ac.cy
88Tutorial | MSc Research Seminars
Department of
Computer Science
06/03/2019 89Demetris Trihinas
trihinas.d@unic.ac.cy
89Tutorial | MSc Research Seminars
Department of
Computer Science
The Data Science Process
From Mining Raw Data to Story
Visualization
Demetris Trihinas
Department of Computer Science
University of Nicosia
trihinas.d@unic.ac.cy

The Data Science Process: From Mining Raw Data to Story Visualization