2. COURSE CONTENTS
•Core Data Analysis
• 1D analysis
• 2D analysis: both quantitative
• 2D analysis: both nominal
• Learning multivariate correlation
• Principal components (PCA) and SVD: Mathematical foundations
• Principal components (PCA) and SVD: Applications
• Clustering with k-means
3. INTRO: WHAT IS CORE DATA
ANALYSIS?
Four main parts
1. Data Mining and data patterns and their use
2. Core data analysis: two main goals for
Knowledge Enhancing
3. Visualization: How it works
4. Illustrative data cases
4. INTRO: DATA MINING AND DATA PATTERNS
AND THEIR USE
•Is it Data Mining?
• Well, what is Data Mining?
• Generically, Data Mining is looking for (i) patterns in data stored in (ii) Databases
as part of (iii) Knowledge Discovery
• Core data analysis does not care of (ii) Databases
• Core data analysis does care of (ia) specific patterns in data as part of
(iia) Knowledge Enhancing
5. INTRO: EXAMPLE OF PATTERN
DOUBLE SUCCESS 1
The History of Laws for planetary motion
Double success
Ptolemy (c. 150 a.d.):
• Sun and planets
• circle Earth
• Does not match data well
6. INTRO: EXAMPLE OF PATTERN
DOUBLE SUCCESS 2
The History of Laws for planetary motion
• Copernicus (c. 1540):
• Planets circle Sun
• Does not match data well
• either
7. INTRO: EXAMPLE OF PATTERN
DOUBLE SUCCESS 3
Laws for planetary motion:
Kepler (c. 1605):
• 1st Law: Planets revolve Sun in ellipses (ovals)
• 2d Law: Speed changes – the further away from Sun, the faster
• Does either
8. INTRO: EXAMPLE OF PATTERN
DOUBLE SUCCESS 4
Planet
Period
(year)
Distance (average,
relative to that of
Earth)
Mercury
Venus
Earth
Mars
Jupiter
Saturn
Uranus
Neptune
Pluto
0.241
0.615
1.00
1.88
11.8
29.5
84.0
165
248
0.39
0.72
1.00
1.52
5.20
9.54
19.18
30.06
39.44
3d Law:
Is there any relation
between
speed/period and
distance?
9. INTRO: EXAMPLE OF PATTERN
DOUBLE SUCCESS 5
3d Kepler’s Law:
Is there any relation
between speed/period
and distance?
Fit no line…
10. INTRO: EXAMPLE OF PATTERN
DOUBLE SUCCESS 6
3d Kepler’s Law (1619):
[J. Napier invented
logarithm (1614)]
Log(P)=
𝟑
𝟐
Log(D)
P2=D3
11. INTRO: EXAMPLE OF PATTERN
DOUBLE SUCCESS 7
Three Kepler’s Laws: What is so grand?
Substantiated theoretically by
R. Hooke (1635-1703) and I. Newton (1642-1727)
UNIVERSAL GRAVITATION LAW
Mathematical equation, cornerstone of modern science
12. INTRO: EXAMPLE OF PATTERN
FAILURE? 1
Imagine this:
Broad street, Soho, London,
Cholera outbreak September 1854
Dr. Snow report: “On proceeding to the spot, I found
that nearly all the deaths had taken place within a short
distance of the pump.”
Dr John Snow’s map:
Cases of death
labeled by ticks.
The handle of pump
removed 7/9/1854.
13. INTRO 1: EXAMPLE OF PATTERN
FAILURE? 2
Myth: Death stopped. Data analysis won.
Fact: Data analysis lost. The health commission rejected the water
pump theory, as contradicting the science of the day (cholera outbreak
caused by “concentrated noxious atmospheric influence, no doubt
emanating from putrefying organic matter”). The handle of the pump
was ordered back. Death stopped because all died already.
More death occurred at further cholera outbreaks till R. Koch discovered
and publicized the vibrio cholera in 1883.
Dr John Snow’s map:
A case of death
Is labeled by a tick
14. PATTERN FOUND
Success: if
Compatible with existing knowledge
Failure: if
Not compatible with existing knowledge
Advice
• Find a pattern
• Interpret using existing knowledge
• Care not whether interpretation is
compatible
15. INTRO: WHAT IS CORE DATA
ANALYSIS II 1
• Core data analysis does care of (ia) specific patterns in data as part of (iia) Knowledge
Enhancing
• What are these (ia), (iia) specifics?
• Have something to do with the notion of Knowledge
• Statements of fact (“I teach this class.”) – factual
• Statements of pattern, regularity (“Professors use to teach classes.”) - structural
16. INTRO: WHAT IS CORE DATA
ANALYSIS II 2
• Core data analysis does care of (ia) specific patterns in data as part of (iia) Knowledge
Enhancing
• (ia), (iia) specifics relate to elements of structural knowledge
• Elements of Structural knowledge:
• Concepts (“Professor”, “Teach”, “Class”)
• Statements of relation between concepts (“Professors use to teach classes.”) - structural
17. INTRO: WHAT IS CORE DATA
ANALYSIS II
•List elements of structural knowledge,
•concepts and
•statements of relation among them, for
•3d Kepler’s Law
•Dr Snow’s cholera outbreak map
18. INTRO: WHAT IS CORE DATA
ANALYSIS II 3
• Core data analysis does care of (ia) deriving concepts and statements of relation between them
from data
• (iia) Structural Knowledge Enhancing, generically, via either of the two pathways
• Two pathways for Structural Knowledge Enhancing
• Summarization: Developing Concepts
• Correlation: Deriving Statements of relation between concepts
19. W1. INTRO: WHAT IS CORE DATA
ANALYSIS II 4
• Two pathways for Structural Knowledge Enhancing
• Summarization: Developing Concepts
• Correlation: Deriving Statements of relation between concepts
Two major formats:
Quantitative (both concepts and statements)
3d Kepler’s Law
Period2 = Distance3
Categorical (both concepts and statements)
Dr Snow’s conclusion:
Cholera death is caused by pump water
21. INTRO II: THREE POSSIBLE LAYERS
OF STUDY
Pro Con
• Systems Usable now Short lived
Simple Too many
• Concepts Awareness Superficial
• Methods Workable Technical
Extendable Boring
Long-term
22. INTRO II: COURSE CONTENTS
REVIEW
•Summarization: PCA (Weeks 6 and 7), Cluster
analysis (Week 8)
•Correlation: Classifier (Week 5), (no Regression, sorry;
if needed, go to Statistics, Econometrics and Neuron Networks
courses)
•Prequel: 1D and 2D analyses to study basic
concepts and basic methods
•Pre-prequel: Intro – Data and problems
23. INTRO II: RELATION TO OTHER
APPROACHES
• Classical mathematical statistics: data is just a vehicle to fit and test
mathematical models in the applied domain (say, in data analysis, a feature is
a column in table, they model it as a random variable!)
• Machine Learning: Prediction rules to be built incrementally (say, here PCA is
a major method; for them, just a method to preprocess the data)
• Data Mining: adding new knowledge by finding
interesting patterns in databases, which is initial
stage of knowledge discovery (CDA is part of that,
up to databases)
OVERALL: METHODS are SAME, PERSPECTIVES DO DIFFER
24. INTRO III: VISUALIZATION
• Visualization of data is an important activity assisting data analysis by a human in many ways
including
A. Highlighting
B. Integrating different aspects
C. Manipulating (not shown)
A few examples follow.
25. INTRO III: VISUALIZATION
A. Highlighting 1
Figure 1. A fragment of London Tube
map made after H. Beck (1906); the
central part is highlighted by
disproportionate scaling. Being, for a
long while, totally rejected by the
authorities, a standard for metro maps
worldwide.
26. INTRO III: VISUALIZATION
A. Highlighting 2: Cheating by distortion
Figure 2. A decline in relative numbers of
general practitioner doctors in California in 70-
es is conveniently visualized using 1D size-, not
2D area-related, scaling of a picture of doctor.
27. INTRO III: VISUALIZATION
Highlighting 3: Cheating by
distortion
Figure 3. Another unintended
distortion: a newspaper’s self-
satisfaction report (July 2005) is
visualized with bars that grow
from mark 500,000 rather than 0.
A 25% advantage has visually
grown ten-fold!
28. INTRO III: VISUALIZATION
B. Integrating aspects 1
Figure 4. Con Edison company’s power grid screen over
Manhattan NY. Grid repair problems are dealt with on the fly
by sending operators upon seeing disorders on the screen.
29. INTRO III: VISUALIZATION
B. Integrating aspects 2
Figure 5. Minard’s (1869) depiction of a lost Napoleon
campaign 1812 integrating space, time and strength of
the French army.
30. INTRO III: VISUALIZATION
B. Integrating
aspects 3
Figure 6. The
structure of research
activities of CENTRIA
(UNL, Lisbon) in 2007
represented over ACM
Computer Subjects
Classification 1998.
31. INTRO IV: ILLUSTRATIVE DATA CASES
Company name Income, $mln MShare,% NSup EC Sector
Aversiona
Antyops
Astonite
19.0
29.4
23.9
43.7
36.0
38.0
2
3
3
No
No
No
Utility
Utility
Industrial
Bayermart
Breaktops
Bumchista
18.4
25.7
12.1
27.9
22.3
16.9
2
3
2
Yes
Yes
Yes
Utility
Industrial
Industrial
Civiok
Cyberdam
23.9
27.2
30.2
58.0
4
5
Yes
Yes
Retail
Retail
Case 1: Companies 1
Companies characterized by mixed scale features; the first three companies making product A, the next three
making product B, and the last two product C.
Metadata: A. Features and Domain knowledge
1) Income, $ Mln;
2) Mshare - Market share , per cent;
3) NSup - Number of principal suppliers;
4) ECommerce - Yes e-trade or No;
5) Sector - (a) Retail, (b) Utility, and (c) Industrial.
B. Main production (A,B,C)
C. Feature scale types (3 main types)
32. INTRO IV: ILLUSTRATIVE DATA CASES
Case 1: Companies 2
Metadata: A. Features and Domain knowledge
1) Income, $ Mln;
2) Mshare - Market share , per cent;
3) NSup - Number of principal suppliers;
4) ECommerce - Yes e-trade or No;
5) Sector - (a) Retail, (b) Utility, and (c) Industrial.
Feature: Maps entities to feature values (Synonyms: Variable,
Attribute, Character, Parameter)
Feature. Quantitative scale: Arithmetic averaging makes
sense
Examples: 1) Income, 2) Mshare, 3) NSup
33. INTRO IV: ILLUSTRATIVE DATA CASES
Case 1: Companies 3
Metadata: A. Features and Domain knowledge
1) Income, $ Mln;
2) Mshare - Market share , per cent;
3) NSup - Number of principal suppliers;
4) ECommerce - Yes e-trade or No;
5) Sector - (a) Retail, (b) Utility, and (c) Industrial.
Feature. Nominal scale: Disjunctive categories, Only comparison “equal or
not” making sense (Special case of categorical scales)
Example: 5) Sector (Retail, Utility, Industrial are values
Feature. Binary scale: Two disjunctive categories, “Yes” and “No”
Shares properties of nominal scale and quantitative scale if 1/0 coded
Example: 4) ECommerce
34. INTRO IV: QUANTATIVE CODING
Company name Income, $mln MShare,% NSup EC Sector
Aversiona
Antyops
Astonite
19.0
29.4
23.9
43.7
36.0
38.0
2
3
3
No
No
No
Utility
Utility
Industrial
Bayermart
Breaktops
Bumchista
18.4
25.7
12.1
27.9
22.3
16.9
2
3
2
Yes
Yes
Yes
Utility
Industrial
Industrial
Civiok
Cyberdam
23.9
27.2
30.2
58.0
4
5
Yes
Yes
Retail
Retail
Case 1: Companies 4
Quantitative coding: Each category is made into a 1/0 binary (dummy) feature “Does
it hold? 1 if Yes, 0 if No.”
Entity Income MSchar NSup EC? Util? Indu? Retail?
1
2
3
19.0
29.4
23.9
43.7
36.0
38.0
2
3
3
0
0
0
1
1
0
0
0
1
0
0
0
4
5
6
18.4
25.7
12.1
27.9
22.3
16.9
2
3
2
1
1
1
1
0
0
0
1
1
0
0
0
7
8
23.9
27.2
30.2
58.0
4
5
1
1
0
0
0
0
1
1
Company data 8x5 converted to the quantitative format 8x7
35. INTRO IV: ILLUSTRATIVE DATA CASES
Company name Income, $mln MShare,% NSup EC Sector
Aversiona
Antyops
Astonite
19.0
29.4
23.9
43.7
36.0
38.0
2
3
3
No
No
No
Utility
Utility
Industrial
Bayermart
Breaktops
Bumchista
18.4
25.7
12.1
27.9
22.3
16.9
2
3
2
Yes
Yes
Yes
Utility
Industrial
Industrial
Civiok
Cyberdam
23.9
27.2
30.2
58.0
4
5
Yes
Yes
Retail
Retail
Case 1: Companies 5
Data analysis:
• How to map companies to the screen with their similarity reflected in distances
between points? (Summarization/visualization)
• Would clustering of companies reflect the product? What features would be
involved then? (Summarization)
• Can rules be derived to predict the product for another company, coming outside
of the table? (Correlation)
• Is there any relation between the structural features (Nsup,EC,Sector) and
market related features (Income, MSchare)? (Correlation.)
36. INTRO IV: ILLUSTRATIVE DATA CASES
Case 2: Iris 1
Anderson–Fisher Iris 150x4 data of three taxa:
Specimen (1-150)Taxon
1-50 Iris setosa (diploid)
51-100 Iris versicolor (tetraploid)
101-150 Iris virginica (hexaploid)
Features
W1 Sepal length
W2 Sepal width
W3 Petal length
W4 Petal width
37. INTRO IV: DATA CASES
Case 2: Iris 2
#
I Iris setosa II Iris versicolor III Iris virginica
w1 w2 w3 w4 w1 w2 w3 w4 w1 w2 w3 w4
1
2
3
4
5
6
7
8
9
50
5.1 3.5 1.4 0.3
4.4 3.2 1.3 0.2
4.4 3.0 1.3 0.2
5.0 3.5 1.6 0.6
5.1 3.8 1.6 0.2
4.9 3.1 1.5 0.2
5.0 3.2 1.2 0.2
4.6 3.2 1.4 0.2
5.0 3.3 1.4 0.2
5.1 3.5 1.4 0.2
6.4 3.2 4.5 1.5
5.5 2.4 3.8 1.1
5.7 2.9 4.2 1.3
5.7 3.0 4.2 1.2
5.6 2.9 3.6 1.3
7.0 3.2 4.7 1.4
6.8 2.8 4.8 1.4
6.1 2.8 4.7 1.2
4.9 2.4 3.3 1.0
6.0 2.2 4.0 1.0
6.3 3.3 6.0 2.5
6.7 3.3 5.7 2.1
7.2 3.6 6.1 2.5
7.7 3.8 6.7 2.2
7.2 3.0 5.8 1.6
7.4 2.8 6.1 1.9
7.6 3.0 6.6 2.1
7.7 2.8 6.7 2.0
6.2 3.4 5.4 2.3
6.5 3.2 5.1 2.0
Data analysis
• Visualise the data so that similar specimen are mapped into
points that are near each other, and dissimilar to far away points
• Build a predictor of sepal sizes from the petal sizes (to lessen the
burden of measurement)
• Build a predictor of taxa (classifier) based on the petal/sepal
sizes
38. INTRO IV: DATA CASES
Case 3: Intrusion attack 1
Features
1) Pr, the protocol-type, which is either tcp or icmp or udp (a nominal feature),
2) BySD, the number of data bytes from source to destination,
3) SH, the number of connections to the same host as the current one in the past two seconds,
4) SS, the number of connections to the same service as the current one in the past two
seconds,
5) SE, the rate of connections (per cent in SHCo) that have SYN errors,
6) RE, the rate of connections (per cent in SHCo) that have REJ errors,
7) A, the type of attack (ap - apache, sa - saint, sm - smurf, and no attack) – a nominal
Pr BySD SH SS SE RE A Pr ByS SH SS Se RE A
Tcp
62344
16 16 0 0.94 Ap Tcp 287 14 14 0 0 no
Tcp 60884 17 17 0.06 0.88 Ap Tcp 308 1 1 0 0 no
Tcp 59424 18 18 0.06 0.89 Ap Tcp 284 5 5 0 0 no
Tcp 59424 19 19 0.05 0.89 Ap Udp 105 2 2 0 0 no
Tcp 59424 20 20 0.05 0.9 Ap Udp 105 2 2 0 0 no
Tcp 75484 21 21 0.05 0.9 Ap Udp 105 2 2 0 0 no
39. INTRO IV: DATA CASES
Case 3: Intrusion attack 2
Data analysis
• Build a classifier to judge whether the system functions normally or is it under
attack (Correlation);
• Is there any relation between the protocol and type of attack (Correlation);
• Visualize the data reflecting similarity of the patterns (Summarization).
Pr BySD SH SS SE RE A Pr ByS SH SS Se RE A
Tcp
62344
16 16 0 0.94 Ap Tcp 287 14 14 0 0 no
Tcp 60884 17 17 0.06 0.88 Ap Tcp 308 1 1 0 0 no
Tcp 59424 18 18 0.06 0.89 Ap Tcp 284 5 5 0 0 no
Tcp 59424 19 19 0.05 0.89 Ap Udp 105 2 2 0 0 no
Tcp 59424 20 20 0.05 0.9 Ap Udp 105 2 2 0 0 no
Tcp 75484 21 21 0.05 0.9 Ap Udp 105 2 2 0 0 no
40. TOPICS COVERED:
1. Data Mining and data patterns and their use: if
found a pattern, interpret it!
2. Knowledge Enhancing: summarize to concepts,
correlate to statements of relation.
3. Visualize: to highlight or integrate aspects.
4. Illustrative data cases: concept of feature,
feature scale, data table, data analysis
problem.