DATA ANALYSIS

DATA ANALYSIS
CHARAK RAY
libra.charak@gmail.com

COURSE CONTENTS
•Core Data Analysis
• 1D analysis
• 2D analysis: both quantitative
• 2D analysis: both nominal
• Learning multivariate correlation
• Principal components (PCA) and SVD: Mathematical foundations
• Principal components (PCA) and SVD: Applications
• Clustering with k-means

INTRO: WHAT IS CORE DATA
ANALYSIS?
Four main parts
1. Data Mining and data patterns and their use
2. Core data analysis: two main goals for
Knowledge Enhancing
3. Visualization: How it works
4. Illustrative data cases

INTRO: DATA MINING AND DATA PATTERNS
AND THEIR USE
•Is it Data Mining?
• Well, what is Data Mining?
• Generically, Data Mining is looking for (i) patterns in data stored in (ii) Databases
as part of (iii) Knowledge Discovery
• Core data analysis does not care of (ii) Databases
• Core data analysis does care of (ia) specific patterns in data as part of
(iia) Knowledge Enhancing

INTRO: EXAMPLE OF PATTERN
DOUBLE SUCCESS 1
The History of Laws for planetary motion
Double success
Ptolemy (c. 150 a.d.):
• Sun and planets
• circle Earth
• Does not match data well

DOUBLE SUCCESS 2
The History of Laws for planetary motion
• Copernicus (c. 1540):
• Planets circle Sun
• Does not match data well
• either

DOUBLE SUCCESS 3
Laws for planetary motion:
Kepler (c. 1605):
• 1st Law: Planets revolve Sun in ellipses (ovals)
• 2d Law: Speed changes – the further away from Sun, the faster
• Does either

DOUBLE SUCCESS 4
Planet
Period
(year)
Distance (average,
relative to that of
Earth)
Mercury
Venus
Earth
Mars
Jupiter
Saturn
Uranus
Neptune
Pluto
0.241
0.615
1.00
1.88
11.8
29.5
84.0
165
248
0.39
0.72
1.00
1.52
5.20
9.54
19.18
30.06
39.44
3d Law:
Is there any relation
between
speed/period and
distance?

DOUBLE SUCCESS 5
3d Kepler’s Law:
Is there any relation
between speed/period
and distance?
Fit no line…

DOUBLE SUCCESS 6
3d Kepler’s Law (1619):
[J. Napier invented
logarithm (1614)]
Log(P)=
𝟑
𝟐
Log(D)
P2=D3

DOUBLE SUCCESS 7
Three Kepler’s Laws: What is so grand?
Substantiated theoretically by
R. Hooke (1635-1703) and I. Newton (1642-1727)
UNIVERSAL GRAVITATION LAW
Mathematical equation, cornerstone of modern science

FAILURE? 1
Imagine this:
Broad street, Soho, London,
Cholera outbreak September 1854
Dr. Snow report: “On proceeding to the spot, I found
that nearly all the deaths had taken place within a short
distance of the pump.”
Dr John Snow’s map:
Cases of death
labeled by ticks.
The handle of pump
removed 7/9/1854.

INTRO 1: EXAMPLE OF PATTERN
FAILURE? 2
Myth: Death stopped. Data analysis won.
Fact: Data analysis lost. The health commission rejected the water
pump theory, as contradicting the science of the day (cholera outbreak
caused by “concentrated noxious atmospheric influence, no doubt
emanating from putrefying organic matter”). The handle of the pump
was ordered back. Death stopped because all died already.
More death occurred at further cholera outbreaks till R. Koch discovered
and publicized the vibrio cholera in 1883.
Dr John Snow’s map:
A case of death
Is labeled by a tick

PATTERN FOUND
Success: if
Compatible with existing knowledge
Failure: if
Not compatible with existing knowledge
Advice
• Find a pattern
• Interpret using existing knowledge
• Care not whether interpretation is
compatible

ANALYSIS II 1
• Core data analysis does care of (ia) specific patterns in data as part of (iia) Knowledge
Enhancing
• What are these (ia), (iia) specifics?
• Have something to do with the notion of Knowledge
• Statements of fact (“I teach this class.”) – factual
• Statements of pattern, regularity (“Professors use to teach classes.”) - structural

ANALYSIS II 2
• Core data analysis does care of (ia) specific patterns in data as part of (iia) Knowledge
Enhancing
• (ia), (iia) specifics relate to elements of structural knowledge
• Elements of Structural knowledge:
• Concepts (“Professor”, “Teach”, “Class”)
• Statements of relation between concepts (“Professors use to teach classes.”) - structural

ANALYSIS II
•List elements of structural knowledge,
•concepts and
•statements of relation among them, for
•3d Kepler’s Law
•Dr Snow’s cholera outbreak map

ANALYSIS II 3
• Core data analysis does care of (ia) deriving concepts and statements of relation between them
from data
• (iia) Structural Knowledge Enhancing, generically, via either of the two pathways
• Two pathways for Structural Knowledge Enhancing
• Summarization: Developing Concepts
• Correlation: Deriving Statements of relation between concepts

W1. INTRO: WHAT IS CORE DATA
ANALYSIS II 4
• Two pathways for Structural Knowledge Enhancing
• Summarization: Developing Concepts
• Correlation: Deriving Statements of relation between concepts
 Two major formats:
 Quantitative (both concepts and statements)
 3d Kepler’s Law
Period2 = Distance3
 Categorical (both concepts and statements)
 Dr Snow’s conclusion:
Cholera death is caused by pump water

INTRO II: STRUCTURAL
KNOWLEDGE ENHANCING GENERIC
METHODS
•Two pathways & Two formats
• Summarization methods:
• Quantitative Principal component analysis (PCA)
• Categorical Cluster analysis
• Correlation methods:
• Quantitative Regression
• Categorical Classifier

INTRO II: THREE POSSIBLE LAYERS
OF STUDY
Pro Con
• Systems Usable now Short lived
Simple Too many
• Concepts Awareness Superficial
• Methods Workable Technical
Extendable Boring
Long-term

INTRO II: COURSE CONTENTS
REVIEW
•Summarization: PCA (Weeks 6 and 7), Cluster
analysis (Week 8)
•Correlation: Classifier (Week 5), (no Regression, sorry;
if needed, go to Statistics, Econometrics and Neuron Networks
courses)
•Prequel: 1D and 2D analyses to study basic
concepts and basic methods
•Pre-prequel: Intro – Data and problems

INTRO II: RELATION TO OTHER
APPROACHES
• Classical mathematical statistics: data is just a vehicle to fit and test
mathematical models in the applied domain (say, in data analysis, a feature is
a column in table, they model it as a random variable!)
• Machine Learning: Prediction rules to be built incrementally (say, here PCA is
a major method; for them, just a method to preprocess the data)
• Data Mining: adding new knowledge by finding
interesting patterns in databases, which is initial
stage of knowledge discovery (CDA is part of that,
up to databases)
OVERALL: METHODS are SAME, PERSPECTIVES DO DIFFER

INTRO III: VISUALIZATION
• Visualization of data is an important activity assisting data analysis by a human in many ways
including
A. Highlighting
B. Integrating different aspects
C. Manipulating (not shown)
A few examples follow.

A. Highlighting 1
Figure 1. A fragment of London Tube
map made after H. Beck (1906); the
central part is highlighted by
disproportionate scaling. Being, for a
long while, totally rejected by the
authorities, a standard for metro maps
worldwide.

A. Highlighting 2: Cheating by distortion
Figure 2. A decline in relative numbers of
general practitioner doctors in California in 70-
es is conveniently visualized using 1D size-, not
2D area-related, scaling of a picture of doctor.

Highlighting 3: Cheating by
distortion
Figure 3. Another unintended
distortion: a newspaper’s self-
satisfaction report (July 2005) is
visualized with bars that grow
from mark 500,000 rather than 0.
A 25% advantage has visually
grown ten-fold!

B. Integrating aspects 1
Figure 4. Con Edison company’s power grid screen over
Manhattan NY. Grid repair problems are dealt with on the fly
by sending operators upon seeing disorders on the screen.

B. Integrating aspects 2
Figure 5. Minard’s (1869) depiction of a lost Napoleon
campaign 1812 integrating space, time and strength of
the French army.

B. Integrating
aspects 3
Figure 6. The
structure of research
activities of CENTRIA
(UNL, Lisbon) in 2007
represented over ACM
Computer Subjects
Classification 1998.

INTRO IV: ILLUSTRATIVE DATA CASES
Company name Income, $mln MShare,% NSup EC Sector
Aversiona
Antyops
Astonite
19.0
29.4
23.9
43.7
36.0
38.0
2
3
3
No
No
No
Utility
Utility
Industrial
Bayermart
Breaktops
Bumchista
18.4
25.7
12.1
27.9
22.3
16.9
2
3
2
Yes
Yes
Yes
Utility
Industrial
Industrial
Civiok
Cyberdam
23.9
27.2
30.2
58.0
4
5
Yes
Yes
Retail
Retail
Case 1: Companies 1
Companies characterized by mixed scale features; the first three companies making product A, the next three
making product B, and the last two product C.
Metadata: A. Features and Domain knowledge
1) Income, $ Mln;
2) Mshare - Market share , per cent;
3) NSup - Number of principal suppliers;
4) ECommerce - Yes e-trade or No;
5) Sector - (a) Retail, (b) Utility, and (c) Industrial.
B. Main production (A,B,C)
C. Feature scale types (3 main types)

Case 1: Companies 2
1) Income, $ Mln;
Feature: Maps entities to feature values (Synonyms: Variable,
Attribute, Character, Parameter)
Feature. Quantitative scale: Arithmetic averaging makes
sense
Examples: 1) Income, 2) Mshare, 3) NSup

Case 1: Companies 3
1) Income, $ Mln;
Feature. Nominal scale: Disjunctive categories, Only comparison “equal or
not” making sense (Special case of categorical scales)
Example: 5) Sector (Retail, Utility, Industrial are values
Feature. Binary scale: Two disjunctive categories, “Yes” and “No”
Shares properties of nominal scale and quantitative scale if 1/0 coded
Example: 4) ECommerce

INTRO IV: QUANTATIVE CODING
Aversiona
Antyops
Astonite
19.0
29.4
23.9
43.7
36.0
38.0
2
3
3
No
No
No
Utility
Utility
Industrial
Bayermart
Breaktops
Bumchista
18.4
25.7
12.1
27.9
22.3
16.9
2
3
2
Yes
Yes
Yes
Utility
Industrial
Industrial
Civiok
Cyberdam
23.9
27.2
30.2
58.0
4
5
Yes
Yes
Retail
Retail
Case 1: Companies 4
Quantitative coding: Each category is made into a 1/0 binary (dummy) feature “Does
it hold? 1 if Yes, 0 if No.”
Entity Income MSchar NSup EC? Util? Indu? Retail?
1
2
3
19.0
29.4
23.9
43.7
36.0
38.0
2
3
3
0
0
0
1
1
0
0
0
1
0
0
0
4
5
6
18.4
25.7
12.1
27.9
22.3
16.9
2
3
2
1
1
1
1
0
0
0
1
1
0
0
0
7
8
23.9
27.2
30.2
58.0
4
5
1
1
0
0
0
0
1
1
Company data 8x5 converted to the quantitative format 8x7

Aversiona
Antyops
Astonite
19.0
29.4
23.9
43.7
36.0
38.0
2
3
3
No
No
No
Utility
Utility
Industrial
Bayermart
Breaktops
Bumchista
18.4
25.7
12.1
27.9
22.3
16.9
2
3
2
Yes
Yes
Yes
Utility
Industrial
Industrial
Civiok
Cyberdam
23.9
27.2
30.2
58.0
4
5
Yes
Yes
Retail
Retail
Case 1: Companies 5
Data analysis:
• How to map companies to the screen with their similarity reflected in distances
between points? (Summarization/visualization)
• Would clustering of companies reflect the product? What features would be
involved then? (Summarization)
• Can rules be derived to predict the product for another company, coming outside
of the table? (Correlation)
• Is there any relation between the structural features (Nsup,EC,Sector) and
market related features (Income, MSchare)? (Correlation.)

Case 2: Iris 1
Anderson–Fisher Iris 150x4 data of three taxa:
Specimen (1-150)Taxon
1-50 Iris setosa (diploid)
51-100 Iris versicolor (tetraploid)
101-150 Iris virginica (hexaploid)
Features
W1 Sepal length
W2 Sepal width
W3 Petal length
W4 Petal width

INTRO IV: DATA CASES
Case 2: Iris 2
#
I Iris setosa II Iris versicolor III Iris virginica
w1 w2 w3 w4 w1 w2 w3 w4 w1 w2 w3 w4
1
2
3
4
5
6
7
8
9
50
5.1 3.5 1.4 0.3
4.4 3.2 1.3 0.2
4.4 3.0 1.3 0.2
5.0 3.5 1.6 0.6
5.1 3.8 1.6 0.2
4.9 3.1 1.5 0.2
5.0 3.2 1.2 0.2
4.6 3.2 1.4 0.2
5.0 3.3 1.4 0.2
5.1 3.5 1.4 0.2
6.4 3.2 4.5 1.5
5.5 2.4 3.8 1.1
5.7 2.9 4.2 1.3
5.7 3.0 4.2 1.2
5.6 2.9 3.6 1.3
7.0 3.2 4.7 1.4
6.8 2.8 4.8 1.4
6.1 2.8 4.7 1.2
4.9 2.4 3.3 1.0
6.0 2.2 4.0 1.0
6.3 3.3 6.0 2.5
6.7 3.3 5.7 2.1
7.2 3.6 6.1 2.5
7.7 3.8 6.7 2.2
7.2 3.0 5.8 1.6
7.4 2.8 6.1 1.9
7.6 3.0 6.6 2.1
7.7 2.8 6.7 2.0
6.2 3.4 5.4 2.3
6.5 3.2 5.1 2.0
Data analysis
• Visualise the data so that similar specimen are mapped into
points that are near each other, and dissimilar to far away points
• Build a predictor of sepal sizes from the petal sizes (to lessen the
burden of measurement)
• Build a predictor of taxa (classifier) based on the petal/sepal
sizes

Case 3: Intrusion attack 1
Features
1) Pr, the protocol-type, which is either tcp or icmp or udp (a nominal feature),
2) BySD, the number of data bytes from source to destination,
3) SH, the number of connections to the same host as the current one in the past two seconds,
4) SS, the number of connections to the same service as the current one in the past two
seconds,
5) SE, the rate of connections (per cent in SHCo) that have SYN errors,
6) RE, the rate of connections (per cent in SHCo) that have REJ errors,
7) A, the type of attack (ap - apache, sa - saint, sm - smurf, and no attack) – a nominal
Pr BySD SH SS SE RE A Pr ByS SH SS Se RE A
Tcp
62344
16 16 0 0.94 Ap Tcp 287 14 14 0 0 no
Tcp 60884 17 17 0.06 0.88 Ap Tcp 308 1 1 0 0 no
Tcp 59424 18 18 0.06 0.89 Ap Tcp 284 5 5 0 0 no
Tcp 59424 19 19 0.05 0.89 Ap Udp 105 2 2 0 0 no
Tcp 59424 20 20 0.05 0.9 Ap Udp 105 2 2 0 0 no
Tcp 75484 21 21 0.05 0.9 Ap Udp 105 2 2 0 0 no

Case 3: Intrusion attack 2
Data analysis
• Build a classifier to judge whether the system functions normally or is it under
attack (Correlation);
• Is there any relation between the protocol and type of attack (Correlation);
• Visualize the data reflecting similarity of the patterns (Summarization).
Pr BySD SH SS SE RE A Pr ByS SH SS Se RE A
Tcp
62344
16 16 0 0.94 Ap Tcp 287 14 14 0 0 no
Tcp 60884 17 17 0.06 0.88 Ap Tcp 308 1 1 0 0 no
Tcp 59424 18 18 0.06 0.89 Ap Tcp 284 5 5 0 0 no
Tcp 59424 19 19 0.05 0.89 Ap Udp 105 2 2 0 0 no
Tcp 59424 20 20 0.05 0.9 Ap Udp 105 2 2 0 0 no
Tcp 75484 21 21 0.05 0.9 Ap Udp 105 2 2 0 0 no

TOPICS COVERED:
1. Data Mining and data patterns and their use: if
found a pattern, interpret it!
2. Knowledge Enhancing: summarize to concepts,
correlate to statements of relation.
3. Visualize: to highlight or integrate aspects.
4. Illustrative data cases: concept of feature,
feature scale, data table, data analysis
problem.

DATA ANALYSIS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DATA ANALYSIS

Similar to DATA ANALYSIS (20)

More from CHARAK RAY

More from CHARAK RAY (20)

Recently uploaded

Recently uploaded (20)

DATA ANALYSIS

Editor's Notes