Masters Thesis Defense Presentation

TEXT MINING APPLIED TO SQL
QUERIES: A CASE STUDY FOR SDSS
SKYSERVER
Vitor Hirota Makiyama
Advised by Dr. Rafael D. C. dos Santos
Master in Applied Computing
National Institute for Space Research

Outline
1. Introduction
• SDSS & SkyServer
• SkyServer as a data mining tool
• Log Analysis
• Thesis motivation
2. Theory review
3. Methodology
4. Experimental results
5. Conclusion remarks
2

SDSS & SkyServer
• The Sloan Digital Sky Survey
• 15 years operation, 4th iteration
• 3D map of over 1/3 of the sky, and
• 5+ million spectra
• SkyServer
• The Internet portal to SDSS
providing data access tools to the
catalog
3Introduction

SkyServer as a data mining tool
Find all galaxies without saturated
pixels within 1' of a given point
Find quasars with a line width >
2000 km/s and 2.5 < redshift < 2.7
DECLARE @saturated BIGINT;
SET @saturated = dbo.Fphotoflags('saturat
ed');
SELECT G.objid, GN.distance
INTO ##results
FROM galaxy AS G
JOIN Fgetnearbyobjeq(185, -
0.5, 1) AS GN ON G.objid = GN.objid
WHERE ( G.flags & @saturated ) = 0
ORDER BY distance
DECLARE @qso INT;
SET @qso = dbo.Fspecclass('QSO');
DECLARE @hiZ_qso INT;
SET @hiZ_qso = dbo.Fspecclass('HIZ-QSO');
SELECT s.specobjid,Max(l.sigma *
300000.0 / l.wave) AS veldisp,
Avg(s.z) AS z
INTO ##results
FROM specobj s,specline l
WHERE s.specobjid = l.specobjid
AND ( ( s.specclass = @qso )
OR ( s.specclass = @hiZ_qso
) )
AND l.sigma * 300000.0 / l.wave >
2000.0
AND s.zconf > 0.9
GROUP BY s.specobjid
4Introduction

5Introduction

6Introduction

Motivation
Apply text mining techniques over the
SQL logs to define a methodology to
parse, clean and tokenize statements
into an intermediate numerical
representation for data mining.
9Introduction

Outline
1. Introduction
2. Theory review
• Text Mining
• Information Retrieval
• Clustering
3. Methodology
5. Conclusion
10

Text Mining
Knowledge Discovery in Databases:
The nontrivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in data.
Fayyad et al. (1996)
11Theory Review

Text Mining
Text Mining:
The discovery by computer of new previously unknown,
information by automatically extracting information from
different written sources.
12Theory Review

Text Mining
13Theory Review
Miner, Gary. Practical text mining and statistical analysis for non-
structured text data applications. Academic Press (2012)

Zipf’s Law
14Theory Review
Singh et al. (2006)

Term Weighting
• To balance term significance within a document
collection, accounting for terms that are too common
or too rare.
• TF*IDF assigns the largest weight to terms that arise
with high frequency in individual documents, but are
at the same time, relatively rare in the collection as a
whole .
15
idft = log
N
dft
tf-idft,d = tft,d ´idft

Clustering
The exploratory procedure that organizes a
collection of patterns into natural groupings
based on a given association measure.
16Theory Review

Association Measures
SIMEJ (x, y) =
x× y
x
2
+ y
2
- x× y
SIMC (x, y) =
x× y
x y
d(x, y) = (xi - yi )2
i=1
n
å
Sthrel et al. (2000)
SIME (x, y)= e-d2
SIMJ (X,Y) =
X ÇY
X ÈY
17Theory Review

K-Means
1. Choose k clusters centers.
2. Assign each pattern to the closest cluster center.
3. Recompute cluster center using the current cluster
memberships.
4. If convergence criterion is not met, go to step 2.
Manning et al. (2009)
18Theory Review

K-Means
1. Choose k clusters centers
2. Assign each pattern to the closest cluster center.
3. Recompute cluster center using the current cluster
memberships.
4. If convergence criterion is not met, go to step 2.
Manning et al. (2009)
19Theory Review

Fuzzy C-Means
• Fuzzy extension to traditional K-Means, where every pattern
belongs to every cluster with varying degrees of membership.
• Cluster validity metrics:
20Theory Review
vPC (U) =
1
N
(ukn )2
n=1
N
å
k=1
c
å
vPE (U) = -
1
N
ukn log(ukn )
n=1
N
å
k=1
c
å
vFS (U,V, X) = (ukn )m
(|| xn -vk ||2
-|| vk -v ||2
)
n=1
N
å
k=1
c
å
vXB (U,V, X) =
um
kn
m
|| xn - vk ||22
n=1
N
å
k=1
c
å
N(min
k¹l
{|| vk -vl ||2
})

Self-Organizing Maps
21Theory Review
• ANN that performs unsupervised, competitive
learning.
Yin (2008)

Self-Organizing Maps
• Maps high-dimensional data into a regular low-
dimensional grid
• Reduces the original data dimension while
preserving relationships of the data
• Particular interesting for visualization with the
U-Matrix
22

Outline
1. Introduction
2. Theory review
3. Methodology
• SQL queries to feature vectors
• Data mining
23

Methodology
24Methodology

SQL queries to feature vectors
25
Transformation
Preprocessing
Parsing
Cleaning / Tokenization
Normalization
Methodology

26
Transformation
Preprocessing
Parsing
Normalization
Methodology
SELECT p.objid, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z,
platex.plate, s.fiberid, s.elodiefeh
FROM photoobj p, dbo.fgetnearbyobjeq(1.62, 27.64, 30) n,
specobj s, platex
WHERE p.objid = n.objid AND p.objid = s.bestobjid
AND s.plateid = platex.plateid AND class = ‘star’
AND p.r >= 14 AND p.r <= 22.5
AND p.g >= 15 AND p.g <= 23
AND platex.plate = 2803
select objid ra dec u g r i z
plate fiberid elodiefeh
from photoobj fgetnearbyobjeq specobj platex
where objid objid logic objid bestobjid
logic plateid plateid logic class
logic r logic r logic g logic g
logic plate

27
Transformation
Preprocessing
Parsing
Normalization
Methodology
SELECT p.objid, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z,
platex.plate, s.fiberid, s.elodiefeh
FROM photoobj p, dbo.fgetnearbyobjeq(1.62, 27.64, 30) n,
specobj s, platex
WHERE p.objid = n.objid AND p.objid = s.bestobjid
AND s.plateid = platex.plateid AND class = ‘star’
AND p.r >= 14 AND p.r <= 22.5
AND p.g >= 15 AND p.g <= 23
AND platex.plate = 2803
select objid ra dec u g r i z
plate fiberid elodiefeh
from photoobj fgetnearbyobjeq specobj platex
where objid objid logic objid bestobjid
logic plateid plateid logic class
logic r logic r logic g logic g
logic plate
select_objid 1
select_ra 1
select_dec 1
select_u 1
select_g 1
select_r 1
select_i 1
select_z 1
select_plate 1
select_fiberid 1
select_elodiefeh 1
from_photoobj 1
from_fgetnearbyobjeq 1
from_specobj 1
from_platex 1
where_objid 3
where_logic 8
where_bestobjid 1
where_plateid 2
where_class 1
where_r 2
where_g 2
where_plate 1

28
Transformation
Preprocessing
Parsing
Normalization
Term Weighting
Scaling
Methodology
Term Document
matrix

Transformation
Preprocessing
Parsing
Normalization
Term Weighting
Scaling
Data mining
29Methodology
Data Mining
Fuzzy
C-Means
Self organizing
Maps
Term Document
matrix

Outline
1. Introduction
2. Theory review
3. Methodology
30

Number of clusters with FCM
31Results

Number of clusters with FCM
32Results

Visualization with SOM
33Results

Visualization with SOM
22: Finding spectra by classification (object type)
select top 100 specobjid
from specobj
where class = 'star'
and zwarning = 0
43: QSOs by spectroscopy
select top 100 specobjid, z
from specobj
where class = 'qso'
and zwarning = 0
34
Cosine distance:
Term-Frequency: 0.0205
SOM U-Matrix: 0.0
Results
39: Classifications from Galaxy Zoo
select objid, nvote, p_el as elliptical,
p_cw as spiralclock, p_acw as
spiralanticlock, p_edge as edgeon, p_dk
as dontknow, p_mg as merger
from zoonospec
where objid = 1237656495650570395
39B: Classifications from Galaxy Zoo
select top 100 g.objid, zns.nvote,
zns.p_el as elliptical, zns.p_cw as
spiralclock, zns.p_acw as
spiralanticlock, zns.p_edge as edgeon,
zns.p_dk as dontknow, zns.p_mg as merger
from galaxy as g
join zoonospec as zns
on g.objid = zns.objid
where g.clean=1
and zns.nvote >= 10
and zns.p_cw > 0.8
Cosine distance:
Term-Frequency: 0.1610
SOM U-Matrix: 0.0

Outline
1. Introduction
2. Theory review
3. Methodology
35

Conclusions
• A methodology for proper parsing, cleaning and
tokenization of SQL statements into feature vectors
was defined, which can be used for KDD.
• Preprocessing and transformation can be tuned
according to data mining goal.
• Foreseen applications include:
• Detailed SQL and database usage statistics
• Query recommedation systems
• Running time prediction
36Conclusions

Publications
• Clustering SQL queries to analyse database
usage, IASC Satellite for the ISI WSC
Conference, 2015
• Text Mining Applied to SQL Queries: A Case
Study for the SDSS SkyServer, 2nd International
Symposium on Information Management and
Big Data, 2015
37

Masters Thesis Defense Presentation

More Related Content

What's hot

Viewers also liked

Similar to Masters Thesis Defense Presentation

Recently uploaded

Masters Thesis Defense Presentation

Editor's Notes