TEXT MINING APPLIED TO SQL
QUERIES: A CASE STUDY FOR SDSS
SKYSERVER
Vitor Hirota Makiyama
Advised by Dr. Rafael D. C. dos Santos
Master in Applied Computing
National Institute for Space Research
Outline
1. Introduction
• SDSS & SkyServer
• SkyServer as a data mining tool
• Log Analysis
• Thesis motivation
2. Theory review
3. Methodology
4. Experimental results
5. Conclusion remarks
2
SDSS & SkyServer
• The Sloan Digital Sky Survey
• 15 years operation, 4th iteration
• 3D map of over 1/3 of the sky, and
• 5+ million spectra
• SkyServer
• The Internet portal to SDSS
providing data access tools to the
catalog
3Introduction
SkyServer as a data mining tool
Find all galaxies without saturated
pixels within 1' of a given point
Find quasars with a line width >
2000 km/s and 2.5 < redshift < 2.7
DECLARE @saturated BIGINT;
SET @saturated = dbo.Fphotoflags('saturat
ed');
SELECT G.objid, GN.distance
INTO ##results
FROM galaxy AS G
JOIN Fgetnearbyobjeq(185, -
0.5, 1) AS GN ON G.objid = GN.objid
WHERE ( G.flags & @saturated ) = 0
ORDER BY distance
DECLARE @qso INT;
SET @qso = dbo.Fspecclass('QSO');
DECLARE @hiZ_qso INT;
SET @hiZ_qso = dbo.Fspecclass('HIZ-QSO');
SELECT s.specobjid,Max(l.sigma *
300000.0 / l.wave) AS veldisp,
Avg(s.z) AS z
INTO ##results
FROM specobj s,specline l
WHERE s.specobjid = l.specobjid
AND ( ( s.specclass = @qso )
OR ( s.specclass = @hiZ_qso
) )
AND l.sigma * 300000.0 / l.wave >
2000.0
AND s.zconf > 0.9
GROUP BY s.specobjid
4Introduction
SkyServer as a data mining tool
5Introduction
SkyServer as a data mining tool
6Introduction
Log Analysis
7
Log Analysis
2006 2014
8
Motivation
Apply text mining techniques over the
SQL logs to define a methodology to
parse, clean and tokenize statements
into an intermediate numerical
representation for data mining.
9Introduction
Outline
1. Introduction
2. Theory review
• Text Mining
• Information Retrieval
• Clustering
3. Methodology
4. Experimental results
5. Conclusion
10
Text Mining
Knowledge Discovery in Databases:
The nontrivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in data.
Fayyad et al. (1996)
11Theory Review
Text Mining
Text Mining:
The discovery by computer of new previously unknown,
information by automatically extracting information from
different written sources.
Fayyad et al. (1996)
12Theory Review
Text Mining
13Theory Review
Miner, Gary. Practical text mining and statistical analysis for non-
structured text data applications. Academic Press (2012)
Zipf’s Law
14Theory Review
Singh et al. (2006)
Term Weighting
• To balance term significance within a document
collection, accounting for terms that are too common
or too rare.
• TF*IDF assigns the largest weight to terms that arise
with high frequency in individual documents, but are
at the same time, relatively rare in the collection as a
whole .
15
idft = log
N
dft
tf-idft,d = tft,d ´idft
Clustering
The exploratory procedure that organizes a
collection of patterns into natural groupings
based on a given association measure.
16Theory Review
Association Measures
SIMEJ (x, y) =
x× y
x
2
+ y
2
- x× y
SIMC (x, y) =
x× y
x y
d(x, y) = (xi - yi )2
i=1
n
å
Sthrel et al. (2000)
SIME (x, y)= e-d2
SIMJ (X,Y) =
X ÇY
X ÈY
17Theory Review
K-Means
1. Choose k clusters centers.
2. Assign each pattern to the closest cluster center.
3. Recompute cluster center using the current cluster
memberships.
4. If convergence criterion is not met, go to step 2.
Manning et al. (2009)
18Theory Review
K-Means
1. Choose k clusters centers
2. Assign each pattern to the closest cluster center.
3. Recompute cluster center using the current cluster
memberships.
4. If convergence criterion is not met, go to step 2.
Manning et al. (2009)
19Theory Review
Fuzzy C-Means
• Fuzzy extension to traditional K-Means, where every pattern
belongs to every cluster with varying degrees of membership.
• Cluster validity metrics:
20Theory Review
vPC (U) =
1
N
(ukn )2
n=1
N
å
k=1
c
å
vPE (U) = -
1
N
ukn log(ukn )
n=1
N
å
k=1
c
å
vFS (U,V, X) = (ukn )m
(|| xn -vk ||2
-|| vk -v ||2
)
n=1
N
å
k=1
c
å
vXB (U,V, X) =
um
kn
m
|| xn - vk ||22
n=1
N
å
k=1
c
å
N(min
k¹l
{|| vk -vl ||2
})
Self-Organizing Maps
21Theory Review
• ANN that performs unsupervised, competitive
learning.
Yin (2008)
Self-Organizing Maps
• Maps high-dimensional data into a regular low-
dimensional grid
• Reduces the original data dimension while
preserving relationships of the data
• Particular interesting for visualization with the
U-Matrix
22
Outline
1. Introduction
2. Theory review
3. Methodology
• SQL queries to feature vectors
• Data mining
4. Experimental results
5. Conclusion remarks
23
Methodology
24Methodology
Fayyad et al. (1996)
SQL queries to feature vectors
25
Transformation
Preprocessing
Parsing
Cleaning / Tokenization
Normalization
Methodology
SQL queries to feature vectors
26
Transformation
Preprocessing
Parsing
Cleaning / Tokenization
Normalization
Methodology
SELECT p.objid, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z,
platex.plate, s.fiberid, s.elodiefeh
FROM photoobj p, dbo.fgetnearbyobjeq(1.62, 27.64, 30) n,
specobj s, platex
WHERE p.objid = n.objid AND p.objid = s.bestobjid
AND s.plateid = platex.plateid AND class = ‘star’
AND p.r >= 14 AND p.r <= 22.5
AND p.g >= 15 AND p.g <= 23
AND platex.plate = 2803
select objid ra dec u g r i z
plate fiberid elodiefeh
from photoobj fgetnearbyobjeq specobj platex
where objid objid logic objid bestobjid
logic plateid plateid logic class
logic r logic r logic g logic g
logic plate
SQL queries to feature vectors
27
Transformation
Preprocessing
Parsing
Cleaning / Tokenization
Normalization
Methodology
SELECT p.objid, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z,
platex.plate, s.fiberid, s.elodiefeh
FROM photoobj p, dbo.fgetnearbyobjeq(1.62, 27.64, 30) n,
specobj s, platex
WHERE p.objid = n.objid AND p.objid = s.bestobjid
AND s.plateid = platex.plateid AND class = ‘star’
AND p.r >= 14 AND p.r <= 22.5
AND p.g >= 15 AND p.g <= 23
AND platex.plate = 2803
select objid ra dec u g r i z
plate fiberid elodiefeh
from photoobj fgetnearbyobjeq specobj platex
where objid objid logic objid bestobjid
logic plateid plateid logic class
logic r logic r logic g logic g
logic plate
select_objid 1
select_ra 1
select_dec 1
select_u 1
select_g 1
select_r 1
select_i 1
select_z 1
select_plate 1
select_fiberid 1
select_elodiefeh 1
from_photoobj 1
from_fgetnearbyobjeq 1
from_specobj 1
from_platex 1
where_objid 3
where_logic 8
where_bestobjid 1
where_plateid 2
where_class 1
where_r 2
where_g 2
where_plate 1
SQL queries to feature vectors
28
Transformation
Preprocessing
Parsing
Cleaning / Tokenization
Normalization
Term Weighting
Scaling
Methodology
Term Document
matrix
Transformation
Preprocessing
Parsing
Cleaning / Tokenization
Normalization
Term Weighting
Scaling
Data mining
29Methodology
Data Mining
Fuzzy
C-Means
Self organizing
Maps
Term Document
matrix
Outline
1. Introduction
2. Theory review
3. Methodology
4. Experimental results
5. Conclusion remarks
30
Number of clusters with FCM
31Results
Number of clusters with FCM
32Results
Visualization with SOM
33Results
Visualization with SOM
22: Finding spectra by classification (object type)
select top 100 specobjid
from specobj
where class = 'star'
and zwarning = 0
43: QSOs by spectroscopy
select top 100 specobjid, z
from specobj
where class = 'qso'
and zwarning = 0
34
Cosine distance:
Term-Frequency: 0.0205
SOM U-Matrix: 0.0
Results
39: Classifications from Galaxy Zoo
select objid, nvote, p_el as elliptical,
p_cw as spiralclock, p_acw as
spiralanticlock, p_edge as edgeon, p_dk
as dontknow, p_mg as merger
from zoonospec
where objid = 1237656495650570395
39B: Classifications from Galaxy Zoo
select top 100 g.objid, zns.nvote,
zns.p_el as elliptical, zns.p_cw as
spiralclock, zns.p_acw as
spiralanticlock, zns.p_edge as edgeon,
zns.p_dk as dontknow, zns.p_mg as merger
from galaxy as g
join zoonospec as zns
on g.objid = zns.objid
where g.clean=1
and zns.nvote >= 10
and zns.p_cw > 0.8
Cosine distance:
Term-Frequency: 0.1610
SOM U-Matrix: 0.0
Outline
1. Introduction
2. Theory review
3. Methodology
4. Experimental results
5. Conclusion remarks
35
Conclusions
• A methodology for proper parsing, cleaning and
tokenization of SQL statements into feature vectors
was defined, which can be used for KDD.
• Preprocessing and transformation can be tuned
according to data mining goal.
• Foreseen applications include:
• Detailed SQL and database usage statistics
• Query recommedation systems
• Running time prediction
36Conclusions
Publications
• Clustering SQL queries to analyse database
usage, IASC Satellite for the ISI WSC
Conference, 2015
• Text Mining Applied to SQL Queries: A Case
Study for the SDSS SkyServer, 2nd International
Symposium on Information Management and
Big Data, 2015
37
Thank you!

Masters Thesis Defense Presentation

  • 1.
    TEXT MINING APPLIEDTO SQL QUERIES: A CASE STUDY FOR SDSS SKYSERVER Vitor Hirota Makiyama Advised by Dr. Rafael D. C. dos Santos Master in Applied Computing National Institute for Space Research
  • 2.
    Outline 1. Introduction • SDSS& SkyServer • SkyServer as a data mining tool • Log Analysis • Thesis motivation 2. Theory review 3. Methodology 4. Experimental results 5. Conclusion remarks 2
  • 3.
    SDSS & SkyServer •The Sloan Digital Sky Survey • 15 years operation, 4th iteration • 3D map of over 1/3 of the sky, and • 5+ million spectra • SkyServer • The Internet portal to SDSS providing data access tools to the catalog 3Introduction
  • 4.
    SkyServer as adata mining tool Find all galaxies without saturated pixels within 1' of a given point Find quasars with a line width > 2000 km/s and 2.5 < redshift < 2.7 DECLARE @saturated BIGINT; SET @saturated = dbo.Fphotoflags('saturat ed'); SELECT G.objid, GN.distance INTO ##results FROM galaxy AS G JOIN Fgetnearbyobjeq(185, - 0.5, 1) AS GN ON G.objid = GN.objid WHERE ( G.flags & @saturated ) = 0 ORDER BY distance DECLARE @qso INT; SET @qso = dbo.Fspecclass('QSO'); DECLARE @hiZ_qso INT; SET @hiZ_qso = dbo.Fspecclass('HIZ-QSO'); SELECT s.specobjid,Max(l.sigma * 300000.0 / l.wave) AS veldisp, Avg(s.z) AS z INTO ##results FROM specobj s,specline l WHERE s.specobjid = l.specobjid AND ( ( s.specclass = @qso ) OR ( s.specclass = @hiZ_qso ) ) AND l.sigma * 300000.0 / l.wave > 2000.0 AND s.zconf > 0.9 GROUP BY s.specobjid 4Introduction
  • 5.
    SkyServer as adata mining tool 5Introduction
  • 6.
    SkyServer as adata mining tool 6Introduction
  • 7.
  • 8.
  • 9.
    Motivation Apply text miningtechniques over the SQL logs to define a methodology to parse, clean and tokenize statements into an intermediate numerical representation for data mining. 9Introduction
  • 10.
    Outline 1. Introduction 2. Theoryreview • Text Mining • Information Retrieval • Clustering 3. Methodology 4. Experimental results 5. Conclusion 10
  • 11.
    Text Mining Knowledge Discoveryin Databases: The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Fayyad et al. (1996) 11Theory Review
  • 12.
    Text Mining Text Mining: Thediscovery by computer of new previously unknown, information by automatically extracting information from different written sources. Fayyad et al. (1996) 12Theory Review
  • 13.
    Text Mining 13Theory Review Miner,Gary. Practical text mining and statistical analysis for non- structured text data applications. Academic Press (2012)
  • 14.
  • 15.
    Term Weighting • Tobalance term significance within a document collection, accounting for terms that are too common or too rare. • TF*IDF assigns the largest weight to terms that arise with high frequency in individual documents, but are at the same time, relatively rare in the collection as a whole . 15 idft = log N dft tf-idft,d = tft,d ´idft
  • 16.
    Clustering The exploratory procedurethat organizes a collection of patterns into natural groupings based on a given association measure. 16Theory Review
  • 17.
    Association Measures SIMEJ (x,y) = x× y x 2 + y 2 - x× y SIMC (x, y) = x× y x y d(x, y) = (xi - yi )2 i=1 n å Sthrel et al. (2000) SIME (x, y)= e-d2 SIMJ (X,Y) = X ÇY X ÈY 17Theory Review
  • 18.
    K-Means 1. Choose kclusters centers. 2. Assign each pattern to the closest cluster center. 3. Recompute cluster center using the current cluster memberships. 4. If convergence criterion is not met, go to step 2. Manning et al. (2009) 18Theory Review
  • 19.
    K-Means 1. Choose kclusters centers 2. Assign each pattern to the closest cluster center. 3. Recompute cluster center using the current cluster memberships. 4. If convergence criterion is not met, go to step 2. Manning et al. (2009) 19Theory Review
  • 20.
    Fuzzy C-Means • Fuzzyextension to traditional K-Means, where every pattern belongs to every cluster with varying degrees of membership. • Cluster validity metrics: 20Theory Review vPC (U) = 1 N (ukn )2 n=1 N å k=1 c å vPE (U) = - 1 N ukn log(ukn ) n=1 N å k=1 c å vFS (U,V, X) = (ukn )m (|| xn -vk ||2 -|| vk -v ||2 ) n=1 N å k=1 c å vXB (U,V, X) = um kn m || xn - vk ||22 n=1 N å k=1 c å N(min k¹l {|| vk -vl ||2 })
  • 21.
    Self-Organizing Maps 21Theory Review •ANN that performs unsupervised, competitive learning. Yin (2008)
  • 22.
    Self-Organizing Maps • Mapshigh-dimensional data into a regular low- dimensional grid • Reduces the original data dimension while preserving relationships of the data • Particular interesting for visualization with the U-Matrix 22
  • 23.
    Outline 1. Introduction 2. Theoryreview 3. Methodology • SQL queries to feature vectors • Data mining 4. Experimental results 5. Conclusion remarks 23
  • 24.
  • 25.
    SQL queries tofeature vectors 25 Transformation Preprocessing Parsing Cleaning / Tokenization Normalization Methodology
  • 26.
    SQL queries tofeature vectors 26 Transformation Preprocessing Parsing Cleaning / Tokenization Normalization Methodology SELECT p.objid, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z, platex.plate, s.fiberid, s.elodiefeh FROM photoobj p, dbo.fgetnearbyobjeq(1.62, 27.64, 30) n, specobj s, platex WHERE p.objid = n.objid AND p.objid = s.bestobjid AND s.plateid = platex.plateid AND class = ‘star’ AND p.r >= 14 AND p.r <= 22.5 AND p.g >= 15 AND p.g <= 23 AND platex.plate = 2803 select objid ra dec u g r i z plate fiberid elodiefeh from photoobj fgetnearbyobjeq specobj platex where objid objid logic objid bestobjid logic plateid plateid logic class logic r logic r logic g logic g logic plate
  • 27.
    SQL queries tofeature vectors 27 Transformation Preprocessing Parsing Cleaning / Tokenization Normalization Methodology SELECT p.objid, p.ra, p.dec, p.u, p.g, p.r, p.i, p.z, platex.plate, s.fiberid, s.elodiefeh FROM photoobj p, dbo.fgetnearbyobjeq(1.62, 27.64, 30) n, specobj s, platex WHERE p.objid = n.objid AND p.objid = s.bestobjid AND s.plateid = platex.plateid AND class = ‘star’ AND p.r >= 14 AND p.r <= 22.5 AND p.g >= 15 AND p.g <= 23 AND platex.plate = 2803 select objid ra dec u g r i z plate fiberid elodiefeh from photoobj fgetnearbyobjeq specobj platex where objid objid logic objid bestobjid logic plateid plateid logic class logic r logic r logic g logic g logic plate select_objid 1 select_ra 1 select_dec 1 select_u 1 select_g 1 select_r 1 select_i 1 select_z 1 select_plate 1 select_fiberid 1 select_elodiefeh 1 from_photoobj 1 from_fgetnearbyobjeq 1 from_specobj 1 from_platex 1 where_objid 3 where_logic 8 where_bestobjid 1 where_plateid 2 where_class 1 where_r 2 where_g 2 where_plate 1
  • 28.
    SQL queries tofeature vectors 28 Transformation Preprocessing Parsing Cleaning / Tokenization Normalization Term Weighting Scaling Methodology Term Document matrix
  • 29.
    Transformation Preprocessing Parsing Cleaning / Tokenization Normalization TermWeighting Scaling Data mining 29Methodology Data Mining Fuzzy C-Means Self organizing Maps Term Document matrix
  • 30.
    Outline 1. Introduction 2. Theoryreview 3. Methodology 4. Experimental results 5. Conclusion remarks 30
  • 31.
    Number of clusterswith FCM 31Results
  • 32.
    Number of clusterswith FCM 32Results
  • 33.
  • 34.
    Visualization with SOM 22:Finding spectra by classification (object type) select top 100 specobjid from specobj where class = 'star' and zwarning = 0 43: QSOs by spectroscopy select top 100 specobjid, z from specobj where class = 'qso' and zwarning = 0 34 Cosine distance: Term-Frequency: 0.0205 SOM U-Matrix: 0.0 Results 39: Classifications from Galaxy Zoo select objid, nvote, p_el as elliptical, p_cw as spiralclock, p_acw as spiralanticlock, p_edge as edgeon, p_dk as dontknow, p_mg as merger from zoonospec where objid = 1237656495650570395 39B: Classifications from Galaxy Zoo select top 100 g.objid, zns.nvote, zns.p_el as elliptical, zns.p_cw as spiralclock, zns.p_acw as spiralanticlock, zns.p_edge as edgeon, zns.p_dk as dontknow, zns.p_mg as merger from galaxy as g join zoonospec as zns on g.objid = zns.objid where g.clean=1 and zns.nvote >= 10 and zns.p_cw > 0.8 Cosine distance: Term-Frequency: 0.1610 SOM U-Matrix: 0.0
  • 35.
    Outline 1. Introduction 2. Theoryreview 3. Methodology 4. Experimental results 5. Conclusion remarks 35
  • 36.
    Conclusions • A methodologyfor proper parsing, cleaning and tokenization of SQL statements into feature vectors was defined, which can be used for KDD. • Preprocessing and transformation can be tuned according to data mining goal. • Foreseen applications include: • Detailed SQL and database usage statistics • Query recommedation systems • Running time prediction 36Conclusions
  • 37.
    Publications • Clustering SQLqueries to analyse database usage, IASC Satellite for the ISI WSC Conference, 2015 • Text Mining Applied to SQL Queries: A Case Study for the SDSS SkyServer, 2nd International Symposium on Information Management and Big Data, 2015 37
  • 38.

Editor's Notes

  • #4 acesso tanto de astronomos quanto para fins educacionais Processamento de imagens brutas para extracao de atributos ferramentas, de visualização a extração de dados Figuras: lugares famosos, mapa interativo Skyserver – acesso via SQL
  • #5 Motivação de Data Mining 20 perguntas comuns de astronomia, que poderiam ser expressas em SQL Design do BD a partir das 20 perguntas, objetivo: eficiencia Fluxo de trabalho do astronomo (download, analise) SQL = ganho de produtividade
  • #6 Online version, 10min, 500k linhas
  • #7 Tempo ilimitado x 10min timeout online Consultas complexas, horas de execução CasJobs: evitar que isso esgote recursos BD pessoal e personalizavel, stored procs, funções, dados temporarios SQL: poderoso e flexivel
  • #8 19+ million page hits and 2 million SQL queries per month 1.5 bilhões de acessos, e 300- milhões de consultas, desde 2001 5o e 10o ano de vida: análise de uso e tráfego
  • #13 TM – extensão de KDD Foco em dados não estruturados / semi-estruturados
  • #14 Área de pesq multidisciplinar: recuperação e extração de informação, NLP, ML, DB, visualização Tecnologias: extração de informação, rastreamento de tópico, sumarização, categorização, visualização da informação, pergunta & resposta, e Clustering.
  • #15 the product of the frequency of use of words and the rank order is approximately constant
  • #17 Técnica de interesse para nosso trabalho Padrões num mesmo cluster mais similares que qualquer ponto for a do cluster exploração do usuário: aprofundamento seletivo em um grupo relevante
  • #18 - Relação binária, simil/disimil ou próx/distante - Transformação de dis para sim - pontos x1=(3 1) e x2=(1 2), Linhas de iso-similaridade para sim = 0.25, 0.5 e 0.75 - Euclides: hiper-esferas - Coseno: hiper-cones que passam pela origem - Jaccard: hiper-esferas não concentricas, para s prox de 0: euclides, para s prox 1: coseno - Jaccard: probabilidade de que pelo menos 1 item pertença aos dois conjuntos
  • #19 Critérios de parada: iterações, padrões não trocam de centroide, centróide não se move Figura 1: Escolhendo 2 centroids aleatoriamente Figura 2: Designando o centroid mais próximo para cada ponto Figura 3: Recalculando a posição do centroid de acordo com a média dos pontos designados aquele cluster
  • #20 Figura 1: Convergencia e resultado final após 9 iterações Figura 2: O caminho percorrido pelos centroids durante a execução do algoritmo.
  • #21 Coeficiente de partiçao, entropia da partição, fukuyama-sugeno, xie-beni Medição da nebulosidade vpc / vpe – só considera a nebulosidade da matrix vfs / vxb – considera os dados e os centroides
  • #23 relationship: topology, close in the dataset, close in the map Umatrix: unified distance matrix, color
  • #26 - Tokenization – processo de separação do corpus em termos, ex: quebrar espaços em branco - Limpeza de caracteres desnecessários (quebra de linha) - Normalização: minusculos, alias, palavras chave
  • #27 - Tokenization – processo de separação do corpus em termos, ex: quebrar espaços em branco - Limpeza de caracteres desnecessários (quebra de linha) - Normalização: minusculos, alias, palavras chave
  • #28 - Tokenization – processo de separação do corpus em termos, ex: quebrar espaços em branco - Limpeza de caracteres desnecessários (quebra de linha) - Normalização: minusculos, alias, palavras chave
  • #29 - Tokenization – processo de separação do corpus em termos, ex: quebrar espaços em branco - Limpeza de caracteres desnecessários (quebra de linha) - Normalização: minusculos, alias, palavras chave