SlideShare a Scribd company logo
1 of 27
Download to read offline
Interactive exploration of complex relational data sets in a
web interface
Vincent Michel - Logilab
SemWeb.pro 2012 - 2 mai 2012
1/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 1 / 2
Table des matières
1 Introduction
2 How to store and query the data ?
3 How to visualize the data ?
4 Conclusion
2/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 2 / 2
Context - Number
Increasing number of databases
Wikipedia - Semantic Web
3/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 3 / 2
Context - Size
Increasing number of HUGE databases
Few examples
Data.bnf.fr : 6.106
rdf triples / 65 Mo (April 2012).
Geonames : 8.106
features / 150.106
rdf triples / 2Gb (March 2012).
Dbpedia : 3.106
things / 19
rdf triples / > 30 Gb (July 2011).
Related issues
Storing data : lots of space required.
Updating data : massive amount of operations.
Querying/Visualization : Scalable tools.
4/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 4 / 2
Context - Format
Increasing number of HUGE databases with DIFFERENT formats
Few examples
Geonames : CSV file.
Data.bnf.fr : separated nt/n3/rdf files.
Dbpedia : (numerous) separated nt/n3 files.
NosDeputes : SQL dumps.
Related issues
Storing data : lots of conversions are required to push data within the
same RDBMS.
Querying data : some links have to be reconstructed from the dumps.
5/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 5 / 2
Context - THE Question
What can we do with THAT ? ? !
How can we query the data ? How can we visualize the data ?
6/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 6 / 2
Table des matières
1 Introduction
2 How to store and query the data ?
3 How to visualize the data ?
4 Conclusion
7/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 7 / 2
Cubicweb framework
A semantic open-source web framework written in Python
An efficient knowledge management system
Entity-relationship data-model : query directly the concepts while
abstracting the physical structure of the underlying database.
RQL (Relational Query Language) : High-level query language
operating over a relation database (PostgresSQL, MySQL, ...)
Separate query and display : visualize the results in the standard web
framework, download them in different formats (JSON, RDF, CSV,...)
Structured database : reconstruct the relations that may exist between
different entities and entity classes, so that queries will be easier (and faster
than a triple-store).
8/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 8 / 2
Data Model
The schema defines different entity classes with their attributes, as well
as the relationships between the different entity classes
9/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 9 / 2
Querying data
RQL (Relational Query Language)
Similar to SPARQL.
Higher-level than SQL, abstracts the details of the tables and the joins.
Example :
Give me 10 locations that have a population greater than 1000000, and that
are in a country named "France"
↓
Any X LIMIT 10 WHERE X is Location,
X population > 1000000, X country C, C name "France"
→ A query returns a result set (a list of results), that can be displayed using
views.
10/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 10 /
Viewing data
Full process
Request
Give me 10 locations that have a population greater than 1000000, and
that are in a country named "France"
RQL
Any X LIMIT 10 WHERE X is Location,
X population > 1000000, X country C, C name "France"
Standard views : Web framework, JSON, RDF, CSV, ...
http://domain:8080/?rql=my rql&vid=json
11/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 11 /
Table des matières
1 Introduction
2 How to store and query the data ?
3 How to visualize the data ?
4 Conclusion
12/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 12 /
Goals
We have databases with complex structure
→ Building an interface to visualize queries/perform datamining over a
relational database
Why ?
Each visualization has a corresponding url → results are addressable,
linkable and shareable.
Using Javascript-based visualization : d3.js, protovis, mapstraction, ...
Provide more complex views, based on datamining procedures.
13/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 13 /
A two steps process
Pivot data structure : numerical arrays, JSON
Processor
Convert result set → pivot structure.
The processor defines some criteria for the mathematical processing.
Array view
For numerical array, we construct Array views, associated to some
visualizations or datamining processes.
Display the arrays as graphs, histograms, maps...
14/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 14 /
Visualization pathway
PostgreSQL
database
Result Set
HTML, RDF,
JSON, CSV
...
View
Query
Result Set
DATAMINING
RESULT:
HTML, RDF,
JSON, CSV, ...
Array-View
Processor
Query
Numpy array
Standard
CubicWeb
pathway
Datamining
cube
pathway
15/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 15 /
Overview of some possibilities
One can combine almost any Processor with any Array view
Processors
attr-asfloat turns all Int, BigInt and Float attributes in the result set to
floats, and returns the corresponding array.
undirected-rel creates a binary numpy array that represents the relations
(or corelations) between entities.
attr-astoken turns all String attributes in the result set in a numpy array,
based on a Word-n-gram analyze.
Array views
Histogram, scatterplot, matrix, force-directed graph, dendrogram, ... → each
view has a default processor.
16/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 16 /
Example of combination
Query : 20 couples (name, id) of parliamentarians, in alphabetical order
Create tokens from names (processor : attr-astoken), and view tokens in table
(array-view : array-table)
http://mondomain.vm:port/?rql= Any X, N ORDERBY N ASC
LIMIT 20 WHERE X is Parlementaire, X nom N
&arid=attr-astoken&vid=array-table
17/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 17 /
Changing the view...
Create tokens from names (processor : attr-astoken), and now, view the
number of occurence of tokens using a histogram (array-view : protovis-hist)
http://mondomain.vm:port/?rql= Any X, N ORDERBY N ASC
LIMIT 20 WHERE X is Parlementaire, X nom N
&arid=attr-astoken&vid=protovis-hist
18/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 18 /
Histogram visualization - NosDeputes
Number of comments by parliamentary
Processor : attr-asfloat, Array view : protovis-hist
19/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 19 /
Scatterplot visualization 1 - Geonames
All couples (latitude, longitude) of the locations in France, with an elevation not
null
Processor : attr-asfloat, Array view : protovis-scatterplot
20/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 20 /
Scatterplot visualization 2 - Geonames
All couples (latitude, longitude) of 50000 locations in France, with a population
higher than 100 (inhabitants)
Processor : attr-asfloat, Array view : protovis-scatterplot
21/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 21 /
Matrix visualization - Geonames
All neighbour countries in North America
Processor : undirected-rel, Array view : protovis-binarymap
22/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 22 /
Graph visualization - Geonames
All neighbour countries in the World, and their corresponding continent
Processor : undirected-rel, Array view : protovis-forcedirected
23/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 23 /
Applying datamining processes
Another kind of processor based on the Numpy Array
All couples (latitude, longitude) of the locations in France, with an elevation
higher than 1
Processor : attr-asfloat, Array view : protovis-scatterplot, Datamining
process : Kmeans
24/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 24 /
Table des matières
1 Introduction
2 How to store and query the data ?
3 How to visualize the data ?
4 Conclusion
25/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 25 /
Visualization in CubicWeb
Easy to interactively process and visualize datasets.
Visualizations directly applied on queries.
All visualizations are defined by an unique URL !
Features
quick interactive exploratory visualization/datamining processings.
different processors and views for more flexibility.
based on the pivto structure (JSON, numerical arrays).
Use jointly with facets to filter results.
26/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 26 /
Conclusion
Future developments
More datamining procedures (unsupervised learning, clustering, ...).
Including more Javascript libraries.
Use sparse arrays for better performance.
Questions ?
27/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 27 /

More Related Content

What's hot

R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo dbMongoDB
 
Intake 38 data access 1
Intake 38 data access 1Intake 38 data access 1
Intake 38 data access 1Mahmoud Ouf
 
Redis Project: Relational databases & Key-Value systems
Redis Project: Relational databases & Key-Value systemsRedis Project: Relational databases & Key-Value systems
Redis Project: Relational databases & Key-Value systemsStratos Gounidellis
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4jNeo4j
 
Apache Spark — Fundamentals and MLlib
Apache Spark — Fundamentals and MLlibApache Spark — Fundamentals and MLlib
Apache Spark — Fundamentals and MLlibJens Fisseler, Dr.
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...Till Blume
 
Dbms objective and subjective notes
Dbms objective and subjective notesDbms objective and subjective notes
Dbms objective and subjective notesHitesh Kumar Markam
 

What's hot (11)

Cdi implementation
Cdi implementationCdi implementation
Cdi implementation
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
 
Intake 38 data access 1
Intake 38 data access 1Intake 38 data access 1
Intake 38 data access 1
 
Datastructure bsc2
Datastructure bsc2 Datastructure bsc2
Datastructure bsc2
 
Redis Project: Relational databases & Key-Value systems
Redis Project: Relational databases & Key-Value systemsRedis Project: Relational databases & Key-Value systems
Redis Project: Relational databases & Key-Value systems
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
 
Apache Spark — Fundamentals and MLlib
Apache Spark — Fundamentals and MLlibApache Spark — Fundamentals and MLlib
Apache Spark — Fundamentals and MLlib
 
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...Towards an Incremental Schema-level Index  for Distributed Linked Open Data G...
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...
 
PO WER - Piotr Mariat - Sql
PO WER - Piotr Mariat - SqlPO WER - Piotr Mariat - Sql
PO WER - Piotr Mariat - Sql
 
3DRepo
3DRepo3DRepo
3DRepo
 
Dbms objective and subjective notes
Dbms objective and subjective notesDbms objective and subjective notes
Dbms objective and subjective notes
 

Viewers also liked

Preference Handling in Relational Query Languages
Preference Handling in Relational Query LanguagesPreference Handling in Relational Query Languages
Preference Handling in Relational Query Languagesradned
 
Resource description framework
Resource description frameworkResource description framework
Resource description frameworkhozifa1010
 

Viewers also liked (6)

Preference Handling in Relational Query Languages
Preference Handling in Relational Query LanguagesPreference Handling in Relational Query Languages
Preference Handling in Relational Query Languages
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
 
Resource description framework
Resource description frameworkResource description framework
Resource description framework
 
ATG Advanced RQL
ATG Advanced RQLATG Advanced RQL
ATG Advanced RQL
 
Database Technologies for Semantic Web
Database Technologies for Semantic WebDatabase Technologies for Semantic Web
Database Technologies for Semantic Web
 
Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
 

Similar to Datamining at SemWebPro 2012

Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic WebTomek Pluskiewicz
 
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...Validating statistical Index Data represented in RDF using SPARQL Queries: Co...
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...Jose Emilio Labra Gayo
 
Brainomics - CrEDIBLE 2013
Brainomics - CrEDIBLE 2013Brainomics - CrEDIBLE 2013
Brainomics - CrEDIBLE 2013Vincent Michel
 
BRAINOMICS A management system for exploring and merging heterogeneous brain ...
BRAINOMICS A management system for exploring and merging heterogeneous brain ...BRAINOMICS A management system for exploring and merging heterogeneous brain ...
BRAINOMICS A management system for exploring and merging heterogeneous brain ...Logilab
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Triplificating and linking XBRL financial data
Triplificating and linking XBRL financial dataTriplificating and linking XBRL financial data
Triplificating and linking XBRL financial dataRoberto García
 
NoSQL Databases - Lecture 12 - Introduction to Databases (1007156ANR)
NoSQL Databases - Lecture 12 - Introduction to Databases (1007156ANR)NoSQL Databases - Lecture 12 - Introduction to Databases (1007156ANR)
NoSQL Databases - Lecture 12 - Introduction to Databases (1007156ANR)Beat Signer
 
Expressive Query Answering For Semantic Wikis
Expressive Query Answering For  Semantic WikisExpressive Query Answering For  Semantic Wikis
Expressive Query Answering For Semantic WikisJie Bao
 
RakutenTechConf2013] [D-3_1] LeoFS - Open the New Door
RakutenTechConf2013] [D-3_1] LeoFS - Open the New DoorRakutenTechConf2013] [D-3_1] LeoFS - Open the New Door
RakutenTechConf2013] [D-3_1] LeoFS - Open the New DoorRakuten Group, Inc.
 
ICWE2017 BigDataEurope
ICWE2017 BigDataEuropeICWE2017 BigDataEurope
ICWE2017 BigDataEuropeBigData_Europe
 
DCMI Keynote: Bridging the Semantic Gaps and Interoperability
DCMI Keynote: Bridging the Semantic Gaps and InteroperabilityDCMI Keynote: Bridging the Semantic Gaps and Interoperability
DCMI Keynote: Bridging the Semantic Gaps and InteroperabilityMike Bergman
 
Standardizing for Open Data
Standardizing for Open DataStandardizing for Open Data
Standardizing for Open DataIvan Herman
 
Representing verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked dataRepresenting verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked dataJose Emilio Labra Gayo
 
Advanced functional programing in Swift
Advanced functional programing in SwiftAdvanced functional programing in Swift
Advanced functional programing in SwiftVincent Pradeilles
 

Similar to Datamining at SemWebPro 2012 (20)

Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...Validating statistical Index Data represented in RDF using SPARQL Queries: Co...
Validating statistical Index Data represented in RDF using SPARQL Queries: Co...
 
Brainomics - CrEDIBLE 2013
Brainomics - CrEDIBLE 2013Brainomics - CrEDIBLE 2013
Brainomics - CrEDIBLE 2013
 
BRAINOMICS A management system for exploring and merging heterogeneous brain ...
BRAINOMICS A management system for exploring and merging heterogeneous brain ...BRAINOMICS A management system for exploring and merging heterogeneous brain ...
BRAINOMICS A management system for exploring and merging heterogeneous brain ...
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Triplificating and linking XBRL financial data
Triplificating and linking XBRL financial dataTriplificating and linking XBRL financial data
Triplificating and linking XBRL financial data
 
BICOD-2017
BICOD-2017BICOD-2017
BICOD-2017
 
Bicod2017
Bicod2017Bicod2017
Bicod2017
 
NoSQL Databases - Lecture 12 - Introduction to Databases (1007156ANR)
NoSQL Databases - Lecture 12 - Introduction to Databases (1007156ANR)NoSQL Databases - Lecture 12 - Introduction to Databases (1007156ANR)
NoSQL Databases - Lecture 12 - Introduction to Databases (1007156ANR)
 
Expressive Query Answering For Semantic Wikis
Expressive Query Answering For  Semantic WikisExpressive Query Answering For  Semantic Wikis
Expressive Query Answering For Semantic Wikis
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
LOD2 Webinar Series: CubeViz
LOD2 Webinar Series: CubeViz LOD2 Webinar Series: CubeViz
LOD2 Webinar Series: CubeViz
 
LOD2: State of Play WP2 - Storing and Querying Very Large Knowledge Bases
LOD2: State of Play WP2 - Storing and Querying Very Large Knowledge BasesLOD2: State of Play WP2 - Storing and Querying Very Large Knowledge Bases
LOD2: State of Play WP2 - Storing and Querying Very Large Knowledge Bases
 
Where is the World is my Open Government Data?
Where is the World is my Open Government Data?Where is the World is my Open Government Data?
Where is the World is my Open Government Data?
 
RakutenTechConf2013] [D-3_1] LeoFS - Open the New Door
RakutenTechConf2013] [D-3_1] LeoFS - Open the New DoorRakutenTechConf2013] [D-3_1] LeoFS - Open the New Door
RakutenTechConf2013] [D-3_1] LeoFS - Open the New Door
 
ICWE2017 BigDataEurope
ICWE2017 BigDataEuropeICWE2017 BigDataEurope
ICWE2017 BigDataEurope
 
DCMI Keynote: Bridging the Semantic Gaps and Interoperability
DCMI Keynote: Bridging the Semantic Gaps and InteroperabilityDCMI Keynote: Bridging the Semantic Gaps and Interoperability
DCMI Keynote: Bridging the Semantic Gaps and Interoperability
 
Standardizing for Open Data
Standardizing for Open DataStandardizing for Open Data
Standardizing for Open Data
 
Representing verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked dataRepresenting verifiable statistical index computations as linked data
Representing verifiable statistical index computations as linked data
 
Advanced functional programing in Swift
Advanced functional programing in SwiftAdvanced functional programing in Swift
Advanced functional programing in Swift
 

Recently uploaded

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxTasha Penwell
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfSubhamKumar3239
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Milind Agarwal
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 

Recently uploaded (20)

Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Data Analysis Project: Stroke Prediction
Data Analysis Project: Stroke PredictionData Analysis Project: Stroke Prediction
Data Analysis Project: Stroke Prediction
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptxThe Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
The Power of Data-Driven Storytelling_ Unveiling the Layers of Insight.pptx
 
convolutional neural network and its applications.pdf
convolutional neural network and its applications.pdfconvolutional neural network and its applications.pdf
convolutional neural network and its applications.pdf
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
Unveiling the Role of Social Media Suspect Investigators in Preventing Online...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 

Datamining at SemWebPro 2012

  • 1. Interactive exploration of complex relational data sets in a web interface Vincent Michel - Logilab SemWeb.pro 2012 - 2 mai 2012 1/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 1 / 2
  • 2. Table des matières 1 Introduction 2 How to store and query the data ? 3 How to visualize the data ? 4 Conclusion 2/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 2 / 2
  • 3. Context - Number Increasing number of databases Wikipedia - Semantic Web 3/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 3 / 2
  • 4. Context - Size Increasing number of HUGE databases Few examples Data.bnf.fr : 6.106 rdf triples / 65 Mo (April 2012). Geonames : 8.106 features / 150.106 rdf triples / 2Gb (March 2012). Dbpedia : 3.106 things / 19 rdf triples / > 30 Gb (July 2011). Related issues Storing data : lots of space required. Updating data : massive amount of operations. Querying/Visualization : Scalable tools. 4/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 4 / 2
  • 5. Context - Format Increasing number of HUGE databases with DIFFERENT formats Few examples Geonames : CSV file. Data.bnf.fr : separated nt/n3/rdf files. Dbpedia : (numerous) separated nt/n3 files. NosDeputes : SQL dumps. Related issues Storing data : lots of conversions are required to push data within the same RDBMS. Querying data : some links have to be reconstructed from the dumps. 5/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 5 / 2
  • 6. Context - THE Question What can we do with THAT ? ? ! How can we query the data ? How can we visualize the data ? 6/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 6 / 2
  • 7. Table des matières 1 Introduction 2 How to store and query the data ? 3 How to visualize the data ? 4 Conclusion 7/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 7 / 2
  • 8. Cubicweb framework A semantic open-source web framework written in Python An efficient knowledge management system Entity-relationship data-model : query directly the concepts while abstracting the physical structure of the underlying database. RQL (Relational Query Language) : High-level query language operating over a relation database (PostgresSQL, MySQL, ...) Separate query and display : visualize the results in the standard web framework, download them in different formats (JSON, RDF, CSV,...) Structured database : reconstruct the relations that may exist between different entities and entity classes, so that queries will be easier (and faster than a triple-store). 8/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 8 / 2
  • 9. Data Model The schema defines different entity classes with their attributes, as well as the relationships between the different entity classes 9/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 9 / 2
  • 10. Querying data RQL (Relational Query Language) Similar to SPARQL. Higher-level than SQL, abstracts the details of the tables and the joins. Example : Give me 10 locations that have a population greater than 1000000, and that are in a country named "France" ↓ Any X LIMIT 10 WHERE X is Location, X population > 1000000, X country C, C name "France" → A query returns a result set (a list of results), that can be displayed using views. 10/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 10 /
  • 11. Viewing data Full process Request Give me 10 locations that have a population greater than 1000000, and that are in a country named "France" RQL Any X LIMIT 10 WHERE X is Location, X population > 1000000, X country C, C name "France" Standard views : Web framework, JSON, RDF, CSV, ... http://domain:8080/?rql=my rql&vid=json 11/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 11 /
  • 12. Table des matières 1 Introduction 2 How to store and query the data ? 3 How to visualize the data ? 4 Conclusion 12/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 12 /
  • 13. Goals We have databases with complex structure → Building an interface to visualize queries/perform datamining over a relational database Why ? Each visualization has a corresponding url → results are addressable, linkable and shareable. Using Javascript-based visualization : d3.js, protovis, mapstraction, ... Provide more complex views, based on datamining procedures. 13/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 13 /
  • 14. A two steps process Pivot data structure : numerical arrays, JSON Processor Convert result set → pivot structure. The processor defines some criteria for the mathematical processing. Array view For numerical array, we construct Array views, associated to some visualizations or datamining processes. Display the arrays as graphs, histograms, maps... 14/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 14 /
  • 15. Visualization pathway PostgreSQL database Result Set HTML, RDF, JSON, CSV ... View Query Result Set DATAMINING RESULT: HTML, RDF, JSON, CSV, ... Array-View Processor Query Numpy array Standard CubicWeb pathway Datamining cube pathway 15/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 15 /
  • 16. Overview of some possibilities One can combine almost any Processor with any Array view Processors attr-asfloat turns all Int, BigInt and Float attributes in the result set to floats, and returns the corresponding array. undirected-rel creates a binary numpy array that represents the relations (or corelations) between entities. attr-astoken turns all String attributes in the result set in a numpy array, based on a Word-n-gram analyze. Array views Histogram, scatterplot, matrix, force-directed graph, dendrogram, ... → each view has a default processor. 16/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 16 /
  • 17. Example of combination Query : 20 couples (name, id) of parliamentarians, in alphabetical order Create tokens from names (processor : attr-astoken), and view tokens in table (array-view : array-table) http://mondomain.vm:port/?rql= Any X, N ORDERBY N ASC LIMIT 20 WHERE X is Parlementaire, X nom N &arid=attr-astoken&vid=array-table 17/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 17 /
  • 18. Changing the view... Create tokens from names (processor : attr-astoken), and now, view the number of occurence of tokens using a histogram (array-view : protovis-hist) http://mondomain.vm:port/?rql= Any X, N ORDERBY N ASC LIMIT 20 WHERE X is Parlementaire, X nom N &arid=attr-astoken&vid=protovis-hist 18/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 18 /
  • 19. Histogram visualization - NosDeputes Number of comments by parliamentary Processor : attr-asfloat, Array view : protovis-hist 19/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 19 /
  • 20. Scatterplot visualization 1 - Geonames All couples (latitude, longitude) of the locations in France, with an elevation not null Processor : attr-asfloat, Array view : protovis-scatterplot 20/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 20 /
  • 21. Scatterplot visualization 2 - Geonames All couples (latitude, longitude) of 50000 locations in France, with a population higher than 100 (inhabitants) Processor : attr-asfloat, Array view : protovis-scatterplot 21/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 21 /
  • 22. Matrix visualization - Geonames All neighbour countries in North America Processor : undirected-rel, Array view : protovis-binarymap 22/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 22 /
  • 23. Graph visualization - Geonames All neighbour countries in the World, and their corresponding continent Processor : undirected-rel, Array view : protovis-forcedirected 23/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 23 /
  • 24. Applying datamining processes Another kind of processor based on the Numpy Array All couples (latitude, longitude) of the locations in France, with an elevation higher than 1 Processor : attr-asfloat, Array view : protovis-scatterplot, Datamining process : Kmeans 24/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 24 /
  • 25. Table des matières 1 Introduction 2 How to store and query the data ? 3 How to visualize the data ? 4 Conclusion 25/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 25 /
  • 26. Visualization in CubicWeb Easy to interactively process and visualize datasets. Visualizations directly applied on queries. All visualizations are defined by an unique URL ! Features quick interactive exploratory visualization/datamining processings. different processors and views for more flexibility. based on the pivto structure (JSON, numerical arrays). Use jointly with facets to filter results. 26/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 26 /
  • 27. Conclusion Future developments More datamining procedures (unsupervised learning, clustering, ...). Including more Javascript libraries. Use sparse arrays for better performance. Questions ? 27/27 Vincent Michel - Logilab () Exploration of relational data sets SemWeb.pro 2012 - 2 mai 2012 27 /