SlideShare a Scribd company logo
Large Graph Mining
Recent Developement, Challenges and Potential
Solutions
EBISS,
20 of July 2012
Brussels
SABRI SKHIRI / RESEARCH DIRECTOR EURA NOVA
PASSIONATE BY COMPUTER SCIENCE, TECHNOLOGY &
RESEARCH
THE SPEAKER
Research director @ EURA NOVA
Make the link between Research & Customer challenges
Supervising 3 PhD thesis, 6 Master thesis with 3 BEL
Universities
2
Head of the EU R&D Architecture
for a Telco equipment provider
Guiding the transition from Telco to Service provider with new technologies
Committer on open source
projects launched @ EURA NOVA
RoQ-Messaging, NAIAD, Wazaabi
Ramp-up test to wake-up the room after lunch a Friday afternoon …
Before starting
I will use persons to illustrate the topic in this tutorial
Can you give me their names?
3
Leonard Sheldon Moss Lary Page
Looks ready to start to learn about Graph Processing !
AGENDA
1 / Introduction
2 / Focus on two graph mining algorithms
3 / Introduction of Distributed Processing Framework
4 / Graph Data warehouse – an emerging challenge
4
5 / Conclusion
AGENDA
1 / Introduction
2 / Focus on two graph mining algorithms
3 / Introduction of Distributed Processing Framework
4 / Graph Data warehouse – an emerging challenge
5
5 / Conclusion
Graph Mining needs another approach
EXECUTIVE SUMMARY
Data Mining
Mature, algorithmic, libraries & products
New Needs
Linked data & reasoning on relationships
What do we need?
Is traditional data mining still applicable?
Graph Data Warehouse
Is traditional data warehouse still applicable?
Flat data, relational data,
multi-dimensional data
No Linked data
Biology
Chemistry
Social Networks
Internet - Networks
Graph-based similarity
Algorithm re-design for graphs
Scalability for storage & processing
Conceptual modeling
Query
Processing Stack & materialization
Storage
LET’S START
WITH DATA MINING
Process of discovering patterns or models of data. Those
patterns often consist in previously unknown and implicit
information and knowledge embedded within a data set [1]
[1] M.-S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE Trans. Knowl. Data Eng.,
8(6):866–883, 1996.
Techniques have been developed these last 20 years
DATA MINING
Process of analyzing data from different perspectives and
summarizing it into useful information
Pattern recognition
We mine data to retrieve pre-
determined patterns
Clustering
Data are grouped within partitions
according criteria
Association
Enables to link data between each other
Classification
We position data in a pre-determined
group
Feature extraction
We transform the input data into a set of
features (data set reduction)
Summarization
Ranking such as page rank
8
Manages & processes data as a collection of independent instances
DATA MINING
The Mining usually does not consider the global relations
between the objects
Almost all clustering algorithms compute the similarity between all the pair of
objects in the data set
Taking into account the relation between data in mining
Why the relationship matters?
Imagine to cluster people from their profiles
1
0
Taking into account the relation between data in mining
Why the relationship matters?
Imagine to cluster people not only from their profiles but also
… by their social interactions
New emergent industrial needs lead to deal with this kind of structured data
11
More complete Data structure
Greater expressive power
Better model or real-life cases
New Industry requirements
Need to structure and mine structured & linked data
The metabolic pathways
1. Biochemical Networks
http://biocyc.org
Genetic regulation signal
1. Biochemical Networks
Taking a systemic approach we end-up with a huge interactio
graph
A biochemical network definition
1. Biochemical Networks
CPRG VVVVV 
}X{
}X{
}X{
Re
Re
ReRe
RCact
RGTrans
RPg
actTransg
VVE
VVE
VVE
EEEE



 
)G(V,E
New emergent industrial needs
1. Biochemical Networks
What happens if I drop a compound in the system ?
Drug simulation in drug design
Predict a metabolic pathway given a metabolic network and seed reactions
Subgraph extraction
Find which genes are involved in the fat reduction pathway?
Genetic therapy
Predict a metabolic network from a genetic signature given a protein interaction
graph & a regulation network
16
New emergent industrial needs
2. Chemical Databases
Database specifically designed to store chemical
information.
Atoms
Bonds
17
Graphs are the natural representation for chemical compounds, most of the
mining algorithms focus on mining chemical graphs
New emergent industrial needs
2. Chemical Databases
A typical request: Structural similarity search
18
),...,(
),(
1 nd
ddd
V
EVG

 Gd is the graph query
The objective is to maximize the probability that
the ith teta = alpha knowing the measure a, b.
}{with)),|((max VbaP ii  
New emergent industrial needs
2. Chemical Databases
19
Structural indexing
Indexing the structural properties of the molecules
Structural similarity search
Similar molecules will have similar effects
Structure-Activity-Relationship
How to modify the Structure for changing its activity
3D molecule conformation
Based on similar molecule conformations
New emergent industrial needs
2. Chemical Databases
20
Structure-Activity-Relationship
Example of the sucralose where 3 hydroxyl groups have been replaced with
Chloride (Cl)
Sugar C12H22O11
Diet Sugar C12H19Cl3O8
http://en.wikipedia.org/wiki/Sucralose
New emergent industrial needs
3. Social network anlytics
21
The Social Graph models the (direct or indirect) Social
interactions between users
Example of Trust from a bipartite Graph
3. Social network analytics
22
The Goal is to infer trust connections between actors in
set A only connected through Item I
Daire O'Doherty, Salim Jouili, Peter Van Roy:
Towards trust inference from bipartite social
networks. DBSocial 2012: 13-18
Example of Trust from a bi-partite Graph
3. Social network anlytics
23
The Goal is to infer trust connections between actors in
set A only connected through Item I
Measure to compare similarity and diversity
Highly connected shared item will have higher
distance values
Daire O'Doherty, Salim Jouili, Peter Van Roy:
Towards trust inference from bipartite social
networks. DBSocial 2012: 13-18
Example of Trust from a bi-partite Graph
3. Social network anlytics
24
Daire O'Doherty, Salim Jouili, and Peter Van Roy. Trust-
Based Recommendation: An Empirical Analysis, Sixth
ACM Workshop on Social Network Mining and Analysis
(SNA-KDD 2012), Beijing, China, Aug. 12, 2012.
New emergent industrial needs
3. Social network analytics
25
People you may know
Structural similarity based
Trust computation on structural properties
Used for accurate recommendation
Collaborative filtering
Tends to like what your friends like
Influence management
Used in marketing models
Marketing model to influence users
3. Social network analytics
SOCIAL KNOWLEDGE
TRADITIONAL
MARKETING MODELS
Bolton 1998
Bolton & Lemon 1999
SOCIAL MODELS
Nitan & Libai 2011 / Singer 2012
INFLUENCE NETWORK Able to predict much more accurately
> How to influence influencer to reach objectives
Viral marketing maven
Accurate
churners
Product (content, services, etc.)
adoption
Loyal user to reward to optimize the subscriber base
Decrease
acquisition
costs
Building an interaction-based model for INFLUENCE
3. Social network analytics
27
Vertex similarity distance
Edge weight computing
Betweenness centrality computation
Temporal analysis and version at
vertex/edge
When all social interaction variables are
considered within the same model we end-up
with a very powerful Social Profile model
LET’S USE GRAPHS
Can I use the traditional data mining approaches ?
What changes with graphs?
Problem Statement
29
Similarity & Distances
Must be graph-based
Structural nature of the data model
Makes mining algorithm more challenging to implement
Scalability issue
Most of the graph mining problems include significant graphs
Most of the existing graph mining algorithms deal with data in the main
memory-> not possible anymore
Let’s position this tutorial
Problem Statement
30
BSP approach
Using fully distributed approach
Google Pregel, Apache HAMA
In-memory/MPI/HPC
Use multi-processors implementations
SNAP
Graph DB
Focus on storage & graph traversal
Neo4J, Dex, OrientDB
Let’s position this tutorial
Problem Statement
31
BSP approach
Using fully distributed approach
Google Pregel, Apache HAMA
Given a set of data mining algorithms, how can we adapt
them to fully leverage the distributed processing approach?
The base data model is not the same anymore
Using the distributed way
32
(Distributed) Storage
Graph Model
(Distributed) graph processing
Mining algorithm
The algorithm implementation will depend on the underlying
distributed processing paradigm
AGENDA
1 / Introduction
2 / Focus on two graph mining algorithms
3 / Introduction of Distributed Processing Framework
4 / Graph Data warehouse – an emerging challenge
3
3
5 / Conclusion
Graph Mining algorithms
Let’s see what a graph mining algorithm looks like
A ranking algorithm
Page Rank
The web is a network of web pages
In addition to the page content, the page linkage represents a useful
source of knowledge and information
35
Compute a ranking on every web page based only on
the linkage structure
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking:
Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November
1999. Previous number = SIDL-WP-1999-0120.
Basic concepts
Page Rank
Authority: approximate by the number & the importance of pages
pointing to the considered page
36
Random surfer who browses the pages
Page Rank
Either,
1. The surfer chooses an outgoing link of the current vertex
uniformly at random, and follows that link to the destination
vertex, or
2. it “teleports” to a completely random Web page, independent of
the links out of the current vertex.
37
Intuitively, the random surfer traverses frequently “important” vertices with many
vertices pointing to it
Random surfer who browses the pages
Page Rank
Let G = (V,E) be the web graph
The PageRank equation
38




)( )(
)(
.
)1(
)(
vdu outin
ud
uPR
p
V
p
vPR
Number of incoming edges to vertex V
Number of outgoing edges from vertex u
The dumping factor (0.85)
We will see how to implement it in a distributed processing framework in the 2nd
part of this tutorial
Introduction
Graph clustering
Probably the most important topic studied in graph mining
Graph area: referred as community detection
39L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to
Cluster Analysis (Wiley Series in Probability and Statistics). Wiley-Interscience,
Mar. 2005.
Goal
Given a set of instances, grouping them into groups which share
common characteristics based on similarity
Example in targeting advertisement
Graph clustering
40
Brands
Method to cluster a
new user
Display
Ads
Track
Behavior
Improve
model
Classified group
User grouped by brand affinity
Social Interactions
Usage Patterns
Social Graph
Let us see 2 kind of clustering algorithms
(1) Generalization of K-Means & (2) divide algorithm that uses the structure
The original algorithm concep
K-Means based clustering
Goal finding cluster by minimizing the sum of the distances
between the data instances and the corresponding centroid
41
The k Number of groups A similarity measure
),(: ji ooD
Steps
1. Select K instance as initial centroids
2. Each data instance is assigned to the nearest
cluster
3. Each cluster center is recomputed as the
average of the data instance in the cluster
4. Repeat step [2-3]
What do we need to change?
Adapting K-Means to Graph model
Extending K-Means to take advantage of the linkage
information
42
A Graph-aware selection of the
vertex center
A Graph-aware similarity
measure
),(: ji ooD
The Simplest is the geodesic distance
Number of edges (hops)
Median Vertex
Minimizes the sum of distances to all other vertices



Cu
Cv
m vuDv ),(min
What do we need to change?
Adapting K-Means to Graph model
Extending K-Means to take advantage of the linkage
information
43
A Graph-aware selection of the
vertex center
A Graph-aware similarity
measure
),(: ji ooD
The Simplest is the geodesic distance
Number of edges (hops)
Closeness Centrality
a node is the more central the lower its total distance
to all other nodes
 


Vvuv
uvD
V
vCC
,
),(
1
)( We usually take the shortest path
as distance
M. J. Rattigan, M. E. Maier, and D. Jensen. Graph clustering with network
structure indices. In Z. Ghahramani, editor, ICML, volume 227 of ACM
International Conference Proceeding Series, pages 783–790. ACM, 2007.
A divide method
Centrality-based clustering
From the graph, iteratively cut specific edges
Progressively cut into smaller communities
44
The cutting strategy should select the edges
connecting as much as possible communities
[1] proposed to use the edge betweenness
centrality to select the edges to be cut
M. Girvan and M. E. J. Newman. Community structure in social and biological
networks. Proceedings of the National Academy of Sciences, 99(12):7821–
7826,2002
Definition
Edge betweenness centrality
Locates structurally the “well-connected” edges
If it is located on many shortest paths
45S. Wasserman and K. Faust. Social Network Analysis: Methods and
Applications.Number 8 in Structural analysis in the social sciences. Cambridge
University Press, 1 edition, 1994.


Vwv vw
vw
b
eb
eBC
,
)(
)(
Bvw (e) = the number of shortest paths from V to W
through e
Bvw = the total number of shortest paths from V to W
Step by step description
Centrality-based clustering
46
Steps
1. Compute the betweenness of all existing edges
2. Remove the edge with the highest betweenness centrality
3. Repeat step [1,2] until the communities are suitably found


Vwv vw
vw
b
eb
eBC
,
)(
)(
Extremely useful for web & social graphs
Characterized by Small-World structure property
R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of online social networks. In
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data
mining, KDD ’06, pages 611–617, New York, NY, USA, 2006. ACM.
AGENDA
1 / Introduction
2 / Focus on two graph mining algorithms
3 / Introduction of Distributed Processing Framework
4 / Graph Data warehouse – an emerging challenge
4
7
5 / Conclusion
Why do we need a distributed approach?
Scalability issues
The graphs can reach a significant size ~ x100 millions nodes, x
billion edges
48
Most of the Graph mining frameworks & libraries use in-
memory graph data => we need another paradigm
(really) Short introduction to
distributed computing
How to distribute a processing over a huge data set?
The ability to run simultaneously software in different
processors in order to increase its performance while the
distributed concept emphasizes the notion of loosely
coupling between those processors.
From the resource sharing & the paradigm viewpoint
Distributed architectures
Shared memory
Shared Disks
Share Nothing
50
Explicit parallel programming Implicit parallel programming
Distributed architecture
Shared memory
51
Distributed systems that share a common memory space
Case of distributed machine, it can be a distributed cache
Pros
High speed transfer
Cons
The shared memory must manage the data
consistency &
The access from different clients
Can be costly when adding a new memory nodes
Can be highly expensive
Distributed architecture
Shared disk
52
Distributed systems that share a common shared disk space
Typically through a LAN
Pros
Almost transparent for the applications
Less costly when adding new storage node
Cons
Access contention & data consistency issue
when clients increase
Expensive
Distributed architecture
Shared Nothing
53
Distributed systems where each machine has its own memory
space
Pros
Can be implemented on cheap or expensive
server
With an adapted distributed processing
framework the application does not need to deal
with the distributed aspect
Highly elastic
Cons
Applications need to be re-designed
Distributed architecture
Shared Nothing
54
This kind of system needs to distribute the data
Partitioning policy
1 3 1’ 2
4’ 4
2’ 3’
5
1 2 3 4 5
This leads to the interesting concept of data locality
Executing a process where the data is located
Distributed architecture: programming model viewpoint
Explicit parallel programming
55
The developer will have to explicitly program the parallel
aspects
Create tasks, synchronization, managing threads & processes, thread safe operation, etc.
Not advised solution
Pros
Richer expressivity, give very low level control
over the distributed processing (main pain point
in Hadoop MR)
Cons
Serious complexity
Error-prone
Distributed architecture: programming model viewpoint
Implicit parallel programming
56
The developer will NOT have to take of those details
The compiler or the framework handles all aspects related to parallel execution
The code to run, the scheduling, the location of execution, etc
Most of the examples we present here are Implicit programming with
share nothing data resources
Pros
Much more easy – hidden complexity
Highly scalable
Cons
Much less control on the execution as it is
completely handled by the framework
Let’s talk about graph processing
How can I process a graph using implicit parallel
programming and a share nothing processing?
The well known framework from Google & Hadoop its open source version
Map Reduce
Created by Google to index crawled web pages
The 3 main strengths of Hadoop [1]
Data Locality
Can schedule a process where the data is
[1 )A. Bialecki, M. Cafarella, D. Cutting, and O. O’Malley. Hadoop: A
framework for running applications on large clusters built of commodity
hardware, http://lucene.apache.org/hadoop/, 2005
Fault Tolerant
Automatic re-scheduling of failing tasks
Parallel processing
On different chunks of data
58
Short introduction – 2 main phases Map & Reduce
Map Reduce
Main concepts
Map Phase
[1] A. Bialecki, M. Cafarella, D. Cutting, and O. O’Malley. Hadoop: A
framework for running applications on large clusters built of commodity
hardware, http://lucene.apache.org/hadoop/, 2005
59
The problem is partitioned into a set of smaller sub-problems
Distributed over the worker in the cluster
& processed independently
Reduce Phase All answers to all sub-problems are gathered from the worker nodes
and then merged
Is it really suited for Graph Processing & mining?
The developer only focus on the algorithm but
Gives a simple way to deal with large data sets in
completely distributed way
60
However… not really suited for Graph
processing
1. Does not manipulate a Graph model – makes
complex the algorithm
2. Is not suited for iterative processing
1 iteration = 1 MR
Requiring a lot of I/O, data migration, unnecessary computation
Optimizing data transfert for iterative algorithms
Map Reduce Improvements
Few works have been done in this direction
R. Chen, X. Weng, B. He, and M. Yang. Large graph processing in the cloud. In Proceedings of the 2010
international conference on Management of data, SIGMOD ’10, pages 1123–1126, New York, NY, USA,
2010. ACM.
J.Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox.Twister: a runtime for iterative
mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed
Computing, HPDC ’10, pages 810–818, New York, NY, USA, 2010. ACM.
U. Kang, C. Tsourakakis, A. Appel, C. Faloutsos, and J. Leskovec. Hadi: Fast diameter estimation and
mining in massive graphs with hadoop. CMU-ML-08-117, 2008.
U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In W. Wang,
H. Kargupta, S. Ranka, P. S. Yu, and X. Wu, editors, ICDM, pages 229–238. IEEE Computer Society,
2009.
61
Despite the improvements these solutions lack for graph based model since they deal
with multi-dimension data
Methods for dealing with linked structures using Map reduce concept
Then comes Google with Pregel
62
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser,
and G. Czajkowski. Pregel: a system for large-scale graph processing. In
A. K. Elmagarmid and D. Agrawal, editors, SIGMOD Conference, pages
135–146. ACM, 2010.
Providing a distributed computing framework
dedicated to graph processing
Bulk Synchronous Processing (BSP) for graph processing
In a BSP model an algorithm is executed as a
sequence a Supersteps separated by a global
synch. point untill termination.
In 1 Superstep a processor can:
1. Perform computation on local data
2. Send or receive messages
Leanring distributed graph processing framework
Concep of superstep@Pregel
63
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser,
and G. Czajkowski. Pregel: a system for large-scale graph processing. In
A. K. Elmagarmid and D. Agrawal, editors, SIGMOD Conference, pages
135–146. ACM, 2010.
The vertices of the graph execute the same
user defined function (compute) in //
Modification of the state of a vertex or its outgoing edges
Read messages sent to the vertex from previous supersteps
Send messages to other vertices that will be received in the next supersteps
Modification of the Graph Topology
Leanring distributed graph processing framework
Concep of superstep@Pregel
64
How do I stop the processing?
Use the “Vertex Voting”
Each node votes to halt -> become inactive unless it receives a non-empty message
Inactive vertices are not involved in processing
anymore.
The processing stops when all vertices are inactive.
Methods for dealing with linked structures using Map reduce concept
Open source implementation of Pregel
65
Apache Giraph
From Google Pregel
BSP for distributed
graph processing
Distributed Graph Processing
Processing
HDFS
Let’s play with Giraph
Implementing a single source shortest path (SSP)
Thinking in term of supersteps & messages
Re-thinking the SSP for Giraph Processing
1. Init vertex value to larger possible value for all vertices except the source
2. On each step
1. The vertex reads the message from its neighbor
2. Each message contains the distance between the source & current
vertex through the last vertex
3. We take the min value between the current value & the received
value
4. Send the message to all neighbor as min distance + weighted edge
Definition of the vertex value
The distance to reach the current vertex
from the source
Definition of the messages
Vertex sends its current value +edge
weight
67
Thinking in term of supersteps & messages
Re-thinking the SSP for Giraph Processing
68
Let’s dive into the supersteps
SSP for Giraph Processing
69
For a Geek like me, code is easier to get
SSP for Giraph Processing
70
*Moss, IT Crowd
https://github.com/apache/giraph
Just for information & Fun
Launching the code in Giraph
71
*Moss, IT Crowd
Let’s play with Giraph II
Implementing Page Rank
Thinking in term of supersteps & messages
Re-thinking PageRank for Giraph Processing
Remember the PageRank equation
Definition of the vertex value
?
Definition of the messages
?
73




)( )(
)(
.
)1(
)(
vdu outin
ud
uPR
p
V
p
vPR
3 Mins to think !
Thinking in term of supersteps & messages
Re-thinking PageRank for Giraph Processing
Remember the PageRank equation
Definition of the vertex value
The PageRank tentative
Definition of the messages
The PageRank tentative divided by #out
edges
74




)( )(
)(
.
)1(
)(
vdu outin
ud
uPR
p
V
p
vPR
Dive into the algorithm
PageRank in Giraph
1. Init vertex value with 1/Size of the Grpah
2. On each step
1. The vertex read the message from its neighbor
2. Each message contains PR tentative of ingoing vertex
3. Compute the page rank for the current vertex with p=0.85
4. Send the message to all outgoing edges
5. After a fixed number of supersteps (iterations), Vertex vote to halt
75
Definition of the vertex value
The PageRank tentative
Definition of the messages
The PageRank tentative divided by #out
edges




)( )(
)(
.
)1(
)(
vdu outin
ud
uPR
p
V
p
vPR
[1]L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking:
Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November
1999. Previous number = SIDL-WP-1999-0120.
One could find a suitable setup to run
until convergence of values [1]
A deeper look at the algorithm
PageRank algorithm distilled
76
For a Geek like me, code is easier to get
PageRank for Giraph Processing
77
*Moss, IT Crowd
https://github.com/apache/giraph
For the Geekers - what’s the meaning of the sendMesgToAllEdges ?
PageRank for Giraph Processing
78
*Moss, IT Crowd
https://github.com/apache/giraph
Up to you guys – Classification of customer by product
Test: Write a classification Example
1. Starting from n root nodes, each having one color
2. Propagate the color to all neighbor nodes
3. The color is propagated if there is no nearest root colored node
4. Use the SSSP to define the distance
79
Definition of the vertex value
?
Definition of the messages
?
15 mins
public enum Color {
GREEN, RED, ORANGE
}
Up to you guys – Classification of customer by product
Test: Write a classification Example
1. Starting from n root nodes, each having one color
2. Propagate the color to all neighbor nodes
3. The color is propagated if there is no nearest root colored node
4. Use the SSSP to define the distance
80
Definition of the vertex value
[Color Label, Distance to the root node of
this color]
Definition of the messages
[Color, Distance to the root node of this
color]
10 mins
public enum Color {
GREEN, RED, ORANGE
}
Up to you guys – Classification of customer by product
Test: Write a classification Example
81
Definition of the vertex value
[Color Label, Distance to the root node of this
color]
Definition of the messages
[Color, Distance to the root node of this color]
8 mins
1. Init vertex value to larger possible value for all vertices except the source
colored vertices
2. On each step
1. The vertex read the message from its neighbor
2. Each message contains the distance between the source & current
vertex through the last vertex and the propagated color
3. If the value is less than the received value we update the value and
set the color
4. Send the message to all neighbor as min distance + weighted edge
82
Up to you guys – Classification of customer by product
Test: Write a classification Example
Definition of the vertex value
[Color Label, Distance to the root node of this
color]
Definition of the messages
[Color, Distance to the root node of this color]
83
Up to you guys – Classification of customer by product
Test: Write a classification Example
Intermediate Conclusion
Can I use graph mining algorithm on huge graphs
using distributed framework coming from the web?
Can we do graph mining on large graphs using the distributed approach?
Intermediate Conclusion
85
Yes you can, but …
1. Need to choose a implicit distributed framework
2. This will constraint the programming model & the storage
3. Need to re-design the algorithm to fully exploit the framework
If I can mine the graph - does it mean that I have a data warehouse?
What do we miss to have a full graph data warehouse?
AGENDA
1 / Introduction
2 / Focus on two graph mining algorithms
3 / Introduction of Distributed Processing Framework
4 / Graph Data warehouse – an emerging challenge
8
6
5 / Conclusion
Links between Data Warehouse
& Data Mining
Is it the same?
Definition of interactions
Data warehouse & mining
88
Data Mining algorithms are involved in many
steps of the DW
1. Identifying key attributes
2. Finding related measures
3. Limiting the scope of queries
Mining space
Multi-dimensional cube space for mining
Generating features & target
By using OLAP queries
Multi-step OLAP process
Using data mining as building blocs
Speeding up model construction
Using data cube computation
OLAP framework are often integrated with
mining frameworks
-> OLAM (On-Line Analytic Mining) &
exploratory multi-dimensional mining [1]
[1] J. Han and M. Kamber. Data Mining: Concepts and Techniques.
Morgan Kaufmann, 2000.
Graph is fine but stop to play,
be an adult
Come back in a professional & Business
environment, come back to relational DB
It is not because it is fun, it is because the relationship model brings a value
The graph is a constraint
90
Let’s take the Social Network example
1. We can model a friend relationship in a m-n
2. In Average ~ 100 Friends
3. Friends of Friends request – 1002 join requests
Storing a SN in a Relational DB is not a problem
Unless you need traversal queries for mining
Two main important issues
A Graph in a relational DB
91
1. Cost of Joins when traversing
2. Almost transfering the totality of the graph between the client and the
DB
We have seen that Distributed Graph Processing frameworks use the data locality
to minimize the cost
Data Application Server
I got a distributed processing
framework & mining
algorithms
Now do I have a Graph Data warehouse?
…BTW what is exactly a Data warehouse?
Let’s take a look
Traditional Data warehouse
Aim at providing software, modeling approaches & tools to
analyze a set of data in a collection of DB
E. Malinowski and E. Zimanyi. Advanced data warehouse design: From conventional to
spatial and temporal applications. Springer-Verlag, 2008.
An important topic of research
Conceptual modeling
94
Aim at providing software, modeling approaches & tools to
analyze a set of data in a collection of DB
Research topic focus
1. Improvement of the Snowflake & Star
model
2. Models enabling the to define levels of
hierarchies
3. Role played by a measure in different
dimension
4. Properties such as additive, derive
E. Malinowski and E. Zimanyi. Multidimensional conceptual modeling. In J. Wang, editor,
Encyclopedia of Data Warehousing and Mining, pages 293–300. IGI Global, second edition,
2008.
Measures
Fact
Dimensions
The multiDim model – a conceptual model for Data Warehouse & OLAP Applications
Conceptual modeling
95
E. Malinowski and E. Zimanyi. Multidimensional conceptual modeling. In J. Wang, editor,
Encyclopedia of Data Warehousing and Mining, pages 293–300. IGI Global, second edition,
2008.
Measures
Fact
Hierarchy of dimensions
Cardinality child parent
Conceptual modeling reached a certain level of maturity
Operations & queries on the model
OLAP queries
96
Extracting information by Queries
1. Rollup (increasing the level of aggregation)
2. Drill-down (decreasing the level of aggregation or increasing detail)
along one or more dimension hierarchies
3. Slice and dice (selection and projection)
4. Pivot (re-orienting the multidimensional view of data).
S. Chaudhuri and U. Dayal. An overview of data warehousing and olap
technology. SIGMOD Record, 26(1):65–74, 1997
Functional layers for OLAP
Summary
QUERY LAYER
TRANSLATION LAYER
PROCESSING FRAMEWORK
STORAGE
OLAP Cube
Snowflake Models & SQL request
OLAP
OLAP Processing
framework
Relational
Traditional storage
97
I got a distributed processing
framework & mining
algorithms
Now do I have a Graph Data warehouse?!
Define what is missing if we have a graph model instead of a relational model
Let’s take the Data warehouse process
99
Global process overview
Need to be able to model intermediate structure keeping the
relationship as a central place while Defining navigation path, roles in
navigation, summarization pros, etc.
Central element in the traversal and then in graph mining
Why navigation path matters?
100
Define the way one could traverse the graph
Person
Friends of
Group
Belongs toMembers
Item
Bought by
Bought
Used in
1. Classification
2. Ranking
3. Collaborative filtering
Roles in paths
Hierarchies in paths
Additivity in paths
Dealing with distributed frameworks while keeping an high level query layer
Processing layers
101
QUERY LAYER
TRANSLATION LAYER
DISTRIBUTED PROCESSING FRAMEWORK
GRAPH STORAGE
102
How to deal with the graph nature ?
If I have a graph DB how do I use Giraph ?
How to deal with the distributed aspects ?
Integration of the processing FWK ?
How to infer a physical execution plan ?
Data materialization issue is completely different from OLAP
What kind of query language to expose ?
SQL - PigLatin – SPARQL ?
Dealing with distributed frameworks while keeping an high level query layer
Challenges @Processing layers
From Google & Microsoft Research
The most advanced research
103
Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks,
in Proceedings of the 2011 international conference on Management of data
Combining Social Interaction information with user profiles
Target ads, marketing, etc.
New Warehousing & OLAP multi-dimensional network model
A graph on which vertex = tuple in a table
Attributes of this table = multi-dimensional spaces
From Google & Microsoft Research
The most advanced research
104
1. Shown we can execute standard OLAP operations while leveraging the
graph aspects
2. Defined the algorithm to obtain the aggregated networks from queries
3. Present a materialization approach
Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks,
in Proceedings of the 2011 international conference on Management of data
New Warehousing & OLAP multi-dimensional network model
A graph on which vertex = tuple in a table
Attributes of this table = multi-dimensional spaces
Examples for operation on multi-dimensional networks
Showing structural behaviors
105
Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks,
in Proceedings of the 2011 international conference on Management of data
Summarizing on the multi-
dimensional network on
the dimension “Gender”
Summarizing on the multi-
dimensional network on
the dimensions “Gender” &
“Location”
2 females in CA take 55.6% of
the total Male-Female
connections
Drill-down operation
What is the network structure as grouped by
both gender & location?
1. The cuboid queries
Queries on GraphCube
106
Has as output the aggregate network corresponding to a
specific aggregation of the multi-dimensional network
What is the network structure
between various location
& profession
combinations?
Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks,
in Proceedings of the 2011 international conference on Management of data
The answer = the aggregated network in the
desired cuboid in the graph cube
2. Crossboid query
Queries on GraphCube
107
Queries which crosses multiple multi-dimensional spaces of
the networks (Cuboids)
What is the network structure
between the user “3” and
various locations?
Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks,
in Proceedings of the 2011 international conference on Management of data
From Google & Microsoft Research
The most advanced research
108
New Warehousing & OLAP multi-dimensional network model
A graph on which vertex = tuple in a table
Attributes of this table = multi-dimensional spaces
1. Shown we can execute standard OLAP operation while leveraging
the graph aspects
2. Defined the algorithm to obtain the aggregated networks from queries
3. Present a materialization approach
Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks,
in Proceedings of the 2011 international conference on Management of data
Only consider vertex of the same type
Only centralized processing
Then materialization policy is inspired by legacy central DW
AGENDA
1 / Introduction
2 / Focus on two graph mining algorithms
3 / Introduction of Distributed Processing Framework
4 / Graph Data warehouse – an emerging challenge
109
5 / Conclusion
Conclusion
Today building blocs exist to mine large graphs
Up to you to assemble them for a dedicated purpose
DISTRIBUTED PROCESSING FRAMEWORK
GRAPH STORAGE
MINING LIBRARIES
NON-GRAPH BASED
Conclusion
111
Structuring linked data as graph is an emerging & important
requirement
Important challenges for Mining algorithms
Adapting the logic to include the global relationship
Important challenges for the processing layer
Re-design algorithms – integrating the storage layer - using emerging Big data frameworks
However implicit distributed graph processing
frameworks are emerging
Still far from the concept of Graph Data Warehouse
Lack of modeling – uniform stack –Query language – Re-design the materialization
THANK YOU
EBISS,
20 of July 2012
Brussels
sabri.skhiri@euranova.eu / twitter@sskhiri /http://blog.euranova.eu
SABRI SKHIRI / RESEARCH DIRECTOR EURA NOVA
REFERENCES
113
Apache Mahout, Scalable machine learning
http://mahout.apache.org/
Apache Hadoop, Distributed computing framework
http://hadoop.apache.org/
Apache Giraph, open source implementation of Pregel implementation
http://incubator.apache.org/giraph/
NAIAD, open source implementation of Scala
http://naiad-processing.org
Cassandra, NoSQL column oriented storage
http://cassandra.apache.org/
HBase, NoSQL column oriented storage
http://hbase.apache.org/
PigLatin, high level query framework
http://pig.apache.org/
Scribe, log aggregator framework
https://github.com/facebook/scribe

More Related Content

What's hot

Prov-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationProv-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance Visualization
Rinke Hoekstra
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
Paul Groth
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
Rinke Hoekstra
 
A Novel Data mining Technique to Discover Patterns from Huge Text Corpus
A Novel Data mining Technique to Discover Patterns from Huge  Text CorpusA Novel Data mining Technique to Discover Patterns from Huge  Text Corpus
A Novel Data mining Technique to Discover Patterns from Huge Text Corpus
IJMER
 
On nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domainsOn nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domainsunyil96
 
Self adaptive based natural language interface for disambiguation of
Self adaptive based natural language interface for disambiguation ofSelf adaptive based natural language interface for disambiguation of
Self adaptive based natural language interface for disambiguation of
Nurfadhlina Mohd Sharef
 
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
International Journal of Technical Research & Application
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
Paul Groth
 
Data Mining vs Statistics
Data Mining vs StatisticsData Mining vs Statistics
Data Mining vs Statistics
Andry Alamsyah
 
Graph based Clustering
Graph based ClusteringGraph based Clustering
Graph based Clustering
怡秀 林
 
Semantic Text Processing Powered by Wikipedia
Semantic Text Processing Powered by WikipediaSemantic Text Processing Powered by Wikipedia
Semantic Text Processing Powered by Wikipedia
Maxim Grinev
 
A comprehensive survey of link mining and anomalies detection
A comprehensive survey of link mining and anomalies detectionA comprehensive survey of link mining and anomalies detection
A comprehensive survey of link mining and anomalies detection
csandit
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Editor IJARCET
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
Paul Groth
 
Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...
IRJET Journal
 
Data mining - Process, Techniques and Research Topics
Data mining - Process, Techniques and Research TopicsData mining - Process, Techniques and Research Topics
Data mining - Process, Techniques and Research Topics
Techsparks
 

What's hot (18)

Prov-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance VisualizationProv-O-Viz: Interactive Provenance Visualization
Prov-O-Viz: Interactive Provenance Visualization
 
Content + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learningContent + Signals: The value of the entire data estate for machine learning
Content + Signals: The value of the entire data estate for machine learning
 
An Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities DataAn Ecosystem for Linked Humanities Data
An Ecosystem for Linked Humanities Data
 
A Novel Data mining Technique to Discover Patterns from Huge Text Corpus
A Novel Data mining Technique to Discover Patterns from Huge  Text CorpusA Novel Data mining Technique to Discover Patterns from Huge  Text Corpus
A Novel Data mining Technique to Discover Patterns from Huge Text Corpus
 
On nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domainsOn nonmetric similarity search problems in complex domains
On nonmetric similarity search problems in complex domains
 
Self adaptive based natural language interface for disambiguation of
Self adaptive based natural language interface for disambiguation ofSelf adaptive based natural language interface for disambiguation of
Self adaptive based natural language interface for disambiguation of
 
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERINGAN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
AN IMPROVED TECHNIQUE FOR DOCUMENT CLUSTERING
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Data Mining vs Statistics
Data Mining vs StatisticsData Mining vs Statistics
Data Mining vs Statistics
 
Graph based Clustering
Graph based ClusteringGraph based Clustering
Graph based Clustering
 
Ijetcas14 347
Ijetcas14 347Ijetcas14 347
Ijetcas14 347
 
Semantic Text Processing Powered by Wikipedia
Semantic Text Processing Powered by WikipediaSemantic Text Processing Powered by Wikipedia
Semantic Text Processing Powered by Wikipedia
 
A comprehensive survey of link mining and anomalies detection
A comprehensive survey of link mining and anomalies detectionA comprehensive survey of link mining and anomalies detection
A comprehensive survey of link mining and anomalies detection
 
Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932Volume 2-issue-6-1930-1932
Volume 2-issue-6-1930-1932
 
Thinking About the Making of Data
Thinking About the Making of DataThinking About the Making of Data
Thinking About the Making of Data
 
Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...
 
Ak4301197200
Ak4301197200Ak4301197200
Ak4301197200
 
Data mining - Process, Techniques and Research Topics
Data mining - Process, Techniques and Research TopicsData mining - Process, Techniques and Research Topics
Data mining - Process, Techniques and Research Topics
 

Viewers also liked

Network analysis lecture
Network analysis lectureNetwork analysis lecture
Network analysis lecture
Sara-Jayne Terp
 
gSpan algorithm
 gSpan algorithm gSpan algorithm
gSpan algorithm
Sadik Mussah
 
Graph mining
Graph miningGraph mining
Graph mining
Houw Liong The
 
Social Media Mining - Chapter 2 (Graph Essentials)
Social Media Mining - Chapter 2 (Graph Essentials)Social Media Mining - Chapter 2 (Graph Essentials)
Social Media Mining - Chapter 2 (Graph Essentials)
SocialMediaMining
 
The Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level DomainThe Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level Domain
oli-unima
 
Graph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph miningGraph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph mining
tuxette
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Xiaohan Zeng
 
Social Network Analysis in Two Parts
Social Network Analysis in Two PartsSocial Network Analysis in Two Parts
Social Network Analysis in Two Parts
Patti Anklam
 
Mining the social graph
Mining the social graphMining the social graph
Mining the social graph
shunya kimura
 
Social Network Analysis (SNA) and its implications for knowledge discovery in...
Social Network Analysis (SNA) and its implications for knowledge discovery in...Social Network Analysis (SNA) and its implications for knowledge discovery in...
Social Network Analysis (SNA) and its implications for knowledge discovery in...
ACMBangalore
 
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
BigMine
 
Trends In Graph Data Management And Mining
Trends In Graph Data Management And MiningTrends In Graph Data Management And Mining
Trends In Graph Data Management And Mining
Srinath Srinivasa
 
Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)
Matthew Russell
 
Complex and Social Network Analysis in Python
Complex and Social Network Analysis in PythonComplex and Social Network Analysis in Python
Complex and Social Network Analysis in Python
rik0
 
Kick start graph visualization projects
Kick start graph visualization projectsKick start graph visualization projects
Kick start graph visualization projects
Linkurious
 
Prof. Hendrik Speck - Social Network Analysis
Prof. Hendrik Speck - Social Network AnalysisProf. Hendrik Speck - Social Network Analysis
Prof. Hendrik Speck - Social Network Analysis
Hendrik Speck
 
Introduction to Social Network Analysis
Introduction to Social Network AnalysisIntroduction to Social Network Analysis
Introduction to Social Network Analysis
Patti Anklam
 
Data Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network AnalysisData Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network Analysis
vwchu
 
A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)
Lynn Cherny
 
Graph Theory
Graph TheoryGraph Theory
Graph Theory
Shivam Singh
 

Viewers also liked (20)

Network analysis lecture
Network analysis lectureNetwork analysis lecture
Network analysis lecture
 
gSpan algorithm
 gSpan algorithm gSpan algorithm
gSpan algorithm
 
Graph mining
Graph miningGraph mining
Graph mining
 
Social Media Mining - Chapter 2 (Graph Essentials)
Social Media Mining - Chapter 2 (Graph Essentials)Social Media Mining - Chapter 2 (Graph Essentials)
Social Media Mining - Chapter 2 (Graph Essentials)
 
The Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level DomainThe Graph Structure of the Web - Aggregated by Pay-Level Domain
The Graph Structure of the Web - Aggregated by Pay-Level Domain
 
Graph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph miningGraph mining 2: Statistical approaches for graph mining
Graph mining 2: Statistical approaches for graph mining
 
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
Social Network Analysis: What It Is, Why We Should Care, and What We Can Lear...
 
Social Network Analysis in Two Parts
Social Network Analysis in Two PartsSocial Network Analysis in Two Parts
Social Network Analysis in Two Parts
 
Mining the social graph
Mining the social graphMining the social graph
Mining the social graph
 
Social Network Analysis (SNA) and its implications for knowledge discovery in...
Social Network Analysis (SNA) and its implications for knowledge discovery in...Social Network Analysis (SNA) and its implications for knowledge discovery in...
Social Network Analysis (SNA) and its implications for knowledge discovery in...
 
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos FaloutsosLarge Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
 
Trends In Graph Data Management And Mining
Trends In Graph Data Management And MiningTrends In Graph Data Management And Mining
Trends In Graph Data Management And Mining
 
Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)Mining Social Web APIs with IPython Notebook (PyCon 2014)
Mining Social Web APIs with IPython Notebook (PyCon 2014)
 
Complex and Social Network Analysis in Python
Complex and Social Network Analysis in PythonComplex and Social Network Analysis in Python
Complex and Social Network Analysis in Python
 
Kick start graph visualization projects
Kick start graph visualization projectsKick start graph visualization projects
Kick start graph visualization projects
 
Prof. Hendrik Speck - Social Network Analysis
Prof. Hendrik Speck - Social Network AnalysisProf. Hendrik Speck - Social Network Analysis
Prof. Hendrik Speck - Social Network Analysis
 
Introduction to Social Network Analysis
Introduction to Social Network AnalysisIntroduction to Social Network Analysis
Introduction to Social Network Analysis
 
Data Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network AnalysisData Mining Seminar - Graph Mining and Social Network Analysis
Data Mining Seminar - Graph Mining and Social Network Analysis
 
A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)A Fast and Dirty Intro to NetworkX (and D3)
A Fast and Dirty Intro to NetworkX (and D3)
 
Graph Theory
Graph TheoryGraph Theory
Graph Theory
 

Similar to Large Graph Mining

MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
acijjournal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET Journal
 
The International Journal of Engineering and Science
The International Journal of Engineering and ScienceThe International Journal of Engineering and Science
The International Journal of Engineering and Sciencetheijes
 
Massive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and ApplicationsMassive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and Applications
Vijay Raghavan
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
Editor IJCATR
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
IJSTA
 
Classifier Model using Artificial Neural Network
Classifier Model using Artificial Neural NetworkClassifier Model using Artificial Neural Network
Classifier Model using Artificial Neural Network
AI Publications
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESA SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
IJCSES Journal
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
Thomas Rones
 
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
ertekg
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
Poonam Kshirsagar
 
التنقيب في البيانات - Data Mining
التنقيب في البيانات -  Data Miningالتنقيب في البيانات -  Data Mining
التنقيب في البيانات - Data Mining
nabil_alsharafi
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
Subrata Kumer Paul
 
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
IOSR Journals
 
Data Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trendData Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trend
Salah Amean
 
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATACOMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
cscpconf
 
Combined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex dataCombined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex data
csandit
 
TTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueTTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining Technique
Mehmet Beyaz
 

Similar to Large Graph Mining (20)

MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
The International Journal of Engineering and Science
The International Journal of Engineering and ScienceThe International Journal of Engineering and Science
The International Journal of Engineering and Science
 
Massive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and ApplicationsMassive Data Analysis- Challenges and Applications
Massive Data Analysis- Challenges and Applications
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
 
An Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional DataAn Efficient Approach for Clustering High Dimensional Data
An Efficient Approach for Clustering High Dimensional Data
 
Classifier Model using Artificial Neural Network
Classifier Model using Artificial Neural NetworkClassifier Model using Artificial Neural Network
Classifier Model using Artificial Neural Network
 
algorithms
algorithmsalgorithms
algorithms
 
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIESA SURVEY ON DATA MINING IN STEEL INDUSTRIES
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
 
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What...
 
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
Re-Mining Association Mining Results Through Visualization, Data Envelopment ...
 
Data minig with Big data analysis
Data minig with Big data analysisData minig with Big data analysis
Data minig with Big data analysis
 
التنقيب في البيانات - Data Mining
التنقيب في البيانات -  Data Miningالتنقيب في البيانات -  Data Mining
التنقيب في البيانات - Data Mining
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-RSelecting the correct Data Mining Method: Classification & InDaMiTe-R
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
 
Data Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trendData Mining: Concepts and techniques: Chapter 13 trend
Data Mining: Concepts and techniques: Chapter 13 trend
 
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATACOMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
COMBINED MINING APPROACH TO GENERATE PATTERNS FOR COMPLEX DATA
 
Combined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex dataCombined mining approach to generate patterns for complex data
Combined mining approach to generate patterns for complex data
 
TTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueTTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining Technique
 

Recently uploaded

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 

Recently uploaded (20)

原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 

Large Graph Mining

  • 1. Large Graph Mining Recent Developement, Challenges and Potential Solutions EBISS, 20 of July 2012 Brussels SABRI SKHIRI / RESEARCH DIRECTOR EURA NOVA
  • 2. PASSIONATE BY COMPUTER SCIENCE, TECHNOLOGY & RESEARCH THE SPEAKER Research director @ EURA NOVA Make the link between Research & Customer challenges Supervising 3 PhD thesis, 6 Master thesis with 3 BEL Universities 2 Head of the EU R&D Architecture for a Telco equipment provider Guiding the transition from Telco to Service provider with new technologies Committer on open source projects launched @ EURA NOVA RoQ-Messaging, NAIAD, Wazaabi
  • 3. Ramp-up test to wake-up the room after lunch a Friday afternoon … Before starting I will use persons to illustrate the topic in this tutorial Can you give me their names? 3 Leonard Sheldon Moss Lary Page Looks ready to start to learn about Graph Processing !
  • 4. AGENDA 1 / Introduction 2 / Focus on two graph mining algorithms 3 / Introduction of Distributed Processing Framework 4 / Graph Data warehouse – an emerging challenge 4 5 / Conclusion
  • 5. AGENDA 1 / Introduction 2 / Focus on two graph mining algorithms 3 / Introduction of Distributed Processing Framework 4 / Graph Data warehouse – an emerging challenge 5 5 / Conclusion
  • 6. Graph Mining needs another approach EXECUTIVE SUMMARY Data Mining Mature, algorithmic, libraries & products New Needs Linked data & reasoning on relationships What do we need? Is traditional data mining still applicable? Graph Data Warehouse Is traditional data warehouse still applicable? Flat data, relational data, multi-dimensional data No Linked data Biology Chemistry Social Networks Internet - Networks Graph-based similarity Algorithm re-design for graphs Scalability for storage & processing Conceptual modeling Query Processing Stack & materialization Storage
  • 7. LET’S START WITH DATA MINING Process of discovering patterns or models of data. Those patterns often consist in previously unknown and implicit information and knowledge embedded within a data set [1] [1] M.-S. Chen, J. Han, and P. S. Yu. Data mining: An overview from a database perspective. IEEE Trans. Knowl. Data Eng., 8(6):866–883, 1996.
  • 8. Techniques have been developed these last 20 years DATA MINING Process of analyzing data from different perspectives and summarizing it into useful information Pattern recognition We mine data to retrieve pre- determined patterns Clustering Data are grouped within partitions according criteria Association Enables to link data between each other Classification We position data in a pre-determined group Feature extraction We transform the input data into a set of features (data set reduction) Summarization Ranking such as page rank 8
  • 9. Manages & processes data as a collection of independent instances DATA MINING The Mining usually does not consider the global relations between the objects Almost all clustering algorithms compute the similarity between all the pair of objects in the data set
  • 10. Taking into account the relation between data in mining Why the relationship matters? Imagine to cluster people from their profiles 1 0
  • 11. Taking into account the relation between data in mining Why the relationship matters? Imagine to cluster people not only from their profiles but also … by their social interactions New emergent industrial needs lead to deal with this kind of structured data 11 More complete Data structure Greater expressive power Better model or real-life cases
  • 12. New Industry requirements Need to structure and mine structured & linked data
  • 13. The metabolic pathways 1. Biochemical Networks http://biocyc.org
  • 14. Genetic regulation signal 1. Biochemical Networks Taking a systemic approach we end-up with a huge interactio graph
  • 15. A biochemical network definition 1. Biochemical Networks CPRG VVVVV  }X{ }X{ }X{ Re Re ReRe RCact RGTrans RPg actTransg VVE VVE VVE EEEE      )G(V,E
  • 16. New emergent industrial needs 1. Biochemical Networks What happens if I drop a compound in the system ? Drug simulation in drug design Predict a metabolic pathway given a metabolic network and seed reactions Subgraph extraction Find which genes are involved in the fat reduction pathway? Genetic therapy Predict a metabolic network from a genetic signature given a protein interaction graph & a regulation network 16
  • 17. New emergent industrial needs 2. Chemical Databases Database specifically designed to store chemical information. Atoms Bonds 17 Graphs are the natural representation for chemical compounds, most of the mining algorithms focus on mining chemical graphs
  • 18. New emergent industrial needs 2. Chemical Databases A typical request: Structural similarity search 18 ),...,( ),( 1 nd ddd V EVG   Gd is the graph query The objective is to maximize the probability that the ith teta = alpha knowing the measure a, b. }{with)),|((max VbaP ii  
  • 19. New emergent industrial needs 2. Chemical Databases 19 Structural indexing Indexing the structural properties of the molecules Structural similarity search Similar molecules will have similar effects Structure-Activity-Relationship How to modify the Structure for changing its activity 3D molecule conformation Based on similar molecule conformations
  • 20. New emergent industrial needs 2. Chemical Databases 20 Structure-Activity-Relationship Example of the sucralose where 3 hydroxyl groups have been replaced with Chloride (Cl) Sugar C12H22O11 Diet Sugar C12H19Cl3O8 http://en.wikipedia.org/wiki/Sucralose
  • 21. New emergent industrial needs 3. Social network anlytics 21 The Social Graph models the (direct or indirect) Social interactions between users
  • 22. Example of Trust from a bipartite Graph 3. Social network analytics 22 The Goal is to infer trust connections between actors in set A only connected through Item I Daire O'Doherty, Salim Jouili, Peter Van Roy: Towards trust inference from bipartite social networks. DBSocial 2012: 13-18
  • 23. Example of Trust from a bi-partite Graph 3. Social network anlytics 23 The Goal is to infer trust connections between actors in set A only connected through Item I Measure to compare similarity and diversity Highly connected shared item will have higher distance values Daire O'Doherty, Salim Jouili, Peter Van Roy: Towards trust inference from bipartite social networks. DBSocial 2012: 13-18
  • 24. Example of Trust from a bi-partite Graph 3. Social network anlytics 24 Daire O'Doherty, Salim Jouili, and Peter Van Roy. Trust- Based Recommendation: An Empirical Analysis, Sixth ACM Workshop on Social Network Mining and Analysis (SNA-KDD 2012), Beijing, China, Aug. 12, 2012.
  • 25. New emergent industrial needs 3. Social network analytics 25 People you may know Structural similarity based Trust computation on structural properties Used for accurate recommendation Collaborative filtering Tends to like what your friends like Influence management Used in marketing models
  • 26. Marketing model to influence users 3. Social network analytics SOCIAL KNOWLEDGE TRADITIONAL MARKETING MODELS Bolton 1998 Bolton & Lemon 1999 SOCIAL MODELS Nitan & Libai 2011 / Singer 2012 INFLUENCE NETWORK Able to predict much more accurately > How to influence influencer to reach objectives Viral marketing maven Accurate churners Product (content, services, etc.) adoption Loyal user to reward to optimize the subscriber base Decrease acquisition costs
  • 27. Building an interaction-based model for INFLUENCE 3. Social network analytics 27 Vertex similarity distance Edge weight computing Betweenness centrality computation Temporal analysis and version at vertex/edge When all social interaction variables are considered within the same model we end-up with a very powerful Social Profile model
  • 28. LET’S USE GRAPHS Can I use the traditional data mining approaches ?
  • 29. What changes with graphs? Problem Statement 29 Similarity & Distances Must be graph-based Structural nature of the data model Makes mining algorithm more challenging to implement Scalability issue Most of the graph mining problems include significant graphs Most of the existing graph mining algorithms deal with data in the main memory-> not possible anymore
  • 30. Let’s position this tutorial Problem Statement 30 BSP approach Using fully distributed approach Google Pregel, Apache HAMA In-memory/MPI/HPC Use multi-processors implementations SNAP Graph DB Focus on storage & graph traversal Neo4J, Dex, OrientDB
  • 31. Let’s position this tutorial Problem Statement 31 BSP approach Using fully distributed approach Google Pregel, Apache HAMA Given a set of data mining algorithms, how can we adapt them to fully leverage the distributed processing approach?
  • 32. The base data model is not the same anymore Using the distributed way 32 (Distributed) Storage Graph Model (Distributed) graph processing Mining algorithm The algorithm implementation will depend on the underlying distributed processing paradigm
  • 33. AGENDA 1 / Introduction 2 / Focus on two graph mining algorithms 3 / Introduction of Distributed Processing Framework 4 / Graph Data warehouse – an emerging challenge 3 3 5 / Conclusion
  • 34. Graph Mining algorithms Let’s see what a graph mining algorithm looks like
  • 35. A ranking algorithm Page Rank The web is a network of web pages In addition to the page content, the page linkage represents a useful source of knowledge and information 35 Compute a ranking on every web page based only on the linkage structure L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999. Previous number = SIDL-WP-1999-0120.
  • 36. Basic concepts Page Rank Authority: approximate by the number & the importance of pages pointing to the considered page 36
  • 37. Random surfer who browses the pages Page Rank Either, 1. The surfer chooses an outgoing link of the current vertex uniformly at random, and follows that link to the destination vertex, or 2. it “teleports” to a completely random Web page, independent of the links out of the current vertex. 37 Intuitively, the random surfer traverses frequently “important” vertices with many vertices pointing to it
  • 38. Random surfer who browses the pages Page Rank Let G = (V,E) be the web graph The PageRank equation 38     )( )( )( . )1( )( vdu outin ud uPR p V p vPR Number of incoming edges to vertex V Number of outgoing edges from vertex u The dumping factor (0.85) We will see how to implement it in a distributed processing framework in the 2nd part of this tutorial
  • 39. Introduction Graph clustering Probably the most important topic studied in graph mining Graph area: referred as community detection 39L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis (Wiley Series in Probability and Statistics). Wiley-Interscience, Mar. 2005. Goal Given a set of instances, grouping them into groups which share common characteristics based on similarity
  • 40. Example in targeting advertisement Graph clustering 40 Brands Method to cluster a new user Display Ads Track Behavior Improve model Classified group User grouped by brand affinity Social Interactions Usage Patterns Social Graph Let us see 2 kind of clustering algorithms (1) Generalization of K-Means & (2) divide algorithm that uses the structure
  • 41. The original algorithm concep K-Means based clustering Goal finding cluster by minimizing the sum of the distances between the data instances and the corresponding centroid 41 The k Number of groups A similarity measure ),(: ji ooD Steps 1. Select K instance as initial centroids 2. Each data instance is assigned to the nearest cluster 3. Each cluster center is recomputed as the average of the data instance in the cluster 4. Repeat step [2-3]
  • 42. What do we need to change? Adapting K-Means to Graph model Extending K-Means to take advantage of the linkage information 42 A Graph-aware selection of the vertex center A Graph-aware similarity measure ),(: ji ooD The Simplest is the geodesic distance Number of edges (hops) Median Vertex Minimizes the sum of distances to all other vertices    Cu Cv m vuDv ),(min
  • 43. What do we need to change? Adapting K-Means to Graph model Extending K-Means to take advantage of the linkage information 43 A Graph-aware selection of the vertex center A Graph-aware similarity measure ),(: ji ooD The Simplest is the geodesic distance Number of edges (hops) Closeness Centrality a node is the more central the lower its total distance to all other nodes     Vvuv uvD V vCC , ),( 1 )( We usually take the shortest path as distance M. J. Rattigan, M. E. Maier, and D. Jensen. Graph clustering with network structure indices. In Z. Ghahramani, editor, ICML, volume 227 of ACM International Conference Proceeding Series, pages 783–790. ACM, 2007.
  • 44. A divide method Centrality-based clustering From the graph, iteratively cut specific edges Progressively cut into smaller communities 44 The cutting strategy should select the edges connecting as much as possible communities [1] proposed to use the edge betweenness centrality to select the edges to be cut M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821– 7826,2002
  • 45. Definition Edge betweenness centrality Locates structurally the “well-connected” edges If it is located on many shortest paths 45S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications.Number 8 in Structural analysis in the social sciences. Cambridge University Press, 1 edition, 1994.   Vwv vw vw b eb eBC , )( )( Bvw (e) = the number of shortest paths from V to W through e Bvw = the total number of shortest paths from V to W
  • 46. Step by step description Centrality-based clustering 46 Steps 1. Compute the betweenness of all existing edges 2. Remove the edge with the highest betweenness centrality 3. Repeat step [1,2] until the communities are suitably found   Vwv vw vw b eb eBC , )( )( Extremely useful for web & social graphs Characterized by Small-World structure property R. Kumar, J. Novak, and A. Tomkins. Structure and evolution of online social networks. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’06, pages 611–617, New York, NY, USA, 2006. ACM.
  • 47. AGENDA 1 / Introduction 2 / Focus on two graph mining algorithms 3 / Introduction of Distributed Processing Framework 4 / Graph Data warehouse – an emerging challenge 4 7 5 / Conclusion
  • 48. Why do we need a distributed approach? Scalability issues The graphs can reach a significant size ~ x100 millions nodes, x billion edges 48 Most of the Graph mining frameworks & libraries use in- memory graph data => we need another paradigm
  • 49. (really) Short introduction to distributed computing How to distribute a processing over a huge data set? The ability to run simultaneously software in different processors in order to increase its performance while the distributed concept emphasizes the notion of loosely coupling between those processors.
  • 50. From the resource sharing & the paradigm viewpoint Distributed architectures Shared memory Shared Disks Share Nothing 50 Explicit parallel programming Implicit parallel programming
  • 51. Distributed architecture Shared memory 51 Distributed systems that share a common memory space Case of distributed machine, it can be a distributed cache Pros High speed transfer Cons The shared memory must manage the data consistency & The access from different clients Can be costly when adding a new memory nodes Can be highly expensive
  • 52. Distributed architecture Shared disk 52 Distributed systems that share a common shared disk space Typically through a LAN Pros Almost transparent for the applications Less costly when adding new storage node Cons Access contention & data consistency issue when clients increase Expensive
  • 53. Distributed architecture Shared Nothing 53 Distributed systems where each machine has its own memory space Pros Can be implemented on cheap or expensive server With an adapted distributed processing framework the application does not need to deal with the distributed aspect Highly elastic Cons Applications need to be re-designed
  • 54. Distributed architecture Shared Nothing 54 This kind of system needs to distribute the data Partitioning policy 1 3 1’ 2 4’ 4 2’ 3’ 5 1 2 3 4 5 This leads to the interesting concept of data locality Executing a process where the data is located
  • 55. Distributed architecture: programming model viewpoint Explicit parallel programming 55 The developer will have to explicitly program the parallel aspects Create tasks, synchronization, managing threads & processes, thread safe operation, etc. Not advised solution Pros Richer expressivity, give very low level control over the distributed processing (main pain point in Hadoop MR) Cons Serious complexity Error-prone
  • 56. Distributed architecture: programming model viewpoint Implicit parallel programming 56 The developer will NOT have to take of those details The compiler or the framework handles all aspects related to parallel execution The code to run, the scheduling, the location of execution, etc Most of the examples we present here are Implicit programming with share nothing data resources Pros Much more easy – hidden complexity Highly scalable Cons Much less control on the execution as it is completely handled by the framework
  • 57. Let’s talk about graph processing How can I process a graph using implicit parallel programming and a share nothing processing?
  • 58. The well known framework from Google & Hadoop its open source version Map Reduce Created by Google to index crawled web pages The 3 main strengths of Hadoop [1] Data Locality Can schedule a process where the data is [1 )A. Bialecki, M. Cafarella, D. Cutting, and O. O’Malley. Hadoop: A framework for running applications on large clusters built of commodity hardware, http://lucene.apache.org/hadoop/, 2005 Fault Tolerant Automatic re-scheduling of failing tasks Parallel processing On different chunks of data 58
  • 59. Short introduction – 2 main phases Map & Reduce Map Reduce Main concepts Map Phase [1] A. Bialecki, M. Cafarella, D. Cutting, and O. O’Malley. Hadoop: A framework for running applications on large clusters built of commodity hardware, http://lucene.apache.org/hadoop/, 2005 59 The problem is partitioned into a set of smaller sub-problems Distributed over the worker in the cluster & processed independently Reduce Phase All answers to all sub-problems are gathered from the worker nodes and then merged
  • 60. Is it really suited for Graph Processing & mining? The developer only focus on the algorithm but Gives a simple way to deal with large data sets in completely distributed way 60 However… not really suited for Graph processing 1. Does not manipulate a Graph model – makes complex the algorithm 2. Is not suited for iterative processing 1 iteration = 1 MR Requiring a lot of I/O, data migration, unnecessary computation
  • 61. Optimizing data transfert for iterative algorithms Map Reduce Improvements Few works have been done in this direction R. Chen, X. Weng, B. He, and M. Yang. Large graph processing in the cloud. In Proceedings of the 2010 international conference on Management of data, SIGMOD ’10, pages 1123–1126, New York, NY, USA, 2010. ACM. J.Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox.Twister: a runtime for iterative mapreduce. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC ’10, pages 810–818, New York, NY, USA, 2010. ACM. U. Kang, C. Tsourakakis, A. Appel, C. Faloutsos, and J. Leskovec. Hadi: Fast diameter estimation and mining in massive graphs with hadoop. CMU-ML-08-117, 2008. U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In W. Wang, H. Kargupta, S. Ranka, P. S. Yu, and X. Wu, editors, ICDM, pages 229–238. IEEE Computer Society, 2009. 61 Despite the improvements these solutions lack for graph based model since they deal with multi-dimension data
  • 62. Methods for dealing with linked structures using Map reduce concept Then comes Google with Pregel 62 G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In A. K. Elmagarmid and D. Agrawal, editors, SIGMOD Conference, pages 135–146. ACM, 2010. Providing a distributed computing framework dedicated to graph processing Bulk Synchronous Processing (BSP) for graph processing In a BSP model an algorithm is executed as a sequence a Supersteps separated by a global synch. point untill termination. In 1 Superstep a processor can: 1. Perform computation on local data 2. Send or receive messages
  • 63. Leanring distributed graph processing framework Concep of superstep@Pregel 63 G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In A. K. Elmagarmid and D. Agrawal, editors, SIGMOD Conference, pages 135–146. ACM, 2010. The vertices of the graph execute the same user defined function (compute) in // Modification of the state of a vertex or its outgoing edges Read messages sent to the vertex from previous supersteps Send messages to other vertices that will be received in the next supersteps Modification of the Graph Topology
  • 64. Leanring distributed graph processing framework Concep of superstep@Pregel 64 How do I stop the processing? Use the “Vertex Voting” Each node votes to halt -> become inactive unless it receives a non-empty message Inactive vertices are not involved in processing anymore. The processing stops when all vertices are inactive.
  • 65. Methods for dealing with linked structures using Map reduce concept Open source implementation of Pregel 65 Apache Giraph From Google Pregel BSP for distributed graph processing Distributed Graph Processing Processing HDFS
  • 66. Let’s play with Giraph Implementing a single source shortest path (SSP)
  • 67. Thinking in term of supersteps & messages Re-thinking the SSP for Giraph Processing 1. Init vertex value to larger possible value for all vertices except the source 2. On each step 1. The vertex reads the message from its neighbor 2. Each message contains the distance between the source & current vertex through the last vertex 3. We take the min value between the current value & the received value 4. Send the message to all neighbor as min distance + weighted edge Definition of the vertex value The distance to reach the current vertex from the source Definition of the messages Vertex sends its current value +edge weight 67
  • 68. Thinking in term of supersteps & messages Re-thinking the SSP for Giraph Processing 68
  • 69. Let’s dive into the supersteps SSP for Giraph Processing 69
  • 70. For a Geek like me, code is easier to get SSP for Giraph Processing 70 *Moss, IT Crowd https://github.com/apache/giraph
  • 71. Just for information & Fun Launching the code in Giraph 71 *Moss, IT Crowd
  • 72. Let’s play with Giraph II Implementing Page Rank
  • 73. Thinking in term of supersteps & messages Re-thinking PageRank for Giraph Processing Remember the PageRank equation Definition of the vertex value ? Definition of the messages ? 73     )( )( )( . )1( )( vdu outin ud uPR p V p vPR 3 Mins to think !
  • 74. Thinking in term of supersteps & messages Re-thinking PageRank for Giraph Processing Remember the PageRank equation Definition of the vertex value The PageRank tentative Definition of the messages The PageRank tentative divided by #out edges 74     )( )( )( . )1( )( vdu outin ud uPR p V p vPR
  • 75. Dive into the algorithm PageRank in Giraph 1. Init vertex value with 1/Size of the Grpah 2. On each step 1. The vertex read the message from its neighbor 2. Each message contains PR tentative of ingoing vertex 3. Compute the page rank for the current vertex with p=0.85 4. Send the message to all outgoing edges 5. After a fixed number of supersteps (iterations), Vertex vote to halt 75 Definition of the vertex value The PageRank tentative Definition of the messages The PageRank tentative divided by #out edges     )( )( )( . )1( )( vdu outin ud uPR p V p vPR [1]L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999. Previous number = SIDL-WP-1999-0120. One could find a suitable setup to run until convergence of values [1]
  • 76. A deeper look at the algorithm PageRank algorithm distilled 76
  • 77. For a Geek like me, code is easier to get PageRank for Giraph Processing 77 *Moss, IT Crowd https://github.com/apache/giraph
  • 78. For the Geekers - what’s the meaning of the sendMesgToAllEdges ? PageRank for Giraph Processing 78 *Moss, IT Crowd https://github.com/apache/giraph
  • 79. Up to you guys – Classification of customer by product Test: Write a classification Example 1. Starting from n root nodes, each having one color 2. Propagate the color to all neighbor nodes 3. The color is propagated if there is no nearest root colored node 4. Use the SSSP to define the distance 79 Definition of the vertex value ? Definition of the messages ? 15 mins public enum Color { GREEN, RED, ORANGE }
  • 80. Up to you guys – Classification of customer by product Test: Write a classification Example 1. Starting from n root nodes, each having one color 2. Propagate the color to all neighbor nodes 3. The color is propagated if there is no nearest root colored node 4. Use the SSSP to define the distance 80 Definition of the vertex value [Color Label, Distance to the root node of this color] Definition of the messages [Color, Distance to the root node of this color] 10 mins public enum Color { GREEN, RED, ORANGE }
  • 81. Up to you guys – Classification of customer by product Test: Write a classification Example 81 Definition of the vertex value [Color Label, Distance to the root node of this color] Definition of the messages [Color, Distance to the root node of this color] 8 mins
  • 82. 1. Init vertex value to larger possible value for all vertices except the source colored vertices 2. On each step 1. The vertex read the message from its neighbor 2. Each message contains the distance between the source & current vertex through the last vertex and the propagated color 3. If the value is less than the received value we update the value and set the color 4. Send the message to all neighbor as min distance + weighted edge 82 Up to you guys – Classification of customer by product Test: Write a classification Example Definition of the vertex value [Color Label, Distance to the root node of this color] Definition of the messages [Color, Distance to the root node of this color]
  • 83. 83 Up to you guys – Classification of customer by product Test: Write a classification Example
  • 84. Intermediate Conclusion Can I use graph mining algorithm on huge graphs using distributed framework coming from the web?
  • 85. Can we do graph mining on large graphs using the distributed approach? Intermediate Conclusion 85 Yes you can, but … 1. Need to choose a implicit distributed framework 2. This will constraint the programming model & the storage 3. Need to re-design the algorithm to fully exploit the framework If I can mine the graph - does it mean that I have a data warehouse? What do we miss to have a full graph data warehouse?
  • 86. AGENDA 1 / Introduction 2 / Focus on two graph mining algorithms 3 / Introduction of Distributed Processing Framework 4 / Graph Data warehouse – an emerging challenge 8 6 5 / Conclusion
  • 87. Links between Data Warehouse & Data Mining Is it the same?
  • 88. Definition of interactions Data warehouse & mining 88 Data Mining algorithms are involved in many steps of the DW 1. Identifying key attributes 2. Finding related measures 3. Limiting the scope of queries Mining space Multi-dimensional cube space for mining Generating features & target By using OLAP queries Multi-step OLAP process Using data mining as building blocs Speeding up model construction Using data cube computation OLAP framework are often integrated with mining frameworks -> OLAM (On-Line Analytic Mining) & exploratory multi-dimensional mining [1] [1] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.
  • 89. Graph is fine but stop to play, be an adult Come back in a professional & Business environment, come back to relational DB
  • 90. It is not because it is fun, it is because the relationship model brings a value The graph is a constraint 90 Let’s take the Social Network example 1. We can model a friend relationship in a m-n 2. In Average ~ 100 Friends 3. Friends of Friends request – 1002 join requests Storing a SN in a Relational DB is not a problem Unless you need traversal queries for mining
  • 91. Two main important issues A Graph in a relational DB 91 1. Cost of Joins when traversing 2. Almost transfering the totality of the graph between the client and the DB We have seen that Distributed Graph Processing frameworks use the data locality to minimize the cost Data Application Server
  • 92. I got a distributed processing framework & mining algorithms Now do I have a Graph Data warehouse? …BTW what is exactly a Data warehouse?
  • 93. Let’s take a look Traditional Data warehouse Aim at providing software, modeling approaches & tools to analyze a set of data in a collection of DB E. Malinowski and E. Zimanyi. Advanced data warehouse design: From conventional to spatial and temporal applications. Springer-Verlag, 2008.
  • 94. An important topic of research Conceptual modeling 94 Aim at providing software, modeling approaches & tools to analyze a set of data in a collection of DB Research topic focus 1. Improvement of the Snowflake & Star model 2. Models enabling the to define levels of hierarchies 3. Role played by a measure in different dimension 4. Properties such as additive, derive E. Malinowski and E. Zimanyi. Multidimensional conceptual modeling. In J. Wang, editor, Encyclopedia of Data Warehousing and Mining, pages 293–300. IGI Global, second edition, 2008. Measures Fact Dimensions
  • 95. The multiDim model – a conceptual model for Data Warehouse & OLAP Applications Conceptual modeling 95 E. Malinowski and E. Zimanyi. Multidimensional conceptual modeling. In J. Wang, editor, Encyclopedia of Data Warehousing and Mining, pages 293–300. IGI Global, second edition, 2008. Measures Fact Hierarchy of dimensions Cardinality child parent Conceptual modeling reached a certain level of maturity
  • 96. Operations & queries on the model OLAP queries 96 Extracting information by Queries 1. Rollup (increasing the level of aggregation) 2. Drill-down (decreasing the level of aggregation or increasing detail) along one or more dimension hierarchies 3. Slice and dice (selection and projection) 4. Pivot (re-orienting the multidimensional view of data). S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. SIGMOD Record, 26(1):65–74, 1997
  • 97. Functional layers for OLAP Summary QUERY LAYER TRANSLATION LAYER PROCESSING FRAMEWORK STORAGE OLAP Cube Snowflake Models & SQL request OLAP OLAP Processing framework Relational Traditional storage 97
  • 98. I got a distributed processing framework & mining algorithms Now do I have a Graph Data warehouse?!
  • 99. Define what is missing if we have a graph model instead of a relational model Let’s take the Data warehouse process 99 Global process overview Need to be able to model intermediate structure keeping the relationship as a central place while Defining navigation path, roles in navigation, summarization pros, etc.
  • 100. Central element in the traversal and then in graph mining Why navigation path matters? 100 Define the way one could traverse the graph Person Friends of Group Belongs toMembers Item Bought by Bought Used in 1. Classification 2. Ranking 3. Collaborative filtering Roles in paths Hierarchies in paths Additivity in paths
  • 101. Dealing with distributed frameworks while keeping an high level query layer Processing layers 101 QUERY LAYER TRANSLATION LAYER DISTRIBUTED PROCESSING FRAMEWORK GRAPH STORAGE
  • 102. 102 How to deal with the graph nature ? If I have a graph DB how do I use Giraph ? How to deal with the distributed aspects ? Integration of the processing FWK ? How to infer a physical execution plan ? Data materialization issue is completely different from OLAP What kind of query language to expose ? SQL - PigLatin – SPARQL ? Dealing with distributed frameworks while keeping an high level query layer Challenges @Processing layers
  • 103. From Google & Microsoft Research The most advanced research 103 Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks, in Proceedings of the 2011 international conference on Management of data Combining Social Interaction information with user profiles Target ads, marketing, etc. New Warehousing & OLAP multi-dimensional network model A graph on which vertex = tuple in a table Attributes of this table = multi-dimensional spaces
  • 104. From Google & Microsoft Research The most advanced research 104 1. Shown we can execute standard OLAP operations while leveraging the graph aspects 2. Defined the algorithm to obtain the aggregated networks from queries 3. Present a materialization approach Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks, in Proceedings of the 2011 international conference on Management of data New Warehousing & OLAP multi-dimensional network model A graph on which vertex = tuple in a table Attributes of this table = multi-dimensional spaces
  • 105. Examples for operation on multi-dimensional networks Showing structural behaviors 105 Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks, in Proceedings of the 2011 international conference on Management of data Summarizing on the multi- dimensional network on the dimension “Gender” Summarizing on the multi- dimensional network on the dimensions “Gender” & “Location” 2 females in CA take 55.6% of the total Male-Female connections Drill-down operation What is the network structure as grouped by both gender & location?
  • 106. 1. The cuboid queries Queries on GraphCube 106 Has as output the aggregate network corresponding to a specific aggregation of the multi-dimensional network What is the network structure between various location & profession combinations? Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks, in Proceedings of the 2011 international conference on Management of data The answer = the aggregated network in the desired cuboid in the graph cube
  • 107. 2. Crossboid query Queries on GraphCube 107 Queries which crosses multiple multi-dimensional spaces of the networks (Cuboids) What is the network structure between the user “3” and various locations? Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks, in Proceedings of the 2011 international conference on Management of data
  • 108. From Google & Microsoft Research The most advanced research 108 New Warehousing & OLAP multi-dimensional network model A graph on which vertex = tuple in a table Attributes of this table = multi-dimensional spaces 1. Shown we can execute standard OLAP operation while leveraging the graph aspects 2. Defined the algorithm to obtain the aggregated networks from queries 3. Present a materialization approach Zhao and al., Graph cube: on warehousing and OLAP multidimensional networks, in Proceedings of the 2011 international conference on Management of data Only consider vertex of the same type Only centralized processing Then materialization policy is inspired by legacy central DW
  • 109. AGENDA 1 / Introduction 2 / Focus on two graph mining algorithms 3 / Introduction of Distributed Processing Framework 4 / Graph Data warehouse – an emerging challenge 109 5 / Conclusion
  • 110. Conclusion Today building blocs exist to mine large graphs Up to you to assemble them for a dedicated purpose DISTRIBUTED PROCESSING FRAMEWORK GRAPH STORAGE MINING LIBRARIES NON-GRAPH BASED
  • 111. Conclusion 111 Structuring linked data as graph is an emerging & important requirement Important challenges for Mining algorithms Adapting the logic to include the global relationship Important challenges for the processing layer Re-design algorithms – integrating the storage layer - using emerging Big data frameworks However implicit distributed graph processing frameworks are emerging Still far from the concept of Graph Data Warehouse Lack of modeling – uniform stack –Query language – Re-design the materialization
  • 112. THANK YOU EBISS, 20 of July 2012 Brussels sabri.skhiri@euranova.eu / twitter@sskhiri /http://blog.euranova.eu SABRI SKHIRI / RESEARCH DIRECTOR EURA NOVA
  • 113. REFERENCES 113 Apache Mahout, Scalable machine learning http://mahout.apache.org/ Apache Hadoop, Distributed computing framework http://hadoop.apache.org/ Apache Giraph, open source implementation of Pregel implementation http://incubator.apache.org/giraph/ NAIAD, open source implementation of Scala http://naiad-processing.org Cassandra, NoSQL column oriented storage http://cassandra.apache.org/ HBase, NoSQL column oriented storage http://hbase.apache.org/ PigLatin, high level query framework http://pig.apache.org/ Scribe, log aggregator framework https://github.com/facebook/scribe

Editor's Notes

  1. Data mining has been developed for 2 decades, we have mature algorithms, libraries, and even product. Mainly focused on relational data and flat data. New requirements coming from research or industry such as bioolgy, chemistry, social networks, internet, etc. Then the question is “is traditional data mining algorithm but also processing stack, still equally applicable on this new data model?” so what do we need as processing paradigm framework and what can we change from the alorithmic view point?
  2. Those techniques have been heavily developed these last years in Business intelligence [1,2] especially for database and flat data in order to feed market nalysis,business management, and assisted-decision tools [16]. It is worth saying that the data mining stands at the intersection of di.erent disciplines such as statistic, machine learning, information retrieval and pattern recognition. In fact, there is no question that data mining appropriately uses algorithms from these well studied fields. Indeed, almost all mining algorithms can be divided into the following families: (1) the classification for which we position data in pre-determined groups, (2) clustering in which data are grouped within partitions according to di.erent criterias, (3) associations that enables to link data between each other (4) pattern recognition in which we mine data to retrieved pre-determined pattern (5)feature extraction and (6) Summarization(Ranking such as Page rank).
  3. the instances of data to be mined are considered independent without relationships between them. For example, in the case of the clustering algorithm in which the input data set is divided in groups with similar objects, it is considered that there is no relation between the objects. Hence almost all clustering algorithm compute the similarity between all the pair of objects in the data set by means of a distance measure. Indeed, the traditional data mining works are focused on multi-dimensional and text data.
  4. Taking into account the structural relationships give us additional information about the objects, their links, their interactions, even in social networks, we start speaking about social user profile instead of User profile. Directly we see an evolution of the mining, by considering another data model.
  5. Taking into account the structural relationships give us additional information about the objects, their links, their interactions, even in social networks, we start speaking about social user profile instead of User profile. Directly we see an evolution of the mining, by considering another data model.
  6. Description dui schema: Catalyser, inhibitors, compound, protein and gene coding for those proteins.
  7. Each gene that codes for those protein can be activated or blocked -> transduction signal network or gene regulation network. If you take a systemic approach we end up with a huge graph.
  8. Gene, Regulator, Protein, Compound
  9. To calculate this measure we use the classic Jaccard index Widely used measure to compare similarity and diversity of sample sets The second structural feature to analyse is that of our intuition of the popularity of shared items More highly connected a shared item higher distance value will be alpha+ beta +gamma = 1
  10. To calculate this measure we use the classic Jaccard index Widely used measure to compare similarity and diversity of sample sets The second structural feature to analyse is that of our intuition of the popularity of shared items More highly connected a shared item higher distance value will be alpha+ beta +gamma = 1
  11. To calculate this measure we use the classic Jaccard index Widely used measure to compare similarity and diversity of sample sets The second structural feature to analyse is that of our intuition of the popularity of shared items More highly connected a shared item higher distance value will be alpha+ beta +gamma = 1
  12. I will show you how the new generation of distributed processing framework provides power full tools for this kind of mining.
  13. Make evolve the infra to support Mobile Apps Re-designing Service Life-cycle management: to be competitive you need to deploy on market in cycle that are less than 3 weeks, how to re-design the complete chain, governance integration
  14. P= is the probability of teleportation of the surfer. We stop the algorithm after a pre-defined number of iteration.
  15. For example a co-authorship graph is a bi-partite graph, by clustering this kind of graph we can see paper and authors in different clsuter and easyly identify paper relevant in a specific domain. This kind of things is also used in targeting advertisment
  16. This kind of things is also used in targeting advertisement
  17. K-means takes 2 parameters: the k number of groups and a similarity measure between two object instances. Convergence means that no objects moves during the last 2 round.
  18. Graph aware means that take the graph and linkage information into account for the computation
  19. Graph aware means that take the graph and linkage information into account for the computation
  20. K-means takes 2 parameters: the k number of groups and a similarity measure between two object instances. Convergence means that no objects moves during the last 2 round.
  21. The size of the graph makes it impossible to work in memory, then we need another kind of solutions. One of them is to distributed them among distributed storage and to link the processing to this storage.
  22. The distributed architecture can be classified & described according to the resources the machines or the processors share each other. But also according to the programming paradigm they offers.
  23. The distributed architecture can be classified & described according to the resources the machines or the processors share each other.
  24. NEC San storage
  25. NEC San storage
  26. This policy defines the location of the data and then, the distributed computing framework can send dedicated tasks where the data is located. This represents the notion of data locality
  27. This policy defines the location of the data and then, the distributed computing framework can send dedicated tasks where the data is located. This represents the notion of data locality
  28. This policy defines the location of the data and then, the distributed computing framework can send dedicated tasks where the data is located. This represents the notion of data locality
  29. Make evolve the infra to support Mobile Apps Re-designing Service Life-cycle management: to be competitive you need to deploy on market in cycle that are less than 3 weeks, how to re-design the complete chain, governance integration
  30. Iterative processing as it is the case in K-means, clusterisation, page rank, etc.
  31. Within each superstep a processor (or a virtual processor) may perform the following operations; (1) perform computations on a set of local data (only) and (2) send or receive messages. Similarly, in Pregel, whithin a superstep the vertices of graph execute the same user-defined function, in parallel. This function can include : a modification of the state of a vertex or that of its outgoing edges, read messages sent to the vertex in the previous superstep, send messages to other vertices that will be received in the next superstep, or even a modification of the topology of the graph (deleting or adding vertices and/or edges) [49].
  32. Pregel uses a “vertex voting to halt” technique to determine the algorithm termination. Each vertex has two possible states: active or inactive. An algorithm is considered terminated when all the vertices are in the inactive state. Practically, in the initial superstep (superstep 0), all vertices are in the active state, then in each subsequent supersteps each vertex can vote to halt explicitly to deactive itself. An inactive vertex do not participate of any superstep unless it receives an non-empty message .
  33. The initial step consists on setting the values associated to all the other vertices to infinity. In superstep 1, the vertices (2), (3) and (4) receive from the vertex (1) (in superstep 0), respectively, the messages containing their distances to (1). For instance, the vertex (2) receives a message that contains 6 which is the sum of the value of vertex (1) and the weight of outgoing edge ((1).(2)). Moreover, in superstep 1, the source vertex is in inactive state because it does not receive any message in this superstep. The next supersteps follow the same procedure until all the vertices are in inactive state.
  34. The initial step consists on setting the values associated to all the other vertices to infinity. In superstep 1, the vertices (2), (3) and (4) receive from the vertex (1) (in superstep 0), respectively, the messages containing their distances to (1). For instance, the vertex (2) receives a message that contains 6 which is the sum of the value of vertex (1) and the weight of outgoing edge ((1).(2)). Moreover, in superstep 1, the source vertex is in inactive state because it does not receive any message in this superstep. The next supersteps follow the same procedure until all the vertices are in inactive state.
  35. The initial step consists on setting the values associated to all the other vertices to infinity. In superstep 1, the vertices (2), (3) and (4) receive from the vertex (1) (in superstep 0), respectively, the messages containing their distances to (1). For instance, the vertex (2) receives a message that contains 6 which is the sum of the value of vertex (1) and the weight of outgoing edge ((1).(2)). Moreover, in superstep 1, the source vertex is in inactive state because it does not receive any message in this superstep. The next supersteps follow the same procedure until all the vertices are in inactive state.
  36. Vertex Value = PageRank tentative Message= contains tentative pageRank divided by the number of outgoing edges of the involved vertex, to get the term to sum in the current vertex
  37. Vertex Value = PageRank tentative Message= contains tentative pageRank divided by the number of outgoing edges of the involved vertex, to get the term to sum in the current vertex
  38. https://github.com/apache/giraph
  39. https://github.com/apache/giraph
  40. Equally functional features of a data warehouse but for graph model? Golden OrB does not use generic to get messages, so you have to cast yourselves ! @Override public void compute(Collection<IntMessage> messages) { int _maxValue = 0; for(IntMessage m: messages) { int msgValue = ((IntWritable)m.getMessageValue()).get(); _maxValue = Math.max(_maxValue, msgValue); } }
  41. Equally functional features of a data warehouse but for graph model? Golden OrB does not use generic to get messages, so you have to cast yourselves ! @Override public void compute(Collection<IntMessage> messages) { int _maxValue = 0; for(IntMessage m: messages) { int msgValue = ((IntWritable)m.getMessageValue()).get(); _maxValue = Math.max(_maxValue, msgValue); } }
  42. The result is that thy highly minimize the amount of data to transfer and even optimise the data locality. Today in the relational model this vision is emerging with the concept of SQL MR DB.
  43. Let’s focus on the data ware house and OLAP tier
  44. Most of the research topics focus on the improvement of the snowflake and star schema [51]. Some researches try to add a graphical representation [60] based on the ER model [61, 66] or based on UML [2, 47] , other focus on models that enable to define di.erent level of hierarchies [60, 7, 31, 39], while other provide models take into account the role played by a measure in di.erent dimensions [47, 1]. The model described by [51] tries to summarize the main limitations o.ered by the snowflake and star model and proposes a new model that include most of the previous researches in this area.
  45. I will not enter in detail in this model, my only goal on this slide is to show you that we got a certain level of maturity in the conceptual modeling area for data wareshouse
  46. The processing layer will take the cube and the queries to generate an optimize physical execution plan that will materialized the queries
  47. The DB can be existing DB such as in fraud detection, with a buying log or in telecom with the call data record or Graph storage. Then as soon as we have our consilated graph there there is a gap I the conceptual modeling that we will have, and what kind of queries ? There is no equally functional conceptual modeling approach as we can find in data warehouse. The multi dim model defined by Esteban, could be applied here, but with some semantic modification to take into account our constraints.
  48. I have 3 way to navigate in the graph: Friends Belongs / members Bought, bought by
  49. You need to deal with a lot of data type: non-structured, Graph, semi-structured, structured The size of the data will perhaps lead you to consider a distributed approach Then a translation layer is required to transform your query in a physical execution plan
  50. You need to deal with a lot of data type: non-structured, Graph, semi-structured, structured The size of the data will perhaps lead you to consider a distributed approach Then a translation layer is required to transform your query in a physical execution plan
  51. Most of the research topics focus on the improvement of the snowflake and star schema [51]. Some researches try to add a graphical representation [60] based on the ER model [61, 66] or based on UML [2, 47] , other focus on models that enable to define di.erent level of hierarchies [60, 7, 31, 39], while other provide models take into account the role played by a measure in di.erent dimensions [47, 1]. The model described by [51] tries to summarize the main limitations o.ered by the snowflake and star model and proposes a new model that include most of the previous researches in this area.
  52. 1st graph: Shows the aggragated network which is the result of condensing aggregation operation group by gender. The vertices are the condesed vertices by the aggregation. The edge represent the relation between aggragated vertices. The weight on the edge are the result of the count operation on the group by.
  53. The graph cube is obtain by restructuring all possible aggregations of A
  54. Most of the research topics focus on the improvement of the snowflake and star schema [51]. Some researches try to add a graphical representation [60] based on the ER model [61, 66] or based on UML [2, 47] , other focus on models that enable to define di.erent level of hierarchies [60, 7, 31, 39], while other provide models take into account the role played by a measure in di.erent dimensions [47, 1]. The model described by [51] tries to summarize the main limitations o.ered by the snowflake and star model and proposes a new model that include most of the previous researches in this area.