Multiplatform Spark solution for Graph datasources by Javier Dominguez
17 NOV 2016 @ BIG DATA SPAIN
SOLUTION FOR GRAPH
Multiplatform Spark solution for Graph datasourcess, Stratio Stratio
Javier Dominguez Montes
Studied computer engineering at the
ULPGC. He is passionate about Scala,
Python and all Big Data technologies
and is currently part of the Data
Science team at Stratio Big Data,
working with ML algorithms, profiling
analysis based around Spark.
MULTIPLATFORM SOLUTION FOR GRAPH
Graph use cases Results
Main process explanation
Notebooks show off
FOR GRAPH DATASOURCES
• Graph use cases
• Machine learning
Example of how to exploit a massive database from different stages and
through several graph technologies
MACHINE LEARNING LIFE CYCLE WITH BIG
Machine Learning life cycle
Show how a data sciencist is able to take advantage of a Graph
Database through different datasources and technologies thanks to
Use as a example a masive dataset.
Query the datasource from different technologies like:
And finally apply Machine Learning over our information!
BIG DATA SPAIN USE CASE
Making use of a masive graph datasource implies make batch queries over it.
We will need to maken them with our distributed technologies... The easier the better
Motifs filter example
val g: GraphFrame = Graph(usersRdd,relationshipsRdd0)
// Search for pairs of vertices with edges in both directions between them
val motifs: Dataframe = g.find("(person_1)-[relation]->(person_2); (person_2)-[abilities]->(technology)")
// More complex queries can be expressed by applying filters.
motifs.filter("person_1.name = 'Javier' AND technology.name = 'Neo4j'")
Most of our clients or teammates will need to have fast and easy access to the information.
We would need a way to make easy queries and of course a graphic representation of our data!
We would need of course microservices like REST operations over our datastore.
Apache Spark is a fast and generic engine for large-scale data processing.
Spark API for the management and distributed calculation of graphs. It comes with a great variety of graph
It aims to provide both the functionality of GraphX and extended functionality taking advantage of
Spark DataFrames. This extended functionality includes motif finding and highly expressive graph
Neo4j is a highly scalable native graph database that leverages data relationships as first-class entities.
Big data alone used to be enough, but enterprise leaders need more than just volumes of information to
make bottom-line decisions. You need real-time insights into how data is related.
It's possible to quickly and automatically produce models that can analyze bigger, more complex data
and deliver faster, more accurate results – even on a very large scale. The result? High-value predictions
that can guide better decisions and smart actions in real time without human intervention.
Will relate all the existing object in our dataset and infer possible
• Main process explanation
• Notebooks show off
Integration of different Open Source libraries of distributed machine learning algorithms.
Development environment adapted to each data scientist.
Real-time decision based on models based on machine learning algorithms
Integrated with all components of the Stratio Big Data Platform
Comprehensive knowledge lifecycle management
Freebase aimed to create a global resource that allowed people
(and machines) to access common information more effectively.
This model is based on the idea of converting the declarations of the resources in
expressions with the subject-predicate-object which are called triplets.
Subject: It's the resource, what we are describing.
Predicate: Could be a property or a relationship with the object value.
Object value: Propertie's value or the related subject.
<'Cristiano Ronaldo'> <'Scores in 2014/2015'> 61 .
<'Cristiano Ronaldo'> <'Born in'> 'Portugal' .
Total triplets: 1.9 Billion
A k-core of a graph G is a maximal connected subgraph of G in which all vertices have degree at least k.
Equivalently, it is one of the connected components of the subgraph of G formed by repeatedly deleting all
vertices of degree less than k.
Remove all nodes with fewer connections.
At the end, we want only the most representative and connected elements in our grah.
In our use case we used K = 5.
Jaccard Graph Clustering
Node Clusterization based on concrete relations optimized for Big
We've developed an straightforward functionality which is able to
detect patterns and clusterize data in a graph database thanks to
daily machine learning processes.
HDFS / Parquet
Spark / GraphX
Jaccard distance calculation
in everyday process
nodes graph clustering
BANK USE CASE
Semantic search engine
Include ElasticSearch for making text searchs as a search engine.
Apply more Machine Learning algorithms
• Connected components: As we've already done, try to cluster information thanks to their relationships.
• PageRank: Measure the importance of a subject.
• Triangle counting: Check posible triangle relationships inside our dataset to avoid redundancy.
New Graph use cases
• Fraud detection
• Recommendation System
Tel: (+1) 408 5998830
Tel: (+34) 91 828 64 73