SlideShare a Scribd company logo
ANALYSING
BILLION NODE
GRAPHS
DATA SUMMER CONF 2018
ODESSA, UKRAINE
GIORGI JVARIDZE
SENIOR SOFTWARE ENGINEER - ZALANDO
21-07-2018
2
ABOUT ME
- Giorgi Jvaridze
https://www.linkedin.com/in/jvaridze/
- Senior Software Engineer @ Zalando
- Previously at Tripadvisor - Personalization
- Dublin Data Science meetup group
- Data Science Tbilisi meetup group
- DataFest Tbilisi - Annual conference about Data
Science, Data Journalism and related topics.
3
- Intros
- Graphs
- Cross Device Graph
- Deterministic Approach
- Probabilistic Approach
- Challenges
- Conclusion
- Q&A
AGENDA
OUR VISION:
CONNECTING PEOPLE AND FASHION
6
ZALANDO AT A GLANCE
~ 4.5billion EUR
revenue 2017
> 200
million
visits
per
month
> 15,000
employees in
Europe
> 75%
of visits via
mobile devices
> 23
million
active customers
> 300,000
product choices
~ 2,000
brands
17
countries
as at Jun 2018
7
WE BRING FASHION TO PEOPLE IN 17 COUNTRIES
2008-2009
2010
2012-2013
2011
2018
8
DATAFEST TBILISI 2018
https://datafest.ge/
9
DATAFEST TBILISI
10
GRAPHS / NETWORKS
- Vertices or Nodes
- Edges
11
GRAPHS / NETWORKS
- Vertices or Nodes
- Edges
- Weighted vs Unweighted
12
GRAPHS / NETWORKS
- Vertices or Nodes
- Edges
- Weighted vs Unweighted
- Directed vs Undirected
13
GRAPHS / NETWORKS
- Vertices or Nodes
- Edges
- Weighted vs Unweighted
- Directed vs Undirected
- Connected Components
14
GRAPHS / NETWORKS
- Vertices or Nodes
- Edges
- Weighted vs Unweighted
- Directed vs Undirected
- Connected Components
- Vertex/Edge attributes
g
h
c d e
a b
f
C1
C2
C3
15
GRAPHS / NETWORKS
- Vertices or Nodes
- Edges
- Weighted vs Unweighted
- Directed vs Undirected
- Connected Components
- Vertex/Edge attributes
- Many more properties and algorithms
16
- Social Networks
- The Internet
- Biomedical Networks
- Network of Neurons
- Many Data Science Applications
GRAPHS / NETWORKS
17
DATA SCIENCE AT ZALANDO
18
CROSS-DEVICE GRAPH
19
USE CASES
20
USE CASES
21
USE CASES
22
Approach
Deterministic vs Probabilistic
23
some_id other_id another_id
f1 c1 -
f1 c2 -
f1 c2 u1
f2 - u1
f3 c3 -
f4 c4 u2
f5 c5 u2
f6 - u2
c1
f1
c2
f2u1
f3
c3
f4 f5 f6
c4 c5
u2
GRAPH CREATION
24
some_id other_id another_id
f1 c1 -
f1 c2 -
f1 c2 u1
f2 - u1
f3 c3 -
f4 c4 u2
f5 c5 u2
f6 - u2
c1
f1
c2
f2u1
f3
c3
f4 f5 f6
c4 c5
u2
GRAPH CREATION
25
some_id other_id another_id
f1 c1 -
f1 c2 -
f1 c2 u1
f2 - u1
f3 c3 -
f4 c4 u2
f5 c5 u2
f6 - u2
c1
f1
c2
f2u1
f3
c3
f4 f5 f6
c4 c5
u2
GRAPH CREATION
26
some_id other_id another_id
f1 c1 -
f1 c2 -
f1 c2 u1
f2 - u1
f3 c3 -
f4 c4 u2
f5 c5 u2
f6 - u2
c1
f1
c2
f2u1
f3
c3
f4 f5 f6
c4 c5
u2
GRAPH CREATION
27
some_id other_id another_id
f1 c1 -
f1 c2 -
f1 c2 u1
f2 - u1
f3 c3 -
f4 c4 u2
f5 c5 u2
f6 - u2
c1
f1
c2
f2u1
f3
c3
f4 f5 f6
c4 c5
u2
GRAPH CREATION
28
some_id other_id another_id
f1 c1 -
f1 c2 -
f1 c2 u1
f2 - u1
f3 c3 -
f4 c4 u2
f5 c5 u2
f6 - u2
c1
f1
c2
f2u1
f3
c3
f4 f5 f6
c4 c5
u2
CONNECTED COMPONENTS
29
ARCHITECTURE
30
TECHNOLOGIES
AWS S3 AWS EMR AWS Data
Pipeline
31
APACHE SPARK
32
SPARK GRAPHFRAMES
- Distributed graph processing framework built on top of DataFrames
- GraphX is based on RDDs
- Represented by GraphFrame object
- GraphFrame object takes two DataFrames: Vertices and Edges
- Vertices must have id column
- Edges must have src and dst column
33
SPARK GRAPHFRAMES
- Distributed graph processing framework built on top of DataFrames
- GraphX is based on RDDs
- Represented by GraphFrame object
- GraphFrame object takes two DataFrames: Vertices and Edges
- Vertices must have id column
- Edges must have src and dst column
34
CONNECTED COMPONENTS - NAIVE IMPLEMENTATION (GRAPHX)
1. Assign each vertex unique component ID
2. Iterate until convergence:
- For each vertex V, update:
Component ID of V ← Smallest component ID in neighbourhood of V
35
CONNECTED COMPONENTS - NAIVE IMPLEMENTATION (GRAPHX)
1. Assign each vertex unique component ID
2. Iterate until convergence:
- For each vertex V, update:
Component ID of V ← Smallest component ID in neighbourhood of V
Pro: Easy to implement
Con: Slow convergence on large-diameter graphs
36
SMALL-/LARGE-STAR ALGORITHM (GRAPHFRAMES)
1. Assign each vertex unique component ID
2. Iterate until convergence:
- Small Star: For each vertex, connect all smaller neighbors and self to the
min neighbor.
- Large Star: For each vertex, connect all strictly larger neighbors to the min
neighbor including self
Kiveris et al. "Connected Components in MapReduce and Beyond."
37
SMALL-/LARGE-STAR ALGORITHM (GRAPHFRAMES)
Kiveris et al. "Connected Components in MapReduce and Beyond."
38
SMALL-/LARGE-STAR ALGORITHM (GRAPHFRAMES)
Kiveris et al. "Connected Components in MapReduce and Beyond."
39
some_id other_id another_id
f1 c1 -
f1 c2 -
f1 c2 u1
f2 - u1
f3 c3 -
f4 c4 u2
f5 c5 u2
f6 - u2
c1
f1
c2
f2u1
f3
c3
f4 f5 f6
c4 c5
u2
CONNECTED COMPONENTS
40
Approach
Probabilistic Cross-Device Graph
41
42
c1
f1
c2
f2u1
f3
c3
f4 f5 f6
c4 c5
u2
Vertex 1
Features
Vertex 2
Features
Label
c1 f1 true
c4 f4 true
... ... ...
Training Data
CLASSIFICATION PROBLEM
43
CLASSIFICATION PROBLEM
c1
f1
c2
f2u1
f3
c3
f4 f5 f6
c4 c5
u2
Vertex 1
Features
Vertex 2
Features
Label
c1 f1 true
c4 f4 true
c1 c4 false
c1 c5 false
... ... ...
Training Data
44
● ID Metadata - first seen, last seen, number of sessions etc..
● Session Information - mobile/desktop, device, browser, geoip, activity
patterns etc..
● Product Information - category browse, product similarity
● Fashion Insights
● Features of Neighbours - immediate or within connected component…
● Graph Features - graph centrality, community detection scores etc..
● Graph Embeddings
FEATURE ENGINEERING
45
Graph Embeddings
DeepWalk and Node2vec
● How do we capture structure in relational data?
https://thegradient.pub/structure-learning/
● Fraud Detection Using Deep Learning on Graph Embeddings and Topology
Metrics
https://www.experoinc.com/post/fraud-detection-using-deep-learning-on-graph-e
mbeddings-and-topology-metrics
● Graph Convolutional Networks -
https://tkipf.github.io/graph-convolutional-networks/
46
WORD2VEC
47
NODE2VEC - TEXT GENERATION
c1
f1
c2
f2u1
f3
c3
f4 f5 f6
c4 c5
u2
c1 f1 c2 u1 f1
u1 c2 u1 f2
f3 c3 f3 c3
c4 f4 u2 f5 c5 f5
...
48
Challenges
Visualization and Data Quality
49
Top largest connected components
50
Why is the component #187 so huge?
Top largest connected components
51
Betweenness Centrality in Florentine Families’ Marriage Network
52
Betweenness Centrality in Florentine Families’ Marriage Network
53
Betweenness Centrality in Florentine Families’ Marriage Network
But there are ~4 Million nodes.
That’s ~8,000,000,000,000 pairs of IDs
54
“A Fast and Highly Quality Multilevel Scheme for Partitioning Irregular Graphs”. George Karypis and Vipin Kumar. SIAM Journal on Scientific
Computing, Vol. 20, No. 1, pp. 359—392, 1999.
GRAPH PARTITIONING - METIS
55
GRAPH PARTITIONING - METIS
- Partition Graph multiple times
- See statistics on how many times each node was on the border between partitions
- Investigate the most fragile nodes
- Potentially filter them out
- Iterate
56
OTHER IDEAS
- Community Detection methods
- Spectral Clustering
- Exploratory Analysis
- Visualization
- Others?
57
OTHER IDEAS
58
VISUALIZATION
- Too big
- Most of the tools won’t work
- Slow Graph Layout
- Hairball Problem
59
VISUALIZATION
- Too big
- Most of the tools won’t work
- Slow Graph Layout
- Hairball Problem
https://github.com/anvaka/ngraph
- Tools for working with Graphs (mostly for JS)
- Library for visualizing large graphs on WebGL
- Offline Graph Layout (C++ / OpenMP)
60
GRAPH SUMMARIZATION
Graph Summarization Methods and Applications: A Survey by Yike Liu, Tara Safavi, Abhilash Dighe,
Danai Koutra https://arxiv.org/abs/1612.04883
61
Questions?
This presentation and its contents are strictly confidential. It may not, in
whole or in part, be reproduced, redistributed, published or passed on to
any other person by the recipient.
The information in this presentation has not been independently verified. No
representation or warranty, express or implied, is made as to the accuracy
or completeness of the presentation and the information contained herein
and no reliance should be placed on such information. No responsibility is
accepted for any liability for any loss howsoever arising, directly or
indirectly, from this presentation or its contents.
DISCLAIMER
62

More Related Content

Similar to Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridze, Senior Software Engineer at Zalando

Azure Industrial Iot Edge
Azure Industrial Iot EdgeAzure Industrial Iot Edge
Azure Industrial Iot Edge
Riccardo Zamana
 
Crossing the Production Barrier: Development at Scale
Crossing the Production Barrier: Development at ScaleCrossing the Production Barrier: Development at Scale
Crossing the Production Barrier: Development at Scalejgoulah
 
Fighting fraud: finding duplicates at scale (Highload+ 2019)
Fighting fraud: finding duplicates at scale (Highload+ 2019)Fighting fraud: finding duplicates at scale (Highload+ 2019)
Fighting fraud: finding duplicates at scale (Highload+ 2019)
Alexey Grigorev
 
Le Bourget 2017 - From earth observation to actionable intelligence
Le Bourget 2017 - From earth observation to actionable intelligenceLe Bourget 2017 - From earth observation to actionable intelligence
Le Bourget 2017 - From earth observation to actionable intelligence
Leonardo
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupHadoop User Group
 
JAWS DAYS 2018
JAWS DAYS 2018JAWS DAYS 2018
JAWS DAYS 2018
Itaru Ogawa
 
CleverDATA_HighLoadStrategy_presentation
CleverDATA_HighLoadStrategy_presentationCleverDATA_HighLoadStrategy_presentation
CleverDATA_HighLoadStrategy_presentation
CleverDATA
 
Data Visualization: A Marketing Superpower - Clark Boyd
Data Visualization: A Marketing Superpower - Clark BoydData Visualization: A Marketing Superpower - Clark Boyd
Data Visualization: A Marketing Superpower - Clark Boyd
Clark Boyd
 
Five Great Ways to Lose Data on Kubernetes - KubeCon EU 2020
Five Great Ways to Lose Data on Kubernetes - KubeCon EU 2020Five Great Ways to Lose Data on Kubernetes - KubeCon EU 2020
Five Great Ways to Lose Data on Kubernetes - KubeCon EU 2020
Altinity Ltd
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
Databricks
 
Creative Survey Analysis By Mineknowledge
Creative Survey Analysis By MineknowledgeCreative Survey Analysis By Mineknowledge
Creative Survey Analysis By Mineknowledgemineknowledge
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
Stanka Dalekova
 
DataDevOps: A Manifesto for a DevOps-like Culture Shift in Data & Analytics
DataDevOps: A Manifesto for a DevOps-like Culture Shift in Data & AnalyticsDataDevOps: A Manifesto for a DevOps-like Culture Shift in Data & Analytics
DataDevOps: A Manifesto for a DevOps-like Culture Shift in Data & Analytics
Dr. Arif Wider
 
John Hornick_Inside3DPrintingHK
John Hornick_Inside3DPrintingHKJohn Hornick_Inside3DPrintingHK
John Hornick_Inside3DPrintingHK
MecklerMedia
 
DSDT meetup July 2021
DSDT meetup July 2021DSDT meetup July 2021
DSDT meetup July 2021
DSDT_MTL
 
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Yuichiro Yasui
 
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia
Bharat Kalia
 
SIGGRAPH 2018 - PICA PICA and NVIDIA Turing
SIGGRAPH 2018 - PICA PICA and NVIDIA TuringSIGGRAPH 2018 - PICA PICA and NVIDIA Turing
SIGGRAPH 2018 - PICA PICA and NVIDIA Turing
Electronic Arts / DICE
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
MapR Technologies
 
News From Mahout
News From MahoutNews From Mahout
News From Mahout
MapR Technologies
 

Similar to Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridze, Senior Software Engineer at Zalando (20)

Azure Industrial Iot Edge
Azure Industrial Iot EdgeAzure Industrial Iot Edge
Azure Industrial Iot Edge
 
Crossing the Production Barrier: Development at Scale
Crossing the Production Barrier: Development at ScaleCrossing the Production Barrier: Development at Scale
Crossing the Production Barrier: Development at Scale
 
Fighting fraud: finding duplicates at scale (Highload+ 2019)
Fighting fraud: finding duplicates at scale (Highload+ 2019)Fighting fraud: finding duplicates at scale (Highload+ 2019)
Fighting fraud: finding duplicates at scale (Highload+ 2019)
 
Le Bourget 2017 - From earth observation to actionable intelligence
Le Bourget 2017 - From earth observation to actionable intelligenceLe Bourget 2017 - From earth observation to actionable intelligence
Le Bourget 2017 - From earth observation to actionable intelligence
 
Yahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user groupYahoo! Mail antispam - Bay area Hadoop user group
Yahoo! Mail antispam - Bay area Hadoop user group
 
JAWS DAYS 2018
JAWS DAYS 2018JAWS DAYS 2018
JAWS DAYS 2018
 
CleverDATA_HighLoadStrategy_presentation
CleverDATA_HighLoadStrategy_presentationCleverDATA_HighLoadStrategy_presentation
CleverDATA_HighLoadStrategy_presentation
 
Data Visualization: A Marketing Superpower - Clark Boyd
Data Visualization: A Marketing Superpower - Clark BoydData Visualization: A Marketing Superpower - Clark Boyd
Data Visualization: A Marketing Superpower - Clark Boyd
 
Five Great Ways to Lose Data on Kubernetes - KubeCon EU 2020
Five Great Ways to Lose Data on Kubernetes - KubeCon EU 2020Five Great Ways to Lose Data on Kubernetes - KubeCon EU 2020
Five Great Ways to Lose Data on Kubernetes - KubeCon EU 2020
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Creative Survey Analysis By Mineknowledge
Creative Survey Analysis By MineknowledgeCreative Survey Analysis By Mineknowledge
Creative Survey Analysis By Mineknowledge
 
Follow the money with graphs
Follow the money with graphsFollow the money with graphs
Follow the money with graphs
 
DataDevOps: A Manifesto for a DevOps-like Culture Shift in Data & Analytics
DataDevOps: A Manifesto for a DevOps-like Culture Shift in Data & AnalyticsDataDevOps: A Manifesto for a DevOps-like Culture Shift in Data & Analytics
DataDevOps: A Manifesto for a DevOps-like Culture Shift in Data & Analytics
 
John Hornick_Inside3DPrintingHK
John Hornick_Inside3DPrintingHKJohn Hornick_Inside3DPrintingHK
John Hornick_Inside3DPrintingHK
 
DSDT meetup July 2021
DSDT meetup July 2021DSDT meetup July 2021
DSDT meetup July 2021
 
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
Graph500 and Green Graph500 benchmarks on SGI UV2000 @ SGI UG SC14
 
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia
 
SIGGRAPH 2018 - PICA PICA and NVIDIA Turing
SIGGRAPH 2018 - PICA PICA and NVIDIA TuringSIGGRAPH 2018 - PICA PICA and NVIDIA Turing
SIGGRAPH 2018 - PICA PICA and NVIDIA Turing
 
Real-time and Long-time Together
Real-time and Long-time TogetherReal-time and Long-time Together
Real-time and Long-time Together
 
News From Mahout
News From MahoutNews From Mahout
News From Mahout
 

More from Provectus

Choosing the right IDP Solution
Choosing the right IDP SolutionChoosing the right IDP Solution
Choosing the right IDP Solution
Provectus
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Provectus
 
Choosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare OrganizationsChoosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare Organizations
Provectus
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
Provectus
 
AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and Beyond
Provectus
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
Provectus
 
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRCost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Provectus
 
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
Provectus
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
Provectus
 
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ..."How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
Provectus
 
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky..."Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
Provectus
 
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2..."Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
Provectus
 
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma..."Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
Provectus
 
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ..."Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
Provectus
 
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
Provectus
 
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
Provectus
 
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti..."Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
Provectus
 
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
Provectus
 
How to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAMHow to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAM
Provectus
 

More from Provectus (20)

Choosing the right IDP Solution
Choosing the right IDP SolutionChoosing the right IDP Solution
Choosing the right IDP Solution
 
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.
 
Choosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare OrganizationsChoosing the Right Document Processing Solution for Healthcare Organizations
Choosing the Right Document Processing Solution for Healthcare Organizations
 
MLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in ProductionMLOps and Data Quality: Deploying Reliable ML Models in Production
MLOps and Data Quality: Deploying Reliable ML Models in Production
 
AI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and BeyondAI Stack on AWS: Amazon SageMaker and Beyond
AI Stack on AWS: Amazon SageMaker and Beyond
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
 
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRCost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMR
 
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...
 
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K..."Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...
 
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ..."How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...
 
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky..."Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...
 
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2..."Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...
 
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma..."Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...
 
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ..."Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...
 
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019
 
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019
 
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti..."Integrate your front end apps with serverless backend in the cloud", Sebasti...
"Integrate your front end apps with serverless backend in the cloud", Sebasti...
 
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
 
How to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAMHow to implement authorization in your backend with AWS IAM
How to implement authorization in your backend with AWS IAM
 

Recently uploaded

社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 

Recently uploaded (20)

社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 

Data Summer Conf 2018, “Analysing Billion Node Graphs (ENG)” — Giorgi Jvaridze, Senior Software Engineer at Zalando

  • 1. ANALYSING BILLION NODE GRAPHS DATA SUMMER CONF 2018 ODESSA, UKRAINE GIORGI JVARIDZE SENIOR SOFTWARE ENGINEER - ZALANDO 21-07-2018
  • 2. 2 ABOUT ME - Giorgi Jvaridze https://www.linkedin.com/in/jvaridze/ - Senior Software Engineer @ Zalando - Previously at Tripadvisor - Personalization - Dublin Data Science meetup group - Data Science Tbilisi meetup group - DataFest Tbilisi - Annual conference about Data Science, Data Journalism and related topics.
  • 3. 3 - Intros - Graphs - Cross Device Graph - Deterministic Approach - Probabilistic Approach - Challenges - Conclusion - Q&A AGENDA
  • 4.
  • 6. 6 ZALANDO AT A GLANCE ~ 4.5billion EUR revenue 2017 > 200 million visits per month > 15,000 employees in Europe > 75% of visits via mobile devices > 23 million active customers > 300,000 product choices ~ 2,000 brands 17 countries as at Jun 2018
  • 7. 7 WE BRING FASHION TO PEOPLE IN 17 COUNTRIES 2008-2009 2010 2012-2013 2011 2018
  • 10. 10 GRAPHS / NETWORKS - Vertices or Nodes - Edges
  • 11. 11 GRAPHS / NETWORKS - Vertices or Nodes - Edges - Weighted vs Unweighted
  • 12. 12 GRAPHS / NETWORKS - Vertices or Nodes - Edges - Weighted vs Unweighted - Directed vs Undirected
  • 13. 13 GRAPHS / NETWORKS - Vertices or Nodes - Edges - Weighted vs Unweighted - Directed vs Undirected - Connected Components
  • 14. 14 GRAPHS / NETWORKS - Vertices or Nodes - Edges - Weighted vs Unweighted - Directed vs Undirected - Connected Components - Vertex/Edge attributes g h c d e a b f C1 C2 C3
  • 15. 15 GRAPHS / NETWORKS - Vertices or Nodes - Edges - Weighted vs Unweighted - Directed vs Undirected - Connected Components - Vertex/Edge attributes - Many more properties and algorithms
  • 16. 16 - Social Networks - The Internet - Biomedical Networks - Network of Neurons - Many Data Science Applications GRAPHS / NETWORKS
  • 23. 23 some_id other_id another_id f1 c1 - f1 c2 - f1 c2 u1 f2 - u1 f3 c3 - f4 c4 u2 f5 c5 u2 f6 - u2 c1 f1 c2 f2u1 f3 c3 f4 f5 f6 c4 c5 u2 GRAPH CREATION
  • 24. 24 some_id other_id another_id f1 c1 - f1 c2 - f1 c2 u1 f2 - u1 f3 c3 - f4 c4 u2 f5 c5 u2 f6 - u2 c1 f1 c2 f2u1 f3 c3 f4 f5 f6 c4 c5 u2 GRAPH CREATION
  • 25. 25 some_id other_id another_id f1 c1 - f1 c2 - f1 c2 u1 f2 - u1 f3 c3 - f4 c4 u2 f5 c5 u2 f6 - u2 c1 f1 c2 f2u1 f3 c3 f4 f5 f6 c4 c5 u2 GRAPH CREATION
  • 26. 26 some_id other_id another_id f1 c1 - f1 c2 - f1 c2 u1 f2 - u1 f3 c3 - f4 c4 u2 f5 c5 u2 f6 - u2 c1 f1 c2 f2u1 f3 c3 f4 f5 f6 c4 c5 u2 GRAPH CREATION
  • 27. 27 some_id other_id another_id f1 c1 - f1 c2 - f1 c2 u1 f2 - u1 f3 c3 - f4 c4 u2 f5 c5 u2 f6 - u2 c1 f1 c2 f2u1 f3 c3 f4 f5 f6 c4 c5 u2 GRAPH CREATION
  • 28. 28 some_id other_id another_id f1 c1 - f1 c2 - f1 c2 u1 f2 - u1 f3 c3 - f4 c4 u2 f5 c5 u2 f6 - u2 c1 f1 c2 f2u1 f3 c3 f4 f5 f6 c4 c5 u2 CONNECTED COMPONENTS
  • 30. 30 TECHNOLOGIES AWS S3 AWS EMR AWS Data Pipeline
  • 32. 32 SPARK GRAPHFRAMES - Distributed graph processing framework built on top of DataFrames - GraphX is based on RDDs - Represented by GraphFrame object - GraphFrame object takes two DataFrames: Vertices and Edges - Vertices must have id column - Edges must have src and dst column
  • 33. 33 SPARK GRAPHFRAMES - Distributed graph processing framework built on top of DataFrames - GraphX is based on RDDs - Represented by GraphFrame object - GraphFrame object takes two DataFrames: Vertices and Edges - Vertices must have id column - Edges must have src and dst column
  • 34. 34 CONNECTED COMPONENTS - NAIVE IMPLEMENTATION (GRAPHX) 1. Assign each vertex unique component ID 2. Iterate until convergence: - For each vertex V, update: Component ID of V ← Smallest component ID in neighbourhood of V
  • 35. 35 CONNECTED COMPONENTS - NAIVE IMPLEMENTATION (GRAPHX) 1. Assign each vertex unique component ID 2. Iterate until convergence: - For each vertex V, update: Component ID of V ← Smallest component ID in neighbourhood of V Pro: Easy to implement Con: Slow convergence on large-diameter graphs
  • 36. 36 SMALL-/LARGE-STAR ALGORITHM (GRAPHFRAMES) 1. Assign each vertex unique component ID 2. Iterate until convergence: - Small Star: For each vertex, connect all smaller neighbors and self to the min neighbor. - Large Star: For each vertex, connect all strictly larger neighbors to the min neighbor including self Kiveris et al. "Connected Components in MapReduce and Beyond."
  • 37. 37 SMALL-/LARGE-STAR ALGORITHM (GRAPHFRAMES) Kiveris et al. "Connected Components in MapReduce and Beyond."
  • 38. 38 SMALL-/LARGE-STAR ALGORITHM (GRAPHFRAMES) Kiveris et al. "Connected Components in MapReduce and Beyond."
  • 39. 39 some_id other_id another_id f1 c1 - f1 c2 - f1 c2 u1 f2 - u1 f3 c3 - f4 c4 u2 f5 c5 u2 f6 - u2 c1 f1 c2 f2u1 f3 c3 f4 f5 f6 c4 c5 u2 CONNECTED COMPONENTS
  • 41. 41
  • 42. 42 c1 f1 c2 f2u1 f3 c3 f4 f5 f6 c4 c5 u2 Vertex 1 Features Vertex 2 Features Label c1 f1 true c4 f4 true ... ... ... Training Data CLASSIFICATION PROBLEM
  • 43. 43 CLASSIFICATION PROBLEM c1 f1 c2 f2u1 f3 c3 f4 f5 f6 c4 c5 u2 Vertex 1 Features Vertex 2 Features Label c1 f1 true c4 f4 true c1 c4 false c1 c5 false ... ... ... Training Data
  • 44. 44 ● ID Metadata - first seen, last seen, number of sessions etc.. ● Session Information - mobile/desktop, device, browser, geoip, activity patterns etc.. ● Product Information - category browse, product similarity ● Fashion Insights ● Features of Neighbours - immediate or within connected component… ● Graph Features - graph centrality, community detection scores etc.. ● Graph Embeddings FEATURE ENGINEERING
  • 45. 45 Graph Embeddings DeepWalk and Node2vec ● How do we capture structure in relational data? https://thegradient.pub/structure-learning/ ● Fraud Detection Using Deep Learning on Graph Embeddings and Topology Metrics https://www.experoinc.com/post/fraud-detection-using-deep-learning-on-graph-e mbeddings-and-topology-metrics ● Graph Convolutional Networks - https://tkipf.github.io/graph-convolutional-networks/
  • 47. 47 NODE2VEC - TEXT GENERATION c1 f1 c2 f2u1 f3 c3 f4 f5 f6 c4 c5 u2 c1 f1 c2 u1 f1 u1 c2 u1 f2 f3 c3 f3 c3 c4 f4 u2 f5 c5 f5 ...
  • 50. 50 Why is the component #187 so huge? Top largest connected components
  • 51. 51 Betweenness Centrality in Florentine Families’ Marriage Network
  • 52. 52 Betweenness Centrality in Florentine Families’ Marriage Network
  • 53. 53 Betweenness Centrality in Florentine Families’ Marriage Network But there are ~4 Million nodes. That’s ~8,000,000,000,000 pairs of IDs
  • 54. 54 “A Fast and Highly Quality Multilevel Scheme for Partitioning Irregular Graphs”. George Karypis and Vipin Kumar. SIAM Journal on Scientific Computing, Vol. 20, No. 1, pp. 359—392, 1999. GRAPH PARTITIONING - METIS
  • 55. 55 GRAPH PARTITIONING - METIS - Partition Graph multiple times - See statistics on how many times each node was on the border between partitions - Investigate the most fragile nodes - Potentially filter them out - Iterate
  • 56. 56 OTHER IDEAS - Community Detection methods - Spectral Clustering - Exploratory Analysis - Visualization - Others?
  • 58. 58 VISUALIZATION - Too big - Most of the tools won’t work - Slow Graph Layout - Hairball Problem
  • 59. 59 VISUALIZATION - Too big - Most of the tools won’t work - Slow Graph Layout - Hairball Problem https://github.com/anvaka/ngraph - Tools for working with Graphs (mostly for JS) - Library for visualizing large graphs on WebGL - Offline Graph Layout (C++ / OpenMP)
  • 60. 60 GRAPH SUMMARIZATION Graph Summarization Methods and Applications: A Survey by Yike Liu, Tara Safavi, Abhilash Dighe, Danai Koutra https://arxiv.org/abs/1612.04883
  • 62. This presentation and its contents are strictly confidential. It may not, in whole or in part, be reproduced, redistributed, published or passed on to any other person by the recipient. The information in this presentation has not been independently verified. No representation or warranty, express or implied, is made as to the accuracy or completeness of the presentation and the information contained herein and no reliance should be placed on such information. No responsibility is accepted for any liability for any loss howsoever arising, directly or indirectly, from this presentation or its contents. DISCLAIMER 62