Scalability and Graph
Analytics with Neo4j
Stefan Kolmar
VP Field Engineering - Neo4j
I Remember...
The Evolution of Databases
The Evolution of Databases
TRADITIONAL OLTP/RELATIONAL
The Evolution of Databases
TRADITIONAL OLTP/RELATIONAL BIG DATA TECHNOLOGY
The Evolution of Databases
TRADITIONAL OLTP/RELATIONAL BIG DATA TECHNOLOGY
The Evolution of Databases
TRADITIONAL OLTP/RELATIONAL BIG DATA TECHNOLOGY
The classic challenges for Telco’s
Large Data Volumes
CDRs
Network Metrics
Customer Metrics
The classic challenges for Telco’s
Large Data Volumes
CDRs
Network Metrics
Customer Metrics
Dynamic Access Dynamic Access
What Is Different in Neo4j?
Index-Free Adjacency
Connectedness and Size of Data Set
ResponseTime
Relational and
Other NoSQL
Databases
0 to 2 hops
0 to 3 degrees
Thousands of connections
1000x
Advantage
Tens to hundreds of hops
Thousands of degrees
Billions of connections
Neo4j
“Minutes to
milliseconds”
The Largest Investment
in Graph Databases
Multi-tenancy with Neo4j 4.0
• B2B SaaS:
Greatly simplified management of DB infrastructure for your customers.
• Multi-tenancy:
A single instance of Neo4j Server/Cluster may serve multiple customers/users within an
organization.
• Rapid Testing/Development/Deployment:
Manage separate databases for development, testing, staging, etc. in a single infrastructure.
• Scalability:
Disjoint data is organized in physically separate structures, strong isolation.
• Cloud-Friendly:
Databases can be associated to cloud storage and easily detached from a server and attached
to another server.
Multi-Database: Use Cases
Administration commands:
● CREATE|DROP|START|STOP DATABASE name
Use commands:
● HTTP API: http://server:port/.../database
● Browser & Cypher Shell: :USE database
● Drivers: Session(database)
● Browser:
Configure and Manage Neo4j Multi-Database
Network Mgmt
Customer
Relations
Unbounded Scalability in Neo4j 4.0
Causal Clustering with Neo4j
• Scale-out model
• Two ways of using:
• Operate over single large, decomposed graph
• Query across disjoint graphs, per business domain
Data Scientists
Run analysis on large, distributed databases.
Developers
Develop large scale applications on
laptops/desktops and deploy
in a network of Neo4j clusters.
Enterprises
Keep data in designated geographies
Analyse graphs without replicating or
moving them.
Fabric: Distributed Graph Query
Cypher Queries
SQL
Cypher in Neo4j
MATCH (boss)-[:MANAGES*0..3]->(sub),
(sub)-[:MANAGES*1..3]->(report)
RETURN boss.name AS Boss,
sub.name AS Subordinate,
count(report) AS Total
Multi-graph Cypher Queries
SQL
UNWIND corporate.graphIds() AS gid
CALL {
USE corporate.graph( gid )
MATCH (boss)-[:MANAGES*0..3]->(sub),
(sub)-[:MANAGES*1..3]->(report)
RETURN boss.name AS Boss,
sub.name AS Subordinate,
count(report) AS Total
}
RETURN Boss, Subordinate, Total ORDER BY Total
Cypher in Neo4j 4.0
• Executes queries in parallel on multiple databases, combining or aggregating results.
• Chains queries together from multiple databases for sophisticated real-time analyses.
The foundation:
Causal Cluster
How will this help a Telco to scale?
The evolution:
Fabric
Large Data Volumes
CDRs
Network Metrics
Customer Metrics
Large Data Volumes
CDRs
Network Metrics
Customer Metrics
Large Data Volumes
CDRs
Network Metrics
Customer Metrics
Scaling R/W access
The foundation:
Causal Cluster
How will this help a Telco to scale?
The evolution:
Fabric
Large Data Volumes
CDRs
Network Metrics
Customer Metrics
Large Data Volumes
CDRs
Network Metrics
Customer Metrics
Large Data Volumes
CDRs
Network Metrics
Customer Metrics
Scaling R/W access
NEO4J DBMSuser
NEO4J DBMS
CLUSTER A
CORE 1
CORE 3CORE 2
REPLICA 1
REPLICA 2
CLUSTER B
CORE 1
CORE 3CORE 2
NM1
Network Metrics
Network Metrics
NM2
NM1 NM2
NM1 NM2
NM3
NM3 NM3
NM3
NM3
http://ldbcouncil.org/developer/snb and https://neo4j.com/fosdem20
Neo4j 4.0 Scalability in Action
Sharding the LDBC Social Network Benchmark
Data Model
http://ldbcouncil.org/developer/snb and https://neo4j.com/fosdem20
Neo4j 4.0 Scalability in Action
Sharding the LDBC Social Network Benchmark
• 1-shard for the Persons graph
• N-shards for the Forums graph
http://ldbcouncil.org/developer/snb and https://neo4j.com/fosdem20
Neo4j 4.0 Scalability in Action
Sharding the LDBC Social Network Benchmark
Up to 300x reduced latency
Up to 10x Performance improvement
Scalability → Security?
BobJoe
• Based on Role-based Access Control for
graphs
• Restrictions on what data can be seen by
different users, applied to all database
interactions
• Implicit security view of the data for each
user through schema-based security
definitions
• Grant/Deny permissions to traverse, read or
write data based on node labels, relationship
types or database and property names
• Security rules are replicated across the
cluster via roles that are associated with the
users
Security and Data Privacy
Baseline_Personnel
_Security_Standard
Security_Check Counter_Terrorism
_Check
Developed_Vetting
Security and Data Privacy in Practice
• Call Centre Agent:
-> needs Doctor’s name
-> not allowed to read diagnosis
• Doctor:
-> ability to view patient records and
-> ability to view patient diagnoses
Constraints
// Doctors get wide-ranging access
GRANT ACCESS ON DATABASE healthcare TO doctor;
GRANT TRAVERSE {*} ON GRAPH healthcare TO doctor;
GRANT READ {*} ON GRAPH healthcare TO doctor;
GRANT WRITE ON GRAPH healthcare TO doctor;
Security Config
// Agents get narrower access
GRANT ACCESS ON DATABASE healthcare TO agent;
GRANT TRAVERSE {*} ON GRAPH healthcare TO agent;
GRANT READ {Name} ON GRAPH healthcare NODES Doctor TO agent;
GRANT READ {Name} ON GRAPH healthcare NODES Patient TO agent;
Call Centre Agent
MATCH (:CallcenterAgent {name: 'Alice'})
<-[:CALLED]-(p:Patient)-[:HAS_DIAGNOSIS]-(dia)
<-[:ESTABLISHED]-(d:Doctor)
RETURN p.name, d.name, dia.name;
Reactive Architecture Neo4j 4.0
• Flow control throughout the stack, allowing for
the client application to fully control the
production and flow of records within a result
• Synchronous/Asynchronous execution
• Based on reactive streams with non-blocking
backpressure library
• Client applications can pull or discard the whole
result or N elements
• Can also be gracefully cancelled
• Exposed through a reactive API in Drivers v4.0
• Use Cases:
• Long queries with large result sets
• Paged results
• Thin/small clients
Reactive Architecture
Graph Recipes & Analytics Graph Enhanced ML & AI
Graph Data Science
Science-driven approach to gain knowledge from the
relationships and structures in data, typically to power predictions.
Uses multi-disciplinary workflows that may include
queries, statistics, algorithms and machine learning.
`
Answers specific questions to gain insights from
connections in existing/historical data
Approaches typically include global queries and
algorithms and direct use of results
Training models (ML) with graph structured data
to be used to emulate human, probabilistic
decisions within a solution/ application (AI
system)
Optimized for Analytics
Leverage custom data structures
optimized for global traversals and
aggregation
Flexibly decompose and reshape
your graph for specific use cases
Algorithms for Insights
Robust algorithms that are highly
parallelized and scale to billions of
nodes
Early access to dozens of
experimental implementations
Intuitive Interface
Drastically simplified and
standardized API that enables
custom configurations
Documentation, training, and
examples so getting started is simple
Product Supported & Under Active Development
The Graph Data Science Library
Graph Data Science
Analytics projections:
- Specialized data structure for algorithms,
capable of supporting billions of nodes
- Cypher loaders for experimentation
- Quickly reshape, combine, aggregate, and
deduplicate your transactional data
- Support for multiple node labels,
relationship types, and properties
- Manage multiple in-memory analytics
graphs for different workloads
- Memory footprint allowing large scale use
Graph algorithms & more:
- 40+ algorithms in 5 categories: community,
centrality, similarity, pathfinding, and link
prediction
- Helper algorithms like graph generation, one
hot encoding, and random walk
- Early previews to new implementations in the
alpha & beta name spaces
- Supported, scalable algorithms include seeding,
determinism, and incremental calculations
- Estimate mode for memory requirements
Graph Data Science Algorithms
Generally Unsupervised
38
A subset of data science algorithms that come from network science,
Graph Algorithms enable reasoning about network structure.
Pathfinding
and Search
Centrality
(Importance)
Community
Detection
Heuristic
Link Prediction
Similarity
• Neo4j provides
• Scalability for Telco’s
• Carrier grade high availability with Causal Cluster
• Security features to fulfill privacy requirements
• Graph Analytics to provide Data Science infrastructure for Telcos
Conclusions
Scalability and Graph
Analytics with Neo4j
Stefan Kolmar
VP Field Engineering - Neo4j

Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j

  • 1.
    Scalability and Graph Analyticswith Neo4j Stefan Kolmar VP Field Engineering - Neo4j
  • 2.
  • 3.
  • 4.
    The Evolution ofDatabases TRADITIONAL OLTP/RELATIONAL
  • 5.
    The Evolution ofDatabases TRADITIONAL OLTP/RELATIONAL BIG DATA TECHNOLOGY
  • 6.
    The Evolution ofDatabases TRADITIONAL OLTP/RELATIONAL BIG DATA TECHNOLOGY
  • 7.
    The Evolution ofDatabases TRADITIONAL OLTP/RELATIONAL BIG DATA TECHNOLOGY
  • 8.
    The classic challengesfor Telco’s Large Data Volumes CDRs Network Metrics Customer Metrics
  • 9.
    The classic challengesfor Telco’s Large Data Volumes CDRs Network Metrics Customer Metrics Dynamic Access Dynamic Access
  • 10.
    What Is Differentin Neo4j? Index-Free Adjacency
  • 11.
    Connectedness and Sizeof Data Set ResponseTime Relational and Other NoSQL Databases 0 to 2 hops 0 to 3 degrees Thousands of connections 1000x Advantage Tens to hundreds of hops Thousands of degrees Billions of connections Neo4j “Minutes to milliseconds”
  • 12.
  • 13.
  • 14.
    • B2B SaaS: Greatlysimplified management of DB infrastructure for your customers. • Multi-tenancy: A single instance of Neo4j Server/Cluster may serve multiple customers/users within an organization. • Rapid Testing/Development/Deployment: Manage separate databases for development, testing, staging, etc. in a single infrastructure. • Scalability: Disjoint data is organized in physically separate structures, strong isolation. • Cloud-Friendly: Databases can be associated to cloud storage and easily detached from a server and attached to another server. Multi-Database: Use Cases
  • 15.
    Administration commands: ● CREATE|DROP|START|STOPDATABASE name Use commands: ● HTTP API: http://server:port/.../database ● Browser & Cypher Shell: :USE database ● Drivers: Session(database) ● Browser: Configure and Manage Neo4j Multi-Database Network Mgmt Customer Relations
  • 16.
  • 17.
  • 18.
    • Scale-out model •Two ways of using: • Operate over single large, decomposed graph • Query across disjoint graphs, per business domain Data Scientists Run analysis on large, distributed databases. Developers Develop large scale applications on laptops/desktops and deploy in a network of Neo4j clusters. Enterprises Keep data in designated geographies Analyse graphs without replicating or moving them. Fabric: Distributed Graph Query
  • 19.
    Cypher Queries SQL Cypher inNeo4j MATCH (boss)-[:MANAGES*0..3]->(sub), (sub)-[:MANAGES*1..3]->(report) RETURN boss.name AS Boss, sub.name AS Subordinate, count(report) AS Total
  • 20.
    Multi-graph Cypher Queries SQL UNWINDcorporate.graphIds() AS gid CALL { USE corporate.graph( gid ) MATCH (boss)-[:MANAGES*0..3]->(sub), (sub)-[:MANAGES*1..3]->(report) RETURN boss.name AS Boss, sub.name AS Subordinate, count(report) AS Total } RETURN Boss, Subordinate, Total ORDER BY Total Cypher in Neo4j 4.0 • Executes queries in parallel on multiple databases, combining or aggregating results. • Chains queries together from multiple databases for sophisticated real-time analyses.
  • 21.
    The foundation: Causal Cluster Howwill this help a Telco to scale? The evolution: Fabric Large Data Volumes CDRs Network Metrics Customer Metrics Large Data Volumes CDRs Network Metrics Customer Metrics Large Data Volumes CDRs Network Metrics Customer Metrics Scaling R/W access
  • 22.
    The foundation: Causal Cluster Howwill this help a Telco to scale? The evolution: Fabric Large Data Volumes CDRs Network Metrics Customer Metrics Large Data Volumes CDRs Network Metrics Customer Metrics Large Data Volumes CDRs Network Metrics Customer Metrics Scaling R/W access
  • 23.
    NEO4J DBMSuser NEO4J DBMS CLUSTERA CORE 1 CORE 3CORE 2 REPLICA 1 REPLICA 2 CLUSTER B CORE 1 CORE 3CORE 2 NM1 Network Metrics Network Metrics NM2 NM1 NM2 NM1 NM2 NM3 NM3 NM3 NM3 NM3
  • 24.
    http://ldbcouncil.org/developer/snb and https://neo4j.com/fosdem20 Neo4j4.0 Scalability in Action Sharding the LDBC Social Network Benchmark Data Model
  • 25.
    http://ldbcouncil.org/developer/snb and https://neo4j.com/fosdem20 Neo4j4.0 Scalability in Action Sharding the LDBC Social Network Benchmark • 1-shard for the Persons graph • N-shards for the Forums graph
  • 26.
    http://ldbcouncil.org/developer/snb and https://neo4j.com/fosdem20 Neo4j4.0 Scalability in Action Sharding the LDBC Social Network Benchmark Up to 300x reduced latency Up to 10x Performance improvement
  • 27.
  • 28.
    BobJoe • Based onRole-based Access Control for graphs • Restrictions on what data can be seen by different users, applied to all database interactions • Implicit security view of the data for each user through schema-based security definitions • Grant/Deny permissions to traverse, read or write data based on node labels, relationship types or database and property names • Security rules are replicated across the cluster via roles that are associated with the users Security and Data Privacy Baseline_Personnel _Security_Standard Security_Check Counter_Terrorism _Check Developed_Vetting
  • 29.
    Security and DataPrivacy in Practice
  • 30.
    • Call CentreAgent: -> needs Doctor’s name -> not allowed to read diagnosis • Doctor: -> ability to view patient records and -> ability to view patient diagnoses Constraints
  • 31.
    // Doctors getwide-ranging access GRANT ACCESS ON DATABASE healthcare TO doctor; GRANT TRAVERSE {*} ON GRAPH healthcare TO doctor; GRANT READ {*} ON GRAPH healthcare TO doctor; GRANT WRITE ON GRAPH healthcare TO doctor; Security Config // Agents get narrower access GRANT ACCESS ON DATABASE healthcare TO agent; GRANT TRAVERSE {*} ON GRAPH healthcare TO agent; GRANT READ {Name} ON GRAPH healthcare NODES Doctor TO agent; GRANT READ {Name} ON GRAPH healthcare NODES Patient TO agent;
  • 32.
    Call Centre Agent MATCH(:CallcenterAgent {name: 'Alice'}) <-[:CALLED]-(p:Patient)-[:HAS_DIAGNOSIS]-(dia) <-[:ESTABLISHED]-(d:Doctor) RETURN p.name, d.name, dia.name;
  • 33.
  • 34.
    • Flow controlthroughout the stack, allowing for the client application to fully control the production and flow of records within a result • Synchronous/Asynchronous execution • Based on reactive streams with non-blocking backpressure library • Client applications can pull or discard the whole result or N elements • Can also be gracefully cancelled • Exposed through a reactive API in Drivers v4.0 • Use Cases: • Long queries with large result sets • Paged results • Thin/small clients Reactive Architecture
  • 35.
    Graph Recipes &Analytics Graph Enhanced ML & AI Graph Data Science Science-driven approach to gain knowledge from the relationships and structures in data, typically to power predictions. Uses multi-disciplinary workflows that may include queries, statistics, algorithms and machine learning. ` Answers specific questions to gain insights from connections in existing/historical data Approaches typically include global queries and algorithms and direct use of results Training models (ML) with graph structured data to be used to emulate human, probabilistic decisions within a solution/ application (AI system)
  • 36.
    Optimized for Analytics Leveragecustom data structures optimized for global traversals and aggregation Flexibly decompose and reshape your graph for specific use cases Algorithms for Insights Robust algorithms that are highly parallelized and scale to billions of nodes Early access to dozens of experimental implementations Intuitive Interface Drastically simplified and standardized API that enables custom configurations Documentation, training, and examples so getting started is simple Product Supported & Under Active Development The Graph Data Science Library
  • 37.
    Graph Data Science Analyticsprojections: - Specialized data structure for algorithms, capable of supporting billions of nodes - Cypher loaders for experimentation - Quickly reshape, combine, aggregate, and deduplicate your transactional data - Support for multiple node labels, relationship types, and properties - Manage multiple in-memory analytics graphs for different workloads - Memory footprint allowing large scale use Graph algorithms & more: - 40+ algorithms in 5 categories: community, centrality, similarity, pathfinding, and link prediction - Helper algorithms like graph generation, one hot encoding, and random walk - Early previews to new implementations in the alpha & beta name spaces - Supported, scalable algorithms include seeding, determinism, and incremental calculations - Estimate mode for memory requirements
  • 38.
    Graph Data ScienceAlgorithms Generally Unsupervised 38 A subset of data science algorithms that come from network science, Graph Algorithms enable reasoning about network structure. Pathfinding and Search Centrality (Importance) Community Detection Heuristic Link Prediction Similarity
  • 39.
    • Neo4j provides •Scalability for Telco’s • Carrier grade high availability with Causal Cluster • Security features to fulfill privacy requirements • Graph Analytics to provide Data Science infrastructure for Telcos Conclusions
  • 40.
    Scalability and Graph Analyticswith Neo4j Stefan Kolmar VP Field Engineering - Neo4j