SlideShare a Scribd company logo
1 of 42
Download to read offline
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Ingesting streaming data into Graph
Database
Guido Schmutz
San Francisco, 22.3.2018
@gschmutz guidoschmutz.wordpress.com
Guido Schmutz
Working at Trivadis for more than 21 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
Agenda
1. Introduction
2. DataStax (DSE) Graph
3. Use-Case 1 - Batch Data Ingestion of Social Graph
4. Use-Case 2 - Streaming Ingestion of Social Graph
5. Use-Case 3 - Streaming Ingestion into RDF Graph
6. Summary
Introduction
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Customer Context – Big Data Lab
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Mobile
Apps
Batch Analytics
Streaming Analytics
Stream Analytics
NoSQL
Reference /
Models
SQL
Search
Service
Dashboard
BI Tools
Enterprise Data
Warehouse
Search / Explore
Online & Mobile
Apps
File Import / SQL Import
Weather
Data
Event
Hub
Event
Hub
Event
Hub
Data
Flow
Data
Flow
Data
Flow
Change
Data
Capture
Parallel
Processing
Storage
Storage
RawRefined
Results
SQL
Export
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Customer Context – Big Data Lab
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Mobile
Apps
Batch Analytics
Streaming Analytics
Stream Analytics
NoSQL
Reference /
Models
SQL
Search
Service
Dashboard
BI Tools
Enterprise Data
Warehouse
Search / Explore
Online & Mobile
Apps
File Import / SQL Import
Weather
Data
Event
Hub
Event
Hub
Event
Hub
Data
Flow
Data
Flow
Data
Flow
Change
Data
Capture
Parallel
Processing
Storage
Storage
RawRefined
Results
SQL
Export
Map of a Twitter Status object
Map of a Twitter Status object
Know your domain and choose the right data store!
Connectedness of Datalow high
Document
Data
Store
Key-Value
Stores
Wide-
Column
Store
Graph
Databases
Relational
Databases
DataStax (DSE) Graph
DataStax Enterprise Edition – DSE Graph
A scale-out property graph database
Store & find relationships in data fast and easy
in large graphs.
• Adopts Apache TinkerPop standards for data
and traversal
• Uses Apache Cassandra for scalable storage and
retrieval
• Leverages Apache Solr for full-text search and
indexing
• Integrates Apache Spark for fast analytic traversal
• Supports data security
Images: DataStax
Lab cluster setup
2 local data center for workload
segregation
DC1
– 9 nodes
– OLTP workload incl. Search
DC2
– 3 nodes
– OLAP workload with Spark
Social Network Graph Model
Creating the Graph Schema (I)
Define types of properties to be used with Vertices and Edges
Define types of vertices to be used
Define types of edges to be used
schema.vertexLabel("twitterUser").
properties("id","name","language","verified","createdAt","sources").create();
schema.vertexLabel("tweet".
properties("id","text","targetIds","timestamp","language").create()
schema.propertyKey("id").Long().single().create();
schema.propertyKey("name").Text().create();
schema.propertyKey("sources").Text().multiple().create();
schema.edgeLabel("publishes").properties("timestamp")
multiple().properties("rating").
connection("twitterUser","tweet").create();
Twitter
User
Tweet
id
name (idx)
...
sources*
id
text (idx)
timestamp
language
publishes (0..*)
timestamp (idx)
Creating the Graph Schema (II)
Define global materialized index on vertex (good for high selectivity)
Defining global full-text search index (SolR) on text of Tweet
Define vertex-centric index on edge
schema.vertexLabel("twitterUser").
index("usesAtTimestamp").
outE("publishes").by("timestamp").ifNotExists().add()
Twitter
User
Tweet
id
name (idx)
...
sources*
id
text (idx)
timestamp
language
publishes (0..*)
timestamp (idx)schema.vertexLabel("twitterUser").
index('twitterUserNameIdx').
materialized().by('name').ifNotExists().add()
schema.vertexLabel("tweet").
index("search").
search().by("text").asText().ifNotExists().add()
Creating Vertex Instances
Adding a new TwitterUser vertex
Adding a new Tweet vertex
Vertex u = graph.addVertex("twitterUser");
u.property("id","1323232");
u.property("name",”realDonaldTrump");
u.property("language","en");
u.property("sources","tweet");
Twitter
User
userId: 1323232
name: realDonaldTrump
language: en
sources: tweet
Vertex m = graph.addVertex(label, "tweet",
"id", "1213232323",
"text", "I called President Putin of
Russia to congratulate him on his election ...",
"timestamp", "2018-03-21",
"language", "en");
Tweet
id: 1213232323
text: I called President …
timestamp: 2018-03-21
language: en
Creating Edges between Vertices
Adding an edge between the TwitterUser and the Tweet vertex
Vertex u = graph.addVertex("twitterUser", ...);
Vertex t = graph.addVertex("tweet", ...);
Edge e = u.addEdge("publishes", t);
e.property("timestamp", "2018-03-21");
Twitter
User
Tweet
publishes
timestamp:
2018-03-21
userId: 1323232
name: realDonaldTrump
language: en
sources: tweet
id: 1213232323
text: I called President …
timestamp: 2018-03-21
language: en
Use Case 1 – Batch Data Ingestion
of Social Graph
Batch Data Ingestion of Social Graph
Transformation DSE Graph
Avro on
HDFS
CSV on
HDFS
DSE
GraphLoader
CSV local
"712181117131034624","1458632184000","en","","","@crouchybro Hi, I
work for @NBCNews. Are you safe? If you feel comfortable, would you
DM me?","","","257242054","crouchybro","168379717"
Extracting and Preparing Data through Zeppelin
Batch Ingestion with DSE Loader
• Simplifies loading of data from various sources into DSE Graph efficiently
• Inspects incoming data for schema compliance
• Uses declarative data mappings and custom transformations to handle data
• Loads all vertices first and stores in cache then edges and properties
Graph
Loader
 
Data Mappings
Batch Loading
Stream Ingestion (planed)
RDBMSJSON
DSE Graph
CSV
Batch Ingestion with DSE Loader – Example
publishes
Place
Twitte
r
User
Tweet
uses
tweetInput = File.text("tweet.txt")
.delimiter('","')
.header('id','timestamp','language','placeId','geoPointId','text',
'retweetedId','replyToId','replyToUserId','replyToScreenName',
'publisherUserId')
tweetInput = tweetInput.transform { it['replyToId'] =
it['replyToId'].equals('-1') ? '' : it['replyToId'];
it.each {key,value -> it[key] = value.replaceAll('^"|"$', '')};
it }
load(tweetInput).asVertices {
isNew()
label "tweet"
key "id"
ignore "placeId"
ignore "publisherUserId "
...
outV "placeId", "uses", {
label "place"
key "id"
}
...
inV "publisherUserId", "publishes", {
label "twitterUser"
key "id"
}}
Transform
Extract
Load
"712181117131034624","1458632184000","en","","","@crouchybro
Hi, I work for @NBCNews. Are you safe? If you feel comfortable,
would you DM me?","","","257242054","crouchybro","168379717"
...
Batch Ingestion with DSE Loader - Example
mentions
Twitte
r
User
Tweet
tweetMentionsInput = File.text("tweet_mentions_user.txt")
.delimiter('","')
.header('tweetId','userId','screenName','timestamp')
tweetMentionsInput = tweetMentionsInput.transform {
it.each {key,value -> it[key] = value.replaceAll('^"|"$', '')};
it
}
load(tweetMentionsInput).asEdges {
label "mentions"
ignore "screenName"
outV "tweetId", {
label "tweet"
key "tweetId":"id"
}
inV "userId", {
label "twitterUser"
key "id"
}
}
graphloader load-sma-graph.groovy -graph sma_graph_prod_v10 -address
192.168.69.135 -dryrun false -preparation false -abort_on_num_failures
10000 -abort_on_prep_errors false -vertex_complete false
Performance Tests DSE Graph Loader
ID strategy dataset vertex
threads
edge
threads
number of
tweets
time
sec/min
tweets
per sec
Gen 1-day-14 0 0 212’116 541/9.01 392
Gen 1-day-14 4 12 212’116 419/6.93 506
Gen 1-day-14 10 10 212’116 334/5.56 635
Gen 1-day-14 40 40 212’116 170/2.83 1247
Gen 1-day-14 60 80 212’116 118/1.96 1797
Cust 1-day-14 60 80 212’116 101/1.68 2100
Gen 1-mth-14 60 80 7’403’667 3888/64.8 1904
Gen 1-mth-14 60 120 7’403’667 3685/61.4 2009
Cust 1-mth-14 60 120 7’403’667 3315/55.25. 2233
Gen 1-mth-14 80 120 7’403’667 4005/66.75 1848
Cust 1-mth-14 80 120 7’403’667 3598/59.69 2057
Use Case 2 – Streaming Ingestion
of Social Graph
Streaming Ingestion of Social Network Graph
Twitter
Tweet-to-Graph
DSE Graph
Tweet
Profile-to-Graph
Follower-to-
Graph
ElasticsearchTweet-to-ES
Profile
Follower
YouTube-to-
Graph
YouTube VideoYouTube
Apache Kafka – A Streaming Platform
High-Level Architecture
Distributed Log at the Core
Scale-Out Architecture
Logs do not (necessarily) forget
Graph Ingestion through DataStax Java Driver
Similar to programming with JDBC: execute your gremlin statement by creating it in a
String variable => also as error prone as with JDBC ;-)
fluent Java-based API available with version of DSE (5.1)
String stmt = ""
+ "Vertex v" + "n"
+ "GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue)" + "n"
+ "if (!gt.hasNext()) {" + "n"
+ " v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue)" + "n"
+ "} else {" + "n" ...
Map<String,Object> map = ImmutableMap.<String,Object>of("vertexLabel", vertexLabel, "propertyKey",
propertyKey, "propertyKeyValue", propertyKeyValue);
vertex = session.executeGraph(new SimpleGraphStatement(stmt)
.set("b"
, ImmutableMap.<String,Object>builder().putAll(map).putAll(propertyParams.getRight()).build()
)).one().asVertex();
Tweet-to-Graph
Add information for new Event (tweet) to Graph – what it
really means ….
1. Create new Tweet vertex and add
retweets/replies edges
2. Create/Update TwitterUser vertex
3. Create new publishes edge
4. For Each Mention, Create/Update
TwitterUser vertex
5. For each Mention, Create mentions edge
6. For each Link, Create/Update Url vertex
7. For each Link, Create uses edge
8. For each HashTag, Create/Update Term
vertex
9. For each HashTag, Create contains edge
10. ….
12
34
5
6
7
8
9
Could we solve it by modeling differently?
Tweet
Term
contains
name (idx): #fake
text: The point of
Trump’s now #Fake
steel …
Tweet
text: the point of Trump’s now
#Fake steel …
terms: [fake]
….
vs
Single Insert into Social Network Graph
DSE Graph
Tweet-to-Graph
Tweet
User
publishes
Term *
uses*
GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue)
if (!gt.hasNext()) {
v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue)
} else {
v = gt.next()
}
v.property("name",b.propertyParam0)
v.property("language",b.propertyParam1)
v.property("verified",b.propertyParam2)
Vertex from = g.V(f).next(); Vertex to = g.V(t).next(); if
(!g.V(f).out(b.edgeLabel).V(t).hasNext()) { from.addEdge(b.edgeLabel, to) }
Vertex from = g.V(f).next(); Vertex to = g.V(t).next(); if
(!g.V(f).out(b.edgeLabel).V(t).hasNext()) { from.addEdge(b.edgeLabel, to) }
GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue)
if (!gt.hasNext()) {
v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue)
} else {
v = gt.next()
}
v.property("timestamp",b.propertyParam0)
v.property("language",b.propertyParam1)
GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue)
if (!gt.hasNext()) {
v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue)
} else {
v = gt.next()
}
v.property("type",b.propertyParam0)
GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue)
if (!gt.hasNext()) {
v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue)
} else {
v = gt.next()
}
v.property("type",b.propertyParam0)
Vertex from = g.V(f).next(); Vertex to = g.V(t).next(); if
(!g.V(f).out(b.edgeLabel).V(t).hasNext()) { from.addEdge(b.edgeLabel, to) }
2nd approach - ”Fan out ingestion driven by model”
1st approach => too slow
2nd approach => too difficult to implement and maintain
Tweet_to_
Graph
Tweet_to_
Graph
User_to_
Graph
URL_to_
Graph
URL_to_
Graph
Term_to_
Graph
Geo_to_
Graph
Tweet
Tweet
3rd approach - Batch Insert into Social Network Graph
user = g.V().has('twitterUser', 'id', bUser.id).tryNext().orElseGet { g.addV('twitterUser').property('id', bUser.id).next() }
user.property('name',bUser.name)
user.property('language',bUser.language)
user.property('verified',bUser.verified)
tweet = g.V().has('tweet', 'id', bTweet.id).tryNext().orElseGet { g.addV('tweet').property('id', bTweet.id).next() }
tweet.property('timestamp',bTweet.timestamp)
tweet.property('language',bTweet.language)
if (!g.V(user).out('publishes').hasId(tweet.id()).hasNext()) {
publishes = g.V(user).as('f').V(tweet).as('t').addE('publishes').from('f').next()
}
i = 0
Vertex[] term = new Vertex[10]
for (Object keyValue : bTerm.propertyKeyValues) {
term[i] = g.V().has('term', 'name', keyValue).tryNext().orElseGet {g.addV('term').property('name', keyValue).next() }
Map<String,Object> params = bTerm.params[i]
if (params != null)
for (String key : params.keySet()) {
term[i].property(key, params.get(key))
}
i++
}
Edge[] usesTerm = new Edge[10]
for (i = 0; i < bUsesTerm.count; i++) {
if (!g.V(tweet).out('uses').hasId(term[i].id()).hasNext()) {
usesTerm[i] = g.V(tweet).as('f').V(term[i]).as('t').addE('uses').from('f').next()
}
Map<String,Object> params = bUsesTerm.params[i]
if (params != null)
for (String key : params.keySet()) {
usesTerm[i].property(key, params.get(key))
}
}
return nof
DSE Graph
Tweet-to-Graph
…
...
Benchmark Single vs. Scripted Insert
Test Case Single Inserts Scripted (Batch) Insert
10 tweets with 10 hashtags each 569ms - 690ms 132ms – 172ms
10 tweets with 30 hashtags each 1311ms – 1396ms 285ms – 364ms
20 tweets with 30 hashtags each 2734ms – 3385ms 640ms – 767ms
20 tweets with 40 hashtags each 3576ms – 4089ms 778ms – 992ms
100 tweets with 10 hashtags each 6340ms – 7171ms 1673ms - 2428
Process CDC (Change Data Capture) Events
CDC
Customer-to-
Cassandra
Wide-Row
/ Search
Customer
Event
Graph
Customer-to-
Graph
Customer CDC
Address CDC
Contact CDC
Aggregate-to-
Customer
DSE Cluster
Partner
Id
FirstName
LastName
BirthDate
ProfileImageURL
Id
CustomerId
When
Where
Type
Description
Contacts
Address
Id
PartnerId
Street
StreetNr
ZipCode
City
Country
Id
Name
Price
Description
ImageURL
Products
CustomerId
ProductId
Order
Id
CustomerId
When
Where
InteractionType
Description
Customer Service
Id
ProductId
Comment
UserId
Product Reviews
Id
PartnerId
TwitterId
FacebookId
Customer
Use Case 3: Streaming Ingestion
into RDF Graph
Semantic Layer 3 – streaming update of Triple store
Bulk Move /
Event
Bulk
Import
Integration
Bulk
Data Flow
Event Hub
Triplification
Event
Data Flow
SPARQL / SQL
Bulk Move /
Event
Semantic
Layer
Query
Federation / LOD
Consumer
Master Data /
Contr. Vocabulary
Reasoning
Meta Data
Instance Data
SPARQL
RDF CRUD
SQL SELECT
Internal Data Sources
CSV RDBMSXMLJSON SPARQL
External Data Sources
Data Stream
CSV XMLJSON SPARQL Data Stream
Data Lake Platform
Object Store Distributed FS NoSQL MPP RDBMS
Query ExecutionETL
Big Data Storage
Big Data Compute
NLPAnalytics / ML
Data
Visualization
Data Science
Lab
Dashboard
RDF
CRUD
SQL SELECT
Bulk
upload / sync
“Triplification” – 1st option
Using RML (http://rml.io/) and
Carml (https://github.com/carml/carml)
Semantic
Layer
Query
Federation / LOD
Master Data /
Contr. Vocabulary
Reasoning
Meta Data
Instance Data
Integration
Bulk
Data Flow
Event Hub
Triplification
Event
Data Flow
RDF CRUD
RML
Kafka
Kafka Connect
Carml
Stream Sets
“Triplification” – 2nd option
Using Jolt Transformation (https://github.com/bazaarvoice/jolt) to
create JSON-LD
Demo: http://jolt-demo.appspot.com/#inputArrayToPrefix
Semantic
Layer
Query
Federation / LOD
Master Data /
Contr. Vocabulary
Reasoning
Meta Data
Instance Data
Integration
Bulk
Data Flow
Event Hub
Triplification
Event
Data Flow
RDF CRUD
Kafka
Kafka Connect
Jolt
Stream Sets
Jolt
Summary - Lessons Learnt
Summary - Lessons Learnt
Graph Schema
• Edges can not cross graph schemas
• Vertex or edge? Vertex or property?
• Custom Vertex IDs can be useful
• idempotent
• faster on insert (no id generation needed)
• can lead to hotspots
Data Ingestion
• Single Nodes & Edge Insert is slow =>
batching by Groovy Script to speed up
• DSE graph loader works good for batch
data loading
Technology on its own won't help you.
You need to know how to use it properly.

More Related Content

What's hot

NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
Felix Gessert
 

What's hot (20)

Event Driven Architecture with Apache Camel
Event Driven Architecture with Apache CamelEvent Driven Architecture with Apache Camel
Event Driven Architecture with Apache Camel
 
Webinar: MongoDB Schema Design and Performance Implications
Webinar: MongoDB Schema Design and Performance ImplicationsWebinar: MongoDB Schema Design and Performance Implications
Webinar: MongoDB Schema Design and Performance Implications
 
Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)Prometheus and Docker (Docker Galway, November 2015)
Prometheus and Docker (Docker Galway, November 2015)
 
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with  Apache Pulsar and Apache PinotBuilding a Real-Time Analytics Application with  Apache Pulsar and Apache Pinot
Building a Real-Time Analytics Application with Apache Pulsar and Apache Pinot
 
Angular
AngularAngular
Angular
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
kafka
kafkakafka
kafka
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Chicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An IntroductionChicago Data Summit: Apache HBase: An Introduction
Chicago Data Summit: Apache HBase: An Introduction
 
NewSQL - The Future of Databases?
NewSQL - The Future of Databases?NewSQL - The Future of Databases?
NewSQL - The Future of Databases?
 
Sqoop
SqoopSqoop
Sqoop
 
GT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL DatabaseGT.M: A Tried and Tested Open-Source NoSQL Database
GT.M: A Tried and Tested Open-Source NoSQL Database
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Bulk Loading Data into Cassandra
Bulk Loading Data into CassandraBulk Loading Data into Cassandra
Bulk Loading Data into Cassandra
 

Similar to Ingesting streaming data into Graph Database

OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009
marpierc
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim RemaniThe Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
ploibl
 

Similar to Ingesting streaming data into Graph Database (20)

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Trivadis TechEvent 2016 Introduction to DataStax Enterprise (DSE) Graph by Gu...
Trivadis TechEvent 2016 Introduction to DataStax Enterprise (DSE) Graph by Gu...Trivadis TechEvent 2016 Introduction to DataStax Enterprise (DSE) Graph by Gu...
Trivadis TechEvent 2016 Introduction to DataStax Enterprise (DSE) Graph by Gu...
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
 
OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql database
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Streaming Infrastructure at Wise with Levani Kokhreidze
Streaming Infrastructure at Wise with Levani KokhreidzeStreaming Infrastructure at Wise with Levani Kokhreidze
Streaming Infrastructure at Wise with Levani Kokhreidze
 
State of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open SourceState of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open Source
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with Storm
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
NextGenML
NextGenML NextGenML
NextGenML
 
Data relay introduction to big data clusters
Data relay introduction to big data clustersData relay introduction to big data clusters
Data relay introduction to big data clusters
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Economies of Scaling Software
Economies of Scaling SoftwareEconomies of Scaling Software
Economies of Scaling Software
 
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim RemaniThe Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
 
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim RemaniThe Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
 

More from Guido Schmutz

Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
Guido Schmutz
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using Kafka
Guido Schmutz
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming Visualisation
Guido Schmutz
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Guido Schmutz
 

More from Guido Schmutz (20)

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code
 
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data Architecture
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data Architecture
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache Kafka
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
 
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using Kafka
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming Visualisation
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
 
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI Architecture
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 

Recently uploaded

Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
UK Journal
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
PLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. StartupsPLAI - Acceleration Program for Generative A.I. Startups
PLAI - Acceleration Program for Generative A.I. Startups
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024Enterprise Knowledge Graphs - Data Summit 2024
Enterprise Knowledge Graphs - Data Summit 2024
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
FDO for Camera, Sensor and Networking Device – Commercial Solutions from VinC...
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptxWSO2CONMay2024OpenSourceConferenceDebrief.pptx
WSO2CONMay2024OpenSourceConferenceDebrief.pptx
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 

Ingesting streaming data into Graph Database

  • 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH Ingesting streaming data into Graph Database Guido Schmutz San Francisco, 22.3.2018 @gschmutz guidoschmutz.wordpress.com
  • 2. Guido Schmutz Working at Trivadis for more than 21 years Oracle ACE Director for Fusion Middleware and SOA Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Head of Trivadis Architecture Board Technology Manager @ Trivadis More than 30 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz
  • 3. Agenda 1. Introduction 2. DataStax (DSE) Graph 3. Use-Case 1 - Batch Data Ingestion of Social Graph 4. Use-Case 2 - Streaming Ingestion of Social Graph 5. Use-Case 3 - Streaming Ingestion into RDF Graph 6. Summary
  • 5. Hadoop Clusterd Hadoop Cluster Big Data Cluster Customer Context – Big Data Lab Location Social Click stream Sensor Data Billing & Ordering CRM / Profile Marketing Campaigns Call Center Mobile Apps Batch Analytics Streaming Analytics Stream Analytics NoSQL Reference / Models SQL Search Service Dashboard BI Tools Enterprise Data Warehouse Search / Explore Online & Mobile Apps File Import / SQL Import Weather Data Event Hub Event Hub Event Hub Data Flow Data Flow Data Flow Change Data Capture Parallel Processing Storage Storage RawRefined Results SQL Export
  • 6. Hadoop Clusterd Hadoop Cluster Big Data Cluster Customer Context – Big Data Lab Location Social Click stream Sensor Data Billing & Ordering CRM / Profile Marketing Campaigns Call Center Mobile Apps Batch Analytics Streaming Analytics Stream Analytics NoSQL Reference / Models SQL Search Service Dashboard BI Tools Enterprise Data Warehouse Search / Explore Online & Mobile Apps File Import / SQL Import Weather Data Event Hub Event Hub Event Hub Data Flow Data Flow Data Flow Change Data Capture Parallel Processing Storage Storage RawRefined Results SQL Export
  • 7. Map of a Twitter Status object
  • 8. Map of a Twitter Status object
  • 9. Know your domain and choose the right data store! Connectedness of Datalow high Document Data Store Key-Value Stores Wide- Column Store Graph Databases Relational Databases
  • 11. DataStax Enterprise Edition – DSE Graph A scale-out property graph database Store & find relationships in data fast and easy in large graphs. • Adopts Apache TinkerPop standards for data and traversal • Uses Apache Cassandra for scalable storage and retrieval • Leverages Apache Solr for full-text search and indexing • Integrates Apache Spark for fast analytic traversal • Supports data security Images: DataStax
  • 12. Lab cluster setup 2 local data center for workload segregation DC1 – 9 nodes – OLTP workload incl. Search DC2 – 3 nodes – OLAP workload with Spark
  • 14. Creating the Graph Schema (I) Define types of properties to be used with Vertices and Edges Define types of vertices to be used Define types of edges to be used schema.vertexLabel("twitterUser"). properties("id","name","language","verified","createdAt","sources").create(); schema.vertexLabel("tweet". properties("id","text","targetIds","timestamp","language").create() schema.propertyKey("id").Long().single().create(); schema.propertyKey("name").Text().create(); schema.propertyKey("sources").Text().multiple().create(); schema.edgeLabel("publishes").properties("timestamp") multiple().properties("rating"). connection("twitterUser","tweet").create(); Twitter User Tweet id name (idx) ... sources* id text (idx) timestamp language publishes (0..*) timestamp (idx)
  • 15. Creating the Graph Schema (II) Define global materialized index on vertex (good for high selectivity) Defining global full-text search index (SolR) on text of Tweet Define vertex-centric index on edge schema.vertexLabel("twitterUser"). index("usesAtTimestamp"). outE("publishes").by("timestamp").ifNotExists().add() Twitter User Tweet id name (idx) ... sources* id text (idx) timestamp language publishes (0..*) timestamp (idx)schema.vertexLabel("twitterUser"). index('twitterUserNameIdx'). materialized().by('name').ifNotExists().add() schema.vertexLabel("tweet"). index("search"). search().by("text").asText().ifNotExists().add()
  • 16. Creating Vertex Instances Adding a new TwitterUser vertex Adding a new Tweet vertex Vertex u = graph.addVertex("twitterUser"); u.property("id","1323232"); u.property("name",”realDonaldTrump"); u.property("language","en"); u.property("sources","tweet"); Twitter User userId: 1323232 name: realDonaldTrump language: en sources: tweet Vertex m = graph.addVertex(label, "tweet", "id", "1213232323", "text", "I called President Putin of Russia to congratulate him on his election ...", "timestamp", "2018-03-21", "language", "en"); Tweet id: 1213232323 text: I called President … timestamp: 2018-03-21 language: en
  • 17. Creating Edges between Vertices Adding an edge between the TwitterUser and the Tweet vertex Vertex u = graph.addVertex("twitterUser", ...); Vertex t = graph.addVertex("tweet", ...); Edge e = u.addEdge("publishes", t); e.property("timestamp", "2018-03-21"); Twitter User Tweet publishes timestamp: 2018-03-21 userId: 1323232 name: realDonaldTrump language: en sources: tweet id: 1213232323 text: I called President … timestamp: 2018-03-21 language: en
  • 18. Use Case 1 – Batch Data Ingestion of Social Graph
  • 19. Batch Data Ingestion of Social Graph Transformation DSE Graph Avro on HDFS CSV on HDFS DSE GraphLoader CSV local "712181117131034624","1458632184000","en","","","@crouchybro Hi, I work for @NBCNews. Are you safe? If you feel comfortable, would you DM me?","","","257242054","crouchybro","168379717"
  • 20. Extracting and Preparing Data through Zeppelin
  • 21. Batch Ingestion with DSE Loader • Simplifies loading of data from various sources into DSE Graph efficiently • Inspects incoming data for schema compliance • Uses declarative data mappings and custom transformations to handle data • Loads all vertices first and stores in cache then edges and properties Graph Loader   Data Mappings Batch Loading Stream Ingestion (planed) RDBMSJSON DSE Graph CSV
  • 22. Batch Ingestion with DSE Loader – Example publishes Place Twitte r User Tweet uses tweetInput = File.text("tweet.txt") .delimiter('","') .header('id','timestamp','language','placeId','geoPointId','text', 'retweetedId','replyToId','replyToUserId','replyToScreenName', 'publisherUserId') tweetInput = tweetInput.transform { it['replyToId'] = it['replyToId'].equals('-1') ? '' : it['replyToId']; it.each {key,value -> it[key] = value.replaceAll('^"|"$', '')}; it } load(tweetInput).asVertices { isNew() label "tweet" key "id" ignore "placeId" ignore "publisherUserId " ... outV "placeId", "uses", { label "place" key "id" } ... inV "publisherUserId", "publishes", { label "twitterUser" key "id" }} Transform Extract Load "712181117131034624","1458632184000","en","","","@crouchybro Hi, I work for @NBCNews. Are you safe? If you feel comfortable, would you DM me?","","","257242054","crouchybro","168379717" ...
  • 23. Batch Ingestion with DSE Loader - Example mentions Twitte r User Tweet tweetMentionsInput = File.text("tweet_mentions_user.txt") .delimiter('","') .header('tweetId','userId','screenName','timestamp') tweetMentionsInput = tweetMentionsInput.transform { it.each {key,value -> it[key] = value.replaceAll('^"|"$', '')}; it } load(tweetMentionsInput).asEdges { label "mentions" ignore "screenName" outV "tweetId", { label "tweet" key "tweetId":"id" } inV "userId", { label "twitterUser" key "id" } } graphloader load-sma-graph.groovy -graph sma_graph_prod_v10 -address 192.168.69.135 -dryrun false -preparation false -abort_on_num_failures 10000 -abort_on_prep_errors false -vertex_complete false
  • 24. Performance Tests DSE Graph Loader ID strategy dataset vertex threads edge threads number of tweets time sec/min tweets per sec Gen 1-day-14 0 0 212’116 541/9.01 392 Gen 1-day-14 4 12 212’116 419/6.93 506 Gen 1-day-14 10 10 212’116 334/5.56 635 Gen 1-day-14 40 40 212’116 170/2.83 1247 Gen 1-day-14 60 80 212’116 118/1.96 1797 Cust 1-day-14 60 80 212’116 101/1.68 2100 Gen 1-mth-14 60 80 7’403’667 3888/64.8 1904 Gen 1-mth-14 60 120 7’403’667 3685/61.4 2009 Cust 1-mth-14 60 120 7’403’667 3315/55.25. 2233 Gen 1-mth-14 80 120 7’403’667 4005/66.75 1848 Cust 1-mth-14 80 120 7’403’667 3598/59.69 2057
  • 25. Use Case 2 – Streaming Ingestion of Social Graph
  • 26. Streaming Ingestion of Social Network Graph Twitter Tweet-to-Graph DSE Graph Tweet Profile-to-Graph Follower-to- Graph ElasticsearchTweet-to-ES Profile Follower YouTube-to- Graph YouTube VideoYouTube
  • 27. Apache Kafka – A Streaming Platform High-Level Architecture Distributed Log at the Core Scale-Out Architecture Logs do not (necessarily) forget
  • 28. Graph Ingestion through DataStax Java Driver Similar to programming with JDBC: execute your gremlin statement by creating it in a String variable => also as error prone as with JDBC ;-) fluent Java-based API available with version of DSE (5.1) String stmt = "" + "Vertex v" + "n" + "GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue)" + "n" + "if (!gt.hasNext()) {" + "n" + " v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue)" + "n" + "} else {" + "n" ... Map<String,Object> map = ImmutableMap.<String,Object>of("vertexLabel", vertexLabel, "propertyKey", propertyKey, "propertyKeyValue", propertyKeyValue); vertex = session.executeGraph(new SimpleGraphStatement(stmt) .set("b" , ImmutableMap.<String,Object>builder().putAll(map).putAll(propertyParams.getRight()).build() )).one().asVertex(); Tweet-to-Graph
  • 29. Add information for new Event (tweet) to Graph – what it really means …. 1. Create new Tweet vertex and add retweets/replies edges 2. Create/Update TwitterUser vertex 3. Create new publishes edge 4. For Each Mention, Create/Update TwitterUser vertex 5. For each Mention, Create mentions edge 6. For each Link, Create/Update Url vertex 7. For each Link, Create uses edge 8. For each HashTag, Create/Update Term vertex 9. For each HashTag, Create contains edge 10. …. 12 34 5 6 7 8 9
  • 30. Could we solve it by modeling differently? Tweet Term contains name (idx): #fake text: The point of Trump’s now #Fake steel … Tweet text: the point of Trump’s now #Fake steel … terms: [fake] …. vs
  • 31. Single Insert into Social Network Graph DSE Graph Tweet-to-Graph Tweet User publishes Term * uses* GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue) if (!gt.hasNext()) { v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue) } else { v = gt.next() } v.property("name",b.propertyParam0) v.property("language",b.propertyParam1) v.property("verified",b.propertyParam2) Vertex from = g.V(f).next(); Vertex to = g.V(t).next(); if (!g.V(f).out(b.edgeLabel).V(t).hasNext()) { from.addEdge(b.edgeLabel, to) } Vertex from = g.V(f).next(); Vertex to = g.V(t).next(); if (!g.V(f).out(b.edgeLabel).V(t).hasNext()) { from.addEdge(b.edgeLabel, to) } GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue) if (!gt.hasNext()) { v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue) } else { v = gt.next() } v.property("timestamp",b.propertyParam0) v.property("language",b.propertyParam1) GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue) if (!gt.hasNext()) { v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue) } else { v = gt.next() } v.property("type",b.propertyParam0) GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue) if (!gt.hasNext()) { v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue) } else { v = gt.next() } v.property("type",b.propertyParam0) Vertex from = g.V(f).next(); Vertex to = g.V(t).next(); if (!g.V(f).out(b.edgeLabel).V(t).hasNext()) { from.addEdge(b.edgeLabel, to) }
  • 32. 2nd approach - ”Fan out ingestion driven by model” 1st approach => too slow 2nd approach => too difficult to implement and maintain Tweet_to_ Graph Tweet_to_ Graph User_to_ Graph URL_to_ Graph URL_to_ Graph Term_to_ Graph Geo_to_ Graph Tweet Tweet
  • 33. 3rd approach - Batch Insert into Social Network Graph user = g.V().has('twitterUser', 'id', bUser.id).tryNext().orElseGet { g.addV('twitterUser').property('id', bUser.id).next() } user.property('name',bUser.name) user.property('language',bUser.language) user.property('verified',bUser.verified) tweet = g.V().has('tweet', 'id', bTweet.id).tryNext().orElseGet { g.addV('tweet').property('id', bTweet.id).next() } tweet.property('timestamp',bTweet.timestamp) tweet.property('language',bTweet.language) if (!g.V(user).out('publishes').hasId(tweet.id()).hasNext()) { publishes = g.V(user).as('f').V(tweet).as('t').addE('publishes').from('f').next() } i = 0 Vertex[] term = new Vertex[10] for (Object keyValue : bTerm.propertyKeyValues) { term[i] = g.V().has('term', 'name', keyValue).tryNext().orElseGet {g.addV('term').property('name', keyValue).next() } Map<String,Object> params = bTerm.params[i] if (params != null) for (String key : params.keySet()) { term[i].property(key, params.get(key)) } i++ } Edge[] usesTerm = new Edge[10] for (i = 0; i < bUsesTerm.count; i++) { if (!g.V(tweet).out('uses').hasId(term[i].id()).hasNext()) { usesTerm[i] = g.V(tweet).as('f').V(term[i]).as('t').addE('uses').from('f').next() } Map<String,Object> params = bUsesTerm.params[i] if (params != null) for (String key : params.keySet()) { usesTerm[i].property(key, params.get(key)) } } return nof DSE Graph Tweet-to-Graph … ...
  • 34. Benchmark Single vs. Scripted Insert Test Case Single Inserts Scripted (Batch) Insert 10 tweets with 10 hashtags each 569ms - 690ms 132ms – 172ms 10 tweets with 30 hashtags each 1311ms – 1396ms 285ms – 364ms 20 tweets with 30 hashtags each 2734ms – 3385ms 640ms – 767ms 20 tweets with 40 hashtags each 3576ms – 4089ms 778ms – 992ms 100 tweets with 10 hashtags each 6340ms – 7171ms 1673ms - 2428
  • 35. Process CDC (Change Data Capture) Events CDC Customer-to- Cassandra Wide-Row / Search Customer Event Graph Customer-to- Graph Customer CDC Address CDC Contact CDC Aggregate-to- Customer DSE Cluster Partner Id FirstName LastName BirthDate ProfileImageURL Id CustomerId When Where Type Description Contacts Address Id PartnerId Street StreetNr ZipCode City Country Id Name Price Description ImageURL Products CustomerId ProductId Order Id CustomerId When Where InteractionType Description Customer Service Id ProductId Comment UserId Product Reviews Id PartnerId TwitterId FacebookId Customer
  • 36. Use Case 3: Streaming Ingestion into RDF Graph
  • 37. Semantic Layer 3 – streaming update of Triple store Bulk Move / Event Bulk Import Integration Bulk Data Flow Event Hub Triplification Event Data Flow SPARQL / SQL Bulk Move / Event Semantic Layer Query Federation / LOD Consumer Master Data / Contr. Vocabulary Reasoning Meta Data Instance Data SPARQL RDF CRUD SQL SELECT Internal Data Sources CSV RDBMSXMLJSON SPARQL External Data Sources Data Stream CSV XMLJSON SPARQL Data Stream Data Lake Platform Object Store Distributed FS NoSQL MPP RDBMS Query ExecutionETL Big Data Storage Big Data Compute NLPAnalytics / ML Data Visualization Data Science Lab Dashboard RDF CRUD SQL SELECT Bulk upload / sync
  • 38. “Triplification” – 1st option Using RML (http://rml.io/) and Carml (https://github.com/carml/carml) Semantic Layer Query Federation / LOD Master Data / Contr. Vocabulary Reasoning Meta Data Instance Data Integration Bulk Data Flow Event Hub Triplification Event Data Flow RDF CRUD RML Kafka Kafka Connect Carml Stream Sets
  • 39. “Triplification” – 2nd option Using Jolt Transformation (https://github.com/bazaarvoice/jolt) to create JSON-LD Demo: http://jolt-demo.appspot.com/#inputArrayToPrefix Semantic Layer Query Federation / LOD Master Data / Contr. Vocabulary Reasoning Meta Data Instance Data Integration Bulk Data Flow Event Hub Triplification Event Data Flow RDF CRUD Kafka Kafka Connect Jolt Stream Sets Jolt
  • 41. Summary - Lessons Learnt Graph Schema • Edges can not cross graph schemas • Vertex or edge? Vertex or property? • Custom Vertex IDs can be useful • idempotent • faster on insert (no id generation needed) • can lead to hotspots Data Ingestion • Single Nodes & Edge Insert is slow => batching by Groovy Script to speed up • DSE graph loader works good for batch data loading
  • 42. Technology on its own won't help you. You need to know how to use it properly.