SlideShare a Scribd company logo
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF
HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH
Ingesting streaming data into Graph
Database
Guido Schmutz
San Francisco, 22.3.2018
@gschmutz guidoschmutz.wordpress.com
Guido Schmutz
Working at Trivadis for more than 21 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
Agenda
1. Introduction
2. DataStax (DSE) Graph
3. Use-Case 1 - Batch Data Ingestion of Social Graph
4. Use-Case 2 - Streaming Ingestion of Social Graph
5. Use-Case 3 - Streaming Ingestion into RDF Graph
6. Summary
Introduction
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Customer Context – Big Data Lab
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Mobile
Apps
Batch Analytics
Streaming Analytics
Stream Analytics
NoSQL
Reference /
Models
SQL
Search
Service
Dashboard
BI Tools
Enterprise Data
Warehouse
Search / Explore
Online & Mobile
Apps
File Import / SQL Import
Weather
Data
Event
Hub
Event
Hub
Event
Hub
Data
Flow
Data
Flow
Data
Flow
Change
Data
Capture
Parallel
Processing
Storage
Storage
RawRefined
Results
SQL
Export
Hadoop Clusterd
Hadoop Cluster
Big Data Cluster
Customer Context – Big Data Lab
Location
Social
Click
stream
Sensor
Data
Billing &
Ordering
CRM /
Profile
Marketing
Campaigns
Call
Center
Mobile
Apps
Batch Analytics
Streaming Analytics
Stream Analytics
NoSQL
Reference /
Models
SQL
Search
Service
Dashboard
BI Tools
Enterprise Data
Warehouse
Search / Explore
Online & Mobile
Apps
File Import / SQL Import
Weather
Data
Event
Hub
Event
Hub
Event
Hub
Data
Flow
Data
Flow
Data
Flow
Change
Data
Capture
Parallel
Processing
Storage
Storage
RawRefined
Results
SQL
Export
Map of a Twitter Status object
Map of a Twitter Status object
Know your domain and choose the right data store!
Connectedness of Datalow high
Document
Data
Store
Key-Value
Stores
Wide-
Column
Store
Graph
Databases
Relational
Databases
DataStax (DSE) Graph
DataStax Enterprise Edition – DSE Graph
A scale-out property graph database
Store & find relationships in data fast and easy
in large graphs.
• Adopts Apache TinkerPop standards for data
and traversal
• Uses Apache Cassandra for scalable storage and
retrieval
• Leverages Apache Solr for full-text search and
indexing
• Integrates Apache Spark for fast analytic traversal
• Supports data security
Images: DataStax
Lab cluster setup
2 local data center for workload
segregation
DC1
– 9 nodes
– OLTP workload incl. Search
DC2
– 3 nodes
– OLAP workload with Spark
Social Network Graph Model
Creating the Graph Schema (I)
Define types of properties to be used with Vertices and Edges
Define types of vertices to be used
Define types of edges to be used
schema.vertexLabel("twitterUser").
properties("id","name","language","verified","createdAt","sources").create();
schema.vertexLabel("tweet".
properties("id","text","targetIds","timestamp","language").create()
schema.propertyKey("id").Long().single().create();
schema.propertyKey("name").Text().create();
schema.propertyKey("sources").Text().multiple().create();
schema.edgeLabel("publishes").properties("timestamp")
multiple().properties("rating").
connection("twitterUser","tweet").create();
Twitter
User
Tweet
id
name (idx)
...
sources*
id
text (idx)
timestamp
language
publishes (0..*)
timestamp (idx)
Creating the Graph Schema (II)
Define global materialized index on vertex (good for high selectivity)
Defining global full-text search index (SolR) on text of Tweet
Define vertex-centric index on edge
schema.vertexLabel("twitterUser").
index("usesAtTimestamp").
outE("publishes").by("timestamp").ifNotExists().add()
Twitter
User
Tweet
id
name (idx)
...
sources*
id
text (idx)
timestamp
language
publishes (0..*)
timestamp (idx)schema.vertexLabel("twitterUser").
index('twitterUserNameIdx').
materialized().by('name').ifNotExists().add()
schema.vertexLabel("tweet").
index("search").
search().by("text").asText().ifNotExists().add()
Creating Vertex Instances
Adding a new TwitterUser vertex
Adding a new Tweet vertex
Vertex u = graph.addVertex("twitterUser");
u.property("id","1323232");
u.property("name",”realDonaldTrump");
u.property("language","en");
u.property("sources","tweet");
Twitter
User
userId: 1323232
name: realDonaldTrump
language: en
sources: tweet
Vertex m = graph.addVertex(label, "tweet",
"id", "1213232323",
"text", "I called President Putin of
Russia to congratulate him on his election ...",
"timestamp", "2018-03-21",
"language", "en");
Tweet
id: 1213232323
text: I called President …
timestamp: 2018-03-21
language: en
Creating Edges between Vertices
Adding an edge between the TwitterUser and the Tweet vertex
Vertex u = graph.addVertex("twitterUser", ...);
Vertex t = graph.addVertex("tweet", ...);
Edge e = u.addEdge("publishes", t);
e.property("timestamp", "2018-03-21");
Twitter
User
Tweet
publishes
timestamp:
2018-03-21
userId: 1323232
name: realDonaldTrump
language: en
sources: tweet
id: 1213232323
text: I called President …
timestamp: 2018-03-21
language: en
Use Case 1 – Batch Data Ingestion
of Social Graph
Batch Data Ingestion of Social Graph
Transformation DSE Graph
Avro on
HDFS
CSV on
HDFS
DSE
GraphLoader
CSV local
"712181117131034624","1458632184000","en","","","@crouchybro Hi, I
work for @NBCNews. Are you safe? If you feel comfortable, would you
DM me?","","","257242054","crouchybro","168379717"
Extracting and Preparing Data through Zeppelin
Batch Ingestion with DSE Loader
• Simplifies loading of data from various sources into DSE Graph efficiently
• Inspects incoming data for schema compliance
• Uses declarative data mappings and custom transformations to handle data
• Loads all vertices first and stores in cache then edges and properties
Graph
Loader
 
Data Mappings
Batch Loading
Stream Ingestion (planed)
RDBMSJSON
DSE Graph
CSV
Batch Ingestion with DSE Loader – Example
publishes
Place
Twitte
r
User
Tweet
uses
tweetInput = File.text("tweet.txt")
.delimiter('","')
.header('id','timestamp','language','placeId','geoPointId','text',
'retweetedId','replyToId','replyToUserId','replyToScreenName',
'publisherUserId')
tweetInput = tweetInput.transform { it['replyToId'] =
it['replyToId'].equals('-1') ? '' : it['replyToId'];
it.each {key,value -> it[key] = value.replaceAll('^"|"$', '')};
it }
load(tweetInput).asVertices {
isNew()
label "tweet"
key "id"
ignore "placeId"
ignore "publisherUserId "
...
outV "placeId", "uses", {
label "place"
key "id"
}
...
inV "publisherUserId", "publishes", {
label "twitterUser"
key "id"
}}
Transform
Extract
Load
"712181117131034624","1458632184000","en","","","@crouchybro
Hi, I work for @NBCNews. Are you safe? If you feel comfortable,
would you DM me?","","","257242054","crouchybro","168379717"
...
Batch Ingestion with DSE Loader - Example
mentions
Twitte
r
User
Tweet
tweetMentionsInput = File.text("tweet_mentions_user.txt")
.delimiter('","')
.header('tweetId','userId','screenName','timestamp')
tweetMentionsInput = tweetMentionsInput.transform {
it.each {key,value -> it[key] = value.replaceAll('^"|"$', '')};
it
}
load(tweetMentionsInput).asEdges {
label "mentions"
ignore "screenName"
outV "tweetId", {
label "tweet"
key "tweetId":"id"
}
inV "userId", {
label "twitterUser"
key "id"
}
}
graphloader load-sma-graph.groovy -graph sma_graph_prod_v10 -address
192.168.69.135 -dryrun false -preparation false -abort_on_num_failures
10000 -abort_on_prep_errors false -vertex_complete false
Performance Tests DSE Graph Loader
ID strategy dataset vertex
threads
edge
threads
number of
tweets
time
sec/min
tweets
per sec
Gen 1-day-14 0 0 212’116 541/9.01 392
Gen 1-day-14 4 12 212’116 419/6.93 506
Gen 1-day-14 10 10 212’116 334/5.56 635
Gen 1-day-14 40 40 212’116 170/2.83 1247
Gen 1-day-14 60 80 212’116 118/1.96 1797
Cust 1-day-14 60 80 212’116 101/1.68 2100
Gen 1-mth-14 60 80 7’403’667 3888/64.8 1904
Gen 1-mth-14 60 120 7’403’667 3685/61.4 2009
Cust 1-mth-14 60 120 7’403’667 3315/55.25. 2233
Gen 1-mth-14 80 120 7’403’667 4005/66.75 1848
Cust 1-mth-14 80 120 7’403’667 3598/59.69 2057
Use Case 2 – Streaming Ingestion
of Social Graph
Streaming Ingestion of Social Network Graph
Twitter
Tweet-to-Graph
DSE Graph
Tweet
Profile-to-Graph
Follower-to-
Graph
ElasticsearchTweet-to-ES
Profile
Follower
YouTube-to-
Graph
YouTube VideoYouTube
Apache Kafka – A Streaming Platform
High-Level Architecture
Distributed Log at the Core
Scale-Out Architecture
Logs do not (necessarily) forget
Graph Ingestion through DataStax Java Driver
Similar to programming with JDBC: execute your gremlin statement by creating it in a
String variable => also as error prone as with JDBC ;-)
fluent Java-based API available with version of DSE (5.1)
String stmt = ""
+ "Vertex v" + "n"
+ "GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue)" + "n"
+ "if (!gt.hasNext()) {" + "n"
+ " v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue)" + "n"
+ "} else {" + "n" ...
Map<String,Object> map = ImmutableMap.<String,Object>of("vertexLabel", vertexLabel, "propertyKey",
propertyKey, "propertyKeyValue", propertyKeyValue);
vertex = session.executeGraph(new SimpleGraphStatement(stmt)
.set("b"
, ImmutableMap.<String,Object>builder().putAll(map).putAll(propertyParams.getRight()).build()
)).one().asVertex();
Tweet-to-Graph
Add information for new Event (tweet) to Graph – what it
really means ….
1. Create new Tweet vertex and add
retweets/replies edges
2. Create/Update TwitterUser vertex
3. Create new publishes edge
4. For Each Mention, Create/Update
TwitterUser vertex
5. For each Mention, Create mentions edge
6. For each Link, Create/Update Url vertex
7. For each Link, Create uses edge
8. For each HashTag, Create/Update Term
vertex
9. For each HashTag, Create contains edge
10. ….
12
34
5
6
7
8
9
Could we solve it by modeling differently?
Tweet
Term
contains
name (idx): #fake
text: The point of
Trump’s now #Fake
steel …
Tweet
text: the point of Trump’s now
#Fake steel …
terms: [fake]
….
vs
Single Insert into Social Network Graph
DSE Graph
Tweet-to-Graph
Tweet
User
publishes
Term *
uses*
GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue)
if (!gt.hasNext()) {
v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue)
} else {
v = gt.next()
}
v.property("name",b.propertyParam0)
v.property("language",b.propertyParam1)
v.property("verified",b.propertyParam2)
Vertex from = g.V(f).next(); Vertex to = g.V(t).next(); if
(!g.V(f).out(b.edgeLabel).V(t).hasNext()) { from.addEdge(b.edgeLabel, to) }
Vertex from = g.V(f).next(); Vertex to = g.V(t).next(); if
(!g.V(f).out(b.edgeLabel).V(t).hasNext()) { from.addEdge(b.edgeLabel, to) }
GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue)
if (!gt.hasNext()) {
v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue)
} else {
v = gt.next()
}
v.property("timestamp",b.propertyParam0)
v.property("language",b.propertyParam1)
GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue)
if (!gt.hasNext()) {
v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue)
} else {
v = gt.next()
}
v.property("type",b.propertyParam0)
GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue)
if (!gt.hasNext()) {
v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue)
} else {
v = gt.next()
}
v.property("type",b.propertyParam0)
Vertex from = g.V(f).next(); Vertex to = g.V(t).next(); if
(!g.V(f).out(b.edgeLabel).V(t).hasNext()) { from.addEdge(b.edgeLabel, to) }
2nd approach - ”Fan out ingestion driven by model”
1st approach => too slow
2nd approach => too difficult to implement and maintain
Tweet_to_
Graph
Tweet_to_
Graph
User_to_
Graph
URL_to_
Graph
URL_to_
Graph
Term_to_
Graph
Geo_to_
Graph
Tweet
Tweet
3rd approach - Batch Insert into Social Network Graph
user = g.V().has('twitterUser', 'id', bUser.id).tryNext().orElseGet { g.addV('twitterUser').property('id', bUser.id).next() }
user.property('name',bUser.name)
user.property('language',bUser.language)
user.property('verified',bUser.verified)
tweet = g.V().has('tweet', 'id', bTweet.id).tryNext().orElseGet { g.addV('tweet').property('id', bTweet.id).next() }
tweet.property('timestamp',bTweet.timestamp)
tweet.property('language',bTweet.language)
if (!g.V(user).out('publishes').hasId(tweet.id()).hasNext()) {
publishes = g.V(user).as('f').V(tweet).as('t').addE('publishes').from('f').next()
}
i = 0
Vertex[] term = new Vertex[10]
for (Object keyValue : bTerm.propertyKeyValues) {
term[i] = g.V().has('term', 'name', keyValue).tryNext().orElseGet {g.addV('term').property('name', keyValue).next() }
Map<String,Object> params = bTerm.params[i]
if (params != null)
for (String key : params.keySet()) {
term[i].property(key, params.get(key))
}
i++
}
Edge[] usesTerm = new Edge[10]
for (i = 0; i < bUsesTerm.count; i++) {
if (!g.V(tweet).out('uses').hasId(term[i].id()).hasNext()) {
usesTerm[i] = g.V(tweet).as('f').V(term[i]).as('t').addE('uses').from('f').next()
}
Map<String,Object> params = bUsesTerm.params[i]
if (params != null)
for (String key : params.keySet()) {
usesTerm[i].property(key, params.get(key))
}
}
return nof
DSE Graph
Tweet-to-Graph
…
...
Benchmark Single vs. Scripted Insert
Test Case Single Inserts Scripted (Batch) Insert
10 tweets with 10 hashtags each 569ms - 690ms 132ms – 172ms
10 tweets with 30 hashtags each 1311ms – 1396ms 285ms – 364ms
20 tweets with 30 hashtags each 2734ms – 3385ms 640ms – 767ms
20 tweets with 40 hashtags each 3576ms – 4089ms 778ms – 992ms
100 tweets with 10 hashtags each 6340ms – 7171ms 1673ms - 2428
Process CDC (Change Data Capture) Events
CDC
Customer-to-
Cassandra
Wide-Row
/ Search
Customer
Event
Graph
Customer-to-
Graph
Customer CDC
Address CDC
Contact CDC
Aggregate-to-
Customer
DSE Cluster
Partner
Id
FirstName
LastName
BirthDate
ProfileImageURL
Id
CustomerId
When
Where
Type
Description
Contacts
Address
Id
PartnerId
Street
StreetNr
ZipCode
City
Country
Id
Name
Price
Description
ImageURL
Products
CustomerId
ProductId
Order
Id
CustomerId
When
Where
InteractionType
Description
Customer Service
Id
ProductId
Comment
UserId
Product Reviews
Id
PartnerId
TwitterId
FacebookId
Customer
Use Case 3: Streaming Ingestion
into RDF Graph
Semantic Layer 3 – streaming update of Triple store
Bulk Move /
Event
Bulk
Import
Integration
Bulk
Data Flow
Event Hub
Triplification
Event
Data Flow
SPARQL / SQL
Bulk Move /
Event
Semantic
Layer
Query
Federation / LOD
Consumer
Master Data /
Contr. Vocabulary
Reasoning
Meta Data
Instance Data
SPARQL
RDF CRUD
SQL SELECT
Internal Data Sources
CSV RDBMSXMLJSON SPARQL
External Data Sources
Data Stream
CSV XMLJSON SPARQL Data Stream
Data Lake Platform
Object Store Distributed FS NoSQL MPP RDBMS
Query ExecutionETL
Big Data Storage
Big Data Compute
NLPAnalytics / ML
Data
Visualization
Data Science
Lab
Dashboard
RDF
CRUD
SQL SELECT
Bulk
upload / sync
“Triplification” – 1st option
Using RML (http://rml.io/) and
Carml (https://github.com/carml/carml)
Semantic
Layer
Query
Federation / LOD
Master Data /
Contr. Vocabulary
Reasoning
Meta Data
Instance Data
Integration
Bulk
Data Flow
Event Hub
Triplification
Event
Data Flow
RDF CRUD
RML
Kafka
Kafka Connect
Carml
Stream Sets
“Triplification” – 2nd option
Using Jolt Transformation (https://github.com/bazaarvoice/jolt) to
create JSON-LD
Demo: http://jolt-demo.appspot.com/#inputArrayToPrefix
Semantic
Layer
Query
Federation / LOD
Master Data /
Contr. Vocabulary
Reasoning
Meta Data
Instance Data
Integration
Bulk
Data Flow
Event Hub
Triplification
Event
Data Flow
RDF CRUD
Kafka
Kafka Connect
Jolt
Stream Sets
Jolt
Summary - Lessons Learnt
Summary - Lessons Learnt
Graph Schema
• Edges can not cross graph schemas
• Vertex or edge? Vertex or property?
• Custom Vertex IDs can be useful
• idempotent
• faster on insert (no id generation needed)
• can lead to hotspots
Data Ingestion
• Single Nodes & Edge Insert is slow =>
batching by Groovy Script to speed up
• DSE graph loader works good for batch
data loading
Technology on its own won't help you.
You need to know how to use it properly.

More Related Content

What's hot

Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
Databricks
 
Azure Purview Data Toboggan Erwin de Kreuk
Azure Purview Data Toboggan Erwin de KreukAzure Purview Data Toboggan Erwin de Kreuk
Azure Purview Data Toboggan Erwin de Kreuk
Erwin de Kreuk
 
Working with Azure Cosmos DB in Azure Functions
Working with Azure Cosmos DB in Azure FunctionsWorking with Azure Cosmos DB in Azure Functions
Working with Azure Cosmos DB in Azure Functions
Will Velida
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
James Serra
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
Databricks
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Kai Wähner
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
Presto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analysts
Shubham Tagra
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
LibbySchulze
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
Neo4j
 
Data product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics historyData product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics history
Rogier Werschkull
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
Databricks
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
Slim Baltagi
 
Azure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar PresentationAzure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
Trey Grainger
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Timothy McAliley
 

What's hot (20)

Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 
Azure Purview Data Toboggan Erwin de Kreuk
Azure Purview Data Toboggan Erwin de KreukAzure Purview Data Toboggan Erwin de Kreuk
Azure Purview Data Toboggan Erwin de Kreuk
 
Working with Azure Cosmos DB in Azure Functions
Working with Azure Cosmos DB in Azure FunctionsWorking with Azure Cosmos DB in Azure Functions
Working with Azure Cosmos DB in Azure Functions
 
Azure SQL Database Managed Instance
Azure SQL Database Managed InstanceAzure SQL Database Managed Instance
Azure SQL Database Managed Instance
 
Making Structured Streaming Ready for Production
Making Structured Streaming Ready for ProductionMaking Structured Streaming Ready for Production
Making Structured Streaming Ready for Production
 
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache KafkaReal-Life Use Cases & Architectures for Event Streaming with Apache Kafka
Real-Life Use Cases & Architectures for Event Streaming with Apache Kafka
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Presto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analystsPresto best practices for Cluster admins, data engineers and analysts
Presto best practices for Cluster admins, data engineers and analysts
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
Data Modeling with Neo4j
Data Modeling with Neo4jData Modeling with Neo4j
Data Modeling with Neo4j
 
Data product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics historyData product thinking-Will the Data Mesh save us from analytics history
Data product thinking-Will the Data Mesh save us from analytics history
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Azure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar PresentationAzure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar Presentation
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
 

Similar to Ingesting streaming data into Graph Database

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
Databricks
 
Trivadis TechEvent 2016 Introduction to DataStax Enterprise (DSE) Graph by Gu...
Trivadis TechEvent 2016 Introduction to DataStax Enterprise (DSE) Graph by Gu...Trivadis TechEvent 2016 Introduction to DataStax Enterprise (DSE) Graph by Gu...
Trivadis TechEvent 2016 Introduction to DataStax Enterprise (DSE) Graph by Gu...
Trivadis
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS
Shu-Jeng Hsieh
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
Virot "Ta" Chiraphadhanakul
 
OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009
marpierc
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql database
PARIKSHIT SAVJANI
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Streaming Infrastructure at Wise with Levani Kokhreidze
Streaming Infrastructure at Wise with Levani KokhreidzeStreaming Infrastructure at Wise with Levani Kokhreidze
Streaming Infrastructure at Wise with Levani Kokhreidze
HostedbyConfluent
 
State of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open SourceState of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open Source
OSCON Byrum
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with Storm
Brandon O'Brien
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
viirya
 
NextGenML
NextGenML NextGenML
Data relay introduction to big data clusters
Data relay introduction to big data clustersData relay introduction to big data clusters
Data relay introduction to big data clusters
Chris Adkin
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Economies of Scaling Software
Economies of Scaling SoftwareEconomies of Scaling Software
Economies of Scaling Software
Joshua Long
 
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim RemaniThe Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
ploibl
 
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim RemaniThe Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
JAXLondon2014
 

Similar to Ingesting streaming data into Graph Database (20)

Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Trivadis TechEvent 2016 Introduction to DataStax Enterprise (DSE) Graph by Gu...
Trivadis TechEvent 2016 Introduction to DataStax Enterprise (DSE) Graph by Gu...Trivadis TechEvent 2016 Introduction to DataStax Enterprise (DSE) Graph by Gu...
Trivadis TechEvent 2016 Introduction to DataStax Enterprise (DSE) Graph by Gu...
 
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & Alluxio
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Big Data Tools in AWS
Big Data Tools in AWSBig Data Tools in AWS
Big Data Tools in AWS
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
 
OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009OGCE Overview for SciDAC 2009
OGCE Overview for SciDAC 2009
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql database
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Streaming Infrastructure at Wise with Levani Kokhreidze
Streaming Infrastructure at Wise with Levani KokhreidzeStreaming Infrastructure at Wise with Levani Kokhreidze
Streaming Infrastructure at Wise with Levani Kokhreidze
 
State of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open SourceState of the Art Web Mapping with Open Source
State of the Art Web Mapping with Open Source
 
Introduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with StormIntroduction to Streaming Distributed Processing with Storm
Introduction to Streaming Distributed Processing with Storm
 
Real-time Big Data Processing with Storm
Real-time Big Data Processing with StormReal-time Big Data Processing with Storm
Real-time Big Data Processing with Storm
 
NextGenML
NextGenML NextGenML
NextGenML
 
Data relay introduction to big data clusters
Data relay introduction to big data clustersData relay introduction to big data clusters
Data relay introduction to big data clusters
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
 
Economies of Scaling Software
Economies of Scaling SoftwareEconomies of Scaling Software
Economies of Scaling Software
 
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim RemaniThe Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
 
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim RemaniThe Economies of Scaling Software - Josh Long and Abdelmonaim Remani
The Economies of Scaling Software - Josh Long and Abdelmonaim Remani
 

More from Guido Schmutz

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code
Guido Schmutz
 
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data Architecture
Guido Schmutz
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Guido Schmutz
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
Guido Schmutz
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
Guido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data Architecture
Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Guido Schmutz
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Guido Schmutz
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache Kafka
Guido Schmutz
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Guido Schmutz
 
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?
Guido Schmutz
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Guido Schmutz
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using Kafka
Guido Schmutz
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming Visualisation
Guido Schmutz
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?
Guido Schmutz
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Guido Schmutz
 
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI Architecture
Guido Schmutz
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
Guido Schmutz
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
Guido Schmutz
 

More from Guido Schmutz (20)

30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code30 Minutes to the Analytics Platform with Infrastructure as Code
30 Minutes to the Analytics Platform with Infrastructure as Code
 
Event Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data ArchitectureEvent Broker (Kafka) in a Modern Data Architecture
Event Broker (Kafka) in a Modern Data Architecture
 
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-FormatsBig Data, Data Lake, Fast Data - Dataserialiation-Formats
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Event Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data ArchitectureEvent Hub (i.e. Kafka) in Modern Data Architecture
Event Hub (i.e. Kafka) in Modern Data Architecture
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) ArchitectureEvent Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
 
Building Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache KafkaBuilding Event Driven (Micro)services with Apache Kafka
Building Event Driven (Micro)services with Apache Kafka
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
 
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache KafkaSolutions for bi-directional integration between Oracle RDBMS and Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
 
What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?What is Apache Kafka? Why is it so popular? Should I use it?
What is Apache Kafka? Why is it so popular? Should I use it?
 
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache KafkaSolutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
 
Location Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using KafkaLocation Analytics Real-Time Geofencing using Kafka
Location Analytics Real-Time Geofencing using Kafka
 
Streaming Visualisation
Streaming VisualisationStreaming Visualisation
Streaming Visualisation
 
Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?Kafka as an event store - is it good enough?
Kafka as an event store - is it good enough?
 
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache KafkaSolutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
 
Fundamentals Big Data and AI Architecture
Fundamentals Big Data and AI ArchitectureFundamentals Big Data and AI Architecture
Fundamentals Big Data and AI Architecture
 
Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka Location Analytics - Real-Time Geofencing using Kafka
Location Analytics - Real-Time Geofencing using Kafka
 
Streaming Visualization
Streaming VisualizationStreaming Visualization
Streaming Visualization
 

Recently uploaded

Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 

Recently uploaded (20)

Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 

Ingesting streaming data into Graph Database

  • 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENF HAMBURG KOPENHAGEN LAUSANNE MÜNCHEN STUTTGART WIEN ZÜRICH Ingesting streaming data into Graph Database Guido Schmutz San Francisco, 22.3.2018 @gschmutz guidoschmutz.wordpress.com
  • 2. Guido Schmutz Working at Trivadis for more than 21 years Oracle ACE Director for Fusion Middleware and SOA Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Head of Trivadis Architecture Board Technology Manager @ Trivadis More than 30 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz
  • 3. Agenda 1. Introduction 2. DataStax (DSE) Graph 3. Use-Case 1 - Batch Data Ingestion of Social Graph 4. Use-Case 2 - Streaming Ingestion of Social Graph 5. Use-Case 3 - Streaming Ingestion into RDF Graph 6. Summary
  • 5. Hadoop Clusterd Hadoop Cluster Big Data Cluster Customer Context – Big Data Lab Location Social Click stream Sensor Data Billing & Ordering CRM / Profile Marketing Campaigns Call Center Mobile Apps Batch Analytics Streaming Analytics Stream Analytics NoSQL Reference / Models SQL Search Service Dashboard BI Tools Enterprise Data Warehouse Search / Explore Online & Mobile Apps File Import / SQL Import Weather Data Event Hub Event Hub Event Hub Data Flow Data Flow Data Flow Change Data Capture Parallel Processing Storage Storage RawRefined Results SQL Export
  • 6. Hadoop Clusterd Hadoop Cluster Big Data Cluster Customer Context – Big Data Lab Location Social Click stream Sensor Data Billing & Ordering CRM / Profile Marketing Campaigns Call Center Mobile Apps Batch Analytics Streaming Analytics Stream Analytics NoSQL Reference / Models SQL Search Service Dashboard BI Tools Enterprise Data Warehouse Search / Explore Online & Mobile Apps File Import / SQL Import Weather Data Event Hub Event Hub Event Hub Data Flow Data Flow Data Flow Change Data Capture Parallel Processing Storage Storage RawRefined Results SQL Export
  • 7. Map of a Twitter Status object
  • 8. Map of a Twitter Status object
  • 9. Know your domain and choose the right data store! Connectedness of Datalow high Document Data Store Key-Value Stores Wide- Column Store Graph Databases Relational Databases
  • 11. DataStax Enterprise Edition – DSE Graph A scale-out property graph database Store & find relationships in data fast and easy in large graphs. • Adopts Apache TinkerPop standards for data and traversal • Uses Apache Cassandra for scalable storage and retrieval • Leverages Apache Solr for full-text search and indexing • Integrates Apache Spark for fast analytic traversal • Supports data security Images: DataStax
  • 12. Lab cluster setup 2 local data center for workload segregation DC1 – 9 nodes – OLTP workload incl. Search DC2 – 3 nodes – OLAP workload with Spark
  • 14. Creating the Graph Schema (I) Define types of properties to be used with Vertices and Edges Define types of vertices to be used Define types of edges to be used schema.vertexLabel("twitterUser"). properties("id","name","language","verified","createdAt","sources").create(); schema.vertexLabel("tweet". properties("id","text","targetIds","timestamp","language").create() schema.propertyKey("id").Long().single().create(); schema.propertyKey("name").Text().create(); schema.propertyKey("sources").Text().multiple().create(); schema.edgeLabel("publishes").properties("timestamp") multiple().properties("rating"). connection("twitterUser","tweet").create(); Twitter User Tweet id name (idx) ... sources* id text (idx) timestamp language publishes (0..*) timestamp (idx)
  • 15. Creating the Graph Schema (II) Define global materialized index on vertex (good for high selectivity) Defining global full-text search index (SolR) on text of Tweet Define vertex-centric index on edge schema.vertexLabel("twitterUser"). index("usesAtTimestamp"). outE("publishes").by("timestamp").ifNotExists().add() Twitter User Tweet id name (idx) ... sources* id text (idx) timestamp language publishes (0..*) timestamp (idx)schema.vertexLabel("twitterUser"). index('twitterUserNameIdx'). materialized().by('name').ifNotExists().add() schema.vertexLabel("tweet"). index("search"). search().by("text").asText().ifNotExists().add()
  • 16. Creating Vertex Instances Adding a new TwitterUser vertex Adding a new Tweet vertex Vertex u = graph.addVertex("twitterUser"); u.property("id","1323232"); u.property("name",”realDonaldTrump"); u.property("language","en"); u.property("sources","tweet"); Twitter User userId: 1323232 name: realDonaldTrump language: en sources: tweet Vertex m = graph.addVertex(label, "tweet", "id", "1213232323", "text", "I called President Putin of Russia to congratulate him on his election ...", "timestamp", "2018-03-21", "language", "en"); Tweet id: 1213232323 text: I called President … timestamp: 2018-03-21 language: en
  • 17. Creating Edges between Vertices Adding an edge between the TwitterUser and the Tweet vertex Vertex u = graph.addVertex("twitterUser", ...); Vertex t = graph.addVertex("tweet", ...); Edge e = u.addEdge("publishes", t); e.property("timestamp", "2018-03-21"); Twitter User Tweet publishes timestamp: 2018-03-21 userId: 1323232 name: realDonaldTrump language: en sources: tweet id: 1213232323 text: I called President … timestamp: 2018-03-21 language: en
  • 18. Use Case 1 – Batch Data Ingestion of Social Graph
  • 19. Batch Data Ingestion of Social Graph Transformation DSE Graph Avro on HDFS CSV on HDFS DSE GraphLoader CSV local "712181117131034624","1458632184000","en","","","@crouchybro Hi, I work for @NBCNews. Are you safe? If you feel comfortable, would you DM me?","","","257242054","crouchybro","168379717"
  • 20. Extracting and Preparing Data through Zeppelin
  • 21. Batch Ingestion with DSE Loader • Simplifies loading of data from various sources into DSE Graph efficiently • Inspects incoming data for schema compliance • Uses declarative data mappings and custom transformations to handle data • Loads all vertices first and stores in cache then edges and properties Graph Loader   Data Mappings Batch Loading Stream Ingestion (planed) RDBMSJSON DSE Graph CSV
  • 22. Batch Ingestion with DSE Loader – Example publishes Place Twitte r User Tweet uses tweetInput = File.text("tweet.txt") .delimiter('","') .header('id','timestamp','language','placeId','geoPointId','text', 'retweetedId','replyToId','replyToUserId','replyToScreenName', 'publisherUserId') tweetInput = tweetInput.transform { it['replyToId'] = it['replyToId'].equals('-1') ? '' : it['replyToId']; it.each {key,value -> it[key] = value.replaceAll('^"|"$', '')}; it } load(tweetInput).asVertices { isNew() label "tweet" key "id" ignore "placeId" ignore "publisherUserId " ... outV "placeId", "uses", { label "place" key "id" } ... inV "publisherUserId", "publishes", { label "twitterUser" key "id" }} Transform Extract Load "712181117131034624","1458632184000","en","","","@crouchybro Hi, I work for @NBCNews. Are you safe? If you feel comfortable, would you DM me?","","","257242054","crouchybro","168379717" ...
  • 23. Batch Ingestion with DSE Loader - Example mentions Twitte r User Tweet tweetMentionsInput = File.text("tweet_mentions_user.txt") .delimiter('","') .header('tweetId','userId','screenName','timestamp') tweetMentionsInput = tweetMentionsInput.transform { it.each {key,value -> it[key] = value.replaceAll('^"|"$', '')}; it } load(tweetMentionsInput).asEdges { label "mentions" ignore "screenName" outV "tweetId", { label "tweet" key "tweetId":"id" } inV "userId", { label "twitterUser" key "id" } } graphloader load-sma-graph.groovy -graph sma_graph_prod_v10 -address 192.168.69.135 -dryrun false -preparation false -abort_on_num_failures 10000 -abort_on_prep_errors false -vertex_complete false
  • 24. Performance Tests DSE Graph Loader ID strategy dataset vertex threads edge threads number of tweets time sec/min tweets per sec Gen 1-day-14 0 0 212’116 541/9.01 392 Gen 1-day-14 4 12 212’116 419/6.93 506 Gen 1-day-14 10 10 212’116 334/5.56 635 Gen 1-day-14 40 40 212’116 170/2.83 1247 Gen 1-day-14 60 80 212’116 118/1.96 1797 Cust 1-day-14 60 80 212’116 101/1.68 2100 Gen 1-mth-14 60 80 7’403’667 3888/64.8 1904 Gen 1-mth-14 60 120 7’403’667 3685/61.4 2009 Cust 1-mth-14 60 120 7’403’667 3315/55.25. 2233 Gen 1-mth-14 80 120 7’403’667 4005/66.75 1848 Cust 1-mth-14 80 120 7’403’667 3598/59.69 2057
  • 25. Use Case 2 – Streaming Ingestion of Social Graph
  • 26. Streaming Ingestion of Social Network Graph Twitter Tweet-to-Graph DSE Graph Tweet Profile-to-Graph Follower-to- Graph ElasticsearchTweet-to-ES Profile Follower YouTube-to- Graph YouTube VideoYouTube
  • 27. Apache Kafka – A Streaming Platform High-Level Architecture Distributed Log at the Core Scale-Out Architecture Logs do not (necessarily) forget
  • 28. Graph Ingestion through DataStax Java Driver Similar to programming with JDBC: execute your gremlin statement by creating it in a String variable => also as error prone as with JDBC ;-) fluent Java-based API available with version of DSE (5.1) String stmt = "" + "Vertex v" + "n" + "GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue)" + "n" + "if (!gt.hasNext()) {" + "n" + " v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue)" + "n" + "} else {" + "n" ... Map<String,Object> map = ImmutableMap.<String,Object>of("vertexLabel", vertexLabel, "propertyKey", propertyKey, "propertyKeyValue", propertyKeyValue); vertex = session.executeGraph(new SimpleGraphStatement(stmt) .set("b" , ImmutableMap.<String,Object>builder().putAll(map).putAll(propertyParams.getRight()).build() )).one().asVertex(); Tweet-to-Graph
  • 29. Add information for new Event (tweet) to Graph – what it really means …. 1. Create new Tweet vertex and add retweets/replies edges 2. Create/Update TwitterUser vertex 3. Create new publishes edge 4. For Each Mention, Create/Update TwitterUser vertex 5. For each Mention, Create mentions edge 6. For each Link, Create/Update Url vertex 7. For each Link, Create uses edge 8. For each HashTag, Create/Update Term vertex 9. For each HashTag, Create contains edge 10. …. 12 34 5 6 7 8 9
  • 30. Could we solve it by modeling differently? Tweet Term contains name (idx): #fake text: The point of Trump’s now #Fake steel … Tweet text: the point of Trump’s now #Fake steel … terms: [fake] …. vs
  • 31. Single Insert into Social Network Graph DSE Graph Tweet-to-Graph Tweet User publishes Term * uses* GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue) if (!gt.hasNext()) { v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue) } else { v = gt.next() } v.property("name",b.propertyParam0) v.property("language",b.propertyParam1) v.property("verified",b.propertyParam2) Vertex from = g.V(f).next(); Vertex to = g.V(t).next(); if (!g.V(f).out(b.edgeLabel).V(t).hasNext()) { from.addEdge(b.edgeLabel, to) } Vertex from = g.V(f).next(); Vertex to = g.V(t).next(); if (!g.V(f).out(b.edgeLabel).V(t).hasNext()) { from.addEdge(b.edgeLabel, to) } GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue) if (!gt.hasNext()) { v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue) } else { v = gt.next() } v.property("timestamp",b.propertyParam0) v.property("language",b.propertyParam1) GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue) if (!gt.hasNext()) { v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue) } else { v = gt.next() } v.property("type",b.propertyParam0) GraphTraversal gt = g.V().has(b.vertexLabel, b.propertyKey, b.propertyKeyValue) if (!gt.hasNext()) { v = graph.addVertex(label, b.vertexLabel, b.propertyKey, b.propertyKeyValue) } else { v = gt.next() } v.property("type",b.propertyParam0) Vertex from = g.V(f).next(); Vertex to = g.V(t).next(); if (!g.V(f).out(b.edgeLabel).V(t).hasNext()) { from.addEdge(b.edgeLabel, to) }
  • 32. 2nd approach - ”Fan out ingestion driven by model” 1st approach => too slow 2nd approach => too difficult to implement and maintain Tweet_to_ Graph Tweet_to_ Graph User_to_ Graph URL_to_ Graph URL_to_ Graph Term_to_ Graph Geo_to_ Graph Tweet Tweet
  • 33. 3rd approach - Batch Insert into Social Network Graph user = g.V().has('twitterUser', 'id', bUser.id).tryNext().orElseGet { g.addV('twitterUser').property('id', bUser.id).next() } user.property('name',bUser.name) user.property('language',bUser.language) user.property('verified',bUser.verified) tweet = g.V().has('tweet', 'id', bTweet.id).tryNext().orElseGet { g.addV('tweet').property('id', bTweet.id).next() } tweet.property('timestamp',bTweet.timestamp) tweet.property('language',bTweet.language) if (!g.V(user).out('publishes').hasId(tweet.id()).hasNext()) { publishes = g.V(user).as('f').V(tweet).as('t').addE('publishes').from('f').next() } i = 0 Vertex[] term = new Vertex[10] for (Object keyValue : bTerm.propertyKeyValues) { term[i] = g.V().has('term', 'name', keyValue).tryNext().orElseGet {g.addV('term').property('name', keyValue).next() } Map<String,Object> params = bTerm.params[i] if (params != null) for (String key : params.keySet()) { term[i].property(key, params.get(key)) } i++ } Edge[] usesTerm = new Edge[10] for (i = 0; i < bUsesTerm.count; i++) { if (!g.V(tweet).out('uses').hasId(term[i].id()).hasNext()) { usesTerm[i] = g.V(tweet).as('f').V(term[i]).as('t').addE('uses').from('f').next() } Map<String,Object> params = bUsesTerm.params[i] if (params != null) for (String key : params.keySet()) { usesTerm[i].property(key, params.get(key)) } } return nof DSE Graph Tweet-to-Graph … ...
  • 34. Benchmark Single vs. Scripted Insert Test Case Single Inserts Scripted (Batch) Insert 10 tweets with 10 hashtags each 569ms - 690ms 132ms – 172ms 10 tweets with 30 hashtags each 1311ms – 1396ms 285ms – 364ms 20 tweets with 30 hashtags each 2734ms – 3385ms 640ms – 767ms 20 tweets with 40 hashtags each 3576ms – 4089ms 778ms – 992ms 100 tweets with 10 hashtags each 6340ms – 7171ms 1673ms - 2428
  • 35. Process CDC (Change Data Capture) Events CDC Customer-to- Cassandra Wide-Row / Search Customer Event Graph Customer-to- Graph Customer CDC Address CDC Contact CDC Aggregate-to- Customer DSE Cluster Partner Id FirstName LastName BirthDate ProfileImageURL Id CustomerId When Where Type Description Contacts Address Id PartnerId Street StreetNr ZipCode City Country Id Name Price Description ImageURL Products CustomerId ProductId Order Id CustomerId When Where InteractionType Description Customer Service Id ProductId Comment UserId Product Reviews Id PartnerId TwitterId FacebookId Customer
  • 36. Use Case 3: Streaming Ingestion into RDF Graph
  • 37. Semantic Layer 3 – streaming update of Triple store Bulk Move / Event Bulk Import Integration Bulk Data Flow Event Hub Triplification Event Data Flow SPARQL / SQL Bulk Move / Event Semantic Layer Query Federation / LOD Consumer Master Data / Contr. Vocabulary Reasoning Meta Data Instance Data SPARQL RDF CRUD SQL SELECT Internal Data Sources CSV RDBMSXMLJSON SPARQL External Data Sources Data Stream CSV XMLJSON SPARQL Data Stream Data Lake Platform Object Store Distributed FS NoSQL MPP RDBMS Query ExecutionETL Big Data Storage Big Data Compute NLPAnalytics / ML Data Visualization Data Science Lab Dashboard RDF CRUD SQL SELECT Bulk upload / sync
  • 38. “Triplification” – 1st option Using RML (http://rml.io/) and Carml (https://github.com/carml/carml) Semantic Layer Query Federation / LOD Master Data / Contr. Vocabulary Reasoning Meta Data Instance Data Integration Bulk Data Flow Event Hub Triplification Event Data Flow RDF CRUD RML Kafka Kafka Connect Carml Stream Sets
  • 39. “Triplification” – 2nd option Using Jolt Transformation (https://github.com/bazaarvoice/jolt) to create JSON-LD Demo: http://jolt-demo.appspot.com/#inputArrayToPrefix Semantic Layer Query Federation / LOD Master Data / Contr. Vocabulary Reasoning Meta Data Instance Data Integration Bulk Data Flow Event Hub Triplification Event Data Flow RDF CRUD Kafka Kafka Connect Jolt Stream Sets Jolt
  • 41. Summary - Lessons Learnt Graph Schema • Edges can not cross graph schemas • Vertex or edge? Vertex or property? • Custom Vertex IDs can be useful • idempotent • faster on insert (no id generation needed) • can lead to hotspots Data Ingestion • Single Nodes & Edge Insert is slow => batching by Groovy Script to speed up • DSE graph loader works good for batch data loading
  • 42. Technology on its own won't help you. You need to know how to use it properly.