When a single graph isn’t enough
Chief Innovation Officer
FRANK SMIT
Co-founder and CEO
“The number one tool for social
media monitoring, webcare,
publishing & social analytics”
Founded in 2011
Located in Zaandam,
Netherlands
25 employees
Over 700 customers in 8
countries
Collect millions of messages
on a daily basis
Twitter, Facebook, Instagram,
Pinterest, LinkedIn, Youtube,
Google+, news sites, blogs and
fora
Data
https://dribbble.com/shots/1233464-24-Free-Flat-Social-Icons
“We develop AI and data
applications for organisations”
Founded in 2015
Located in Zaandam,
Netherlands
5 employees
12 customers
Different companies with different use cases
and therefore different graphs and challenges
But we want ONE solution!
How shareable is my
message?
Given a campaign, who are the
influencers?
Which of our followers ask
questions to our competitors?
Community detection
Social graph
http://www.scribblelive.com/blog/2013/10/30/movie-galaxies-uses-
social-graph-organization-to-visualize-movie-interconnectedness/
People have multiple social
media accounts
Querying persons instead of
accounts could be very
valuable
Social account graph
Customers look at products,
review products, buy products,
etc
By combing the customer
graph with social graph, better
segmentation is possible
Customer graph
https://cdn.graphgrid.com/content/uploads/2016/04/04125950/ConnectedCustomer.png
Graph can be stored in
different storage systems
Graph connectors (like data
connectors in spark)
Storages
Software as a Service (SaaS)
Keep company private data
safe
Make sure that customer X
cannot query data from
customer Y
Security
https://privacy.google.com/images/animations/your-security/last-frame-1.svg
High volume: billion
connections collected already
since the start
High velocity: about 100
messages a second
Two V’s
https://media.licdn.com/mpr/mpr/shrinknp_800_800/AAEAAQAAAAAAAAVQAAAAJDUwOGNmZjgxLTBjODQtNGUyMi05ZWUyLTVhY2RhMTU3OGFlYQ.jpg
https://www.extrasrl.it/hs-fs/hubfs/New_Website/New_Color_Background/31_percent.png?t=1489767435682&width=320&name=31_percent.png
Requirements
1. SaaS to allow for online graph analytics
2. Scalable architecture so that multipe customers could query the data at the
same time
3. Different kind of graphs in the graph space
4. Keep the private data secure and separated from the rest
MULTI NODE vs SINGLE NODE
Titan had trouble loading the
data into its graph format
MonetDB had trouble
performing the actual graph-
like queries
Virtuoso proved to be stable
even under high data load
Spark was not always the
fastest but scaled very well
Benchmark results
Our first prototype consists of
an API on top of Spark
Queries are processed by the
API and scala code is
generated to be performed on
Spark
Graphs can be stored in
ElasticSearch, Cassandra and
on disk
General architecture using
Spark
Namespaces to keep the data
model as general as possible
to cope with the different
graphs
Data definitions
Data model {
"_namespace": "com.obi4wan.social",
"_types": [
{
"_type": "message",
"_fields": {
"content": {
"_type": "generic.message"
},
"date": {
"_type": "generic.datetime"
},
"hashtags" : {
"_type": "com.obi4wan.social.hashtag",
"_structure": "list"
},
"author": {
"_type": "com.obi4wan.social.account"
}
}
}
]
}
JSON base query language for
defining query steps
search: search using
elasticsearch
enrich: join previous step on
subgraph
Query plan
{
"queryplan": [
{
"graph": {
"v": "com.obi4wan.social.message"
}
},
{
"search": {
"field": "com.obi4wan.social.message.content",
"query": "fire OR smoke"
}
},
{
"enrich": {
"type": "com.obi4wan.social.account",
"on": {
"old": "com.obi4wan.social.message.author",
"nw": "com.obi4wan.social.account.url"
}
}
},
{
"enrich": {
"type": "com.obilytics.people.account",
"on": {
"old": "com.obi4wan.social.account.url",
"nw": "com.obilytics.people.account.url"
}
}
}
]
}
Next steps and remaining challenges
1. Dataframes are immutable, how to update data in realtime (indexedRDDs)
2. Search is now done through ElasticSearch, would be nice to do that using a
Spark only solution
3. Query language is limited, use Cypher
Questions?

Use case: processing multiple graphs

  • 1.
    When a singlegraph isn’t enough
  • 2.
    Chief Innovation Officer FRANKSMIT Co-founder and CEO
  • 3.
    “The number onetool for social media monitoring, webcare, publishing & social analytics” Founded in 2011 Located in Zaandam, Netherlands 25 employees Over 700 customers in 8 countries
  • 4.
    Collect millions ofmessages on a daily basis Twitter, Facebook, Instagram, Pinterest, LinkedIn, Youtube, Google+, news sites, blogs and fora Data https://dribbble.com/shots/1233464-24-Free-Flat-Social-Icons
  • 5.
    “We develop AIand data applications for organisations” Founded in 2015 Located in Zaandam, Netherlands 5 employees 12 customers
  • 6.
    Different companies withdifferent use cases and therefore different graphs and challenges But we want ONE solution!
  • 7.
    How shareable ismy message? Given a campaign, who are the influencers? Which of our followers ask questions to our competitors? Community detection Social graph http://www.scribblelive.com/blog/2013/10/30/movie-galaxies-uses- social-graph-organization-to-visualize-movie-interconnectedness/
  • 8.
    People have multiplesocial media accounts Querying persons instead of accounts could be very valuable Social account graph
  • 9.
    Customers look atproducts, review products, buy products, etc By combing the customer graph with social graph, better segmentation is possible Customer graph https://cdn.graphgrid.com/content/uploads/2016/04/04125950/ConnectedCustomer.png
  • 10.
    Graph can bestored in different storage systems Graph connectors (like data connectors in spark) Storages
  • 11.
    Software as aService (SaaS) Keep company private data safe Make sure that customer X cannot query data from customer Y Security https://privacy.google.com/images/animations/your-security/last-frame-1.svg
  • 12.
    High volume: billion connectionscollected already since the start High velocity: about 100 messages a second Two V’s https://media.licdn.com/mpr/mpr/shrinknp_800_800/AAEAAQAAAAAAAAVQAAAAJDUwOGNmZjgxLTBjODQtNGUyMi05ZWUyLTVhY2RhMTU3OGFlYQ.jpg https://www.extrasrl.it/hs-fs/hubfs/New_Website/New_Color_Background/31_percent.png?t=1489767435682&width=320&name=31_percent.png
  • 13.
    Requirements 1. SaaS toallow for online graph analytics 2. Scalable architecture so that multipe customers could query the data at the same time 3. Different kind of graphs in the graph space 4. Keep the private data secure and separated from the rest
  • 14.
    MULTI NODE vsSINGLE NODE
  • 15.
    Titan had troubleloading the data into its graph format MonetDB had trouble performing the actual graph- like queries Virtuoso proved to be stable even under high data load Spark was not always the fastest but scaled very well Benchmark results
  • 16.
    Our first prototypeconsists of an API on top of Spark Queries are processed by the API and scala code is generated to be performed on Spark Graphs can be stored in ElasticSearch, Cassandra and on disk General architecture using Spark
  • 17.
    Namespaces to keepthe data model as general as possible to cope with the different graphs Data definitions Data model { "_namespace": "com.obi4wan.social", "_types": [ { "_type": "message", "_fields": { "content": { "_type": "generic.message" }, "date": { "_type": "generic.datetime" }, "hashtags" : { "_type": "com.obi4wan.social.hashtag", "_structure": "list" }, "author": { "_type": "com.obi4wan.social.account" } } } ] }
  • 18.
    JSON base querylanguage for defining query steps search: search using elasticsearch enrich: join previous step on subgraph Query plan { "queryplan": [ { "graph": { "v": "com.obi4wan.social.message" } }, { "search": { "field": "com.obi4wan.social.message.content", "query": "fire OR smoke" } }, { "enrich": { "type": "com.obi4wan.social.account", "on": { "old": "com.obi4wan.social.message.author", "nw": "com.obi4wan.social.account.url" } } }, { "enrich": { "type": "com.obilytics.people.account", "on": { "old": "com.obi4wan.social.account.url", "nw": "com.obilytics.people.account.url" } } } ] }
  • 19.
    Next steps andremaining challenges 1. Dataframes are immutable, how to update data in realtime (indexedRDDs) 2. Search is now done through ElasticSearch, would be nice to do that using a Spark only solution 3. Query language is limited, use Cypher
  • 20.