Presentation of the Gradoop Framework at the Flink & Neo4j Meetup in Berlin (http://www.meetup.com/graphdb-berlin/events/228576494/). The talk is about the extended property graph model, its operators and how they are implemented on top of Apache Flink. The talk also includes some benchmark results on scalability and a demo involving Neo4j, Flink and Gradoop (see www.gradoop.com)
Presentation of the Gradoop Framework at the Graph Database Meetup in Munich (https://www.meetup.com/inovex-munich/events/231187528/). The talk is about the extended property graph model, its operators and how they are implemented on top of Apache Flink. The talk also includes some benchmark results on scalability (see www.gradoop.com)
Presentation of the Gradoop Framework at the GraphDevroom @FOSDEM 2016. The talk is about the extended property graph model, it operators and how they are implemented on top of Apache Flink, a distributed dataflow framework. The talk also includes a social network analysis example and some benchmark results on scalability. (see www.gradoop.com)
Slides from Matt Dowle's presentation at H2O Open Tour: NYC
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with A...Martin Junghanns
The slides contain an overview of Gradoop, our framework for end-to-end graph analytics. We present our extended property graph data model and give an introduction into Apache Flink and its DataSet API. We show, how our data model is mapped to Flink DataSets and how we implement graph operators using DataSet transformations. Furthermore the slides contain information about two useful tools we developed around Gradoop: Graph Definition Language (GDL) and ldbc-flink-import.
Information-Rich Programming in F# with Semantic DataSteffen Staab
Programming with rich data frequently implies that one
needs to search for, understand, integrate and program with
new data - with each of these steps constituting a major
obstacle to successful data use.
In this talk we will explain and demonstrate how our approach,
LITEQ - Language Integrated Types, Extensions and Queries for
RDF Graphs, which is realized as part of the F# / Visual Studio-
environment, supports the software developer. Using the extended
IDE the developer may now
a. explore new, previously unseen data sources,
which are either natively in RDF or mapped into RDF;
b. use the exploration of schemata and data in order to
construct types and objects in the F# environment;
c. automatically map between data and programming language objects in
order to make them persistent in the data source;
d. have extended typing functionality added to the F#
environment and resulting from the exploration of the data source
and its mapping into F#.
Core to this approach is the novel node path query language, NPQL,
that allows for interactive, intuitive exploration of data schemata and
data proper as well as for the mapping and definition
of types, object collections and individual objects.
Beyond the existing type provider mechanism for F#
our approach also allows for property-based navigation
and runtime querying for data objects.
Presentation of the Gradoop Framework at the Graph Database Meetup in Munich (https://www.meetup.com/inovex-munich/events/231187528/). The talk is about the extended property graph model, its operators and how they are implemented on top of Apache Flink. The talk also includes some benchmark results on scalability (see www.gradoop.com)
Presentation of the Gradoop Framework at the GraphDevroom @FOSDEM 2016. The talk is about the extended property graph model, it operators and how they are implemented on top of Apache Flink, a distributed dataflow framework. The talk also includes a social network analysis example and some benchmark results on scalability. (see www.gradoop.com)
Slides from Matt Dowle's presentation at H2O Open Tour: NYC
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Meetup Big Data User Group Dresden: Gradoop - Scalable Graph Analytics with A...Martin Junghanns
The slides contain an overview of Gradoop, our framework for end-to-end graph analytics. We present our extended property graph data model and give an introduction into Apache Flink and its DataSet API. We show, how our data model is mapped to Flink DataSets and how we implement graph operators using DataSet transformations. Furthermore the slides contain information about two useful tools we developed around Gradoop: Graph Definition Language (GDL) and ldbc-flink-import.
Information-Rich Programming in F# with Semantic DataSteffen Staab
Programming with rich data frequently implies that one
needs to search for, understand, integrate and program with
new data - with each of these steps constituting a major
obstacle to successful data use.
In this talk we will explain and demonstrate how our approach,
LITEQ - Language Integrated Types, Extensions and Queries for
RDF Graphs, which is realized as part of the F# / Visual Studio-
environment, supports the software developer. Using the extended
IDE the developer may now
a. explore new, previously unseen data sources,
which are either natively in RDF or mapped into RDF;
b. use the exploration of schemata and data in order to
construct types and objects in the F# environment;
c. automatically map between data and programming language objects in
order to make them persistent in the data source;
d. have extended typing functionality added to the F#
environment and resulting from the exploration of the data source
and its mapping into F#.
Core to this approach is the novel node path query language, NPQL,
that allows for interactive, intuitive exploration of data schemata and
data proper as well as for the mapping and definition
of types, object collections and individual objects.
Beyond the existing type provider mechanism for F#
our approach also allows for property-based navigation
and runtime querying for data objects.
Linked geospatial data has recently received attention, as researchers and practitioners have started tapping the wealth of geospatial information available on the Web. Incomplete geospatial information, although appearing often in the applications captured by such datasets, is not represented and queried properly due to the lack of appropriate data models and query languages. We discuss our recent work on the model RDFi, an extension of RDF with the ability to represent property values that exist, but are unknown or partially known, using constraints, and an extension of the query language SPARQL with qualitative and quantitative geospatial querying capabilities. We demonstrate the usefulness of RDFi in geospatial Semantic Web applications by giving examples and comparing the modeling capabilities of RDFi with the ones of related Semantic Web systems.
The world around us is full of connected information. Neo4j was originally developed to solve two complex "network" problems in a document management system, as it was too hard to manage rich connection information efficiently in traditional and new "NOSQL" databases.During this meetup, we will talk about the technology, and about the journey that a couple of technologists from Malmö took. You will learn* how Neo Technology grew from just the three founders in to a global database company with use-cases in every domain imaginable.* how focusing on customer and community feedback allows us to provide a solution for managing connected data to everyone, not just the large internet companies.
Of course we will also introduce the graph model, it's whiteboard friendlyness and how you get started with Neo4j and it's easy and powerful query language Cypher. We'll also compare the graph and relational data model to see how they differ in shape and capabilities. Finally we discuss the foundations that enable Graph databases to provide higher join performance, faster development processes and more inclusive software for all stakeholders. With use-cases from Gaming, Dating and Finance we'll see how to apply the graph capabilities to these domains to realize new functionality or opportunities that were not possible before.
Finally, if there's a question you've always wanted to ask/discuss, we'll have plenty of time for that at the end of Michael's presentation.
Save queries as annotations. A method for the digital preservation of queries on a Hebrew Text database with linguistic information in it. These queries form the data for interpretations by biblical scholars. Sharing those queries as Open Annotation enables researchers to communicate their (intermediate) results.
Jeff will showcase the sparklyr the new R package to interface with Spark and talk about the different use extensions including the rsparkling ML package.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
We will cover Apache Spark's Machine Learning Library (MLlib) This presentation covers using Spark for recommender systems.
MLlib is a library built on top of Spark's engine which allows us to train, test, validate and operationalize machine learning models while working with lots of data in a convenient way thanks to its robust abstractions over data sets.
Find out how you can use MLlib to build product recommendation systems by employing both traditional ML techniques such as collaborative filtering, as well as more novel, deep-learning approaches which make use of Neural Networks.
LDQL: A Query Language for the Web of Linked DataOlaf Hartig
I used this slideset to present our research paper at the 14th Int. Semantic Web Conference (ISWC 2015). Find a preprint of the paper here:
http://olafhartig.de/files/HartigPerez_ISWC2015_Preprint.pdf
Ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. Practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity and training time. The Super Learner algorithm, also called "stacking", learns the optimal combination of the base learner fits. The latest version of H2O now contains a "Stacked Ensemble" method, which allows the user to stack H2O models into a Super Learner. The Stacked Ensemble method is the the native H2O version of stacking, previously only available in the h2oEnsemble R package, and now enables stacking from all the H2O APIs: Python, R, Scala, etc.
Erin is a Statistician and Machine Learning Scientist at H2O.ai. Before joining H2O, she was the Principal Data Scientist at Wise.io (acquired by GE Digital) and Marvin Mobile Security (acquired by Veracode) and the founder of DataScientific, Inc. Erin received her Ph.D. from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing.
This is my Deep Water talk for the TensorFlow Paris meetup.
Deep Water is H2O's integration with multiple open source deep learning libraries such as TensorFlow, MXNet and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment.
Medical Heritage Library (MHL) on ArchiveSparkHelge Holzmann
This presentation gives an introduction to ArchiveSpark and the recent extension to use it with any archival collection. The slides demonstrate how to set it up and use it for analyzing data from medical journals of the Medical Heritage Library (MHL).
Suneel Marthi - Deep Learning with Apache Flink and DL4JFlink Forward
http://flink-forward.org/kb_sessions/deep-learning-with-apache-flink-and-dl4j/
Deep Learning has become very popular over the last few years in areas such as Image Recognition, Fraud Detection, Machine Translation etc. Deep Learning has proved to be very useful in handling unstructured data and extracting value from them. A big challenge with having to build deep learning models was the high cost of training them. With the recent advent of distributed frameworks like Apache Flink, Apache Spark etc.. it’s faster to train Deep Learning models in parallel on modern platform architecture. In this talk, we’ll be showing how to use Apache Flink Streaming with the open source Deep Learning framework, DeepLearning4j to perform large scale deep learning model training. We will show a demo of a Recurrent Neural Net that is trained for language modeling and have it generate text.
Discover what's new in the Neo4j community for the week of 14 October 2017, including projects around Ethereum, graph visualization, & recommender systems.
This week Will and I interviewed Ward Cunningham as part of the Neo4j online meetup and we launched the first version of the much awaited Kafka Connector. Neo4j 3.5 was also released and Jennifer kicked off an exciting series of posts on the Marvel Universe.
Linked geospatial data has recently received attention, as researchers and practitioners have started tapping the wealth of geospatial information available on the Web. Incomplete geospatial information, although appearing often in the applications captured by such datasets, is not represented and queried properly due to the lack of appropriate data models and query languages. We discuss our recent work on the model RDFi, an extension of RDF with the ability to represent property values that exist, but are unknown or partially known, using constraints, and an extension of the query language SPARQL with qualitative and quantitative geospatial querying capabilities. We demonstrate the usefulness of RDFi in geospatial Semantic Web applications by giving examples and comparing the modeling capabilities of RDFi with the ones of related Semantic Web systems.
The world around us is full of connected information. Neo4j was originally developed to solve two complex "network" problems in a document management system, as it was too hard to manage rich connection information efficiently in traditional and new "NOSQL" databases.During this meetup, we will talk about the technology, and about the journey that a couple of technologists from Malmö took. You will learn* how Neo Technology grew from just the three founders in to a global database company with use-cases in every domain imaginable.* how focusing on customer and community feedback allows us to provide a solution for managing connected data to everyone, not just the large internet companies.
Of course we will also introduce the graph model, it's whiteboard friendlyness and how you get started with Neo4j and it's easy and powerful query language Cypher. We'll also compare the graph and relational data model to see how they differ in shape and capabilities. Finally we discuss the foundations that enable Graph databases to provide higher join performance, faster development processes and more inclusive software for all stakeholders. With use-cases from Gaming, Dating and Finance we'll see how to apply the graph capabilities to these domains to realize new functionality or opportunities that were not possible before.
Finally, if there's a question you've always wanted to ask/discuss, we'll have plenty of time for that at the end of Michael's presentation.
Save queries as annotations. A method for the digital preservation of queries on a Hebrew Text database with linguistic information in it. These queries form the data for interpretations by biblical scholars. Sharing those queries as Open Annotation enables researchers to communicate their (intermediate) results.
Jeff will showcase the sparklyr the new R package to interface with Spark and talk about the different use extensions including the rsparkling ML package.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
We will cover Apache Spark's Machine Learning Library (MLlib) This presentation covers using Spark for recommender systems.
MLlib is a library built on top of Spark's engine which allows us to train, test, validate and operationalize machine learning models while working with lots of data in a convenient way thanks to its robust abstractions over data sets.
Find out how you can use MLlib to build product recommendation systems by employing both traditional ML techniques such as collaborative filtering, as well as more novel, deep-learning approaches which make use of Neural Networks.
LDQL: A Query Language for the Web of Linked DataOlaf Hartig
I used this slideset to present our research paper at the 14th Int. Semantic Web Conference (ISWC 2015). Find a preprint of the paper here:
http://olafhartig.de/files/HartigPerez_ISWC2015_Preprint.pdf
Ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. Practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity and training time. The Super Learner algorithm, also called "stacking", learns the optimal combination of the base learner fits. The latest version of H2O now contains a "Stacked Ensemble" method, which allows the user to stack H2O models into a Super Learner. The Stacked Ensemble method is the the native H2O version of stacking, previously only available in the h2oEnsemble R package, and now enables stacking from all the H2O APIs: Python, R, Scala, etc.
Erin is a Statistician and Machine Learning Scientist at H2O.ai. Before joining H2O, she was the Principal Data Scientist at Wise.io (acquired by GE Digital) and Marvin Mobile Security (acquired by Veracode) and the founder of DataScientific, Inc. Erin received her Ph.D. from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing.
This is my Deep Water talk for the TensorFlow Paris meetup.
Deep Water is H2O's integration with multiple open source deep learning libraries such as TensorFlow, MXNet and Caffe. On top of the performance gains from GPU backends, Deep Water naturally inherits all H2O properties in scalability. ease of use and deployment.
Medical Heritage Library (MHL) on ArchiveSparkHelge Holzmann
This presentation gives an introduction to ArchiveSpark and the recent extension to use it with any archival collection. The slides demonstrate how to set it up and use it for analyzing data from medical journals of the Medical Heritage Library (MHL).
Suneel Marthi - Deep Learning with Apache Flink and DL4JFlink Forward
http://flink-forward.org/kb_sessions/deep-learning-with-apache-flink-and-dl4j/
Deep Learning has become very popular over the last few years in areas such as Image Recognition, Fraud Detection, Machine Translation etc. Deep Learning has proved to be very useful in handling unstructured data and extracting value from them. A big challenge with having to build deep learning models was the high cost of training them. With the recent advent of distributed frameworks like Apache Flink, Apache Spark etc.. it’s faster to train Deep Learning models in parallel on modern platform architecture. In this talk, we’ll be showing how to use Apache Flink Streaming with the open source Deep Learning framework, DeepLearning4j to perform large scale deep learning model training. We will show a demo of a Recurrent Neural Net that is trained for language modeling and have it generate text.
Discover what's new in the Neo4j community for the week of 14 October 2017, including projects around Ethereum, graph visualization, & recommender systems.
This week Will and I interviewed Ward Cunningham as part of the Neo4j online meetup and we launched the first version of the much awaited Kafka Connector. Neo4j 3.5 was also released and Jennifer kicked off an exciting series of posts on the Marvel Universe.
Discover what's new in the Neo4j community for the week of 7 October 2017, including projects around Data Science, Facebook, and Natural Language Processing.
Hdf Augmentation: Interoperability in the Last MileTed Habermann
Science data files are generally written to serve well-defined purposes for a small science teams. In many cases, the organization of the data and the metadata are designed for custom tools developed and maintained by and for the team. Using these data outside of this context many times involves restructuring, re-documenting, or reformatting the data. This expensive and time-consuming process usually prevents data reuse and thus decreases the total life-cycle value of the data considerably. If the data are unique or critically important to solving a particular problem, they can be modified into a more generally usable form or metadata can be added in order to enable reuse. This augmentation process can be done to enhance data for the intended purpose or for a new purpose, to make the data available to new tools and applications, to make the data more conventional or standard, or to simplify preservation of the data. The HDF Group has addressed augmentation needs in many ways: by adding extra information, by renaming objects or moving them around in the file, by reducing complexity of the organization, and sometimes by hiding data objects that are not understood by specific applications. In some cases these approaches require re-writing the data into new files and in some cases it can be done externally, without affecting the original file. We will describe and compare several examples of each approach.
Predicting Influence and Communities Using Graph AlgorithmsDatabricks
Relationships are one of the most predictive indicators of behavior and preferences. Communities detection based on relationships is a powerful tool for inferring similar preferences in peer groups, anticipating future behavior, estimating group resiliency, finding hierarchies, and preparing data for other analysis. Centrality measures based on relationships identify the most important items in a network and help us understand group dynamics such as influence, accessibility, the speed at which things spread, and bridges between groups. Data scientists use graph algorithms to identify groups and estimate important entities based on their interactions. In this session, we'll cover the common uses of community detection and centrality measures and how some of the iconic graph algorithms compute values. We'll show examples of how to run community detection and centrality algorithms in Apache Spark including using the AggregateMessages function to add your own algorithms. You'll learn best practices and tips for tricky situations. For those that want to run graph algorithms in a graph platform, we'll also illustrate a few examples in Neo4j. Some of the Community Detection Algorithms included: * Triangle Count and Clustering Coefficient to estimate network cohesiveness * Strongly Connected Components and Connected Components to find clusters * Label Propagation to quickly infer groups and data cleans with semi-supervised learning * Louvain Modularity to uncover at group hierarchies Balanced Triad to identify unstable groups * PageRank to reveal influencers * Betweenness Centrality to predict bottlenecks and bridges.
Authors: Amy Hodler, Sören Reichardt
Vancouver part 1 intro to elasticsearch and kibana-beginner's crash course ...UllyCarolinneSampaio
Elasticsearch is known as the heart of the Elastic Stack, which consists of Beats, Logstash, Elasticsearch, and Kibana. The Elastic Stack allows us to take data from any source, in any format, then search, analyze, and visualize it in real time.
If you are a developer who is looking to make data usable in real time and at scale, Elasticsearch and Kibana are great tools to have on your belt.
This week I had fun with the online meetup on similarity algorithms with Tomaz Bratanic. I came across a great post written by Adrien Sales showing how to analyse PostgreSQL metadata using Neo4j and learned a neat approach to ingesting data into Neo4j using Kafka Streams and GraphQL.
K-CAI NEURAL API is a Keras based neural network API for machine learning that will allow you to prototype with a lots of possibilities of Tensorflow! Python, Free Pascal and Delphi together in Google Colab, Git or the Community Edition.
Extending the Stream/Table Duality into a Trinity, with Graphs (David Allen &...confluent
The stream table duality in Kafka lets us look at our data in two different ways, whichever is more convenient for our use. But what about when the connections between the data points add much more value to our data? For this, we need to look at our data as a graph. Graphs help drive financial fraud investigations, social media analyses, network & IT management use cases, recommendation engines, and knowledge management. These are all cases where patterns of interaction in your data (for example, a pattern of structured financial transactions) matter more than the individual data points (a single transfer). We'll cover how to easily transform Kafka streams or tables into graphs, and query them declaratively using Cypher or GraphQL. In graph shape, we can enrich our social network streams with powerful graph algorithms that tell us about user and event influence through graph centrality, then streaming results back to Kafka. Stream/table duality becomes the stream / table / graph trinity. We will demonstrate the trinity by: - Getting started with regular kafka streams, - Using confluent hub's Neo4j sink - Exposing query-able graphs with Cypher & GraphQL - Analyzing data with Neo4j's graph algorithms - Transforming graphs back into streams The trinity means not choosing between representations, but using the best one for your use case. We'll demonstrate how it can be used to tackle social network analysis problems and discuss how the approach can be extended to real-time financial fraud detection and more.
Implementing the FRBR Conceptual Model in the Variations Music Discovery SystemJenn Riley
Riley, Jenn, Paul McElwain and Alex Berry. "Implementing the FRBR Conceptual Model in the Variations Music Discovery System." Digital Library Program Brown Bag Presentation, October 28, 2009.
This week we have Hierarchical community detection using Louvain and a Case Law Network Graph. We also learn how to creating a schema.org linked data endpoint on Neo4j, do Authorization in the GRANDstack, learn taxonomies from user tagged data, and more!
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
The affect of service quality and online reviews on customer loyalty in the E...
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Berlin
1. GRADOOP: Scalable Graph Analytics
with Apache Flink
Martin Junghanns @kc1s
Apache Flink and Neo4j Meetup Berlin
2. About the speaker and the team
Apache Flink and Neo4j Meetup Berlin 2
André
PhD Student
Martin
PhD Student
Kevin
M.Sc. Student
Niklas
M.Sc. Student
Prof. Dr. Erhard Rahm
Database Chair
10. „Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 10
Assuming a social network
11. „Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 11
Assuming a social network
1. Determine subgraph
12. „Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 12
Assuming a social network
1. Determine subgraph
13. „Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 13
Assuming a social network
1. Determine subgraph
2. Find communities
14. „Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 14
Assuming a social network
1. Determine subgraph
2. Find communities
15. „Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 15
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
16. „Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 16
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
17. „Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 17
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
4. Find common subgraph
18. „Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 18
Assuming a social network
1. Determine subgraph
2. Find communities
3. Filter communities
4. Find common subgraph
19. „Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 19
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
20. „Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 20
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
21. „Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 21
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
22. „Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 22
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
23. „Graphs can be analyzed“
Apache Flink and Neo4j Meetup Berlin 23
Assuming a social network
• Heterogeneous data
1. Determine subgraph
• Apply graph transformation
2. Find communities
• Handle collections of graphs
3. Filter communities
• Aggregation, Selection
4. Find common subgraph
• Apply dedicated algorithm
24. „And let‘s not forget …“
Apache Flink and Neo4j Meetup Berlin 24
26. „A framework and research platform for efficient,
distributed and domain independent management
and analytics of heterogeneous graph data.“
Apache Flink and Neo4j Meetup Berlin 26
33. Apache Flink and Neo4j Meetup Berlin 29
Extended Property Graph Model (EPGM)
34. Extended Property Graph Model
• Vertices and directed Edges
Apache Flink and Neo4j Meetup Berlin 30
35. Extended Property Graph Model
• Vertices and directed Edges
• Logical Graphs
Apache Flink and Neo4j Meetup Berlin 31
36. Extended Property Graph Model
• Vertices and directed Edges
• Logical Graphs
• Identifiers
Apache Flink and Neo4j Meetup Berlin 32
1 3
4
5
21 2
3
4
5
1
2
37. Extended Property Graph Model
• Vertices and directed Edges
• Logical Graphs
• Identifiers
• Type Labels
Apache Flink and Neo4j Meetup Berlin 33
1 3
4
5
21 2
3
4
5
Person Band
Person
Person
Band
likes likes
likes
knows
likes
1|Community
2|Community
38. Extended Property Graph Model
• Vertices and directed Edges
• Logical Graphs
• Identifiers
• Type Labels
• Properties
Apache Flink and Neo4j Meetup Berlin 34
1 3
4
5
21 2
3
4
5
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
96. Flink DataSet API
Apache Flink and Neo4j Meetup Berlin 50
• DataSet := Distributed Collection of Data Objects
DataSet
DataSet
DataSet
97. Flink DataSet API
Apache Flink and Neo4j Meetup Berlin 50
• DataSet := Distributed Collection of Data Objects
• Transformation := Operation on DataSets
DataSet
DataSet
DataSet
Transformation
Transformation
DataSet
DataSet
98. Flink DataSet API
Apache Flink and Neo4j Meetup Berlin 50
• DataSet := Distributed Collection of Data Objects
• Transformation := Operation on DataSets
• Flink Programm := Composition of Transformations
DataSet
DataSet
DataSet
Transformation
Transformation
DataSet
DataSet
Transformation DataSet
Flink Program
99. Flink DataSet API
Apache Flink and Neo4j Meetup Berlin 50
DataSetDataSetDataSet
DataSetDataSetDataSet
DataSetDataSetDataSet
DataSetDataSetDataSet
DataSetDataSetDataSet
DataSetDataSetDataSet
• DataSet := Distributed Collection of Data Objects
• Transformation := Operation on DataSets
• Flink Programm := Composition of Transformations
DataSet
DataSet
DataSet
Transformation
Transformation
DataSet
DataSet
Transformation DataSet
Flink Program
110. Graph Representation
Apache Flink and Neo4j Meetup Berlin 52
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
111. Graph Representation
Apache Flink and Neo4j Meetup Berlin 52
Id Label Properties
1 Community {interest:Heavy Metal}
2 Community {interest:Hard Rock}
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
DataSet<EPGMGraphHead>
112. Graph Representation
Apache Flink and Neo4j Meetup Berlin 52
Id Label Properties
1 Community {interest:Heavy Metal}
2 Community {interest:Hard Rock}
Id Label Properties Graphs
1 Person {name:Alice, born:1984} {1}
2 Band {name:Metallica,founded:1981} {1}
3 Person {name:Bob} {1,2}
4 Band {name:AC/DC,founded:1973} {2}
5 Person {name:Eve} {2}
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
DataSet<EPGMGraphHead>
DataSet<EPGMVertex>
113. Graph Representation
Apache Flink and Neo4j Meetup Berlin 52
Id Label Properties
1 Community {interest:Heavy Metal}
2 Community {interest:Hard Rock}
Id Label Properties Graphs
1 Person {name:Alice, born:1984} {1}
2 Band {name:Metallica,founded:1981} {1}
3 Person {name:Bob} {1,2}
4 Band {name:AC/DC,founded:1973} {2}
5 Person {name:Eve} {2}
Id Label Source Target Properties Graphs
1 likes 1 2 {since:2014} {1}
2 likes 3 2 {since:2013} {1}
3 likes 3 4 {since:2015} {2}
4 knows 3 5 {} {2}
5 likes 5 4 {since:2014} {2}
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
DataSet<EPGMGraphHead>
DataSet<EPGMVertex> DataSet<EPGMEdge>
117. Operator Implementation
Apache Flink and Neo4j Meetup Berlin 54
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
118. Operator Implementation
Apache Flink and Neo4j Meetup Berlin 54
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5
Exclusion
119. Operator Implementation
Apache Flink and Neo4j Meetup Berlin 54
1 3
4
5
2
1|Community|interest:Heavy Metal
2|Community|interest:Hard Rock
Person
name : Alice
born : 1984
Band
name : Metallica
founded : 1981
Person
name : Bob
Person
name : Eve
Band
name : AC/DC
founded : 1973
likes
since : 2014
likes
since : 2013
likes
since : 2015
knows
likes
since : 2014
1 2
3
4
5 // input: firstGraph (G[1]), secondGraph (G[2])
1: DataSet<GradoopId> graphId = secondGraph.getGraphHead()
2: .map(new Id<G>());
3:
4: DataSet<V> newVertices = firstGraph.getVertices()
5: .filter(new NotInGraphBroadCast<V>())
6: .withBroadcastSet(graphId, GRAPH_ID);
7:
8: DataSet<E> newEdges = firstGraph.getEdges()
9: .filter(new NotInGraphBroadCast<E>())
10: .withBroadcastSet(graphId, GRAPH_ID)
11: .join(newVertices)
12: .where(new SourceId<E>().equalTo(new Id<V>())
13: .with(new LeftSide<E, V>())
14: .join(newVertices)
15: .where(new TargetId<E>().equalTo(new Id<V>())
16: .with(new LeftSide<E, V>());
Exclusion
121. Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
graphId = secondGraph.getGraphHead()
122. Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
Id Label Properties
2 Community {interest:Hard Rock}
graphId = secondGraph.getGraphHead()
123. Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
Id Label Properties
2 Community {interest:Hard Rock}
graphId = secondGraph.getGraphHead()
.map(new Id<G>());
124. Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
Id Label Properties
2 Community {interest:Hard Rock}
graphId = secondGraph.getGraphHead()
Id
2
.map(new Id<G>());
125. Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
Id Label Properties
2 Community {interest:Hard Rock}
graphId = secondGraph.getGraphHead()
Id
2
newVertices = firstGraph.getVertices()
.map(new Id<G>());
126. Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
Id Label Properties
2 Community {interest:Hard Rock}
graphId = secondGraph.getGraphHead()
Id
2
newVertices = firstGraph.getVertices() Id Label Properties Graphs
1 Person {name:Alice} {1}
2 Band {name:Metallica,founded:1981} {1}
3 Person {name:Bob} {1,2}
.map(new Id<G>());
127. Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
Id Label Properties
2 Community {interest:Hard Rock}
graphId = secondGraph.getGraphHead()
Id
2
newVertices = firstGraph.getVertices() Id Label Properties Graphs
1 Person {name:Alice} {1}
2 Band {name:Metallica,founded:1981} {1}
3 Person {name:Bob} {1,2}
.map(new Id<G>());
.filter(new NotInGraphBroadCast<V>())
.withBroadcastSet(graphId, GRAPH_ID);
128. Operator Implementation – Exclusion
Apache Flink and Neo4j Meetup Berlin 55
Id Label Properties
2 Community {interest:Hard Rock}
graphId = secondGraph.getGraphHead()
Id
2
newVertices = firstGraph.getVertices() Id Label Properties Graphs
1 Person {name:Alice} {1}
2 Band {name:Metallica,founded:1981} {1}
3 Person {name:Bob} {1,2}
Id Label Properties Graphs
1 Person {name:Alice} {1}
2 Band {name:Metallica,founded:1981} {1}
.map(new Id<G>());
.filter(new NotInGraphBroadCast<V>())
.withBroadcastSet(graphId, GRAPH_ID);
150. Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
http://www.ldbcouncil.org/
151. Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
http://www.ldbcouncil.org/
152. Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
http://www.ldbcouncil.org/
153. Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
http://www.ldbcouncil.org/
154. Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
http://www.ldbcouncil.org/
155. Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
http://www.ldbcouncil.org/
156. Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
7. Group graph by Persons location and gender
http://www.ldbcouncil.org/
157. Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 61
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
7. Group graph by Persons location and gender
8. Aggregate vertex and edge count of grouped graph
http://www.ldbcouncil.org/
158. Social Network Benchmark
Apache Flink and Neo4j Meetup Berlin 62
1. Extract subgraph containing only Persons and knows relations
2. Transform Persons to necessary information
3. Find communities using Label Propagation
4. Aggregate vertex count for each community
5. Select communities with more than 50K users
6. Combine large communities to a single graph
7. Group graph by Persons location and gender
8. Aggregate vertex and edge count of grouped graph
https://git.io/vgozj