This document describes the LDBC Social Network Benchmark (SNB). The SNB was created to benchmark graph databases and RDF stores using a social network scenario. It includes a data generator that produces synthetic social network data with correlations between attributes. It also defines workloads of interactive, business intelligence, and graph analytics queries. Systems are evaluated based on metrics like query response times. The SNB aims to uncover performance choke points for different database systems and drive improvements in graph processing capabilities.
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge ScientistStratos Kontopoulos
Presentation for the NexTech Experts Panel II during the NexTech 2021 Congress (https://www.iaria.org/conferences2021/NexTech21.html).
Discusses the emerging and versatile role of the Knowledge Scientist in designing and developing explainable SemanticAI applications.
Detecting eCommerce Fraud with Neo4j and LinkuriousNeo4j
Last year, the global eCommerce market represented $1.9 trillions. As the market expands worldwide, the opportunity for fraud keeps growing with fraudsters constantly refining their tactics to outsmart anti-fraud frameworks. From chargeback fraud to re-shipping scam or identity fraud, numerous types of fraud can impact your organization. While collecting data is essential to enable real-time risk assessment, many organizations don’t have the necessary tools to find the insights needed to block fraud attempts.
Neo4j and Linkurious offer a solution to tackle the eCommerce fraud challenge. Their combined technologies provide a 360° overview of organization’s data and allow real-time analysis and detection of eCommerce fraud patterns and activities.
In this webinar, you will learn about:
- The current trends of eCommerce frauds and the risks for organizations;
- The challenges of detecting fraud tentatives in real-time and the advantage of the graph approach;
- How to use Linkurious’ graph visualization and analysis software to prevent and investigate eCommerce fraud.
The Future is Big Graphs: A Community View on Graph Processing SystemsNeo4j
Alexandru Iosup, Full Professor, Vrije Universiteit Amsterdam (VU Amsterdam)
Angela Bonifati, Full Professor of Computer Science, Université de Lyon
Hannes Voigt, Software Engineer, Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4jIvan Zoratti
I gave this presentation at DataOps 19 in Barcelona.
You will find information about Neo4j and how to use it with Graph Algorithms for Machine Learning and Artificial Intelligence.
GraphTour London 2020 - Graphs for AI, Amy HodlerNeo4j
The document discusses how graph data science and network analysis can be used to make better predictions by incorporating relationship data. It outlines the steps of graph data science, including building knowledge graphs, performing graph analytics using queries and algorithms, engineering graph features, and applying graph embeddings and machine learning. The use of graph techniques is shown to improve outcomes in applications such as predictive maintenance, fraud detection, and recommendations.
Social media monitoring with ML-powered Knowledge GraphGraphAware
Ever wondered how can be ML used to build Knowledge Graph for allowing businesses to successfully differentiate and compete today? We will show how Computer Vision, NLP/U, knowledge enrichment and graph-native algorithms fit together to build powerful insights from various unstructured data sources.
About the speakers:
Vlasta Kus - Lead Data Scientist at GraphAWare - Machine Learning, Deep Learning and Natural Language Processing expert.
Background in particle physics research at CERN. 10+ years of experience in software development (C/C++, Java, Python) and statistical data analysis.
Neo4j certified professional.
Specialised in using Machine Learning for building Knowledge Graphs (Hume @ GraphAware).
Golven Leroy - Student - I am a engineering student who is interested in everything graph. I love travelling and good food, especially when it is cheese-related and accompanied by good wine. Wannabe Gyro Gearloose, early-age spiderman fan, and beatmaker in my free time.
NODES 2019 - Neo4j Online Developer Expo & Summit - 10th October 2019
1) The document discusses how MATLAB can be used for big data analytics, machine learning, and deploying analytic applications.
2) It presents an agenda covering introduction to MATLAB for data analytics, big data, machine learning/deep learning, and deploying analytics.
3) MATLAB allows domain experts to perform tasks like data cleaning, exploration, machine learning using familiar MATLAB functions and interfaces while handling large datasets.
This document provides an agenda and overview for a graph data science demo focusing on fraud analysis. The demo will review Neo4j's graph data science library and algorithms for pathfinding, centrality, community detection, and similarity. It will use sample bank transaction and customer data modeled as a graph to demonstrate PageRank, betweenness centrality, weakly connected components, Louvain modularity, and node similarity algorithms. The goal is to identify important nodes, communities, and similar entities to detect potential fraud.
Ethics & (Explainable) AI – Semantic AI & the Role of the Knowledge ScientistStratos Kontopoulos
Presentation for the NexTech Experts Panel II during the NexTech 2021 Congress (https://www.iaria.org/conferences2021/NexTech21.html).
Discusses the emerging and versatile role of the Knowledge Scientist in designing and developing explainable SemanticAI applications.
Detecting eCommerce Fraud with Neo4j and LinkuriousNeo4j
Last year, the global eCommerce market represented $1.9 trillions. As the market expands worldwide, the opportunity for fraud keeps growing with fraudsters constantly refining their tactics to outsmart anti-fraud frameworks. From chargeback fraud to re-shipping scam or identity fraud, numerous types of fraud can impact your organization. While collecting data is essential to enable real-time risk assessment, many organizations don’t have the necessary tools to find the insights needed to block fraud attempts.
Neo4j and Linkurious offer a solution to tackle the eCommerce fraud challenge. Their combined technologies provide a 360° overview of organization’s data and allow real-time analysis and detection of eCommerce fraud patterns and activities.
In this webinar, you will learn about:
- The current trends of eCommerce frauds and the risks for organizations;
- The challenges of detecting fraud tentatives in real-time and the advantage of the graph approach;
- How to use Linkurious’ graph visualization and analysis software to prevent and investigate eCommerce fraud.
The Future is Big Graphs: A Community View on Graph Processing SystemsNeo4j
Alexandru Iosup, Full Professor, Vrije Universiteit Amsterdam (VU Amsterdam)
Angela Bonifati, Full Professor of Computer Science, Université de Lyon
Hannes Voigt, Software Engineer, Neo4j
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4jIvan Zoratti
I gave this presentation at DataOps 19 in Barcelona.
You will find information about Neo4j and how to use it with Graph Algorithms for Machine Learning and Artificial Intelligence.
GraphTour London 2020 - Graphs for AI, Amy HodlerNeo4j
The document discusses how graph data science and network analysis can be used to make better predictions by incorporating relationship data. It outlines the steps of graph data science, including building knowledge graphs, performing graph analytics using queries and algorithms, engineering graph features, and applying graph embeddings and machine learning. The use of graph techniques is shown to improve outcomes in applications such as predictive maintenance, fraud detection, and recommendations.
Social media monitoring with ML-powered Knowledge GraphGraphAware
Ever wondered how can be ML used to build Knowledge Graph for allowing businesses to successfully differentiate and compete today? We will show how Computer Vision, NLP/U, knowledge enrichment and graph-native algorithms fit together to build powerful insights from various unstructured data sources.
About the speakers:
Vlasta Kus - Lead Data Scientist at GraphAWare - Machine Learning, Deep Learning and Natural Language Processing expert.
Background in particle physics research at CERN. 10+ years of experience in software development (C/C++, Java, Python) and statistical data analysis.
Neo4j certified professional.
Specialised in using Machine Learning for building Knowledge Graphs (Hume @ GraphAware).
Golven Leroy - Student - I am a engineering student who is interested in everything graph. I love travelling and good food, especially when it is cheese-related and accompanied by good wine. Wannabe Gyro Gearloose, early-age spiderman fan, and beatmaker in my free time.
NODES 2019 - Neo4j Online Developer Expo & Summit - 10th October 2019
1) The document discusses how MATLAB can be used for big data analytics, machine learning, and deploying analytic applications.
2) It presents an agenda covering introduction to MATLAB for data analytics, big data, machine learning/deep learning, and deploying analytics.
3) MATLAB allows domain experts to perform tasks like data cleaning, exploration, machine learning using familiar MATLAB functions and interfaces while handling large datasets.
This document provides an agenda and overview for a graph data science demo focusing on fraud analysis. The demo will review Neo4j's graph data science library and algorithms for pathfinding, centrality, community detection, and similarity. It will use sample bank transaction and customer data modeled as a graph to demonstrate PageRank, betweenness centrality, weakly connected components, Louvain modularity, and node similarity algorithms. The goal is to identify important nodes, communities, and similar entities to detect potential fraud.
Many powerful Machine Learning algorithms are based on graphs, e.g., Page Rank (Pregel), Recommendation Engines (collaborative filtering), text summarization, and other NLP tasks. Also, the recent developments with Graph Neural Networks connect the worlds of Graphs and Machine Learning even further.
Considering data pre-processing and feature engineering which are both vital tasks in Machine Learning Pipelines extends this relationship across the entire ecosystem. In this session, we will investigate the entire range of Graphs and Machine Learning with many practical exercises.
RDF and OWL are powerful tools for making data smart. RDF uses a simple triple format to represent metadata and link data using unique identifiers, allowing for data integration. OWL builds on RDF by adding more formal semantics and defining concepts, properties, and relationships to allow for automated reasoning and inference over data. Combining OWL and RDF results in smart data that computers can understand, enabling intelligent automation and decision making.
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...Connected Data World
Do you want to learn how to use the low-hanging fruit of knowledge graphs — schema.org and JSON-LD — to annotate content and improve your SEO with semantics and entities? This hands-on workshop with one of the leading Semantic SEO practitioners will help you get started.
The document discusses semantic construction with graphs. It provides background on the speaker including their engineering and entrepreneurial experience. It then discusses trends in the AECO industry toward increased digitalization and use of building information modeling (BIM). The document proposes that an integrated approach using a semantic graph database can help address information gaps and complexity issues across project stages in the AECO industry.
This document discusses graph data science and Neo4j's capabilities. It describes how Neo4j can help simplify graph data science through its native graph database, graph data science library, and data visualization tool. Example use cases are also provided that demonstrate how Neo4j has helped companies with fraud detection, customer journey analysis, supply chain management, and patient outcomes.
How Graph Algorithms Answer your Business Questions in Banking and BeyondNeo4j
This document provides an agenda and overview for a presentation on using graph algorithms in banking. The presentation introduces graphs and the Neo4j graph database, demonstrates sample banking data modeled as a graph, and reviews several graph algorithms that could be used for applications like fraud detection, including PageRank, weakly connected components, node similarity, and Louvain modularity. The document concludes with a demo and Q&A section.
Building Enterprise-Ready Knowledge Graph Applications in the CloudPeter Haase
The document provides an agenda for a workshop on building enterprise-ready knowledge graph applications in the cloud. The workshop will cover understanding knowledge graphs and related technologies, setting up a knowledge graph architecture on Amazon Neptune for scalable storage and querying, and using the metaphactory platform to rapidly build applications and APIs. Attendees will learn concepts for maintaining, querying and searching knowledge graphs, and building end-user and developer applications on top of knowledge graphs. The tutorial will include hands-on demonstrations and exercises to set up a small knowledge graph application.
What you need to know to start an AI company?Mo Patel
An overview of why AI and Deep Learning are hot now? Overview f Machine Intelligence startups. What are the key ingredients for AI startup? How can AI startups compete with big tech companies and areas to focus on for differentiation?
An Introduction to Graph: Database, Analytics, and Cloud ServicesJean Ihm
Graph analysis employs powerful algorithms to explore and discover relationships in social network, IoT, big data, and complex transaction data. Learn how graph technologies are used in applications such as fraud detection for banking, customer 360, public safety, and manufacturing. This session will provide an overview and demos of graph technologies for Oracle Cloud Services, Oracle Database, NoSQL, Spark and Hadoop, including PGX analytics and PGQL property graph query language.
Presented at Analytics and Data Summit, March 20, 2018
This document discusses Quantum Information Systems and their use of semantic web technology for sustainability applications. It describes the Intellidimension Semantic Web and Open Semantic Framework platform and how it is interoperable with other technologies. The goal of the technology architecture is to facilitate the use of semantic web technology for sustainability applications through software, services, and web components. It provides high-level overviews of the semantic web processes, educational tools, scientific applications, and ambient intelligence components that are part of the system.
A VERY high level over view of Graph Analytics concepts and techniques, including structural analytics, Connectivity Analytics, Community Analytics, Path Analytics, as well as Pattern Matching
This webinar focuses on the particular use case of graph databases in Network & IT-Management. This webinar is designed for people who work with Network Management at telecom companies or professionals within industries that handle and rely on complex networks.
We’ll start with an overview of Neo4j and Graph-thinking within Networks, explaining how Neworks are naturally modelled as graphs. We’ll explain how graph databases vastly help mitigate some of the major challenges the Network and Security Managers face on daily basis — including intrusions and other cyber crimes, performance optimization, outage simulations, fraud prevention and more.
Ajay Ohri is an experienced principal data scientist with 14 years of experience. He has expertise in R, Python, machine learning, data visualization, SAS, SQL and cloud computing. Ohri has extensive experience in financial services domains including credit cards, loans, and insurance. He is proficient in data science tasks like exploratory data analysis, regression modeling, and data cleaning. Ohri has worked on significant projects for government and private clients. He also publishes books and articles on data science topics.
This document discusses hybrid enterprise knowledge graphs and the metaphactory platform. It describes how metaphactory uses a knowledge graph as an integration hub, connecting to various data sources like databases, APIs, and machine learning models through its Ephedra federation engine. Ephedra allows querying over these different data sources together using SPARQL 1.1 federation. It provides examples of use cases involving similarity search, sensor data, chemical structures, and demonstrates federation between Wikidata and other sources.
Geophy CTO Sander Mulders presented their Metadata platform at our March meetup at Skillsmatters' CodeNode. The talk was about how Geophy use Linked Data approaches to accelerate & improve the accuracy of real estate requirements such as valuations.
Sander talked about the thousands of data sources used, how they use RDF for data integration, how to construct features and metadata driven services using components such as Apache Kafka and Stardog.
Graph Databases and Graph Data Science in Neo4jijtsrd
The document discusses graph databases, Neo4j graph database software, and graph data science algorithms. It provides an overview of graph databases and their components like nodes, edges, and properties. It then describes Neo4j's features including querying, visualization, hosting options, and the Graph Data Science library. Finally, it explains different types of graph data science algorithms in Neo4j like centrality, similarity, and pathfinding algorithms and provides an example of each.
This document discusses Logicterm's work with ISO integration standards over 20 years of research and development. It summarizes Logicterm's approach of using top-down logical data models and concepts that cross domains to create integration models. It provides examples of Logicterm's work modeling children's social services and developing a public sector ontology. The document concludes by outlining next steps to pilot an ISO standard for integration in a proof of concept with two or three application domains.
1. Graphs add predictive power to machine learning models by incorporating network structure and relationships between entities.
2. Building graph machine learning models involves aggregating data from various sources to construct a graph, engineering graph features using algorithms and embeddings, and training predictive models that leverage the graph structure.
3. Graph algorithms, embeddings, and neural networks are increasingly being used to power applications in domains like financial services, healthcare, cybersecurity, and more by enabling novel and more accurate predictions based on relationships in data.
4. Document Discovery with Graph Data ScienceNeo4j
This document discusses using graphs for document discovery and data science. Graphs can combine structured and unstructured data, show relationships between information, and enable visual exploration of data. Graph algorithms can enhance graphs by identifying important entities, predicting unknown relationships, and supporting analytical use cases like discovery. The document advocates building a graph from documents, applying graph analytics to aid discovery, enabling search and exploration of the graph, and developing applications to integrate these capabilities.
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Paris Sud University
This document outlines techniques for expanding and enriching knowledge graphs through data linking, key discovery, and link invalidation. It begins with an introduction to linked open data and knowledge graphs. The technical part discusses approaches to data linking, including instance-based methods that consider attribute similarity and graph-based methods that propagate similarity across object properties. Supervised methods use training data while rule-based methods apply expert-defined rules. Evaluation measures effectiveness using recall, precision and F1-score, and efficiency. The document also covers similarity measures and techniques for knowledge graph expansion and enrichment.
FOSDEM 2014: Social Network Benchmark (SNB) Graph GeneratorLDBC council
The document discusses the Social Network Benchmark (SNB) developed by the Linked Data Benchmark Council (LDBC). The SNB aims to enable benchmarking of graph databases at scale by developing an open source benchmark including a social network data model and schema, workload definitions, and tools to generate synthetic but realistic graph data and validate benchmark results. The data generator creates correlated graph data by considering attributes like names, locations, interests that influence how the graph is generated to mimic real social networks. The SNB aims to evaluate systems on challenging graph queries and workloads.
LDBC 8th TUC Meeting: Introduction and status updateLDBC council
The document summarizes an 8th Technical User Community meeting on the LDBC benchmark. It discusses:
1) The LDBC Organization which sponsors benchmarks and task forces to develop them.
2) The key elements of a benchmark - data/schema, workloads, performance metrics, and execution rules.
3) The Semantic Publishing Benchmark and Social Network Benchmark being developed to evaluate graph and RDF databases on industry workloads.
4) The workloads include interactive, business intelligence, and graph analytics to test different database capabilities.
5) Various database systems that can be evaluated using the benchmarks.
Many powerful Machine Learning algorithms are based on graphs, e.g., Page Rank (Pregel), Recommendation Engines (collaborative filtering), text summarization, and other NLP tasks. Also, the recent developments with Graph Neural Networks connect the worlds of Graphs and Machine Learning even further.
Considering data pre-processing and feature engineering which are both vital tasks in Machine Learning Pipelines extends this relationship across the entire ecosystem. In this session, we will investigate the entire range of Graphs and Machine Learning with many practical exercises.
RDF and OWL are powerful tools for making data smart. RDF uses a simple triple format to represent metadata and link data using unique identifiers, allowing for data integration. OWL builds on RDF by adding more formal semantics and defining concepts, properties, and relationships to allow for automated reasoning and inference over data. Combining OWL and RDF results in smart data that computers can understand, enabling intelligent automation and decision making.
From Knowledge Graphs to AI-powered SEO: Using taxonomies, schemas and knowle...Connected Data World
Do you want to learn how to use the low-hanging fruit of knowledge graphs — schema.org and JSON-LD — to annotate content and improve your SEO with semantics and entities? This hands-on workshop with one of the leading Semantic SEO practitioners will help you get started.
The document discusses semantic construction with graphs. It provides background on the speaker including their engineering and entrepreneurial experience. It then discusses trends in the AECO industry toward increased digitalization and use of building information modeling (BIM). The document proposes that an integrated approach using a semantic graph database can help address information gaps and complexity issues across project stages in the AECO industry.
This document discusses graph data science and Neo4j's capabilities. It describes how Neo4j can help simplify graph data science through its native graph database, graph data science library, and data visualization tool. Example use cases are also provided that demonstrate how Neo4j has helped companies with fraud detection, customer journey analysis, supply chain management, and patient outcomes.
How Graph Algorithms Answer your Business Questions in Banking and BeyondNeo4j
This document provides an agenda and overview for a presentation on using graph algorithms in banking. The presentation introduces graphs and the Neo4j graph database, demonstrates sample banking data modeled as a graph, and reviews several graph algorithms that could be used for applications like fraud detection, including PageRank, weakly connected components, node similarity, and Louvain modularity. The document concludes with a demo and Q&A section.
Building Enterprise-Ready Knowledge Graph Applications in the CloudPeter Haase
The document provides an agenda for a workshop on building enterprise-ready knowledge graph applications in the cloud. The workshop will cover understanding knowledge graphs and related technologies, setting up a knowledge graph architecture on Amazon Neptune for scalable storage and querying, and using the metaphactory platform to rapidly build applications and APIs. Attendees will learn concepts for maintaining, querying and searching knowledge graphs, and building end-user and developer applications on top of knowledge graphs. The tutorial will include hands-on demonstrations and exercises to set up a small knowledge graph application.
What you need to know to start an AI company?Mo Patel
An overview of why AI and Deep Learning are hot now? Overview f Machine Intelligence startups. What are the key ingredients for AI startup? How can AI startups compete with big tech companies and areas to focus on for differentiation?
An Introduction to Graph: Database, Analytics, and Cloud ServicesJean Ihm
Graph analysis employs powerful algorithms to explore and discover relationships in social network, IoT, big data, and complex transaction data. Learn how graph technologies are used in applications such as fraud detection for banking, customer 360, public safety, and manufacturing. This session will provide an overview and demos of graph technologies for Oracle Cloud Services, Oracle Database, NoSQL, Spark and Hadoop, including PGX analytics and PGQL property graph query language.
Presented at Analytics and Data Summit, March 20, 2018
This document discusses Quantum Information Systems and their use of semantic web technology for sustainability applications. It describes the Intellidimension Semantic Web and Open Semantic Framework platform and how it is interoperable with other technologies. The goal of the technology architecture is to facilitate the use of semantic web technology for sustainability applications through software, services, and web components. It provides high-level overviews of the semantic web processes, educational tools, scientific applications, and ambient intelligence components that are part of the system.
A VERY high level over view of Graph Analytics concepts and techniques, including structural analytics, Connectivity Analytics, Community Analytics, Path Analytics, as well as Pattern Matching
This webinar focuses on the particular use case of graph databases in Network & IT-Management. This webinar is designed for people who work with Network Management at telecom companies or professionals within industries that handle and rely on complex networks.
We’ll start with an overview of Neo4j and Graph-thinking within Networks, explaining how Neworks are naturally modelled as graphs. We’ll explain how graph databases vastly help mitigate some of the major challenges the Network and Security Managers face on daily basis — including intrusions and other cyber crimes, performance optimization, outage simulations, fraud prevention and more.
Ajay Ohri is an experienced principal data scientist with 14 years of experience. He has expertise in R, Python, machine learning, data visualization, SAS, SQL and cloud computing. Ohri has extensive experience in financial services domains including credit cards, loans, and insurance. He is proficient in data science tasks like exploratory data analysis, regression modeling, and data cleaning. Ohri has worked on significant projects for government and private clients. He also publishes books and articles on data science topics.
This document discusses hybrid enterprise knowledge graphs and the metaphactory platform. It describes how metaphactory uses a knowledge graph as an integration hub, connecting to various data sources like databases, APIs, and machine learning models through its Ephedra federation engine. Ephedra allows querying over these different data sources together using SPARQL 1.1 federation. It provides examples of use cases involving similarity search, sensor data, chemical structures, and demonstrates federation between Wikidata and other sources.
Geophy CTO Sander Mulders presented their Metadata platform at our March meetup at Skillsmatters' CodeNode. The talk was about how Geophy use Linked Data approaches to accelerate & improve the accuracy of real estate requirements such as valuations.
Sander talked about the thousands of data sources used, how they use RDF for data integration, how to construct features and metadata driven services using components such as Apache Kafka and Stardog.
Graph Databases and Graph Data Science in Neo4jijtsrd
The document discusses graph databases, Neo4j graph database software, and graph data science algorithms. It provides an overview of graph databases and their components like nodes, edges, and properties. It then describes Neo4j's features including querying, visualization, hosting options, and the Graph Data Science library. Finally, it explains different types of graph data science algorithms in Neo4j like centrality, similarity, and pathfinding algorithms and provides an example of each.
This document discusses Logicterm's work with ISO integration standards over 20 years of research and development. It summarizes Logicterm's approach of using top-down logical data models and concepts that cross domains to create integration models. It provides examples of Logicterm's work modeling children's social services and developing a public sector ontology. The document concludes by outlining next steps to pilot an ISO standard for integration in a proof of concept with two or three application domains.
1. Graphs add predictive power to machine learning models by incorporating network structure and relationships between entities.
2. Building graph machine learning models involves aggregating data from various sources to construct a graph, engineering graph features using algorithms and embeddings, and training predictive models that leverage the graph structure.
3. Graph algorithms, embeddings, and neural networks are increasingly being used to power applications in domains like financial services, healthcare, cybersecurity, and more by enabling novel and more accurate predictions based on relationships in data.
4. Document Discovery with Graph Data ScienceNeo4j
This document discusses using graphs for document discovery and data science. Graphs can combine structured and unstructured data, show relationships between information, and enable visual exploration of data. Graph algorithms can enhance graphs by identifying important entities, predicting unknown relationships, and supporting analytical use cases like discovery. The document advocates building a graph from documents, applying graph analytics to aid discovery, enabling search and exploration of the graph, and developing applications to integrate these capabilities.
Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Paris Sud University
This document outlines techniques for expanding and enriching knowledge graphs through data linking, key discovery, and link invalidation. It begins with an introduction to linked open data and knowledge graphs. The technical part discusses approaches to data linking, including instance-based methods that consider attribute similarity and graph-based methods that propagate similarity across object properties. Supervised methods use training data while rule-based methods apply expert-defined rules. Evaluation measures effectiveness using recall, precision and F1-score, and efficiency. The document also covers similarity measures and techniques for knowledge graph expansion and enrichment.
FOSDEM 2014: Social Network Benchmark (SNB) Graph GeneratorLDBC council
The document discusses the Social Network Benchmark (SNB) developed by the Linked Data Benchmark Council (LDBC). The SNB aims to enable benchmarking of graph databases at scale by developing an open source benchmark including a social network data model and schema, workload definitions, and tools to generate synthetic but realistic graph data and validate benchmark results. The data generator creates correlated graph data by considering attributes like names, locations, interests that influence how the graph is generated to mimic real social networks. The SNB aims to evaluate systems on challenging graph queries and workloads.
LDBC 8th TUC Meeting: Introduction and status updateLDBC council
The document summarizes an 8th Technical User Community meeting on the LDBC benchmark. It discusses:
1) The LDBC Organization which sponsors benchmarks and task forces to develop them.
2) The key elements of a benchmark - data/schema, workloads, performance metrics, and execution rules.
3) The Semantic Publishing Benchmark and Social Network Benchmark being developed to evaluate graph and RDF databases on industry workloads.
4) The workloads include interactive, business intelligence, and graph analytics to test different database capabilities.
5) Various database systems that can be evaluated using the benchmarks.
The LDBC Social Network Benchmark Interactive Workload - SIGMOD 2015Ioan Toma
The document discusses the LDBC Social Network Benchmark for evaluating database and graph processing systems. It describes the benchmark's social network data generator which produces realistic data following power law distributions and correlations. It also outlines the benchmark's three workloads: interactive, business intelligence, and graph analytics. The focus is on the interactive workload, which includes complex read queries, simple read queries, and concurrent updates. It aims to identify choke points and measure the acceleration factor a system can sustain for the query mix while meeting a maximum query latency. Parameter curation is used to select query parameters that produce stable performance. The parallel query driver respects dependencies between queries to evaluate a system's ability to handle the workload concurrently.
The document describes analyzing clickstream data using Spark clustering algorithms. It discusses parsing raw clickstream data containing user agent strings to extract useful fields like device type, OS, and timezone. It then applies distributed K-modes clustering in Spark to group users into subsets exhibiting similar patterns. The results show 10 clusters with different proportions of country, timezone, device and other fields, revealing distinct user types within the dataset.
Chengqi zhang graph processing and mining in the era of big datajins0618
The document discusses challenges and opportunities in graph processing and mining for big data, including new graph semantics, mining tasks, query processing algorithms, indexing techniques, and computing models needed to handle large graph datasets. It outlines the speaker's work on topics such as structural keyword search, graph matching, community detection, and graph classification. It also covers the speaker's work on efficient graph algorithms, parallel and distributed graph computation, and graph processing system design.
Graph Database Use Cases - StampedeCon 2015StampedeCon
Presented by Max De Marzi at StampedeCon 2015: Graphs are eating the world – but in what form? Starting off with a primer on Graph Databases, this talk will focus on practical examples of graph applications.
We’ll look at multiple use cases like job boards, dating sites, recommendation engines of all kinds, network management, scheduling engines, etc. We'll also see some examples of graph search in action.
This document provides an overview of graph databases and their use cases. It begins with definitions of graphs and graph databases. It then gives examples of how graph databases can be used for social networking, network management, and other domains where data is interconnected. It provides Cypher examples for creating and querying graph patterns in a social networking and IT network management scenario. Finally, it discusses the graph database ecosystem and how graphs can be deployed for both online transaction processing and batch processing use cases.
(1) Amundsen is a data discovery platform developed by Lyft to help users find, understand, and use data.
(2) The platform addresses challenges around data discovery such as lack of understanding about what data exists and where to find it.
(3) Amundsen provides searchable metadata about data resources, previews of data, and usage statistics to help data scientists and others explore and understand data.
"Semantic Integration Is What You Do Before The Deep Learning". dev.bg Machine Learning seminar, 13 May 2019.
It's well known that 80\% of the effort of a data scientist is spent on data preparation. Semantic integration is arguably the best way to spend this effort more efficiently and to reuse it between tasks, projects and organizations. Knowledge Graphs (KG) and Linked Open Data (LOD) have become very popular recently. They are used by Google, Amazon, Bing, Samsung, Springer Nature, Microsoft Academic, AirBnb… and any large enterprise that would like to have a holistic (360 degree) view of its business. The Semantic Web (web 3.0) is a way to build a Giant Global Graph, just like the normal web is a Global Web of Documents. IEEE already talks about Big Data Semantics. We review the topic of KGs and their applicability to Machine Learning.
EgoSystem: Presentation to LITA, American Library Association, Nov 8 2014James Powell
The Internet represents the connections among computers and devices, the world wide web is a network of interconnected documents, and the semantic web is the closest thing we have today to a network of interconnected facts. Noticeably absent from these global networks is any sort of open, formal representation for an online global social network. Each users' online presence, and its immediate social network, are isolated and typically only available within the confines of the social networking site that hosts it. Discovery across explicit online social networks and implicit social networks such as those that can be inferred from co-authorship relationships and affiliations is, for all practical purposes, impossible. And yet there are practical and non-nefarious reasons why an organization might be interested in exploring portions of such a network. Outreach is one such interest. Los Alamos National Laboratory (LANL) prototyped EgoSystem to harvest and explore the professional social networks of post doctoral students. The project's goal is to enlist past students and other Lab alumni as ambassadors and advocates for LANL's ongoing mission. During this talk we will discuss the various technologies that support the EgoSystem and demonstrate some of its capabilities.
An early look at the LDBC Social Network Benchmark's Business Intelligence wo...Gábor Szárnyas
This document summarizes an early look at the Business Intelligence workload of the LDBC Social Network Benchmark. It describes the LDBC timeline and goals, defines the graph processing landscape including interactive, graph analytics, and BI workloads, and provides examples of BI global queries. It also outlines the choke points identified in benchmark design and language features, and discusses implementing the BI workload through data generation, query specification, and cross-validation of multiple systems.
Multi-Model Data Query Languages and Processing ParadigmsJiaheng Lu
Specifying users' interests with a formal query language is a typically challenging task, which becomes even harder in the context of multi-model data management because we have to deal with data variety. It usually lacks a unified schema to help the users issuing their queries, or has an incomplete schema as data come from disparate sources. Multi-Model DataBases (MMDBs) have emerged as a promising approach for dealing with this task as they are capable of accommodating and querying the multi-model data in a single system. This tutorial aims to offer a comprehensive presentation of a wide range of query languages for MMDBs and to make comparisons of their properties from multiple perspectives. We will discuss the essence of cross-model query processing and provide insights on the research challenges and directions for future work. The tutorial will also offer the participants hands-on experience in applying MMDBs to issue multi-model data queries.
La bi, l'informatique décisionnelle et les graphesCédric Fauvet
The document discusses how graph databases and graph technologies can be used for business intelligence, analytics, and decision making. It provides examples of how companies in various industries like communications, logistics, online recruiting, and consumer web have used graph databases from Neo4j to power applications, gain insights, and improve user experiences. Specific use cases discussed include network management, parcel routing, social job search, recommendations, and interactive television programming. The benefits of the graph model over relational databases for complex connected data are also highlighted.
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Databricks
In real-life applications, we often deal with situations where analysis needs to be conducted on graphs where the nodes and edges are associated with multiple labels. For example, in a graph that represents user activities in social networks, the labels associated with nodes may indicate their membership in communities (e.g. group, school, company, etc.), and the labels associated with edges may denote types of activities (e.g. comment, like, share, etc.). The current GraphX library in Spark does not directly support efficient calculation on the label-defined subgraph analysis and computations.
In this session, the speakers will propose a general API library that is able to support analysis on multi-label graphs, and can be reused and extended to design more complicated algorithms. It includes a method to create multi-label graphs and calculate basic statistics and metrics at both the global and subgraph level. Common graph algorithms, such as PageRank, can also be efficiently implemented in a parallel scheme by reusing the module/algorithm in GraphX, such as Pregel API.
See how LinkedIn is able to leverage this tool to efficiently find top LinkedIn feed influencers in different communities and by different actions. can be reused and extended to design more complicated algorithms. It includes a method to create multi-label graphs and calculate basic statistics and metrics at both the global and subgraph level. Common graph algorithms, such as PageRank, can also be efficiently implemented in a parallel scheme by reusing the module/algorithm in GraphX, such as Pregel API.
See how LinkedIn is able to leverage this tool to efficiently find top LinkedIn feed influencers in different communities and by different actions.
This document provides an overview of Near Real Time Analysis of Web Scale Social Data. It discusses how SpotRight collects and organizes publicly available user-generated social data at web scale, including connections between users, actions, events, profiles, demographics, and more from sources like Twitter, Pinterest, blogs and articles. It describes SpotRight's goals, architecture, algorithms, and tools used to perform real-time analysis and deliver timely insights to clients, including graph building, profile creation, and delivery of results. Key aspects involve collecting petabytes of social data, performing distributed graph algorithms at scale, and querying and delivering insights from the organized data.
Lyft developed Amundsen, an internal metadata and data discovery platform, to help their data scientists and engineers find data more efficiently. Amundsen provides search-based and lineage-based discovery of Lyft's data resources. It uses a graph database and Elasticsearch to index metadata from various sources. While initially built using a pull model with crawlers, Amundsen is moving toward a push model where systems publish metadata to a message queue. The tool has increased data team productivity by over 30% and will soon be open sourced for other organizations to use.
How Graph Databases used in Police Department?Samet KILICTAS
This presentation delivers basics of graph concept and graph databases to audience. It clearly explains how graph databases are used with sample use cases from industry and how it can be used for police departments. Questions like "When to use a graph DB?" and "Should I solve a problem with Graph DB?" are answered.
This document provides a summary of an event on optimized graph algorithms in Neo4j. It includes an introduction to graph analytics and algorithms, examples of analyzing real-world networks, and a demonstration of Neo4j's native graph database capabilities for graph analytics and algorithms. The presentation discusses preprocessing data from multiple sources into a graph, running algorithms like PageRank and community detection, and visualizing results.
Similar to FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz (20)
LDBC 6th TUC Meeting conclusions by Peter BonczIoan Toma
The document discusses the conclusions and future perspectives of the Linked Data Benchmark Council (LDBC). It summarizes that the EU project is on track, four benchmarks were developed including for social networks and graph analytics, and there has been growing participation from academic and industry groups. It outlines plans for LDBC to transition to an independent council and develop additional benchmarks in areas like graph programming, temporal graphs, and ontology reasoning.
MODAClouds Decision Support System for Cloud Service SelectionIoan Toma
The document discusses MODAClouds' Decision Support System (DSS) for cloud service selection. Some key points:
- The DSS helps users select cloud services by considering multiple dimensions like cost, quality, risks, and technical/business constraints.
- It allows multiple stakeholders like architects, operators, managers to provide input on tangible/intangible assets and risks.
- The DSS performs risk analysis and generates requirements. It also considers issues around multi-cloud environments like interoperability and migration challenges.
- Other features include automatic data gathering from various sources, and progressive learning over time from user inputs and service selections.
This document outlines the process for auditing an implementation of the LDBC Social Network Benchmark (SNB). It describes downloading the necessary driver and data generator software. The main metrics for benchmarking are throughput at scale and query latency constraints during crash recovery testing. It provides guidelines for implementing the driver, loading data, running queries for 2 hours, and verifying crash recovery. Audit results will report throughput, load time, recovery time and system configuration. Rules prohibit hinted query plans, precomputed indexes, and require serializable consistency.
Social Network Benchmark Interactive WorkloadIoan Toma
The document summarizes the LDBC Social Network Benchmark (SNB) for interactive workloads on social network systems. It describes:
- The SNB data generator that produces synthetic social network data with characteristics similar to Facebook, including persons, groups, posts, and relationships. It scales to very large datasets.
- The types of queries in the benchmark - complex reads, short reads, and updates - that mimic real user interactions. Complex reads target specific system bottlenecks.
- The workload driver that schedules queries and updates over time based on the synthetic data, aiming to balance read and write loads and mimic user browsing behavior. It measures latency and throughput.
- Validation and execution rules to ensure fair
MarkLogic is an enterprise NoSQL database platform that can be used for semantic search, data integration, and intelligent recommendation engines. It natively stores XML, JSON, and RDF triples alongside documents. Triples provide context to documents and enable semantic queries over data. MarkLogic also supports full-text and geospatial search, transactions, and flexible indexing and replication of data at scale.
SADI: A design-pattern for “native” Linked-Data Semantic Web ServicesIoan Toma
Semantic Automated Discovery and Integration
A design-pattern for “native” Linked-Data
Semantic Web Services by Mark D. Wilkinson (Universidad Politécnica de Madrid)
The document discusses the UniProt SPARQL endpoint, which provides access to 20 billion biomedical data triples. It notes challenges in hosting large life science databases and standards-based federated querying. The endpoint uses Apache load balancing across two nodes each with 64 CPU cores and 256GB RAM. Loading rates of 500,000 triples/second are achieved. Usage peaks at 35 million queries per month from an estimated 300-2000 real users. Very large queries involving counting all IRIs can take hours. Template-based compilation and key-value approaches are proposed to optimize challenging queries. Public monitoring is also discussed.
Lighthouse: Large-scale graph pattern matching on GiraphIoan Toma
This document provides an overview of Lighthouse, a system for large-scale graph pattern matching on Giraph. It discusses the timeline of Lighthouse's development, its vertex-centric API, architecture based on BSP and Pregel, and execution algebra using operations like scan, select, project, and joins. Examples are provided of matching graph patterns in Cypher queries and finding the k-shortest paths between nodes while restricting to safe edges. The implementation uses two phases, first computing the routes for each top k path and then traversing back to build the full paths. Preliminary results are mentioned but not detailed.
HP Labs: Titan DB on LDBC SNB interactive by Tomer Sagi (HP)Ioan Toma
HP has a long history of innovation dating back to its founding in a Palo Alto garage in 1939. Some of its notable innovations include the first programmable calculator in 1968, the first pocket scientific calculator in 1972, launching the first inkjet printer in 1984, and being first to commercialize RISC technology in 1986. More recently, HP Labs has developed technologies like ePrint in 2010, 3D Photon technology in 2011, and Project Moonshot in 2013. Going forward, HP Labs is focusing its research on systems, networking, security, analytics, and printing to deliver the fastest and most efficient route from data to value.
LDBC Semantic Publishing Benchmark evolution. The Semantic Publishing Benchmark (SPB) needed to evolve in order to allow for retrieval of semantically relevant content based on rich metadata descriptions and diverse reference knowledge and demonstrate that triplestores offer simplified and more efficient querying proving it to be a challenging technology in the RDF and Graph arena. Read in the presentations all the changes.
GRAPH-TA 2013 - RDF and Graph benchmarking - Jose Lluis Larriba PeyIoan Toma
The document discusses an agenda for a meeting on benchmarking RDF and graph databases. It provides an overview of the LDBC benchmarking project including its objectives to create standardized benchmarks, spur industry cooperation, and push technological improvements. It outlines the working groups and task forces that will focus on specific types of benchmarks. It also discusses common issues to be addressed like suitable use cases, benchmark methodologies, and benchmark workloads for querying, updating, and integrating RDF and graph data. Open questions are raised about benchmark realism, benchmark rules, and technological convergence of RDF and graph databases.
The LDBC benchmark council develops industry-strength benchmarks for graph and RDF data management systems. It focuses on benchmarks for transactional updates, business intelligence queries, graph analytics, and complex RDF workloads. The benchmarks are designed based on choke points - difficulties in the workload that stimulate technical progress. Examples of choke points in TPC-H include dependent group by keys and sparse joins. A key choke point for graph benchmarks is structural correlation, where the structure of the graph depends on node attribute values. The LDBC benchmark includes a generator that produces synthetic graphs with correlated properties and edges to test how systems handle this challenge.
"Scaling RAG Applications to serve millions of users", Kevin GoedeckeFwdays
How we managed to grow and scale a RAG application from zero to thousands of users in 7 months. Lessons from technical challenges around managing high load for LLMs, RAGs and Vector databases.
[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...Jason Yip
The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.
Session 1 - Intro to Robotic Process Automation.pdfUiPathCommunity
👉 Check out our full 'Africa Series - Automation Student Developers (EN)' page to register for the full program:
https://bit.ly/Automation_Student_Kickstart
In this session, we shall introduce you to the world of automation, the UiPath Platform, and guide you on how to install and setup UiPath Studio on your Windows PC.
📕 Detailed agenda:
What is RPA? Benefits of RPA?
RPA Applications
The UiPath End-to-End Automation Platform
UiPath Studio CE Installation and Setup
💻 Extra training through UiPath Academy:
Introduction to Automation
UiPath Business Automation Platform
Explore automation development with UiPath Studio
👉 Register here for our upcoming Session 2 on June 20: Introduction to UiPath Studio Fundamentals: https://community.uipath.com/events/details/uipath-lagos-presents-session-2-introduction-to-uipath-studio-fundamentals/
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Essentials of Automations: Exploring Attributes & Automation ParametersSafe Software
Building automations in FME Flow can save time, money, and help businesses scale by eliminating data silos and providing data to stakeholders in real-time. One essential component to orchestrating complex automations is the use of attributes & automation parameters (both formerly known as “keys”). In fact, it’s unlikely you’ll ever build an Automation without using these components, but what exactly are they?
Attributes & automation parameters enable the automation author to pass data values from one automation component to the next. During this webinar, our FME Flow Specialists will cover leveraging the three types of these output attributes & parameters in FME Flow: Event, Custom, and Automation. As a bonus, they’ll also be making use of the Split-Merge Block functionality.
You’ll leave this webinar with a better understanding of how to maximize the potential of automations by making use of attributes & automation parameters, with the ultimate goal of setting your enterprise integration workflows up on autopilot.
Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectorsDianaGray10
Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more.
The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications.
We’ll discuss and demo the benefits of UiPath Apps and connectors including:
Creating a compelling user experience for any software, without the limitations of APIs.
Accelerating the app creation process, saving time and effort
Enjoying high-performance CRUD (create, read, update, delete) operations, for
seamless data management.
Speakers:
Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP
Charlie Greenberg, host
"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor IvaniukFwdays
At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...Alex Pruden
Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security.
Paper Link: https://eprint.iacr.org/2024/257
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
The Department of Veteran Affairs (VA) invited Taylor Paschal, Knowledge & Information Management Consultant at Enterprise Knowledge, to speak at a Knowledge Management Lunch and Learn hosted on June 12, 2024. All Office of Administration staff were invited to attend and received professional development credit for participating in the voluntary event.
The objectives of the Lunch and Learn presentation were to:
- Review what KM ‘is’ and ‘isn’t’
- Understand the value of KM and the benefits of engaging
- Define and reflect on your “what’s in it for me?”
- Share actionable ways you can participate in Knowledge - - Capture & Transfer
High performance Serverless Java on AWS- GoTo Amsterdam 2024Vadym Kazulkin
Java is for many years one of the most popular programming languages, but it used to have hard times in the Serverless community. Java is known for its high cold start times and high memory footprint, comparing to other programming languages like Node.js and Python. In this talk I'll look at the general best practices and techniques we can use to decrease memory consumption, cold start times for Java Serverless development on AWS including GraalVM (Native Image) and AWS own offering SnapStart based on Firecracker microVM snapshot and restore and CRaC (Coordinated Restore at Checkpoint) runtime hooks. I'll also provide a lot of benchmarking on Lambda functions trying out various deployment package sizes, Lambda memory settings, Java compilation options and HTTP (a)synchronous clients and measure their impact on cold and warm start times.
3. What is the LDBC?
Linked Data Benchmark Council = LDBC
• Industry entity similar to TPC (www.tpc.org)
• Focusing on graph and RDF store benchmarking
•
•
•
•
•
•
Kick-started by an EU project
Runs from September 2012 – March 2015
9 project partners:
5. SNB Scenario: Social Network Analysis
• Intuitive: everybody knows what a SN is
– Facebook, Twitter, LinkedIn, …
• SNs can be easily represented as a graph
– Entities are the nodes (Person, Group, Tag, Post, ...)
– Relationships are the edges (Friend, Likes, Follows, …)
• Different scales: from small to very large SNs
– Up to billions of nodes and edges
• Multiple query needs:
– interactive, analytical, transactional
• Multiple types of uses:
– marketing, recommendation, social interactions, fraud
detection, ...
•
6. Audience
• For developers facing graph processing tasks
– recognizable scenario to compare merits of different
products and technologies
• For vendors of graph database technology
– checklist of features and performance characteristics
• For researchers, both industrial and academic
– challenges in multiple choke-point areas such as
graph query optimization and (distributed) graph
analysis
•
7. What was developed?
• Four main elements:
–
–
–
data schema: defines the structure of the data
workloads: defines the set of operations to perform
performance metrics: used to measure
(quantitatively) the performance of the systems
– execution rules: defined to assure that the results
from different executions of the benchmark are
valid and comparable
• Software as Open Source (GitHub)
– data generator, query drivers, validation tools, ...
•
•
9. Data Schema
• Specified in UML for portability
–
–
–
Classes
associations between classes
Attributes for classes and associations
• Some of the relationships represent dimensions
– Time (Y,QT,Month,Day)
– Geography (Continent,Country,Place)
• Data Formats
– CSV
– RDF (Turtle + N3)
–
•
11. Data Generation Process
• Produce synthetic data that mimics the
characteristics of real SN data
• Based on SIB–S3G2 Social Graph Generator
– www.w3.org/wiki/Social_Network_Intelligence_BenchM
ark
• Graph model:
– property (directed labeled) graph
• Building blocks
–
–
–
•
Correlated data dictionaries
Sequential subtree generation
Graph generation among correlation dimensions
12. Choke Points
• Hidden challenges in a benchmark
influence database system design, e.g. TPC-H
•
•
•
Functional Dependency Analysis in aggregation
Bloom Filters for sparse joins
Subquery predicate propagation
–
• LDBC explicitly designs benchmarks looking at
choke-point “coverage”
– requires access to database kernel architects
•
13. Data Correlations between attributes
SELECT personID from person
WHERE firstName = ‘Joachim’ AND addressCountry = ‘Germany’
Anti-Correlation
SELECT personID from person
WHERE firstName = ‘Cesare’
AND addressCountry = ‘Italy’
§
§ Query optimizers may underestimate or overestimate the result size of
conjunctive predicates
Joachim Loew
Cesare
Cesare
Joachim Prandelli
14. Property Dictionaries
• Property values for each literal property
defined by:
– a dictionary D, a fixed set of values (e.g., terms
from DBPedia or GeoNames)
– a ranking function R (between 1 and|D|)
– a probability density function F that specifies
how the data generator chooses values from
dictionary D using the rank R for each term in
the dictionary
16. Correlated Edge Generation
“1990”
<Britney Spears>
rth
e>
Student
lik
Ye
<
“University of< stu
s>
ar
d yA
e> “Anna”
> know
m
Leipzig”
t>
<
tn a
firs
e>
< liveAt>
m
<knows>
P5
<
“Germany”
<firstna
P1
“Laura”
<birth
Year>
“1990”
s>
w
<Britney
no
<k
Spears>
P3
P2
>
<s
“University of
At
tu
y
d
dy
Leipzig”
At
stu
<
>
“University of
“University of
Amsterdam”
Leipzig”
“1990” “Netherlands”
<b
i
s
ow
< li
ke
n
<k
>
<is
>
P4
tu
<s
>
>
At
dy
>
ear>
At
<live
Y
<birth
17. How to Generate a Correlated Graph?
Ye
?
?
ear>
tu
<s
>
connection
probability
<birth
Y
?
<li
k
e>
>
ar
At
dy
<Britney Spears>
>
h
irt
<b
“University of < s
t ud
yAt
Leipzig”
>
me>
<firstna
P1
“Laura”
<Britney Spears>
• Computeke>
li similarity of two nodes based
P4 <
Student
on their (correlated) properties.
“Anna”
e>function wrt
• ?Use a probability density
m
tn a
irs
ffor connecting nodes
<
to this similarity <liveAt> “Germany”
P5
?
<is
“1990”
P3
“1990”
Y
<birth
“University of
>
P2
At
Leipzig”
highly similar less similar
dy
tu
<s
<s
> tudy
At
“University this is very expensive to compute on a large
Danger:
“University of
of Leipzig”
Amsterdam”
graph! (quadratic, random access)
“1990”
“Netherlands”
?
At>
<live
e ar >
18. Window Optimization
Ye
<birth
Y
no
<k
e>
>
ar
>
n
<k
P3
At
dy
<Britney Spears>
s>
w
o
ear>
tu
<s
>
connection
probability
ws
<li
k
>
h
irt
<b
“University of < s
t ud
yAt
Leipzig”
>
me>
<firstna
P1
“Laura”
<is
“1990”
>
ke <Britney Spears>
i
P4 <l nodes with too large similarity distance
Trick: disregard
Student
(only connect nodes in a similarity window)
s>
e> “Anna”
ow
m
<kn
tn a
Window <firs
<knows>
P5
<liveAt> “Germany”
“1990”
Y
<birth
“University of
Leipzig”
highly similar less similar
<s
<s
> tudy
A
“University
Probability that two nodes are connectedt is “University w.r.t
skewed of
of similarity between the nodes (due to probability
the Leipzig”
Amsterdam”
“1990”
“Netherlands”
>
At
y
ud
t
At>
<live
e ar >
distr.)
P2
19. Graph generation
• Create new edges and nodes in one pass, starting
from an existing node and start creating new
nodes to which it will be connected
• Uses degree distributions that follow a power law
• Generates correlated and highly connected graph
data by creating edges after generating many
nodes
– Similarity metric
– Random dimension
20. Implementation
• MapReduce
– A Map function runs on different parts of the input data, in
parallel and on many clusters
– Main Reduce function:
•
•
Sort the correlation dimensions
Generate edges based on sliding windows (similarity metric)
– Successive MapReduce phases for different correlation
dimensions
• Scalable, e.g.:
– Generate 1.2TB of data in 30 minutes (16 4-core nodes)
–
•
24. Workloads
• On-Line: tests a system's throughput with relatively
simple queries with concurrent updates
– Show all photos posted by my friends that I was tagged in
–
• Business Intelligence: consists of complex structured
queries for analyzing online behavior
– Influential people the topic of open source development?
–
• Graph Analytics: tests the functionality and scalability
on most of the data as a single operation
– PageRank, Shortest Path(s), Community Detection
•
26. Interactive Query Set
• Tests system throughput with relatively simple queries and
concurrent updates
• Current set: 12 read-only queries
• For each query:
–
–
–
–
–
Name and detailed description in plain English
List of input parameters
Expected result: content and format
Textual functional description
Relevance:
• textual description (plain English) of the reasoning for including this
query in the workload
• discussion about the technical challenges (Choke Points) targeted
– Validation parameters and validation results
– SPARQL query
•
•
–
27. Some SNB Interactive Choke Points
• Graph Traversals. Query execution time heavily depends
on the ability to quickly traverse friends graph.
• Plan Variablility. Each query have many different best
plans depending on parameter choices (eg. Hash- vs
index-based joins).
• Top k and distinct: Many queries return the first results
in a specific order: Late projection, pushing conditions
from the sort into the query
• Repetitive short queries, differing only in literals,
opportunity for query plan recycling
29. Example: Q3
•
Name: Friends within 2 hops that have been in two countries
Description:
Find Friends and Friends of Friends of the user A that have made a post in
the foreign countries X and Y within a specified period. We count only posts
that are made in the country that is different from the country of a friend. The
result should be sorted descending by total number of posts, and then by person
URI. Top 20 should be shown. The user A (as friend of his friend) should not be
in the result
Parameter:
- Person
- CountryX
- CountryY
- startDate - the beginning of the requested period
- Duration - requested period in days
Result:
- Person.id, Person.firstname, Person.lastName
- Number of post of each country and the sum of all posts
Relevance:
- Choke Points: CP3.3
- If one country is large but anticorrelated with the country of self then
processing this before a smaller but positively correlated country can be
beneficial
30. Example: Q5 - SPARQL
select ?group count (*)
where {
{select distinct ?fr
where {
{%Person% snvoc:knows ?fr.} union
{%Person% snvoc:knows ?fr2.
?fr2 snvoc:knows ?fr. filter (?fr != %Person%)}
}
} .
?group snvoc:hasMember ?mem . ?mem snvoc:hasPerson ?fr .
?mem snvoc:joinDate ?date . filter (?date >=
"%Date0%"^^xsd:date) .
?post snvoc:hasCreator ?fr . ?group snvoc:containerOf ?post
}
group by ?group
order by desc(2) ?group
limit 20
31. Example: Q5 - Cypher
MATCH (person:Person)-[:KNOWS*1..2]-(friend:Person)
WHERE person.id={person_id}
MATCH (friend)<-[membership:HAS_MEMBER]-(forum:Forum)
WHERE membership.joinDate>{join_date}
MATCH (friend)<-[:HAS_CREATOR]-(comment:Comment)
WHERE (comment)-[:REPLY_OF*0..]->(:Comment)-[:REPLY_OF]->(:Post)<[:CONTAINER_OF]-(forum)
RETURN forum.title AS forum, count(comment) AS commentCount
ORDER BY commentCount DESC
MATCH (person:Person)-[:KNOWS*1..2]-(friend:Person)
WHERE person.id={person_id}
MATCH (friend)<-[membership:HAS_MEMBER]-(forum:Forum)
WHERE membership.joinDate>{join_date}
MATCH (friend)<-[:HAS_CREATOR]-(post:Post)<-[:CONTAINER_OF]-(forum)
RETURN forum.title AS forum, count(post) AS postCount
ORDER BY postCount DESC
34. SNB Software
• SNB data
– Dataset generator
– Dataset Validator
– SN Graph Analyzer
• SNB Query drivers
– BSBM based
– YCSB based
35. Some Experiments
• Virtuoso (RDF)
– 100k users during 3 years period (3.3 billion triples,
60GB)
– Ten SPARQL query mixes
– 4 x Intel Xeon 2.30GHz CPU, 193 GB of RAM
• DEX (Graph Database)
– Validation setup: 10k users during 3 years (19GB)
– Validation query set and parameters (API-based)
– 2 x Intel Xeon 2.40Ghz CPU, 128 GB of RAM
•
36. Virtuoso Interactive Workload
• Some queries could not be considered as truly interactive
– e.g. Q4, Q5 and Q9
– … still all queries are very interesting challenges
• ”Irregular” data distribution reflecting the reality of the SN
– … but complicates the selection of query parameters
•
•
37. Exploration in Scale
• 3.3 bn RDF triples per 100K users, 24G in triples,
36G in literals
• 2/3 of data in interactive working set, 1/4 in BI
working set
• scale out becomes increasingly necessary after 1M
• 10-100M users are data center scales
– as in real social networks
– larger scales will favor space efficient data models,
e.g. column store with a schema, but
– larger scales also have greater need for schema-last
features