APOC has become the de-facto standard utility library for Neo4j. In this talk, I will demonstrate some of the lesser known but very useful components of APOC that will save you a lot of work. You will also learn how to combine individual functions into powerful constructs to achieve impressive feats
This will be a fast-paced demo/live-coding talk.
Video: https://neo4j.com/graphconnect-2018/session/neo4j-utility-library-apoc-pearls
Unicorn images by TeeTurtle.com (Unstable Unicorns is a fun game & cool t-shirts)
Review the latest features released in Neo4j version 4.1 including Cypher, database drivers, clustering, security, and extension libraries like APOC and Spring Data Neo4j!
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Review the latest features released in Neo4j version 4.1 including Cypher, database drivers, clustering, security, and extension libraries like APOC and Spring Data Neo4j!
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
Apache Spark 2.1.0 boosted the performance of Apache Spark SQL due to Project Tungsten software improvements. Another 16x times faster has been achieved by using Oracle’s innovations for Apache Spark SQL. This 16x improvement is made possible by using Oracle’s Software in Silicon accelerator offload technologies.
Apache Spark SQL In-memory performance is becoming more important due to many factors. Users are now performing more advanced SQL processing on multi-terabyte workloads. In addition on-prem and cloud servers are getting larger physical memory to enable storing these huge workloads be stored in memory. In this talk we will look at using Spark SQL in feature creation, feature generation within pipelines for Spark ML.
This presentation will explore workloads at scale and with complex interactions. We also provide best practices and tuning suggestion to support these kinds of workloads on real applications in cloud deployments. In addition ideas for next generation Tungsten project will also be discussed.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Building Robust ETL Pipelines with Apache SparkDatabricks
Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we’ll take a deep dive into the technical details of how Apache Spark “reads” data and discuss how Spark 2.2’s flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines.
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
These slides support the GraphFrames: DataFrame-based graphs for Apache Spark webinar. In this webinar, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure. In this talk, I present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Web-Scale Graph Analytics with Apache® Spark™Databricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you’ll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
Apache Spark 2.1.0 boosted the performance of Apache Spark SQL due to Project Tungsten software improvements. Another 16x times faster has been achieved by using Oracle’s innovations for Apache Spark SQL. This 16x improvement is made possible by using Oracle’s Software in Silicon accelerator offload technologies.
Apache Spark SQL In-memory performance is becoming more important due to many factors. Users are now performing more advanced SQL processing on multi-terabyte workloads. In addition on-prem and cloud servers are getting larger physical memory to enable storing these huge workloads be stored in memory. In this talk we will look at using Spark SQL in feature creation, feature generation within pipelines for Spark ML.
This presentation will explore workloads at scale and with complex interactions. We also provide best practices and tuning suggestion to support these kinds of workloads on real applications in cloud deployments. In addition ideas for next generation Tungsten project will also be discussed.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Building Robust ETL Pipelines with Apache SparkDatabricks
Stable and robust ETL pipelines are a critical component of the data infrastructure of modern enterprises. ETL pipelines ingest data from a variety of sources and must handle incorrect, incomplete or inconsistent records and produce curated, consistent data for consumption by downstream applications. In this talk, we’ll take a deep dive into the technical details of how Apache Spark “reads” data and discuss how Spark 2.2’s flexible APIs; support for a wide variety of datasources; state of art Tungsten execution engine; and the ability to provide diagnostic feedback to users, making it a robust framework for building end-to-end ETL pipelines.
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
These slides support the GraphFrames: DataFrame-based graphs for Apache Spark webinar. In this webinar, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
Producer Performance Tuning for Apache KafkaJiangjie Qin
Kafka is well known for high throughput ingestion. However, to get the best latency characteristics without compromising on throughput and durability, we need to tune Kafka. In this talk, we share our experiences to achieve the optimal combination of latency, throughput and durability for different scenarios.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure. In this talk, I present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Web-Scale Graph Analytics with Apache® Spark™Databricks
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently.
At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you’ll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
A Recovering Java Developer Learns to GoMatt Stine
As presented at OSCON 2014.
The Go programming language has emerged as a favorite tool of DevOps and cloud practitioners alike. In many ways, Go is more famous for what it doesn’t include than what it does, and co-author Rob Pike has said that Go represents a “less is more” approach to language design.
The Cloud Foundry engineering teams have steadily increased their use of Go for building components, starting with the Router, and progressing through Loggregator, the CLI, and more recently the Health Manager. As a “recovering-Java-developer-turned-DevOps-junkie” focused on helping our customers and community succeed with Cloud Foundry, it became very clear to me that I needed to add Go to my knowledge portfolio.
This talk will introduce Go and its distinctives to Java developers looking to add Go to their toolkits. We’ll cover Go vs. Java in terms of:
* type systems
* modularity
* programming idioms
* object-oriented constructs
* concurrency
This is a presentation I prepared for a local meetup. The audience is a mix of web designers and developers who have a wide range of development experience.
Spring Day | Spring and Scala | Eberhard WolffJAX London
2011-10-31 | 09:45 AM - 10:30 AM
Spring is widely used in the Java world - but does it make any sense to combine it with Scala? This talk gives an answer and shows how and why Spring is useful in the Scala world. All areas of Spring such as Dependency Injection, Aspect-Oriented Programming and the Portable Service Abstraction as well as Spring MVC are covered.
Learn how to take advantage of the Pebble build system by creating customized wscripts that let you concatenate JS files, automatically run linters, and internationalize your apps with Cherie Williams (Developer Evangelist).
Looming Marvelous - Virtual Threads in Java Javaland.pdfjexp
Nowadays we have 2 options for concurrency in Java:
* simple, synchronous, blocking code with limited scalability that tracks well linearly at runtime, or.
* complex, asynchronous libraries with high scalability that are harder to handle.
Project Loom aims to bring together the best aspects of these two approaches and make them available to developers.
In the talk, I'll briefly cover the history and challenges of concurrency in Java before we dive into Loom's approaches and do some behind-the-scenes implementation. To manage so many threads reasonably needs some structure - for this there are proposals for "Structured Concurrency" which we will also look at. Some examples and comparisons to test Loom will round up the talk.
Project Loom is included in Java 19 and 20 as a preview feature, it can already be tested how well it works with our applications and libraries.
Spoiler: Pretty good.
Easing the daily grind with the awesome JDK command line toolsjexp
Included in the JDK installation are a lot of handy tools for Java developers, from java, jshell and jcmd to jfr and jdeprscan. These allow you to analyze a running JVM, generate JRE's, run Java source code and much more. In this talk I would like to present a number of these tools with practical examples and thus expand the toolbox of the participants. With the command line tools, many tasks can be automated and executed more efficiently, leaving more time for the exciting things in developer life.
Today, we have 2 options for concurrency in Java:
Simple, synchronous, blocking code with limited scalability that tracks well linearly at runtime, or
complex, asynchronous libraries with high scalability, which are harder to handle
Project Loom aims to bring together the best aspects of these two approaches and make them available to developers.
In the talk, I'll briefly discuss the history and challenges of concurrency in Java before we dive into Loom's approaches and look a bit behind the scenes.
Project Loom is included since Java 17 as a preview feature, it can already be tested to see how well it works with our applications and libraries. Spoiler: Pretty good.
GraphConnect 2022 - Top 10 Cypher Tuning Tips & Tricks.pptxjexp
I was there when Cypher was invented in 2012
and have been using it ever since. The language is
extremely powerful and easy to learn. But to truly
master it, you need to understand how it works
internally and how the database executes your
queries. In this session, you'll learn to look behind
the scenes at execution plans with PROFILE and
EXPLAIN and which specific clauses, expressions,
structures, and operations help you minimize
Cypher and database operations. After this talk,
you should be able to speed up your Cypher
statements quite a bit.
The newly released Neo4j Connector for Apache Spark can be used to read and write data between the two systems.
In this demo I show how to use the investigative Data from the FinCEN files to have a full pipeline up an running.
Notebook is in https://github.com/jexp/fincen
How Graphs Help Investigative Journalists to Connect the Dotsjexp
The Journalists of the ICIJ used graph technology to understand the relationships between the leaked pieces of information in the Panama and Paradise Papers.
NBC News applied graph algorithms to the messages and follower networks of Russian Twitter trolls to gain further insights.
The Trumpworld organizational data correlated with US bills and government contracts offers starting points for further investigations.
New tools like graph databases allow data journalists to understand the intricate networks of the criminal, economic and political world better as those three examples show. Each journalist adding new connections helps others to validate their stories. They say "It's like magic".
Join Michael for a look behind the scenes of graph based data ingestion, analysis and investigation.
We will use the open source graph database Neo4j, data visualization and graph algorithms to read between the lines.
Who doesn't know him, the office hero, who sat in the office late into the evening and repaired production? The fact that perhaps another colleague sat on the sofa at home and had an equal share in this success is unfortunately not so appreciated in most company cultures. But why is that? Because we are not used to working at home? Because we think that you are not so productive at home? Because you have family, garden or other activities at home? Michael has been working for distributed companies for a long time, but has also worked in offices for a long time. He will take you on his journey through different working environments and tell you what worked well for him.
The JVM is already a runtime for many languages. With the optimizing Graal compiler added to Java 11 and the language implementations in Truffle for Ruby, Python, JavaScript, and R it becomes possible to run them natively on the JVM, even exchanging data between them.
Michael Hunger explains the concepts behind Truffle and Graal and uses a practical example to show how you can use Python and JavaScript for “stored procedures” in a JVM-based database.
He demonstrates how to optimize the startup time of your application and container images by precompiling it to machine-code and examines its limits and the difference it makes. But nothing is perfect—Michael discusses the limitations and compares performances for the full picture.
Presentation at OSCON, PDX 2019.
https://conferences.oreilly.com/oscon/oscon-or/public/schedule/detail/76092
Neo4j Graph Streaming Services with Apache Kafkajexp
In this presentation we give an high level overview of the Neo4j-Kafka integration and the Confluent partnership.
Providing change-data-capture and ingestion capabilities as Neo4j Extension and the Kafka Connect Neo4j Sink on Confluent Hub allows you to integrate real-time streaming with graph querying and analytics.
How Graph Databases efficiently store, manage and query connected data at s...jexp
Graph Databases try to make it easy for developers to leverage huge amounts of connected information for everything from routing to recommendations. Doing that poses a number of challenges on the implementation side. In this talk we want to look at the different storage, query and consistency approaches that are used behind the scenes. We’ll check out current and future solutions used in Neo4j and other graph databases for addressing global consistency, query and storage optimization, indexing and more and see which papers and research database developers take inspirations from.
Code we've written once has to be kept readable, maintainable, understandable and extensible for many years. Good code is not self-serving but the foundation for working together.
Refactoring can help you to keep the quality of the relevant parts of our systems high.
The technique is really easy (almost too easy) - improve the naming, structure, and responsibility in small steps that don't change behavior and run your tests after each step.
18 years ago I got hooked on Refactoring when Martin Fowler's first book came out. I've been using it since then on a daily basis on many different projects. Since then a lot has changed, especially with the help of modern IDEs with their automated refactorings and intentions.
Now he asked me to help review the 2nd edition. Our discussions reminded me that each generation of developers should be taught this crucial skill. That's why I want to give an overview of core refactorings and code-smells but also demonstrate the tips and tricks of today's tools that make this task so much easier.
Plus a sneak preview of the upcoming book.
New Features in Neo4j 3.4 / 3.3 - Graph Algorithms, Spatial, Date-Time & Visu...jexp
Highlighting the progress in Neo4j 3.3 and 3.4 especially
Neo4j Desktop, Graph Algorithms, NLP, Date-Time, Geospatial, and performance.
Also featuring the new visualization tool Neo4j Bloom.
GraphQL - The new "Lingua Franca" for API-Developmentjexp
Three years ago, with the release of the GraphQL specification, Facebook took a fresh stab at the topic of "API design between remote services and applications." The key aspects of GraphQL provide a common, schema-based, domain-specific language and flexible, dynamic queries at interface boundaries.
In the talk, I'd like to compare GraphQL and REST and showcase benefits for developers and architects using a concrete example in application and API development, data source and system integration.
Not only, is our data is getting not just more complex but also more connected. In order not to lose sight of the web of information, but to use it as a source of new insights and opportunities, technologies such as graph databases can help.
For both analytical and transactional use cases, they allow efficient storage, retrieval, and processing of networked data without loss of detail. In this talk, we want to get to know existing tools and techniques for graph data processing.
We recently released the Neo4j graph algorithms library.
You can use these graph algorithms on your connected data to gain new insights more easily within Neo4j. You can use these graph analytics to improve results from your graph data, for example by focusing on particular communities or favoring popular entities.
We developed this library as part of our effort to make it easier to use Neo4j for a wider variety of applications. Many users expressed interest in running graph algorithms directly on Neo4j without having to employ a secondary system.
We also tuned these algorithms to be as efficient as possible in regards to resource utilization as well as streamlined for later management and debugging.
In this session we'll look at some of these graph algorithms and the types of problems that you can use them for in your applications.
Despite the “Graph” in the name, GraphQL is mostly used to query relational databases, object models or APIs. But it is really easy to support GraphQL endpoints from graph databases too. In this talk, I’ll demonstrate how we implemented a GraphQL extension for the Neo4j graph database. It uses the GraphQL schema definition map arbitrary GraphQL queries into single graph queries and runs them against the data in the Graph database. Using directives in the schema, we added some cool features that are transparent to the end user like computed fields and auto-generated mutations and query types. That allows you to create GraphQL APIs of some complexity without writing a single line of code.
I will show how to use the Neo4j-GraphQL extension, by creating an endpoint for the Game of Thrones dataset, and how we then can use our well-known tools (GraphiQL, apollo-client, graphql-cli, voyager) to interact with it.
Despite the “Graph” in the name, GraphQL is mostly used to query relational databases or object models. But it is really well suited to querying graph databases too. In this talk, I’ll demonstrate how I implemented a GraphQL endpoint for the Neo4j graph database and how you would use it in your app.
The world around us is full of connected information. Neo4j was originally developed to solve two complex "network" problems in a document management system, as it was too hard to manage rich connection information efficiently in traditional and new "NOSQL" databases.During this meetup, we will talk about the technology, and about the journey that a couple of technologists from Malmö took. You will learn* how Neo Technology grew from just the three founders in to a global database company with use-cases in every domain imaginable.* how focusing on customer and community feedback allows us to provide a solution for managing connected data to everyone, not just the large internet companies.
Of course we will also introduce the graph model, it's whiteboard friendlyness and how you get started with Neo4j and it's easy and powerful query language Cypher. We'll also compare the graph and relational data model to see how they differ in shape and capabilities. Finally we discuss the foundations that enable Graph databases to provide higher join performance, faster development processes and more inclusive software for all stakeholders. With use-cases from Gaming, Dating and Finance we'll see how to apply the graph capabilities to these domains to realize new functionality or opportunities that were not possible before.
Finally, if there's a question you've always wanted to ask/discuss, we'll have plenty of time for that at the end of Michael's presentation.
Each of the files or classes of a projects source code represents a tree (AST). Looking at dependencies to other classes besides inheritance creates a graph though. Field types and method parameters are also implicit dependencies. Storing this information in a graph database like Neo4j allows for interesting queries and insights. Class-Graph provides that and is available as open-source github project.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
4. Extending Neo4j
Neo4j Execution Engine
User Defined
Procedure
Applications
Bolt
User Defined Procedures let you write
custom code that is:
• Written in any JVM language
• Deployed to the Database
• Accessed by applications via Cypher
5. APOC History
• My Unicorn Moment
• 3.0 was about to have
User Defined Procedures
• Add the missing utilities
• Grew quickly 50 - 150 - 450
• Active OSS project
• Many contributors
41. Graph Grouping
MATCH (p:Person) set p.decade = b.born / 10;
MATCH (p1:Person)-->()<--(p2:Person)
WITH p1,p2,count(*) as c
MERGE (p1)-[r:INTERACTED]-(p2)
ON CREATE SET r.count = c
CALL apoc.nodes.group(['Person'],['decade'])
YIELD node, relationship RETURN *;
52. Expand Operations
Customized path expansion from start node(s)
• Min/max traversals
• Limit number of results
• Optional (no rows removed if no results)
• Choice of BFS/DFS expansion
• Custom uniqueness (restrictions on visitations of nodes/rels)
• Relationship and label filtering
• Supports repeating sequences
53. Expand Operations
apoc.path.expand(startNode(s), relationshipFilter, labelFilter, minLevel, maxLevel) YIELD path
• The original, when you don’t need much customization
apoc.path.expandConfig(startNode(s), configMap) YIELD path
• Most flexible, rich configuration map
apoc.path.subgraphNodes(startNode(s), configMap) YIELD node
• Only distinct nodes, don't care about paths
apoc.path.spanningTree(startNode(s), configMap) YIELD path
• Only one distinct path to each node
apoc.path.subgraphAll(startNode(s), configMap) YIELD nodes, relationships
• Only (collected) distinct nodes (and all rels between them)
55. Relationship Filter
• '<ACTED_IN' - Incoming Rel
• 'DIRECTED>' - Outgoing Rel
• 'REVIEWED' - Any direction
• '<ACTED_IN | DIRECTED> | REVIEWED' - Multiple, in varied directions
• You can't do that with Cypher
-[ACTED_IN|DIRECTED|REVIEWED]->
56. Label Filter
What is/isn't allowed during expansion, and what is/isn't returned
• '-Director' – Blacklist, not allowed in path
• '+Person' –Whitelist, only allowed in path (no whitelist = all allowed)
• '>Reviewer' – End node, only return these, and continue expansion
• '/Actor:Producer' – Terminator node, only return these, stop expansion
'Person|Movie|-Director|>Reviewer|/Actor:Producer' – Combine them
57. Sequences
Repeating sequences of relationships, labels, or both.
Uses labelFilter and relationshipFilter, just add commas
Or use sequence for both together
labelFilter:'Post | -Blocked, Reply, >Admin'
relationshipFilter:'NEXT>,<FROM,POSTED>|REPLIED>'
sequence:'Post |-Blocked, NEXT>, Reply, <FROM, >Admin,
POSTED>| REPLIED>'
58. End nodes / Terminator nodes
What if we already have the nodes that should end the expansion?
endNodes – like filter, but takes a collection of nodes (or ids)
terminatorNodes – like filter (stop expand), but also takes a collection
(whitelistNodes and blacklistNodes too! )
Can be used with labelFilter or sequence, but continue or include must be unanimous
59. End nodes / Terminator nodes
What if we already have the nodes that should end the expansion?
endNodes – like filter, but takes a collection of nodes (or ids)
terminatorNodes – like filter (stop expand), but also takes a collection
(whitelistNodes and blacklistNodes too! )
Can be used with labelFilter or sequence, but continue or include must be unanimous
65. Turn JSON List into Cypher List
with "[1,2,3]" as str
with split(substring(str,1, length(str)-2),",") as numbers
return [x IN numbers| toInteger(x)]
72. Gephi Integration
match path = (:Person)-[:ACTED_IN]->(:Movie)
WITH path LIMIT 1000
with collect(path) as paths
call apoc.gephi.add(null,'workspace0', paths) yield nodes,
relationships, time
return nodes, relationships, time
incremental send to Gephi, needs Gephi Streaming extension
102. Procedures / Functions from Cypher
CALL apoc.custom.asProcedure('answer','RETURN 42 as answer');
CALL custom.answer();
works also with parameters, and return columns declarations
CALL apoc.custom.asFunction('answer','RETURN $input','long',
[['input','number']]);
RETURN custom.answer(42) as answer;
103. Neo4j Developer Surface
Native LanguageDrivers
BOLT User Defined
Procedure
2000-2010 0.x Embedded Java API
2010-2014 1.x REST
2014-2015 2.x Cypher over HTTP
2016 3.0.x Bolt, Official Language Drivers, User Defined Procedures
2016 3.1.x User Defined Functions
2017 3.2.x User Defined Aggregation Functions
110. Build a procedure or function
you'd like
start with
the template repo
github.com/neo4j-examples/neo4j-procedure-template
112
111. User Defined Procedures
User-defined procedures are
● @Procedure annotated, named Java Methods
○ default name: package + method
● take @Name'ed parameters (3.1. default values)
● return a Stream of value objects
● fields are turned into columns
● can use @Context injected GraphDatabaseService etc
● run within Transaction
112. public class FullTextIndex {
@Context
public GraphDatabaseService db;
@Procedure( name = "example.search", mode = Procedure.Mode.READ )
public Stream<SearchHit> search( @Name("index") String index,
@Name("query") String query ) {
if( !db.index().existsForNodes( index )) {
return Stream.empty();
}
return db.index().forNodes( index ).query( query ).stream()
.map( SearchHit::new );
}
public static class SearchHit {
public final Node node;
SearchHit(Node node) { this.node = node; }
}
}
113. try ( Driver driver = GraphDatabase.driver( "bolt://localhost",
Config.build().toConfig() ) ) {
try ( Session session = driver.session() ) {
String call = "CALL example.search('User',$query)";
Map<String,Object> params = singletonMap( "query", "name:Brook*");
StatementResult result = session.run( call, params);
while ( result.hasNext() {
// process results
}
}
}
Deploy & Register in Neo4j Server via neo4j-harness
Call & test via neo4j-java-driver
114. Deploying User Defined Procedures
Build or download (shadow) jar
● Drop jar-file into $NEO4J_HOME/plugins
● Restart server
● Procedure should be available
● Otherwise check neo4j.log / debug.log
124. Aggregation Function In APOC
• more efficient variants of collect(x)[a..b]
• apoc.agg.nth, apoc.agg.first, apoc.agg.last, apoc.agg.slice
• apoc.agg.median(x)
• apoc.agg.percentiles(x,[0.5,0.9])
• apoc.agg.product(x)
• apoc.agg.statistics() provides a full numeric statistic