Performance Improvements in Neo4j 3.2

•

2 likes•1,268 views

The document discusses performance improvements in Neo4j 3.2 including faster native label indexing, composite indexes for multi-property queries, orders of magnitude speed increases for reachability queries through pruning, and up to 300% faster query performance from the new compiled Cypher runtime. It provides details on these improvements such as the new GBPTree index design for native labels and conditions for using composite indexes.

Technology

Craig Taverner
Cypher Team Lead
Performance Improvements
in Neo4j 3.2

Craig Taverner
craig@neo4j.com
@craigtaverner
Craig Taverner
craig@neo4j.com
@craigtaverner

Native Graph Performance Improvements
• Native Label index: Writes now 30-300% faster
• Composite indexes: Faster multi-property queries
• Reachability Queries: Can improve by orders of magnitude
• Compiled Cypher runtime: Speeds queries up to 300%
• Neo4j Browser: Ground-up rewrite yields snappier
performance

Cypher Parser
Performance Improvements in Neo4j 3.2
• Native Label Index 
• Composite Indexes 
• Compiled Runtime 
• Reachability (Pruning Var Expand) 
• Solving OR Using Indexes 
Cypher Planner
Cypher Runtime
Neo4j Kernel
Storage
CompositeIndexes
Compiled
Native
Label
Index
PruningVXOr

Native Label Index
• New GBPTree Index
• Designed for
concurrent read and
write load
• Performance
• Similar for reads
• Faster for writes

Composite Indexes
Full-stack generalisation of schema indexes to multiple properties
• Cypher: 
 
CREATE INDEX ON :Person(firstname,lastname) 
MATCH (n:Person) 
WHERE n.firstname = 'Joe' AND n.lastname = 'Soap' RETURN n; 
• Core API: 
 
gds.schema().indexFor(Label.label("Person")) 
.on("firstname") 
.on("lastname") 
.create();

Composite Indexes
Cypher queries will use the composite index if the following conditions exist:
• Predicates must exist for all properties in the index and be equality
predicates for the index to be used.
• Predicates for existence, range, starts-with, ends-with and contains will not
be able to use the index (yet).
CREATE INDEX ON :Person(firstname, lastname); 
// Direct composite index search on multiple property equality
MATCH (n:Person) WHERE n.firstname = 'Joe' AND n.lastname = 'Soap' RETURN n;
// No use of composite index (yet)
MATCH (n:Person) WHERE n.firstname = 'Joe' AND exists(n.lastname) RETURN n;
MATCH (n:Person) WHERE n.firstname = 'Joe' AND n.lastname STARTS WITH 'Soap' RETURN n;
MATCH (n:Person) WHERE n.firstname = 'Joe' RETURN n;

Reachability Queries - Pruning Var Expand
MATCH (kevin {name:'Kevin Bacon'})-[*1..5]-(actor)
RETURN DISTINCT actor

Compiled Runtime
• Coverage
• 50% of Cypher Operators supported
• Cypher Benchmarks Suite 15% supported
• LDBC suite 18% supported
• Does this mean many simple and few complex? 
• Performance
• Operators 2x to 20x faster
• Queries … well it depends… let’s say 2x for supported queries

Solving OR using Indexes
• Consider AND 
MATCH (n:X)
WHERE n.firstName = $first AND n.lastName = $last
RETURN n
• Solved with Index and Filter - FAST
• But what happens with OR 
MATCH (n:X)
WHERE n.firstName = $first OR n.lastName = $last
RETURN n
• 3.1: Solved with LabelScan and Filter - SLOW
• 3.2: Solved with two IndexSeeks - FAST

Resources
• Compiled Runtime Coverage
Performance Improvements
in Neo4j 3.2

We are proud to announce the release of Neo4j 3.2. This version marks an expansion in global scale, performance and refinement. It signals that the next generation of graph-powered internet applications, generating personalized content or finding coordinated malfeasance, will span the globe. This webinar detailing the themes behind Neo4j version 3.2, including: enterprise scale for global internet applications, while refining its enterprise governance capabilities and investing in performance improvements up and down the native graph stack.

Scaling spark on kubernetes at Lyft

Li Gao

HDFS on Kubernetes—Lessons Learned with Kimoon Kim

Databricks

There is growing interest in running Apache Spark natively on Kubernetes (see https://github.com/apache-spark-on-k8s/spark). Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. When running Spark on Kubernetes, if the HDFS daemons run outside Kubernetes, applications will slow down while accessing the data remotely. This session will demonstrate how to run HDFS inside Kubernetes to speed up Spark. In particular, it will show how Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons. You’ll also learn how you can provide Spark with the high availability of the critical HDFS namenode service when running HDFS in Kubernetes.

Spark Compute as a Service at Paypal with Prabhu Kasinathan

Databricks

Apache Spark is a gift to the big data community, which adds tons of new features on every release. However, it’s difficult to manage petabyte-scale Hadoop clusters with hundreds of edge nodes, multiple Spark releases and demonstrate operational efficiencies and standardization. In order to address these challenges, Paypal has developed and deployed a REST0based Spark platform: Spark Compute as a Service (SCaaS),which provides improved application development, execution, logging, security, workload management and tuning. This session will walk through the top challenges faced by PayPal administrators, developers and operations and describe how Paypal’s SCaaS platform overcomes them by leveraging open source tools and technologies, like Livy, Jupyter, SparkMagic, Zeppelin, SQL Tools, Kafka and Elastic. You’ll also hear about the improvements PayPal has added, which enable it to run greater than 10,000 Spark applications in production effectively.

Eron Wright - Flink Security Enhancements

Flink Forward

http://flink-forward.org/kb_sessions/flink-security-enhancements/ Recent security enhancements to Flink make it easy to access secure data and to protect the associated credentials. In this talk we’ll describe and demonstrate the new features, including Kerberos-based access to HDFS and Kafka, transport security (TLS), and service-level authorization which protects your Flink cluster from unauthorized access.

Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...

HostedbyConfluent

Having started with classic monolith applications in the late 90s and adopting a new microservice architecture in 2015, our organization needed a convenient, reliable, and low-cost way to push changes back and forth between them. One that preferably utilized technology already on hand and could exchange information between multiple data stores. In this session we will explore how Kafka Connect and its various connectors satisfied this need. We will review the two disparate tech stacks we needed to integrate, and the strategies and connectors we used to exchange information. Finally, we will cover some enhancements we made to our own processes including integrating Kafka Connect and its connectors into our CI/CD pipeline and writing tools to monitor connectors in our production environment.

analytic engine - a common big data computation service on the aws

Scott Miao

http://flink-forward.org/kb_sessions/multi-tenant-flink-as-a-service-on-yarn/ Since June 2016, Flink-as-a-service has been available to researchers and companies in Sweden from the Swedish ICT SICS Data Center at www.hops.site using the HopsWorks platform. Flink applications can be either deployed as jobs (batch or streaming) or written and run directly from Apache Zeppelin on YARN. Flink applications are run within a project on a YARN cluster with the novel property that Flink applications are metered and charged to projects. Projects are also securely isolated from each other and include support for project-specific Kafka topics that are protected from access by users that are not members of the project. Hopsworks is entirely UI-driven, is open-source, and Flink applications that include Kafka topics can be created in a few mouse clicks. In this talk we will discuss the challenges in building a metered version of Flink-as-a-Service for YARN, experiences with Flink-on-YARN, and some of the possibilities that Hopsworks opens up for building secure, multi-ten

Running Spark Inside Containers with Haohai Ma and Khalid Ahmed

Spark Summit

This presentation describes the journey we went through in containerizing Spark workload into multiple elastic Spark clusters in a multi-tenant kubernetes environment. Initially we deployed Spark binaries onto a host-level filesystem, and then the Spark drivers, executors and master can transparently migrate to run inside a Docker container by automatically mounting host-level volumes. In this environment, we do not need to prepare a specific Spark image in order to run Spark workload in containers. We then utilized Kubernetes helm charts to deploy a Spark cluster. The administrator could further create a Spark instance group for each tenant. A Spark instance group, which is akin to the Spark notion of a tenant, is logically an independent kingdom for a tenant’s Spark applications in which they own dedicated Spark masters, history server, shuffle service and notebooks. Once a Spark instance group is created, it automatically generates its image and commits to a specified repository. Meanwhile, from Kubernetes’ perspective, each Spark instance group is a first-class deployment and thus the administrator can scale up/down its size according to the tenant’s SLA and demand. In a cloud-based data center, each Spark cluster can provide a Spark as a service while sharing the Kubernetes cluster. Each tenant that is registered into the service gets a fully isolated Spark instance group. In an on-prem Kubernetes cluster, each Spark cluster can map to a Business Unit, and thus each user in the BU can get a dedicated Spark instance group. The next step on this journey will address the resource sharing across Spark instance groups by leveraging new Kubernetes’ features (Kubernetes31068/9), as well as the Elastic workload containers depending on job demands (Spark18278). Demo: https://www.youtube.com/watch?v=eFYu6o3-Ea4&t=5s

Kafka for Microservices – You absolutely need Avro Schemas! | Gerardo Gutierr...

HostedbyConfluent

Whether you are deploying a new application in Microservices or transitioning from a monolithic database application to a cloud-ready architecture, you will inevitably face the decision of either creating a service mesh of API’s – or – using an event bus for better durability, reliability and extensibility of your application. If you choose to go the event bus route, Kafka is an excellent choice for several reasons. One key technology not to overlook is Avro Schemas. They provide a definition for your event payload, just like an API, to ensure all of the event consumers can reliably consume the events. They also handle schema evolution as requirements change and much, much more. In this talk we will discuss all the nuances and considerations around using Avro Schemas for your JSON event payloads. From developer tools, to DevOps approaches, versioning, governance and some “gotchas” we found when working with Avro Schemas and the Confluent Schema Registry.

Dev ops for big data cluster management tools

Ran Silberman

Confluent building a real-time streaming platform using kafka streams and k...

Thomas Alex

Securing the Message Bus with Kafka Streams | Paul Otto and Ryan Salcido, Raf...

HostedbyConfluent

Organizations have a need to protect Personally Identifiable Information (PII). As Event Streaming Architecture (ESA) becomes ubiquitous in the enterprise, the prevalence of PII within data streams will only increase. Data architects must be cognizant of how their data pipelines can allow for potential leaks. In highly distributed systems, zero-trust networking has become an industry best practice. We can do the same with Kafka by introducing message-level security. A DevSecOps Engineer with some Kafka experience can leverage Kafka Streams to protect PII by enforcing role-based access control using Open Policy Agent. Rather than implementing a REST API to handle message-level security, Kafka Streams can filter, or even transform outgoing messages in order to redact PII data while leveraging the native capabilities of Kafka. In our proposed presentation, we will provide a live demonstration that consists of two consumers subscribing to the same Kafka topic, but receiving different messages based on the rules specified in Open Policy Agent. At the conclusion of the presentation, we will provide attendees with a GitHub repository, so that they can enjoy a sandbox environment for hands-on experimentation with message-level security.

Performance improvements in etcd 3.5 release

LibbySchulze

Introducing Kubernetes

VikRam S

A Look into the Mirror: Patterns and Best Practices for MirrorMaker2 | Cliff ...

HostedbyConfluent

OpenStack and Containers - Will they blend? A prequel. SF Bay OpenStack Meetup

John Starmer

Modern software containers provide a virtualization model that OpenStack, as originally conceived, was not designed for. We are now faced with trying to determine the appropriate path forward for managing disparate virtualization models in increasingly hybrid business settings. In this presentation, we look at the possibility of treating OpenStack as "just another containerized application" running with Kubernetes as the container operating environment. See the associated presentation here: https://kumul.us/will-it-blend-a-joint-openstack-and-kubernetes-environment/

Escalando Foursquare basado en Checkins y Recomendaciones

Manuel Vargas

Singer, Pinterest's Logging Infrastructure

Discover Pinterest

RedisConf17 - Pain-free Pipelining

Redis Labs

Scaling Apache Spark on Kubernetes at Lyft

Databricks

Lyft is on the mission to improve people's lives with the world's best transportation. As part of this mission Lyft invests heavily in open source infrastructure and tooling. At Lyft Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark at Lyft has evolved to solve both Machine Learning and large scale ETL workloads. By combining the flexibility of Kubernetes with the data processing power of Apache Spark, Lyft is able to drive ETL data processing to a different level. In this talk, Li Gao and Rohit Menon will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Topics Include: - Key traits of Apache Spark on Kubernetes. - Deep dive into Lyft's multi-cluster setup and operationality to handle petabytes of production data. - How Lyft extends and enhances Apache Spark to support capabilities such as Spark pod life cycle metrics and state management, resource prioritization, and queuing and throttling. - Dynamic job scale estimation and runtime dynamic job configuration. - How Lyft powers internal Data Scientists, Business Analysts, and Data Engineers via a multi-cluster setup. Speakers: Li Gao, Rohit Menon

Tuning kafka pipelines

Sumant Tambe

Kafka is a high-throughput, fault-tolerant, scalable platform for building high-volume near-real-time data pipelines. This presentation is about tuning Kafka pipelines for high-performance. Select configuration parameters and deployment topologies essential to achieve higher throughput and low latency across the pipeline are discussed. Lessons learned in troubleshooting and optimizing a truly global data pipeline that replicates 100GB data under 25 minutes is discussed.

Apache Pulsar at Tencent Game: Adoption, Operational Quality Optimization Exp...

StreamNative

After nearly 10 years of development of Tencent Game big data, the daily data transmission volume can reach 1.7 trillion. As the key component of the big data platform, the MQ system is critical to provide real-time service operational quality assurance, which requires the support of various applications such as real-time game operational service, real-time index data analysis, and real-time personalized recommendation. With the fast growth of the gaming business and the continuous expansion of data, the challenge of real-time service operational quality assurance is also increasing. In this presentation, We will introduce the development history of Tencent Game big data technology and our practical experience of operational service quality optimization for Apache Pulsar in Tencent Game real-time service scenarios.

Mobius: C# Language Binding For Spark

Spark Summit

Netflix Cloud Architecture and Open Source

aspyker

From Newbie to Highly Available, a Successful Kafka Adoption Tale (Jonathan S...

confluent

In this talk, I would like to share the successful experience our team is having implementing Kafka within a complex data architecture. Although I have the blessing of leading a team of incredibly talented Engineers, none of us had the experience of working with Kafka in the scale we face at Mimecast, where hundreds of microservices generate millions of events per second to communicate asynchronously to achieve different goals. The talk will explain how Kafka is helping us to decouple Microservices and make data available for teams and services that were not in communication before. I will highlight the challenges we encountered and how we overcame them, like having one Kafka Cluster per region going across to our double data center architecture and still avoiding a split-brain scenario, serving thousands of producers and consumers, explaining in plain language the main Kafka components and how they are used to solve problems. I would like to share how Kafka is allowing our Data Scientists to explore the data since we are able to replay the input data as many times we need, discovering new features and more importantly, been able to reproduce exactly the same scenario over and over. Last but not least, the talk will emphasize the fact, like in our case, newcomers do not have to pay a steep learning curve to make the intimidating Kafka Platform part of their solution, the documentation is fantastic, the community is amazing and examples could be found all over the internet.

Kafka tiered-storage-meetup-2022-final-presented

Sumant Tambe

Kafka Tiered Storage separates compute and data storage in two independently scalable layers. Uber's Kafka Improvement Proposal (KIP) #405 describes two-tiered storage, which is a major step towards cloud-native Kafka. It stores the most recent data locally and offloads older data to a remote storage service. Operationally, the benefit is faster routine cluster maintenance activities. In Linkedin, Kafka tiered storage is strongly desired to reduce the cost of running Kafka in the Azure cloud environment. As KIP-405 does not dictate the implementation of remote storage substrate, Linkedin's choice for tiering Kafka in Azure deployments is the Azure Blob Service. This presentation will begin with the motivation behind Linkedin efforts to adopt Kafka Tiered Storage. Next, the architecture of KIP-405 will be discussed. Finally, the Remote Storage Manager for Azure Blobs, which is a work-in-progress, will be presented. Video: https://youtu.be/V5gaBE5CMwg?t=1387

GraphConnect EU 2017 - Performance Improvements in Neo4j 3.2

Craig Taverner

Walkthrough Neo4j 1.9 & 2.0

Performance Improvements in Neo4j 3.2

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Performance Improvements in Neo4j 3.2

Similar to Performance Improvements in Neo4j 3.2 (20)

More from Neo4j

More from Neo4j (20)

Recently uploaded

Recently uploaded (20)

Performance Improvements in Neo4j 3.2