The document provides an overview of the internal workings of Neo4j. It describes how the graph data is stored on disk as linked lists of fixed size records and how two levels of caching work - a low-level filesystem cache and a high-level object cache that stores node and relationship data in a structure optimized for traversals. It also explains how traversals are implemented using relationship expanders and evaluators to iteratively expand paths through the graph, and how Cypher builds on this but uses graph pattern matching rather than the full traversal system.
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
Flight SQL is a revolutionary new open database protocol designed for modern architectures. Key features in Flight SQL include a columnar-oriented design and native support for parallel processing of data partitions. This talk will go over how these new features can push SQL query throughput beyond existing standards such as ODBC.
Review the latest features released in Neo4j version 4.1 including Cypher, database drivers, clustering, security, and extension libraries like APOC and Spring Data Neo4j!
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Parquet performance tuning: the missing guideRyan Blue
Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need.
Topics include:
* The tools and techniques Netflix uses to analyze Parquet tables
* How to spot common problems
* Recommendations for Parquet configuration settings to get the best performance out of your processing platform
* The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
High-speed Database Throughput Using Apache Arrow Flight SQLScyllaDB
Flight SQL is a revolutionary new open database protocol designed for modern architectures. Key features in Flight SQL include a columnar-oriented design and native support for parallel processing of data partitions. This talk will go over how these new features can push SQL query throughput beyond existing standards such as ODBC.
Review the latest features released in Neo4j version 4.1 including Cypher, database drivers, clustering, security, and extension libraries like APOC and Spring Data Neo4j!
Data Quality With or Without Apache Spark and Its EcosystemDatabricks
Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen. Organisations may consider picking up one of the available options – Apache Griffin, Deequ, DDQ and Great Expectations. In this presentation we’ll compare these different open-source products across different dimensions, like maturity, documentation, extensibility, features like data profiling and anomaly detection.
Parquet performance tuning: the missing guideRyan Blue
Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need.
Topics include:
* The tools and techniques Netflix uses to analyze Parquet tables
* How to spot common problems
* Recommendations for Parquet configuration settings to get the best performance out of your processing platform
* The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Presented at JavaOne 2013, Tuesday September 24.
"Data Modeling Patterns" co-created with Ian Robinson.
"Pitfalls and Anti-Patterns" created by Ian Robinson.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
Change Data Feed is a new feature of Delta Lake on Databricks that is available as a public preview since DBR 8.2. This feature enables a new class of ETL workloads such as incremental table/view maintenance and change auditing that were not possible before. In short, users will now be able to query row level changes across different versions of a Delta table.
In this talk we will dive into how Change Data Feed works under the hood and how to use it with existing ETL jobs to make them more efficient and also go over some new workloads it can enable.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Observability for Data Pipelines With OpenLineageDatabricks
Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment.
Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security.
Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Presented at the MLConf in Seattle, this presentation offers a quick introduction to Apache Spark, followed by an overview of two novel features for data science
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
These webinar slides are an introduction to Neo4j and Graph Databases. They discuss the primary use cases for Graph Databases and the properties of Neo4j which make those use cases possible. They also cover the high-level steps of modeling, importing, and querying your data using Cypher and touch on RDBMS to Graph.
026 Neo4j Data Loading (ETL_ELT) Best Practices - NODES2022 AMERICAS Advanced...Neo4j
What patterns are most appropriate for building ETLs using Neo4j? In this session, we share how we built the Google Cloud DataFlow flex template using the Neo4j Java API. You can then apply the same approach to building read and write operators in any framework, including AWS Lambda and Google Cloud Functions.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Presented at JavaOne 2013, Tuesday September 24.
"Data Modeling Patterns" co-created with Ian Robinson.
"Pitfalls and Anti-Patterns" created by Ian Robinson.
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
Parquet is a very popular column based format. Spark can automatically filter useless data using parquet file statistical data by pushdown filters, such as min-max statistics. On the other hand, Spark user can enable Spark parquet vectorized reader to read parquet files by batch. These features improve Spark performance greatly and save both CPU and IO. Parquet is the default data format of data warehouse in Bytedance. In practice, we find that parquet pushdown filters work poorly resulting in reading too much unnecessary data for statistical data has no discrimination across parquet row groups(column data is out of order when writing to parquet files by ETL jobs).
Change Data Feed is a new feature of Delta Lake on Databricks that is available as a public preview since DBR 8.2. This feature enables a new class of ETL workloads such as incremental table/view maintenance and change auditing that were not possible before. In short, users will now be able to query row level changes across different versions of a Delta table.
In this talk we will dive into how Change Data Feed works under the hood and how to use it with existing ETL jobs to make them more efficient and also go over some new workloads it can enable.
Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features.
We will discuss optimizations which provide major performance gains, including significantly improved performance for ACID tables. The talk will also provide a glimpse of what is expected to come in the near future.
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
Adobe’s Unified Profile System is the heart of its Experience Platform. It ingests TBs of data a day and is PBs large. As part of this massive growth we have faced multiple challenges in our Apache Spark deployment which is used from Ingestion to Processing.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Observability for Data Pipelines With OpenLineageDatabricks
Data is increasingly becoming core to many products. Whether to provide recommendations for users, getting insights on how they use the product, or using machine learning to improve the experience. This creates a critical need for reliable data operations and understanding how data is flowing through our systems. Data pipelines must be auditable, reliable, and run on time. This proves particularly difficult in a constantly changing, fast-paced environment.
Collecting this lineage metadata as data pipelines are running provides an understanding of dependencies between many teams consuming and producing data and how constant changes impact them. It is the underlying foundation that enables the many use cases related to data operations. The OpenLineage project is an API standardizing this metadata across the ecosystem, reducing complexity and duplicate work in collecting lineage information. It enables many projects, consumers of lineage in the ecosystem whether they focus on operations, governance or security.
Marquez is an open source project part of the LF AI & Data foundation which instruments data pipelines to collect lineage and metadata and enable those use cases. It implements the OpenLineage API and provides context by making visible dependencies across organizations and technologies as they change over time.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Presented at the MLConf in Seattle, this presentation offers a quick introduction to Apache Spark, followed by an overview of two novel features for data science
This presentation shortly describes key features of Apache Cassandra. It was held at the Apache Cassandra Meetup in Vienna in January 2014. You can access the meetup here: http://www.meetup.com/Vienna-Cassandra-Users/
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
These webinar slides are an introduction to Neo4j and Graph Databases. They discuss the primary use cases for Graph Databases and the properties of Neo4j which make those use cases possible. They also cover the high-level steps of modeling, importing, and querying your data using Cypher and touch on RDBMS to Graph.
026 Neo4j Data Loading (ETL_ELT) Best Practices - NODES2022 AMERICAS Advanced...Neo4j
What patterns are most appropriate for building ETLs using Neo4j? In this session, we share how we built the Google Cloud DataFlow flex template using the Neo4j Java API. You can then apply the same approach to building read and write operators in any framework, including AWS Lambda and Google Cloud Functions.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
From DataEngConf 2017 - Everybody wants to get to data faster. As we move from more general solution to specific optimization techniques, the level of performance impact grows. This talk will discuss how layering in-memory caching, columnar storage and relational caching can combine to provide a substantial improvement in overall data science and analytical workloads. It will include a detailed overview of how you can use Apache Arrow, Calcite and Parquet to achieve multiple magnitudes improvement in performance over what is currently possible.
Django and Neo4j - Domain modeling that kicks assTobias Lindaaker
Presentation about using Neo4j from Django presented at OSCON 2010, Portland OR.
Sample code is available at: https://svn.neo4j.org/components/neo4j.py/trunk/src/examples/python/djangosites/blog/
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Accelerate your Kubernetes clusters with Varnish Caching
An overview of Neo4j Internals
1. An overview of
Neo4j Internals
tobias@neotechnology.com
Tobias Lindaaker twitter: @thobe, #neo4j (@neo4j)
web: neo4j.org neotechnology.com
Hacker @ Neo Technology my web: thobe.org
Monday, May 21, 2012
2. Outline
This is a rough structure of
how the pieces of Neo4j fit
together.
This talk will not cover
how disks/fs works, we
just assume it does. Traversals Core API Cypher
Nor will it cover the “Core
API”, you are assumed to
know it.
Node/Relationship
Thread local diffs
Object cache
FS Cache HA
Record files Transaction log
Disk(s)
2
Monday, May 21, 2012
3. Outline
Traversals Core API Cypher
Node/Relationship
Thread local diffs
Object cache
FS Cache HA
Let’s start at the
bottom: the on disk Record files Transaction log
storage file layout.
Disk(s)
3
Monday, May 21, 2012
4. Simple sample graph. It all boils down to
linked lists of fixed size records on disk.
Your graph on disk Properties are stored as a linked list of
property records, each holding key+value.
Each node/relationship references its first
property record.
The Nodes also reference the first node in its
relationship chain.
Each Relationship references its start and
Name: Alistair end node.
Age: 34 KNOWS It also references the prev/next relationship
record for the start/end node respectively
Name: Tobias
Age: 27
Nationality: Swedish
KNOWS
KNOWS
KNOWS
Name: Ian
Age: 42
Name: Jim
Age: 37
KNOWS Stuff: good
4
Monday, May 21, 2012
5. Simple sample graph. It all boils down to
linked lists of fixed size records on disk.
Your graph on disk Properties are stored as a linked list of
property records, each holding key+value.
Each node/relationship references its first
Name property record.
The Nodes also reference the first node in its
Alistair relationship chain.
Each Relationship references its start and
end node. Name
KNOWS It also references the prev/next relationship
record for the start/end node respectively Tobias
Age
34
Age
27
Nationality
KNOWS Swedish
KNOWS
KNOWS
Name
Jim
Age
Name 37
Stuff
Ian KNOWS
Age good
42 4
Monday, May 21, 2012
6. Simple sample graph. It all boils down to
linked lists of fixed size records on disk.
Your graph on disk Properties are stored as a linked list of
property records, each holding key+value.
Each node/relationship references its first
Name property record.
The Nodes also reference the first node in its
Alistair relationship chain.
Each Relationship references its start and
end node. Name
KNOWS It also references the prev/next relationship
record for the start/end node respectively Tobias
Age
34
Age
27
Nationality
KNOWS Swedish
KNOWS
KNOWS
Name
Jim
Age
Name 37
Stuff
Ian KNOWS
Age good
42 4
Monday, May 21, 2012
7. Simple sample graph. It all boils down to
linked lists of fixed size records on disk.
Your graph on disk Properties are stored as a linked list of
property records, each holding key+value.
Each node/relationship references its first
Name
SP EP property record.
The Nodes also reference the first node in its
Alistair relationship chain.
SN EN Each Relationship references its start and
end node. Name
KNOWS It also references the prev/next relationship
record for the start/end node respectively Tobias
Age
34
Age
27
SP EP
SP EP
SN EN Nationality
SN EN
KNOWS Swedish
KNOWS SP EP
SN EN
KNOWS
Name
SP EP Jim
Age
SN EN
Name 37
Stuff
Ian KNOWS
Age good
42 4
Monday, May 21, 2012
8. Simple sample graph. It all boils down to
linked lists of fixed size records on disk.
Your graph on disk Properties are stored as a linked list of
property records, each holding key+value.
Each node/relationship references its first
Name
SP EP property record.
The Nodes also reference the first node in its
Alistair relationship chain.
SN EN Each Relationship references its start and
end node. Name
KNOWS It also references the prev/next relationship
record for the start/end node respectively Tobias
Age
34
Age
27
SP EP
SP EP
SN EN Nationality
SN EN
KNOWS Swedish
KNOWS SP EP
SN EN
KNOWS
Name
SP EP Jim
Age
SN EN
Name 37
Stuff
Ian KNOWS
Age good
42 4
Monday, May 21, 2012
9. Simple sample graph. It all boils down to
linked lists of fixed size records on disk.
Your graph on disk Properties are stored as a linked list of
property records, each holding key+value.
Each node/relationship references its first
Name
SP EP property record.
The Nodes also reference the first node in its
Alistair relationship chain.
SN EN Each Relationship references its start and
end node. Name
KNOWS It also references the prev/next relationship
record for the start/end node respectively Tobias
Age
34
Age
27
SP EP
SP EP
SN EN Nationality
SN EN
KNOWS Swedish
KNOWS SP EP
SN EN
KNOWS
Name
SP EP Jim
Age
SN EN
Name 37
Stuff
Ian KNOWS
Age good
42 4
Monday, May 21, 2012
10. Simple sample graph. It all boils down to
linked lists of fixed size records on disk.
Your graph on disk Properties are stored as a linked list of
property records, each holding key+value.
Each node/relationship references its first
Name
SP EP property record.
The Nodes also reference the first node in its
Alistair relationship chain.
SN EN Each Relationship references its start and
end node. Name
KNOWS It also references the prev/next relationship
record for the start/end node respectively Tobias
Age
34
Age
27
SP EP
SP EP
SN EN Nationality
SN EN
KNOWS Swedish
KNOWS SP EP
SN EN
KNOWS
Name
SP EP Jim
Age
SN EN
Name 37
Stuff
Ian KNOWS
Age good
42 4
Monday, May 21, 2012
11. Store files
๏Node store
๏Relationship store
• Relationship type store
๏Property store
• Property key store
• (long) String store Short string and
array values are
inlined in the
•
property store, long
values are stored in
(long) Array store separate store files.
5
Monday, May 21, 2012
12. Neo4j Storage Record Layout
Node (9 bytes)
inUse nextRelId nextPropId
1 5 9
Relationship (33 bytes)
inUse firstNode secondNode relationshipType firstPrevRelId firstNextRelId secondPrevRelId secondNextRelId nextPropId
1 5 9 13 17 21 25 29 33
Relationship Type (5 bytes)
inUse typeBlockId
1 5
Property (33 bytes)
inUse type keyIndexId propBlock nextPropId
1 3 5 29 33
Property Index (9 bytes)
inUse propCount keyBlockId
1 5 9
Dynamic Store (125 bytes)
inUse next data
1 5
NeoStore (5 bytes)
inUse datum
1 5
Monday, May 21, 2012
13. Outline
Traversals Core API Cypher
Next: The t wo levels of cache Node/Relationship
in Neo4j. Thread local diffs
The low level FS Cache for the Object cache
record files.
And the high level Object
cache storing a structure
more optimized for traversal.
FS Cache HA
Record files Transaction log
Disk(s)
7
Monday, May 21, 2012
14. The caches
๏ Filesystem cache:
• Caches regions store file intofiles sized regions)
(divides each
of the store
equally
• The cache holds a fixed number of regions for each file
• Regions are evicted based on ahit in non-cached region)
(hit count vs. miss count, i.e.
LFU-like policy
• Default implementation of regions uses OS mmap
๏ Node/Relationship cache
• Cache a version more optimized for traversals
8
Monday, May 21, 2012
15. What we put in cache
ID Relationship ID refs
in: R1 R2 ... Rn The structure of the elements in the high level
type 1 object cache.
out R1 R2 ... Rn
On disk most of the information is contained
in: R1 R2 R3 ... Rn in the relationship records, with the nodes just
type 2 referencing their first relationship. In the
out R1 ... Rn
Node cache this is turned around: the nodes hold
references to all its relationships. The
... (grouped by type) relationships are simple, only holding its
properties.
The relationships for each node is grouped by
RelationshipType to allow fast traversal of a
specific type.
Key 1 Key 2 ... Key n
All references (dotted arrows) are by ID, and
traversals do indirect lookup through the cache.
Val 1
Val 2
ID start end type Val n
Relationship
Key 1 Key 2 ... Key n
Val 1
Val 2
Val n
9
Monday, May 21, 2012
16. Outline
So how do traversals work...
Traversals Core API Cypher
Node/Relationship
Thread local diffs
Object cache
FS Cache HA
Record files Transaction log
Disk(s)
10
Monday, May 21, 2012
17. Traversals - how do they work?
๏ RelationshipExpanders: given (a path to) a node, returns The surface
layer, the you
Relationships to continue traversing from that node interact with.
๏ Evaluators: given (a path to) a node, returns whether to:
• Continue traversing on that branch (i.e. expand) or not
• Include (the path to) the node in the result set or not
๏ Then a projection to Path, Node, or Relationship applied to
each Path in the result set.
... but also:
๏ Uniqueness level: policy for when it is ok to revisit a node
that has already been visited
๏ Implemented on top of the Core API
11
Monday, May 21, 2012
18. More on Traversals
๏ Fetch node data from cache - non-blocking access This is what happens
•
under the hood.
If not in cache, retrieve from storage, into cache
‣If region is in FS cache: blocking but short duration access
‣If region is outside FS cache: blocking slower access
๏ Get relationships from cached node
• If not fetched, retrieve from storage, by following chains
๏ Expand relationship(s) to end up on next node(s)
• The relationship knows the node, no need to fetch it yet
๏ Evaluate
• possibly emitting a Path into the result set
๏ Repeat 12
Monday, May 21, 2012
19. Outline
How is Cypher different?
Traversals Core API Cypher and how dowes it work?
Node/Relationship
Thread local diffs
Object cache
FS Cache HA
Record files Transaction log
Disk(s)
13
Monday, May 21, 2012
20. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
21. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
22. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
23. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
24. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
25. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
26. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
27. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
28. Cypher - Just convenient traversal descriptions?
๏ Builds on the same infrastructure as Traversals - Expanders
• but not on the full Traversal system
๏ Uses graph pattern matching for traversing the graph
• Recursive MATCH x-->y, backtracking z-->a-->b, z-->b
START x=...
matching with
x-->z, y-->z,
Red: pattern graph
Blue: actual graph
Green: start node
Purple: matches
14
Monday, May 21, 2012
29. What about gremlin?
๏ gremlin is a third party language, built by Marko Rodriguez of
Tinkerpop (a group of people who like to hack on graphs)
๏ Originally based on the idea of using xpath to describe traversals:
./HAS_CART/CONTAINS_ITEM/PURCHASED/PURCHASED
but bastardized to distinguish between nodes and relationships:
./outE[label=HAS_CART]/inV Traversals are close
/outE[label=CONTAINS_ITEM]/inV to xpath, which is
why xpath like
/inE[label=PURCHASED]/outV descriptions of
traversals seemed
/outE[label=PURCHASED]/inV like a good idea.
๏ xpath is not complete enough to express full algorithms, it needs a
host language, gremlin originally defined its own.
This changed Groovy as a more complete host language and
abandoned xpath in favor of method chaining
[ replace ‘/’ with ‘.’ ] 15
Monday, May 21, 2012
30. Gremlin compared to Cypher
๏start me=node:people(name={myname})
match me-[:HAS_CART]->cart-[:CONTAINS_ITEM]->item
item<-[:PURCHASED]-user-[:PURCHASED]->recommendation
return recommendation
๏ Cypher is declarative, describes what data to get - its shape
๏ Gremlin is imperative, prescribes how to get the data
๏ Cypher has more opportunities for optimization by the engine
๏ Gremlin can implement pagerank, Cypher can’t (yet?)
16
Monday, May 21, 2012
31. Outline
Traversals Core API Cypher Transactions involve t wo
parts:
The (thread local) changes
being done by an active
transaction,
Node/Relationship and the transaction replay log
Thread local diffs
Object cache for recovery.
FS Cache HA
Record files Transaction log
Disk(s)
17
Monday, May 21, 2012
32. Transaction Isolation
๏ Mutating operations are not written when performed
๏ They are stored in a thread confined transaction state object
๏ This prevents other threads from seeing uncommitted changes
from the transactions of other threads
๏ When Transaction.finish() is invoked the transaction is either
committed or rolled back
๏ Rollback is simple: discard the transaction state object
18
Monday, May 21, 2012
33. Transactions & Durability
๏ Commit is:
• Changes made in the transaction are collected as commands
• Commands are sorted to get predictable update order
‣This prevents concurrent readers from seeing inconsistent
data when the changes are applied to the store
• Write changes (in sorted order) to the transaction log
• Mark the transaction as committed in the log
• Apply the changes (in sorted order) to the store files
19
Monday, May 21, 2012
34. Recovery
๏ Transaction commands dictate state, they don’t modify state
• i.e. SET property "count" to 5
• rather than ADD 1 to property "count"
๏ Thus: Applying the same command twice yields the same state
๏ Recovery simply replays all transactions since the last safe point
๏ If tx A mutates node1.name, then tx B also mutates
node1.name that doesn’t matter, because the database is not
recovered until all transactions have been replayed
20
Monday, May 21, 2012
35. Outline
Traversals Core API Cypher
Node/Relationship
Thread local diffs
Object cache
FS Cache HA
Record files Transaction log High Availability in
Neo4j builds on top of
the transaction replay
Disk(s)
21
Monday, May 21, 2012
36. Outline The transaction logs are
shared bet ween all instances
in an High Availability setup,
all other parts operate on the
local data just like in the
standalone case.
Traversals Core API Cypher
Local Node/Relationship
Thread local diffs
Object cache
FS Cache HA
Record files Transaction log Shared
Disk(s)
22
Monday, May 21, 2012
37. • HA - the parts to it:
๏ Based on streaming transactions between servers
๏ All transactions are committed through the master
• Then (eventually) applied to the slaves
• Eventuality synchronizationupdate intervalby interaction
or when
defined by the
is mandated
๏ When writing to a slave:
• Locks coordinated through the master
• Transaction data buffered on theget a txid
applied first on the master to
slave
then applied with the same txid on the slave
23
Monday, May 21, 2012
38. Creating new Nodes and Relationships
๏ New Nodes/Relationships don’t need locks, so they don’t need a
transaction synced with master until the transaction commit
๏ They do need an ID that is unique and equal among all instances
๏ Each instance allocates IDs in blocks from the master, then assigns
them to new Nodes/Relationships locally
• This batch allocation can be seen in (Enterprise) 1000 as
Node/Relationship counts jumping in steps of
WebAdmin
24
Monday, May 21, 2012
39. HA synchronization points
๏ Transactions are uniquely identified by monotonically increasing ID
๏ All Requests from slave send the current latest txid on that slave
๏ Responses from master send back a stream of transactions that
have occurred since then, along with the actual result
๏ Transaction is replayed just like when committed / recovered
๏ Nodes/Relationships touched during this application are evicted
from cache to avoid cache staleness
๏ Transaction commands are only sorted when created, before
stored/transmitted, thus consistency is preserved during all
application phases
25
Monday, May 21, 2012
40. Locking semantics of HA
๏ To be granted a lock the slave must have the latest version of the
Node/Relationship it is trying to lock
• This ensures consistency
• The implementation of “Latest version ofentireNode/
Relationship” is “Latest version of the
the
graph”
• The slave must thus sync transactions from the master
26
Monday, May 21, 2012
41. Master election
๏ Each instance communicates/coordinates:
• its latest transaction id (including the master id for that tx)
• the id forclock value for when the txid was written
that instance
• (logical)
๏ Election chooses:
1. The instance with highest txid
2. IF multiple: The instance that was master for that tx
3. IF unavailable: The instance with the lowest clock value
4. IF multiple: The instance with the lowest id
๏ Election happens when the current master cannot be reached
•
Any instance can choose to re-elect
•
Each instance runs the election protocol individually
•
Notify others when election chooses new master
๏ When elected, the new master broadcasts to all instances, 27
forcing them to bind to the new master
Monday, May 21, 2012
42. Thank you for listening!
tobias@neotechnology.com
Tobias Lindaaker twitter: @thobe, #neo4j (@neo4j)
web: neo4j.org neotechnology.com
Hacker @ Neo Technology my web: thobe.org
Monday, May 21, 2012